A Review of Automated Speech-Based Interaction for Cognitive Screening

: Language, speech and conversational behaviours reﬂect cognitive changes that may precede physiological changes and offer a much more cost-effective option for detecting preclinical cognitive decline. Artiﬁcial intelligence and machine learning have been established as a means to facilitate automated speech-based cognitive screening through automated recording and analysis of linguistic, speech and conversational behaviours. In this work, a scoping literature review was performed to document and analyse current automated speech-based implementations for cognitive screening from the perspective of human–computer interaction. At this stage, the goal was to identify and analyse the characteristics that deﬁne the interaction between the automated speech-based screening systems and the users, potentially revealing interaction-related patterns and gaps. In total, 65 articles were identiﬁed as appropriate for inclusion, from which 15 articles satisﬁed the inclusion criteria. The literature review led to the documentation and further analysis of ﬁve interaction-related themes: (i) user interface, (ii) modalities, (iii) speech-based communication, (iv) screening content and (v) screener. Cognitive screening through speech-based interaction might beneﬁt from two practices: (1) implementing more multimodal user interfaces that facilitate—amongst others—speech-based screening and (2) introducing the element of motivation in the speech-based screening process.


Introduction
Language impairment is common in prodromal dementia states such as mild cognitive impairment (MCI). Well-documented literature has identified the early disruption of normative patterns and processing of speech and language to be characteristic in patients with Alzheimer's disease and in MCI stages [1,2]. Early foundational clinical studies on language have highlighted changes in verbal fluency and naming [1,3].
Research has shown that MCI individuals with impairments in multiple domains, including language, are more likely to develop dementia compared to those with pure memory impairment [2]. Simple linguistic markers, such as word choice, verbal fluency, phrasing (i.e., 'utterance') and short speech patterns possess predictive power in assessing early cognitive decline status [1,4]. Language, speech and conversational behaviours reflect cognitive changes that may precede physiological changes and offer a much more cost-effective option for detecting preclinical cognitive decline [4,5].
Artificial intelligence (AI) and machine learning (ML) have been established as a means to facilitate automated speech-based cognitive screening through automated recording and analysis of linguistic, speech and conversational behaviours. Automatic speech recognition (ASR) and automatic speech analysis (ASA) can be used to (i) explore and discover latent speech patterns that are challenging to identify and analyse in a 'manual' way, (ii) provide a cost-effective method for cognitive screening by targeting the early onset of dementia and (iii) eventually place cognitive screening in non-clinical settings, away from 'white coat' effects and extra stress that the formal process may place on the screenee, potentially skewing the results.
In this work, a scoping literature review is performed to document and analyse current automated speech-based implementations for cognitive screening from the perspective of human-computer interaction (HCI). At this stage, the goal is to identify and analyse the characteristics that define the interaction between the automated speech-based screening systems and the users, potentially revealing interaction-related patterns and gaps. Therefore, this work can be of use to HCI designers, researchers and practitioners of the field aspiring to learn about existing practices related to automated speech-based cognitive screening and inform their designs accordingly. At an internal level, the scoping literature review is utilised to approach a specific topic and answer broad research questions within that context. It also serves as a preliminary stage for a systematic literature review that will cover the interaction themes discovered herein, as well as cognitive health-related aspects of the documented studies.
This paper is organised as follows. First, the literature review process is described (Section 2) and the review's concept matrix is presented (Table 1). Next, findings from the review process are presented (Section 3). A discussion of the key findings and study limitations is presented in Section 4. The paper concludes in Section 5.

Research Questions
To assess the current state of the research in the field of cognitive screening using automated speech-based systems, the literature review will address two research questions (RQs): • RQ1: Which automated speech-based systems for cognitive screening have been studied? • RQ2: What are the interaction-related characteristics of the studied automated, speech-based systems for cognitive screening?
RQ1 focuses on identifying the automated speech-based systems that have been studied. The interaction aspects of the retrieved systems are important to further explore and analyse these systems. RQ2 examines these aspects.

Search Strategy
A literature search was performed in the Scopus academic search engine. The Scopus engine searches through the databases of publishers, such as ACM, IEEE, Elsevier, Springer, Taylor and Francis, Sage, Emerald, Oxford University Press, Cambridge University Press and many more. Scopus was chosen from amongst other academic search engines (e.g., Google Scholar, Web of Science) for the main search process owing to its wider coverage of scientific publishers, flexible result-filtering capabilities and the consistent accuracy of results [6,7]. The keywords used were: ('dementia' or 'alzheimer' or 'impairment') and ('cognitive screening' or 'cognitive assessment' or 'preclinical screening' or 'preclinical assessment') and ('artificial intelligence' or 'AI' or 'speech recognition' or 'natural language processing' or 'speech analysis'). The publication's title, abstract, and keywords were utilised for the retrieval of eligible articles. Eligible articles were also identified through backward reference searching, that is, by screening the reference lists of the retrieved publications [8].

Inclusion and Exclusion Criteria
Articles with the following characteristics were included in the literature review:

1.
Written in English and accepted and presented in peer-reviewed publications, 2.
Describing a speech-based system for cognitive screening that implements ASR and, possibly, ASA;

3.
Describing a speech-based system that screens for cognitive impairment related to dementia; and 4.
Evaluating the speech-based system for cognitive screening to some degree, either technically or empirically.
Based on inclusion criterion #4, conceptual work (e.g., conceptual description of systems, theoretical frameworks ) was excluded. Based on inclusion criterion #3, studies regarding cognitive screening outside the dementia domain were also excluded. Criterion #2 ensured that only automated approaches of speech-based interaction were included. Works in languages other than English or presented in pre-print form were also excluded (criterion #1).
The criteria's formulation is based on the facts that: (i) the peer-review process adds to the credibility and reliability of the publications and the respective presented systems, and (ii) the evaluation of the speech-based system ensures that the systems are existing, usable, and beyond the conceptual level.

Screening Process and Results
In total, 65 articles (49 from Scopus and 16 from backward reference searching) were identified as appropriate for inclusion, from which 15 articles satisfied the inclusion criteria. The author, along with an independent expert of the field, reviewed all 15 articles independently. The categories and themes of the review were shaped by two reviewers, i.e., the author and an independent expert, based on the data extraction process. A high level of agreement (>80%) for the reviewers was achieved, regarding the inclusion/exclusion decisions and the formulated categories and themes. Any disagreements were discussed and settled.

Data Collection and Analysis
The data extracted from each article were as follows: • The source and full reference; • The description of the automated speech-based system(s) for cognitive screening, focusing on the technical and interaction aspects; • The cognitive screening tests that were utilised from the automated speech-based system(s) and their administration details.
The two reviewers jointly performed the data extraction process. Axial coding of the main themes took place so that each theme contains comparable and consistent categories. The main themes identified in the review and tabulated were • Studies presenting automated speech-based systems for cognitive screening (addressing RQ1); • The interaction aspects of these systems (addressing RQ2).
The themes were classified into a concept matrix to facilitate comparisons, provide structure and help clarify the concepts of the review for the reader [9,10]. Table 1 shows the concept matrix of the literature review.  [23] X X X X X

Results
The literature review led to the documentation of five interaction-related themes, (i) user interface (UI), (ii) modalities, (iii) speech-based communication, (iv) screening content and (v) screener, all of which were further analysed in terms of their sub-categories and characteristics.

User Interface
The review showed that some automated speech-based systems for cognitive screening can screen exclusively for linguistic markers, thus featuring a voice user interface (VUI). Other systems can also feature multimodal UIs, which facilitate interaction in more ways than just speech. The target of these interfaces is to screen for additional markers to the linguistic ones, such as motor skills and auditory skills. Systems like those described by Tang et al. [4] and Mirheidari et al. [15] utilise VUI through several modalities, such as intelligent virtual agents (IVAs), robots, virtual assistants (VAs), and regular voice recorders. Naturally, this decision affects the nature of the screening test that accompanies these interfaces because its content should only support voice input and its results should be based on linguistic markers. Other systems, like the ones presented by Di Nuovo et al. [11] and Luperto et al. [14], utilise some of the modalities mentioned before to combine voice interfaces with other types of interfaces and offer multimodal interaction. Interaction then comes in the form of additional cognitive screening tasks, such as drawing a clock or a path, listening to an audio track or copying a cube, mainly originating from structured and validated cognitive screening tests, such as the Mini-Mental State Examination (MMSE) [24] and the Montreal Cognitive Assessment (MoCA) [25].

Modalities
Users interact with the automated speech-based systems for cognitive screening through several technologies. The literature review resulted in five modalities that are currently in speech-based systems: IVAs, VAs, socially assistive robots (SAR), audio recorders and telephone audio recorders. IVAs can 'learn' and be trained, thus enabling valuable real-time conversations between the system and the user [4] while VAs and SARs can provide screening tasks, recognise voice commands and facilitate multimodal interaction, as required from structured cognitive screening tests [14,19,23]. Voice recorders and telephones are less technologically advanced devices, which, however, can facilitate a less automated type of interaction that usually requires the assistance from a human screener to administer the cognitive test [12,18]. All the modalities implement ASR and, in cases, ASA to support the speech-based communication between the system and the user.

Speech-Based Communication
The way users/screenees communicate with the screeners can take three forms: • Structured speech-based communication: This type of communication is based on the screener evaluating the screenee on a particular set of predetermined, standardised, and speech-based screening tasks, where specific responses are expected for the screening score to be calculated. Speech-based, structured, and standardised cognitive screening tests, such as the speech-based versions of the MMSE and the MoCA, which ask for specific responses (e.g., what is the year or the season) from the screenee in order to score better, facilitate this type of communication [11,14,19,23]. Semi-structured and standardised cognitive screening tests, such as verbal fluency tests that, for example, ask the screenee to name as many animals as they can in one minute, facilitate this type of communication [15,18]. • Unstructured speech-based communication: In this case, the screener and the screenee make conversation about various topics, and the evaluation of the screenee is based on linguistic markers and the quality of the dialogue. The user/screenee can converse with a human screener [16,17] or an AI agent [4].

Screening Content
The screening content affects interaction because its nature suggests the type of communication between the screener and the screenee, as mentioned above. Screening content for speech-based systems can come from validated tests (e.g., MMSE, MoCA and verbal fluency tests) and be used on an as-is basis [11,18,19] or adjusted to fit the purposes of the researcher and/or the screener [12,15]. Custom screening content is also utilised, especially in cases where the type of communication is unstructured (i.e., in conversations) [4,13,16,17].

Screener
The screening test can be administered by a human screener or an AI agent. In the former case, the human screener communicates the contents of the screening test to the user/screenee, and the ASR and ASA may take place later in an asynchronous manner. Audio recorders can document the communication and facilitate the asynchronous processing [5,12,18]. With an AI agent being the screener, the user communicates with IVAs, VAs, and SARs throughout the whole screening process. The ASR and ASA, in this case, may take place in a real-time manner, facilitated by AI and ML [4,11,15,19].

Discussion
A main observation that comes from the 'speech-based communication' category is the shortage of conversational agents for cognitive screening. The work of Tang et al. [4] presents an AI dialogue agent. However, the other two studies that support unstructured speech-based communication (i.e., [16,17]) are based on a human screener conducting the screening and the processing of asynchronous speech. A finely tuned speech-based, AI conversational agent would allow for more opportunistic use, thus screening for linguistic markers on a more frequent basis and increasing the chances of documenting early signs of cognitive decline. At the same time, its application for cognitive screening could be inexpensive, less stressful and presenting assessor neutrality, compared to having a formal healthcare process with a human screener [4,23]. However, more research is needed to construct reliable AI conversational agents since, currently, they present several limitations, such as their inability to fully capture the difference between human-AI and human-human conversations in real-world conversation (i.e., the human-in-the-loop problem [4,26]). Therefore, the accuracy of the dialogue simulation must be improved continuously.
When it comes to cognitive screening tests, Table 1 clearly shows that most studies utilise validated tests, either as-is (i.e., [19,23]) or adjusted to fit their research purposes (i.e., [12,15]). Pen-and-paper cognitive screening tests (e.g., MoCA and MMSE) are screening for various cognitive functions, such as orientation, attention, memory, language and abstraction [24,25]. Their standardised form supports structured speech-based communication between the screener and screenee. In this case, speech analysis is limited to identifying the screenee's answers and mark them as right or wrong. The same speech analysis process is followed for standardised-and potentially validated-cognitive screening tests specifically tailored for speech-based screening (e.g., semantic verbal fluency tests); only in this case, the screenee's responses are open-ended, and thus semi-structured speech-based communication with the screener is established. A number of studies utilise custom cognitive screening tests, mainly for conversations. These tests rely heavily on ASA for identifying linguistic markers in an unstructured communication context. Naturally, the unstructured nature of these screening tests makes it more challenging to get validated; thus, further research on their validity, sensitivity and specificity is necessary.
The results of this review suggest that cognitive screening through speech-based interaction might benefit from two practices. Firstly, multimodal UIs supporting speech-based interaction are scarce. Multimodal UIs can facilitate the screening of multiple cognitive functions. Apart from linguistic markers that can be covered by speech-based interaction and AI conversational agents, other cognitive and cognitive-related functions, such as motor, auditory, and visuospatial functions, can be screened by other modalities, including writing, listening to sounds and so on. Multimodal interventions have been shown to be significant in the fight against cognitive decline as they cover a wide range of cognitive decline markers [27]. Respective UIs that include speech-based interaction-and, potentially, screening conversations-can mark the next generation of multimodal cognitive screening systems. Secondly, the element that was missing from the reviewed studies and their screening processes was motivation. Motivation is an important element in cognitive screening for increasing the screening frequency (and thus the chances for detecting cognitive decline) and making the process less stressful and more enjoyable [28][29][30]. Gamified screening and serious games for screening have been suggested as effective ways of adding motivation in the cognitive screening process [28][29][30][31][32]. Therefore, a cognitive screening solution that (i) features multimodal functionality facilitating speech-based screening and (ii) introduces the element of motivation (e.g., by gamifying the screening process) might be an interesting subject for future research with true potential.

Study Limitations
As stated earlier, this review does not cover the cognitive health-related aspects of the documented studies. At this early stage, this study is focused on identifying the interaction aspects of speech-based screening because it is an important aspect that can affect screening results. Cognitive health-related aspects will be the subject of future work as described in Section 5. Moreover, in the analysis of results, no distinction was made between synchronous (real-time) and asynchronous ASR and ASA for cognitive screening. The reason for this was as follows: (i) in its essence, speech-based cognitive screening can facilitate and be of great value for detecting early signs of cognitive decline whether the processing of speech data takes place in real time or after a few minutes/hours, and (ii) in several reviewed articles, the descriptions around data processing were not clear or were missing. Naturally, in future work, this information will be retrieved by contacting respective authors as the timing of data processing is a feature that can affect the design and development of future speech-based cognitive screening systems.

Conclusions
Overall, this work-despite its descriptive nature-can lay the groundwork for researchers, designers and practitioners of the field to build on and further examine and produce prescriptive approaches around the use of automated speech analysis for cognitive screening. The herein documented studies reveal considerable interest in the topic over the last three years (with 9 out of 15 studies taking place from 2017 to 2020) and this work may act as a guide, assisting interesting parties in the field to get an overview on speech-based interaction for cognitive screening and the interaction-related practices and gaps in the field. Interaction is an important practice for speech-based cognitive screening systems that can operate as a lower implementation layer, on which the cognitive health-related content can be based and that-if not treated in an informed way-it may affect/skew the screening results. In this work, the conducted literature review managed to produce the interaction-related themes that are necessary for describing and classifying speech-based cognitive screening and the concept matrix of the review managed to highlight the underlying relationships among the interaction attributes of the systems.
Future work will address the cognitive health-related aspects of speech-based cognitive screening systems focusing, amongst others, on their screening evaluation methodologies, their screening validity and performance and the screenees' cognitive status or impairment that they target. To that end, a systematic literature review will be utilised, being an extension of the herein presented scoping review, and the interaction elements of the studied systems will, naturally, be revisited.