Speech- and Language-Based Classification of Alzheimer’s Disease: A Systematic Review

Background: Alzheimer’s disease (AD) has paramount importance due to its rising prevalence, the impact on the patient and society, and the related healthcare costs. However, current diagnostic techniques are not designed for frequent mass screening, delaying therapeutic intervention and worsening prognoses. To be able to detect AD at an early stage, ideally at a pre-clinical stage, speech analysis emerges as a simple low-cost non-invasive procedure. Objectives: In this work it is our objective to do a systematic review about speech-based detection and classification of Alzheimer’s Disease with the purpose of identifying the most effective algorithms and best practices. Methods: A systematic literature search was performed from Jan 2015 up to May 2020 using ScienceDirect, PubMed and DBLP. Articles were screened by title, abstract and full text as needed. A manual complementary search among the references of the included papers was also performed. Inclusion criteria and search strategies were defined a priori. Results: We were able: to identify the main resources that can support the development of decision support systems for AD, to list speech features that are correlated with the linguistic and acoustic footprint of the disease, to recognize the data models that can provide robust results and to observe the performance indicators that were reported. Discussion: A computational system with the adequate elements combination, based on the identified best-practices, can point to a whole new diagnostic approach, leading to better insights about AD symptoms and its disease patterns, creating conditions to promote a longer life span as well as an improvement in patient quality of life. The clinically relevant results that were identified can be used to establish a reference system and help to define research guidelines for future developments.


Context and Objectives
Alzheimer's Disease (AD) is currently the most common cause of dementia from neurodegeneration all over the world, contributing to 60-70% of all cases. In 2006, the worldwide prevalence of AD was 26.6 million and, by 2050, the prevalence is predicted to reach 131 million, resulting in 1 in every 83 people in the world living with the disease [1,2]. Early and accurate diagnosis of AD has a major impact on its progress and follow-up, and although memory loss and behavioral changes are relevant indicators for its detection, these only become evident in more advanced stages of the disease, often leading to the late diagnosis of dementia [3,4]. Neuropsychological tests, an alternative to more expensive and often invasive approaches, can be powerful indicators of converting patients (from mild cognitive disease to AD), in particular when machine learning approaches are used [5,6]. In a systematic review, encompassing neuropsychological measures [7], categorical fluency tests for language, covering executive control ability and verbal ability, showed the highest performance when discriminating between healthy controls and Alzheimer's, and measures caused by something inherent in dementia (e.g., delirium, substances or other medical, neurological or psychiatric disorders). To answer these questions, a medical history is acquired, and appropriate physical examinations and laboratory studies are performed, as well as cognitive screenings, that also use neuroimaging techniques [15]. Within cognitive tests, it stands out the Mini-Mental State Exam (MMSE), the Clock-drawing test, and the Alzheimer's Disease Assessment Scale [12,16,17]. The main exams using imaging techniques are Computed Axial Tomography (CT), Magnetic Resonance Imaging (MRI), Positron Emission Tomography (PET), and Single-Photon Emission Computed Tomography (SPECT) [15]. Although there is currently a wide range of diagnostic methods applied to AD, there is still a concern to find new methods that respond more urgently to dementia while being simple and cost effective.
Alzheimer's disease is characterized by a progressive worsening of deficits in several cognitive fields, including language. Aphasia and dysarthria are common symptoms and language impairment in AD occurs mainly due to a decline in semantic and pragmatic levels of language processing [18]. From a physiological perspective, superior parietal, posterior temporal, and occipital cortical areas are interconnected by posterior corpus callosum. The superior longitudinal fasciculus surrounds the putamen, connecting all four cerebral lobes, areas that are known to be affected in MCI and AD and that have a central role in language processing [19,20]. Language difficulties are a major problem for most patients with dementia, especially as the disease progresses. The first signs that communication is being affected are the difficulties on finding words, especially when it comes to naming familiar people or objects. Words are replaced by wrong and meaningless words and pauses during speech are increased as well [21]. In the early stages of AD, language impairment involves problems of lexical recovery, loss of verbal fluency, and a breakdown in higherorder written and spoken language comprehension. In the moderate and severe phases of AD, the loss of verbal fluency is profound, with loss of understanding and prominent literal and semantic paraphrases. In the very severe phases of AD, speech is often restricted to echolalia and verbal stereotypes. In Table 1, it is possible to see the association of the mentioned speech impairments with the stage of the disease [18,22]. Communicative difficulties (speech and language) constitute one of the groups of symptoms that most accompany dementia and, therefore, should be recognized as a central study instrument. This recognition aims to provide earlier diagnosis, resulting in greater effectiveness in delaying the disease evolution. Table 1. Language changes in AD (adapted from Ferris and Farlow [18] and Greta et al. [23]).

Function
Early Temporal and acoustics parameters, though less explored for AD, are also reported to change. Fundamental frequency, interruption of sound, voice periods, speech rate, among others, show distinct ranges in AD and healthy individuals [24][25][26]. Though they are out of the scope of this review, depression or mood changes, symptoms connected with AD, can also be classified using speech analysis.

Materials and Methods
The methodology for this systematic review was inspired on the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) [27,28], registered with the number CRD42022296738 at the National Institute for Health Research (Prospero) database. ScienceDirect, PubMed, and DBLP scientific repositories, used as information sources, were searched through May 2020. Based on central keywords we have defined the as search query: (Alzheimer's [Title] AND "Speech [Title] AND ("Detection [Title]" OR "Classification [Title]")), that we have used similarly for each database. As eligibility criteria we have defined the following: (a) English language articles; (b) Published in peerreviewed journal; (c) Related with machine learning or statistical methods; (d) Processing pipeline details were clearly defined. Using the first repository, as a preparatory step, a statistical analysis of the number of publications per year was made, from 1996 to May 2020. After a coarse removal of out-of-scope articles and duplicates, it was possible to count the number of publications per year, as presented in Figure 1. This allowed to observe a significant increase in the research interest in this topic since 2015, therefore, it was decided to restrict the analysis to the period from 2015 to 2020. In ScienceDirect, a filter was applied so that only research articles were displayed, and in DBLP two filters were applied simultaneously, so that it was possible to restrict the articles to those that were classified as academic journals and whose content was related to "machine learning".
Temporal and acoustics parameters, though less explored for AD, are also reported to change. Fundamental frequency, interruption of sound, voice periods, speech rate, among others, show distinct ranges in AD and healthy individuals [24][25][26]. Though they are out of the scope of this review, depression or mood changes, symptoms connected with AD, can also be classified using speech analysis.

Materials and Methods
The methodology for this systematic review was inspired on the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) [27,28], registered with the number CRD42022296738 at the National Institute for Health Research (Prospero) database. ScienceDirect, PubMed, and DBLP scientific repositories, used as information sources, were searched through May 2020. Based on central keywords we have defined the as search query: (Alzheimer's [Title] AND "Speech [Title] AND ("Detection [Title]" OR "Classification [Title]")), that we have used similarly for each database. As eligibility criteria we have defined the following: (a) English language articles; (b) Published in peerreviewed journal; (c) Related with machine learning or statistical methods; (d) Processing pipeline details were clearly defined. Using the first repository, as a preparatory step, a statistical analysis of the number of publications per year was made, from 1996 to May 2020. After a coarse removal of out-of-scope articles and duplicates, it was possible to count the number of publications per year, as presented in Figure 1. This allowed to observe a significant increase in the research interest in this topic since 2015, therefore, it was decided to restrict the analysis to the period from 2015 to 2020. In ScienceDirect, a filter was applied so that only research articles were displayed, and in DBLP two filters were applied simultaneously, so that it was possible to restrict the articles to those that were classified as academic journals and whose content was related to "machine learning". We have not assessed the risk of bias for these studies due to its great heterogeneity and differences in background scientific fields (some studies were clinical oriented, such as non-randomized studies or randomized controlled trials, while others were developed as exploratory machine learning exercises, with no pretension to immediate application in clinical decision). But we consider that, since many studies are based on stochastic approaches, bias risk should be better addressed in these articles, especially when creating speech databases, where gender, age, disease severity, comorbidities, among others, should be carefully balanced.
After applying the filters, the articles of interest were selected manually. This process involved careful reading of the article's abstract, where only those that approached the detection of AD or MCI based on speech and language, were selected. In a deeper analysis of the obtained articles, 14 duplicates were detected. In addition to the duplicates found, 2 more articles from the IEEE platform were added, by reference following in the first  We have not assessed the risk of bias for these studies due to its great heterogeneity and differences in background scientific fields (some studies were clinical oriented, such as non-randomized studies or randomized controlled trials, while others were developed as exploratory machine learning exercises, with no pretension to immediate application in clinical decision). But we consider that, since many studies are based on stochastic approaches, bias risk should be better addressed in these articles, especially when creating speech databases, where gender, age, disease severity, comorbidities, among others, should be carefully balanced.
After applying the filters, the articles of interest were selected manually. This process involved careful reading of the article's abstract, where only those that approached the detection of AD or MCI based on speech and language, were selected. In a deeper analysis of the obtained articles, 14 duplicates were detected. In addition to the duplicates found, 2 more articles from the IEEE platform were added, by reference following in the first selected bibliography. Thus, the database created has 24 articles from the platforms mentioned. In Figure 2 it is possible to observe the process to reach this total number of articles. Finally, our search strategy, was focused on identifying the main components of machine learning and statistical-based approaches: data sources, data models, parameter optimization strategies; and on the outcomes provided by such systems: evaluation strategies and performance indicators. selected bibliography. Thus, the database created has 24 articles from the platforms mentioned. In Figure 2 it is possible to observe the process to reach this total number of articles. Finally, our search strategy, was focused on identifying the main components of machine learning and statistical-based approaches: data sources, data models, parameter optimization strategies; and on the outcomes provided by such systems: evaluation strategies and performance indicators.

Results
In this section we will present the outcomes of our literature review. We start by presenting the systems' overall architecture and then, on each subsection, we will focus on the composing elements.

Machine Learning Pipeline
The use of speech analysis is potentially a useful, non-invasive, and simple method for early diagnosis of AD. The automation of this process allows a fast, accurate, and economical follow-up over time. Initially, speech-based tests for AD detection were performed by linguists. These tests were designed to extract linguistic characteristics from speech or writing samples. However, more current studies seek to optimize this task by automating the process of speech recognition through audio recordings [29]. Thus, and in sequence, the process can be described in 4 crucial steps: 1. Data Preparation: In this step the extraction, optimization and normalization of features occurs. This consists in the selection of the most significant features (by removal of the non-dominant features) and in the transformation of ranges to similar limits, which will reduce training time and the complexity of the classification models. Metadata are "the data of the data", more specifically, structured, and organized information on a given object (in this case voice recordings) that allow certain characteristics of it to be known. This metadata together with the results of the pre-processing of the recordings makes the final database. Incorrect or poor-

Results
In this section we will present the outcomes of our literature review. We start by presenting the systems' overall architecture and then, on each subsection, we will focus on the composing elements.

Machine Learning Pipeline
The use of speech analysis is potentially a useful, non-invasive, and simple method for early diagnosis of AD. The automation of this process allows a fast, accurate, and economical follow-up over time. Initially, speech-based tests for AD detection were performed by linguists. These tests were designed to extract linguistic characteristics from speech or writing samples. However, more current studies seek to optimize this task by automating the process of speech recognition through audio recordings [29]. Thus, and in sequence, the process can be described in 4 crucial steps: 1.
Data Preparation: In this step the extraction, optimization and normalization of features occurs. This consists in the selection of the most significant features (by removal of the non-dominant features) and in the transformation of ranges to similar limits, which will reduce training time and the complexity of the classification models. Metadata are "the data of the data", more specifically, structured, and organized information on a given object (in this case voice recordings) that allow certain characteristics of it to be known. This metadata together with the results of the preprocessing of the recordings makes the final database. Incorrect or poor-quality data (e.g., outliers, wrong labels, noise, . . . ), if not properly cared for, will lead to under optimized models and to unsatisfactory results. If data is not enough, for example when deep learning algorithms are used, then data augmentation techniques can be useful.

2.
Training and Validation: The supporting database is divided into subsets, usually 70-90% for training and 30-10% for testing. The subsets can be randomly generated several times and the results can be averaged for additional confidence in the results, a procedure that is designated by cross-validation. The data model is trained, i.e., the involved parameters are adjusted, by one or many optimizers, and the performance is calculated using the test subset. This step allows categorizing and organizing the data to promote better analysis [30]. When data is not enough, then transfer learning approaches can be used.

3.
Optimization: After model evaluation, it is possible to conclude on the parameters that need to be improved, as well as to proceed in a more effective way to the selection of the most interesting and relevant features, so that a new extraction and consequently a new process (iteration) of Training and Validation can be performed.

4.
Run-Time: Having concluded the previous points, the system is ready to be deployed and to classify new unseen inputs. More specifically, from the recording of a patient's voice, to classify it as possible healthy or possible Alzheimer's patient.
In Figure 3 we can observe the described methodology in detail. quality data (e.g., outliers, wrong labels, noise, …), if not properly cared for, will lead to under optimized models and to unsatisfactory results. If data is not enough, for example when deep learning algorithms are used, then data augmentation techniques can be useful. 2. Training and Validation: The supporting database is divided into subsets, usually 70-90% for training and 30-10% for testing. The subsets can be randomly generated several times and the results can be averaged for additional confidence in the results, a procedure that is designated by cross-validation. The data model is trained, i.e., the involved parameters are adjusted, by one or many optimizers, and the performance is calculated using the test subset. This step allows categorizing and organizing the data to promote better analysis [30]. When data is not enough, then transfer learning approaches can be used. 3. Optimization: After model evaluation, it is possible to conclude on the parameters that need to be improved, as well as to proceed in a more effective way to the selection of the most interesting and relevant features, so that a new extraction and consequently a new process (iteration) of Training and Validation can be performed. 4. Run-Time: Having concluded the previous points, the system is ready to be deployed and to classify new unseen inputs. More specifically, from the recording of a patient's voice, to classify it as possible healthy or possible Alzheimer's patient.
In Figure 3 we can observe the described methodology in detail.

Speech and Language Resources
As mentioned above, to be able to create a mechanism for detecting AD, a speech database is required. Building a speech database implies careful planning. Important steps that should be followed and prepared in an initial design stage are: recording conditions, acquisition and storage hardware, data collection protocol, informant selection, speech task, data organization and labelling. As sensitive data can be collected, ethical and safety aspects should also be of concern. The quality of the database is crucial since it supports the analysis and the conclusions that can be drawn.
With the increasing interest on the area, the number of speech and language resources has also increased (although many languages are not yet covered). Table 2 presents the main databases that are referred in the scientific literature, accompanied by a summary of their characteristics. These resources are crucial for supporting the

Speech and Language Resources
As mentioned above, to be able to create a mechanism for detecting AD, a speech database is required. Building a speech database implies careful planning. Important steps that should be followed and prepared in an initial design stage are: recording conditions, acquisition and storage hardware, data collection protocol, informant selection, speech task, data organization and labelling. As sensitive data can be collected, ethical and safety aspects should also be of concern. The quality of the database is crucial since it supports the analysis and the conclusions that can be drawn.
With the increasing interest on the area, the number of speech and language resources has also increased (although many languages are not yet covered). Table 2 presents the main databases that are referred in the scientific literature, accompanied by a summary of their characteristics. These resources are crucial for supporting the development of new systems, in particular when deep learning approaches are used. The use of similar databases in different studies, by different researchers, also provides a common ground for evaluation and performance comparison. The BEA (whose acronym comes from BEszélt nyelvi Adatbázis) is a growing database containing various types of spontaneous speech, reading aloud, and conversation in Hungarian. To date, it consists of records of 280 healthy and cognitively declining subjects between the ages of 20 and 90 [56].
Cinderella contains recordings of 60 subjects spontaneously telling the story of Cinderella. These 60 subjects, Portuguese native speakers, are equally divided into the groups healthy, with MCI, and with AD. The records that make up the database were made by researchers Toledo et al. [45] for the study in question; the character of the database in terms of availability is undefined.
TalkBank is a project whose main objective is to encourage the study in the field of human communication. Currently, it makes available repositories of several research areas covering more than 34 languages, all of them open-source upon request. DementiaBank is one of the repositories that this project has, which as its name indicates, focuses on the communication of people with dementia. Within this repository, there are several Corpus with different languages, tasks, and dementias under analysis. In Tables 2 and 3, there are two examples of the corpus that can be found in DementiaBank, Lu Corpus, and Pitt Corpus. Table 3. Linguistic features that have been used for AD detection. The features are organized by type. For each feature name, the number of occurrences/usages is provided inside parenthesis.

Feature Type
Feature Name

Semantic density
The density of the idea (1); Efficiency of the idea (1); Density of information (2); Density of the sentences (1).

Complexity
The entropy of words (1); Honore's Statistics (1). Dem@care is a European project focused on improving the quality of life of people with dementia. This project has multilingual databases and files of different types, such as audio and video. These databases are available upon request, and there is also a quick contact section on the website available at the footer. Although none of the studies had made use of this database, it is highly referenced in the literature covered.

Lexical Variation
The Gipuzkoa-Alzheimer Project (GAP) is a longitudinal Spanish study, running since 2011 where volunteers are observed every 3 years to analyze the evolution of the disease. The database that this study gathers can be accessed upon request [57].
The Wisconsin Registry for Alzheimer's Prevention (WRAP) has been conducting a longitudinal study to assess parameters that allow early detection of cognitive decline at older ages. To date, 1561 people have participated in this study, who have been subjected to various types of analysis methods and continuously over several years. The WRAP protocol resources and databases of related studies can be accessed by qualified researchers by completing an online form and a data use agreement, which can be found on the Global Alzheimer's Association Interactive Network website [58].

Language and Speech Features
As mentioned in Table 1, the most evident problems early on in AD, as far as speech is concerned, are related to difficulties in general semantics, that is, in finding words to name objects. In this sense, temporal cycles during spontaneous speech production (speech fluency) are affected and, therefore, can be detectable in the patient's hesitation and pronunciation [59]. Other speech characteristics affected in AD patients seem to be those related to articulation (speed in language processing), prosody in terms of temporal and acoustic measurements, and eventually, in later phases, phonological fluency [60].
Considering the linearity of the features, they can be classified as linear or non-linear, the linear ones being more conventionally used. Linear features can be subdivided into several groups, but these are always very interconnected. Thus, we chose to divide into two groups, linguistics, and acoustics, and present them in Tables 3 and 4. For each reviewed article we have collected the name of the features that were used. Table 4. Acoustic features that have been used for AD detection. The features are organized by type. For each feature name, the number of occurrences/usages is provided inside parenthesis.

Regularity
Jitter (11); Shimmer (11); Intensity (6) The reviewed literature does not present an immediate pattern regarding the extraction and use of features, and it is possible to find simple sets based on traditional metrics, but also other approaches using advanced parameters and methods, using one or several feature sets. All studies report good accuracies and promising results.
Using linguistic features, Rentoumi et al. [40] developed studies for computational linguistic analysis in Alzheimer's patients, resulting in maximum accuracies of 88%.
To identify changes in the macro-linguistic aspects of speech in subjects with cognitive decline, Toledo et al. [45] conducted a study, in Portuguese, where the history of Cinderella was used as the main task of analysis. Using, in the same way, linguistic features, it was possible to distinguish the various degrees of dementia.
The task of picture description is one of the most used for the analysis of spontaneous speech. A study carried out by Hernández-Domínguez et al. [61] uses this same task, proposing a new methodology that allows patients to be described, later allowing them to be classified as Alzheimer's patient or not. This classification reached accuracies of 94% using linguistic features.
With the main objective of detecting MCI, Fraser et al. [51] developed two studies. The first, bilingual, which allowed the creation of a detection system applicable to two languages, English and Swedish, also allowing the evaluation of the impact of the language on the accuracy of this detection. The second has taken a cascade approach to combine data from multiple language tasks to distinguish patients with CCL and healthy patients, achieving 83% accuracy [51]. In both studies, the extracted features were linguistic.
Martínez-Sánchez et al. [49] presented a study to validate a prototype that automatically performs speech analysis in older people with AD. The device created, and based on acoustic features, provides numerical parameters that can be interpreted to identify specific changes in speech fluency, acoustics, and prosody, and was able to correctly classify 92.4% of the subjects under study. Also using acoustic features [13,52,62,63], achieved accuracies of 97%, 83%, 71.4%, and 62%, respectively.
Khodabakhsh et al. [54,55] conducted three studies in the area of focus. In the first two studies, acoustic features were used to detect AD, where accuracies of 94% were reported for both proposed approaches. The third study encompassed a more extensive set of features where acoustic and linguistic features were combined, resulting in 84% accuracy, for a distinct dataset [53].
Qiao et al. [44] created an automatic speech recognition software specialized in cognitive impairment, allowing the characterization of language impairment in people with AD and MCI. For this, they used acoustic features.
Alexandra König et al. [36] proposed to use several short cognitive vocal tasks to distinguish between healthy controls, mild cognitive impairment and AD patients, with the best distinction being between healthy subjects and Alzheimer's patients, with an accuracy of 87%. The same authors also proposed a mobile application to record spontaneous speech in an uncontrolled environment that proved to be an useful tool in providing additional indicators for early assessment and detection of AD and MCI [37]. By combining acoustic features in a semantic verbal fluency analysis, aimed at automating this process, the authors were capable of successfully distinguishing patients in a healthy group from patients with AD and MCI [38].
Acoustic and linguistic features were also used by Gosztolya et al. [41]. The authors have developed independent systems for each set of features, with an accuracy 82%, for both cases. The combination of both feature sets allowed to rise the scores to 86%, showing the importance of acoustic and linguistic information.
With the combination of acoustic features and linguistic features, two studies were conducted, one by Gosztolya et al. [41] and the other by Beltrami et al. [42], which obtained accuracies of 86% and 77%, respectively.
Chien et al. [43] have also developed a system for the analysis of AD through speech. However, contrary to what happens in most studies, the features instead of being selected by statistical methods were selected through an acoustic feature sequence generator created and trained as part of the proposed system.
Other unconventional features sets have also been used with interesting results. For example in [47,48] non-linear features are used, namely the fractal dimension and entropy of permutation that allowed reaching accuracies of 90.9%.

Classification Models
The process of classification lies in identifying to which, of a given set of categories, a new observation belongs to, based on another set of training categories whose observations have already been assigned a category [64]. Thus, after the extraction and selection of the most significant features, it is necessary to proceed to their classification so that it is also possible to classify the groups of data under study.
When data distribution or patterns are known, then a compatible model (linear, polynomial, exponential or other) will lead to optimal results. However, machine learning has gained special relevance due to its ability to provide good estimates even when facing unstructured high dimensionality data. In this context, deep neural networks (DNN) can excel. These are flexible models where elements, inspired on the human brain anatomophysiology, are combined in large structures, with several sequential layers, to provide the output. The number of elements per layer, the number of layers, and the behavior of each layer (fully connected, convolutional, recurrent, . . . ) are some of the parameters that can be adjusted to fit the network to the data/problem. Despite the widespread use of these techniques, the high amount of training data that is required for training the huge number of parameters and the "black-box" model that is obtained in the end, are some of the often-mentioned caveats.
In Table 5, some of the most commonly used models are summarized and defined in general terms. Table 5. Most significantly used classification models.

NB
Consists of a network, composed of a main node with other associated descending nodes that follow Bayes' theorem [65]. [13,35,40,53] SVM Consists of building the hyperplane with maximum margin capable of optimally separating two classes of a data set [65]. [13,[37][38][39][40][41][50][51][52][53][54][55]61,66] RF Relies on the creation of a large number of uncorrelated decision trees based on the average random selection of predictor variables [67]. [13,61] DT Consists of building a decision tree where each node in the tree specifies a test on an attribute, each branch descending from that node corresponds to one of the possible values for that attribute, and each leaf represents class labels associated with the instance. The instances of the training set are classified following the path from the root to a leaf, according to the result of the tests along the path [68]. [39,[53][54][55] KNN Based on the memory principle in the sense that it stores all cases and classifies new cases based on similar measures [65]. [42,46,48] LR A model capable of finding an equation that predicts an outcome for a binary variable from one or more response variables [69]. [42,51] LDA It is a discriminatory approach based on the differences between samples of certain groups. Unsupervised learning technique where the objective is to maximize the relationship between the variance between groups and the variance within the same group [70]. [54,55] ANN DNN Naturally inspired models. Supervised learning approach based on a theory of association (pattern recognition) between cognitive elements [71]. There are many possibilities with different elements, structures, layers, etc.
The larger the number of parameters then the larger the dataset must be. Based on Table 5, it is possible to determine the frequency of use of each model, as can be seen in Figure 4. We can observe that the most popular classification models are based on Vector Support Machine (SVM), with 34%, followed by the several variations of Artificial Neural Networks (ANN), with 21%. The ability to deal with non-linear data distributions and possibility of finding non-obvious patterns in data may be the main motivations for their use.

odel
Characterization References Consists of a network, composed of a main node with other associated descending nodes that follow Bayes' theorem [65]. [13,35,40,53] M Consists of building the hyperplane with maximum margin capable of optimally separating two classes of a data set [65]. [13,[37][38][39][40][41][50][51][52][53][54][55]61,66] Relies on the creation of a large number of uncorrelated decision trees based on the average random selection of predictor variables [67]. [13,61] Consists of building a decision tree where each node in the tree specifies a test on an attribute, each branch descending from that node corresponds to one of the possible values for that attribute, and each leaf represents class labels associated with the instance. The instances of the training set are classified following the path from the root to a leaf, according to the result of the tests along the path [68]. [39,[53][54][55] N Based on the memory principle in the sense that it stores all cases and classifies new cases based on similar measures [65]. [42,46,48] A model capable of finding an equation that predicts an outcome for a binary variable from one or more response variables [69]. [42,51] A It is a discriminatory approach based on the differences between samples of certain groups. Unsupervised learning technique where the objective is to maximize the relationship between the variance between groups and the variance within the same group [70]. [54,55] N DNN Naturally inspired models. Supervised learning approach based on a theory of association (pattern recognition) between cognitive elements [71]. There are many possibilities with different elements, structures, layers, etc.
The larger the number of parameters then the larger the dataset must be. Based on Table 5, it is possible to determine the frequency of use of each model, as can be seen in Figure 4. We can observe that the most popular classification models are based on Vector Support Machine (SVM), with 34%, followed by the several variations of Artificial Neural Networks (ANN), with 21%. The ability to deal with non-linear data distributions and possibility of finding non-obvious patterns in data may be the main motivations for their use.

Testing and Performance Indicators
To conclude on the efficiency and viability of the classification model adopted, it is necessary to evaluate it. To be able to compare the performance of a given system against others reported systems it is important to choose a common metric with a well/defined testing method/setup otherwise it will be impossible to understand how good a system stands against its competitors. In this sense, Table 6 presents the evaluation models applied in the literature search. Table 6. Evaluation models for classification models.

Model Method Reference
Cross Validation k-Fold [40,41,43,[46][47][48]52,61] Leave-pair-out [51,66] Leave-one-out [13,38,50,53,54] Split Evaluation 90-10% [52] 80-20% [42] Random Sub-Sampling - [37] Accuracy, among other metrics, is an indicator of quality that allows one to objectively evaluate the performance of systems, either alone or by comparison. Other common parameters of interest are the Area Under Curve (AUC) and the F1 score. However, accuracy is one of the preferred metrics and its value is provided by most authors. Figure 5 shows, for each classification model, the average accuracy values that was reported in the revised articles.

Testing and Performance Indicators
To conclude on the efficiency and viability of the classification model adopted, it is necessary to evaluate it. To be able to compare the performance of a given system against others reported systems it is important to choose a common metric with a well/defined testing method/setup otherwise it will be impossible to understand how good a system stands against its competitors. In this sense, Table 6 presents the evaluation models applied in the literature search. Accuracy, among other metrics, is an indicator of quality that allows one to objectively evaluate the performance of systems, either alone or by comparison. Other common parameters of interest are the Area Under Curve (AUC) and the F1 score. However, accuracy is one of the preferred metrics and its value is provided by most authors. Figure 5 shows, for each classification model, the average accuracy values that was reported in the revised articles.

Discussion
Speech analysis, in general, represents an important source of information encompassing the phonetic, phonological, lexical-semantic, morphosyntactic, and pragmatic levels of language organization [72]. The first signs of cognitive decline are quite present in the discourse of neurodegenerative patients so that diagnosis via speech analysis of these patients is a viable and effective method, which may even lead to an earlier and more accurate diagnosis.
The reviewed articles focused on various aspects of identification or classification of cognitive loss. In terms of the evolution of the disease, it is possible to apply the techniques based on speech assessment in several stages: (a) in the area of early diagnosis; (b) in the classification/distinction between pathological cases and healthy individuals; (c) in the quantification symptoms intensity; (d) in the follow-up of the disease, characterizing the effectiveness of therapeutic approaches.
Further research is required to improve the systems performance and reliability.

Discussion
Speech analysis, in general, represents an important source of information encompassing the phonetic, phonological, lexical-semantic, morphosyntactic, and pragmatic levels of language organization [72]. The first signs of cognitive decline are quite present in the discourse of neurodegenerative patients so that diagnosis via speech analysis of these patients is a viable and effective method, which may even lead to an earlier and more accurate diagnosis.
The reviewed articles focused on various aspects of identification or classification of cognitive loss. In terms of the evolution of the disease, it is possible to apply the techniques based on speech assessment in several stages: (a) in the area of early diagnosis; (b) in the classification/distinction between pathological cases and healthy individuals; (c) in the quantification symptoms intensity; (d) in the follow-up of the disease, characterizing the effectiveness of therapeutic approaches.
Further research is required to improve the systems performance and reliability.

Base Model for System Development
Despite the distinct objectives of the articles included in this revision it was possible to identify common modules, similar resources and shared methodologies. A base system, with a robust development base and with flexibility for exploration, should follow: The DementiaBank database, provided by the TalkBank platform, would be used due to its versatility in terms of population, types of tasks, and languages; This is robust resource, widely known and used, that can be useful when comparing systems using a common linguistic base.
• FEATURES. A combination of linguistic and acoustic features seems to provide the best results, namely the duration and the total number of silences, voice segments, and hesitations, as well as the fundamental frequency, jitter, and shimmer, as they are of the characteristics where a greater difference between healthy individuals and individuals with AD. • TASK. Given the previously mentioned features, spontaneous speech would be used as the main task for assessment, using questions that would generate a fluent and spontaneous conversation. • CLASSIFICATION MODELS. As classification models, Artificial Neural Networks should constitute the base model for decision due to their flexibility to data patterns and because the provide a high dimension parameter space that can be explored and tuned. Systems based on these models have the highest reported accuracies. • EVALUATION MODELS. As it is the most recurrent, cross-validation should be applied to evaluate the classification models. Accuracy and F-score should be the comparison metrics.
The integration of the modules and the tuning of the final system are also a matter of concern. Closed-loop systems, that can automate parameter search are of great interest when designing a machine learning tool. A better performance system ensures that the subject's final rating is more reliable and safer. That said, although these systems are a possible way of detecting and classifying AD, it is important to note that their purpose was to help on an assisted diagnosis process. None of the reported system was evaluated as a clinical tool and the official diagnosis should be made by a specialist doctor. However, they demonstrate an added value in the sense that they assume the role of a time-saver, leading to people being diagnosed earlier and more quickly, also raising awareness of potential age groups who may go to visit a neurologist.

Future Work
With the evolution of technology also the methods of diagnosis and analysis are evolving. Thus, more, and better ways of detecting diseases or even new diagnostic processes are appearing. The detection and classification of Alzheimer's disease, which was usually performed via neurological tests and neuroimaging, is now possible through less invasive and equally efficient methods. The existing models for the detection of AD through speech have been increasing in quantity and in quality, though improvements are still needed. At present, the biggest barriers in the methods created for the automatic detection of AD lie in the fact that: (a) most systems are language dependent; (b) the number of samples used per study is very small, so the number of experiments on which the system is based is little for it to achieve optimal performance; (c) System components are not always integrated and may require human intervention; (d) feature sets are not yet fully established although temporal aspects (total duration, speech rate, articulation rate, among others) pitch, voice periods and interruptions, when combined with language or linguistic features can lead to very good results. Additional research is needed to find the optimal combination of parameters and what tasks should the (potential) patient be invited to perform. Thus, it is envisioned as future work the implementation of multilingual or language independent systems, supported by extensive and diverse databases (that still must be gathered, with balanced number of M/F, ages, disease severity), as well as the automation of the features selection and extraction. Better decision models, task oriented, are also required.