Using a Human Interviewer or an Automatic Interviewer in the Evaluation of Patients with AD from Speech

Currently, there are more and more frequent studies focused on the evaluation of Alzheimer’s disease (AD) from the automatic analysis of the speech of patients, in order to detect the presence of the disease in an individual or for the evolutionary control of the disease. However, studies focused on analyzing the effect of the methodology used to generate the spontaneous speech of the speaker who undergoes this type of analysis are rare. The objective of this work is to study two different strategies to facilitate the generation of the spontaneous speech of a speaker for further analysis: the use of a human interviewer that promotes the generation of speech through an interview and the use of an automatic system (an automatic interviewer) that invites the speaker to describe certain visual stimuli. In this study, a database called Cross-Sectional Alzheimer Prognosis R2019 has been created, consisting of speech samples from speakers recorded using both methodologies. The speech recordings have been studied through a feature extraction based on five basic temporal measurements. This study demonstrates the discriminatory capacity between the speakers with AD and the control subjects independent of the strategy used in the generation of spontaneous speech. These results are promising and can serve as a basis for knowing the effectiveness and extension of automated interview processes, especially in telemedicine and telecare scenarios.


Introduction
Alzheimer's disease (AD) is, currently, the most common cause of neurodegenerative dementia in the world [1,2]. It implies 70-76% of dementia cases in developed countries, increasingly long-lived [1] where numbers threaten to triple by 2050. Memory loss appears as one of the first symptoms, to which others such as difficulties with language use or spatial and temporal disorientation are added. In more advanced stages, the ability to carry out daily activities or even basic body functions, such as walking or swallowing [3], decreases or disappears. Either way, when the first symptoms are revealed, the damage caused is already irreparable and chronic.
No cure has been found up until now, nor is it possible to have a reliable diagnosis in life. The diagnostic process remains complex, is inevitably carried out in the advanced stages of the disease and is prolonged over time. Its use as a screening method is limited since current methods are expensive and invasive [4]. For this reason, there is great interest in finding biomarkers in more accessible parts of the body that are sensitive to AD before the clinical onset of dementia, and which also allow the specific stages of the disease to be monitored afterwards [5]. In such circumstances, developing eHealth 4.0 solutions based on more accessible particular biomarkers would allow democratizing the evolutionary and pharmacological control in an easy, fast, non-invasive and scalable way. It would provide Table 1. Databases of recordings used for linguistic analysis of localized Alzheimer's disease (AD) and types of study according to the distribution in time of taking measurements and the type of interviewer used.

Automatic Interviewers
A common denominator in the collection of voice samples in the AD study is to resort to a person who leads the interview, usually by encouraging the subject to recall past memories or answer questions [32,33], read texts [84] or describe a photo [85], although some have already included computer avatars in their work [71,74] which would make the results more objective and less dependent on human factors such as the ability of the interviewer to direct the interview. In any case, a common and objective method of recording speech for AD patients seems necessary.
Several research efforts have identified relevant advantages in computer-assisted communication compared to human interaction. Among these advantages, we identify that the participant feels more anonymity while being interviewed. These assisted communication systems are being equipped increasingly, among others, with communication skills that facilitate communication by using vision and prosodic analysis to implement active listening behaviours, smiles, head movements and postural mimicry. They use non-verbal strategies and show empathy so that the interviewed subject can feel comfortable and the conversation can go more smoothly. This fact helps to generate feelings of sympathy and trust in the interviewee [86]. In any case, it is necessary for the automatic interviewer to be able to perceive the human behaviour in order to process it, understand it and react to it. It is also essential that these skills do not work in isolation and that they allow integration into other systems, so that they learn from human interaction as they are used.
In general, there are systems that already work to reduce the barriers of interviewers and virtual systems that interact with humans. They are implementing automatic speech recognition systems [87], facial expression recognition [88] and natural language generation [89]. All the standardization efforts have tried to consolidate all the systems. For example, SAIBA [90] is a framework to generate empathic conversations. On the other hand, there are LiteBody and DTAsk that aim at creating social-emotional relationships with users through verbal and non-verbal social behaviours [91]. GRETA [92] is a conversational agent that complies with SAIBA, and focuses less on the interaction of natural language and more on the generation and performance of non-verbal affective behaviours. It applies a complex facial generation model, in particular. The project SEMAINE [93] aims to integrate various research technologies, including some of the above, into the creation of a virtual listener. The focus would be on perception and reverse channeling rather than deep representations of dialogue. There are even tools that allow the creation of virtual humans that interact with people in a conversation [94]. An example of an automatic interviewer that is based on a virtual human and employs multiple resources to make communication as human as possible is SimSensei Kiosk [86].
Either way, few works have focused on applying these strategies to AD. Although there is evidence of some solutions such as SimpleC built by IBM [95] that replace purely human variables (such as the interviewer's empathy with the subject), these remain rare.
It has not yet been demonstrated to what extent, for the same subject, the parameters obtained from a human or automatic interviewer can vary or how this fact could fit into the current reality.
Within the framework of the detection and evolutionary control of AD based on voice recordings and their automatic processing, this work intends to objectively determine the discriminatory capacity (healthy and pathological AD voices) of two different recording strategies, specifically by obtaining the samples through human and automatic interviewers.

Materials and Methods
This section describes the materials and methodology we have used to determine the discriminatory capacity of the voice of AD patients and HC subjects under the two proposed scenarios: the human interviewer and the automatic interviewer.
A database has been created to accomplish the study which we have called Cross-Sectional Alzheimer Prognosis R2019, and which has been used to compare both types of samples. In Section 2.2.1 the way the recordings were made is properly explained. Subsequently, voice processing and feature extraction based on speech times have been performed. Using a series of descriptive statistical measures and then applying a statistical analysis based on the Wilcoxon Summation Test [96], two parallel studies were suggested to determine the discriminatory capacity of each interviewer.

Methods
To find out to what extent the type of interviewer influences the difference of healthy and pathological AD voices, the Cross-Sectional Alzheimer Prognosis R2019 database has been created. In it, on each of the subjects recorded, both AD and HC patients, two types of interviews were done and, therefore, two different types of recordings. The first type of recording is characterized by having been carried out by an automatic interviewer (Prognosis software, see Section 2.2.3). The second type of recording, as is widely done in the field, has been realized by a member of the research team. Hereafter, we will define the first type of recordings as induced speech samples, and those obtained through the human interviewer will be defined as spontaneous speech samples. It is interesting to clarify that all recordings have always been made in the environment of the patients and the HC subjects, so that the result can be as realistic as possible.
After collecting recordings of both types for all subjects, a Voice Activity Detector (VAD) was applied to the samples using Matlab ® software ( Figure 1). As a result of this process, different speech and silence sequences are obtained and stored in a matrix, where the initial and final time of each sound fragment S i is indicated. The voice signal is characterized using different descriptive statistics estimators of the duration of the different sound fragments in which it is divided the speech of a speaker. The different features are described below.
-Average speech time t S : describes the average speech time of the different sound fragments which can be identifier in a sample of speech of a speaker. It is estimated [97] from the arithmetic mean of the duration of all the sound fragments in a single recording.
In which t S i is the time duration of each sound fragments (S 1 , S 2 , . . . , S N ) is which is divided each speech recording {S i }.
-Variance of speech time σ 2 t S : describes the variation of the different sound fragments in a recording. It is estimated [98] using the following estimator of the variance: -Skewness of speech time µ t S3 : This measure allows characterizing the behavior of the probability distribution function of the duration of the different sound fragments. This measure quantifies [99] the lack of symmetry of the average duration of the voice fragments. In this way, when the studied samples follow a normal distribution, the value of µ t S3 will be zero. Positive or negative values of µ t S3 indicate data skewed to the right of their distribution curve or to the left, respectively. Skewness of speech time is calculated using the following estimator: where t S i is the time duration of each sound fragments, t S is the average speech time, σ 2 t S is the variance of speech time, and N is the number of sound fragments in the sample of speech.
-Kurtosis of speech time (Kurt t s ): This is other measure which allows characterizing other aspects of the behavior of the probability distribution function of the duration of the different sound fragments. This measure states the quantity of sound fragments in a recording with a duration that is close to the average duration (h). The larger Kurt t s is, the steeper its distribution curve will be. Kurt t s is calculated [99] using the following estimator: -Index of speech time(Ind t s ): describes the relationship between the total time the subject is speaking to the total duration of the recording. It is calculated as the division between the total time of speech sequences by the total recording time of the sample.
in which t S i is the time duration of each sound fragments (S 1 , S 2 , . . . , S N ) is which is divided a speech recording and T TOTAL is the duration of the speech recording. In order to know exactly which of the above variables are AD-discriminatory and which are not for each interviewer, a descriptive statistical analysis and a non-parametric study based on the Wilcoxon Summation Test [100], has been done, since the samples do not show a normal distribution [101]. The Stata ® software was used for the statistical study.

Databases
The Cross-Sectional Alzheimer Prognosis R2019 database is a database created to test how discriminating a voice sample of AD can be based on the type of interviewer used in the recording. Based on this, for each subject, AD or HC patient, this database has both spontaneous and induced speech samples, all collected in a single session.
It should be noted that for the production of speech in the subjects we have not looked for a pre-designed sentence structure, or balanced sentences, but we have recorded spontaneously generated speech. In this sense, there is no design of sentences, but recollection is used through stimuli in the participants: in the case of the human interviewer, by inviting the subject to speak, and in the case of the automatic interviewer, through videos and images, with the aim of producing reminiscences in the subjects that evoke memories.
The tools used to proceed with the recording have been a laptop, headset with microphone [102] and, in the case of automatic interviews, the Prognosis software (see Section 2.2.3). Specifically, four different recordings are obtained for each subject, three of induced speech and one of spontaneous speech. They were recorded at a sampling rate of 44,100 Hz. The recording file type used was WAV.
The total number of people recorded was 87, of whom 41 were AD patients and 46 HC subjects, all over 65 years of age. The total number of women recorded is 56 compared to 31 men. As for the distribution according to the degree of the disease we have that of the 41 patients with AD, 15 correspond to the moderate degree and 26 to the mild degree. Table 2 as well as Figures 2-4 represent the distribution of the people recorded according to sex, presence or absence of AD and the comparison between the different degrees analyzed (HC subjects, mild or moderate patients). Table 2. Distribution of the people recorded according to sex, presence or absence of AD and the comparison between the different degrees analyzed (HC subjects, mild or moderate patients).

Polulation Women Men
Healthy control 46 23

Human Interviewer
The recordings of the samples were carried out by one member of the investigation team using an Intel Core i5, 12 GB RAM. 250 Gb SSD laptop, where different recordings have been stored. Furthermore, Audacity ® software has been used to carry out the recording and the Ozone Rage ST [102] headset with microphone. When conducting the interview, each of the subjects has been asked or encouraged to talk about what they want, about a free topic, with the objective of achieving a spontaneous speech recording that varies between 30 s and 2 min.

Automatic Interviewer
The obtained samples of the automatic interviewer are gained by the Prognosis software. The main objective of this software is to make recordings based on the "automatic interviewer". In order to obtain these samples of induced speech and to incite the subjects to speak, a series of videos have been used as stimulus, taking into account that the common denominator of all the participants is age and that they are all over 65 years old. In the Portal Memoria Digital de Canarias of the University Library of the ULPGC [103] there is a wide repertoire of reports and old news, among others, that we have used to provoke reminiscences in the subjects. Currently, there are numerous studies that demonstrate the effectiveness of the use of multimedia tools, such as videos, to make subjects who suffer from dementia or AD communicate or express themselves more easily [104]- [107], and they are also especially interesting when they are personalized or provoke reminiscences of moments lived. Based on this, we have expressly selected from the repository those

Human Interviewer
The recordings of the samples were carried out by one member of the investigation team using an Intel Core i5, 12 GB RAM. 250 Gb SSD laptop, where different recordings have been stored. Furthermore, Audacity ® software has been used to carry out the recording and the Ozone Rage ST [102] headset with microphone. When conducting the interview, each of the subjects has been asked or encouraged to talk about what they want, about a free topic, with the objective of achieving a spontaneous speech recording that varies between 30 s and 2 min.

Automatic Interviewer
The obtained samples of the automatic interviewer are gained by the Prognosis software. The main objective of this software is to make recordings based on the "automatic interviewer". In order to obtain these samples of induced speech and to incite the subjects to speak, a series of videos have been used as stimulus, taking into account that the common denominator of all the participants is age and that they are all over 65 years old. In the Portal Memoria Digital de Canarias of the University Library of the ULPGC [103] there is a wide repertoire of reports and old news, among others, that we have used to provoke reminiscences in the subjects. Currently, there are numerous studies that demonstrate the effectiveness of the use of multimedia tools, such as videos, to make subjects who suffer from dementia or AD communicate or express themselves more easily [104][105][106][107], and they are also especially interesting when they are personalized or provoke reminiscences of moments lived. Based on this, we have expressly selected from the repository those videos that reflect past times and that, therefore, we consider that could somehow awaken memories in the participants.
The program is divided into three phases (see Figure 5). The first phase corresponds to the registration of the participant's data. The interviews are anonymous but some participant's data are recollected: degree of AD (control, mild, moderate, or severe), sex and age.
The second stage corresponds to the recording of a sustained vowel (/aa/). The software invites the speaker, through a video, to produce a sustained vowel. The software shows a short motivational video where one of the investigators of the project explains how the recording is going to be and how the subjects should act. Finally, as indicated in this video, the subject must pronounce the vowel /aa/ in a sustainable way for 8 s, which will be recorded automatically. Recordings of sustained vowels have not been specifically used for this work, although they are reserved in the database for future studies.
videos that reflect past times and that, therefore, we consider that could somehow awaken memories in the participants.
The program is divided into three phases (see Figure 5). The first phase corresponds to the registration of the participant's data. The interviews are anonymous but some participant's data are recollected: degree of AD (control, mild, moderate, or severe), sex and age. The second stage corresponds to the recording of a sustained vowel (/aa/). The software invites the speaker, through a video, to produce a sustained vowel. The software shows a short motivational video where one of the investigators of the project explains how the recording is going to be and how the subjects should act. Finally, as indicated in this video, the subject must pronounce the vowel /aa/ in a sustainable way for 8 s, which will be recorded automatically. Recordings of sustained vowels have not been specifically used for this work, although they are reserved in the database for future studies.
In the third phase, three samples of induced speech are recorded for the speaker. For the recording of each sample, firstly the software reproduces a video inviting the speaker to describe the video that it is going to reproduce (see Figure 6). Next, the software reproduces a video, which is expected to generate fond memories of childhood or youth in the speaker (see Figure 7), and later the software will play another video inviting the speaker to describe the video. The description made by the speaker is automatically recorded. This recording has a duration of 30 s. It is expected to be able to generate reminiscences in the speaker which generate interest in the speaker by sharing their experiences and memories. Three videos have a duration of approximately between 30 s and 2 min, and they are randomly selected from a video repository provided by the software.  In the third phase, three samples of induced speech are recorded for the speaker. For the recording of each sample, firstly the software reproduces a video inviting the speaker to describe the video that it is going to reproduce (see Figure 6). Next, the software reproduces a video, which is expected to generate fond memories of childhood or youth in the speaker (see Figure 7), and later the software will play another video inviting the speaker to describe the video. The description made by the speaker is automatically recorded. This recording has a duration of 30 s. It is expected to be able to generate reminiscences in the speaker which generate interest in the speaker by sharing their experiences and memories. Three videos have a duration of approximately between 30 s and 2 min, and they are randomly selected from a video repository provided by the software. The second stage corresponds to the recording of a sustained vowel (/aa/). The software invites the speaker, through a video, to produce a sustained vowel. The software shows a short motivational video where one of the investigators of the project explains how the recording is going to be and how the subjects should act. Finally, as indicated in this video, the subject must pronounce the vowel /aa/ in a sustainable way for 8 s, which will be recorded automatically. Recordings of sustained vowels have not been specifically used for this work, although they are reserved in the database for future studies.
In the third phase, three samples of induced speech are recorded for the speaker. For the recording of each sample, firstly the software reproduces a video inviting the speaker to describe the video that it is going to reproduce (see Figure 6). Next, the software reproduces a video, which is expected to generate fond memories of childhood or youth in the speaker (see Figure 7), and later the software will play another video inviting the speaker to describe the video. The description made by the speaker is automatically recorded. This recording has a duration of 30 s. It is expected to be able to generate reminiscences in the speaker which generate interest in the speaker by sharing their experiences and memories. Three videos have a duration of approximately between 30 s and 2 min, and they are randomly selected from a video repository provided by the software.  Once each recorded period is finished, the software is also designed to detect if the collected sample collected contains audio from the participant or if, by contrast, only silence has been recorded. In case the participant has not spoken during the recording, a short video will be shown again where the same team researcher indicates this situation and encourages the subject to speak. Since in that case no sample would have been obtained, this deficiency is compensated by playing a new video. Once the recordings are finished, the researcher reappears on screen and thanks the subject for participating in the study. Once each recorded period is finished, the software is also designed to detect if the collected sample collected contains audio from the participant or if, by contrast, only silence has been recorded. In case the participant has not spoken during the recording, a short video will be shown again where the same team researcher indicates this situation and encourages the subject to speak. Since in that case no sample would have been obtained, this deficiency is compensated by playing a new video. Once the recordings are finished, the researcher reappears on screen and thanks the subject for participating in the study.

Results
The descriptive statistics of each measure have been studied for each of the populations (HC and AD), differentiating in the methodology for the interview (human and automatic interviewer). The results are show in Table 3. Base on the obtained data in characteristics extraction process, Table 4 shows boxplots of samples analyzed for each of the five variables under study.  Once each recorded period is finished, the software is also designed to detect if the collected sample collected contains audio from the participant or if, by contrast, only silence has been recorded. In case the participant has not spoken during the recording, a short video will be shown again where the same team researcher indicates this situation and encourages the subject to speak. Since in that case no sample would have been obtained, this deficiency is compensated by playing a new video. Once the recordings are finished, the researcher reappears on screen and thanks the subject for participating in the study.

Results
The descriptive statistics of each measure have been studied for each of the populations (HC and AD), differentiating in the methodology for the interview (human and automatic interviewer). The results are show in Table 3. Base on the obtained data in characteristics extraction process, Table 4 shows boxplots of samples analyzed for each of the five variables under study.

Variable
Human Interviewer Automatic Interviewer Once each recorded period is finished, the software is also designed to detect if the collected sample collected contains audio from the participant or if, by contrast, only silence has been recorded. In case the participant has not spoken during the recording, a short video will be shown again where the same team researcher indicates this situation and encourages the subject to speak. Since in that case no sample would have been obtained, this deficiency is compensated by playing a new video. Once the recordings are finished, the researcher reappears on screen and thanks the subject for participating in the study.

Results
The descriptive statistics of each measure have been studied for each of the populations (HC and AD), differentiating in the methodology for the interview (human and automatic interviewer). The results are show in Table 3. Base on the obtained data in characteristics extraction process, Table 4 shows boxplots of samples analyzed for each of the five variables under study.

Variable Human Interviewer Automatic Interviewer
Kurt t s In previous graphs it is possible to see the discriminative capacity of some of the variables under study, such as, for example, the variable . Others, however, are less discriminatory.
Subsequently, in order to know with each interviewer which variables are capable of discriminating AD from HC, a non-parametric analysis based on the Wilcoxon Summation Test has been made. The results are shown in Table 5. In this case, the probability of z must be less than 0.05%, or the error below 95%, so that the result of any of the variables studied can be accepted as discriminatory. In other words, this implies that a variable will be considered discriminatory when the null hypothesis that there is no difference between the study variables is rejected.
As can be seen in Table 5, the variables , and confirm that there is a difference between HC and AD subjects, since the probability of z is less than 0.05%, for both methodologies.
In previous graphs it is possible to see the discriminative capacity of some of the variables under study, such as, for example, the variable . Others, however, are less discriminatory.
Subsequently, in order to know with each interviewer which variables are capable of discriminating AD from HC, a non-parametric analysis based on the Wilcoxon Summation Test has been made. The results are shown in Table 5. In this case, the probability of z must be less than 0.05%, or the error below 95%, so that the result of any of the variables studied can be accepted as discriminatory. In other words, this implies that a variable will be considered discriminatory when the null hypothesis that there is no difference between the study variables is rejected.
As can be seen in Table 5, the variables , and confirm that there is a difference between HC and AD subjects, since the probability of z is less than 0.05%, for both methodologies.

Ind t s
In previous graphs it is possible to see the discriminative capacity of some of the variables under study, such as, for example, the variable . Others, however, are less discriminatory.
Subsequently, in order to know with each interviewer which variables are capable of discriminating AD from HC, a non-parametric analysis based on the Wilcoxon Summation Test has been made. The results are shown in Table 5. In this case, the probability of z must be less than 0.05%, or the error below 95%, so that the result of any of the variables studied can be accepted as discriminatory. In other words, this implies that a variable will be considered discriminatory when the null hypothesis that there is no difference between the study variables is rejected.
As can be seen in Table 5, the variables , and confirm that there is a difference between HC and AD subjects, since the probability of z is less than 0.05%, for both methodologies.
In previous graphs it is possible to see the discriminative capacity of some of the variables under study, such as, for example, the variable . Others, however, are less discriminatory.
Subsequently, in order to know with each interviewer which variables are capable of discriminating AD from HC, a non-parametric analysis based on the Wilcoxon Summation Test has been made. The results are shown in Table 5. In this case, the probability of z must be less than 0.05%, or the error below 95%, so that the result of any of the variables studied can be accepted as discriminatory. In other words, this implies that a variable will be considered discriminatory when the null hypothesis that there is no difference between the study variables is rejected.
As can be seen in Table 5, the variables , and confirm that there is a difference between HC and AD subjects, since the probability of z is less than 0.05%, for both methodologies.
In previous graphs it is possible to see the discriminative capacity of some of the variables under study, such as, for example, the variable Ind t s . Others, however, are less discriminatory.
Subsequently, in order to know with each interviewer which variables are capable of discriminating AD from HC, a non-parametric analysis based on the Wilcoxon Summation Test has been made. The results are shown in Table 5. In this case, the probability of z must be less than 0.05%, or the error below 95%, so that the result of any of the variables studied can be accepted as discriminatory. In other words, this implies that a variable will be considered discriminatory when the null hypothesis that there is no difference between the study variables is rejected.
As can be seen in Table 5, the variables t S , σ 2 t S and Ind t s confirm that there is a difference between HC and AD subjects, since the probability of z is less than 0.05%, for both methodologies.

Discussion
Databases are a fundamental aspect of any research since only on the basis of them we can develop our experimental studies and analyses. In this work, we have located and classified a series of databases related to language use in AD patients in order to understand better their role and current situation in the field. From a thorough review of the state of the art, we have found, in addition to a shortage in the number of localized databases, a great diversity in how the recordings of the subjects are made. We have, for example, aspects related to the recording process, to the automation of the interviews or to the linguistic tasks performed by the subjects that differ much from one study to another. Although aspects like those mentioned above influence the recordings, they are not the only ones; there are many other variables as the language, the environment or simply the pre-processing methods used which make each database obtainable under different conditions. However, it should be mentioned that there is a certain inclination as regards the type of linguistic task performed, and up to 80% of the cases analyzed use spontaneous speech in their recordings, understanding spontaneous speech as those tasks in which questions are asked to the subject and give a limited, relatively long time to express themselves freely. Nevertheless, we do not have clear evidence as to whether other linguistic tasks, such as reading, could provide us with more or less information or even be complementary to each other. On the other hand, we have been able to observe that only 18% of the located databases correspond to longitudinal studies. This type of study is of special interest because it allows the analysis of the time variable on the samples, which is undoubtedly a reflection of the progressive deterioration of language that these patients suffer. Among the databases located, it should be noted that only one of them automated the interview process with the subject by means of computerized avatars. However, we are not aware of how this fact can effectively affect the final results.
The Cross-Sectional Alzheimer Prognosis R2019 database has been created in order to find out how the automating process of interviews affects the taking of samples and the results of subsequent statistical analyses. This database consists of two types of recordings: spontaneous speech (human interviewer) and induced speech (automatic interviewer). In this sense, it is worth highlighting two main advantages of the database used. In the first place, it is a database in which two clear methodologies are collected to record the voice for the purposes indicated in this study. We have the same subject who participates in both types of recordings. The second advantage resides in the potential of these registers which, based on the pertinent analyzes carried out, allow us to discover to what extent these methodologies are validated, the automatic one in particular.
A descriptive statistical analysis and a non-parametric statistical study of the different variables (t S , σ 2 t S and Ind t s ) has been realized, which allow characterizing the different sound fragments of a speech sample.
As a first step, by means of the descriptive statistical analysis, we have probed that for both kinds of interviewer, durations of the different sound fragments of a speech sample, and their variability, in HC subjects have been always higher than those related to AD patients. In a second round, the Wilcoxon Summation Test has been applied to these variables (t S , σ 2 t S and Ind t s ) in order to verify whether the variables have discriminatory capacity to differentiate between HC and AD subjects, and the dependence on the methodology used in the interviewer (human interviewer and automatic interviewer). As a result, the probabilities (z) obtained for each of these variables show that the proposed measures have discriminatory capacity independent of the methodology applied in the interviewer.
By its part, it can be seen from Table 4 that the Ind t s variable has a high discriminatory potential for the case of both automatic interviewer and human interviewer. For the rest of variables it can be seen that in all cases values of samples overlap to a greater or lesser extent. However, this fact does not apply to the Ind t s variable where, apart from not overlapping values, the mean for each population is clearly different regardless of the interviewer used. For the human interviewer there are approximately 0.8 s for the control subjects and about 0.55 s for AD patients. For its part, using the automatic interviewer we have a mean close to 0.7 s for HC subjects and a value of 0.5 s for each AD population. Regarding the dispersion of the data, it can be said that is greater in the case of the automatic interviewer that in the case of the human interviewer.
In any case, both interviewers offer good results. In the case of the automatic interviewer, there are fewer studies that allow us to quantify its effectiveness compared to the human one. The good results obtained in this first approach serve as a baseline for us to continue working in this line, applying different tests, such as the parametric ones, or analyzing other variables that could be not only temporary.

Conclusions
In this work we have carried out a search and review of the state of the art on the different strategies that have been followed so far in the recording of subjects for the linguistic analysis of AD. Currently, although the repository of localized databases is not as extensive as would be desired, we have been able to note the great diversity of strategies followed in the creation of these databases and the lack of a common criteria. This common criterion would imply knowing what kind of tasks carried out by the subject are more interesting, who, how, where and when these interviews take place and the possible influence of these factors on the recording. Up until now, there has been no clear evidence of which is more beneficial for the purpose: the detection and/or evolutionary control of AD. For its part, we have been able to verify that there is a significant lack of longitudinal databases. When recording, the contribution of longitudinal databases is particularly interesting, which would allow us to capture the progressive deterioration that these patients suffer.
Another important aspect is the automation of the recording process as such. Few works have focused on applying automatic strategies, at least applied to AD. This includes, for example, the automation of neuropsychological tests or patient interviews that are currently still administered manually. In this case, although there is evidence that at least one of the databases we have located conducts guided interviews with computer avatars, little is known about the benefits of automating these processes. Therefore, it seems interesting to check and go deeper into the extent to which, for the same subject, the parameters obtained could vary depending on the type of interviewer.
In this paper, two concepts have been defined to understand the specific benefits of these automatic techniques compared to their manual counterparts: induced speech (obtained from the automatic interviewer) and spontaneous speech (obtained from a human interviewer). Based on them, a database called Cross-Sectional Alzheimer Prognosis R2019 has been created, in which samples of both types have been taken. From the processing of these samples and their subsequent statistical analysis, the results obtained prove to be discriminatory regardless of the recording strategy, whether a person has been employed as an interviewer or whether the recordings have been obtained automatically from the Prognosis software. Although the results obtained are promising, the scarcity of this type of study implies a clear need to continue working on evidence that is increasingly strong and allows for a higher level of interpretation.
From the promising results obtained, the door is also opened to carry out a thorough study in which the Cross-Sectional Alzheimer Prognosis R2019 database is expanded, complemented and improved. As future lines, we propose the possibility of combining the analysis of speech raised in this work by expanding the database, not only to new participants, but also to new data that can likely be correlated with AD, such as an educational or pharmacological factor, as well as other data of a geographical and social nature.
It also seems interesting to be able to extend this type of study to new techniques proposed in the field of voice recognition, in which ideas based on the study of samples obtained in real noisy environments such as social gatherings, streets, cafes and restaurants are raised [108]. Likewise, an interesting line to take into account in this regard is given by the current challenges posed to address a voice recognition scenario capable of providing speech enhancement, speaker diarization and speech recognition modules, for example, by means of recognition modules based on multispeaker speech recognition for unsegmented recordings [109].
Mainly this study aims to open the door to an evaluation methodology based on non-invasive, objective, collaborative and easily replicable techniques. Added to that, it is also important to mention the high acceptance by patients as an important advantage. With a view to the future, all these conditions together would help to make real online remote evaluations of sample acquisition, which is useful not for medical diagnosis but for evolutionary control, pharmacological control or early detection of AD, among others.
Although the automatic methods that have been applied to date are obviously not sophisticated enough, it seems clear that, in the future, greater knowledge in this area could contribute to the scalability of this type of language test, which has been applied manually until now and with the economic, social or health limitations that this involves. There is also a wide field of study in which not only could examinations be automated from an acoustic point of view but also taking into consideration other cognitive aspects such as semantics or lexicon. In any case, deepening the automation of these processes would give more possibilities to new techniques, currently on the rise, such as Telecare.