Speech Quality Feature Analysis for Classification of Depression and Dementia Patients

Loss of cognitive ability is commonly associated with dementia, a broad category of progressive brain diseases. However, major depressive disorder may also cause temporary deterioration of one’s cognition known as pseudodementia. Differentiating a true dementia and pseudodementia is still difficult even for an experienced clinician and extensive and careful examinations must be performed. Although mental disorders such as depression and dementia have been studied, there is still no solution for shorter and undemanding pseudodementia screening. This study inspects the distribution and statistical characteristics from both dementia patient and depression patient, and compared them. It is found that some acoustic features were shared in both dementia and depression, albeit their correlation was reversed. Statistical significance was also found when comparing the features. Additionally, the possibility of utilizing machine learning for automatic pseudodementia screening was explored. The machine learning part includes feature selection using LASSO algorithm and support vector machine (SVM) with linear kernel as the predictive model with age-matched symptomatic depression patient and dementia patient as the database. High accuracy, sensitivity, and specificity was obtained in both training session and testing session. The resulting model was also tested against other datasets that were not included and still performs considerably well. These results imply that dementia and depression might be both detected and differentiated based on acoustic features alone. Automated screening is also possible based on the high accuracy of machine learning results.


Introduction
Dementia is a collective symptoms attributed to loss of recent and remote memory along with difficulty in absorbing new knowledge and trouble in decision making. The most common cause of dementia is Alzheimer's disease which contributes to 60-70% of all dementia cases worldwide. Presently there is no treatment available [1] and recent researches focuses on early detection of dementia signs [2][3][4][5][6][7][8][9] and reducing the risk factors to slow the cognitive decline [10][11][12][13].
Preliminary diagnosis of dementia typically performed in a mental hospital by a licensed psychiatrist interviewing and performing tests to the patients [14][15][16]. Occasionally, diagnosing dementia becomes a complex process, as elderly patients with major depressive disorder often has overlapping symptoms with dementia. To determine whether a patient is truly suffering from dementia, a rigorous test must be performed [17]. A temporary decrease in mental cognition caused by mental disorders is defined as pseudodementia [17][18][19][20][21]. The key difference of pseudodementia is the reversibility of cognitive impairment, in contrast with the progressive nature of dementia. In some cases pseudodementia also serves as biomarker of dementia [21]. Unfortunately, most engineering researches concern only with depression severity or dementia severity [22,23] and almost none focused on pseudodementia.
Features commonly employed for automated mental health screening include facial features (gaze, blink, emotion detection, etc.) [24][25][26], biosignals (electroencephalogram, heart rate, respiration, etc.) [27][28][29][30], and auditory features (intensity, tone, speed of speech, etc.) [23,31]. Although biosignals are the most reliable data source, most of biosignal measurement devices are arduous to equip, limiting their value. In the other hand, facial and acoustic features may be obtained with minimal burden to the patient. As audio feature analysis is comparatively straightforward when compared to facial image analysis, we utilized audio features in this study instead of image features.
The aim of this study was to use an array microphone to record conversations between psychiatrists and depression patients and dementia patients in a clinical setting, and to investigate the differences in acoustic features between the two patient groups and not against healthy volunteers, differing from other conventional studies. Additionally, we are using dataset labelled from licensed psychiatrist to reduce the subjectivity. We revealed the features contributing for pseudodementia screening. In addition, we examined the possibility of utilizing machine learning for automatic pseudodementia screening.

Data Acquisition
This study is conducted as a part of Project for Objective Measures using Computational Psychiatry Technology (PROMPT), a research aimed to develop objective, noninvasive, and easy-to-use biomarkers for assessing the severity of depressive and neurocognitive disorders, including dementia. The details of the project may be found in [32].
The PROMPT study was approved by Keio University Hospital Ethics Committee (20160156, 20150427). All participants provided written informed consent. The experiment was conducted on Keio University Hospital and Joint Medical Research Institute. During the interview, the patient and the psychiatrist were seated across a table, as shown in Figure 1.
A single session consists of "free talk" segment followed by "rating" segment. In "free talk", the psychiatrist conducts a typical clinical interview concerning the patient's daily life and mood. The length of a "free talk" segment is around 10 min. In the "rating" segment, the patient is interviewed based on a clinical assessment tools related to their mental health history, which may include some tasks such as clock-drawing test and memory test or some personal questions such as their sleep habit and depressive mood in the recent weeks. The duration of "rating" segment typically lasts more than 20 min. Recording setup during interview session. P is the patient and T is the psychiatrist. There is a distance of approximately 70 cm between the patient's seat and the recording apparatus.

Participants
For statistical analysis, the first, and the second parts of machine learning, several datasets were removed from the PROMPT database in consideration of age features and the presence of symptoms. Only datasets which satisfy the following criteria were included: 1. Age between 57 and 84 years-old; 57 is the lowest age for dementia patients and 84 is the highest age for depression patients. The purpose of this criterion was to remove the effect of age which is positively correlated with Dementia. 2. For dementia patients: mini-mental state examination (MMSE) score of 23 or less accompanied with 15-item geriatric depression scale (GDS) score of 5 or less; The purpose of this criterion was to select only patients with dementia symptoms and exclude patients with both symptoms. A person is defined as symptomatic dementia if the MMSE score is 23 or less 3. For depression patients: 17-item Hamilton depression rating scale (HAMD17) of 8 or more.
A person is defined to be depressed if one's score of HAMD17 is 8 or more. 4. The recording session was from "free talk" and the length was at least 10 min long. The purpose of this criterion was to ensure enough information contained within the recordings.
For the third part of machine learning, different criteria were applied to PROMPT database to construct a test set consisting of young depressed and old dementia datasets. Specifically, the criteria were: 1. For dementia patients: mini-mental state examination (MMSE) score of 23 or less accompanied with 15-item geriatric depression scale (GDS) score of 5 or less; The age of the patients should be of 85 years or more. 2. For depression patients: 17-item Hamilton depression rating scale (HAMD17) of 8 or more.
The age of the patients should be no more than 56 years. 3. The recording session was from "free talk" and the duration was at least 10 min long.
Each dataset corresponds to a interview session from one subject. In this study, the datasets were considered as independent because (1) time gap between the sessions were long, 2 weeks in the minimum; and (2) the clinical score results may increase or decrease compared to the first visit, especially depression patients. Figure 2 illustrates the dataset filtering for the statistical analysis and machine learning phases.

Materials
A vertical array microphone: Classis RM30W (Beyerdynamic GmbH & Co. KG, Heilbronn, Germany) with an internal noise cancellation filter to remove wind and pop noise was utilized to record conversations between patients and psychiatrists. The sampling rate was set to 16 kHz. Feature extraction and analysis was performed utilizing typical processor: Dell G7 7588 with Intel Core i7-8750H@2.20 GHz, 16 GB RAM, manufactured in China. with Windows 10 operating system. All methods were available built-in from software MATLAB 2019b.

Preprocessing
In some rare cases, the recordings contained some outliers, possibly caused by random errors, and preprocessing of the raw data needs to be conducted. We defined the outliers by using inter-quartile range (IQR). A point in the audio recording is defined to be an outlier if it satisfies one of the following conditions: Here, X is the signal, Q1 is the lower (1st) quartile, Q3 is the upper (3rd) quartile, and IQR is the inter-quartile range, computed by subtracting Q1 from Q3. We then apply cubic smoothing spline fitting to the audio signal, without the outliers. The objective of this method is twofold: (1) to interpolate the removed outliers, (2) subtle noise removal.
Additionally, intensity normalization was also performed. This was to ensure that the data is in equal scale to each other and to reduce clipping in audio signals. The normalization was conducted by rescaling the signal such that the maximum absolute value of its amplitude is 0.99. Continuous silence in form of trailing zeroes at front and end of the recordings were also deleted.

Feature Extraction
A subtotal of ten acoustic features were extracted from raw data. They were: Pitch, harmonics-to-noise ratio (HNR), zero-crossing rate (ZCR), Mel-frequency cepstral coefficients (MFCC), Gammatone cepstral coefficients (GTCC), mean frequency, median frequency, signal energy, spectral centroid, and spectral rolloff point, with details in Table 1. These features were chosen as they represent both temporal and spectral features of a signal. Additionally, some of these features relate to closely to speech which is a common biomarker for both depression and dementia [35][36][37]. These features were computed once in every 10 ms by applying a 10 ms window with no overlap. We then performed feature extraction to the windowed signals. The total count of audio feature is 36, with 14 MFCCs and GTCCs. As we used data with length of at least 10 min, a minimum of 60.000 datapoints were obtained, for each feature. We then computed the mean, median, and standard deviation (SD) of the datapoints and used them for statistical analysis and machine learning, resulting in total feature count of 108. Table 1. List of features utilized in this study.

Feature Mathematical Functions and References
Pitch [38] Harmonics-to-noise ratio (HNR) [39] Zero-Crossing Rate (ZCR) [40] Gammatone cepstral coefficients (GTCC) [41] Mean frequency Mean of power spectrum from the signal Median frequency Median of power spectrum from the signal For ZCR: N, sgn, and X i denotes the length of signal, signum function extracting the sign of a real number (positive, negative, or zero), and i-th sequence of signal X, respectively. For mean frequency and median frequency: power spectrum from the signal was applied by performing Fourier transform. For signal energy: E(X) is the signal energy of signal X, σ(X) denotes the function of standard deviation of signal X and µ(X) indicates the function of mean of signal X. For spectral centroid: c denotes the spectral centroid, f i is the frequency in Hertz corresponding to bin i, s i is the spectral value at bin i, and b 1 and b 2 are the band edges, in bins, over which to calculate the spectral centroid. For spectral rolloff point: r is the spectral rolloff frequency, s i is the spectral value at bin i, and b 1 and b 2 are the band edges, in bins, over which to calculate the spectral spread.

Statistical Analysis
To investigate the relationship between audio features and clinical symptoms, linear correlations of the acoustic features against the corresponding clinical rating tools were computed. The clinical rating tools were HAMD for depression subjects and MMSE for dementia subjects. In addition, two-tailed t-test were also performed to check statistical significance. The values were adjusted using Bonferroni correction. Additionally, correlation between age and sex with clinical rating tools were also evaluated for validation purposes.

Machine Learning
Machine learning was performed in three stages: (1) to examine the possibility of automatic pseudodementia diagnosis with unsupervised learning, (2) to examine the possibility of automatic pseudodementia diagnosis with supervised classifier, and (3) to validate its robustness against non age-matched datasets. The unsupervised learning algorithm utilized for the first stage was k-means clustering. The parameters for k-means clustering were k = 2 with squared Eucledian distance metric. For stages 2 and 3, the machine learning model utilized was a binary classifier: support vector machine (SVM) with linear kernel, 3rd order polynomial kernel, and radial-basis function (RBF) kernel [43]. The hyperparameters for both linear kernel and polynomial kernel is the cost parameter C while RBF kernel has two hyperparameters: C and gamma. The optimization of hyperparameters was performed using grid search algorithm with values ranging from 1 1000 to 1000. Linear kernel was chosen as it allows the visualization of feature contributions, as opposed to SVM with nonlinear kernels. For the second phase, the machine learning session was performed using nested 10-fold cross-validation. It is defined as follows: 1. Split the datasets into ten smaller groups, maintaining the ratio of the classes 2. Perform ten-fold cross validation using these datasets.
For each fold: (a) Split the training group into ten smaller subgroups.
(b) Perform another ten-fold cross-validation using these subgroups.
For each inner fold: i. Perform LASSO regression [44] and obtain the coefficients.
The LASSO regression solves min where α is a scalar and β is a vector of coefficients, N is the number of observations, y i is the response at observation i, x ij is the vector of predictors at observation i, and λ is a nonnegative regularization parameter. High value of λ results in stricter feature selection and in this study, it is computed automatically such that it is the largest possible value for nonnull model. The performance of the model is not considered.
ii. Mark the features with coefficient of less than 0.01.  3. Compute the average performance and standard deviation of the models.
In the third phase, a SVM model was trained using age-matched subjects and selected features from the second phase. Resulting model's performance is evaluated against the filtered-out subjects: young depression and old dementia subjects. In both cases, the dementia patients were labelled as class 0 (negative) and depression patients were labelled as class 1 (positive). The illustration of the phases are shown in Figure 3.

Evaluation Metrics
We utilized eight metrices to evaluate the effectiveness of the machine learning model, all of which are computed based on the ratio of true positive (TP), false positive (FP), true negative (TN), and false negatives (FN). In this study, the class depression was labelled as "positive" and dementia was labelled as "negative". All of the TP, FP, TN, and FN values were obtained from confusion matrix, as shown on Figure 4. Based on the confusion matrices, the evaluation metrices of observed accuracy, true positive rate (TPR/sensitivity), true negative rate (TNR/specificity), positive predictive value (PPV/precision), negative predictive value (NPV), F1-score, Cohen's kappa, and Matthew's correlation coefficient (MCC) can be then computed. The formulas for computing these metrics are described in Table 2. These metrics were conventional evaluation metrics utilized in performance evaluation. Metrics related to inter-rater reliability such as Cohen's kappa and MCC were included to ensure validity of measurement in cases of imbalanced sample problem.

Demographics
A total of 120 participants (depression n = 77, dementia n = 43) participated in the study, and 419 datasets (300 of depression and 119 of dementia) were obtained. After age-matching, only 177 datasets (89 of depression and 88 of dementia) from 53 participants (depression n = 24, dementia n = 29) were qualified for the first and second phase of machine learning. The test dataset for second phase of machine learning consisted of young depression patients and old dementia patients and was used in the third phase of machine learning. There were 242 datasets (211 of depression and 31 of dementia) from 67 patients (depression n = 53, dementia n = 14). Details of subject demographics were described in Table 3.

Statistical Analysis
In this section, the statistical analysis for the extracted features were reported. Pearson's correlation found significant correlations with clinical interview tools in features of GTCCs 1, 3, 12 and MFCCs 1, 3, 4, 7, 12. The average absolute correlation coefficient R was 0.264 and its SD was 0.049. The highest absolute correlation value with statistical significance (p < 0.05) was |R| = 0.346 for depression and |R| = 0.400 for dementia. Features with significant correlation related to depression tend to yield weak to moderate negative Pearson correlation values (average absolute R ± SD = 0.289 ± 0.05) while features with significant correlation related to dementia tend to yield weak to moderate positive Pearson correlation values (average absolute R ± SD = 0.281 ± 0.06). The features' distributions were depicted in Figure 5 and their corresponding Pearson correlation values were shown in Table 4. Corrected two-tailed t-test shows significant differences of features in HNR, ZCR, GTCC coefficients 4-14, mean frequencies, median frequencies, MFCC coefficients 4-13, spectral centroid, and spectral rolloff points. No significant difference was found in Pitch and Energy. There was no significant correlation was found between sex and clinical assessment tools (depression R = 0.021, p = 0.853; dementia R = 0.142, p = 0.928). Age has no significant correlation with depression's clinical assessment tools (R = 0.097, p = 0.403) but significant, moderate correlation between age and dementia's clinical assessment tools was found (R = 0.424, p = 0.0046).

Machine Learning
In this section, the results of machine learning were presented. The evaluation results from unsupervised learning with kMeans algorithm was shown on Table 5. For the SVM with linear kernels, 26 features were completely rejected in the feature selection, resulting in their removal during creation of the model for second phase. The rejected features were related to pitch, GTCCs 1-3, MFCCs 1-3, signal energy, spectral centroid, and spectral cutoff point. Feature selection in SVM with 3rd order polynomial kernel results in removal of 28 features. The rejected features were related to pitch, GTCCs and MFCCs (1-3, 12-13), signal energy, spectral centroid, and spectral cutoff. LASSO with RBF-SVM similarly rejects 28 features related to low-order (1-4) and high-order (10-13) MFCC and GTCC coeffcients, pitch, signal energy, spectral centroid, and spectral cutoff. Results of feature contributions of the trained linear SVM model was presented in Figure 6 alongside with the list of remaining 82 features. The feature contributions were absolute value of linear SVM coefficients. Machine learning evaluation results for phase 2 were shown on Tables 6-8  and the results for phase were shown on Table 9. Results with and without LASSO algorithm also shown in these tables to confirm effectiveness of feature selection. Here, the label "positive" represents depression patients and "negative" is for dementia patients.

Discussion
In the present study, we obtained the audio recordings from clinical interviews of depression and dementia patients. Then, the recordings were filtered according to the analysis criteria. Preprocessing and acoustic feature extraction was then performed to the qualifying datasets. Statistical analysis and machine learning were performed to the acoustic features. This study has potential limitations. First, although subtle, the recordings were contaminated with the doctor's voice. This naturally reduces the quality of the acoustic features. Next, there is no removal of silence between the dialogues. We hypothesized that long silences correspond to low motivation and therefore useful for predicting depression. Third, we did not consider real-time appliances. We utilized the full length of the recordings for predicting dementia versus depression. Finally, all the experiments were conducted in Japanese hospital, with Japanese doctors, and with Japanese patient. The speech features we extracted might be specific to the Japanese. Needless to say, these limitations imply potential bias in our study and the results of our study must be interpreted with attention to the limitations.
As a result, we found that GTCC coefficients 1, 3, and 12 along with MFCC coefficients 1, 3, 4, 7, 12 showed significant correlation with both clinical assessment tools: HAMD and MMSE, as shown in Table 4. Interestingly, the sign of Pearson's correlation coefficient were different; negative correlation was observed for HAMD and positive correlation was observed for MMSE. This suggests that although the features were important for both depression and dementia, they correlated differently. Another thing to note that the highest absolute correlation value with significance (p < 0.05) was 0.346 for HAMD and 0.400 for MMSE, suggesting a weak to moderate correlation between the audio features and clinical rating scores.
The corrected t-test between these features in Figure 5 showed statistical differences only in certain features. Interestingly, the standard deviation of a rather high-order MFCC coefficient showed significant difference. Normally, most of the information are represented in the lower order coefficients and their distributions were important for speech analysis. Feature contribution shown in Figure 6 puts these features in the middle of the selected features, and some of the lower-order MFCC features were even removed. This might imply the shared features between dementia and depression did not contribute well for predicting them.
Statistical comparison of acoustic features between two groups found significant differences in both temporal and spectral acoustic features. No significant difference between the two groups can be found in pitch and energy, both in the family of temporal features.
Although the result from unsupervised clustering algorithm was not satisfactory, both the accuracy and inter-rater agreement show that the performance was better than chance, denoting the underlying patterns in the data. In the second part of machine learning, feature selection was performed using LASSO algorithm. Here, both pitch and signal energy features were rejected alongside with other spectral features. Considering that both pitch and signal energy also showed no statistical significance in the t-test, it can be inferred that these features do not contribute for classification of depression and dementia. In contrast, GTCCs 4-14 and MFCCs 4-14 had statistically significant difference and were also selected by LASSO algorithm. GTCCs and MFCCs are similar features, related to tones of human speech. Although GFCCs was not developed for speech analysis, both are commonly used for speech recognition systems [45,46]. This finding is consistent with the fact that a person's speech characteristics might be related with their mental health. SVM feature contributions also confirmed that the top contributing features were MFCCs and GTCCs. As the coefficients of the MTCCs and GTCCs are related to the filterbanks utilized when computing them, these coefficients have the benefits of being interpretable [47].
Surprisingly, the best result of the SVM was obtained in SVM with linear kernel, although the the scores were only slightly superior to the nonlinear SVMs. Additionally, the effectiveness of LASSO algorithm for feature selection was evaluated and interesting result was found. For the second phase, all the SVM models benefited from having LASSO feature selection, but for the third phase, nonlinear SVMs seemed to be the most benefited with the feature selection. This might be related by the LASSO algorithm. As LASSO regression is a linear regression with penalty and the feature selection step was basically to discard features that give zero contribution to LASSO regression, linear SVM might be similar to it and was redundant in this case.
Nevertheless, high accuracy and interrater agreement were obtained from the models in both machine learning phases. For comparison, studies [24,25,28,29], and [23] have 87.2%, 81%, 81.23%, 89.71% and 73% as accuracy for predicting depression, respectively. [31] reports 73.6% accuracy for predicting dementia and [30] reports 99.9% TNR and 78.8% TPR. However, most of these studies compared healthy subjects against symptomatic patients, while our study compared patients afflicted with different mental problem. Additionally, most conventional studies measure depression by questionnaire and not with clinical examination, so this cannot be said to be a fair comparison. Low NPV scores and inter-rater during the third phase maybe due to the fact that evaluation in third phase was utilized with heavily imbalanced dataset and with higher number of samples compared to the training phase. These results suggest the possibility of using audio features for automatic pseudodementia screening.

Conclusions
We recorded the audio of clinical interview session of depression patients and dementia patients in a clinical setting using an array microphone. Statistical analysis shows significant differences in audio features between depressed patients and dementia patients. A machine learning model was constructed and evaluated; considerable performance was recorded for distinguishing depression patients and dementia patients. Feature contribution analysis reveal features MFCC and GTCC features to be the highest contributing features. The top contributing features were 9th and 4th MFCC features. Based on our findings, we conclude that automated pseudodementia screening with machine learning is feasible.

Future Work
Although this study has yielded considerably good results, there are still some rooms for improvements. For example, to eliminate the psychiatrist's voice inside the recordings. Although the microphone was situated against the patient, subtle amount of the psychiatrist's voice also included in the recordings. As such, a specific voice separation algorithm needs to be developed and applied to remove psychiatrist's voice. This will certainly add silent parts in the recordings and the feature extraction methodology needs to be modified; instead of processing audio with 10 ms window, activity-based window might be considered. Additionally, a dynamic cardioid microphone or multichannel array microphone might be beneficial for picking sounds only from the patient's side. In this case, room settings for suppressing reverberation and microphone placement becomes very important.
In conjunction with psychiatrist voice removal, activity-based features might also reveal relevance in aspects we did not consider in this study. Here, we hypothesized that longer silence between answers corresponds with lower patient cognition. We assumed that these silence segments will affect the mean value of the features while minimally affecting the median value and is beneficial for differentiating dementia against depression. However, activity-based or content-based analysis might reveal the difference in features we considered irrelevant in this study, such as signal energy.
Also, this study does not consider patients with overlapping symptoms of depression and dementia. Thus, the next step of this study is to develop a multi-class classifier capable of predicting patients with overlapping symptoms. A regression model trained with clinical assessment tools for both depression and dementia is also a possibility.
In consideration of improving the accuracy, more advanced machine learning techniques such as neural network might be suitable. Although the number of available dataset is relatively small for neural networks, sub-sampling and bootstrapping techniques might help to increase the numbers of dataset. Attention must be paid during the validation such that no data leak may occur. Additionally, feature extraction methods such as the combination of numerous hybrid acoustic features, as listed in [48] might also be beneficial. Nevertheless, the curse of dimensionality should be avoided when handling such numerous predictors.
Additionally, while this study did not consider real-time analysis, shorter audio input length should be considered. In this study we used 10 min recording of the "free talk" session and disregarded the processing time. However, in real case, it is more beneficial if the processing was complete before the patient and psychiatrist started the examination with clinical assessment tools.
Finally, in regards the dataset used for training and testing. All experiments were conducted in a Japanese hospital, with Japanese therapist, and with Japanese patient. Although the audio features relating to mental health are supposed to be independent with the language, there is a need to replicate this research outside of Japan and to evaluate the performance of our model against the publicly available databases. Utilizing other databases also have the benefit of the possibility for fair effectiveness evaluation with our model.