According to the World Health Organization (WHO), over 300 million people are estimated to suffer from depression [1
], and that number is globally going up, especially at an advanced age. Scientifically known as Major Depressive Disorder (MDD), depression is a mental disorder characterized by a low mood, low self-esteem, loss of interest, low energy, and pain without a clear cause for an extended period of time. It has negative impacts on a person’s family, work, sleeping, and eating habits. In extreme cases, as reported by [2
], half of all completed suicides are related to depressive and other mood disorders. Because of that, many researchers have focused on developing systems to diagnose and prevent this mental disorder to help psychiatrists and psychologists to assist patients as soon as possible.
Traditional mental health disorder reports, such as Beck’s Depression Inventory (BDI-II) [3
], Geriatric Depression Scale (GDS) [4
], Hamilton Rating Scale for Depression (HRSD) [5
] and the Patient Health Questionnaire (PHQ-8) [6
] (used as a target variable in this research), are not completely useful or take too much time and consideration for the psychiatrist to diagnose the illness. Moreover, they are quite boring for the patients and could be biased due to that effect.
Automatic Depression Detection (ADD) using computer algorithms is the next step in prevention and monitoring of this mental disorder. Studies on ADD based on speech signals started in the late 2000s [7
] and have taken on increased importance since then. It is worth noting that ADD systems are not intended to substitute medical experts’ judgments, but to assist them with clinical decision support.
In this context, since 2011, annually, Audio–Visual Emotion Challenges have been proposed. The competitions are “aimed at comparison of multimedia processing and machine-learning methods for automatic audio, visual, and audio–visual health and emotion sensing, with all participants competing strictly under the same conditions”
. Using different databases, each competition has been focused on one particular mental disorder such as depression [8
] or bipolar disorders [11
], using emotion recognition through four well-known dimensions of the emotions—arousal, valence, expectancy, and power—which have been present in all the challenges [8
]. Some of the tasks are called sub-challenges
and researchers can propose solutions for one or many of them.
This research follows the guidelines of the Depression Classification Sub-Challenge
(DCC) at the 2016 Audio–Visual Emotion Challenge (AVEC) [10
], whose main goal is to determine if a speaker has been diagnosed as depressed or not using audio, visual, and/or text data. In particular, in this work, we focus on audio information. For this task, we propose a system based on ensemble learning whose individual classifiers are Convolutional Neural Networks (CNNs) and the inputs to them are speech log-spectrograms. As in the DCC sub-challenge, the English-speakers database DAIC-WOZ [13
] is used for the system evaluation.
The rest of the paper is organized as follows: Section 2
looks over the state of the art of the automatic depression detection problem, CNNs and ensemble methods. In Section 3
, we present the dataset we have used, and we describe the proposed system. Section 4
contains the relevant experiments and results. Finally, in Section 5
we expose our conclusions and some lines of future work.
2. Related Work
In this section, first, we briefly analyze the systems developed for automatic depression detection in the recent years. After that, an overview of the CNN models and ensemble methods are provided.
Most of the works in the literature related to ADD systems consider two main sources of information, either individual or combined: visual and audio modalities. Although we do not use visual features in this research, we include here a summary of the related state of the art for completeness.
The majority of the work on this visual modality is based on facial video information, by means of the modeling of facial expressions, head movements, eye gazes, or blinks. The way the information is encoded plays a very important role in these kinds of systems. Facial expressions have been generally represented by Facial Action Coding System (FACS), Active Appearance Model (AAM), or Local Phase Quantization at Three Orthogonal Planes (LPQ-TOP) [14
]. With these representations, the well-known Support Vector Regression (SVR) models obtain good achievements [15
]. In other works, such as [17
], eigenfaces are proposed to predict depression in adolescents. Other systems combine visual cues with other types of information, mainly acoustic, resulting in multimodal systems that usually outperform the individual modalities. In this context, it is worth mentioning the system described in [18
] that consists of neural network-based hierarchical classifiers and SVR ensembles for fusing the audio and visual information and was successfully assessed on the AVEC-2013 [8
] challenge. Another relevant work is the one described in [19
] that implements a combination of the audio and visual features with a Decision Tree as classifier and was the winner of the AVEC-2016 challenge.
As mentioned in Section 1
, in this work we focus on the acoustic modality. Speech does not only convey linguistic contents, but also contains paralinguistic features (how words are said) that provide important clues about the emotional, neurological, and mental traits and states of the speaker. For this reason, in recent years, speech technologies are being proposed for the assessment, diagnosis and tracking of different health conditions that affect the subject’s voice [20
]. In this area, commonly referred to as Computational Paralinguistic Analysis
, current research encompasses the detection of pathological voices due, for example, to laryngeal disorders [21
]; the diagnosis and monitoring of neurodegenerative conditions, such as Parkinson’s disease [22
], Mild Cognitive Impairment [24
], Alzheimer’s disease [24
] or Amyotrophic Lateral Sclerosis [26
]; the prediction of stress and cognitive load level [27
]; and the detection of psychological pathologies, such as autism [29
] or depression [30
], which is the topic of this paper.
Conventional systems for speech-based health tasks consists of data-driven approaches based on hand-crafted acoustic features, such as pitch, prosody, loudness, rate of speech, and energies, among others, and a machine-learning algorithm such as Logistic Regression, Support Vector Machines (SVM) or Gaussian Mixture models [22
]. Nevertheless, very recent works, such as, for example, [20
], deal with the use of deep-learning techniques for these tasks, since, presently, these kinds of methods have achieved unprecedented successes in the field of automatic learning applied to signal processing, and particularly in image, video, and audio problems.
In the specific field of speech-based automatic depression detection, most of the developed systems also follow one of these two strategies. On the one hand, conventional ADD systems rely on studies about the importance of several acoustic characteristics for depression detection. In fact, according to [30
], the most relevant ones for this task are pitch, formants, intensity, jitter, shimmer, harmonic-to-noise ratio, and rates of speech. These voice quality features are related to the observation that depressed speakers tend to speak in an unnatural and monotonous way [33
]. On the other hand, recently, several ADD systems based on the deep-learning paradigm have been proposed. In particular, Convolutional Neural Networks (CNN), which represent a specific case of Artificial Neural Network (ANN), are some of the most commonly used architectures in this field.
CNNs appeared in 1980s [34
], but the research that became a milestone was [35
] where a CNN-based net was proposed for image recognition tasks. Since then, a large number of problems in computer vision have been solved with this kind of architectures such as handwritten digits, traffic sign classification, people detection, or image recognition for health applications. Later, these approaches have spread to audio-related tasks, such as for example, automatic speech recognition [36
], speech emotion recognition [40
] or acoustic scene classification [42
]. For our purposes, the research in [44
] is especially relevant, where a speech-based depression detection system is proposed, called DepAudionet, that uses One-Dimensional CNN (1d-CNN), Long Short-Term Memory (LSTM) and fully connected layers, since it is the basis of our work.
On the other hand, ensemble methods are meta-algorithms that help to improve machine-learning results by combining more than one model, usually using the same base learner. Ensemble learning with neural networks was introduced in [45
] and since then, it has been used in many architectures and applications such as image categorization [46
], sentiment analysis [47
] or acoustic environment classification [48
In this paper, we deal with the problem of automatic depression detection by only using the subject’s voice. In particular, we present two main contributions. In the first one, we propose a refined 1d-CNN architecture based on the aforementioned DepAudionet model [44
] that is optimized by selecting the best configuration from an exhaustive experimentation. In the second one, we take advantage of the ensemble learning strategy for fusing several machines with this 1d-CNN architecture, in such a way that the performance of these individual systems is improved. Although Ensemble CNN models have been successfully used in other speech- and audio-related tasks, such as automatic speech recognition [39
], speech emotion recognition [40
], or acoustic scene classification [43
], to the best of our knowledge, they have not been previously used for automatic depression detection from speech.
An important issue in real-world speech-based health applications is privacy, as patients’ audio and/or video recordings are highly sensitive and contain personal information. In the DAIC-WOZ dataset we have used in this work, this issue does not apply, as participants completed consent forms to allow their data to be shared for research purposes [13
]. However, presently, there is a growth interest in developing this kind of application for real-world scenarios where, at least, the data acquisition is done through microphone-enabled smart devices and/or other wearable technologies that what would allow the remote monitoring of patients [20
]. According to [50
], apart from the consent forms, two main strategies for strengthening privacy can be considered. The first approach consists of the extraction of acoustic features from which it is not possible to reconstruct the raw speech signal. A good example is [51
], where the used audio characteristics are the percentage of speech detected in a given time period, and the percentage of speech uttered by the patient, and therefore, it is not necessary to store the whole raw recordings. When the system uses features that allow the recovery of the original speech signal, a second approach can be applied, consisting of the extraction and encryption of the acoustic characteristics in the local device and their transmission to a secure server where further analysis is done [49
]. If the system proposed in this paper had to work in these kinds of real-world scenarios, the second approach should be considered, as its input are log-spectrograms (a reversible transformation of the speech signal), although this issue is beyond the scope of this paper.
4. Results and Discussion
4.1. Experimental Protocol
This section details the tools used to developed the ADD system and the experimental protocol that has been followed to assess it.
To build the different models, the programming language Python 3.6
has been used. The computation of log-spectrograms have been possible with the help of the tool LibROSA
] and the neural network model is programmed in Keras
]. Others Python libraries used in the experiments and graphs are Scikit-Learn
] and Matplotlib
We adopt the F1-score at speaker level as the main metric to assess the system since it is the one proposed in the AVEC-2016 challenge [10
]. F1-score is the harmonic average of the precision and recall. The three of them are more accurate for unbalanced problems than the accuracy that has been calculated too. All of them have been measured both for the depressive and non-depressive speakers.
The different architectures have been training during 50 epochs with a batch size of 80 samples with an Adadelta optimizer, a binary cross-entropy loss function and a decreasing learning rate from 1 to .
Regarding to the system input, each log-spectrogram has been standardized independently using the min-max normalization to obtain log-spectrograms in the range . Standard normalization has been tried too resulting in a worse performance.
Unless it is stated otherwise, the parameter configuration of the 1d-CNN architecture is as follows: the input dimensions are , the number of filters in Layer 1 is with a size, the kernel size, stride and padding in Layer 2 are respectively, , and , and the number of neurons in Layer 4 is .
Concerning the system training, to obtain more reliable results, a 5-fold cross-validation has been used. For each fold, the original training set has been divided into a training and a validation subset, guaranteeing that speakers belong to only one of these subsets. Final metrics have been computed by concatenating the partial results obtained over the test set with the model trained in each one of the 5-fold iterations with the corresponding training+validation configuration.
We compare our proposal to two different reference systems. The first one is the baseline system provided by the AVEC-2016 challenge that is based on an SVM-based classifier and uses the hand-crafted features mentioned in Section 3.1.3
. The second one is the DepAudionet depression detection system [44
]. As mentioned before, its architecture is composed of 1d-CNN, LSTM and fully connected layers with the particular configuration described in [44
In this section, we present the results of the experiments we have carried out to evaluate the performance of the system. First, we show the performance of the three ensemble methods under consideration. Next, some parameters of the 1d-CNN architecture have been changed to observe their influence on the performance of the whole system for both the single network and the ensemble.
4.2.1. Results with Different Ensemble Methods
The number M
of machines that compose the ensemble is quite relevant. Therefore, in this set of experiments we analyze the influence of the value of M
on the performance of the system, as well as the improvements achieved by the ensemble averaging algorithm for the three proposed methods: average of the probabilities at sample level (Method 1), most frequent value of the labels at sample level (Method 2) and at speaker level (Method 3). In particular, Figure 6
represents the variation of the F1-score metric for depressed (left) and non-depressed (right) speakers as a function of the number of ensemble machines, M
, for the three methods under consideration. The case of using a single 1d-CNN corresponds to
. Results for
are not reported, since preliminary experiments showed that values of M
greater than 50 do not improve the performance of the system for this database. The mean (black line) and the standard deviation (colored shadows) of the F1-score metric have been obtained by taking into account 200 different combinations of
As can be seen, in general, the performance increases with the number of classifiers in the ensemble for the three methods. Only for the Method 3, there is a worsening in the depressed case when a large number of machines is used. Nevertheless, the results are better than using a single machine.
In addition, the variance of the individual predictions decreases when the number of machines increases except for the Method 3 (mode of the labels at speaker level). That is due to the big influence of the change of only one speaker prediction in the F1-score metric. On the contrary, the F1-scores achieved by ensembles performing at sample level are more robust to these variations because the number of samples is considerably higher than the number of speakers.
From these results, we can conclude that the best option is to use the Method 1 and and then, this is the configuration for the ensemble used in the experiments indicated as “Ensemble 50 1d-CNN” in next sections.
4.2.2. Number of Filters in the Layer 1 and Size of the Layer 4
presents the four performance metrics of the system varying the number of filters N
, i.e., the depth of the first layer, and the number of neurons
in Layer 4, i.e., the hidden layer of the non-convolutional part. Please note that these two parameters should be related. In Table 1
, it can be seen the results of the AVEC-2016 baseline, the DepAudionet system and both the results of a single 1d-CNN (represented as the mean of 50 machine performance) and the combination of the same machines with the best ensemble architecture as shown in Section 4.2.1
, i.e., Method 1. These metrics are shown for both the depressed and non-depressed classes (in parenthesis).
As expected, the ensemble method provides better F1-scores than the single network. In addition, all the F1-scores obtained by our system in both modalities, single network and ensemble, are better than the baseline. For the depressed class this is because, although the baseline produces a better recall than our system, its precision is clearly worse than those achieved by both the single network and the ensemble. For the non-depressed category, our system also obtains better F1-scores than the baseline, since its recall is clearly higher than the one produced by the baseline, whereas its precision is slightly lower.
Regarding the DepAudionet system, again, its performance is better than the baseline in terms of F1-score and precision for the depressed class and in terms of F1-score and recall for the non-depressed class. However, it is outperformed by our systems (single network and ensemble) according to the accuracy, F1-score, precision and recall measures for both the depressed and non-depressed categories.
shows the F1-scores for more combinations of these N
in a grid representation. It displays the results of an ensemble system with
since it is the best one as shown above. Although there are more than one pair of parameters (N
) that produce the best performance (F1-score of
in depressed samples and
in non-depressed) we have decided that the best solution is the one with 128 filters in the convolutional layer and 128 neurons in the Layer 4, as long as it is the solution with few parameters to train.
4.3. Size of the Max-Pooling Window
Another important parameter is the size of the kernel in the max-pooling layer. Taking into account the characteristics of the network, its size is rectangular with a value equal to 1 in the frequency dimension, due to the one-dimensional filters in the first convolutional layer, and a variable length in the temporal dimension. Consequently, Table 2
shows a comparative between the kernel lengths at this second layer. Although there are many possibilities changing the stride of the kernels, Table 2
displays only the best combination for each kernel size.
From this table we can see that the best kernel size is the one that takes 5 temporal slices. For this best case, the stride kernel is (1, 4).
To sum up, the best parameters of the system, at least for this database, are: number of filters, a kernel size in the max-pooling layer of (1, 5) with a stride of (1, 4), using neurons at the last hidden layer and number of machines assembled with the Method 1 (i.e., averaging the probabilities at sample level). With these values, our proposed system, the ensemble-based model, achieves a relative improvement in terms of F1-score of , and with respect to the baseline, the DepAudionet and the single 1d-CNN architecture, respectively.
In this paper, we have presented an automatic system for detecting if a person suffers depression by analyzing his/her voice. It is based on an ensemble method, ensemble averaging, that combines One-Dimensional Convolutional Neural Networks (1d-CNN). Each individual 1d-CNN uses log-spectrograms as inputs and is composed of one input layer, four hidden layers and one output layer whose parameters have been optimized by performing an exhaustive experimentation. Moreover, a random sampling procedure has been used to balance and augment the training data.
The system has been evaluated over the DAIC-WOZ dataset in the context of the AVEC-2016 Depression Sub-Challenge and compare to the baseline system provided in this challenge, which is based on a SVM-based classifier and hand-crafted features, and the DepAudionet architecture consisting of 1d-CNN, LSTM and fully connected layers. Results have shown that our proposed system achieves a relative improvement in terms of F1-score of , and with respect to the baseline, the DepAudionet and the single 1d-CNN architecture, respectively.
For future work, we plan to address the following lines: to study the use of other different methods of ensemble learning as bagging or stacking, to build deeper and narrower networks and the addition of other features as input to the network such as gender or other type of metadata, and to explore different strategies for the combination of our audio-based system with the information provided by the visual modality.