Evaluating the Performance of eGeMAPS Features in Detecting Depression Using Resampling Methods

Turnipseed, Joshua; Fonseca, Benedito J. B.

doi:10.3390/signals7030041

Open AccessArticle

Evaluating the Performance of eGeMAPS Features in Detecting Depression Using Resampling Methods

by

Joshua Turnipseed

and

Benedito J. B. Fonseca, Jr.

^*

Department of Electrical Engineering, Northern Illinois University, 590 Garden Rd., DeKalb, IL 60115, USA

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(3), 41; https://doi.org/10.3390/signals7030041 (registering DOI)

Submission received: 22 January 2026 / Revised: 1 March 2026 / Accepted: 1 April 2026 / Published: 6 May 2026

(This article belongs to the Special Issue Advances in Biomedical Signal Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

This paper investigates how well eGeMAPS features can be used to classify depression from a patient’s speech audio samples through the use of statistical resampling methods. We use permutation tests to evaluate, with high confidence, whether eGeMAPS features and the speaker’s depression status are dependent. We use bootstrap confidence intervals to test, with high confidence, whether eGeMAPS features are able to better discriminate depression in male speakers than in female speakers. Lastly, we compare the detection power of different subsets of the eGeMAPS features. We use an open-source dataset of depressed and non-depressed speakers (E-DAIC), an open-source audio feature extractor (eGeMAPS), and open-source machine learning classifiers (WEKA) to enable replication of results and establish a baseline for future studies.

Keywords:

audio features; depression detection; permutation tests

1. Introduction

Major depression is one of the most prevalent mental disorders in the United States [1], underscoring the urgent need for effective approaches in understanding and treating this debilitating condition. However, it could take a clinician several meetings with a patient to diagnose them with major depression as they familiarize themselves with the patient’s behaviors. If a tool existed to aid in the detection of depression, clinicians could reference it to help make their decision.

One way to aid in the detection of depression is to use machine learning to analyze a patient’s speech. The general approach is to observe the acoustic parameters of the patient’s voice and notify the clinician if certain features were in ranges that indicate depression. As described in Literature Review Section, previous researchers have considered using audio features and machine learning to detect depressed patients; however, the results are inconclusive, often reporting conflicting classification performance. Also, analyses are often based on training datasets that are not large enough, raising the question of whether classification models were overfitting the training set. Furthermore, often the results are based on proprietary datasets or use classification tools with a high number of hyperparameters, making it difficult to replicate the results.

The goal of our paper is to further evaluate the detection performance of audio features in detecting depression while using open-source datasets and tools to enable replication of results. We evaluate the depression detection performance of several machine learning methods using (1) an open database (E-DAIC) of depressed and non-depressed speakers; (2) the open-source OpenSMILE tool; (3) a well-defined set of audio features (eGeMAPS), which are readily available in the OpenSMILE distribution; and (4) the open-source WEKA machine learning tool. To facilitate the replication of our results and given the limited availability of training data, we focus on traditional machine learning methods as opposed to deep learning methods. Details about E-DAIC, OpenSMILE, eGeMAPS, and WEKA are described in Section 2.1, Section 2.2, Preparation of Audio Files from Speakers Section, and Section 2.3, respectively.

We start by addressing a fundamental question that has not been properly addressed before: can we state, with high confidence, that audio features and depression status are dependent on each other? We answer this question by using resampling methods. More precisely, we use permutation tests to test the hypothesis.

H_{0}^{(I)}

: eGeMAPS audio features of speakers are independent of their depression status.

Rejecting

H_{0}^{(I)}

provides evidence that, with high confidence, eGeMAPS features may be correlated with depression. While permutation tests have been considered in the past to evaluate the performance of machine learning algorithms [2], this is the first study that applies permutation tests to evaluate the efficacy of audio features in detecting depression. We further use bootstrapping to provide confidence intervals for the probability of decision errors, to compare the discrimination power of eGeMAPS features in detecting depression in both male and female speakers, and to compare the discrimination power of various subsets of eGeMAPS features.

The main contributions of this paper are as follows:

We show that, at least in the speaker set from E-DAIC, hypothesis $H_{0}^{(I)}$ can be rejected with 95% confidence in many instances, meaning a strong indication that eGeMAPS audio features and speakers’ depression status are in general dependent.
Our results further reinforce previous results obtained in the literature by providing 95% bootstrap confidence intervals for the probability of decision errors when using eGeMAPS features.
We show that eGeMAPS features are able to better classify depression in male speakers than in female speakers.
We show that eGeMAPS temporal features provide higher discrimination power when detecting depression in females, while eGeMAPS energy features provide higher discrimination power when detecting depression in males.

Literature Review

Depression detection from speech is a well-established concept, as individuals suffering from major depression exhibit speech characteristics described as dull, monotone, monoloud, and lifeless [3]. These characteristics manifest in lowered pitch, reduced pitch range, slower speaking rates, and increased articulation errors [4]. Various acoustic features have been employed to detect these speech alterations, including fundamental frequency (F0), formants, jitter, shimmer, intensity, and speech rate [3,5,6]. While some studies incorporate multimodal approaches that integrate audio, video, and textual features for depression detection, this paper focuses exclusively on audio-based analysis.

Several studies have investigated depression detection using traditional machine learning approaches. One study explored monolingual and multilingual depression detection by analyzing vowel-based features across three languages: German, Hungarian, and Italian. A Support Vector Machine (SVM) classifier was employed for both binary and multi-class classification tasks, with results indicating a 26% improvement in classification performance when accounting for gender differences [7]. Another study used Gaussian Mixture Models to model depressed and neutral speech, finding that the first formant (F1) was the most effective single feature, despite prior research suggesting that the second formant (F2) may better capture emotional and cognitive influences linked to depression [8]. Additionally, research on speech landmark features examined acoustic event transitions such as vocal cord vibrations, nasality, frication, and speech bursts. These features, converted into n-gram representations and used with an SVM classifier, achieved an F1 score of 0.86 on the E-DAIC dataset [9].

Although deep learning methods generally require larger datasets than traditional learning methods [10], some studies have employed convolutional neural networks and Long Short-Term Memory networks to classify depression. The authors in [11,12] reported moderate detection performance, with F1 scores ranging between 0.52 and 0.63. Another investigation [13] utilized a deep convolutional neural network model with over 4.5 million parameters, reporting balanced accuracy scores ranging between 0.54 and 0.60. Using a proprietary dataset containing over 84 h of speech data from 4748 participants, the authors in [14] compared various machine learning techniques, including deep learning models, and reported unweighted average recall (UAR) values ranging between 0.56 and 0.63. These studies emphasize the difficulty of accurately detecting depression using only audio-based features. On the other hand, using Long Short-Term Memory networks, the authors in [15] used a new corpus of 118 native Italian speakers and reported F1 scores of 0.83.

The findings across these studies present both promising and inconsistent outcomes. While some research reports high classification accuracy [7,9,15,16], others highlight limitations in audio-based depression detection [11,13,14]. Furthermore, gender has been identified as a key factor influencing classification performance [5,7,8,11,16]. Acoustic features such as prosodic and glottal characteristics, F0, formants, Mel-Frequency Cepstral Coefficients (MFCCs), shimmer, and jitter have consistently proven effective in detecting depressed speech [3,5,6,17,18,19]. Several studies have also raised concerns regarding the small sample sizes used in many investigations, which may affect the generalizability of their results [16,17].

The E-DAIC dataset is widely used in depression detection research [4,9,11,12,13]; however, reported classification performance is inconclusive. While some have reported strong classification performance [9], others have reported significantly lower classification performance metrics on this dataset [11,12,13].

OpenSMILE is a widely used tool for extracting speech features, and the included eGeMAPS feature set has been frequently applied in depression detection. eGeMAPS includes low-level descriptors such as shimmer, jitter, and formants, which have been identified as relevant indicators of depressive speech patterns. However, studies show inconsistent effectiveness, with some demonstrating strong classification performance [16], while others highlight limitations in predictive capabilities in both traditional and deep machine learning models [13,14].

Although the studies reviewed do not explicitly mention WEKA, many of their approaches could be implemented using this framework, as it provides the same machine learning algorithms mentioned above. WEKA offers a comprehensive collection of classification, regression, and feature selection methods, making it a valuable tool for reproducible research in depression detection.

2. Materials and Methods

We evaluate the depression detection performance of several machine learning methods using the E-DAIC dataset of depressed and non-depressed speakers, the OpenSMILE tool, and the WEKA machine learning tool. In this section, we provide more details about the dataset and tools and how they were used.

2.1. The E-DAIC Dataset

The Extended Distress Analysis Interview Corpus (E-DAIC) is a dataset created by the University of Southern California containing 275 speakers. These speakers are divided into training, development, and test sets. The training set contains 163 speakers, the development set contains 56 speakers, and the test set contains the remaining 56 speakers. E-DAIC has been used as a benchmark in many evaluation efforts [4].

When creating the dataset, each speaker was assigned an identifier and was interviewed by a virtual assistant. The interview questions aimed at diagnosing the speaker for psychological conditions such as depression. For this, the designers of E-DAIC used the Eight-Item Patient Health Questionnaire (PHQ-8) [20], which is a commonly used tool to indicate the level of depression in a subject. Depending on the answers given, each speaker was given a score ranging from 0 to 24. From this score, E-DAIC labels the speaker as depressed or non-depressed: speakers with scores 10 or below are considered non-depressed; otherwise, the speaker is considered to be depressed. More details about E-DAIC can be found in [21,22].

We highlight that our research uses the E-DAIC dataset by considering the whole set of 275 speakers to build various training and test subsets. This was done to better create balanced sets and avoid bias in the training of the classifier [12]; i.e., to control the number of depressed and non-depressed speakers as well as the number of male and female speakers during different classification trials. Another motivation for this was to enable the creation of several balanced sets using different E-DAIC IDs each time.

Preparation of Audio Files from Speakers

For each speaker, the E-DAIC database contains a single audio WAV file. The WAV file contains audio from both the interviewer and the speaker during a back-and-forth interview process. Thus, before extracting audio features from the WAV file, it is important to remove the audio from the interviewer. For this, we used E-DAIC’s transcript files, which indicate the time periods containing the speaker’s audio samples. Using such transcripts, we extracted the audio samples from the WAV file from only the periods of time in which the speaker was speaking.

For each time period with a speaker’s speech, we partitioned it into nonoverlapping segments of equal size

(T_{U L})

. Time periods shorter than

T_{U L}

and any remaining portions of the time period that did not fit into a full segment of

T_{U L}

duration were discarded.

2.2. OpenSMILE

OpenSMILE is a free and open-source software designed by researchers from the Technical University of Munich [23]. OpenSMILE extracts various features from an audio stream and it has been used by many researchers in classification tasks involving audio features [5]. In fact, it has been used in two of the Audio/Visual Emotion Challenges (AVEC2017 and AVEC2019) [4,24].

OpenSMILE works by first creating a configuration file specifying the desired audio features to be extracted from a given input audio file. OpenSMILE offers several built-in configuration files. In this paper, we consider the eGeMAPS configuration file and feature set specified, present in OpenSMILE distribution under the name ‘eGeMAPSv02’.

The eGeMAPS Features Set

In order to standardize the set of OpenSMILE parameters to better assess studies that extract key elements of vocal features from speakers, the Geneva Emotion Research Group created the Extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) [25].

The number of features present in eGeMAPS is 88. These features cover energy, temporal, frequency, and spectral characteristics and aim at extracting the changes in speech during emotional expression. The Geneva Emotion Research Group selected those features based on its reported efficacy, theoretical importance, and historical usage [25]. The 88 features are obtained from statistical functionals of 25 low-level descriptors (LLDs). To produce the functionals, LLDs features are first smoothed using a 3-frame moving average filter. We highlight that only voiced regions are used to smooth functionals related to pitch, jitter, and shimmer. Additional functionals such as percentiles and slope statistics are also calculated within voiced regions for loudness and pitch. The only exception are loudness functionals, which are calculated for all regions. Certain functionals, such as arithmetic means and variation coefficients, are extracted from both voiced and unvoiced segments. Arithmetic means and variation coefficients are also extracted from features related to formant bandwidths; however, such functionals only consider features from voiced regions. Further features and functionals included in eGeMAPS are the spectral flux and MFCC 1–4 coefficients’ arithmetic mean and variation coefficient in voiced regions; and the spectral flux arithmetic mean in unvoiced regions. Detailed descriptions of all the eGeMAPS features can be found in [25].

Each segment of

T_{U L}

duration is treated with OpenSMILE using the eGeMAPSv02 configuration file, which produces an ARFF output file for posterior processing by WEKA. It should be mentioned that OpenSMILE may output features of 0.0 value for some attributes; for instance, when voiced segments are not found in a segment. To avoid such segments impacting the analysis, any audio segment producing an ARFF output file with 10 or more consecutive features of 0.0value was removed from the analysis.

2.3. WEKA

The Waikato Environment for Knowledge Analysis (WEKA) is a free software tool developed by researchers at the University of Waikato in New Zealand. WEKA offers a large collection of machine learning algorithms for data mining and classification tasks [26]. It has been used by researchers and practitioners to preprocess and visualize data and to evaluate the effectiveness of machine learning tools. The version of WEKA used in this research is 3.8.6. The classifiers that we focus on in this study are: Support Vector Machines (SVM), Decision Trees, and Naive Bayes.

For the SVM classifier, WEKA implements the Sequential Minimal Optimization (SMO) algorithm described in [27]. In our results, we considered various values for the complexity constant (C). All other WEKA SMO parameters were set at their default level, including standard normalization and Polynomial kernel with exponent 1.0 (WEKA command for

C = 0.1

: weka.classifiers.functions.SMO -C 0.1 -L 0.001 -P 1.0E-12 -N 1 -V -1 -W 1 -K “weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007” -calibrator “weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -num-decimal-places 4”.). We highlight that no internal cross-validation was used. This is important because a speaker may have multiple audio segments producing multiple (features/label) entries in the training set, and any internal cross-validation would bias the classifier.

For the Naive Bayes classifier, we used the “Flexible Naive Bayes” classifier proposed by [28], which is implemented in WEKA. This variant uses kernels to estimate the distributions for each feature (WEKA command: weka.classifiers.bayes.NaiveBayes -K.).

For the Decision Tree classifier, we used WEKA’s J.48 classifier, which is an implementation of the C4.5 Decision Tree algorithm proposed in [29]. In our results, we considered various confidence thresholds for prunning (C) and various minimum numbers of instances per leaf (M). All other WEKA J.48 parameters were set at their default level (WEKA command for

C = 0.25

and

M = 15

: weka.classifiers.trees.J48 -C 0.25 -M 15.).

2.4. Resampling Methods

In this paper, we use two resampling methods to evaluate eGeMAPS effectiveness in classifying depression: permutation tests and bootstrapping.

2.4.1. Permutation Tests

Let

{x_{b}}_{b = 1}^{B}

be realizations of B i.i.d. random variables

{X_{b}}_{b = 1}^{B}

; let

{y_{b}}_{b = 1}^{B}

be realizations of B i.i.d. random variables

{Y_{b}}_{b = 1}^{B}

; and assume one observes

1 / B \sum_{b = 1}^{B} y_{b} > 1 / B \sum_{b = 1}^{B} x_{b}

, which could suggest that

{Y_{b}}_{b = 1}^{B}

has a different distribution than

{X_{b}}_{b = 1}^{B}

.

The permutation test can be used to formally test the hypothesis

H_{0} :

{X_{b}}_{b = 1}^{B}

and

{Y_{b}}_{b = 1}^{B}

form a set of

2 B

i.i.d. random variables. For this, let

z_{0} : = [x_{1}, \dots, x_{B}, y_{1}, \dots, y_{B}]

and compute a statistic from the last B values of

z_{0}

, which covers only the realizations of

{Y_{b}}_{b = 1}^{B}

; for instance, let

S_{0} : = 1 / B \sum_{b = B + 1}^{2 B} z_{b}

. Subsequently, compute the same statistic from multiple

N_{p e r m}

random permutations of

z_{0}

. Let

z_{n}

be the

n t h

permutation, and let

S_{n} : = 1 / B \sum_{b = B + 1}^{2 B} z_{n}

.

The p-value for testing

H_{0}

is then estimated with

\hat{p} : = \frac{1 + \sum_{n = 1}^{N_{p e r m}} 1 {S_{n} > S_{0}}}{1 + N_{p e r m}},

(1)

where

1 {s t a t e m e n t} = 1

when

s t a t e m e n t

is true and

1 {s t a t e m e n t} = 0

otherwise, and small

\hat{p}

provides evidence to reject

H_{0}

.

If

x_{b}

and

y_{b}

represent statistics obtained from the same subjects that undergo different experiments, then one can increase the power of the permutation test by considering permutation in pairs: Let

S_{0} : = 1 / B \sum_{b = 1}^{B} y_{b}

as before and create

N_{p e r m}

random permutations by randomly switching between

x_{b}

and

y_{b}

for each b. More precisely, for the

n t h

random permutation, let

q^{(n)} : = [q_{1}^{(n)}, \dots, q_{B}^{(n)}]

where

q_{b}^{(n)}

equals

x_{b}

or

y_{b}

with probability 0.5, let

S_{n} : = 1 / B \sum_{b = 1}^{B} q_{b}^{(n)}

, and estimate the p-value using (1).

2.4.2. Permutation Tests for Evaluating Machine Learning Classifiers

In this paper, we use permutation tests to evaluate how well eGeMAPS features can detect depression using the methodology described by Ojala and Garriga in [2]. Ojala and Garriga propose to evaluate whether a machine learning classifier is able to find structure in the data by permuting the labels in the data.

More precisely, letting

B : = {(X_{k}, y_{k})}_{k = 1}^{K}

be a set of training data, where

y_{k}

is the label of each feature vector

X_{k}

, Ojala and Garriga evaluate a classifier f using the leave-one-out cross-validation error

P_{e} (f, B) = (1 / K) \sum_{k = 1}^{K} 1 {f_{{D - {(X_{k}, y_{k})}}} (X_{k}) \neq y_{k}}

, where

f_{{D - {(X_{k}, y_{k})}}}

represents the classifier trained with the training set without

(X_{k}, y_{k})

.

Ojala and Garriga then propose to test the hypothesis that the features

{X_{k}}_{k = 1}^{K}

are independent of the labels

{y_{k}}_{k = 1}^{K}

using the permutation test: Let

{{\tilde{B}}_{n}}_{n = 1}^{N_{p e r m}}

represent

N_{p e r m}

training sets, each obtained by randomly permuting the labels

{y_{k}}_{k = 1}^{K}

; and compute the p-value

\hat{p} : = \frac{1 + \sum_{n = 1}^{N_{p e r m}} 1 {P_{e} (f, {\tilde{B}}_{n}) \leq P_{e} (f, B)}}{1 + N_{p e r m}},

(2)

and low values for

\hat{p}

indicates evidence that the features

{X_{k}}_{k = 1}^{K}

and labels

{y_{k}}_{k = 1}^{K}

are dependent variables and the classifier is able to explore such dependence.

We apply Ojala’s and Garriga’s method to evaluate whether eGeMAPS features are able to detect depression in speakers, as described in detail in the next section.

2.4.3. Bootstrap Confidence Intervals

Let

{z_{b}}_{b = 1}^{B}

be realizations of B i.i.d. random variables

{Z_{b}}_{b = 1}^{B}

for which one desires to estimate

E [Z_{b}]

.

The bootstrap method allows one to estimate a 95% confidence interval for

E [Z_{b}]

as follows [30] (p. 8): Let

Z_{0} : = [z_{1}, \dots, z_{B}]

be the array containing the B realizations obtained in the experiment. Let

Z_{n} : = [z_{n, 1}, \dots, z_{n, B}]

be a B-dimensional array obtained by resampling from

Z_{0}

with replacement. More precisely, the value for

z_{n, b}

for each b is chosen from among the B values in

Z_{0}

following an equiprobable distribution. Compute the average of

Z_{n}

; i.e., let

μ_{n}^{(Z)} : = (1 / B) \sum_{b = 1}^{B} z_{n, b}

. After repeating the resampling process

N_{p e r m}

times, one builds a 95% confidence interval for

E [Z_{b}]

by first sorting the

{μ_{n}^{(Z)}}_{n = 1}^{N_{p e r m}}

values and then using the values at the 2.5% and 97.5% percentiles for the confidence interval. The confidence interval is not exact; however, the method is known to perform good estimates when

B > 100

[30] (p. 19).

2.4.4. Hypothesis Testing Using Bootstrap

The bootstrap method can also be used to test the hypothesis that the expected values of two populations are equal or not [31] (p. 104): let

{z_{b}}_{b = 1}^{B}

be realizations as above; let

{w_{b}}_{b = 1}^{B}

be realizations of B i.i.d. random variables

{W_{b}}_{b = 1}^{B}

; let

W_{0} : = [w_{1}, \dots, w_{B}]

; build

{W_{n}}_{n = 1}^{N_{p e r m}}

; and compute their averages

{μ_{n}^{(W)}}_{n = 1}^{N_{p e r m}}

as above. To test the hypothesis

H_{0} : E [Z_{b}] = E [W_{b}]

, one builds a 95% bootstrap confidence interval for the parameter

E [Z_{b}] - E [W_{b}]

using the values from

{μ_{n}^{(Z)} - μ_{n}^{(W)}}_{n = 1}^{N_{p e r m}}

. If the confidence interval does not contain 0, then

H_{0}

is rejected with 95% confidence.

2.5. Testing Independence of Features and Depression Labels

In this section, we use the permutation test to test the hypothesis

H_{0}^{(I)}

specified in Section 1; i.e., to test whether eGeMAPs features of speakers are independent of their depression status. We will test

H_{0}^{(I)}

for male and female speakers separately.

To test

H_{0}^{(I)}

for speakers of the selected gender, we used the following procedure:

2.5.1. Step 1: Preprocessing

We extracted audio segments from all speakers of the selected gender in the E-DAIC database, as described in Preparation of Audio Files from Speakers Section.

For a given utterance length duration

T_{U L}

, let

N_{s e g}^{(s)}

be the number of segments with duration

T_{U L}

for a speaker s.

For a given minimum number of segments

N_{s e g, \min}

, we discarded any speaker with

N_{s e g}^{(s)} < N_{s e g, \min}

. As described in Section 3.1, we experimented with different values for

T_{U L}

.

We used OpenSMILE to extract eGeMAPS audio features for each of the

N_{s e g}^{(s)}

segments of each remaining speaker s, as described in Section 2.2. For most of the results of this paper, all 88 eGeMAPS features were extracted; however, in Section 3.3, we consider different subsets of features.

2.5.2. Step 2: Build Random Balanced Training Sets

Using the eGeMAPS features, we built B balanced training sets

{B^{(b)}}_{b = 1}^{B}

. By a balanced set, we mean a set with an equal number of segments from depressed and non-depressed speakers. More precisely, let

D_{all}

and

N_{all}

respectively represent the set of depressed and non-depressed speakers of the selected gender; and let

| S |

represent the number of elements in a set S. We note that

D_{all}

and

N_{all}

contain all speakers of E-DAIC; i.e., all speakers of E-DAIC’s training, development, and testing sets from the selected gender were included in

D_{all}

and

N_{all}

.

Since

| D_{all} | < | N_{all} |

in the E-DAIC dataset, we build each balanced set

B^{(b)}

of training data as follows:

Include all $| D_{all} |$ speakers in $B^{(b)}$ .
Randomly choose $| D_{all} |$ speakers from the non-depressed set $N_{all}$ and include them in $B^{(b)}$ . Each non-depressed speaker is equally likely to be chosen but any non-depressed speaker is chosen at most once.
Let $D$ and $N$ be the subsets of speakers from $D_{all}$ and $N_{all}$ included in $B^{(b)}$ . For each speaker $s \in D \cup N$ , randomly choose $N_{s e g, \min}$ segments out of the $N_{s e g}^{(s)}$ segments from such a speaker and include their corresponding feature vectors and truth labels into $B^{(b)}$ . Each segment of speaker s is equally likely to be chosen but any segment is chosen at most once.

To facilitate the explanation of our procedure, we enumerate each segment chosen and each speaker in each of

D

and

N

and let

{{X_{(i, j, D)}^{(b)}, y_{(i, j, D)}^{(b)}}_{j = 1}^{N_{s e g, \min}}}_{i = 1}^{| D |}

and

{{X_{(i, j, N)}^{(b)}, y_{(i, j, N)}^{(b)}}_{j = 1}^{N_{s e g, \min}}}_{i = 1}^{| D |}

represent the feature vectors (

X

) and the truth labels (y) from the

j t h

chosen segment of the

i t h

speaker from each subset (

D

or

N

) in the training set

B^{(b)}

. To avoid creating new notation for the depressed and nondepressed classes, we shall abuse the notation and use

D

and

N

to also represent the classes of each feature vector; i.e.,

y_{(i, j, D)}^{(b)} = D

and

y_{(i, j, N)}^{(b)} = N

.

We also define

\begin{matrix} B_{i, :, D}^{(b)} & : = & {X_{(i, j, D)}^{(b)}, y_{(i, j, D)}^{(b)}}_{j = 1}^{N_{s e g, \min}} \end{matrix}

(3)

\begin{matrix} B_{i, :, N}^{(b)} & : = & {X_{(i, j, N)}^{(b)}, y_{(i, j, N)}^{(b)}}_{j = 1}^{N_{s e g, \min}} \end{matrix}

(4)

to represent the set of all feature vectors and their truth labels from the

i t h

speaker from each class; and the balanced set

B^{(b)}

is given by

B^{(b)} : = ⋃_{i = 1}^{| D |} B_{i, :, D}^{(b)} \cup B_{i, :, N}^{(b)}

(5)

At the end of this process, the set

B^{(b)}

will contain

2 \cdot | D | \cdot N_{s e g, \min}

feature vectors and their corresponding truth labels.

We note that all

{B^{(b)}}_{b = 1}^{B}

contain all speakers from

D

; however, each

B^{(b)}

contains random

N_{s e g, \min}

segments of each depressed speaker.

2.5.3. Step 3: Training and Testing a Classifier Using $B^{(b)}$

Let

f : {I R}^{88} \to {N, D}

be a classifier (SVM, trees, NaiveBayes) with a chosen set of hyperparameters to decide the depression status of an input sample from its 88 eGeMAPS features. For each training set

B^{(b)}

, we used WEKA to train f and estimate its error performance using samples outside the set used to learn f.

Similar to [2], we evaluated the performance of f using the cross-validation error; however, we did not use leave-one-out cross-validation; instead, we used a

| D |

-fold cross-validation: In each fold, we removed from

B^{(b)}

the

2 N_{s e g, \min}

feature vectors corresponding to one pair of depressed/non-depressed speakers and used such feature vectors to test the function learned with the remaining feature vectors in

B^{(b)}

. More precisely, letting

f_{{B^{(b)} - (B_{i, :, D}^{(b)} \cup B_{i, :, N}^{(b)})}}

represent the classifier learned from the feature vectors and labels of

B^{(b)}

excluding all the feature vectors and labels from the

i t h

depressed speaker and from the

i t h

non-depressed speaker, we define

\begin{matrix} P_{e, i} (f, B^{(b)}) : = \frac{1}{2 \cdot N_{s e g, \min}} \cdot \\ (\sum_{j = 1}^{N_{s e g, \min}} 1 \{f_{{B^{(b)} - (B_{i, :, D}^{(b)} \cup B_{i, :, N}^{(b)})}} (X_{(i, j, D)}^{(b)}) \neq D\} + 1 \{f_{{B^{(b)} - (B_{i, :, D}^{(b)} \cup B_{i, :, N}^{(b)})}} (X_{(i, j, N)}^{(b)}) \neq N\}) . \end{matrix}

(6)

and we evaluate the error performance of the classification algorithm f for the training set

B^{(b)}

with

P_{e} (f, B^{(b)}) : = \frac{1}{| D |} \sum_{i = 1}^{| D |} P_{e, i} (f, B^{(b)}) .

(7)

We highlight that the reason for using feature vectors from multiple segments when building

B^{(b)}

is to enable better training of the classifier. The reason for removing all feature vectors from a speaker that is used in the cross-validation fold is to avoid biasing the classifier with any feature vector from the speaker. Lastly, the reason for adding both a depressed and a non-depressed speaker in each fold is to ensure that both training and evaluation are done with balanced sets, avoiding the biasing of the classifier towards either class. These steps are necessary because of the relatively limited number of depressed speakers in the E-DAIC feature set.

From

{P_{e} (f, B^{(b)})}_{b = 1}^{B}

, we compute

\begin{matrix} \bar{P_{e}} & : = & \frac{1}{B} \sum_{b = 1}^{B} P_{e} (f, B^{(b)}) . \end{matrix}

(8)

Step 3 is repeated for various sets of hyperparameters; and the hyperparameter set providing the smallest

\bar{P_{e}}

is selected. Such a

\bar{P_{e}}

is then used in Step 5 to conclude the test of

H_{0}^{(I)}

.

2.5.4. Step 4: Permutation of Labels

In this step, we randomly permute the labels in each

B^{(b)}

. The rationale for this step is the following: If

H_{0}^{(I)}

were true; i.e., if eGeMAPS features were indeed independent of the depression status of speakers, then permuting the labels in

B^{(b)}

would not change the distribution of features and, therefore, would not change the distribution of

P_{e} (f, B^{(b)})

.

To precisely define this step, let

A : = {1, \dots, | D |} \times {1, \dots, N_{s e g, \min}} \times {D, N}

represent the set of triplets

(i, j, k)

representing the index of the speaker, its segment number, and the class of each segment in the training set. Let

π

be a random permutation of the elements in

A

; i.e.,

π : A \to A

is any bijective function from

A

to itself; and let

π {(i, j, k)}_{l}

represent the

l t h

coordinate of

π (i, j, k)

. For instance, if

π (3, 9, D) = (7, 1, N)

, then

π {(3, 9, D)}_{1} = 7

,

π {(3, 9, D)}_{2} = 1,

and

π {(3, 9, D)}_{3} = N

.

Using a random

π

, we permute the labels

y_{(i, j, D)}^{(b)}

and

y_{(i, j, N)}^{(b)}

among all speakers, segments, and classes. More precisely, let

{\tilde{y}}_{(i, j, k)}^{(b)} = y_{π (i, j, k)}^{(b)}

, which means that the label for the

j t h

segment of the

i t h

speaker in the depressed class (

y_{(i, j, D)}^{(b)}

) is replaced by the label from the segment

π {(i, j, D)}_{2}

of the speaker

π {(i, j, D)}_{1}

of the class

π {(i, j, D)}_{3}

, which may be different than

D

. This means that the permutation

π

may cause

{\tilde{y}}_{(i, j, D)}^{(b)} = N

or

{\tilde{y}}_{(i, j, N)}^{(b)} = D

.

Define

\begin{matrix} {\tilde{B}}_{i, :, D}^{(b)} & : = & {X_{(i, j, D)}^{(b)}, {\tilde{y}}_{(i, j, D)}^{(b)}}_{j = 1}^{N_{s e g, \min}} \end{matrix}

(9)

\begin{matrix} {\tilde{B}}_{i, :, N}^{(b)} & : = & {X_{(i, j, N)}^{(b)}, {\tilde{y}}_{(i, j, N)}^{(b)}}_{j = 1}^{N_{s e g, \min}}, \end{matrix}

(10)

which represent the set of all feature vectors from the

i t h

speaker with labels permuted as described above; and define the permuted training set

{\tilde{B}}^{(b)}

as

{\tilde{B}}^{(b)} : = ⋃_{i = 1}^{| D |} {\tilde{B}}_{i, :, D}^{(b)} \cup {\tilde{B}}_{i, :, N}^{(b)}

(11)

Using the permuted training set

{\tilde{B}}^{(b)}

, we compute the error performance of the classification algorithm f using

P_{e} (f, {\tilde{B}}^{(b)})

and

P_{e, i} (f, {\tilde{B}}^{(b)})

of (7) and (6). We highlight that

B^{(b)}

,

B_{i, :, D}^{(b)}

, and

B_{i, :, N}^{(b)}

are replaced with

{\tilde{B}}^{(b)}

,

{\tilde{B}}_{i, :, D}^{(b)}

, and

{\tilde{B}}_{i, :, N}^{(b)}

in (7) and (6) for this computation.

The random permutation of labels and computation of

P_{e} (f, {\tilde{B}}^{(b)})

is performed B times, one time for each

B^{(b)}

.

2.5.5. Step 5: Permutation Test of $H_{0}^{(I)}$

If

H_{0}^{(I)}

were true; i.e., if eGeMAPS features were indeed independent of the depression status of speakers, then for each b,

X_{(i, j, D)}^{(b)}

and

X_{(i, j, N)}^{(b)}

would be i.i.d. vectors, permuting the labels would not impact their distribution, and the estimates

P_{e} (f, B^{(b)})

and

P_{e} (f, {\tilde{B}}^{(b)}))

would also be i.i.d.

This allows us to test

H_{0}^{(I)}

using a paired-permutation test as follows: From steps 3 and 4, we obtain B pairs of probability of error estimates:

{(P_{e} (f, B^{(b)}), P_{e} (f, {\tilde{B}}^{(b)}))}_{b = 1}^{B}

. Using these pairs, we generate

N_{p e r m}

B-dimensional vectors

{q^{(n)}}_{n = 1}^{N_{p e r m}}

by randomly choosing either

P_{e} (f, B^{(b)})

or

P_{e} (f, {\tilde{B}}^{(b)})

. More precisely, for each n, each entry in

q^{(n)} : = [q_{1}^{(n)}, \dots, q_{B}^{(n)}]

is defined from B i.i.d. realizations

w_{b}

of a Bernoulli(0.5) random variable: If

w_{b} = 0

, assign

P_{e} (f, B^{(b)})

to

q_{b}^{(n)}

; otherwise, assign

P_{e} (f, {\tilde{B}}^{(b)})

to

q_{b}^{(n)}

.

Let

\begin{matrix} {\bar{q}}^{(n)} & : = & \frac{1}{B} \sum_{b = 1}^{B} q_{b}^{(n)}; \end{matrix}

(12)

i.e.,

{\bar{q}}^{(n)}

contains the average of the randomly chosen probabilities of error in the

n t h

permutation. Note that, if

H_{0}^{(I)}

were true, then the expected value of

{\bar{q}}^{(n)}

would be the same as the expected value of

\bar{P_{e}}

computed at (8) without any permutation of labels.

Lastly, we obtain an estimate for the p-value of

H_{0}^{(I)}

with:

{\hat{p}}^{(I)} = \frac{1 + \sum_{n = 1}^{N_{p e r m}} 1 {{\bar{q}}^{(n)} \leq \bar{P_{e}}}}{1 + N_{p e r m}},

(13)

and low values of

{\hat{p}}^{(I)}

provide evidence to reject

H_{0}^{(I)}

. For instance, if the p-value of

H_{0}^{(I)}

is less than

0.05

, then we are justified in rejecting

H_{0}^{(I)}

and conclude that, with high confidence, eGeMAPS features are not independent of the depression status of speakers of the selected gender.

3. Results and Discussion

3.1. Testing Independence of Features and Depression Labels

We performed the Steps 1 through 5 of Section 2.5 to test

H_{0}^{(I)}

for female and male speakers of the E-DAIC set for various values of utterance length (

T_{U L}

).

For Step 1, we considered

N_{s e g, \min} = 30

; i.e., speakers needed at least 30 segments of

T_{U L}

duration to be considered in the test. This resulted in a different number of speakers in

D

for each

T_{U L}

. For

T_{U L} = 8

s,

| D_{f e m a l e} | = 18

and

| D_{m a l e} | = 15

. For

T_{U L} = 6

s,

| D_{f e m a l e} | = 21

and

| D_{m a l e} | = 19

. For

T_{U L} = 4

s,

| D_{f e m a l e} | = 26

and

| D_{m a l e} | = 24

. For

T_{U L} = 2

s,

| D_{f e m a l e} | = 30

and

| D_{m a l e} | = 33

.

For Step 2, we considered a total of

B = 300

random balanced sets.

For Step 3, we considered three classifier algorithms: SVM, Naive Bayes, and Decision Trees (J.48). To select the set of hyperparameters for each classifier, we used 100 of the 300 random balanced sets to compute

\bar{P_{e}}

for various hyperparameter sets. For the J.48 algorithm, we considered

C \in {0.1, 0.25, 0.5}

and

M \in {2, 5, 10, 15}

for the prunning coefficient and minimum number of instances per leaf. The hyperparameters that provided smallest

\bar{P_{e}}

were: for female speakers,

C = 0.1

and

M = 15

; and for male speakers,

C = 0.1

and

M = 5

. It should be mentioned that the maximum variation in the

\bar{P_{e}}

among all hyperparameters considered was less than 0.05. For SVM, we observed a wider variation. We considered

C \in {10^{- 5}, 10^{- 4}, 10^{- 3}, 0.01, 0.1, 1.0, 10.0}

for the complexity constant. For both male and female speakers, higher values (

C = 1

and

C = 10.0

) provided the smaller

\bar{P_{e}}

, with

C = 10.0

providing the smallest. Since the value

C = 10.0

provided excess running time and the value

C = 1.0

resulted in less than

0.001

more

\bar{P_{e}}

, well within the standard deviation, we used

C = 1.0

for the permutation tests (We note that the Naive Bayes implementation of WEKA does not offer hyperparameter selection).

Table 1 shows the average (

\bar{P_{e}}

) and the standard deviation of

{P_{e} (f, B^{(b)})}_{b = 1}^{B}

for the final set of hyperparameter sets in each of the classifier algorithms for various

T_{U L}

. We note that the results for the SVM classifier are in agreement with previous results in the literature, which reported

\bar{P_{e}} \in [0.35, 0.45]

[14]. We also note that, if we were to solely rely on

\bar{P_{e}}

among the random training sets, we would not be able to reject

H_{0}^{(I)}

since a typical 95% confidence interval obtained by multiplying the standard deviation by 1.96 would include

0.50

, which is the error probability of a random guess classifier. Even in the case of male speakers, as indicated in Figure 1, the 95% confidence interval for the average

P_{e} (f, B)

would also include

0.50

, showing weak evidence for rejecting

H_{0}^{(I)}

. This shows the nontriviality of the permutation test proposed.

To properly test

H_{0}^{(I)}

, we performed Steps 4 and 5 of Section 2.5 and the permutation test results are shown in Table 2. For Step 5, we considered

N_{p e r m} = 10, 000

and, to obtain confidence intervals for the p-value estimates

{\hat{p}}^{(I)}

, we partitioned the 300 balanced sets into 10 sets of 30 different random balanced sets. The values shown in Table 2 correspond to the average and standard deviation of the 10 p-value estimates obtained.

Since Table 2 shows average p-values below 0.05 for at least 1 classifier when

T_{U L} \geq 4

s, we can reject

H_{0}^{(I)}

for both female and male speakers, and conclude that there is strong evidence that eGeMAPS features are dependent on the depression status of E-DAIC speakers. We note that the high p-values obtained for the Naive Bayes classifier do not contradict this conclusion: such high p-values only indicate that this specific classifier is not able to explore the dependence between eGeMAPS features and the depression status of E-DAIC speakers.

The histograms of

P_{e} (f, B^{(b)})

and

P_{e} (f, {\tilde{B}}^{(b)})

shown in Figure 1 support this conclusion. These histograms illustrate the distribution of

P_{e} (f, B^{(b)})

and

P_{e} (f, {\tilde{B}}^{(b)})

for all training sets (300 in total) built under

T_{U L} = 8

s in each of the classifiers. As expected, when randomly shufflying the depressed and non-depressed labels,

P_{e} (f, {\tilde{B}}^{(b)})

is concentrated around 0.50. When considering the actual labels, both the SVM and Tree classifiers were able to change the distribution of

P_{e} (f, B^{(b)})

and reduce its mean below 0.50.

To further support this conclusion, we computed the 95% confidence intervals for

E [\bar{P_{e}}]

using the bootstrap resampling method described in Section 2.4.3. Such intervals are shown in Table 3. By comparing these intervals with the

{\hat{p}}^{(I)}

estimates of Table 2, we can see that bootstrap intervals farther from 0.50 result in low values of

{\hat{p}}^{(I)}

. It is also possible to note that the intervals move further away from 0.50 as

T_{U L}

increases, confirming that longer values of

T_{U L}

provide better detection performance.

3.2. eGeMAPS Dependence with Speaker Gender

The results of Table 1, Table 2 and Table 3 suggest that eGeMAPS is more effective in detecting depression in male than female speakers of the E-DAIC set. In this section, we formally test the hypothesis

H_{0}^{(G)}

: eGeMAPS discrimination power does not depend on the speaker gender.

using the bootstrap method described in Section 2.4.3.

To test

H_{0}^{(G)}

, let

{B_{F}^{(b)}}_{b = 1}^{B}

be a sequence of training sets randomly built from female speakers following Steps 1 and 2 of Section 3.1; and, for a given classifier algorithm f, let

P_{e} (f, B_{F}^{(b)})

be the error performance obtained by f when trained on

B_{F}^{(b)}

. Likewise, let

{B_{M}^{(b)}}_{b = 1}^{B}

and

{P_{e} (f, B_{M}^{(b)})}_{b = 1}^{B}

respectively be training sets randomly built from male speakers and their corresponding error performance obtained by the same classification algorithm f.

Let

Z_{0} : = [P_{e} (f, B_{M}^{(1)}), \dots, P_{e} (f, B_{M}^{(B)})]

and

W_{0} : = [P_{e} (f, B_{F}^{(1)}), \dots, P_{e} (f, B_{F}^{(B)})]

.

As described in Section 2.4.3, build

N_{p e r m}

resamples with replacement of

Z_{0}

and

W_{0}

and compute the averages of each resample, obtaining

{μ_{n}^{(Z)}}_{n = 1}^{N_{p e r m}}

and

{μ_{n}^{(W)}}_{n = 1}^{N_{p e r m}}

. Use the 2.5% and 97.5%-percentiles of

{μ_{n}^{(W)} - μ_{n}^{(Z)}}_{n = 1}^{N_{p e r m}}

to build a 95% confidence interval. If the confidence interval contains only positive values, then one has strong evidence to reject

H_{0}

and conclude that eGeMAPS features are better able to detect depression in males than in females.

Using Steps 1 and 2 of Section 2.5, we tested

H_{0}^{(G)}

for various utterance durations (

T_{U L}

) and various classifier algorithms. As in Section 3.1, we evaluated the error performance of SVM, Naive Bayes, and Decision Trees with the same hyperparameters as in Section 3.1. All results were obtained considering

N_{s e g, \min} = 30

and

B = 300

; i.e., we considered 300 random sets of depressed/non-depressed speakers, each with a random selection of 30 segments from each speaker.

The 95% confidence intervals shown in Table 4 provide strong evidence to reject

H_{0}^{(G)}

and conclude that there is strong evidence that eGeMAPS features are able to discriminate depression in male speakers better than in female speakers, particularly at high

T_{U L}

values and when using SVM or Tree (J.48) classifier.

We note that the SVM classifier with

T_{U L} = 6

s showed a confidence interval with negative values, suggesting an opposite conclusion; i.e., better discrimination performance in female speakers, while the Tree (J.48) and Naive Bayes Classifier showed better discrimination performance in male speakers. This observation was confirmed even when considering a different random set of 6 s utterances for the 21 female and 19 male speakers. This result illustrates that the performance of algorithms can be highly dependent on the subset of speakers chosen for comparison. This result also illustrates that discrimination performance can also vary widely between classifiers.

Nevertheless, the positive confidence intervals in all other

T_{U L}

and classifiers suggest that better discrimination performance is expected in male speakers.

3.3. Detection Power of Subsets of eGeMAPS Features

In this section, we compare the depression detection performance of different subsets of eGeMAPS features by building bootstrap confidence intervals.

For this, we start by choosing 2 subsets of eGeMAPS features. For the first subset, let

{B_{1}^{(b)}}_{b = 1}^{B}

be a sequence of training sets randomly built from speakers of a given gender following the Steps 1 and 2 of Section 3.1 and considering features of the first subset of features; and, for a given classifier algorithm f, let

P_{e} (f, B_{1}^{(b)})

be the error performance obtained by f when trained on

B_{1}^{(b)}

. For the second subset, let

{B_{2}^{(b)}}_{b = 1}^{B}

be a sequence of training sets obtained from the same set of sepakers as in the first subset; however, considering only features of the second subset. Let

{P_{e} (f, B_{2}^{(b)})}_{b = 1}^{B}

be the corresponding error performance.

Let

\bar{P_{e, 1}}

and

\bar{P_{e, 2}}

respectively represent the average probability of error obtained with each feature set. Since these averages are estimates of the true expected probability of error, we cannot conclude that the first subset provides better detection performance if

\bar{P_{e, 1}} < \bar{P_{e, 2}}

unless the 95% confidence intervals are nonoverlapping. We can however use bootstrap to build a 95% confidence interval for

\bar{P_{e, 1}} - \bar{P_{e, 2}}

. If such an interval is in

{I R}^{+}

, then we can conclude that the second subset offers lower probability of decision error.

To build the 95% bootstrap confidence interval, let

Z_{0} : = [P_{e} (f, B_{1}^{(1)}), \dots, P_{e} (f, B_{1}^{(B)})]

and

W_{0} : = [P_{e} (f, B_{2}^{(1)}), \dots, P_{e} (f, B_{2}^{(B)})]

. As described in Section 2.4.3, build

N_{p e r m}

resamples with replacement of

Z_{0}

and

W_{0}

and compute the averages of each resample, obtaining

{μ_{n}^{(Z)}}_{n = 1}^{N_{p e r m}}

and

{μ_{n}^{(W)}}_{n = 1}^{N_{p e r m}}

. The 2.5% and 97.5%-percentiles of

{μ_{n}^{(W)} - μ_{n}^{(Z)}}_{n = 1}^{N_{p e r m}}

are then used to build a 95% confidence interval.

Using the procedure above, we built 95% bootstrap confidence intervals for the difference in error performance (

\bar{P_{e}}

) of the following subsets of eGeMAPS features (see [25] for more details):

All eGeMAPS features ( $all$ ): subset with all 88 features defined in [25];
Temporal features ( $temp$ ): (6 features) rate of loudness peaks, mean length and standard deviation of continuously voiced regions and of unvoiced regions, number of continuous voiced regions per second (Features 82–87 produced by the configuration file eGeMAPSv02.conf available in the OpenSMILE website.).
eGeMAPS features excluding temporal features ( ${temp}^{c}$ ).
Frequency features ( $freq$ ): (24 features) mean and coefficient of variation in pitch, jitter, center frequency and bandwidth of the first, second, and third formants; and the 20th, 50th, and 80th percentiles of pitch, the 20th–80th percentile of pitch, and the mean and the standard deviation of the rising/falling slopes of pitch (Features 1–10, 31, 32, 41–44, 47–50, 53–56 produced by eGeMAPSv02.conf.).
eGeMAPS features excluding frequency features ( ${freq}^{c}$ ).
Energy features ( $ener$ ): (15 features) mean and coefficient of variation in shimmer, loudness, harmonic-to-noise ratio; equivalent sound level; and the 20th, 50th, and 80th percentiles of loudness, the 20th–80th percentile of loudness, and the mean and the standard deviation of the rising/falling slopes of loudness (Features 11–20, 33-36, and 88 produced by eGeMAPSv02.conf.).
eGeMAPS features excluding energy features ( ${ener}^{c}$ ).
Spectral features ( $spec$ ): (43 features) mean and coefficient of variation in alpha ratio (voiced and unvoiced segments), Hammarberg index (voiced and unvoiced segments), spectral slope from 0 to 500 Hz, spectral slope from 500 to 1500 Hz; relative energy of the first, second, and third formants; ratio of the energy of the spectral harmonic peak between a formant’s center frequency and the pitch frequency for the first, second, and third formants; ratio between the energy of the first harmonic to the second and between the energy of the first to the third harmonics; MFCCs 1 through 4, and spectral flux (Features 21–30, 37-40, 45, 46, 51, 52, and 57–81 produced by eGeMAPSv02.conf.).
eGeMAPS features excluding spectral features ( ${spec}^{c}$ ).

For each pair of feature subsets being compared, we performed the Steps 1 and 2 of Section 2.5 for each feature set and for each gender considering

T_{U L} = 8

s and

N_{s e g, \min} = 30

; i.e., speakers needed at least 30 segments of 8 s duration. From

B = 300

random speaker sets, we obtained

{P_{e} (f, B_{1}^{(b)})}_{b = 1}^{B}

and

P_{e} (f, B_{2}^{(b)}) {)}}_{b = 1}^{B}

.

Figure 2 shows the 95% bootstrap confidence intervals for

\bar{P_{e, 1}} - \bar{P_{e, 2}}

for various pairs of feature subsets and considering the SVM and the Decision Tree classifiers. For each classifier and feature subset, we selected the best set of hyperparameters by using 100 random balanced sets and selecting the set of hyperparameters providing the highest

\bar{P_{e}}

, as described in Section 3.1.

Although both the SVM and Decision Tree classifiers provided similar results when testing the hypotheses

H_{0}^{(I)}

and

H_{0}^{(G)}

,

\bar{P_{e, 1}} - \bar{P_{e, 2}}

varied significantly between these two classifiers. For instance, while the 95% C.I. for

\bar{P_{e, ener}} - \bar{P_{e, spec}}

was negative for SVM, it was positive for Decision Tree. Such differences suggest that SVM and Decision Trees are building significantly different decision regions.

The results of

\bar{P_{e, all}} - \bar{P_{e, {ener}^{c}}}

and

\bar{P_{e, all}} - \bar{P_{e, {spec}^{c}}}

reinforce this conclusion, while SVM is able to better classify depression in males when including energy features, Decision Trees is able to better classify depression in males when excluding energy features; and while SVM is able to better classify depression in females when including spectral features, Decision Trees is able to better classify depression in females when excluding spectral features.

Regarding

\bar{P_{e, all}} - \bar{P_{e, {freq}^{c}}}

, the 95% C.I. for females suggest that frequency features provide additional, albeit small, discrimination power to both SVM and Decision Tree classifiers.

Regarding

\bar{P_{e, all}} - \bar{P_{e, {temp}^{c}}}

, its 95% C.I. centered at 0 in both males and females suggest they are not providing additional discrimination power to either SVM or Decision Tree classifiers.

Focusing on feature classes in isolation, it is interesting to observe that, for detecting depression in females, the Trees classifier was able to significantly exploit temporal-only features when detecting depression in females, but the conclusion was the opposite for males. SVM also showed significantly worse detection performance for temporal features in males. It is also interesting to observe that the Trees classifier was able to exploit spectral-only features in males, providing lower

\bar{P_{e}}

when compared against all the other types of features.

4. Conclusions

Using resampling methods, we showed in this paper that eGeMAPS features contain information that can assist in the detection of depression in speakers. Although several previous studies addressed this question before, they often lack confidence intervals or statistical guarantees for their conclusions. Using a permutation test, we showed that eGeMAPS features and speakers’ depression status are dependent on each other.

Using bootstrap methods, we were also able to show that eGeMAPS features are able to better discriminate depression in male speakers than female speakers. Although previous studies have addressed this question before, this is the first study that addressed the question providing statistical guarantees.

We further evaluated the performance of different subsets of eGeMAPS features in their power to discriminate depression. This is important because machine learning algorithms are known to be sensitive to feature configuration and eGeMAPS was designed to be used not only for depression detection, but also for emotion recognition and other mental health diseases.

It is important to mention that, similar to other studies, our results are restricted to eGeMAPS features, to the E-DAIC dataset, and to the classifiers used. Different feature sets, different depression datasets, or different classifiers (e.g., deep neural networks) may offer different detection performance. However, by using standard open-source datasets, open-source audio features, and open-source machine learning tools, our results can be replicated and be used as a baseline for future studies.

As avenues for future research, we highlight the need for additional and larger annotated datasets that are made open to all researchers. Larger datasets would not only increase the confidence level of the analysis, but also facilitate the training of deep learning methods. It would also be interesting to consider how detection performance is sensitive to the population sampled in the datasets, the culture and language of speakers, and the task performed by speakers. We also highlight that some of our observations need further study from a physiological mechanism perspective. For instance, it would be important to investigate the underlying causes for eGeMAPS temporal features providing a higher discrimination power in females and for eGeMAPS energy features providing a higher discrimination power in males.

Further research is also needed to make depression detection from speech a practical approach. Clearly, the detection performance (probability of error) of speech-based depression detection is far from desirable: the average probability of error was greater than 0.40 (in agreement with previous studies) and had large standard deviations, highlighting the difficulty in detecting depression from speech-only features. One avenue to improve the detection performance is to collect features from additional signals, such as videos or textual information, and build classifiers that fuse such features along with audio features to detect depression [4,19]. Another avenue is to develop classifiers tailored to detect depression episodes on a specific patient [32]; i.e., instead of designing a classifier to detect depression in any speaker, the classifier would focus on a particular patient. Such a classifier would train on audio samples of the patient during both non-depressed and depressed times. Such a classifier could then be used to determine whether treatment is reducing the rate of depression episodes in the patient.

Author Contributions

Conceptualization, J.T. and B.J.B.F.J.; methodology, B.J.B.F.J.; software, J.T. and B.J.B.F.J.; validation, J.T. and B.J.B.F.J.; writing—original draft preparation, J.T.; writing—review and editing, J.T. and B.J.B.F.J.; supervision, B.J.B.F.J.; project administration, B.J.B.F.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The E-DAIC dataset is distributed by University of Southern California Affective Computing team upon request to them. OpenSmile (v3.0) software is freely available from https://www.audeering.com/research/opensmile/. The version used in this study was accessed around 8 August 2024. The WEKA tool is freely available from https://ml.cms.waikato.ac.nz/weka/. The version used in this study was accessed around 8 August 2024. The Python scripts used to process OpenSmile features from the E-DAIC dataset, execute WEKA, and process its results are available upon request.

Acknowledgments

The authors acknowledge the Affective Computing team of the University of Southern California for making E-DAIC available for our research. The authors also acknowledge the authors of OpenSmile and WEKA for making their software openly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

E-DAIC	Extended Distress Analysis Interview Corpus
eGeMAPS	Extended Geneva Minimalistic Acoustic Parameter Set
MFCC	Mel-Frequency Cepstrum Coefficients
SMO	Sequential Minimal Optimization
SVM	Support Vector Machine
WEKA	Waikato Environment for Knowledge Analysis

References

Goodwin, R.D.; Dierker, L.C.; Wu, M.; Galea, S.; Hoven, C.W.; Weinberger, A.H. Trends in US depression prevalence from 2015 to 2020: The widening treatment gap. Am. J. Prev. Med. 2022, 63, 726–733. [Google Scholar] [CrossRef] [PubMed]
Ojala, M.; Garriga, G.C. Permutation tests for studying classifier performance. J. Mach. Learn. Res. 2010, 11, 1833–1863. [Google Scholar]
Low, L.S.A.; Maddage, N.C.; Lech, M.; Sheeber, L.B.; Allen, N.B. Detection of clinical depression in adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 2010, 58, 574–586. [Google Scholar] [CrossRef] [PubMed]
Ringeval, F.; Schuller, B.; Valstar, M.; Cummins, N.; Cowie, R.; Tavabi, L.; Schmitt, M.; Alisamir, S.; Amiriparian, S.; Messner, E.M.; et al. AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop; Association for Computing Machinery: New York, NY, USA, 2019; pp. 3–12. [Google Scholar]
Wang, J.; Zhang, L.; Liu, T.; Pan, W.; Hu, B.; Zhu, T. Acoustic differences between healthy and depressed people: A cross-situation study. BMC Psychiatry 2019, 19, 300. [Google Scholar] [CrossRef] [PubMed]
Low, D.M.; Bentley, K.H.; Ghosh, S.S. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investig. Otolaryngol. 2020, 5, 96–116. [Google Scholar] [CrossRef] [PubMed]
Kiss, G.; Vicsi, K. Mono-and multi-lingual depression prediction based on speech processing. Int. J. Speech Technol. 2017, 20, 919–935. [Google Scholar] [CrossRef]
Cummins, N.; Epps, J.; Breakspear, M.; Goecke, R. An investigation of depressed speech detection: Features and normalization. In Proceedings of the INTERSPEECH 2011 12th Annual Conference of the International Speech Communication Association; International Speech Communication Association: Grenoble, France, 2011; pp. 2997–3000. [Google Scholar]
Huang, Z.; Epps, J.; Joachim, D. Investigation of speech landmark patterns for depression detection. IEEE Trans. Affect. Comput. 2019, 13, 666–679. [Google Scholar] [CrossRef]
Lyu, S.H.; Yang, L.; Zhou, Z.H. A refined margin distribution analysis for forest representation learning. Adv. Neural Inf. Process. Syst. 2019, 32, 5530–5540. [Google Scholar]
Bailey, A.; Plumbley, M.D. Gender bias in depression detection using audio features. In Proceedings of the 2021 29th European Signal Processing Conference (EUSIPCO); IEEE: Piscataway, NJ, USA, 2021; pp. 596–600. [Google Scholar]
Ma, X.; Yang, H.; Chen, Q.; Huang, D.; Wang, Y. Depaudionet: An efficient deep model for audio based depression classification. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge; Association for Computing Machinery: New York, NY, USA, 2016; pp. 35–42. [Google Scholar]
Kwon, N.; Hossain, S.; Blaylock, N.; O’Connell, H.; Hachen, N.; Gwin, J. Detecting Anxiety and Depression from Phone Conversations using x-vectors. In Proceedings of the Workshop on Speech, Music and Mind, Virtual, 15 September 2022; pp. 1–5. [Google Scholar]
Brueckner, R.; Kwon, N.; Subramanian, V.; Blaylock, N.; O’Connell, H. Audio-based detection of anxiety and depression via vocal biomarkers. In Proceedings of the Future of Information and Communication Conference; Springer: Cham, Switzerland, 2024; pp. 124–141. [Google Scholar]
Tao, F.; Esposito, A.; Vinciarelli, A. The androids corpus: A new publicly available benchmark for speech based depression detection. Depression 2023, 47, 11–19. [Google Scholar]
Alghowinem, S.; Goecke, R.; Epps, J.; Wagner, M.; Cohn, J.F. Cross-cultural depression recognition from vocal biomarkers. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016; pp. 1943–1947. [Google Scholar]
Cummins, N.; Scherer, S.; Krajewski, J.; Schnieder, S.; Epps, J.; Quatieri, T.F. A review of depression and suicide risk assessment using speech analysis. Speech Commun. 2015, 71, 10–49. [Google Scholar] [CrossRef]
Quatieri, T.F.; Malyska, N. Vocal-source biomarkers for depression: A link to psychomotor activity. In Proceedings of the INTERSPEECH 2012 ISCA’s 13th Annual Conference, Portland, OR, USA, 9–13 September 2012; Volume 2, pp. 1059–1062. [Google Scholar]
Alghowinem, S.; Gedeon, T.; Goecke, R.; Cohn, J.F.; Parker, G. Interpretation of depression detection models via feature selection methods. IEEE Trans. Affect. Comput. 2020, 14, 133–152. [Google Scholar] [CrossRef] [PubMed]
Kroenke, K.; Strine, T.W.; Spitzer, R.L.; Williams, J.B.; Berry, J.T.; Mokdad, A.H. The PHQ-8 as a measure of current depression in the general population. J. Affect. Disord. 2009, 114, 163–173. [Google Scholar] [CrossRef] [PubMed]
DeVault, D.; Artstein, R.; Benn, G.; Dey, T.; Fast, E.; Gainer, A.; Georgila, K.; Gratch, J.; Hartholt, A.; Lhommet, M.; et al. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, Paris, France, 5–9 May 2014; pp. 1061–1068. [Google Scholar]
Gratch, J.; Artstein, R.; Lucas, G.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et al. The Distress Analysis Interview Corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 3123–3128. [Google Scholar]
Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2010; pp. 1459–1462. [Google Scholar]
Ringeval, F.; Schuller, B.; Valstar, M.; Gratch, J.; Cowie, R.; Scherer, S.; Mozgai, S.; Cummins, N.; Schmitt, M.; Pantic, M. Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge; Association for Computing Machinery: New York, NY, USA, 2017; pp. 3–9. [Google Scholar]
Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
Frank, E.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for“ Data Mining: Practical Machine Learning Tools and Techniques”; The University of Waikato: Hamilton, New Zealand, 2016. [Google Scholar]
Platt, J. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Advances in Kernel Methods—Support Vector Learning; Schoelkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
John, G.H.; Langley, P. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; pp. 338–345. [Google Scholar]
Quinlan, R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Mateo, CA, USA, 1993. [Google Scholar]
Good, P.I. Resampling Methods; Springer: Cham, Switzerland, 2006. [Google Scholar]
Chernick, M.R.; LaBudde, R.A. An Introduction to Bootstrap Methods with Applications to R; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Campbell, E.L.; Dineley, J.; Conde, P.; Matcham, F.; White, K.M.; Oetzmann, C.; Simblett, S.; Bruce, S.; Folarin, A.A.; Wykes, T.; et al. Classifying depression symptom severity: Assessment of speech representations in personalized and generalized machine learning models. In Proceedings of the INTERSPEECH 2023; ISCA: Dublin, Ireland, 2023; Volume 2023, pp. 1738–1742. [Google Scholar]

Figure 1. The histograms of

P_{e} (f, B^{(b)})

and

P_{e} (f, {\tilde{B}}^{(b)})

obtained in 300 random training sets for

T_{U L} = 8

s and various classifiers. (a) SVM (females), (b) Naive Bayes (females), (c) Tree (females), (d) SVM (males), (e) Naive Bayes (males), and (f) Tree (males).

Figure 1. The histograms of

P_{e} (f, B^{(b)})

and

P_{e} (f, {\tilde{B}}^{(b)})

obtained in 300 random training sets for

T_{U L} = 8

s and various classifiers. (a) SVM (females), (b) Naive Bayes (females), (c) Tree (females), (d) SVM (males), (e) Naive Bayes (males), and (f) Tree (males).

Figure 2. 95% bootstrap confidence intervals for

\bar{P_{e, 1}} - \bar{P_{e, 2}}

for various pairs of feature subsets. Obtained from 300 random training sets with

T_{U L} = 8

s. Blue intervals (males) and red intervals (females). (a) SVM classifier. (b) Decision Tree classifier.

Figure 2. 95% bootstrap confidence intervals for

\bar{P_{e, 1}} - \bar{P_{e, 2}}

for various pairs of feature subsets. Obtained from 300 random training sets with

T_{U L} = 8

s. Blue intervals (males) and red intervals (females). (a) SVM classifier. (b) Decision Tree classifier.

Table 1. Average (

\bar{P_{e}}

) and standard deviation of the probability of classification error among the

B = 300

random balanced sets

({P_{e} (f, B^{(b)})}_{b = 1}^{B})

for female and male speakers of the E-DAIC set for various

T_{U L}

and various classification algorithms.

Table 1. Average (

\bar{P_{e}}

) and standard deviation of the probability of classification error among the

B = 300

random balanced sets

({P_{e} (f, B^{(b)})}_{b = 1}^{B})

for female and male speakers of the E-DAIC set for various

T_{U L}

and various classification algorithms.

$T_{UL}$	2 s	4 s	6 s	8 s
female
SVM	$0.494 \pm 0.033$	$0.463 \pm 0.038$	$0.451 \pm 0.043$	$0.464 \pm 0.051$
Bayes	$0.526 \pm 0.025$	$0.520 \pm 0.030$	$0.511 \pm 0.036$	$0.513 \pm 0.044$
Tree	$0.519 \pm 0.024$	$0.510 \pm 0.032$	$0.500 \pm 0.035$	$0.500 \pm 0.042$
male
SVM	$0.471 \pm 0.039$	$0.439 \pm 0.054$	$0.466 \pm 0.060$	$0.411 \pm 0.062$
Bayes	$0.504 \pm 0.028$	$0.506 \pm 0.041$	$0.491 \pm 0.049$	$0.481 \pm 0.063$
Tree	$0.484 \pm 0.024$	$0.467 \pm 0.037$	$0.452 \pm 0.052$	$0.431 \pm 0.057$

Table 2. Results of

H_{0}^{(I)}

tests (average and standard deviation of

{\hat{p}}^{(I)}

) for female and male speakers of the E-DAIC set for various

T_{U L}

and various classification algorithms.

Table 2. Results of

H_{0}^{(I)}

tests (average and standard deviation of

{\hat{p}}^{(I)}

) for female and male speakers of the E-DAIC set for various

T_{U L}

and various classification algorithms.

$T_{UL}$	2 s	4 s	6 s	8 s
female
SVM	$0.21 \pm 0.20$	$0.00 \pm 0.00$	$0.00 \pm 0.00$	$0.01 \pm 0.01$
Bayes	$1.00 \pm 0.00$	$0.99 \pm 0.03$	$0.87 \pm 0.12$	$0.86 \pm 0.16$
Tree	$0.18 \pm 0.19$	$0.10 \pm 0.15$	$0.03 \pm 0.08$	$0.04 \pm 0.04$
male
SVM	$0.00 \pm 0.00$	$0.00 \pm 0.00$	$0.03 \pm 0.06$	$0.00 \pm 0.00$
Bayes	$0.65 \pm 0.29$	$0.73 \pm 0.19$	$0.24 \pm 0.31$	$0.08 \pm 0.06$
Tree	$0.00 \pm 0.00$	$0.00 \pm 0.00$	$0.00 \pm 0.00$	$0.00 \pm 0.00$

Table 3. 95% Bootstrap confidence intervals for the expected probability of classification error

(E [\bar{P_{e}}])

for female and male speakers of the E-DAIC set for various

T_{U L}

and various classification algorithms.

Table 3. 95% Bootstrap confidence intervals for the expected probability of classification error

(E [\bar{P_{e}}])

for female and male speakers of the E-DAIC set for various

T_{U L}

and various classification algorithms.

$T_{UL}$	2 s	4 s	6 s	8 s
female
SVM	$[0.490, 0.497]$	$[0.459, 0.467]$	$[0.446, 0.456]$	$[0.459, 0.471]$
Bayes	$[0.523, 0.529]$	$[0.517, 0.524]$	$[0.507, 0.515]$	$[0.508, 0.518]$
Tree	$[0.516, 0.521]$	$[0.507, 0.514]$	$[0.497, 0.504]$	$[0.496, 0.505]$
male
SVM	$[0.466, 0.475]$	$[0.432, 0.445]$	$[0.459, 0.472]$	$[0.404, 0.418]$
Bayes	$[0.500, 0.507]$	$[0.501, 0.510]$	$[0.485, 0.496]$	$[0.474, 0.488]$
Tree	$[0.482, 0.487]$	$[0.462, 0.471]$	$[0.447, 0.458]$	$[0.424, 0.437]$

Table 4. Results of testing

H_{0}^{(G)}

for various classifiers and

T_{U L}

values.

Table 4. Results of testing

H_{0}^{(G)}

for various classifiers and

T_{U L}

values.

$T_{UL}$	2 s	4 s	6 s	8 s
SVM	$[0.017, 0.029]$	$[0.017, 0.033]$	$[- 0.023, - 0.007]$	$[0.044, 0.063]$
Bayes	$[0.018, 0.027]$	$[0.009, 0.020]$	$[0.013, 0.027]$	$[0.023, 0.041]$
Tree	$[0.030, 0.038]$	$[0.038, 0.049]$	$[0.041, 0.055]$	$[0.061, 0.078]$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Turnipseed, J.; Fonseca, B.J.B., Jr. Evaluating the Performance of eGeMAPS Features in Detecting Depression Using Resampling Methods. Signals 2026, 7, 41. https://doi.org/10.3390/signals7030041

AMA Style

Turnipseed J, Fonseca BJB Jr. Evaluating the Performance of eGeMAPS Features in Detecting Depression Using Resampling Methods. Signals. 2026; 7(3):41. https://doi.org/10.3390/signals7030041

Chicago/Turabian Style

Turnipseed, Joshua, and Benedito J. B. Fonseca, Jr. 2026. "Evaluating the Performance of eGeMAPS Features in Detecting Depression Using Resampling Methods" Signals 7, no. 3: 41. https://doi.org/10.3390/signals7030041

APA Style

Turnipseed, J., & Fonseca, B. J. B., Jr. (2026). Evaluating the Performance of eGeMAPS Features in Detecting Depression Using Resampling Methods. Signals, 7(3), 41. https://doi.org/10.3390/signals7030041

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Evaluating the Performance of eGeMAPS Features in Detecting Depression Using Resampling Methods

Abstract

1. Introduction

Literature Review

2. Materials and Methods

2.1. The E-DAIC Dataset

Preparation of Audio Files from Speakers

2.2. OpenSMILE

The eGeMAPS Features Set

2.3. WEKA

2.4. Resampling Methods

2.4.1. Permutation Tests

2.4.2. Permutation Tests for Evaluating Machine Learning Classifiers

2.4.3. Bootstrap Confidence Intervals

2.4.4. Hypothesis Testing Using Bootstrap

2.5. Testing Independence of Features and Depression Labels

2.5.1. Step 1: Preprocessing

2.5.2. Step 2: Build Random Balanced Training Sets

2.5.3. Step 3: Training and Testing a Classifier Using B ( b )

2.5.4. Step 4: Permutation of Labels

2.5.5. Step 5: Permutation Test of H 0 ( I )

3. Results and Discussion

3.1. Testing Independence of Features and Depression Labels

3.2. eGeMAPS Dependence with Speaker Gender

3.3. Detection Power of Subsets of eGeMAPS Features

4. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2.5.3. Step 3: Training and Testing a Classifier Using $B^{(b)}$

2.5.5. Step 5: Permutation Test of $H_{0}^{(I)}$