Beyond Short-Frame Acoustic Features: Capturing Long-Term Speech Patterns for Depression Detection

Fushimi, Shizuku; Azani, Mohammad Aiman; Chiba, Mizuto; Okada, Yoshifumi

doi:10.3390/technologies14040198

Open AccessArticle

Beyond Short-Frame Acoustic Features: Capturing Long-Term Speech Patterns for Depression Detection

¹

Division of Information and Electronic Engineering, Muroran Institute of Technology, 27-1 Mizumoto-cho, Muroran 050-8585, Japan

²

College of Information and Systems, Muroran Institute of Technology, 27-1 Mizumoto-cho, Muroran 050-8585, Japan

^*

Author to whom correspondence should be addressed.

Technologies 2026, 14(4), 198; https://doi.org/10.3390/technologies14040198

Submission received: 2 March 2026 / Revised: 23 March 2026 / Accepted: 24 March 2026 / Published: 25 March 2026

(This article belongs to the Special Issue Advanced Technologies for Enhancing Safety, Health, and Well-Being)

Download

Browse Figures

Versions Notes

Abstract

Speech-based depression detection is promising for objective mental health assessment. However, conventional methods relying on short-frame acoustic features often fail to capture long-term temporal and behavioral characteristics of speech essential for modeling depression-specific speaking patterns. Herein, four novel acoustic feature sets extracted from long-term speech are proposed: utterance interval feature set (UIFS), pause interval feature set (PIFS), response interval feature set (RIFS), and speech density (SD). These features explicitly characterize temporal structures and session-level speech behaviors beyond short-frame analysis. These features are combined with conventional acoustic features, including standard features extracted using openSMILE and voice level features, and evaluated using support vector machines under subject-independent conditions for the binary classification of depressed and nondepressed speakers. Incorporating the proposed features improves classification performance compared with baseline features (accuracy: 0.54 for openSMILE and 0.52 for openSMILE + voice level features). The configuration integrating all four proposed feature sets achieves an accuracy of 0.58, a precision of 0.56, a recall of 0.58, and a specificity of 0.58, indicating consistent performance gains under subject-independent and strictly controlled evaluation conditions. Thus, depression-related speech patterns can be captured by explicitly modeling temporal and behavioral speech characteristics across entire dialog sessions. This study contributes to advancing acoustic feature design for speech-based depression detection and developing clinically supportive screening and monitoring technologies.

Keywords:

depression detection; speech analysis; long-term acoustic features; temporal speech characteristics; support vector machine

1. Introduction

Depression is a mood disorder characterized by psychological symptoms such as persistent sadness and loss of motivation and physical symptoms such as sleep disturbances and fatigue. The International Classification of Diseases recognizes depression as a mental disorder [1]. According to the World Health Organization (WHO) [2], more than 300 million people worldwide are affected by depression. Depression also has serious societal consequences beyond individual health problems such as reduced work productivity, increased medical and social security costs, and increased risk of suicide. Therefore, early classification and timely, accurate assessment and diagnosis of depression are essential for clinical practice and public health management.

However, in clinical settings, the evaluation and diagnosis of depression largely depend on clinicians’ judgment and subjective information gathered via questionnaires and interviews. Thus, diagnostic outcomes may vary across evaluators and different assessment time points, hindering the early identification of depression. To address these challenges, recent studies have extensively employed machine learning techniques to detect depression based on objective indicators that complement subjective evaluations.

Studies on machine learning-based depression classification have used various information sources such as text, facial expressions, and speech. These studies captured depressive states from multiple perspectives such as linguistic content, visual nonverbal information, and acoustic features embedded in speech signals. Text-based approaches focus on analyzing lexical usage patterns and contextual features from written content such as chat messages or blog posts for detecting depression using deep learning methods [3,4,5]. Facial expression-based approaches analyze temporal changes in facial expressions extracted from facial images and classify depression by examining dynamic features derived from local regions such as the eyes and mouth [6,7,8]. In contrast, speech-based approaches focus on acoustic features extracted from speech signals and detect depression by analyzing variations in speech energy, pitch, and spectral characteristics [9,10,11]. These acoustic features include fundamental frequency and voice perturbation measures, which suggest a potential relationship between depressive states and changes in speaking style [12].

Speech is a valuable information source for depression detection because it can be collected with relatively low effort, without requiring the comprehension of linguistic content or the capture of facial images, while still reflecting nonverbal aspects of speech behavior. Moreover, speech data can be integrated naturally into existing communication modalities such as telephone calls and online interviews, making it ideal for clinical and remote-support applications. Therefore, the present study focused on single-modal depression classification based on speech signals. This approach enabled evaluating the classification performance and effectiveness of speech alone and serves as a fundamental investigation toward future extensions to multimodal frameworks.

Existing studies on speech-based depression classification have employed general-purpose acoustic feature sets such as COVAREP [13] and openSMILE [14] for feature extraction. Some of these studies have reported relatively high classification accuracies of ~80–90% under subject-dependent experimental settings, in which speech samples from the same subjects are included in both the training and test sets [15,16,17]. In such subject-dependent experimental settings, however, models may overfit to speaker-specific speech features. Thus, the corresponding results cannot be used to evaluate the generalization performance for unseen subjects. In contrast, subject-independent experimental settings, wherein speech samples from the same subjects do not appear in both training and test sets, enable evaluating generalization performance for unseen subjects. Previous studies have reported relatively modest classification accuracy of ~50–60% under these conditions [18,19,20]. One possible reason for such limited performance is data imbalance arising from disparities in the number of samples between individuals with depression and control group [21]. This imbalance can bias models toward the majority (healthy) class, leading to apparently high accuracy while failing to properly evaluate the ability to detect depression. In imbalanced clinical speech datasets, this bias often results in poor sensitivity to depressive speech characteristics, as the minority-class samples are much more limited during training. Many existing studies have primarily relied on local acoustic features extracted from short-time frames on the order of several tens of milliseconds [22,23]. Such feature designs may not adequately capture speech features related to temporal structures and speech behaviors that emerge across an entire utterance. In addition, many conventional acoustic feature sets, including those provided by openSMILE, were originally designed for tasks such as automatic speech recognition and emotion recognition. Therefore, they may not be tailored to depression-specific speech characteristics.

This study aims to improve speech-based depression classification performance of machine learning models by introducing conventional general-purpose acoustic features and novel global acoustic features that focus on depression-related speech behaviors under a subject-independent experimental setting. Previous studies [24,25] have reported that individuals with depression exhibit distinctive patterns in utterance intervals (speech-active periods), pause intervals (silent intervals occurring during speech), and response intervals (the time interval before responding to a dialog). Individuals with depression tend to produce shorter utterances and avoid redundant expressions compared with control group, which can reduce the information content of their speech (i.e., lower speech density (SD)) [26]. Capture such characteristics using local acoustic features extracted from short-time frames is difficult. To address this issue, four types of global acoustic features that reflect temporal structures and speech behaviors across an entire speech signal were proposed: (1) acoustic features related to utterance intervals, (2) acoustic features related to pause intervals, (3) acoustic features related to response intervals, and (4) acoustic features related to SD. These features are designed not to capture the linguistic content itself but to quantify nonverbal aspects manifested in speech production and speech behavior. For classification based on such global acoustic features, the support vector machine (SVM) [27] was employed as the machine learning algorithm herein. This is because deep learning approaches that rely on the fine segmentation of speech to generate large numbers of training samples are considered unsuitable due to the limited availability of subject-independent data. To address data imbalance resulting from the limited availability of speech samples from individuals with depression and its impact on classification performance, undersampling was applied during training.

The main contributions of this study are summarized as follows:

Novel global acoustic features that explicitly capture depression-related speech behaviors under a subject-independent experimental setting are designed and introduced.
By combining the proposed features with conventional general-purpose acoustic features in classification experiments, enhanced classification performance is achieved; this demonstrates the effectiveness and robustness of the proposed feature design.
Feature importance analysis reveals that acoustic features related to pause interval play a particularly important role in depression classification.

The remainder of this paper is organized as follows. Section 2 provides an overview of related studies. Section 3 describes the dataset and preprocessing procedures. Section 4 details the acoustic features, including both conventional and proposed features, as well as the construction of SVM-based classification models and feature selection methods. Section 5 presents the experimental setup. Section 6 and Section 7 report the experimental results and related discussion, respectively. Finally, Section 8 concludes the paper and outlines directions for future work.

2. Related Works

2.1. Studies on Depression Classification Using Machine Learning

Various machine learning methods have been used for depression classification, including SVMs [28] and random forests [29]. With the advancement of deep learning-based classification techniques, methods such as recurrent neural networks (RNNs) [30], long short-term memory (LSTM) networks [31], and convolutional neural networks (CNNs) [22] have been introduced. These methods automatically learn discriminative acoustic features from speech signals that contribute to classification performance. However, deep learning models generally have complex internal structures, making it difficult to explicitly interpret the contribution of individual acoustic features to depression classification. Moreover, due to their large number of parameters, deep learning models tend to overfit when trained on small clinical datasets, whereas simpler models such as SVMs can achieve more stable generalization under such conditions. In addition, the computational cost of training deep architectures is substantially higher than that of traditional machine learning methods such as SVMs, making deep models less suitable for small and imbalanced clinical datasets where the number of training samples is limited. Therefore, to enhance the interpretability of the classification process, feature selection was performed herein prior to training and classification using an SVM.

2.2. Studies on Depression Classification Based on Acoustic Features

A wide variety of acoustic features have been investigated in speech-based depression classification such as Mel-frequency cepstral coefficients (MFCCs), fundamental frequency, and jitter [32]. Other studies have focused on vocal characteristics such as creaky voice and prosodic features [33], as well as temporal properties, including the duration of pauses and their proportion within an entire speech signal [25]. These studies have demonstrated that individuals with depression and control group exhibit statistically significant differences in acoustic features. Acoustic feature extraction methods include openSMILE [16,34], COVAREP [21,33], and HOSA [35]. Classification accuracy can also be improved by combining multiple types of acoustic features rather than relying only on general-purpose acoustic feature sets [34,36]. For instance, Dumpala et al. [34] compared classification performance using RNNs, CNNs, and LSTMs between two conditions: using only openSMILE features and combining openSMILE features with speaker embeddings, which represent speaker-specific characteristics such as voice quality, speaking habits, and spectral properties. Consequently, the classification accuracy improved by ~10%, increasing from 0.56–0.64 to 0.65–0.74, for all methods. However, acoustic features are typically extracted and processed by segmenting speech into short-time frames using a fixed frame shift over the entire speech signal. Using such approaches, explicitly distinguishing and evaluating utterance intervals and nonutterance intervals are difficult, where characteristics of speech behavior in individuals with depression are particularly pronounced. To address this limitation, newly proposed global acoustic features were integrated with conventional acoustic features and speech was explicitly segmented into utterance and nonutterance intervals. This enables classification that effectively captured the temporal structure and quantitative characteristics of each segment.

3. Dataset

The Distress Analysis Interview Corpus–Wizard of Oz (DAIC-WOZ) dataset [37] was used. This dataset was designed to facilitate the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. It contains interview data collected from a total of 189 participants (corresponding to 189 sessions) comprising 102 males and 87 females. Four sessions were excluded due to technical issues such as missing audio; consequently, 185 interview sessions were included. The collected data contained audio and video recordings, as well as questionnaire responses; however, only the audio data were utilized. The average duration of audio data per session was ~16 min, and the speech length varied across participants. All interviews were conducted in English, and all participant responses were provided exclusively in English.

In the DAIC-WOZ dataset, each participant is assigned a score based on the Patient Health Questionnaire depression scale (PHQ-8) [38] developed to assess the severity of depressive symptoms. The PHQ-8 score is a self-report measure calculated from participants’ responses to eight questionnaire items, ranging from 0 to 24. Table 1 and Table 2 summarize the eight questionnaire items and the corresponding severity categories based on the PHQ-8 score, respectively. Participants with PHQ-8 scores ranging from 0 to 9 were categorized as nondepressed, whereas those with scores ranging from 10 to 24 were defined as depressed. Fine-grained severity levels were not modeled, and a binary classification task was performed to distinguish between nondepressed and depressed participants. Note that the DAIC-WOZ dataset exhibited class imbalance. Among the 185 sessions performed, 132 participants belonged to the nondepressed class and 53 belonged to the depressed class. To mitigate the impact of class imbalance on classification performance, undersampling was applied during training and the number of samples in both classes was set to 53, corresponding to the size of the depressed class. This undersampling was conducted only on the training data, whereas the test data kept the original class distribution.

4. Methods

4.1. Definition of Temporal Intervals in Speech Signals

In the DAIC-WOZ dataset, audio data obtained from each session contained dialogs between an interviewer and a participant. Figure 1 shows that segments during which the participant is speaking are defined as utterance intervals, whereas those during which the participant is not speaking are defined as nonutterance intervals. Among the nonutterance intervals, silent segments occurring within a participant’s ongoing utterance are defined as pause intervals, whereas the intervals from the end of the interviewer’s question to the onset of the participant’s response are defined as response intervals. The intervals between the end of a participant’s utterance and the beginning of the interviewer’s subsequent utterance were excluded from the analysis. In the DAIC-WOZ dataset, manual annotations indicating the speaking segments of both the interviewer and the participant are provided. Based on these annotations, only the segments labeled as participant speech were used for acoustic feature extraction, thereby minimizing the influence of overlapping speech. Furthermore, additional noise reduction or signal enhancement techniques (e.g., spectral filtering) were not applied, as such processing may alter the original spectral characteristics and energy distribution of the speech signal. Instead, the raw audio recordings were directly used for feature extraction to preserve the inherent acoustic properties.

4.2. Conventional Acoustic Features

openSMILE features and volume-related acoustic features were adopted as conventional acoustic features. The details of each type of acoustic feature are described below.

4.2.1. openSMILE Features

openSMILE [14] is an open-source acoustic feature extraction toolkit that generates high-dimensional acoustic feature vectors by extracting various low-level descriptors (LLDs) from speech signals and applying statistical functionals to these descriptors. openSMILE features were constructed by computing the LLDs listed in Table 3 and applying the statistical measures and transformation processes shown in Table 4. These operations include representative statistical measures for characterizing the distributions and temporal dynamics of the LLDs, duration-related indicators, and regression coefficients and discrete cosine transform (DCT) coefficients. openSMILE features were extracted from the utterance intervals defined in Section 4.1. Each speech sample was represented as an 88-dimensional feature vector using the eGeMAPS configuration. The eGeMAPS feature set consists of physiologically and acoustically relevant parameters, such as pitch, loudness, jitter, shimmer, and formants, which are known to reflect emotional states and physiological changes during speech [39]. These parameters have been widely used in speech emotion recognition and mental health assessment tasks [40,41,42], and are therefore considered suitable for capturing depression-related speech characteristics, such as reduced prosodic variation and voice irregularities. Hereafter, these acoustic features are referred to as openSMILE features (OS).

Compared with control group, speech produced by individuals with depression exhibits characteristic changes in energy, fundamental frequency, spectral characteristics, and voice quality–related features such as voice perturbation and noise components [12]. Therefore, OS is expected to capture acoustic cues associated with these depression-related speech characteristics. Therefore, OS was adopted as a baseline acoustic feature set to facilitate direct comparison with existing studies on speech-based depression classification.

4.2.2. Volume Level Features

Speech produced by individuals with depression exhibits lower vocal intensity and reduced temporal variability in loudness compared with that of control group [43]. Therefore, volume level features (VL) were introduced herein to quantitatively characterize loudness-related properties of speech. Specifically, basic descriptive statistics of the loudness distribution were computed, including the maximum, mean, median, and mode. To further describe the distribution asymmetry, skewness was employed. Standard deviation and kurtosis were used to capture the variability and peakedness of loudness fluctuations, respectively. To represent the distributional structure of loudness in greater detail, the loudness range from 0 to 70 dB was divided into 10-dB intervals, and an additional interval was defined for values exceeding 70 dB; this resulted in a total of eight intervals. Then, the frequency counts of loudness values falling into each interval were computed. These eight frequency-based features were incorporated to model the overall shape of the loudness distribution. Skewness and kurtosis were calculated based on this binned frequency distribution.

For extracting VL, loudness analysis was conducted over the utterance intervals defined for each participant in Section 4.1. Each speech sample was ultimately represented as a 15-dimensional acoustic feature vector.

4.3. Proposed Acoustic Features

Conventional acoustic features, namely OS and VL, primarily represent local acoustic characteristics derived by segmenting the speech signal into short-time frames. Thus, they are limited in their ability to represent the global temporal structure and speech behaviors manifested over an entire utterance. To address this limitation, speech timing structures and speech behaviors, which are characteristic of individuals with depression, were evaluated herein. Based on the findings, a set of novel acoustic features computed from the entire speech signal was constructed. These proposed features were designed not to capture fine-grained, short-term acoustic fluctuations but rather to quantify macroscopic temporal and behavioral aspects of speech production, including utterance interval, pause interval, response interval, and SD. By explicitly modeling these factors, the proposed features aim to characterize depression-related speaking patterns that are difficult to capture using conventional frame-level acoustic descriptors. These proposed features are summarized in Table 5. Detailed definitions and computational procedures for each proposed feature are provided in subsequent sections, and the abbreviations are listed in Table 5.

4.3.1. Utterance Interval Feature Set: UL, UN, and UR

Individuals with depression exhibit speech patterns characterized by changes in the duration and temporal variability of utterance intervals compared with control group [24]. Reduced speech frequency due to decreased responsiveness and variations in the proportion of utterance intervals within an entire dialog also correlate with the severity of depression [25,44]. These findings suggest that the temporal and quantitative characteristics of utterance intervals provide informative cues for depression classification.

Motivated by these observations, a novel acoustic feature set focusing on utterance intervals is proposed herein. It comprises features related to utterance duration (utterance length: UL), the number of utterance intervals (utterance number: UN), and the ratio of utterance intervals (utterance ratio: UR). Collectively, these features are referred to as the utterance interval feature set (UIFS).

UL characterizes the statistical distribution of utterance interval durations and includes basic descriptive statistics, namely the total duration, mean, median, and mode of utterance intervals. Variability and distributional shape are quantified using standard deviation, skewness, and kurtosis. In addition, utterance interval durations are discretized into 10 classes by partitioning the range from 0 to 900 ms into 100-ms bins, with an additional class for durations equal to or longer than 900. Histogram counts for each class are computed, based on which skewness and kurtosis are computed. UN is defined as the total number of utterance intervals observed within a participant’s session. It is intended to capture changes in speech frequency associated with reduced responsiveness. UR is defined as the ratio of the total duration of utterance intervals to the combined duration of utterance and nonutterance intervals. Thus, this feature represents the proportion of speech occupancy within the entire dialog.

Herein, UL, UN, and UR were extracted for each participant based on the utterance intervals defined in Section 4.1. Each speech sample was finally represented as a 19-dimensional acoustic feature vector.

4.3.2. Pause Interval Feature Set: PL, PN, and PR

Individuals with depression exhibit speech patterns characterized by changes in the duration and temporal variability of pause intervals compared with control group [25]. In addition, increases in pause frequency and the proportion of pause intervals within an entire dialog correlate with the severity of depression [25,45]. These findings suggest that the temporal and quantitative characteristics of pause intervals constitute informative cues for depression classification.

Motivated by these observations, a novel acoustic feature set focusing on pause intervals is proposed herein. It comprises features related to pause duration (pause length: PL), number of pause intervals (pause number: PN), and the ratio of pause intervals within a dialog (pause ratio: PR). Collectively, these features are referred to as the pause interval feature set (PIFS).

PL characterizes the statistical distribution of pause interval durations and includes basic descriptive statistics, namely the total duration, mean, median, and mode of pause intervals. In addition, pause interval durations were converted into histogram representations, and their standard deviation, skewness, and kurtosis were computed using the unified procedure described in Section 4.3.1. PN is defined as the total number of pause intervals observed within a participant’s session. It is intended to capture changes in speech fluency and stagnation tendencies, such as increased pause frequency. PR is defined as the ratio of the total duration of pause intervals to the combined duration utterance and nonutterance intervals. It represents the proportion of silence occupancy within the entire dialog.

Herein, PL, PN, and PR were extracted for each participant based on the pause interval defined in Section 4.1. Each speech sample was finally represented as a 19-dimensional acoustic feature vector.

4.3.3. Response Interval Feature Set: RL, RN, and RR

Individuals with depression exhibit speech patterns that are characterized by prolonged response latency and reduced spontaneous speech activity compared with control group [44]. Increased variability in response times and a higher proportion of response delays within an entire dialog correlate with the severity of depression [25,46]. These findings suggest that the temporal and quantitative characteristics of response intervals provide informative cues for depression classification.

Motivated by these observations, a novel acoustic feature set focusing on response intervals is proposed herein. It comprises features related to response interval duration (response length: RL), number of response intervals (response number: RN), and the ratio of response intervals (response ratio: RR). Collectively, these features are referred to as the response interval feature set (RIFS).

RL characterizes the statistical distribution of response interval durations and includes basic descriptive statistics, namely the total duration, mean, median, and mode of response intervals. In addition, response interval durations were converted into histogram representations, and their standard deviation, skewness, and kurtosis were computed using the unified procedure described in Section 4.3.1. RN is defined as the total number of response intervals observed within a participant’s session. It is intended to capture delayed response tendencies and reductions in conversational engagement. RR is defined as the ratio of the total duration of response intervals to the combined duration of utterance and nonutterance intervals. It thus represents the proportion of response latency within the entire dialog.

Herein, RL, RN, and RR were extracted for each participant based on the response interval defined in Section 4.1. Each speech sample was finally represented as a 19-dimensional acoustic feature vector.

4.3.4. Speech Density: SD

Individuals with depression tend to produce shorter utterances and avoid redundant expressions compared with control group [26]. Such changes in speech behavior may manifest as a reduction in the temporal proportion of utterance intervals within an entire dialog and a decrease in the density of actual speech components produced within utterance intervals.

Motivated by these observations, an acoustic feature that quantifies SD is proposed herein. SD is defined as the ratio of the total number of voiced and unvoiced speech frames observed within utterance intervals to the total duration of the entire session, including utterance and nonutterance intervals. Herein, voiced and unvoiced frames are considered fundamental speech components within English utterance intervals. Voice frames correspond to speech frames in which a fundamental frequency (F0) is detected such as vowels and voiced consonants, whereas unvoiced frames correspond to speech frames without detectable F0 typically associated with unvoiced consonants such as /k/ and /t/. These frames are distinguished based on the presence or absence of F0 during frame-level speech analysis, and the total number of frames in each category is counted. In this study, F0 was estimated using the pYIN algorithm [47], and frames with detected F0 were classified as voiced. Frames without detected F0 and with spectral flatness below 0.6 were regarded as unvoiced consonants, whereas those with spectral flatness above this threshold were treated as noise. Because this procedure does not use any dedicated Voice Activity Detection (VAD) algorithm, complete separation of unvoiced consonants and background noise cannot be guaranteed; however, applying spectral-flatness filtering helps reduce this ambiguity to some extent. Based on this definition, SD conceptually differs from UR, which represents the proportion of time occupied by utterance intervals within the dialog. Instead, SD indicates how densely actual speech components are generated over the entire session, including both utterance and nonutterance intervals. In other words, UR is a macroscopic temporal metric, whereas SD provides a microscopic characterization of speech activity density. Thus, SD and UR capture complementary aspects of speech behavior.

Herein, SD was computed for each participant based on the utterance and nonutterance intervals defined in Section 4.1, and each speech sample was represented as a one-dimensional acoustic feature.

4.4. Construction of the Classification Model

4.4.1. Classification Algorithm

Many existing deep learning-based studies have trained models on abundant feature vectors extracted from short-time speech frames [22,23]. In contrast, feature vectors were extracted herein from each participant’s entire dialog session, generating one feature vector per participant. Thus, the total number of training samples was determined by the number of participants and was substantially smaller than that typically used in conventional deep learning-based approaches. Under such conditions, deep learning models, which generally involve abundant trainable parameters, cannot be employed because insufficient training data may considerably increase the risk of overfitting and degrade generalization performance.

Therefore, SVM [27] was employed as the classification algorithm; it is recognized for its robustness, strong generalization capability, and effectiveness in high-dimensional feature spaces with limited training data. SVM determines an optimal decision boundary (hyperplane) between two classes by maximizing the margin between them. As the learned decision function is determined by a limited subset of training samples, known as support vectors, SVM can achieve stable classification performance even for small number of training samples.

As only a single feature vector was extracted for each participant based on long-duration speech signals herein, the number of available training samples was inherently limited. Therefore, SVM was a suitable and reliable choice of model. Accordingly, approaches that artificially increase the number of training samples by segmenting speech into short-time frames were not employed; this was because such approaches could introduce redundancy and potentially distort participant-level speaking characteristics. Instead, an SVM-based classification model was constructed using feature vectors that preserved participant-level statistical and temporal characteristics extracted from entire dialog sessions.

4.4.2. Feature Selection

Among the acoustic features employed herein, some may not contribute to classification performance or exhibit strong redundancy with other features. Therefore, feature selection was performed to enhance classification performance, reduce model complexity, and improve interpretability.

Recursive feature elimination with cross-validation (RFECV) [48] was employed for feature selection, which integrates recursive feature elimination (RFE) with cross-validation (CV). RFE is a wrapper-based feature selection technique that iteratively removes the least informative features based on their contribution to the classification model. Specifically, feature importance is evaluated using the absolute values of the learned SVM weights, and features with the smallest contributions are eliminated at each iteration. By progressively discarding less informative features, RFE identifies a compact and discriminative subset of features that preserves classification performance. Meanwhile, CV was employed to objectively evaluate the generalization capability of the model by repeatedly training and testing it on multiple data splits. Subject-independent five-fold CV was employed to ensure that speech samples from the same participant did not appear in both the training and test sets, enabling an accurate evaluation of generalization performance for unseen participants. In addition, feature selection using RFECV was performed exclusively on the training data within each cross-validation fold. The selected feature subset was then applied to the corresponding test data. This procedure was adopted to avoid data leakage and to ensure that the feature selection process did not utilize information from the test set. Using RFECV, the optimal number of features and their most effective combination were automatically determined by maximizing classification performance under CV. In the RFECV process, the F1-score was adopted as the optimization metric. This choice enables balanced evaluation of classification performance across classes, particularly under imbalanced dataset conditions, by considering both precision and recall. Consequently, a compact and discriminative feature subset that optimized the performance of the SVM-based classifier was selected.

4.4.3. Training and Classification

To integrate all acoustic features defined in Section 4.2 and Section 4.3 into a unified numerical scale, min–max normalization was applied independently to each feature dimension. The normalization parameters (minimum and maximum values) were computed exclusively from the training set in each cross-validation fold and subsequently applied to the corresponding test set. This procedure was adopted to prevent data leakage and ensure a fair evaluation of generalization performance. Then, normalized acoustic feature vectors were constructed for further analysis. Multiple randomized trials were conducted to account for the influence of data partitioning in RFECV-based feature selection. To reduce bias caused by arbitrary data partitioning and obtain a more robust estimation of generalization performance, a total of 100 independent trials with different random seeds were performed. In each trial, feature selection using RFECV and subsequent training and classification were conducted using both linear and nonlinear SVM models. By iterating the experiments, the variability of selected feature subsets caused by different data partitions was mitigated. This process enabled a reliable assessment of both classification performance stability and feature selection reproducibility. The final classification performance was reported as the median value of each evaluation metric across the 100 trials. This metric provided a robust summary statistic that was less sensitive to outliers and random fluctuations resulting from individual data partitions.

5. Experiments

5.1. Experimental Settings

Herein, the extent to which depression classification performance can be improved was quantitatively evaluated by introducing the proposed global acoustic features in addition to conventional acoustic features. To this end, seven experimental conditions were designed by varying combinations of acoustic features (or feature sets). Then, classification experiments were conducted using an SVM-based classifier with RFECV-based feature selection. Table 6 summarizes the acoustic features used in each experimental condition. In Table 6, the “+” symbol indicates that the corresponding feature sets are concatenated to form a single feature vector for classification. For example, the combined feature vector for the condition OS + VL + PIFS is obtained by concatenating the 88-dimensional vectors of OS, the 15-dimensional vectors of VL, and the 19-dimensional vectors of PIFS, resulting in a 122-dimensional vector.

First, Baseline 1 is defined as the condition wherein only OS is used. Baseline 2 is defined as the condition that augments OS with VL to OS and is denoted as OS + VL. Using this condition, the incremental contribution of VL to classification performance could be appropriately evaluated. Relative to Baseline 2, four experimental conditions were established in which UIFS, PIFS, RIFS, and SD were individually added. These conditions were designed to independently assess the contribution of each proposed acoustic feature set to classification performance. Finally, a comprehensive condition integrating all acoustic features (OS, VL, UIFS, PIFS, RIFS, and SD) was evaluated and denoted as ALL. ALL was then compared with both baseline conditions and the individual feature-addition conditions to systematically and comprehensively assess the overall effectiveness of the proposed acoustic feature sets.

As the characteristics of the decision boundary in SVM are strongly influenced by the choice of kernel function, the classification performance of linear and nonlinear (radial basis function, RBF) kernels were compared to determine the most suitable condition for the dataset. For the nonlinear SVM with the RBF kernel, hyperparameter optimization was conducted using a grid search over the penalty parameter

C

and the kernel coefficient

γ

. The optimal parameter set was determined based on cross-validation performance. Using the selected kernel, the aforementioned seven experimental conditions were evaluated to verify the effectiveness and contribution of each acoustic feature set to depression classification performance.

5.2. Evaluation Metrics for Classification Performance

Classification performance was evaluated using subject-independent five-fold CV. In each fold, 80% of the participants were used for training and the remaining 20% for testing. This ensured that no speech data from the same participant appeared in both the sets in any fold.

For binary classification task, commonly used evaluation metrics were employed; these included accuracy, precision, recall, and specificity. Each metric is defined as follows:

A c c u r a c y = \frac{T P + T N}{T P + F N + T N + F P},

(1)

P r e c i s i o n = \frac{T P}{T P + F P},

(2)

R e c a l l = \frac{T P}{T P + F N},

(3)

S p e c i f i c i t y = \frac{T N}{T N + F P},

(4)

where

T P

,

T N

,

F P

, and

F N

denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. Herein, the depressed and nondepressed classes were defined as the positive and negative classes, respectively.

6. Results

In this section, we first compare the classification performance of linear and nonlinear SVMs to determine the classifier used in the subsequent analyses. Subsequently, we report the ablation test results using each of the proposed feature sets in standalone, as well as the classification results obtained by adding each proposed feature set to the baseline acoustic features. Finally, we analyze the contribution of individual acoustic features based on the configuration that achieved the strongest overall performance.

All of the evaluation scores reported in this section were obtained from a total of 2500 trials, which correspond to the combination of 5-fold cross-validation for classification, 5-fold cross-validation within the RFECV process, and 100 random initialization trials in RFECV (5 × 5 × 100). This procedure was adopted to ensure the stability and reliability of the performance estimates.

6.1. Performance Comparison Between Linear and Nonlinear SVMs

Table 7 compares the performance of linear SVM and nonlinear SVM with an RBF kernel. For each evaluation metric, a higher median score between the two models is highlighted. The values shown in parentheses alongside the median represent the interquartile range (from the first quartile to the third quartile). Across the feature sets examined, the nonlinear SVM showed performance that was broadly similar to that of the linear SVM, and in several metrics such as Precision and Recall, the nonlinear SVM yielded slightly higher median values. As the overall differences between the two classifiers were small; however, considering that acoustic characteristics related to depression may include patterns that are not fully captured by linear separation, the subsequent analyses were conducted using the nonlinear SVM with an RBF kernel.

6.2. Classification Results Using Conventional and Proposed Acoustic Features

Before examining the classification performance obtained by combining the proposed features with the baseline acoustic features, we evaluated each proposed feature set in standalone using the nonlinear SVM. The four feature sets exhibited similar levels of accuracy, precision, and recall, with median values concentrated in the range of 0.50 to 0.55 (see Table 8). This indicates that the discriminative capability of each proposed feature set, when used individually, was limited and did not exceed that of the baseline acoustic features. A notable specific finding appeared, where the SD feature set showed a markedly lower median value compared with the other feature sets, suggesting a reduced ability to correctly identify negative cases when used alone and a potential imbalance in class-wise discriminability. In contrast, UIFS, PIFS, and RIFS maintained specificity levels comparable to the baselines. Taken together, these results demonstrate that the proposed acoustic feature sets are not sufficiently informative on their own but rather function as supplementary cues that contribute more effectively when combined with the baseline acoustic features.

Figure 2 visualizes the classification results as box-and-whisker plots for each evaluation metric. Compared with the baseline conditions using only conventional acoustic features (OS and OS + VL), the OS + VL + UIFS, OS + VL + PIFS, and ALL conditions showed higher median values across all four evaluation metrics. Among these, the ALL condition achieved the highest overall performance. To formally assess the improvement, we compared the best-performing configuration (ALL) with the two baselines using the Wilcoxon signed-rank test with Bonferroni correction. The ALL condition showed significantly higher accuracy, precision, and recall than both OS and OS + VL (p < 0.001 for all comparisons). For specificity, the ALL condition significantly outperformed OS (p < 0.05), whereas the difference from OS + VL did not reach the conventional 5% significance level but showed a marginally significant trend (p < 0.1). For the OS + VL + RIFS condition, the median accuracy was comparable to that of OS; the medians of precision, recall, and specificity were higher than those of the baselines, although these differences were not statistically significant under the same test. In contrast, the OS + VL + SD condition showed median values that were comparable to or lower than those of OS or OS + VL across all four evaluation metrics, indicating that SD did not contribute positively when integrated with the baseline features.

6.3. Feature Importance of the Best-Performing Feature Set

Figure 3 shows the acoustic features selected by RFECV for the ALL condition, which achieved the highest classification performance among all experimental settings. The horizontal axis in this figure indicates the mean of the absolute weight coefficients from the linear SVM across 2500 trials. Since all features were scaled to [0, 1] via min–max normalization, these mean absolute weights provide a direct statistical measure of the relative discriminative contribution of each feature, independent of their original scales and units. In this analysis, the top 20 acoustic features with the strongest discriminative contribution were extracted. Among these features, the top two features, OS(F0semitoneFrom27.5Hz_meanRisingSlope) and PL(mode), showed substantially larger contributions than the others, indicating that both short-term acoustic dynamics and the statistical properties of pause behavior strongly influenced the classification. In addition to these two dominant features, a total of nine features related to the proposed long-duration session-level representations, derived from UL-, PL-, and RL-based features included in UIFS, PIFS, and RIFS, were ranked within the top 20. These proposed features collectively contributed to the overall performance, reflecting the influence of multiple session-level temporal and statistical descriptors in the ALL condition.

7. Discussion

7.1. Discussion on Performance Comparison Between Linear and Nonlinear SVMs

The results presented in Section 6.1 showed that the nonlinear SVM with an RBF kernel achieved performance that was comparable to or slightly higher than the linear SVM across most feature conditions. This finding suggests that depression-related acoustic characteristics may contain nonlinear relationships among the feature dimensions, and that such relationships can be captured more effectively by a nonlinear decision boundary. Based on this observation, the nonlinear SVM was adopted for the subsequent analyses in this study.

Because feature selection in this study was performed using a linear SVM within the RFECV framework, the interaction between nonlinear classification models and feature selection procedures optimized specifically for nonlinear kernels remains to be clarified. Evaluating feature selection methods that are tailored to nonlinear models will be an important direction for future work.

7.2. Classification Results of Conventional and Proposed Acoustic Features

7.2.1. Effectiveness of the Proposed Acoustic Features

The ablation test results indicated that the proposed feature sets, UIFS, PIFS, RIFS, or SD, did not show markedly stronger discriminative ability when used individually, either relative to one another or compared with the existing acoustic features. However, the proposed feature sets demonstrated their effectiveness when combined with the baseline feature sets, contributing to improved classification performance. The most pronounced classification performance was observed under the ALL condition, compared with that under the OS and OS + VL conditions. The OS + VL + PIFS condition also demonstrated consistently strong performance among the proposed feature combinations, and this improvement can be attributed to the addition of acoustic features designed to capture depression-related speech behaviors at the session level. Notably, in the analysis of discriminative contribution (Figure 3), one of the PIFS-related features, PL(mode), exhibited the second-largest weight among all features, with a contribution level that was substantially higher than most remaining features. This suggests that pause-related temporal statistics remain highly influential, even when all feature sets are integrated. Recent clinical studies also provide a physiological and psychological basis for the strong effectiveness of pause-related features. Psychomotor retardation, a core symptom of depression, causes slowing in cognitive processing and motor activity, directly affecting speech production. Bennabi et al. [49] reported that such slowing appears in speech as increased hesitation and longer silent intervals. Yamamoto et al. [50] demonstrated that individuals with depression exhibit significantly extended pause intervals, reduced speech rates, and delay in response initiation and these pause characteristics correlate with clinical severity. Moreover, Esposito et al. [51] showed that depression affects not only pause intervals but also pausing strategies, including empty pauses, filled pauses, and phoneme lengthening. These findings suggest that the effectiveness of PIFS arises from their ability to capture behavioral markers directly linked to psychomotor retardation and underlying neurocognitive impairment.

As shown in Figure 3, a total of nine of the top 20 features were derived from the proposed long-duration session-level representations, including those from UIFS and RIFS as well as PIFS, indicating that session-level descriptors contributed meaningfully to shaping the learned decision boundary. These feature-level findings match the overall performance trends, and as illustrated in Figure 2, the combinations involving UIFS and RIFS also enhanced classification performance over the baseline. Within these feature sets, features related to utterance length (UL) and response length (RL) were identified as key contributors to the classification. It is well-established in clinical literature that individuals with depression exhibit shorter utterance durations [24,25,44] and prolonged response length [25,46] compared to control groups. Taken together, the incorporation of UL- and RL–related features likely enabled the model to better capture behavioral characteristics that are representative of depressive speech.

The relatively limited effectiveness of SD may be attributed to the low dimensionality of its feature representation compared with other acoustic feature sets, which may have restricted its ability to capture sufficient discriminative information. In future, acoustic information can be combined with textual information, such as measures of utterance length or word count related to SD for enhancing the representational capacity of SD and classification performance.

In Figure 2, substantial variance and broad overlap in the score distributions across all four evaluation metrics were observed among the experimental conditions. This can be attributed to the characteristics of the DAIC-WOZ dataset and the evaluation protocol employed in this study. First, the DAIC-WOZ dataset contains a limited number of participants, with the depressed class being particularly small. Under subject-independent cross-validation, this results in considerable heterogeneity in speaker composition across folds, and variability in individual speaker characteristics has a substantial influence on classification performance. Such fold-to-fold variation is a major factor contributing to the overall performance variance observed across all metrics. Second, RFECV-based feature selection was performed independently in each of the 100 trials. Even slight differences in the training data split yield stochastic variation in the selected feature subsets, which further amplifies the variance of the classification metrics. These factors jointly explain why the distributions of accuracy, precision, recall, and specificity show wide variance and strong overlap among the experimental conditions. For future work, achieving more robust feature design will require both deeper investigation into the acoustic properties of depressive speech and validation using larger and more diverse participant populations.

7.2.2. Comparison of Classification Performance with Previous Studies

Table 9 compares the classification performances of the ALL condition and the conditions proposed in previous studies that used a single acoustic feature set [34,36]. Table 10 compares the ALL condition and the condition combining multiple acoustic feature sets [34,36]. All previous studies employed the same dataset as this study and conducted experiments under subject-independent evaluation protocols.

It should be noted, however, that the present study employs a much more conservative performance estimation in order to strictly evaluate the generalization ability of the model. As a result, the accuracy achieved by the ALL condition in this study does not necessarily exceed the values reported in earlier work, and these numerical differences should therefore be interpreted with caution. In addition, methodological differences between the existing studies and the present study also contribute to the differences in the reported performances. Many existing studies employed deep neural network-based models, which offer strong representational capacity but typically require a huge number of parameters and substantial computational resources during both training and inference. Such models can also be more susceptible to unintended data leakage and often provide limited interpretability. In contrast, the present study adopts an SVM-based model with a rigorous cross-validation procedure and focuses on the detailed evaluation of long-duration session-level acoustic features. The proposed method relies on linear or nonlinear SVMs, whose computational complexity is generally lower than that of neural network-based approaches such as CNNs and LSTMs commonly used in earlier work. This lower complexity may lead to reduced computational cost and shorter inference time, offering practical advantages in scenarios that require real-time processing or deployment in resource-limited clinical settings.

Although a quantitative comparison of computational efficiency (e.g., inference time or number of parameters) was not conducted in this study, this remains an important direction for future work. From this perspective, the primary contribution of the present study lies not in achieving superior absolute accuracy but in demonstrating that session-level acoustic feature sets, UIFS, PIFS, RIFS, and SD, contribute to depression classification even under a stricter and more conservative evaluation protocol. As shown in Section 6.3, PL(mode), a PIFS-derived feature, remained one of the highest-ranking features under the ALL condition, confirming that long-duration behavioral indicators retain discriminative relevance for identifying depressive speech.

From a clinical perspective, it is important to consider the trade-off between precision and recall (sensitivity). In medical screening, high recall is often prioritized to avoid missing individuals who may have the condition, since false-negative outcomes can delay diagnosis and treatment. However, excessively low precision leads to a large number of false-positive predictions, which may place unnecessary psychological burden on patients and increase the cost of follow-up assessments. Therefore, a balanced performance between these metrics is desirable for safe and practical real-world application.

Taken together, the performance differences between this study and the existing studies are better understood as reflecting differences in evaluation rigor, model complexity, and feature design, rather than a direct comparison of superiority or inferiority among specific acoustic feature sets.

7.3. Limitations and Future Directions of the Study

Although the proposed method achieved enhanced classification performance, it may still be insufficient for direct clinical deployment in real-world medical settings. The following three challenges should be addressed to further improve classification performance and enhance the practical applicability of the proposed method.

(1): The integration of multimodal information represents a critical future direction for depression classification. Previous studies have reported that multimodal approaches combining speech with facial expressions, textual content, or physiological signals have achieved improved classification performance [52,53]. Using multiple modalities, individual differences and symptom variability can be more effectively captured while mitigating the impact of noise or missing data or modality-specific limitations. Future research should explore multimodal fusion strategies that combine acoustic features with complementary information sources to achieve more robust and reliable depression detection.
(2): Data imbalance and limited sample size remain a significant challenge. Herein, under-sampling was applied to address class imbalance, which inevitably reduced the number of training samples. In general, increasing the number of training samples improves model generalization and robustness [54]. Therefore, expanding the dataset via large-scale data collection, multi-institutional collaboration, or public dataset integration constitutes an important direction for future research. Such data expansion will facilitate the application of deep learning models, which typically require large-scale datasets to fully exploit their representational capacity.
(3): The exploration of feature selection strategies was limited to a single method, RFECV. Although RFECV provides an effective framework for selecting discriminative features, alternative methods such as sequential forward selection (SFS) and sequential backward selection (SBS) may yield different feature subsets and classification performance. A systematic comparison of multiple feature selection techniques would provide deeper insights into optimal feature subset construction and contribute to more robust model design for depression classification.
(4): In addition, the SD metric showed relatively lower discriminative performance compared with the other proposed feature sets. One possible reason is that the current implementation does not employ a dedicated VAD algorithm, and the separation between unvoiced consonants and background noise is not always perfect. Such misclassification could lead to an overestimation or underestimation of SD. Therefore, introducing a more robust VAD or noise-resilient voicing detection method will be important for improving the SD feature in future work.
(5): Furthermore, this study did not control demographic factors such as gender, age, and speaker-specific characteristics, and these factors may have influenced both the extracted acoustic features and the generalization ability of the trained model. In particular, clear differences in fundamental frequency and formant structure between male and female speakers are well known, and gender-dependent patterns in depressive speech have also been reported [55]. In addition, demographic attributes such as gender, age, and speaker-specific characteristics have been identified as potential sources of bias in depression-detection models [18]. Therefore, future work should utilize datasets containing a wider range of demographic profiles and develop methods that appropriately separate and evaluate the influence of speaker-related factors on depression classification.

8. Conclusions

In this study, we introduced novel acoustic features designed to capture the temporal and behavioral characteristics of speech and examined their effectiveness for depression classification under subject-independent conditions. Although each of the four proposed feature sets, UIFS, PIFS, RIFS, and SD, did not show marked discriminative advantages when used individually, they contributed to improved classification performance when combined with conventional acoustic features. In particular, the SVM model under the ALL condition, which integrates all four proposed feature sets, achieved higher performance than models relying only on existing acoustic features. The experiments were conducted under a rigorous evaluation protocol that avoided data leakage, providing performance estimates that more accurately reflect the generalization ability of speech-based depression classification models.

While the proposed features improved the modeling of depression-related speech patterns, the current performance levels remain insufficient for immediate clinical use. Future work should explore multimodal integration, evaluate models on larger and more demographically diverse datasets, and further refine feature engineering and classification strategies. These directions will be important for enhancing model generalization and advancing the practical applicability of speech-based depression screening in real diagnostic settings.

Author Contributions

Conceptualization, S.F. and Y.O.; methodology, S.F. and Y.O.; validation, S.F. and Y.O.; formal analysis, S.F.; investigation, S.F.; resources, S.F.; data curation, S.F.; writing—original draft preparation, S.F. and Y.O.; writing—review and editing, S.F., Y.O., M.A.A. and M.C.; visualization, S.F. and M.A.A.; supervision, Y.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it involved the secondary analysis of the DAIC-WOZ dataset, which consists of publicly available, de-identified data. The original data collection was conducted in accordance with ethical standards and approved by the relevant institutional review boards.

Informed Consent Statement

Not applicable.

Data Availability Statement

The DAIC-WOZ dataset used in this study is available upon request from the University of Southern California (USC) Institute for Creative Technologies at https://dcapswoz.ict.usc.edu/ (accessed on 1 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UIFS	Utterance interval feature set
PIFS	Pause interval feature set
RIFS	Response interval feature set
SD	Speech density
UL	Utterance length
UN	Utterance number
UR	Utterance ratio
PL	Pause length
PN	Pause number
PR	Pause ratio
RL	Response length
RN	Response number
RR	Response ratio
OS	openSMILE features
VL	Volume level features
LLD	Low-level descriptor
MFCC	Mel-frequency cepstral coefficient
PLP-CC	Perceptual linear prediction cepstral coefficients
LPC	Linear predictive coding
LSP	Line spectral pair
HNR	Harmonics-to-noise ratio
DCT	Discrete cosine transform
ACF	Autocorrelation function
SHS	Sub-harmonic summation
FFT	Fast Fourier transform
CENS	Chroma energy normalized statistics
SVM	Support vector machine
RBF	Radial basis function
RFECV	Recursive feature elimination with cross-validation
RFE	Recursive feature elimination
CV	Cross-validation
SFS	Sequential forward selection
SBS	Sequential backward selection
DNN	Deep neural network
CNN	Convolutional neural network
RNN	Recurrent neural network
LSTM	Long short-term memory
$T P$	True positive
$T N$	True negative
$F P$	False positive
$F N$	False negative
DAIC-WOZ	Distress Analysis Interview Corpus-Wizard of Oz

References

Marx, W.; Penninx, B.W.J.H.; Solmi, M.; Furukawa, T.A.; Firth, J.; Carvalho, A.F.; Berk, M. Major depressive disorder. Nat. Rev. Dis. Primers 2023, 9, 44. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Depression. Available online: https://www.who.int/news-room/fact-sheets/detail/depression (accessed on 15 November 2022).
Amanat, A.; Rizwan, M.; Javed, A.R.; Abdelhaq, M.; Alsaqour, R.; Pandya, S.; Uddin, M. Deep learning for depression detection from textual data. Electronics 2022, 11, 676. [Google Scholar] [CrossRef]
Liu, D.; Feng, X.L.; Ahmed, F.; Shahid, M.; Guo, J. Detecting and measuring depression on social media using a machine learning approach: Systematic review. JMIR Ment. Health 2022, 9, e27244. [Google Scholar] [CrossRef] [PubMed]
Tahir, W.B.; Khalid, S.; Almutairi, S.; Abohashrh, M.; Memon, S.A.; Khan, J. Depression detection in social media: A comprehensive review of machine learning and deep learning techniques. IEEE Access 2025, 13, 12789–12818. [Google Scholar] [CrossRef]
He, L.; Jiang, D.; Sahli, H. Automatic depression analysis using dynamic facial appearance descriptor and Dirichlet process Fisher encoding. IEEE Trans. Multimed. 2019, 21, 1476–1486. [Google Scholar] [CrossRef]
Cao, X.; Zhai, L.; Zhai, P.; Li, F.; He, T.; He, L. Deep learning based depression recognition through facial expression: A systematic review. Neurocomputing 2025, 627, 129605. [Google Scholar] [CrossRef]
Wang, R.; Huang, J.; Zhang, J.; Liu, X.; Zhang, X.; Liu, Z.; Zhao, P.; Chen, S.; Sun, X. FacialPulse: An efficient RNN based depression detection via temporal facial landmarks. arXiv 2024, arXiv:2408.03499. [Google Scholar] [CrossRef]
Asgari, M. Algorithms for Extracting Robust and Accurate Speech Features and Their Application in Clinical Domain. Ph.D. Dissertation, Oregon Health & Science University, Portland, OR, USA, 2014. [Google Scholar]
Maran, P.L.; Braquehais, M.D.; Vlaic, A.; Alonzo Castillo, M.T.; Vendrell Serres, J.; Ramos Quiroga, J.A.; Rodríguez Urrutia, A. Performance of automatic speech analysis in detecting depression: Systematic review and meta-analysis. JMIR Ment. Health 2025, 12, e67802. [Google Scholar] [CrossRef]
Almaghrabi, S.A.; Clark, S.R.; Baumert, M. Bioacoustic features of depression: A review. Biomed. Signal Process. Control 2023, 85, 105020. [Google Scholar] [CrossRef]
Esposito, A.; Raimo, G.; Maldonato, M.; Vogel, C.; Conson, M.; Cordasco, G. Behavioral sentiment analysis of depressive states. In Proceedings of the 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom 2020), Naples, Italy, 16–18 September 2020. [Google Scholar]
Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP: A collaborative voice analysis repository for speech technologies. In Proceedings of the ICASSP 2014, Florence, Italy, 4–9 May 2014. [Google Scholar]
Eyben, F.; Wöllmer, M.; Schuller, B. openSMILE—The Munich versatile and fast open-source audio feature extractor. In Proceedings of the ACM MM 2010, Florence, Italy, 25–29 October 2010. [Google Scholar]
Solieman, H.; Pustozerov, E.A. The detection of depression using multimodal models based on text and voice quality features. In Proceedings of the ElConRus 2021, St. Petersburg, Russia, 26–29 January 2021. [Google Scholar]
Higuchi, M.; Nakamura, M.; Shinohara, S.; Omiya, Y.; Takano, T.; Mizuguchi, D.; Sonota, N.; Toda, H.; Saito, T.; So, M.; et al. Detection of major depressive disorder based on a combination of voice features: An exploratory approach. Int. J. Environ. Res. Public Health 2022, 19, 11397. [Google Scholar] [CrossRef]
Tasnim, M.; Stroulia, E. Detecting depression from voice. In Proceedings of the Advances in Artificial Intelligence: Proceedings of Canadian AI 2019, Kingston, ON, Canada, 28–31 May 2019. [Google Scholar]
Dumpala, S.H.; Rodriguez, S.; Rempel, S.; Uher, R.; Oore, S. Significance of speaker embeddings and temporal context for depression detection. arXiv 2021, arXiv:2107.13969. [Google Scholar] [CrossRef]
Scherer, S.; Stratou, G.; Gratch, J.; Morency, L.P. Investigating voice quality as a speaker-independent indicator of depression and PTSD. In Proceedings of the Interspeech 2013, Lyon, France, 25–29 August 2013. [Google Scholar]
Wang, J.; Ravi, V.; Flint, J.; Alwan, A. Unsupervised instance discriminative learning for depression detection from speech signals. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar]
Ishimaru, M.; Okada, Y.; Uchiyama, R.; Horiguchi, R.; Toyoshima, I. Classification of depression and its severity based on multiple audio features using a graphical convolutional neural network. Int. J. Environ. Res. Public Health 2023, 20, 1588. [Google Scholar] [CrossRef] [PubMed]
Kim, A.Y.; Jang, E.H.; Lee, S.H.; Choi, K.Y.; Park, J.G.; Shin, H.C. Automatic depression detection using smartphone-based text-dependent speech signals: Deep convolutional neural network approach. J. Med. Internet Res. 2023, 25, e34474. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Ravi, V.; Flint, J.; Alwan, A. Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification. Speech Commun. 2024, 163, 103049. [Google Scholar] [CrossRef]
Menne, F.; Dörr, F.; Schräder, J.; Tröger, J.; Habel, U.; König, A.; Wagels, L. The voice of depression: Speech features as biomarkers for major depressive disorder. BMC Psychiatry 2024, 24, 794. [Google Scholar] [CrossRef]
Mundt, J.C.; Vogel, A.P.; Feltner, D.E.; Lenderking, W.R. Vocal acoustic biomarkers of depression severity and treatment response. Biol. Psychiatry 2012, 72, 580–587. [Google Scholar] [CrossRef]
Trifu, R.N.; Nemeș, B.; Herta, D.C.; Bodea-Hategan, C.; Talaș, D.A.; Coman, H. Linguistic markers for major depressive disorder: A cross-sectional study using an automated procedure. Front. Psychol. 2024, 15, 1355734. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Sumali, B.; Mitsukura, Y.; Liang, K.C.; Yoshimura, M.; Kitazawa, M.; Takamiya, A.; Fujita, T.; Mimura, M.; Kishimoto, T. Speech quality feature analysis for classification of depression and dementia patients. Sensors 2020, 20, 3599. [Google Scholar] [CrossRef]
Nasir, M.; Jati, A.; Shivakumar, P.G.; Chakravarthula, S.N.; Georgiou, P. Multimodal and multiresolution depression detection from speech and facial landmark features. In Proceedings of the AVEC 2016, Amsterdam, The Netherlands, 16 October 2016. [Google Scholar]
Rejaibi, E.; Komaty, A.; Meriaudeau, F.; Agrebi, S.; Othmani, A. MFCC-based recurrent neural network for automatic clinical depression recognition and assessment from speech. Biomed. Signal Process. Control 2022, 71, 103149. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, X.; Chen, W.; Li, C.; Yu, C. Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments. Sci. Rep. 2024, 14, 9543. [Google Scholar] [CrossRef] [PubMed]
Wei, Y.; Qin, S.; Liu, F.; Liu, R.; Zhou, Y.; Chen, Y.; Xiong, X.; Zheng, W.; Ji, G.; Meng, Y.; et al. Acoustic-based machine learning approaches for depression detection in Chinese university students. Front. Public Health 2025, 13, 1561332. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Liang, L.; Zhang, Z.; Xu, X.; Liu, R.; Fang, H.; Zhang, R.; Wei, Y.; Liu, Z.; Zhu, R.; et al. Fast and accurate assessment of depression based on voice acoustic features: A cross-sectional and longitudinal study. Front. Psychiatry 2023, 14, 1195276. [Google Scholar] [CrossRef] [PubMed]
Dumpala, S.H.; Dikaios, K.; Rodriguez, S.; Langley, R.; Rempel, S.; Uher, R.; Oore, S. Manifestation of depression in speech overlaps with characteristics used to represent and recognize speaker identity. Sci. Rep. 2023, 13, 11155. [Google Scholar] [CrossRef]
Miao, X.; Li, Y.; Liu, Y.; Julian, I.N.; Guo, H. Fusing features of speech for depression classification based on higher-order spectral analysis. Speech Commun. 2022, 143, 46–56. [Google Scholar] [CrossRef]
Lim, E.; Jhon, M.; Kim, J.W.; Kim, S.H.; Kim, S.; Yang, H.J. A lightweight approach based on cross-modality for depression detection. Comput. Biol. Med. 2025, 186, 109618. [Google Scholar] [CrossRef]
Gratch, J.; Artstein, R.; Lucas, G.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et al. The distress analysis interview corpus of human and computer interviews. In Proceedings of the LREC 2014, Reykjavik, Iceland, 26–31 May 2014. [Google Scholar]
Kroenke, K.; Strine, T.W.; Spitzer, R.L.; Williams, J.B.W.; Berry, J.T.; Mokdad, A.H. The PHQ-8 as a measure of current depression in the general population. J. Affect. Disord. 2009, 114, 163–173. [Google Scholar] [CrossRef]
Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C. The geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2016, 7, 190–202. [Google Scholar] [CrossRef]
Toyoshima, I.; Okada, Y.; Ishimaru, M.; Uchiyama, R.; Tada, M. Multi-input speech emotion recognition model using mel spectrogram and GeMAPS. Sensors 2023, 23, 1743. [Google Scholar] [CrossRef]
Atmaja, B.T.; Akagi, M. Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning. APSIPA Trans. Signal Inf. Process. 2020, 9, e17. [Google Scholar] [CrossRef]
Jordan, E.; Terrisse, R.; Lucarini, V.; Alrahabi, M.; Krebs, M.-O.; Desclés, J.; Lemey, C. Speech emotion recognition in mental health: Systematic review of voice-based applications. JMIR Ment. Health 2025, 12, e74260. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Zhang, L.; Liu, T.; Pan, W.; Hu, B.; Zhu, T. Acoustic differences between healthy and depressed people: A cross-situation study. BMC Psychiatry 2019, 19, 300. [Google Scholar] [CrossRef] [PubMed]
Schneider, K.; Leinweber, K.; Jamalabadi, H.; Teutenberg, L.; Brosch, K.; Pfarr, J.K.; Thomas-Odenthal, F.; Usemann, P.; Wroblewski, A.; Straube, B.; et al. Syntactic complexity and diversity of spontaneous speech production in schizophrenia spectrum and major depressive disorders. Schizophrenia 2023, 9, 63. [Google Scholar] [CrossRef] [PubMed]
Mundt, J.C.; Snyder, P.J.; Cannizzaro, M.S.; Chappie, K.; Geralts, D.S. Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology. J. Neurolinguist. 2007, 20, 50–64. [Google Scholar] [CrossRef]
Deurzen, P.A.M.; Buitelaar, J.K.; Brunnekreef, J.A.; Ormel, J.; Minderaa, R.B.; Hartman, C.A.; Huizink, A.C.; Speckens, A.E.M.; Oldehinkel, A.J.; Slaats-Willemse, D.I.E. Response time variability and response inhibition predict affective problems in adolescent girls, not in boys: The TRAILS study. Eur. Child Adolesc. Psychiatry 2012, 21, 277–287. [Google Scholar] [CrossRef]
Mauch, M.; Dixon, S. PYIN: A fundamental frequency estimator using probabilistic threshold distributions. In Proceedings of the ICASSP 2014, Florence, Italy, 4–9 May 2014. [Google Scholar]
Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Bennabi, D.; Vandel, P.; Papaxanthis, C.; Pozzo, T.; Haffen, E. Psychomotor retardation in depression: A systematic review of diagnostic, pathophysiologic, and therapeutic implications. Biomed. Res. Int. 2013, 2013, 158746. [Google Scholar] [CrossRef]
Yamamoto, M.; Takamiya, A.; Sawada, K.; Yoshimura, M.; Kitazawa, M.; Liang, K.-C.; Fujita, T.; Mimura, M.; Kishimoto, T. Using speech recognition technology to investigate the association between timing-related speech features and depression severity. PLoS ONE 2020, 15, e0238726. [Google Scholar] [CrossRef]
Cummins, N.; Vlasenko, B.; Sagha, H.; Schuller, B. Enhancing speech-based depression detection through gender dependent vowel-level formant features. In Proceedings of the Artificial Intelligence in Medicine (AIME 2017), Vienna, Austria, 30 May 2017. [Google Scholar]
Xu, C.; Chen, Y.; Tao, Y.; Xie, W.; Liu, X.; Lin, Y.; Liang, C.; Du, F.; Lin, Z.; Shi, C. Deep learning-based detection of depression by fusing auditory, visual and textual clues. J. Affect. Disord. 2025, 391, 119860. [Google Scholar] [CrossRef]
Nurfidausi, A.F.; Mancini, E.; Torroni, P. TRI-DEP: A trimodal comparative study for depression detection using speech, text, and EEG. arXiv 2025, arXiv:2510.14922. [Google Scholar] [CrossRef]
Qiu, J.; Wu, Q.; Ding, G.; Xu, Y.; Feng, S. A survey of machine learning for big data processing. EURASIP J. Adv. Signal Process. 2016, 2016, 67. [Google Scholar] [CrossRef]
Esposito, A.; Esposito, A.M.; Likforman-Sulem, L.; Maldonato, N.M.; Vinciarelli, A. On the significance of speech pauses in depressive disorders: Results on read and spontaneous narratives. In Recent Advances in Nonlinear Speech Processing; Springer: Cham, Switzerland, 2016; pp. 73–82. [Google Scholar]

Figure 1. Temporal intervals in speech signals.

Figure 2. Box-and-whisker plots of classification performance obtained using a nonlinear SVM under different acoustic feature combinations. The orange horizontal lines represent the median values, and the blue circles represent the mean values.

Figure 3. Ranking of the top 20 acoustic features based on their discriminative contribution in the ALL condition. Error bars represent mean ± 1 standard deviation (n = 2500). Abbreviations: Stdev, standard deviation; Norm. Stdev, normalized standard deviation; V, voiced segments; UV, unvoiced segments.

Table 1. Items of the PHQ-8 questionnaire.

Over the Last 2 Weeks, How Often Have You Been Bothered by Any of the Following Problems?	Several Days	More Than Half the Days	Nearly Every Day
1. Little interest or pleasure in doing things	1	2	3
2. Feeling down, depressed, or hopeless	1	2	3
3. Trouble falling or staying asleep or sleeping too much	1	2	3
4. Feeling tired or having little energy	1	2	3
5. Poor appetite or overeating	1	2	3
6. Feeling bad about yourself or that you are a failure or have let yourself or your family down	1	2	3
7. Trouble concentrating on things such as reading the newspaper or watching television	1	2	3
8. Moving or speaking so slowly that other people could have noticed? Or the opposite, being so fidgety or restless that you have been moving around a lot more than usual	1	2	3

Table 2. Depression severity categories based on PHQ-8 scores.

PHQ-8 Scores	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24
Severity level	Asymptomatic					Mild					Moderate					Moderate to severe					Severe
Assessment	Nondepressed										Depressed

Table 3. Speech-based low-level descriptors (LLDs). The terminological representations in this table, including linear (lin), logarithmic (dB), and A-weighted (dBA) magnitudes, follow the original definitions in openSMILE [14].

Feature Group	Description
Waveform	Zero crossings, extremes, and DC offset
Signal energy	Root mean square energy and logarithmic energy
Loudness	Intensity and approximate loudness
FFT spectrum	Phase and magnitude (lin, dB, and dBA scales)
ACF, Cepstrum	Autocorrelation and Cepstrum
Mel/Bark spectr.	Bands 0- $N_{m e l}$
Semitone spectrum	FFT-based and filter-based
Cepstral	Cepstral feature, e.g., MFCC and PLP-CC
Pitch	$F_{0}$ via ACF and SHS methods; probability of voicing
Voice Quality	HNR, jitter, and shimmer
LPC	LPC coefficients, reflection coefficients, residual signal, line spectral pairs (LSPs)
Auditory	Auditory spectra and PLP coefficients
Formants	Center frequencies and bandwidths
Spectral	Energy in N user-defined bands, multiple roll-off points, centroid, entropy, flux, and relative position of maxima/minima
Tonal	CHROMA, CENS, CHROMA-based features

Table 4. Statistics, regression, and transformation operations.

Category	Description
Extremes	Extreme values, positions of extrema, and ranges
Means	Arithmetic mean, quadratic mean, and geometric mean
Moments	Standard deviation, variance, kurtosis, and skewness
Percentiles	Percentiles and percentile ranges
Regression	Liner and quadratic approximation coefficients, regression error, and centroid
Peaks	Number of peaks, mean peak distance, and mean peak amplitude
Segments	Number of segments based on delta thresholding and mean segment length
Sample values	Contour values at configurable relative positions
Times/durations	Up- and down-level times, rise and fall times, and duration
Onsets	Number of onsets and relative position of first and last on-/offset
DCT	Coefficients of the discrete cosine transformation (DCT)
Zero crossings	Zero-crossing rate and mean-crossing rate

Table 5. Description of proposed acoustic features.

Acoustic Feature Set (Dimensionality)	Acoustic Feature	Description
UIFS (19)	UL	Total duration, mean, median, mode, count, skewness, kurtosis, and standard deviation
	UN	Total count of utterances
	UR	Ratio of the total utterance interval to the total duration of utterance and nonutterance intervals
PIFS (19)	PL	Total duration, mean, median, mode, count, skewness, kurtosis, and standard deviation
	PN	Total count of pauses
	PR	Ratio of the total pause interval to the total duration of utterance and nonutterance intervals
RIFS (19)	RL	Total duration, mean, median, mode, count, skewness, kurtosis, and standard deviation
	RN	Total count of responses
	RR	Ratio of the total response interval to the total duration of utterance and nonutterance intervals
SD (1)		Occurrence frequency of speech components per unit time

Table 6. Experimental conditions. ○ and — indicate the inclusion and exclusion of each acoustic feature set in the experimental conditions, respectively.

	OS	VL	UIFS	PIFS	RIFS	SD
Experimental Condition	OS	VL	UIFS	PIFS	RIFS	SD
OS (Baseline 1)	`○`	—	—	—	—	—
OS + VL (Baseline 2)	`○`	`○`	—	—	—	—
OS + VL + UIFS	`○`	`○`	`○`	—	—	—
OS + VL + PIFS	`○`	`○`	—	`○`	—	—
OS + VL + RIFS	`○`	`○`	—	—	`○`	—
OS + VL + SD	`○`	`○`	—	—	—	`○`
ALL	`○`	`○`	`○`	`○`	`○`	`○`

Table 7. Performance comparison of linear and nonlinear SVMs for each feature set condition.

Experimental Condition	Model	Accuracy	Precision	Recall	Specificity
OS	Linear	0.54 [0.46–0.57]	0.51 [0.45–0.56]	0.51 [0.45–0.57]	0.55 [0.45–0.65]
OS	Nonlinear	0.54 [0.46–0.57]	0.51 [0.46–0.56]	0.51 [0.45–0.58]	0.55 [0.45–0.65]
OS + VL	Linear	0.54 [0.46–0.59]	0.52 [0.46–0.58]	0.52 [0.46–0.60]	0.53 [0.47–0.63]
OS + VL	Nonlinear	0.52 [0.48–0.59]	0.53 [0.47–0.59]	0.54 [0.47–0.60]	0.53 [0.47–0.63]
OS + VL + UIFS	Linear	0.58 [0.50–0.65]	0.56 [0.50–0.62]	0.57 [0.50–0.64]	0.58 [0.53–0.68]
OS + VL + UIFS	Nonlinear	0.58 [0.50–0.62]	0.55 [0.50–0.61]	0.56 [0.50–0.63]	0.58 [0.47–0.68]
OS + VL + PIFS	Linear	0.58 [0.51–0.64]	0.56 [0.50–0.61]	0.57 [0.50–0.63]	0.58 [0.53–0.68]
OS + VL + PIFS	Nonlinear	0.58 [0.50–0.63]	0.56 [0.50–0.62]	0.57 [0.50–0.64]	0.58 [0.47–0.68]
OS + VL + RIFS	Linear	0.54 [0.50–0.62]	0.54 [0.48–0.60]	0.54 [0.48–0.61]	0.58 [0.47–0.67]
OS + VL + RIFS	Nonlinear	0.54 [0.50–0.62]	0.55 [0.49–0.60]	0.55 [0.49–0.62]	0.58 [0.47–0.63]
OS + VL + SD	Linear	0.54 [0.46–0.59]	0.52 [0.46–0.58]	0.52 [0.45–0.58]	0.53 [0.47–0.63]
OS + VL + SD	Nonlinear	0.54 [0.46–0.62]	0.53 [0.46–0.58]	0.53 [0.46–0.60]	0.53 [0.47–0.63]
ALL	Linear	0.58 [0.50–0.65]	0.56 [0.49–0.62]	0.57 [0.49–0.65]	0.58 [0.53–0.68]
ALL	Nonlinear	0.58 [0.50–0.65]	0.56 [0.50–0.62]	0.58 [0.50–0.65]	0.58 [0.47–0.68]

The higher median score between the two models is highlighted in background color.

Table 8. Ablation test results obtained by evaluating each proposed feature set in standalone using the nonlinear SVM.

Feature Set	Accuracy	Precision	Recall	Specificity
UIFS	0.52 [0.48–0.60]	0.52 [0.46–0.60]	0.52 [0.47–0.60]	0.55 [0.46–0.70]
PIFS	0.50 [0.43–0.57]	0.50 [0.43–0.57]	0.50 [0.44–0.57]	0.55 [0.40–0.64]
RIFS	0.55 [0.45–0.60]	0.55 [0.45–0.61]	0.55 [0.45–0.60]	0.55 [0.40–0.70]
SD	0.50 [0.43–0.55]	0.49 [0.38–0.60]	0.50 [0.45–0.55]	0.18 [0.09–0.30]

Table 9. Comparison of classification performance with conditions proposed in previous studies that employ a single acoustic feature set.

Author	Method	Feature	Accuracy	Precision	Recall	Specificity
S. Dumpala et al. [34]	DNN	COVAREP	0.56
	CNN	COVAREP	0.61
	LSTM	COVAREP	0.60
	DNN	openSMILE	0.59
	CNN	openSMILE	0.63
	LSTM	openSMILE	0.64
E. Lim et al. [36]	CNN	Mel spectrogram		0.44	0.78
	Multimodal fusion (cross-modality)	Mel spectrogram		0.59	0.60
	Multimodal fusion cross-modality model	HuBERT		0.49	0.70
Ours	Non-linear SVM	ALL	0.58	0.56	0.58	0.58

Table 10. Comparison of classification performance with those proposed in previous studies that employ multiple combined acoustic feature sets.

Author	Method	Feature	Accuracy	Precision	Recall	Specificity
S. Dumpala et al. [34]	DNN	openSMILE + Speaker Embeddings	0.65
	CNN	openSMILE + Speaker Embeddings	0.72
	LSTM	openSMILE + Speaker Embeddings	0.74
E. Lim et al. [36]	Multimodal fusion (cross-modality model)	Mel spectrogram + Text		0.67	0.70
E. Lim et al. [36]	Multimodal fusion (cross-modality model)	HuBERT + Text		0.66	0.70
Ours	Non-linear SVM	ALL	0.58	0.56	0.58	0.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fushimi, S.; Azani, M.A.; Chiba, M.; Okada, Y. Beyond Short-Frame Acoustic Features: Capturing Long-Term Speech Patterns for Depression Detection. Technologies 2026, 14, 198. https://doi.org/10.3390/technologies14040198

AMA Style

Fushimi S, Azani MA, Chiba M, Okada Y. Beyond Short-Frame Acoustic Features: Capturing Long-Term Speech Patterns for Depression Detection. Technologies. 2026; 14(4):198. https://doi.org/10.3390/technologies14040198

Chicago/Turabian Style

Fushimi, Shizuku, Mohammad Aiman Azani, Mizuto Chiba, and Yoshifumi Okada. 2026. "Beyond Short-Frame Acoustic Features: Capturing Long-Term Speech Patterns for Depression Detection" Technologies 14, no. 4: 198. https://doi.org/10.3390/technologies14040198

APA Style

Fushimi, S., Azani, M. A., Chiba, M., & Okada, Y. (2026). Beyond Short-Frame Acoustic Features: Capturing Long-Term Speech Patterns for Depression Detection. Technologies, 14(4), 198. https://doi.org/10.3390/technologies14040198

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beyond Short-Frame Acoustic Features: Capturing Long-Term Speech Patterns for Depression Detection

Abstract

1. Introduction

2. Related Works

2.1. Studies on Depression Classification Using Machine Learning

2.2. Studies on Depression Classification Based on Acoustic Features

3. Dataset

4. Methods

4.1. Definition of Temporal Intervals in Speech Signals

4.2. Conventional Acoustic Features

4.2.1. openSMILE Features

4.2.2. Volume Level Features

4.3. Proposed Acoustic Features

4.3.1. Utterance Interval Feature Set: UL, UN, and UR

4.3.2. Pause Interval Feature Set: PL, PN, and PR

4.3.3. Response Interval Feature Set: RL, RN, and RR

4.3.4. Speech Density: SD

4.4. Construction of the Classification Model

4.4.1. Classification Algorithm

4.4.2. Feature Selection

4.4.3. Training and Classification

5. Experiments

5.1. Experimental Settings

5.2. Evaluation Metrics for Classification Performance

6. Results

6.1. Performance Comparison Between Linear and Nonlinear SVMs

6.2. Classification Results Using Conventional and Proposed Acoustic Features

6.3. Feature Importance of the Best-Performing Feature Set

7. Discussion

7.1. Discussion on Performance Comparison Between Linear and Nonlinear SVMs

7.2. Classification Results of Conventional and Proposed Acoustic Features

7.2.1. Effectiveness of the Proposed Acoustic Features

7.2.2. Comparison of Classification Performance with Previous Studies

7.3. Limitations and Future Directions of the Study

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI