Next Article in Journal
A Novel Method for Determining the Optimal Transition Point from Surface to Underground Exploitation of Dimension Stone
Previous Article in Journal
Damping Performance of Manganese Alloyed Austempered Ductile Iron
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Systematic Review

Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study

by
Ivan Danylenko
1 and
Olgierd Unold
2,*
1
Independent Researcher, 50-370 Wroclaw, Poland
2
Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(1), 422; https://doi.org/10.3390/app16010422 (registering DOI)
Submission received: 5 December 2025 / Revised: 26 December 2025 / Accepted: 29 December 2025 / Published: 30 December 2025

Abstract

The DAIC-WOZ dataset is a widely used benchmark for the task of depression severity estimation from multimodal behavioral data. Yet the reliability, reproducibility, and methodological rigor of published machine learning models remain uncertain. In this systematic review, we examined all works published through September 2025 that mention the DAIC-WOZ dataset and report mean absolute error as an evaluation metric. Our search identified 536 papers, of which 414 remained after deduplication. Following title and abstract screening, 132 records were selected for full-text review. After applying eligibility criteria, 66 papers were included in the quality assessment stage. Of these, only five met minimal reproducibility standards (such as clear data partitioning, model description, and training protocol documentation) and were included in this review. We found that published models suffer from poor documentation and methodology, and, inter alia, identified subject leakage as a critical methodological flaw. To illustrate its impact, we conducted experiments on the DAIC-WOZ dataset, comparing the performance of the model trained with and without subject leakage. Our results indicate that leakage produces significant overestimation of the validation performance; however, our evidence is limited to the audio, text, and combined modalities of the DAIC-WOZ dataset. Without leakage, the model consistently performed worse than a simple mean predictor. Aside from poor methodological rigor, we found that the predictive accuracy of the included models is poor: reported MAEs on DAIC-WOZ are of the same magnitude as the dataset’s own PHQ-8 variability, and are comparable to or larger than the variability typically observed in general population samples. We conclude with specific recommendations aimed at improving the methodology, reproducibility, and documentation of manuscripts. Code for our experiments is publicly available.

1. Introduction

Globally, more than 300 million people live with depression, making it the leading cause of disability according to the World Health Organization [1]. The traditional assessment of depression primarily relies on clinical interviews and questionnaires, tools heavily dependent on the expertise and subjective judgment of trained psychologists.
Currently, there is no widely accepted standardized method for the automated assessment of the severity of depression within the psychological community. Clinical evaluations are based on nuanced expert knowledge, making the process time-consuming and potentially variable between practitioners. General practitioners find distinguishing depression from other physical and psychological disorders to be a difficult task requiring considerable skill. Furthermore, at any given consultation, about half of the patients with depression are not recognized by general practitioners, with 20% remaining unrecognized even after 6 months [2]. As Stringaris [3] emphasize, depression might not be a single disease entity, but rather a collection of related but distinct syndromes. The main symptoms of major depressive disorder defined in DSM-5 [4] include anhedonia (loss of interest or pleasure), depressed mood, sleep disturbances, fatigue, changes in appetite, feelings of worthlessness, concentration problems, and psychomotor agitation or retardation.
Given these challenges, depression assessment has received considerable research interest. The potential of machine learning in clinical depression diagnosis has been illustrated by Yi et al. [5], who combined electroencephalography and functional near-infrared spectroscopy features with support vector machines achieving diagnostic accuracy of 92.7%. Current advances in machine learning enable the extraction of complex, nonlinear features from raw multimodal data, such as speech, text, and facial cues. Numerous works aim to explore whether expert psychological knowledge can be distilled through machine learning models to the point where it allows for an efficient assessment of the severity of depression from clinical interview. As a consequence, over recent years, researchers have published many publications declaring the development of highly efficient machine learning systems. We showcase that most of these studies either do not meet minimal reproducibility criteria, report overestimated performance due to the methodological flaws, or report results that lack statistical validity and clinical utility.
While prior surveys have broadly examined machine learning approaches for depression detection across various modalities and highlighted challenges in automated depression detection [6,7,8], our analysis focuses on depression severity estimation studies that use the Distress Analysis Interview Corpus–Wizard-of-Oz (DAIC-WOZ) dataset [9]—the most widely adopted benchmark for depression severity estimation—and evaluates their methodological rigor. Thus, our findings are grounded exclusively in the context of this particular dataset, with potential applicability to its extended version (E-DAIC [10]). In this systematic review, we address the following questions related to the use of machine learning in depression severity estimation using the DAIC-WOZ dataset:
RQ1: 
How reliable, reproducible, and methodologically rigorous are published machine learning models?
RQ2: 
What are the most common methodological flaws present in current machine learning research?
RQ3: 
Are there specific practices that systematically lead to the overestimation or misrepresentation of machine learning models’ performance?
RQ4: 
How does subject leakage between training and evaluation partitions affect the measured performance of machine learning models?
RQ5: 
What recommendations can be made to improve the methodological rigor, reproducibility, and documentation standards for future machine learning research?
We incorporate a structured quality screening stage to ensure that only works with sufficient documentation and reproducibility standards are examined in detail. Our analysis goes further by identifying systematic methodological flaws, inter alia, subject leakage, that undermine the reliability of reported results. To make the issue concrete, we illustrate the impact of subject leakage through an empirical case study. We conclude with detailed recommendations in three domains: methodological rigor, study reproducibility, and documentation standards. Code for our experiments is publicly available at the following link: https://github.com/Kowd-PauUh/ml-in-depression-estimation, accessed on 26 December 2025.

2. Methods

This section focuses on methods adopted for the systematic screening of the current literature to answer RQ1, RQ2, and RQ3. Empirical experiments that address RQ4 are described in Section 4. Our study follows a systematic review protocol, aligned with the PRISMA [11] guidelines, to critically analyze prior work on machine learning for depression severity estimation using the DAIC-WOZ dataset. We restrict the scope of this review to studies that utilized the DAIC-WOZ dataset, as the dataset has been widely adopted as a gold-standard benchmark for the task of depression severity estimation. Empirical experiments focus on the audio, text, and combined modalities of the DAIC-WOZ, as the video modality is available only through pre-extracted features rather than raw footage. Consequently, our analysis does not cover other datasets and video modality of the DAIC-WOZ. A description of the DAIC-WOZ is provided in Section 4.1.
Search strategy. We searched the following electronic sources: IEEE Xplore, arXiv, ScienceDirect, SpringerLink, ACM Digital Library, Scopus, and Google Scholar. The query terms were “DAIC” and “WOZ” and “MAE”, applied consistently to title, abstract, and full text across all databases, using each platform’s appropriate Boolean syntax. No restrictions were made on publication year. The search covered all available records until September 2025.  
We specifically included the criterion “MAE” because virtually all studies on depression severity estimation report mean absolute error (MAE) as their primary evaluation metric, often in combination with root mean squared error (RMSE). Although RMSE and MAE are mathematically related, they are not directly interchangeable, and converting RMSE to MAE would require unverified assumptions about error distributions. To ensure comparability and avoid introducing assumptions, we did not include the “RMSE” term in our query, which may have excluded studies reporting only RMSE.
Study selection. The study selection procedure consisted of three stages: (i) Screening of titles and abstracts to exclude clearly irrelevant records; (ii) Full-text assessment to verify compliance with the eligibility criteria; (iii) Quality screening to verify study reproducibility. At each stage, all records were screened and assessed manually. No automation tools were used in the entire process.
Eligibility criteria. Studies were eligible for quality screening stage if they met the following requirements:
  • The study reported experimental results on the DAIC-WOZ dataset.
  • The evaluation metrics for regression of depression severity scores included MAE.
  • The study presents scientific novelty; studies that solely repurposed or applied existing methods without introducing methodological innovation were excluded.
  • The study was published in English.
To ensure methodological rigor and reproducibility, we required studies to explicitly report the following methodological details (adapted from a subset of the CLAIM [12] guidelines):
  • Clear description of data preprocessing steps;
  • Explicit reporting of data partitioning scheme (proportions, partitions disjoint levels);
  • Detailed model description, including architecture, inputs, outputs, and intermediate layers;
  • Specification of training details (data augmentation, hyperparameters, number of models trained);
  • Method for selecting the final model.
Although CLAIM was originally proposed as a checklist for the use of artificial intelligence in medical imaging, the criteria we employ in our study focus on general reproducibility principles. These principles are directly applicable to the task of depression severity estimation from human-subject data. We acknowledge that these criteria are more stringent than common reporting practices in the field; however, their adoption was essential for the consistent evaluation of methodological rigor and the identification of methodological issues such as subject leakage. Exclusion during the quality screening stage occurred when missing methodological details prevented independent reconstruction of the experimental pipeline (e.g., ambiguous data partitioning, model training objective, convergence criteria).
Data collection. Data extraction was performed manually from the reports during quality screening stage. We sought data on (i) data preprocessing, (ii) proposed model architecture, (iii) best model selection and evaluation protocols, and (iv) reported results (MAE and other metrics, if available). To assess reproducibility and reliability, we further examined whether code was made available and experiments were repeated. In addition, we searched for the presence of any pitfalls in the methodology, such as subject leakage.

3. Results

This section focuses on findings of our systematic review that are further discussed in Section 5 with respect to RQ1, RQ2, and RQ3.
Initial search. Our initial search identified 536 records. After removing 122 duplicates, 414 papers remained for title and abstract screening. At this stage, 280 studies were excluded as their titles and/or abstracts were deemed irrelevant to our research scope. The remaining 134 records were sought for retrieval, but 2 full texts could not be accessed, leaving 132 papers for full-text screening.
Study selection. During the full-text review, 66 papers were excluded for not meeting eligibility criteria. The reasons for exclusion were as follows:
  • Not addressing the DAIC-WOZ dataset ( n = 47 );
  • Not reporting MAE as an evaluation metric ( n = 2 );
  • Not introducing a new model or repurposing an existing model ( n = 11 );
  • Not written in English ( n = 6 ).
The remaining 66 papers proceeded to quality assessment. From this point onward, all percentages are calculated with respect to the 66 papers that entered the quality assessment stage. Of these, 61 were excluded for insufficient documentation in one or more of the following reproducibility aspects:
Ultimately, five papers met all inclusion criteria and were included in this systematic review. Of these, three studies utilize textual modality of the dataset and employ the deep learning paradigm [74,75,76], while another two adopt classical machine learning utilizing audio and video modalities [77,78]. Figure 1 presents the inclusion and exclusion of papers at each study selection stage.
Reproducibility failures. We found that among the 66 papers assessed for quality the following occurred: 92.4% ( n = 61 ) did not comply with at least one criterion; 71.2% ( n = 47 ) did not comply with at least two criteria; 42.4% ( n = 28 ) did not comply with at least three criteria; 16.7% ( n = 11 ) did not comply with at least four criteria; 9.1% ( n = 6 ) did not comply with any criterion. Additionally, we looked for the code availability (not mandatory for inclusion) of the published studies and found that only 15.2% ( n = 10 ) have publicly available code [20,23,24,35,42,46,50,69,72,74].
Reporting issues. In total, 12.1% ( n = 8 ) of the studies reported results for an unknown data partition [19,27,29,32,35,37,39,42]. Additionally, 3.0% ( n = 2 ) of the studies inconsistently reported results within the article [30,40].
Methodological flaws. Among papers that underwent quality screening, we identified five methodological flaws:
  • In total, 81.8% ( n = 54 ) of the studies do not use repeated experiments and instead rely on a single experimental run [13,14,15,16,17,18,20,21,22,23,25,26,27,28,29,30,31,32,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,58,59,60,61,62,64,65,67,69,70,71,73,76,77,78]. This is a critical flaw, as many computational methods in machine learning involve stochastic elements such as random initialization, data shuffling, or sampling. Without running experiments multiple times and reporting measures of variability (e.g., mean and standard deviation (SD) over different seeds or cross-validation folds), the reported performance lacks statistical validity.
  • The vast majority of the studies report measures of errors such as MAE and RMSE, while completely ignoring goodness-of-fit metrics such as coefficient of determination ( R 2 ). Although useful, MAE and RMSE have no upper bound and do not provide information about the performance of the regression with respect to the distribution of the ground truth values. In contrast, R 2 quantifies the proportion of the variance in the dependent variable that the model predicts from the independent variables [79]. Without R 2 or a similar measure of explained variance, it is difficult to assess whether low prediction errors reflect genuine variance explained by the model or are a consequence of the target variable’s scale. ( R 2 has its own limitations. Due to its definition (see Equation (2)), direct comparisons of R 2 across datasets with differing variance in the dependent variable are problematic. R 2 may also be less informative when residuals or the dependent variable are non-normally distributed. Nevertheless, it provides a useful measure of the proportion of variance explained by the model within a given dataset.) Yet only 6.1% ( n = 4 ) of the studies additionally report R 2 [17,26,61,63].
  • Only 30.3% ( n = 20 ) of the studies report performance on a held-out test set [28,31,41,44,45,49,52,53,54,59,63,64,68,69,71,72,73,74,76,77], while the rest evaluate models solely on the validation set. Relying on the validation set only can lead to the overly optimistic results, as this set is often used for model selection (early stopping on validation loss, best validation performance, etc.) or hyperparameter tuning, and is a form of inadvertent overfitting.
  • It has been shown for DAIC-WOZ that incorporating interviewer turns into the feature set leads to inflated performance estimates. Burdisso et al. [80] demonstrated that models trained solely on interviewer prompts achieve equal or higher performance than those trained on participant responses, indicating that interviewer turns in the DAIC-WOZ dataset act as discriminative shortcuts rather than clinically meaningful signals. Although the actual evidence is limited to the textual modality, a similar effect may plausibly occur in the audio modality, since the transcripts are extracted directly from the audio recordings. This may potentially constitute a methodological pitfall in 42.4% ( n = 28 ) of the studies, as they incorporate interviewer utterances as input data [18,19,24,32,33,36,37,39,40,41,47,48,50,51,52,53,54,55,58,60,63,68,71,72,73,74,75,76] (due to the underreported data preprocessing procedures in some of the papers, data on the use of interviewer turns are missing for 25.8% ( n = 17 ) of the studies).
  • At least 9.1% ( n = 6 ) of the studies exhibit subject leakage [32,34,55,56,60,70], while an additional 24.2% ( n = 16 ) could not be confirmed due to the insufficient documentation of data preprocessing procedures. Subject leakage occurs when data partitions are disjoint at a level more granular than the interview (e.g., at the level of 30 s audio clips, individual participant turns, etc.), and random splits are applied after such preprocessing. This results in the same participant contributing data to both training and evaluation partitions, allowing models to exploit subject-specific cues rather than learning generalizable patterns.
The last finding motivated the formulation of RQ4. In Section 4, we empirically demonstrate that subject leakage significantly inflates validation performance.
Papers included. A total of five papers were included in this systematic review. Although all included papers used interview as a whole to predict the Patient Health Questionnaire (PHQ-8) score, data preprocessing and modeling techniques varied significantly across studies.
Data partitioning. All papers included in this systematic review employed the standard DAIC-WOZ split with 107 interviews in the training set, 35 in the validation set, and 47 in the test set, as proposed by the dataset authors, with all partitions being disjoint at the interview level. Most importantly, as the dataset was disjoint at the interview level in each study, none of the included studies exhibit subject leakage.
Deep learning papers. Each of the three studies focused on textual modality, yet they differed substantially in how interview transcripts were represented and modeled. Milintsevich et al. [74] employed turn-wise feature extraction from interviews using a RoBERTa-based encoder with latter embeddings aggregation through a single-layer BiLSTM with an attention mechanism. Hong et al. [75] constructed a text graph and employed a Graph Neural Network (GNN) with a message passing mechanism to obtain interview-level representations. Kang et al. [76] employed a Large Language Model to generate a synopsis and sentiment analysis from an interview, and a BERT model was utilized to predict the PHQ-8 score based on the generated data.
Classical machine learning papers. The two classical machine learning studies relied on handcrafted features combined with conventional regression techniques. Syed et al. [77] used 68-point 3D facial landmarks for video data, alongside segmenting audio files and extracting 73 COVAREP [81] features, Fisher Vector encoding, and Partial Least Squares Regression (PLSR). Rathi et al. [78] utilized Histogram Oriented Gradient features and applied various regression techniques, with linear regression yielding the best results.
Best model selection. With respect to the final model selection, the papers fall into two groups: those that select the best performing model based on the metrics achieved on the validation set [74,77], and those that train for a fixed number of epochs with early stopping based on validation performance [75,76].
Methodological rigor. Among the included studies, methodological rigor remain limited. Only one paper has an available source code [74], which constrains reproducibility. All three text-based studies incorporate interviewer prompts [74,75,76], which was shown to artificially inflate performance estimates on the text modality of the DAIC-WOZ dataset [80]. Only two studies conduct repeated experiments [74,75], which limits the reliability of the reported results. Furthermore, only three of the studies report results on the held-out test set [74,76,77] and none incorporate additional goodness-of-fit measures such as R 2 . All studies report MAE; three studies additionally report RMSE [76,77,78]. Milintsevich et al. [74] also report a macro-averaged version of the MAE and Syed et al. [77] also report Pearson correlation coefficients. Neither of the metrics quantifies the variance explained by a model: MAE and RMSE measure the prediction error; Pearson correlation coefficient captures the strength of the linear association between predicted and true values.
Achieved results. Table 1 summarizes the reported validation and test MAEs. On the validation set, MAE ranged from 3.76 [75] to 5.51 [74], and on the held-out test set, from 3.66 [76] to 5.42 [77]. However, the lowest MAE values came exclusively from studies that incorporated interviewer prompts. Our model, trained without subject leakage and interviewer utterances on the audio modality of the DAIC-WOZ dataset, yielded MAE of 4.978 ± 0.571 on validation and 5.001 ± 0.101 on test sets—similar to the results obtained by three of the included studies [74,77,78]. However, the regression coefficient of determination for our model is negative (val.  R 2 = 0.248 ± 0.295 , test  R 2 = 0.058 ± 0.063 ), which indicates that the model explains less variance in the PHQ-8 scores than a simple mean predictor. This example shows that reporting MAE alone provides limited insight into the model’s performance.
Furthermore, because the PHQ-8 score spans 0–24 points, an MAE of 5 corresponds to more than 20% of the full scale. Within DAIC-WOZ itself, such MAE is already comparable to the dataset’s own variability (train. SD  = 5.46 , val. SD  = 6.59 , test SD  = 6.47 ), indicating that the model does not achieve predictive accuracy substantially better than the natural spread of scores in the dataset sample. Clinical datasets such as DAIC-WOZ are expected to have higher PHQ-8 variability than general population samples. To contextualize the results, we provide a comparison to the typical spread of PHQ-8 scores in population-based samples. In population-based samples, PHQ-8 variability is considerably smaller: Riazy et al. [82] show that in a population study of 287,530 participants from 29 European countries, the 50th percentile PHQ-8 score is 1.8, the 75th percentile is 4.52, and the 95th percentile reaches 10.85—far below the maximum possible score of 24. In combination with mean scores ranging from 1.2 to 4.0 and standard deviations typically falling between 2.7 and 4.5 [82], the MAEs of the included studies currently match or exceed the typical spread of PHQ-8 scores in the general population. However, this comparison should be interpreted with caution given the higher PHQ-8 variability in DAIC-WOZ.

4. Empirical Demonstration of Subject Leakage Pitfall

During our systematic review, we identified subject leakage as a critical methodological flaw in at least 9.1% ( n = 6 ) of the studies. In this section, we present an empirical study of the subject leakage impact on the measured model’s performance, addressing RQ4. We first introduce the dataset used in our experiments, followed by the description of our data preprocessing pipeline. We then describe the architecture designed for this demonstration and outline our training and evaluation protocols. Finally, we report and analyze the results of the experiments. The implications of these results are further discussed in Section 5.

4.1. Dataset

DAIC-WOZ is a central resource in the development of automated methods for detecting psychological distress, including major depressive disorder [83]. It is part of the larger DAIC corpus and was specifically designed to support multimodal research in clinical interview contexts. The dataset comprises semi-structured interviews conducted in four formats: (i) traditional face-to-face interviews with human interviewers, (ii) teleconferencing with human interviewers, (iii) “Wizard of Oz” (WOZ) interviews in which an animated virtual agent named Ellie is operated by a hidden human interviewer, and (iv) fully automated interviews in which Ellie functions autonomously without human input. DAIC-WOZ includes rich multimodal data collected during interviews, which includes audio, audio transcripts, features extracted from video, and depth recordings from Microsoft Kinect sensors. A subset of sessions also incorporates physiological data such as electrocardiogram, galvanic skin response, and respiration signals. Additionally, the dataset contains manual and automatic annotations of verbal and nonverbal behavior, including speech prosody, facial expressions, gaze, gesture patterns, and dialogue acts. Participants, drawn from both the general population and US military veterans, completed a series of validated psychological assessments including the PHQ-8, the PTSD Checklist, PANAS, and the State-Trait Anxiety Inventory [9]. The DAIC-WOZ dataset is comprised of 189 interviews, while its extended version E-DAIC [10] consists of 275 interviews.

4.2. Data Preprocessing

Audio transcripts of the DAIC-WOZ dataset were analyzed to compute two metrics per utterance, (i) speech duration and (ii) word count, obtained via whitespace tokenization of the transcribed text. To ensure sufficient acoustic and lexical content per instance, we retained only utterances satisfying the following thresholds: duration between 10 and 30 s and word count above 10. These thresholds were empirically selected based on dataset inspection to exclude both brief interjections (e.g., “Uh”, “Umm”) and unusual long, multi-topic utterances. After applying these filters and removing the interviewer’s speech, approximately 10.8 h of usable participant speech remained from a total of ≈42 h of recordings (≈26%). In experiments that involved audio modality, all audio files were resampled with a 16 kHz sample rate. Additionally, during training, we applied a random gain adjustment within the  ± 6  dB range (uniform distribution) to the utterances. In experiments that involved text modality, audio transcripts were used without additional preprocessing.
To prevent subject leakage into the held-out test partition, we split the dataset at the participant level. Participants were divided into training and test groups using stratified sampling on the binary PHQ-8 label (PHQ-8 ≥ 10), with 20% reserved for testing. We report that post-split, the PHQ-8 score distribution in both subsets closely matches the original. Each recording was annotated at the participant level with a single PHQ-8 score, which was propagated across all the corresponding utterances. The training partition was used in cross-validations and the test partition was used as a held-out set for model evaluation after training was finished.

4.3. Model Architecture

To demonstrate the impact of subject leakage into validation split, we train the model with architecture depending on the data modality, as visualized in Figure 2. We deliberately do not include video modality in our experiments, because in the DAIC-WOZ dataset, it is not represented by raw video but only through pre-extracted facial and visual features.
Audio modality. The model is composed of (i) a pretrained CNN backbone, (ii) a single transformer encoder layer (torch.nn.TransformerEncoderLayer) initialized with d_model = 512 , nhead = 2 , dim_feedforward = 2048 , and  dropout = 0.1 , and (iii) a linear layer with a single output neuron for regression applied after averaging the output from the transformer. We initialize the CNN backbone using torchvision from ResNet18 architecture pretrained on the ImageNet [84] dataset and replace the last fully connected layer with a randomly initialized one with 512 neurons. Each model’s component is trainable, including the CNN backbone.
Each speech segment is transformed into an MEL spectrogram using torchaudio.compliance.kaldi.fbank with a 25 ms Hanning window and a 10 ms frame shift. This yields a time–frequency matrix of shape F × T , where F = 224 is the number of MEL bins. The spectrogram is partitioned into N = T / F non-overlapping square fragments { X i } i = 1 N , each of size F × F . The choice of square fragments ensures compatibility with the ResNet18 architecture. When T is not an exact multiple of F, the remainder is discarded, effectively dropping the last incomplete chunk. Each fragment X i R F × F is duplicated across 3 channels (without ImageNet normalization applied) and passed independently through a shared-weight CNN backbone. In the next step, vector representations of all fragmens are fed through the transformer encoder layer. The outputs of the transformer are averaged to obtain single vector representation for the input audio. Finally a single regression unit is applied over this representation to model the PHQ-8 score.
The intuition behind this design is grounded in the ability of convolutional neural networks to capture localized patterns in two-dimensional images. Additionally, the attention mechanism allows for modeling sequences where relevant information is unevenly distributed and may occur sparsely over time. In the case of MEL spectrograms, the two-dimensional patterns correspond to joint time–frequency structures that encode acoustic features.
Although ImageNet pretraining is based on natural images, it has been shown that transfer learning from natural images contributes to improvements in audio-related tasks [85]. We note that domain-specific pretraining on large audio datasets could potentially improve performance; however, our primary goal is to investigate the effects of subject leakage rather than to optimize predictive performance.
Text modality. The model is composed of (i) a pretrained RoBERTa [86] model checkpoint: https://huggingface.co/FacebookAI/roberta-large (accessed on 13 October 2025); and (ii) a linear layer with a single output neuron for regression. Each model’s component is trainable, including all RoBERTa layers.  
Each transcript of a speech segment is truncated to the length of 512 tokens during both training and evaluation. RoBERTa produces a 1024 feature vector for each input token, and we use [CLS] token features as a representation for the transcript. A single regression unit is applied over this representation to model the PHQ-8 score.
Combined modality. The model combines architectural components of both the audio and text models and is composed of (i) an audio model without the last linear layer, (ii) a text model without the last linear layer, and (iii) a linear layer with a single output neuron for regression. All components are trainable.
For each input (audio–transcript pair), audio and text representations are concatenated to form a single vector with 512 + 1024 = 1536 features, as shown in Figure 2. These features are then passed to the regression unit to predict the PHQ-8 score.

4.4. Training Procedure

Each model was trained under the same conditions regardless of the data modality. Training was conducted on NVIDIA A100 Tensor Core GPU (80GB) hardware using the AdamW optimizer with default parameters: β 1 = 0.9 , β 2 = 0.999 , ϵ = 1 × 10 8 , and weight decay set to 0.01 . The learning rate was set to 1 × 10 5 and was constant. The maximum number of epochs was set to 10 and early stopping was activated when validation loss did not decrease by at least 10 3 for a single epoch ( patience = 1 ). We deliberately chose a conservative early stopping and learning rate strategy to mitigate the risk of overfitting, given the relatively small size of the DAIC-WOZ dataset.
From each run, we selected the final checkpoint as the best. Training and evaluation minibatch sizes were set to 16. The training objective was to minimize the mean squared error (MSE) between predicted and ground truth PHQ-8 scores. While MSE served as the optimization criterion, model performance was evaluated using two metrics: mean absolute error and the coefficient of determination. Let y i denote the ground truth PHQ-8 score for the i-th sample, y ^ i the corresponding model prediction, y ¯ the mean of all true scores, and N the number of samples. The metrics are then defined as follows:
Mean absolute error measures the average magnitude of the absolute prediction errors:
MAE = 1 N i = 1 N | y ^ i y i |
where | y ^ i y i | is the absolute error for the i-th prediction.
Coefficient of determination quantifies the proportion of variance in the ground truth scores explained by the model:
R 2 = 1 i = 1 N ( y ^ i y i ) 2 i = 1 N ( y i y ¯ ) 2 = 1 M S E 1 N i = 1 N ( y i y ¯ ) 2
where the numerator is the residual sum of squares and the denominator is the total sum of squares, or equivalently, the numerator is the mean squared error and the denominator is the variance of the true scores.
Each experiment was executed under five independent 5-fold cross-validation schemes, with folds partitioned at the (a) utterance level to simulate leakage in the experiment with subject leakage, and (b) the participant level to avoid leakage in the experiment without subject leakage. This resulted in 25 runs per experiment. Random seeds were fixed for data loading and initialization to ensure reproducibility. To assess statistical significance, we employed parametric and non-parametric tests, depending on whether the normality assumption of the paired differences was met. Specifically, we used a two-sided paired t-test on the per-run scores (25 paired results per experiment, corresponding to identical seeds), under the assumption that the paired differences are normally distributed. When this normality assumption was violated, we instead applied the Wilcoxon signed-rank test, a non-parametric alternative that does not rely on normality. We verified assumptions of normality on the paired differences using the Shapiro–Wilk test and report that normality was violated in the Val. R 2 comparison for all modalities (audio modality: p = 5.87 × 10 5 ; text modality: p = 2.45 × 10 8 ; combined modality: p = 4.76 × 10 6 ). To account for multiple comparisons, we applied Holm–Bonferroni correction across all significance tests. We report metric mean ± standard deviation, Cohen’s d effect sizes, p-values, and adjusted p-values. Additionally, for each comparison, we report the mean difference Δ and its 95% bootstrap confidence interval, computed by resampling the paired differences with 10,000 bootstrap iterations.
To contextualize the results, we deliberately report the performance of the weakest possible baseline, a mean predictor, which always outputs the average PHQ-8 score computed over the entire training set. This choice directly aligns with the definition of R 2 (see Equation (2)), which quantifies the performance relative to the mean of the dependent variable. While more sophisticated baselines such as linear regression or support vector machines could certainly be considered, the rationale is to establish a minimal reference point for model performance evaluation. As our results show, complex models evaluated without subject leakage consistently yield negative R 2 values, indicating performance worse than the mean predictor. Therefore, comparison against a stricter baseline would not alter the conclusion that these models fail to predict PHQ-8 score from the multimodal data.

4.5. Impact of Subject Leakage on Model Performance

The evaluation results of the described model trained with and without subject leakage on different data modalities are presented in Table 2, which reports achieved R 2 and MAE metrics. Table 3 reports statistics on the significance of the observed differences and effect sizes for each comparison. In the following, we analyze these results per modality. Additional data on the learning dynamics of the model under each experimental condition are presented in Appendix A.
Baseline. A simple mean predictor, trained on the entire training set, achieves R 2 = 0 and MAE = 4.773 on this set, and R 2 = 0.100 and MAE = 5.149 on the test set.
Audio modality. On the validation set, subject leakage produced a significant overestimation of model performance: R 2 increased from 0.248 ± 0.295 to 0.256 ± 0.083 when the leakage was present ( A d j . p = 5.36 × 10 7 , Cohen’s d = 1.83 ). Correspondingly, MAE decreased from 4.978 ± 0.571 to 3.595 ± 0.269 ( A d j . p = 2.98 × 10 9 , Cohen’s d = 2.07 )—a substantial gain in predictive accuracy. Crucially, this inflated performance does not translate to the held-out test set, which contains participants unseen during both training and validation. Test R 2 degrades from 0.058 ± 0.063 to 0.250 ± 0.108 when leakage was present ( A d j . p = 5.07 × 10 7 , Cohen’s d = 1.56 ). Similarly, MAE on the held-out set increased from 5.001 ± 0.101 to 5.373 ± 0.178 ( A d j . p = 2.28 × 10 8 , Cohen’s d = 1.86 ). These results demonstrate that on the audio modality, subject leakage artificially boosts validation performance while degrading generalization to truly unseen participants.
Text modality. On the validation set, a modest, yet significant overestimation in R 2 was observed: validation R 2 increased from 0.236 ± 0.525 to 0.045 ± 0.095 when leakage was present ( A d j . p = 6.16 × 10 3 , Cohen’s d = 0.41 ). Validation MAE slightly decreased from 4.758 ± 0.610 to 4.510 ± 0.244 , although this difference was not statistically significant. On the held-out test set, no differences were observed. Thus, on the text modality, subject leakage has a comparatively smaller effect than on the audio modality. Although generalization to the held-out test data was not affected, the leakage still caused validation performance overestimation in terms of the R 2 metric.
Combined modality. Similarly to the audio modality, we observed significant validation performance inflation: R 2 increased from 0.308 ± 0.513 to 0.123 ± 0.124 when leakage was present ( A d j . p = 2.38 × 10 6 , Cohen’s d = 0.80 ) and MAE decreased from 4.901 ± 0.611 to 4.058 ± 0.357 ( A d j . p = 1.19 × 10 5 , Cohen’s d = 1.26 ). Held-out test performance degradation was also observed: R 2 decreased from 0.147 ± 0.120 to 0.224 ± 0.144 when leakage was present, although after p-value correction, the differences were not statistically significant.

5. Discussion

In this work, we systematically evaluated the state of machine learning for depression severity estimation using the DAIC-WOZ dataset. A total of 66 studies were examined through prisms of methodological rigor and reproducibility, of which 5 were included in this systematic review. In the following, we discuss our findings with respect to the research questions formulated in the introduction of this work.
RQ1. Our findings reveal that, despite the growing body of literature, the methodological rigor and reproducibility of the existing studies are limited: many reported results are either inflated (30.3% ( n = 20 ) rely solely on the validation set, 42.4% ( n = 28 ) use interviewer turns, 9.1% ( n = 6 ) exhibit subject leakage) or lack statistical validity (81.8% ( n = 54 ) do not repeat experiments). Studies with available code are scarce (15.2%, n = 10 ). Furthermore, 92.4% ( n = 61 ) of studies suffer from poor documentation, i.e., do not report basic elements such as clear data partitioning, training protocols, or best model selection criteria. The stringent inclusion criteria adopted in this review inevitably resulted in a substantial reduction in the number of eligible studies. These criteria were deliberately chosen to reflect the minimum information required for independent verification and replication of the experimental pipeline. The most common point that led to exclusion was the inability to detail the model training protocol in 71.2% ( n = 47 ) of studies. As a minimum, for studies adopting the deep learning paradigm, we expected to report training hyperparameters such as optimizer used, number of epochs, and learning rate and its strategy. In-context learning studies were expected to disclose prompts.
RQ2. Our systematic review identified a recurring set of methodological shortcomings in the surveyed literature. Specifically, we identified three major flaws, each related to the model evaluation protocol: (i) lack of repeated experiments, (ii) lack of validation on held-out dataset, and (iii) absence of the model’s goodness-of-fit measures. Each of these issues compromises the reliability of the reported findings:
i.
Reporting results from a single experimental run is prone to misrepresenting the actual performance due to the stochasticity in machine learning.
ii.
The lack of tests on held-out data causes inadvertent overfitting to validation data.
iii.
The most prevalent choice for the evaluation metric in the literature related to depression severity estimation is MAE. This is often accompanied by RMSE and MSE used as an optimization criterion. While MSE is a good choice as a cost function for model training, it and its rooted variant RMSE do not quantify errors in an easily interpretable way. MAE directly translates to the scale of actual labels and therefore is easier to interpret. Nevertheless, all three metrics share common limitations of not having an upper bound and not explaining how well the model fits the data. We strongly encourage researchers to incorporate R 2 into their evaluation procedures. R 2 in the [0, 1] interval directly indicates the fraction of variance explained by the model, while its negative value means that the regression fits worse than the mean predictor. Furthermore, R 2 is monotonically related to MSE (Equation (2)). This means that the ordering of the regression models based on the coefficient of determination will be identical to an ordering of models based on MSE or RMSE [79]. Our model, trained without subject leakage on the audio modality of DAIC-WOZ, achieved MAE (val.: 4.978 ± 0.571 , test: 5.001 ± 0.101 ) comparable to the models included in this review. Relying on MAE alone would misleadingly suggest success, whereas negative R 2 clearly indicates that the model fails to generalize to unseen data.
RQ3. Our work identified two major flaws related to specific data preprocessing decisions that systematically lead to an overestimation of the model’s performance: (i) use of interviewer turns and (ii) subject leakage. These flaws differ in severity from the aforementioned issues with evaluation protocols. They introduce shortcuts that models may exploit, producing artificially high scores without learning actual patterns in data:
i.
For the textual modality of the DAIC-WOZ, the model trained on interviewer prompts yields better performance than the one trained on participant responses [80]. This flaw is specific to the DAIC-WOZ text modality, although it is possible that this may have similar effects to the other modalities of the dataset. We therefore recommend authors to explicitly state whether interviewer turns are included in the inputs of the model and justify their use. The best results included in our review (val. MAE of 3.76 [75] and test MAE of 3.66 [76]) were reported by studies that used text modality and incorporated interviewer prompts.
ii.
In our study, we examined the impact of subject leakage on the validation set and showed that this results in a substantive performance overestimation on audio, text, and combined modality of the DAIC-WOZ dataset. This effect cannot be diagnosed without evaluation on the held-out set with unseen participants. In the following paragraph, we discuss the results of our empirical experiments.
RQ4. Our experiments demonstrate that subject leakage consistently inflates validation performance across audio, text, and combined modalities of the DAIC-WOZ dataset. Additionally, on the audio modality, subject leakage reduces the model’s generalizability to the unseen data. As we demonstrate in Appendix A, Figure A1, Figure A2, Figure A3, Figure A4, Figure A5 and Figure A6, leakage directly affects the models’ training dynamics and results in overfitting to the validation data. Due to the overfitting to subject-specific cues that occurs under leakage conditions, the magnitude of performance overestimation is expected to depend on the model architecture. Thus, high-capacity models designed to capture fine-grained patterns in data may be more susceptible to leakage effects. We also note that leakage effects may depend on specific data preprocessing procedures (e.g., data cleaning, normalization, etc.).
To isolate and empirically demonstrate the leakage effect, we deliberately adopted utterance-level depression severity modeling in our experiments. Propagating a single interview-level PHQ-8 score to all constituent utterances implicitly assumes that each utterance contributes equally to the depression severity score. However, if depression-related information is sparsely distributed throughout the interview, many utterances become weakly or incorrectly labeled. Given the strong, well-established pretrained backbones used in our architecture, the weak performance of our models without leakage suggests that depression severity is likely more appropriately modeled at the interview rather than at a more granular level.
Without leakage, neither of our models achieved results better than the mean predictor. This effectively means that leakage constitutes a critical methodological flaw—experimental setup that fails to yield positive R 2 is misleadingly perceived as successful once leakage is introduced. Importantly, underreporting of data preprocessing prevents estimation of the actual percentage of studies exhibiting this issue, which may be substantially higher than the confirmed 9.1%. At the field level, failure to rigorously prevent and report subject leakage risks undermining the validity and clinical utility of mental health estimation models.
RQ5. To address the issues identified in the surveyed literature, in Table 4, we provide a structured list of recommendations, aimed at improving the methodological rigor, reproducibility, and documentation standards of future research on depression severity estimation. These can be treated as a checklist to employ essential requirements and best-practice recommendations by authors, as well as reviewers and editors. We recommend adopting the checklist as the standard for publication in this domain. Additionally, our code provides a ready-to-use framework for developing machine learning models inside a Docker container with on-premise MLFlow integration.

6. Limitations

In this study, we focused specifically on studies related to the DAIC-WOZ dataset. Although this is the most commonly used benchmark for the training and evaluation of depression severity estimators, other datasets exist in this domain. A non-exhaustive list of the datasets for the task of depression severity estimation, that were not considered in this systematic review, includes E-DAIC [10], audio–visual depressive language corpus (AViD-Corpus) [87], and DEPAC [88].
The empirical experiments conducted to demonstrate the impact of subject leakage were limited to the audio, text, and combined audio–text modalities of the DAIC-WOZ dataset. Consequently, the magnitude of leakage effects may differ for models based on the video modality of the dataset (as well as its combinations with other modalities) and other datasets. We also expect that the impact of subject leakage is likely to increase with model complexity: deeper architectures capable of capturing more complex dependencies in the data will be more prone to overfitting to subject-specific patterns when leakage is present.
Although our systematic review is restricted to depression severity estimation studies, the methodological pitfalls it identifies are likely to be relevant to other estimation tasks in mental health domain (e.g., anxiety, PTSD severity estimation); however, the prevalence of these issues and the magnitude of their effects are expected to vary across datasets and task formulations.

7. Conclusions

To our knowledge, this is the first systematic review of machine learning approaches applied to the DAIC-WOZ dataset for the task of depression severity estimation. Despite substantial research effort to develop machine learning models in this domain, we found that only a small number of the papers (7.6%, n = 5 ) in the surveyed literature represent sufficiently documented and reproducible manuscripts. The majority of the works do not follow best practices for the development of machine learning models and exhibit issues with methodological rigor. Beyond this, the predictive accuracy of the models included in this review remains limited, with errors comparable to the inherent variability of the PHQ-8 scores. In this work, we also conducted an empirical demonstration of the impact of subject leakage—a critical methodological issue present in at least 9.1% ( n = 6 ) of the surveyed studies—on measured model performance. We showed that for the model achieving negative R 2 under no-leakage conditions, significant performance overestimation and misrepresentation occur when the leakage is present.
The findings of this review are limited in scope to the DAIC-WOZ dataset. Similarly, the empirical evidence provided in our work is limited to the audio, text, and combined modalities of the discussed dataset. Nevertheless, these insights likely extend beyond the DAIC-WOZ dataset and offer guidance for other mental health estimation tasks as well. As promising as machine learning methods appear, in their current form—and as evaluated on DAIC-WOZ—they are far from clinical translation for depression severity estimation. Rigorous methodology, standardized evaluation protocols, detailed documentation, and open code are required to advance reliable research progress.

Author Contributions

I.D.: Conceptualization; Methodology; Data Curation; Software; Investigation; Formal Analysis; Validation; Visualization; Writing—Original Draft; Writing—Review and Editing. O.U.: Conceptualization; Writing—Review and Editing; Supervision; Funding Acquisition. All authors reviewed and approved the final manuscript and agree to be accountable for their contributions. All authors have read and agreed to the published version of the manuscript.

Funding

Open access funding provided by Wroclaw University of Science and Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in our experiments is available upon request at https://dcapswoz.ict.usc.edu/ (accessed on 22 September 2025). All data generated in this study and code for our experiments are accessible at the following link: https://github.com/Kowd-PauUh/ml-in-depression-estimation (accessed on 22 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Supplementary Data

Learning dynamics.Figure A1 and Figure A2 illustrate the model’s learning dynamics under each experimental condition on audio modality. In the presence of subject leakage, validation metrics gradually improve, reflecting information leakage from shared participants across splits. This also causes the training to last longer, as the optimization criterion (MSE) improves as well, which leads to the overfitting and overestimation of the model’s performance. Without leakage, the validation R 2 remains negative, revealing that the model fails to generalize. These observations remain consistent on the combined audio–text modality, as shown in Figure A5 and Figure A6. The effect also persists and can be observed on text modality, albeit is less noticeable, as shown in Figure A3 and Figure A4.
Figure A1. Learning curves of the model trained on audio modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation R 2 across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Figure A1. Learning curves of the model trained on audio modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation R 2 across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Applsci 16 00422 g0a1
Figure A2. Learning curves of the model trained on audio modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation MAE across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Figure A2. Learning curves of the model trained on audio modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation MAE across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Applsci 16 00422 g0a2
Figure A3. Learning curves of the model trained on text modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation R 2 across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Figure A3. Learning curves of the model trained on text modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation R 2 across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Applsci 16 00422 g0a3
Figure A4. Learning curves of the model trained on text modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation MAE across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Figure A4. Learning curves of the model trained on text modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation MAE across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Applsci 16 00422 g0a4
Figure A5. Learning curves of the model trained on combined modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation R 2 across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Figure A5. Learning curves of the model trained on combined modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation R 2 across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Applsci 16 00422 g0a5
Figure A6. Learning curves of the model trained on combined modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation MAE across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Figure A6. Learning curves of the model trained on combined modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation MAE across 25 cross-validation runs; shaded areas denote ± 1 standard deviation. Each point is accompanied by the number of samples contributing to the average, i.e., the count of active runs at each epoch.
Applsci 16 00422 g0a6

References

  1. World Health Organization. Depression: Let’s Talk Says WHO, as Depression Tops List of Causes of Ill Health. 2017. Available online: https://www.who.int/news/item/30-03-2017–depression-let-s-talk-says-who-as-depression-tops-list-of-causes-of-ill-health (accessed on 26 September 2025).
  2. Paykel, E.S.; Priest, R.G. Recognition and management of depression in general practice: Consensus statement. BMJ 1992, 305, 1198–1202. [Google Scholar] [CrossRef] [PubMed]
  3. Stringaris, A. Editorial: What is depression? J. Child Psychol. Psychiatry 2017, 58, 1287–1289. [Google Scholar] [CrossRef] [PubMed]
  4. American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders, 5th ed.; American Psychiatric Publishing: Arlington, VA, USA, 2013. [Google Scholar] [CrossRef]
  5. Yi, L.; Xie, G.; Li, Z.; Li, X.; Zhang, Y.; Wu, K.; Shao, G.; Lv, B.; Jing, H.; Zhang, C.; et al. Automatic depression diagnosis through hybrid EEG and near-infrared spectroscopy features using support vector machine. Front. Neurosci. 2023, 17, 1205931. [Google Scholar] [CrossRef] [PubMed]
  6. Li, Y.; Kumbale, S.; Chen, Y.; Surana, T.; Chng, E.S.; Guan, C. Automated Depression Detection from Text and Audio: A Systematic Review. IEEE J. Biomed. Health Inform. 2025, 29, 1–17. [Google Scholar] [CrossRef]
  7. Li, Q.; Liu, X.; Hu, X.; Rahman Ahad, M.A.; Ren, M.; Yao, L.; Huang, Y. Machine Learning-Based Prediction of Depressive Disorders via Various Data Modalities: A Survey. IEEE/CAA J. Autom. Sin. 2025, 12, 1320–1349. [Google Scholar] [CrossRef]
  8. Mao, K.; Wu, Y.; Chen, J. A systematic review on automated clinical depression diagnosis. npj Ment. Health Res. 2023, 2, 20. [Google Scholar] [CrossRef]
  9. Gratch, J.; Artstein, R.; Lucas, G.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et al. The Distress Analysis Interview Corpus of human and computer interviews. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 3123–3128. [Google Scholar]
  10. Ringeval, F.; Schuller, B.; Valstar, M.; Cummins, N.; Cowie, R.; Tavabi, L.; Schmitt, M.; Alisamir, S.; Amiriparian, S.; Messner, E.M.; et al. AVEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, New York, NY, USA, 21 October 2019; pp. 3–12. [Google Scholar] [CrossRef]
  11. Tugwell, P.; Tovey, D. PRISMA 2020. J. Clin. Epidemiol. 2021, 134, A5–A6. [Google Scholar] [CrossRef]
  12. Mongan, J.; Moy, L.; Kahn, C.E., Jr. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers. Radiol. Artif. Intell. 2020, 2, e200029. [Google Scholar] [CrossRef]
  13. Niu, M.; Li, M.; Fu, C. PointTransform Networks for automatic depression level prediction via facial keypoints. Knowl.-Based Syst. 2024, 297, 111951. [Google Scholar] [CrossRef]
  14. Fang, M.; Peng, S.; Liang, Y.; Hung, C.C.; Liu, S. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed. Signal Process. Control 2023, 82, 104561. [Google Scholar] [CrossRef]
  15. Atta, A.; El Sayad, D.; Ezzat, D.; Amin, S.; El Gamal, M. Speech-Based Depression Detection System Optimized Using Particle Swarm Optimization. In Proceedings of the 2024 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES), Giza, Egypt, 19–21 October 2024; pp. 250–253. [Google Scholar] [CrossRef]
  16. Shu, T.; Zhang, F.; Sun, X. Gaze Behavior based Depression Severity Estimation. In Proceedings of the 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML), Urumqi, China, 4–6 August 2023; pp. 313–319. [Google Scholar] [CrossRef]
  17. Firoz, N.; Beresteneva, O.G.; Aksyonov, S.V. Enhancing Depression Detection: Employing Autoencoders and Linguistic Feature Analysis with BERT and LSTM Model. In Proceedings of the 2023 International Russian Automation Conference (RusAutoCon), Sochi, Russia, 10–16 September 2023; pp. 299–304. [Google Scholar] [CrossRef]
  18. Williamson, J.R.; Godoy, E.; Cha, M.; Schwarzentruber, A.; Khorrami, P.; Gwon, Y.; Kung, H.T.; Dagli, C.; Quatieri, T.F. Detecting Depression using Vocal, Facial and Semantic Communication Cues. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 11–18. [Google Scholar] [CrossRef]
  19. Huang, G.; Li, J.; Lu, H.; Guo, M.; Chen, S. Rethinking Inconsistent Context and Imbalanced Regression in Depression Severity Prediction. IEEE Trans. Affect. Comput. 2024, 15, 2154–2168. [Google Scholar] [CrossRef]
  20. Feng, K.; Chaspari, T. Robust and Explainable Depression Identification from Speech Using Vowel-Based Ensemble Learning Approaches. In Proceedings of the 2024 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Houston, Texas, USA, 10–13 November 2024; pp. 1–8. [Google Scholar] [CrossRef]
  21. Firoz, N. Detecting Depression from Text: A Gender Based Comparative Approach Using Machine Learning and BERT Embeddings. In Proceedings of the 20th All-Russian Conference of Student Research Incubators, Tomsk, Russia, 10–16 September 2023; pp. 164–166. [Google Scholar]
  22. Iyortsuun, N.K.; Kim, S.H.; Yang, H.J.; Kim, S.W.; Jhon, M. Additive Cross-Modal Attention Network (ACMA) for Depression Detection Based on Audio and Textual Features. IEEE Access 2024, 12, 20479–20489. [Google Scholar] [CrossRef]
  23. Wang, X.; Xu, J.; Sun, X.; Li, M.; Hu, B.; Qian, W.; Guo, D.; Wang, M. Facial Depression Estimation via Multi-Cue Contrastive Learning. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 6007–6020. [Google Scholar] [CrossRef]
  24. Dinkel, H.; Wu, M.; Yu, K. Text-based depression detection on sparse data. arXiv 2020, arXiv:1904.05154. [Google Scholar] [CrossRef]
  25. Zhao, Z.; Wang, K. Unaligned Multimodal Sequences for Depression Assessment From Speech. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; pp. 3409–3413. [Google Scholar] [CrossRef]
  26. Firoz, N.; Beresteneva, O.G.; Vladimirovich, A.S.; Tahsin, M.S. Enhancing depression detection through advanced text analysis: Integrating BERT, autoencoder, and LSTM models. Res. Sq. Platf. LLC 2023. [Google Scholar] [CrossRef]
  27. Guangyao, S.; Shenghui, Z.; Bochao, Z.; Yubo, A. Multimodal depression detection using a deep feature fusion network. In Proceedings of the Third International Conference on Computer Science and Communication Technology, ICCSCT 2022, Beijing, China, 30–31 July 2022. [Google Scholar] [CrossRef]
  28. Niu, M.; Wang, X.; Gong, J.; Liu, B.; Tao, J.; Schuller, B.W. Depression Scale Dictionary Decomposition Framework for Multimodal Automatic Depression Level Prediction. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 6195–6210. [Google Scholar] [CrossRef]
  29. Zheng, W.; Yan, L.; Gou, C.; Wang, F.Y. Graph Attention Model Embedded with Multi-Modal Knowledge for Depression Detection. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME 2020), London, United Kingdom, 6–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
  30. TJ, S.J.; Jacob, I.J.; Mandava, A.K. D-ResNet-PVKELM: Deep neural network and paragraph vector based kernel extreme machine learning model for multimodal depression analysis. Multimed. Tools Appl. 2023, 82, 25973–26004. [Google Scholar] [CrossRef]
  31. Niu, M.; Li, Y.; Tao, J.; Zhou, X.; Schuller, B.W. DepressionMLP: A Multi-Layer Perceptron Architecture for Automatic Depression Level Prediction via Facial Keypoints and Action Units. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 8924–8938. [Google Scholar] [CrossRef]
  32. Xu, Z.; Gao, Y.; Wang, F.; Zhang, L.; Zhang, L.; Wang, J.; Shu, J. Depression detection methods based on multimodal fusion of voice and text. Sci. Rep. 2025, 15, 21907. [Google Scholar] [CrossRef]
  33. Hong, J.; Lee, J.; Choi, D.; Jung, J. LEFORMER: Liquid Enhanced Multimodal Learning for Depression Severity Estimation. In Proceedings of the 2025 IEEE 38th International Symposium on Computer-Based Medical Systems (CBMS 2025), Madrid, Spain, 18–20 June 2025; pp. 423–428. [Google Scholar] [CrossRef]
  34. Kumar, P.; Misra, S.; Shao, Z.; Zhu, B.; Raman, B.; Li, X. Multimodal Interpretable Depression Analysis Using Visual, Physiological, Audio and Textual Data. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025), Tucson, Arizona, USA, 28 February–4 March 2025; pp. 5305–5315. [Google Scholar] [CrossRef]
  35. Yang, Y.; Zheng, W. Multi-level spatiotemporal graph attention fusion for multimodal depression detection. Biomed. Signal Process. Control 2025, 110, 108123. [Google Scholar] [CrossRef]
  36. Zhao, Z.; Bao, Z.; Zhang, Z.; Cummins, N.; Wang, H.; Schuller, B. Hierarchical Attention Transfer Networks for Depression Assessment from Speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 7159–7163. [Google Scholar] [CrossRef]
  37. Gupta, A.K.; Dhamaniya, A.; Gupta, P. RADIANCE: Reliable and interpretable depression detection from speech using transformer. Comput. Biol. Med. 2024, 183, 109325. [Google Scholar] [CrossRef]
  38. Lu, J.; Liu, B.; Lian, Z.; Cai, C.; Tao, J.; Zhao, Z. Prediction of Depression Severity Based on Transformer Encoder and CNN Model. In Proceedings of the 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), Singapore, 11–14 December 2022; pp. 339–343. [Google Scholar] [CrossRef]
  39. Chen, Z.; Wang, D.; Lou, L.; Zhang, S.; Zhao, X.; Jiang, S.; Yu, J.; Xiao, J. Text-guided multimodal depression detection via cross-modal feature reconstruction and decomposition. Inf. Fusion 2025, 117, 102861. [Google Scholar] [CrossRef]
  40. Wang, Z.; Chen, L.; Wang, L.; Diao, G. Recognition of Audio Depression Based on Convolutional Neural Network and Generative Antagonism Network Model. IEEE Access 2020, 8, 101181–101191. [Google Scholar] [CrossRef]
  41. Tang, J.; Guo, Q.; Sun, W.; Shang, Y. A Layered Multi-Expert Framework for Long-Context Mental Health Assessments. In Proceedings of the 2025 IEEE Conference on Artificial Intelligence (CAI), Santa Clara, CA, USA, 5–7 May 2025; pp. 435–440. [Google Scholar] [CrossRef]
  42. Wei, P.C.; Peng, K.; Roitberg, A.; Yang, K.; Zhang, J.; Stiefelhagen, R. Multi-modal Depression Estimation Based on Sub-attentional Fusion. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2023; pp. 623–639. [Google Scholar] [CrossRef]
  43. Nanggala, K.; Elwirehardja, G.N.; Pardamean, B. Depression detection through transformers-based emotion recognition in multivariate time series facial data. Int. J. Artif. Intell. 2025, 14, 1302–1310. [Google Scholar] [CrossRef]
  44. Chen, M.; Xiao, X.; Zhang, B.; Liu, X.; Lu, R. Neural Architecture Searching for Facial Attributes-based Depression Recognition. In Proceedings of the 26th International Conference on Pattern Recognition (ICPR), Montreal, Quebec, Canada, 21–25 August 2022; pp. 877–884. [Google Scholar] [CrossRef]
  45. Yang, L.; Jiang, D.; He, L.; Pei, E.; Oveneke, M.C.; Sahli, H. Decision Tree Based Depression Classification from Audio Video and Language Information. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 89–96. [Google Scholar] [CrossRef]
  46. Lin, L.; Chen, X.; Shen, Y.; Zhang, L. Towards automatic depression detection: A BiLSTM/1D CNN-based model. Appl. Sci. 2020, 10, 8701. [Google Scholar] [CrossRef]
  47. Rasipuram, S.; Bhat, J.H.; Maitra, A.; Shaw, B.; Saha, S. Multimodal Depression Detection Using Task-oriented Transformer-based Embedding. In Proceedings of the 2022 IEEE Symposium on Computers and Communications (ISCC), Rhodes Island, Greece, 30 June–3 July 2022; pp. 01–04. [Google Scholar] [CrossRef]
  48. Lau, C.; Chan, W.Y.; Zhu, X. Improving Depression Assessment with Multi-Task Learning from Speech and Text Information. In Proceedings of the 55th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 31 October–3 November 2021; pp. 449–453. [Google Scholar] [CrossRef]
  49. Rohanian, M.; Hough, J.; Purver, M. Detecting Depression with Word-Level Multimodal Fusion. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1443–1447. [Google Scholar] [CrossRef]
  50. Prakrankamanant, P.; Watanabe, S.; Chuangsuwanich, E. Explainable Depression Detection using Masked Hard Instance Mining. arXiv 2025, arXiv:2505.24609. [Google Scholar] [CrossRef]
  51. Hu, J.; Wang, A.; Xie, Q.; Ma, H.; Li, Z.; Guo, D. Agentmental: An interactive multi-agent framework for explainable and adaptive mental health assessment. arXiv 2025, arXiv:2508.11567. [Google Scholar]
  52. Chen, X.; Shao, Z.; Jiang, Y.; Chen, R.; Wang, Y.; Li, B.; Niu, M.; Chen, H.; Hu, Q.; Wu, J.; et al. TTFNet: Temporal-Frequency Features Fusion Network for Speech Based Automatic Depression Recognition and Assessment. IEEE J. Biomed. Health Inform. 2025, 29, 7536–7548. [Google Scholar] [CrossRef]
  53. Zhang, J.; Guo, Y. Multilevel depression status detection based on fine-grained prompt learning. Pattern Recognit. Lett. 2024, 178, 167–173. [Google Scholar] [CrossRef]
  54. Sun, B.; Zhang, Y.; He, J.; Yu, L.; Xu, Q.; Li, D.; Wang, Z. A Random Forest Regression Method With Selected-Text Feature For Depression Assessment. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, New York, NY, USA, 23 October 2017; pp. 61–68. [Google Scholar] [CrossRef]
  55. Sun, B.; Zhang, Y.; He, J.; Xiao, Y.; Xiao, R. An automatic diagnostic network using skew-robust adversarial discriminative domain adaptation to evaluate the severity of depression. Comput. Methods Programs Biomed. 2019, 173, 185–195. [Google Scholar] [CrossRef]
  56. Zhang, Y.; Hu, W.; Wu, Q. Autoencoder Based on Cepstrum Separation to Detect Depression from Speech. In Proceedings of the 3rd International Conference on Information Technologies and Electrical Engineering, New York, NY, USA, 14–15 September 2021; pp. 508–510. [Google Scholar] [CrossRef]
  57. Zhang, W.; Mao, K.; Chen, J. A Multimodal approach for detection and assessment of depression using text, audio and video. Phenomics 2024, 4, 234–249. [Google Scholar] [CrossRef]
  58. Niu, M.; Chen, K.; Chen, Q.; Yang, L. HCAG: A Hierarchical Context-Aware Graph Attention Model for Depression Detection. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Ontario, Canada, 6–11 June 2021; pp. 4235–4239. [Google Scholar] [CrossRef]
  59. Yang, L.; Sahli, H.; Xia, X.; Pei, E.; Oveneke, M.C.; Jiang, D. Hybrid Depression Classification and Estimation from Audio Video and Text Information. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, New York, NY, USA, 23 October 2017; pp. 45–51. [Google Scholar] [CrossRef]
  60. Ishimaru, M.; Okada, Y.; Uchiyama, R.; Horiguchi, R.; Toyoshima, I. A new regression model for depression severity prediction based on correlation among audio features using a graph convolutional neural network. Diagnostics 2023, 13, 727. [Google Scholar] [CrossRef] [PubMed]
  61. Tang, J.; Shang, Y. Advancing Mental Health Pre-Screening: A New Custom GPT for Psychological Distress Assessment. In Proceedings of the 2024 IEEE 6th International Conference on Cognitive Machine Intelligence (CogMI), Washington, DC, USA, 28–31 October 2024; pp. 162–171. [Google Scholar] [CrossRef]
  62. Oureshi, S.A.; Dias, G.; Saha, S.; Hasanuzzaman, M. Gender-Aware Estimation of Depression Severity Level in a Multimodal Setting. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
  63. Qureshi, S.A.; Dias, G.; Hasanuzzaman, M.; Saha, S. Improving Depression Level Estimation by Concurrently Learning Emotion Intensity. IEEE Comput. Intell. Mag. 2020, 15, 47–59. [Google Scholar] [CrossRef]
  64. Hu, M.; Liu, L.; Wang, X.; Tang, Y.; Yang, J.; An, N. Parallel Multiscale Bridge Fusion Network for Audio–Visual Automatic Depression Assessment. IEEE Trans. Comput. Soc. Syst. 2024, 11, 6830–6842. [Google Scholar] [CrossRef]
  65. Du, Z.; Li, W.; Huang, D.; Wang, Y. Encoding Visual Behaviors with Attentive Temporal Convolution for Depression Prediction. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–7. [Google Scholar] [CrossRef]
  66. Song, S.; Shen, L.; Valstar, M. Human Behaviour-Based Automatic Depression Analysis Using Hand-Crafted Statistics and Deep Learned Spectral Features. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 158–165. [Google Scholar] [CrossRef]
  67. Yang, L.; Jiang, D.; Sahli, H. Feature Augmenting Networks for Improving Depression Severity Estimation From Speech Signals. IEEE Access 2020, 8, 24033–24045. [Google Scholar] [CrossRef]
  68. Gong, Y.; Poellabauer, C. Topic Modeling Based Multi-modal Depression Detection. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, New York, NY, USA, 23 October 2017; pp. 69–76. [Google Scholar] [CrossRef]
  69. Li, Y.; Shao, S.; Milling, M.; Schuller, B.W. Large language models for depression recognition in spoken language integrating psychological knowledge. Front. Comput. Sci. 2025, 7, 1629725. [Google Scholar] [CrossRef]
  70. Shabana, S.; Bharathi, V.C. Se-GCaTCT: Correlation based optimized self-guided cross-attention temporal convolutional transformer for depression detection with effective optimization strategy. Biomed. Signal Process. Control 2026, 112, 108561. [Google Scholar] [CrossRef]
  71. Wang, Y.; Lin, Z.; Teng, Y.; Cheng, Y.; Jiang, H.; Yang, Y. SIMMA: Multimodal Automatic Depression Detection via Spatiotemporal Ensemble and Cross-Modal Alignment. IEEE Trans. Comput. Soc. Syst. 2025, 12, 3548–3564. [Google Scholar] [CrossRef]
  72. Dai, Z.; Zhou, H.; Ba, Q.; Zhou, Y.; Wang, L.; Li, G. Improving depression prediction using a novel feature selection algorithm coupled with context-aware analysis. J. Affect. Disord. 2021, 295, 1040–1048. [Google Scholar] [CrossRef]
  73. Han, Z.; Shang, Y.; Shao, Z.; Liu, J.; Guo, G.; Liu, T.; Ding, H.; Hu, Q. Spatial–Temporal Feature Network for Speech-Based Depression Recognition. IEEE Trans. Cogn. Dev. Syst. 2024, 16, 308–318. [Google Scholar] [CrossRef]
  74. Milintsevich, K.; Sirts, K.; Dias, G. Towards automatic text-based estimation of depression through symptom prediction. Brain Inform. 2023, 10, 4. [Google Scholar] [CrossRef]
  75. Hong, S.; Cohn, A.; Hogg, D. Using graph representation learning with schema encoders to measure the severity of depressive symptoms. In Proceedings of the Tenth International Conference on Learning Representations, Online, 25–29 April 2022; Available online: https://eprints.whiterose.ac.uk/id/eprint/186629/ (accessed on 26 September 2025).
  76. Kang, A.; Chen, J.Y.; Lee-Youngzie, Z.; Fu, S. Synthetic Data Generation with LLM for Improved Depression Prediction. arXiv 2024, arXiv:2411.17672. [Google Scholar] [CrossRef]
  77. Syed, Z.S.; Sidorov, K.; Marshall, D. Depression Severity Prediction Based on Biomarkers of Psychomotor Retardation. In Proceedings of the AVEC ’17: 7th Annual Workshop on Audio/Visual Emotion Challenge, New York, NY, 23 October 2017; pp. 37–43. [Google Scholar] [CrossRef]
  78. Rathi, S.; Kaur, B.; Agrawal, R.K. Enhanced Depression Detection from Facial Cues Using Univariate Feature Selection Techniques. In Proceedings of the Pattern Recognition and Machine Intelligence, Kolkata, India, 17–20 December 2019; pp. 22–29. [Google Scholar] [CrossRef]
  79. Chicco, D.; Warrens, M.J.; Jurman, G. The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef]
  80. Burdisso, S.; Reyes-Ramírez, E.; Villatoro-tello, E.; Sánchez-Vega, F.; Lopez Monroy, A.; Motlicek, P. DAIC-WOZ: On the Validity of Using the Therapist’s prompts in Automatic Depression Detection from Clinical Interviews. In Proceedings of the 6th Clinical Natural Language Processing Workshop, Mexico City, Mexico, 21 June 2024; pp. 82–90. [Google Scholar] [CrossRef]
  81. Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP—A collaborative voice analysis repository for speech technologies. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 960–964. [Google Scholar] [CrossRef]
  82. Riazy, L.; Grote, M.; Liegl, G.; Rose, M.; Fischer, F. Cross-Sectional Reference Data From 29 European Countries for 6 Frequently Used Depression Measures. JAMA Netw. Open 2025, 8, e2517394. [Google Scholar] [CrossRef]
  83. Leal, S.S.; Ntalampiras, S.; Sassi, R. Speech-Based Depression Assessment: A Comprehensive Survey. IEEE Trans. Affect. Comput. 2025, 16, 1318–1333. [Google Scholar] [CrossRef]
  84. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
  85. Shin, S.; Kim, J.; Yu, Y.; Lee, S.; Lee, K. Self-supervised transfer learning from natural images for sound classification. Appl. Sci. 2021, 11, 3043. [Google Scholar] [CrossRef]
  86. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  87. Valstar, M.; Schuller, B.; Smith, K.; Eyben, F.; Jiang, B.; Bilakhia, S.; Schnieder, S.; Cowie, R.; Pantic, M. AVEC 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of the AVEC ’13: 3rd ACM International Workshop on Audio/Visual Emotion Challenge, New York, NY, USA, 21 October 2013; pp. 3–10. [Google Scholar] [CrossRef]
  88. Tasnim, M.; Ehghaghi, M.; Diep, B.; Novikova, J. DEPAC: A Corpus for Depression and Anxiety Detection from Speech. arXiv 2023, arXiv:2306.12443. [Google Scholar]
Figure 1. PRISMA flowchart of our systematic review, highlighting the inclusion and exclusion of studies at each stage.
Figure 1. PRISMA flowchart of our systematic review, highlighting the inclusion and exclusion of studies at each stage.
Applsci 16 00422 g001
Figure 2. Architecture used for the demonstration of the subject leakage pitfall. In experiments on text modality, a feature vector is obtained by feeding audio transcript through the RoBERTa model. In experiments on audio modality input, spectrograms are split into square fragments and independently processed by a CNN backbone; chunk-level representations are weighed using the attention mechanism of the encoder-only transformer and the output is averaged to obtain single feature vectors. In experiments on combined modality, these feature vectors are concatenated before the regression is used to obtain the final PHQ-8 score.
Figure 2. Architecture used for the demonstration of the subject leakage pitfall. In experiments on text modality, a feature vector is obtained by feeding audio transcript through the RoBERTa model. In experiments on audio modality input, spectrograms are split into square fragments and independently processed by a CNN backbone; chunk-level representations are weighed using the attention mechanism of the encoder-only transformer and the output is averaged to obtain single feature vectors. In experiments on combined modality, these feature vectors are concatenated before the regression is used to obtain the final PHQ-8 score.
Applsci 16 00422 g002
Table 1. The comparative summary of the approaches included in this systematic review along with the reported results. We also report results achieved by our models on audio (A), text (T), and combined (A–T) modalities of the DAIC-WOZ dataset without subject leakage and use of interviewer’s utterances. Results are presented either as a single value or as mean ± SD. “—” denotes missing data, i.e., the metric was not reported in the study.
Table 1. The comparative summary of the approaches included in this systematic review along with the reported results. We also report results achieved by our models on audio (A), text (T), and combined (A–T) modalities of the DAIC-WOZ dataset without subject leakage and use of interviewer’s utterances. Results are presented either as a single value or as mean ± SD. “—” denotes missing data, i.e., the metric was not reported in the study.
ApproachData ModalityModelVal. MAETest MAERepeated ExperimentsUses Interviewer’s PromptsSubject Leakage
Milintsevich et al. [74]TRoBERTa-BiLSTM 5.51 ± 0.06 5.03 ± 0.09
Syed et al. [77]V–AFisher Vectors, PLSR 5.50 5.42
Hong et al. [75]TGNN 3.76
Rathi et al. [78]VLinear Regression 4.64
Kang et al. [76]TBERT 3.66
Ours (no leakage)ACNN-Transformer 4.978 ± 0.571 5.001 ± 0.101
Ours (no leakage)TRoBERTa 4.758 ± 0.610 5.450 ± 0.255
Ours (no leakage)A–TCNN-Transformer, RoBERTa 4.901 ± 0.611 5.269 ± 0.261
Table 2. Measured performance of the model trained under two experimental conditions (without and with subject leakage into validation split) on R 2 and MAE metrics across dataset modalities.
Table 2. Measured performance of the model trained under two experimental conditions (without and with subject leakage into validation split) on R 2 and MAE metrics across dataset modalities.
ModalityMetricSubject LeakageVal.Test (Heldout, No Leakage)
A R 2 False 0.248 ± 0.295 0.058 ± 0.063
R 2 True 0.256 ± 0.083 0.250 ± 0.108
MAEFalse 4.978 ± 0.571 5.001 ± 0.101
MAETrue 3.595 ± 0.269 5.373 ± 0.178
T R 2 False 0.236 ± 0.525 0.235 ± 0.142
R 2 True 0.045 ± 0.095 0.295 ± 0.149
MAEFalse 4.758 ± 0.610 5.450 ± 0.255
MAETrue 4.510 ± 0.244 5.596 ± 0.280
A-T R 2 False 0.308 ± 0.513 0.147 ± 0.120
R 2 True 0.123 ± 0.124 0.224 ± 0.144
MAEFalse 4.901 ± 0.611 5.269 ± 0.261
MAETrue 4.058 ± 0.357 5.427 ± 0.262
Table 3. Statistical significance of performance differences between models trained without and with subject leakage across modalities and metrics. For each comparison, we report mean difference ( Δ ) and its 95% bootstrap confidence interval, Shapiro–Wilk normality test p-value, the paired statistical test used to assess significance of difference and its p-value, effect size (Cohen’s d), and adjusted p-value after Holm–Bonferroni correction.
Table 3. Statistical significance of performance differences between models trained without and with subject leakage across modalities and metrics. For each comparison, we report mean difference ( Δ ) and its 95% bootstrap confidence interval, Shapiro–Wilk normality test p-value, the paired statistical test used to assess significance of difference and its p-value, effect size (Cohen’s d), and adjusted p-value after Holm–Bonferroni correction.
ModalityMetric Δ 95% CIShapiro pTestpCohen’s dAdj. p (Holm)Significant
AVal. R 2 0.504 [ 0.411 0.624 ] 5.87 × 10 5 Wilcoxon 5.96 × 10 8 1.83 5.36 × 10 7
Test R 2 0.192 [ 0.239 0.146 ] 1.77 × 10 1 t-test 5.07 × 10 8 1.56 5.07 × 10 7
Val. MAE 1.383 [ 1.632 1.136 ] 6.43 × 10 1 t-test 2.48 × 10 10 2.07 2.98 × 10 9
Test MAE 0.371 [ 0.294 0.447 ] 3.43 × 10 1 t-test 2.07 × 10 9 1.86 2.28 × 10 8
TVal. R 2 0.191 [ 0.059 0.398 ] 2.45 × 10 8 Wilcoxon 1.03 × 10 3 0.41 6.16 × 10 3
Test R 2 0.060 [ 0.141 0.023 ] 5.44 × 10 1 t-test 1.67 × 10 1 0.28 1.78 × 10 1
Val. MAE 0.248 [ 0.488 0.011 ] 5.61 × 10 1 t-test 5.97 × 10 2 0.40 1.78 × 10 1
Test MAE 0.146 [ 0.005 0.287 ] 8.72 × 10 1 t-test 5.95 × 10 2 0.40 1.78 × 10 1
A-TVal. R 2 0.431 [ 0.257 0.668 ] 4.76 × 10 6 Wilcoxon 2.98 × 10 7 0.80 2.38 × 10 6
Test R 2 0.077 [ 0.139 0.016 ] 9.71 × 10 1 t-test 2.42 × 10 2 0.48 1.21 × 10 1
Val. MAE 0.842 [ 1.107 0.594 ] 2.41 × 10 1 t-test 1.70 × 10 6 1.26 1.19 × 10 5
Test MAE 0.158 [ 0.025 0.288 ] 5.34 × 10 1 t-test 2.97 × 10 2 0.46 1.21 × 10 1
Table 4. Checklist of methodological and reproducibility recommendations for machine learning studies on depression severity estimation.
Table 4. Checklist of methodological and reproducibility recommendations for machine learning studies on depression severity estimation.
AspectItem #Checklist Item
Critical Requirements
Methodology1Use a data partitioning scheme with partitions disjoint at the participant to prevent subject leakage across training, validation, and test sets.
2Preserve a held-out test set with unseen participants and report test performance separately from validation results.
3Incorporate R 2 into the evaluation protocol as a primary goodness-of-fit metric, in addition to error metrics such as MAE or RMSE.
4Train and validate models in cross-validation schemes or other forms of repeated experiments whenever applicable.
5Report metrics variability (e.g., standard deviation) across repeated experiments.
Documentation6Explicitly document the adopted methodology—describe in detail all steps from data preprocessing through training and evaluation to final model selection, including (where applicable), but not limited to, data cleaning and augmentation, training objective and hyperparameters, best model selection objective, and number of models trained.
7Provide a complete description of the model architecture (i.e., number of layers, dimensionalities, etc.).
Best Practices
Model diagnostics8Report the model’s calibration curve or error distribution plots to visualize the model’s errors with respect to the ground truth labels.
9Conduct ablation studies where applicable.
Reproducibility10Make source code with frozen dependency versions publicly available for each part of the study, including model training, evaluation, and statistical analysis of the results.
11Ensure accessibility of each experimental run outputs (e.g., logs, metrics, checkpoints).
12Employ containerized environments (e.g., Docker) for your experiments to ensure portability and long-term reproducibility.
13Use open-source experiment tracking tools (e.g., MLFlow) to foster experimentation and further ease of adoption.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Danylenko, I.; Unold, O. Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study. Appl. Sci. 2026, 16, 422. https://doi.org/10.3390/app16010422

AMA Style

Danylenko I, Unold O. Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study. Applied Sciences. 2026; 16(1):422. https://doi.org/10.3390/app16010422

Chicago/Turabian Style

Danylenko, Ivan, and Olgierd Unold. 2026. "Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study" Applied Sciences 16, no. 1: 422. https://doi.org/10.3390/app16010422

APA Style

Danylenko, I., & Unold, O. (2026). Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study. Applied Sciences, 16(1), 422. https://doi.org/10.3390/app16010422

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop