Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study

Danylenko, Ivan; Unold, Olgierd

doi:10.3390/app16010422

Open AccessSystematic Review

Common Pitfalls and Recommendations for Use of Machine Learning in Depression Severity Estimation: DAIC-WOZ Study

by

Ivan Danylenko

¹

and

Olgierd Unold

^2,*

¹

Independent Researcher, 50-370 Wroclaw, Poland

²

Faculty of Information and Communication Technology, Wroclaw University of Science and Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 422; https://doi.org/10.3390/app16010422 (registering DOI)

Submission received: 5 December 2025 / Revised: 26 December 2025 / Accepted: 29 December 2025 / Published: 30 December 2025

(This article belongs to the Special Issue Artificial Intelligence in Mental Health: Advances in Predictive Modeling, Intervention Strategies and Outcome Analysis)

Download

Browse Figures

Versions Notes

Abstract

The DAIC-WOZ dataset is a widely used benchmark for the task of depression severity estimation from multimodal behavioral data. Yet the reliability, reproducibility, and methodological rigor of published machine learning models remain uncertain. In this systematic review, we examined all works published through September 2025 that mention the DAIC-WOZ dataset and report mean absolute error as an evaluation metric. Our search identified 536 papers, of which 414 remained after deduplication. Following title and abstract screening, 132 records were selected for full-text review. After applying eligibility criteria, 66 papers were included in the quality assessment stage. Of these, only five met minimal reproducibility standards (such as clear data partitioning, model description, and training protocol documentation) and were included in this review. We found that published models suffer from poor documentation and methodology, and, inter alia, identified subject leakage as a critical methodological flaw. To illustrate its impact, we conducted experiments on the DAIC-WOZ dataset, comparing the performance of the model trained with and without subject leakage. Our results indicate that leakage produces significant overestimation of the validation performance; however, our evidence is limited to the audio, text, and combined modalities of the DAIC-WOZ dataset. Without leakage, the model consistently performed worse than a simple mean predictor. Aside from poor methodological rigor, we found that the predictive accuracy of the included models is poor: reported MAEs on DAIC-WOZ are of the same magnitude as the dataset’s own PHQ-8 variability, and are comparable to or larger than the variability typically observed in general population samples. We conclude with specific recommendations aimed at improving the methodology, reproducibility, and documentation of manuscripts. Code for our experiments is publicly available.

Keywords:

DAIC-WOZ; depression severity estimation; machine learning; PHQ-8; affective computing; multimodal learning

1. Introduction

Globally, more than 300 million people live with depression, making it the leading cause of disability according to the World Health Organization [1]. The traditional assessment of depression primarily relies on clinical interviews and questionnaires, tools heavily dependent on the expertise and subjective judgment of trained psychologists.

Currently, there is no widely accepted standardized method for the automated assessment of the severity of depression within the psychological community. Clinical evaluations are based on nuanced expert knowledge, making the process time-consuming and potentially variable between practitioners. General practitioners find distinguishing depression from other physical and psychological disorders to be a difficult task requiring considerable skill. Furthermore, at any given consultation, about half of the patients with depression are not recognized by general practitioners, with 20% remaining unrecognized even after 6 months [2]. As Stringaris [3] emphasize, depression might not be a single disease entity, but rather a collection of related but distinct syndromes. The main symptoms of major depressive disorder defined in DSM-5 [4] include anhedonia (loss of interest or pleasure), depressed mood, sleep disturbances, fatigue, changes in appetite, feelings of worthlessness, concentration problems, and psychomotor agitation or retardation.

Given these challenges, depression assessment has received considerable research interest. The potential of machine learning in clinical depression diagnosis has been illustrated by Yi et al. [5], who combined electroencephalography and functional near-infrared spectroscopy features with support vector machines achieving diagnostic accuracy of 92.7%. Current advances in machine learning enable the extraction of complex, nonlinear features from raw multimodal data, such as speech, text, and facial cues. Numerous works aim to explore whether expert psychological knowledge can be distilled through machine learning models to the point where it allows for an efficient assessment of the severity of depression from clinical interview. As a consequence, over recent years, researchers have published many publications declaring the development of highly efficient machine learning systems. We showcase that most of these studies either do not meet minimal reproducibility criteria, report overestimated performance due to the methodological flaws, or report results that lack statistical validity and clinical utility.

While prior surveys have broadly examined machine learning approaches for depression detection across various modalities and highlighted challenges in automated depression detection [6,7,8], our analysis focuses on depression severity estimation studies that use the Distress Analysis Interview Corpus–Wizard-of-Oz (DAIC-WOZ) dataset [9]—the most widely adopted benchmark for depression severity estimation—and evaluates their methodological rigor. Thus, our findings are grounded exclusively in the context of this particular dataset, with potential applicability to its extended version (E-DAIC [10]). In this systematic review, we address the following questions related to the use of machine learning in depression severity estimation using the DAIC-WOZ dataset:

RQ1:: How reliable, reproducible, and methodologically rigorous are published machine learning models?
RQ2:: What are the most common methodological flaws present in current machine learning research?
RQ3:: Are there specific practices that systematically lead to the overestimation or misrepresentation of machine learning models’ performance?
RQ4:: How does subject leakage between training and evaluation partitions affect the measured performance of machine learning models?
RQ5:: What recommendations can be made to improve the methodological rigor, reproducibility, and documentation standards for future machine learning research?

We incorporate a structured quality screening stage to ensure that only works with sufficient documentation and reproducibility standards are examined in detail. Our analysis goes further by identifying systematic methodological flaws, inter alia, subject leakage, that undermine the reliability of reported results. To make the issue concrete, we illustrate the impact of subject leakage through an empirical case study. We conclude with detailed recommendations in three domains: methodological rigor, study reproducibility, and documentation standards. Code for our experiments is publicly available at the following link: https://github.com/Kowd-PauUh/ml-in-depression-estimation, accessed on 26 December 2025.

2. Methods

This section focuses on methods adopted for the systematic screening of the current literature to answer RQ1, RQ2, and RQ3. Empirical experiments that address RQ4 are described in Section 4. Our study follows a systematic review protocol, aligned with the PRISMA [11] guidelines, to critically analyze prior work on machine learning for depression severity estimation using the DAIC-WOZ dataset. We restrict the scope of this review to studies that utilized the DAIC-WOZ dataset, as the dataset has been widely adopted as a gold-standard benchmark for the task of depression severity estimation. Empirical experiments focus on the audio, text, and combined modalities of the DAIC-WOZ, as the video modality is available only through pre-extracted features rather than raw footage. Consequently, our analysis does not cover other datasets and video modality of the DAIC-WOZ. A description of the DAIC-WOZ is provided in Section 4.1.

Search strategy. We searched the following electronic sources: IEEE Xplore, arXiv, ScienceDirect, SpringerLink, ACM Digital Library, Scopus, and Google Scholar. The query terms were “DAIC” and “WOZ” and “MAE”, applied consistently to title, abstract, and full text across all databases, using each platform’s appropriate Boolean syntax. No restrictions were made on publication year. The search covered all available records until September 2025.

We specifically included the criterion “MAE” because virtually all studies on depression severity estimation report mean absolute error (MAE) as their primary evaluation metric, often in combination with root mean squared error (RMSE). Although RMSE and MAE are mathematically related, they are not directly interchangeable, and converting RMSE to MAE would require unverified assumptions about error distributions. To ensure comparability and avoid introducing assumptions, we did not include the “RMSE” term in our query, which may have excluded studies reporting only RMSE.

Study selection. The study selection procedure consisted of three stages: (i) Screening of titles and abstracts to exclude clearly irrelevant records; (ii) Full-text assessment to verify compliance with the eligibility criteria; (iii) Quality screening to verify study reproducibility. At each stage, all records were screened and assessed manually. No automation tools were used in the entire process.

Eligibility criteria. Studies were eligible for quality screening stage if they met the following requirements:

The study reported experimental results on the DAIC-WOZ dataset.
The evaluation metrics for regression of depression severity scores included MAE.
The study presents scientific novelty; studies that solely repurposed or applied existing methods without introducing methodological innovation were excluded.
The study was published in English.

To ensure methodological rigor and reproducibility, we required studies to explicitly report the following methodological details (adapted from a subset of the CLAIM [12] guidelines):

Clear description of data preprocessing steps;
Explicit reporting of data partitioning scheme (proportions, partitions disjoint levels);
Detailed model description, including architecture, inputs, outputs, and intermediate layers;
Specification of training details (data augmentation, hyperparameters, number of models trained);
Method for selecting the final model.

Although CLAIM was originally proposed as a checklist for the use of artificial intelligence in medical imaging, the criteria we employ in our study focus on general reproducibility principles. These principles are directly applicable to the task of depression severity estimation from human-subject data. We acknowledge that these criteria are more stringent than common reporting practices in the field; however, their adoption was essential for the consistent evaluation of methodological rigor and the identification of methodological issues such as subject leakage. Exclusion during the quality screening stage occurred when missing methodological details prevented independent reconstruction of the experimental pipeline (e.g., ambiguous data partitioning, model training objective, convergence criteria).

Data collection. Data extraction was performed manually from the reports during quality screening stage. We sought data on (i) data preprocessing, (ii) proposed model architecture, (iii) best model selection and evaluation protocols, and (iv) reported results (MAE and other metrics, if available). To assess reproducibility and reliability, we further examined whether code was made available and experiments were repeated. In addition, we searched for the presence of any pitfalls in the methodology, such as subject leakage.

3. Results

This section focuses on findings of our systematic review that are further discussed in Section 5 with respect to RQ1, RQ2, and RQ3.

Initial search. Our initial search identified 536 records. After removing 122 duplicates, 414 papers remained for title and abstract screening. At this stage, 280 studies were excluded as their titles and/or abstracts were deemed irrelevant to our research scope. The remaining 134 records were sought for retrieval, but 2 full texts could not be accessed, leaving 132 papers for full-text screening.

Study selection. During the full-text review, 66 papers were excluded for not meeting eligibility criteria. The reasons for exclusion were as follows:

Not addressing the DAIC-WOZ dataset ( $n = 47$ );
Not reporting MAE as an evaluation metric ( $n = 2$ );
Not introducing a new model or repurposing an existing model ( $n = 11$ );
Not written in English ( $n = 6$ ).

The remaining 66 papers proceeded to quality assessment. From this point onward, all percentages are calculated with respect to the 66 papers that entered the quality assessment stage. Of these, 61 were excluded for insufficient documentation in one or more of the following reproducibility aspects:

Data preprocessing: 34.8% ( $n = 23$ ) [13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35].
Data partitioning: 37.9% ( $n = 25$ ), including the following:
2.1
Data proportions: 19.7% ( $n = 13$ ) [13,14,21,29,36,37,38,39,40,41,42,43,44].
2.2
Partitions disjoint level: 27.3% ( $n = 18$ ) [13,14,15,16,17,18,20,21,22,26,27,29,30,33,38,43,45,46].
Model description: 27.3% ( $n = 18$ ) [16,17,18,21,25,26,30,33,35,41,43,44,47,48,49,50,51,52].
Training details: 71.2% ( $n = 47$ ) [14,15,16,17,18,19,21,23,25,26,27,28,29,31,32,33,36,37,38,40,42,43,44,45,46,47,49,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71]. We additionally stratify this number by the publication venue, as stricter page limits for conference papers lead to different documentation expectations compared to journal articles. Of the 47 studies with insufficient training details, 25 were conference papers and 22 were journal articles.
Final model selection: 60.6% ( $n = 40$ ) [13,14,15,16,17,18,19,21,23,25,26,28,30,31,32,33,34,35,36,38,39,40,42,43,45,47,50,52,53,55,57,60,64,65,66,69,70,71,72,73].

Ultimately, five papers met all inclusion criteria and were included in this systematic review. Of these, three studies utilize textual modality of the dataset and employ the deep learning paradigm [74,75,76], while another two adopt classical machine learning utilizing audio and video modalities [77,78]. Figure 1 presents the inclusion and exclusion of papers at each study selection stage.

Reproducibility failures. We found that among the 66 papers assessed for quality the following occurred: 92.4% (

n = 61

) did not comply with at least one criterion; 71.2% (

n = 47

) did not comply with at least two criteria; 42.4% (

n = 28

) did not comply with at least three criteria; 16.7% (

n = 11

) did not comply with at least four criteria; 9.1% (

n = 6

) did not comply with any criterion. Additionally, we looked for the code availability (not mandatory for inclusion) of the published studies and found that only 15.2% (

n = 10

) have publicly available code [20,23,24,35,42,46,50,69,72,74].

Reporting issues. In total, 12.1% (

n = 8

) of the studies reported results for an unknown data partition [19,27,29,32,35,37,39,42]. Additionally, 3.0% (

n = 2

) of the studies inconsistently reported results within the article [30,40].

Methodological flaws. Among papers that underwent quality screening, we identified five methodological flaws:

In total, 81.8% ( $n = 54$ ) of the studies do not use repeated experiments and instead rely on a single experimental run [13,14,15,16,17,18,20,21,22,23,25,26,27,28,29,30,31,32,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,58,59,60,61,62,64,65,67,69,70,71,73,76,77,78]. This is a critical flaw, as many computational methods in machine learning involve stochastic elements such as random initialization, data shuffling, or sampling. Without running experiments multiple times and reporting measures of variability (e.g., mean and standard deviation (SD) over different seeds or cross-validation folds), the reported performance lacks statistical validity.
The vast majority of the studies report measures of errors such as MAE and RMSE, while completely ignoring goodness-of-fit metrics such as coefficient of determination ( $R^{2}$ ). Although useful, MAE and RMSE have no upper bound and do not provide information about the performance of the regression with respect to the distribution of the ground truth values. In contrast, $R^{2}$ quantifies the proportion of the variance in the dependent variable that the model predicts from the independent variables [79]. Without $R^{2}$ or a similar measure of explained variance, it is difficult to assess whether low prediction errors reflect genuine variance explained by the model or are a consequence of the target variable’s scale. ( $R^{2}$ has its own limitations. Due to its definition (see Equation (2)), direct comparisons of $R^{2}$ across datasets with differing variance in the dependent variable are problematic. $R^{2}$ may also be less informative when residuals or the dependent variable are non-normally distributed. Nevertheless, it provides a useful measure of the proportion of variance explained by the model within a given dataset.) Yet only 6.1% ( $n = 4$ ) of the studies additionally report $R^{2}$ [17,26,61,63].
Only 30.3% ( $n = 20$ ) of the studies report performance on a held-out test set [28,31,41,44,45,49,52,53,54,59,63,64,68,69,71,72,73,74,76,77], while the rest evaluate models solely on the validation set. Relying on the validation set only can lead to the overly optimistic results, as this set is often used for model selection (early stopping on validation loss, best validation performance, etc.) or hyperparameter tuning, and is a form of inadvertent overfitting.
It has been shown for DAIC-WOZ that incorporating interviewer turns into the feature set leads to inflated performance estimates. Burdisso et al. [80] demonstrated that models trained solely on interviewer prompts achieve equal or higher performance than those trained on participant responses, indicating that interviewer turns in the DAIC-WOZ dataset act as discriminative shortcuts rather than clinically meaningful signals. Although the actual evidence is limited to the textual modality, a similar effect may plausibly occur in the audio modality, since the transcripts are extracted directly from the audio recordings. This may potentially constitute a methodological pitfall in 42.4% ( $n = 28$ ) of the studies, as they incorporate interviewer utterances as input data [18,19,24,32,33,36,37,39,40,41,47,48,50,51,52,53,54,55,58,60,63,68,71,72,73,74,75,76] (due to the underreported data preprocessing procedures in some of the papers, data on the use of interviewer turns are missing for 25.8% ( $n = 17$ ) of the studies).
At least 9.1% ( $n = 6$ ) of the studies exhibit subject leakage [32,34,55,56,60,70], while an additional 24.2% ( $n = 16$ ) could not be confirmed due to the insufficient documentation of data preprocessing procedures. Subject leakage occurs when data partitions are disjoint at a level more granular than the interview (e.g., at the level of 30 s audio clips, individual participant turns, etc.), and random splits are applied after such preprocessing. This results in the same participant contributing data to both training and evaluation partitions, allowing models to exploit subject-specific cues rather than learning generalizable patterns.

The last finding motivated the formulation of RQ4. In Section 4, we empirically demonstrate that subject leakage significantly inflates validation performance.

Papers included. A total of five papers were included in this systematic review. Although all included papers used interview as a whole to predict the Patient Health Questionnaire (PHQ-8) score, data preprocessing and modeling techniques varied significantly across studies.

Data partitioning. All papers included in this systematic review employed the standard DAIC-WOZ split with 107 interviews in the training set, 35 in the validation set, and 47 in the test set, as proposed by the dataset authors, with all partitions being disjoint at the interview level. Most importantly, as the dataset was disjoint at the interview level in each study, none of the included studies exhibit subject leakage.

Deep learning papers. Each of the three studies focused on textual modality, yet they differed substantially in how interview transcripts were represented and modeled. Milintsevich et al. [74] employed turn-wise feature extraction from interviews using a RoBERTa-based encoder with latter embeddings aggregation through a single-layer BiLSTM with an attention mechanism. Hong et al. [75] constructed a text graph and employed a Graph Neural Network (GNN) with a message passing mechanism to obtain interview-level representations. Kang et al. [76] employed a Large Language Model to generate a synopsis and sentiment analysis from an interview, and a BERT model was utilized to predict the PHQ-8 score based on the generated data.

Classical machine learning papers. The two classical machine learning studies relied on handcrafted features combined with conventional regression techniques. Syed et al. [77] used 68-point 3D facial landmarks for video data, alongside segmenting audio files and extracting 73 COVAREP [81] features, Fisher Vector encoding, and Partial Least Squares Regression (PLSR). Rathi et al. [78] utilized Histogram Oriented Gradient features and applied various regression techniques, with linear regression yielding the best results.

Best model selection. With respect to the final model selection, the papers fall into two groups: those that select the best performing model based on the metrics achieved on the validation set [74,77], and those that train for a fixed number of epochs with early stopping based on validation performance [75,76].

Methodological rigor. Among the included studies, methodological rigor remain limited. Only one paper has an available source code [74], which constrains reproducibility. All three text-based studies incorporate interviewer prompts [74,75,76], which was shown to artificially inflate performance estimates on the text modality of the DAIC-WOZ dataset [80]. Only two studies conduct repeated experiments [74,75], which limits the reliability of the reported results. Furthermore, only three of the studies report results on the held-out test set [74,76,77] and none incorporate additional goodness-of-fit measures such as

R^{2}

. All studies report MAE; three studies additionally report RMSE [76,77,78]. Milintsevich et al. [74] also report a macro-averaged version of the MAE and Syed et al. [77] also report Pearson correlation coefficients. Neither of the metrics quantifies the variance explained by a model: MAE and RMSE measure the prediction error; Pearson correlation coefficient captures the strength of the linear association between predicted and true values.

Achieved results. Table 1 summarizes the reported validation and test MAEs. On the validation set, MAE ranged from

3.76

[75] to

5.51

[74], and on the held-out test set, from

3.66

[76] to

5.42

[77]. However, the lowest MAE values came exclusively from studies that incorporated interviewer prompts. Our model, trained without subject leakage and interviewer utterances on the audio modality of the DAIC-WOZ dataset, yielded MAE of

4.978 \pm 0.571

on validation and

5.001 \pm 0.101

on test sets—similar to the results obtained by three of the included studies [74,77,78]. However, the regression coefficient of determination for our model is negative (val.

R^{2} = - 0.248 \pm 0.295

, test

R^{2} = - 0.058 \pm 0.063

), which indicates that the model explains less variance in the PHQ-8 scores than a simple mean predictor. This example shows that reporting MAE alone provides limited insight into the model’s performance.

Furthermore, because the PHQ-8 score spans 0–24 points, an MAE of

\approx 5

corresponds to more than 20% of the full scale. Within DAIC-WOZ itself, such MAE is already comparable to the dataset’s own variability (train. SD

= 5.46

, val. SD

= 6.59

, test SD

= 6.47

), indicating that the model does not achieve predictive accuracy substantially better than the natural spread of scores in the dataset sample. Clinical datasets such as DAIC-WOZ are expected to have higher PHQ-8 variability than general population samples. To contextualize the results, we provide a comparison to the typical spread of PHQ-8 scores in population-based samples. In population-based samples, PHQ-8 variability is considerably smaller: Riazy et al. [82] show that in a population study of 287,530 participants from 29 European countries, the 50th percentile PHQ-8 score is 1.8, the 75th percentile is 4.52, and the 95th percentile reaches 10.85—far below the maximum possible score of 24. In combination with mean scores ranging from 1.2 to 4.0 and standard deviations typically falling between 2.7 and 4.5 [82], the MAEs of the included studies currently match or exceed the typical spread of PHQ-8 scores in the general population. However, this comparison should be interpreted with caution given the higher PHQ-8 variability in DAIC-WOZ.

4. Empirical Demonstration of Subject Leakage Pitfall

During our systematic review, we identified subject leakage as a critical methodological flaw in at least 9.1% (

n = 6

) of the studies. In this section, we present an empirical study of the subject leakage impact on the measured model’s performance, addressing RQ4. We first introduce the dataset used in our experiments, followed by the description of our data preprocessing pipeline. We then describe the architecture designed for this demonstration and outline our training and evaluation protocols. Finally, we report and analyze the results of the experiments. The implications of these results are further discussed in Section 5.

4.1. Dataset

DAIC-WOZ is a central resource in the development of automated methods for detecting psychological distress, including major depressive disorder [83]. It is part of the larger DAIC corpus and was specifically designed to support multimodal research in clinical interview contexts. The dataset comprises semi-structured interviews conducted in four formats: (i) traditional face-to-face interviews with human interviewers, (ii) teleconferencing with human interviewers, (iii) “Wizard of Oz” (WOZ) interviews in which an animated virtual agent named Ellie is operated by a hidden human interviewer, and (iv) fully automated interviews in which Ellie functions autonomously without human input. DAIC-WOZ includes rich multimodal data collected during interviews, which includes audio, audio transcripts, features extracted from video, and depth recordings from Microsoft Kinect sensors. A subset of sessions also incorporates physiological data such as electrocardiogram, galvanic skin response, and respiration signals. Additionally, the dataset contains manual and automatic annotations of verbal and nonverbal behavior, including speech prosody, facial expressions, gaze, gesture patterns, and dialogue acts. Participants, drawn from both the general population and US military veterans, completed a series of validated psychological assessments including the PHQ-8, the PTSD Checklist, PANAS, and the State-Trait Anxiety Inventory [9]. The DAIC-WOZ dataset is comprised of 189 interviews, while its extended version E-DAIC [10] consists of 275 interviews.

4.2. Data Preprocessing

Audio transcripts of the DAIC-WOZ dataset were analyzed to compute two metrics per utterance, (i) speech duration and (ii) word count, obtained via whitespace tokenization of the transcribed text. To ensure sufficient acoustic and lexical content per instance, we retained only utterances satisfying the following thresholds: duration between 10 and 30 s and word count above 10. These thresholds were empirically selected based on dataset inspection to exclude both brief interjections (e.g., “Uh”, “Umm”) and unusual long, multi-topic utterances. After applying these filters and removing the interviewer’s speech, approximately 10.8 h of usable participant speech remained from a total of ≈42 h of recordings (≈26%). In experiments that involved audio modality, all audio files were resampled with a 16 kHz sample rate. Additionally, during training, we applied a random gain adjustment within the

\pm 6

dB range (uniform distribution) to the utterances. In experiments that involved text modality, audio transcripts were used without additional preprocessing.

To prevent subject leakage into the held-out test partition, we split the dataset at the participant level. Participants were divided into training and test groups using stratified sampling on the binary PHQ-8 label (PHQ-8 ≥ 10), with 20% reserved for testing. We report that post-split, the PHQ-8 score distribution in both subsets closely matches the original. Each recording was annotated at the participant level with a single PHQ-8 score, which was propagated across all the corresponding utterances. The training partition was used in cross-validations and the test partition was used as a held-out set for model evaluation after training was finished.

4.3. Model Architecture

To demonstrate the impact of subject leakage into validation split, we train the model with architecture depending on the data modality, as visualized in Figure 2. We deliberately do not include video modality in our experiments, because in the DAIC-WOZ dataset, it is not represented by raw video but only through pre-extracted facial and visual features.

Audio modality. The model is composed of (i) a pretrained CNN backbone, (ii) a single transformer encoder layer (torch.nn.TransformerEncoderLayer) initialized with

d_model = 512

,

nhead = 2

,

dim_feedforward = 2048

, and

dropout = 0.1

, and (iii) a linear layer with a single output neuron for regression applied after averaging the output from the transformer. We initialize the CNN backbone using torchvision from ResNet18 architecture pretrained on the ImageNet [84] dataset and replace the last fully connected layer with a randomly initialized one with 512 neurons. Each model’s component is trainable, including the CNN backbone.

Each speech segment is transformed into an MEL spectrogram using torchaudio.compliance.kaldi.fbank with a 25 ms Hanning window and a 10 ms frame shift. This yields a time–frequency matrix of shape

F \times T

, where

F = 224

is the number of MEL bins. The spectrogram is partitioned into

N = ⌊ T / F ⌋

non-overlapping square fragments

{X_{i}}_{i = 1}^{N}

, each of size

F \times F

. The choice of square fragments ensures compatibility with the ResNet18 architecture. When T is not an exact multiple of F, the remainder is discarded, effectively dropping the last incomplete chunk. Each fragment

X_{i} \in R^{F \times F}

is duplicated across 3 channels (without ImageNet normalization applied) and passed independently through a shared-weight CNN backbone. In the next step, vector representations of all fragmens are fed through the transformer encoder layer. The outputs of the transformer are averaged to obtain single vector representation for the input audio. Finally a single regression unit is applied over this representation to model the PHQ-8 score.

The intuition behind this design is grounded in the ability of convolutional neural networks to capture localized patterns in two-dimensional images. Additionally, the attention mechanism allows for modeling sequences where relevant information is unevenly distributed and may occur sparsely over time. In the case of MEL spectrograms, the two-dimensional patterns correspond to joint time–frequency structures that encode acoustic features.

Although ImageNet pretraining is based on natural images, it has been shown that transfer learning from natural images contributes to improvements in audio-related tasks [85]. We note that domain-specific pretraining on large audio datasets could potentially improve performance; however, our primary goal is to investigate the effects of subject leakage rather than to optimize predictive performance.

Text modality. The model is composed of (i) a pretrained RoBERTa [86] model checkpoint: https://huggingface.co/FacebookAI/roberta-large (accessed on 13 October 2025); and (ii) a linear layer with a single output neuron for regression. Each model’s component is trainable, including all RoBERTa layers.

Each transcript of a speech segment is truncated to the length of 512 tokens during both training and evaluation. RoBERTa produces a 1024 feature vector for each input token, and we use [CLS] token features as a representation for the transcript. A single regression unit is applied over this representation to model the PHQ-8 score.

Combined modality. The model combines architectural components of both the audio and text models and is composed of (i) an audio model without the last linear layer, (ii) a text model without the last linear layer, and (iii) a linear layer with a single output neuron for regression. All components are trainable.

For each input (audio–transcript pair), audio and text representations are concatenated to form a single vector with

512 + 1024 = 1536

features, as shown in Figure 2. These features are then passed to the regression unit to predict the PHQ-8 score.

4.4. Training Procedure

Each model was trained under the same conditions regardless of the data modality. Training was conducted on NVIDIA A100 Tensor Core GPU (80GB) hardware using the AdamW optimizer with default parameters:

β_{1} = 0.9

,

β_{2} = 0.999

,

ϵ = 1 \times 10^{- 8}

, and weight decay set to

0.01

. The learning rate was set to

1 \times 10^{- 5}

and was constant. The maximum number of epochs was set to 10 and early stopping was activated when validation loss did not decrease by at least

10^{- 3}

for a single epoch (

patience = 1

). We deliberately chose a conservative early stopping and learning rate strategy to mitigate the risk of overfitting, given the relatively small size of the DAIC-WOZ dataset.

From each run, we selected the final checkpoint as the best. Training and evaluation minibatch sizes were set to 16. The training objective was to minimize the mean squared error (MSE) between predicted and ground truth PHQ-8 scores. While MSE served as the optimization criterion, model performance was evaluated using two metrics: mean absolute error and the coefficient of determination. Let

y_{i}

denote the ground truth PHQ-8 score for the i-th sample,

{\hat{y}}_{i}

the corresponding model prediction,

\bar{y}

the mean of all true scores, and N the number of samples. The metrics are then defined as follows:

Mean absolute error measures the average magnitude of the absolute prediction errors:

MAE = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{y}}_{i} - y_{i} |

(1)

where

| {\hat{y}}_{i} - y_{i} |

is the absolute error for the i-th prediction.

Coefficient of determination quantifies the proportion of variance in the ground truth scores explained by the model:

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}} = 1 - \frac{M S E}{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}

(2)

where the numerator is the residual sum of squares and the denominator is the total sum of squares, or equivalently, the numerator is the mean squared error and the denominator is the variance of the true scores.

Each experiment was executed under five independent 5-fold cross-validation schemes, with folds partitioned at the (a) utterance level to simulate leakage in the experiment with subject leakage, and (b) the participant level to avoid leakage in the experiment without subject leakage. This resulted in 25 runs per experiment. Random seeds were fixed for data loading and initialization to ensure reproducibility. To assess statistical significance, we employed parametric and non-parametric tests, depending on whether the normality assumption of the paired differences was met. Specifically, we used a two-sided paired t-test on the per-run scores (25 paired results per experiment, corresponding to identical seeds), under the assumption that the paired differences are normally distributed. When this normality assumption was violated, we instead applied the Wilcoxon signed-rank test, a non-parametric alternative that does not rely on normality. We verified assumptions of normality on the paired differences using the Shapiro–Wilk test and report that normality was violated in the Val.

R^{2}

comparison for all modalities (audio modality:

p = 5.87 \times 10^{- 5}

; text modality:

p = 2.45 \times 10^{- 8}

; combined modality:

p = 4.76 \times 10^{- 6}

). To account for multiple comparisons, we applied Holm–Bonferroni correction across all significance tests. We report metric mean ± standard deviation, Cohen’s d effect sizes, p-values, and adjusted p-values. Additionally, for each comparison, we report the mean difference

Δ

and its 95% bootstrap confidence interval, computed by resampling the paired differences with 10,000 bootstrap iterations.

To contextualize the results, we deliberately report the performance of the weakest possible baseline, a mean predictor, which always outputs the average PHQ-8 score computed over the entire training set. This choice directly aligns with the definition of

R^{2}

(see Equation (2)), which quantifies the performance relative to the mean of the dependent variable. While more sophisticated baselines such as linear regression or support vector machines could certainly be considered, the rationale is to establish a minimal reference point for model performance evaluation. As our results show, complex models evaluated without subject leakage consistently yield negative

R^{2}

values, indicating performance worse than the mean predictor. Therefore, comparison against a stricter baseline would not alter the conclusion that these models fail to predict PHQ-8 score from the multimodal data.

4.5. Impact of Subject Leakage on Model Performance

The evaluation results of the described model trained with and without subject leakage on different data modalities are presented in Table 2, which reports achieved

R^{2}

and MAE metrics. Table 3 reports statistics on the significance of the observed differences and effect sizes for each comparison. In the following, we analyze these results per modality. Additional data on the learning dynamics of the model under each experimental condition are presented in Appendix A.

Baseline. A simple mean predictor, trained on the entire training set, achieves

R^{2} = 0

and MAE

= 4.773

on this set, and

R^{2} = - 0.100

and MAE

= 5.149

on the test set.

Audio modality. On the validation set, subject leakage produced a significant overestimation of model performance:

R^{2}

increased from

- 0.248 \pm 0.295

to

0.256 \pm 0.083

when the leakage was present (

A d j . p = 5.36 \times 10^{- 7}

, Cohen’s

d = 1.83

). Correspondingly, MAE decreased from

4.978 \pm 0.571

to

3.595 \pm 0.269

(

A d j . p = 2.98 \times 10^{- 9}

, Cohen’s

d = - 2.07

)—a substantial gain in predictive accuracy. Crucially, this inflated performance does not translate to the held-out test set, which contains participants unseen during both training and validation. Test

R^{2}

degrades from

- 0.058 \pm 0.063

to

- 0.250 \pm 0.108

when leakage was present (

A d j . p = 5.07 \times 10^{- 7}

, Cohen’s

d = - 1.56

). Similarly, MAE on the held-out set increased from

5.001 \pm 0.101

to

5.373 \pm 0.178

(

A d j . p = 2.28 \times 10^{- 8}

, Cohen’s

d = 1.86

). These results demonstrate that on the audio modality, subject leakage artificially boosts validation performance while degrading generalization to truly unseen participants.

Text modality. On the validation set, a modest, yet significant overestimation in

R^{2}

was observed: validation

R^{2}

increased from

- 0.236 \pm 0.525

to

- 0.045 \pm 0.095

when leakage was present (

A d j . p = 6.16 \times 10^{- 3}

, Cohen’s

d = 0.41

). Validation MAE slightly decreased from

4.758 \pm 0.610

to

4.510 \pm 0.244

, although this difference was not statistically significant. On the held-out test set, no differences were observed. Thus, on the text modality, subject leakage has a comparatively smaller effect than on the audio modality. Although generalization to the held-out test data was not affected, the leakage still caused validation performance overestimation in terms of the

R^{2}

metric.

Combined modality. Similarly to the audio modality, we observed significant validation performance inflation:

R^{2}

increased from

- 0.308 \pm 0.513

to

0.123 \pm 0.124

when leakage was present (

A d j . p = 2.38 \times 10^{- 6}

, Cohen’s

d = 0.80

) and MAE decreased from

4.901 \pm 0.611

to

4.058 \pm 0.357

(

A d j . p = 1.19 \times 10^{- 5}

, Cohen’s

d = - 1.26

). Held-out test performance degradation was also observed:

R^{2}

decreased from

- 0.147 \pm 0.120

to

- 0.224 \pm 0.144

when leakage was present, although after p-value correction, the differences were not statistically significant.

5. Discussion

In this work, we systematically evaluated the state of machine learning for depression severity estimation using the DAIC-WOZ dataset. A total of 66 studies were examined through prisms of methodological rigor and reproducibility, of which 5 were included in this systematic review. In the following, we discuss our findings with respect to the research questions formulated in the introduction of this work.

RQ1. Our findings reveal that, despite the growing body of literature, the methodological rigor and reproducibility of the existing studies are limited: many reported results are either inflated (30.3% (

n = 20

) rely solely on the validation set, 42.4% (

n = 28

) use interviewer turns, 9.1% (

n = 6

) exhibit subject leakage) or lack statistical validity (81.8% (

n = 54

) do not repeat experiments). Studies with available code are scarce (15.2%,

n = 10

). Furthermore, 92.4% (

n = 61

) of studies suffer from poor documentation, i.e., do not report basic elements such as clear data partitioning, training protocols, or best model selection criteria. The stringent inclusion criteria adopted in this review inevitably resulted in a substantial reduction in the number of eligible studies. These criteria were deliberately chosen to reflect the minimum information required for independent verification and replication of the experimental pipeline. The most common point that led to exclusion was the inability to detail the model training protocol in 71.2% (

n = 47

) of studies. As a minimum, for studies adopting the deep learning paradigm, we expected to report training hyperparameters such as optimizer used, number of epochs, and learning rate and its strategy. In-context learning studies were expected to disclose prompts.

RQ2. Our systematic review identified a recurring set of methodological shortcomings in the surveyed literature. Specifically, we identified three major flaws, each related to the model evaluation protocol: (i) lack of repeated experiments, (ii) lack of validation on held-out dataset, and (iii) absence of the model’s goodness-of-fit measures. Each of these issues compromises the reliability of the reported findings:

i.: Reporting results from a single experimental run is prone to misrepresenting the actual performance due to the stochasticity in machine learning.
ii.: The lack of tests on held-out data causes inadvertent overfitting to validation data.
iii.: The most prevalent choice for the evaluation metric in the literature related to depression severity estimation is MAE. This is often accompanied by RMSE and MSE used as an optimization criterion. While MSE is a good choice as a cost function for model training, it and its rooted variant RMSE do not quantify errors in an easily interpretable way. MAE directly translates to the scale of actual labels and therefore is easier to interpret. Nevertheless, all three metrics share common limitations of not having an upper bound and not explaining how well the model fits the data. We strongly encourage researchers to incorporate $R^{2}$ into their evaluation procedures. $R^{2}$ in the [0, 1] interval directly indicates the fraction of variance explained by the model, while its negative value means that the regression fits worse than the mean predictor. Furthermore, $R^{2}$ is monotonically related to MSE (Equation (2)). This means that the ordering of the regression models based on the coefficient of determination will be identical to an ordering of models based on MSE or RMSE [79]. Our model, trained without subject leakage on the audio modality of DAIC-WOZ, achieved MAE (val.: $4.978 \pm 0.571$ , test: $5.001 \pm 0.101$ ) comparable to the models included in this review. Relying on MAE alone would misleadingly suggest success, whereas negative $R^{2}$ clearly indicates that the model fails to generalize to unseen data.

RQ3. Our work identified two major flaws related to specific data preprocessing decisions that systematically lead to an overestimation of the model’s performance: (i) use of interviewer turns and (ii) subject leakage. These flaws differ in severity from the aforementioned issues with evaluation protocols. They introduce shortcuts that models may exploit, producing artificially high scores without learning actual patterns in data:

i.: For the textual modality of the DAIC-WOZ, the model trained on interviewer prompts yields better performance than the one trained on participant responses [80]. This flaw is specific to the DAIC-WOZ text modality, although it is possible that this may have similar effects to the other modalities of the dataset. We therefore recommend authors to explicitly state whether interviewer turns are included in the inputs of the model and justify their use. The best results included in our review (val. MAE of $3.76$ [75] and test MAE of $3.66$ [76]) were reported by studies that used text modality and incorporated interviewer prompts.
ii.: In our study, we examined the impact of subject leakage on the validation set and showed that this results in a substantive performance overestimation on audio, text, and combined modality of the DAIC-WOZ dataset. This effect cannot be diagnosed without evaluation on the held-out set with unseen participants. In the following paragraph, we discuss the results of our empirical experiments.

RQ4. Our experiments demonstrate that subject leakage consistently inflates validation performance across audio, text, and combined modalities of the DAIC-WOZ dataset. Additionally, on the audio modality, subject leakage reduces the model’s generalizability to the unseen data. As we demonstrate in Appendix A, Figure A1, Figure A2, Figure A3, Figure A4, Figure A5 and Figure A6, leakage directly affects the models’ training dynamics and results in overfitting to the validation data. Due to the overfitting to subject-specific cues that occurs under leakage conditions, the magnitude of performance overestimation is expected to depend on the model architecture. Thus, high-capacity models designed to capture fine-grained patterns in data may be more susceptible to leakage effects. We also note that leakage effects may depend on specific data preprocessing procedures (e.g., data cleaning, normalization, etc.).

To isolate and empirically demonstrate the leakage effect, we deliberately adopted utterance-level depression severity modeling in our experiments. Propagating a single interview-level PHQ-8 score to all constituent utterances implicitly assumes that each utterance contributes equally to the depression severity score. However, if depression-related information is sparsely distributed throughout the interview, many utterances become weakly or incorrectly labeled. Given the strong, well-established pretrained backbones used in our architecture, the weak performance of our models without leakage suggests that depression severity is likely more appropriately modeled at the interview rather than at a more granular level.

Without leakage, neither of our models achieved results better than the mean predictor. This effectively means that leakage constitutes a critical methodological flaw—experimental setup that fails to yield positive

R^{2}

is misleadingly perceived as successful once leakage is introduced. Importantly, underreporting of data preprocessing prevents estimation of the actual percentage of studies exhibiting this issue, which may be substantially higher than the confirmed 9.1%. At the field level, failure to rigorously prevent and report subject leakage risks undermining the validity and clinical utility of mental health estimation models.

RQ5. To address the issues identified in the surveyed literature, in Table 4, we provide a structured list of recommendations, aimed at improving the methodological rigor, reproducibility, and documentation standards of future research on depression severity estimation. These can be treated as a checklist to employ essential requirements and best-practice recommendations by authors, as well as reviewers and editors. We recommend adopting the checklist as the standard for publication in this domain. Additionally, our code provides a ready-to-use framework for developing machine learning models inside a Docker container with on-premise MLFlow integration.

6. Limitations

In this study, we focused specifically on studies related to the DAIC-WOZ dataset. Although this is the most commonly used benchmark for the training and evaluation of depression severity estimators, other datasets exist in this domain. A non-exhaustive list of the datasets for the task of depression severity estimation, that were not considered in this systematic review, includes E-DAIC [10], audio–visual depressive language corpus (AViD-Corpus) [87], and DEPAC [88].

The empirical experiments conducted to demonstrate the impact of subject leakage were limited to the audio, text, and combined audio–text modalities of the DAIC-WOZ dataset. Consequently, the magnitude of leakage effects may differ for models based on the video modality of the dataset (as well as its combinations with other modalities) and other datasets. We also expect that the impact of subject leakage is likely to increase with model complexity: deeper architectures capable of capturing more complex dependencies in the data will be more prone to overfitting to subject-specific patterns when leakage is present.

Although our systematic review is restricted to depression severity estimation studies, the methodological pitfalls it identifies are likely to be relevant to other estimation tasks in mental health domain (e.g., anxiety, PTSD severity estimation); however, the prevalence of these issues and the magnitude of their effects are expected to vary across datasets and task formulations.

7. Conclusions

To our knowledge, this is the first systematic review of machine learning approaches applied to the DAIC-WOZ dataset for the task of depression severity estimation. Despite substantial research effort to develop machine learning models in this domain, we found that only a small number of the papers (7.6%,

n = 5

) in the surveyed literature represent sufficiently documented and reproducible manuscripts. The majority of the works do not follow best practices for the development of machine learning models and exhibit issues with methodological rigor. Beyond this, the predictive accuracy of the models included in this review remains limited, with errors comparable to the inherent variability of the PHQ-8 scores. In this work, we also conducted an empirical demonstration of the impact of subject leakage—a critical methodological issue present in at least 9.1% (

n = 6

) of the surveyed studies—on measured model performance. We showed that for the model achieving negative

R^{2}

under no-leakage conditions, significant performance overestimation and misrepresentation occur when the leakage is present.

The findings of this review are limited in scope to the DAIC-WOZ dataset. Similarly, the empirical evidence provided in our work is limited to the audio, text, and combined modalities of the discussed dataset. Nevertheless, these insights likely extend beyond the DAIC-WOZ dataset and offer guidance for other mental health estimation tasks as well. As promising as machine learning methods appear, in their current form—and as evaluated on DAIC-WOZ—they are far from clinical translation for depression severity estimation. Rigorous methodology, standardized evaluation protocols, detailed documentation, and open code are required to advance reliable research progress.

Author Contributions

I.D.: Conceptualization; Methodology; Data Curation; Software; Investigation; Formal Analysis; Validation; Visualization; Writing—Original Draft; Writing—Review and Editing. O.U.: Conceptualization; Writing—Review and Editing; Supervision; Funding Acquisition. All authors reviewed and approved the final manuscript and agree to be accountable for their contributions. All authors have read and agreed to the published version of the manuscript.

Funding

Open access funding provided by Wroclaw University of Science and Technology.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in our experiments is available upon request at https://dcapswoz.ict.usc.edu/ (accessed on 22 September 2025). All data generated in this study and code for our experiments are accessible at the following link: https://github.com/Kowd-PauUh/ml-in-depression-estimation (accessed on 22 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Supplementary Data

Learning dynamics.Figure A1 and Figure A2 illustrate the model’s learning dynamics under each experimental condition on audio modality. In the presence of subject leakage, validation metrics gradually improve, reflecting information leakage from shared participants across splits. This also causes the training to last longer, as the optimization criterion (MSE) improves as well, which leads to the overfitting and overestimation of the model’s performance. Without leakage, the validation

R^{2}

remains negative, revealing that the model fails to generalize. These observations remain consistent on the combined audio–text modality, as shown in Figure A5 and Figure A6. The effect also persists and can be observed on text modality, albeit is less noticeable, as shown in Figure A3 and Figure A4.

Figure A1. Learning curves of the model trained on audio modality: (a) with subject leakage into validation split, (b) without subject leakage. Solid lines show average training and validation

R^{2}

across 25 cross-validation runs; shaded areas denote

\pm 1