Next Article in Journal
MCFA: Multi-Scale Cascade and Feature Adaptive Alignment Network for Cross-View Geo-Localization
Previous Article in Journal
Current Sensor with Optimized Linearity for Lightning Impulse Current Measurement
Previous Article in Special Issue
Towards Explainable Graph Embeddings for Gait Assessment Using Per-Cluster Dimensional Weighting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DepressionMIGNN: A Multiple-Instance Learning-Based Depression Detection Model with Graph Neural Networks

1
HACI Laboratory, Sydney Smart Technology College, Northeastern University, Shenyang 110167, China
2
School of Computer Science and Engineeing, Northeastern University, Shenyang 110167, China
3
Cloudlore Big Data Technology (Qinhuangdao) Co., Ltd., Qinhuangdao 066600, China
4
Daikin Comfort Technologies, Waller, TX 77484, USA
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(14), 4520; https://doi.org/10.3390/s25144520
Submission received: 27 May 2025 / Revised: 10 July 2025 / Accepted: 16 July 2025 / Published: 21 July 2025

Abstract

The global prevalence of depression necessitates the application of technological solutions, particularly sensor-based systems, to augment scarce resources for early diagnostic purposes. In this study, we use benchmark datasets that contain multimodal data including video, audio, and transcribed text. To address depression detection as a chronic long-term disorder reflected by temporal behavioral patterns, we propose a novel framework that segments videos into utterance-level instances using GRU for contextual representation, and then constructs graphs where utterance embeddings serve as nodes connected through dual relationships capturing both chronological development and intermittent relevant information. Graph neural networks are employed to learn multi-dimensional edge relationships and align multimodal representations across different temporal dependencies. Our approach achieves superior performance with an MAE of 5.25 and RMSE of 6.75 on AVEC2014, and CCC of 0.554 and RMSE of 4.61 on AVEC2019, demonstrating significant improvements over existing methods that focus primarily on momentary expressions.

1. Introduction

Depression, which is also referred to as depressive disorder [1], is now a common and serious mental disorder disease around the world. However, there is a shortage of mental health workers in most low- and middle-income countries. To aid mental health diagnosis and provide patients with early diagnosis, automatic depression estimation (ADE) technologies have been widely explored in recent years [2].
With the development of deep learning in affective computing [3,4,5,6], researchers have explored how to construct multimodal-based ADE systems. Some works have focused on feature fusion strategies to combine information from different modalities like visual, audio, and text. Various fusion methods have been proposed such as equal-weighted fusion or attention-based fusion [7,8,9]. Other works have developed multimodal neural network architectures [10] including CNNs [11], RNNs, and transformers tailored [12] for each modality before fusing them. Some researchers have specifically focused on capturing temporal dependencies by employing multi-scale temporal CNNs and transformers to model long-range dependencies [13,14,15]. However, most existing approaches still focus on identifying transient moments in a video that represent depression symptoms. This setting hypothesizes that depression is a temporary expression rather than a persistent state.
Limitations of existing methods: Despite the progress made in multimodal ADE systems, existing methods suffer from three major limitations. First, as illustrated in Figure 1a,b, current temporal modeling approaches primarily utilize GRU to analyze sequential connections or employ attention mechanisms to explore bidirectional full connections of temporal information. However, the sequential connection approach fails to account for jump-connected event relationships [16], while the bidirectional full connection approach, though aiming to capture key segments, disrupts the chronological logic of temporal event development to some extent.
Second, according to clinical descriptions of depression symptoms [17,18,19], depression is fundamentally a chronic disorder that progressively worsens over time, manifesting as a persistent state rather than momentary expressions even within a single interaction episode. Current approaches that focus on identifying transient depressive moments fail to capture this essential characteristic of depression as a sustained condition that requires analysis of feature-aware past–future chronological development and associations among intermittent relevant events.
Third, as illustrated in Figure 1(d1), multimodal features from audio, text, and video modalities (represented by different colors for different subspaces) are naturally unaligned in high-dimensional space. Existing alignment and fusion methods, namely, direct concatenation and multi-head attention mechanisms, have significant limitations. Direct concatenation simply connects feature vectors sequentially, ignoring dynamic correlations and temporal dependencies between modalities, resulting in shallow and context-unaware fusion. While multi-head attention mechanisms consider contextual associations, they typically learn all relationships from scratch to capture temporal and crossmodal associations, making them sensitive to data and feature quality and potentially suboptimal. As shown in Figure 1(d2), these approaches lead to incomplete modal alignment and suboptimal feature fusion.
Proposed solution: To address these limitations, this paper proposes a novel graph-based temporal depression representation alignment and learning approach. Our method addresses the first limitation by designing two-dimensional edges (Figure 1c)—forward-full-connection (FFC) and backward-full-connection (BFC)—to construct graphs with multimodal utterance embeddings, enabling consideration of both past–future chronological development and associations among intermittent relevant events. For the second limitation, our approach segments multimodal long-term samples into utterance-level instances processed with bidirectional GRU, maintaining the integrity of chronological development logic while capturing the chronic nature of depression symptoms. To solve the third limitation, we employ relational graph convolutional neural networks (RGCN) [20] and Graph Attention Networks (GATs) [21] for dynamic learning of multimodal features. Through learning to aggregate nodes from different modalities, our graph network effectively aligns and fuses features at the feature level, as demonstrated in Figure 1(d3), successfully aligning all three modalities to the same subspace.
The main contributions to address the discussed three problems of this paper can be summarized as follows:
  • We segment the multimodal long-term sample into multiple instances at the utterance level and process them with the bidirectional GRU, so that the segment-level contextual information can be well captured.
  • We design two-dimensional edges (see Figure 1c, forward-full-connection (FFC) and backward-full-connection (BFC), to construct a graph with multimodal utterance embeddings, so that the past–future chronological development and the association among intermittent relevant information can be considered.
  • We propose a novel graph-based temporal depression representation alignment and learning approach with a relational graph convolutional neural network (RGCN) [20] and Graph Attention Network (GAT) [21] to analyze the multimodal features by making full use of the two relation types. We use the graph network for dynamic learning of multimodal features, through learning to aggregate the nodes from different modalities, and then effectively align and fuse them at the feature level, as shown in Figure 1(d3); finally, the three modalities are effectively aligned to the same subspace.
The remainder of this paper is organized as follows. In Section 3, the architecture we propose is presented with a focus on graph network-based multimodal fusion techniques. In Section 4, the experimental preparation and experimental setup are described, as well as the results with an in-depth discussion and analysis. Additionally, Section 2 and Section 5 provide an introduction to related works and a summary of this paper, respectively.

2. Related Works

Multimodal data processing encompasses various fusion techniques, including early fusion, late fusion, and model-based fusion. Early fusion, also referred to as feature-level fusion, entails integrating data features from disparate modalities at the initial stage of data processing, thereby facilitating an in-depth exploration of crossmodal interactions and correlations. However, this approach is constrained by intermodal mismatches and the complexity of high-dimensional data. Late fusion, which involves fusing results after each piece of modal data is processed independently, is more suitable when modalities exhibit greater independence and simplifies processing, but may overlook potential correlations between modalities. Model-based fusion, on the other hand, performs fusion directly within the model through learning algorithms, such as graph networks or integrated learning strategies, which can adapt flexibly to the characteristics of different modalities, but necessitate more stringent algorithm design requirements. Notably, early and late fusion techniques often employ attention mechanisms or direct feature splicing, while model-based fusion frequently utilizes graph-based representation learning and edge learning.

2.1. Fusion Based on Feature Splicing

Among the feature fusion strategies, the fusion method based on dimensional splicing is a commonly used and effective strategy. This method realizes the fusion of multi-source information by splicing features from different data sources or feature extraction methods in feature dimensions. Through dimensional splicing, features of different types or dimensions can be combined together to expand the feature space and improve the model’s representation of data. In [22], they proposed a set of multimodal and multiresolution feature extraction methods for detecting depression through speech and facial marker features, and explored the model performance corresponding to the fusion strategy of audio and video features at early and late stages, respectively. In recent years, deep neural networks have been developing rapidly, so more and more neural network-based multimodal fusion methods are being used to better accomplish upstream tasks. To address the challenges of cross-language and cross-cultural depression prediction. In [7], they proposed a GRU-based trimodal fusion network model for text, video, and audio, which effectively captures complex signals related to depression severity and cross-cultural sentiment recognition, demonstrating the potential and effectiveness of multimodal analysis in the field of sentiment computing. In the same year, ref. [10] also extracted and fused the features of the three modalities using BERT-CNN, VGG-GCNN, and ResNet-50, respectively, and the experimental results showed that the extraction of the features had a large impact on the model performance. To effectively fuse modalities and simultaneously investigate the impact of global topic information of text and images on the depression detection task, they first proposed a new Multimodal Topic Augmented Assisted Learning (MTAL) method, which aims to capture topic information within different modalities to enhance depression detection, by proposing a modality-independent topic model capable of mining topic cues from discrete textual signals or continuous visual signals to assist in depression detection tasks [23]. In [24], they perform feature extraction on text and speech separately by analyzing the features in speech and text, respectively, applying an attention mechanism to the text to highlight depression-related elements, and finally feature stitching for depression detection.

2.2. Fusion Based on Attention Mechanisms

In multimodal feature fusion, the attention mechanism plays a crucial role in emphasizing the information most critical for diagnosis by assigning different weights to different features of each modality. For example, when fusing speech, video, and text data, the attention model can learn that subtle changes in facial micro-expressions, which are more indicative of a patient’s depressive state than the tone of voice, and thus prioritize this part of the data. Recent advances in attention-based multimodal fusion have demonstrated significant improvements across various domains. The effectiveness of attention mechanisms in multimodal tasks has been widely validated, with studies showing that enhanced attention networks can effectively integrate information from multiple views or modalities [25]. These findings underscore the importance of sophisticated attention mechanisms for complex multimodal analysis tasks, including depression detection where different modalities may contribute varying levels of diagnostic information. To address the problem of efficiently extracting depression-related cues from speech and facial activities, ref. [26] proposed an approach that combines a spatial–temporal attention network (STA) and a multimodal attention feature fusion strategy (MAFF) by segmenting speech and video data and using the attention mechanism to emphasize the depression detection-related features, finally generating multimodal representations with modal complementary information. In order to fully explore the impact of semantic content and visual information on depression assessment, they proposed that the application of the deep learning model Bi-GRU combined with the attention mechanism can effectively recognize depression. Meanwhile, the attention mechanism is utilized to enhance the association learning between visual and textual modalities as a way to improve the accuracy and efficiency of depression detection [27]. To better extract depression features and fuse audio and text features, they used a combination of GRU and BiLSTM to deeply analyze the mood fluctuations in the audio and the semantic information in the text, so as to effectively recognize depression. Finally, the attention mechanism was applied to the two modalities to effectively fuse the features [28]. In order to improve the accuracy of sentiment analysis and depression detection, ref. [13] established a tensor-based multimodal transformer model, TensorFormer, which, through its global cross-attention module and parallel feed-forward module, allows information from different modalities to be comprehensively interacted and fused, thus improving the performance and flexibility of the model when processing multimodal data. In [29], they proposed a network based on contextual attention and information interaction mechanisms that can capture important acoustic and visual features at critical time points and extract correlations and interactions between acoustic and visual features at local and global scales.

2.3. Fusion Based on Graph Network

In practical applications, numerous types of information can be constructed into non-grid topologies, such as social networks [30] and character interactions [31], and traditional convolutional neural networks face challenges in processing such non-Euclidean data. Therefore, ref. [32] firstly proposed the Graph Neural Network (GNN), which can directly process the graph structure. However, the classical GNN has limitations, such as using the same parameters in the iteration, and it is difficult for the model to learn deeper feature expressions. To address this, ref. [20] proposed transferring information from neighboring nodes to the target node to perform graph convolution operations, which was the first time the convolution operation in image processing was simply used in graph structure data processing. Nevertheless, Graph Convolutional Networks (GCNs) assign the same weight to different neighbors in the neighborhood of the same order, limiting the ability of the GCN model to capture the relevance of spatial information. Recently, the attention mechanism has received increasing attention from scholars, so [21] proposed using the attention mechanism for weighted summation of neighboring node features, where the weights of neighboring node features depend entirely on the node features and are independent of the graph structure. Due to the rapid development of graph networks in prediction tasks, scholars have applied them to the field of disease detection. When using multimodality for graph embedding for depression detection, ref. [33] proposed a novel hierarchical context-aware graph attention model for automated depression detection, simulating the hierarchical structure of depression assessment and using a graph attention network to capture relational contextual information in text and audio modalities. To model the dynamic fusion between modalities, ref. [34] proposed a multi-head intermodal attention mechanism based on GAT. By utilizing the powerful ability of graph attention networks to capture complex dynamic relationships between different modalities, and by using multi-head attention, the model can focus on different subsets of information simultaneously, enhancing the model’s ability for multimodal data fusion. This intermodal attention mechanism enables the model to better understand and integrate information from different modalities, such as speech, text, and visual signals. To explore the heterogeneity/homogeneity among various modalities, ref. [35] proposed a multimodal fusion method called MS2-GNN, which can explore the heterogeneity/homogeneity among multiple physiological and psychological modalities and study the differential relationships among individuals. The modality-shared GNN and modality-specific GNN architectures are utilized to extract intermodal/intramodal features, respectively. However, it does not consider that depression is a chronic long-term illness and does not model the entire session data temporally, nor does it model the node embedding of individual sessions using graph networks, and it does not perform the analysis of jump–connection event relationships.
To capture long-term temporal patterns of depression and achieve effective multimodal fusion, we segment data into discourse-level instances and obtain contextual representations via GRU. These are embedded as nodes in a graph, where RGCN and GAT are used to model intermodal relationships and temporal dependencies. As summarized in Table 1, prior studies mainly rely on handcrafted fusion or attention-based methods, often overlooking the chronic and sequential nature of depression. Our approach addresses this gap by introducing dual-edge temporal graph modeling, enabling fine-grained alignment and session-level behavioral reasoning.

3. Materials and Methods

The proposed model introduces a new approach to analyzing long-term and episodic temporal dependencies for depression. The experimental data are sourced from established multimodal datasets, which include synchronized video, audio, and text streams. These modalities are collected using sensors such as RGB cameras and condenser microphones—commonly embedded in consumer-grade devices like webcams and smartphones—under controlled recording environments. The video recordings provide facial expressions and head movements, while the audio captures vocal tone, rhythm, and energy. These sensor-acquired signals offer crucial cues for assessing mental health status.Unlike previous approaches that mainly rely on momentary or handcrafted fusion mechanisms, our framework is tailored to depression’s long-term and episodic nature by learning temporal dependencies through a graph-based formulation.
First, video is segmented into utterance-level instances and encoded into contextual representations, so as to analyze short-term features. These contextual representations are processed as nodes in a graph with dual connections to model both chronological development and relevant intermittent information among nodes. This allows the model to capture multi-dimensional temporal dependencies that are critical for understanding depression disorders.

3.1. Preprocessing

The multimodal data used in this study were originally collected through commonly used sensor devices. Specifically, visual signals were captured using RGB cameras at a frame rate of 30 frames per second (FPS), and audio signals were recorded through standard microphones at a sample rate of 16 kHz. These sensors recorded participants’ facial expressions, vocal characteristics, and spoken content under controlled interview or conversational settings. The transcribed text was obtained by applying automatic speech recognition (ASR) to the audio recordings. To clarify the segmentation by utterance using ASR, we directly utilize the transcriptions provided by the AVEC2019 dataset, which were pre-processed by the dataset creators using automatic speech recognition. The dataset provides utterance-level transcriptions with corresponding timestamps. We simply extract features from each pre-defined utterance segment: audio features from the temporal boundaries specified in the dataset, visual features from the corresponding video frames, and textual features from the provided transcribed content. This approach ensures consistent temporal alignment across all three modalities based on the dataset’s established utterance boundaries.
Assuming that we have a dataset D = { S i d , Y i d } i d = 1 N s , S i d is the sample of a subject that consists with [ X i d A , X i d V , X i d T ] . X i d A represents the acoustic features extracted from speech, X i d V represents the visual features extracted from facial expressions, and X i d T represents the linguistic features extracted from speech transcripts. Y i d is the corresponding BDI or PHQ-8 score. BDI (Beck Depression Inventory) is a 21-item self-report measure designed to assess depressive symptom severity. PHQ-8 (Patient Health Questionnaire-8) is a clinically validated 8-item scale widely adopted for screening and grading depression severity. The proposed approach considers information from the audio modality, visual modality, and text modality. In order to ensure that all three modalities are available at the same time and aligned with each other, we intercepted samples according to the moment of the speaker’s utterances and re-organize them in the temporal order.
Acoustic features: Acoustic features were extracted using the OpenSmile toolkit with IS10 configuration [36]; X i d A = [ x 0 a , x 1 a , , x n a ] , where i d is the index of the sample and n is the number of utterances. The same notation applies below. We divided each speech recording using a sliding window of 4 s with a 1-second step size, and extracted feature vectors within each window using the Bag-of-audio-words eGeMAPS [37] approach. The 100-dimensional feature vector is empirically determined based on the Bag-of-audio-words approach, which is commonly used in the audio processing literature for effective acoustic representation. Finally, the dimension of X i d A is ( n , 100 ) .
Visual features: Facial features were extracted from video clips using a DenseNet pretrained on the FER+ dataset [38] as X i d V = [ x 0 v , x 1 v , , x n v ] . The DenseNet was pretrained on the FER+ dataset for classifying eight basic emotions using cross-entropy loss. We first extract the region of facial expression and align it with the OpenFace toolkit, and then feed the aligned face images to the pretrained ResNet-50 [39] to obtain the deep representation. The 2048-dimensional vector follows the standard output dimension of ResNet-50’s feature layer, which is widely adopted for visual feature representation in multimodal analysis.
Textual features: We utilized a fine-tuned RoBERTa Large model [40] for text transcripts, appending an < S > token to tokenized utterances as X i d T = [ x 0 t , x 1 t , , x n t ] . We employed a pretrained BERT model to convert the transcript into sentence embeddings. The 768-dimensional representation corresponds to the hidden state size of the BERT model, which is the standard dimension for BERT-based text embeddings. The dimension of textural features is ( n , 768 ) .

3.2. Model Architecture

The proposed DepressionMIGNN has 3 procedures as depicted in Figure 2: (1) Multimodal Context Encoding, (2) Graph Construction and Transformation, and (3) Score Prediction.
Multimodal Context Encoding: As the features obtained from the pre-processing procedure only reflect short-term information, to account for contextual information, inspired by [41], bidirectional GRUs were utilized to update features across time-steps. The following equations adapted from [42] demonstrate the process.
c i [ a , v , t ] , h i [ a , v , t ] = G R U ( x i [ a , v , t ] , h i 1 [ a , v , t ] )
where GRU ( · ) , indicates that the outputs of the forward GRU and the backward GRU are concatenated along the channel dimension. Here, h i [ a , v , t ] represents the updated hidden state for each modality, and c i [ a , v , t ] denotes the context-enhanced representation obtained by concatenating forward and backward outputs, which is then used for subsequent multimodal fusion. Each modality is processed with an independent GRU, and their weights are not shared. After the contextual representation is obtained for each modality, the representations of the three modalities are concatenated to jointly represent a short-term segmentation:
c i m = c i a c i v c i t ; C i = [ c 0 m , c 1 m , , c n m ]
where C i denotes the multimodal context matrix of the i-th short-term segment.
Graph Construction and Feature Transformation: Given the multimodal contextual representation, we construct a temporal graph G = ( V , E , R , W ) where each node represents a segment (utterance) from the sample. Specifically, each node v i V corresponds to the multimodal feature representation c i m of an utterance segment with dimension 160. V refers to the set of nodes (utterance segments), E denotes the set of edges connecting these segments, R represents the types of temporal relationships among consecutive and non-consecutive utterances (including forward and backward temporal connections), and W contains the corresponding edge weights that are learned and optimized during training. The edges e i j E are generated based on temporal proximity and semantic similarity between utterance segments c i m and c j m . Each edge e i j has a relation type r R and a weight α i j W with 0 α i j 1 , which is updated through the training process.
Additionally, a window range N w is implemented around a central utterance when constructing the graph to constrain the number of nodes. This aims to improve the model’s ability to capture relationships within a specific time period. In detail, we selected N w 2 sentences before and after the current central utterance to form the graph nodes, meaning that N w 2 refers to the number of sentences on each side of the central one. The graph hops along the time sequence with a step length of 1.
As Figure 2 demonstrates, we designed two relation types, FFC and BFC. The adjacent matrices are learnable with an attention-like process as Equation (3) shows.
α i j = softmax ( c i m T W e r [ c i N w 2 m , , c i + N w 2 m ] )
where W e r is a trainable weight matrix for the relation type r. Once the graph is constructed, we update the features with graph neural networks. First, we use RGCN to update the node representation according to our definition of both FFC and BFC connections, considering the past-future chronological development.
h i ( 1 ) = σ ( r R j N i r α i j N i r W r ( 1 ) c j m + α i i W 0 ( 1 ) c i m )
where W r ( 1 ) and W 0 ( 1 ) refer to weight matrices that are learned during training. The variable α represents the weights of the edges, and N i r is the set of indices for nodes in the neighborhood of node i under the relation r. The activation function is denoted by σ . Next, in order to capture the association among intermittent relevant information, we utilized GAT with the RGCN output features h i 1 as input to compute the new connection weight α based on the updated features, followed by another round of feature updating with the new weights, as shown in Equation (5), which is derived from the paper on Graph Attention Networks cited in [21]:
α i j = e x p ( a T L e a k y R e L U ( W g ( 2 ) [ h i ( 1 ) h j ( 1 ) ] ) ) j N i r e x p ( a T L e a k y R e L U ( W g ( 2 ) [ h i ( 1 ) h j ( 1 ) ] ) ) h i ( 2 ) = α i i W g ( 2 ) h i ( 2 ) + j N i r α i j W g ( 2 ) h j ( 2 )
where W g ( 2 ) represents a learnable weight matrix, a represents a parameterized weight vector, and L e a k y R e L U is an activation function.
To stabilize the learning process of self-attention, we found that extending our mechanism to employ multi-head attention was beneficial. Specifically, K independent attention mechanisms execute the transformation of Equation (6), and then their features are concatenated, resulting in the following output feature representation:
h i ( 2 ) = k = 1 K α i i k W g ( 2 ) , k h i ( 2 ) + j N i r α i j k W g ( 2 ) , k h j ( 2 )
where ‖ represents concatenation, α k are normalized attention coefficients computed by the k-th attention mechanism, and W g ( 2 ) is the corresponding input linear transformation’s weight matrix.
Subject-Level Prediction: It is important to clarify that the AVEC datasets used in our study provide depression severity scores at the subject level, not at the utterance level. Therefore, all utterances from a given subject share the same label. Following the Multiple-Instance Learning (MIL) paradigm, we segment each video into utterance-level instances to allow the model to capture fine-grained temporal patterns within the session. These utterances are treated as instances in a bag, and the model is trained to make predictions based on the collective evidence of all utterances.
The MIL formulation offers several advantages for depression detection: it enables automatic identification of the most diagnostically relevant utterances rather than treating all segments equally, handles noise in long-term behavioral data by focusing on discriminative instances, and aligns with clinical assessment practices where depression is evaluated based on overall symptom patterns. The subject-level prediction is obtained by averaging the instance-level predictions from all utterances, ensuring the final score reflects collective evidence while allowing the model to emphasize more informative temporal segments through learned representations. This approach enables our framework to focus on the most informative segments while still producing a subject-level prediction, aligning with the nature of depression as a long-term disorder.
Score Prediction: To predict the BDI/PHQ-8 score, we first concatenate the contextual representations c i m and node features h i ( 2 ) . Then, we perform predictions for each utterance-level short-term information and calculate the average score of all utterances as the final estimation result. The PHQ-8 and BDI scores are continuous values that quantify depression severity, where higher scores indicate more severe depressive symptoms. These scores are provided as ground truth labels in the AVEC dataset at the subject level. Our model outputs a regression score that directly corresponds to these clinical assessment scales, enabling clinicians to interpret the results in terms of established depression severity categories.
y ^ i = σ ( W c [ c i m h i ( 2 ) ] + b c ) ; Y ^ i d = i n y ^ i n
where σ denotes the activation function, y i is the predicted score for utterance i, W c is a learnable weight matrix, and b c is bias.
We adopted Concordance Correlation Coefficient (CCC) loss as the cost function during the training procedure.
C C C = 2 ρ σ f σ y σ f 2 + σ y 2 + ( μ f μ y ) 2 , L C C C = 1 C C C
where ρ is the Pearson correlation coefficient, μ f and μ y are the mean values of predictions and ground truth labels, respectively, and σ f and σ y are the corresponding standard deviations. The value of CCC ranges from −1 to 1, where 1 denotes an ideal positive correlation and −1 denotes a completely negative correlation.

4. Experiment

In this section, we introduce the public benchmark dataset, the baseline methodology used for comparisons, as well as the detailed experimental results and visualizations.

4.1. Dataset and Baselines

We conduct experiments on two widely used multimodal depression datasets: AVEC2014 and AVEC2019, summarized in Table 2. AVEC2014 includes audio–visual data from 82 subjects across two tasks (Northwind and Freeform), with depression levels labeled using the BDI. The two tasks were combined for the experiments and equally divided into the training set, development set, and test set, resulting in 100 video samples for each set. Each set therefore includes approximately 100–200 min of video data in total, depending on the response duration of each subject. And the average duration for each sample is 1 to 2 min. AVEC2019 is built upon the E-DAIC corpus and contains data from 275 subjects, where the depression level is annotated using the PHQ-8. In AVEC2019, the transcripts were derived from semi-automatic transcriptions with manual correction. The entire AVEC2019 corpus contains approximately 73 h of semi-clinical interviews, with the training set covering about 43 h and the development and test sets each around 15 h.
To validate our model, we compare it with a series of baselines ranging from traditional machine learning regressors to recent deep multimodal models. These methods include handcrafted feature pipelines, attention-based fusion, temporal modeling, and transformer-based approaches. Table 3 provides a summary of representative baselines and their key modeling strategies.

4.2. Settings and Metrics

We develop DepressionMIGNN using PyTorch 1.6.0 and PyG 1.6.3 as frameworks. The dimensions of processed audio, visual, and textual features in the preprocessing procedure are 100, 2048, and 768, respectively. The number of cells in the GRU used in the Multimodal Contextual Encoding is 200, and the number of utterances per subject ( N w ) is 20. The learning rate was set to 1 × 10 4 with a weight decay of 1 × 10 8 . The model was trained for 500 epochs with a batch size of 50. Training was conducted on a single NVIDIA RTX 4070 GPU (12 GB), with a peak memory usage of approximately 5 GB. On average, each training epoch took around 12 s. Our proposed model takes approximately 100 min to train on the training set and around 20 min to evaluate on the test set.
To evaluate and compare the performance of our method with selected baselines, we utilized two commonly used metrics in previous studies: the Concordance Correlation Coefficient (CCC) and the Root Mean Square Error (RMSE).

4.3. Results

4.3.1. Sensor Requirements and Modal Usage

To apply the proposed depression detection method in real-world settings, only two commonly available sensors are required: an RGB camera and a microphone. The RGB camera captures facial movements and visual cues, while the microphone records acoustic signals during spontaneous speech. These sensors are typically embedded in consumer devices such as laptops and smartphones, making the system hardware-efficient and highly deployable.
In our implementation, the recorded audio and video streams are temporally synchronized and segmented into utterances. Visual frames are aligned and processed using a pretrained ResNet-50 to extract facial features, while acoustic signals are analyzed using the eGeMAPS feature set via the OpenSmile toolkit. Transcripts are generated through automatic speech recognition (ASR), and semantic features are encoded using a pretrained RoBERTa model. These three modalities are fused and modeled through our graph-based architecture to capture both short- and long-term temporal patterns indicative of depressive symptoms.

4.3.2. Comparison Experiment Results

Table 4 presents the results of DepressionMIGNN and baselines. The Parameters column indicates the number of model parameters. In the table, it can be seen that our proposed model outperforms the selected baselines on the CCC metric. To elucidate the reason for the improvement, we should examine the fundamental differences between our approach and the baselines. The baseline approaches, refs. [7,10,43,44,45,46,48,50], focus on fusing features without placing much emphasis on the relationship between temporal information. In the AVEC2014 dataset experiment, ref. [26] with 2.6 M parameters did slightly better than us on MAE, but we significantly improved on RMSE compared to this model. Although [8] with 4.2 M parameters used a multiscale dilated CNN to consider temporal information, this approach has limitations in handling temporal features and cannot consider long time intervals at once. And although [47] with 32.1 M parameters can capture the global and local spatial–temporal information, the parameters of its model are too large. Although [13] with 65 M parameters utilized the transformer’s ability to capture long time dependencies, the attention mechanism’s processing of temporal information aims to capture key segments and is somewhat disruptive, without considering the back-and-forth logic of temporal event development. In addition, ref. [53], while considering the effect of affective states on depression, did not adequately consider the contextual information and did not consider the prolonged dependency of depression during the interaction. These methods are insufficient in fully matching the characteristics of depressive symptoms, chronic, sporadic, and intermittent. Our FFC and BFC are designed to preserve the chronological development logic of depressive symptoms at the vide-level, which aligns with the chronic nature of the condition. Our graph-based method also utilizes the adjacency matrix to capture non-sequential connections between temporal events, which is in line with the sporadic and intermittent nature of depressive symptoms. This approach allows for a more comprehensive representation of the temporal dynamics of depression, which ultimately improves the accuracy of depression estimation.
As shown in Table 4, our proposed model achieves competitive performance across both datasets. On AVEC2014, we obtain the lowest RMSE (6.75) among all models and a near-best MAE (5.25), slightly higher than MAFF (5.21). On AVEC2019, our model achieves a CCC of 0.554, which is on par with the best-performing method FAU-GF (0.555) and outperforms most other baselines. Our RMSE (4.61) is also competitive, only slightly higher than TensorFormer (4.31), while using less than 1/10th of its parameters (6.1 M vs. 65 M).
These results demonstrate that our model achieves a strong trade-off between accuracy and computational efficiency. In particular, the high CCC scores validate the effectiveness of our temporal graph-based fusion in capturing long-term depressive patterns.

4.3.3. Ablation Study

In order to investigate the amount of depression information contained in different modalities, an ablation study is conducted by comparing the performance of different combinations of modalities. In addition, we discuss the contribution of the components employed in the proposed model. Specifically, we removed multi-instance learning, GNN, GAT, and the designed relations one by one, and compare the performance of different model configurations.
Table 5 shows the performance of the unimodal and multimodal models on the AVEC2014 dataset and AVEC2019 dataset, where A denotes audio modality, V denotes visual modality, and T denotes text modality. As shown in Table 5, among the single modalities, the textual modality (T) achieves the best performance, highlighting the importance of semantic information in depression detection. Among bimodal combinations, T + A performs the best, indicating strong complementarity. V + A does not outperform unimodal A, suggesting limited contribution from visual features alone. The full trimodal configuration (T + A + V) achieves the best results overall, confirming the effectiveness of multimodal integration despite modality-specific noise.
Furthermore, we discuss the contribution of the components employed in the model. Specifically, we remove the multi-instance learning strategy, the GAT module, the GNN module, and the bidirectional relation to compare the performance advantages and disadvantages of the different models, and the results are also shown in Table 5. It can be found that the absence of any of these components leads to a sharp drop in performance.
The multi-instance learning strategy enhances the model’s understanding of time series data by segmenting video samples into multiple instances at the utterance level, combining them with an GRU model to extract rich contextual information, and effectively enhancing the model’s understanding of time series data. This enables the model to capture detailed information within each instance and broader temporal dependencies through contextual relationships between instances, thus improving the accurate prediction of depression representations. Without the multi-instance module, the model loses the ability to understand video content at a fine-grained level, resulting in a significant decrease in the ability to capture temporal dependencies and overall confidence.
The GNN module, particularly with bi-directional connectivity relationships, significantly enhances the model’s ability to fuse time series data, enabling the model to combine past and future information and more accurately reflect the complex associations between temporal development and intermittently relevant information. This approach demonstrates significant advantages in capturing subtle changes in depression video samples and is particularly important for understanding and predicting time-dependent manifestations of depression. We also found that introducing only GCN without invoking the GAT module, although improved relative to the model that does not use GNN at all, still performs very poorly relative to the model that uses both GCN and GAT together. This is because GCN updates the representation of each node mainly by averaging or summing the features of neighboring nodes, which, although capable of capturing graph structural information, is not enough to capture all important feature interactions and subtle differences between modalities. In contrast, GAT is able to capture relationships between nodes in a more fine-grained way by assigning different importance to different edges through the attention mechanism. Finally, we conducted ablation experiments on bi-directional connectivity relationships, considering only the sequential relationships of discourse, and found that the predictive performance of the model is unsatisfactory, leading to the loss of important contextual information and interaction patterns. Bidirectional connections help capture the interplay and support of discourse between users, reflecting the true complexity between utterances. When only unidirectional discourse order relationships are considered, the model is unable to fully understand the dynamic associations between user behaviors and mental states, leading to a reduced ability to identify the phenomenon of depression.

4.3.4. Effect of Different Numbers of Windows

To investigate the optimal context window size for capturing sporadic depressive symptoms, we conducted experiments with varying window sizes. Using (10,10) as the baseline, we tested increasing window sizes of (12,12), (15,15), and (20,20), and decreasing window sizes of (8,8), (5,5), and (0,0). The results (Table 6) show that the model’s performance degrades gradually as the window size deviates from the optimal value of 20 utterances, suggesting that this interval is most indicative of depressive mood expression in this dataset. Notably, the model with a context window size of (0,0), equivalent to a sequential encoder-only model, performs significantly worse, highlighting the importance of contextual information.

4.3.5. Effect of Different Numbers of Heads

Since this model uses the attention mechanism in the graph, and the number of attention heads in it is an adjustable parameter, multiple attention heads allow the network to learn the relationship between nodes from different perspectives. To study the effect of the number of attention heads on the performance of the model, we set the number of attention heads to three, four, five, and six for the experiments, and the results are shown in Table 7, which indicates that the model is most effective when the number of attention heads is 4. Figure 3 shows the experimental results on the AVEC2019 dataset. An increase in the number of attention heads means that the model is able to capture information from different subspaces, but it also increases the number of parameters of the model. While fewer attention heads would limit the model’s ability to capture information, too many attention heads would result in a model that is too complex and prone to overfitting. When the number of heads is four, the model reaches the optimal balance between the number of parameters and the learning ability, which can effectively capture diverse feature information without overfitting due to too many parameters. In order to comprehensively understand the effect of attention heads, we conducted combination experiments with different and heads, and obtained the optimal combination of heads as 20 and 4, and the experimental results on the AVEC2019 dataset are shown in Figure 4. This combination experiment further verifies the effectiveness of the model and provides a reference for selecting the optimal attention head and context window size in different application scenarios.

4.4. Visualization

In this section, to better explore the performance of the comparison models, we perform several visualizations of the experiments from the previous section, including fitting the true and predicted scatterplots using regression lines, visualization of the GAT attention weights, and visualization of the model’s hidden layer representation.

4.4.1. Visualization of Scatterplots

In order to compare the predictive performance of different models, we visualize the scatter of the true and predicted values. The scatterplot goes through the distribution of data points, which reflects the correlation between the predicted and true values (it depicts how the predicted values change with the true values when they change). Ideally, these points should be distributed along a straight line with y = x (the gray dotted line in each subplot), indicating that it has a perfect prediction. If the scatterplot shows points arranged roughly along a certain straight line, but not exactly y = x, this indicates that the predicted values have some linear correlation with the true values, but there is prediction error. To better illustrate the prediction trend, we reverse-fit the scatter points into a regression line, which is highlighted in blue in each subplot. In order to investigate the effect of the context window values and the number of attention heads for different ablation modes on the model, we visualized the scatter prediction results for the predicted results separately, and the results for the context window values and the attention heads are shown in Figure 5 and Figure 6, respectively, where each row is for a different ablation model, and each column is for a different context window value and a different number of attention heads, respectively. As can be seen from the figure, the regression line between the predicted and true values is closest to 45 degrees when using the three modalities of T + A + V, N w = 20, and heads = 4, which indicates that this model predicts more accurately. While the regression curves of other models are more shifted from y = x curve, and their predicted value distribution is also more discrete, at the same time part of the model for a real value has a large span of predicted values, such as T + V modalities and N w = 10, which proves that their prediction effects are not as good as that of the model proposed in this paper.

4.4.2. Visualization of Attention Weights

We visualize the attention weights of GAT and multi-head attention mechanisms to illustrate the relationships among timesteps (Figure 7). The weight graphs reveal distinct attention patterns: multi-head attention focuses on consecutive partial utterances, resulting in chunk-like attention weights, while GAT exhibits jump-like attention weights, aligning with the intermittent nature of depression symptoms (Problem 1). The sparser distribution of attention weights in GAT indicates higher model confidence, whereas the denser distribution in multi-head attention suggests lower confidence. In depression detection tasks, GAT’s ability to identify and reinforce intermittent and discontinuous associations makes it a more suitable choice. The jumping attention weights of GAT, combined with the intermittent manifestation of depression symptoms, enable effective recognition of critical but discontinuous symptom patterns. Furthermore, the GAT model’s attention weights are more scattered and adaptive, allowing it to capture complex relationships between timesteps, whereas multi-head attention tends to focus on local patterns. This adaptability makes GAT more effective at handling the variability and unpredictability of depression symptoms. Additionally, the visualization results show that GAT attention weights are more consistent across different attention heads, indicating a more robust and reliable attention mechanism. Overall, the visualization results demonstrate the superiority of GAT in capturing intermittent and discontinuous associations in depression detection tasks.

4.4.3. Visualization of Hidden Layer Representations

We employed t-SNE dimensionality reduction to visualize the hidden layer representations of the three parts of each model in the ablation experiments. The results are shown in Figure 8. The modal features after GRU processing are depicted in Figure 8a, which corresponds to the output C in the second part of Figure 2. The figures reveal discriminative clusters, with each color representing a modal feature. The blue dots represent textual features, green dots represent audio features, and red dots represent visual features. Figure 8(a1) shows the feature visualization of the three-modal model, where the dots of each color represent the hidden layer representations of the corresponding modal features after GRU processing. The t-SNE results indicate that these dots form distinctly separated clusters in the two-dimensional space, suggesting that the features of different modalities have distinguishable spatial representations after GRU processing. This distribution implies complementarity between the different modalities in the depression detection task, where each modality provides unique and distinguishable information. Figure 8(a2–a4) show visualizations of the three bimodal models. Notably, the audio and text modal internal features exhibit relatively concentrated clustering, whereas the text–video and audio–video modal features exhibit more dispersed clustering. This is because audio and text features are processed to maintain their original structural information, whereas video features are more decentralized due to their unstructured and complex nature. This difference is attributed to the fact that video data contains richer environmental information and dynamic changes, which are more difficult to retain during dimensionality reduction. Moreover, the video modal information is more diffusely distributed in each model, indicating that the video modality is not effective for depression prediction, consistent with the results in Table 5, where a single–video modal model has the worst performance.
We also visualized the hidden layer features of different graph network modules in the depression detection model after processing multimodal data by t-SNE. The results demonstrate the difference between the GCN module alone (Figure 8b) and the GCN module followed by GAT-weighted fusion (Figure 8c). Figure 8b compares the original multimodal features through the distribution of the output features formed by the GCN-processed features in the two-dimensional space, demonstrating the GCN’s ability to capture and distinguish different modal features. However, the aggregation effect is not significant enough compared to Figure 8c. Moreover, there are many independent feature points inside each Figure 8b, indicating that the fusion effect is not good enough. As shown in Table 5, using only GCN to aggregate depression information results in unsatisfactory depression prediction performance due to its inability to dynamically adjust the importance of different neighboring nodes. In contrast, Figure 8c demonstrates the effect of the GCN output features after the second dynamic weighted fusion of GAT, which significantly enhances the differentiation and aggregation of the features, forming more compact clusters. By comparing Figure 8b and Figure 8c, we can clearly see how GAT-weighted fusion optimizes the model performance and further improves the accuracy of depression detection.
To further verify the effectiveness of the proposed model, we compared the output features of the graph fusion model with the output features of the attention mechanism fusion model (Figure 8d). The attention mechanism fusion model can only align some of the features fused, prioritizing features considered more informative. In contrast, the graph fusion model proposed in this paper shows its remarkable ability to capture overall feature alignment, providing a global feature representation by exploiting the connectivity patterns between nodes. This global perspective allows the graph network to be more comprehensive and in-depth when fusing features from different sources. Instead of considering each node in isolation, each node is evaluated in the context of the entire graph, enabling the model to capture subtle feature changes. Thus, the effectiveness of the model proposed in this paper is further validated.

5. Conclusions

In this work, we proposed a novel graph-based multimodal framework for automatic depression severity estimation, tailored to capture the temporally irregular nature of depressive behaviors. The model utilizes data collected via common sensors—RGB cameras and microphones—to obtain visual and acoustic signals during interviews or conversations. Unlike prior approaches that predominantly model momentary affect, our method segments long interview recordings into utterance-level instances and models their interrelations using relational and attentional graph neural networks (RGCN and GAT). This design allows the model to learn both local sequential cues and long-range symptom patterns, enabling more accurate modeling of depression’s intermittent and evolving manifestations.
Compared to the state of the art, our approach achieves substantial performance gains on two widely used benchmark datasets (AVEC2014 and AVEC2019), confirming the advantage of modeling depressive symptoms as temporally structured and crossmodally expressed signals. Our findings also suggest that a context window of around 20 utterances offers an effective granularity for estimating depressive states, potentially guiding the design of future time-aware models.
Importantly, although our framework adopts a regression-based formulation to predict continuous depression severity scores (BDI, PHQ-8), it remains clinically meaningful. These scores can be mapped to established diagnostic categories (e.g., minimal, mild, moderate, severe), facilitating integration into clinical workflows and enabling derivation of classification metrics such as sensitivity and specificity. This highlights the model’s potential for real-world screening and triaging scenarios, where early risk detection is critical.
Nonetheless, the present study has limitations. It relies on pre-collected datasets under constrained environments, and has not yet been validated in live clinical settings. The absence of interpretability mechanisms also limits its direct usability by clinicians. Future work will focus on extending the framework to larger and more diverse populations, exploring explainable modeling techniques, and embedding the model within real-time decision-making systems for mental health professionals. Such steps will be essential to bridge the gap between technical development and clinical impact.

Author Contributions

Conceptualization, S.Z.; methodology, S.Z. and Y.Z.; software, S.Z.; validation, Y.S.; formal analysis, K.S.; data curation, Y.Z.; writing—original draft, S.Z.; writing—review and editing, S.Z.; visualization, Y.Z.; supervision, J.L. and T.W.; project administration, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Yunlifang (Qinhuangdao) Technology Co., Ltd. and Yideluoke Intelligent Technology (Qinhuangdao) Co., Ltd. It is also supported by the Central Guidance Fund for Local Science and Technology Development (Project No. 226Z0312G) and the Qinhuangdao Science and Technology Plan Project (Project No. 202303B005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

This work was supported by two joint research projects: “Development of Intelligent Recognition Algorithms for a Dust Suppression System Based on Hawk-Eye Vision”, jointly developed by Yunlifang (Qinhuangdao) Technology Co., Ltd. and Northeastern University; and “Platform Software Development for a Dust Suppression System Based on Hawk-Eye Vision”, jointly developed by Yideluoke Intelligent Technology (Qinhuangdao) Co., Ltd. and Northeastern University.

Conflicts of Interest

Author Shiwen Zhao, Yunze Zhang, Yikai Su, Kaifeng Su, Jiemin Liu, Tao Wang were employed by the company Cloudlore Big Data Technology (Qinhuangdao) Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Yunlifang (Qinhuangdao) Technology Co., Ltd. and Yideluoke Intelligent Technology (Qinhuangdao) Co., Ltd. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

  1. Fava, M.; Kendler, K.S. Major depressive disorder. Neuron 2000, 28, 335–341. [Google Scholar] [CrossRef] [PubMed]
  2. Trotzek, M.; Koitka, S.; Friedrich, C.M. Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences. IEEE Trans. Knowl. Data Eng. 2018, 32, 588–601. [Google Scholar] [CrossRef]
  3. Song, S.; Jaiswal, S.; Shen, L.; Valstar, M. Spectral representation of behaviour primitives for depression analysis. IEEE Trans. Affect. Comput. 2020, 13, 829–844. [Google Scholar] [CrossRef]
  4. Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Schuller, B.W. Multitask Learning from Augmented Auxiliary Data for Improving Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2022, 14, 3164–3176. [Google Scholar] [CrossRef]
  5. Fu, C.; Liu, C.; Ishi, C.; Ishiguro, H. An adversarial training based speech emotion classifier with isolated gaussian regularization. IEEE Trans. Affect. Comput. 2022, 14, 2361–2374. [Google Scholar] [CrossRef]
  6. Wolohan, J.; Hiraga, M.; Mukherjee, A.; Sayyed, Z.A.; Millard, M. Detecting linguistic traces of depression in topic-restricted text: Attending to self-stigmatized depression with NLP. In Proceedings of the First International Workshop on Language Cognition and Computational Models, Santa Fe, NM, USA, 20–25 August 2018; pp. 11–21. [Google Scholar]
  7. Kaya, H.; Fedotov, D.; Dresvyanskiy, D.; Doyran, M.; Mamontov, D.; Markitantov, M.; Akdag Salah, A.A.; Kavcar, E.; Karpov, A.; Salah, A.A. Predicting depression and emotions in the cross-roads of cultures, para-linguistics, and non-linguistics. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; pp. 27–35. [Google Scholar]
  8. Yin, S.; Liang, C.; Ding, H.; Wang, S. A multi-modal hierarchical recurrent neural network for depression detection. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; pp. 65–71. [Google Scholar]
  9. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
  10. Rodrigues Makiuchi, M.; Warnita, T.; Uto, K.; Shinoda, K. Multimodal fusion of bert-cnn and gated cnn representations for depression detection. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; pp. 55–63. [Google Scholar]
  11. Yuan, J.; Xiong, H.C.; Xiao, Y.; Guan, W.; Wang, M.; Hong, R.; Li, Z.Y. Gated CNN: Integrating multi-scale feature layers for object detection. Pattern Recognit. 2020, 105, 107131. [Google Scholar] [CrossRef]
  12. Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  13. Sun, H.; Chen, Y.W.; Lin, L. TensorFormer: A Tensor-based Multimodal Transformer for Multimodal Sentiment Analysis and Depression Detection. IEEE Trans. Affect. Comput. 2022, 14, 2776–2786. [Google Scholar] [CrossRef]
  14. Saggu, G.S.; Gupta, K.; Arya, K.; Rodriguez, C.R. DepressNet: A Multimodal Hierarchical Attention Mechanism approach for Depression Detection. Int. J. Eng. Sci. 2022, 15, 24–32. [Google Scholar] [CrossRef]
  15. Yin, F.; Du, J.; Xu, X.; Zhao, L. Depression Detection in Speech Using Transformer and Parallel Convolutional Neural Networks. Electronics 2023, 12, 328. [Google Scholar] [CrossRef]
  16. Ghosal, D.; Majumder, N.; Poria, S.; Chhaya, N.; Gelbukh, A. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 154–164. [Google Scholar]
  17. Judd, L.L.; Akiskal, H.S.; Maser, J.D.; Zeller, P.J.; Endicott, J.; Coryell, W.; Paulus, M.P.; Kunovac, J.L.; Leon, A.C.; Mueller, T.I.; et al. Major depressive disorder: A prospective study of residual subthreshold depressive symptoms as predictor of rapid relapse. J. Affect. Disord. 1998, 50, 97–108. [Google Scholar] [CrossRef] [PubMed]
  18. Kennedy, S.H. Core symptoms of major depressive disorder: Relevance to diagnosis and treatment. Dialogues Clin. Neurosci. 2022, 10, 271–277. [Google Scholar] [CrossRef] [PubMed]
  19. Ayuso-Mateos, J.L.; Nuevo, R.; Verdes, E.; Naidoo, N.; Chatterji, S. From depressive symptoms to depressive disorders: The relevance of thresholds. Br. J. Psychiatry 2010, 196, 365–371. [Google Scholar] [CrossRef] [PubMed]
  20. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  21. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  22. Nasir, M.; Jati, A.; Shivakumar, P.G.; Nallan Chakravarthula, S.; Georgiou, P. Multimodal and multiresolution depression detection from speech and facial landmark features. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 43–50. [Google Scholar]
  23. An, M.; Wang, J.; Li, S.; Zhou, G. Multimodal topic-enriched auxiliary learning for depression detection. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 1078–1089. [Google Scholar]
  24. Lin, L.; Chen, X.; Shen, Y.; Zhang, L. Towards automatic depression detection: A BiLSTM/1D CNN-based model. Appl. Sci. 2020, 10, 8701. [Google Scholar] [CrossRef]
  25. Li, P.; Tao, H.; Zhou, H.; Zhou, P.; Deng, Y. Enhanced Multiview attention network with random interpolation resize for few-shot surface defect detection. Multimed. Syst. 2025, 31, 36. [Google Scholar] [CrossRef]
  26. Niu, M.; Tao, J.; Liu, B.; Huang, J.; Lian, Z. Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. 2020, 14, 294–307. [Google Scholar] [CrossRef]
  27. Hao, Y.; Cao, Y.; Li, B.; Rahman, M. Depression recognition based on text and facial expression. In Proceedings of the International Symposium on Artificial Intelligence and Robotics, Fukuoka, Japan, 21–27 August 2021; SPIE: St Bellingham, WA, USA, 2021; Volume 11884, pp. 513–522. [Google Scholar]
  28. Shen, Y.; Yang, H.; Lin, L. Automatic depression detection: An emotional audio-textual corpus and a gru/bilstm-based model. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6247–6251. [Google Scholar]
  29. Zhou, L.; Liu, Z.; Yuan, X.; Shangguan, Z.; Li, Y.; Hu, B. CAIINET: Neural network based on contextual attention and information interaction mechanism for depression detection. Digit. Signal Process. 2023, 137, 103986. [Google Scholar] [CrossRef]
  30. Liu, Z.; Liu, Y.; Lyu, C.; Ye, J. Building personalized transportation model for online taxi-hailing demand prediction. IEEE Trans. Cybern. 2020, 51, 4602–4610. [Google Scholar] [CrossRef] [PubMed]
  31. Liu, X.; Ji, Z.; Pang, Y.; Han, J.; Li, X. DGIG-Net: Dynamic graph-in-graph networks for few-shot human–object interaction. IEEE Trans. Cybern. 2021, 52, 7852–7864. [Google Scholar] [CrossRef] [PubMed]
  32. Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 729–734. [Google Scholar]
  33. Niu, M.; Chen, K.; Chen, Q.; Yang, L. Hcag: A hierarchical context-aware graph attention model for depression detection. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4235–4239. [Google Scholar]
  34. Fu, C.; Liu, C.; Ishi, C.T.; Ishiguro, H. Multi-modality emotion recognition model with GAT-based multi-head inter-modality attention. Sensors 2020, 20, 4894. [Google Scholar] [CrossRef] [PubMed]
  35. Chen, T.; Hong, R.; Guo, Y.; Hao, S.; Hu, B. MS2-GNN: Exploring GNN-Based Multimodal Fusion Network for Depression Detection. IEEE Trans. Cybern. 2022, 53, 7749–7759. [Google Scholar] [CrossRef] [PubMed]
  36. Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun. 2011, 53, 1062–1087. [Google Scholar] [CrossRef]
  37. Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef]
  38. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  39. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  40. Liu, Y. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
  41. Kumar, A.; Sachdeva, N. A Bi-GRU with attention and CapsNet hybrid model for cyberbullying detection on social media. World Wide Web 2022, 25, 1537–1550. [Google Scholar] [CrossRef]
  42. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
  43. Kaya, H.; Çilli, F.; Salah, A.A. Ensemble CCA for continuous emotion prediction. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, USA, 7 November 2014; pp. 19–26. [Google Scholar]
  44. Senoussaoui, M.; Sarria-Paja, M.; Santos, J.F.; Falk, T.H. Model fusion for multimodal depression classification and level detection. In Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, Orlando, FL, USA, 7 November 2014; pp. 57–63. [Google Scholar]
  45. Jan, A.; Meng, H.; Gaus, Y.F.B.A.; Zhang, F. Artificial intelligent system for automatic depression level analysis through visual and vocal expressions. IEEE Trans. Cogn. Dev. Syst. 2017, 10, 668–680. [Google Scholar] [CrossRef]
  46. Cholet, S.; Paugam-Moisy, H.; Regis, S. Bidirectional associative memory for multimodal fusion: A depression evaluation case study. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
  47. Pan, Y.; Shang, Y.; Shao, Z.; Liu, T.; Guo, G.; Ding, H. Integrating Deep Facial Priors into Landmarks for Privacy Preserving Multimodal Depression Recognition. IEEE Trans. Affect. Comput. 2023, 15, 828–836. [Google Scholar] [CrossRef]
  48. Pan, Y.; Shang, Y.; Liu, T.; Shao, Z.; Guo, G.; Ding, H.; Hu, Q. Spatial–temporal attention network for depression recognition from facial videos. Expert Syst. Appl. 2024, 237, 121410. [Google Scholar] [CrossRef]
  49. Zhu, Z.; Dai, W.; Hu, Y.; Li, J. Speech emotion recognition model based on Bi-GRU and Focal Loss. Pattern Recognit. Lett. 2020, 140, 358–365. [Google Scholar] [CrossRef]
  50. Fan, W.; He, Z.; Xing, X.; Cai, B.; Lu, W. Multi-modality depression detection via multi-scale temporal dilated cnns. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; pp. 73–80. [Google Scholar]
  51. Sun, H.; Liu, J.; Chai, S.; Qiu, Z.; Lin, L.; Huang, X.; Chen, Y. Multi-modal adaptive fusion transformer network for the estimation of depression level. Sensors 2021, 21, 4764. [Google Scholar] [CrossRef] [PubMed]
  52. Fang, M.; Peng, S.; Liang, Y.; Hung, C.C.; Liu, S. A multimodal fusion model with multi-level attention mechanism for depression detection. Biomed. Signal Process. Control 2023, 82, 104561. [Google Scholar] [CrossRef]
  53. Teng, S.; Chai, S.; Liu, J.; Tateyama, T.; Lin, L.; Chen, Y.W. Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
  54. Fu, C.; Qian, F.; Su, Y.; Su, K.; Song, S.; Niu, M.; Shi, J.; Liu, Z.; Liu, C.; Ishi, C.T.; et al. Facial action units guided graph representation learning for multimodal depression detection. Neurocomputing 2025, 619, 129106. [Google Scholar] [CrossRef]
Figure 1. Differences in temporal connection type between existing methods and our method, a different state of features: (a) sequential connection; (b) bidirectional full connection; (c) the proposed graph-based connection with a multi-dimensional edge; (d) hidden features, where different colors represent different subintervals; (d1) is the original three modal features, which are not aligned; (d2) is the feature after being aligned by the multi-attention mechanism, with many outliers; (d3) is the feature after being aligned by the graph network—all are aligned to the same subspace.
Figure 1. Differences in temporal connection type between existing methods and our method, a different state of features: (a) sequential connection; (b) bidirectional full connection; (c) the proposed graph-based connection with a multi-dimensional edge; (d) hidden features, where different colors represent different subintervals; (d1) is the original three modal features, which are not aligned; (d2) is the feature after being aligned by the multi-attention mechanism, with many outliers; (d3) is the feature after being aligned by the graph network—all are aligned to the same subspace.
Sensors 25 04520 g001
Figure 2. Architecture of the proposed model. Firstly, the input video sample, collected via sensors, is divided into multiple instances at the utterance-level. The BoAW eGeMAPS, pretrained ResNet, and pretrained BERT models are utilized to extract audio, visual, and textual features. Bidirectional GRUs are then used to update the multimodal features across time-steps. The graph is constructed initially with the designed FFC and BFC. RGCN and GAT models are employed to update the node features and connections. Finally, the updated node features are concatenated with contextual representations and used for PHQ score prediction.
Figure 2. Architecture of the proposed model. Firstly, the input video sample, collected via sensors, is divided into multiple instances at the utterance-level. The BoAW eGeMAPS, pretrained ResNet, and pretrained BERT models are utilized to extract audio, visual, and textual features. Bidirectional GRUs are then used to update the multimodal features across time-steps. The graph is constructed initially with the designed FFC and BFC. RGCN and GAT models are employed to update the node features and connections. Finally, the updated node features are concatenated with contextual representations and used for PHQ score prediction.
Sensors 25 04520 g002
Figure 3. CCC results with the proposed method for different combinations on the AVEC2019 dataset, with different N w values on the horizontal axis and CCC values on the vertical axis, and different colors and bubble sizes in the graph for different numbers of heads.
Figure 3. CCC results with the proposed method for different combinations on the AVEC2019 dataset, with different N w values on the horizontal axis and CCC values on the vertical axis, and different colors and bubble sizes in the graph for different numbers of heads.
Sensors 25 04520 g003
Figure 4. Plot of the experimental results of modal ablation on the AVEC2019 dataset when N w = 20, head = 4 with the proposed method, where the horizontal axis is the different modal ablation models, the red line represents the CCC index, and the blue curve represents the RMSE index.
Figure 4. Plot of the experimental results of modal ablation on the AVEC2019 dataset when N w = 20, head = 4 with the proposed method, where the horizontal axis is the different modal ablation models, the red line represents the CCC index, and the blue curve represents the RMSE index.
Sensors 25 04520 g004
Figure 5. The correlation of prediction vs. ground truth on the AVEC2019 dataset under different N w conditions, where the basic model is T + A + V ( N w = 20).
Figure 5. The correlation of prediction vs. ground truth on the AVEC2019 dataset under different N w conditions, where the basic model is T + A + V ( N w = 20).
Sensors 25 04520 g005
Figure 6. The correlation of prediction vs. ground truth on the AVEC2019 dataset with different numbers of heads.
Figure 6. The correlation of prediction vs. ground truth on the AVEC2019 dataset with different numbers of heads.
Sensors 25 04520 g006
Figure 7. Visualization of attention weight on AVEC2019 dataset under different models. Redder color means higher weight value. The more purple the color, the higher the weight.
Figure 7. Visualization of attention weight on AVEC2019 dataset under different models. Redder color means higher weight value. The more purple the color, the higher the weight.
Sensors 25 04520 g007
Figure 8. t-SNE visualization of features and graph network output of different models on AVEC2019 dataset. (a) Context features output by GRU; (b) representation output by GCN; (c) representation output by GAT; (d) representation output by multi-head attention. Red, green, and blue colors represent the audio, video, and text modalities, respectively.
Figure 8. t-SNE visualization of features and graph network output of different models on AVEC2019 dataset. (a) Context features output by GRU; (b) representation output by GCN; (c) representation output by GAT; (d) representation output by multi-head attention. Red, green, and blue colors represent the audio, video, and text modalities, respectively.
Sensors 25 04520 g008
Table 1. Summary of representative multimodal depression detection models.
Table 1. Summary of representative multimodal depression detection models.
MethodsBackbone NetworkLimitations
Early/Late Fusion [22]Handcrafted features + fusionShallow feature-level fusion; lacks deep contextual representation
GRU-TriModal [7]GRU-based fusion for text/audio/videoDoes not model long-range temporal structure; limited session-level reasoning
BERT-CNN+Gated-CNN [10]CNN + modality-specific deep networksIndependent modality modeling; lacks crossmodal alignment
MTAL [23]Multimodal topic-enhanced modelIgnores long-term temporal progression; topic modeling not adaptive to sequences
BiGRU-Attn [24]Text/speech encoding + attention fusionStatic fusion; lacks temporal dependency modeling
STA-MAFF [26]Spatial–temporal attention + fusionFocused only on local cues; lacks global session modeling
BiGRU-Attn [27]BiGRU + attention across modalitiesNo modeling of intermodal structure
GRU-BiLSTM-Attn [28]Sequential encoders + attentionNo cross-utterance or session-level modeling
TensorFormer [13]Multimodal transformer with tensor fusionComputationally expensive; not optimized for graph structure
CAIINet [29]Contextual attention + interaction modulesFocuses on local timepoints; lacks global coherence
HCAG [33]Hierarchical GAT on text/audioNo utterance-level fusion; limited modality diversity
Multi-Head GAT [34]Graph attention with intermodal fusionIgnores chronic nature of depression; no dual-edge modeling
MS2-GNN [35]Modality-shared/specific GNNNo full-session temporal modeling; lacks skip connections
Table 2. Summary of datasets used in our experiments.
Table 2. Summary of datasets used in our experiments.
DatasetSubjectsModalityLabel Type
AVEC201482Audio + VideoBDI score
AVEC2019275Audio + Video + TextPHQ-8 score
Table 3. Summary of representative baseline models used for comparison.
Table 3. Summary of representative baseline models used for comparison.
ModelDescription
CCA [43]Ensemble of Canonical Correlation Analyzers for audio–visual affect prediction.
GMM+ELM [44]Gaussian mixture models with ELM using vocal and facial features.
PLS+LR [45]Combines partial least squares and linear regression for depression score estimation.
m-BAM [46]Bidirectional associative memory for multisensory fusion.
MAFF [26]Multimodal attention feature fusion framework.
AVA-DepressNet [47]Audio–visual attention network with privacy-preserving design.
STA-DRN [48]Spatial–temporal attention network for depression prediction.
EF [7]Kernel-based multimodal fusion addressing cross-lingual emotion recognition.
BERT-CNN+Gated-CNN [10]Combines BERT and CNN for multimodal fusion.
Hierarchical BiGRU [49]BiGRU model with Focal Loss for speech emotion imbalance.
Multi-scale TDCNN [50]Dilated CNNs over multiple time scales with BERT and statistical features.
Adaptive Fusion Transformer [51]Transformer with adaptive fusion of multimodal signals.
DepressNet [14]BiLSTM-based hierarchical attention model for end-to-end depression scoring.
TensorFormer [13]Tensor-based Transformer for crossmodal interaction.
MFM-Att [52]Multimodal fusion model capturing complex depressive phenotype.
MT [53]Multitask model combining depression detection and sentiment analysis.
FAU-GF [54]Uses identity-free facial muscle movement and speech cues with graph-based modeling to reduce noise and enhance episodic temporal reasoning.
Table 4. Comparison results of the proposed model and baselines, where (-) indicates that the relevant paper did not mention it. ↑ indicates that higher is better, ↓ indicates that lower is better; the same applies below.
Table 4. Comparison results of the proposed model and baselines, where (-) indicates that the relevant paper did not mention it. ↑ indicates that higher is better, ↓ indicates that lower is better; the same applies below.
ModelsParametersAVEC2014
MAERMSE
CCA [43]2.9 M7.699.61
GMM+ELM [44]3.8 M6.318.12
PLS+LR [45]2.1 M6.147.43
m-BAM [46]0.9 M5.787.47
MAFF [26]2.6 M5.217.03
AVA-DepressNet [47]32.1 M5.326.83
STA-DRN [48]4.7 M6.007.75
Ours6.1 M5.256.75
ModelsParametersAVEC2019
CCCRMSE
EF [7]0.6 M0.344-
BERT-CNN+Gated-CNN [10]2.3 M0.4036.11
Hierarchical BiLSTM [8]4.2 M0.4425.50
Multi-scale Temporal Dilated CNN [50]0.5 M0.4304.39
Adaptive Fusion Transformer [51]-0.331-
DepressNet [14]7.1 M0.4575.36
TensorFormer [13]65 M0.4934.31
MFM-Att [52]26.1 M-5.17
MT [53]-0.466-
FAU-GF [54]15.4 M0.5554.95
Ours6.1 M0.5544.61
Table 5. Comparison results of each component in the proposed model, where (-) indicates that the AVEC2014 dataset lacks the transcribed text modality.
Table 5. Comparison results of each component in the proposed model, where (-) indicates that the AVEC2014 dataset lacks the transcribed text modality.
ConfigurationsAVEC2014AVEC2019
MAE↓RMSE↓CCC↑RMSE↓
DepressionMIGNN (T)--0.5316.21
DepressionMIGNN (A)6.218.620.4186.47
DepressionMIGNN (V)7.767.780.2107.37
DepressionMIGNN (T + A)--0.5415.51
DepressionMIGNN (T + V)--0.5225.44
DepressionMIGNN (A + V)5.256.750.2407.31
DepressionMIGNN (T + A + V)--0.5544.61
DepressionGNN (w/o Multiple Instance)6.417.430.4585.64
DepressionMI (w/o GNN)8.058.520.2968.67
DepressionMIGCN (w/o GAT w/GCN)6.727.270.4066.01
DepressionMIGNN (w/o BFC)5.966.330.4256.12
DepressionMIGNN (w/o GNN w/Multiple Attention)7.977.410.3028.12
Table 6. Effects of different numbers of window.
Table 6. Effects of different numbers of window.
ConfigurationsAVEC2014AVEC2019
MAE↓RMSE↓CCC↑RMSE↓
DepressionMIGNN ( N w  = 0)7.388.070.3246.68
DepressionMIGNN ( N w  = 10)6.827.520.5205.42
DepressionMIGNN ( N w  = 16)6.387.210.5465.32
DepressionMIGNN ( N w  = 20)5.256.750.5544.61
DepressionMIGNN ( N w  = 24)6.066.980.3756.19
DepressionMIGNN ( N w  = 30)6.477.370.4095.90
DepressionMIGNN ( N w  = 40)6.938.420.3376.29
Table 7. Effects of different numbers of heads.
Table 7. Effects of different numbers of heads.
ConfigurationsAVEC2014AVEC2019
MAE↓RMSE↓CCC↑RMSE↓
DepressionMIGNN (head = 3)6.717.430.4345.95
DepressionMIGNN (head = 4)5.256.750.5544.61
DepressionMIGNN (head = 5)6.067.190.4885.70
DepressionMIGNN (head = 6)6.407.360.4326.05
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, S.; Zhang, Y.; Su, Y.; Su, K.; Liu, J.; Wang, T.; Yu, S. DepressionMIGNN: A Multiple-Instance Learning-Based Depression Detection Model with Graph Neural Networks. Sensors 2025, 25, 4520. https://doi.org/10.3390/s25144520

AMA Style

Zhao S, Zhang Y, Su Y, Su K, Liu J, Wang T, Yu S. DepressionMIGNN: A Multiple-Instance Learning-Based Depression Detection Model with Graph Neural Networks. Sensors. 2025; 25(14):4520. https://doi.org/10.3390/s25144520

Chicago/Turabian Style

Zhao, Shiwen, Yunze Zhang, Yikai Su, Kaifeng Su, Jiemin Liu, Tao Wang, and Shiqi Yu. 2025. "DepressionMIGNN: A Multiple-Instance Learning-Based Depression Detection Model with Graph Neural Networks" Sensors 25, no. 14: 4520. https://doi.org/10.3390/s25144520

APA Style

Zhao, S., Zhang, Y., Su, Y., Su, K., Liu, J., Wang, T., & Yu, S. (2025). DepressionMIGNN: A Multiple-Instance Learning-Based Depression Detection Model with Graph Neural Networks. Sensors, 25(14), 4520. https://doi.org/10.3390/s25144520

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop