DynMultiDep: A Dynamic Multimodal Fusion and Multi-Scale Time Series Modeling Approach for Depression Detection
Abstract
1. Introduction
- 1.
- We propose a dynamic multimodal depression detection framework—DynMultiDep, which combines multi-scale temporal modeling and an adaptive fusion mechanism. The core innovation lies in the Multi-scale Temporal Expert Module (MTEM) and the Dynamic Multimodal Fusion Module (DynMM). MTEM extracts long-term trend features through Mamba experts and captures short-term fluctuations using a local window Transformer, achieving adaptive fusion through a long-short routing mechanism.
- 2.
- For the first time, we introduce the novel DynMM framework, a dynamic fusion approach, into the field of multimodal depression detection. It dynamically adjusts modality selection and fusion strategies based on the input features during both training and inference stages, significantly improving computational efficiency and adaptability. This method overcomes the limitations of traditional static fusion strategies and demonstrates higher efficiency when dealing with complex multimodal data.
- 3.
- DynMultiDep is the first multimodal depression detection framework to combine dynamic fusion and multi-scale temporal modeling, providing an effective solution to accurately capture subtle changes in depressive symptoms. Extensive experiments on two large depression datasets, D-Vlog and LMVD, show that our proposed method significantly outperforms existing state-of-the-art approaches.
2. Related Work
2.1. Depression Detection Techniques
2.2. Multimodal Fusion and Time Series Modeling
3. Methodology
3.1. Multimodal Feature Extraction
3.2. Multiscale Time Series Modeling (MTEM)
3.2.1. Dynamic Multimodal Fusion and Signal Preprocessing
3.2.2. The Signal Processing Flow of the Expert Module
Mamba Long-Term Trend Expert
| Algorithm 1 MambaBlock |
| Require: Feature sequence input_features of shape [batch_size, seq_len, d_model] Optional Parameters: hidden_dim = 1024, state_dim = 16, conv_kernel = 4 Ensure: Output feature sequence of shape [batch_size, seq_len, d_model] |
|
|
Local Window Transformer (LWT) Short-Term Fluctuation Expert
| Algorithm 2 TransformerBlock |
| Require: Feature sequence input_features of shape [batch_size, seq_len, d_model] Optional Parameters: heads = 8, head_dim = 64, ff_dim = 2048 Ensure: Output feature sequence of shape [batch_size, seq_len, d_model] |
|
|
3.3. Dynamic Feature Fusion
3.4. Dynamic Multimodal Fusion (DynMM)
3.5. Modality-Level Dynamic Decision Making
- where : Represents the Audio-Only processing pathway. This expert analyzes information solely from the audio modality . It incurs the lowest computational cost (approx. 0.6G FLOPs).
- where : Represents the Audio + Lightweight Video pathway. This expert first applies compressed sensing techniques to the input video features to sample and reconstruct a reduced representation, , preserving key information while reducing data volume. It then processes the audio features alongside these lightweight video features . This pathway offers a balance between incorporating basic audio-visual cues and maintaining high computational efficiency (approx. 1.0 G FLOPs).
- where : Represents the Audio + Standard Video pathway. This expert processes the original audio features and the original video features directly, enabling a standard level of audio-visual analysis using the full video information. It operates at a moderate complexity level (approx. 1.6 G FLOPs).
- where : Represents the Deep Audio-Visual Fusion pathway. This expert also processes the original audio and video features but employs the most sophisticated analysis techniques for deep multimodal integration. It entails the highest computational cost (approx. 2.3 G FLOPs) and is designated for the most challenging samples.
| Algorithm 3 Dynamic Path Selection in DynMM |
| Require: Initial audio features Ensure: Selected path index |
|
|
3.5.1. Fusion-Level Dynamic Decision Making
- (Audio-Internal Fusion—Path 1 Strategy): This operation focuses on refining the uni-modal audio representation. It performs internal audio feature fusion by integrating representations from different levels of the audio encoder. It employs a feature pyramid structure or utilizes skip connections to combine shallow acoustic features with deep semantic features. Furthermore, it applies multi-scale aggregation with feature weighting mechanisms to adaptively emphasize relevant temporal scales or feature levels within the audio modality. This operation involves no cross-modal fusion.
- (Lightweight Audio-Visual Fusion—Path 2 Strategy): This operation performs lightweight cross-modal fusion between audio () and lightweight video features ( or ). Its architecture prioritizes efficiency through a selective gating mechanism, where audio features generate signals controlling the activation of relevant lightweight video features. It utilizes a simplified, single-head attention mechanism for basic interaction, focusing on shallow feature interaction. Dimensionality reduction (e.g., 512→128) is applied to features before interaction to reduce computational load. The goal is to capture fundamental cross-modal cues efficiently.
- (Standard Audio-Visual Fusion—Path 3 Strategy): This operation executes standard, comprehensive cross-modal fusion between audio () and full video features (). It relies on bidirectional multi-head cross-attention (e.g., using 4 attention heads) for rich interaction ( and ). Fusion occurs at intermediate feature levels, and residual connections preserve the original modality information while incorporating cross-modal enhancements. This structure enables the learning of diverse cross-modal relationships.
- (Deep Audio-Visual Fusion—Path 4 Strategy): This operation implements the most sophisticated deep multimodal fusion strategy for audio () and full video features (). It employs a multi-level interaction framework (e.g., a 3-layer structure) for progressive, deep fusion. It utilizes co-attention mechanisms, incorporates fine-grained temporal alignment techniques between audio and video frames, and applies hierarchical feature fusion combining features from shallow to deep layers. The architecture integrates these deeply fused features, potentially using self-attention in a joint space, followed by a decoder structure (e.g., Transformer decoder) to interpret the representation for the final task. This operation aims to capture the most complex inter-modal dependencies.
3.5.2. End-to-End Training and Classification Process
4. Experimental Setup
4.1. Datasets and Baselines
- 1.
- D-Vlog: D-Vlog is a multimodal depression dataset focused on analyzing the non-verbal behaviors of individuals with depression. The dataset consists of 961 video logs, with 555 videos depicting individuals with depression and 406 videos representing non-depressed individuals. It includes visual, audio, and text features, capturing multidimensional characteristics of depression symptoms.
- 2.
- LMVD: LMVD is a large-scale multimodal depression dataset specifically designed for audiovisual depression detection, with data sourced from social media platforms. The dataset includes visual, audio, and text data, providing rich emotional expression information to help identify depression-related features.
- 3.
- Baselines: First, on the D-Vlog dataset, we employ four state-of-the-art methods (MCRVT [67], Spike Memory Transformer [68], JAMFN [69], and DepMamba [70]) as baselines. Second, on the LMVD dataset, we use two state-of-the-art methods (DepMamba and LMTformer [71]) as baselines, with a focus on designing multimodal (video and audio) methods.
4.2. Implementation Details
- 1.
- Optimizer and Training Settings: The Adam optimizer was used with a learning rate of 1e-4 and a batch size of 16. To ensure reproducibility, a fixed random seed of 42 was applied to all experimental runs. Hyperparameter tuning was conducted using a grid search strategy on the validation set. Training was managed via an early stopping mechanism, terminating if the validation loss failed to improve for 15 consecutive epochs, within a maximum limit of 120 epochs.
- 2.
- Feature Extraction: For the video and audio modalities, the Mamba model (state dimension = 10) and the Transformer model (hidden dimension = 256) were employed. The data was processed through a 1×1 convolution layer (256 channels), followed by Mamba and Transformer blocks to extract long-term dependencies and local features.
- 3.
- Multimodal Fusion: A gating network (hidden dimension = 256) was designed to decide whether to activate the single-modality path or fuse the multimodal path. A resource-aware loss function was introduced to balance the task loss and computational cost.
- 4.
- Dataset Splitting: The D-Vlog dataset was split into training, validation, and test sets in a 7:1:2 ratio, while the LMVD dataset was split in an 8:1:1 ratio. For both D-Vlog and LMVD datasets, we ensured strict subject-level separation across training, validation, and testing sets. Specifically, no individual appears in more than one subset, preventing potential identity leakage that could artificially inflate performance metrics.
- 5.
- Evaluation Metrics: Six evaluation metrics were employed, including accuracy, precision, recall, F1-score, unweighted accuracy (UA), and weighted average F1 (WF1) [79].
5. Results and Analysis
- RQ1: Does DynMultiDep demonstrate strong generalization ability when applied to different datasets?
- RQ2: In what aspects does the DynMultiDep model outperform existing state-of-the-art methods?
- RQ3: How does DynMultiDep effectively improve detection accuracy through dynamic multimodal fusion and multiscale time-series modeling?
- RQ4: Can DynMultiDep maintain stable detection performance under different noise levels?
- RQ5: How does DynMultiDep perform when handling missing or disrupted modalities?
5.1. Our Method vs. Baseline Models: A Comparative Study (RQ1)
5.2. Comparison Between Our Method and SOTA Models (RQ2)
5.3. Ablation Study (RQ3)
5.3.1. D-Vlog Dataset
5.3.2. LMVD Dataset
5.3.3. Synergistic Analysis
5.4. Visual Analysis
5.4.1. Statistical Significance Analysis
5.4.2. Cross-Time Attention Patterns in Multimodal Depression Detection
5.4.3. Model Performance Under Different Noise Levels (RQ4)
5.4.4. Sensitivity Analysis of Imbalanced Data Handling Strategies
- Precision-Recall Tradeoff (Top Left): As the proportion of depressed samples decreases from 50% to 2%, without any balancing strategy (blue line), the recall rate declines significantly (from 0.85 to 0.4), indicating a reduced ability of the model to detect the minority class. In contrast, when using the Focal Loss strategy (red line), even under extreme imbalance conditions (2%), the recall rate remains high (approximately 0.8), with only a small change in precision and stable false positive rates.
- Performance Improvement Percentage (Top Right): Compared to the baseline with no balancing strategy, Focal Loss shows a more significant performance improvement as data imbalance intensifies. Under the extreme imbalance condition of 2% depressed samples, recall improves by over 100%, and the F1 score increases by approximately 70–80%, demonstrating the effectiveness of Focal Loss in highly imbalanced data.
- Modality Contribution Analysis (Bottom Left): As data imbalance increases, the contribution of different modalities to the model’s performance changes. The performance of the audio modality declines rapidly, while multimodal fusion effectively enhances model performance under high imbalance, maintaining a relatively stable F1 score.
- ROC Curve Comparison (Bottom Right): On the 50% balanced dataset (black line), the model performs the best. The Focal Loss strategy (red line) significantly outperforms the no-balancing strategy (blue dashed line) under the 5% imbalance condition, nearly recovering the performance of the balanced dataset.
5.4.5. Robustness of Multimodal Models Under Temporal Disturbances
5.4.6. Analysis of Modality Missing Experiment Results (RQ5)
- The upper-left plot illustrates the effect of modality missing on the F1 score at various missing rates for audio and video modalities. It is evident that missing video modality (orange line) leads to a rapid decline in the F1 score, particularly when the missing rate exceeds 30%. In contrast, missing the audio modality (blue line) has a more moderate impact, suggesting that video features contribute more significantly to the model’s performance.
- The upper-right plot visually demonstrates the importance of the two modalities. The complete absence of the video modality results in a 71.1% performance degradation, while the absence of the audio modality only leads to a 51.0% drop in performance. This further confirms the higher importance of the video modality in depression detection.
- The lower-left plot compares three different data recovery strategies: optimal recovery, mean imputation, and zero imputation. Regardless of the modality, the optimal recovery strategy (e.g., using GANs or autoencoders) achieves the highest F1 score. Additionally, for all recovery strategies, the recovery performance of the audio modality outperforms that of the video modality, consistent with the greater complexity of video data.
- The lower-right plot presents a heatmap of the F1 scores for the audio and video modalities at various missing rates. The brighter the color, the higher the F1 score. It is observed that when no modality is missing, the F1 score is highest (0.83). When one modality is entirely missing (100%), the integrity of the remaining modality becomes crucial. When both modalities are severely missing (lower-right corner), the model’s performance significantly declines. The interactive pattern in the heatmap indicates that there is a synergistic effect between modalities, where the integrity of one modality can partially compensate for the loss of the other modality.
5.4.7. Analysis of Computation Paths and Efficiency in DynMM Model
5.5. Error Analysis and Case Study
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cassano, P.; Fava, M. Depression and public health: An overview. J. Psychosom. Res. 2002, 53, 849–857. [Google Scholar] [CrossRef] [PubMed]
- Remes, O.; Mendes, J.F.; Templeton, P. Biological, psychological, and social determinants of depression: A review of recent literature. Brain Sci. 2021, 11, 1633. [Google Scholar] [CrossRef]
- Sabo, E.; Reynolds, C.F.; Kupfer, D.J.; Berman, S.R. Sleep, depression, and suicide. Psychiatry Res. 1991, 36, 265–277. [Google Scholar] [CrossRef]
- Steiger, A.; Pawlowski, M. Depression and sleep. Int. J. Mol. Sci. 2019, 20, 607. [Google Scholar] [CrossRef]
- Hawton, K.; Comabella, C.C.I.; Haw, C.; Saunders, K. Risk factors for suicide in individuals with depression: A systematic review. J. Affect. Disord. 2013, 147, 17–28. [Google Scholar] [CrossRef]
- Furr, S.R.; Westefeld, J.S.; McConnell, G.N.; Jenkins, J.M. Suicide and depression among college students: A decade later. Prof. Psychol. Res. Pract. 2001, 32, 97. [Google Scholar] [CrossRef]
- World Health Organization. Depressive Disorder (Depression); World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
- Zhao, Y.-J.; Jin, Y.; Rao, W.-W.; Zhang, Q.-E.; Zhang, L.; Jackson, T.; Su, Z.-H.; Xiang, M.; Yuan, Z.; Xiang, Y.-T. Prevalence of major depressive disorder among adults in China: A systematic review and meta-analysis. Front. Psychiatry 2021, 12, 659470. [Google Scholar] [CrossRef]
- Liang, D.; Mays, V.M.; Hwang, W.-C. Integrated mental health services in China: Challenges and planning for the future. Health Policy Plan. 2018, 33, 107–122. [Google Scholar] [CrossRef] [PubMed]
- Xu, Z.; Gahr, M.; Xiang, Y.; Kingdon, D.; Rüsch, N.; Wang, G. The state of mental health care in China. Asian J. Psychiatry 2022, 69, 102975. [Google Scholar] [CrossRef]
- Lu, J.; Xu, X.; Huang, Y.; Li, T.; Ma, C.; Xu, G.; Yin, H.; Xu, X.; Ma, Y.; Wang, L.; et al. Prevalence of depressive disorders and treatment in China: A cross-sectional epidemiological study. Lancet Psychiatry 2021, 8, 981–990. [Google Scholar] [CrossRef] [PubMed]
- Al-Harbi, K.S. Treatment-resistant depression: Therapeutic trends, challenges, and future directions. Patient Prefer. Adherence 2012, 6, 369–388. [Google Scholar] [CrossRef]
- Gotlib, I.H.; Joormann, J. Cognition and depression: Current status and future directions. Annu. Rev. Clin. Psychol. 2010, 6, 285–312. [Google Scholar] [CrossRef]
- Goldman, L.S.; Nielsen, N.H.; Champion, H.C.; Council on Scientific Affairs; American Medical Association. Awareness, diagnosis, and treatment of depression. J. Gen. Intern. Med. 1999, 14, 569–580. [Google Scholar] [CrossRef] [PubMed]
- Alghowinem, S.; Goecke, R.; Wagner, M.; Epps, J.; Hyett, M.; Parker, G.; Breakspear, M. Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behaviors. IEEE Trans. Affect. Comput. 2016, 9, 478–490. [Google Scholar] [CrossRef]
- Safa, R.; Bayat, P.; Moghtader, L. Automatic detection of depression symptoms in twitter using multimodal analysis. J. Supercomput. 2022, 78, 4709–4744. [Google Scholar] [CrossRef] [PubMed]
- Gui, T.; Zhu, L.; Zhang, Q.; Peng, M.; Zhou, X.; Ding, K.; Chen, Z. Cooperative multimodal approach to depression detection in twitter. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 110–117. [Google Scholar]
- Shen, G.; Jia, J.; Nie, L.; Feng, F.; Zhang, C.; Hu, T.; Chua, T.-S.; Zhu, W. Depression detection via harvesting social media: A multimodal dictionary learning solution. IJCAI 2017, 2017, 3838–3844. [Google Scholar]
- Cai, H.; Qu, Z.; Li, Z.; Zhang, Y.; Hu, X.; Hu, B. Feature-level fusion approaches based on multimodal EEG data for depression recognition. Inf. Fusion 2020, 59, 127–138. [Google Scholar] [CrossRef]
- Sardari, S.; Nakisa, B.; Rastgoo, M.N.; Eklund, P. Audio based depression detection using Convolutional Autoencoder. Expert Syst. Appl. 2022, 189, 116076. [Google Scholar] [CrossRef]
- Ma, X.; Yang, H.; Chen, Q.; Huang, D.; Wang, Y. Depaudionet: An efficient deep model for audio based depression classification. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 16 October 2016; pp. 35–42. [Google Scholar]
- Wang, Z.; Chen, L.; Wang, L.; Diao, G. Recognition of audio depression based on convolutional neural network and generative antagonism network model. IEEE Access 2020, 8, 101181–101191. [Google Scholar] [CrossRef]
- Lynch, C.J.; Elbau, I.G.; Ng, T.; Ayaz, A.; Zhu, S.; Wolk, D.; Manfredi, N.; Johnson, M.; Chang, M.; Chou, J.; et al. Frontostriatal salience network expansion in individuals in depression. Nature 2024, 633, 624–633. [Google Scholar] [CrossRef]
- Mizrahi, B.; Shilo, S.; Rossman, H.; Kalkstein, N.; Marcus, K.; Barer, Y.; Keshet, A.; Shamir-Stein, N.; Shalev, V.; Zohar, A.E.; et al. Longitudinal symptom dynamics of COVID-19 infection. Nat. Commun. 2020, 11, 6208. [Google Scholar] [CrossRef]
- Kuppens, P.; Verduyn, P. Emotion dynamics. Curr. Opin. Psychol. 2017, 17, 22–26. [Google Scholar] [CrossRef] [PubMed]
- Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar] [CrossRef]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef]
- Bech, P. Rating scales in depression: Limitations and pitfalls. Dialogues Clin. Neurosci. 2006, 8, 207–215. [Google Scholar] [CrossRef]
- Sepehry, A.A. Self-rating depression scale (SDS). In Encyclopedia of Quality of Life and Well-Being Research; Springer: Berlin/Heidelberg, Germany, 2024; pp. 6269–6276. [Google Scholar]
- Faravelli, C.; Albanesi, G.; Poli, E. Assessment of depression: A comparison of rating scales. J. Affect. Disord. 1986, 11, 245–253. [Google Scholar] [CrossRef]
- Boyle, G.J. Self-report measures of depression: Some psychometric considerations. Br. J. Clin. Psychol. 1985, 24, 45–59. [Google Scholar] [CrossRef]
- Hamilton, M. A rating scale for depression. J. Neurol. Neurosurg. Psychiatry 1960, 23, 56. [Google Scholar] [CrossRef]
- Platona, R.I.; Voiță-Mekereș, F.; Enătescu, V.R. Depression rating scales-benefits and limitations. A literature review. J. Psychol. Educ. Res. 2023, 31, 138–152. [Google Scholar]
- Maust, D.; Cristancho, M.; Gray, L.; Rushing, S.; Tjoa, C.; Thase, M.E. Psychiatric rating scales. Handb. Clin. Neurol. 2012, 106, 227–237. [Google Scholar]
- Hamilton, M. General problems of psychiatric rating scales (especially for depression). Mod. Probl. Pharmacopsychiatry 1974, 7, 125–138. [Google Scholar]
- Cotes, R.O.; Boazak, M.; Griner, E.; Jiang, Z.; Kim, B.; Bremer, W.; Seyedi, S.; Rad, A.B.; Clifford, G.D. Multimodal assessment of schizophrenia and depression utilizing video, acoustic, locomotor, electroencephalographic, and heart rate technology: Protocol for an observational study. JMIR Res. Protoc. 2022, 11, e36417. [Google Scholar] [CrossRef] [PubMed]
- Zhu, J.; Wang, Z.; Gong, T.; Zeng, S.; Li, X.; Hu, B.; Li, J.; Sun, S.; Zhang, L. An improved classification model for depression detection using EEG and eye tracking data. IEEE Trans. Nanobiosci. 2020, 19, 527–537. [Google Scholar] [CrossRef] [PubMed]
- de Melo, W.C.; Granger, E.; Hadid, A. Combining global and local convolutional 3d networks for detecting depression from facial expressions. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–8. [Google Scholar]
- Xie, W.; Wang, C.; Lin, Z.; Luo, X.; Chen, W.; Xu, M.; Liang, L.; Liu, X.; Wang, Y.; Luo, H.; et al. Multimodal fusion diagnosis of depression and anxiety based on CNN-LSTM model. Comput. Med. Imaging Graph. 2022, 102, 102128. [Google Scholar] [CrossRef]
- Zhang, Q.; Wu, H.; Zhang, C.; Hu, Q.; Fu, H.; Zhou, J.T.; Peng, X. Provable dynamic fusion for low-quality multimodal data. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 41753–41769. [Google Scholar]
- Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2022, arXiv:2202.07125. [Google Scholar]
- Wang, Z.; Wang, K.; Wang, M.-Q. Super-resolution Reconstruction of Remote Sensing Image Based on Transformer of Multi-scale Feature Fusion. J. Northeast. Univ. (Nat. Sci.) 2024, 45, 1178. [Google Scholar]
- Yan, Z.; Ma, C.; Liu, S.; Sun, Y. Multi-scale feature enhanced Transformer network for efficient semantic segmentation. Opt.-Electron. Eng. 2024, 51, 240237-1. [Google Scholar]
- Liang, Z.; Zhao, K.; Liang, G.; Li, S.; Wu, Y.; Zhou, Y. MAXFormer: Enhanced transformer for medical image segmentation with multi-attention and multi-scale features fusion. Knowl.-Based Syst. 2023, 280, 110987. [Google Scholar] [CrossRef]
- Li, T.; Cui, Z.; Han, Y.; Li, G.; Li, M.; Wei, D. Enhanced multi-scale networks for semantic segmentation. Complex Intell. Syst. 2024, 10, 2557–2568. [Google Scholar] [CrossRef]
- Gu, J.; Kwon, H.; Wang, D.; Ye, W.; Li, M.; Chen, Y.-H.; Lai, L.; Chandra, V.; Pan, D.Z. Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12094–12103. [Google Scholar]
- Yan, H.; Zhang, C.; Wu, M. Lawin transformer: Improving semantic segmentation transformer with multi-scale representations via large window attention. arXiv 2022, arXiv:2201.01615. [Google Scholar]
- Ao, Y.; Shi, W.; Ji, B.; Miao, Y.; He, W.; Jiang, Z. MS-TCNet: An effective Transformer–CNN combined network using multi-scale feature learning for 3D medical image segmentation. Comput. Biol. Med. 2024, 170, 108057. [Google Scholar]
- Lee, D.-J.; Shin, D.-H.; Son, Y.-H.; Han, J.-W.; Oh, J.-H.; Kim, D.-H.; Jeong, J.-H.; Kam, T.-E. Spectral graph neural network-based multi-atlas brain network fusion for major depressive disorder diagnosis. IEEE J. Biomed. Health Inform. 2024, 28, 2967–2978. [Google Scholar] [CrossRef]
- Chen, P.; Zhang, Y.; Cheng, Y.; Shu, Y.; Wang, Y.; Wen, Q.; Yang, B.; Guo, C. Pathformer: Multi-scale transformers with adaptive pathways for time series forecasting. arXiv 2024, arXiv:2402.05956. [Google Scholar] [CrossRef]
- Shabani, A.; Abdi, A.; Meng, L.; Sylvain, T. Scaleformer: Iterative multi-scale refining transformers for time series forecasting. arXiv 2022, arXiv:2206.04038. [Google Scholar]
- Du, D.; Su, B.; Wei, Z. Preformer: Predictive transformer with multi-scale segment-wise correlations for long-term time series forecasting. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Zhao, L.; Mo, C.; Ma, J.; Chen, Z.; Yao, C. LSTM-MFCN: A time series classifier based on multi-scale spatial–temporal features. Comput. Commun. 2022, 182, 52–59. [Google Scholar]
- Wang, T.; Liu, Z.; Zhang, T.; Hussain, S.F.; Waqas, M.; Li, Y. Adaptive feature fusion for time series classification. Knowl.-Based Syst. 2022, 243, 108459. [Google Scholar]
- Sun, H.; Liu, J.; Chai, S.; Qiu, Z.; Lin, L.; Huang, X.; Chen, Y. Multi-modal adaptive fusion transformer network for the estimation of depression level. Sensors 2021, 21, 4764. [Google Scholar] [CrossRef] [PubMed]
- Ma, L.; Yao, Y.; Liang, T.; Liu, T. Multi-scale cooperative multimodal transformers for multimodal sentiment analysis in videos. In Australasian Joint Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2024; pp. 281–297. [Google Scholar]
- Luo, H.; Ji, L.; Huang, Y.; Wang, B.; Ji, S.; Li, T. Scalevlad: Improving multimodal sentiment analysis via multi-scale fusion of locally descriptors. arXiv 2021, arXiv:2112.01368. [Google Scholar]
- Philemon, W.; Mulugeta, W. A machine learning approach to multi-scale sentiment analysis of amharic online posts. HiLCoE J. Comput. Sci. Technol. 2014, 2, 8. [Google Scholar]
- Xiong, G.; Yan, K.; Zhou, X. A distributed learning based sentiment analysis methods with Web applications. World Wide Web 2022, 25, 1905–1922. [Google Scholar] [CrossRef]
- Zhang, M.; Liu, Z.; Feng, J.; Liu, L.; Jiao, L. Remote sensing image change detection based on deep multi-scale multi-attention Siamese transformer network. Remote Sens. 2023, 15, 842. [Google Scholar] [CrossRef]
- Zhang, P.; Dai, X.; Yang, J.; Xiao, B.; Yuan, L.; Zhang, L.; Gao, J. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2998–3008. [Google Scholar]
- Zhu, F.; Zhao, S.; Wang, P.; Wang, H.; Yan, H.; Liu, S. Semi-supervised wide-angle portraits correction by multi-scale transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19689–19698. [Google Scholar]
- Xiang, Q.; Huang, T.; Zhang, Q.; Li, Y.; Tolba, A.; Bulugu, I. A novel sentiment analysis method based on multi-scale deep learning. Math. Biosci. Eng. 2023, 20, 8766–8781. [Google Scholar] [CrossRef]
- Yoon, J.; Kang, C.; Kim, S.; Han, J. D-vlog: Multimodal vlog dataset for depression detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 12226–12234. [Google Scholar]
- He, L.; Chen, K.; Zhao, J.; Wang, Y.; Pei, E.; Chen, H.; Jiang, J.; Zhang, S.; Zhang, J.; Wang, Z.; et al. Lmvd: A large-scale multimodal vlog dataset for depression detection in the wild. arXiv 2024, arXiv:2407.00024. [Google Scholar] [CrossRef]
- Li, X.-H.; Liu, Z.-T.; Zou, Y.-J.; She, J.; Hirota, K. MCRVT: Multi-Hierarchical Cross-Reconstruction Networks With Versatile Transformer for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2025, 16, 2189–2199. [Google Scholar] [CrossRef]
- Yang, M.; Liu, Y.; Tao, Y.; Hu, B. Spike Memory Transformer: An Energy-Efficient Model in Distributed Learning Framework for Autonomous Depression Detection. IEEE Internet Things J. 2025, 12, 44025–44036. [Google Scholar] [CrossRef]
- Zhou, L.; Liu, Z.; Shangguan, Z.; Yuan, X.; Li, Y.; Hu, B. JAMFN: Joint Attention Multi-Scale Fusion Network for Depression Detection. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023. [Google Scholar]
- Ye, J.; Zhang, J.; Shan, H. Depmamba: Progressive fusion mamba for multimodal depression detection. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- He, L.; Zhao, J.; Zhang, J.; Jiang, J.; Qi, S.; Wang, Z.; Wu, D. LMTformer: Facial depression recognition with lightweight multi-scale transformer from videos. Appl. Intell. 2025, 55, 195. [Google Scholar] [CrossRef]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Yu, Y.; Si, X.; Hu, C.; Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef] [PubMed]
- Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 3285–3292. [Google Scholar]
- Dey, R.; Salem, F.M. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS), Boston, MA, USA, 6–9 August 2017; pp. 1597–1600. [Google Scholar]
- Fan, J.; Zhang, K.; Huang, Y.; Zhu, Y.; Chen, B. Parallel spatio-temporal attention-based TCN for multivariate time series prediction. Neural Comput. Appl. 2023, 35, 13109–13118. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Guo, J.; Tang, Y.; Wu, E. Vision gnn: An image is worth graph of nodes. Adv. Neural Inf. Process. Syst. 2022, 35, 8291–8303. [Google Scholar]
- Wang, Q.; Zhan, L.; Thompson, P.; Zhou, J. Multimodal learning with incomplete modalities by knowledge distillation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 1828–1838. [Google Scholar]
- Evidently AI. Accuracy, Precision, and Recall in Multi-Class Classification. Available online: https://www.evidentlyai.com/classification-metrics/multi-class-metrics (accessed on 16 March 2024).














| Modal Features | Metric | Transformer | LSTM | BiLSTM | GRU | TCN | ResNet | MMDNet | DynMultiDep |
|---|---|---|---|---|---|---|---|---|---|
| A+V | Accuracy | 0.6564 | 0.5849 | 0.6447 | 0.5802 | 0.6226 | 0.5802 | 0.6604 | 0.8000 |
| Precision | 0.6667 | 0.5956 | 0.6778 | 0.5802 | 0.6188 | 0.5810 | 0.6783 | 0.7917 | |
| Recall | 0.8687 | 0.8862 | 0.7534 | 1.0000 | 0.9106 | 0.9919 | 0.7886 | 0.8906 | |
| F1 Score | 0.7544 | 0.7124 | 0.7110 | 0.7343 | 0.7368 | 0.7327 | 0.7293 | 0.8382 | |
| A (Audio) | Accuracy | 0.6038 | 0.5802 | 0.5802 | 0.5595 | 0.5802 | 0.5991 | 0.6415 | 0.7692 |
| Precision | 0.6211 | 0.5802 | 0.5802 | 0.5776 | 0.5802 | 0.5990 | 0.6407 | 0.7284 | |
| Recall | 0.8130 | 1.0000 | 1.0000 | 0.9116 | 1.0000 | 0.9350 | 0.8699 | 0.9672 | |
| F1 Score | 0.7042 | 0.7343 | 0.7343 | 0.7071 | 0.7343 | 0.7302 | 0.7379 | 0.8310 | |
| V (Visual) | Accuracy | 0.5613 | 0.5849 | 0.5755 | 0.5896 | 0.6375 | 0.6032 | 0.5802 | 0.7745 |
| Precision | 0.5949 | 0.5936 | 0.5782 | 0.6023 | 0.6316 | 0.6035 | 0.5817 | 0.7701 | |
| Recall | 0.7642 | 0.9024 | 0.9919 | 0.8618 | 0.9796 | 0.9320 | 0.9837 | 0.9571 | |
| F1 Score | 0.6690 | 0.7161 | 0.7305 | 0.7090 | 0.7680 | 0.7326 | 0.7311 | 0.8535 |
| Modal Features | Metric | Transformer | LSTM | BiLSTM | GRU | TCN | ResNet | MMDNet | DynMultiDep |
|---|---|---|---|---|---|---|---|---|---|
| A+V | Accuracy | 0.6995 | 0.7213 | 0.6685 | 0.7104 | 0.6995 | 0.7219 | 0.7268 | 0.7651 |
| Precision | 0.6667 | 0.7564 | 0.6581 | 0.7111 | 0.7045 | 0.6970 | 0.7356 | 0.7586 | |
| Recall | 0.7912 | 0.6484 | 0.7033 | 0.7033 | 0.6813 | 0.6765 | 0.7033 | 0.8250 | |
| F1 Score | 0.7236 | 0.6982 | 0.6783 | 0.7072 | 0.6927 | 0.6441 | 0.7191 | 0.7904 | |
| A (Audio) | Accuracy | 0.5410 | 0.5574 | 0.5355 | 0.6099 | 0.5738 | 0.5191 | 0.5410 | 0.6600 |
| Precision | 0.5229 | 0.5362 | 0.5190 | 0.5714 | 0.5455 | 0.5090 | 0.5238 | 0.6790 | |
| Recall | 0.8791 | 0.8132 | 0.9011 | 0.9014 | 0.8571 | 0.9341 | 0.8462 | 0.6875 | |
| F1 Score | 0.6557 | 0.6463 | 0.6586 | 0.6995 | 0.6667 | 0.6589 | 0.6471 | 0.6832 | |
| V (Visual) | Accuracy | 0.6612 | 0.5660 | 0.6612 | 0.6721 | 0.6831 | 0.6721 | 0.7377 | 0.7452 |
| Precision | 0.6355 | 0.5771 | 0.6381 | 0.6240 | 0.6602 | 0.6505 | 0.7263 | 0.7416 | |
| Recall | 0.7473 | 0.9431 | 0.7363 | 0.8571 | 0.7473 | 0.7363 | 0.7582 | 0.7952 | |
| F1 Score | 0.6869 | 0.7160 | 0.6837 | 0.7222 | 0.7010 | 0.6907 | 0.7419 | 0.7674 |
| Methods | Accuracy | Precision | Recall | F1 Score | UA | WF1 |
|---|---|---|---|---|---|---|
| MCRVT | 63.1 | 72.51 | 67.74 | 70.03 | 62.53 | 54.1 |
| DynMultiDep | 65.4 | 75.15 | 70.21 | 72.58 | 64.81 | 56.07 |
| Spike Memory Transformer | 70.73 | \ | \ | \ | \ | \ |
| DynMultiDep | 71.83 | \ | \ | \ | \ | \ |
| JAMFN | \ | 68.18 | 68.39 | 68.25 | \ | \ |
| DynMultiDep | \ | 69.18 | 69.24 | 69.38 | \ | \ |
| DepMamba | 67.87 | 67.2 | 85.73 | 75.33 | \ | \ |
| DynMultiDep | 80 | 79.17 | 89.06 | 83.82 | \ | \ |
| Method | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| DepMamba | 70.13 | 68.24 | 74.44 | 71.17 |
| LMTformer | 74.22 | 69.96 | 82.24 | 75.62 |
| DynMultiDep | 76.51 | 75.86 | 82.50 | 79.04 |
| Baseline | Mean Diff | Effect Size (Cohen’s d) | Adjusted p-Value | Significance |
|---|---|---|---|---|
| Transformer | 0.08982 | 5.156104 | *** | |
| LSTM | 0.063626 | 3.293931 | *** | |
| BiLSTM | 0.110838 | 5.399268 | *** | |
| GRU | 0.072973 | 3.185507 | *** | |
| TCN | 0.088799 | 3.597811 | *** | |
| ResNet | 0.064569 | 3.183101 | *** | |
| MMDNet | 0.061750 | 4.024491 | *** |
| Baseline | Mean Diff | Effect Size (Cohen’s d) | Adjusted p-Value | Significance |
|---|---|---|---|---|
| Transformer | 0.15182 | 1.542 | <0.001 | *** |
| LSTM | 0.252626 | 3.914 | *** | |
| BiLSTM | 0.165838 | 1.725 | *** | |
| GRU | 0.236973 | 3.382 | *** | |
| TCN | 0.194799 | 2.246 | *** | |
| ResNet | 0.234569 | 3.328 | *** | |
| MMDNet | 0.164750 | 1.694 | *** |
| Case ID | Ground Truth | Prediction | Key Modality | Primary Reason for Misclassification |
|---|---|---|---|---|
| S1 | Depressed | Healthy | Video | Severe facial occlusion (hands covering face) |
| S2 | Depressed | Healthy | Video | Motion blur caused by rapid head rotation |
| S3 | Healthy | Depressed | Audio | Vocal hoarseness due to seasonal cold/flu |
| S4 | Depressed | Healthy | Audio | Extended silence segments exceeding 15 s |
| S5 | Healthy | Depressed | Both | Significant environmental noise (street traffic) |
| S6 | Depressed | Healthy | Video | Insufficient lighting masking facial micro-expressions |
| S7 | Depressed | Healthy | Both | Atypical “smiling depression” with high masking |
| S8 | Healthy | Depressed | Video | Heavy facial hair obstructing landmark detection |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, J.; Zheng, M.; Yang, J.; Zhan, Y.; Xie, X. DynMultiDep: A Dynamic Multimodal Fusion and Multi-Scale Time Series Modeling Approach for Depression Detection. J. Imaging 2026, 12, 29. https://doi.org/10.3390/jimaging12010029
Li J, Zheng M, Yang J, Zhan Y, Xie X. DynMultiDep: A Dynamic Multimodal Fusion and Multi-Scale Time Series Modeling Approach for Depression Detection. Journal of Imaging. 2026; 12(1):29. https://doi.org/10.3390/jimaging12010029
Chicago/Turabian StyleLi, Jincheng, Menglin Zheng, Jiongyi Yang, Yihui Zhan, and Xing Xie. 2026. "DynMultiDep: A Dynamic Multimodal Fusion and Multi-Scale Time Series Modeling Approach for Depression Detection" Journal of Imaging 12, no. 1: 29. https://doi.org/10.3390/jimaging12010029
APA StyleLi, J., Zheng, M., Yang, J., Zhan, Y., & Xie, X. (2026). DynMultiDep: A Dynamic Multimodal Fusion and Multi-Scale Time Series Modeling Approach for Depression Detection. Journal of Imaging, 12(1), 29. https://doi.org/10.3390/jimaging12010029

