DEPART: Multi-Task Interpretable Depression and Parkinson’s Disease Detection from In-the-Wild Video Data
Abstract
1. Introduction
- We propose a unified multi-task method for the simultaneous video-based detection of depression and PD from in-the-wild recordings called DEPART (DEpression and PArkinson’s Recognition Technique).
- We introduce a prototype-aware temporal architecture with gated fusion that combines discriminative classification and exemplar-based reasoning for robust disease prediction.
- We adapt gradient-based attention visualization to Contrastive Language-Image Pre-training (CLIP) and Transformer-based video encoders, enabling the interpretation of task-specific spatial regions of interest.
- We provide a comprehensive experimental study analyzing multi-task vs. single-task learning, architectural depth, prototype learning, and loss design.
- We demonstrate a competitive quantitative performance against State-of-the-Art (SOTA) methods on the In-the-Wild Speech Medical (WSM) corpus and analyze computational efficiency and inference characteristics.
2. Related Work
3. Proposed Method
3.1. Data Pre-Processing
3.2. Body Region Detection
3.3. Multi-Task Depression and Parkinson’s Disease Detection Model
3.3.1. Static and Temporal Feature Encoding
3.3.2. Gated Residual Connections
- 1.
- Fixed Gating Coefficient (FGC). A single global scalar is shared across all layers, frames, and feature dimensions. The update becomes:The scalar is selected as a hyperparameter.
- 2.
- Time-Wise Gating (TWG). A learnable vector is maintained per layer. After the softmax normalization, it yields frame-specific mixing weights:This assigns a single weight per frame, applied uniformly across all feature dimensions.
- 3.
- Feature-Wise Gating (FWG). A learnable vector is maintained per layer. After the softmax normalization:This assigns a single weight per feature dimension, applied uniformly across all the frames.
3.3.3. Prototype-Aware Classification
3.4. Loss Function
4. Experiments
4.1. Research Corpus
4.2. Experimental Results
5. Error Analysis
- 1.
- Annotation-modality mismatch: In some segments, the audio is spoken by a healthy person, but the video shows alternating appearances of both healthy individuals and those with PD. Since the annotations in both sub-corpora are derived solely from the audio modality, they label these segments as “healthy”. However, the presence of PD patients in the visual modality makes a conflict. The model correctly detects visual cues related to PD and predicts the PD class, highlighting a genuine annotation inconsistency rather than a model failure.
- 2.
- Misleading static visual content: Some videos of healthy individuals contain prolonged static frames without any movement. Given that motor impairment is a hallmark of PD, the model interprets this lack of movement as a pathological signal. This leads to false positive predictions despite the healthy person.
- 3.
- Body detection failures: In some cases, the YOLO-based body detector fails to locate any person in a frame. As a result, the entire frame, including background noise and irrelevant text overlays, is used as an input to the proposed model. Due to our model’s tendency to favor minority classes in order to counteract class imbalance, it is prone to predicting PD (the least frequent class) when presented with ambiguous or noisy input.
6. Visualization of Task-Specific Regions of Interest
7. Discussion
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| CLIP | Contrastive Language-Image Pre-training |
| CNN | Convolutional Neural Network |
| DAIC | Distress Analysis Interview Corpus |
| DEPART | DEpression and PArkinson’s Recognition Technique |
| DT | Decision Tree |
| E-DAIC | Extended DAIC |
| FAU | Facial Action Units |
| FE | Facial Expressions |
| FGC | Fixed Gating Coefficient |
| FL | Facial Landmarks |
| FPS | Frames Per Second |
| FWG | Feature-Wise Gating |
| GELU | Gaussian Error Linear Unit |
| GT | Gaze Tracking |
| HMM | Hidden Markov Model |
| LSTM | Long Short-Term Memory |
| MF1 | Macro F1-score |
| MLP | Multi-Layer Perceptron |
| PD | Parkinson’s disease |
| PL | Pose Landmarks |
| ReLU | Rectified Linear Unit |
| SOTA | State-of-the-Art |
| SVM | Support Vector Machine |
| SWIN | Shifted WINdow |
| TWG | Time-Wise Gating |
| UAR | Unweighted Average Recall |
| ViT | Vision Transformer |
| WF1 | Weighted F1-score |
| WSM | In-the-Wild Speech Medical |
References
- Wallensten, J.; Ljunggren, G.; Nager, A.; Wachtler, C.; Bogdanovic, N.; Petrovic, P.; Carlsson, A.C. Stress, depression, and risk of dementia—A cohort study in the total population between 18 and 65 years old in Region Stockholm. Alzheimer’s Res. Ther. 2023, 15, 161. [Google Scholar] [CrossRef]
- Lokshina, A.; Grishina, D. Treatment of noncognitive neuropsychiatric disorders in Alzheimer’s disease. Neurol. Neuropsychiatry Psychosom. 2021, 13, 132–138. [Google Scholar] [CrossRef]
- Byers, A.L.; Yaffe, K. Depression and risk of developing dementia. Nat. Rev. Neurol. 2011, 7, 323–331. [Google Scholar] [CrossRef]
- Sharma, D.; Singh, J.; Sehra, S.S.; Sehra, S.K. Demystifying Mental Health by Decoding Facial Action Unit Sequences. Big Data Cogn. Comput. 2024, 8, 78. [Google Scholar] [CrossRef]
- Markitantov, M.; Ryumina, E.; Kaya, H.; Karpov, A. Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion. In Proceedings of the Interspeech; ISCA Archive: Rotterdam, The Netherlands, 2025; pp. 3010–3014. [Google Scholar]
- Parikh, A.; Sadeghi, M.; Eskofier, B. Exploring facial biomarkers for depression through temporal analysis of action units. arXiv 2024, arXiv:2407.13753. [Google Scholar] [CrossRef]
- Shangguan, Z.; Liu, Z.; Li, G.; Chen, Q.; Ding, Z.; Hu, B. Dual-stream multiple instance learning for depression detection with facial expression videos. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 31, 554–563. [Google Scholar] [CrossRef] [PubMed]
- Yang, N.; Liu, J.; Sun, D.; Ding, J.; Sun, L.; Qi, X.; Yan, W. Motor Symptoms of Parkinson’s Disease: Critical Markers for Early AI-assisted Diagnosis. Front. Aging Neurosci. 2025, 17, 1602426. [Google Scholar] [CrossRef] [PubMed]
- Rangel-Cascajosa, C.; Luna-Perejón, F.; Vicente-Diaz, S.; Domínguez-Morales, M. Gait-Based Parkinson’s Disease Detection Using Recurrent Neural Networks for Wearable Systems. Big Data Cogn. Comput. 2025, 9, 183. [Google Scholar] [CrossRef]
- Brien, D.C.; Riek, H.C.; Yep, R.; Huang, J.; Coe, B.; Areshenkoff, C.; Grimes, D.; Jog, M.; Lang, A.; Marras, C.; et al. Classification and staging of Parkinson’s disease using video-based eye tracking. Park. Relat. Disord. 2023, 110, 105316. [Google Scholar] [CrossRef] [PubMed]
- Maddage, N.C.; Senaratne, R.; Low, L.S.A.; Lech, M.; Allen, N. Video-based detection of the clinical depression in adolescents. In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society; IEEE: New York, NY, USA, 2009; pp. 3723–3726. [Google Scholar]
- Mu, X.; Seyedi, S.; Zheng, I.; Jiang, Z.; Chen, L.; Omofojoye, B.; Hershenberg, R.; Levey, A.I.; Clifford, G.D.; Dodge, H.H.; et al. Detecting Cognitive Impairment and Psychological Well-being among Older Adults Using Facial, Acoustic, Linguistic, and Cardiovascular Patterns Derived from Remote Conversations. arXiv 2024, arXiv:2412.14194. [Google Scholar] [CrossRef] [PubMed]
- Escalante, H.J.; Kaya, H.; Salah, A.A.; Escalera, S.; Güçlütürk, Y.; Güçlü, U.; Baró, X.; Guyon, I.; Junior, J.C.J.; Madadi, M.; et al. Modeling, recognizing, and explaining apparent personality from videos. IEEE Trans. Affect. Comput. 2020, 13, 894–911. [Google Scholar] [CrossRef]
- Williamson, J.R.; Godoy, E.; Cha, M.; Schwarzentruber, A.; Khorrami, P.; Gwon, Y.; Kung, H.T.; Dagli, C.; Quatieri, T.F. Detecting depression using vocal, facial and semantic communication cues. In Proceedings of the International Workshop on Audio/Visual Emotion Challenge; ACM: New York, NY, USA, 2016; pp. 11–18. [Google Scholar]
- Song, S.; Shen, L.; Valstar, M. Human behaviour-based automatic depression analysis using hand-crafted statistics and deep learned spectral features. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018); IEEE: New York, NY, USA, 2018; pp. 158–165. [Google Scholar]
- Wei, P.C.; Peng, K.; Roitberg, A.; Yang, K.; Zhang, J.; Stiefelhagen, R. Multi-modal depression estimation based on sub-attentional fusion. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2022; pp. 623–639. [Google Scholar]
- Gimeno-Gómez, D.; Bucur, A.M.; Cosma, A.; Martínez-Hinarejos, C.D.; Rosso, P. Reading between the frames: Multi-modal depression detection in videos from non-verbal cues. In Proceedings of the European Conference on Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2024; pp. 191–209. [Google Scholar]
- Jaegle, A.; Gimeno, F.; Brock, A.; Vinyals, O.; Zisserman, A.; Carreira, J. Perceiver: General perception with iterative attention. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 4651–4664. [Google Scholar]
- Zhang, Z.; Zhang, S.; Ni, D.; Wei, Z.; Yang, K.; Jin, S.; Huang, G.; Liang, Z.; Zhang, L.; Li, L.; et al. Multimodal sensing for depression risk detection: Integrating audio, video, and text data. Sensors 2024, 24, 3714. [Google Scholar] [CrossRef]
- Kyprakis, I.; Skaramagkas, V.; Boura, I.; Karamanis, G.; Fotiadis, D.I.; Kefalopoulou, Z.; Spanaki, C.; Tsiknakis, M. A Deep Learning approach for Depressive Symptoms assessment in Parkinson’s disease patients using facial videos. arXiv 2025, arXiv:2505.03845. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 3202–3211. [Google Scholar]
- Yoon, J.; Kang, C.; Kim, S.; Han, J. D-vlog: Multimodal vlog dataset for depression detection. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2022; Volume 36, pp. 12226–12234. [Google Scholar] [CrossRef]
- Dolgushin, M.; Guseva, D.; Karpov, A. Investigation of Explainable Multimodal Methods for Detecting Mental Disorders. In Proceedings of the International Conference on Speech and Computer (SPECOM); Springer: Berlin/Heidelberg, Germany, 2025; pp. 173–187. [Google Scholar]
- Zhou, A.; Li, S.; Sriram, P.; Li, X.; Dong, J.; Sharma, A.; Zhong, Y.; Luo, S.; Kindratenko, V.; Heintz, G.; et al. Youtubepd: A multimodal benchmark for Parkinson’s disease analysis. Adv. Neural Inf. Process. Syst. 2023, 36, 55140–55159. [Google Scholar]
- Calvo-Ariza, N.R.; Gómez-Gómez, L.F.; Orozco-Arroyave, J.R. Classical FE Analysis to Classify Parkinson’s Disease Patients. Electronics 2022, 11, 3533. [Google Scholar] [CrossRef]
- Rabie, H.; Akhloufi, M.A. A review of machine learning and deep learning for Parkinson’s disease detection. Discov. Artif. Intell. 2025, 5, 24. [Google Scholar] [CrossRef] [PubMed]
- Valstar, M.; Schuller, B.; Smith, K.; Eyben, F.; Jiang, B.; Bilakhia, S.; Schnieder, S.; Cowie, R.; Pantic, M. AVEC 2013: The continuous audio/visual emotion and depression recognition challenge. In Proceedings of the ACM International Workshop on Audio/Visual Emotion Challenge; ACM: New York, NY, USA, 2013; pp. 3–10. [Google Scholar]
- Valstar, M.; Schuller, B.; Smith, K.; Almaev, T.; Eyben, F.; Krajewski, J.; Cowie, R.; Pantic, M. AVEC 2014: 3D Dimensional Affect and Depression Recognition Challenge. In Proceedings of the International Workshop on Audio/Visual Emotion Challenge; Association for Computing Machinery: New York, NY, USA, 2014; pp. 3–10. [Google Scholar]
- Gratch, J.; Artstein, R.; Lucas, G.M.; Stratou, G.; Scherer, S.; Nazarian, A.; Wood, R.; Boberg, J.; DeVault, D.; Marsella, S.; et al. The distress analysis interview corpus of human and computer interviews. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 26–31 May 2014; Volume 14, pp. 3123–3128. [Google Scholar]
- DeVault, D.; Artstein, R.; Benn, G.; Dey, T.; Fast, E.; Gainer, A.; Georgila, K.; Gratch, J.; Hartholt, A.; Lhommet, M.; et al. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. In Proceedings of the International Conference on Autonomous Agents and Multi-Agent Systems; ACM: New York, NY, USA, 2014; pp. 1061–1068. [Google Scholar]
- Baltrušaitis, T.; Robinson, P.; Morency, L.P. OpenFace: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2016; pp. 1–10. [Google Scholar]
- Correia, J.; Teixeira, F.; Botelho, C.; Trancoso, I.; Raj, B. The in-the-wild speech medical corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2021; pp. 6973–6977. [Google Scholar]
- Dodge, H.H.; Yu, K.; Wu, C.Y.; Pruitt, P.J.; Asgari, M.; Kaye, J.A.; Hampstead, B.M.; Struble, L.; Potempa, K.; Lichtenberg, P.; et al. Internet-based conversational engagement randomized controlled clinical trial (I-CONECT) among socially isolated adults 75+ years old with normal cognition or mild cognitive impairment: Topline results. Gerontol. 2024, 64, gnad147. [Google Scholar] [CrossRef] [PubMed]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Boccignone, G.; Conte, D.; Cuculo, V.; D’Amelio, A.; Grossi, G.; Lanzarotti, R.; Mortara, E. pyVHR: A Python framework for remote photoplethysmography. PeerJ Comput. Sci. 2022, 8, e929. [Google Scholar] [CrossRef]
- Teng, S.; Chai, S.; Liu, J.; Tateyama, T.; Lin, L.; Chen, Y.W. Multi-Modal and Multi-Task Depression Detection with Sentiment Assistance. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE); IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
- Yang, Q.; Zhou, J.; Wei, Z. Time Perspective-Enhanced Suicidal Ideation Detection Using Multi-Task Learning. Int. J. Netw. Dyn. Intell. 2024, 3, 100011. [Google Scholar] [CrossRef]
- Hu, Y.H.; Wu, R.Y.; Su, M.Y.; Lin, I.L.; Shen, C.C. Multimodal Multitask Learning for Predicting Depression Severity and Suicide Risk Using Pretrained Audio and Text Embeddings: Methodology Development and Application. JMIR Med Inf. 2025, 13, e66907. [Google Scholar] [CrossRef]
- Białek, K.; Potulska-Chromik, A.; Jakubowski, J.; Nojszewska, M.; Kostera-Pruszczyk, A. Analysis of handwriting for recognition of Parkinson’s disease: Current state and new study. Electronics 2024, 13, 3962. [Google Scholar] [CrossRef]
- Markovic, F.; Jovanovic, L.; Spalevic, P.; Kaljevic, J.; Zivkovic, M.; Simic, V.; Shaker, H.; Bacanin, N. Parkinsons detection from gait time series classification using modified metaheuristic optimized long short term memory. Neural Process. Lett. 2025, 57, 14. [Google Scholar] [CrossRef]
- Lim, W.S.; Chiu, S.I.; Wu, M.C.; Tsai, S.F.; Wang, P.H.; Lin, K.P.; Chen, Y.M.; Peng, P.L.; Chen, Y.Y.; Jang, J.S.R.; et al. An integrated biometric voice and facial features for early detection of Parkinson’s disease. NPJ Park. Dis. 2022, 8, 145. [Google Scholar] [CrossRef] [PubMed]
- Lv, C.; Fan, L.; Li, H.; Ma, J.; Jiang, W.; Ma, X. Leveraging multimodal deep learning framework and a comprehensive audio-visual dataset to advance Parkinson’s detection. Biomed. Signal Process. Control 2024, 95, 106480. [Google Scholar] [CrossRef]
- Junaid, M.; Ghergherehchi, M.; Lee, S. Multitask Deep Learning for Predicting Parkinson’s Progression and Depression From Multimodal Time Series Data. IEEE Access 2025, 13, 147818–147841. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML); PmLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
- Zou, Y.; Yi, S.; Li, Y.; Li, R. A closer look at the cls token for cross-domain few-shot learning. Adv. Neural Inf. Process. Syst. 2024, 37, 85523–85545. [Google Scholar]
- Zhao, Z.; Liu, Y.; Wu, H.; Wang, M.; Li, Y.; Wang, S.; Teng, L.; Liu, D.; Cui, Z.; Wang, Q.; et al. CLIP in medical imaging: A survey. Med Image Anal. 2025, 102, 103551. [Google Scholar] [CrossRef] [PubMed]
- Oh, S.; Kim, N.; Ryu, J. Analyzing to discover origins of CNNs and ViT architectures in medical images. Sci. Rep. 2024, 14, 8755. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024; pp. 1–27. [Google Scholar]
- Alomar, K.; Aysel, H.I.; Cai, X. CNNs, RNNs and Transformers in human action recognition: A survey and a hybrid model. Artif. Intell. Rev. 2025, 58, 387. [Google Scholar] [CrossRef]
- Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4080–4090. [Google Scholar]
- Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2017; pp. 618–626. [Google Scholar]




| Method | Corpus | Model | Features |
|---|---|---|---|
| Depression | |||
| Williamson et al. [14] | DAIC-WOZ | Gaussian staircase regressor | FAU |
| Song et al. [15] | DAIC-WOZ | CNN | FAU, GT, Head PL |
| Wei et al. [16] | DAIC-WOZ | ConvBiLSTM | FAU, GT, Head PL |
| Gimeno-Gómez et al. [17] | DAIC-WOZ | Perceiver [18] | FL, FAU, GT, Head PL |
| Transformer with Modality and Position Condition | |||
| Zhang et al. [19] | Own | ResNet34 Swin-Transformer with BiLSTM and Multi-Head Attention | Neural Face Embeddings |
| Kyprakis et al. [20] | Own | ViT based on SWIN [21] | Neural Face Embeddings |
| Yoon et al. [22] | D-Vlog | Transformer | FAU |
| Dolgushin et al. [23] | WSM | Majority voting of classifiers SVM, CNN, Random Forest | ResNet50, FL, PL |
| PD | |||
| Zhou et al. [24] | YouTubePD | FE ResNet50 and FL Region Encoders with Spatial-Temporal Attention | Frame and Region embeddings |
| Calvo-Ariza et al. [25] | Own | ResNet50 for FAU prediction and SVM | FAU |
| Dolgushin et al. [23] | WSM | Majority voting of classifiers CNN, DT, MLP | PL, FL |
| Class | Subset | Number of Samples | Women, % | Mean Age, Years |
|---|---|---|---|---|
| Healthy | Train | 3962 (353) | 58.2 (52.4) | 36.7 (35.5) |
| Development | 514 (63) | 57.2 (52.4) | 35.6 (34.3) | |
| Test | 896 (63) | 69.2 (54.0) | 36.2 (34.9) | |
| Depression | Train | 1431 (191) | 57.4 (55.0) | 29.3 (29.9) |
| Development | 317 (38) | 63.1 (52.6) | 29.9 (29.5) | |
| Test | 335 (37) | 60.6 (51.4) | 28.0 (28.9) | |
| PD | Train | 1058 (157) | 52.3 (49.7) | 42.5 (45.1) |
| Development | 105 (24) | 55.2 (50.0) | 43.0 (44.6) | |
| Test | 133 (28) | 26.3 (50.0) | 48.3 (45.0) |
| Method | Seq. Length, | Recall/Precision, % | UAR, % | MF1, % | Rank | ||
|---|---|---|---|---|---|---|---|
| Frames | Healthy | Depr. | PD | ||||
| CLIP + Mamba | 10 | 55.02/90.96 | 81.49/57.84 | 71.43/27.14 | 69.31 | 58.52 | 9.4 |
| CLIP + Mamba | 20 | 67.52/86.93 | 67.76/59.58 | 75.19/34.84 | 70.16 | 62.34 | 6.6 |
| CLIP + Mamba | 30 | 53.68/86.36 | 82.39/56.56 | 60.90/25.39 | 65.66 | 56.37 | 12.5 |
| CLIP + Mamba | 60 | 63.73/90.78 | 77.61/67.01 | 76.69/29.39 | 72.68 | 63.10 | 5.0 |
| CLIP + Mamba | 90 | 62.50/92.87 | 80.90/61.17 | 72.93/30.50 | 72.11 | 62.47 | 5.5 |
| CLIP + Transformer | 10 | 59.04/88.35 | 72.84/58.37 | 76.69/29.57 | 69.52 | 59.36 | 8.0 |
| CLIP + Transformer | 20 | 61.27/86.05 | 68.96/64.35 | 73.68/26.98 | 67.97 | 59.10 | 10.2 |
| CLIP + Transformer | 30 | 54.24/90.45 | 79.10/55.67 | 74.44/27.97 | 69.26 | 58.03 | 10.0 |
| CLIP + Transformer | 60 | 61.61/91.69 | 82.99/59.02 | 74.44/34.02 | 73.01 | 63.13 | 4.4 |
| CLIP + Transformer | 90 | 58.48/86.26 | 78.81/55.35 | 65.41/31.10 | 67.57 | 58.99 | 10.5 |
| ViT + Mamba | 10 | 74.67/83.94 | 65.07/60.06 | 49.62/32.35 | 60.22 | 63.12 | 9.0 |
| ViT + Mamba | 20 | 63.62/81.66 | 62.09/54.74 | 49.62/23.08 | 58.44 | 53.73 | 17.6 |
| ViT + Mamba | 30 | 66.96/83.45 | 62.99/51.21 | 48.87/27.90 | 59.61 | 55.44 | 16.2 |
| ViT + Mamba | 60 | 73.44/83.61 | 65.07/53.69 | 36.09/28.07 | 58.20 | 56.20 | 15.1 |
| ViT + Mamba | 90 | 69.87/83.58 | 62.69/54.83 | 46.62/26.72 | 59.72 | 56.19 | 15.8 |
| ViT + Transformer | 10 | 65.07/86.24 | 68.66/51.45 | 54.14/29.88 | 62.62 | 57.17 | 12.6 |
| ViT + Transformer | 20 | 69.42/88.86 | 70.45/56.59 | 47.37/25.51 | 62.41 | 57.96 | 12.0 |
| ViT + Transformer | 30 | 72.99/82.99 | 64.18/56.14 | 41.35/28.50 | 59.51 | 57.12 | 14.5 |
| ViT + Transformer | 60 | 71.43/88.77 | 76.12/61.35 | 60.90/36.24 | 69.48 | 64.06 | 5.1 |
| ViT + Transformer | 90 | 73.77/84.55 | 71.64/58.19 | 44.36/34.88 | 63.26 | 60.72 | 9.2 |
| Hyperparameter | Search Values | Transformer | Mamba |
|---|---|---|---|
| Hidden dimension () | {64, 128, 256, 512, 1024} | 128 | 128 |
| Output features () | {64, 128, 256, 512, 1024} | 512 | 512 |
| Number of layers (H) | {2, 3, 4, 5, 6, 7, 8, 9} | 5 | 7 |
| Number of attention heads | {2, 4, 8, 16} | 2 | – |
| State dimension | {4, 8, 16, 32} | – | 16 |
| Kernel size | {3, 4, 5, 6, 7, 8, 9} | – | 7 |
| Global scalar () | {0, 0.05, 0.10, …, 1} | 0.75 | 0.75 |
| Prototype-based model hyperparameters | |||
| Temperature () | {0.05, 0.1, 0.2, …, 10} | 0.1 | – |
| Contrastive weight () | {0, 0.05, 0.10, …, 1} | 0.05 | – |
| Number of prototypes per class () | {1, 2, …, 20} | 9 | – |
| Training parameters | |||
| Scheduler type | {none, plateau, cosine} | none | none |
| Learning rate | {, , , } | ||
| Optimizer | {adam, adamw, lion, sgd} | adamw | adamw |
| Dropout rate | {0.1, 0.15, 0.2, 0.25, 0.3} | 0.25 | 0.25 |
| Modification | Recall/Precision, % | UAR, % | MF1, % | Rank | ||
|---|---|---|---|---|---|---|
| Healthy | Depr. | PD | ||||
| CLIP + Transformer (60) () | 61.61/91.69 | 82.99/59.02 | 74.44/34.02 | 73.01 | 63.13 | 3.2 |
| + FGC () | 62.83/92.60 | 82.99/61.50 | 76.69/33.55 | 74.17 | 64.07 | 2.6 |
| + TWG | 68.86/88.78 | 78.21/68.41 | 69.17/32.17 | 72.08 | 64.82 | 3.6 |
| + FWG | 68.86/88.78 | 78.21/68.77 | 69.92/32.29 | 72.33 | 64.98 | 3.0 |
| + Prototype | 65.96/89.55 | 79.70/66.92 | 78.20/34.10 | 74.62 | 65.40 | 2.0 |
| Recall/Precision, % | UAR, % | MF1, % | Rank | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Healthy | Depr. | PD | |||||||
| + | – | – | – | 64.17/89.70 | 80.60/64.75 | 77.44/33.66 | 74.07 | 64.52 | 5.6 |
| – | + | – | – | 61.83/90.38 | 82.09/61.25 | 75.94/33.44 | 73.29 | 63.34 | 8.2 |
| – | – | + | – | 71.65/82.31 | 64.48/78.83 | 67.67/29.03 | 67.93 | 62.73 | 10.9 |
| – | – | – | + | 72.21/62.46 | 66.27/75.03 | 65.41/70.09 | 67.96 | 63.12 | 9.1 |
| + | + | – | – | 61.94/89.81 | 82.39/61.06 | 78.20/35.37 | 74.18 | 64.06 | 6.0 |
| + | – | + | – | 65.51/88.80 | 79.40/64.72 | 72.18/32.88 | 72.37 | 63.96 | 10.2 |
| + | – | – | + | 65.96/89.55 | 79.70/66.92 | 78.20/34.10 | 74.62 | 65.40 | 4.1 |
| – | + | + | – | 64.40/89.04 | 80.30/64.05 | 74.44/33.45 | 73.04 | 64.05 | 9.5 |
| – | + | – | + | 69.08/86.69 | 75.22/67.74 | 75.19/35.97 | 73.17 | 65.61 | 6.1 |
| – | – | + | + | 71.76/81.39 | 63.28/79.10 | 64.66/28.10 | 66.57 | 61.92 | 11.5 |
| + | + | + | – | 64.62/89.49 | 80.90/64.68 | 75.19/33.56 | 73.57 | 64.45 | 6.6 |
| + | + | – | + | 65.62/89.09 | 80.30/64.51 | 76.69/35.54 | 74.21 | 65.23 | 5.1 |
| + | – | + | + | 62.95/88.82 | 80.30/62.12 | 73.68/33.11 | 72.31 | 63.14 | 11.1 |
| – | + | + | + | 64.73/89.09 | 80.30/64.20 | 74.44/33.67 | 73.16 | 64.24 | 7.9 |
| + | + | + | + | 64.51/89.47 | 81.19/64.61 | 75.19/33.67 | 73.63 | 64.48 | 6.2 |
| Method | MT/ST | NL | Recall/Precision, % | UAR, % | MF1/WF1, % | Rank | |
|---|---|---|---|---|---|---|---|
| Healthy | Depr./PD | ||||||
| Depression | |||||||
| CLIP + Transformer (60) + Prototype | MT | 5 | 82.11/87.26 | 82.39/75.82 | 82.25 | 81.79/82.32 | 3.0 |
| CLIP + Transformer (60) + Prototype | ST | 5 | 74.59/83.03 | 77.61/67.53 | 76.10 | 75.40/76.01 | 6.6 |
| CLIP + Transformer (60) | 5 | 86.18/83.46 | 74.93/78.68 | 80.55 | 80.78/81.54 | 3.9 | |
| CLIP + Transformer (60) | 4 | 82.52/84.41 | 77.61/75.14 | 80.07 | 79.91/80.58 | 4.9 | |
| CLIP + Transformer (60) | 3 | 83.33/90.91 | 87.76/78.19 | 85.55 | 84.83/85.23 | 1.4 | |
| CLIP + Transformer (60) | 2 | 81.50/85.87 | 80.30/74.72 | 80.90 | 80.52/81.11 | 4.4 | |
| Correia et al. [32] (Audio) | – | – | – | 77.00 | 76.90/– | 6.5 | |
| Dolgushin et al. [23] | – | 87.37/– | 80.84/– | 84.11 | –/83.38 | 1.8 | |
| Yoon et al. [22] (D-Vlog corpus, Visual) | – | – | – | – | –/56.38 | – | |
| Yoon et al. [22] (D-Vlog corpus, Multimodal) | – | – | – | – | –/63.50 | – | |
| Gimeno-Gómez et al. [17] (D-Vlog corpus, Multimodal) | – | – | – | – | –/76.00 | – | |
| PD | |||||||
| CLIP + Transformer (60) + Prototype | MT | 5 | 55.45/88.54 | 78.20/36.62 | 66.82 | 59.03/63.65 | 5.1 |
| CLIP + Transformer (60) + Prototype | ST | 5 | 60.15/90.00 | 79.70/39.70 | 69.92 | 62.55/67.37 | 3.9 |
| CLIP + Transformer (60) | 5 | 77.23/89.14 | 71.43/50.80 | 74.33 | 71.07/76.97 | 3.0 | |
| CLIP + Transformer (60) | 4 | 78.96/91.67 | 78.20/55.03 | 78.58 | 74.42/79.83 | 1.3 | |
| CLIP + Transformer (60) | 3 | 48.51/86.73 | 77.44/33.12 | 62.98 | 54.31/58.30 | 6.6 | |
| CLIP + Transformer (60) | 2 | 69.80/87.31 | 69.17/42.99 | 69.49 | 65.30/71.50 | 4.4 | |
| Correia et al. [32] (Audio) | – | – | – | 77.80 | 77.80/– | 1.5 | |
| Dolgushin et al. [23] | – | 91.27/– | 40.00/– | 65.63 | –/71.88 | 4.0 | |
| Zhou et al. [24] (YouTubePD corpus, Visual) | – | – | – | – | –/59.00 | – | |
| Zhou et al. [24] (YouTubePD corpus, Multimodal) | – | – | – | – | –/61.00 | – | |
| Method | MT/ST | NL | Clean. | Recall/Precision, % | UAR, % | MF1/WF1, % | Rank | |
|---|---|---|---|---|---|---|---|---|
| Healthy | Depr./PD | |||||||
| Depression | ||||||||
| CLIP + Transformer (60) + Prototype | MT | 5 | – | 82.11/87.26 | 82.39/75.82 | 82.25 | 81.79/82.32 | 3.57 |
| CLIP + Transformer (60) | ST | 3 | – | 83.33/90.91 | 87.76/78.19 | 85.55 | 84.83/85.23 | 1.86 |
| CLIP + Transformer (60) + Prototype | MT | 5 | + | 79.84/89.97 | 87.50/75.56 | 83.67 | 82.85/83.14 | 3.29 |
| CLIP + Transformer (60) | ST | 3 | + | 81.68/95.41 | 94.49/78.59 | 88.08 | 86.91/87.10 | 1.29 |
| PD | ||||||||
| CLIP + Transformer (60) + Prototype | MT | 5 | – | 55.45/88.54 | 78.20/36.62 | 66.82 | 59.03/63.65 | 3.93 |
| CLIP + Transformer (60) | ST | 4 | – | 78.96/91.67 | 78.20/55.03 | 78.58 | 74.42/79.83 | 1.93 |
| CLIP + Transformer (60) + Prototype | MT | 5 | + | 75.29/93.30 | 86.14/57.62 | 80.71 | 76.19/79.33 | 1.29 |
| CLIP + Transformer (60) | ST | 4 | + | 73.75/91.39 | 82.18/54.97 | 77.96 | 73.75/77.20 | 2.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ryumina, E.; Axyonov, A.; Dolgushin, M.; Ryumin, D.; Karpov, A. DEPART: Multi-Task Interpretable Depression and Parkinson’s Disease Detection from In-the-Wild Video Data. Big Data Cogn. Comput. 2026, 10, 89. https://doi.org/10.3390/bdcc10030089
Ryumina E, Axyonov A, Dolgushin M, Ryumin D, Karpov A. DEPART: Multi-Task Interpretable Depression and Parkinson’s Disease Detection from In-the-Wild Video Data. Big Data and Cognitive Computing. 2026; 10(3):89. https://doi.org/10.3390/bdcc10030089
Chicago/Turabian StyleRyumina, Elena, Alexandr Axyonov, Mikhail Dolgushin, Dmitry Ryumin, and Alexey Karpov. 2026. "DEPART: Multi-Task Interpretable Depression and Parkinson’s Disease Detection from In-the-Wild Video Data" Big Data and Cognitive Computing 10, no. 3: 89. https://doi.org/10.3390/bdcc10030089
APA StyleRyumina, E., Axyonov, A., Dolgushin, M., Ryumin, D., & Karpov, A. (2026). DEPART: Multi-Task Interpretable Depression and Parkinson’s Disease Detection from In-the-Wild Video Data. Big Data and Cognitive Computing, 10(3), 89. https://doi.org/10.3390/bdcc10030089

