DEPART: Multi-Task Interpretable Depression and Parkinson’s Disease Detection from In-the-Wild Video Data
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper proposes DEPART to jointly detect depression and Parkinson’s disease (PD) from in-the-wild videos. Its main contribution is to integrate CLIP, a Transformer-based temporal encoder, prototype learning, and multi-task learning into this specific clinical scenario, which has practical application value. My suggestions are as follows.
(1) Although the introduction reviews prior work on depression and PD detection, it does not sufficiently cover the use of multi-task learning in medical diagnosis. Please add more recent studies on multi-task learning for medical diagnosis, especially works that handle comorbidity, to better highlight the novelty of this study.
(2) Why is CLIP chosen as the static encoder instead of other vision models? Why is a Transformer adopted rather than other temporal modeling approaches? Why is YOLOv8 used instead of the latest YOLO26 or other detectors? Please provide clearer justification or experimental evidence for these design choices to strengthen the persuasiveness of the method.
(3) Please check and unify the definitions of variables in all equations to ensure they are self-consistent and consistent across the paper.
(4) The preprocessing step uses Whisper to transcribe the audio track and removes clips without speech. While this may discard many irrelevant segments, it may also introduce selection bias. Please consider and discuss this issue.
(5) Failures of the human detection module can directly affect the overall performance, and the paper also identifies detection failure as a major error source. Please consider adding key YOLOv8 hyperparameters and related experiments.
(6) What is the basis for setting the temperature hyperparameter τ to 0.1? In the total loss, the first three terms use a coefficient of 1, whereas the last term is weighted by α, but this asymmetric design is not explained. In addition, in Table 4, the row “+FGC (α = 0.75)” uses α with an unclear meaning, while the FGC equation in the paper only includes λ and not α.
(7) Table 6 compares the proposed method with SOTA methods. Please clarify whether these SOTA methods are representative.
(8) The paper reports the number of parameters and inference time. Could you also provide FLOPs and compare efficiency with other SOTA methods?
(9) It is recommended to release the code.
Author Response
We sincerely thank the reviewer for your valuable time and effort in reviewing our manuscript. These comments are all valuable and very helpful for revising and improving our manuscript, and provide important guidance for our research. We have studied the comments carefully and have made corrections that we hope will meet the reviewer’s approval.
The description of different fonts used in this document are as follows:
- Reviewers’ original comments are reproduced in blue-colored fonts.
- Plain fonts are our answers to Reviewers’ comments.
- Text reproduced from the manuscript is shown in red color.
Comment 1: Although the introduction reviews prior work on depression and PD detection, it does not sufficiently cover the use of multi-task learning in medical diagnosis. Please add more recent studies on multi-task learning for medical diagnosis, especially works that handle comorbidity, to better highlight the novelty of this study.
Reply: Thank you for this important comment. We agree that the original introduction did not sufficiently emphasize multi-task learning (MTL) as a broader paradigm in a medical diagnostics and comorbidity-aware modeling. We expanded the related-work discussion to better reflect recent MTL studies used for medical/clinical detection tasks, including approaches that explicitly consider jointly learned signals and comorbidity-related outcomes.
Lines 154-162: There is also a substantial number of works on multi-task detection of violations/issues related to depression and cognitive disorders. For example, Teng et al. [36] proposed using multi-task classification of depression and sentiment, using the latter as an auxiliary component when training deep neural network models. Similarly, Yang et al. [37] used multi-task learning with a BERT-based model to incorporate time-perspective cues for suicidal ideation detection. Also, Hu et al. [38] considered, in a multi-task manner, predicting depression severity jointly with a suicide risk via a multimodal fusion of audio and text embeddings. However, a joint consideration of video modality and multi-task learning in in-the-wild conditions is not encountered in the literature.
Lines 194-198: It is worth noting the study Junaid et al. [43], which proposes an integrated framework for a multi-task detection of depression and PD based on time-series data from the Parkinson’s Progression Markers Initiative and magnetic resonance imaging. Although this work demonstrated effective joint detection of these diseases, it did not consider video data.
Comment 2: Why is CLIP chosen as the static encoder instead of other vision models? Why is a Transformer adopted rather than other temporal modeling approaches? Why is YOLOv8 used instead of the latest YOLO26 or other detectors? Please provide clearer justification or experimental evidence for these design choices to strengthen the persuasiveness of the method.
Reply: Thank you for this important comment. In the revised manuscript, we added a clearer justification for our design choices and explicitly indicated, where these choices are supported in the text. In particular, we now clarify that the study considers two static visual encoders (ViT and CLIP) and two temporal encoders (Transformer and Mamba), motivate their selection, and compare their performance in our task (Section 4.2). We also added a clearer justification for the YOLOv8-based body detector, emphasizing that in our pipeline robust single-class human body localization is more crucial than adopting the latest generic detector version.
Lines 254-257: YOLOv8 is selected as a stable and efficient detection tool for frame-based body localization, while in our pipeline, a robustness of a single-class human body detection is more critical than adopting the most recent generic detector version.
Lines 284-286: We apply CLIP and ViT as static visual encoders, as they provide strong, transferable frame-level representations thanks to the large-scale pre-training, and have shown a robust performance on various medical recognition tasks [48,49].
Lines 295-296: These temporal encoders are selected because they explicitly model contextual dependencies across frame sequences, in contrast to simpler classical temporal models [52].
Lines 302-303: In Section 4.2, we compare the performance of both static and temporal encoders.
Comment 3: Please check and unify the definitions of variables in all equations to ensure they are self-consistent and consistent across the paper.
Reply: Thank you for this comment. We carefully re-checked the notation in all equations and revised the manuscript to make the variable definitions self-consistent and consistent across sections. In particular, we corrected several notation inaccuracies, clarified previously implicit variable definitions, and introduced missing symbols, where needed.
Comment 4: The preprocessing step uses Whisper to transcribe the audio track and removes clips without speech. While this may discard many irrelevant segments, it may also introduce selection bias. Please consider and discuss this issue.
Reply: Thank you for this comment. We agree that filtering clips based on Whisper transcripts can introduce selection bias by excluding non-speech segments and over-representing participants with longer speech-rich recordings. In the revised manuscript, we added Table 2, which summarizes demographic statistics of the data, then we explicitly discuss this limitation and clarify that the segment-level dataset distribution may differ from the participant-level distribution after preprocessing. We also emphasized that these differences are minor and mainly affect the Development and Test subsets, thus, they do not add substantial distortions to the Train subset.
Lines 424-432: Table 2 summarizes demographic distribution of segments and individuals by disease group and subset after segmentation and pre-processing. It should be pointed out that individuals with longer recordings contribute more segments. Moreover, since our pipeline retains only speech- and face-containing clips, the effective class balance and gender distribution of our segment-level composition is different from the individual-level one. However, these differences are minor and mainly affect the Development and Test subsets, thus, they do not add substantial distortions to the Train subset. In general, the distribution between classes is not balanced, with the fewest segments represented for individuals with PD.
Comment 5: Failures of the human detection module can directly affect the overall performance, and the paper also identifies detection failure as a major error source. Please consider adding key YOLOv8 hyperparameters and related experiments.
Reply: Thank you for this important comment. We agree that failures of the human detection module can directly affect the downstream diagnostic performance and should be documented more clearly. In the revised manuscript, we added the key YOLOv8 inference hyperparameters used in our pipeline and clarified that we used the default inference settings of the adopted implementation.
Lines 261-265: The YOLOv8 human body detector is used with the default inference settings for the adopted implementation. These settings include a confidence threshold of 0.5 and an intersection over union threshold of 0.5. The input image size is set to 640x640 pixels. Only the weights of the detector are replaced with a pretrained single-class checkpoint for humans’ bodies that is used in our pipeline.
Comment 6: What is the basis for setting the temperature hyperparameter τ to 0.1? In the total loss, the first three terms use a coefficient of 1, whereas the last term is weighted by α, but this asymmetric design is not explained. In addition, in Table 4, the row “+FGC (α = 0.75)” uses α with an unclear meaning, while the FGC equation in the paper only includes λ and not α.
Reply: Thank you for this careful comment. We revised the manuscript to clarify all three points. Firstly, the temperature hyperparameter ? was selected by grid search (as were the other hyperparameters), but in the previous version we reported only the final best value (?=0.1) in the text. We corrected this by adding τ to Table 3 (Grid search results for hyperparameters) and explicitly listing the searched values and the selected optimal value. Secondly, we clarified the role of ? in the total loss: the first three terms are primary supervised classification losses, while α is used to control the contribution of the contrastive prototype loss as an auxiliary regularization term. Thirdly, we agree that the notation in Table 4 was inconsistent: the row “+FGC (?=0.75)” contained a typographical error, and this parameter should be ? (the FGC mixing coefficient), not α. We corrected this in the revised manuscript.
Lines 401-403: The first three terms are the primary classification losses and are therefore combined with unit weights, whereas L_cont is used as an auxiliary regularization term and scaled by α.
Comment 7: Table 6 compares the proposed method with SOTA methods. Please clarify whether these SOTA methods are representative.
Reply: Thank you for this comment. After the revision, the table number has changed to Table 7. To the best of our knowledge, the previously cited SOTA methods are the only ones that directly use this dataset, and we therefore consider them representative of prior work on this specific benchmark. In order to further improve the broader representativeness of the comparison, we have additionally included several recent SOTA visual and multimodal methods that address an analogous task, automatic disease prediction from videos collected from open sources (e.g., YouTube), even though they are not trained or evaluated on our dataset.
Lines 576-580: A comparison with contemporary visual and multimodal SOTA models [17, 22,24] trained on other representative single-task datasets sourced from YouTube (e.g., D-Vlog and YouTubePD) indicates that our proposed model achieves results comparable to these methods and, in many cases, surpasses them. Nevertheless, the scope for a direct comparison of these results remains limited.
Comment 8: The paper reports the number of parameters and inference time. Could you also provide FLOPs and compare efficiency with other SOTA methods?
Reply: Thank you for this comment. We have added the computational cost in FLOPs to the manuscript and reported it per sample (GFLOPs/sample for readability) for each pipeline component, as well as the total FLOPs of the full pipeline.
Lines 725-731: The temporal detection module contains 0.669 M trainable parameters, corresponding to a model size of 2.58 MB. The static CLIP visual encoder and YOLO-based body detector contain 87.456 M and 3.011 M parameters, respectively, and require 605 MB and 5.97 MB of memory. Overall, the full pipeline contains 91.136 M parameters. In terms of the computational cost, YOLO, CLIP and the temporal detection module require 485.03, 523.76, and 0.08 GFLOPs/sample, respectively. This totals to 1,008.87 GFLOPs for each sample.
Comment 9: It is recommended to release the code.
Reply: Thank you for the suggestion. We have released the code publicly at: https://github.com/SMIL-SPCRAS/DEPART/tree/main. The repository includes the training/evaluation pipeline, configuration files, usage instructions and links the data. We have also added this link to the revised manuscript.
Lines 16-17: The source code is available at https://github.com/SMIL-SPCRAS/DEPART/tree/main.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe research is well done and well presented. I have no complaints about the scientific content, but I would recommend publishing the manuscript after major revision.
Main concern.
What worries me most is the ethical issue. When researchers use biological material from individuals (blood, saliva, etc.), they must comply with various ethical standards. They must prove this when submitting the article. On the other hand, today, using smartphones and cloud technologies, it is possible, for example, to perform an electrocardiogram on anyone. I haven't checked yet whether I can install such an application on my smartphone today or not, but I will definitely be able to tomorrow. This will be done absolutely freely and without any ethical restrictions. I see that very soon I will be able to recognize depression and Parkinson's disease in pedestrians right on the street. Doesn't this contradict ethical standards and medical confidentiality? Are researchers violating ethical standards if they analyze YouTube videos?
Please answer this question in the updated version of the manuscript.
Small corrections
- Line 6
“CLIP-based visual encoding,”
What is CLIP? Explain here.
- Lines 260-261
“[CLS] token representation.”
What is [CLS]?
- Line 274
“(H-th) layer of the temporal encoder”
What does (H-th) mean?
- Line 330
“GELU activation”
What is GELU?
- Equations (11) and (12)
What is B?
- Table 5.
I can't figure out the order of the results in this table.
Can the authors arrange the binary numbers in ascending order: - - - +, - - + -, etc.?
Why is the result for 3 (- - + +) omitted? Is this an impossible combination?
- Line 559
“where G(p) i,n ∈RD is the gradient vector”
What is D?
- Equation (17)
What is ReLU? Despite the popularity of this abbreviation, it is better to give a description.
Author Response
We sincerely thank the reviewer for your valuable time and effort in reviewing our manuscript. These comments are all valuable and very helpful for revising and improving our manuscript, and provide important guidance for our research. We have studied the comments carefully and have made corrections that we hope will meet the reviewer’s approval.
The description of different fonts used in this document are as follows:
- Reviewers’ original comments are reproduced in blue-colored fonts.
- Plain fonts are our answers to Reviewers’ comments.
- Text reproduced from the manuscript is shown in red color.
Comment 1: What worries me most is the ethical issue. When researchers use biological material from individuals (blood, saliva, etc.), they must comply with various ethical standards. They must prove this when submitting the article. On the other hand, today, using smartphones and cloud technologies, it is possible, for example, to perform an electrocardiogram on anyone. I haven't checked yet whether I can install such an application on my smartphone today or not, but I will definitely be able to tomorrow. This will be done absolutely freely and without any ethical restrictions. I see that very soon I will be able to recognize depression and Parkinson's disease in pedestrians right on the street. Doesn't this contradict ethical standards and medical confidentiality? Are researchers violating ethical standards if they analyze YouTube videos?
Reply: Thank you for pointing out this important ethical issue. Firstly, in our research, we are using the open available multimodal In-the-Wild Speech Medical (WSM) corpus, collected by Correia et al. [32] (Correia, J.; Teixeira, F.; Botelho, C.; Trancoso, I.; Raj, B. The in-the-Wild Speech Medical Corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021, pp. 6973–6977. https://doi.org/10.1109/ICASSP39728.2021.9414230, Correia, J. In-the-wild detection of speech affecting diseases. Carnegie Mellon University. Thesis. 2024. https://doi.org/10.1184/R1/24944607.v1). Secondly, since we have segmented and partially cleaned the visual data of the WSM corpus as described in Section 5, we have received an official approval of the Scientific Council of the St. Petersburg Federal Research Center of the Russian Academy of Sciences (our organization) according to the guideline of the WMA Declaration of Helsinki – Ethical Principles for Medical Research Involving Human Participants (https://www.wma.net/policies-post/wma-declaration-of-helsinki). In the manuscript, we also emphasize that we use this publicly available corpus in compliance with the rules for its use.
Lines 435-439: WSM is the publicly available multimodal corpus collected and released by Correia et al. [32]. The corpus is disseminated by its authors for research purposes, therefore, our work constitutes a secondary analysis of existing resources. We adhere to the corpus’s terms of use and apply a privacy-preserving pre-processing to minimize ethical and confidentiality risks.
Comment 2: Line 6
“CLIP-based visual encoding,”
What is CLIP? Explain here.
Reply: Thank you for this comment. We revised the abstract and added the full form of CLIP at its first mention (i.e., Contrastive Language–Image Pre-training) instead of its abbreviation.
Lines 5-7: It performs body region extraction, Contrastive Language-Image Pre-training (CLIP)-based visual encoding, Transformer-based temporal modeling, and prototype-aware classification with a gated fusion technique.
Comment 3: Lines 260-261
“[CLS] token representation.”
What is [CLS]?
Reply: Thank you for this comment. We expanded the sentence in the manuscript and added a clarification that [CLS] is a special classification token prepended to the input sequence [47].
Lines 280-283: Both models produce a Din-dimensional embedding f ∈ RD_in, corresponding to the [CLS] token representation, where [CLS] is a special classification token prepended to the input sequence [47].
Comment 4: Line 274
“(H-th) layer of the temporal encoder”
What does (H-th) mean?
Reply: Thank you for this comment. The symbol H is defined earlier in the same subsection as the number of stacked temporal encoder blocks. To make this clearer at the point of use, we explicitly clarified that the “H-th” layer refers to the final temporal encoder block.
Lines 306-307: Let ˜hi,n ∈ RD_h denote the output representation of the n-th frame after the final temporal encoder block (the H-th block, where H is the number of stacked blocks).
Comment 5: Line 330
“GELU activation”
What is GELU?
Reply: Thank you for this comment. We revised the manuscript to expand GELU at its first mention and clarify that it refers to the Gaussian Error Linear Unit activation function.
Lines 366-367: Between these two linear layers, LayerNorm, a smooth non-linear Gaussian Error Linear Unit (GELU) activation, and a dropout are applied.
Comment 6: Equations (11) and (12)
What is B?
Reply: Thank you for this comment. We revised the manuscript to explicitly define B in Equations (11) and (12) as the number of segments in a batch.
Lines 385-386: B is the number of segments in a batch.
Comment 7: Table 5.
I can't figure out the order of the results in this table.
Can the authors arrange the binary numbers in ascending order: - - - +, - - + -, etc.?
Why is the result for 3 (- - + +) omitted? Is this an impossible combination?
Reply: Thank you for the suggestion. We agree that the ordering of configurations in the original Table 5 was not sufficiently transparent. We have therefore revised Table 5 to present the loss-component combinations in a clearer, systematic order. Regarding the (- - + +) setting, it is a feasible combination and was presented in the original manuscript; however, we realized that in the previous ordering it could be easily missed. In the revised Table 5, it is placed and formatted consistently with the rest of the configurations.
Revised Table 5 (reordered configurations for clarity).
Comment 8: Line 559
“where G(p) i,n ∈RD is the gradient vector”
What is D?
Reply: Thank you for this comment. We revised the manuscript to explicitly define this dimensionality and to use the notation consistently with the rest of the paper. In particular, we replaced D with D_in and clarified that D_in is the token embedding dimensionality of the static encoder.
Lines 650: D_in is the token embedding dimensionality of the static encoder.
Comment 9: Equation (17)
What is ReLU? Despite the popularity of this abbreviation, it is better to give a description.
Reply: Thank you for this comment. We have revised the manuscript to expand ReLU at its first use in Equation (17) and clarified that it denotes the Rectified Linear Unit activation, which is applied here to retain only positive relevance scores.
Lines 661-662: ReLU(·) denotes the Rectified Linear Unit activation, applied here to retain only positive relevance scores.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsSummary
This manuscript proposes a multi-task, interpretable framework for detecting depression and Parkinson’s disease (PD) using in-the-wild video data. To develop a non-invasive health monitoring algorithm, the authors design a sophisticated pipeline. First, static features are extracted using a static encoder, and a temporal encoder then captures sequential features based on the aggregated static representations. The resulting feature vectors are compared with prototype representations to derive class probability vectors. Finally, disease classes are predicted by integrating the classifier outputs with the prototype representations. Extensive experiments demonstrate the effectiveness of the proposed algorithm. Since the method provides an interpretable framework, it has potential value for related research fields and practical applications. However, several concerns should be addressed before publication.
Major Comments
1. Dataset description and demographic information
Please provide detailed demographic information for the datasets. It is important to report the number of samples in each group (HC, PD, and depressed). Additionally, demographic characteristics such as age and gender distribution would improve the clarity and reproducibility of the study. If available, providing disease severity measures (e.g., UPDRS scores for PD patients) would further strengthen the clinical relevance of the work.
2. Risk of overfitting and validation strategy
Considering the model complexity relative to the dataset size, there is a potential risk of overfitting. If computational cost permits, reporting K-fold cross-validation results would strengthen the reliability of the findings. If full cross-validation is impractical due to training time, it would still be helpful to discuss measures taken to mitigate overfitting and provide evidence supporting the robustness of the reported results.
3. Evaluation metrics
The experimental results report recall only. While recall is an important metric for disease screening, precision is also necessary to evaluate the balance between true positive and false positive predictions. Please provide per-class precision values to offer a more comprehensive assessment of model performance.
4. Comparison with existing methods
In Table 6, the proposed method is compared with only two state-of-the-art (SOTA) methods. Since Table 1 lists several relevant approaches, including additional comparisons would strengthen the experimental validation. Including results for two or three more representative methods is recommended, if feasible.
5. Multi-task classification and label interpretation
The proposed method appears to focus on multi-task classification. In such a setting, multiple label combinations may be possible (e.g., HC, PD, Depressed, PD+Depressed). It is not clear how the model handles cases involving multiple conditions. Please clarify how classification decisions are made when multiple categories are present and how the outputs should be interpreted in such scenarios.
Author Response
We sincerely thank the reviewer for your valuable time and effort in reviewing our manuscript. These comments are all valuable and very helpful for revising and improving our manuscript, and provide important guidance for our research. We have studied the comments carefully and have made corrections that we hope will meet the reviewer’s approval.
The description of different fonts used in this document are as follows:
- Reviewers’ original comments are reproduced in blue-colored fonts.
- Plain fonts are our answers to Reviewers’ comments.
- Text reproduced from the manuscript is shown in red color.
Comment 1: Dataset description and demographic information
Please provide detailed demographic information for the datasets. It is important to report the number of samples in each group (HC, PD, and depressed). Additionally, demographic characteristics such as age and gender distribution would improve the clarity and reproducibility of the study. If available, providing disease severity measures (e.g., UPDRS scores for PD patients) would further strengthen the clinical relevance of the work.
Reply: Thank you for this important comment. We have added Table 2, which summarises the Parkinson’s and Depression subsets of the WSM corpus after the segmentation, reporting the number of segments (with unique participants in parentheses), percentage of positive diagnosis, percentage of female participants, and mean age for each split. Regarding disease severity measures, these are not available in the original WSM corpus released in [32] (Correia, J.; Teixeira, F.; Botelho, C.; Trancoso, I.; Raj, B. The in-the-Wild Speech Medical Corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada, 2021, pp. 6973–6977. https://doi.org/10.1109/ICASSP39728.2021.9414230) as it was compiled from YouTube videos where such clinical scores could not be extracted.
Lines 424-432: Table 2 summarizes demographic distribution of segments and individuals by disease group and subset after segmentation and pre-processing. It should be pointed out that individuals with longer recordings contribute more segments. Moreover, since our pipeline retains only speech- and face-containing clips, the effective class balance and gender distribution of our segment-level composition is different from the individual-level one. However, these differences are minor and mainly affect the Development and Test subsets, thus, they do not add substantial distortions to the Train subset. In general, the distribution between classes is not balanced, with the fewest segments represented for individuals with PD.
Comment 2: Risk of overfitting and validation strategy
Considering the model complexity relative to the dataset size, there is a potential risk of overfitting. If computational cost permits, reporting K-fold cross-validation results would strengthen the reliability of the findings. If full cross-validation is impractical due to training time, it would still be helpful to discuss measures taken to mitigate overfitting and provide evidence supporting the robustness of the reported results.
Reply: Thank you for this important comment. We agree that overfitting is a relevant concern given the model complexity and dataset size. In our study, we follow the fixed protocol including the Train, Development, and Test subsets provided by the corpus authors [32]. Under this protocol, all models are selected and hyperparameters are optimized on the Development subset, and final results are reported only on an unseen Test subset. This setup provides a clear separation between a model selection and a final evaluation, which helps reduce the risk of overfitting to the test data. We also note that, while a K-fold cross-validation could be useful, it is rarely applied in this setting because it introduces an additional variance across folds and raises a non-trivial model selection challenge (i.e., which a fold-specific model should be treated as the final one). In contrast, the predefined split into three subsets ensures a standardized and reproducible evaluation protocol aligned with the benchmark corpora. We have clarified the choice of validation strategy and the corresponding considerations for robustness in the revised manuscript.
Lines 454-459: This experimental setup follows the predefined protocol of the corpus and ensures a clear separation between model fitting, hyperparameter selection, and final evaluation on an unseen Test subset. We therefore do not use any cross-validation setup, as the fixed split into three subsets provides a standardized and reproducible benchmark protocol for a fair model comparison with SOTA, while reducing the risk of a model overfitting to the Test data.
Comment 3: Evaluation metrics
The experimental results report recall only. While recall is an important metric for disease screening, precision is also necessary to evaluate the balance between true positive and false positive predictions. Please provide per-class precision values to offer a more comprehensive assessment of model performance.
Reply: Thank you for this valuable comment. We agree that, in addition to the recall, the per-class precision is important for assessing the balance between true positive and false positive predictions. In the revised manuscript, we have added per-class precision values into all the tables with experimental results. We have also expanded the corresponding discussion to include a class-wise analysis of both recall and precision, highlighting their different behavior across classes and the resulting error patterns.
Lines 470-477: Per-class Recall shows that the models are generally more sensitive to depression and, in some configurations, the PD class than to the healthy class. In contrast, the per-class Precision is highest for the healthy class and lowest for the PD one. This suggests that predictions for the healthy class are relatively reliable (with fewer false positive errors), while PD detection often achieves a higher sensitivity at the expense of more false positive errors. Overall, this pattern shows that the balance between Recall and Precision values differs across the classes, with PD being the most difficult one. This pattern also holds for all subsequent experiments.
Lines 623-632: Precision shows a more nuanced pattern. In the depression sub-corpus, cleaning the Test data leads to a noticeable increase in the Precision for healthy individuals in both models (87.26% vs. 89.97% for the multi-task and 90.91% vs. 95.41% for the single-task setup), while the Precision for depression remaining nearly stable in both cases (75.82% vs. 75.56% for the multi-task and 78.19% vs. 78.59% for the single-task). In the PD sub-corpus, the Precision improvements are higher for the pathological class: the Precision for PD increases substantially in the multi-task model (from 36.62% to 57.62%) and remains nearly unchanged in the single-task model (55.03% vs. 54.97%). At the same time, the Precision for healthy individuals increases in the multi-task setting (from 88.54% to 93.30%) and changes marginally in the single-task model (from 91.67% to 91.39%).
Comment 4: Comparison with existing methods
In Table 6, the proposed method is compared with only two state-of-the-art (SOTA) methods. Since Table 1 lists several relevant approaches, including additional comparisons would strengthen the experimental validation. Including results for two or three more representative methods is recommended, if feasible.
Reply: Thank you for this comment. After the revision, the table number has changed to Table 7. We agree that including additional baselines would strengthen the experimental validation. However, to the best of our knowledge, the two previously reported SOTA methods are the only approaches that provide results on this specific dataset, and thus they remain the only directly comparable methods on this benchmark. To address your suggestion, we have expanded Table 7 by adding results for several additional representative SOTA visual and multimodal methods that deal with similar tasks. Although these methods are not evaluated on our dataset, we include them to provide a broader contextual comparison and position our method with respect to SOTA approaches.
Lines 576-580: A comparison with contemporary visual and multimodal SOTA models [17, 22, 24] trained on other representative single-task datasets sourced from YouTube (e.g., D-Vlog and YouTubePD) indicates that our proposed model achieves results comparable to these methods and, in many cases, surpasses them.
Comment 5: Multi-task classification and label interpretation
The proposed method appears to focus on multi-task classification. In such a setting, multiple label combinations may be possible (e.g., HC, PD, Depressed, PD+Depressed). It is not clear how the model handles cases involving multiple conditions. Please clarify how classification decisions are made when multiple categories are present and how the outputs should be interpreted in such scenarios.
Reply: Thank you for this comment. In our dataset, each video is annotated with exactly one label for the Parkinson’s disease (PD) and one label for the depression, and, according to the available annotations [32], there are no samples simultaneously labeled as positive for both cases (i.e., no explicit comorbid cases such as PD+Depressed). Consequently, we were not able to empirically evaluate the model’s behavior under explicit comorbidity labeling or to report performance for such combined categories. Throughout this work, we therefore proceeded under the dataset-driven assumption that, within this corpus, the presence of one disease implies the absence of another one. We agree, however, that handling comorbidity is an important and practically relevant scenario. In future work, we plan to extend the annotation scheme of the WSM corpus to explicitly capture ambiguous and comorbid cases (including combinations such as those in your examples), which will enable a systematic analysis of multi-disease scenarios and a more precise interpretation of multi-task outputs. Moreover, we have expanded the research corpus description, to explicitly mention these limitations on the comorbidity of the current pipeline.
Lines 418-423: Importantly, the annotation protocol [32] used provides only binary labels (disorder or healthy) within each sub-corpus and does not include any comorbidity (e.g., both depression and PD, or PD and Alzheimer’s disease) as an explicit category. Therefore, our study does not model or estimate multi-label combinations, and the outputs should be interpreted only with respect to the single-disease labels available in the corpus.
Lines 773-776: Fourthly, since the current study does not evaluate comorbid conditions due to the lack of explicit annotations, in future work we plan to investigate multi-disease settings by leveraging corpora with annotated comorbidity cases.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have revised the manuscript according to the reviewers' comments.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors took all my comments into account. I can recommend the manuscript for publication.
Reviewer 3 Report
Comments and Suggestions for AuthorsIn this revision, all the issues raised have been fully addressed.
