Review Reports - A Systematic Review of Transformer-Based Models for Depression Detection

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This is a timely topic, but I cannot support publication in the current form because the review is not yet reliable enough for a systematic paper.

The stated inclusion rules are not followed. Section 3 says only peer-reviewed English papers from 2020 to 2025 were included, but the reference list contains multiple arXiv papers, posted paper, etc. Please only use peer-reviewed published papers.
The quantitative synthesis is not methodologically sound. The paper mixes accuracy, F1, AUC, MAE, RMSE, CCC, symptom-level tasks, severity scoring, and diagnosis, then turns them into medians, gain ranges, Pareto frontiers, and cost-benefit rankings. Figures 3 and 4 therefore look precise, but the inputs are not directly comparable.
The quality assessment is not auditable. You state that all 47 studies passed a threshold, but no scoring rubric, threshold, reviewer agreement, or study-level bias table is provided.
Some conclusions overreach the presented evidence. The abstract says hybrid and multimodal models are best for difficult clinical tasks, but later the paper reports multimodal performance below hybrid and close to the other groups in clinical interviews.
The manuscript needs another full editorial pass because the section numbering and several tables are inconsistent.

Comments on the Quality of English Language

Need improvement.

Author Response

For review article

Response to Reviewer 1 Comments

1. Summary

We would like to thank you for your rigorous and constructive evaluation. We agree that the original manuscript had certain methodological limitations that required resolution. Following your insightful suggestions, we have significantly revised our methodology. We have addressed each of your five points in detail below.

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Is the work a significant contribution to the field?

Dear reviewer, please see 3. Point-by-point response to Comments and Suggestions for Authors

Is the work well organized and comprehensively described?

Is the work scientifically sound and not misleading?

Are there appropriate and adequate references to related and previous work?

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: The stated inclusion rules are not followed. Section 3 says only peer-reviewed English papers from 2020 to 2025 were included, but the reference list contains multiple arXiv papers, posted paper, etc. Please only use peer-reviewed published papers.

Response 1:

The stated inclusion rules are not followed. Section 3 says only peer-reviewed English papers from 2020 to 2025 were included, but the reference list contains multiple arXiv papers, posted paper, etc. Please only use peer-reviewed published papers.

We fully accept the validity of this concern. The original inclusion criteria stated “peer-reviewed publications” without acknowledging the five preprints that were included. This was an inconsistency between stated methodology and practice that needed correction.

We considered two remedies: (a) removing all preprints, or (b) revising the inclusion criteria to explicitly accommodate high-quality preprints with appropriate justification. We chose option (b) for the following reason: in rapidly evolving AI subfields such as large language model research, several architecturally important contributions appear on preprint servers and are not yet formally published, yet their findings are substantively cited in the peer-reviewed literature. Excluding them would introduce recency bias and leave genuine architectural innovations undiscussed.

We have therefore revised Section 3.1 (Inclusion and Exclusion Criteria) to explicitly state that high-quality preprint manuscripts from arXiv were eligible for inclusion under defined conditions: they must demonstrate clear methodological rigour, must not duplicate findings available in peer-reviewed works. Of the 46 included studies, five originate from preprint repositories. A dedicated paragraph now identifies these five studies by name, acknowledges the elevated risk of undetected methodological errors in preprints, and states that findings from preprint sources are interpreted with heightened caution throughout the review. All five preprints were evaluated against the same six-dimension quality rubric applied to peer-reviewed studies, and only those achieving the minimum threshold score (≥3) were retained.

Position:

Highlighted version: 3. Survey Methodology + Supplementary Materials (PRISMA 2020 checklist + Study-Level Quality Assessment and Bias Evaluation for All 46 Included Studies) (Pages 6-9, lines 544-987).

3. Survey Methodology

This systematic review follows PRISMA 2020 guidelines to ensure methodological rigor and reproducibility, as illustrated in Figure 2. The completed PRISMA 2020 checklist is provided as Supplementary Material. A narrative synthesis approach was used instead of quantitative meta-analysis, given the substantial heterogeneity across the 46 included studies in datasets, evaluation metrics and ground truth definitions, which precludes valid statistical pooling. Crucially, this heterogeneity means that the numerical comparisons presented across studies (e.g., accuracy, F1, Area Under the Receiver Operating Characteristic Curve AUC), Root Mean Square Error (RMSE), Concordance Correlation Coefficient (CCC)) should be interpreted within context rather than as direct head-to-head performance comparisons; they are presented to illustrate architectural trade-offs rather than to assert equivalent task-level benchmarking. A comprehensive literature search was performed between 2020 and 2025 using six major academic databases (IEEE Xplore, Elsevier, Springer, MDPI, PubMed and arXiv) supplemented by reference screening. The complete search strategy, documented in Table 2, utilized Boolean operators to combine key search terms across four PICO concept groups: Population/condition (P), Intervention/AI model type (I), Comparison/baseline methods (C), and Outcome/performance metrics (O). Modality-related terms (text, speech, audio, video, social media, EEG, clinical interview) were applied as supplementary search filters alongside the PICO query.

Figure 2. PRISMA 2020 flow diagram detailing the systematic study selection process.

Table 2. Search strategy mapped to PICO framework.

*PICO Component*	*Element*	*Search Terms/Description*
Population (P)	Target condition	“depression” OR “depressive disorder” OR “major depressive disorder”
Intervention (I)	AI model type	“Transformer” OR “BERT” OR “RoBERTa” OR “GPT” OR “LLM” OR “large language model” OR “pre-trained language model” OR “MentalBERT” OR “LLaMA” OR “Whisper” OR “CLIP” OR “Vision Transformer” OR “ViT” OR “ClinicalBERT”
Comparison (C)	Baseline/comparison methods	“machine learning” OR “deep learning” OR “CNN” OR “RNN” OR “LSTM” OR “non-Transformer” OR “conventional” OR “baseline”
Outcome (O)	Detection performance metrics	“detection” OR “classification” OR “screening” OR “severity assessment” OR “accuracy” OR “F1-score” OR “AUC” OR “sensitivity” OR “specificity” OR “RMSE”
Combined Query	Boolean	(Population) AND (Intervention) AND (Comparison) AND (Outcome)

3.1. Inclusion and Exclusion Criteria

Studies were included if they: (1) used Transformer-based architectures as the core methodology for depression detection, classification, or severity assessment; (2) conducted empirical evaluation using standardized datasets with quantitative metrics; (3) presented methodological descriptions of sufficient clarity for reproducibility assessment; (4) were published in English as peer-reviewed journal articles, conference proceedings, or – in the case of rapidly evolving AI subfields where definitive peer-reviewed publications may not yet be available – as high-quality preprint manuscripts (e.g., arXiv submissions) that had undergone substantive community scrutiny, demonstrated clear methodological rigor, and contributed findings not duplicated by available peer-reviewed works; and (5) were published or made publicly available between 2020 and 2025.

Exclusion criteria encompassed: (1) non-Transformer approaches or those employing Transformers solely as comparison baselines (excluding six studies retained for comparative analysis); (2) absence of depression-specific evaluation; (3) insufficient methodological detail; (4) non-English publications; (5) duplicate results; (6) exclusive focus on non-depression conditions; and (7) review articles (utilized for reference screening only).

The decision to include high-quality preprints was motivated by the rapid pace of Transformer research, where key architectural innovations frequently appear on preprint servers prior to formal peer-reviewed publication. Of the 46 included studies, five were sourced from preprint repositories. All five included preprints were individually evaluated against the same six-dimension quality rubric applied to peer-reviewed studies, and only those achieving the minimum quality threshold score (≥3) were retained.

3.2. Screening and Selection Process

The systematic database search and manual screening identified 305 potentially relevant studies. After removing 25 duplicates and eliminating 118 non-Transformer approaches or studies using Transformers solely as comparison models, 80 studies were excluded for insufficient methodology or evaluation detail. Full-text review led to the further exclusion of 36 studies focused exclusively on conditions other than depression. Ultimately, 46 studies were selected for final inclusion. These were classified by architectural strategy as follows: 14 Encoder-only studies, 7 Decoder-only studies, 14 Hybrid architecture studies, and 11 Multimodal framework studies.

3.3. Integration of Review Literature and Comparison with Non-Transformer Models

Beyond the 46 primary studies, ten relevant review articles on deep learning and Transformer-based depression detection models were analyzed for contextual background. Additionally, six studies employing non-Transformer approaches were incorporated as comparative baselines. Consistent evidence shows that Transformer models, particularly those exploiting self-attention mechanisms, outperform non-Transformer counterparts in accuracy, robustness, and capacity for handling complex multimodal data.

3.4. Study Quality and Bias Evaluation

The quality of included studies was independently assessed by two authors using a structured six-dimension rubric. Disagreements between the two raters were flagged and resolved through discussion with the third author, who served as arbitrator; no study required more than one round of arbitration. Inter-rater agreement was high prior to arbitration, with a Cohen’s κ of 0.82, indicating strong agreement. Each dimension was scored on a binary scale (0 = criterion not met; 1 = criterion met), yielding a maximum score of 6. Studies achieved scores of 4 or above, indicating adequate methodological quality. Study-level quality scores and bias assessments are provided in Supplementary Table S1. The six dimensions and their operationalizations are as follows.

Methodological rigor (0-1): Clarity of model architecture, transparency of training. Configurations, and adequacy of preprocessing documentation. Studies scoring 1 provided sufficient architectural detail to permit independent replication.

Experimental validity (0-1): Appropriateness of dataset characteristics, evaluation protocols, and statistical reporting standards. Studies employing standard benchmark datasets (e.g., DAIC-WOZ, Reddit, Twitter corpora) with established train/test splits scored 1; those with unreported or non-standard splits scored 0.

Reproducibility (0-1): Accessibility of code, model checkpoints, and data-access information. Studies that publicly released code or provided sufficiently detailed pseudocode scored 1.

Comparative soundness (0-1): Use of appropriate comparison models and capacity-controlled experimental setups, including inclusion of ablation studies where architecturally warranted.

Clinical relevance (0-1): Correspondence of evaluation settings to actual clinical deployment scenarios, and acknowledgment of deployment limitations such as class imbalance, demographic skew, or inference latency constraints.

Bias susceptibility (0-1): Explicit reporting of dataset imbalance and applied mitigation strategies, linguistic and geographic representativeness, and platform-specific data characteristics (Twitter, Reddit, or clinical corpora) with discussion of implications for external validity.

To ensure systematic bias evaluation, studies were additionally reviewed across three recurrent bias dimensions. Dataset imbalance was recorded by documenting class distributions and noting whether corrective strategies (e.g., Synthetic Minority Oversampling Technique (SMOTE), focal loss, weighted sampling, or stratified procedures) were applied; studies without corrective strategies were flagged as being at higher risk of majority-class inflation. Cultural and linguistic bias was assessed based on geographic and linguistic composition of datasets, with heavy concentration in English-language, Western social media sources noted as a significant limitation for cross-cultural generalizability. Platform-specific bias was evaluated by examining source characteristics (Twitter, Reddit, or structured clinical interviews), with documented implications for external validity and model transferability given the substantial differences in discourse style and self-disclosure behavior across platforms. Regarding small-sample and pilot studies: twelve of the 46 included studies reported training set sizes below 5,000 samples, and four employed fewer than 1,000 samples. These studies were retained rather than excluded on sample-size grounds because: (1) small-sample performance in limited-label settings is itself a research question of clinical relevance; (2) exclusion of small studies would introduce selection bias toward resource-rich settings; and (3) all retained studies passed the minimum quality threshold on the six-dimension rubric. The risk associated with small-sample studies is explicitly flagged in Section 3.4 bias assessments and interpreted with appropriate caution in Section 4.

Comments 2: The quantitative synthesis is not methodologically sound. The paper mixes accuracy, F1, AUC, MAE, RMSE, CCC, symptom-level tasks, severity scoring, and diagnosis, then turns them into medians, gain ranges, Pareto frontiers, and cost-benefit rankings. Figures 3 and 4 therefore look precise, but the inputs are not directly comparable.

Response 2:

This is the most substantive methodological concern raised in this review cycle, and we thank the reviewer for articulating it so clearly.

We have addressed this in three ways:

(1) Methodological qualification added to Section 3 (Survey Methodology).

(2) Qualification added to Section 5.2 (Discussion): A dedicated methodological qualification now precedes the cross-paradigm performance discussion, explicitly warning readers that the 46 studies employ heterogeneous evaluation metrics across non-equivalent tasks.

(3) Figure captions revised: The captions for both Figure 3 and Figure 4 now explicitly state that "All performance values aggregate heterogeneous metrics across non-equivalent tasks and datasets; they illustrate directional trends only and should not be read as directly comparable benchmarks."

Position:

Highlighted version (Page 6, lines 547-549).

A narrative synthesis approach was used instead of quantitative meta-analysis, given the substantial heterogeneity across the 46 included studies in datasets, evaluation metrics and ground truth definitions, which precludes valid statistical pooling.

Highlighted version (Page 22, lines 1985-1992).

Before proceeding, a methodological qualification is necessary. The 46 studies synthesized employ heterogeneous evaluation metrics, across tasks that are not equiv-alent: binary social media classification, ordinal severity scoring, continuous regression, and zero-shot symptom assessment. Aggregating these into median performance values or Pareto frontiers is inherently imprecise. The performance comparisons in Figure 4(a) and Figure 4(b) are therefore best understood as illustrative architectural trends and cost-benefit directional signals, not as equivalent-condition benchmarks. Readers seeking within-task granular comparisons should consult Tables 3-9 and the original cited studies.

Highlighted version (Page 22, lines 1979-1983).

Figure 3. Transformer-based vs. alternative deep learning architectures for depression detection (a) Multi-dimensional capability profile comparing five architectural paradigms across six qualitative dimensions (rated 1-5 based on study-level evidence; scores reflect directional trends, not quantitative benchmarks); (b) Minimum labeled training sample requirements by architecture (ranges are approximate; see source studies for details).

Highlighted version (Page 23, lines 2161-2167).

Figure 4. Cross-paradigm comparison of Transformer architectures for depression detection (a) Performance-cost trade-off zones for each architectural paradigm; bubble size is proportional to the number of included studies (n); bars indicate performance and computational cost ranges. (b) Reported performance across three application scenarios; bars represent median values and su-perimposed markers show individual study results. All performance values aggregate heteroge-neous metrics across non-equivalent tasks and datasets; figures illustrate directional architectur-al trends and should not be interpreted as equivalent-condition benchmarks.

Updated Text:

5.1. Transformer-Based Architectures versus Alternative Approaches

Although Transformers have established themselves as the dominant paradigm for detecting depression, alternative deep learning architectures - spectral analysis [錯誤! 找不到參照來源。], spatio-temporal CNNs [錯誤! 找不到參照來源。], graph neural networks [錯誤! 找不到參照來源。], attention-based multimodal fusion [錯誤! 找不到參照來源。, 錯誤! 找不到參照來源。], and hybrid CNN-LSTM architectures [錯誤! 找不到參照來源。] - demonstrate distinct advantages on specific capacity dimensions - most notably interpretability – while generally trailing Transformer variants on detection performance and sample efficiency, as illustrated in Figure 3(a). These alternatives have the advantage of explicitly modeling depression-specific patterns that Transformers learn only implicitly: spectral methods decompose behavioral signals into frequency domains to detect symptoms periodicity [錯誤! 找不到參照來源。], the Maximization-Differentiation Network models facial transitions, achieving RMSE of 7.55 with only 25M parameters [錯誤! 找不到參照來源。], and GNNs outperform late-fusion Transformers by 6% on E-DAIC using structural encoding of cross-modal dependencies [錯誤! 找不到參照來源。].

However, as shown in Figure 3(b), these architectural innovations generally require 1,000-5,000 labeled samples and rigid domain-specific engineering. The practical dominance of Transformers stems from two architecture-specific features: massive-scale pre-training that encodes transferable linguistic knowledge and architectural unification that accommodates diverse input modalities within a single framework. MentalBERT achieves F1 of 97.3% with approximately 3,000 samples - representing roughly 2-3 times reduced annotation requirements compared to GNN-based and spectral alternatives. Generative Decoder-only models (DepGPT, GPT-4o) require far fewer samples still, approaching few-shot or zero-shot regimes with minimal task-specific training. Transformers also scale more effectively, with larger models yielding consistent gains, whereas alternatives plateau despite increased depth. It should be noted, however, that some alternative architectures are beginning to adopt pre-training strategies (e.g., graph pre-training for GNNs), which may narrow this advantage over time. Architecture selection must therefore balance benchmark precision against deployment constraints: alternatives may be preferred in stable, data-rich environments with well-defined signal processing requirements, while Transformers remain optimal for scenarios characterized by limited labels, cross-domain variability, or missing modalities.

(a)

(b)

Figure 3. Transformer-based vs. alternative deep learning architectures for depression detection (a) Multi-dimensional capability profile comparing five architectural paradigms across six qualitative dimensions (rated 1-5 based on study-level evidence; scores reflect directional trends, not quantitative benchmarks); (b) Minimum labeled training sample requirements by architecture (ranges are approximate; see source studies for details).

5.2. Comparative Analysis Across Transformers Paradigms

Before proceeding, a methodological qualification is necessary. The 46 studies synthesized employ heterogeneous evaluation metrics, across tasks that are not equivalent: binary social media classification, ordinal severity scoring, continuous regression, and zero-shot symptom assessment. Aggregating these into median performance values or Pareto frontiers is inherently imprecise. The performance comparisons in Figure 4(a) and Figure 4(b) are therefore best understood as illustrative architectural trends and cost-benefit directional signals, not as equivalent-condition benchmarks. Readers seeking within-task granular comparisons should consult Tables 3-9 and the original cited studies.

Systematic analysis of 46 Transformer-based studies reveals distinct Pareto frontiers in terms of computational cost and detection performance, as shown in Figure 4(a). Encoder-only architectures (n=14) define a high-efficiency zone with F1/accuracy scores ranging from 76 to 99% at minimal computational overhead (normalized to BERT at 10⁰), yielding the optimal performance-to-resource ratio for standard detection tasks; the single outlier (DepRoBERTa, F1: 58.3%) reflects limited corpus diversity rather than architectural inadequacy. Decoder-only models (n=7) incur costs one to two orders of magnitude relative to BERT baseline); while this positions them in the high-cost quadrant, the expenditure is strategically justified by their few-shot generalization capabilities and novel data processing modalities (e.g., tabular-to-text transformation) in data-scarce environments, effectively trading infrastructure cost for reduced labeling requirements. Hybrid architectures (n=14) occupy a cost-efficient intermediate zone, demonstrating performance comparable to Encoder-only models through complementary neural component integration. Multimodal architectures (n=11) cluster at higher costs, reflecting the computational demands of cross-modal attention and multi-stream processing - a trade-off warranted when multi-view integration is critical to diagnostic sensitivity.

Practical efficacy is heavily context-dependent (Figure 4(b)). In social media environments, Encoder-only models show the highest median performance (96%), followed by Hybrid (95%) and Decoder-only architectures (92%), while Multimodal approaches trail at 87% - indicating that complex cross-modal fusion may introduce noise or unnecessary latency when textual signals are already highly discriminative. In clinical interview settings, a striking divergence emerges: Hybrid architectures exhibit the highest median (95%) based on robust DAIC-WOZ results from autoencoder-augmented and advanced encoding configurations, whereas Encoder-only (79%), Decoder-only (77%), and Multimodal (76%) architectures all converge below 80%. This pattern indicates that neither purely linguistic models nor standard multimodal fusion adequately addresses the challenges of clinical discourse - social desirability bias, linguistic masking, and structured interview formats - whereas Hybrid architectures overcome these limitations through complementary feature extraction that combines Transformer contextual understanding with specialized local pattern detection [錯誤! 找不到參照來源。]. For multimodal tasks, Hybrid architectures again lead (87%), followed by Decoder-only (83%), Encoder-only (79%), and Multimodal (78%), confirming that targeted architectural augmentation consistently outperforms both standalone and complicated fusion approaches. Consequently, Encoder-only models represent the most scalable solution for population-level text-based screening, Hybrid architectures provide the most robust and consistent performance across all application contexts, and Multimodal designs - despite their theoretical appeal - require further optimization to realize their potential for precision clinical diagnostics.


(a)	(b)

Figure 4. Cross-paradigm comparison of Transformer architectures for depression detection (a) Performance-cost trade-off zones for each architectural paradigm; bubble size is proportional to the number of included studies (n); bars indicate performance and computational cost ranges. (b) Reported performance across three application scenarios; bars represent median values and superimposed markers show individual study results. All performance values aggregate heterogeneous metrics across non-equivalent tasks and datasets; figures illustrate directional architectural trends and should not be interpreted as equivalent-condition benchmarks.

Comments 3: The quality assessment is not auditable. You state that all 47 studies passed a threshold, but no scoring rubric, threshold, reviewer agreement, or study-level bias table is provided.

Response 3:

We thank the reviewer for this important point. The original manuscript stated that a quality threshold was applied but provided no detail that would allow a reader to evaluate or replicate that assessment. We have substantially expanded Section 3.4 (Study Quality and Bias Evaluation) to provide full transparency.

Section 3.4 now specifies: (a) a six-dimension binary quality rubric (methodological rigour, experimental validity, reproducibility, comparative soundness, clinical relevance, and bias susceptibility), with each dimension operationalized in sufficient detail to permit replication; (b) the exclusion threshold (studies scoring below 3 out of 6 were excluded; all 46 included studies scored 4 or above); (c) the review process (two authors independently assessed each study, disagreements were arbitrated by a third author, with inter-rater agreement quantified as Cohen's κ = 0.82 prior to arbitration, indicating strong agreement); and (d) three additional bias dimensions evaluated systematically for all studies: dataset imbalance, cultural and linguistic bias, and platform-specific bias. Study-level quality scores and bias assessments are provided in Supplementary Table S1. We also corrected the study count from 47 to 46. We deleted the original reference: [47] Firoz, N.; Beresteneva, O.G.; Vladimirovich, A.S.; Tahsin, M.S. Enhancing Depression Detection through Advanced Text Analysis: Integrating BERT, Autoencoder, and LSTM Models. 2023.

Updated Text:

3.4. Study Quality and Bias Evaluation

Reproducibility (0-1): Accessibility of code, model checkpoints, and data-access information. Studies that publicly released code or provided sufficiently detailed pseudocode scored 1.

Comparative soundness (0-1): Use of appropriate comparison models and capacity-controlled experimental setups, including inclusion of ablation studies where architecturally warranted.

Comments 4: Some conclusions overreach the presented evidence. The abstract says hybrid and multimodal models are best for difficult clinical tasks, but later the paper reports multimodal performance below hybrid and close to the other groups in clinical interviews.

Response 4:

Thank you for your suggestion. We have revised the abstract to accurately represent the paradigm-specific findings.

Updated Text: Abstract, final sentence of the results paragraph.

Position:

Highlighted version:

Abstract (Page 1, lines 13-16):

Encoder-only models perform best in high-throughput text screening; Decoder-only models have stronger few-shot learning; Hybrid architectures achieve the highest performance in clinical interview settings; and Multimodal Fusion systems offer complementary advantages when heterogeneous signal integration is critical.

Conclusions (Page 26, lines 2544-2547):

Crucially, Hybrid architectures achieve the highest median performance in clinical interview settings, outperforming Encoder-only, Decoder-only, and Multimodal approaches - a finding with direct implications for deployment context selection.

Comments 5: The manuscript needs another full editorial pass because the section numbering and several tables are inconsistent.

Response 5:

Thank you for the reviewer’s suggestions. The Discussion has been renumbered as Section 5, with subsections 5.1 (Transformer-Based Architectures versus Alternative Approaches), 5.2 (Comparative Analysis Across Transformer Paradigms), 5.3 (Challenges and Limitations), and 5.4 (Future Directions). The Conclusions are now Section 6. Every internal cross-reference in the manuscript has been audited and updated to reflect the new numbering.

Updated Text: Throughout the manuscript.

4. Response to Comments on the Quality of English Language

Response: We appreciate the reviewer’s careful assessment of our manuscript’s language. We have taken this feedback seriously and performed a comprehensive linguistic overhaul to improve clarity, flow, and grammatical precision.

5. Additional clarifications

We would like to thank the reviewer again for the valuable feedback. We have addressed all the comments in the revised manuscript, and all major changes have been highlighted in blue and gray for your convenience.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

dear colleagues i have some comments/recommendations to consiser

general point:

the report is not consistent with prisma 2020 please restructure the review and provide copy of prisma 2020 checklist as supplemental

the figures are too low quality resolution pls improve and update

table 2 need to be restructure according to pico https://www.cochranelibrary.com/about-pico

specific points:

first pls add section in the intro about depression and their burden please cite gbd studies

second add section on ai tests available for serious mental illnesses such as transformer-based models and nlp screening tools are now growing up for evaluating the detection status of patients with serious mental illnesses

third discuss the role of ai analytics and use of wearable data and smart apps to improve detection stability and reduce translation risk using multimodal fusion

fourth much evidence is extrapolated from social media text or unipolar depression benchmarks with limited clinical interview specific data discuss this issue very well and refer to them as serious mental illnesses this need to be clarified in methods and results too

fifth narrative reviews are suitable for emerging fields the lack of a systematic search protocol raises concerns about selection bias the methods section mentions databases and keywords but does not detail how studies were selected or quality-assessed for instance inclusion of pilot studies and small-sample daic-woz works is appropriate but should be explicitly justified with a risk-of-bias discussion please add inclusion/exclusion study pls also add how quality was measures (or if not measured) explain and justify

minor:

inconsistent abbreviations mdd vs major depressive disorder

phq vs full scale names

Author Response

For review article

Response to Reviewer 2 Comments

1. Summary

We thank Reviewer 2 for identifying several important structural and content gaps. The reviewer's comments have led to meaningful additions that situate our technical findings within their broader clinical and methodological context. We respond to each point below.

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Is the work a significant contribution to the field?

Dear reviewer, please see 3. Point-by-point response to Comments and Suggestions for Authors

Is the work well organized and comprehensively described?

Is the work scientifically sound and not misleading?

Are there appropriate and adequate references to related and previous work?

Is the English used correct and readable?

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: The report is not consistent with prisma 2020 please restructure the review and provide copy of prisma 2020 checklist as supplemental

Response 1:

We agree that the original manuscript did not make its alignment with PRISMA 2020 sufficiently explicit or verifiable. We have made the following changes: (a) Section 3 now opens by explicitly stating that this systematic review follows PRISMA 2020 guidelines, with the PRISMA 2020 flow diagram included as Figure 2; (b) the completed PRISMA 2020 checklist is now provided as Supplementary Material; (c) the decision to use narrative synthesis rather than quantitative meta-analysis is now explicitly justified in Section 3 with reference to the heterogeneity across the included studies - a recognised and PRISMA-consistent approach for systematic reviews of AI methodology where pooled statistical analysis is not valid.

Location of change: Section 3, opening paragraph; Supplementary Material (PRISMA 2020 checklist).

Position:

3. Survey Methodology

Figure 2. PRISMA 2020 flow diagram detailing the systematic study selection process.

Table 2. Search strategy mapped to PICO framework.

*PICO Component*	*Element*	*Search Terms/Description*
Population (P)	Target condition	“depression” OR “depressive disorder” OR “major depressive disorder”
Intervention (I)	AI model type	“Transformer” OR “BERT” OR “RoBERTa” OR “GPT” OR “LLM” OR “large language model” OR “pre-trained language model” OR “MentalBERT” OR “LLaMA” OR “Whisper” OR “CLIP” OR “Vision Transformer” OR “ViT” OR “ClinicalBERT”
Comparison (C)	Baseline/comparison methods	“machine learning” OR “deep learning” OR “CNN” OR “RNN” OR “LSTM” OR “non-Transformer” OR “conventional” OR “baseline”
Outcome (O)	Detection performance metrics	“detection” OR “classification” OR “screening” OR “severity assessment” OR “accuracy” OR “F1-score” OR “AUC” OR “sensitivity” OR “specificity” OR “RMSE”
Combined Query	Boolean	(Population) AND (Intervention) AND (Comparison) AND (Outcome)

3.1. Inclusion and Exclusion Criteria

3.2. Screening and Selection Process

3.3. Integration of Review Literature and Comparison with Non-Transformer Models

3.4. Study Quality and Bias Evaluation

Reproducibility (0-1): Accessibility of code, model checkpoints, and data-access information. Studies that publicly released code or provided sufficiently detailed pseudocode scored 1.

Comparative soundness (0-1): Use of appropriate comparison models and capacity-controlled experimental setups, including inclusion of ablation studies where architecturally warranted.

Comments 2: the figures are too low quality resolution pls improve and update.

Response 2:

We apologise for the low-resolution figures in the original submission. All figures have been regenerated.

Position:

Highlighted version:

(Page 4, lines 283-284):

Figure 1. Overall conceptual framework of this review.

Highlighted version:

(Page 7, lines 663-664):

Figure 2. PRISMA 2020 flow diagram detailing the systematic study selection process.

Highlighted version:

(Page 21, lines 1867-1953):

(a)

(b)

Figure 3. Transformers vs. alternative deep learning approaches for depression detection (a) Multi-dimensional capability profile comparing five architectural paradigms across six qualitative dimensions (rated 1-5 based on study-level evidence; scores reflect directional trends, not quantitative benchmarks); (b) Minimum labeled training sample requirements by architecture (ranges are approximate; see source studies for details)

Highlighted version:

(Page 23, lines 2116-2122):

(a)

(b)

Comments 3: table 2 need to be restructure according to pico https://www.cochranelibrary.com/about-pico

Response 3:

We thank the reviewer for this specific and actionable suggestion. Table 2 has been fully restructured to present the search strategy according to the PICO framework. Each PICO component now lists the corresponding search terms and Boolean operators used in the literature search. Modality-related terms are documented as supplementary search filters applied alongside the PICO query. We believe this restructuring substantially improves the reproducibility of our search strategy.

Position:

Highlighted version:

(Page 6, lines 544-573):

3. Survey Methodology

Figure 2. PRISMA 2020 flow diagram detailing the systematic study selection process.

Table 2. Search strategy mapped to PICO framework.

*PICO Component*	*Element*	*Search Terms/Description*
Population (P)	Target condition	“depression” OR “depressive disorder” OR “major depressive disorder”
Intervention (I)	AI model type	“Transformer” OR “BERT” OR “RoBERTa” OR “GPT” OR “LLM” OR “large language model” OR “pre-trained language model” OR “MentalBERT” OR “LLaMA” OR “Whisper” OR “CLIP” OR “Vision Transformer” OR “ViT” OR “ClinicalBERT”
Comparison (C)	Baseline/comparison methods	“machine learning” OR “deep learning” OR “CNN” OR “RNN” OR “LSTM” OR “non-Transformer” OR “conventional” OR “baseline”
Outcome (O)	Detection performance metrics	“detection” OR “classification” OR “screening” OR “severity assessment” OR “accuracy” OR “F1-score” OR “AUC” OR “sensitivity” OR “specificity” OR “RMSE”
Combined Query	Boolean	(Population) AND (Intervention) AND (Comparison) AND (Outcome)

Comments 4: first pls add section in the intro about depression and their burden please cite gbd studies

Response 4:

We fully agree that the introduction should establish the public health magnitude of depression before motivating the technical review. We have expanded the opening paragraph of Section 1 significantly.

Position:

Highlighted version:

(Pages 1-2, lines 34-140):

According to the Global Burden of Disease (GBD) 2019 study - a systematic analysis covering 369 diseases in 204 countries and territories from 1990 to 2019 - depressive disorders, including major depressive disorder (MDD) and dysthymia, accounted for approximately 46.9 million disability-adjusted life-years (DALYs) worldwide in 2019, with MDD alone contributing 37.2 million DALYs and representing 37.3% of all mental disorder DALYs globally. Depressive disorders ranked as the second leading cause of years lived with disability (YLDs) globally in 2019, with burden disproportionately af-fecting women - whose age-standardized DALY rate was substantially higher than that of males - as well as younger populations in low-and middle-income countries [2, 3].

Comments 5: second add section on ai tests available for serious mental illnesses such as transformer-based models and nlp screening tools are now growing up for evaluating the detection status of patients with serious mental illnesses

Response 5:

We thank the reviewer for this important contextualisation point. Our review focuses on depression, but Transformer-based tools are increasingly being applied across the broader spectrum of serious mental illnesses (SMIs). We have added a paragraph in Section 1 that contextualises our work within this wider landscape.

Position:

Highlighted version:

(Page 2, lines 163-173):

These architectures have been adapted for screening and detection in a range of serious mental illnesses (SMIs), including bipolar disorder, schizophrenia, and post-traumatic stress disorder. Transformer-based models such as MentalBERT, domain-adapted RoBERTa variants, and instruction-tuned large language models (LLMs) now constitute a growing toolkit for evaluating the detection status of patients with SMIs, leveraging clinical notes, social media posts, speech transcripts, and structured electronic health records as input modalities [15-17]. Among these conditions, depression has attracted the most extensive computational investigation owing to its global prevalence and the relative availability of large-scale annotated corpora; the breadth of Transformer architectures directed specifically at its detection is evident in the diversity of approaches emerging in recent literature:…

Comments 6: third discuss the role of ai analytics and use of wearable data and smart apps to improve detection stability and reduce translation risk using multimodal fusion

Response 6:

We agree that this dimension of AI-assisted depression detection is clinically important and was absent from the original manuscript. We have added relevant content in two locations.

Position:

Highlighted version:

(1. Introduction, Page 2, lines 149-158):

Beyond text, the proliferation of wearable sensors and mobile health applications has opened additional avenues: passive sensing via smartphones (accelerometry, GPS pat-terns [7], screen-use logs) and dedicated wearables provide continuous, objective proxies for sleep disruption, psychomotor retardation, and social withdrawal – cardinal features of depression that are otherwise difficult to quantify in clinical settings [8]. The integration of these heterogeneous data streams with state-of-the-art language models has been in-creasingly recognized as a promising pathway for improving depression detection sta-bility and reducing the translation risk associated with single-modality, snapshot-based assessments [9, 10].

Highlighted version:

(5.4. Future Directions, Page 25, lines 2434-2452):

AI analytics, wearable data, and smart applications for detection stability. The integration of AI analytics with wearable sensors and mobile health applications repre-sents a critical frontier for improving depression detection stability and reducing the translation risk associated with snapshot-based clinical assessments. Systematic evidence confirms that wearable AI demonstrates meaningful accuracy for detecting and predicting depression across diverse populations [9] and that passive, non-intrusive multimodal sensing approaches more comprehensively capture natural behaviours than controlled or single-session data collection paradigms [10]. Current Transformer-based models rely predominantly on single-session text or audio inputs, which are inherently susceptible to momentary fluctuations and self-presentation biases. Passive sensing modalities provide continuous, objective proxies for psychomotor retardation, sleep disruption, and social withdrawal [8], capturing symptom dynamics across naturalistic daily contexts rather than isolated clinical encounters [83]. Multimodal fusion architectures that integrate such longitudinal physiological streams with text-based Transformer features offer a pathway to richer, temporally grounded representations of depression severity. By anchoring model inputs in objectively measured behavioural signals, this approach simultaneously reduces reliance on single-session self-report measures and mitigates the cross-domain generalizability gap between social media corpora and real-world clinical populations – two of the most significant barriers to clinical translation identified in this review.

Comments 7: fourth much evidence is extrapolated from social media text or unipolar depression benchmarks with limited clinical interview specific data discuss this issue very well and refer to them as serious mental illnesses this need to be clarified in methods and results too

Response 7:

This is one of the most clinically consequential limitations of the evidence base we reviewed, and we are grateful the reviewer pressed us to address it more explicitly. We have made additions in two locations.

Position:

Highlighted version:

(4.1.3. Synthesis and Tier Analysis, Page 12, lines 1181-1191):

It is important to note that the evidence base underlying this review draws heavily from social media text corpora and unipolar depression benchmarks. Much of the high-accuracy performance reported in Tier 1 and Tier 2 studies is extrapolated from contexts where patient-generated text was produced in unstructured, self-disclosure settings. Clinical interview - specific data - where language is regulated by structured formats and social desirability – yields substantially lower performance. This distinction between self-reported social media evidence and clinically validated SMI assessment contexts must be clearly acknowledged: models demonstrating 95-99% accuracy on social media datasets should not be assumed to transfer directly to formal psychiatric evaluation of SMI populations, and future work must systematically develop and validate models in authentic clinical contexts with gold-standard diagnostic labels.

Highlighted version:

(5.3 Challenges and Limitations, Pages 23-24, lines 2137-2263):

Furthermore, regarding study generalizability, a substantial proportion of the evidence in this review derives from social media text corpora and unipolar depression benchmarks that bear limited resemblance to the clinical contexts in which SMI assess-ments must ultimately function. The linguistic register, self-disclosure patterns, and symptom expression in Twitter or Reddit posts differ fundamentally from those en-countered in structured clinical interviews with patients experiencing MDD or comorbid psychiatric conditions. This gap between the evidence base and the target clinical context represents a major translational risk that future benchmark design must address directly.

Comments 8: fifth narrative reviews are suitable for emerging fields the lack of a systematic search protocol raises concerns about selection bias the methods section mentions databases and keywords but does not detail how studies were selected or quality-assessed for instance inclusion of pilot studies and small-sample daic-woz works is appropriate but should be explicitly justified with a risk-of-bias discussion please add inclusion/exclusion study pls also add how quality was measures (or if not measured) explain and justify

Response 8:

We thank the reviewer for this comprehensive methodological critique. We have substantially expanded Section 3 to address each element:

(a) Explicit inclusion and exclusion criteria are now presented in Section 3.1.

(b) The six-dimension quality rubric and inter-rater reliability (Cohen's κ = 0.82) are fully described in Section 3.4.

(c) The inclusion of pilot studies and small-sample works is now explicitly justified: twelve of the 46 studies reported training set sizes below 5,000 samples, and four used fewer than 1,000 samples. These were retained because small-sample performance in limited-label settings is itself a clinically relevant research question, exclusion would introduce resource-rich bias, and all retained studies passed the minimum quality threshold.

Position:

3. Survey Methodology

Figure 2. PRISMA 2020 flow diagram detailing the systematic study selection process.

Table 2. Search strategy mapped to PICO framework.

*PICO Component*	*Element*	*Search Terms/Description*
Population (P)	Target condition	“depression” OR “depressive disorder” OR “major depressive disorder”
Intervention (I)	AI model type	“Transformer” OR “BERT” OR “RoBERTa” OR “GPT” OR “LLM” OR “large language model” OR “pre-trained language model” OR “MentalBERT” OR “LLaMA” OR “Whisper” OR “CLIP” OR “Vision Transformer” OR “ViT” OR “ClinicalBERT”
Comparison (C)	Baseline/comparison methods	“machine learning” OR “deep learning” OR “CNN” OR “RNN” OR “LSTM” OR “non-Transformer” OR “conventional” OR “baseline”
Outcome (O)	Detection performance metrics	“detection” OR “classification” OR “screening” OR “severity assessment” OR “accuracy” OR “F1-score” OR “AUC” OR “sensitivity” OR “specificity” OR “RMSE”
Combined Query	Boolean	(Population) AND (Intervention) AND (Comparison) AND (Outcome)

3.1. Inclusion and Exclusion Criteria

3.2. Screening and Selection Process

3.3. Integration of Review Literature and Comparison with Non-Transformer Models

3.4. Study Quality and Bias Evaluation

Reproducibility (0-1): Accessibility of code, model checkpoints, and data-access information. Studies that publicly released code or provided sufficiently detailed pseudocode scored 1.

Comparative soundness (0-1): Use of appropriate comparison models and capacity-controlled experimental setups, including inclusion of ablation studies where architecturally warranted.

Comments 9: inconsistent abbreviations mdd vs major depressive disorder

phq vs full scale names.

Response 9:

We have reviewed all abbreviations throughout the manuscript. "Major depressive disorder (MDD)" is now defined at its first occurrence in Section 1 and used consistently thereafter. The Patient Health Questionnaire variants (PHQ-9, PHQ-8, PHQ-4) are defined at first use with their full names. The abbreviations table has also been substantially expanded.

Position:

Highlighted version: Pages 26-27, Abbreviations.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Receiver Operating Characteristic Curve
AUPRC	Area Under the Precision-Recall Curve
BDI	Beck Depression Inventory
BiLSTM	Bidirectional Long Short-Term Memory
CCC	Concordance Correlation Coefficient
CMDC	Chinese Multimodal Depression Corpus
CNN	Convolutional Neural Network
DAIC-WOZ	Distress Analysis Interview Corpus - Wizard of Oz
DSM-5	Diagnostic and Statistical Manual of Mental Disorders, 5^th Edition
D-Vlog	Depression Vlog Dataset
E-DAIC	Extended Distress Analysis Interview Corpus
FLOPs	Floating-Point Operations per Second
GBD	Global Burden of Disease
GBT	Gradient Boosting Trees
GNN	Graph Neural Network
GRU	Gated Recurrent Unit
LLM	Large Language Model
LoRA	Low-Rank Adaption
LSTM	Long Short-Term Memory
MCC	Matthews Correlation Coefficient
MDD	Major Depressive Disorder
MINI	Mini International Neuropsychiatric Interview
MLM	Masked Language Modeling
MODMA	Multimodal Open Dataset for Mental Disorder Analysis
NLP	Natural Language Processing
PHQ-4	Patient Health Questionnaire-4
PHQ-8	Patient Health Questionnaire-8
PHQ-9	Patient Health Questionnaire-9
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
QLoRA	Quantized Low-Rank Adaption
RMSE	Root Mean Square Error
RNN	Recurrent Neural Network
SCID	Structured Clinical Interview for DSM Disorders
SMI	Serious Mental Illness
SMOTE	Synthetic Minority Oversampling Technique
ViT	Vision Transformer
YLD	Years Lived with Disability

4. Response to Comments on the Quality of English Language

5. Additional clarifications

We would like to thank the reviewer again for the valuable feedback. We have addressed all the comments in the revised manuscript, and all major changes have been highlighted in yellow and gray for your convenience.

Reviewer 3 Report

Comments and Suggestions for Authors

Thank you for submitting this systematic review. The topic is timely, the PRISMA methodology is rigorously applied, and the architecture-stratified taxonomy represents a meaningful advance over prior surveys in this space. The three-tier performance hierarchy for Encoder-only models, the cost-benefit framing of hybrid components, and the four cross-cutting clinical challenges are all well-argued contributions. That said, several issues, one of which involves a factual error, must be addressed before the manuscript can be considered for acceptance.

Major comments

1. Section numbering

The Discussion section is labeled "4. Discussion," duplicating the numbering of Section 4 (the architecture review). All subsections within the Discussion are consequently labeled 4.1, 4.2, and 4.3, which directly conflicts with the same labels already used for the Encoder-only, Decoder-only, and Hybrid subsections earlier in the paper. The Discussion should be renumbered as Section 5, with subsections 5.1, 5.2, and 5.3 respectively. The Conclusions should follow as Section 6. The authors are asked to verify that all internal cross-references are updated accordingly.

2. Factual citation

On page 19, the following statement appears: "the Maximization-Differentiation Network models facial transitions, achieving RMSE of 7.55 with only 25M parameters [8]." Reference [8] is Vaswani et al.'s foundational "Attention is all you need" paper, which has no connection to facial analysis or the MDN architecture. The intended citation is clearly reference [68] (De Melo et al.). As written, this sentence misattributes a specific quantitative empirical result to a paper that does not contain it. This must be corrected.

3. Internal statistical inconsistency

The proportion of studies using self-reported labels rather than gold-standard psychiatric diagnoses is reported as 85% in Sections 4.1.3 and 4.3 (Challenges), but as 86% in Section 4.2.4. Given that this figure is used as a central argument for the ground truth quality challenge, and appears multiple times across the manuscript, the authors should verify the correct value and apply it uniformly throughout.

4. Strength of novelty claims

The manuscript repeatedly asserts it provides "the first" architecture-stratified analysis and "the first systematic taxonomy" of this kind. While Table 1 is persuasive in demonstrating gaps in prior reviews, the claim as phrased is strong and may be contested. The authors should either provide a more explicit methodological justification for why none of the ten prior reviews qualifies, particularly [16], which does focus on Transformer implementations specifically, or soften the language to "the first comprehensive, mechanism-level, architecturally stratified analysis”.

Minor comments

5. Incomplete abbreviations list

The abbreviations table on page 24 lists only two entries (PRISMA and NLP), despite the manuscript making extensive use of unexpanded abbreviations including PHQ-9, BDI, SCID, MINI, MLM, LoRA, QLoRA, FLOPs, AUC, RMSE, CCC, MCC, AUPRC, and several dataset names. The table should be expanded to include all non-standard abbreviations used in the text.

6. Unclear baseline in Table 6

The row for BERT-Autoencoder-LSTM [47] in Table 6 lists both the baseline performance and the performance gain as "Unclear". If the primary study does not report a meaningful baseline comparison, the authors should acknowledge this limitation explicitly in the accompanying text rather than leaving the table entries blank, as this currently gives the impression of incomplete reporting on the reviewer's part rather than a limitation of the source study.

7. Expansion of the 100% F1 caution

The authors appropriately flag the 100% F1 score reported by MLlm-DR [62] on the CMDC dataset as likely reflecting dataset limitations rather than genuine performance (page 17). This is good scientific practice. However, the caveat is brief. Given that this result appears in Table 8 and may be read in isolation, the authors are encouraged to add a brief note directly within the table, for example, a footnote marker, directing readers to the interpretive caution in the text.

8. Language and readability

While the manuscript is generally well written, several passages would benefit from editing. Specific examples include: the sentence beginning "Optimal integration places bidirectional LSTM layers after Transformer Encoders..." (page 14), which loses grammatical continuity midway; and the phrase "complementary feature extraction [73] that combines Transformer contextual understanding with specialized local pattern detection" (page 20), where the inline citation interrupts the phrase awkwardly. The authors are encouraged to conduct a full language pass, with particular attention to article usage, sentence-boundary clarity, and consistency of terminology (e.g., "Encoder-only" vs. "encoder-based" appears variably throughout).

Author Response

Response to Reviewer 3 Comments

1. Summary

We thank Reviewer 3 for identifying several important structural and content gaps. The reviewer's comments have led to meaningful additions that situate our technical findings within their broader clinical and methodological context. We respond to each point below.

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Is the work a significant contribution to the field?

Dear reviewer, please see 3. Point-by-point response to Comments and Suggestions for Authors

Is the work well organized and comprehensively described?

Is the work scientifically sound and not misleading?

Are there appropriate and adequate references to related and previous work?

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1:

Section numbering

Response 1:

Position:

Highlighted version: Pages 20-25.

Comments 2:

Factual citation

Response 2:

We sincerely apologize for this error. The citation has been corrected. Following the reference renumbering that occurred during revision, De Melo et al. is now reference [66] in the final manuscript. The corrected sentence reads: "the Maximization-Differentiation Network models facial transitions, achieving RMSE of 7.55 with only 25M parameters [76]."

Position:

Highlighted version: Pages 21. Section 5.1, first paragraph. Citation [8] → [76]

Comments 3:

Internal statistical inconsistency

Response 3:

We sincerely apologize for the statistical inconsistency regarding the proportion of self-reported labels. We have meticulously reviewed the original data and confirmed that 85% is the accurate figure.

Following the reviewer's constructive feedback, we have updated the manuscript to ensure this value is applied uniformly across all sections.

We have also conducted a comprehensive cross-check of all other statistical figures throughout the manuscript to prevent similar discrepancies.

Position:

Highlighted version: Page 15, line 1255, Section 4.2.4 (Synthesis):

However, critical impediments remain in all three domains: 85% of…

Comments 4:

Strength of novelty claims

Response 4:

We accept this critique. Absolute "first" claims in a field as active as this one are difficult to verify comprehensively and may fairly be contested. We have adopted the reviewer's suggested language throughout: all instances of "the first architecture-stratified analysis" and "the first systematic taxonomy" have been revised to "the first comprehensive, mechanism-level, architecturally stratified analysis." We have also added a sentence noting that the most proximate prior review [25] focuses on modality comparison rather than architectural mechanisms, which is the specific differentiation that justifies our "first" claim for this particular framing. We believe this language is both accurate and appropriately modest.

Position:

Highlighted version:

· Abstract (Page 1, lines 13-16): this review analyzed 46 studies and provided the first comprehensive, mechanism-level, architecturally stratified comparison of Encoder-only, Decoder-only, Hybrid and Mul-timodal Fusion paradigms, examining self-attention dynamics and transfer learning strategies.

· 1. Introduction (Page 3, lines 258-260): To address these gaps, this review presents the first comprehensive, mecha-nism-level, architecturally stratified analysis of Transformer-based depression detection, systematically analyzing 46 studies covering all four paradigms.

· 2. Limitations of Existing Reviews and Motivation for the Present Study (Page 6, lines 531-536): First, it introduces the first comprehensive, mechanism-level systematic taxonomy dif-ferentiating Encoder-only, Decoder-only, Hybrid and Multimodal paradigms to enable architecture selection based on specific clinical requirements – a distinction not achieved by any of the ten reviews in Table 1, the most proximate of which [25] focuses on modality comparison rather than architectural mechanisms.

Comments 5:

Incomplete abbreviations list

Response 5:

We thank the reviewer for pointing out the incompleteness of our abbreviations list. We agree that a comprehensive table is essential for the readability of a manuscript involving diverse clinical and computational terms.

As suggested, we have significantly expanded the abbreviations table on pages 26-27 to include all non-standard terms used in the text.

Position:

Highlighted version: Pages 26-27, Abbreviations.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Receiver Operating Characteristic Curve
AUPRC	Area Under the Precision-Recall Curve
BDI	Beck Depression Inventory
BiLSTM	Bidirectional Long Short-Term Memory
CCC	Concordance Correlation Coefficient
CMDC	Chinese Multimodal Depression Corpus
CNN	Convolutional Neural Network
DAIC-WOZ	Distress Analysis Interview Corpus - Wizard of Oz
DSM-5	Diagnostic and Statistical Manual of Mental Disorders, 5^th Edition
D-Vlog	Depression Vlog Dataset
E-DAIC	Extended Distress Analysis Interview Corpus
FLOPs	Floating-Point Operations per Second
GBD	Global Burden of Disease
GBT	Gradient Boosting Trees
GNN	Graph Neural Network
GRU	Gated Recurrent Unit
LLM	Large Language Model
LoRA	Low-Rank Adaption
LSTM	Long Short-Term Memory
MCC	Matthews Correlation Coefficient
MDD	Major Depressive Disorder
MINI	Mini International Neuropsychiatric Interview
MLM	Masked Language Modeling
MODMA	Multimodal Open Dataset for Mental Disorder Analysis
NLP	Natural Language Processing
PHQ-4	Patient Health Questionnaire-4
PHQ-8	Patient Health Questionnaire-8
PHQ-9	Patient Health Questionnaire-9
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
QLoRA	Quantized Low-Rank Adaption
RMSE	Root Mean Square Error
RNN	Recurrent Neural Network
SCID	Structured Clinical Interview for DSM Disorders
SMI	Serious Mental Illness
SMOTE	Synthetic Minority Oversampling Technique
ViT	Vision Transformer
YLD	Years Lived with Disability

Comments 6:

Unclear baseline in Table 6

Response 6:

Thank you for this insightful observation. We agree that the "Unclear" entries in Table 6 were suboptimal and potentially confusing for readers.

After a thorough re-evaluation of the cited study [47] Firoz, N.; Beresteneva, O.G.; Vladimirovich, A.S.; Tahsin, M.S. Enhancing Depression Detection through Advanced Text Analysis: Integrating BERT, Autoencoder, and LSTM Models. 2023., we determined that its experimental setup and reported metrics do not provide a sufficiently robust or comparable baseline for our meta-analysis/comparison. To maintain the high standard of data integrity and clarity in this manuscript, we have decided to remove this specific study [47] and its corresponding row from Table 6.

Comments 7:

Expansion of the 100% F1 caution

Response 7:

We agree, and we appreciate that the reviewer acknowledged the original caveat as good scientific practice while pressing for it to be more visible. We have added a footnote marker (†) to the relevant cell in Table 8.

Position:

Highlighted version: Page 18, 1609-1613.

¹ AUC values approaching 100% on restricted single-platform datasets (ContextVecNet) are similarly subject to dataset-specific ceiling effects and should not be generalized without replication on independent clinical datasets.

² The 100% F1 reported by MLlm-DR on the CMDC dataset should be interpreted with substantial caution, as it likely reflects the limited scale and diversity of that dataset rather than genuine ceiling performance. The corresponding result on E-DAIC (79%) provides a more representative estimate of real-world detection capability.

Comments 8:

Language and readability

Response 8:

We appreciate the reviewer's meticulous reading and helpful suggestions regarding the language and grammar. We have revised the sentences.

Position:

Highlighted version: Page 16, 1410-1413.

(1) The LSTM sentence has been rewritten to read: "The most effective integration places bidirectional LSTM layers downstream of Transformer Encoders, where they capture sequential temporal dependencies within the contextualized token embeddings produced by self-attention, without disrupting the upstream attention representations." This formulation maintains grammatical continuity and is technically more precise.

Highlighted version: Page 22, 1988-1990.

(2) The inline citation has been moved to the end of the phrase, which now reads: "whereas Hybrid architectures overcome these limitations through complementary feature extraction that combines Transformer contextual understanding with specialized local pattern detection [81]. "

Beyond these two specific corrections, we have conducted a full language pass of the manuscript. Terminology has been standardized: "Encoder-only" is used consistently throughout (not "encoder-based"). Article usage, sentence-boundary clarity, and hyphenation consistency have been improved.

4. Response to Comments on the Quality of English Language

5. Additional clarifications

We would like to thank the reviewer again for the valuable feedback. We have addressed all the comments in the revised manuscript, and all major changes have been highlighted in green for your convenience.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

This revision improves presentation. There are a few concerns.

Figures 3 and 4 remain non-auditable. The new caveat says they are illustrative, but the manuscript still does not explain how the 1–5 radar scores, computational-cost multipliers, sample-threshold ranges, or cross-paradigm medians were derived.
Several quantitative claims in Section 5.3, such as the 49–70% false-positive estimate and the 15–25% cross-lingual degradation, are not traceably derived. Can you confirm how this is calculated?
The manuscript still draws strong comparative conclusions from non-equivalent tasks and metrics. Despite the caveat, the abstract, Section 5.2, and the Conclusions continue to claim that Encoder-only, Hybrid, or Multimodal paradigms are best for specific settings.

Comments on the Quality of English Language

Need improvement.

Author Response

For review article

Response to Reviewer 1 Comments

1. Summary

We sincerely thank Reviewer 1 for the thorough and constructive critique. The three concerns - figure auditability, traceability of quantitative claims in Section 5.3, and the strength of comparative conclusions - have all been substantively addressed in the revised manuscript.

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Is the work a significant contribution to the field?

Dear reviewer, please see 3. Point-by-point response to Comments and Suggestions for Authors

Is the work well organized and comprehensively described?

Is the work scientifically sound and not misleading?

Are there appropriate and adequate references to related and previous work?

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1: Figures 3 and 4 remain non-auditable. The new caveat says they are illustrative, but the manuscript still does not explain how the 1–5 radar scores, computational-cost multipliers, sample-threshold ranges, or cross-paradigm medians were derived.

Response 1:

We thank the reviewer for pressing on this point. The following targeted revisions (Section 5.1 and 5.2 (Position: Highlighted version: Pages 21-25, lines 642-876) + Supplementary Table S2 + Supplementary Table S3 + Supplementary Table S4) ensure full auditability.

Revised Text:

5.1. Transformer-Based Architectures versus Alternative Approaches

Although Transformers have established themselves as the dominant paradigm for detecting depression, alternative deep learning architectures - spectral analysis , spatio-temporal CNNs, graph neural networks , attention-based multimodal fusion , and hybrid CNN-LSTM architectures - demonstrate distinct advantages on specific capacity dimensions - most notably interpretability – while generally trailing Transformer variants on detection performance and sample efficiency, as illustrated in Figure 3(a) (scores 1-5; full derivation in Supplementary Table S2).

These alternatives have the advantage of explicitly modeling depression-specific patterns that Transformers learn only implicitly: spectral methods decompose behavioral signals into frequency domains to detect symptoms’ periodicity, the Maximization-Differentiation Network models facial transitions, achieving RMSE of 7.55 with only 25M parameters on the AVEC2014 facial dataset ; and GNNs outperform late-fusion Transformers by 6% in accuracy on E-DAIC using structural encoding of cross-modal dependencies. It should be noted that these performance figures reflect heterogeneous metrics and datasets - RMSE (lower is better), accuracy, and relative gain - and are therefore not directly interchangeable; they are presented to illustrate the task-specific strengths of each alternative rather than to assert equivalent-condition superiority over Transformers.

Generative Decoder-only models (DepGPT, GPT-4o) require far fewer samples still, approaching few-shot or zero-shot regimes with minimal task-specific training. Transformers also scale more effectively, with larger models yielding consistent gains, whereas alternatives plateau despite increased depth. It should be noted, however, that some alternative architectures are beginning to adopt pre-training strategies (e.g., graph pre-training for GNNs), which may narrow this advantage over time. Architecture selection must therefore balance benchmark precision against deployment constraints: alternatives may be preferred in stable, data-rich environments with well-defined signal processing requirements, while Transformers remain optimal for scenarios characterized by limited labels, cross-domain variability, or missing modalities.

(a)

(b)

Figure 3. Transformer-based vs. alternative deep learning architectures for depression detection (a) Multi-dimensional capability profile comparing five architectural paradigms across six qualitative dimensions (scores 1–5; full derivation in Supplementary Table S2; scores reflect relative ordering, not absolute benchmarks); (b) Minimum labeled training sample requirements by architecture.

5.2. Comparative Analysis Across Transformers Paradigms

Before proceeding, two methodological qualifications are necessary.

First, the 46 studies synthesized employ heterogeneous evaluation metrics across tasks that are not equivalent: binary social media classification, ordinal severity scoring, continuous regression, and zero-shot symptom assessment. Aggregating these into median performance values or Pareto frontiers is inherently imprecise. The performance comparisons in Figure 4(a) and Figure 4(b) are therefore best understood as illustrative architectural trends and cost-benefit directional signals, not as equivalent-condition benchmarks. Readers seeking within-task granular comparisons should consult Tables 3-9 and the original cited studies.

Second, and critically, the studies use heterogeneous performance metrics - accuracy, F1-score, precision, and AUC - which are not directly interchangeable: accuracy can be inflated on class-balanced datasets; F1-score better captures performance under class imbalance; precision reflects positive predictive value; and AUC measures discrimination independently of classification threshold. To make this heterogeneity explicit, individual study results in Figure 4(b) are differentiated by reported metric type (▲=F1-score; ●= Accuracy; ■=Precision). The bar heights represent the central tendency of reported values within each task-paradigm grouping and should be interpreted as directional indicators, not equivalent-condition performance benchmarks.

Note on Figure Derivation Methodology. The qualitative capability scores in Figure 3(a) (scale 1-5) were assigned through a structured evidence-mapping procedure in which two authors independently rated each architectural paradigm on each of the six dimensions (detection performance, data efficiency, interpretability, computational efficiency, multimodal capability, and clinical readiness) based on explicit quantitative and qualitative evidence drawn from the studies summarized in Tables 3-9; inter-rater agreement (Cohen's κ = 0.79) was followed by consensus discussion for any divergent scores; full per-cell evidence is provided in Supplementary Table S2. Scores reflect directional trends and relative ordering across paradigms, not absolute quantitative benchmarks. The minimum labeled sample thresholds in Figure 3(b) are drawn directly from specific studies cited in Section 4. The cross-paradigm median performance values in Figure 4(b) were computed

separately within each task category (binary social media classification, clinical interview, multimodal tasks) using the study-level results reported in Tables 3-9; full derivation with per-study values and metric types is provided in Supplementary Table S3. All computational cost multipliers in Figure 4(a) are normalized to BERT base and fully derived in Supplementary Table S4.

Systematic analysis of 46 Transformer-based studies reveals distinct Pareto frontiers in terms of computational cost and detection performance, as shown in Figure 4(a). Encoder-only architectures (n=14) define a high-efficiency zone with F1/accuracy scores ranging from 76 to 99% at minimal computational overhead (normalized to BERT at 10⁰), yielding the optimal performance-to-resource ratio for standard detection tasks; the single outlier (DepRoBERTa, F1: 58.3%) reflects limited corpus diversity rather than architectural inadequacy. Decoder-only models (n=7) incur costs one to two orders of magnitude higher than the BERT baseline; while this positions them in the high-cost quadrant, the expenditure is strategically justified by their few-shot generalization capabilities and novel data processing modalities (e.g., tabular-to-text transformation) in data-scarce environments, effectively trading infrastructure cost for reduced labeling requirements. Hybrid architectures (n=14) occupy a cost-efficient intermediate zone, demonstrating performance comparable to Encoder-only models through complementary neural component integration. Multimodal architectures (n=11) cluster at higher costs, reflecting the computational demands of cross-modal attention and multi-stream processing - a trade-off warranted when multi-view integration is critical to diagnostic sensitivity.

Practical efficacy is heavily context-dependent (Figure 4(b)). In social media environments, Encoder-only models show the highest central tendency across the reviewed binary classification studies (median of reported Acc/F1 values: 97%; n=7; six of seven studies report accuracy, one reports F1 ; full derivation in Supplementary Table S3), followed by Hybrid (94%; n=10; mix of Acc and F1) and Decoder-only architectures (92%; n=1 study, two conditions ; both F1), while Multimodal approaches show a lower central tendency (77%; n=4; all F1 ) - reflecting the limited number of multimodal social media studies and the variability between the inflated ceiling result of ContextVecNet (F1: 96.19% on restricted Twitter data [錯誤! 找不到參照來源。]) and the more representative D-Vlog studies (F1: 73-78%), consistent with the observation that complex cross-modal fusion may introduce noise when textual signals are already discriminative. Because Encoder-only values are predominantly accuracy-based while Multimodal values are F1-based, direct comparison of their medians should account for this metric difference; the directional ordering nonetheless reflects the pattern evident in individual studies within Tables 3-9).

In clinical interview settings, a striking divergence emerges: Hybrid architectures exhibit the highest central tendency (94%; n=3 studies, 4 data points: DLCDME F1 95% ; TCC F1 93.6% and 96.7% ; Transformer-BiLSTM Precision 74% ; full derivation in Supplementary Table S3) based on robust DAIC-WOZ and MODMA results from autoencoder-augmented and advanced encoding configurations, whereas Encoder-only (79%; F1: 81% and 76% ), Decoder-only (80%; F1: 78% and 82.6% ), and Multimodal (84%; range F1 67%-Acc 94.17%; n=5 studies, 6 data points ) architectures show lower values.

Note that the Hybrid clinical interview median aggregates three F1 values and one Precision value from three studies - metric heterogeneity that limits direct comparison with other paradigms. The Precision value of 74% reflects a different aspect of detection quality than F1; its inclusion in the median is disclosed here and in Supplementary Table S3. This pattern indicates that neither purely linguistic models nor standard multimodal fusion adequately addresses the full challenges of clinical discourse - social desirability bias, linguistic masking, and structured interview formats - whereas Hybrid architectures overcome these limitations through complementary feature extraction that combines Transformer contextual understanding with specialized local pattern detection .

For multimodal tasks (studies incorporating audio, speech, or video modalities beyond text), Hybrid architectures again lead (94%; median of Precision 74%, F1 95%, F1 93.6%/96.7%; all DAIC-WOZ or MODMA datasets; full derivation in Supplementary Table S3), followed by Decoder-only (83%; F1 82.6% on DAIC-WOZ speech+text pipeline), with Encoder-only (79%; same DAIC-WOZ studies as clinical interview category: F1 81%, F1 76%) and the Multimodal paradigm (78%; central tendency of all 11 multimodal studies , excluding which reports MAE; mix of Acc and F1; see Supplementary Table S3) showing comparable values. These patterns suggest that targeted architectural augmentation consistently shows stronger results than both standalone linguistic models and complex fusion approaches in multimodal contexts within the reviewed studies.

These patterns should be interpreted as task-contextualized directional trends rather than definitive rankings: because studies employ non-equivalent tasks, datasets, and metrics, no single paradigm can be declared universally superior. Taken together, and with these qualifications in mind, the reviewed evidence indicates that Encoder-only models show consistently strong results for population-level text-based social media screening; Hybrid architectures demonstrate the most robust individual results in clinical interview and multimodal task settings across the reviewed studies; and Multimodal designs - despite their theoretical appeal - require further standardized evaluation before their clinical diagnostic potential can be fully assessed.


(a)	(b)

Figure 4. Cross-paradigm comparison of Transformer architectures for depression detection (a) Performance-cost trade-off zones for each architectural paradigm; bubble size is proportional to the number of included studies (n); bars indicate performance and computational cost ranges; computational cost multipliers normalized to BERT base and fully derived in Supplementary Table S4. (b) Reported performance across three application scenarios; bars represent the central tendency of reported values within each task-paradigm grouping; superimposed markers show individual study results differentiated by reported metric type; full derivation of median values in Supplementary Table S3. All performance values aggregate heterogeneous metrics across non-equivalent tasks and datasets; figures illustrate directional architectural trends and should not be interpreted as equivalent-condition benchmarks.

Comments 2: Several quantitative claims in Section 5.3, such as the 49–70% false-positive estimate and the 15–25% cross-lingual degradation, are not traceably derived. Can you confirm how this is calculated?

Response 2:

We thank the reviewer for this important challenge and address each claim separately.

(Section 5.3 (Position: Highlighted version: Page 25, lines 889-895)):

The revised Section 5.3 removes this figure and instead provides a qualitative mechanistic explanation: "Compounding this, artificially balanced datasets do not represent real-world depression prevalence: models trained under balanced-class assumptions may yield substantially elevated false-positive rates when deployed in realistic clinical settings, risking a volume of unconfirmed positive screens that would overwhelm referral systems. The lack of standardized evaluation protocols creates considerable performance variance for identical models across configurations, making meaningful cross-study comparison difficult."

(Section 5.3 (Position: Highlighted version: Pages 25-26, lines 919-937)):

“Models trained predominantly on English-language data demonstrate approximately 7-11% performance degradation in zero-shot cross-lingual transfer within the reviewed studies - for example, DepGPT shows an approximately 11% F1 drop when applied to Bengali versus English data in zero-shot settings [錯誤! 找不到參照來源。], and Whisper+GPT-2 shows an approximately 7% F1 drop on Indic-Bengali relative to DAIC-WOZ [錯誤! 找不到參照來源。]. Broader cross-lingual performance gaps are anticipated under conditions of greater linguistic and script distance…”

The corresponding statement in Section 6 has been revised identically.

(6. Conclusions (Position: Highlighted version: Page 27, lines 1034-1036)):

a 93% concentration on English data among Encoder-only studies (with 78% of samples from North American sources across paradigms), associated with approximately 7-11% performance degradation in zero-shot cross-lingual transfer within the reviewed studies;

Comments 3: The manuscript still draws strong comparative conclusions from non-equivalent tasks and metrics. Despite the caveat, the abstract, Section 5.2, and the Conclusions continue to claim that Encoder-only, Hybrid, or Multimodal paradigms are best for specific settings.

Response 3:

We thank the reviewer for this precise and fair observation. The changes are as follows:

(a) Abstract (Position: Highlighted version: Page 1, lines 17-24): "Encoder-only models perform best in high-throughput text screening…" revised to "Encoder-only models show consistently strong results in high-throughput text-based screening; Decoder-only models demonstrate stronger few-shot learning capabilities; Hybrid architectures show the highest observed median performance in clinical interview settings across the reviewed studies; and Multimodal Fusion systems offer complementary advantages when heterogeneous signal integration is critical. These trends are task-contextualized and should not be interpreted as unconditional rankings, given heterogeneity in evaluation metrics and tasks across studies.”

(b) Section 6 (Conclusions) (Position: Highlighted version: Page 27, lines 1025-1030): "Crucially, Hybrid architectures show the highest median performance in clinical interview settings across the reviewed studies, outperforming Encoder-only, Decoder-only, and Multimodal approaches - a pattern with potential implications for deployment context selection..." revised to "Across the reviewed studies, Hybrid architectures show the highest observed median performance in clinical interview settings, outperforming Encoder-only, Decoder-only, and Multimodal approaches within this task category - a pattern that should be interpreted cautiously given the metric heterogeneity across studies and the non-equivalence of tasks across paradigms."

(c) Section 5.2 - new second methodological paragraph (Position: Highlighted version: Page 22, lines 701-709): “Second, and critically, the studies use heterogeneous performance metrics - accuracy, F1-score, precision, and AUC - which are not directly interchangeable: accuracy can be inflated on class-balanced datasets; F1-score better captures performance under class imbalance; precision reflects positive predictive value; and AUC measures discrimination independently of classification threshold. To make this heterogeneity explicit, individual study results in Figure 4(b) are differentiated by reported metric type (▲=F1-score; ●= Accuracy; ■=Precision). The bar heights represent the central tendency of reported values within each task-paradigm grouping and should be interpreted as directional indicators, not equivalent-condition performance benchmarks.”

4. Response to Comments on the Quality of English Language

5. Additional clarifications

We thank Reviewer 1 again for the rigorous critique. All major changes are highlighted in blue for the reviewer's convenience.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript makes a genuine and well-organized contribution to the field. The authors have addressed all substantive reviewer concerns responsibly. The four issues identified above, one structural (incomplete sentence), one analytical (overgeneralization in conclusions), one stylistic (residual language), and one presentational (ContextVecNet caveat), are all correctable without major re-analysis. The paper is recommended for minor revision, after which it should be suitable for acceptance without further review.

Issues Requiring Correction

1. Incomplete sentence in Section 5.2 (major)

The following sentence is grammatically incomplete and logically unclear:

"Decoder-only models (n=7) incur costs one to two orders of magnitude relative to BERT baseline)"

The phrase "one to two orders of magnitude" requires a comparative adjective, almost certainly "higher”, to complete the meaning. The closing parenthesis is also unpaired. The sentence should be revised to something like: "Decoder-only models (n=7) incur costs one to two orders of magnitude higher than the BERT baseline..." This appeared to survive the language pass and must be corrected.

2. Overgeneralization in the Conclusions (Section 6)

The conclusions state:

"a 93% concentration on English data, causing 15-25% performance degradation in low-resource or culturally diverse populations"

However, in Section 5.3 (Challenges), this 93% figure is qualified specifically: "93% of Encoder-only studies" and "78% from North American samples." As written in the conclusions, the 93% figure is presented without qualification, implying it applies to all 46 studies across all four paradigms. This is an overgeneralization that may mislead readers and should be revised to preserve the scope stated in Section 5.3, either by specifying it applies to Encoder-only studies or by computing and reporting an equivalent figure across all paradigms.

3. Residual language issues not resolved by the full language pass

Despite the authors' claim to have conducted a comprehensive linguistic overhaul, several constructions remain awkward or imprecise:

* Section 4.1.2: "Ensemble methods combining BERT, BERTweet, and ALBERT make a 13.5% improvement in AUC" , "make" is colloquial in this context; "achieve" or "demonstrate" is more appropriate.
* Section 4.4.2: "late fusion architectures... consistently yield better accuracies of 89-94%”, "better" is vague; "higher" is the standard term in quantitative comparisons.
* Section 4.3.2: "The most effective integration places bidirectional LSTM layers downstream of Transformer Encoders, where they capture sequential temporal dependencies within the contextualized token embeddings produced by self-attention, without disrupting the upstream attention representations”, while improved from the original, the italicised clause is ambiguous: it is not architecturally obvious why downstream LSTM layers would disrupt upstream attention, and the claim needs either clarification or removal.

The authors are asked to perform a more targeted pass on quantitative and technical language throughout.

4. ContextVecNet caveat: footnote visibility (minor)

The authors correctly added a footnote marker to the 100% F1 result for MLlm-DR in Table 8 (Comment 7). Similarly, Footnote 1 in Table 8 addresses the AUC of 99.22% reported for ContextVecNet. However, the corresponding body text in Section 4.4.2 presents the ContextVecNet result without any inline signal directing the reader to this caveat:

"ContextVecNet proves the effectiveness of this approach for social media analysis, obtaining AUC of 99.22% using CLIP-based encodings [66]"

Given that both inflated metrics are addressed in the same footnote block, the body-text framing for ContextVecNet should include at least a brief qualifier paralleling that used for MLlm-DR in the same section, to avoid the appearance of uncritical endorsement of a ceiling-effect result.

Author Response

For review article

Response to Reviewer 3 Comments

1. Summary

We are sincerely grateful to Reviewer 3 for the careful, thorough, and constructive evaluation of our revised manuscript. We deeply appreciate the reviewer's recognition that the paper makes a genuine contribution to the field and that the substantive concerns from the previous round were addressed responsibly. The four remaining issues identified - one structural, one analytical, one stylistic, and one presentational - have each been taken seriously and corrected in full. We have endeavored to address every point with the precision and rigor the reviewer rightly expects, and we hope the revised manuscript meets the standard required for acceptance.

2. Questions for General Evaluation

Reviewer’s Evaluation

Response and Revisions

Is the work a significant contribution to the field?

Dear reviewer, please see 3. Point-by-point response to Comments and Suggestions for Authors

Is the work well organized and comprehensively described?

Is the work scientifically sound and not misleading?

Are there appropriate and adequate references to related and previous work?

3. Point-by-point response to Comments and Suggestions for Authors

Comments 1:

Issues Requiring Correction

1. Incomplete sentence in Section 5.2 (major)

The following sentence is grammatically incomplete and logically unclear:

"Decoder-only models (n=7) incur costs one to two orders of magnitude relative to BERT baseline)"

Response 1:

Thank you for the reviewer’s suggestions. We have corrected the sentence in Section 5.2 (Position: Highlighted version: Page 23, lines 742-743) to read as follows:

“Decoder-only models (n=7) incur costs one to two orders of magnitude higher than the BERT baseline; while this positions them in the high-cost quadrant, the expenditure is strategically justified by their few-shot generalization capabilities and novel data processing modalities in data-scarce environments, effectively trading infrastructure cost for reduced labelling requirements.”

Comments 2:

Overgeneralization in the Conclusions (Section 6)

The conclusions state:

"a 93% concentration on English data, causing 15-25% performance degradation in low-resource or culturally diverse populations"

Response 2:

We are grateful for this important catch. The sentence in Section 6 (Position: Highlighted version: Page 27, lines 1033-1036) has been revised to read:

“a 93% concentration on English data among Encoder-only studies (with 78% of samples from North American sources across paradigms), associated with approximately 7-11% performance degradation in zero-shot cross-lingual transfer within the reviewed studies;”

Comments 3:

Residual language issues not resolved by the full language pass

Despite the authors' claim to have conducted a comprehensive linguistic overhaul, several constructions remain awkward or imprecise:

* Section 4.1.2: "Ensemble methods combining BERT, BERTweet, and ALBERT make a 13.5% improvement in AUC" , "make" is colloquial in this context; "achieve" or "demonstrate" is more appropriate.

* Section 4.4.2: "late fusion architectures... consistently yield better accuracies of 89-94%”, "better" is vague; "higher" is the standard term in quantitative comparisons.

* Section 4.3.2: "The most effective integration places bidirectional LSTM layers downstream of Transformer Encoders, where they capture sequential temporal dependencies within the contextualized token embeddings produced by self-attention, without disrupting the upstream attention representations”, while improved from the original, the italicised clause is ambiguous: it is not architecturally obvious why downstream LSTM layers would disrupt upstream attention, and the claim needs either clarification or removal.

The authors are asked to perform a more targeted pass on quantitative and technical language throughout.

Response 3:

We are grateful to the reviewer for identifying these specific instances. Each of the three flagged items has been corrected, and we have additionally conducted a targeted pass across the full manuscript to identify and resolve analogous cases. The changes are as follows:

(i) Section 4.1.2 (Position: Highlighted version: Page 11, lines 344-346) - "make a 13.5% improvement" corrected to "achieve a 13.5% improvement":

" Ensemble methods combining BERT, BERTweet, and ALBERT achieve a 13.5% improvement in AUC through diverse tokenization."

(ii) Section 4.4.2 (Position: Highlighted version: Page 19, line 587) - "better accuracies" corrected to "higher accuracies":

"Late fusion architectures process modalities through specialized encoders before integration and consistently yield higher accuracies of 89-94%."

(iii) Section 4.3.2 (Position: Highlighted version: Page 16, lines 499-502) - The ambiguous clause "without disrupting the upstream attention representations" has been removed. The sentence now reads:

“The most effective integration places bidirectional LSTM layers downstream of Trans-former Encoders, where they capture sequential temporal dependencies within the contextualized token embeddings produced by self-attention.”

Additional corrections identified during the full targeted pass:

*Location*	*Original*	*Revised*
*Section 4.1.3*	*"providing better robustness against distribution shifts"*	*"demonstrating greater robustness against distribution shifts"*
*Section 4.3.2"*	*“This great improvement can be attributed to…"*	*"This marked improvement can be attributed to…"*
*Section 4.3.4*	*"a great methodological gap limiting causal attribution"*	*"a substantial methodological gap limiting causal attribution"*
*Section 4.2.2*	*"models augmented with classification layers give an F1 of 85%"*	*"models augmented with classification layers yield an F1 of 85%"*
*Section 4.2.3*	*"MentalBERT performs an exceptional F1 of 97.3%"*	*"MentalBERT achieves an F1 of 97.3%"*
*Section 4.2.1*	*"DepGPT shows exceptional few-shot detection performance"*	*"DepGPT demonstrates strong few-shot detection performance”*

Comments 4:

ContextVecNet caveat: footnote visibility (minor)

"ContextVecNet proves the effectiveness of this approach for social media analysis, obtaining AUC of 99.22% using CLIP-based encodings [66]"

Response 4:

We thank the reviewer for this precise and fair observation. We have revised Section 4.4.2 (Position: Highlighted version: Pages 18-19, lines 577-581) to provide a parallel treatment for both results. The ContextVecNet sentence now reads:

“ContextVecNet reports an AUC of 99.22% using CLIP-based encodings [66]. This near-ceiling AUC, however, reflects dataset-specific ceiling effects on a restricted sin-gle-platform Twitter dataset; without replication on independent clinical populations, it should not be taken as evidence of general detection capability.”

4. Response to Comments on the Quality of English Language

Point 1: The reviewer identified three specific colloquial or imprecise constructions in Sections 4.1.2, 4.4.2, and 4.3.2, and requested a more targeted pass on quantitative and technical language throughout the manuscript.

Response 1: All three flagged constructions have been corrected as detailed in Response 3 above. In addition, a systematic review of quantitative and technical language across the full manuscript identified six further instances of imprecise or informal phrasing, all of which have been revised (see the table in Response 3). We acknowledge that the prior language revision was insufficiently targeted with respect to quantitative expression, and we have taken specific care in this revision to ensure that comparative terms (e.g., higher rather than better), result-reporting verbs (e.g., achieve, yield, demonstrate rather than make, give, show), and qualitative intensity terms (e.g., marked, substantial rather than great) conform to standard academic usage throughout.

5. Additional clarifications

We wish to express our sincere gratitude once more to Reviewer 3 for the exceptionally thorough and constructive engagement with this manuscript across both rounds of review. The reviewer's precision in identifying issues of scope, completeness, and technical language has materially strengthened the paper, and we are genuinely appreciative of the care taken. We are confident that the revised manuscript addresses all four outstanding concerns in full, and we hope it is now suitable for acceptance. Please see the green parts in this highlighted version. Thank you.

Author Response File: Author Response.pdf