A Systematic Review of Transformer-Based Models for Depression Detection

Zhou, Shiwen; Mohd, Masnizah; Zakaria, Lailatul Qadri

doi:10.3390/app16105018

Open AccessSystematic Review

A Systematic Review of Transformer-Based Models for Depression Detection

by

Shiwen Zhou

^*

,

Masnizah Mohd

and

Lailatul Qadri Zakaria

^*

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5018; https://doi.org/10.3390/app16105018

Submission received: 12 March 2026 / Revised: 7 May 2026 / Accepted: 7 May 2026 / Published: 18 May 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Depression is a critical global public health challenge, and the demand for accurate automated detection methods has generated considerable research interest in Transformer-based models. Despite their substantial promise, a comprehensive investigation into their architectural efficacy, intrinsic mechanisms, and barriers to practical implementation remains lacking. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines, this systematic review was conducted across six databases (IEEE Xplore, Elsevier, Springer, MDPI, PubMed, and arXiv). The final search was performed in October 2025, covering English-language empirical studies published between 2020 and 2025 that employed Transformer-based architectures for depression detection. Risk of bias and methodological quality were independently appraised by two authors using a six-dimension structured rubric, with disagreements resolved by a third author. Findings were narratively synthesized given substantial cross-study heterogeneity. This systematic review analyzed 46 studies and provided the first comprehensive, mechanism-level, architecturally stratified comparison of encoder-only, decoder-only, hybrid, and multimodal fusion paradigms, examining self-attention dynamics and transfer learning strategies. Since 2019, these frameworks have evolved from text-centric approaches to advanced multimodal systems. Encoder-only models show consistently strong results in high-throughput text-based screening, decoder-only models demonstrate stronger few-shot learning capabilities, hybrid architectures show the highest observed median performance in clinical interview settings across the reviewed studies, and multimodal fusion systems offer complementary advantages when heterogeneous signal integration is critical. These trends are task-contextualized and should not be interpreted as unconditional rankings, given heterogeneity in evaluation metrics and tasks across studies. Nonetheless, four principal challenges hinder clinical translation: overreliance on self-reported data, cross-linguistic bias, absence of uncertainty quantification, and substantial computational overhead. Future efforts should shift from incremental benchmark improvements toward clinical utility through standardized psychiatric validation, uncertainty-aware architectures, fairness-enforced training across diverse populations, and the integration of Transformer-based models with wearable and mobile health data to improve detection stability and reduce translational risk. This systematic review was registered on the Open Science Framework (OSF; DOI: 10.17605/OSF.IO/SYF9N). This research was funded by the Faculty of Information Science and Technology and by Universiti Kebangsaan Malaysia under Grant TAP-K014364.

Keywords:

deep learning; depression detection; natural language processing; Transformer-based models

1. Introduction

Depression is a serious and disabling mental disorder affecting over 280 million individuals globally [1]. According to the Global Burden of Disease (GBD) 2019 study—a systematic analysis covering 369 diseases in 204 countries and territories from 1990 to 2019—depressive disorders, including major depressive disorder (MDD) and dysthymia, accounted for approximately 46.9 million disability-adjusted life-years (DALYs) worldwide in 2019, with MDD alone contributing 37.2 million DALYs and representing 37.3% of all mental disorder DALYs globally. Depressive disorders ranked as the second leading cause of years lived with disability (YLDs) globally in 2019, with the burden disproportionately affecting women—whose age-standardized DALY rate was substantially higher than that of males—as well as younger populations in low- and middle-income countries [2,3]. Conventional diagnostic strategies—structured clinical interviews and standardized questionnaires like the Patient Health Questionnaire-9 (PHQ-9) and Beck Depression Inventory (BDI)—remain susceptible to subjective bias, recall errors, accessibility limitations, and heterogeneity issues, which contribute to large treatment gaps [4].

Concurrently, digital platforms have produced rich behavioral data: depressive symptomatology manifests through unique linguistic markers (elevated first-person pronouns, absolutist words, negative words, past tense, and simplicity) alongside behavioral features (changed posting frequency, interaction patterns, and temporal activity shifts), allowing computational screening, longitudinal monitoring, and early intervention [5,6]. Beyond text, the proliferation of wearable sensors and mobile health applications has opened additional avenues: passive sensing via smartphones (accelerometry, GPS patterns [7], and screen-use logs) and dedicated wearables provides continuous, objective proxies for sleep disruption, psychomotor retardation, and social withdrawal—cardinal features of depression that are otherwise difficult to quantify in clinical settings [8]. The integration of these heterogeneous data streams with state-of-the-art language models has been increasingly recognized as a promising pathway for improving depression detection stability and reducing the translation risk associated with single-modality, snapshot-based assessments [9,10].

Transformer models have revolutionized natural language processing (NLP) by enabling unprecedented contextual analysis of human language [11,12,13]. Introduced by [14], Transformers use self-attention mechanisms to model relationships between distant sequence elements, capturing complex linguistic patterns that earlier architectures failed to detect. These architectures have been adapted for screening and detection in a range of serious mental illnesses (SMIs), including bipolar disorder, schizophrenia, and post-traumatic stress disorder. Transformer-based models such as MentalBERT, domain-adapted RoBERTa variants, and instruction-tuned large language models (LLMs) now constitute a growing toolkit for evaluating the detection status of patients with SMIs, leveraging clinical notes, social media posts, speech transcripts, and structured electronic health records as input modalities [15,16,17]. Among these conditions, depression has attracted the most extensive computational investigation owing to its global prevalence and the relative availability of large-scale annotated corpora. The breadth of Transformer architectures directed specifically at its detection is evident in the diversity of approaches emerging in recent literature: the 46 representative studies considered in this systematic review comprise studies dealing with encoder-only architectures (n = 14), decoder-only models (n = 7), hybrid approaches (n = 14), and multimodal architectures (n = 11), representing the wide architectural landscape of this rapidly evolving field.

Since 2020, the field has matured through four distinct architectural paradigms, each conferring unique analytical capabilities. Encoder-only architectures dominate text-based screening using bidirectional contextual understanding. Decoder-only models allow for generative assessment and few-shot learning with minimal labeled data. Hybrid architectures combine Transformers with complementary components to address specific modeling gaps. Multimodal Transformers combine text, speech, facial expressions, and physiological signals using cross-modal attention for comprehensive evaluation of symptoms. Despite the rapid proliferation of 46 studies published between 2020 and 2025, systematic understanding of architectural trade-offs, performance determinants, and clinical deployment readiness remains fragmented.

This systematic review addresses three limitations in the existing literature. First, architectural superficiality: existing surveys catalog Transformer models without examining the self-attention dynamics, transfer learning strategies, or pre-training methodologies that mechanistically distinguish them. Second, inadequate coverage: detailed reviews focus disproportionately on early BERT variants, with insufficient attention to decoder-only models, hybrid designs, and multimodal fusion approaches. Third, translation gaps: clinical considerations—psychiatric validation, uncertainty quantification, demographic fairness, computational feasibility, and workflow integration—remain substantively unaddressed despite being prerequisites for real-world deployment.

To address these gaps, this systematic review presents the first comprehensive, mechanism-level, architecturally stratified analysis of Transformer-based depression detection, systematically analyzing 46 studies covering all 4 paradigms. The analysis investigates Transformer-specific mechanisms that drive performance advantages—including how self-attention captures depressive semantic patterns, how transfer learning enables clinical adaptation with minimal labeled samples, and how unified multimodal frameworks integrate heterogeneous signals. Comparative synthesis across application contexts identifies performance determinants, develops principles for architecture selection, and defines critical gaps constraining clinical implementation.

This paper is intended for mental health researchers, clinicians, and interdisciplinary teams working at the intersection of artificial intelligence, deep learning, etc. The content in this paper is systematically structured as follows: Section 2 presents the limitations of existing reviews and motivation for this work, Section 3 discusses the review methodology following the PRISMA 2020 guidelines, Section 4 reviews Transformer models stratified by architectural paradigm, Section 5 presents cross-paradigm comparative analysis and synthesizes challenges and future directions, and Section 6 concludes with key findings and clinical implications. For ease of navigation, the overall conceptual framework is shown in Figure 1.

2. Limitations of Existing Reviews and Motivation for the Present Study

While several reviews pertaining to AI-driven approaches for depression detection exist, in-depth analyses of the Transformer-based architectures contain many gaps. Table 1 summarizes ten representative review articles, pointing out critical gaps as motivation for the present study. Existing reviews range in three categories, with each containing limitations that constrain the understanding of Transformer-based depression detection methodologies.

Encyclopedic surveys prioritize breadth over architectural depth. Reference [18] reviewed 50 studies but devoted only 8% to Transformers, overlooking architectural variants, vision Transformers, and multimodal applications without mechanism-level analysis of self-attention and transfer learning. Reference [21] involved 401 studies with a focus on CNNs (133) and RNNs (159) with minimal Transformer analysis despite acknowledging their emergence. Reference [20] considered 86 studies but treated Transformers descriptively rather than analytically.

Transformer-focused reviews are limited in coverage and temporal scope. Reference [19] surveyed 16 studies but focused heavily on foundational BERT models (13/16) with a sole GPT-2 study, omitting contemporary LLMs entirely and relying exclusively on Web of Science with social media text. Reference [22] reviewed 34 LLM studies but recognized the predominant BERT reliance and noted that newer models were underrepresented, with the inclusion of management studies diluting the detection focus.

Methodologically constrained reviews lack necessary rigor. Reference [23] reviewed 399 studies across psychiatric disorders, with depression comprising 45%, but the heterogeneous scope prevented focused architectural analysis, while traditional ML dominated (59%), with Transformers representing only 17% of deep learning approaches. Reference [26] reviewed 95 studies on diverse mental health domains, with depression accounting for 35%, conflating architecturally distinct generative and discriminative models without taxonomic differentiation. Reference [25] reviewed only 14 studies with questionable database selection (ResearchGate and Google Scholar), while Reference [24] reviewed 39 studies without major technical databases and Reference [27] reviewed only 11 studies without a systematic search strategy.

These limitations converge on three interrelated deficits. Architectural superficiality characterizes reviews that enumerate models without examining attention mechanisms, pre-training strategies, or fine-tuning techniques. Performance ambiguity results from absent standardized architectural comparisons, depriving practitioners of evidence-based model selection guidance. Implementation considerations remain substantively unaddressed—data privacy, computational requirements, clinical integration, cross-linguistic validation, and explainability are prerequisites for deployment, yet receive cursory treatment.

Compared to previous surveys, this work offers three contributions. First, it introduces the first comprehensive, mechanism-level systematic taxonomy differentiating encoder-only, decoder-only, hybrid, and multimodal paradigms to enable architecture selection based on specific clinical requirements—a distinction not achieved by any of the ten reviews in Table 1, the most proximate of which [25] focuses on modality comparison rather than architectural mechanisms. Second, it provides mechanism-level analysis examining how self-attention heads differentially weight depressive linguistic markers, how domain-adaptive pre-training alters token-level representations to encode clinical semantics, and how transfer learning bridges the gap between general-purpose language understanding and disorder-specific detection. Third, it explicitly discusses barriers to clinical translation, including psychiatric validation requirements, uncertainty quantification, cross-cultural fairness, and computational deployment constraints.

3. Survey Methodology

This systematic review follows PRISMA 2020 guidelines [28] to ensure methodological rigor and reproducibility, as illustrated in Figure 2. The protocol has been registered on the Open Science Framework (OSF) under the digital object identifier 10.17605/OSF.IO/SYF9N (available at: https://doi.org/10.17605/OSF.IO/SYF9N; accessed on 1 May 2026). The registered protocol provides a permanent, time-stamped record of the eligibility criteria, search strategy, screening and selection procedures, data extraction items, quality assessment rubric, and synthesis approach, all of which were defined in advance and applied throughout the review. The completed PRISMA 2020 checklist and the PRISMA 2020 abstracts checklist [29] are available in the Supplementary File S1. A narrative synthesis approach was adopted rather than quantitative meta-analysis, given the substantial heterogeneity across the 46 included studies in datasets, evaluation metrics, and ground truth definitions, which precludes valid statistical pooling. This heterogeneity has an important implication for interpretation: numerical comparisons across studies (e.g., accuracy, F1, area under the receiver operating characteristic curve (AUC), root mean square error (RMSE), and concordance correlation coefficient (CCC)) reflect architectural trade-offs rather than equivalent task-level benchmarks and should not be treated as direct head-to-head performance comparisons. A systematic literature search was conducted across six databases (IEEE Xplore, Elsevier, Springer, MDPI, PubMed, and arXiv), covering publications from 2020 to 2025, with databases last searched in October 2025. The search strategy, documented in Table 2, combined key terms across four PICO concept groups—population/condition (P), intervention/AI model type (I), comparison/baseline methods (C), and outcome/performance metrics (O)—using Boolean operators, with modality-related terms (text, speech, audio, video, social media, EEG, and clinical interview) applied as supplementary filters.

3.1. Inclusion and Exclusion Criteria

Studies were included if they (1) used Transformer-based architectures as the core methodology for depression detection, classification, or severity assessment; (2) conducted empirical evaluation using standardized datasets with quantitative metrics; (3) presented methodological descriptions of sufficient clarity for reproducibility assessment; (4) were published in English as peer-reviewed journal articles, conference proceedings, or—in the case of rapidly evolving AI subfields where definitive peer-reviewed publications may not yet be available—as high-quality preprint manuscripts (e.g., arXiv submissions) that had undergone substantive community scrutiny, demonstrated clear methodological rigor, and contributed findings not duplicated by available peer-reviewed works; (5) were published or made publicly available between 2020 and 2025.

Exclusion criteria encompassed (1) non-Transformer approaches or those employing Transformers solely as comparison baselines (excluding six studies retained for comparative analysis), (2) absence of depression-specific evaluation, (3) insufficient methodological detail, (4) non-English publications, (5) duplicate results, (6) exclusive focus on non-depression conditions, and (7) review articles (utilized for reference screening only).

The decision to include high-quality preprints was motivated by the rapid pace of Transformer research, where key architectural innovations frequently appear on preprint servers prior to formal peer-reviewed publication. Of the 46 included studies, 5 were sourced from preprint repositories. All five included preprints were individually evaluated against the same six-dimension quality rubric applied to peer-reviewed studies, and only those achieving the minimum quality threshold score (≥ 3) were retained.

3.2. Screening and Selection Process

Two authors (S.Z. and M.M.) independently screened all identified records against the inclusion and exclusion criteria specified in Section 3.1, with full-text review also performed independently. Disagreements at either stage were resolved through discussion with the third author (L.Q.) acting as arbitrator, following the same procedure documented in the header of Supplementary Table S1 for quality assessment. Data extraction was subsequently performed collaboratively by the three authors (S.Z. and M.M., and L.Q.), with fields corresponding to the columns in Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9 and Supplementary Table S1, and was validated by M.M. and L.Q. No automation tools were used in screening or data extraction. The systematic database search and manual screening identified 305 potentially relevant studies. After removing 25 duplicates and eliminating 118 non-Transformer approaches or studies using Transformers solely as comparison models, 80 studies were excluded for insufficient methodology or evaluation detail. Full-text review led to the further exclusion of 36 studies focused exclusively on conditions other than depression. Ultimately, 46 studies were selected for final inclusion. These were classified by architectural strategy as follows: 14 encoder-only studies, 7 decoder-only studies, 14 hybrid architecture studies, and 11 multimodal framework studies.

3.3. Integration of Review Literature and Comparison with Non-Transformer Models

Beyond the 46 primary studies, 10 relevant review articles on deep learning and Transformer-based depression detection models were analyzed for contextual background. Additionally, six studies employing non-Transformer approaches were incorporated as comparative baselines. Consistent evidence shows that Transformer models, particularly those exploiting self-attention mechanisms, outperform non-Transformer counterparts in accuracy, robustness, and capacity for handling complex multimodal data.

3.4. Study Quality and Bias Evaluation

The quality of included studies was independently assessed by two authors using a structured six-dimension rubric. Disagreements between the two authors were flagged and resolved through discussion with the third author, who served as arbitrator. No study required more than one round of arbitration. Inter-rater agreement was high prior to arbitration, with a Cohen’s κ of 0.82, indicating strong agreement. Each dimension was scored on a binary scale (0 = criterion not met; 1 = criterion met), yielding a maximum score of 6. Studies achieved scores of 4 or above, indicating adequate methodological quality. Study-level quality scores and bias assessments are provided in Supplementary Table S1. The six dimensions and their operationalizations are as follows:

Methodological rigor (0–1): Clarity of model architecture, transparency of training configurations, and adequacy of preprocessing documentation. Studies scoring 1 provided sufficient architectural detail to permit independent replication.

Experimental validity (0–1): Appropriateness of dataset characteristics, evaluation protocols, and statistical reporting standards. Studies employing standard benchmark datasets (e.g., DAIC-WOZ, Reddit, and Twitter corpora) with established train/test splits scored 1, while those with unreported or non-standard splits scored 0.

Reproducibility (0–1): Accessibility of code, model checkpoints, and data-access information. Studies that publicly released code or provided sufficiently detailed pseudocode scored 1.

Comparative soundness (0–1): Use of appropriate comparison models and capacity-controlled experimental setups, including inclusion of ablation studies where architecturally warranted.

Clinical relevance (0–1): Correspondence of evaluation settings to actual clinical deployment scenarios and acknowledgment of deployment limitations, such as class imbalance, demographic skew, or inference latency constraints.

Bias susceptibility (0–1): Explicit reporting of dataset imbalance and applied mitigation strategies, linguistic and geographic representativeness, and platform-specific data characteristics (Twitter, Reddit, or clinical corpora) with discussion of implications for external validity.

To ensure systematic bias evaluation, studies were additionally reviewed across three recurrent bias dimensions. Dataset imbalance was recorded by documenting class distributions and noting whether corrective strategies (e.g., Synthetic Minority Oversampling Technique (SMOTE), focal loss, weighted sampling, or stratified procedures) were applied. Studies without corrective strategies were flagged as being at higher risk of majority-class inflation. Cultural and linguistic bias was assessed based on the geographic and linguistic composition of datasets, with heavy concentration in English-language, Western social media sources noted as a significant limitation for cross-cultural generalizability. Platform-specific bias was evaluated by examining source characteristics (Twitter, Reddit, or structured clinical interviews), with documented implications for external validity and model transferability given the substantial differences in discourse style and self-disclosure behavior across platforms. Regarding small-sample and pilot studies, 12 of the 46 included studies reported training set sizes below 5000 samples, and 4 employed fewer than 1000 samples. These studies were retained rather than excluded on sample size grounds because (1) small-sample performance in limited-label settings is itself a research question of clinical relevance, (2) exclusion of small studies would introduce selection bias toward resource-rich settings, and (3) all retained studies passed the minimum quality threshold on the six-dimension rubric. The risk associated with small-sample studies is explicitly flagged in Section 3.4 bias assessments and interpreted with appropriate caution in Section 4.

4. Transformer-Based Models for Depression Detection

This section discusses four Transformer architectural paradigms for depression detection, each exploiting distinct properties of the self-attention mechanism to solve different aspects of the detection problem. Encoder-only models use bidirectional attention to encode holistic contextual representations, excelling at text classification but encountering limitations in clinical interview contexts. Decoder-only models use causal attention for generative assessment and few-shot learning from minimal labeled data, though at considerable computational expense. Hybrid architectures combine Transformers with complementary neural components—recurrent neural networks (RNNs) for sequential dynamics and convolutional neural networks (CNNs) for local feature extraction—to address specific modeling gaps. Multimodal Transformers extend cross-modal attention to fuse heterogeneous signal modalities at the cost of increased equipment requirements and deployment complexity. The taxonomy goes from foundational encoder-only models through generative and hybrid variants to multimodal systems, reflecting escalating capability accompanied by increasing complexity.

4.1. Encoder-Only Transformers: Bidirectional Context Understanding

4.1.1. Mechanistic Basis of Self-Attention in Depression Detection

Encoder-only Transformer architectures are the major paradigm in detecting depression from text. Their effectiveness stems from the bidirectional self-attention mechanism inherent in masked language modeling (MLM): during pre-training, each token attends to all other tokens in the sequence simultaneously, constructing contextual representations that encode both local syntactic structure and long-range semantic dependencies. This bidirectional encoding is mechanistically appropriate for depression detection given that depressive language shows distributional patterns across multiple linguistic levels simultaneously. At the lexical level, self-attention heads can learn to weight co-occurrences of depression-indicative tokens—such as first-person singular pronouns, absolutist quantifiers, and negative affect terms—even when these markers are distributed across non-adjacent positions. At the discourse level, attention across sentence boundaries enables detection of coherence disruptions, topic narrowing, and rumination patterns (repeated return to negative themes) that characterize depressive writing styles. These capabilities extend beyond bag-of-words or sequential models because the self-attention mechanism simultaneously encodes both individual marker presence and their relational co-occurrence structure.

Domain-specific pre-training amplifies these capabilities by reshaping the model’s internal representations to encode clinical semantics. When MentalBERT is pre-trained on mental health Reddit corpora, its token embeddings shift such that clinically significant terms cluster more closely in embedding space relative to general-purpose models. This representational restructuring enables greater discrimination between depressive and non-depressive uses of ambiguous terms.

4.1.2. Performance Analysis of Encoder-Only Models

MentalBERT (domain-adapted model) demonstrates gains of 6–8% in F1-scores compared to general-purpose BERT/RoBERTa base models across severity classification tasks [30,31], while DepRoBERTa demonstrates improvements across recall, irrespective of absolute scoring [32]. These gains are highest in contexts involving interpretation of subtle linguistic patterns (e.g., expressions with metaphors, disrupting discourse coherence), rather than simple binary detection. However, specialized models have 2–5 times higher computational costs during pre-training and often exhibit limited cross-domain generalization. Consequently, while high-volume screening may favor general-purpose models, investment in domain-adapted architectures is justified for specialized clinical evaluation. Beyond pre-training, architectural progression reveals the following: RoBERTa matches 98% accuracy of BERT on balanced datasets while demonstrating greater robustness against distribution shifts [33,34], and compressed variants like DistilBERT and ALBERT achieve comparable performance (F1: 81%) with far fewer parameters [35]. Severity-weighted scoring approaches on longitudinal eRisk data further demonstrate RoBERTa’s capacity for calibrated risk assessment [36]. Ensemble methods combining BERT, BERTweet, and ALBERT achieve a 13.5% improvement in AUC through diverse tokenization [37], and RoBERTa with dense layers achieves 90% accuracy, challenging the assumption that larger parameter scales invariably outperform targeted architectural optimizations [38].

Preprocessing strategies vary greatly across studies, indicating a lack of consensus on standardization. Integration of sentiment analysis tools like VADER offers marginal gains (1–2%) in binary classification but does not produce consistent improvement in severity stratification [39], while lexicon-based approaches are useful only for document compression, not for the analysis of high-density social media posts [35]. Addressing class imbalance is a critical concern: SMOTE (applied in 43% of studies) and focal loss increase minority class recall but risk generating synthetic artifacts that may amplify annotation noise (noted in 6/14 studies).

4.1.3. Synthesis and Tier Analysis

Systematic comparison shows a definite three-tier performance hierarchy governed by the characteristics of the data source. Binary social media classification achieves accuracies of 95–99% in Tier 1, with marginal differentiation between BERT and optimized variants [33,34,39,40,41,42]. This performance reflects the high signal density of informal online discourse, where diminished conversational constraints facilitate explicit symptom disclosure—a key point often overlooked when assessing model capability. From a self-attention standpoint, the bidirectional encoding mechanism can leverage the co-occurrence structure of explicitly negative language in social media posts, offering a fundamentally simpler distributional pattern than the guarded language of clinical interviews. Tier 2 severity classification highlights the requirement of domain adaptation: MentalBERT achieves an F1 of 97.3% [30], while ensemble approaches attain AUC scores of 98.48% for severe depression detection [37]. However, the modest performance of DepRoBERTa (F1: 58.3%) shows that corpus quality is an important factor of specialization benefits [32].

The limitations of encoder-only models are most obvious in Tier 3 clinical interview studies. Despite state-of-the-art preprocessing, BERT-based approaches yield only moderate performance on DAIC-WOZ (F1: 81%), highlighting a persistent 12–17% performance gap relative to social media analysis [35,43]. This performance ceiling reflects a fundamental mismatch between the attention mechanism’s reliance on explicit lexical cues and the nature of clinical interviews, where socially desirable responding and structured formats suppress the overt linguistic markers upon which bidirectional self-attention depends. Future performance gains in this tier may, therefore, require multimodal integration rather than purely linguistic refinement. Critically, 85% of studies across all tiers rely on self-reported labels rather than gold-standard psychiatric diagnoses (Structured Clinical Interview for DSM Disorders (SCID), and Mini International Neuropsychiatric Interview (MINI)), raising concerns that high reported accuracies may reflect consistency with subjective self-assessments rather than genuine clinical validity.

It is important to note that the evidence base underlying this review draws heavily from social media text corpora and unipolar depression benchmarks. Much of the high-accuracy performance reported in Tier 1 and Tier 2 studies is extrapolated from contexts where patient-generated text was produced in unstructured, self-disclosure settings. Clinical interview-specific data—where language is regulated by structured formats and social desirability—yields substantially lower performance. This distinction between self-reported social media evidence and clinically validated SMI assessment contexts must be clearly acknowledged: models demonstrating 95–99% accuracy on social media datasets should not be assumed to transfer directly to formal psychiatric evaluation of SMI populations, and future work must systematically develop and validate models in authentic clinical contexts with gold-standard diagnostic labels. As summarized in Table 3, domain-specific pre-training is the most consistent performance determinant across the 14 reviewed studies.

Table 3. Systematic cross-study comparison of encoder-only Transformers for depression detection.

Tier	Model Architecture	Datasets	Task	Metrics	Preprocessing	Limitations/Future Directions
Tier 1: Binary Classification (Social Media)	MentalBERT [31]	MentalHelp	Binary	F1: 91%	Pseudo-labeling	Multilingual, multimodal approaches
	RoBERTa-Dense [33]	Twitter	Binary	Acc: 98%	Hybrid dense layer	Limited to binary; extend to severity classification
	BERT/RoBERTa [34]	Reddit	Binary	Acc: 98%	Dynamic masking	Explore more diverse datasets and languages
	BERT [39]	Twitter	Binary	Acc: 99%	VADER + SMOTE	Extend to multilingual; incorporate multimodal data
	BERT [40]	Twitter	Binary	Acc: 97%	Emotion classification	Support multiple languages; expand dataset diversity
	BERT [41]	Reddit	Binary	Acc: 95.5%	None	Comprehensive ethical AI-based mental health support remains challenging
	BERT [42]	Reddit	Binary	Acc: 94.9%	None	Class imbalance; long-context handling; multimodal integration
Tier 2: Severity Classification (Social Media)	MentalBERT [30]	Reddit	4-Class	F1: 97.3%	None	Multimodal integration; noisy data handling
	DepRoBERTa [32]	Reddit	3-Class	F1: 58.3%	None	Limited corpus diversity; expand to larger depression-focused corpora
	RoBERTa-Weighted [36]	eRisk 2020, 2021	Scoring	Best ADODL/DCHR	Severity weighting	Assumes equal post severity; improves temporal sensitivity
	BERT/BERTweet/ALBERT [37]	DEPTWEET	4-Class	AUC: 98.48%	Local and global explainability	Expand beyond single platform; address severe class underestimation
	RoBERTa-FT [38]	Twitter	3-Class	Acc: 90%	Dense + dropout	Annotations not clinically verified; psychiatrist validation needed
Tier 3: Clinical Interview Data	BERT + KeyBERT + Focal Loss [35]	DAIC-WOZ	Binary	F1: 81%	KeyBERT + Focal Loss	Class imbalance; long-context processing; multimodal integration
Tier 3: Clinical Interview Data	BERT [43]	DAIC-WOZ	Binary	F1: 76%, AUC: 82%	None	Enhance robustness; explore multimodal approaches

4.2. Decoder-Only Transformers: Generative Capabilities for Enhanced Assessment

4.2.1. Mechanistic Basis: Causal Attention and In-Context Learning

Decoder-only Transformers are a paradigm shift in depression detection, exploiting autoregressive generation and in-context learning capabilities absent from encoder-only architectures. Unlike bidirectional models optimized solely for classification, decoder-only LLMs use causal self-attention: each token can attend only to preceding tokens, enabling sequential reasoning through token-by-token generation. This architectural distinction is, mechanistically, important to depression detection in three ways. First, the autoregressive generation process allows for explanatory assessment—rather than producing a binary label, the model generates interpretive symptom narratives tracing the reasoning from textual evidence to diagnostic conclusion, improving clinical transparency. Second, in-context learning makes these models capable of adapting to new tasks by prompt engineering alone: presenting a few labeled examples within the input context enables the model to implicitly learn the mapping from text to depression indicators without parameter updates—a capability enabled by massive pre-training scales with no parallel in encoder-only architectures. Third, the generative framework naturally accommodates multimodal inputs: speech transcriptions, structured clinical records, and text can be serialized into a unified token sequence for joint reasoning.

However, the causal attention constraint means decoder-only models cannot simultaneously utilize bidirectional context for each token, potentially limiting their ability to capture the full co-occurrence structure of depressive markers within a text. This theoretical limitation is partially offset by the sheer scale of pre-training, which induces rich distributional knowledge but emerges empirically as reduced performance on tasks involving fine-grained bidirectional discrimination.

4.2.2. Performance Analysis of Decoder-Only Models

In text-based depression detection, Bengali-specialized DepGPT demonstrates strong few-shot detection performance (F1: 98%), surpassing GPT-4 (94%) and encoder-only comparators like SahajBERT (87%) [44]. However, this advantage does not universally transfer to formal clinical environments. GPT-3.5 Turbo underperforms in clinical interview analysis (F1: 78%) relative to the more specialized Distil-RoBERTa (F1: 82%) [45], consistent with the mechanistic expectation that causal attention is less effective than bidirectional encoding for fine-grained clinical text discrimination. Architectural hybridity is a partial remedy: LLaMA and Mistral models augmented with classification layers yield an F1 of 85% on the Dreaddit dataset, surpassing BERT-based classifiers by 8–12% [46]. Zero-shot approaches like GPT-4o combined with the RISEN framework show promise with 75.9% accuracy but exhibit systematic inconsistencies, especially in detecting nuanced symptoms like anhedonia compared to more prominent affective markers [47].

The greatest empirical support for decoder-only models comes from multimodal fusion and structured data processing, where their generative architecture facilitates seamless integration of heterogeneous data. Systems using Whisper transcription with GPT-2 embeddings achieve an F1 of 82.6% on DAIC-WOZ, outperforming text-only encoders by 9–12% through effective alignment of acoustic prosody with semantic content [48], while Whisper-to-LLM pipelines using GPT-4 and LLaMA-3 achieve competitive PHQ-8 prediction (RMSE: 3.975, CCC: 0.781) on E-DAIC [49]. In structured data processing, MDD-LLM-70B demonstrates significant robustness: trained on over 270,000 UK Biobank samples, it achieves an AUC of 89.2% [50] and maintains performance with only 6% degradation when 60% of feature data is missing—contrasting sharply with the 35% degradation observed in non-Transformer models. This resilience is probably based on the ability of this model to exploit contextual relationships between remaining features via in-context reasoning. As described in Table 4, performance depends substantially on adaptation strategy and domain context.

Table 4. Performance and limitations of decoder-only Transformers in depression detection.

Domain	Model Architecture	Dataset	Task	Performance	Preprocessing	Limitations/Future Directions
Zero/Few-Shot Learning	DepGPT [44]	Bengali Social Media	Binary	F1: 98% (few-shot), F1: 87% (zero-shot)	Domain-specific fine-tuning	Language restricted (Bengali); self-reported labels; lacks real-world testing
Comparative	GPT-3.5 Turbo vs. Distil-RoBERTa [45]	DAIC-WOZ	Binary (PHQ-4)	F1: 78% (GPT-3.5), 82% (RoBERTa)	Fine-tuning + synthetic data	No clinical validation; training bias; high API costs
Classification	LLaMA/Mistral/PHI/ classification layers [46]	Dreaddit, SAD, CAMS, IRF	Multiclass	Best F1: 85% (PHI-3 mini)	Added classification layers	Requires fine-tuning; scalability; class imbalance
Prompt Engineering	GPT-4o + RISEN [47]	DAIC-WOZ	Symptom (PHQ-8)	F1: 74%, Acc: 75.9%	Zero-shot structured prompting	Low interpretability; high API costs; symptom-specific variance
Foundation Model	Whisper/MMS + GPT-2 [48]	DAIC-WOZ, Indic-Bengali	Binary	F1: 82.6% (DAIC-WOZ), 75.3% (Indic-Bengali)	LoRA + prompt engineering	Limited to two languages; real-world testing required
LLM-based	Whisper + GPT-4/GPT-3.5/LLAMA-3 8B [49]	E-DAIC, DAIC-WOZ	PHQ-8 regression + classification	Best: RMSE 3.975, CCC 0.781	Audio transcription– text pipeline	Multimodal fusion improvements needed; GPT-4 API costs
Tabular-to-Text LLM	MDD-LLM: LLaMA-3.1 (8B and 70B) [50]	UK Biobank (274,348)	MDD diagnosis	Acc: 83.8%, F1: 81.8%, AUC: 89.2% (70B)	LoRA/QLoRA; tabular to NL transformation	Hallucination risk; requires medical-domain LLMs; lacks prospective validation

4.2.3. Fine-Tuning Strategy Analysis

Fine-tuning strategy selection exerts a strong impact on model performance and deployment feasibility. Full fine-tuning obtains task-specific peak performance but requires extensive computational resources and risks catastrophic forgetting. Parameter-efficient methods like Low-Rank Adaptation (LoRA) update only 0.1–1% of parameters and maintain 94–97% of full-tuning performance at 40% less training time and 60% less memory. However, LoRA proves less effective for tasks requiring major domain shifts. Prompt-based adaptation (zero-shot and few-shot) eliminates parameter updates entirely but has 10–15% variance in performance depending on prompt design. For clinical deployment, a staged approach is recommended: prompt-based methods for initial screening, followed by LoRA fine-tuned models for cases requiring higher precision.

4.2.4. Synthesis of Decoder-Only Capabilities

As summarized in Table 5, systematic synthesis reveals that decoder-only Transformers occupy a distinct functional niche. The capacity of models like DepGPT to operate with 10–50 times less labeled data is a substantial advantage for low-resource languages [44], but this benefit is mostly seen in informal social media contexts rather than rigorous clinical settings [45]. In multimodal fusion, speech-text fusion through Whisper transcription with LLM embeddings achieves an F1 of 82.6%, yielding 9–12% gains over text-only encoders [48,49], proving that the generative architecture enables effective alignment of acoustic prosody with semantic content. In structured data processing, MDD-LLM-70B achieves an AUC of 89.2% on tabular clinical data [50]—a capability entirely inaccessible to encoder-only architectures—while maintaining robustness with only 6% degradation under 60% missing data. However, critical impediments remain in all three domains: 85% of reviewed studies employ self-reported labels or screening scores rather than psychiatrist-confirmed diagnoses, and the potential for plausible yet erroneous hallucinated clinical inferences remains unresolved. This validation deficit, combined with extreme computational overhead, suggests that despite their unique potential for data-scarce, multimodal, and structured data scenarios, the incorporation of decoder-only models into routine clinical depression assessment is premature in the absence of prospective trials and improved uncertainty calibration.

Table 5. Decoder-only Transformers’ performance by domain.

Application Domain	Best Performance	Key Advantage	Primary Limitation
Text-based Analysis	F1: 98% (few-shot, social media)	Data efficiency; zero/few-shot learning	Underperforms encoders on clinical interviews (F1: 78% vs. 82%)
Multimodal Fusion	F1: 82.6% (speech + text)	9–12% gain over text-only encoders	High computational cost; limited language coverage
Structured Data	AUC: 89.2% (tabular)	Novel capability inaccessible to encoders; robust to missing data	Extreme compute requirements; hallucination risk

4.3. Hybrid Transformers: Synergistic Integration of Neural Components

4.3.1. Mechanistic Rationale for Hybridization

Hybrid Transformer architectures augment the global contextual modeling of self-attention with complementary neural components that address specific analytical limitations. The core rationale is that while self-attention excels at encoding long-range linguistic dependencies, it treats all token pairs with uniform computational priority—potentially underweighting diagnostically important local patterns (e.g., n-grams, specific syntactic constructions, and prosodic segments) or temporal dynamics across utterances.

4.3.2. Transformer–RNN Integration

Combinations of BERT/RoBERTa with LSTM/BiLSTM/GRU yield accuracy of 86–98.76% on social media text, corresponding to improvements of 2–11.8% over standard Transformers [51,52,53,54,55,56,57]. On clinical interview data, the advanced DLCDME architecture attains 96% precision and 95% F1, resulting in 8–10% gains over baselines [58]. The most effective integration places bidirectional LSTM layers downstream of Transformer encoders, where they capture sequential temporal dependencies within the contextualized token embeddings produced by self-attention. BERT-BiLSTM augmented by emotion-aware modules (emoji normalization, slang dictionary, and emotion scoring) achieves 93.8% F1 and 89.6% AUC [57], surpassing vanilla BERT by 14.4%. This marked improvement can be attributed to the emotion-aware preprocessing that increases the feature space available to the BiLSTM to be used, not solely to the recurrent architecture. Notably, these gains are highly context-dependent, diminishing to 1–2% on clinical interview data [56]—consistent with the hypothesis that temporal modeling adds the greatest value when explicit symptom progression markers are present in the data. Computational costs are substantial: BERT-BiLSTM requires 2.5–3 times more floating-point operations per second (FLOPs) than base BERT.

4.3.3. Transformer–CNN Integration

Pairing CNN layers with Transformers for local feature extraction achieves accuracy of 92–98.49% with a consistently superior cost–benefit ratio, compared to RNN integration [55,59,60,61,62]. Controlled comparisons show that BERT-CNN outperforms BERT-BiLSTM by 2% and base BERT by 9% on identical datasets [55]. Ablation studies underscore the importance of CNN components: deleting convolutional layers results in an 8% F1 decrease, with degraded detection of subtle depression cues, such as metaphor use and hedging language [61]. The mechanistic interpretation is that CNN filters, applied to Transformer embeddings, extract diagnostic n-gram patterns at multiple granularities—patterns that global attention, which treats all token pairs with equal computational priority, may not preferentially activate. In multimodal contexts, parallel architectures such as TCC (Transformer–CNN–CNN) achieve 8–12% gains over traditional baselines at 1.5–2 times the computational cost [62]. As summarized in Table 6, this review identified three hybrid integration strategies: Transformer–RNN hybrids, Transformer–CNN hybrids, and other hybrid configurations.

Table 6. Performance, mechanisms, and limitations of hybrid Transformers by component type.

Hybrid Type	Model Architecture	Dataset	Task	Performance	Baseline	Gain	Limitations/Future Directions
Domain-Specific	LSTM-MentalBERT [51]	Twitter (CLPsych 2021)	Binary	Acc: 86%	BERT: 82%	+4%	Small dataset (3.2 K); modest gains
BiLSTM + Optimization	BiLSTM + AdaBoost/BERT-CNN-BiLSTM [52]	Kaggle	Binary	Acc: 94%/95.6%	BERT: ~92%	+3.6% hybrid	High training time (8–12 h); needs larger datasets
Multimodal Temporal	RoBERTa-GRU + Multimodal [53]	Reddit, Twitter	Binary	90.18%, 89.92%	RoBERTa: ~88%	+2–3%	High complexity; marginal gains
Triple Hybrid	BERT-CNN-LSTM [54]	GitHub/Kaggle	Binary	Acc: 98.76%	BERT: ~97%	+1.8%	High training cost; ad hoc stacking; no ablation
Comparative	BERT-BiLSTM/BERT-CNN [55]	Reddit, Tweets	Binary	F1: 94.1%/96.1%	BERT: ~89%	+9% (CNN)	CNN > BiLSTM; 3 times cost
Clinical Interview	Transformer-BiLSTM [56]	DAIC-WOZ, EATD	Binary	Precision: 74% (DAIC), 93% (EATD)	Transformer: ~72%, 91%	+1–2%	Minimal clinical gain
Emotion-Aware	BERT-BiLSTM + Emotion [57]	Kaggle	Binary	F1: 93.8%, AUC: 89.6%	BERT: ~82%	+14.4%	2.5–3 times inference cost
Advanced Encoding	DLCDME (ClinicalBERT + Transformer + LSTM) [58]	DAIC-WOZ	Binary	Precision: 96%, F1: 95%	Baselines: ~85%	+8–10%	High complexity; deployment untested
Domain-Specialized	MentalBERT + MelBERT + CNN [59]	Reddit	Binary	Acc: 92%, F1: 92%	BERT/RoBERTa: ~85–87%	+5–7%	Dual Transformers; needs curated corpora
Simple Hybrid	BERT + MLP [60]	Twitter	Binary	Acc: 98.49%	BERT: ~98%	+0.5%	Minimal gain; MLP adds little
Deep Linguistic	RoBERTa-CNN (DLAD) [61]	Reddit Depression	Binary	Acc: 96%, Recall: +57.9%	Baselines: ~84%	+12.3%	Binary only; lacks demographic data
Parallel Audio	TCC (Transformer–CNN–CNN) [62]	DAIC-WOZ, MODMA	Binary	F1: 93.6%/96.7%	Single modal: ~84%, ~87%	+8–12%	Parallel CNNs; 3 times inference time
Neural-Symbolic	DORIS (LLM + GBT) [63]	SWDD, Twitter	Binary	F1: 77.5%, AUPRC: 81.5%	BERT: ~72%	+3–5%	LLM annotation costs; scalability limited
Vision Transformer	DNet (FEM + ViT) [64]	AVEC2014, CZ2023	Severity	MAE: 6.09, RMSE: 7.85	CNN: ~6.5, ~8.2	Competitive	Face-only; needs multimodal validation

4.3.4. Synthesis of Hybrid Architectures

CNN-based hybrids provide better cost–benefit ratios than RNN alternatives, with median gains of 8% versus 3.6% at substantially lower computational overhead. Beyond traditional hybrids, neural-symbolic approaches such as DORIS combine LLM-based annotation with symbolic reasoning, achieving an F1 of 77.5% and area under the precision–recall curve (AUPRC) of 81.5% on social media datasets, though LLM annotation costs limit scalability [63]. Vision Transformer hybrids show promise—DNet obtains competitive severity estimation (MAE: 6.09, RMSE: 7.85) on facial expression data, though validation is limited to single-modality visual input [64]. This pattern implies that for cross-sectional data, extracting localized sentiment patterns is more diagnostically informative than modeling sequential progression. However, 40% of studies report less than 5% improvement despite doubling or tripling computational costs, and 87% lack systematic ablations to separate component contributions—a substantial methodological gap limiting causal attribution of performance gains to specific architectural components. Evidence-based selection, therefore, recommends prioritizing CNN hybrids for cross-sectional analysis and RNNs for longitudinal monitoring, while considering that cost-effective preprocessing can rival architectural modifications in resource-constrained settings, as summarized in Table 7.

Table 7. Hybrid Transformers’ performance summary by component type ¹.

Component	n Studies	Accuracy Range	Gain Range	Median Gain	Computational Cost	Primary Benefit	Key Limitation
Trans + RNN	7	86–98.76%	+1–14.4%	+3.6%	2.5–3 times FLOPs	Sequential dependency modeling	High cost; inconsistent gains (1–14% variance)
Trans + CNN	4	92–98.49%	+0.5–12.3%	+8%	1.5–2 times FLOPs	Local feature extraction	Task-specific (social media > clinical)
Trans + Other	3	95–98.6%	+3–10%	+5%	Varies widely	Specialized capabilities	Ad hoc design; unclear synergy mechanisms

¹ Accuracy ranges and gain values aggregate heterogeneous metrics across different tasks. Values illustrate architectural trends, not directly comparable benchmarks.

4.4. Multimodal Transformers: Cross-Modal Integration

4.4.1. Mechanistic Rationale for Cross-Modal Attention

Multimodal Transformer architectures are a paradigm shift from text-only depression detection, combining complementary signal modalities—text, speech, facial expressions, and physiological signals—via cross-modal attention mechanisms. The mechanistic rationale is based on the multifaceted nature of depression as a clinical construct: linguistic content identifies cognitive symptoms, paralinguistic features reveal psychomotor changes, facial dynamics capture emotional blunting and expression reduction, and physiological signals reflect neurobiological substrates. Critically, cross-modal attention allows the model to learn modality alignment, e.g., detecting semantic–prosodic misalignment where verbal wellness reports contradict depressive intonation patterns [65]. Such cross-modal inconsistencies are clinically significant because they represent masking behavior that no single modality can capture in isolation.

However, this sophistication introduces significant challenges: synchronization across heterogeneous sampling rates, computational scaling with the number of modalities, and interpretation of distributed decision-making across modality-specific encoders. As outlined in Table 8, early, late, and hybrid fusion strategies provide distinct trade-offs about accuracy, robustness, and computational efficiency.

Table 8. Performance, mechanisms, and limitations of multimodal Transformers by fusion strategy.

Fusion Strategy	Study	Dataset	Key Innovation	Performance	vs. Baseline	Core Limitations
Early Fusion (n = 3)	[66]	LMVD	ViT + LLM (Qwen3-32B) Confidence weighting	Acc: 78.02%, F1: 78.01%	Visual: +10%, Audio: +7%	Limited to 3 modalities; ethical bias
	[67]	DAIC-WOZ	MLP-Mixer + XLM-RoBERTa	F1: 67%	Text-only: +5%	Clinical-only; needs bio-signals
	[68]	Multimodal Twitter dataset	ContextVecNet: CLIP + context vectors	AUC: 99.22% ¹, F1: 96.19%	Text-only: +3%	High compute; small window; replication needed
Late Fusion (n = 5)	[69]	MODMA, DAIC-WOZ	FTSM: 64→4 channels	Acc: 91.22%, 94.17%	EEG: +10%, Text: +12%	Channel reduction; explore speech
	[70]	DAIC-WOZ	Llama2 + landmarks	F1: 71.9%	WavLM + Rob: +7%	Data scarcity; imbalance
	[71]	D-Vlog	DepMSTAT: SAB + TAB	Acc: 71.53%, F1: 73.51%	Visual: +5%, Audio: +4%	Data authenticity; segmentation challenges
	[72]	CMDC, E-DAIC	MLlm-DR: LQ-former Speech + visual	F1: 100% ², 79%	Text-only: +15%	Cross-modal; emotion features
	[73]	MODMA	MHA-GCN ViT topology	Acc: 89.03%, F1: 88.83%	EEG: +8%, Speech: +11%	High compute; generalization
Hybrid Fusion (n = 3)	[74]	D-Vlog	MDD-Net: Mutual Bidirectional	F1: 77.07%, Recall: 80.65%	SOTA: +17.37%	Mislabeled data; cross-dataset
	[75]	D-Vlog, LMVD	Disentangled learning	Acc: 70.28%, F1: 77.58%	Unified: +3–5%	Noise handling; expand modalities
	[76]	CMU-MOSI, MOSEI, AVEC2019	TensorFormer: 3-way tensor	MAE: 0.753, 0.517, CCC: 0.493	Sequential: +5%	Scalability; generalization

¹ AUC values approaching 100% on restricted single-platform datasets (ContextVecNet) are similarly subject to dataset-specific ceiling effects and should not be generalized without replication on independent clinical datasets. ² The 100% F1 reported by MLlm-DR on the CMDC dataset should be interpreted with substantial caution, as it likely reflects the limited scale and diversity of that dataset rather than genuine ceiling performance. The corresponding result on E-DAIC (79%) provides a more representative estimate of real-world detection capability.

4.4.2. Fusion Strategy Analysis

Early fusion architectures, combining modalities at input or early encoding layers, facilitate fine-grained cross-modal interactions from the initial processing stages, achieving competitive accuracies of 78–96% [66,67,68]. For instance, ContextVecNet reports an AUC of 99.22% using CLIP-based encodings [68]. This near-ceiling AUC, however, reflects dataset-specific ceiling effects on a restricted single-platform Twitter dataset. Without replication in independent clinical populations, it should not be taken as evidence of general detection capability. Confidence-based strategies surpass isolated modalities by 4–10% through adaptive weighting [66]. However, these advantages are counterbalanced by severe limitations: computational complexity is quadratic, and the architecture exhibits catastrophic failure in 100% of studies when input data are incomplete [69,73] because early-stage feature concatenation cannot compensate for absent modality representations.

Late fusion architectures process modalities through specialized encoders before integration and consistently yield higher accuracies of 89–94% [69,70,73]. DepMSTAT uses spatiotemporal attention blocks for video-based depression detection, achieving F1 of 73.51% on D-Vlog with significant gains for visual (+5%) and audio (+4%) modality contributions [71]. The multimodal Transformer integrating EEG–interview data achieves 94.17% accuracy on DAIC-WOZ with EEG channels reduced from 64 to 4—a 16-fold decrease in data collection burden [69]. MLlm-DR [72] uses the LQ-former architecture for cross-modal emotion features, reporting 100% F1 on the CMDC dataset (79% on E-DAIC). The 100% F1 on CMDC should be treated with significant caution: this most likely reflects the limited scale and diversity of the CMDC dataset rather than flawless detection capability. The more realistic E-DAIC result of 79% is more reflective of the real-world performance that could be expected. The decisive advantage of late fusion is graceful degradation: 78–82% accuracy is maintained even with absent modalities, since each modality-specific encoder produces independent representations that can be aggregated even when some inputs are missing. This robustness comes at 2–3 times the computational cost of early fusion.

Hybrid fusion architectures seek to combine the advantages of both approaches through hierarchical processing, achieving state-of-the-art accuracies of 88–97% while introducing significant complexity [74,75,76]. TensorFormer, for example, uses three-way attention tensors to capture higher-order dependencies, outperforming late fusion by 4–6% [76], while MDD-Net’s Mutual Transformer achieves 17.37% F1 improvement via bidirectional cross-attention [74]. However, these gains require O(n³) computational complexity and inference latencies of 2–5 s, which raises legitimate questions about clinical deployment feasibility.

4.4.3. Synthesis of Multimodal Fusion Strategies

Late fusion architectures offer the most optimal deployment balance, yielding 91.2% median accuracy with improved robustness to missing modalities. The modality contribution hierarchy shows that audio–text fusion contributes the most incremental value (+8–11%), exceeding visual or EEG combinations, suggesting a pragmatic entry point for clinical implementation—audio data can be captured using standard microphones without specialized equipment, dramatically reducing the deployment barrier. Nevertheless, the field confronts critical translation shortfalls: 82% of studies lack prospective clinical trials, and 100% omit fairness audits—a pressing concern given potential algorithmic bias against non-Western populations and non-standard speech patterns, as summarized in Table 9.

Table 9. Cross-study performance summary and methodological patterns of multimodal Transformers.

Category	Subcategory	Metric/Value	Key Finding
Fusion Strategy Performance	Late Fusion	91.2% median	Best cost–benefit; robustness: 82–87%
	Early Fusion	87.3% median	Catastrophic failure with missing data: 67–73%; 2.5–3 times cost
	Hybrid Fusion	89.5% median, 97% peak	Highest peak; 8–12 times cost; worst interpretability
Modality Contribution	Audio–Text	+8–11% vs. text	Most valuable; paralinguistic features
	Visual–Text	+3–5% vs. text	Social media; limited by masking
	EEG–Behavioral	+6–9% vs. text	Neurobiological; high equipment cost
Clinical Context	Clinical Assessment	+8–16% improvement	DAIC-WOZ, MODMA: high-quality signals
Clinical Context	Social Media	+3–7% improvement	D-Vlog, Twitter: variable quality
Computational Costs	Early/Late/Hybrid	2.5–3 times/5–8 times/8–12 times	Exponential scaling with complexity
Computational Costs	Inference Latency	0.5–1 s/1–2 s/2–5 s	Real-time challenging
Missing Modality Robustness	Late Fusion	82–87% (7–12% drop)	Graceful degradation; maintains utility
	Early Fusion	67–73% (>10% drop)	Catastrophic failure; no graceful strategy
	Hybrid Fusion	75–80% (8–17% drop)	Moderate, between early and late
Validation Gaps	Clinical validation	18% (2/11)	Prospective trials: 0%
	Cross-dataset	18% (2/11)	Generalization untested
	Fairness audits	0% (0/11)	Demographic bias unassessed
Deployment Barriers	Equipment access	100% require specialized equipment	EEG: amplifying disparities
	Real-time capability	73% require > 1 s	Prohibitive for high-throughput screening
	Interpretability	82% no analysis	Clinical transparency needed

5. Discussion

The preceding section-by-section examination of individual architectural paradigms revealed performance patterns specific to each approach. This section elevates the analysis to cross-paradigm synthesis, addressing three questions that cannot be answered within any single paradigm: How do Transformers compare with fundamentally different deep learning approaches? What are the Pareto-optimal trade-offs across architectural paradigms? And what systemic challenges must be resolved to enable clinical translation?

5.1. Transformer-Based Architectures Versus Alternative Approaches

Although Transformers have established themselves as the dominant paradigm for detecting depression, alternative deep learning architectures—spectral analysis [77], spatiotemporal CNNs [78], graph neural networks [79], attention-based multimodal fusion [80,81], and hybrid CNN-LSTM architectures [82]—demonstrate distinct advantages on specific capacity dimensions—most notably interpretability—while generally trailing Transformer variants on detection performance and sample efficiency, as illustrated in Figure 3a (scores 1–5; full derivation in Supplementary Table S2).

These alternatives have the advantage of explicitly modeling depression-specific patterns that Transformers learn only implicitly: spectral methods decompose behavioral signals into frequency domains to detect symptoms’ periodicity [77], the Maximization–Differentiation Network models facial transitions, achieving an RMSE of 7.55 with only 25M parameters on the AVEC2014 facial dataset [78], and GNNs outperform late-fusion Transformers by 6% in accuracy on E-DAIC using structural encoding of cross-modal dependencies [79]. It should be noted that these performance figures reflect heterogeneous metrics and datasets—RMSE (lower is better), accuracy, and relative gain—and are, therefore, not directly interchangeable. They are presented to illustrate the task-specific strengths of each alternative rather than to assert equivalent-condition superiority over Transformers.

However, as shown in Figure 3b, these architectural innovations generally require 1000–5000 labeled samples and rigid domain-specific engineering. The practical dominance of Transformers stems from two architecture-specific features: massive-scale pre-training that encodes transferable linguistic knowledge, and architectural unification that accommodates diverse input modalities within a single framework. MentalBERT achieves an F1 of 97.3% with approximately 3000 samples—representing roughly 2–3 times fewer annotation requirements compared to GNN-based and spectral alternatives in comparable social media text classification settings—though this comparison is restricted to sample efficiency and does not generalize across tasks or datasets.

Generative decoder-only models (DepGPT and GPT-4o) require far fewer samples still, approaching few-shot or zero-shot regimes with minimal task-specific training. Transformers also scale more effectively, with larger models yielding consistent gains, whereas alternatives plateau despite increased depth. It should be noted, however, that some alternative architectures are beginning to adopt pre-training strategies (e.g., graph pre-training for GNNs), which may narrow this advantage over time. Architecture selection must, therefore, balance benchmark precision against deployment constraints: alternatives may be preferred in stable, data-rich environments with well-defined signal processing requirements, while Transformers remain optimal for scenarios characterized by limited labels, cross-domain variability, or missing modalities.

5.2. Comparative Analysis Across Transformer Paradigms

Before proceeding, two methodological qualifications are necessary.

First, the 46 studies synthesized employ heterogeneous evaluation metrics across tasks that are not equivalent: binary social media classification, ordinal severity scoring, continuous regression, and zero-shot symptom assessment. Aggregating these into median performance values or Pareto frontiers is inherently imprecise. The performance comparisons in Figure 4a,b are, therefore, best understood as illustrative architectural trends and cost–benefit directional signals, not as equivalent-condition benchmarks. Readers seeking within-task granular comparisons should consult Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9 and the original cited studies.

Second, and critically, the studies use heterogeneous performance metrics—accuracy, F1-score, precision, and AUC—which are not directly interchangeable: accuracy can be inflated on class-balanced datasets, F1-score better captures performance under class imbalance, precision reflects positive predictive value, and AUC measures discrimination independently of classification threshold. To make this heterogeneity explicit, individual study results in Figure 4b are differentiated by reported metric type (▲ = F1-score; ● = accuracy; ■ = precision). The bar heights represent the central tendency of reported values within each task–paradigm grouping and should be interpreted as directional indicators, not equivalent-condition performance benchmarks.

Note on figure derivation methodology. The qualitative capability scores in Figure 3a (scale 1–5) were assigned through a structured evidence-mapping procedure in which two authors independently rated each architectural paradigm on each of the six dimensions (detection performance, data efficiency, interpretability, computational efficiency, multimodal capability, and clinical readiness) based on explicit quantitative and qualitative evidence drawn from the studies summarized in Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9. Inter-rater agreement (Cohen’s κ = 0.79) was followed by consensus discussion for any divergent scores, and full per-cell evidence is provided in Supplementary Table S2. Scores reflect directional trends and relative ordering across paradigms, not absolute quantitative benchmarks. The minimum labeled sample thresholds in Figure 3b are drawn directly from specific studies cited in Section 4. The cross-paradigm median performance values in Figure 4b were computed separately within each task category (binary social media classification, clinical interview, and multimodal tasks) using the study-level results reported in Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9, and full derivation with per-study values and metric types is provided in Supplementary Table S3. All computational cost multipliers in Figure 4a are normalized to BERT base and fully derived in Supplementary Table S4.

Systematic analysis of 46 Transformer-based studies reveals distinct Pareto frontiers in terms of computational cost and detection performance, as shown in Figure 4a. Encoder-only architectures (n = 14) define a high-efficiency zone with F1/accuracy scores ranging from 76 to 99% at minimal computational overhead (normalized to BERT at 10⁰), yielding the optimal performance-to-resource ratio for standard detection tasks. The single outlier (DepRoBERTa, F1: 58.3%) reflects limited corpus diversity rather than architectural inadequacy. Decoder-only models (n = 7) incur costs one to two orders of magnitude higher than the BERT baseline; while this positions them in the high-cost quadrant, the expenditure is strategically justified by their few-shot generalization capabilities and novel data-processing modalities (e.g., tabular-to-text transformation) in data-scarce environments, effectively trading infrastructure cost for reduced labeling requirements. Hybrid architectures (n = 14) occupy a cost-efficient intermediate zone, demonstrating performance comparable to encoder-only models through complementary neural component integration. Multimodal architectures (n = 11) cluster at higher costs, reflecting the computational demands of cross-modal attention and multistream processing—a trade-off warranted when multiview integration is critical to diagnostic sensitivity.

Practical efficacy is heavily context-dependent (Figure 4b). In social media environments, encoder-only models show the highest central tendency across the reviewed binary classification studies (median of reported Acc/F1 values: 97%; n = 7; six of seven studies report accuracy, one reports F1 [31]; full derivation in Supplementary Table S3), followed by hybrid (94%; n = 10; mix of Acc and F1) and decoder-only architectures (92%; n = 1 study, two conditions [44]; both F1). Multimodal approaches show a lower central tendency (77%; n = 4; all F1 [68,71,74,75]), reflecting the limited number of multimodal social media studies and the variability between the inflated ceiling result of ContextVecNet (F1: 96.19% on restricted Twitter data [68]) and the more representative D-Vlog studies (F1: 73–78% [71,74,75]), consistent with the observation that complex cross-modal fusion may introduce noise when textual signals are already discriminative. Because encoder-only values are predominantly accuracy-based, while multimodal values are F1-based, direct comparison of their medians should account for this metric difference. The directional ordering nonetheless reflects the pattern evident in individual studies within Table 3, Table 4, Table 5, Table 6, Table 7, Table 8 and Table 9.

In clinical interview settings, a striking divergence emerges: hybrid architectures exhibit the highest central tendency (94%; n = 3 studies, 4 data points: DLCDME F1 95% [58]; TCC F1 93.6% and 96.7% [62]; Transformer-BiLSTM precision 74% [56]; full derivation in Supplementary Table S3) based on robust DAIC-WOZ and MODMA results from autoencoder-augmented and advanced encoding configurations, whereas encoder-only (79%; F1: 81% [35] and 76% [43]), decoder-only (80%; F1: 78% [45] and 82.6% [48]), and multimodal (84%; range F1 67–Acc 94.17%; n = 5 studies, 6 data points [67,69,70,72,73]) architectures show lower values.

Note that the hybrid clinical interview median aggregates three F1 values and one precision value from three studies—metric heterogeneity that limits direct comparison with other paradigms. The precision value of 74% [56] reflects a different aspect of detection quality than F1—its inclusion in the median is disclosed here and in Supplementary Table S3. This pattern indicates that neither purely linguistic models nor standard multimodal fusion adequately addresses the full challenges of clinical discourse—social desirability bias, linguistic masking, and structured interview formats—whereas hybrid architectures overcome these limitations through complementary feature extraction that combines Transformer contextual understanding with specialized local pattern detection [83].

For multimodal tasks (studies incorporating audio, speech, or video modalities beyond text), hybrid architectures again lead (94%; median of [56] precision 74%, [58] F1 95%, [62] F1 93.6%/96.7%; all DAIC-WOZ or MODMA datasets; full derivation in Supplementary Table S3), followed by decoder-only (83%; [48] F1 82.6% on DAIC-WOZ speech + text pipeline), with encoder-only (79%; same DAIC-WOZ studies as clinical interview category: [35] F1 81%, [43] F1 76%) and the multimodal paradigm (78%; central tendency of all 11 multimodal studies [66,67,68,69,70,71,72,73,74,75], excluding [76], which reports MAE; mix of Acc and F1; see Supplementary Table S3) showing comparable values. These patterns suggest that targeted architectural augmentation consistently shows stronger results than both standalone linguistic models and complex fusion approaches in multimodal contexts within the reviewed studies.

These patterns should be interpreted as task-contextualized directional trends rather than definitive rankings: because studies employ non-equivalent tasks, datasets, and metrics, no single paradigm can be declared universally superior. Taken together, and with these qualifications in mind, the reviewed evidence indicates that encoder-only models show consistently strong results for population-level text-based social media screening, hybrid architectures demonstrate the most robust individual results in clinical interview and multimodal task settings across the reviewed studies, and multimodal designs—despite their theoretical appeal—require further standardized evaluation before their clinical diagnostic potential can be fully assessed.

5.3. Challenges and Limitations

While Section 4 outlined paradigm-specific limitations, four systemic challenges transcend all architectural boundaries and collectively impede clinical translation. Rather than re-enumerating individual study limitations, this section summarizes these cross-cutting barriers and their interactions.

Ground truth quality and evaluation heterogeneity. The most fundamental challenge is that about 85% of reviewed studies use self-reported screening scores (e.g., PHQ-9, BDI) as ground truth, instead of gold-standard psychiatric interviews (SCID and MINI). This reliance on self-report does not simply reflect a data quality concern—it systematically biases model learning toward subjective self-assessment patterns rather than clinically validated diagnostic criteria. Consequently, reported performance metrics may indicate high consistency with self-reported symptoms (convergent validity) as opposed to true diagnostic accuracy. Compounding this, artificially balanced datasets do not represent real-world depression prevalence: models trained under balanced-class assumptions may yield substantially elevated false-positive rates when deployed in realistic clinical settings, risking a volume of unconfirmed positive screens that would overwhelm referral systems. The lack of standardized evaluation protocols creates considerable performance variance for identical models across configurations, making meaningful cross-study comparison difficult.

Furthermore, regarding study generalizability, a substantial proportion of the evidence in this systematic review derives from social media text corpora and unipolar depression benchmarks that bear limited resemblance to the clinical contexts in which SMI assessments must ultimately function. The linguistic register, self-disclosure patterns, and symptom expression in Twitter or Reddit posts differ fundamentally from those encountered in structured clinical interviews with patients experiencing MDD or comorbid psychiatric conditions. This gap between the evidence base and the target clinical context represents a major translational risk that future benchmark design must address directly.

Mechanistic opacity and clinical interpretability. Current explainability approaches focused on attention weight visualization [84] provide only surface-level insight. While attention heatmaps can indicate which tokens or modalities contribute to a prediction, they do not reveal the diagnostic reasoning chain—whether the model is detecting real patterns in DSM-5 symptoms, leveraging superficial lexical correlates, or using dataset-specific artifacts. This interpretability deficit has concrete clinical consequences: clinicians cannot verify model reasoning against their own diagnostic judgment, patients cannot receive meaningful explanations of screening results, and regulatory bodies do not have access to the transparency required for medical device approval. The challenge is especially acute for multimodal models, where decision-making is distributed across modality-specific encoders and cross-modal attention layers.

Geographic, linguistic, and demographic bias. The concentration of research on English-language, Western social media data (93% of encoder-only studies; 78% from North American samples) creates a compounding bias problem. Depression is expressed through culturally mediated linguistic patterns—metaphorical expressions, somatic idioms, and disclosure norms vary substantially across cultures. Models trained predominantly on English-language data demonstrate approximately 7–11% performance degradation in zero-shot cross-lingual transfer within the reviewed studies—for example, DepGPT shows an approximately 11% F1 drop when applied to Bengali versus English data in zero-shot settings [44], and Whisper + GPT-2 shows an approximately 7% F1 drop on Indic-Bengali relative to DAIC-WOZ [48]. Broader cross-lingual performance gaps are anticipated under conditions of greater linguistic and script distance, perpetuating the very healthcare disparities that computational screening tools aim to mitigate. This bias interacts with the ground truth problem: PHQ-9 and BDI, while widely translated, may not capture culturally specific symptom presentations, meaning that even translated models may be optimizing for a culturally biased diagnostic target.

Computational and infrastructural constraints. Multimodal Transformers require specialized equipment and substantial computational resources for both training and inference. These requirements present a fundamental tension: the populations most in need of automated screening (resource-constrained communities and low- and middle-income countries) are least able to deploy the most capable architectures. Furthermore, critical human factors—clinician cognitive burden, trust calibration, and patient acceptance—remain underexplored, despite being prerequisites for successful clinical workflow integration.

The systematic review also has limitations of its methodology: a narrative synthesis was adopted in place of quantitative meta-analysis due to cross-study heterogeneity, and eligibility was restricted to English-language publications from 2020 to 2025.

5.4. Future Directions

Transitioning Transformer-based depression detection to clinically validated instruments requires strategic realignment across four domains, each targeting a specific challenge identified above.

Standardized clinical validation. Addressing the ground truth deficit requires convergence on evaluation frameworks based on criterion-standard psychiatric interviews, supplanting reliance on self-report measures. Benchmark datasets must incorporate stratification in depression subtypes and demographics. To mitigate the burden of false positives arising from balanced-data training, evaluation protocols must mandate prevalence-adjusted test sets reflecting epidemiological base rates of 5–20%, and expanded metrics must incorporate positive predictive value, calibration error, and net benefit analyses.

Interpretable and uncertainty-aware architectures. Remediating mechanistic opacity requires innovations extending beyond attention visualization. Hierarchical Transformer architectures with explicit temporal modeling can enable the capture of longitudinal symptom trajectories, differentiating first-episode major depression, recurrent depression, and chronic dysthymia. Integrating uncertainty quantification with Bayesian formulations or conformal prediction provides calibrated confidence intervals, enabling clinicians to distinguish high-confidence predictions from ambiguous cases warranting additional scrutiny. Concept-based explanation methods that align model representations with DSM-5 symptom domains would bridge the gap between computational outputs and clinical reasoning. Robust multimodal fusion with conditional computation and dynamic weighting can ensure performance under data degradation while managing computational constraints.

Equity-centered development. Mitigating geographic and linguistic representation deficits requires proactive strategies, including adversarial debiasing, demographically stratified loss reweighting, and performance parity constraints to minimize disparities across subgroups. Purposefully constructed datasets must ensure sufficient statistical power across cross-cultural, multilingual, and intersectional demographic strata through active recruitment in historically underrepresented communities. Cross-lingual transfer learning—exploitation of multilingual pre-trained models with culture-aware adaptation—offers a scalable path toward equitable global coverage.

AI analytics, wearable data, and smart applications for detection stability. The integration of AI analytics with wearable sensors and mobile health applications represents a critical frontier for improving depression detection stability and reducing the translation risk associated with snapshot-based clinical assessments. Systematic evidence confirms that wearable AI demonstrates meaningful accuracy for detecting and predicting depression across diverse populations [9] and that passive, non-intrusive multimodal sensing approaches more comprehensively capture natural behaviors than controlled or single-session data collection paradigms [10]. Current Transformer-based models rely predominantly on single-session text or audio inputs, which are inherently susceptible to momentary fluctuations and self-presentation biases. Passive sensing modalities provide continuous, objective proxies for psychomotor retardation, sleep disruption, and social withdrawal [8], capturing symptom dynamics across naturalistic daily contexts rather than isolated clinical encounters [85]. Multimodal fusion architectures that integrate such longitudinal physiological streams with text-based Transformer features offer a pathway to richer, temporally grounded representations of depression severity. By anchoring model inputs in objectively measured behavioral signals, this approach simultaneously reduces reliance on single-session self-report measures and mitigates the cross-domain generalizability gap between social media corpora and real-world clinical populations—two of the most significant barriers to clinical translation identified in this systematic review.

Prospective pragmatic validation. Addressing deployment feasibility requires randomized controlled trials comparing AI-augmented workflows against standard-of-care baselines. Trials must employ patient-centered outcome measures, including diagnostic accuracy, time-to-diagnosis, cost-effectiveness, and critically, clinician cognitive workload and patient satisfaction. Embedded pilot implementations across resource-constrained primary care facilities and community mental health centers serving socioeconomically disadvantaged populations are crucial for identifying context-specific barriers and generating implementation science evidence that maximizes equitable benefit.

6. Conclusions

This systematic review systematically analyzed 46 Transformer-based depression detection studies across encoder-only, decoder-only, hybrid, and multimodal architectural paradigms. It provides the first comprehensive, mechanism-level analysis examining how self-attention, transfer learning, and pre-training distinguish Transformers from conventional approaches.

Analysis reveals context-dependent optimization patterns. Encoder-only models achieve 95–99% accuracy in binary social media classification but exhibit a 12–17% performance deficit on clinical data, highlighting the limitations of purely lexical bidirectional attention. Decoder-only models excel in few-shot learning (F1 up to 98%). Parameter-efficient fine-tuning (e.g., LoRA) reduces training requirements by 40% with marginal loss, facilitating resource-constrained deployment. Hybrid architectures using CNN-based variants yield higher cost–benefit gains (8%) than RNN alternatives (3.6%), emphasizing the value of local feature extraction. Late-fusion multimodal architectures balance performance (91.2% median accuracy) with robustness against missing modalities. Across the reviewed studies, hybrid architectures show the highest observed median performance in clinical interview settings, outperforming encoder-only, decoder-only, and multimodal approaches within this task category—a pattern that should be interpreted cautiously given the metric heterogeneity across studies and the non-equivalence of tasks across paradigms.

However, four systemic challenges constrain clinical translation: an 85% reliance on self-reported labels over gold-standard psychiatric diagnoses, raising questions about diagnostic validity, a 93% concentration on English data among encoder-only studies (with 78% of samples from North American sources across paradigms), associated with approximately 7–11% performance degradation in zero-shot cross-lingual transfer within the reviewed studies, a lack of uncertainty quantification, and 82% of multimodal approaches lacking prospective validation. Attention visualization provides pattern-level insights but falls short of clinically grounded interpretation.

Advancing Transformer-based detection requires shifting focus from benchmark optimization to clinical utility and deployment feasibility. This necessitates validation against gold-standard psychiatric interviews, uncertainty-aware architectures, fairness-enforced training for diverse populations, and prospective trials measuring diagnostic agreement and workflow integration.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16105018/s1, File S1: PRISMA 2020 Checklist and PRISMA 2020 for Abstracts Checklist; Table S1: Study-Level Quality Assessment and Bias Evaluation for All 46 Included Studies; Table S2: Evidence Basis for Figure 3a Qualitative Capability Scores (Scale 1–5); Table S3: Derivation of Figure 4b within-task Median Performance Values; Table S4: Derivation of Figure 4a Computational Cost Multipliers.

Author Contributions

Conceptualization, S.Z., M.M., and L.Q.Z.; methodology, S.Z., M.M., and L.Q.Z.; software, S.Z.; validation, M.M. and L.Q.Z.; formal analysis, S.Z.; investigation, S.Z., M.M., and L.Q.Z.; resources, S.Z., M.M., and L.Q.Z.; data curation, S.Z., M.M., and L.Q.Z.; writing—original draft preparation, S.Z.; writing—review and editing, M.M. and L.Q.Z.; visualization, S.Z.; supervision, M.M. and L.Q.Z.; project administration, M.M. and L.Q.Z.; funding acquisition, L.Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Faculty of Information Science and Technology, and by Universiti Kebangsaan Malaysia under Grant TAP-K014364.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

We would like to thank the Cyber Analytics lab at the Center for Cyber Security (CYBER), Universiti Kebangsaan Malaysia (UKM), for providing us with the research infrastructure to conduct this work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Receiver Operating Characteristic Curve
AUPRC	Area Under the Precision–Recall Curve
BDI	Beck Depression Inventory
BiLSTM	Bidirectional Long Short-Term Memory
CCC	Concordance Correlation Coefficient
CMDC	Chinese Multimodal Depression Corpus
CNN	Convolutional Neural Network
DAIC-WOZ	Distress Analysis Interview Corpus–Wizard of Oz
DSM-5	Diagnostic and Statistical Manual of Mental Disorders, 5th Edition
D-Vlog	Depression Vlog Dataset
E-DAIC	Extended Distress Analysis Interview Corpus
FLOPs	Floating-Point Operations per Second
GBD	Global Burden of Disease
GBT	Gradient Boosting Tree
GNN	Graph Neural Network
GRU	Gated Recurrent Unit
LLM	Large Language Model
LoRA	Low-Rank Adaptation
LSTM	Long Short-Term Memory
MCC	Matthews Correlation Coefficient
MDD	Major Depressive Disorder
MINI	Mini International Neuropsychiatric Interview
MLM	Masked Language Modeling
MODMA	Multimodal Open Dataset for Mental Disorder Analysis
NLP	Natural Language Processing
OSF	Open Science Framework
PHQ-4	Patient Health Questionnaire-4
PHQ-8	Patient Health Questionnaire-8
PHQ-9	Patient Health Questionnaire-9
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
QLoRA	Quantized Low-Rank Adaptation
RMSE	Root Mean Square Error
RNN	Recurrent Neural Network
SCID	Structured Clinical Interview for DSM Disorders
SMI	Serious Mental Illness
SMOTE	Synthetic Minority Oversampling Technique
ViT	Vision Transformer
YLD	Years Lived with Disability

References

Chang, Y.Y.; Omar, N. Data annotation architecture for automatic depression detection. Asia-Pac. J. Inf. Technol. Multimed. 2023, 12, 39–56. [Google Scholar]
GBD 2019 Mental Disorders Collaborators. Global, regional, and national burden of 12 mental disorders in 204 countries and territories, 1990–2019: A systematic analysis for the Global Burden of Disease Study 2019. Lancet Psychiatry 2022, 9, 137–150. [Google Scholar] [CrossRef]
Liu, J.; Liu, Y.; Ma, W.; Tong, Y.; Zheng, J. Temporal and spatial trend analysis of all-cause depression burden based on Global Burden of Disease (GBD) 2019 study. Sci. Rep. 2024, 14, 12346. [Google Scholar] [CrossRef]
Horwitz, A.G.; Zhao, Z.; Sen, S. Peak-end bias in retrospective recall of depressive symptoms on the PHQ-9. Psychol. Assess. 2023, 35, 378–381. [Google Scholar] [CrossRef]
Aldkheel, A.; Zhou, L. Depression Detection on Social Media: A Classification Framework and research Challenges and Opportunities. J. Healthc. Inform. Res. 2023, 8, 88–120. [Google Scholar] [CrossRef] [PubMed]
Trifu, R.N.; Trifu, R.N.; Nemeș, B.; Herta, D.C.; Bodea-Hategan, C.; Talaș, D.A.; Coman, H. Linguistic markers for major depressive disorder: A cross-sectional study using an automated procedure. Front. Psychol. 2024, 15, 1355734. [Google Scholar] [CrossRef] [PubMed]
Terhorst, Y.; Knauer, J.; Philippi, P.; Baumeister, H. The relation between passively collected GPS mobility metrics and depressive symptoms: Systematic review and meta-analysis. J. Med. Internet Res. 2024, 26, e51875. [Google Scholar] [CrossRef]
Rohani, D.A.; Faurholt-Jepsen, M.; Kessing, L.V.; Bardram, J.E. Correlations between objective behavioral features collected from mobile and wearable devices and depressive mood symptoms in patients with affective disorders: Systematic review. JMIR mHealth uHealth 2018, 6, e165. [Google Scholar] [CrossRef]
Abd-Alrazaq, A.; AlSaad, R.; Shuweihdi, F.; Ahmed, A.; Aziz, S.; Sheikh, J. Systematic review and meta-analysis of performance of wearable artificial intelligence in detecting and predicting depression. npj Digit. Med. 2023, 6, 84. [Google Scholar] [CrossRef] [PubMed]
Khoo, L.S.; Lim, M.K.; Chong, C.Y.; McNaney, R. Machine learning for multimodal mental health detection: A systematic review of passive sensing approaches. Sensors 2024, 24, 348. [Google Scholar] [CrossRef]
Li, Z.; Zhang, Z.; Zhao, H.; Wang, R.; Chen, K.; Utiyama, M.; Sumita, E. Text compression-aided transformer encoding. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 1. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; ACL: Stroudsburg, PA, USA, 2020; pp. 38–45. [Google Scholar] [CrossRef]
Yadav, A.B. Generative AI in the Era of Transformers: Revolutionizing Natural Language Processing with LLMs. J. Image Process. Intell. Remote Sens. 2024, 4, 54–61. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; NeurIPS: San Dieg, CA, USA, 2017; Volume 30. [Google Scholar]
Harvey, D.; Lobban, F.; Rayson, P.; Warner, A.; Jones, S. Natural language processing methods and bipolar disorder: Scoping review. JMIR Ment. Health 2022, 9, e35928. [Google Scholar] [CrossRef]
Deneault, A.; Dumais, A.; Désilets, M.; Hudon, A. Natural language processing and schizophrenia: A scoping review of uses and challenges. J. Pers. Med. 2024, 14, 744. [Google Scholar] [CrossRef]
Chen, F.; Ben-Zeev, D.; Sparks, G.; Kadakia, A.; Cohen, T. Detecting PTSD in clinical interviews: A comparative analysis of NLP methods and large language models. In Biocomputing 2026: Proceedings of the Pacific Symposium, Hawaii, USA, 3–7 January 2026; World Scientific Publishing: Singapore, 2026; pp. 265–279. [Google Scholar] [CrossRef]
Kumari, M.; Singh, G.; Pande, S.D. A survey of current progress in depression detection using deep learning and machine learning. Biomed. Mater. Devices 2025, 4, 716–740. [Google Scholar] [CrossRef]
Mao, H.; Han, Q. Applications of Transformer-Based Language Models for Depression Detection: A Scoping Review. J. Integr. Soc. Sci. Humanit. 2025, 2, 1–8. [Google Scholar] [CrossRef]
Tahir, W.B.; Khalid, S.; Almutairi, S.; Abohashrh, M.; Memon, S.A.; Khan, J. Depression detection in social media: A comprehensive review of machine learning and deep learning techniques. IEEE Access 2025, 13, 12789–12818. [Google Scholar] [CrossRef]
Wang, N.; Chiong, R.; Kamil, R.; Zhang, W.; Al-Haddad, S.A.R.; Ibrahim, N. Depression detection using speech audio and text: A comprehensive review focusing on deep learning methods. Authorea Prepr. 2024. [Google Scholar] [CrossRef]
Omar, M.; Levkovich, I. Exploring the efficacy and potential of large language models for depression: A systematic review. J. Affect. Disord. 2025, 371, 234–244. [Google Scholar] [CrossRef]
Zhang, T.; Schoene, A.M.; Ji, S.; Ananiadou, S. Natural language processing applied to mental illness detection: A narrative review. npj Digit. Med. 2022, 5, 46. [Google Scholar] [CrossRef] [PubMed]
Teferra, B.G.; Rueda, A.; Pang, H.; Valenzano, R.; Samavi, R.; Krishnan, S.; Bhat, V. Screening for depression using natural language processing: Literature review. Interact. J. Med. Res. 2024, 13, e55067. [Google Scholar] [CrossRef]
Nanggala, K.; Elwirehardja, G.N.; Pardamean, B. Systematic Literature Review of Transformer Model Implementations in Detecting Depression. In Proceedings of the 2023 6th International Conference of Computer and Informatics Engineering (IC2IE), Lombok, Indonesia, 14–15 September 2023; IEEE: New York, NY, USA, 2023; pp. 203–208. [Google Scholar] [CrossRef]
Jin, Y.; Liu, J.; Li, P.; Wang, B.; Yan, Y.; Zhang, H.; Ni, C.; Wang, J.; Li, Y.; Bu, Y.; et al. The Applications of Large Language Models in Mental Health: Scoping Review. J. Med. Internet Res. 2025, 27, e69284. [Google Scholar] [CrossRef] [PubMed]
Gori, F.; Singh, S.; Quraishi, S.J. The Role of AI in Identifying Depression: A Review of Techniques and Approaches. In Proceedings of the 2024 9th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 16–18 December 2024; pp. 1830–1834. [Google Scholar]
Parums, D.V. Editorial: Review articles, systematic reviews, meta-analysis, and the updated Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. Med. Sci. Monit. 2021, 27, e934475. [Google Scholar] [CrossRef] [PubMed]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Hussain, N.; Qasim, A.; Mehak, G.; Zain, M.; Sidorov, G.; Gelbukh, A.; Kolesnikova, O. Multi-Level Depression Severity Detection with Deep Transformers and Enhanced Machine Learning Techniques. AI 2025, 6, 157. [Google Scholar] [CrossRef]
Raihan, N.; Puspo, S.S.C.; Farabi, S.; Bucur, A.-M.; Ranasinghe, T.; Zampieri, M. MentalHelp: A Multi-Task Dataset for Mental Health in Social Media. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, 22–24 May 2024; ACL: Stroudsburg, PA, USA, 2024; pp. 11196–11203. [Google Scholar] [CrossRef]
Poświata, R.; Perełkiewicz, M. OPI@LT-EDI-ACL2022: Detecting Signs of Depression from Social Media Text using RoBERTa Pre-trained Language Models. In Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion, Dublin, Ireland, 27 May 2022; ACL: Stroudsbourg, PA, USA, 2022; pp. 276–282. [Google Scholar] [CrossRef]
Karna, P.; Keshari, S.K.; Mandal, A.K.; Chakraborty, B. BERT-Driven Deep Learning Approach for Depression Detection in Social Media Posts. In Proceedings of the 2023 1st International Conference on Optimization Techniques for Learning (ICOTL), Guwahati, India, 7–8 December 2023; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Kurniadi, F.I.; Paramita, N.L.P.S.P.; Sihotang, E.F.A.; Anggreainy, M.S.; Zhang, R. BERT and ROBERTA models for enhanced detection of Depression in social Media text. Procedia Comput. Sci. 2024, 245, 202–209. [Google Scholar] [CrossRef]
Gavalan, H.S.; Rastgoo, M.N.; Nakisa, B. A BERT-Based Summarization approach for depression detection. arXiv 2024, arXiv:2409.08483. [Google Scholar] [CrossRef]
Wu, S.-H.; Qiu, Z.-J. A RoBERTa-based model on measuring the severity of the signs of depression. In Proceedings of the CLEF (Working Notes), Bucharest, Romania, 21–24 September 2021; CEUR Workshop Proceedings: Aachen, Germany, 2021; pp. 1071–1080. [Google Scholar]
Ahmed, T.; Ivan, S.; Munir, A.; Ahmed, S. Decoding depression: Analyzing social network insights for depression severity assessment with transformers and explainable AI. Nat. Lang. Process. J. 2024, 7, 100079. [Google Scholar] [CrossRef]
Zaman, A.; Ferdous, S.S.; Akhter, N.; Ena, T.I.; Nabi, M.M.; Asma, S.A. A Multilevel Depression Detection from Twitter using Fine-Tuned RoBERTa. In Proceedings of the 2023 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 21–23 September 2023; IEEE: New York, NY, USA, 2023; pp. 280–284. [Google Scholar] [CrossRef]
Balcı, E.; Saraç, E. Automated Depression Detection from Tweets: A Comparison of NLP Techniques. In Proceedings of the 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Turkey, 21–22 September 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Hasan, M.; Ghane, S. Data-driven Depression Detection System for Textual Data on Twitter using Deep Learning. In Proceedings of the 2022 2nd Asian Conference on Innovation in Technology (ASIANCON), Pune, India, 26–28 August 2022; IEEE: New York, NY, USA, 2022; pp. 1–5. [Google Scholar] [CrossRef]
Kumari, M.; Singh, G.; Pande, S.D. Depressonify: BERT a deep learning approach of detection of depression. EAI Endorsed Trans. Pervasive Health Technol. 2024, 10, e5513. [Google Scholar] [CrossRef]
Raj, A.; Ali, Z.; Chaudhary, S.; Bali, K.K.; Sharma, A. Depression Detection Using BERT on Social Media Platforms. In Proceedings of the 2024 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Malaysia, 26–28 August 2024; IEEE: New York, NY, USA, 2024; pp. 228–233. [Google Scholar] [CrossRef]
Reyes-Cocoletzi, L.; Aldama-Ramos, J.A.; Elias-Zapata, A.; Betancourt-González, J.; Rojas-Hernández, J. Detection of Tendency to Depression through Text Analysis. Comput. Sist. 2025, 29, 3. [Google Scholar] [CrossRef]
Chowdhury, A.K.; Sujon, S.R.; Shafi, S.S.; Ahmmad, T.; Ahmed, S.; Hasib, K.M.; Shah, F.M. Harnessing large language models over transformer models for detecting Bengali depressive social media text: A comprehensive study. Nat. Lang. Process. J. 2024, 7, 100075. [Google Scholar] [CrossRef]
Arcan, M.; Niland, D.-P. Evaluating Large Language Models for Anxiety, Depression, and Stress Detection: Insights into Prompting Strategies and Synthetic Data. arXiv 2025, arXiv:2511.07044. [Google Scholar] [CrossRef]
Nowacki, A.; Sitek, W.; Rybiński, H. LLM-based classifiers for discovering mental disorders. J. Intell. Inf. Syst. 2025, 64, 161–178. [Google Scholar] [CrossRef]
Teferra, B.G.; Perivolaris, A.; Hsiang, W.-N.; Sidharta, C.K.; Rueda, A.; Parkington, K.; Wu, Y.; Soni, A.; Samavi, R.; Jetly, R.; et al. Leveraging large language models for automated depression screening. PLoS Digit. Health 2025, 4, e0000943. [Google Scholar] [CrossRef]
Maji, B.; Swain, M.; Nasreen, S.; Majumdar, D.; Guha, R.; Routray, A.; Søgaard, A. A Study on The Impact of Foundation Models on Automatic Depression Detection from Speech Signals. In Proceedings of Interspeech, Rotterdam, The Netherlands, 17–21 August 2025; Delft University of Technology: Delft, The Netherlands, 2025; pp. 5258–5262. [Google Scholar] [CrossRef]
Tank, C.; Pol, S.; Katoch, V.; Mehta, S.; Anand, A.; Shah, R.R. Depression detection and analysis using large language models on textual and audio-visual modalities. arXiv 2024, arXiv:2407.06125. [Google Scholar] [CrossRef]
Sha, Y.; Pan, H.; Xu, W.; Meng, W.; Luo, G.; Du, X.; Zhai, X.; Tong, H.H.; Shi, C.; Li, K. MDD-LLM: Towards accuracy large language models for major depressive disorder diagnosis. J. Affect. Disord. 2025, 388, 119774. [Google Scholar] [CrossRef] [PubMed]
Anindyaputri, N.; Girsang, A. A comparative study of deep learning models for detecting depressive disorder in tweets. J. Syst. Manag. Sci. 2024, 14, 0318. [Google Scholar] [CrossRef]
Pandey, A.; Mohapatra, S.; Mishra, J.; Sinha, R.K. Novel depression detection technique using Bert on social media. In Proceedings of the 2022 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC), Bhubaneswar, India, 19–20 November 2022; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Wan, Q.; Pan, Y.; Zakeri, S. Analyzing depression in college students using NLP and Transformer Models: Implications for career and Educational Counseling. Brain Behav. 2025, 15, e70828. [Google Scholar] [CrossRef]
Wani, M.A.; Elaffendi, M.; Bours, P.; Imran, A.S.; Hussain, A.; El-Latif, A.A.A. CODES: A Deep learning Framework for Identifying COVID-Caused Depression Symptoms. Cogn. Comput. 2023, 16, 305–325. [Google Scholar] [CrossRef]
Xin, C.; Zakaria, L.Q. Integrating BERT with CNN and BILSTM for explainable detection of depression in social media contents. IEEE Access 2024, 12, 161203–161212. [Google Scholar] [CrossRef]
Zhang, Y.; He, Y.; Rong, L.; Ding, Y. A hybrid model for depression detection with transformer and bi-directional Long Short-Term memory. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; IEEE: New York, NY, USA, 2023; pp. 2727–2734. [Google Scholar] [CrossRef]
Zhou, S.; Mohd, M. Mental Health Safety and Depression Detection in Social Media Text Data: A classification approach based on a deep learning model. IEEE Access 2025, 13, 63284–63297. [Google Scholar] [CrossRef]
Firoz, N.; Berestneva, O.; Aksyonov, S.V. Dual Layer Cogni—Insight Deep-Mood Encoder: A Two-Tiered Approach for Depression Detection. In Proceedings of the 2024 International Russian Smart Industry Conference (SmartIndustryCon), Sochi, Russia, 25–29 March 2024; IEEE: New York, NY, USA, 2024; pp. 928–937. [Google Scholar] [CrossRef]
Karamat, A.; Imran, M.; Yaseen, M.U.; Bukhsh, R.; Aslam, S.; Ashraf, N. A hybrid transformer architecture for multiclass mental illness prediction using social media text. IEEE Access 2024, 13, 12148–12167. [Google Scholar] [CrossRef]
Rawat, N.; Chauhan, S.; Awasthi, L.K. Deep Learning Approaches for Predicting Mental States through Tweet Analysis. In Proceedings of the ICITIIT 2024, Surat, India, 15–16 March 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Salameh, K.; Suboh, T.; ElAmayreh, R.; Alhijawi, B. Deep linguistic analysis for depression in social media using RoBERTa and CNN. Int. J. Speech Technol. 2025, 28, 825–836. [Google Scholar] [CrossRef]
Yin, F.; Du, J.; Xu, X.; Zhao, L. Depression detection in speech using transformer and parallel convolutional neural networks. Electronics 2023, 12, 328. [Google Scholar] [CrossRef]
Lan, X.; Han, Z.; Cheng, Y.; Sheng, L.; Feng, J.; Gao, C.; Li, Y. Depression Detection on Social Media with Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, Suzhou, China, 4–9 November 2025; ACL: Stroudsburg, PA, USA, 2025; pp. 2155–2171. [Google Scholar] [CrossRef]
Jiang, Z.; Xu, K.; Gao, X.; Cao, Y.; Zhang, Y.; Dong, G.; Chen, Y.; Zhu, X.; Zhang, Q.; Bi, R.; et al. DNet: A depression recognition network combining residual network and vision transformer. BMC Psychiatry 2025, 25, 880. [Google Scholar] [CrossRef]
Hou, J.; Omar, N.; Tiun, S.; Saad, S.; He, Q. TF-BERT: Tensor-based fusion BERT for multimodal sentiment analysis. Neural Netw. 2025, 185, 107222. [Google Scholar] [CrossRef]
Fan, Y.; Zhou, Z.; Zhao, J.; Kong, J.; Liu, Y.; Li, J. A Multimodal Deep Learning Framework for Depression Detection Using Vision Transformers and Large Language Models. In Proceedings of the 2025 5th International Conference on Machine Learning and Intelligent Systems Engineering (MLISE), Guangzhou, China, 13–15 June 2025; IEEE: New York, NY, USA, 2025; pp. 417–420. [Google Scholar] [CrossRef]
Lim, E.; Jhon, M.; Kim, J.-W.; Kim, S.-H.; Kim, S.; Yang, H.-J. A lightweight approach based on cross-modality for depression detection. Comput. Biol. Med. 2025, 186, 109618. [Google Scholar] [CrossRef]
Tahir, W.B.; Khalid, S.; Alshahrani, S.; Alharbi, S.S.; Alhasson, H.F. ContextVecNet: A Context-Driven multimodal learning framework for depression Detection. IEEE J. Biomed. Health Inform. 2025, 1–12. [Google Scholar] [CrossRef]
Esmi, N.; Shahbahrami, A.; Gaydadjiev, G.; De Jonge, P. Multimodal transformer for depression detection based on EEG and interview data. Biomed. Signal Process. Control. 2025, 113, 109039. [Google Scholar] [CrossRef]
Zhang, X.; Liu, H.; Xu, K.; Zhang, Q.; Liu, D.; Ahmed, B.; Epps, J. When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection. arXiv 2024, arXiv:2402.13276. [Google Scholar] [CrossRef]
Tao, Y.; Yang, M.; Li, H.; Wu, Y.; Hu, B. DEPMSTAT: Multimodal Spatio-Temporal Attentional Transformer for Depression Detection. IEEE Trans. Knowl. Data Eng. 2024, 36, 2956–2966. [Google Scholar] [CrossRef]
Zhang, W.; Chen, J.; Zhu, E.; Cheng, W.; Li, Y.; Li, Y.; Wang, Y.J. MLlm-DR: Towards Explainable Depression Recognition with MultiModal Large Language Models. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 111, 1–23. [Google Scholar] [CrossRef]
Jia, X.; Chen, J.; Liu, K.; Wang, Q.; He, J. Multimodal depression detection based on an attention graph convolution and transformer. Math. Biosci. Eng. 2025, 22, 652–676. [Google Scholar] [CrossRef]
Haque, M.R.; Islam, M.M.; Raju, S.M.; Altaheri, H.; Nassar, L.; Karray, F. MDD-Net: Multimodal Depression Detection through Mutual Transformer. arXiv 2025, arXiv:2508.08093. [Google Scholar] [CrossRef]
Mou, L.; Zhen, S.; Mao, S.; Ma, N. Disentangled Representation Learning via Transformer with Graph Attention Fusion for Depression Detection. In Proceedings of the 1st International Workshop on Cognition-Oriented Multimodal Affective and Empathetic Computing, Melbourne, Australia, 26 October 2025; ACM: New York, NY, USA, 2025; pp. 20–29. [Google Scholar] [CrossRef]
Sun, H.; Chen, Y.-W.; Lin, L. TensorFormer: A Tensor-Based multimodal transformer for multimodal sentiment analysis and depression detection. IEEE Trans. Affect. Comput. 2022, 14, 2776–2786. [Google Scholar] [CrossRef]
Song, S.; Jaiswal, S.; Shen, L.; Valstar, M. Spectral Representation of Behaviour Primitives for Depression analysis. IEEE Trans. Affect. Comput. 2020, 13, 829–844. [Google Scholar] [CrossRef]
De Melo, W.C.; Granger, E.; López, M.B. MDN: A Deep Maximization-Differentiation Network for Spatio-Temporal Depression Detection. IEEE Trans. Affect. Comput. 2021, 14, 578–590. [Google Scholar] [CrossRef]
Shen, H.; Song, S.; Gunes, H. Multi-modal Human Behaviour Graph Representation Learning for Automatic Depression Assessment. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024; IEEE: New York, NY, USA, 2024; pp. 1–10. [Google Scholar] [CrossRef]
Liang, J.; Cao, P.; Wang, C.; Yang, J.; Wang, F.; Zaiane, O.R. Modeling multimodal depression diagnosis from the perspective of local depressive representation. IEEE Trans. Affect. Comput. 2025, 17, 497–510. [Google Scholar] [CrossRef]
Niu, M.; Zhao, Z.; Tao, J.; Li, Y.; Schuller, B.W. Dual attention and element recalibration networks for automatic depression level prediction. IEEE Trans. Affect. Comput. 2022, 14, 1954–1965. [Google Scholar] [CrossRef]
Niu, M.; Tao, J.; Liu, B.; Huang, J.; Lian, Z. Multimodal spatiotemporal representation for automatic depression level detection. IEEE Trans. Affect. Comput. 2020, 14, 294–307. [Google Scholar] [CrossRef]
Li, W.; Kaur, W.; Wangmei, C. Detecting Emotional State of Depression in Social Media Posts Using Logistic Regression-Recursive Feature Elimination. In Proceedings of the 2024 16th International Conference on Knowledge and System Engineering (KSE), Kuala Lumpur, Malaysia, 5–7 November 2024; IEEE: New York, NY, USA, 2025; pp. 289–296. [Google Scholar] [CrossRef]
Hamzah, S.; Mohd, M.; Zakaria, L.Q. Exploring the Hybrid Neural Network and Attention Mechanism for Classification of Social Bias. In Proceedings of the 2023 15th International Conference on Knowledge and Systems Engineering (KSE), Hanoi, Vietnam, 18–20 October 2023; IEEE: New York, NY, USA, 2023; pp. 1–4. [Google Scholar] [CrossRef]
Shen, S.; Qi, W.; Zeng, J.; Li, S.; Liu, X.; Zhu, X.; Dong, C.; Wang, B.; Shi, Y.; Yao, J.; et al. Passive Sensing for Mental Health Monitoring Using Machine Learning with Wearables and Smartphones: Scoping Review. J. Med. Internet Res. 2025, 27, e77066. [Google Scholar] [CrossRef]

Figure 1. Overall conceptual framework of this review.

Figure 2. PRISMA 2020 flow diagram detailing the systematic study selection process.

Figure 3. Transformer-based vs. alternative deep learning architectures for depression detection. (a) Multidimensional capability profile comparing five architectural paradigms across six qualitative dimensions (scores 1–5; full derivation in Supplementary Table S2; scores reflect relative ordering, not absolute benchmarks). (b) Minimum labeled training sample requirements by architecture.

Figure 4. Cross-paradigm comparison of Transformer architectures for depression detection. (a) Performance–cost trade-off zones for each architectural paradigm. Bubble size is proportional to the number of included studies (n), bars indicate performance and computational cost ranges, and computational cost multipliers are normalized to BERT base and fully derived in Supplementary Table S4. The grey horizontal arrow pointing leftwards indicates the direction of higher computational efficiency. (b) Reported performance across three application scenarios. Bars represent the central tendency of reported values within each task–paradigm grouping, superimposed markers show individual study results differentiated by reported metric type (▲ = F1-score; ●= accuracy; ■ = precision); full derivation of median values is provided in Supplementary Table S3. All performance values aggregate heterogeneous metrics across non-equivalent tasks and datasets. Figures illustrate directional architectural trends and should not be interpreted as equivalent-condition benchmarks.

Table 1. Summary of existing reviews on deep learning and Transformer-based models for depression detection.

Author	Sample Size	Transformer Coverage	Primary Focus	Key Limitations
[18]	50	Minimal (8%, 4 papers)	Comprehensive ML/DL survey	Basic BERT coverage only; no Transformer mechanisms; missing multimodal Transformers
[19]	16	BERT-focused (13/16 studies)	BERT-based language models	Single generative model (GPT-2); Web of Science only; descriptive synthesis
[20]	86	Limited depth	Evolution from rule-based to DL	Breadth sacrifices depth; minimal theoretical analysis; insufficient multimodal discussion
[21]	401	Brief mention	CNN/RNN/Autoencoder for speech	CNN/RNN dominance (292 studies); minimal Transformer analysis; no attention mechanism coverage
[22]	34	BERT-dominant	LLMs for detection and management	Includes non-detection tasks; BERT overrepresentation; no meta-analysis
[23]	399	Superficial (17%)	NLP across multiple disorders	Multidisorder scope dilutes focus; traditional ML dominance (59%); BERT enumeration only
[24]	39	Brief mention	NLP for screening and assessment	Small sample; excludes IEEE/Scopus; narrative format lacks rigor
[25]	14	Transformer models	Modality comparison	Critically small sample; excessive modality focus neglects architecture; DAIC-WOZ overrepresented
[26]	95	GPT vs. BERT conflated	LLMs across mental health	Conflates generative/discriminative architectures; depression diluted; application-centric
[27]	11	Extremely brief	NLP and ML for social media	Smallest reviews; no systematic methodology

Table 2. Search strategy mapped to PICO framework.

PICO Component	Element	Search Terms/Description
Population (P)	Target condition	“depression” OR “depressive disorder” OR “major depressive disorder”
Intervention (I)	AI model type	“Transformer” OR “BERT” OR “RoBERTa” OR “GPT” OR “LLM” OR “large language model” OR “pre-trained language model” OR “MentalBERT” OR “LLaMA” OR “Whisper” OR “CLIP” OR “Vision Transformer” OR “ViT” OR “ClinicalBERT”
Comparison (C)	Baseline/comparison methods	“machine learning” OR “deep learning” OR “CNN” OR “RNN” OR “LSTM” OR “non-Transformer” OR “conventional” OR “baseline”
Outcome (O)	Detection performance metrics	“detection” OR “classification” OR “screening” OR “severity assessment” OR “accuracy” OR “F1-score” OR “AUC” OR “sensitivity” OR “specificity” OR “RMSE”
Combined Query	Boolean	(Population) AND (Intervention) AND (Comparison) AND (Outcome)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, S.; Mohd, M.; Zakaria, L.Q. A Systematic Review of Transformer-Based Models for Depression Detection. Appl. Sci. 2026, 16, 5018. https://doi.org/10.3390/app16105018

AMA Style

Zhou S, Mohd M, Zakaria LQ. A Systematic Review of Transformer-Based Models for Depression Detection. Applied Sciences. 2026; 16(10):5018. https://doi.org/10.3390/app16105018

Chicago/Turabian Style

Zhou, Shiwen, Masnizah Mohd, and Lailatul Qadri Zakaria. 2026. "A Systematic Review of Transformer-Based Models for Depression Detection" Applied Sciences 16, no. 10: 5018. https://doi.org/10.3390/app16105018

APA Style

Zhou, S., Mohd, M., & Zakaria, L. Q. (2026). A Systematic Review of Transformer-Based Models for Depression Detection. Applied Sciences, 16(10), 5018. https://doi.org/10.3390/app16105018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Review of Transformer-Based Models for Depression Detection

Abstract

1. Introduction

2. Limitations of Existing Reviews and Motivation for the Present Study

3. Survey Methodology

3.1. Inclusion and Exclusion Criteria

3.2. Screening and Selection Process

3.3. Integration of Review Literature and Comparison with Non-Transformer Models

3.4. Study Quality and Bias Evaluation

4. Transformer-Based Models for Depression Detection

4.1. Encoder-Only Transformers: Bidirectional Context Understanding

4.1.1. Mechanistic Basis of Self-Attention in Depression Detection

4.1.2. Performance Analysis of Encoder-Only Models

4.1.3. Synthesis and Tier Analysis

4.2. Decoder-Only Transformers: Generative Capabilities for Enhanced Assessment

4.2.1. Mechanistic Basis: Causal Attention and In-Context Learning

4.2.2. Performance Analysis of Decoder-Only Models

4.2.3. Fine-Tuning Strategy Analysis

4.2.4. Synthesis of Decoder-Only Capabilities

4.3. Hybrid Transformers: Synergistic Integration of Neural Components

4.3.1. Mechanistic Rationale for Hybridization

4.3.2. Transformer–RNN Integration

4.3.3. Transformer–CNN Integration

4.3.4. Synthesis of Hybrid Architectures

4.4. Multimodal Transformers: Cross-Modal Integration

4.4.1. Mechanistic Rationale for Cross-Modal Attention

4.4.2. Fusion Strategy Analysis

4.4.3. Synthesis of Multimodal Fusion Strategies

5. Discussion

5.1. Transformer-Based Architectures Versus Alternative Approaches

5.2. Comparative Analysis Across Transformer Paradigms

5.3. Challenges and Limitations

5.4. Future Directions

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI