The preceding section-by-section examination of individual architectural paradigms revealed performance patterns specific to each approach. This section elevates the analysis to cross-paradigm synthesis, addressing three questions that cannot be answered within any single paradigm: How do Transformers compare with fundamentally different deep learning approaches? What are the Pareto-optimal trade-offs across architectural paradigms? And what systemic challenges must be resolved to enable clinical translation?
5.1. Transformer-Based Architectures Versus Alternative Approaches
Although Transformers have established themselves as the dominant paradigm for detecting depression, alternative deep learning architectures—spectral analysis [
77], spatiotemporal CNNs [
78], graph neural networks [
79], attention-based multimodal fusion [
80,
81], and hybrid CNN-LSTM architectures [
82]—demonstrate distinct advantages on specific capacity dimensions—most notably interpretability—while generally trailing Transformer variants on detection performance and sample efficiency, as illustrated in
Figure 3a (scores 1–5; full derivation in
Supplementary Table S2).
These alternatives have the advantage of explicitly modeling depression-specific patterns that Transformers learn only implicitly: spectral methods decompose behavioral signals into frequency domains to detect symptoms’ periodicity [
77], the Maximization–Differentiation Network models facial transitions, achieving an RMSE of 7.55 with only 25M parameters on the AVEC2014 facial dataset [
78], and GNNs outperform late-fusion Transformers by 6% in accuracy on E-DAIC using structural encoding of cross-modal dependencies [
79]. It should be noted that these performance figures reflect heterogeneous metrics and datasets—RMSE (lower is better), accuracy, and relative gain—and are, therefore, not directly interchangeable. They are presented to illustrate the task-specific strengths of each alternative rather than to assert equivalent-condition superiority over Transformers.
However, as shown in
Figure 3b, these architectural innovations generally require 1000–5000 labeled samples and rigid domain-specific engineering. The practical dominance of Transformers stems from two architecture-specific features: massive-scale pre-training that encodes transferable linguistic knowledge, and architectural unification that accommodates diverse input modalities within a single framework. MentalBERT achieves an F1 of 97.3% with approximately 3000 samples—representing roughly 2–3 times fewer annotation requirements compared to GNN-based and spectral alternatives in comparable social media text classification settings—though this comparison is restricted to sample efficiency and does not generalize across tasks or datasets.
Generative decoder-only models (DepGPT and GPT-4o) require far fewer samples still, approaching few-shot or zero-shot regimes with minimal task-specific training. Transformers also scale more effectively, with larger models yielding consistent gains, whereas alternatives plateau despite increased depth. It should be noted, however, that some alternative architectures are beginning to adopt pre-training strategies (e.g., graph pre-training for GNNs), which may narrow this advantage over time. Architecture selection must, therefore, balance benchmark precision against deployment constraints: alternatives may be preferred in stable, data-rich environments with well-defined signal processing requirements, while Transformers remain optimal for scenarios characterized by limited labels, cross-domain variability, or missing modalities.
5.2. Comparative Analysis Across Transformer Paradigms
Before proceeding, two methodological qualifications are necessary.
First, the 46 studies synthesized employ heterogeneous evaluation metrics across tasks that are not equivalent: binary social media classification, ordinal severity scoring, continuous regression, and zero-shot symptom assessment. Aggregating these into median performance values or Pareto frontiers is inherently imprecise. The performance comparisons in
Figure 4a,b are, therefore, best understood as illustrative architectural trends and cost–benefit directional signals, not as equivalent-condition benchmarks. Readers seeking within-task granular comparisons should consult
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9 and the original cited studies.
Second, and critically, the studies use heterogeneous performance metrics—accuracy, F1-score, precision, and AUC—which are not directly interchangeable: accuracy can be inflated on class-balanced datasets, F1-score better captures performance under class imbalance, precision reflects positive predictive value, and AUC measures discrimination independently of classification threshold. To make this heterogeneity explicit, individual study results in
Figure 4b are differentiated by reported metric type (▲ = F1-score; ● = accuracy; ■ = precision). The bar heights represent the central tendency of reported values within each task–paradigm grouping and should be interpreted as directional indicators, not equivalent-condition performance benchmarks.
Note on figure derivation methodology. The qualitative capability scores in
Figure 3a (scale 1–5) were assigned through a structured evidence-mapping procedure in which two authors independently rated each architectural paradigm on each of the six dimensions (detection performance, data efficiency, interpretability, computational efficiency, multimodal capability, and clinical readiness) based on explicit quantitative and qualitative evidence drawn from the studies summarized in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9. Inter-rater agreement (Cohen’s κ = 0.79) was followed by consensus discussion for any divergent scores, and full per-cell evidence is provided in
Supplementary Table S2. Scores reflect directional trends and relative ordering across paradigms, not absolute quantitative benchmarks. The minimum labeled sample thresholds in
Figure 3b are drawn directly from specific studies cited in
Section 4. The cross-paradigm median performance values in
Figure 4b were computed separately within each task category (binary social media classification, clinical interview, and multimodal tasks) using the study-level results reported in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9, and full derivation with per-study values and metric types is provided in
Supplementary Table S3. All computational cost multipliers in
Figure 4a are normalized to BERT base and fully derived in
Supplementary Table S4.
Systematic analysis of 46 Transformer-based studies reveals distinct Pareto frontiers in terms of computational cost and detection performance, as shown in
Figure 4a. Encoder-only architectures (
n = 14) define a high-efficiency zone with F1/accuracy scores ranging from 76 to 99% at minimal computational overhead (normalized to BERT at 10
0), yielding the optimal performance-to-resource ratio for standard detection tasks. The single outlier (DepRoBERTa, F1: 58.3%) reflects limited corpus diversity rather than architectural inadequacy. Decoder-only models (
n = 7) incur costs one to two orders of magnitude higher than the BERT baseline; while this positions them in the high-cost quadrant, the expenditure is strategically justified by their few-shot generalization capabilities and novel data-processing modalities (e.g., tabular-to-text transformation) in data-scarce environments, effectively trading infrastructure cost for reduced labeling requirements. Hybrid architectures (
n = 14) occupy a cost-efficient intermediate zone, demonstrating performance comparable to encoder-only models through complementary neural component integration. Multimodal architectures (
n = 11) cluster at higher costs, reflecting the computational demands of cross-modal attention and multistream processing—a trade-off warranted when multiview integration is critical to diagnostic sensitivity.
Practical efficacy is heavily context-dependent (
Figure 4b). In social media environments, encoder-only models show the highest central tendency across the reviewed binary classification studies (median of reported Acc/F1 values: 97%;
n = 7; six of seven studies report accuracy, one reports F1 [
31]; full derivation in
Supplementary Table S3), followed by hybrid (94%;
n = 10; mix of Acc and F1) and decoder-only architectures (92%; n = 1 study, two conditions [
44]; both F1). Multimodal approaches show a lower central tendency (77%;
n = 4; all F1 [
68,
71,
74,
75]), reflecting the limited number of multimodal social media studies and the variability between the inflated ceiling result of ContextVecNet (F1: 96.19% on restricted Twitter data [
68]) and the more representative D-Vlog studies (F1: 73–78% [
71,
74,
75]), consistent with the observation that complex cross-modal fusion may introduce noise when textual signals are already discriminative. Because encoder-only values are predominantly accuracy-based, while multimodal values are F1-based, direct comparison of their medians should account for this metric difference. The directional ordering nonetheless reflects the pattern evident in individual studies within
Table 3,
Table 4,
Table 5,
Table 6,
Table 7,
Table 8 and
Table 9.
In clinical interview settings, a striking divergence emerges: hybrid architectures exhibit the highest central tendency (94%;
n = 3 studies, 4 data points: DLCDME F1 95% [
58]; TCC F1 93.6% and 96.7% [
62]; Transformer-BiLSTM precision 74% [
56]; full derivation in
Supplementary Table S3) based on robust DAIC-WOZ and MODMA results from autoencoder-augmented and advanced encoding configurations, whereas encoder-only (79%; F1: 81% [
35] and 76% [
43]), decoder-only (80%; F1: 78% [
45] and 82.6% [
48]), and multimodal (84%; range F1 67–Acc 94.17%;
n = 5 studies, 6 data points [
67,
69,
70,
72,
73]) architectures show lower values.
Note that the hybrid clinical interview median aggregates three F1 values and one precision value from three studies—metric heterogeneity that limits direct comparison with other paradigms. The precision value of 74% [
56] reflects a different aspect of detection quality than F1—its inclusion in the median is disclosed here and in
Supplementary Table S3. This pattern indicates that neither purely linguistic models nor standard multimodal fusion adequately addresses the full challenges of clinical discourse—social desirability bias, linguistic masking, and structured interview formats—whereas hybrid architectures overcome these limitations through complementary feature extraction that combines Transformer contextual understanding with specialized local pattern detection [
83].
For multimodal tasks (studies incorporating audio, speech, or video modalities beyond text), hybrid architectures again lead (94%; median of [
56] precision 74%, [
58] F1 95%, [
62] F1 93.6%/96.7%; all DAIC-WOZ or MODMA datasets; full derivation in
Supplementary Table S3), followed by decoder-only (83%; [
48] F1 82.6% on DAIC-WOZ speech + text pipeline), with encoder-only (79%; same DAIC-WOZ studies as clinical interview category: [
35] F1 81%, [
43] F1 76%) and the multimodal paradigm (78%; central tendency of all 11 multimodal studies [
66,
67,
68,
69,
70,
71,
72,
73,
74,
75], excluding [
76], which reports MAE; mix of Acc and F1; see
Supplementary Table S3) showing comparable values. These patterns suggest that targeted architectural augmentation consistently shows stronger results than both standalone linguistic models and complex fusion approaches in multimodal contexts within the reviewed studies.
These patterns should be interpreted as task-contextualized directional trends rather than definitive rankings: because studies employ non-equivalent tasks, datasets, and metrics, no single paradigm can be declared universally superior. Taken together, and with these qualifications in mind, the reviewed evidence indicates that encoder-only models show consistently strong results for population-level text-based social media screening, hybrid architectures demonstrate the most robust individual results in clinical interview and multimodal task settings across the reviewed studies, and multimodal designs—despite their theoretical appeal—require further standardized evaluation before their clinical diagnostic potential can be fully assessed.
5.3. Challenges and Limitations
While
Section 4 outlined paradigm-specific limitations, four systemic challenges transcend all architectural boundaries and collectively impede clinical translation. Rather than re-enumerating individual study limitations, this section summarizes these cross-cutting barriers and their interactions.
Ground truth quality and evaluation heterogeneity. The most fundamental challenge is that about 85% of reviewed studies use self-reported screening scores (e.g., PHQ-9, BDI) as ground truth, instead of gold-standard psychiatric interviews (SCID and MINI). This reliance on self-report does not simply reflect a data quality concern—it systematically biases model learning toward subjective self-assessment patterns rather than clinically validated diagnostic criteria. Consequently, reported performance metrics may indicate high consistency with self-reported symptoms (convergent validity) as opposed to true diagnostic accuracy. Compounding this, artificially balanced datasets do not represent real-world depression prevalence: models trained under balanced-class assumptions may yield substantially elevated false-positive rates when deployed in realistic clinical settings, risking a volume of unconfirmed positive screens that would overwhelm referral systems. The lack of standardized evaluation protocols creates considerable performance variance for identical models across configurations, making meaningful cross-study comparison difficult.
Furthermore, regarding study generalizability, a substantial proportion of the evidence in this systematic review derives from social media text corpora and unipolar depression benchmarks that bear limited resemblance to the clinical contexts in which SMI assessments must ultimately function. The linguistic register, self-disclosure patterns, and symptom expression in Twitter or Reddit posts differ fundamentally from those encountered in structured clinical interviews with patients experiencing MDD or comorbid psychiatric conditions. This gap between the evidence base and the target clinical context represents a major translational risk that future benchmark design must address directly.
Mechanistic opacity and clinical interpretability. Current explainability approaches focused on attention weight visualization [
84] provide only surface-level insight. While attention heatmaps can indicate which tokens or modalities contribute to a prediction, they do not reveal the diagnostic reasoning chain—whether the model is detecting real patterns in DSM-5 symptoms, leveraging superficial lexical correlates, or using dataset-specific artifacts. This interpretability deficit has concrete clinical consequences: clinicians cannot verify model reasoning against their own diagnostic judgment, patients cannot receive meaningful explanations of screening results, and regulatory bodies do not have access to the transparency required for medical device approval. The challenge is especially acute for multimodal models, where decision-making is distributed across modality-specific encoders and cross-modal attention layers.
Geographic, linguistic, and demographic bias. The concentration of research on English-language, Western social media data (93% of encoder-only studies; 78% from North American samples) creates a compounding bias problem. Depression is expressed through culturally mediated linguistic patterns—metaphorical expressions, somatic idioms, and disclosure norms vary substantially across cultures. Models trained predominantly on English-language data demonstrate approximately 7–11% performance degradation in zero-shot cross-lingual transfer within the reviewed studies—for example, DepGPT shows an approximately 11% F1 drop when applied to Bengali versus English data in zero-shot settings [
44], and Whisper + GPT-2 shows an approximately 7% F1 drop on Indic-Bengali relative to DAIC-WOZ [
48]. Broader cross-lingual performance gaps are anticipated under conditions of greater linguistic and script distance, perpetuating the very healthcare disparities that computational screening tools aim to mitigate. This bias interacts with the ground truth problem: PHQ-9 and BDI, while widely translated, may not capture culturally specific symptom presentations, meaning that even translated models may be optimizing for a culturally biased diagnostic target.
Computational and infrastructural constraints. Multimodal Transformers require specialized equipment and substantial computational resources for both training and inference. These requirements present a fundamental tension: the populations most in need of automated screening (resource-constrained communities and low- and middle-income countries) are least able to deploy the most capable architectures. Furthermore, critical human factors—clinician cognitive burden, trust calibration, and patient acceptance—remain underexplored, despite being prerequisites for successful clinical workflow integration.
The systematic review also has limitations of its methodology: a narrative synthesis was adopted in place of quantitative meta-analysis due to cross-study heterogeneity, and eligibility was restricted to English-language publications from 2020 to 2025.
5.4. Future Directions
Transitioning Transformer-based depression detection to clinically validated instruments requires strategic realignment across four domains, each targeting a specific challenge identified above.
Standardized clinical validation. Addressing the ground truth deficit requires convergence on evaluation frameworks based on criterion-standard psychiatric interviews, supplanting reliance on self-report measures. Benchmark datasets must incorporate stratification in depression subtypes and demographics. To mitigate the burden of false positives arising from balanced-data training, evaluation protocols must mandate prevalence-adjusted test sets reflecting epidemiological base rates of 5–20%, and expanded metrics must incorporate positive predictive value, calibration error, and net benefit analyses.
Interpretable and uncertainty-aware architectures. Remediating mechanistic opacity requires innovations extending beyond attention visualization. Hierarchical Transformer architectures with explicit temporal modeling can enable the capture of longitudinal symptom trajectories, differentiating first-episode major depression, recurrent depression, and chronic dysthymia. Integrating uncertainty quantification with Bayesian formulations or conformal prediction provides calibrated confidence intervals, enabling clinicians to distinguish high-confidence predictions from ambiguous cases warranting additional scrutiny. Concept-based explanation methods that align model representations with DSM-5 symptom domains would bridge the gap between computational outputs and clinical reasoning. Robust multimodal fusion with conditional computation and dynamic weighting can ensure performance under data degradation while managing computational constraints.
Equity-centered development. Mitigating geographic and linguistic representation deficits requires proactive strategies, including adversarial debiasing, demographically stratified loss reweighting, and performance parity constraints to minimize disparities across subgroups. Purposefully constructed datasets must ensure sufficient statistical power across cross-cultural, multilingual, and intersectional demographic strata through active recruitment in historically underrepresented communities. Cross-lingual transfer learning—exploitation of multilingual pre-trained models with culture-aware adaptation—offers a scalable path toward equitable global coverage.
AI analytics, wearable data, and smart applications for detection stability. The integration of AI analytics with wearable sensors and mobile health applications represents a critical frontier for improving depression detection stability and reducing the translation risk associated with snapshot-based clinical assessments. Systematic evidence confirms that wearable AI demonstrates meaningful accuracy for detecting and predicting depression across diverse populations [
9] and that passive, non-intrusive multimodal sensing approaches more comprehensively capture natural behaviors than controlled or single-session data collection paradigms [
10]. Current Transformer-based models rely predominantly on single-session text or audio inputs, which are inherently susceptible to momentary fluctuations and self-presentation biases. Passive sensing modalities provide continuous, objective proxies for psychomotor retardation, sleep disruption, and social withdrawal [
8], capturing symptom dynamics across naturalistic daily contexts rather than isolated clinical encounters [
85]. Multimodal fusion architectures that integrate such longitudinal physiological streams with text-based Transformer features offer a pathway to richer, temporally grounded representations of depression severity. By anchoring model inputs in objectively measured behavioral signals, this approach simultaneously reduces reliance on single-session self-report measures and mitigates the cross-domain generalizability gap between social media corpora and real-world clinical populations—two of the most significant barriers to clinical translation identified in this systematic review.
Prospective pragmatic validation. Addressing deployment feasibility requires randomized controlled trials comparing AI-augmented workflows against standard-of-care baselines. Trials must employ patient-centered outcome measures, including diagnostic accuracy, time-to-diagnosis, cost-effectiveness, and critically, clinician cognitive workload and patient satisfaction. Embedded pilot implementations across resource-constrained primary care facilities and community mental health centers serving socioeconomically disadvantaged populations are crucial for identifying context-specific barriers and generating implementation science evidence that maximizes equitable benefit.