1. Introduction
Depression is one of the most prevalent mental health disorders worldwide, affecting over 380 million people globally according to the World Health Organization (WHO) [
1]. In China alone, more than 54 million individuals suffer from depression, yet only 30% receive effective treatment due to diagnostic delays, insufficient medical resources, and social stigma. Traditional clinical diagnosis relies primarily on face-to-face interviews and psychological assessments (e.g., PHQ-9 [
2] and HAMD scales), which suffer from high costs, limited accessibility, and delayed intervention—patients typically seek help only after symptoms have significantly worsened, missing critical early intervention windows.
Social media platforms have emerged as valuable data sources for mental health monitoring due to their openness, real-time nature, and user spontaneity. Weibo is China’s largest microblogging platform with 586 million monthly active users [
3]. Every day, users publish over 120 million posts, where they naturally share emotional states, daily activities, and psychological struggles. These textual expressions contain early signals of depression—sustained negative emotions, self-deprecating language, and cognitive biases. If effectively mined, such information could enable low-cost, wide-coverage mental health screening that is complementary to traditional clinical methods [
4].
However, social media-based depression detection faces several critical challenges. First, linguistic informality and semantic implicitness: Chinese Weibo texts contain abundant internet slang, emojis, and topic hashtags that traditional rule-based sentiment dictionaries struggle to interpret [
5]. Depression is often expressed through contradictory statements (e.g., “I’m fine, just don’t want to live”) or metaphors (e.g., “my heart feels hollowed out”) that require deep semantic understanding. Second, there is a lack of theoretical grounding: existing data-driven approaches blindly experiment with various features without systematic mapping to clinical diagnostic criteria (DSM-5), leading to models that may achieve statistical performance but lack clinical validity and interpretability [
6]. Third, limited exploitation of multi-modal information: most studies focus solely on textual content, ignoring behavioral patterns (posting times, interaction frequency) and topic distributions that reflect physiological and cognitive symptoms. Fourth, ethical and privacy considerations: data collection and model deployment must balance public health value with user privacy protection, yet current research gives insufficient attention to compliance constraints and potential diagnostic liability [
7].
In the era of large language models [
8] (LLMs), while general-purpose models like GPT-4 demonstrate impressive capabilities across diverse tasks, they face specific limitations for clinical applications in mental health. LLMs lack systematic integration of domain expertise (DSM-5 [
9] criteria), cannot analyze behavioral patterns beyond text, incur high API costs for large-scale screening (USD 30–45 per 1000 inferences), require uploading sensitive user data to external servers (privacy risks), and offer limited interpretability for clinical decision-making [
10]. These limitations motivate the development of domain-specialized models that explicitly incorporate clinical knowledge, leverage multi-modal information, and support local deployment.
This study addresses these challenges through a comprehensive methodology integrating clinical psychology (DSM-5 diagnostic criteria), multi-modal data analysis (text, behavior, and topic), and advanced deep learning techniques (BERT, hierarchical attention [
11], and multi-task learning). Our contributions are fourfold:
We propose a DSM-5-guided feature engineering methodology that systematically maps clinical symptom dimensions (emotional, physiological, and cognitive) to computable social media features. This theory-driven approach ensures clinical validity and interpretability, distinguishing our work from purely data-driven methods.
We designed a hierarchical attention mechanism that models depression at both character and post levels, automatically identifying key linguistic patterns within individual posts and critical posts across a user’s timeline. This provides interpretability through attention weight visualization while improving performance.
We develop a multi-task learning framework that jointly optimizes depression classification, DSM-5 symptom dimension recognition (nine symptoms), and severity assessment (four levels). This not only improves main task performance (F1 = 91.8% vs. 89.6% single-task) but also provides fine-grained clinical outputs beyond binary classification.
We conduct comprehensive comparisons with state-of-the-art methods, including traditional ML, deep learning baselines, and large language models (GPT-4, Claude-3), demonstrating that our domain-specialized approach achieves superior performance (91.8% F1 vs. 86.9% GPT-4 few-shot) while offering significant advantages in cost (~3000× cheaper), speed (40–100× faster), privacy protection (local deployment), and interpretability.
Experimental results on a large-scale dataset (WU3D [
12]: 32,570 users, 2.19 million posts) validate the effectiveness of each component through ablation studies and demonstrate strong performance across multiple metrics. Our model’s low inference cost (USD 0.00001 per 1 K samples vs. GPT-4’s USD 30–45), fast speed (50 ms vs. 2–5 s), and local deployability make it well-suited for real-world mental health screening systems capable of processing 20,000 users per hour on a single GPU.
As shown in
Table 1, a comparison of state-of-the-art depression detection methods on social media, depression detection methods have evolved through four distinct generations, each demonstrating progressive improvements in performance and sophistication. Traditional machine learning approaches (2013–2017) pioneered the field by leveraging manual feature engineering with tools such as LIWC and behavioral pattern analysis, achieving F1-scores of 70–72% [
13,
14]. De Choudhury et al. [
14] conducted the first large-scale study using Twitter data with SVM classifiers, while Coppersmith et al. [
13] established the widely used CLPsych2015 benchmark dataset. However, these methods required extensive domain expertise for feature design and were often platform-specific. Deep learning methods (2017–2020) introduced automatic feature learning through neural architectures such as CNNs and LSTMs, improving performance to 78–84% [
15,
16]. Trotzek et al. [
16] demonstrated the effectiveness of LSTM networks with linguistic metadata for sequential modeling, while Tadesse et al. [
15] applied CNN-based approaches to Reddit data. These methods eliminated the need for manual feature engineering but struggled with gradient issues and limited context windows. Pre-trained language models (2020–2023) revolutionized the field through contextual embeddings and transfer learning, achieving F1-scores of 85.6–90.4% [
17,
18,
19]. Domain-general models like BERT [
17] and RoBERTa [
19] provided rich semantic representations, while MentalBERT [
18], pre-trained on 13.6 million mental health-related sentences from Reddit, achieved 88.6% by capturing domain-specific linguistic patterns. The current state-of-the-art baseline, BERT + LSTM + Attention, achieved 90.4% F1-score on the WU3D dataset by combining contextual embeddings with sequential modeling and attention mechanisms. Large language models (2023–2024) represent the latest generation, with models like GPT-4 [
6] and Mental-LLM [
20] demonstrating impressive reasoning and interpretability capabilities. However, their zero-shot and few-shot performance (82–87% F1) falls short of specialized fine-tuned models. Moreover, LLMs face significant practical deployment barriers: (1) High cost: GPT-4 costs approximately USD 0.03 per user compared to our model’s USD 0.0002, a 150× difference; (2) privacy concerns: external API dependency requiring transmission of sensitive mental health data; and (3) lack of clinical interpretability: no symptom-level assessment or DSM-5 grounding. Our DSM-5-guided multi-task learning approach achieves 91.8% F1-score, representing a +1.4 percentage point improvement over the previous best baseline and +4.9 to +9.8 points over GPT-4. Crucially, our method is the only approach that explicitly integrates DSM-5 clinical diagnostic criteria with multi-task learning, enabling both superior performance and clinical interpretability through symptom-level predictions across nine DSM-5 depression symptoms.
The remainder of this paper is organized as follows.
Section 2 reviews related work on depression detection and provides relevant background.
Section 3 describes our DSM-5-guided feature engineering methodology and model architecture.
Section 4 presents comprehensive experimental results, including comparisons with baselines, ablation studies, and analysis of LLM performance.
Section 4 concludes with a discussion of limitations and future directions.
3. Results and Discussion
3.1. Overall Performance
Table 3, a performance comparison on the WU3D test set, presents a performance comparison of our complete model against six baseline methods on the WU3D test set. Our model achieves state-of-the-art results across all metrics: 91.2% accuracy, 89.7% precision, 92.4% recall, and 91.8% F1-score.
Compared to the baselines, improvements are substantial: vs. SVM (F1 = 72.1%) + 19.7 pp, demonstrating deep learning superiority over traditional methods; vs. TextCNN (F1 = 78.6%) + 13.2 pp, showing pre-trained language models significantly outperform CNN-based methods; vs. BERT-Base (F1 = 85.6%) + 6.2 pp, validating our architectural innovations; vs. BERT + LSTM (F1 = 87.9%) + 3.9 pp, highlighting hierarchical attention and multi-task learning contributions; vs. BERT + LSTM + Attention (F1 = 90.5%) + 1.3 pp, confirming synergistic effects (
Figure 3).
Notably, our model achieves 92.4% recall, which is crucial for mental health screening applications. This high recall means successfully identifying 92.4% of users with depression, minimizing the risk of missing individuals who are in need of help.
3.2. Ablation Studies
To understand each component’s contribution, we conducted systematic ablation experiments (
Table 4: Ablation study results).
Hierarchical Attention: Removing hierarchical attention (and replacing it with single-level) decreases F1 by 1.4 pp to 90.4%. This validates that modeling both character-level and post-level importance jointly yields better representations than single-level attention (
Figure 4).
Multi-Task Learning: Removing multi-task learning (keeping only the main task) decreases F1 by 2.2 pp to 89.6%. Further analysis reveals the symptom recognition task contributes 1.3 pp, and severity classification contributes 0.6 pp. Auxiliary tasks provide fine-grained supervision signals that help the shared encoder to learn more discriminative features.
Bi-LSTM: Removing Bi-LSTM layers (directly connecting BERT to attention) causes the largest drop of 3.1 pp to 88.7%, emphasizing the critical role of temporal modeling in capturing emotional evolution and behavioral patterns.
Multi-Modal Features: We tested different feature combinations: Text only (F1 = 89.2%), Text + Behavior (F1 = 90.2%, +1.0 pp), Text + Topic (F1 = 89.8%, +0.6 pp), and Text + Behavior + Topic (F1 = 91.8%, +2.6 pp). Results demonstrate that behavioral and topic features provide complementary information to textual semantics.
Importantly, the combined contribution of all three innovations (6.2 pp) exceeds the sum of individual contributions (3.1 + 1.4 + 2.2 = 6.7 pp), indicating synergistic effects. Better temporal representations from Bi-LSTM enable more effective attention focusing, while hierarchical attention extracts finer features, improving multi-task prediction accuracy, and multi-task learning provides additional supervision, helping both Bi-LSTM and attention to learn more discriminative representations.
As shown in
Table 5, a comparison of LSTM variants for depression detection, Bi-LSTM achieves the best performance (91.8% F1-score) among all LSTM variants. While Economic LSTM [
24] reduces parameters by 46% and inference time by 43%, it suffers a 2.9 percentage point drop in F1-score. The bidirectional architecture’s ability to capture context from both directions proves essential for identifying subtle linguistic patterns in depression-related posts, justifying the modest computational overhead.
3.3. Transfer Learning Effectiveness
Table 6 demonstrates the impact of our two-stage transfer learning strategy. Without additional pre-training on sentiment datasets (using only Google’s BERT-Base Chinese weights), the model achieves F1 = 83.2%. Pre-training on general sentiment datasets (SST-2, IMDB) improves F1 to 87.1% (+3.9 pp), demonstrating that general emotional understanding helps depression detection. Adding domain-relevant pre-training data (Douban movie reviews, similar to social media language) further boosts F1 to 88.5% (+1.4 pp).
The full model with architectural innovations (Bi-LSTM, hierarchical attention, and multi-task learning), on top of this pre-training, achieves F1 = 91.8%, showing that transfer learning and architectural design are complementary: transfer learning provides better initialization and richer prior knowledge, while our architecture specifically addresses depression detection characteristics.
3.4. Comparison with Large Language Models
We conducted a comprehensive comparison with state-of-the-art large language models (LLMs) [
7] using the same test set. For fair comparison, we designed structured prompts instructing LLMz to act as mental health experts and judge depression tendency based on user posts, providing only binary answers with brief rationale.
Due to high API costs, we evaluated on a random subset of 500 test users (250 depressed, 250 control), ensuring balance. Results are shown in
Table 7, which is a comparison with the large language models.
As shown in
Figure 5, our model significantly outperforms all LLMs: vs. Best LLM (GPT-4 Few-shot, F1 = 86.9%) + 4.9 pp, demonstrating domain-specific methods can surpass general-purpose models; vs. Claude-3 Sonnet (F1 = 85.7%) + 6.1 pp; and vs. GPT-3.5 Few-shot (F1 = 82.1%) + 9.7 pp.
More importantly, our model offers decisive advantages in practical deployment:
Cost: Our model’s inference cost (USD ~0.00001 per 1 K samples, mainly server electricity) is 3000–4500× lower than GPT-4 (USD 0.03–0.045 per 1 K). For screening 100 K users, our model costs < USD 1, while GPT-4 costs USD 3000–4500.
Speed: Our model requires ~50 ms per inference on CPU, while GPT-4 API calls take 2–5 s, making our model 40–100× faster and more suitable for real-time applications.
Privacy: Our model can be deployed locally without uploading sensitive mental health data to third-party servers, whereas LLM APIs require the external transmission of user data, raising privacy concerns.
Interpretability: Our hierarchical attention mechanism provides visualizable explanations, while LLM reasoning is largely opaque.
Analysis of LLM Limitations: Our case study revealed three key weaknesses: (1) Lack of behavioral analysis—LLMs primarily analyze textual content and cannot utilize posting timestamps, interaction patterns, or temporal trends. Our model identified 87.3% of “masked depression” cases (text appears positive, but behavioral patterns are abnormal), while GPT-4 only identified 61.2%. (2) Insufficient DSM-5 grounding—LLMs may judge based on surface-level emotional words rather than systematic DSM-5 criteria (symptom persistence, severity, and multi-dimensional manifestation). Our explicit DSM-5 feature design ensures clinical validity. (3) Limited context window—LLMs have context length limitations (GPT-4: 32 K tokens, ~200 posts). For users with hundreds of posts, LLMs must truncate input, potentially missing critical information. Our hierarchical attention can flexibly process an arbitrary number of posts.
These results demonstrate that while LLMs are powerful general-purpose tools, domain-specialized models with explicit expert knowledge integration remain valuable and are often superior for specific professional tasks.
3.5. Attention Mechanism Analysis
To validate the effectiveness of our hierarchical attention and provide interpretability, we compared five attention strategies on the test set (
Table 8. Comparison of pooling and attention strategies for user-level aggregation).
Results show hierarchical attention outperforms all alternatives. Compared to average pooling (treating all content equally), our approach gains 4.3 pp, highlighting the importance of selective focus. Single-level attention strategies (character-only or post-only) cannot simultaneously identify keywords and key posts, resulting in 2.9–3.6 pp performance gaps. Hierarchical attention models the importance at both granularities, achieving optimal performance.
Attention Weight Visualization: We visualized attention weights for the sample of depressed users. At the post level, the model assigns high weights (α = 0.35) to posts explicitly expressing depression symptoms (e.g., “Another sleepless night, can’t fall asleep”), moderate weights (α = 0.28) to posts with negative emotions, and low weights (α = 0.04) to neutral daily records. At the character level, the model highlights symptom-related keywords (“insomnia”, “meaningless”, “hopeless”, and “suicide”) with high weights (
Figure 6).
This visualization demonstrates that our model focuses on clinically relevant content, providing transparency for human verification and building trust for clinical deployment.
3.6. Multi-Task Learning Analysis
3.6.1. Effect of Auxiliary Tasks
Table 9 compares single-task and multi-task configurations. Adding multi-task learning improves the main task (depression classification) from F1 = 89.6% to F1 = 91.8%, a gain of 2.2 pp. This confirms that auxiliary tasks provide valuable supervision signals, helping the shared encoder learn more discriminative representations.
Symptom Recognition Performance: Our model achieves a macro-averaged F1 of 84.8% across all nine DSM-5 symptom dimensions. The easiest symptom to detect is suicidal ideation (F1 = 90.8%), as users often express it explicitly (“don’t want to live”, “want to die”). Sleep disturbance (F1 = 87.3%) and self-blame (F1 = 85.5%) are also well-recognized due to clear linguistic markers (“insomnia”, “sleepless”, “it’s my fault”, and “I’m useless”).
The most challenging symptoms are psychomotor changes (F1 = 82.0%) and appetite changes (F1 = 82.9%), as these are expressed more implicitly and inconsistently. Psychomotor changes include both agitation and retardation, described with varied language (“restless”, “sluggish”, and “can’t sit still”). Appetite changes can involve increases or decreases, and users may not directly mention them (
Figure 7).
Notably, all symptoms achieve F1 ≥ 82%, comparable to main task performance (91.8%), demonstrating that multi-task learning successfully enables fine-grained symptom recognition without compromising main task accuracy.
Severity Classification: Our model classifies depression severity into four levels (none, mild, moderate, and severe) with 78.6% accuracy. The confusion matrix reveals that most errors occur between adjacent severity levels (mild vs. moderate), which is clinically acceptable. Distinguishing none vs. severe achieves 94.3% accuracy, indicating that the model reliably identifies high-risk cases.
These results confirm that the auxiliary symptoms and severity tasks provide useful supervision signals that improve user-level depression detection while enabling fine-grained clinical predictions. In the next subsection, we further examine how the multi-task framework behaves under noisy heuristic labels for symptom annotations.
3.6.2. Noise Robustness of Multi-Task Learning
The WU3D dataset’s automatic rule-based label generation process results in approximately 18% label noise, primarily arising from temporal misalignment between symptom expressions and PHQ-9 assessment dates, keyword ambiguity, and context-dependent language. This noise level is acknowledged as a limitation inherent to large-scale automatic labeling approaches.
To investigate whether this noise affects different symptom types uniformly, we analyzed the relationship between symptom linguistic explicitness and detection difficulty. Based on error analysis of misclassified cases and linguistic characteristics of the 9 DSM-5 symptoms, we categorize them into explicit symptoms that are typically expressed through direct depression-related keywords (e.g., “失眠” for insomnia, “想死” for suicidal ideation) and implicit symptoms that are more often described through subtle behavioral or contextual clues (e.g., “吃得少” for appetite changes, “动作变慢” for psychomotor retardation).
Table 10, symptom detection performance by linguistic explicitness, presents our empirical observations.
As hypothesized by the reviewer, implicit symptoms show consistently lower performance (approximately 79–85% F1) compared to explicit symptoms (approximately 83–92% F1). This gap is attributable to their reliance on ambiguous behavioral descriptions that are more susceptible to noise during keyword-based labeling, whereas explicit symptoms use direct depression terminology that maps more reliably to ground truth labels.
Despite this noise, our multi-task learning framework demonstrates robustness. As shown in
Table 9, incorporating auxiliary symptom recognition and severity assessment tasks improves the main depression classification F1-score from 89.6% (single-task) to 91.8% (full MTL), representing a +2.2 percentage point improvement. This improvement stems from two mechanisms: (1) regularization effect—the auxiliary tasks provide additional supervision that prevents overfitting to potentially noisy binary labels; and (2) hierarchical attention’s implicit filtering—the character-level and post-level attention mechanisms tend to downweight contradictory or ambiguous posts.
Table 10, symptom detection performance by linguistic explicitness, further shows that symptom recognition alone achieves 90.9% F1, while severity assessment alone reaches 91.2% F1, indicating that both auxiliary tasks contribute independently before combining synergistically in the full MTL framework. Thus, while 18% noise does impact implicit symptom detection more severely, our MTL architecture partially mitigates this through its regularization properties and attention-based signal prioritization.
3.7. Error Analysis and Computational Efficiency
3.7.1. Error Analysis
Overall Error Distribution
To understand model limitations, we analyzed false positive and false negative predictions (
Figure 8).
False Positives (36 cases, 7.2% of control users): Primary cause is frequent negative emotion expression without meeting DSM-5 criteria for persistent, severe, and multi-dimensional symptoms. For example, users experiencing temporary stressful events (exam pressure, breakup) may post negatively but lack sustained depression. Our model sometimes misclassifies these due to high negative sentiment scores.
False Negatives (30 cases, 6.0% of depressed users): The main reason is an implicit or humorous expression of depression symptoms. Some users employ self-mockery, sarcasm, or metaphorical language that the model fails to fully understand. For instance, “Life is so wonderful, I can’t wait to sleep forever” contains implicit suicidal ideation, but surface-level positive words (“wonderful”) may confuse the model.
Distribution, Sarcasm, and Metaphorical Language Challenges
Through qualitative error analysis of misclassified cases, we observed that a notable portion of false negatives involves posts containing sarcasm or metaphorical expressions. These linguistic phenomena pose particular challenges for automated depression detection systems.
Representative failure cases. We illustrate three typical scenarios where the model struggles.
Example 1 (sarcastic expression).
Original post: “人生真美好啊, 我都不想活了🙃” (“Life is so wonderful, I don’t even want to live anymore.”).
Model prediction: non-depressed.
Ground truth: depressed.
The model appears to focus on the literal positive keyword “美好” (“wonderful”) and fails to recognize the sarcastic contrast with “不想活了” (“don’t want to live”), as well as the sarcastic emoji. This illustrates how surface-level positive words can mislead the classifier when sarcasm inverts the intended sentiment.
Example 2 (metaphorical expression).
Original post: “感觉自己淹没在黑暗的海洋中, 怎么也游不到岸边” (“I feel like I’m drowning in a dark ocean and can never reach the shore.”).
Model prediction: depressed.
Ground truth: depressed.
In this case, the model correctly identifies depression, likely because common metaphors such as “黑暗” (“darkness”) and “淹没” (“drowning”) occur frequently enough in the training data for the model to learn their association with hopelessness. This example shows that well-established metaphors can still be captured by pattern learning.
Example 3 (mixed sarcasm and genuine expression).
Original post: “又是充满希望的一天呢, 和每天一样空虚无聊” (“Another day full of hope, just as empty and boring as always.”).
Model prediction: borderline.
Ground truth: depressed.
The juxtaposition of ostensibly positive framing (“充满希望”—“full of hope”) with explicitly negative descriptors (“空虚无聊”—“empty and boring”) creates ambiguity. The borderline prediction indicates that the model detects conflicting signals but cannot resolve them confidently.
Several factors contribute to these difficulties. First, the underlying BERT encoder is mainly pre-trained on formal text (e.g., news, encyclopedia articles) with limited exposure to informal social media language where sarcasm is prevalent. Second, Chinese sarcasm often relies on subtle contextual cues and cultural knowledge rather than explicit markers; unlike English, which sometimes uses tags such as “/s”, Chinese sarcastic expressions frequently reuse positive words in an ironic way. Third, our training dataset does not explicitly annotate sarcastic versus literal expressions, so the model cannot learn to treat sarcasm as a distinct phenomenon requiring special handling.
Addressing sarcasm and metaphor remains an important direction for future work. Possible extensions include incorporating emoji and punctuation patterns as potential sarcasm indicators, augmenting the training data with explicitly labeled sarcastic examples, using contrastive learning objectives to better distinguish literal from figurative language, and integrating user-level modeling to capture individual communication styles. While these enhancements are beyond the scope of the present study, they may further improve robustness to figurative language in future systems.
3.7.2. Computational Efficiency
We measured inference time and memory usage on standard CPU (Intel Xeon E5–2680 v4; Intel, Santa Clara, CA, USA) and GPU (NVIDIA Tesla V100; NVIDIA, Santa Clara, CA, USA): Training time was ~3.5 h on V100 GPU with 32 GB memory; inference time was 50 ms per user (CPU)/12 ms per user (GPU); model size was 142 MB (BERT: 110 M parameters, Other: 32 M parameters); throughput was ~20,000 users per hour on single GPU. Compared to GPT-4 API (2–5 s per request), our model is 40–100× faster and can be deployed on commodity hardware without expensive API subscriptions.
4. Conclusions
This work presents a comprehensive methodology for depression detection on social media, systematically integrating clinical knowledge (DSM-5 criteria), multi-modal data (text, behavior, and topic), and advanced deep learning techniques (BERT, hierarchical attention, and multi-task learning). The proposed model achieves state-of-the-art performance (91.8% F1-score) on a large-scale Chinese social media dataset, significantly outperforming traditional methods, deep learning baselines, and even large language models like GPT-4. Extensive ablation studies confirm that each component contributes meaningfully with synergistic effects amplifying the combined impact.
Beyond accuracy metrics, our model provides interpretable predictions through attention weight visualization and outputs fine-grained symptom assessments aligned with clinical diagnostic criteria. These characteristics, combined with low computational cost and local deployability, make the model well-suited for practical mental health screening applications. The success of this DSM-5-guided, multi-modal, multi-task approach demonstrates that domain-specialized methods with explicit expert knowledge integration remain highly valuable in the era of general-purpose large language models, particularly for professional applications requiring high accuracy, interpretability, transparency, and cost-effectiveness.
Limitations and Future Work: Our study has several limitations: (1) The dataset is limited to Chinese Weibo users, and generalization to other platforms (Twitter, Reddit) and languages requires further validation. (2) Social media users are not fully representative of the general population, with a potential bias toward younger, urban demographics. (3) Self-reported depression labels may contain noise compared to clinical diagnosis, though WU3D’s professional annotation mitigates this. (4) Rule-based symptom label generation (~18% noise) could be improved with more sophisticated NLP techniques or semi-supervised learning. (5) Error analysis shows that sarcastic and metaphorical posts remain a major source of false negatives, as the model sometimes relies on surface-level sentiment words and misses figurative or ironic expressions.
Future directions include the following: (1) Incorporating additional modalities such as images (color tone, scene content) and temporal posting patterns (frequency fluctuations). (2) Developing multilingual models using cross-lingual transfer learning to extend coverage to non-Chinese populations. (3) Exploring causal inference methods to understand relationships between social media behaviors and depression onset. (4) Conducting longitudinal studies to validate the model’s early warning capabilities for depression episode prediction. This work establishes a reproducible and extensible framework that can inform depression detection across diverse social media platforms and provide insights for other mental health conditions. (5) Improving robustness to figurative language by integrating dedicated sarcasm detection, emoji and punctuation pattern features, and more context-aware modeling of ironic or metaphorical expressions.
Code Availability Statement: Model implementation code will be made available upon reasonable request to researchers who meet the criteria for access to confidential data and have obtained appropriate ethical approvals. Inquiries should be directed to the corresponding author.