Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models

Yang, Jie; Jiang, Dengbiao; Deng, Xing

doi:10.3390/electronics14234596

Open AccessArticle

Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models

by

Jie Yang

,

Dengbiao Jiang

^* and

Xing Deng

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4596; https://doi.org/10.3390/electronics14234596

Submission received: 22 October 2025 / Revised: 17 November 2025 / Accepted: 19 November 2025 / Published: 24 November 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the challenges of poorly annotated data and class imbalance in mental health detection from social media. We propose an integrated approach combining weak classifiers with gradient boosting, leveraging LSTM, pretrained Transformer models (e.g., BERT, RoBERTa), and the MAMBA framework to analyze negative emotional states such as anxiety, depression, and anorexia. Data resampling and augmentation techniques effectively mitigated class imbalance, improving RoBERTa’s accuracy, recall, and F1-score from 0.792, 0.812, and 0.816 to 0.824, 0.856, and 0.857, respectively. A novel multi-source feature fusion method integrating local, global, and TF-IDF representations further enhanced performance, raising BERT’s accuracy, recall, and F1-score from 0.748, 0.760, and 0.766 to 0.785, 0.807, and 0.810. Experimental analysis using confusion matrices and error patterns revealed model behaviors and limitations in processing complex emotional data, offering a robust framework for advancing mental health detection systems.

Keywords:

mental health; data imbalance; transfer learning; feature fusion; weak classifiers

1. Introduction

Mental illnesses, particularly anxiety and depression, can lead to profound suffering and significantly diminish social functioning and quality of life [1]. According to a 2019 World Health Organization report [2], one in every eight people worldwide is affected by a mental disorder. Given the elevated risk of suicide among those with mental illnesses, early detection and intervention have become critical public health objectives. A critical challenge in the early detection of mental disorders is that the majority of individuals likely experiencing these conditions do not seek professional assistance, despite the availability of mental health services. Consequently, many miss the critical window for timely and effective intervention [2]. Instead, a significant number choose to share their emotions and personal experiences on social media platforms such as Reddit and Twitter. Unlike traditional offline psychotherapy, this form of online expression features a lower barrier to entry and generates substantial volumes of data, creating opportunities for the application of Natural Language Processing (NLP) techniques to identify mental health conditions. Transformer-based pre-trained models, such as BERT, have demonstrated exceptional capabilities in sentiment analysis [3]. Nonetheless, research focusing on the application of these models for mental health analysis remains relatively nascent, particularly in the accurate identification and prediction of users’ mental health statuses. Data annotation in this domain is inherently complex, labor-intensive, and costly. Moreover, machine learning models in this field often grapple with challenges such as insufficient data and class imbalance, which significantly constrain their performance and overall effectiveness.

To address these issues, transfer learning technology has emerged as a powerful approach, allowing models trained on large-scale datasets to adapt to new tasks and greatly reducing the need for extensive annotated data. By leveraging transfer learning, pre-trained models like BERT can be fine-tuned for mental health analysis, achieving high performance and generalization capabilities even with smaller datasets [4]. This study thus employs transfer learning strategies, fine-tuning pre-trained models on a small, annotated dataset, effectively enhancing the model’s performance in mental health classification.

To further improve the model’s classification performance, the study also proposes optimization strategies. To address the class imbalance in mental health datasets, data augmentation and undersampling techniques are employed. Data augmentation generates more samples of the minority class through Masked Language Modeling (MLM) [5], while undersampling reduces the majority class samples, creating a balanced dataset. This approach helps the model avoid overfitting and enhance its ability to recognize minority classes.can avoid overfitting and enhance its ability to recognize minority classes. Additionally, the study improves the model structure by proposing an optimization strategy for multi-feature fusion. Traditional BERT models typically use only the [CLS] [5] vector for classification, but this study introduces a multi-feature fusion layer that integrates local features, global features, and Term Frequency–Inverse Document Frequency (TF-IDF) features. This enables the model to fully utilize the semantic information in the text, thereby further improving classification accuracy.

The methodology proposed in this study presents a novel framework for mental health analysis that establishes a distinct approach from recent large language model solutions like MentaLLaMA [6]. While existing methods primarily rely on model scaling and domain adaptation of large language models, our work introduces fundamental innovations at the feature representation level. The methodology begins with comprehensive data preprocessing and class balancing, followed by the development of an enhanced model through sophisticated feature integration techniques. The primary contributions of this study are outlined as follows:

Innovative Multi-Source Feature Fusion: We introduce a novel feature fusion mechanism that fundamentally differs from single-modality approaches. Unlike MentaLLaMA’s reliance on massive parameter tuning, our method achieves enhanced performance through intelligent integration of local, global, and TF-IDF features, creating a more comprehensive and nuanced representation for mental health classification.
Inherent Interpretability Framework: Our approach embeds explainability directly into the model architecture through attention-based fusion weights, providing transparent decision-making processes. This contrasts with post hoc explanation methods commonly used in transformer frameworks and offers more reliable interpretability for clinical applications compared to black-box large language models.
Computationally Efficient Alternative: The proposed framework demonstrates that strategic feature engineering can achieve competitive performance without the computational overhead of large language models. Our model-agnostic fusion module enhances any pre-trained transformer through efficient feature combination rather than costly parameter updates, making advanced mental health analysis more accessible and deployable.
Robust Learning Methodology: By combining data balancing techniques with gradient boosting optimization, our approach effectively addresses class imbalance and feature sparsity while maintaining strong generalization capabilities across diverse mental health categories.

2. Related Work

Mental health analysis has emerged as a critical subfield within Natural Language Processing (NLP). Early NLP approaches predominantly relied on rule-based systems and sentiment lexicons; however, these methods demonstrated significant limitations in capturing the nuanced and complex emotions associated with mental health signals.

With the advent of deep learning, models such as Support Vector Machine (SVM), Random Forest (RF), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM) networks have been extensively employed in text classification tasks. While these methods have achieved notable success in specific applications, they exhibit inherent shortcomings. Traditional models, such as SVM and RF, are heavily dependent on manual feature engineering, which can be labor-intensive and prone to bias. On the other hand, sequential models like RNN and LSTM, though capable of learning patterns in sequential data, often struggle to fully capture long-range contextual dependencies and the intricate diversity of sentiment and mental health classifications. These limitations underscore the need for more advanced methods to address the complexity inherent in mental health analysis.

For instance, Jina K et al. [7] analyzed 488,472 posts from 248,537 users in the mental health community on Reddit. Using TF-IDF vectors with an XGBoost classifier, along with Continuous Bag of Words (CBOW) and convolutional neural network (CNN) classifier, they evaluated users’ mental states. Similarly, to assess the psychological impact of COVID-19, Banna et al. [8] combined LSTM and CNN models to examine Twitter data, revealing insights into the pandemic’s impact on mental health. Tariq S et al. [9] used TF-IDF to extract features from texts and adopted the semi-supervised learning mode of Co-training to construct SVM, RF and Naive Bayes (NB) models to predict mental illness, showing that collaborative training techniques can be highly effective in predicting mental illness. Other researchers have used polynomial NB, multi-layer perceptron and LightGBM to classify Reddit posts [10], highlighting the challenges posed by the informal language, acronyms and slang, and noise typical of social media data. Further, a “one-shot decision approach” has been explored to assess individuals at risk of developing psychological disorders, using an integrated model combining SVM and K-Nearest Neighbors (KNN), with with noise label correction methods and intrinsic interpretability to distinguish between depressive symptoms and suicidal thoughts [11]. Another study constructed a deep learning framework based on Bi-LSTM to classify the emotions of COVID-19-related discussions on social media [12], outperforming the traditional LSTM model and other techniques.

In recent years, pre-trained language models based on the Transformer architecture, such as BERT [5] and GPT [13], have provided transformative solutions to longstanding challenges in text analysis. By effectively capturing contextual information, dynamically generating word embeddings, and minimizing the reliance on manual feature engineering, these models have significantly advanced the performance and generalization capabilities of NLP systems, particularly in sentiment analysis.

Models like BERT [5] and GPT [13] exploit the self-attention mechanism inherent to the Transformer architecture to address long-range dependency issues. Unlike the sequential processing paradigm of RNN and LSTM, self-attention enables parallel computation, assigning distinct attention weights to words based on their contextual relevance to other words within the text. This allows the models to accurately capture context in lengthy and complex texts. The incorporation of transfer learning further enhances these models’ utility. Through large-scale pre-training on extensive text corpora, these models develop rich and robust language representations. During fine-tuning, they can rapidly adapt to task-specific requirements, demonstrating strong performance even on datasets characterized by class imbalance. Additionally, ongoing research explores various optimization strategies to refine these models further, broadening their applicability to mental health analysis and related domains.

Dinu A et al. [14] used three pre-trained models, BERT, RoBERTa and XLNET, to train on the SMHD mental health status dataset from Reddit, achieving state-of-the-art performance in post-level classification. Another study collected a dataset of 13 topics on Reddit, with five topics representing mental illnesses and eight representing no mental illness [15]. Results showed that RoBERTa outperformed LSTM and BERT in mental illness analysis. Some studies have also built typologies for social media content based on psychological theory, such as DEPTWEET1, a dataset with 40,191 tweets and crowdsourced tags [16]. Using BERT and DistilBERT, it was found that accurate depression diagnosis depends on models’ abilities to understand context and distinguish between different levels of depression severity. Finally, Ameer I et al. [17] used unstructured user data from Reddit to train multiple models across traditional machine learning, deep learning and transfer learning, with RoBERTa achieving the highest classification accuracy and F1 score for mental health analysis. Yang K et al. [6] tried to apply a large language model (LLM) to the analysis of explainable mental health on social media. Although LLM did not achieve the same performance as existing supervision methods in zero and few-shot settings, the researchers reframed explainable mental health analysis as a text generation task and constructed an IMHI dataset. Training the LLaMA2 model on the IMHI dataset led to results comparable to the most advanced discriminative methods.

Recent approaches like MentaLLaMA [6] showcase the potential of adapting large language models for mental health analysis. However, this study establishes a distinct path: instead of scaling model size, we enhance performance through feature-level innovation. We propose a portable fusion module that upgrades any pre-trained transformer without costly fine-tuning, while embedding interpretability directly into the fusion process through meaningful attention weights—offering both transparency and efficiency absent in many complex frameworks.

3. Proposed Approach

The dataset used in this study (CounselChat) is a publicly available, pre-anonymized corpus hosted on Kaggle [18]. As no personally identifiable information was accessible and no human subjects were recruited, this study qualifies for exemption from ethical review under the Common Rule (45 CFR 46.104(d)(4)). All analyses adhered to Kaggle’s data use policy and the ACM Code of Ethics. The dataset consists of two primary attributes for each entry: Text and Category. The dataset includes an extensive corpus of 310,000 records, distributed across eight distinct categories: depression, suicide, anxiety, social anxiety, eating disorders (edanonymous), health anxiety, and alcoholism. However, the dataset exhibits significant class imbalance, with certain categories containing disproportionately higher numbers of entries than others. This imbalance poses a substantial challenge for accurate analysis and classification. To address these challenges, this study implements a comprehensive methodological framework encompassing data preprocessing, class balancing, feature fusion, and transfer learning, as illustrated in Figure 1. These measures collectively aim to enhance the representational quality of the data and improve model performance across imbalanced categories.

3.1. Data Preprocessing Principles

In Natural Language Processing (NLP), data preprocessing constitutes a critical phase that significantly influences the model’s efficacy and the precision of its outcomes. This study outlines the systematic preprocessing steps undertaken to maintain data integrity and enhance the robustness of the model’s predictions. The preprocessing pipeline comprises the following steps:

3.1.1. Data Cleaning

Removal of long texts: Text entries with excessively high word counts are identified and excluded to ensure consistency in dataset length and quality.
Lowercasing: All text entries are converted to lowercase to establish a uniform data format and eliminate case-sensitive discrepancies.
Elimination of URLs and excessive whitespace: URLs and unnecessary whitespace are removed, as they introduce noise and can adversely affect the model’s training process.
Removal of punctuation and numerical values: Punctuation marks and numbers are stripped from the text, as they typically do not contribute to sentiment or thematic analysis.

3.1.2. Standardization

Stemming: Words are reduced to their root forms, consolidating morphological variants into a unified representation. This simplification minimizes lexical diversity and facilitates more efficient model learning.
Morphological reduction: Similarly to stemming, morphological reduction removes affixes, enabling the model to treat semantically similar or related words as equivalent. This approach enhances the model’s interpretative capability and reduces redundancy in word representations.

3.1.3. Deletion of Stop Words

Elimination of English stop words: Frequently used words such as articles, prepositions, and conjunctions, which contribute little semantic value, are removed. By focusing on the most semantically significant terms, this step highlights key features in the text, improving the model’s focus and analytical performance.

By employing these preprocessing strategies, the study ensures that the data is not only cleaner and more uniform but also optimized for extracting meaningful patterns and improving model accuracy.

3.2. Data Balancing Mechanism

Imbalanced datasets often bias models toward majority classes, significantly impairing their ability to accurately identify minority categories and reducing overall predictive performance. To address this issue and enhance the model’s detection capabilities, this study employs a combination of data-level optimization strategies, including data augmentation and data balancing techniques.

For data augmentation, the study utilizes the Masked Language Modeling (MLM) method. This approach involves randomly selecting 15% of the words in a sentence, with 80% of the selected words masked, 10% replaced with random words, and 10% left unchanged. The pre-trained BERT model is then employed to predict the masked words, which are subsequently reintegrated into the original sentence to generate new text variants. To ensure data uniqueness, any duplicate variants generated during this process are removed. This technique effectively diversifies the dataset by increasing the representation of minority class samples and introducing linguistic variability, thereby enhancing the model’s generalization capabilities. Furthermore, the introduction of noise and textual variations improves the model’s robustness, enabling it to better capture contextual and semantic relationships and achieve more stable performance on complex classification tasks.

To further mitigate the detrimental effects of data imbalance, the study implements a data balancing strategy using similarity-based downsampling for majority class samples. Following the methodology proposed by Reimers et al. [19], this approach calculates the Cosine Similarity between samples—a metric well-suited for measuring the semantic similarity of high-dimensional text feature vectors, as it effectively captures the inherent semantic relationships between text samples and helps accurately identify and retain representative samples in the dataset. Cosine Similarity scores range from −1 to 1, and samples with a score exceeding 0.75 are deemed redundant, with only one representative sample retained. This strategy reduces the over-representation of majority classes while preserving data diversity, thereby improving the model’s ability to learn balanced decision boundaries across categories.

3.3. Data Encoding and Data Sparsity

To alleviate the sparsity of TF-IDF features and construct an efficient classifier, this study designs a hybrid TF-IDF-based classifier construction workflow (as shown in Algorithm 1). This workflow integrates feature encoding and gradient boosting strategies, enabling the effective utilization of sparse data while retaining the discriminative power of TF-IDF representations.

The text content is converted into dimension-adjustable TF-IDF word vectors using TfidfVectorizer [20], a widely adopted vectorization technique in text mining and information retrieval. This method transforms document collections into numerical feature representations based on the term frequency–inverse document frequency (TF-IDF) metric, which effectively balances local term importance against global distribution patterns [21].

The TF-IDF value for a term t in document d is calculated as:

TF-IDF (t, d) = TF (t, d) \times IDF (t)

(1)

where

TF (t, d)

represents the frequency of term t in document d, and

IDF (t)

quantifies the term’s rarity across the document collection [21].

TF-IDF assigns higher weights to terms that appear frequently within a specific document but rarely across the entire corpus, thereby identifying terms with strong discriminative power for document classification [22]. This characteristic makes TF-IDF particularly valuable for capturing distinctive lexical patterns in textual data.

After transformation, the input text is represented as a word vector array

x = [x_{1}, x_{2}, \dots, x_{n}] \in R^{l \times d}

, where

x_{i}

denotes the i-th word vector. The sparsity

σ

of the resulting representation is defined as:

σ = μ (x_{0}) \cdot card {(x)}^{- 1}

(2)

where

card (x)

indicates the total number of elements in x, and

μ (x_{0})

represents the count of zero elements in x [23].

Algorithm 1 Hybrid TF-IDF-Based Classifier Construction for Addressing Data Sparsity. Proposed approaches for data sparsity.

1. Procedure Input:

(x_{t r a i n, t e s t}^{s p a r s e}, y_{t r a i n, t e s t}^{s p a r s e})

▷ Preprocessed dataset

2. Output

M o e d l_{i n p u t}

3. for t in range(

d_{l e n g t h}

): ▷ Gradient classifier

4. if

t f (t, d) \neq 0

5.

T F I D F \leftarrow \sum_{x \in d} f r (x_{i, s p a r s e}, t) \times l o g \frac{|D|}{1 + |d : t \in d|}

6.

x_{s p a r s e}^{t r a i n} (t), y_{s p a r s e}^{t r a i n} \leftarrow T F I D F . f i t . t r a n s f r o m (x_{s p a r s e}^{t r a i n} (t), y_{s p a r s e}^{t r a i n})

7.

x_{s p a r s e}^{t r a i n} (t) \leftarrow x_{s p a r s e}^{t r a i n} (t) . t o a r r a y ()

8. return

x_{s p a r s e}^{t r a i n} (t)

9. for

i, m

in range(

N, M

): ▷ Building classifier

10.

r_{i, m} \leftarrow - \frac{\partial L (y_{i, l a b e l}, F_{m - 1} (x_{i, s p a r s e}))}{\partial F_{m - 1} (x_{i, s p a r s e})}

11.

γ_{m} \leftarrow a r g {min}_{γ} \sum_{i} L (y_{i, l a b e l}, F_{m - 1} (x_{i, s p a r s e}) + γ \cdot r_{i, m})

12.

h_{m} (x_{i, s p a r s e}) \leftarrow y_{i, l a b e l} - F_{m} (x_{i, s p a r s e}))

13.

c l f . f i t (x_{i, d e n s e} \leftarrow x_{i, s p a r s e})))

14.

c l a s s i f i e r \leftarrow c l a s s i f i e r . a p p e n d (c l f)

15. return

c l a s s i f i e r

This study selects TF-IDF as an external feature based on its optimal balance between feature complementarity, computational efficiency, and model adaptability. TF-IDF quantifies term importance from a statistical perspective, effectively compensating for the limitations of pre-trained models in capturing explicit textual features, thereby establishing a powerful complementary relationship. Compared to psychologically-informed linguistic features that rely on external resources or N-gram features that may cause dimensionality issues, TF-IDF can be efficiently computed using only the text data itself, offering notable engineering advantages [20]. Furthermore, its sparse vector representation can be seamlessly integrated into deep learning frameworks via attention mechanisms, enabling efficient information compression and fusion.

3.4. Classifier Design

After obtaining TF-IDF features, we apply gradient boosting [24] with weak classifier models to mitigate sparsity in the feature space. The algorithm follows the standard gradient boosting framework [24,25], where at each iteration m, the model is updated by focusing on misclassified samples from previous iterations.

The iterative construction process of the classifier adheres to the gradient boosting framework, with specific implementation steps detailed in the classifier building phase of Algorithm 1 (Steps 9–15). The pseudo-residuals

r_{i, m}

are calculated as the negative gradient of the loss function:

r_{i, m} = - \frac{\partial L (y_{i}, F_{m - 1} (x_{i}))}{\partial F_{m - 1} (x_{i})}

(3)

The optimal step size

γ_{m}

is determined by minimizing the loss function:

γ_{m} = arg min_{γ} \sum_{i = 1}^{N} L (y_{i}, F_{m - 1} (x_{i}) + γ \cdot r_{i, m})

(4)

This approach effectively reduces classification errors while maintaining model robustness across training epochs, with the learning rate controlling the contribution of each weak learner to the ensemble [24].

3.5. Feature Fusion

In order to further improve the performance of the model, this study improves the application of the pre-trained model in the text classification task. Traditional pre-trained models such as BERT only use the [CLS] vector of the output hidden layer for text classification, but do not make full use of global features. Aiming at the needs of sentiment analysis, this study proposes a multi-feature fusion attention layer, which integrates the original features, global features and TF-IDF features of the pre-trained model (Term frequency–inverse Document Frequency, used to evaluate the importance of a word in the document set). New features are generated and fed into the classifier for prediction. The structure of the multi-feature fusion attention layer is shown in Figure 2. The experimental results show that the performance of the pre-trained models after feature fusion is significantly improved, which verifies the effectiveness of the feature fusion strategy.

The multi-source heterogeneous feature fusion method proposed in this study is based on an attention mechanism, which can adaptively adjust the contribution of different features to the classification task. The specific implementation is as follows:

First, multiple semantic features are extracted from the pre-trained model: the embedding vector corresponding to the [CLS] token in the last hidden layer is extracted as the contextual semantic feature, denoted as

v_{c l s}

; the global average pooling feature, denoted as

v_{a v g}

, is obtained by taking the mean of the last hidden state tensor along the sequence dimension; the global max pooling feature, denoted as

v_{m a x}

, is obtained by taking the maximum of the last hidden state tensor along the sequence dimension. These features capture the deep semantic information and contextual relationships of the text.

Second, the TF-IDF feature is processed: the term frequency–inverse document frequency is calculated for the input text to obtain a sparse vector

x_{t f i d f}

of dimension

d_{t f i d f}

. This sparse vector is mapped to a dense vector, denoted as

v_{t f i d f}

, with the same dimension as the semantic features through a single-head self-attention layer. It provides importance information at the term frequency statistical level. The single-head self-attention layer maps the sparse vector to query vector Q, key vector K, and value vector V through linear transformation matrices

W_{Q}, W_{K}, W_{V} \in R^{d_{t f i d f} \times d}

, and computes the dense vector

v_{t f i d f}

via the Softmax function:

v_{t f i d f} = (\frac{Q K^{T}}{\sqrt{d}}) V

(5)

where d is the dimension of the semantic features.

Then, the initial fusion of the four features is performed:

v_{i n i t i a l} = v_{c l s} + v_{a v g} + v_{m a x} + v_{t f i d f}

(6)

The initial fused features are fed into an attention layer to generate an attention-weighted feature representation:

h = A t t e n t i o n (v_{i n i t i a l})

(7)

The feature h generated by the attention layer is input into four independent weight generation modules, each consisting of a fully connected layer and a Softmax activation function, to generate dedicated weight vectors for the corresponding features:

w_{i} = S o f t m a x (F C_{i} (h)), i \in \{c l s, a v g, m a x, t f i d f\}

(8)

Each

F C_{i}

has independent parameters, enabling the model to learn customized weight distribution patterns for different features, thereby enhancing the model’s expressive power.

The weight vectors are used to perform element-wise re-weighting of each original feature, and the re-weighted features are summed to obtain the final fused feature:

v_{f u s i o n} = \sum_{i \in \{c l s, a v g, m a x, t f i d f\}} w_{i} ⊙ v_{i}

(9)

where the ⊙ symbol denotes element-wise multiplication, used to achieve fine-grained adjustment at the feature level.

The fused feature

v_{f u s i o n}

retains complete feature information and is input into the classifier. The final category probability distribution is output through the Softmax function.

The model adopts a hierarchical processing architecture that establishes a distinct paradigm from scale-based approaches like MentaLLaMA [6]. Rather than relying on massive parameter tuning of large language models, our feature-centric methodology sequentially performs feature extraction from pre-trained models and statistical methods, followed by attention-based cross-feature interaction for adaptive multi-source fusion, and finally completes classification using the integrated representations. This design draws inspiration from cross-modal fusion principles but implements them within a single modality through a model-agnostic fusion layer that simultaneously enhances expressive power and provides inherent interpretability—achieving performance gains while maintaining computational efficiency and transparency unavailable in resource-intensive LLM solutions.

3.6. Computational Complexity Analysis

To assess the computational efficiency of the proposed feature fusion mechanism, this section analyzes its computational complexity and compares it with current mainstream feature fusion strategies.

Assume the input feature dimension is d, the number of features is k (where k = 4 in this study), and the sequence length is L.

3.6.1. Proposed Multi-Source Attention Fusion Mechanism

The computational cost of our method primarily stems from two parts:

Multi-source Feature Extraction: Extracting the CLS token, average pooling, and max pooling features from the pre-trained model has a time complexity of $O (L \times d)$ .
Attention Fusion Layer: This layer first generates weight vectors through independent attention modules, followed by element-wise weighted summation. Its time complexity is dominated by linear transformations, amounting to $O (k \times d^{2})$ , where k is the number of features.

Consequently, the overall time complexity of the proposed method is

O (L \times d + k \times d^{2})

. As it involves only linear projections and element-wise operations, the space complexity is also maintained at a reasonable level of

O (k \times d^{2})

.

3.6.2. Comparison with Mainstream Fusion Strategies

To highlight the efficiency of our method, we compare it with two common fusion strategies:

Simple Concatenation Followed by Attention: This method first concatenates multiple feature vectors $(O (k \times d))$ and then computes attention on the resulting high-dimensional vector, leading to a high time complexity of $O ({(k \times d)}^{2}) = O (k^{2} \times d^{2})$ . The computational cost increases quadratically with the number of features k.
Convolutional Block Attention Module (CBAM): While generic attention fusion modules like CBAM perform excellently in computer vision, their included channel and spatial attention mechanisms require structural adjustments when adapted to text sequences. Their complexity is $O (C^{2} / r + H \times W \times C)$ , which may not be optimal for computational efficiency in the textual modality.

3.6.3. Comprehensive Analysis

Table 1 below comprehensively compares the complexity and characteristics of different fusion strategies:

The analysis results demonstrate that the proposed method achieves a good balance between computational complexity and feature interaction capability. Compared to the simple concatenation strategy, our method avoids computing attention in ultra-high-dimensional spaces by utilizing independent and parallel weight generation paths, thereby achieving finer-grained feature fusion at a lower computational cost.

4. Results and Analysis

4.1. Experiments Methods

Traditional machine learning methods (e.g., Support Vector Machine, Random Forest, Naive Bayes) require additional feature extraction steps, and often struggle with capturing contextual nuances in text. Consequently, conventional deep learning models tends to face challenges in handling context effectively, limiting their utility for complex text sentiment analysis tasks. This study, therefore, primarily focuses on pre-trained models with attention mechanisms, which offer enhanced offer enhanced capabilities for handling context in NLP tasks.

Pre-trained models built on the Transformer architecture with an attention mechanism enable parallel processing and are well-suited for capturing long-range dependencies, providing significant advantages over traditional RNN-based models in both performance and stability. Additionally, Pre-trained models demonstrate greater resilience when handling data imbalance, particularly through transfer learning and data augmentation. These techniques allow models to effectively capture features of minority class samples, improving accuracy in detecting rare emotional signals, which is essential in mental health analysis. This study uses a variety of pre-trained models based on the Transformer architecture, including:

GPT-2: Pre-training on a large amount of text data based on unsupervised learning, good at generating text [13].
BERT: The use of bidirectional pre-trained language representation has revolutionized various tasks in the field of NLP [5].
RoBERTa: Improved BERT performance through longer training, larger data sets, and removal of NSP tasks [26].
DistilBERT: A lightweight version of BERT with fewer parameters but similar performance [27].
SpanBERT: The model’s ability to recognize text fragments was improved by modifying the pretraining task [28].
ALBERT: BERT memory consumption is reduced through parameter sharing and model compression [29]
XLNet: Achieves bidirectional context learning by maximizing the expected likelihood of the arrangement order, making up for BERT’s limitation of ignoring dependencies between mask locations [30].

In addition to the above models, this study also compares the traditional LSTM [31], GRU [32] and the latest MAMBA models [33].

4.2. Experimental Data Preprocessing

In the data cleaning experiment of Section 3, we plotted a kernel density estimation plot for all categories of text in the original data set (see Figure 3), which showed that over 99% of the texts had fewer than 1000 words, while very few had a word count between 1000 and 8000. This distribution was deemed unreasonable, so a threshold of 500 words was established. Eliminating abnormally long text reduces noise and outliers, enabling the model to better capture key data features and improving the efficiency and accuracy of model training.

After data preprocessing, this study further investigates the correlations between different data categories. In this study, the original labels were refined to differentiate the categories of “anxiety,” “social anxiety,” and “health anxiety.” Although these three categories share some textual similarities, they exhibit discernible semantic differences. Other categories in the data set also showed notable correlations. For example, individuals with depression may show signs of suicidal ideation, and those with addiction could also struggle with alcoholism. These associations, while significant, are not absolute, as individuals may express and experience emotions uniquely. The goal of this study was to explore deeper correlations between emotional categories, rather than merely implementing simple emotional classifications or assessing mental health status through social media texts.

To investigate these correlations, we randomly selected 100 data samples from each category and processed them with Word Embedding processing [34,35] to obtain relevant features. These features were subsequently dimensionally reduced using t-SNE [36]. Outliers were identified using the IQR (interquartile distance [37]) method, and an optimal ellipse was fitted using the least-squares method, with the remaining data points falling within this ellipse.

Figure 4 shows that the data points of the anonymous category formed an isolated cluster, indicating clear distinction from other categories. Analysis of other categories showed a strong correlation between addiction and alcoholism, with weaker associations to other categories. In contrast, the suicide and depression categories exhibited considerable overlap, underscoring a strong correlation. While social anxiety and health anxiety showed relative independence, they were still related to general anxiety. These insights align with previous discussions on category correlation. Furthermore, results from the confusion matrix in subsequent experiments were consistent with the category correlation shown in Figure 4, as indicated by the false-positive rates.

4.3. Experimental Data Balancing

In the data enhancement and balancing experiments described in Section 3, we adopted a dual strategy of similarity-based downsampling and data augmentation-based upsampling to balance the dataset, effectively reducing the dominance of majority class samples and achieving a more equitable sample distribution. Specifically, the most prevalent categories saw notable declines in both proportion and quantity—for example, the "depression" class dropped from 39% (108,224 samples) to 25% (75,752 samples), while the "suicide" class decreased from 22% (61,118 samples) to 15% (45,835 samples). In contrast, minority classes experienced significant growth: the "alcoholism" class, originally accounting for 2% (5513 samples), rose to 7% (21,951 samples), and the "healthanxiety" class increased from 3% (8191 samples) to 8% (24,539 samples). Detailed changes in the proportion and quantity of all classes before and after balancing are summarized in Table 2.

By integrating these two complementary data optimization techniques, the study successfully mitigated issues of model overfitting, particularly concerning majority class samples. Additionally, the improved balance in the dataset significantly enhanced the model’s ability to classify minority class samples, leading to overall improvements in predictive performance and robustness.

4.4. Experimental Setup

4.4.1. Pre-Trained Model Configuration

Due to computational resource limitations, each model configuration was only tested once. To ensure the reliability of the results, we validated the proposed method on seven pre-trained models with different architectures. All experiments were conducted under the same training-validation-test split and hyperparameter settings. All transformer-based pre-trained models (GPT-2, BERT, RoBERTa, DistilBERT, SpanBERT, ALBERT, XLNet) were sourced from the HuggingFace Transformers library, utilizing the base versions of each architecture. The MAMBA model was obtained from its official GitHub [38] repository. The models and their corresponding tokenizers were loaded using their respective dedicated classes.

Model Specifications

Models and tokenizers were loaded using their specific classes from HuggingFace (e.g., BertModel, RobertaTokenizer)
Base versions were selected for consistency across experiments
Model weights were initialized from pre-trained checkpoints without domain-specific pre-training.

4.4.2. Hyperparameter Configuration

Transformer-based Models (GPT2, Bert, RoBERTa, DistilBERT, SpanBERT, ALBERT, XLNet, MAMBA):

Pre-trained Source: HuggingFace Transformers (base versions)
MAMBA: Official GitHub repository
Learning Rate: 1 × 10⁻⁵
Batch Size: 16
Optimizer: Adam with default PyTorch parameters
Loss Function: CrossEntropyLoss
Maximum Epochs: 10
Early Stopping: Enabled, selecting the model with minimum validation loss

LSTM and GRU Models:

Learning Rate: 0.001 (default Adam optimizer in Keras)
Batch Size: 32
Optimizer: Adam with default Keras parameters (learning rate = 0.001, $β_{1}$ = 0.9, $β_{2}$ = 0.999, $ε$ = 1 × 10⁻⁷)
Loss Function: Categorical crossentropy
Maximum Epochs: 50
Early Stopping: Enabled, based on validation loss minimization

4.4.3. Data Configuration

Train/Validation/Test Split: 8:1:1 ratio
Maximum Sequence Length: 512 tokens
Evaluation Metrics: Accuracy, Recall, F1-Score

4.4.4. Computational Environment

GPU: NVIDIA RTX 4060 Ti (16GB VRAM)
CPU: Intel i5-12400
RAM: 32 GB
Python 3.9.18
PyTorch 2.1.2+cu121 (for transformer models)
TensorFlow-GPU 2.10.0 (for LSTM/GRU models)
HuggingFace Transformers library (for pre-trained models)
CUDA 12.1

4.4.5. Training Strategy

Transformer models utilized pre-trained weights from HuggingFace base versions
Consistent fine-tuning approach applied across all transformer models
Fixed learning rate schedule without decay for all models
Model selection based on early stopping with validation loss monitoring
Identical evaluation protocol across all models using the same data splits

In this study, the model and the improved versions of some pre-trained models are evaluated on the original and enhanced data sets. In this study, three evaluation metrics of model Accuracy, Recall and F1 Score were mainly investigated. Due to the uneven distribution of the data used in this study, recall rates and F1 scores are more indicative of the true performance of the model. Their calculation methods are as follows:

P r e c i s i o n (P) = \frac{T P}{T P + F P}

(10)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(11)

R e c a l l (R) = \frac{T P}{T P + F N}

(12)

F 1 - S c o r e = \frac{2 \times P \times R}{P + R}

(13)

4.5. Experiment Result

The initial evaluation of the models was conducted on the original dataset, with the results summarized in Table 3. These outcomes serve as the baseline for comparative analysis in subsequent experiments. Table 4 presents the evaluation results of the feature fusion models applied to the original dataset. The inclusion of the multi-feature fusion attention layer resulted in significant performance improvements for several models. Notably, the BERT model exhibited substantial gains, with recall increasing from 76.0% to 80.7% and the F1 score rising from 76.6% to 81.0%. Similarly, the SpanBERT model demonstrated marked improvement, with recall increasing from 74.9% to 81.8% and the F1 score improving from 76.7% to 81.9%. Both models achieved notable enhancements, with recall and F1 scores rising by over 4%. In contrast, the performance improvements for the RoBERTa model were more modest, with recall increasing from 81.2% to 82.0% and the F1 score rising from 81.6% to 82.5%. The GPT-2 model, however, experienced a slight decline in performance, with recall decreasing marginally from 80.% to 80.5% and the F1 score dropping from 80.8% to 80.5%. These results highlight the effectiveness of the multi-feature fusion attention layer in enhancing the performance of certain models, particularly BERT and SpanBERT, while also underscoring the varying impact of feature fusion across different architectures.

Additionally, this study investigates the impact of data balancing on model performance. By augmenting minority class data and downsampling majority class data, we enhance the model’s generalization capability and robustness, thereby improving its overall performance. Table 5 presents the performance of various pre-trained models on the dataset after data augmentation and balancing.A comparison with the results in Table 3 reveals significant improvements across all models. For instance, the accuracy of the RoBERTa model increased from 0.792 to 0.824, the recall improved from 0.812 to 0.856, and the F1 score rose from 0.816 to 0.857. Similarly, the XLNet model exhibited enhancements, with accuracy increasing from 0.788 to 0.822, recall from 0.801 to 0.854, and F1 score from 0.809 to 0.854. Other models also demonstrated notable performance gains, indicating that the data balancing strategy effectively improves the classification performance of the models. Finally, we evaluated the feature fusion model on a balanced dataset, and the results are shown in Table 6. All models achieved Accuracy, Recall, and F1 scores above 0.8. Among them, the RoBERTa model performed exceptionally well, with an accuracy of 0.832, a recall of 0.864, and an F1 score of 0.863—all exceeding 0.83. This is significantly better than the results of the benchmark experiment (Table 3), the data balancing experiment (Table 5), and the feature fusion experiment (Table 4). DistilBERT and XLNet also performed well across all metrics, further validating the effectiveness of feature fusion models on balanced datasets.

The experimental results demonstrate the consistent effectiveness of the proposed feature fusion strategy across multiple pre-trained architectures. Although computational constraints precluded multiple replicate experiments, the observed performance improvements showed remarkable consistency across all seven pre-trained models (see Table 6). This cross-model stability provides compelling evidence that the performance gains stem from the methodological innovations rather than random variations, establishing a solid empirical foundation for our approach.

The enhanced models not only achieved significant improvements in accuracy, recall, and F1-score on the original dataset but also exhibited superior generalization capabilities and minority class recognition on balanced datasets. Particularly noteworthy is the framework’s robust performance in addressing data imbalance, where the feature fusion mechanism enables more accurate capture of minority class patterns. These findings validate the potential for extending these improvements to broader natural language processing tasks, including sentiment analysis, named entity recognition, and machine translation.

Furthermore, while direct numerical comparison with closed-source implementations like MentaLLaMA is impractical, our approach establishes distinct advantages in computational efficiency and transparency. The framework achieves competitive performance (e.g., BERT-FF F1: 0.855) without requiring massive parameter sets typical of 7B+ LLMs, significantly enhancing accessibility. More importantly, unlike the “black-box” predictions characteristic of many large language models, our architecture provides intrinsic explainability through fusion attention weights—a crucial feature for clinical applications where decision transparency is paramount.

4.6. Analysis of Ablation Study

As evidenced in Table 7, the ablation study conducted on the augmented dataset consistently validates the efficacy of the proposed feature fusion strategy across diverse model architectures. The incremental integration of TF-IDF and global semantic features reliably contributes to performance gains in most configurations, confirming the complementary nature of statistical and deep semantic representations.

BERT demonstrates the most pronounced sensitivity to feature enrichment, with TF-IDF and global features providing substantial individual improvements of +1.2% and +1.3% in the F1-score, respectively. RoBERTa and DistilBERT exhibit more modest but consistent gains from the complete fusion framework (+0.6% and +0.9% F1-score). Notably, even in the case of ALBERT where individual components showed slight degradation, the full fusion mechanism demonstrated robust integrative capability, nearly recovering baseline performance and highlighting the framework’s adaptability.

The consistent positive trajectory across architectures underscores that the observed improvements are systematic rather than stochastic, with the attention-based fusion mechanism effectively leveraging multi-source information to enhance model discriminability.

4.7. Model Classification Evaluation

To visually see which categories are more likely to be confused between each other, we use the confusion matrix for analysis. Figure 5, Figure 6 and Figure 7 show the confusion matrix performance of BERT, RoBERT, and DistilBERT in the benchmark experiment (Table 3), the feature fusion experiment (Table 4), the data enhancement experiment (Table 5), and the feature fusion and data enhancement combination experiment (Table 6), respectively. Through comparative analysis of these confusion matrices, We found some correlation between eight categories in the dataset used in this study (depression, suicide, anxiety, social anxiety, edanonymous, health anxiety, addiction, alcoholism). Among them, the edanonymous category was most easily correctly identified by the model, while the confusion between depression and suicide was high, indicating a strong correlation between them. There is also a degree of confusion between suicide and anxiety. Although the direct correlation between health anxiety and social anxiety is not strong, they have some correlation with the anxiety category.

4.8. Model Classification Error Analysis

We selected some misclassified samples, shown in Table 8. For instance, in Text 1, the statement “I just wish I could be happy and myself but I can’t” does not explicitly convey “suicide” but rather leans toward expressing “depression.” However, the emotional expression here is subtle, which makes it difficult for the model to accurately capture and classify the underlying sentiment. Such samples, with vague or unclear semantic expressions, often challenge the model’s ability to discern emotions in text, increasing classification difficulty. Similarly, Text 2 contains multiple statements (e.g., “Constant thought of suicide,” “All I want to do is end it,” “Mental institutions don’t help, therapist doesn’t help, medications don’t help,” and “I’m really intent on killing myself”) that clearly indicate suicidal thoughts. Despite this, the text is labeled as “depression,” rather than “suicide.” Such labeling inaccuracies can misguide the model during training, leading it to establish incorrect associations between emotions and categories. These errors directly impact prediction accuracy by confusing the boundaries between distinct emotional states. Additionally, Text 3 reflects “anxiety” (expressed through phrases like “my anxiety attacks insomnia” and “Anxiety has been affecting me academically and socially”), “depression” (e.g., “I have this unexplainable feeling of severe depression”), and even “suicide” (e.g., “wanting to just die and I already contemplate death every day”). Although Text 3 contains multiple emotional expressions, it is assigned only one label in the dataset. This issue, where multiple emotions correspond to a single label, complicates the model’s ability to handle complex emotional expressions. Such labeling disregards the diversity of emotions within the text, limiting the model’s capacity to accurately identify and distinguish different emotional features. These issues within the dataset—including unclear semantic expressions, label errors, and cases of multiple emotions assigned to a single label—not only reduce the accuracy of the model’s emotion classification but also hinder its generalization ability, affecting its performance in practical applications. These challenges complicate the model’s ability to accurately learn the nuanced relationships between emotions and text, especially in complex scenarios involving multiple emotions. To address these problems and improve model performance, more refined labeling and thorough data cleaning are necessary to ensure label accuracy and clarity of emotional expression. However, fine tagging and cleaning large amounts of text is a time-consuming and inefficient task. Manual annotation requires expertise to accurately distinguish between different emotions, especially when the text contains multiple complex emotions, and this complexity can increase the difficulty and workload of annotation. In addition, manual labeling is susceptible to subjective factors, and there may be differences in judgment between different taggers, which further increases the consistency challenge of data processing.

5. Conclusions

This study demonstrates the efficacy of employing pre-trained language models for emotional text classification, achieving substantial performance improvements through systematic data preprocessing and feature optimization. By implementing text standardization, noise reduction, and dataset simplification techniques—including lowercase conversion, stop word removal, and dry processing—we enhanced the models’ capacity to interpret nuanced emotional expressions.

To address class imbalance challenges, we applied data augmentation and downsampling techniques that significantly improved minority class identification and generalization capabilities. The introduction of a novel feature fusion approach, integrating TF-IDF features with transformer-based representations, further optimized text feature characterization and substantially enhanced emotion classification performance.

Our experimental results validate the proposed methodology. Models such as RoBERTa and BERT, when enhanced with our feature fusion mechanism, achieved significant gains in accuracy, recall, and F1-score on the balanced dataset. More importantly, this work outlines a distinct and valuable research direction. In contrast to the prevailing trend of pursuing performance through model scaling, as exemplified by large language models (LLMs) like MentaLLaMA, our research demonstrates that strategic feature engineering remains a highly viable path for performance enhancement. Our approach offers a set of compelling advantages in specific contexts: it is computationally more efficient, does not require massive parameter sets, and provides inherent explainability through its attention-based fusion mechanism—a feature particularly crucial for sensitive domains like mental health analysis.

Analysis of confusion matrices provided valuable insights into model behavior, revealing semantic correlations between categories and identifying challenges such as ambiguous expressions and labeling inconsistencies. These findings underscore the importance of dataset quality in emotional classification tasks.

Future research will focus on refining dataset labeling processes and exploring the application of our feature fusion framework to broader natural language processing tasks. The demonstrated combination of computational efficiency, transparent decision-making, and robust performance positions our approach as a viable pathway for real-world mental health applications where both accuracy and interpretability are paramount.

6. Implications for Behavioral Health

6.1. Practical Implications

This study demonstrates that transfer learning and feature fusion can substantially improve the detection of mental health conditions from noisy and imbalanced social media data. For example, after applying data augmentation and downsampling, the RoBERTa model’s F1 score improved from 0.816 to 0.857, and further rose to 0.863 with feature fusion on the balanced dataset. These performance gains are especially important in real-world contexts, where identifying early signs of depression or suicidal ideation from subtle or vague expressions is crucial. The ability of the model to correctly classify nuanced emotional states—such as distinguishing between “suicide” and “depression” despite their semantic overlap—is essential for clinical triage tools, online counseling platforms, and mental health chatbots. The strong classification performance across multiple models (with all F1 scores exceeding 0.84 after full optimization) suggests that such systems can serve as effective decision-support tools for mental health professionals, enhancing early intervention efforts and reducing the burden of undetected mental illness.

6.2. Policy Implications

These findings advocate for the strategic integration of NLP-based technologies into public mental health infrastructure. Given the high classification accuracy achieved through balanced and optimized datasets (e.g., BERT-FF achieving Accuracy 0.824, Recall 0.855, and F1 0.855), governments and health authorities can consider these tools for population-level monitoring. In settings where individuals are unlikely to seek traditional therapy, these models could passively monitor anonymized public content to flag crisis signals in real time. However, this requires establishing clear ethical guidelines, data governance frameworks, and safeguards against algorithmic bias, especially since errors due to ambiguous labels (as seen in Table 7) could result in false positives or overlooked cases. Public policy should also support investment in the creation of high-quality annotated datasets and inter-platform collaboration to improve model training and generalizability.

6.3. Research Implications

This study highlights both the opportunities and limitations of using transformer-based models in mental health classification. While the multi-feature fusion strategy significantly enhanced model performance—for instance, increasing SpanBERT’s F1-score from 0.767 to 0.819—several challenges remain. Misclassification analysis revealed that texts expressing multiple emotional cues (e.g., anxiety, depression, and suicidality within a single post) were often inaccurately forced into single-label categories, and confusion matrices indicated persistent overlap between semantically adjacent categories such as depression and suicide. These findings underscore the need for multi-label classification frameworks, context-aware annotation strategies, and hierarchical emotion modeling in future research. In addition, due to computational constraints, statistical significance testing through multiple replicate experiments was not feasible in this study; future work will not only prioritize such validation but also integrate explainable AI (XAI) methods and explore multimodal data to improve interpretability and trustworthiness in clinical applications.

Author Contributions

Conceptualization, J.Y.; Methodology, J.Y.; Software, X.D.; Validation, J.Y., D.J. and X.D.; Formal analysis, J.Y.; Investigation, J.Y.; Resources, D.J.; Data curation, J.Y.; Writing—original draft preparation, J.Y.; Writing—review and editing, J.Y., D.J. and X.D.; Visualization, J.Y.; Supervision, D.J.; Project administration, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The new datasets generated and analyzed during the current study to support the reported findings are publicly available in the GitHub repository at https://github.com/LY-ILY/MDPI_DATA, accessed on 22 October 2025. The repository includes two core datasets: “source data” refers to the raw data before data balancing, and “strength data” refers to the data after data balancing. The data are stored in CSV format, which is compatible with common data analysis tools and can be directly used for reproducibility studies. All data have been properly organized and annotated to ensure accessibility. Further inquiries regarding data details or access issues can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Braghieri, L.; Levy, R.; Makarin, A. Social Media and Mental Health. Am. Econ. Rev. 2022, 112, 3660–3693. [Google Scholar] [CrossRef]
World Health Organization. Mental Disorders. 2019. Available online: https://www.who.int/zh/news-room/fact-sheets/detail/mental-disorders (accessed on 17 September 2024).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [PubMed]
Saleh, M.R.; Martín-Valdivia, M.T.; Montejo-Ráez, A.; Ureña-López, L.A. Experiments with SVM to Classify Opinions in Different Domains. Expert Syst. Appl. 2011, 38, 14799–14804. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Yang, K.; Zhang, T.; Kuang, Z.; Xie, Q.; Huang, J.; Ananiadou, S. MentaLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models. In Proceedings of the ACM Web Conference 2024, Singapore, 13–17 May 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 4489–4500. [Google Scholar]
Kim, J.; Lee, J.; Park, E.; Han, J. A Deep Learning Model for Detecting Mental Illness from User Content on Social Media. Sci. Rep. 2020, 10, 11846. [Google Scholar] [CrossRef]
Banna, M.H.A.; Ghosh, T.; Nahian, M.J.A.; Kaiser, M.S.; Mahmud, M.; Taher, K.A.; Hossain, M.S.; Andersson, K. A Hybrid Deep Learning Model to Predict the Impact of COVID-19 on Mental Health from Social Media Big Data. IEEE Access 2023, 11, 77009–77022. [Google Scholar] [CrossRef]
Tariq, S.; Akhtar, N.; Afzal, H.; Khalid, S.; Mufti, M.R.; Hussain, S.; Habib, A.; Ahmad, G. A Novel Co-Training-Based Approach for the Classification of Mental Illnesses Using Social Media Posts. IEEE Access 2019, 7, 166165–166172. [Google Scholar] [CrossRef]
Nova, K. Machine Learning Approaches for Automated Mental Disorder Classification Based on Social Media Textual Data. Contemp. Issues Behav. Soc. Sci. 2023, 7, 70–83. [Google Scholar]
Adarsh, V.; Kumar, P.A.; Lavanya, V.; Gangadharan, G.R. Fair and Explainable Depression Detection in Social Media. Inf. Process. Manag. 2023, 60, 103168. [Google Scholar] [CrossRef]
Arbane, M.; Benlamri, R.; Brik, Y.; Alahmar, A.D. Social Media-Based COVID-19 Sentiment Classification Model Using Bi-LSTM. Expert Syst. Appl. 2023, 212, 118710. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Dinu, A.; Moldovan, A.-C. Automatic Detection and Classification of Mental Illnesses from General Social Media Texts. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Online, 1–3 September 2021; INCOMA Ltd.: Moscow, Russia, 2021; pp. 358–366. Available online: https://aclanthology.org/2021.ranlp-1.41 (accessed on 22 April 2024).
Murarka, A.; Radhakrishnan, B.; Ravichandran, S. Classification of Mental Illnesses on Social Media Using RoBERTa. In Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, Online, 19 April 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 59–68. Available online: https://aclanthology.org/2021.louhi-1.7 (accessed on 22 April 2024).
Kabir, M.; Ahmed, T.; Hasan, M.B.; Laskar, M.T.R.; Joarder, T.K.; Mahmud, H.; Hasan, K. DEPTWEET: A Typology for Social Media Texts to Detect Depression Severities. Comput. Hum. Behav. 2023, 139, 107503. [Google Scholar] [CrossRef]
Ameer, I.; Arif, M.; Sidorov, G.; Gòmez-Adorno, H.; Gelbukh, A. Mental Illness Classification on Social Media Texts Using Deep Learning and Transfer Learning. arXiv 2022, arXiv:2207.01012. [Google Scholar] [CrossRef]
Choudhry, M.N. Counsel Chat Dataset. 2023. Available online: https://www.kaggle.com/datasets/monjoynchoudhury/counselchatdata (accessed on 18 March 2024).
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar] [CrossRef]
Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Spärck Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
Ramos, J. Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning, Los Angeles, CA, USA, 23–24 June 2003; pp. 29–48. [Google Scholar]
Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; Volume 39. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Liu, Z.; Lin, W.; Shi, Y.; Zhao, J. A Robustly Optimized BERT Pre-Training Approach with Post-Training. In Chinese Computational Linguistics; Springer: Cham, Switzerland, 2021; pp. 471–484. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar] [CrossRef]
Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-Training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar] [CrossRef]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2020, arXiv:1906.08237. [Google Scholar] [CrossRef]
Graves, A. Long Short-Term Memory. In Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012; pp. 37–45. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the Advances in Neural Information Processing Systems 27 (NeurIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. Available online: https://proceedings.neurips.cc/paper_files/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf (accessed on 22 June 2024).
Van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Vinutha, H.P.; Poornima, B.; Sagar, B.M. Detection of Outliers Using Interquartile Range Technique from Intrusion Dataset. In Information and Decision Sciences; Springer: Singapore, 2018; pp. 511–518. [Google Scholar]
Phan, V. Text Classification Using Mamba with IMDB Dataset. GitHub Repository. 2024. Available online: https://github.com/VuBacktracking/mamba-text-classification (accessed on 20 November 2024).

Figure 1. Workflowof the proposed approach for emotion classification. Data Preprocessing: Mental health data is cleaned, normalized, and stop words are removed. Data Balancing: To address class imbalance, MLM-based data augmentation is applied to minority classes, while similarity-based downsampling to majority classes, resulting in a balanced dataset. Data Sparsity Handling: To address the issue of data sparsity in text-based datasets. Feature Fusion: A comprehensive feature representation is constructed by integrating multiple information sources.

Figure 2. Model-FF Network Architecture Diagram. 1DAP stands for 1D Average Pooling, [CLS] represents the feature used by the original model for text classification, and 1DMP stands for 1D Max Pooling.

Figure 3. Data cleaning before and after the text kernel density estimation plot.

Figure 4. Feature Correlation Distribution Chart.

Figure 5. (A–D) BERT’s confusion matrices in benchmarking, feature fusion, data augmentation, and their combination. The color intensity corresponds to the classification probability, where darker purple indicates lower probabilities (close to 0) and brighter colors (yellow, green, blue) indicate higher probabilities (close to 1). Each cell represents the probability of predicting the row label for a sample with the column true label.

Figure 6. (A–D) DistilBERT’s confusion matrices in benchmarking, feature fusion, data augmentation, and their combination. The color intensity corresponds to the classification probability, where darker purple indicates lower probabilities (close to 0) and brighter colors (yellow, green, blue) indicate higher probabilities (close to 1). Each cell represents the probability of predicting the row label for a sample with the column true label.

Figure 7. (A–D) RoBERTa’s confusion matrices in benchmarking, feature fusion, data augmentation, and their combination. The color intensity corresponds to the classification probability, where darker purple indicates lower probabilities (close to 0) and brighter colors (yellow, green, blue) indicate higher probabilities (close to 1). Each cell represents the probability of predicting the row label for a sample with the column true label.

Table 1. Comparison of Computational Complexity Among Different Feature Fusion Strategies.

Fusion Strategy	Time Complexity	Core Characteristics
Simple Concatenation + Attention	$O (k^{2} \times d^{2})$	Simple implementation, but high computational cost, prone to the curse of dimensionality
CBAM	$O (C^{2} / r + H \times W \times C)$	Designed for visual tasks, requires adaptation for textual tasks
Proposed Method	$O (L \times d + k \times d^{2})$	Optimized for text, balances efficiency and performance

Table 2. Class Distribution Before and After Data Balancing.

Class Labels	Original Data (%)	Original Count	Balanced Data (%)	Balanced Count
depression	39	108,224	25	75,752
suicide	22	61,118	15	45,835
anxiety	19	54,261	13	39,695
socialanxiety	8	21,659	14	43,318
edanonymous	5	14,048	9	28,096
healthanxiety	3	8191	8	24,539
addiction	3	6986	7	20,902
alcoholism	2	5513	7	21,951

Table 3. Results of the initial model on the initial dataset.

Model	Accuracy	Recall	F1 Score
GPT2 [13]	0.776	0.807	0.808
Bert [5]	0.748	0.760	0.766
RoBERTa [26]	0.792	0.812	0.816
DistilBERT [27]	0.773	0.793	0.804
SpanBERT [28]	0.752	0.749	0.767
ALBERT [29]	0.763	0.794	0.787
XLNet [30]	0.788	0.801	0.809
MAMBA [33]	0.781	0.794	0.800
LSTM [31]	0.724	0.754	0.742
GRU [32]	0.733	0.756	0.754

Table 4. Results of model-FF on the initial dataset.

Model	Accuracy	Recall	F1 Score
GPT2-FF	0.781	0.805	0.805
Bert-FF	0.785	0.807	0.810
RoBERTa-FF	0.796	0.820	0.825
DistilBERT-FF	0.782	0.804	0.807
SpanBERT-FF	0.796	0.818	0.819
ALBERT-FF	0.784	0.804	0.808
XLNet-FF	0.792	0.817	0.819

Table 5. Results of the initial model on the balanced dataset.

Model	Accuracy	Recall	F1 Score
GPT2 [13]	0.819	0.849	0.850
Bert [5]	0.814	0.840	0.843
RoBERTa [26]	0.824	0.856	0.857
DistilBERT [27]	0.814	0.841	0.849
SpanBERT [28]	0.813	0.844	0.844
ALBERT [29]	0.814	0.841	0.849
XLNet [30]	0.822	0.854	0.854
MAMBA [33]	0.821	0.853	0.852
LSTM [31]	0.781	0.820	0.817
GRU [32]	0.786	0.825	0.822

Table 6. Results of the model-FF on the balanced dataset.

Model	Accuracy	Recall	F1 Score
GPT2-FF	0.819	0.849	0.851
Bert-FF	0.824	0.855	0.855
RoBERTa-FF	0.832	0.864	0.863
DistilBERT-FF	0.829	0.862	0.858
SpanBERT-FF	0.818	0.844	0.844
ALBERT-FF	0.818	0.844	0.845
XLNet-FF	0.827	0.862	0.857

Table 7. Ablation Study on Feature Fusion Components (Based on Augmented Data).

Model	Configuration	Accuracy	Recall	F1-Score	$Δ$ F1 (vs. Base)
BERT	Base	0.814	0.840	0.843	-
	+TF-IDF	0.824	0.851	0.855	+1.2%
	+Global Features	0.820	0.861	0.856	+1.3%
	Full Fusion	0.824	0.855	0.855	+1.2%
RoBERTa	Base	0.824	0.856	0.857	-
	+TF-IDF	0.820	0.861	0.856	0.4%
	+Global Features	0.830	0.864	0.862	+0.5%
	Full Fusion	0.832	0.864	0.863	+0.6%
Distilbert	Base	0.814	0.841	0.849	-
	+TF-IDF	0.824	0.851	0.857	+0.8%
	+Global Features	0.823	0.853	0.854	+0.5%
	Full Fusion	0.829	0.862	0.858	+0.9%
ALBERT	Base	0.814	0.841	0.849	-
	+TF-IDF	0.816	0.846	0.844	−0.5%
	+Global Features	0.816	0.849	0.847	−0.2%
	Full Fusion	0.818	0.844	0.845	−0.4%

Table 8. Partial misclassification sample display.

ID	Class	Text
1	suicide	I cant do this anymore I have had thees feelings for a while and I am done. I want to rest but I am too wuss to have done it before. I just wish I could be happy and myself but I can’t. I am very hated at m school so I don’t think anyone would miss me. I am soon alone in dark place.I don’t know what to do sorry.
2	depression	Constant thought of suicide. I got a divorce a few months ago and my ex wife took the kids to Tennessee. I’m a paramedic and work 12 h shifts. I cant see the kids. Yesterday she told me she had a new boyfriend and I lost my shit. I’m the only texan that doesn’t own a firearm and the gun store was closed. She said she is going to use the threat of suicide to take full custody. I don’t even know if I want them anymore. I cant stop thinking of suicide. I’ve been drinking constantly and it doesn’t help. All I want to do is end it. Mental institutions don’t help, therapist doesn’t help, medications don’t help. I’m really intent on killing myself, I don’t want attention and I feel like an absolute failure as a father. I don’t know why I’m writing to this page. I wish I could get over the fear and stop being so indecisive and just end it.
3	anxiety	Is mental health a valid reason for skipping school/work? Okay, so lately I have been having serious mental breakdowns every single day that my doctor suggested i up the dosage of prozac that i am already taking. However, I feel like it has been getting worse for me. I have always had little breakdowns in school where i would just cry for a solid minute. However, right now it just feels like absolute hell. I have been skipping school because of my anxiety attackes insomnia and i feel really bad about it but at the same time I just feel like it is a literal mind prison. Anxiety has been affecting me academically and socially and whenever I am at school I have this unexplainable feeling of severe depression and I want to just die and I already consider death every day. Is mental health a vaild excuse for my absence? I feel like whoever I talk to, it is just hard to explain.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Jiang, D.; Deng, X. Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models. Electronics 2025, 14, 4596. https://doi.org/10.3390/electronics14234596

AMA Style

Yang J, Jiang D, Deng X. Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models. Electronics. 2025; 14(23):4596. https://doi.org/10.3390/electronics14234596

Chicago/Turabian Style

Yang, Jie, Dengbiao Jiang, and Xing Deng. 2025. "Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models" Electronics 14, no. 23: 4596. https://doi.org/10.3390/electronics14234596

APA Style

Yang, J., Jiang, D., & Deng, X. (2025). Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models. Electronics, 14(23), 4596. https://doi.org/10.3390/electronics14234596

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Mental Health Detection Through Transfer Learning and Feature Fusion: Mitigating Data Imbalance in Large Transformer Models

Abstract

1. Introduction

2. Related Work

3. Proposed Approach

3.1. Data Preprocessing Principles

3.1.1. Data Cleaning

3.1.2. Standardization

3.1.3. Deletion of Stop Words

3.2. Data Balancing Mechanism

3.3. Data Encoding and Data Sparsity

3.4. Classifier Design

3.5. Feature Fusion

3.6. Computational Complexity Analysis

3.6.1. Proposed Multi-Source Attention Fusion Mechanism

3.6.2. Comparison with Mainstream Fusion Strategies

3.6.3. Comprehensive Analysis

4. Results and Analysis

4.1. Experiments Methods

4.2. Experimental Data Preprocessing

4.3. Experimental Data Balancing

4.4. Experimental Setup

4.4.1. Pre-Trained Model Configuration

4.4.2. Hyperparameter Configuration

4.4.3. Data Configuration

4.4.4. Computational Environment

4.4.5. Training Strategy

4.5. Experiment Result

4.6. Analysis of Ablation Study

4.7. Model Classification Evaluation

4.8. Model Classification Error Analysis

5. Conclusions

6. Implications for Behavioral Health

6.1. Practical Implications

6.2. Policy Implications

6.3. Research Implications

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI