Attention-Enabled Ensemble Deep Learning Models and Their Validation for Depression Detection: A Domain Adoption Paradigm

Depression is increasingly prevalent, leading to higher suicide risk. Depression detection and sentimental analysis of text inputs in cross-domain frameworks are challenging. Solo deep learning (SDL) and ensemble deep learning (EDL) models are not robust enough. Recently, attention mechanisms have been introduced in SDL. We hypothesize that attention-enabled EDL (aeEDL) architectures are superior compared to attention-not-enabled SDL (aneSDL) or aeSDL models. We designed EDL-based architectures with attention blocks to build eleven kinds of SDL model and five kinds of EDL model on four domain-specific datasets. We scientifically validated our models by comparing “seen” and “unseen” paradigms (SUP). We benchmarked our results against the SemEval (2016) sentimental dataset and established reliability tests. The mean increase in accuracy for EDL over their corresponding SDL components was 4.49%. Regarding the effect of attention block, the increase in the mean accuracy (AUC) of aeSDL over aneSDL was 2.58% (1.73%), and the increase in the mean accuracy (AUC) of aeEDL over aneEDL was 2.76% (2.80%). When comparing EDL vs. SDL for non-attention and attention, the mean aneEDL was greater than aneSDL by 4.82% (3.71%), and the mean aeEDL was greater than aeSDL by 5.06% (4.81%). For the benchmarking dataset (SemEval), the best-performing aeEDL model (ALBERT+BERT-BiLSTM) was superior to the best aeSDL (BERT-BiLSTM) model by 3.86%. Our scientific validation and robust design showed a difference of only 2.7% in SUP, thereby meeting the regulatory constraints. We validated all our hypotheses and further demonstrated that aeEDL is a very effective and generalized method for detecting symptoms of depression in cross-domain settings.


Introduction
Depression is a serious and debilitating mental health condition that affects millions of people worldwide, affecting 260 million people globally [1]. According to the National Institute of Mental Health, depression is increasingly prevalent, affecting individuals' ability to function in daily life, resulting in suicide risk increasing by 35.2% from 2000 to 2020 [2]. It is characterized by persistent feelings of sadness, hopelessness, and a loss of interest in daily activities [3]. Individuals with depression often experience a range of physical and emotional symptoms, including fatigue, insomnia, changes in appetite, and difficulty concentrating in day-to-day activities [4,5]. Depression can have a significant impact on an individual's quality of life, affecting their personal and professional relationships, their ability to work, and their overall sense of well-being [6,7]. Therefore, early detection is essential to prevent the condition from worsening and to help individuals access appropriate treatment and support. Through this, individuals may be better able to manage their symptoms and maintain their ability to work and contribute to society. This can lead to better outcomes for both individuals and society as a whole [8,9].
Depression detection has been conducted for over 200 years through the identification of an individual's behavior by qualified psychologists [10,11]. Machine learning (ML) has become very popular in healthcare, particularly in the field of the classification of diseased vs. control patients [12,13]. Several studies have explored the use of statistical ML models to categorize a person's chats and texts as exhibiting either depressive or non-depressive behavior by analyzing patterns in language use and to identify features that are indicative of depression [14,15]. As observed before, ML models suffer from poor performance due to their inability to handle the non-linearity of risk predictors and gold standard labels or events [13,16,17]. Similarly, the linear structure architecture of current automated depression detection models renders them susceptible to poor performance, since they only focus on individual words and fail to consider the context of previous and subsequent words. These models tend to be slow, due to non-parallel and slow processing, and offer few options for algorithmic tuning and refinement.
Deep learning (DL) has rapidly gained momentum in large number of applications due to its automated ability to extract automated features [18]. These models utilize fully connected layers with neurons and activation functions, creating networks that mimic the human brain's functioning [19]. Recently, advanced DL models have penetrated the field of text classification and are capable of identifying complex sequences in language data [20][21][22]. The use of DL models and open-source embedding techniques, such as Word2vec and GloVe [23], have shown promise in addressing the challenge of detecting depression. By using embeddings, text data can be converted into dense vectors, where semantically similar inputs are located close to each other [24]. The introduction of architecture such as Gated Recurrent Units (GRUs) has improved the results of depression detection [25], but there are still limitations, such as the nature of a single input-output channel and the inability to achieve optimal results through a single base classifier.
Ensemble deep learning (EDL) represents a breakthrough in the field of DL, providing the potential for better performance than standalone models [26,27]. It enables the training of data of varying sizes, shapes, and types to different base classifiers and produces a single predictive output, which may be helpful in situations where data are of a multimodal nature [28][29][30]. Studies, including [31], have employed clustering and ensemble-based models to yield superior results in sentiment detection. To further enhance the performance of EDL, incorporating attention channels (or blocks) into the model architecture could increase its robustness and enable a more focused analysis of specific input tokens [32][33][34]. By identifying key features within the input data, the attention mechanism could potentially improve the accuracy, reliability, and generalizability of the model, particularly in applications related to mental health or other complex domains. Study [35] employed an attention-enabled LSTM model for sentiment analysis at the document level, which included a joint loss function to enhance its performance. Additionally, transformers have been widely used for sentiment analysis, as demonstrated by another study [36] that utilized a weight ensemble of transformer models to detect aggressive text in the Bengali language, employing various BERT-based techniques.
Multi-head co-attention networks enable us to attend independently to different parts of a sequence. Study [37] leveraged this approach to perform aspect-level sentiment analysis on a text dataset, surpassing existing methods. In the same domain of aspectbased sentiment extraction, numerous studies have explored triplet extraction techniques involving target, opinion, and sentiment extraction [38,39].
To conduct our study, we constructed eleven attention-enabled solo deep learning (aeSDL) and five attention-enabled ensemble deep learning (aeEDL) models and evaluated their performance on four main datasets, namely, two public datasets (SD-Sford-09 and DD- , and two proprietary datasets (DD-Red-14 and SD-Twi-2). We utilized self-attention blocks to determine the overall performance improvement after incorporating attention mechanisms into the models. Additionally, we calculated the average performance gain of aeEDL versus aeSDL, as well as attention models versus non-attention, or without attention (wa), models. We conducted a benchmark of our model on two public datasets and achieved the best performance with our aeEDL (ALBERT+BERT-BiLSTM), outperforming previous studies in the literature. Furthermore, we validated our model using various statistical tests and cross-validation protocols on seen vs. unseen datasets to verify the robustness of our aeEDL. Finally, we conducted cross-domain tests to demonstrate the adaptability of our model by training and testing it on datasets with differing semantics.
This paper follows a systematic flow, beginning with Section 2, which describes the methodology. Section 2.1 includes a discussion of the four types of dataset used in the study, the architecture and building of both the SDL and EDL models are outlined in Section 2.2, and the experimental protocols undertaken are discussed in Section 2.3. Next, the paper describes the performance metrics used throughout the study in Section 2.4. The Results section of the paper reports the findings of the study, which are presented across five different experimental protocols in Section 3. Section 4 presents a performance evaluation, whereby ROC curves and bar graphs showcase the models' performance in Section 4.1, and a discussion of the statistical tests used is presented in Section 4.2. Finally, the paper discusses how the study's findings compare with other related research studies in Section 5. It also includes a discussion of the study's principal findings, and a brief note on attention, the strengths and weaknesses, and possible extensions of our current study.

Methodology
Our methodology involved using simple DL models as a starting point, since they have proven to be effective in various natural language processing (NLP) tasks. To evaluate a model's efficiency and generalizability, we conducted tests on multiple intra-and crossdomains. Therefore, the first step in our methodology was to collect multiple datasets. Next, we constructed the architecture of the individual models, and then, used them to build the aeEDL. Finally, we declare the experiment protocols that we implemented, as well as the performance metrics that we utilized to evaluate the models.

Data Types and Their Preparation
The methodology employed in this study involved gathering data from multiple sources and domains. To conduct our experiment, we collected data from four different sources, two of which were publicly available, and two of which were proprietary. Finally, the fifth sentimental dataset, the famous SemEval (2016) [40], was used for benchmarking.

Dataset 1: SD-Sford-09
This dataset, "Sentiment140", labeled "SD-Sford-09", comprises sentimental data (SD) from Stanford University (Sford), first published in 2009. This is a publicly available dataset [41], which contains 1.6 million tweets, each labeled with the polarity of the tweet, as portrayed in Table 1. A polarity value of zero indicates a negative tweet, while a value of four indicates a positive tweet. The dataset is well balanced, with 800,000 members in each class, and our analysis focused solely on the polarity and text content of each tweet.

Dataset 2: DD-Red-14
We adopted the methodology followed by study [42] to create a depression-centric dataset using the PushShift API to download information from 12 subreddits focused on mental health (such as r/bipolarreddit, r/socialanxiety, r/healthanxiety, r/ptsd, r/autism, r/schizophrenia, r/addiction, r/adhd, r/anxiety, r/alcoholism, r/lonely, and r/depression) and 11 subreddits focused on non-mental health-related topics (such as r/jokes, r/gaming, r/india, r/music, r/teaching, r/legaladvice, r/mildlyinteresting, r/unexpected, r/space, r/cats, and r/news). Subreddits are individual communities within the larger Reddit platform, where users can join and participate in discussions centered around specific topics. These communities often have their own rules, moderators, and user base, creating a unique environment for sharing and interacting with content. To address the validation of the ground truth label, we specifically obtained the dataset from specialized mental health subreddits. These subreddits were externally moderated by dedicated subreddit moderators, who played a crucial role in ensuring the data's quality. It was labeled "DD-Red-14" since it comprised depression data (DD) from a Reddit source (Red) and was first published in 2014. A total of 13,000 posts were collected from the mental health subreddits and labeled 'depressive', and an additional 13,000 posts were collected from the non-mental health-related subreddits and labeled 'neutral', as portrayed in Table 2.  This dataset was extracted from the Kaggle platform and contains 27,977 posts which comprise text related to people suffering from anxiety, depression, and other mental health issues [43]. It was labeled "DD-Kgg-22", since it comprises depression data (DD) from the Kaggle (Kgg) source and first published in 2022. Of these, 14,139 entries are from people free from any mental health issues, labeled 0, while 13,838 entries are from people who are suffering from mental health issues, labeled 1, as visualized in Table 3.

Dataset 4: SD-Twi-23
This dataset contains 31,000 tweets published between January 2018 and January 2021. It was labeled "SD-Twi-23" since it comprises sentimental data (SD) taken from Twitter (Twi) and first published in 2023. Of these, 16,000 tweets were labeled "negative" and were extracted using keywords such as 'sad', 'bad', and 'negative'. An additional 15,000 tweets labeled "neutral" were collected without any filters to serve as a control group for the analysis. This dataset contains tweets with the tags "Positive" and "Negative" from SemEval-2016 Task 4 Subtask B [40]. It was labeled "SD-SemEval-16" since it comprises sentimental data (SD) taken from the 2016 competition. The tweets were categorized into the categories train, dev, devtest, and test, with 14,042 labeled as having positive sentiment and 3677 as having negative sentiment, as visualized in Table 4.  Figure 1 depicts the overall architecture of our study. We began by collecting datasets using published resources and APIs such as Twint. The first block was the pre-processing of the dataset prior to the application of the DL models. In this research paper, we describe the data preprocessing steps used to prepare raw text data for machine learning tasks in natural language processing. Firstly, we performed data cleaning by converting the input to lowercase and removing punctuation and symbols. Then, we tokenized the input paragraphs using word tokenization to convert the sentences into a stream of tokens that can be passed to the machine. Lemmatization was performed to convert words to their base form, or lemma, while retaining their inherent meaning. Stop words were removed from the tokens, including articles, pronouns, and conjunctions. Finally, we performed embedding to map the processed input to its vector counterpart. Embedding is necessary to represent text data as vectors in a high-dimensional space, and we used Word2vec and pre-trained BERT embedding techniques to create a distributed representation of words that capture semantics and relationships among the words.
The power of AI was used once the data preparation had been conducted. Here, we divided the dataset into training and testing sets, and then, built the training models for the (a) SDL models and (b) EDL models. These training models were then used to transform the test datasets, yielding the prediction labels, which were then used for performance evaluation and explainability using the explainable AI module.
pre-trained BERT embedding techniques to create a distributed representation of words that capture semantics and relationships among the words.
The power of AI was used once the data preparation had been conducted. Here, we divided the dataset into training and testing sets, and then, built the training models for the (a) SDL models and (b) EDL models. These training models were then used to transform the test datasets, yielding the prediction labels, which were then used for performance evaluation and explainability using the explainable AI module.

Solo Deep Learning and Ensemble Deep Learning Architectures
For our preliminary data analysis, we built a total of 16 models, which included 11 SDL and 5 EDL models constructed from the SDL models. Among the SDL models, we developed three unidirectional models: Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and recurrent neural network (RNN), along with their corresponding

Solo Deep Learning and Ensemble Deep Learning Architectures
For our preliminary data analysis, we built a total of 16 models, which included 11 SDL and 5 EDL models constructed from the SDL models. Among the SDL models, we developed three unidirectional models: Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and recurrent neural network (RNN), along with their corresponding bidirectional versions: Bidirectional LSTM (BiLSTM), Bidirectional GRU (BiGRU), and Bidirectional RNN (BiRNN), comprising six models. Additionally, we created two pretrained SDL models: Bidirectional Encoder Representations from Transformers (BERT) [44] model and its optimized version: A Lite BERT (ALBERT) [45] from Huggingface. BERT and ALBERT were chosen over other models, such as DistilBERT [46] and XLNet, [47] for several reasons. While DistilBERT and XLNet offer comparable performance with faster inference times [48], the larger model sizes and training procedures of BERT and ALBERT provide more powerful representations, capturing complex contextual information and improving overall performance. This advantage was particularly important for our cross-domain approach across all four datasets, as BERT and ALBERT exhibited better adaptability and understanding of text nuances across different domains [49].
In addition to their advantages mentioned earlier, BERT and ALBERT excel in their ease of incorporating attention mechanisms and additional attention channels. Their complex architectures include multiple attention heads that can attend to different parts of the input, enabling straightforward extension of the models with extra attention layers and refinement of the attention mechanisms. Furthermore, BERT and ALBERT offer a wide range of pretrained models, providing flexibility and adaptability to meet various experimental requirements. We were able to choose from various model sizes and variations, allowing for customization and optimization based on our specific needs.
While individual deep learning models have shown limited success in detecting depression, combining them into hybrid deep learning models has been shown to improve performance and overcome data scarcity [50]. By leveraging multiple architectures, Hybrid models can address domain-specific challenges and improve accuracy in tasks such as detecting depression [51][52][53][54]. Using these spirits of HDL, we finally constructed three hybrid deep learning (HDL) models: a CNN-LSTM, a CNN-BiLSTM, and a BERT-BiLSTM. These constituents have a total of eleven SDL models.
We constructed five EDL models using fusion through and concatenation with seven SDLs. These EDL models, designed to surpass their individual SDLs, are drawn in Figures 2-6, and their constituents are detailed in Table 5.         While all of these architectures follow a similar skeleton, it is the combination of different SDL models, the sequence of layers in each SDL model, and their embeddings that differentiate the architecture and affect its training. To ensure that the anticipated improvement gained is a general trait and independent of the constituent models' architecture, it was necessary to create five different EDL models. We anticipate that the performance of the EDL models will vary due to their core being built out of different SDL models. Hence, the EDL models will retain some behavior of their original SDL model, which would be optimized in the EDL architecture.
All of the architectures have two modules that can accept two identical or different sets of input tokens. The modules converge into a concatenation layer, followed by a depression detection module that consists of max pooling, dropout, and a fully connected dense layer     While all of these architectures follow a similar skeleton, it is the combination of different SDL models, the sequence of layers in each SDL model, and their embeddings that differentiate the architecture and affect its training. To ensure that the anticipated improvement gained is a general trait and independent of the constituent models' architecture, it was necessary to create five different EDL models. We anticipate that the performance of the EDL models will vary due to their core being built out of different SDL models. Hence, the EDL models will retain some behavior of their original SDL model, which would be optimized in the EDL architecture.
All of the architectures have two modules that can accept two identical or different The EDL1 model utilizes two modules: GRU and CNN-LSTM; these are visualized in Figure 2. The GRU layer contains a self-attention layer followed by dense and dropout layers. The CNN-LSTM module includes a convolution layer and a max pooling layer, which is connected to an LSTM layer. The EDL2 model employs the same CNN-LSTM module, along with a BiLSTM module that has self-attention enabled within it, as shown in Figure 3. EDL3 and EDL4 use a CNN-LSTM module with convolution layers along with a BiLSTM module, but with Word2vec embedding being replaced by pre-trained BERT and ALBERT embedding, respectively, as visualized in Figures 4 and 5. EDL5 incorporates both ALBERT and BERT-BiLSTM modules, where each module includes a self-attention layer. These two modules are concatenated to form the depression detection module. Additionally, the BERT module also contains a BiLSTM layer, as depicted in Figure 6.
Attention mechanisms in deep learning are used to improve model performance by allowing the model to selectively focus on important parts of the input. By doing so, attention mechanisms can improve the accuracy of the model's predictions and reduce training time. Additionally, attention mechanisms can increase the interpretability of the model by providing insight into which parts of the input are most relevant to a given prediction. Hence, we applied attention layers using multi-head self-attention to all the SDL and EDL models to observe the effects of incorporating attention into the models.
Multi-head self-attention is a variant of the self-attention mechanism that involves computing multiple attention heads in parallel and concatenating the output of each head before applying a linear transformation. The multi-head self-attention mechanism that we incorporated can be mathematically illustrated as: Attention Score Computation Let X = [x 1 , x 2 , . . . , x n ] be the input sequence of length n, and let H = [h 1 , h 2 , . . . , h n ] be the output sequence of the multi-head self-attention layer, where h i is the representation of the ith element in H.
First, we compute Query (Q), Key (K), and Value Vectors (V), which are learnable parameters, and d ks is the dimensionality of the query. Then, we calculate the attention scores using the SoftMax function s and multiply it with V to obtain the weighted sum for each head: Then, we compute the output sequence using the concatenation of each head, where © denotes the concatenation function: Lastly, we apply a linear transformation to map the output and obtain the desired output size (Final Output sequence) H os from H, where it is the desirable size for the next layer: where a is a learnable weight matrix for mapping the output of the layer to the desired output size.

Training and Loss Functions
The models in the study were trained using a batch size of 128 and an input layer of 100 tokens, with a binary cross-entropy loss (CLE) function. Binary CLE is a loss function used in ML for binary classification problems. Cross-entropy is a mathematical function that is defined in terms of the logarithm of the predicted label and the gold standard label. It measures the difference between the predicted probabilities of the positive class and the true labels, and penalizes the model for large errors. The binary cross-entropy loss function is denoted as L bce , and mathematically, it can be expressed as: where N is the number of samples in the dataset, Y i is the ground truth label (either 0 or 1), Y i is the predicted probability of the positive class, log is the natural logarithm, and × means multiplication. Table 6 provides information on the epochs each model took, as well as their initial learning rates and optimizers. As shown, EDL1 and EDL2 used Adam optimizers, while EDL3, EDL4, and EDL5 used SGD optimizers. The models were trained and tested on a 9:1 split using the K10 protocol. The initial learning rate for EDL1, EDL2, and EDL3 was 2 × 10 5 and for EDL4 and EDL 5, it was 1 × 10 4 . Finally, it should be noted that EDL 1 was trained for 30 epochs, EDL 2 and EDL 3 were trained for 40 epochs, EDL 4 was trained for 45 epochs, and EDL 5 was trained for 50 epochs. The study was implemented using Python 3.8 and a TensorFlow framework. To implement the system, a 12 GB NVIDIA P100 16 Graphics Processing Unit (GPU) was utilized. Additionally, the system was equipped with an Intel Xeon Processors processor and 12 GB of RAM.

Experimental Protocols
Based on our preliminary analysis and introduction, we developed an experimental workflow, which is outlined in this section. Initially, we examined the SDL models and compared the advantages of bidirectional models over unidirectional models. Subsequently, we investigated how combining SDL models with EDL models can enhance performance on standard datasets. We then evaluated the impact of adding an attention layer to these models on the overall performance of the depression and sentiment analysis task. Finally, we cross-validated our observed results and demonstrated the domain adaptability of our system by performing an unseen paradigm (a situation where the deep learning model is tested on a new and previously unseen task or dataset that is significantly different from the data it was trained upon).

Experiment 1: Unidirectional vs. Bidirectional SDL Models
We conducted this experiment on the SDL model, comparing the performance metrics of unidirectional models versus their bidirectional counterparts (LSTM vs. BiLSTM, GRU vs. BiGRU, and RNN vs. BiRNN) averaged across our four main datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2) to visualize how the bidirectional model fared compared to the unidirectional model under a constant K10 partition protocol.

Experiment 2: SDL Models vs. EDL Models
The aim of this experiment was to determine whether the EDL models are superior to their corresponding SDL models, averaged across our four main datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2). For this experiment, we compared five EDL models and seven SDL models to demonstrate how joining multiple SDL models can improve performance under a constant K10 partition protocol.

Experiment 3: Effect of Training Size on the Performance of SDL/EDL Models
To validate the robustness of the models, we applied four cross-validation protocols, namely, K2, K4, K5, and K10, to vary the training size for each model and evaluate the corresponding performance drop resulting from reducing the training size. We utilized these partition protocols across the four main datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2) and averaged the results to illustrate how data size affects our model.

Experiment 4: EDL Models without Attention Block vs. EDL Models with Attention Block
The purpose of this experiment was to observe the change in performance of the SDL and EDL models when augmented with a self-attention block after the classifier in the architecture, compare them with the original EDL model across all datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2), and benchmark them under a constant partition protocol.

Experiment 5: Domain Adoption of Ensemble Deep Learning Models in Unseen Paradigm
This experiment was one of the most critical, as it aimed to evaluate the EDL model's performance when encountering cross-domain data (where a model was trained on one domain and was then applied to test a different domain) using an "unseen test dataset". Specifically, we trained the model on one dataset and evaluated it on a different dataset with varying domains and semantics to demonstrate its generalization ability. Our model showcased domain adaptation following training on a single domain of sentiment data and an evaluation of its ability to adapt to a new domain by testing its performance on depression data. We were able to transfer knowledge learned from the original domain to a new domain and assess the model's ability to generalize to different tasks and datasets.

Performance Metrics
The proposed models were estimated using the parameters "true positive (TP)", "true negative (TN)", "false positive (FP)", and "false negative (FN)", which are defined as follows: If a normal/neutral sentiment input is detected as a normal/neutral sentiment by the depression detection mechanism, then it is identified as true positive (TP). If a depressive/negative sentiment input is detected as a depressive/negative sentiment by the depression detection mechanism, then it is identified as true negative (TN). In the other case, if a depressive/negative input is detected as a normal/neutral sentiment by the mechanism, then it is identified as false positive (FP). Finally, if a normal/neutral input is detected as a depressive/negative sentiment by the mechanism, then it is identified as false negative (FN). Using these parameters, we can derive the following PE parameters: (i) Accuracy: This denotes the overall correct predictions out of the total predictions made (Equation (5)). (ii) Recall (R): This is the number of correctly predicted positive class predictions made of all the positive members in the dataset (Equation (6)). (iii) Precision (P): This is the number of correctly predicted positive class predictions to the total number of classified positive predictions (Equation (7)). (iv) F1-Score (F): This is defined as the harmonic mean of precision and recall. It is useful for imbalanced datasets (Equation (8)). (v) Finally, the area-under-the-curve (AUC) represents the two-dimensional area underneath the plotted ROC curve.

Mean and Standard Deviation of the Statistics
In this study, we propose formulations for measuring the overall robustness of the model. To accomplish this, we measure six quantities in this section. η(m) denotes model m's accuracy summarized over all D datasets; η(d) denotes the robustness of dataset d achieved by summarizing M models; η sys denotes the overall system robustness by averaging the accuracy achieved over M models and D datasets; α(m) denotes model m's area-under-thecurve summarized over all D datasets; α(d) denotes the robustness of dataset d, achieved by summarizing the area-under-the-curve over M models; and α sys denotes the overall system robustness by averaging the area-under-the-curve achieved over M models and D datasets. All these formulas were computed in Section 3 using the K10 partition protocol.

Results
The experimental results of the protocols were obtained by employing four main datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2) and sixteen (11 + 5) models through the utilization of the TensorFlow framework. The training process was executed using a Tesla P100 GPU. Each result was obtained by conducting ten rounds of training and testing, and subsequently, calculating the mean value.

Unidirectional vs. Bidirectional SDL Models
In this study, we demonstrated that bidirectional models consistently outperform unidirectional DL models of comparable architecture. We tested this hypothesis by training and testing six baseline models, with both bidirectional and unidirectional variations. Across all three variations, the bidirectional models consistently outperformed the unidirectional models, as visualized in Table 7, validating our hypothesis. Specifically, the BiLSTM model achieved the greatest absolute increase in performance compared to the LSTM model, with an increase of 2.65% averaged over all four datasets, with BiRNN giving a 1.80% increase over RNN, and BiGRU giving a 1.47% increase over GRU. This can be attributed to the fact that bidirectional models have the ability to process data from past and future inputs, giving them better comprehension of the sequence and context, which can improve performance in comparison to unidirectional models, which only work in one direction.

SDL Models vs. EDL Models (without Attention)
This experiment demonstrates how EDL models outperform their individual components. We evaluated eleven SDL models and five EDL models, and their performance metrics were averaged over four datasets; the results are presented in Table 8. As shown in the table, the EDL models consistently outperformed their individual components in every case, with a mean increase in performance over the five EDL models of 4.49%. Moreover, the best increase in EDL model performance was observed in EDL4, which showed a 7.22% absolute increase over its component, SDL8. These results demonstrate the effectiveness of EDL models in improving the overall performance of DL models.
This could be attributed to the fact that EDL models are able to leverage the strengths of different models and overcome their weaknesses, leading to improved accuracy and generalization ability. In contrast, SDL models may struggle to capture complex relationships in the data or be prone to overfitting. EDL models address these limitations by combining the predictions of multiple models, each with different strengths and weaknesses, thereby reducing the risk of overfitting and improving the robustness of the model.

Cross-Validation Protocols of All Models
In this experiment, we studied the effect of training data size on the performance of our models. Tables 9 and 10 present the results of our analysis, showcasing how the accuracy and area-under-the-curve (AUC) metrics, respectively, gradually decrease over different cross-validation protocols (K10 (default), K5, K4, and K2). In this case, the accuracy of EDL5 dropped from 95.01% when using the K10 protocol to 90.07% when using the K2 protocol, and the AUC fell from 0.9251 when using the K10 protocol to 0.8667 when using the K2 protocol. Even with a reduced amount of training data in the K2 (50:50) validation protocol, the metrics of our EDL models did not drop significantly, demonstrating the generalizability of the models. These results suggest that our EDL models can be used effectively even when the amount of available training data is limited.

Effect of Attention on the SDL and EDL Models and Its Benchmarking against SemEval Dataset
The fourth experiment aimed to demonstrate the effect of using an attention layer in all of the SDL and EDL models. For this purpose, we implemented a self-attention channel on top of the EDL models. As demonstrated by Table 11, the use of attention increased the performance of the models summarized over all four main datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2). According to Table 11, Mean aeSDL > aneSDL for all five of the PE metrics, and similarly, the mean accuracy of aeEDL > aneEDL for all five of the PE metrics. This further proves our hypothesis that "attention blocks" are a powerful paradigm in depression and sentimental analysis.   Table 9. Accuracy metrics of five EDL models using different cross-validation protocols.  Furthermore, with this experiment, we were able to establish a benchmark on the SemEval 2016 Subtask A dataset, with an accuracy of 85.09% and an AUC score of 0.8008; this is the highest accuracy achieved so far using EDL5 with the self-attention block, giving an boost in accuracy of 3.86%, compared to the best score for SDL11 with the self-attention block. These results, shown in Tables 12 and 13, demonstrate the effectiveness of our approach in achieving state-of-the-art performance in sentiment analysis on the SemEval dataset.

Unseen Tests Using Cross-Domain Testing for SDL and EDL Models
In this experiment, we demonstrate our model's ability to perform in a cross-domain setting by conducting unseen tests. We performed 12 sub-experiments on four datasets, involving training on one dataset and testing on a different one, covering all possible combinations. The performance results were averaged out for all the datasets, and the accuracy and percentage differences in seen accuracy and unseen accuracy are shown in Tables 14 and 15 for the SDL and EDL models. Our analysis showed that the mean difference between unseen and seen accuracy for the SDL models was~3%. Similarly, the mean difference between unseen and seen accuracy for the EDL models was~2.7%.     The corresponding AUC and percentage differences are shown in Tables 16 and 17 for the SDL and EDL models, respectively. Our analysis showed that the mean difference between the unseen and seen AUC for the SDL models was~3%. Similarly, the mean difference between the unseen and seen AUC for the EDL models was~2.4%. Note that the criterion for a robust design, leading to superior generalizability, was that the difference between seen and unseen analysis be less than 3% to 5% [54][55][56]; our system design demonstrates results less than 3%, which qualifies it as a robust, generalizable, and stable design, which is also part of our running hypothesis.

Performance Evaluation and Explainable AI
As part of the performance evaluation, the classifiers of the models were determined through their ROC curves, and bar charts were plotted to visualize the performance of the models. ROC curves and bar charts provide a visual representation of the model's performance. Overall, the performance evaluation provides insight into the strengths and weaknesses of the system and helps to identify areas for improvement. The reliability of the system was assessed to determine its robustness and the stability of the model. This was achieved through various statistical tests, such as the R-squared test (adjusted), and paired t-test. The statistical tests were used to determine whether the differences in performance between the models are significant.
As part of the increasing interpretability of the AI models, explainable AI techniques were employed. These techniques provide insights into how the black-box models make decisions and help understand the factors contributing to depression detection.

Receiver Operating Curves
ROC curves are used to evaluate the performance of the models across their entire been operating range. In Figure 7, we visualize the effect of the size of the training data on the EDL5 model by implementing cross-validation protocols K10, K5, K4, and K2. We observe that the AUC for K10 is 0.9251, and the AUC for K2 is 0.8867. The ROC performance of the five EDL models (EDL1, EDL2, EDL3, EDL4, and EDL5) is visualized in Figure 8, with EDL5 having the highest AUC score of 0.9251 and EDL1 having the lowest AUC score of 0.8616.   The bar charts are helpful in visualizing the information present in tables more efficiently. Figure 9 showcases the accuracy of all EDL models averaged over the four datasets, with EDL1 having an accuracy of 87.63% and EDL5 having an accuracy of 95.01%. Figure 10 visualizes the effect of accuracy with the change in the amount of training data for EDL5 through the use of cross-validation protocols K2, K4, K5, and K10. The accuracy in K2 drops to 90.07% from 95.01% in K10 (default).  The bar charts are helpful in visualizing the information present in tables more efficiently. Figure 9 showcases the accuracy of all EDL models averaged over the four datasets, with EDL1 having an accuracy of 87.63% and EDL5 having an accuracy of 95.01%. Figure 10 visualizes the effect of accuracy with the change in the amount of training data for EDL5 through the use of cross-validation protocols K2, K4, K5, and K10. The accuracy in K2 drops to 90.07% from 95.01% in K10 (default).

Reliability Analysis Using Statistical Tests
The stability of the system was validated through four statistical tests conducted on the EDL models across all five datasets. The tests performed were the adjusted R-squared test, two-tailed Z test, paired t-test, and ANOVA test. These tests were conducted to determine whether the predicted data were significant and to monitor the p-value in the paired t-test and ANOVA test to check whether it was less than 0.01 (p < 0.001). The results of these tests are presented in Table 18, across all five EDL models and the five datasets (SD-Sford 09, DD-Red-14, DD-Kgg-22, SD-Twi-2, and SD-SemEval-16).

Reliability Analysis Using Statistical Tests
The stability of the system was validated through four statistical tests conducted on the EDL models across all five datasets. The tests performed were the adjusted R-squared test, two-tailed Z test, paired t-test, and ANOVA test. These tests were conducted to determine whether the predicted data were significant and to monitor the p-value in the paired t-test and ANOVA test to check whether it was less than 0.01 (p < 0.001). The results of these tests are presented in Table 18, across all five EDL models and the five datasets (SD-Sford 09, DD-Red-14, DD-Kgg-22, SD-Twi-2, and SD-SemEval-16).
Along the lines of [57], we conducted these tests and observed that all five EDL models showcased p < 0.001 in the paired t-test and the ANOVA test, signifying the significance of the data and validating their clinical importance. The adjusted R-squared test, which portrays the correctness of the model, illustrates the extent of a feature's variance, and the Z in two-tailed tests denotes the Z-score, which describes the standard deviation above or below the mean population. Along the lines of [57], we conducted these tests and observed that all five EDL models showcased p < 0.001 in the paired t-test and the ANOVA test, signifying the significance of the data and validating their clinical importance. The adjusted R-squared test, which portrays the correctness of the model, illustrates the extent of a feature's variance, and the Z in two-tailed tests denotes the Z-score, which describes the standard deviation above or below the mean population.

Reliability Analysis Using Statistical Tests
Given that deep learning models, such as BERT, are often considered black box models, we recognized the importance of providing insights into the interpretability of our results. To address this, we employed the "SequenceClassificationExplainer" module from the "transformers-interpret" library in our paradigm, as showcased in Figure 11. This explainer allowed us to calculate the attribution of each word in a given sentence after cleaning, tokenization, and prediction. It enabled us to identify the most impactful tokens contributing to the sentiment classification. Additionally, by using fixed thresholds, we constructed masked sentences that highlight the most impactful tokens.

Discussion
Since we implemented 11 SDL and 5 EDL models on four datasets with and without attention paradigms (16 × 2 = 32 models), we summarize the primary and secondary findings of our comprehensive analysis. Further, it is critical to benchmark our design (aeEDL and aeSDL) against the existing studies in the domains of depression and sentimental analysis. Another important component is to elaborate on the bounds of the attention mechanism under which it was adapted. Lastly, as part of the discussion, we illustrate the strengths and weaknesses and possible extensions of this study.

Principal Findings
Through an exhaustive study, we have proven three major and three minor hypotheses. We developed eleven SDL models and five EDL models. Using our four main datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2), we discovered that bidirectional SDL models outperformed unidirectional models. Building upon this finding, we discovered that EDL models outperformed their component SDL models by 4.49%, and yielded better results when they were utilized together in the architecture. Using self-attention layers, we observed significant improvement in the performance of DL models. This was further enhanced by incorporating attention mechanisms into our EDL architecture, leading to benchmark accuracy on the SemEval-2016 dataset. We observed that the increase in the mean accuracy (AUC) of aeSDL over aneSDL was 2.58% (1.73%), and the increase in In our study, a value closer to 0 indicates a depressive sentiment, while a value closer to 1 indicates a non-depressive sentiment. By incorporating this explainable AI technique, our aim was to shed light on the underlying factors influencing sentiment classification. The results over the two datasets are demonstrated in Figures 12 and 13. Although BERT itself does not inherently provide specific interpretability features, leveraging the explainability module helped us address the lack of fixed features and provided additional insights into the decision-making process of our model.

Discussion
Since we implemented 11 SDL and 5 EDL models on four datasets with and without attention paradigms (16 × 2 = 32 models), we summarize the primary and secondary findings of our comprehensive analysis. Further, it is critical to benchmark our design (aeEDL and aeSDL) against the existing studies in the domains of depression and sentimental analysis. Another important component is to elaborate on the bounds of the attention mechanism under which it was adapted. Lastly, as part of the discussion, we illustrate the strengths and weaknesses and possible extensions of this study.

Principal Findings
Through an exhaustive study, we have proven three major and three minor hypotheses.

Discussion
Since we implemented 11 SDL and 5 EDL models on four datasets with and without attention paradigms (16 × 2 = 32 models), we summarize the primary and secondary findings of our comprehensive analysis. Further, it is critical to benchmark our design (aeEDL and aeSDL) against the existing studies in the domains of depression and sentimental analysis. Another important component is to elaborate on the bounds of the attention mechanism under which it was adapted. Lastly, as part of the discussion, we illustrate the strengths and weaknesses and possible extensions of this study.

Discussion
Since we implemented 11 SDL and 5 EDL models on four datasets with and without attention paradigms (16 × 2 = 32 models), we summarize the primary and secondary findings of our comprehensive analysis. Further, it is critical to benchmark our design (aeEDL and aeSDL) against the existing studies in the domains of depression and sentimental analysis. Another important component is to elaborate on the bounds of the attention mechanism under which it was adapted. Lastly, as part of the discussion, we illustrate the strengths and weaknesses and possible extensions of this study.

Principal Findings
Through an exhaustive study, we have proven three major and three minor hypotheses. We developed eleven SDL models and five EDL models. Using our four main datasets (SD-Sford-09, DD-Red-14, DD-Kgg-22, and SD-Twi-2), we discovered that bidirectional SDL models outperformed unidirectional models. Building upon this finding, we discovered that EDL models outperformed their component SDL models by 4.49%, and yielded better results when they were utilized together in the architecture. Using self-attention layers, we observed significant improvement in the performance of DL models. This was further enhanced by incorporating attention mechanisms into our EDL architecture, leading to benchmark accuracy on the SemEval-2016 dataset. We observed that the increase in the mean accuracy (AUC) of aeSDL over aneSDL was 2.58% (1.73%), and the increase in the mean accuracy (AUC) of aeEDL over aneEDL was 2.76% (2.80%). When comparing EDL vs. SDL for non-attention and attention, the mean aneEDL was greater than aneSDL by 4.82% (3.71%), and the mean aeEDL was greater than aeSDL by 5.06% (4.81%). On benchmarking dataset (SemEval), the best-performing aeEDL model (ALBERT + BERT-BiLSTM) was superior to the best aeSDL (BERT-BiLSTM) model by 3.86%. Furthermore, we validated our models through statistical tests, demonstrating their ability to effectively handle crossdomain challenges by performing well on unseen paradigms and predicting on different domains to those on which they were trained. We met the regulatory requirement by showing that the accuracy and AUC differences between unseen and seen paradigms were less than 3%.

Benchmarking: A Comparative Analysis
The crux of our study was positioned using an attention-enabled paradigm in EDL models. These EDL models were designed by fusing DL-based models. Thus, it is important to evaluate our framework against the previous SDL and EDL models. We therefore decided to squarely address the benchmarking efforts in two consecutive steps, with step one involving a comparison of our proposed models with previous DL models and step two consisting of a deeper comparison of our proposed models with previous EDL models. Since the total number of studies in the sentiment analysis and depression detection were 9 and 18, respectively, we organized our benchmarking into two clusters in the form of two tables, namely, Tables 19 and 20. Table 19 focuses on nine studies that did not use attention in their architecture, and Table 20 consists of studies where attention blocks were an integral part of the paradigm. Table 19 showcases fourteen attributes for each of the nine studies. Columns C1 to C16 are as follows: the year of the study (C1); the last name of the author (C2); the main objective of the paper (C3); the base model (C4); the use of an SDL vs. an EDL model (C5); the fusion or stacking technique used, if any (C6); the main method used (C7); the data type used (C8); data size (C9), the evaluation metric (C10) and evaluation score (C11); scientific validation (C12) and clinical validation (C13); and the conduction of an unseen paradigm, if any (C14).
The studies demonstrated up to four kinds of data source, namely, social media chat, reviews, or published psychological data (column C8). Unlike these, in our proposed study (R10), keeping generalizability in mind, we used five kinds of data source, namely, Twitter, Twitter (Stanford), Reddit, Kaggle, and SemEval (for benchmarking). Furthermore, these nine studies had a data size range of 718 to 550,000 sentences (column C9), while in our proposed study (R10), the data size ranged from 10,000 to 1,000,000 sentences. All of these studies computed at least one out of accuracy, F1-Score, or AUC (columns C10, C11). Our study (R10) outperformed existing studies and yielded a mean accuracy of 91.44% across all EDL models and 95.01% in the best EDL5 model, and an F1-Score of 0.8941. Seven out of the nine studies [58][59][60][62][63][64][65] presented some sort of scientific or clinical validation by performing an ablation study, providing p-values of less than 0.001, or by performing crossvalidation (columns C12, C13), unlike in our proposed study (R19), where we conducted exhaustive tests, including six individual statistical tests that yielded p-values of less than 0.001, deployed four cross-validation protocols, and achieved an overall standard deviation of less than 2.5%. Further, it is noted that only our proposed work (R10) conducted a true unseen paradigm (column C14) by training and testing the model on datasets of different domains, thus proving its generalizability over cross-domains. Table 20 shows the state-of-the-art DL models used for sentiment analysis and depression detection. It showcases fifteen attributes for each of the eighteen studies: Columns C1 to C15 are as follows: the year of the study (C1); last name of the author (C2); the main objective of the paper (C3); the base model (C4); the use of an SDL vs. an EDL model (C5); the fusion or stacking technique used, if any (C6); the attention block technique (C7); the main method used (C8); the data type used (C9); data size (C10); evaluation metric (C11) and evaluation Score (C12); scientific validation (C13) and clinical validation (C14); and the conduction of an unseen paradigm, if any (C15).
Studies [67][68][69]71,74,76,[78][79][80]82,83] presented some sort of scientific and clinical validation by performing ablation studies, yielding p-values of less than 0.001, or performing cross-validation (columns C13, C14), unlike in our proposed study (R19); we conducted six individual statistical tests that yielded p-values of less than 0.001, deployed four crossvalidation protocols, and achieved an overall standard deviation of less than 2.5%. While only three studies attempted to validate unseen data [74,83,84] by employing sub-sampling, data merging, or cross-domain validation (column C15), it is noted that only our proposed work (R19) conducted a true unseen paradigm through training and testing on data of different domains, thus proving its generalizability.

A Special Note on Attention in Depression Detection
Attention mechanisms help in depression detection by allowing the model to selectively focus on important parts of the input text, instead of treating the entire text equally. This is particularly useful in depression detection, where certain words or phrases may be more indicative of depression in some individuals than in others. For example, an attention mechanism could help the model to identify important keywords or phrases that are highly indicative of depression, such as negative self-talk, hopelessness, or social isolation.
In depression detection, certain words or phrases may carry more weight than others, and attention mechanisms can help the model identify and prioritize these important features. Additionally, attention mechanisms can help the model better understand the context and meaning of the text by focusing on relevant information and ignoring irrelevant information. This can lead to improved accuracy and performance of the model in identifying depression in text data.

Strengths, Weakness, and Extensions
This article focuses on the application of EDL models and attention layer for depression detection. The study shows significant improvement in predicting sentiment and depression from multiple data sources, making the proposed EDL a benchmark in the field of depression detection. The EDL model outperforms existing studies on two datasets. Additionally, cross-validation, clinical validation, and unseen implementations prove the system's robustness and domain adaptability, as it performs fairly well on a different domain to that on which it was trained, demonstrating its generalization ability.
Due to the limited availability of high-quality open datasets on depression, the existing study focused mainly on training classifiers on specific datasets. Consequently, the model's accuracy was not improved beyond 95%, although it still outperformed existing studies. To encourage further research in the field and improve current benchmark models, high-quality datasets of substantial size are necessary to build more robust and optimal models. The scope of this approach focuses only on the NLP text-based approach, and hence, we adapted the Twitter and Reddit dataset as they follow the same paradigm. This was an in-depth explanation of the domain adaption paradigm, which focused on the adaption of ensemble-based NLP models through extensive experimentation. We developed 11 solo deep learning models and 5 ensemble models, which were specifically designed to leverage their capabilities in detecting depression from users' text patterns. Additionally, within our ensemble models, we incorporated attention channels to enhance explainability, highlighting key textual features that contribute to the classification decision. By employing these techniques, we aim to achieve both high performance in depression detection and meaningful explanations for the model's decisions, ensuring transparency and interpretability in our classification approach.
The exploration of multimodal videos and images will be considered as a potential continuation in future research. Through such research, we can explore datasets from visual-based social media platforms such as TikTok, YouTube, and Instagram. Here, we will shift our focus to video classification, which requires different methodologies compared to the NLP-based classification we have utilized thus far. We will employ computer visionbased classification models to analyze visual cues, facial expressions, body language, and other visual elements in order to detect depression using an entirely different paradigm.

Cross-Validation
In the future, our goal is to develop new datasets and explore novel architectures, such as Generative Adversarial Networks (GANs), for improving depression detection. We aim to compare these new models, such as the fusion of ML with exhaustive feature space with DL [86], to our existing EDL models to evaluate their performance and perform variability analysis [87]. Additionally, we plan to develop new loss functions and incorporate multiple loss functions into our aeEDL models, as adopted in the imaging framework, to increase their robustness and improve their performance metrics [34,88]. Additionally, design systems can be pruned to reduce the size of the training models [89], and artificial intelligence designs are susceptible to bias; we intend to work on understanding studies and rank them according to their bias [90][91][92][93].
Lastly, there have been studies in different domains, such as immunology [94,95], cardiovascular risk assessment [96], and psoriasis diagnosis [97], where cloud-based endto-end systems are used for detection and moderation. We therefore intend to use a similar paradigm to create an automated and scalable cloud-based system using research into AI sentiment analysis to interpret the emotional content present in various forms of communication, such as text messages, social media posts, and online interactions. The proposed cloud system follows a layered architecture, where the presentation layer operates locally on users' devices, while the business and persistence layers are hosted on the cloud. This architecture ensures a user-friendly experience by providing realtime sentiment analysis and emotional guidance directly on the device. Additionally, this architecture facilitates secure connectivity between the system and psychologists, enabling them to access and utilize the system's insights to provide personalized support and assistance to their patients. The automated moderation provided by this system can greatly benefit psychologists in their practice. When patients visit a psychologist, it can sometimes be challenging for them to express their emotions fully. With the assistance of our automated system, psychologists can gain deeper insights into their patients' emotional well-being. This enhanced understanding will enable psychologists to provide more personalized and tailored treatment plans, improving the effectiveness of their interventions. By utilizing this system, psychologists can leverage technology to follow up with their patients' mental well-being in a more comprehensive and individualized manner.
The cloud-based nature of our system will play a crucial role in its capabilities. It will enable the secure storage and processing of a vast amount of data, allowing the system to continuously learn and enhance its understanding of emotions. This accumulated knowledge and analysis of sentiment data contribute to a more robust and accurate sentiment analysis process. Furthermore, the integration of this system into mobile phone-based applications will provide users with convenient access to its features. Users can benefit from real-time guidance and support, empowering them to manage their emotions and prioritize their mental well-being more effectively. This sentiment analysis system will act as a personal mental guide in a robust pipeline, operating locally to help users recognize and address their emotions. Additionally, the integration of AI sentiment analysis could enable holistic support for mental well-being, positively impacting individuals' lives by providing timely assistance and resources based on their emotional needs.

Conclusions
Our study presents a novel paradigm for depression detection and sentimental analysis in a cross-domain framework based on text inputs. This utilizes five kinds of attentionenabled ensemble deep learning model designed using eleven kinds of solo deep learning model. A comprehensive data analysis was conducted using four kinds of dataset to prove our hypothesis. Further, a benchmarking strategy was developed on the standardized SemEval dataset, establishing our model's superior performance both in classification accuracy and area-under-the-curve. As part of a generalizability assessment, "seen" and "unseen" experiments were conducted, with the model meeting the regulatory requirements. Finally, the system's reliability and stability were demonstrated using clinical tests.