A Multi-Class Deep Learning Approach for Early Detection of Depressive and Anxiety Disorders Using Twitter Data

: Social media occupies an important place in people’s daily lives where users share various contents and topics such as thoughts, experiences, events and feelings. The massive use of social media has led to the generation of huge volumes of data. These data constitute a treasure trove, allowing the extraction of high volumes of relevant information particularly by involving deep learning techniques. Based on this context, various research studies have been carried out with the aim of studying the detection of mental disorders, notably depression and anxiety, through the analysis of data extracted from the Twitter platform. However, although these studies were able to achieve very satisfactory results, they nevertheless relied mainly on binary classiﬁcation models by treating each mental disorder separately. Indeed, it would be better if we managed to develop systems capable of dealing with several mental disorders at the same time. To address this point, we propose a well-deﬁned methodology involving the use of deep learning to develop effective multi-class models for detecting both depression and anxiety disorders through the analysis of tweets. The idea consists in testing a large number of deep learning models ranging from simple to hybrid variants to examine their strengths and weaknesses. Moreover, we involve the grid search technique to help ﬁnd suitable values for the learning rate hyper-parameter due to its importance in training models. Our work is validated through several experiments and comparisons by considering various datasets and other binary classiﬁcation models. The aim is to show the effectiveness of both the assumptions used to collect the data and the use of multi-class models rather than binary class models. Overall, the results obtained are satisfactory and very competitive compared to related works.


Introduction
In this research, we are interested in analyzing social Twitter data (tweets) to help detect psychological disorders, more specifically depression and anxiety disorders.Millions of people are now living with mental disorders, which are one of the leading causes of ill health worldwide.Therefore, early detection is crucial for rapid intervention in order to reduce the escalation of these disorders.In what follows, we first provide an overview of depression and anxiety disorders, then highlight the use of the Twitter platform to help deal with them and finally summarize the paper structure.
Table 1.Differences and commonalities between depressive and anxiety disorders [2].

Psychological diagnoses
In common with the same degree Difficultly concentrating, fear, excessive worry and nightmares.
In common but of different degree Sad/melancholy *** Sad/melancholy * Which are not common points Loss of interest (loss of pleasure = anhedonia, despair about the future), feelings of guilt or failure, low self-esteem,

Detection of Depression and Anxiety Disorders on the Twitter Platform
In general, social media allows users to post and share their feelings and moods.This helped significantly analyze these contents in order to understand several mental disorders and make predictions accordingly.More specifically, the growing popularity of Twitter (known currently as X platform) has contributed to making it an excellent data source for performing such content analyses, in particular for depression and anxiety detection.Indeed, people with severe symptoms of mental disorders are affected in their professional, family and social lives.This is why the automatic detection of these symptoms through social media would have important implications for those affected.
In this paper, we focus on the analysis of data extracted from the Twitter platform (i.e., tweets) with the aim of developing models capable of detecting mental disorders in users, more specifically depression and anxiety.In this regard, much research has been conducted in order to understand the statements expressed through tweets and to classify them into positive and negative sentiments while taking into account certain parameters (e.g., population, language, etc.).Traditional approaches used classic machine learning algorithms such as decision trees and SVMs (support vector machines) (see for instance [3][4][5][6][7][8][9]).However, as the data volumes have become very large, recent research has shifted towards deep learning techniques such as recurrent neural networks (RNN) and convolutional neural networks (CNN) (see for example [10,11]).
Even if the detection of depressive and anxious disorders using deep learning could give satisfactory results, these approaches nevertheless mainly rely on binary classification models by treating each mental disorder separately (i.e., depressive or non-depressive/anxious or nonanxious).This is because dealing with one single mental disorder is easier.Table 1 shows us the severity of the distinction between these mental disorders due to the existence of several symptoms in common (e.g., disturbed sleep, fluctuations, etc.).On another side, some symptoms that are not in common between depression and anxiety disorders (e.g., dizziness, heart palpitations, etc.) can overlap with other disorders such as heart disease and cancer.Thus, it would be better if we managed to develop effective models capable of treating more than one mental disorder at the same time.
To fill this gap, we propose a well-defined methodology involving the use of deep learning so as to develop efficient multi-class models for detecting depression and anxiety via tweets analysis.The objective is to classify tweets into three distinct classes: normal, potentially depressive and potentially anxious.This multi-classification approach should allow a better understanding and a more precise assessment of the different nuances linked to these two mental disorders when they are expressed in tweets and thus improve the sensitivity and specificity of their detection.
The basic idea of our proposal is to build several multi-class deep learning models considering both simple and hybrid variants through an efficient combination of different models, in order to test them all.To validate our proposal, we first evaluate the performance of the tested models using different metrics.Then, the well-performing models are used to classify tweets from other datasets.Finally, we compare their performances with binary deep learning models that disjointedly classify depressive and anxious disorders.As a result, the accuracy of our models could reach up to 93%, which is very competitive with other related works, on the one hand, and show more accuracy than binary models that separately predict depressive and anxious disorders, on the other hand.

Paper Structure
The rest of this paper is organized as follows: Section 2 reviews and summarizes some related works on depression and anxiety detection with a special focus on those involving the Twitter platform.Section 3 provides the details of the proposed methodology for the detection of depressive and anxious disorders using multi-class deep learning models.Section 4 summarizes the experimental stage, gives a set of numerical results and discusses and analyzes the obtained results.Finally, Section 5 provides some concluding remarks.

Related Works
Many people around the world suffer from mental disorders due to several factors such as quality of life and stress.Consequently, intensive research efforts have been made in terms of diagnosis and management.In this regard, the evolution of computing technologies have further supported these efforts in different ways, notably by involving artificial intelligence [12].Indeed, as reported in [13], artificial intelligence methods could improve psychotherapy by providing therapists and patients with real-time or near-real-time recommendations based on the patient's response to treatment, especially since 40% of patients do not respond to psychotherapy as planned.In particular, machine learning and data mining techniques can be used to analyze a patient's history to diagnose a problem, thereby helping to copy human reasoning or make logical decisions [12].
Much research has been conducted on the detection of depressive and anxiety mental disorders through social media platforms [3][4][5][6][7][8][9][10][11], in particular using Twitter, while considering different factors such as population, period, language, etc.Most of such studies rely on supervised machine learning models for text classification using either traditional learning techniques such as SVM, RF, NB and LR or deep learning approaches such as RNN, LSTM, GRU, Bi_RNN, Bi_LSTM and Bi_GRU.In addition, some approaches are designed around hybridization of different models such as combining different variants of CNN with RNN (see for instance [33,37]).The general scheme of this kind of analysis mainly consists in collecting data according to some assumptions and hypothesis (i.e., keywords, location, etc.), preprocessing these data, labeling the data according to the target classes, extracting the features, training the adopted models and finally evaluating their performances so as they can be deployed (i.e., they become ready for use).Tables 2-4 summarize and compare some typical research studies according to the classification techniques used.

Research Methodology
The proposed process uses multi-class classification models to categorize tweets as "normal", "potentially depressed" or "potentially anxious".In order to achieve these objectives, we rely on a rigorous methodology which allows us to obtain efficient classifiers by exploiting Twitter data.This process carries out a clear sequence of well-defined phases, as illustrated in Figure 1.In the following, we detail each phase by providing explanations on its role within the system.

Preparation Dataset
The goal of this phase is to obtain a large number of relevant tweets.To do so, four steps are required.First, raw data are collected using dedicated tools.Then, these data are preprocessed to make them ready for use.Next, the preprocessed data are labeled in order to bind them to one among the three classes, namely "normal", "potentially depressed" and "potentially anxious".Finally, the labeled data are balanced so that their numbers are approximately equal.

Data Collection
The aim of this step is to collect a large dataset of tweets written in English.The period of tweets related to depression and anxiety is from 1 December 2019 to 31 December 2021.This period corresponds to the circumstances of the COVID-19 pandemic, where many people were affected by the requirements of confinement, isolation, risk of illness,

Preparation Dataset
The goal of this phase is to obtain a large number of relevant tweets.To do so, four steps are required.First, raw data are collected using dedicated tools.Then, these data are preprocessed to make them ready for use.Next, the preprocessed data are labeled in order to bind them to one among the three classes, namely "normal", "potentially depressed" and "potentially anxious".Finally, the labeled data are balanced so that their numbers are approximately equal.

Data Collection
The aim of this step is to collect a large dataset of tweets written in English.The period of tweets related to depression and anxiety is from 1 December 2019 to 31 December 2021.This period corresponds to the circumstances of the COVID-19 pandemic, where many people were affected by the requirements of confinement, isolation, risk of illness, loss of loved ones, etc.These poor living conditions have encouraged people to use social media to express their feelings.In contrast, the period of the tweets related to normal behaviors is from 25 January 2022 to 31 January 2022.
The keywords used to collect the data were carefully inspired by the symptoms of depression and anxiety summarized in Table 1.This procedure for collecting the data from Twitter is widely adopted by several deep learning approaches for many purposes.In what follows, we give some typical cases.For instance, Shen et al. have collected data for depression detection using keywords close to "(I'm/I was/I am/I've been) diagnosed depression" [36].These data were reused in other works [5,28,[36][37][38] for different purposes.Chang et al. use the disease name 'Borderline, bpd, bipolar' as keywords to predict borderline personality disorder (BPD) and bipolar disorder (BD) [39].In [40], Wang collected data based on the name of five dietary supplements 'Melatonin, Kava, Ginkgo, Biloba, Ginseng' to predict depression, anxiety and mood Disorders.Note that the use of a single word as a keyword (e.g., name of a disease or a food supplement) does not confirm that the user is sick, so the ambiguity rate is systematically high.In contrast, using these words by indicating one symptom or more within an explanatory sentence may reduce the rate of ambiguity.This is because such sentences correspond to user statements and thus their content is more likely contain negative sentiments and expressions that help train models.
To generate depressive and anxiety tweets, we first used patterns close to: "I am/was/have been diagnosed/identified with depression/anxiety".The aim is to target users who selfreport their issues.Then, we intensified the search around these data using other keywords related to both common and non-common symptoms between depression and anxiety disorders.For common symptoms, we used several verbs like "feel", "suffer", "want", "can", "be", "have" under several forms (conjugated in the past and the present according to negative and affirmative forms, depending on the meaning targeted) combined with words related to "sleep", "appetite", "fatigue", "suicide", "death", "sadness", "melancholy", "fear", "worry", under several forms (nouns, adjectives, gerunds in addition to some of their synonyms).The degree of a given symptom was expressed using adverbs such as "so", "very", "little" (e.g., so sad, little sad).
In the same way, we have generated depressive and anxiety tweets based on the symptoms which are not in common.For depression disorder, we used keywords close to "loss of pleasure", "despair about the future", "feelings of failure".Regarding anxiety disorder, we used keywords close to "Dizziness", "heart palpitations", "panic attack".All these keywords were involved under several forms such as nouns, adjectives, gerunds in addition to some of their synonyms.Finally, normal tweets were generated based on keywords related to positive sentiments and feelings such as "happiness", "love" and "beauty".Table 5 gives typical examples of such keywords used within some parts of sentences that can appear in tweets.Our choice to create our dataset can be summarized in two main points.First, in the context of deep learning, it will be better to rely on large volumes of data in the hope that they lead to good performances.Second, as one of the goals of our paper is to show the effects of the nuances between depression and anxiety disorders on training process, it would be better to rely on our own datasets provided that they follow a robust method leading to reliable data.On another side, one might ask whether the training of our models could be done using data extracted from other sources such as statements, reports and questionnaires of those affected in hospitals and clinics.Unfortunately, social media have their own specificities (posts form, language used, emoticons, multimedia contents, etc.).So, even if a given user is affected by a mental disorder, she/he will be most likely adapted to the way social media are used.Therefore, ideally, the models should be trained using data extracted from social media platforms.

Preprocessing of Data
The data collection phase results in building three datasets, denoted as D0, D1 and D2, with a total size of over seven million tweets, as shown in Table 6.Unfortunately, these data are unclear, incomplete, unstructured and containing errors and redundancy; therefore, it is not recommended to analyze them directly.This is why data preprocessing is a much-needed step to obtain relevant data.In our methodology, we have adopted 14 preprocessing techniques by removing: (1) emojis, (2) emoticons, (3) URLs, ( 4) hashtags (#), ( 5) mentions (@name), ( 6) special characters, (7) punctuation from text, (8) symbols, (9) digits, (10) repetitive letters from words, (11) extra whitespace, (12) uppercase letters, ( 13) contractions (e.g., "It's" becomes "It is") and ( 14) NaN and duplicates in column text.Table 6 gives the numbers of tweets before and after preprocessing the collected data.The word clouds are given in Figure 2, which shows the visual representation of the most used keywords (tags) used in the preprocessed data in datasets D0, D1 and D2.

Data Labeling
The next step is data labeling; it implies assigning a label to each tweet in the datasets based on its class.The tweets from datasets D0, D1 and D2 are bound to the three classes "normal", "potentially depressed" and "potentially anxious", respectively.Therefore, we have labeled tweets from dataset D0 with value '0', tweets from dataset D1 with value '1' and finally tweets from dataset D2 with value '2'.This data labelling aims to build classification models that only classify tweets as potentially positive towards depressive and anxiety mental disorders or not; thus, the analysis is done at the tweet level.If so, the behaviors of concerned users on social media platforms will be analyzed through other systems which further process user data in order to make decisions (user level analysis).
In general, data collected from social media should always be taken with a certain degree of confidence.This is why we collected a large volume of data relating to users self-reporting their cases, in order to increase the degree of confidence in the statements contained in the tweets.Moreover, according to the above-stated objectives, our models may allow a certain tolerance regarding the confidence of tweets toward mental disorders because they do not make decisions about users but only classify tweets for further processing.In addition, large volumes of data are generally more suitable for deep learning approaches in order to obtain good results.

Data Labeling
The next step is data labeling; it implies assigning a label to each tweet in the datasets based on its class.The tweets from datasets D0, D1 and D2 are bound to the three classes "normal", "potentially depressed" and "potentially anxious", respectively.Therefore, we have labeled tweets from dataset D0 with value '0', tweets from dataset D1 with value '1' and finally tweets from dataset D2 with value '2'.This data labelling aims to build classification models that only classify tweets as potentially positive towards depressive and anxiety mental disorders or not; thus, the analysis is done at the tweet level.If so, the behaviors of concerned users on social media platforms will be analyzed through other systems which further process user data in order to make decisions (user level analysis).
In general, data collected from social media should always be taken with a certain degree of confidence.This is why we collected a large volume of data relating to users self-reporting their cases, in order to increase the degree of confidence in the statements contained in the tweets.Moreover, according to the above-stated objectives, our models may allow a certain tolerance regarding the confidence of tweets toward mental disorders

Balancing Data
After data labeling of datasets D0, D1 and D2, they are merged into a single dataset denoted as Main_dataset.Imbalanced datasets refer to those for which the target classes have an uneven distribution of observations leading to appearance of minority and majority classes [41].This risks producing models with poor predictive performance, particularly for minority classes.Regarding our dataset, Table 5 shows that, after preprocessing, the contents of datasets D0, D1 and D2 represent approximately 32.00%, 32.63% and 35.37%, respectively.Consequently, our main dataset is quite balanced.Next, the Main-dataset is randomly divided into three balanced datasets that we refer to as Train_dataset, Test_dataset and Eval_dataset, as shown in Figure 3.The Train-dataset contains 70% of the tweets from each of the datasets D0, D1 and D2, which represents 70% of the total tweets from Main-dataset; this is used to train the models.The Test_dataset contains 15% of the tweets from each of the datasets D0, D1 and D2, which represents 15% of the total tweets from Main-dataset; this is used as a test dataset throughout the models training.Finally, the Eval_dataset contains the remaining tweets (about 15% of the total tweets); this is used in the evaluation phase.
Test_dataset and Eval_dataset, as shown in Figure 3.The Train-dataset contains 70% of the tweets from each of the datasets D0, D1 and D2, which represents 70% of the total tweets from Main-dataset; this is used to train the models.The Test_dataset contains 15% of the tweets from each of the datasets D0, D1 and D2, which represents 15% of the total tweets from Main-dataset; this is used as a test dataset throughout the models training.Finally, the Eval_dataset contains the remaining tweets (about 15% of the total tweets); this is used in the evaluation phase.

Tokenization
Tokenization is a crucial procedure in our process.It breaks up each tweet in the dataset into words called tokens.These tokens help understand the context and thus develop the model for natural language processing tasks.In our dataset, the maximum length of tweets is 131 words.

Feature Extraction
This phase aims to extract the most important features from tweets.In our case, we use word embedding, which is one of the most popular representations of document vocabulary.It helps extract many useful features of a given word in a document (e.g., context, semantic, etc.).For this task, we rely on the GloVe model (Global and Vectors) which allows obtaining vector representations for words while integrating global statistics of words co-occurrence to obtain word vectors [42].GloVe is developed as an open-source project at Stanford University and launched in 2014.Regarding our work, the pre-trained word vectors that are used are the GloVe Twitter word embedding (200 d), which are trained by using 2 billion tweets (containing 27 billion tokens and 1.2 million vocab).These data are made available under the Public Domain Dedication and License v1.0 [43].

Training the Models
In order to build well-performing models for classifying normal, depression and anxiety cases, our proposal is based on

Tokenization
Tokenization is a crucial procedure in our process.It breaks up each tweet in the dataset into words called tokens.These tokens help understand the context and thus develop the model for natural language processing tasks.In our dataset, the maximum length of tweets is 131 words.

Feature Extraction
This phase aims to extract the most important features from tweets.In our case, we use word embedding, which is one of the most popular representations of document vocabulary.It helps extract many useful features of a given word in a document (e.g., context, semantic, etc.).For this task, we rely on the GloVe model (Global and Vectors) which allows obtaining vector representations for words while integrating global statistics of words co-occurrence to obtain word vectors [42].GloVe is developed as an open-source project at Stanford University and launched in 2014.Regarding our work, the pre-trained word vectors that are used are the GloVe Twitter word embedding (200 d), which are trained by using 2 billion tweets (containing 27 billion tokens and 1.2 million vocab).These data are made available under the Public Domain Dedication and License v1.0 [43].

Training the Models
In order to build well-performing models for classifying normal, depression and anxiety cases, our proposal is based on

•
An efficient hybridization that combines CNN model with other types of neural networks to take advantage of the strengths that characterize them such as (1) Simple RNN, (2) LSTM, (3) GRU, (4) Bidirectional RNN (BiRNN), ( 5) BiLSTM and (6) BiGRU.Subsequently, we build hybrid multi-class classifier models according to our multilabeled dataset of tweets; • Dealing with the optimization of the learning rate parameter, which is considered one of the most important parameters in deep learning-based tasks.To do so, we first adopt the Adam optimizer while initializing the learning rate parameter with 0.0001 (the smallest value).Then, we call up the technique of Grid Search Optimization to find the best learning rate value for each model in the interval [0.0001, 0.001].
The result of each deep learning classifier is represented as knowledge (model.h5) in order to be used to predict normal cases and depressive and anxious disorders.

Evaluation of Models
In this phase, we evaluate the performance of all models built.For this purpose, we use the four metrics given by Formulas (1)-( 4) namely, accuracy, precision, recall and F1-score, due to their wide use in the literature.These measures are calculated according to the confusion matrix, which summaries the number of correct and incorrect predictions made by a given classifier, as shown below.given class (i.e., both the current label and the label output by the model does not match the class label); (3) False Positives: when the current value is negative while the predicted value is positive with respect to a given class; (4) False Negatives: when the current value is positive while the predicted value is negative with respect to a given class.

Software and Hardware Configuration
The training of our models was performed on an AMD Ryzen 5 4600H laptop endowed with a 3.00-GHz Radeon processor and 16-GB of RAM.The tweets composing the datasets were collected by using Twitter API and Twarc2 Python library.Regarding the parameters of the training process, we have empirically set them as follows: number of epochs is 20, batch size is 256, maximum tweets length is 131 words, embedding glove 200 d and Adam optimizer is adopted as the default optimization algorithm.

Performance of the Developed Models
To build multi-class models for predicting normal, depressive and anxiety tweets, we have tested around 100 models ranging from simple to hybrid models combining different types of neural network layers: convolution, recurrent, attention and bidirectional.Consequently, we found that the following hybrid multi-classifiers are the most representative typical cases of both success and failure: CNN_RNN, CNN_LSTM, CNN_GRU, CNN_BiRNN, CNN_BiLSTM and CNN_BiGRU.CNN_BiRNN, CNN_BiLSTM and CNN_BiGRU models are the best in terms of performance for all experiment instances while CNN_RNN and CNN_GRU models are the best in terms of performance improvements by involving grid search technique.Finally, CNN_LSTM model represents a failure case where the grid search technique was unable to provide performance improvements.Figure 4 show the performance of these models in terms of training accuracy and training loss, respectively.In particular, the well-performing model is CNN_BiGRU with a learning rate of 0.001.
By setting the learning rate value to 0.001, CNN_RNN was the worst model as it recorded poor accuracy.Moreover, CNN_LSTM and CNN_GRU also showed a significant value of overfitting (red and blue curves are far from each other).However, this unwanted overfitting effect gradually disappeared by setting the learning rate value to 0.0001.In contrast, value 0.001 for the learning rate led to better performance for CNN_BiRNN, CNN_BiLSTM and CNN_BiGRU compared to 0.0001, in addition to the good behavior regarding overfitting.Figure 4 shows the associated curves (the curves on the left concern learning rate value 0.0001 while the curves on the right concern learning rate value 0.001).
The above results suggest that changing the learning rate value of the Adam optimizer has positive or negative influence on the performance of each model.Thus, we need efficient methods to define such a value in order to provide efficient models.In this respect, we adopt grid search, which is a well-known technique serving as a Hyperparameter optimizer for each model.The results are given in Tables 7 and 8.
According to Tables 7 and 8, the best Accuracy achieved is 93.38%; it corresponds to CNN_ BiGRU model such that F1-score of the Normal class is 96%, F1-score of the Depression class is 91% and F1-score of the Anxiety class is 93%. Figure 5 illustrates the confusion matrix for both cases grid search and fixed-based learning rate values.Thus, it can be seen that the grid search could make some improvements in some cases for which the diagonal has a max of correct predictions.

Evaluation and Analysis of the Well-Performing Models
In this section, we evaluate our approach regarding the quality of the data collected and the models built.The objective is twofold: (1) verify the effectiveness of the assumptions used to collect data and (2) show the effectiveness of using multi-class models rather than binary class models.To this end, we leverage the dataset used in [36] to perform an evaluation using binary class models for depression and anxiety detection.Thus, we have randomly selected 12,982 tweets from Depression Dataset D1 and 2658 tweets from Non-Depression Dataset D2.After preprocessing these data, we obtained 5955 tweets labeled by '1' and 2325 tweets labeled by '0'; the resulting dataset is denoted as Shen_dataset.These data are then tested by considering the well-performing models discussed in Tables 7 and 8.The results are given on Table 9.By setting the learning rate value to 0.001, CNN_RNN was the worst model as it recorded poor accuracy.Moreover, CNN_LSTM and CNN_GRU also showed a significant value of overfitting (red and blue curves are far from each other).However, this unwanted overfitting effect gradually disappeared by setting the learning rate value to 0.0001.In contrast, value 0.001 for the learning rate led to better performance for CNN_BiRNN, CNN_BiLSTM and CNN_BiGRU compared to 0.0001, in addition to the good behavior regarding overfitting.Figure 4 shows the associated curves (the curves on the left concern learning rate value 0.0001 while the curves on the right concern learning rate value 0.001).The above results suggest that changing the learning rate value of the Adam optimizer has positive or negative influence on the performance of each model.Thus, we need efficient methods to define such a value in order to provide efficient models.In this respect, we adopt grid search, which is a well-known technique serving as a Hyperparameter optimizer for each model.The results are given in Tables 7 and 8.According to Table 9, one observes that the prediction accuracy of Shen_dataset is average and thus does not show very good results.This is because many depressive tweets were classified as anxious tweets by our models.Indeed, as mentioned in Table 1, there are some common symptoms between depressive and anxiety disorders which consequently may lead to committing classification errors.By knowing that the tweets of Shen_dataset were collected by using some keywords that overlap with anxiety disorders (e.g., "I am depressed and anxious", "I am too tired", "I am so sad" and "I have depression anxiety suicidal thoughts"), our models most likely classify them as anxiety tweets instead of depressive ones.To check this issue, we have reused our dataset to build two binary class models for predicting depression and anxiety separately while keeping the same parameters values.These models are based on the hybridization of CNN and Bi-GRU.Hence, Main-dataset was divided into two datasets denoted as Dataset1 and Dataset2.Dataset1 contains only normal and depressive tweets labeled, respectively, with '0' and '1' while Dataset2 contains only normal and anxiety tweets labeled, respectively, with '0' and '1'.Once these models are built, we test datasets Eval_dataset, Shen_dataset, Dataset1 and Dataset2 to make comparisons and thus draw conclusions.The results are given on Table 10.According to Table 10, both binary class models classify depressive tweets from Shen_dataset as depressive and anxiety tweets with very high accuracy.Regarding our datasets, the obtained results are much better.For instance, Model_2 was trained to classify depressive tweets.By evaluating Dataset2 (anxiety dataset), the accuracy is about 86.35% which means that many anxious tweets were classified as non-depressive.Likewise, by evaluating Dataset1 (depressive dataset) using Model_3, the accuracy is about 62.96%; this means that most of depressive tweets were classified as non-anxious.The conclusions we draw from these results can be summarized as follows: 1.
The source of the improved accuracy of the studied models comes from the way the data were collected by relying on both common and non-common symptoms instead of only using keywords related to common symptoms between depressive and anxiety disorders.2.
Our multi-class models seem to be more effective than the corresponding binary class models as they can resolve ambiguities.Indeed, as depressive and anxiety disorders present certain intersections, binary models most likely classify them as positive tweets (i.e., either depressive or anxious tweets) regardless of the model used (see for instance the results of using Model_2).
It should be noted that the conclusions drawn concern only the context of our work and can in no way be generalized.

Assessment of Our Proposal
Finally, we objectively assess our proposal against related works.Table 11 provides a comparison between our proposal and some other related works within the same context (i.e., those dealing with depression and/or anxiety disorders based on Twitter data), according to the following criteria:   According to Tables 7 and 8, the best Accuracy achieved is 93.38%; it corresponds to CNN_ BiGRU model such that F1-score of the Normal class is 96%, F1-score of the Depression class is 91% and F1-score of the Anxiety class is 93%. Figure 5 illustrates the confusion matrix for both cases grid search and fixed-based learning rate values.Thus, it can be seen that the grid search could make some improvements in some cases for which the diagonal has a max of correct predictions.

Evaluation and Analysis of the Well-Performing Models
In this section, we evaluate our approach regarding the quality of the data collected and the models built.The objective is twofold: (1) verify the effectiveness of the assumptions used to collect data and (2) show the effectiveness of using multi-class models rather than binary class models.To this end, we leverage the dataset used in [36] to perform an evaluation using binary class models for depression and anxiety detection.Thus, we have randomly selected 12,982 tweets from Depression Dataset D1 and 2658 tweets from Non-Depression Dataset D2.After preprocessing these data, we obtained 5955 tweets labeled by '1' and 2325 tweets labeled by '0'; the resulting dataset is denoted as Shen_dataset.These data are then tested by considering the well-performing models discussed in Tables 7 and  8.The results are given on Table 9.According to Table 9, one observes that the prediction accuracy of Shen_dataset is average and thus does not show very good results.This is because many depressive tweets In view of the foregoing, the main potential advantage of our study is that it can be viewed as a complementary work to existing research focused on the detection of depression and anxiety disorders, as 1.
In contrast to many related works that rely on binary classification, our approach is based on multi-class models; 2.
Our study showed that multi-classification may be more efficient than binary class models as it could better resolve ambiguities issues, although this cannot be generalized; 3.
The data were collected based on assumptions involving both common and noncommon symptoms between depression and anxiety disorders.
Our approach also shows some drawbacks which are discussed in the following while trying to propose solutions.It should be noted that these limitations do not only concern our approach but much research working within the same context.

1.
Although the data were generated according to a well-defined process, we still lack for more efficient methods for collecting data and labelling them (tweets).This still remains a big challenge for large volumes of data, in contrast to small volumes of data that can be processed and annotated within a reasonable time.As an ongoing work, we are currently studying the use of semantics to help collect and label the data through ontology-computing while considering emoji, emoticons and related contents.

2.
In fact, many researchers have embarked on a frantic race to design/improve classification models for the detection of mental disorders through the Twitter platform.Undoubtedly, this is very important, but it should not be an end in itself because what is more important is to leverage these models in order to perform useful tasks.In this line of thinking, we are currently working to deploy our models within a syndromic surveillance system, in order to improve public health systems.At this level, our sleep, fluctuations in appetite or weight, agitation, anxiety, isolation (absenteeism) and sexual inhibition.In common but of different degree Intense fatigue (loss of energy) *** Suicidal thoughts *** Intense fatigue (loss of energy) * Suicidal thoughts * Which are not common points -Dizziness, heart palpitations.

Figure 1 .
Figure 1.The proposed methodology for building effective classifiers of mental disorders detection.

Figure 1 .
Figure 1.The proposed methodology for building effective classifiers of mental disorders detection.

Figure 2 .
Figure 2. Word cloud of dataset etching after preprocessing; (a) Word cloud of dataset D0; (b) Word cloud of dataset D1; (c) Word cloud of dataset D2.

Figure 2 .
Figure 2. Word cloud of dataset etching after preprocessing; (a) Word cloud of dataset D0; (b) Word cloud of dataset D1; (c) Word cloud of dataset D2.

1 )
Accuracy =TN + TP TN + FP + TP + FN (True Positives: when current and predicted values are positive with respect to a given class (i.e., both the current label and the label output by the model match the class label);(2) True Negatives: when current and predicted values are negative with respect to a

Algorithms 2023 ,
16, x FOR PEER REVIEW 13 of 26 typical cases of both success and failure: CNN_RNN, CNN_LSTM, CNN_GRU, CNN_BiRNN, CNN_BiLSTM and CNN_BiGRU.CNN_BiRNN, CNN_BiLSTM and CNN_BiGRU models are the best in terms of performance for all experiment instances while CNN_RNN and CNN_GRU models are the best in terms of performance improvements by involving grid search technique.Finally, CNN_LSTM model represents a failure case where the grid search technique was unable to provide performance improvements.Figures 4 show the performance of these models in terms of training accuracy and training loss, respectively.In particular, the well-performing model is CNN_BiGRU with a learning rate of 0.001.

Figure 4 .
Figure 4. Comparison between training and test for accuracy and loss of hybrid models; (a) learning rate 0.001; (b) learning rate 0.0001.

Figure 4 .
Figure 4. Comparison between training and test for accuracy and loss of hybrid models; (a) learning rate 0.001; (b) learning rate 0.0001.

C1.
Mental disorder: this refers to the mental disorder studied, which can be either depression (denoted as Dep) or anxiety (denoted as Anx) disorders.C2.Data collection: this refers to whether the training data were collected using keywords (e.g., symptoms, usernames, etc.) or reused from other datasets.C3.Dataset size: this refers to the total number of tweets used to train the models.C4.Type of learning model: this refers to whether the well-performing classifier adopts simple variants (denoted as S) or hybridization (denoted as H) of models.C5.Type of classification: this refers to whether the well-performing classifier is a binary (denoted as B) or a multi-class (denoted as M) model.C6.Accuracy achieved: this refers to the accuracy achieved by the well-performing classifier (measured as a percentage).

Table 2 .
Comparison of recent studies using traditional machine learning approaches to detect mental disorders from different data sources.

Table 3 .
Comparison of recent studies using simple deep learning approaches to detect mental disorders from different data sources.

Table 5 .
Typical keywords used as parameters to collect our dataset.
I have had dizziness for more than six months.I have had heart palpitations for more than six months.

Table 6 .
Number of tweets before and after preprocessing sub-steps.

Table 7 .
The evaluation of our models on the evaluation dataset (Eval_dataset), based on fixed learning rate values for Adam optimizer.

Table 7 .
The evaluation of our models on the evaluation dataset (Eval_dataset), based on fixed learning rate values for Adam optimizer.

Table 8 .
The evaluation of our models on the evaluation dataset (Eval_dataset), by using grid search optimizer to determine the learning rate value for Adam optimizer.

Table 9 .
Prediction of tweets from Shen_dataset using our well-performing models.

Table 10 .
The CNN-BiGRU classifiers to predict normal cases and, depression and anxiety disorders using different datasets.

Table 8 .
The evaluation of our models on the evaluation dataset (Eval_dataset), by using grid search optimizer to determine the learning rate value for Adam optimizer.

Table 9 .
Prediction of tweets from Shen_dataset using our well-performing models.