Machine Learning Algorithms for Depression: Diagnosis, Insights, and Research Directions

: Over the years, stress, anxiety, and modern-day fast-paced lifestyles have had immense psychological effects on people’s minds worldwide. The global technological development in healthcare digitizes the scopious data, enabling the map of the various forms of human biology more accurately than traditional measuring techniques. Machine learning (ML) has been accredited as an efﬁcient approach for analyzing the massive amount of data in the healthcare domain. ML methodologies are being utilized in mental health to predict the probabilities of mental disorders and, therefore, execute potential treatment outcomes. This review paper enlists different machine learning algorithms used to detect and diagnose depression. The ML-based depression detection algorithms are categorized into three classes, classiﬁcation, deep learning, and ensemble. A general model for depression diagnosis involving data extraction, pre-processing, training ML classiﬁer, detection classiﬁcation, and performance evaluation is presented. Moreover, it presents an overview to identify the objectives and limitations of different research studies presented in the domain of depression detection. Furthermore, it discussed future research possibilities in the ﬁeld of depression diagnosis.


Introduction
The modern age lifestyle has a psychological impact on people's minds that causes emotional distress and depression [1]. Depression is a prevailing mental disturbance affecting an individual's thinking and mental development. According to WHO, approximately 1 billion people have mental disorders [2] and over 300 million people suffer from depression worldwide [3]. Depression prevails in suicidal thoughts in an individual. Around 800,000 people commit suicide annually. Therefore, it requires a comprehensive response to deal with the burden of mental health issues [4,5]. Depression may harm the socioeconomic status of an individual. People suffering from depression are more reluctant to socialize. Counseling and psychological therapies can help fight depression. Machine learning (ML) aims at creating algorithms that are equipped with the ability to train themselves to perceive complex patterns. This ability helps to find solutions to new problems by using previous data and solutions. ML algorithms implement processes with regulated and standardized outcomes [6,7]. Broadly, ML algorithms are categorized into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning algorithms. The supervised ML algorithms [8] utilize main inputs to predict known values, whereas the unsupervised ML algorithms [9] divulge unidentified patterns and clusters within the given data. Semi-supervised learning [10] is concerned with the working of systems by combining both labeled and unlabeled data, and it lies between supervised and unsupervised learning. Reinforcement learning [11] is concerned with interpreting the environment to undergo desired actions and exhibiting outcomes through trial and error. The applications of ML techniques in healthcare have proven to be pragmatic as they can process a huge amount of heterogeneous data and provide efficient clinical insights. ML-based approaches provide an efficient understanding of mental conditions and assist mental health specialists in predictive decision making [12]. ML techniques benefit the prediction and diagnosis in the healthcare domain by generating information from unstructured medical data. The prediction outcomes help to identify high-risk medical conditions in patients for early treatments [13]. In mental disorders, ML techniques help arbitrate the potential behavioral biomarkers [14] to assist healthcare specialists in predicting the contingencies of mental disorders and administering effective treatment outcomes. The techniques help the visualization and interpretation of complex healthcare data. The visualization helps develop an effective hypothesis regarding the diagnosis of mental disorders. The traditional clinical diagnostic approach for depression does not accurately identify the depression complexity. The composition of the symptoms related to mental disorders such as depression can easily be detected and anticipated by utilizing ML methods. Therefore, the ML-based diagnostic approach seems to be an efficient choice for predictive analysis. In the healthcare sector, the major domains used for extracting observations associated with mental disorders through ML can be classified as sensors, text, structured data, and multimodal technology interactions [14]. The sensors data can be analyzed using mobile phones and audio signals. The text sources can be extracted through social media platforms, text messages, and clinical records. The structured data constitute the data extracted from standard screening scales, questionnaires, and medical health records. The multimodal technology interactions include data from human interactions with everyday technological equipment, robot, and virtual agents. The ML approaches can be used to assist in diagnosing mental health conditions. The majority of the studies analyze Twitter data [15][16][17] and sensors data from mobile devices [18,19] for identifying mood disorders. Analyzing textual data can help extract diagnostic information from the individual's psychiatric records [20]. ML approaches can help to predict risk factors in patients with mental disorders. The analysis of sensor data [20], clinical health records [21,22], and text message data [23] can help predict the severity of mental disorders and suicidal behaviors. Various studies have been put forward to aid medical specialists in identifying depression and multiple other mental disorders. The domain of mental disorders comprises a diverse range of mental illnesses. However, this review paper aims attention at the methods presented for the detection of depression. This review paper focuses on elaborating the ML approaches and algorithms used to diagnose and detect depression in individuals. The paper briefly presents the objectives and limitations of the reviewed studies in depression diagnosis, which will help analyze and recognize the best ML approach for a depression diagnosis. The analysis presented in this review paper can help medical specialists and clinicians choose a suitable diagnosis approach for patients with depression. This review paper presents the following: Significant studies extract mental health-related insights. A general model for depression diagnosis involving data extraction, pre-processing, training ML classifier, detection classification, and performance evaluation is considered. An overview of different ML algorithms to diagnose depression by categorizing these depression detection algorithms into three classes, i.e., classification, deep learning, and ensemble. We discussed the limitations of the reviewed studies in the depression diagnosis domain and a better understanding of the choice of the ML approach for depression diagnosis for clinicians and healthcare professionals. Future research possibilities in the domain of depression diagnosis are listed. The organization of the remaining sections of this paper is as follows: Section 2 consists of a brief description of the past studies. The methodology for depression diagnosis is explained in Section 3. Section 4 describes the depression detection model. Section 5 explains the future direction in the domain of depression diagnosis. Section 6 describes the conclusion of this review.

Related Work
Over the years, there have been numerous studies on the use of ML to amplify the scrutiny of mental disorders. In [24], the authors present a history of depression, imaging, and ML approaches. It also provides reviews on researchers that have used imaging and ML to study depression. The algorithms under review are SVM (linear kernel), SVM (nonlinear kernel), and relevance vector regression. Only one mental health domain (MHD) is used to analyze in this survey. This study did not mention depression screening scales, and there is no comprehensive comparison of algorithms. Garcia et al. [25] surveyed mental health monitoring systems (MHMS) using ML and sensor data in mental disorders. This study also analyzed supervised, unsupervised, semi-supervised, transfer, and reinforcement learning which were applied in the domains of mental well-being, including depression, anxiety, bipolar disorder (BD), migraine, and stress. However, the study only presents a brief review of the cases about MHMS and applications. Gao et al. [26] compared MLbased brain imaging classification and prediction research studies for diagnosing. Major depression disorder (MDD) and BD were analyzed, combined with the utilization of the MRI data. SVM, LDA, GPC, DT, RVM, NN, and LR algorithms are under review in this study. However, depression screening scales used in different studies are not mentioned. It only focuses on MDD and BD-based research studies. Gyeongcheol et al. [27] analyzed five ML algorithms; SVM, Gradient Boosting Machine (GBM), RF, Naïve Bayes, and KNN were applied in the domains of mental disorders. It included PTSD, schizophrenia, depression, ASD, and BD studies. This study reviewed the limited number of ML algorithms and did not specify the advantages of using a particular ML approach.
In [28], the authors analyzed Facebook data to detect depression-relevant factors. The Facebook user's data were analyzed using LIWC. Four supervised learning ML approaches were applied to the acquired data: DT, KNN, SVM, and an ensemble model. Experimental results indicated that DT yielded better classification accuracy. Liu et al. [29] presented a brief review of generic AI-based applications for mental disabilities and an illustration of AI-based exploration of biomarkers for psychiatric disorders. The study [30] reviewed three major approaches for brain analysis for psychiatric disorders, magnetic resonance imaging (MRI), electroencephalography (EEG), and kinesics diagnosis, along with five AI methods, Bayesian model, LR, DT, SVM, and DL. In [31], authors have used DL methodology to extract a representation of depression cues in audio and video to detect depression. This review has introduced the databases and described objective markers for automatic depression estimation (ADE) to sort out and summarize their work. Furthermore, they reviewed the DL methods (DCNN, RNN, and LTMS) for automatic depression detection to extract the representation of depression from audio and video. Finally, they have discussed challenges and promising directions related to the automatic diagnosis of depression using DL approaches. Table 1 illustrate the overview of different studies.

Methodology for Depression Diagnosis
The detection methodology involves a series of processes, including the data extraction, the pre-processing of the extracted data, feature extraction methods for selecting the required set of features for identifying symptoms of depression, and ML classifiers for classifying the input data into defined data categories. This section discusses each of these steps and the different methods and approaches used for implementing each step.

Pre-Processing Algorithms
(1) Linear Discriminant Analysis (LDA): LDA is a dimensionality reduction approach that removes redundant features by transforming them from a spatial space onto a lower-dimensional space. LDA reduces the dimensions in each dataset, retains the most important features, and achieves higher class separability [31]. (2) Synthetic Minority Oversampling Technique (SMOTE): SMOTE is a statistical oversampling technique to obtain a synthetically class-balanced dataset. It provides a balanced class distribution that develops synthetic patterns from the minority class [32]. (3) Linguistic Inquiry and Word Count (LIWC): LIWC is a text analysis technique for understanding different emotional, subjective, and structural components present in the spoken and written speech patterns [33]. HMM is a probabilistic model used to capture and describe information from observable sequential symbols. In HMM, the observed data are modeled as a series of outputs generated by several internal states [34].

Feature Extraction Methods
Feature selection is a technique in which those features are selected that are the most accurate predictors of the target variable.
(1) SelectKBest: SelectKBest is a feature extraction approach that retains relevant features and drops unwanted features in the given input data. It is a univariate feature selection approach based on the univariate statistical analysis. It combines the univariate statistical test with selecting the K-number of features based on the statistical result between the variables. (2) Particle Swarm Optimization (PSO): PSO is a computational process for optimizing nonlinear functions by developing the candidate solution in a repetitive pattern based on a defined quality measure. The general concept of the PSO algorithm is inspired by the swarm actions of birds, flocking, and schooling in nature [35]. (3) Maximum Relevance Minimum Redundancy (mRMR): mRMR is a feature selection approach that manages multivariate temporal data without compressing previous data. The algorithm selects features with the most relevant class and the least correlation between redundant classes. It provides significantly improved class predictions in extensive datasets [36]. (4) Boruta: Boruta is a feature selection approach designed around a Random Forest classification. Boruta is used for extracting all the relevant variables by removing less relevant features, using the statistical analysis iteratively [37]. (5) RELIEFF: RELIEFF algorithm is one of the most successful filtering feature selection methods. RELIEFF algorithm is used to eliminate the redundant features [38].

Supervised Learning Classifiers
In supervised learning, the specific format is used for the training dataset. Each instance is assigned a label. Datasets are labeled as (x, y) belongs to X, Y where x and y denote a data point. The problem is a classification task if the output y belongs to a discrete domain. If the output is a part of the continuous domain, it is a regression task. The tasks predict the value of the dependent attribute from the variables.

Classification
(1) Naïve Bayes Classifier: A Naive Bayes classifier is dependent on applying Bayes' hypothesis with strong independence assumptions. This classifier depends on basic learning strategies assembled by similitudes that utilize Bayes' hypothesis of probability to build ML models, particularly those identified with report order and disease prediction [39]. i. Bagging (RF, DF): Bagging is an ensemble algorithm [8]. It adapts various algorithms on different fragments of a training dataset. The predictions from all algorithms are then combined. Random Forest (RF), an extension of bagging, selects the features fragments in random patterns from the given dataset. ii.
Boosting (GBDT, XGBoost): Gradient Boosting is an ensemble classifier used for supervised ML tasks. It considers the individual algorithms and forms a collective model.

Regression
It is used to comprehend the connection between reliant and free factors. It is generally used to make projections, for example, for deal income for a given business. Linear regression and logistical regression are popular regression algorithms.
(1) Logistic Regression: When the dependent variable is dichotomous, logistic regression is the best regression technique to use (binary). Logistic regression is employed to describe and explain the connection between one dependent binary variable and one or more nominal, ordinal, interval, or ratio-level independent variables. (2) Lasso Regression: Lasso regression is a form of shrinkage-based linear regression.
Data values are shrunk toward a center; mean in shrinkage. Simple, sparse models are encouraged by the lasso method. (3) Elastic Net: Elastic net is a regularized linear regression that incorporates two wellknown penalties, the L1 and L2 penalty functions. (4) SVR: Support vector regression (SVR) allows the flexibility to define how much error is acceptable in each model and find an appropriate line to fit the data.

Deep Learning
Deep learning is a type of ML that enhances computers to gain for a fact and comprehend the world as far as a hierarchy of ideas. The hierarchy of ideas permits the computer to learn confounded ideas by building them out of more straightforward ones; a graph of these hierarchies would be many layers deep. In image processing and computer vision with applications, such as scene understanding, clinical image investigation, robotic perception, augmented reality, video surveillance, and image compression, image segmentation is a key idea. Because of the achievement of DL models in a wide scope of vision applications, there has been a generous measure of works pointed toward creating image segmentation approaches utilizing DL models.
Neural Networks: The neural network is a classifier that stimulates the human brains and neurons; neural networks (NNs) or artificial neural networks (ANN) are based on a collection of process units (e.g., nodes, neurons, or process layers). The processing unit receives signals from other neurons, combines, transforms them, and generates results.
(1) Convolutional Neural Network: ConvNet, also known as CNN, is a deep learning (DL) method that can take an input picture and assign significance (learnable weights and biases) to various aspects/objects in the image, as well as distinguish one from the other. Compared to other classification methods, the amount of pre-processing needed by a ConvNet is much less. While filters are hand-engineered in basic techniques, ConvNets can learn these filters/characteristics with enough training. (2) Artificial Neural Network (ANN): ANNs depend on the structure of many interaction units. The preparing unit receives signals from different neurons, consolidates, transforms them, and creates an outcome. The cycle units are generally compared to genuine neurons, giving the artificial neural networks. (3) DNN: An artificial neural network (ANN) having many layers between the input and output layers is known as a deep neural network (DNN) [9]. Various neural networks have different components, but they all have the same components: neurons, synapses, weights, biases, and functions. These components work in the same way as the human brain and can be taught just like any other machine learning algorithm.

Depression Detection Models
Depression is a type of mental illness which brings a serious burden to individuals, families, and society. Conferred by the WHO, depression will be the most common mental illness by 2030 [44]. In difficult situations, depression leads to suicide. Currently, there is no efficient clinical characterization of depression. It makes the diagnosing process restricted and biased. Diagnosing depression is complicated, depending not only on the educational background, cognitive ability, and honesty of the subject to describe the symptoms but also on the experience and motivation of the clinicians. Comprehensive information and thorough clinical training are needed to diagnose the severity of depression accurately [10]. Hence, in recent years, numerous automatic depression estimation (ADE) systems have been introduced to automatically estimate the severity scale of depression by using different ML algorithms. Figure 1 illustrates various ML algorithms for the diagnosis of depression.
(5) RNN: Recurrent neural networks (RNNs) are utilized in language modeling plications because input may flow in either direction. For this purpose, long sho term memory is especially useful. Long short-term memory (LSTM) is a recurr neural network design utilized in deep learning. LSTM contains feedback c nections, unlike conventional feedforward neural networks. It can handle la data sequences as well as single data points. (6) AiME (Novel Model): An artificial intelligence mental evaluation (AiME) fram work for detecting symptoms of depression using multimodal deep networ based human-computer interactive evaluation.

Depression Detection Models
Depression is a type of mental illness which brings a serious burden to indiv uals, families, and society. Conferred by the WHO, depression will be the most comm mental illness by 2030 [44]. In difficult situations, depression leads to suicide. Curren there is no efficient clinical characterization of depression. It makes the diagnosing p cess restricted and biased. Diagnosing depression is complicated, depending not o on the educational background, cognitive ability, and honesty of the subject to scribe the symptoms but also on the experience and motivation of the clinicians. Co prehensive information and thorough clinical training are needed to diagnose the verity of depression accurately [10]. Hence, in recent years, numerous automatic pression estimation (ADE) systems have been introduced to automatically estimate severity scale of depression by using different ML algorithms. Figure 1 illustrates v ous ML algorithms for the diagnosis of depression.

Classification Models
This section highlights the classification supervised learning models used in several studies for diagnosing depression. A mobile application, Mood Assessment Capable Framework (Moodable), has been presented in [45] to interpret voice samples, data from smartphone and social media handles, and Patient Health Questionnaire (PHQ-9) data for assessment of an individual's mood, mental health, and inferring symptoms of depression by using ML classifiers SVM, KNN, and RF. The framework achieved 76.6% precision for depression assessment. The authors used six ML classifiers, KNN, Weighted Voting classifier, AdaBoost, Bagging, GB, and XGBoost, in [46], to predict depression. SelectKBest, mRMR, and Boruta feature selection techniques were used for feature extraction. For reducing imbalanced classes, SMOTE was applied. They used a dataset of 604 individuals, including the sociodemographic and psychosocial data and the Burns Depression Checklist (BDC) data, among which 65.73% depression prevalence was identified. The analysis indicated that the AdaBoost classifier achieved the highest classification accuracy of 92.56% when used with the SelectKBest algorithm.
An ML model using the RF algorithm has been implemented for the prognosis of depression among Korean adults in [47]. SMOTE was applied for class balancing between two classes: depression and non-depression. CES-D-11 was used as a depression screening scale where 10-fold cross-validation was utilized to tune the hyperparameters. A total of 6588 Korean citizen's data were included in the study; AUROC value was calculated as 0.870 and achieved an accuracy of 86.20%. However, in this study, biomarkers were not included in the dataset. The authors used three ML algorithms, KNN, RF, and SVM, in [48], to diagnose depression among Bangladeshi students. The study aimed at predicting depression at early stages using related features to avoid drastic incidents. The analysis performed over 577 students' data indicated that the Random Forest algorithm detected the symptoms of depression in the students with 75% accuracy and 60% f-measure.
In [49], ensemble learning and DL approaches have been applied to electroencephalography (EEG) features for detecting depression. Deep Forest (DF) and SVM classifiers were used for feature transformation. Image conversion and CNN were used for feature recognition from the EEG spatial information. The ensemble model with DF and SVM obtained 89.02% classification accuracy and the DL approach achieved 84.75% accuracy. In [50], ML algorithms DT, RF, Naïve Bayes, SVM, and KNN were used to predict stress, anxiety, and depression. The Depression, Anxiety, and Stress Scale questionnaire (DASS 21) analyzed 348 individuals' data. The analysis indicated that Naïve Bayes achieved the highest accuracy of 85.50% for predicting depression. Based on F1 scores, the RF algorithm was more efficient in the case of imbalanced classes. In [51], the author used the sentiment and linguistic analysis with ML to discriminate between depressive and non-depressive social content. RF with RELIEFF feature extractor, LIWC text-analysis tool, and the Hierarchical Hidden Markov Model (HMM) and ANEW scale were used to analyze 4026 social media posts with an accuracy of 90% depressive posts classification, 92% depression degree classification, and 95% depressive communities classification. However, this study takes all depression categories as a single class. Sharma et al. [52] used the XGBoost algorithm on data samples to diagnose mental disorders in the given data. Different sampling techniques were applied to the dataset. The dataset used in this study had imbalanced classes. The study achieved more than 0.90 values for accuracy, precision, recall, and F1 score.
Generalized Anxiety Disorder (GAD) is difficult to perceive and distinguish from major depression (MD) in a clinical framework. In [53], a multi-model ML algorithm was presented to distinguish GAD from MD using structural MRI data and clinical and hormonal information. Conclusively, MRI data provided accumulative data to the GAD classification. However, the sample size and accuracy needed to be increased, and the groups were unbalanced. Xiang et al. [54] used a multikernel SVM with minimum spanning tree (MST) and Kolmogorov-Smirnov test for feature selection. The proposed approach provided a conducive network analysis. A total of 38 MDD patients and 28 healthy controls were included in the dataset. The presented approach achieved 97.54% accuracy. Table 2 presents a comparison of different classification models used for the diagnosis of depression.

Discussion of Classification Models
The multikernel SVM proposed in [54] with a high-order MST achieved the highest 97.54% MDD classification accuracy among the reviewed studies. The multikernel SVM model provides dynamic changes in the functional association between brain fragments. The integration of multiple kernels can enhance classification. Another model with an efficient classification accuracy was presented in [46], which achieved 92.56% classification accuracy using the AdaBoost with SelectKBest feature selection method and SMOTE for balancing the classes. AdaBoost falls under the category of DT Ensemble. By comparing both the studies [46,54], it can be concluded that in [46], no biomarker was included in the dataset, while in [54], the dataset used was limited and there was no identification of any depression screening scale. Considering the studies [45,[48][49][50]53,54], SVM has been the most used classifier for the detection of depression as it works well on unstructured and high-dimensional data. SVM is also resistant to overfitting. For data with an anonymous and irregular distribution, SVM can be proved to be an efficient algorithm.
Random Forest (RF) is the second most used classifier in the reviewed studies [45,47,48,50,51] as it is a computationally efficient algorithm. In [51], the RF model achieved 90, 95, and 92% accuracy for classifying depressive posts, depressive communities, and depression degrees. RF enhances the classification accuracies of continuous data by reducing the overfitting in decision trees. As RF is based on ensemble learning; it allows determining complex and straightforward functions more accurately. Figure 2 shows the comparison of classification models used for a depression diagnosis.

Deep Learning Models
This section highlights the deep learning models presented in multiple studies to detect depression. An artificial intelligence mental evaluation (AiME) framework [55] has been presented in a study for detecting symptoms of depression using multimodal deep networks-based human-computer interactive evaluation. The framework was applied to audio, video, and speech responses of 671 participants and PHQ-9 data. The authors of [56] discuss the multimodal stress detection using fusion of machine learning algorithms In [56], a DL framework based on EEG data have been suggested for the automatic analysis of depression. The framework includes two DL models; one-dimensional convolutional neural network (1DCNN) and a combination of 1DCNN and LSTM model have

Deep Learning Models
This section highlights the deep learning models presented in multiple studies to detect depression. An artificial intelligence mental evaluation (AiME) framework [55] has been presented in a study for detecting symptoms of depression using multimodal deep networks-based human-computer interactive evaluation. The framework was applied to audio, video, and speech responses of 671 participants and PHQ-9 data. The authors of [56] discuss the multimodal stress detection using fusion of machine learning algorithms.
In [56], a DL framework based on EEG data have been suggested for the automatic analysis of depression. The framework includes two DL models; one-dimensional convolutional neural network (1DCNN) and a combination of 1DCNN and LSTM model have been utilized. The dataset used in the study contained 30 healthy and 33 MDD patients' EEG data and quantitative information. BDI-II and HADS were used as the assessment scales. The framework achieved an overall classification accuracy of 98.32%. Erguzel, Sayar et al. [57] presented a hybridized methodology using PSO and ANN to distinguish between unipolar and bipolar depression based on EEG recordings. The presented ANN-PSO approach discriminated 31 bipolar and 58 unipolar subjects with 89.89% accuracy. SCID-I, HDRS 17-item version, YMRS, DSM-IV, and HADS were used as the assessment scales. However, this study used limited datasets.
Feng et al. [58] presented the X-A-BiLSTM model for diagnosing depression from social media data. The XGBoost component helped reduce imbalanced classes, and the Attention-BiLSTM neural network component enhanced the classification capacity. The RSDD dataset with approximately 9000 depressed users and 107,000 control users was used in the study. However, no standard screening scale for depression was used in their work. In [59], a novel approach was presented to optimize word embedding for classification. The proposed approach outperformed the previous state-of-the-art models on the RSDD dataset. The comparative evaluation was performed on some DL models for diagnosing depression from tweets on the user level. The experiments were performed on two publicly available datasets, CLPsych 2015 and Bell Let's Talk. Results showed that CNN-based models performed better than RNN-based models. However, the word embedding models did not perform efficiently with larger datasets.
Zogan et al. [59] presented interpretive Multimodal Depression Detection with Hierarchical Attention Network (MDHAN) to detect depressed people on social media. User posts along with Twitter-based multimodal features were considered. The semantic sequence features were captured from the individuals' profiles. MDHAN outperformed other baseline methods. It determined that combining DL with multi-model features can be effective. MDHAN achieved excellent performance and ensured adequate evidence to explain the prediction with an accuracy of 89.5%. However, this study needs to use a standard dataset of Twitter users because the social media data may be vague and can manipulate the experimental outcome. In [60], deep convolutional neural networks (DCNN) are designed to learn deep-learned characteristics from spectrograms and raw voice waveforms in the first place. To improve the depression recognition performance, we suggest using joint fine-tuning layers to merge the raw and spectrogram DCNN.
He and Cao [60] used DCNN to enhance depression classification. DCNN with LLD and MRELBP texture descriptors were applied on 100 training, 100 development, and 100 testing samples. AVEC2013 and AVEC2014 datasets were combined. The results were the MAE of 8.1901 and the RMSE of 9.8874 for the combined dataset. In [61], the authors presented a model for diagnosing mild depression by processing EEG signals using CNN. The model used four functional connectivity metrics (coherence, correlation, PLV, and PLI). The model obtained a classification accuracy of 80.74%. Only functional connectivity matrices are used in the research, and other metrics need to be used for evaluation. Ahmed et al. [62] discussed early depression diagnosis by analyzing posts of Reddit users using a DL-based hybrid model. BiLSTM with Glove, Word2Vec, and Fastext embedding techniques, Meta-Data features, and LIWC were applied on 401 (for testing) and 486 (for training) with 531,453 posts for depression detection. Beck Depression Inventory (BDI) was used as an assessment scale. The proposed model obtained F1 score, precision, and recall of 81, 78, and 86%, respectively. Table 3 presents a comparison of different deep learning models used for the diagnosis of depression.

Discussion of Deep Learning Models
The reviewed studies used various DL models with different feature extraction and word embedding techniques in this section. The different DL models presented in [56] showed efficient discrimination between depressed and healthy controls. The 1DCNN achieved the highest classification accuracy of 98.32% and the one-dimensional DCNN with LSTM achieved an accuracy of 95.97%. The DL models automatically discriminate EEG signal patterns.
In the majority of the studies [56,57,61], EEG data have been utilized to diagnose the symptoms of depression in the participants. EEG patterns can help to indicate abnormalities in brain functions and irregular emotional alternations. The EEG signals resemble waves with peaks and valleys with the help of which irregularities can be identified. In [56], a variant of CNN, namely DCNN, was applied over EEG signals to diagnose unipolar depression. In [57], a hybrid model of ANN with PSO algorithm was used to discriminate unipolar and bipolar disorders based on EEG recordings, thereby achieving 89.89% accuracy. In [61], a CNN classification model for diagnosing mild depression by processing the EEG signals was used, and the model achieved 80.74% accuracy using the coherence functional connectivity metric. It can be concluded that EEG-based diagnosis is an efficient and cost-effective method for understanding brain activity and the neural that correlates with social anxiety. Figure 3 presents the comparison of DL models for depression.
DL-based hybrid model. The reviewed studies used various DL models with different feature extraction and word embedding techniques in this section. The different DL models presented in [56] showed efficient discrimination between depressed and healthy controls. The 1DCNN achieved the highest classification accuracy of 98.32% and the one-dimensional DCNN with LSTM achieved an accuracy of 95.97%. The DL models automatically discriminate EEG signal patterns.
In the majority of the studies [56,57,61], EEG data have been utilized to diagnose the symptoms of depression in the participants. EEG patterns can help to indicate abnormalities in brain functions and irregular emotional alternations. The EEG signals resemble waves with peaks and valleys with the help of which irregularities can be identified. In [56], a variant of CNN, namely DCNN, was applied over EEG signals to diagnose unipolar depression. In [57], a hybrid model of ANN with PSO algorithm was used to discriminate unipolar and bipolar disorders based on EEG recordings, thereby achieving 89.89% accuracy. In [61], a CNN classification model for diagnosing mild depression by processing the EEG signals was used, and the model achieved 80.74% accuracy using the coherence functional connectivity metric. It can be concluded that EEG-based diagnosis is an efficient and cost-effective method for understanding brain activity and the neural that correlates with social anxiety. Figure 3 presents the comparison of DL models for depression.

Ensemble Models
This section briefly highlights different ensemble models presented in the reviewed studies for the diagnosis of depression. In [64], ML and statistical models were used to predict clinical depression and MDD among individuals suffering from immune-mediated inflammatory disease (IMID) by identifying patient-reported outcome measures (PROMs). LR, NN, and RF algorithms were used to analyze a dataset of 637 IMID patients In [65], long short-term memory (LSTM) and six ML models including LR, logistic regression with lasso regularization, RF, gradient boosted decision tree (GBDT), SVM, and deep neural network (DNN) were used. LSTM has been applied to predict the level of different depression risk factors over the course of two years. The dataset contained 1538 data of

Ensemble Models
This section briefly highlights different ensemble models presented in the reviewed studies for the diagnosis of depression. In [64], ML and statistical models were used to predict clinical depression and MDD among individuals suffering from immune-mediated inflammatory disease (IMID) by identifying patient-reported outcome measures (PROMs). LR, NN, and RF algorithms were used to analyze a dataset of 637 IMID patients. In [65], long short-term memory (LSTM) and six ML models including LR, logistic regression with lasso regularization, RF, gradient boosted decision tree (GBDT), SVM, and deep neural network (DNN) were used. LSTM has been applied to predict the level of different depression risk factors over the course of two years. The dataset contained 1538 data of elderly people in China using the Chinese Longitudinal Healthy Longevity Study (CLHLS).
The results indicated that logistic regression with lasso regularization achieved a higher AUC value than other ML algorithms.
Tao, Chi et al. [66] proposed an ensemble binary classifier to analyze health survey data against ground truth from the SF-20 Quality of Life scales. With ensemble model (DT, AAN, KNN, SVM) applied on the NHANES dataset, the classifier demonstrated an F1 score of 0.976 in the prediction, without any incorrectly identified depression instances. This study has some limitations; the need to use rich online social media sources for feature extraction and dataset range is not defined. Karoly and Ruehlman [67] proposed an algorithm to distinguish between MDD and BD patients based on clinical variables. LR with Elastic Net and XGBoost were applied on 103 MDD and 52 BD patients and achieved an accuracy of 78% for LR with Elastic Net model. There are some limitations in this paper such as the small and unbalanced sample, lack of external sample validation, some misclassifications of classes, and a limited range of evaluation features.
Zhao, Feng et al. [68] evaluated the depression status of Chinese recruits using ML algorithms. NN, SVM, and DT were applied on 1000 participants and achieved 86, 86, and 73% accuracy for NN, SVM, and DT. BD-II was used as an assessment scale. This study needs to include complex socio-demographic and career variables into the model. Ji et al. [69] diagnosed bipolar disorder among Chinese by developing a BDCC using ML algorithms. SVR, RF, LASSO, LR, and LDA were applied on 255 MDD, 360 BPD, and 228 healthy sample data. The experiments obtained an accuracy of 92% for MDD and 92% for BPD detection. However, this model requires large datasets and needs to enhance its cross-sectional nature. Table 4 presents a comparison of different ensemble models used for the diagnosis of depression.

Discussion of Ensemble Models
Among the reviewed studies, ensemble models [66] obtained the highest accuracy of 95.4%. In this study, the NHANES dataset is used for evaluation; the predicted model just predicts the 4% cases wrongly. The ensemble model achieved F1 measure, accuracy, and precision of 97, 95, and 95%, respectively, on the whole dataset. It also shows that the ensemble method for identifying depression on a partial dataset is stable and resilient. The method and experiment showed that combining a classification methodology with binary ground truth may provide better prediction results than baseline standards. The ensemble technique is a straightforward approach similar to the bagging and major voting ensemble methods. Using five machine learning algorithms and Chinese multicenter cohort data, the ensemble model described in [69] obtained the second-highest classification accuracy of 92 percent. The higher AUC obtained in this study, compared to other studies, shows the research's acceptance and the validity of the Chinese version of the BDCC. In addition, the BDCC cuts the time it takes to gather clinical data in half. The ADE takes more than 30 min to complete, while the BDCC takes 10-15 min. The present findings show that the BDCC is just as reliable as the previous form, but it is much easier to deploy. Considering the studies [64,65,67,69], regression has been the most used ML technique for the detection of depression. Regression is simple to implement and easier to interpret the output coefficients. Regression is susceptible to overfitting, but it can be avoided using dimensionality reduction techniques, regularization (L1 and L2) techniques, and cross-validation.

Future Research Possibilities
We propose some possible future study directions in this part, based on the review of prior research in the preceding section.
(1) A larger data sample is required: The majority of prior depression detection research utilized a small sample size. A small sample size is useful for building a prediction model, while a bigger sample size is important for constructing a more accurate model that works well throughout the population. When a large sample size is used to train a model, it allows for a greater diversity of depressed patients to be included, perhaps leading to models with real therapeutic value. When a few studies use bigger datasets, the methods will most likely alter and show more developed approval metrics. The k-fold cross-validation technique, in particular, may be employed with higher k-values to allow for larger test sets on which to test prediction models and increase generalizability.
(2) Learning method(s): Various learning techniques give a better outcome in different situations; therefore, choosing the right one is crucial. Unlabeled data may sometimes help develop a prediction model for a large sample size with little data. As a result, the first step is to determine if the incoming data are labeled, unlabeled, or a combination of labeled and unlabeled data. As a result, employing an unsupervised, supervised, or semi-supervised learning technique will be determined. The second phase is dependent on the learning method's objective, which must be addressed. The last stage is to identify whether the input is linear or nonlinear; linear data are helpful when the dataset is small to prevent overfitting, whereas nonlinear data are important when the dataset is big. The last step is to choose a learning technique to limit the options. The technique for picking the best learning method is to assess various factors such as complexity, flexibility, computation time, optimization ability, and so on, and then choose the best one. If you have too many learning method choices, evaluate the performance of each technique on the provided data; if you just have a few, simply change the default model to make it more appropriate for learning the given data.
(3) Clinical application: Long-term, creating a predictive model aims to find a method that can improve accuracy. However, such a scenario is unlikely to arise in the next few years, since SVM and a few other supervised learning algorithms are presently trustworthy and seem to be around in this area of research. Regardless, after a sufficiently strong method has been thoroughly authorized via preliminary considerations, showing its efficacy, and determining whether it will benefit patients or not, its progression to clinical preliminaries will be critical. Future clinical trials should ensure that machine learning methods efficiently identify depressed individuals who are unlikely to respond to the current specialist under investigation. Clinicians' use of this information improves patient outcomes (for example, diminished inactivity among determination and reduction).
(4) Collaboration of research groups: With the significant progress among different disciplines, collaboration with other disciplines is crucial for ADE. For affective computing, relevant fields include psychology, physiology, computer science, ML, etc. Thus, researchers should borrow each other's strengths to promote ADE's advances. For audio-based ADE, the deep models only represent the depression scale from audios. The deep models capture patterns only from facial expressions specific to video-based ADE. Notably, physiological signals also contain significant information closely related to depression estimation. Accordingly, different researchers should study together to build multimodal-based DL approaches for clinical application.
(5) Availability of databases: Because of the sensitivity of depression data, it is difficult to gain various data for estimating the scale of depression. Hence, the availability of data is a major issue. First, as opposed to the facial expression recognition task, database availability is scarce up to the present day. Given the literature review, one can note that the widely used depression databases are AVEC2013, AVEC2014, and DAIC-WOZ. Notably, AVEC2014 is a subset of AVEC2013. Second, there is no multimodal (i.e., audio, video, text, physiological signals) database to learn comprehensive depression representations for ADE. The existing databases consist of two or three modalities. Though the DAIC database comprises three modalities (audiovisual and text), the organizer has not provided the original videos of DAIC, leading to a certain inconvenience for ADE. Third, the limited size of the datasets limits the research in depression prediction, especially when using DL technologies. For instance, AVEC2013 only contains 50 samples for training, development, and test set. Effective methods to augment the limited amount of annotated data are called to address this bottleneck. Fourth, the criteria for data collection should be standardized. At present, different organizers adopt a range of conditions, equipment, and configurations to collect multimodal data.

Conclusions
The ML approaches can be used to assist in diagnosing mental health conditions. PTSD, schizophrenia, depression, ASD, and bipolar diseases lie in the domains of mental disorders. Social media data, clinical health records, and mobile devices sensors data can be analyzed to identify mood disorders. In this paper, we surveyed state-of-the-art research studies on the diagnosis of depression using ML-based approaches. The purpose of this review paper is to provide information about basic concepts of ML algorithms frequently used in the mental health domain, specifically for depression and their practical application. Among the reviewed studies, SVM has been the most used classifier for detecting depression as it works well with unstructured and high-dimensional data. SVM is also resistant to overfitting. SVM can be proved to be an efficient algorithm for data with an anonymous and irregular distribution. As anticipated, most of the SVM classifiers developed in the articles had a high accuracy of greater than 75%. Because data in the mental health area are scarce, SVM outperforms other machine learning methods for diagnosis. We discussed some of the MHMS's research difficulties and potential advancements in mental health and depression. According to the research reviewed, applications based on machine learning provide a significant potential for progress in mental healthcare, including the prediction of outcomes and therapies for mental illnesses and depression. Data Availability Statement: The data supporting this study's findings are available from the corresponding author upon request.