A Systematic Review of Machine Learning Models in Mental Health Analysis Based on Multi-Channel Multi-Modal Biometric Signals

: With the increase in biosensors and data collection devices in the healthcare industry, artiﬁcial intelligence and machine learning have attracted much attention in recent years. In this study, we offered a comprehensive review of the current trends and the state-of-the-art in mental health analysis as well as the application of machine-learning techniques for analyzing multi-variate/multi-channel multi-modal biometric signals.This study reviewed the predominant mental-health-related biosensors, including polysomnography (PSG), electroencephalogram (EEG), electro-oculogram (EOG), electromyogram (EMG), and electrocardiogram (ECG). We also described the processes used for data acquisition, data-cleaning, feature extraction, machine-learning modeling, and performance evaluation. This review showed that support-vector-machine and deep-learning techniques have been well studied, to date.After reviewing over 200 papers, we also discussed the current challenges and opportunities in this ﬁeld.


Introduction
It is a bitter pill to swallow: At least one-in-five adults suffers from at least one form of mental health issue or disorder.These health conditions involve changes in emotions, thinking, behavior, or a combination of these [1], such as attention-deficit/hyperactivity disorder (ADHD), sleep apnea disorder, and depression [2][3][4][5][6]. Mental health issues affect well-being, impairing relationships and cognitive activities and causing body responses that may place individuals at risk.
A significant amount of research has leveraged the application of machine learning (ML) techniques for extracting, detecting, and classifying mental health biomarkers in sensor datasets [7][8][9][10][11][12]. These biosensor data are usually multi-channel, and even multimodal, time series [13]. In the medical field, two types of signals are commonly collected for diagnosis, which includes bio-electric and non-bio-electric signals. These signals typically require expert evaluation to make a valid diagnosis [14]. With the assistance of ML techniques, there is the potential to increase the efficiency of mental health diagnosis and even the prognoses of mental disorders at an early stage, given the widely monitored signals through wearable devices in recent years. Biological signals can be collected through different modalities. For example, in this paper, we reviewed the application of ML techniques for electroencephalograms (EEGs), which records signals from the brain [15]; electro-oculograms (EOGs), which record the movement signals of the eyes [16]; electromyograms (EMGs), which record signals from muscle activities during sleep stages [17][18][19]; and electrocardiograms (ECGs), which record signals from the heart via a heart-rate monitor. Non-bio-electric signals include body temperature, respiration, and blood pressure. Despite there being many biological signals for diagnosing mental diseases, this work concentrated on bio-electrical signals and ML techniques that have been used to promote the diagnosis of mental health issues [20].
There are various biological signal types [21]: bio-electrical signals, bio-acoustic signals, bio-mechanical signals, bio-chemical signals, and body temperature. Bio-electrical signals occur in the body of cells, and they originate from the electric activities occurring in the body. These signals have been used for diagnosing various diseases using ML techniques, which is a subset of artificial intelligence (AI) methodologies. In this work, we reviewed the trends and the state-of-the-art of these ML techniques for mental health diagnosis, and we investigated the methods by which these signals are used to increase the efficiency of diagnosing mental health diseases.These bio-electric signals are collected through electrodes and specialty devices. In sleep medicine, large datasets have been generated with these devices to assist in characterizing and quantifying sleep and sleep-related disorders [22]. Polysomnography (PSG) data [23] have been the most commonly used test for the diagnosis of obstructive sleep apnea syndrome (OSAS) and other related ailments. PSG procedures have been conducted primarily overnight in a sleep laboratory. To effectively diagnose sleep disorders, PSG records have been used, collected, and scored by experts [24][25][26][27]. PSG records are data extracted from brain-wave recordings, oxygen levels, heart rate monitors, breathing rates, as well as leg and eye movements of patients.EEG, EOG, EMG, and ECG signals as well as sleep videos have also been integrated into PSGs [17,[23][24][25]28]. It has been estimated by World Health Organization (WHO) that nearly one-third of the world population suffers from sleep disorders [29]. PSG analysis has been defined as the gold standard for detecting sleep disorders and other mental health diseases [30]. PSG records are multi-signal channels. For sleep studies and scoring, an expert is often required to manually examine PSG records. Therefore, the results are at risk of human error, and it is time-consuming and expensive to carry out [31].
Collecting PSG data can be very expensive and uncomfortable for the patient; therefore, it is vital to ensure an accurate diagnosis based on this test's results [32][33][34][35].The traditional PSG process requires the measuring of EEG, EOG, EMG, and ECG signals [28]. A significant amount of research has employed deep-learning approaches to model the spatio-temporal aspects of PSG data [24]. Later in this paper, we reviewed the advantages of ML in a study of mental health diseases. Since 1970, there have been improvements in the automatic scoring of PSG records, in accordance to Rechtschaffen and Kales (RK) sleep research [33,36] based on the American Academy of Sleep Medicine (AASM) rules [37]. The visual interpretations of the PSG signals of patients have been a widely accepted approach for analyzing sleep stages and mental-health-related diseases [38]. In many countries, PSG technology and experts in sleep study have been limited, however, so there is an urgent need to achieve automated PSG data analysis with the help of AI techniques [39].
Sleep recordings require the measurement of brain activity (EEG), eye movement (EOG), and muscle activity (EMG) to accurately identify specific sleep stages. EEGs have been intensely researched by many scholars. EEG signals are classified by employing a common spatial pattern (CSP) and differential entropy (DE) characteristics to the delta, theta, alpha, beta, and gamma frequency bands [40,41]. The diagnosis of mental health and sleep disorders can be tedious and requires significant time investment and expertise to obtain a reliable and accurate diagnosis. In many cases, patients have been subjected to prolonged interviews to improve the diagnostic accuracy of the health personnel or expert [41]. With an EEG system, some limitations have been overcome, and the process of feature extraction, classification, and prediction for the diagnosis of mental health diseases based on PSG datasets could potentially be automated using ML techniques.
Among other signal types, EEGs have been a focus of much study for mental health diagnosis by many researchers. However, there are significantly fewer articles on other biological signals, such as EMG, EOG, and ECG. The study in [42] presented a comprehensive survey of ECG signals, and the authors concluded that a significant amount of studies will be published on ECG in the near future. The study in [43] showed that HRV analysis was a viable method for feature extraction from ECG signals. The researchers in [44] proposed and evaluated an automated analysis of single-lead ECG analyses using human recognition patterns.EMGs have different statistical and spectral properties from the other signals [25]. EMG signals have been used as a bio-signal for hand-and-wrist-gesture recognition [45]. PSG data provide comprehensive information for sleep studies and sleep disorder diagnosis.
It has been estimated that at least 2-4 percent of adults and 1-3 percent of children suffer from sleep-related ailments [31]. There are many classifications that have been used for determining sleep stages. The application of ML and AI has assisted scientists and health professionals in recent times to improve the accuracy of sleep-stage classification and mental health diseases [46]. Combrisson et al. [47] implemented several algorithms for the automatic detection of sleep features and embedded them within a software platform, which they referred to as "detection" panels.
PSG records have been broken into 30s epochs, which were then classified as different sleep stages by experts [48], based on the AASM and Rechtschaffen and Kales sleep classification recommendations [29,48,49]. Sleep has been classified into periods of rapid eye movement (REM) and non-rapid eye movement (NREM), including Stage W (wakefulness), Stage N1 (NREM 1), Stage N2 (NREM 2), Stage N3/N4 (NREM 3), and Stage R (REM) [22,24,25,32,50]. Many studies have identified EEG signals as a more effective bio-electric signal for sleep classification [51] and for the diagnosis of other mental health diseases, such as depression and ADHD [35,52]. Figure 1 [29,38] shows a schematic flowchart of sleep-stage classification with PSG signals from bio-electric signals, specifically EEG. We classified sleep as light sleep and deep sleep, and the wavelength shown in Figure 1 is a typical wave pattern for EGG signals [53]. Each sleep stage can be distinguished based on the wavelengths. In this work, we reviewed the application of ML techniques on multi-modal and multi-channel PSG datasets. This work aimed to provide researchers with information on the current trends related to the application of ML for bio-electrical signals.The rest of this paper is structured as follows: First, a background section with a subsection details the method of article selection. Secondly, a section illustrates the methods of applying ML techniques on multi-modal and multi-channel PSG datasets. This section has multiple subsections, including data acquisition, data preparation, feature extraction, balancing datasets, ML techniques, and performance evaluation. Thirdly, a summary and discus-sion section is presented. Finally, this study is concluded, and some recommendations are discussed.

Literature Search Process
For this review, we carried out keyword searches on specific literature databases. The keywords used for this literature search were based on the goal of this study. The keywords used for this search were as follows: ("EEG" OR "ECG" OR "EOG" OR "EMG" OR "PSG") AND "Machine Learning" AND "Mental Health". This search was carried out on commonly used databases, such as Science Direct, IEEE Xplore, MDPI, and PubMed. The search criteria used for obtaining literature for this work are summarized, as follows: • Publications had to be released in 2017 or later. • To deepen the understanding of the research questions, we also added 25 articles published between the years 2000 and 2016. This range was selected based on references from similar research. • Articles had to have at least one or more of the keywords. • Articles had to be published in recognized literature databases/websites. • All selected papers had to be written in English.

•
All papers were either studies, surveys, or reviews of the application of ML on PSG data and the classification of mental health issues using ML. Table 1 shows the digital-database advanced search strings used to collect articles for this review. Using the following search criteria ("EEG" OR "ECG" OR "EOG" OR "EMG" OR "PSG") AND "Machine Learning" AND "Mental Health", which consisted of commonly used boolean operators [1,54], i.e., AND (must be included in the search) as well as OR (may or may not be included in the search), 1074 article were identified, and they were further screened for inclusion in this work. Table 1. A summary of search criteria and results from the different digital databases.

Digital Database Search String Used Total Articles Collected
IEEE Xplore Access ("EEG" OR "ECG" OR "EOG" OR "EMG" OR "PSG") AND "Machine Learning" AND "Mental Health" 41 Science Direct ("EEG" OR "ECG" OR "EOG" OR "EMG" OR "PSG") AND "Machine Learning" AND "Mental Health" 944 MDPI ("EEG" OR "ECG" OR "EOG" OR "EMG" OR "PSG") AND "Machine Learning" AND "Mental Health" 26 PubMed ("EEG" OR "ECG" OR "EOG" OR "EMG" OR "PSG") AND "Machine Learning" AND "Mental Health" 76 Figure 2 shows The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) for article search and collections for our review. Using PRISMA 2020 statements and guidelines, a checklist was used to structure this systematic review and avoid biases during article selection. PRISMA was designed in 2009 [55] to address poor or weak reporting of systematic reviews, and it also assisted in structuring a review in order to provide useful value for the readers of systematic reviews. This study was also registered in INPLASY, which is an international platform of registered systematic review and meta-analysis protocols. Many existing review articles studied in this systematic review also used the PRISMA statement and guidelines [32,54,[56][57][58][59][60]. This work leveraged prior research from different authors to perform a detailed review of the applications of ML on multi-channel and multi-modal PSG data. In this work, we included 218 papers that met our criteria and were related to our research questions in terms of material presentation, methods, and results, as shown in the PRISMA flow diagram in Figure 2. Records were screened and reviewed against the quality of work conducted in the studies and its relevance to our research goal.

Word-Cloud Overview
Word clouds and bar graphs were created based on the titles and abstracts of all the articles reviewed [61]. Figure 3 shows a word cloud and bar graph of all the titles of the articles reviewed for this work. The most frequent words, as shown by the word cloud, were learning, EEG, sleep, machine, detection, and anomaly, among others, and Figure 4 shows a word cloud and the 30 most frequent words used in the abstracts from all the articles reviewed in this paper. The most frequent words used in the abstracts were similar to those used in the full-text articles, but with more weight on words such as data, model, and so on.This showed that the collected articles aligned with the main goal of this study. Furthermore, most of the sleep studies were related to the use of EEG signals and considered the problem as a time-series anomaly detection problem.   Figure 5 is a summary of the number of articles reviewed in this paper, grouped by publication year, with 55 and 43 articles published in 2022 and 2021, respectively. We focused on recent research that had been carried out by researchers in this field.

Methods
In this section, Figure 6 shows a process flow of deploying ML on PSG datasets. We considered the details of each individual block, as shown in Figure 6. Each block shows how the PSG data were prepared [62], extracted, processed, and classified for disease diagnosis, step by step. A similar approach was also used by Sekkal et al. [63].

Data Preparation
The methodologies of processing PSG signal-processing have been divided into four stages [67]: data acquisition, pre-processing, feature extraction, and classification. PSG data have become the most prevalent records used for studies that apply ML for mental health diagnosis and classification, and there were many open-access datasets that had been generated by previous research. In this review, we summarized some of the available datasets and the papers they used. We identified three predominant resources that provided open access to PSG data. 1.
An    [90][91][92]. With the assistance of expert interviews, the data were determined. All raw PSG datasets were stored in the European data format (EDF). Devices/sensors were always attached to the body of the subject overnight to collect data [41, 42,93]. The key characteristics of a good dataset were the following: It had an appropriate sampling frequency in Hz with multiple channels and a reference electrode. The source of the dataset was also necessary to ensure researchers could reference the datasets in their works. The collected datasets had to be large in size and heterogeneous in nature. Most EGG dataset used 19 or 16 channels [67,94], i.e., (F3, F4, T3, T4, C3, C4, P3, P4, FZ, CZ, PZ, Fp1, Fp2, F7, F8, T5, T6), and (O1, O2) [67,74]. Notations including F, T, C, P, and O denoted frontal, temporal, central, parietal, and occipital, respectively, which were used to identify the brain lobes and placements of the electrodes on the scalp surface [14,21,54]. Figure 7 shows the various locations of electrode placement for electroencephalogram (EEG) data collection using a 10-20 scalp electrode system [14,95] for capturing EEG signals. The scalp electrodes were used to record brain activities via EEG signals. This was recorded by an electroencephalograph. In clinical practice, a standard ECG signal was obtained using 10 electrodes (4 limb and 6 thorax electrodes) [21,96]. All these bio-signals included noise that originated from the patient's body and the environment. Moreover, noise caused distortions in the time and frequency of the signals. A filtering process is typically required to eliminate these noises, which is known as pre-processing [97]. Most existing sleep studies considered EEG signals the predominant PSG signals for the diagnosis. These are brain wave signals collected at a sampling frequency.

Pre-Processing of Signals
For feature extraction, all acquired signal needed to be pre-processed by setting up a frequency threshold. It is common practice to set the EEG signal at 100 (µν) . Signals above this threshold are then considered to be noise [95,98]. Figure 8 shows the step for preprocessing to eliminate various noises from EEG signals. There are many techniques used for pre-processing. For instance, the ResNet-50 model [99] was adapted to automatically extract EEG features and reduce the manual steps required for pre-processing data. Categorizing EEG waveform is typically according to five frequency bands namely, Delta (δ), Theta (θ), Alpha (α), Beta (β), and Gamma (γ) [34,74,79,95,[100][101][102][103][104][105]. These bands include informative detail frequency for signal classification based on the waveform, such as sleep stage disorder classification [28,31]. Table 3 shows the various frequency and amplitude as recommended by the American Academy of Sleep Medicine (AASM).

Feature Extraction
After filtering and signal pre-processing, informative features needed to be extracted [43,80,106,107]. This was one of the critical steps for the application of ML models for bio-electric signals, and the appropriate design of this step has improved model performance [108,109]. Different ML models have been used for PSG data analysis [28,31]. Most used for PSG datasets required robust feature extraction that was sufficiently correlated. These features have been extracted based on uni-variate (measures taken on each channel separately), multi-variate (measures taken on two or more channels) [110], and multi-modal (measures taken from multiple modalities, such as EEG, ECG, EMG, and EOG) approaches.There are four predominant features commonly extracted from PSG data: (1) time-domain, (2) frequency-domain scale, (3) time-frequency domain, and (4) non-linear [39, [111][112][113]. EEG has typically been analyzed for the frequency domain, while EOG and EMG have been analyzed for the time domain [36]. Power spectrum density (PSD) has been a common approach for feature extraction from EEG signals [114][115][116][117]. The most popular method of estimating PSD was based on the measure of the signal's power from the device against the device's time-frequency [34, 47,71,118]. PSD was calculated using Welch and Fourier transformations [114,[119][120][121][122][123]. Other widely researched feature extraction methods for PSG signals have included continuous wavelet transform (CWT) coefficients, autoregressive (AR) coefficients [124,125], and Hjorth parameters [126]. Zhang et al. [112,127] and Galvao et al. [87] discussed additional feature extraction methods. Table 4 shows a summary of feature extraction methods for PSG data. Table 4. A summary of feature extraction methods for PSG data.

Balancing Datasets
It has been established that PSG data in their raw state are not balanced, as a normal sleep pattern contains more non-REM sleep than REM sleep, as well as more light sleep than deep sleep [137]. The imbalance makes it difficult for an ML model to be trained effectively. Zhou et al. [138] studied different dataset-balancing approaches. Efe et al. [139] proposed a hybrid neural network architecture using focal-loss and discrete-cosine-transform methods to solve the training data imbalance. Utomo et al. [133] proposed a model based on ECG signal to address imbalanced learning challenges. Over-sampling and under-sampling have been two common strategies, but each has critical weaknesses. For instance, the straightforward and simple way to handle class imbalance has been to increase the minority class, i.e., over-sampling, but this approach disrupts the data architecture [140].

Machine-Learning Modeling
ML leverages the framework of mathematical modeling to classify, predict trends, and detect anomalies in specified time series [4,30,141,142]. In healthcare, ML has been used for the feature extraction and classification of disease in many studies [57,143,144]. There have been a plethora of ML approaches used for PSG data classification and performance improvement. The growth of research concerning AI and ML approaches as well as for the analysis of PSG datasets has shown an upward trend [145,146]. When applying ML techniques to any dataset, statistical and machine-learning models have been the two most common models applied [66,147], and this has been further broken down into sub-categories, including supervised, semi-supervised, and unsupervised learning [8,59,148,149]. PSG signals have been treated as multi-variate, multi-modal time series [150]. Lu et al. [7,40] concluded that deep learning was the most-used ML approach for feature extraction. A significant amount of research and other literature has explored both statistical and deep-learning approaches on PSG datasets [151], in which support vector machine was also widely studied in shallow ML approaches and CNNs for deep learning. Sarkar et al. [152] studied the suitability of recurrent neural networks (RNN) with long short-term memory (LSTM), support-vector-machine (SVM) [151,153,154], and logistic-regression (LR) models [155,156] to monitor depressive symptoms via EGGand found that under supervised learning, SVM and LR outperformed the others. In recent years, different ML models [40,46] have been used to classify and diagnose mental health, imagery, emotions, behaviors, etc. With the recent increase in the use of ML in the healthcare domain, some of these models have been intensely studied, while others have not been sufficiently explored. Thamaraimanalan et al. [157] proposed a radial basis function network (RBFN), which is a variation of artificial neural network (ANN) models. The main aim of the RBFN model was to solve problems faster and more accurately. The authors of [158] studied explainable artificial intelligence (XAI), which assisted the final users in obtaining a reasonable explanation as to underlying fundamentals of the AI model.There have been a plethora of models proposed in various studies. Below are some of the popular models noted in the literature selected for this review.
K-Nearest Neighbor (KNN): KNN has been shown to provide high accuracy for EEG-based emotion classifications [44,86,159,160]. It is a supervised learning method that was first developed in 1951. KNN has commonly been used for both classification and regression [161]. KNN is considered to be one of the simplest ML models. It promotes the concept of the "majority carries the day". An object is classified based on the plurality vote of its neighbors [44]. There is a decrease in the classification speed as the number of variables increases. KNN algorithms are peculiar because of their sensitivity to the actual data structure. Support Vector Machine (SVM): SVM is a kernel-based learning method, a supervised ML algorithm, which has also been commonly adopted for regression problems [154,159,162]. It has been widely studied and used for the classification of PSG datasets [161,[163][164][165]. In many studies, SVM has resulted in a higher accuracy score than its unsupervised counterparts [161]. Similar to KNN, it is efficient in analyzing data for classification and regression. In contrast to KNN, however, SVM is a fast and reliable algorithm, and it also performs well with a limited amount of samples for analysis.
Logistic Regression (LR): LR uses a logistic function on the dependent variable [166]. Subani et al. [165] used LR to model the relationship between a reduced set of features and the corresponding treatment outcomes based on captured datasets that had been processed and feature-extracted [163][164][165].For feature interpretation, LR model coefficients have been noted as indicators. Unknown records are easily classified, and it is easy to implement and interpret. When the datasets were linearly structured, LR was very effective [166], because LR assumes linearity among independent variables [166].
Extreme Learning Machine (ELM): ELM uses a single layer of feedforward neuron networks (SLFN) and chooses the input weights randomly [43,125,133,134,167,168]. ELM is a simplified form of an artificial neural network (ANN). ELM was invented in 2006. It differs from the other neural network model as it does not need gradient-based backpropagation to be trained. It is not as accurate as other neural network models. Kadam et al. [125] studied a different type of ELM, called hierarchical ELM, which extended the basic ELM to multiple layers. Hierarchical ELM was implemented as a supervised learning method.
Multi-Layer Perceptron (MLP): MLP is classified as a feedforward ANN [152,169,170] that has input, output, and hidden layers in its architecture [137]. It is credited as the algorithm that forms the base of a complex neural network. MLP classifies data that are not linearly separable. For difficult or complex datasets, MLP can be customized with a robust architecture to solve regression and classification tasks. In many applications, MLP has been shown to be sensitive to feature-scaling due to the option of its activation functions.
Long Short-Term Memory (LSTM): LSTM has been considered by many researchers as an effective and scalable model for several learning problems related to time-series data [7,8,13,35,82,105,119,169,[171][172][173][174]. Using LSTM on PSG data has also resulted in much success. LSTM has been firmly established as a state-of-the-art approach in sequence modeling [175,176]. In addition, LSTM has been credited with advanced results in sequenceprocessing tasks [131,142,[177][178][179][180][181][182][183]. The study in [175] presented a more robust model called a transformer, which was the first sequence-transduction model entirely based on attention and replaced recurrent layers [175,176,184].To the best of our knowledge, there have been few studies that have applied this model on a PSG dataset [185].
Convolutional Neural Network (CNN): A CNN is typically composed of two types of layers, where the convolutional layer is followed by a max-pooling layer [169,[186][187][188]. CNNs are more commonly used for image recognition and feature extraction. Using a CNN alone has produced relatively low forecasting accuracy for time-series data; therefore, a CNN-LSTM hybrid model has been widely studied on PSG datasets [7,66,142,169,[189][190][191][192][193][194][195][196]. For EEG-based analysis, it provided high accuracy and contained a non-linear domain due to its random and chaotic properties [192,194].
Spiking Neural Network (SNN): SNN is often referred to as the third generation of ANN [197]. It is a relatively rare approach used to model spatio-temporal brain data (STBD), and EEG is a well-known non-invasive type of STBD [198]. It has the ability to learn from changes in temporal data. SNN was inspired by information processing in biology [199]. Despite the increase in the research using SNN, SNN performance has been reported as relatively low, as compared to other ML counterparts [199]. This limitation was found in major benchmark datasets. However, because of SNN's ability to measure biological spikes without further transformation issues, it has attracted the interest of AI researchers. The training time of SNN has been an impediment, due to the fact that SNN uses a more complex method, as compared to other CNN approaches [199]. Table 5 provides a summary of the studies using machine learning for the classification and prediction of PSG data.

Performance Evaluation
In this section, we discuss model performance measures, which quantified the effectiveness of a model for classifying or predicting new cases or disease conditions after being trained, validated, and tested using the available dataset. The results are described based on different aspects of performance [193,195,203,204]. Most of the papers in this review measured accuracy, precision, sensitivity (recall), specificity, F1-scores, and confusion matrices [13,54,97,[205][206][207][208][209][210]].

1.
Sensitivity: It is also known as recall. This measures the ratio of the number of samples correctly predicted to the total samples in the class. Sensitivity can be calculated based on true positive (TP) and false negative (FN) parameters [31, 208,211]. Equation (1) shows a mathematical representation of the sensitivity computation.

2.
Accuracy: This is the fraction of samples that were correctly classified. Accuracy can be expressed as the ratio of the summation of true-positive (TP) and true-negative (TN) parameters to the total sample size, which includes true positive (TP), false positive (FP), false negative (FN), and true negative (TN) [31,137,208,211]. Equation (2) shows a mathematical representation of accuracy.
Accuracy(A cc ) = True Positive + True Negative True Positive + False Positive + True Negative + False Negative

3.
Precision: It is the ratio of the samples correctly predicted to the total predicted positive samples. Equation (3) shows a mathematical representation of the precision computation.
Precision(P s ) = True Positive True Positive + False Positive

4.
Specificity: It measures how many healthy (negative) samples were identified as healthy (negative) samples by a model. Equation (4) shows a mathematical representation of the specificity computation.
Specificity(S e ) = True Negative True Negative + False Positive (4) 5. F1-score: It is a function of precision and sensitivity (recall). It is represented as the harmonic mean of sensitivity and precision. Equation (5) shows a mathematical representation of F1-score computation. F1-scores range from 0 to 1, with 1 being a perfect precision sensitivity (recall) and 0 being the lowest precision sensitivity. Equation (5) shows a mathematical representation of the F1-score computation.
The above are the most commonly used performance evaluation metrics for classification problems. Using accuracy alone to determine the performance of a classification model could be misleading. Calculating a confusion matrix provides a more accurate benchmark for evaluating the performance of classification models, particularly regarding their accuracy and suitability [54,107,204,[212][213][214].

Summary and Discussion
The study of ML methodologies for measuring biomedical signals has, in recent years, attracted increased attention. In this work, the reviewed literature provided an overview of the application of ML approaches on PSG datasets.In this section, we summarize the advantages, the limitations, and the current research gaps concerning these models. We provide detailed steps for applying ML methods to multi-channel and multi-modal PSG data. Using ML methods for feature extraction, prediction, diagnosis, and disease classification has reduced the dependence of clinical professionals on the manual processing of PSG datasets.
ML facilitates a more robust and deeper understanding of PSG dataset processing. The benefits of ML for mental health diagnosis and classification have been confirmed.However, we reviewed the literature to identify the key steps for applying ML to PSG datasets, as well as the limitations and benefits.We discuss data-capturing, the types of datasets, data-processing, feature extraction, model classification, and performance evaluation.

Challenges of Using ML on Multi-Channel and Multi-Modal PSG Datasets
Data-capturing: Biomedical signals are often recorded using multiple electrodes. This leads to an increase in the dimension of recorded signals, which makes the analysis of multi-channel PSG datasets challenging. Typically, these multi-channel EEG signals are converted to single-channel signals for ease of analysis.
Recently, an increase in available PSG datasets have made ML research possible. Research conducted by Guillot et al. utilized eight different datasets [185]. In most of the articles reviewed in this work, many open-access datasets were used. The correct type of data increased the accuracy of the results. The limitations of certain datasets concern the issues of imbalance and noise. For satisfactory ML development, datasets must be cleaned to remove noise, as well as prepared, before applying a classification model. Data-formatting: Biomedical signals are stored using different data formats, such as the European data format (EDF), the general data format (GDF), and BrainVision (VHDR, VMRK, EEG). It is standard practice to store PSG data in an EDF or EDF+ format, as these are simple and flexible formats for the exchange and storage of multi-channel biological, multimodal, and physical signals [215,216]. However, the inconsistencies in data-formatting create barriers for the widespread use of a dataset in the ML research community.For example, in contrast to comma-separated-values (CSV) files that are commonly used for data storage, EDF and EDF+ are not as accommodating as CSV files and require a special tool to read and pre-process.
Data imbalance: PSG datasets must be processed and balanced for satisfactory classification performance. Due to the nature of these datasets, balancing all the channels in a multi-channel dataset is always a challenge. In many cases, there is a dominant class. We found a significant amount of literature that detailed the process of balancing a PSG dataset. Furthermore, there are issues with multi-modal datasets that have been included in PSG datasets, as these then have to be treated as a different modality. This also increases the complexity of balancing a PSG dataset.
Extracting features: In this study, we showed the most commonly used methods for feature extraction. However, different feature extraction methods can work according to different data properties. Extensive tests are usually required. PSG datasets often contain a significant amount of noise, making feature extraction difficult.
Classification model: Such as found in prediction models, there are training, validation, and testing datasets. Dividing PSG datasets into training, validation, and testing data poses a challenge because of insufficient labeling, large dataset sizes, and large time stamps. The presence of additional modalities in PSG datasets also poses a challenge for effective data fusion.
Performance evaluation: We found in this work that the most used performance evaluations for ML applications were sensitivity (recall), specificity, F1-score, and accuracy. Most of the literature in this review had calculated the model accuracy as a way of determining model performance, but only calculating accuracy was not an effective evaluation of the robustness of a model. It remained challenging to interpret information presented in PSG datasets without the assistance of experts to determine the actual performance of a classification model.In practice, sex and race were considered protected attributes. This increased concerns regarding the security of subjects' or patients' private information since the actual patient's personal information was required for data-capturing. The measure of fairness for PSG datasets has also been challenging [217]. Fairness measures used for ML have been reported as not universally suitable [217].

Research Gaps
It was clear that ML offered significant benefits for diagnosis and classification of mental health issues when used correctly. We showed that when implementing ML for multi-variate multi-modal PSG datasets, a solid understanding of the steps and techniques was required.There are many techniques and methods that are highly dependent on the type of dataset. One major gap in the research was that the gold standard methods and techniques have yet to be clearly defined.
Data-capturing: As compared to other data-capturing procedure for ML application, sleep studies required an overnight period for data-capturing. Many patients reported this as "too long". In all the articles reviewed in this work, there was little effort to reduce this time. Furthermore, to the best of our knowledge, there has been limited research conducted concerning the reduction in the number of electrodes or sensors used to capture these datasets. Using a high number electrodes has been reported as uncomfortable by some patients [218,219].

Data-formatting:
Increasing the accessibility of bio-electric datasets has been limited by using only EDF and EDF+ file extensions.Among the 204 articles selected for this review, none either proposed or developed a more widely accessible format.
Feature extraction: Establishing gold-standard feature-extraction techniques could improve their scalability and ease of use.
Classification modeling: Many existing studies have been based on typical ML models. There have been some proven classification models with robust performance on PSG datasets. SVM appeared to be a common model that many have used effectively on PSG datasets. There were many studies that used deep-learning methods. There were a few that used self-attention models on PSG datasets. More advanced models should be considered, such as transfer learning, explainable machine learning, and robust learning.

Conclusions and Future Work
This study reviewed over 200 articles and their contributions towards the application of ML techniques for the diagnosis of mental health issues. According to the literature related to sleep studies and other mental health disorders, ML approaches have improved diagnostic accuracy, as compared to traditional manual processes, and this trend is likely to continue.More and more researchers are leveraging ML and AI tools to improve various aspects of the healthcare environment. There is still more research needed to advance the reliability and efficiency of these techniques. In this study, we focused on providing the detailed steps involved when applying ML to PSG datasets.
Based on the literature reviewed in this work and on the basis of our own knowledge, more research should be conducted regarding the application of attention-based models. We plan on creating a benchmark for the implementation of ML techniques on PSG datasets. Currently, there have been various techniques proposed, with most indicating robust performance enhancements. We plan on researching methods to assist researchers in the selection of the best classification models and feature extraction techniques.

Conflicts of Interest:
The authors declare no conflicts of interest.

Nomenclatures
The following terms were used in this manuscript: