A Comprehensive Review of Computer-Aided Diagnosis of Major Mental and Neurological Disorders and Suicide: A Biostatistical Perspective on Data Mining

The World Health Organization (WHO) suggests that mental disorders, neurological disorders, and suicide are growing causes of morbidity. Depressive disorders, schizophrenia, bipolar disorder, Alzheimer’s disease, and other dementias account for 1.84%, 0.60%, 0.33%, and 1.00% of total Disability Adjusted Life Years (DALYs). Furthermore, suicide, the 15th leading cause of death worldwide, could be linked to mental disorders. More than 68 computer-aided diagnosis (CAD) methods published in peer-reviewed journals from 2016 to 2021 were analyzed, among which 75% were published in the year 2018 or later. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol was adopted to select the relevant studies. In addition to the gold standard, the sample size, neuroimaging techniques or biomarkers, validation frameworks, the classifiers, and the performance indices were analyzed. We further discussed how various performance indices are essential based on the biostatistical and data mining perspective. Moreover, critical information related to the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines was analyzed. We discussed how balancing the dataset and not using external validation could hinder the generalization of the CAD methods. We provided the list of the critical issues to consider in such studies.


Introduction
Mental health is a state of successful cognitive function resulting in adapting to change and coping with everyday stresses of life [1,2]. Mental disorders refer to a wide range of conditions affecting mood, thinking, and behavior. They could be occasional or chronic [3]. Some major mental disorders include depression, bipolar disorder (BD), and schizophrenia (SZ) [4]. Mental illnesses are globally among the leading causes of disability in Disability Adjusted Life Years (DALYs) [5]. Figure 1 shows the composition of mental disorder DALYs by type of disorder for both sexes combined worldwide from 1990 to 2019 [6]. Depressive disorders (29.74%), followed by anxiety disorders (22.86%), and schizophrenia (11.66%) are the top three contributors to mental disorder DALYs [6].
Among mental disorders, depressive disorders account for 1.84%, anxiety disorders for 1.13%, schizophrenia for 0.60%, and BD for 0.33% of total DALYs [6]. As mentioned in Figure 2 (Source: Institute for Health Metrics Evaluation. Used with permission. All rights reserved.), countries with the highest age-standardized mental disorder DALYs rates were Portugal 2603.92, Greece 2510.55, Greenland 2486.44, Iran 2436. 44, and Spain 2396.768 DALYs per 100,000, in 2019 [6]. The World Health Organization (WHO) reported that over 450 million people worldwide suffer from mental disorders [7].

Gold Standard
Due to the multiplicity of mental disorders and the importance of proper diagnosis and treatment, the need to classify these disorders has always existed and led to the publication of the Diagnostic and Statistical Manual of Mental Disorders (DSM). Its latest version, DSM-5, was released in 2013. Structured Clinical Interview for DSM-5 (SCID-5) is a structured diagnostic interview to diagnose mental disorders according to the criteria characterized in the DSM-5, which a trained clinician should prescribe. This structure specifies the order of the questions, how the questions are worded, and how the subject's responses are classified. The primary diagnosis methods are summarized as the following [45].

Depression Disorder
SCID is considered to be the commonly used gold standard for a depression diagnosis. Major depressive disorder (MDD) is a type of depression characterized by separate episodes of at least 14 days. Critical symptoms of MDD are depressed mood, loss of interest, weight loss or weight gain without any particular diet, insomnia or hypersomnia, frequent thoughts of death or suicide, decreased ability to concentrate and think, feelings of being worthless and guilty, psychomotor agitation or retardation, feelings of energy loss and indecisiveness. Five or more of the above symptoms, when at least one of them is one of the first two symptoms is required for a depression diagnosis [46] 2.1.2. Bipolar Disorder SCID is used as the gold standard among diagnostic interviews, but its validity will not be known until the discovery of related biomarkers. At least one period of mania is necessary for a specific diagnosis of bipolar disorder I (BD-I), while one hypomania and major depressive episode without a manic episode is essential for bipolar II (BD-II) diagnosis [47,48]

Schizophrenia
Patients' description of symptoms, mental state tests, and behavioral observations help psychiatrists diagnose schizophrenia based on DSM-5 criteria, which is the gold standard of diagnosis to date. The most important symptoms are delusions, hallucinations, disorganized speech, extremely catatonic behavior, and negative symptoms such as decreased emotional expression. Two or more of these symptoms, when at least one of them is one of the first three symptoms is required for a schizophrenia diagnosis, and each of them should be present for a considerable period within a month [49,50].

Alzheimer's
AD is a specific type of dementia. The gold standard hallmarks for definitive diagnosis of AD are cortical atrophy, amyloid-predominant neuritic plaques, and tau-predominant neurofibrillary tangles validated by postmortem histopathological examination. Amyloid precursor protein (APP), presenilin 1 (PSENl), or presenilin 2 (PSEN2) are known causative genes of the AD where genetic tests can show their mutation in early-onset cases. Furthermore, amyloid-based diagnostic tests such as positron emission tomography (PET) and cerebrospinal fluid (CSF) scans can be useful diagnostic tools [51] 2.1.5. Dementia In DSM-5, major neurocognitive disorder (MCD) is considered an alternative term for dementia that was used in previous versions. A significant decrease in the level of the subject's cognitive performance; for example, in learning and memory functions, followed by interference with independent daily activities, is a sign of dementia. Clinical Dementia Rating (CDR) is a cognitive diagnostic assessment widely used as the gold standard for diagnosing dementia. The CDR test is a semi-structured interview with the patient and a trustful informant, consisting of 46 questions, that takes 30-90 min to be completed and must be done by a trained clinician [52][53][54].

Suicide
Validated questionnaires have been used in the literature to diagnose high-risk individuals for suicidal behaviors [55]. Suicide Behaviors Questionnaire-Revised (SBQ-R) is a globalized test for identifying individuals at increased risk of suicidal behaviors, including ideation and attempts [56]. The SBQ-R test was designed based on the SBQ test, a 34-item questionnaire measuring the suicide tendency. It is a self-report test distinguishing between suicidal and non-suicidal subjects. The SBQ-R test includes four Likert-type questions that measure the risk of suicide according to the subject's suicide ideation/attempt during lifetime, suicidal ideation rate in the last year, expressing thoughts of committing suicide with others, and suicidal behavior occurrence probability in the future. Each question has different points from 0 to 6 based on the subject's choice. Two scoring criteria have been proposed so far to classify suicidal and non-suicidal individuals based on SBQ-R results: SBQ-R Item 1 and SBQ-R total score varying between 3 and 18. Clinical and non-clinical samples have an identical cutoff score of 2 in the SBQ-R Item 1. The SBQ-R total score's cutoff scores were 7 and 8 for clinical and non-clinical samples, respectively [42].

The Literature Review
There are currently not enough biomarkers in psychiatry to classify disease state from the normal state, so diagnosis mostly depends on patient-physician interactions and questionnaires. Clinical observations based on patient self-reports are subjective and inaccurate even if they are based on DSM-5 criteria since they cannot identify false positives and recognize disorders from risks. This is where artificial intelligence (AI) comes in handy. AI is a general term in psychiatry that denotes the use of advanced computerized techniques and algorithms to diagnose, prevent, and treat mental disorders, such as automatic speech processing and machine learning algorithms applied on electronic medical databases and health records to assess a patient's mental state. AI-based interventions reduce false negative and positive diagnoses and annihilate the stigma associated with mental illness symptoms to the clinician. They are also affordable and have significant benefits for patients suffering from restricted movement due to their symptoms. AI-based methods are not replacing clinicians; they can complement human clinical decisions by providing more comprehensive information to empower the health care system [57,58]. Here, we provided the literature review of the CAD systems for suicide, neurological disorders, and mental disorders focusing on the sample size, input features, classifiers, type of validations, and their performance indices.

PRISMA Guideline
We reviewed the works focusing on the diagnosis and prediction of CAD methods proposed in the literature for suicide, neurological disorders, and mental disorders. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [59,60] was proposed in the literature to enrich and standardize medical reviewer papers [61]. We adopted the PRISMA guideline to select the relevant studies.

Search Strategy
A literature search of the online database of PubMed between 2016 and 2021 was performed using the terms ("bipolar" OR "bipolar disorder" OR "schizophrenia" OR "suicide" OR "Alzheimer" OR "dementia" OR "major depressive disorder" OR "depression") AND ("machine learning" OR "deep learning") AND "accuracy". The reference lists of the identified publications were also reviewed. Peer-reviewed articles in English on Humans were analyzed.

Eligibility Criteria
Published studies were included in the review if they met the following criteria: (1) at least a measure of the diagnostic accuracy was provided, (2) at least the classifier, the validation framework, or the validation type were provided. Figure 5 shows a flow diagram describing the study selection process. Among 563 records screened, 71 studies were excluded as irrelevant to the original research question. Among the remaining 492 studies, 424 studies did not meet the eligibility criteria. Thus, 68 studies were included in our analysis.

Data Abstraction
The following characteristics were recorded for each study included in our analysis: publication reference (the first author's surname and the year of publication), the sample size, the case and control groups, input features, classifiers, internal or external validation, type of validation (holdout or resampling), and the diagnostic accuracy.

Results
The CAD methods for mental and neurological disorders are listed in Tables 1-7, while the CAD methods for suicide prediction are provided in Tables 8-11.

Validation Frameworks
The validation framework is one of the critical issues in data mining approaches. In "holdout," the most straightforward cross-validation, the data set is randomly assigned to two sets: the training set and the test set. In addition to the data's inefficient use, the method's limitations are pessimistically biased error estimations [127,128]. Moreover, testing hypotheses proposed by the data are not guarded by this method (type III errors [129]) as the data may be permuted until there would be an acceptable accuracy on the training and test sets in a "holdout" setting. Therefore, other validation frameworks such as repeated holdout, leave-one-out validation, 0.632+ bootstrap, and cross-validation [130] are preferred. These issues are also addressed in the TRIPOD guideline from a clinical perspective [131].
Choi et al. [104] proposed a framework for early detection of dementia using holdout validation. Moreira et al. [105] presented a hybrid data mining model for the diagnosis of dementia using holdout setting. Lin et al. [97] designed a convolutional neural network (CNN)-based approach to predict mild cognitive impairment to Alzheimer's disease (MCIto-AD) conversion using MRI data with leave-one-out cross-validation (CV). Ding et al. [98] proposed a hybrid computational approach to classify AD with holdout and resampling; synthetic minority oversampling technique (SMOTE). Aidos et al. [101] presented a new methodology to obtain an efficient CAD system for predicting AD using longitudinal information with holdout validation. Li et al. [132] developed a spectral CNN for a reliable AD prediction with 10-fold CV. Sayed et al. [133] designed an automatic system for AD diagnosis with 7-fold CV.

Subject-Wise Cross-Validation
The other critical issue is using leave-one-subject-out cross-validation when there are repeated measurements for each subject [134]. Thus, we must take out the entire measurements of a subject from the training set and report the trained system's performance for the test subject. Otherwise, if we use other internal validation methods and perform training and test set random permutations on the entire measurements, rather than subjects, the probability of some measurements of one subject being in the training set and others in the test set is high. If there is a high correlation in such repeated measurements, the accuracy of the diagnosis system is overestimated. To reduce estimation variance, it is preferred to use subject-wise cross-validation with a more extensive test sample size, rather than leave-one-subject-out cross-validation [135].

Critical Performance Indices
It is also essential to report various performance indices since they convey critical information that is very important in clinical systems. One of the most important formulas related to the posterior probability is the following [136]: where, Se is the sensitivity, Sp is the specificity, Prev is the prevalence of the disease, D is the positive condition event determined by the gold standard, and E is the test outcome positive event determined by the diagnosis system. The parameter PPV is the disease probability given that the patient test result is positive, which is essential when the system is used in practice. The PPV significantly drops in imbalanced datasets, in which the prevalence of the disease is low. For example, when a CAD with the Se and Sp of 80% and 95% is tested in practice where Prev is 10%, the expected PPV is 64%. The minimum sensitivity of 80% and specificity of 95% [137], maximum False Discovery Rate (FDR = 1-PPV, Positive Predictive Value) of 5% [138], and the minimum Diagnostic Odds Ratio (DOR) of 100 [139] could be considered a reasonable requirement of a reli-able clinical diagnosis system. As a complementary condition, the minimum Negative Predictive Value (NPV) of 95% could be listed [136].
Some of the published works on mental health provided a variety of performance indices. For example, Lee et al. [62] designed a diagnostic model using biomarkers in peripheral blood to diagnose BD-II with a 90% specificity and sensitivity of 85%. Ildiz et al. [73] obtained 94% sensitivity, specificity, and precision of their analytical model to diagnose SZ and BD. Alici et al. [63] proposed the utility of optical coherence tomography (OCT) data to distinguish BD-I patients from controls with a sensitivity of 87.5%, a specificity of 47.5%, positive predictive value (PPV) of 52.5%, and negative predictive value (NPV) of 79.2%. Fernandes et al. [66] reached a sensitivity of 88.29% and specificity of 71.11% for BD vs. control, a sensitivity of 84% and specificity of 81% for SZ vs. control, and sensitivity of 71% and specificity of 73% for BD and SZ. Achalia et al. [74] used multimodal neuroimaging and neurocognitive measures to differentiate BD patients from healthy controls and obtained a sensitivity of 82.3% and specificity of 92.7%. Li et al. [140] obtained a sensitivity of 80.6% and specificity of 86.3% in predicting AD with Actigraphy Data. Li et al. [132] showed that their spectral CNN could achieve a sensitivity of 88.24% and specificity of 95.45% in AD and normal control classification, a sensitivity of 92.86% and specificity of 77.78% in AD and MCI classification, and sensitivity of 84.38% and specificity of 92% in MCI and normal control classification.
A machine learning approach was used by Bin-Hezam and Ward [102] to detect dementia and yielded a precision of 91.34%, a sensitivity of 91.53%, and F1 score of 91.41% for dementia vs. non-dementia, a precision of 76.76%, sensitivity of 77.00%, and F1 score of 76.35% for control normal (CN) vs. MCI vs. dementia. Choi et al. [104] proposed a novel framework for dementia identification with an F1 score of 78%, sensitivity of 93.43%, specificity of 89.66%, positive likelihood ratio of 9.0319, a negative likelihood ratio of 0.0732, PPV of 0.5064, and NPV of 0.9917. Chen et al. [117] used ensemble learning to predict suicide attempts/death following a visit to psychiatric specialty care. The sensitivity, specificity, PPV, and NPV of the 90-day prediction model were 47.2%, 96.6%, 34.9%, and 97.9%. Ensemble learning was also used by Naghavi et al. [42] for the prediction of suicide ideation/behavior. The proposed system had the sensitivity, specificity, PPV, and DOR of 81%, 98%, 94%, and 227, respectively. In such examples, various performance indices could provide valuable information about the designed systems' clinical reliability. Otherwise, it is not possible to judge the clinical applications of CAD systems.
Klaus Munkholm et al. [70] demonstrated that a composite marker containing different molecular levels and tissue data is an operational biomarker to discriminate bipolar disorder from healthy subjects with an Area Under the ROC Curve (AUC) of 0.826 (95% CI 0.749-0.904). Utilizing optical coherence tomography, Soner Alici et al. [64] indicated an AUC of 0.688 (95% CI 0.604-0.771) in comparing bipolar disorder and healthy individuals. In 2016, Guoqing Zhao et al. [64] performed a study and mentioned that plasma mBDNF and proBDNF levels were the best biomarkers in identifying bipolar disorder among pa-tients in depressive episodes with an AUC of 0.858 (95% CI 0.753-0.963). In the study by Noa Tsujii et al. [67], a high AUC of 0.917 (95% CI 0.849-0.985) was provided based on hemodynamic response and mitochondrial dysfunction to diagnose bipolar disorder and major depressive disorder. Naghavi et al. [42] assessed the suicide ideation/behavior performance using different indices and CI 95%.
Functional neuroimaging techniques-such as PET and fMRI-enable mapping the brain's physiology by measuring blood flow, receptor-ligand binding, and metabolism. Such techniques have been recently used in mental health, which improved understanding of the underlying mechanisms [146]. Functional imaging is divided into resting state (e.g., rs-fMRI) and studies in active conditions. On the other hand, structural neuroimaging, such as NMR and MRI, has been widely used to exclude organic brain disease in mental disorders. It was shown in the literature that structural brain imaging is clinically useful to discriminate mental disorders, including SZ, BD, depression (MDD), and AD [147].
Both of the functional and structural-except CT-scan-neuroimaging techniques were shown to be useful for suicided diagnosis [148]. Both techniques have advantages and disadvantages (e.g., spatial versus temporal resolution) [149], and their combination, a.k.a., multimodal neuroimaging, can yield important insights due to its complementary spatiotemporal resolution [150]. Lei et al. used the combination of MRI and rs-fMRI for diagnosing SZ patients. In this study, the multimodal neuroimaging showed better performance than structural or functional neuroimaging separately [151].
Perhaps the mostly used features for suicide ideation/attempts prediction are demographics, socioeconomic status (SES), and life-style variables. For example, Jung et al. [113] designed a suicide prediction model for middle and high school students based on the multivariate logistic regression and reached the prediction accuracy of 77.9%. The selected significant features included gender, school grade, city type, academic achievement, living with parents, family SES, father's and mother's education, physical activity, and self-rated weight and health.
Decision tree, or its ensemble extensions such as random forests were frequently used for mental health in the literature [42,[105][106][107][108]112,118,120,122]. A decision tree is a rule-based system, wherein its simplest form is a clinically interpretable structure for clinicians used in clinical decision analysis [152]. Naghavi et al. [42] used the combination of stability feature selection and stacked ensembled decision trees (Figure 7) for suicide ideation/behavior diagnosis and reached an AUC of 0.9. In this study, a variety of questionnaires and demographic information was used.
The classifiers used for mental health could be categorized into two main categories: traditional machine learning (e.g., DA and its variants, SVM, decision tree), and deep learning (LSTM, CNN). A deep neural network (DNN) is an artificial neural network with more than one hidden layer. Unlike many traditional classifiers such as linear discriminative analysis (LDA), SVM, or Decision Tree (DT), where few parameters must be estimated or tuned, DNNs have many tunable variables. Thus, they require massive amounts of data to estimate their parameters accurately. When the available data is limited, various issues must be considered to avoid overfitting [153]. Strategies such as early stopping criteria, data augmentation, dropouts, and regularization are used [154]. Moreover, when the dataset is imbalanced (e.g., the mental disorder classification) specific deep learning techniques must be taken into account [155]. Geometrical augmentation is usually used to increase the image sample size by random rotation, translation, and horizontal flipping. However, it was shown that such augmentations do not necessarily improve the predictive accuracy of the deep learning methods [156]. DNNs were used in the literature for multimodal neuroimaging classification in mental health [157]. Although DNNs are promising, they usually appear to be black boxes. The input is the raw data, and the output is the predicted class, and no internal interpretation is provided. It is problematic since clinicians require proper interpretation of abnormal brain regions, for example, in neuroimaging data [158]. There have been some attempts to visualize the black box of the DNNs in the literature [159].
Statistical models such as MLR and Cox regressions were used in mental health literature [67,116]. MLR is an extension of the linear regression when the outcome is binary. It not only provides the probability that a sample belongs to an output class, but it also identifies the significant features in the model. Thus, it is also a feature selection method [160]. On the other hand, Cox regressions are time-to-event models where the event of interest (e.g., committing suicide) and the event's time (e.g., the time from the suicide attempt to the previous hospitalization) are essential. Such models are usually used in survival analysis. When a proper threshold is estimated, it is possible to dichotomize the model's continuous output risk for discrimination between output classes [161]. Unlike other classification methods, both MLR and Cox models support mixed-type input data, and no transformation is required to perform on nominal or ordinal data.

Balancing the Dataset and Generalization of the Results
Bayes' theorem (Equation (1)) was addressed in the literature as a confounding effect of the low prevalence of a disorder on the performance of the CAD systems [162], even when the AUC is very high [163]. Events such as suicide attempt/death have a low prevalence in the population (e.g., 10.7 per 100,000 individuals [164]). Other mental and neurological disorders have a relatively low prevalence (e.g., the global prevalence of 1% for SZ [165]). Thus, they can only be reliably predicted using an extraordinary discrimination capability between higher and lower risk groups. Suppose that a CAD system has a Sensitivity of 90% and a Specificity of 95% based on the cross-validated confusion matrix, which is very good for an imbalanced dataset. The probability that the new subject has the disorder, subject to the positive CAD result, could be estimated using Equation (1) for different disease prevalence (Figure 8). For example, with the prevalence of 1% in such disorders, the PPV is only 15%. If the dataset is balanced for the analysis (e.g., 3549 suicide-indicative posts, versus 3652 nonsuicidal posts in [126]), the PPV is 95% on the analyzed dataset. However, when the system is used in practice (the prevalence of 1%), the PPV drops down to 15%. Thus, the analyzed dataset must resemble the population. It is only preserved when proper sampling and sample size calculation is performed.

EEG-Based Diagnosis
Among the studies analyzed in Tables 1-11, some use the EEG signal for diagnosis. In such studies, the number of EEG channels was shown in the tables. It is also necessary to report discriminative features based on the traditional frequency bands as important clinical biomarkers in such studies. It is not enough to show whether the classification system has an acceptable accuracy, as these discriminative features are very important for clinicians. The spatial distribution of such features must also be provided over the skull [166]. In EEG studies, either the resting state [166] or evoked or cognitive functions [167] were used for mental disorders.
An example was provided from the comparison between schizophrenia and healthy subjects during cognitive functions in Figure 9. It showed significantly lower power in gamma, beta, theta, and alpha bands in healthy subjects than schizophrenia patients. It also showed that more or less, it includes the entire brain. In agreement with the theory that schizophrenia is not a lesion of a part of the brain, it is a disconnection syndrome. This disconnection would be expressed in a failure to modulate synchronous activity caused by disturbances in the dopaminergic mechanism [168]. It is hypothesized that information flow across larger cortical networks is projected by low-frequency brain oscillations, while local cortical information processing is represented by high-frequency oscillations [169]. Thus, the interaction between different high-and low-frequency bands, also known as cross-frequency coupling (CFC) (Figure 10), could provide valuable insights into brain functions [170] and mental disorder diagnosis [171]. Such a representation is currently used instead of simple energy representation of different frequency bands. However, as the dimension increases, it is essential to select connected or disconnected regions of interest and representative interactions.
The EEG amplitude modulation analysis ( Figure 11) has been used to diagnose AD [172]. First, the full-band EEG signal was decomposed into five sub-bands (delta, theta, alpha, beta, and gamma). The Hilbert transform was used to extract the envelope of each sub-band signal. A second frequency decomposition was then used based on modulation filters to represent cross-frequency modulation interaction [173]. A single row for each condition is generated by merging data from all channels (reproduced with permission from [171]). Figure 11. Signal processing steps used to compute resting EEG spectro-temporal modulation energy (reproduced with permission from [172]). The modulation frequency bands were shown as m-delta (0.5-4 Hz) or m-theta (4)(5)(6)(7)(8). The m-delta modulation frequency content in the theta frequency band could discriminate between the healthy normal, mild, and moderate AD ( Figure 12).

Discussion
This review focused on the data mining methods proposed in the literature to classify major mental and neurological disorders, namely SZ, BD, MDD, AD, suicide ideation, attempt, or death. More than 68 recently peer-reviewed published journal papers since 2016 were considered, among which 75% were published in the year 2018 or later. Alonso et al. [174] provided a systematic review of the major mental and neurological disorders. However, they analyzed papers published by 2017, and the data mining validation frameworks and methods focused on in our study were not covered in their study.
Moreover, other (systematic) reviews were published in the literature on this topic [175]. Jo et al. [153] analyzed deep learning papers on AD diagnosis and prognosis published between January 2013 and July 2018 in which neuroimaging data were used. Librenza-Garcia et al. [176] analyzed machine learning papers on BD diagnosis, personalized treatment, and prognosis published up to January 2017. de Filippis et al. [177] analyzed machine learning methods for structural and functional MRI SZ diagnosis published between 2012 and 2019. Castillo-Sánchez et al. [26] reviewed machine learning methods for suicide risk assessment on social networks from 2010 until December 2019. Although the classifiers, sample size, input features, and their performance were taken into account in such studies, the validation type and framework were not directly analyzed. In addition to not following the related clinical standards such as STARD and TRIPOD, these issues would avoid the widespread application of machine learning methods in practice.
Our study has some limitations. First, we only considered PubMed for the search strategy. Other online databases such as ISI, Embase, Google Scholar, and Cochrane Collaboration could improve our initial screening records. We only focused on SZ, BD, depression (MDD), AD, dementia, and suicide. Other significant disorders, including anxiety and headache were not considered. Moreover, we mainly focused on the validation type and framework with the biostatistical perspective. However, feature extraction, selection, and classifiers are essential issues in machine learning.
In our study, the epidemiological information from the GBD was provided to identify the importance of such disorders, and the gold standard methods for their diagnosis were briefly reviewed. The CAD systems were classified based on the classification goal, sample size, neuroimaging techniques, the number of channels (in EEG signals), type of validation in terms of internal and external (subject-based) methods, type of validation based on holdout, cross-validation, and resampling methods, the performance index, and its value. We also discussed the importance of reporting a variety of performance indices and their CI 95%. Some frequency-domain features used in the literature were reviewed for major mental and neurological disorders.
Some issues must be taken into account for better clinical applications of the CAD systems in this field [136]. A simple and intuitive method must present the classification features' discrimination over the recording electrodes and (or) their interactions. The system must be validated using proper performance indices and statistical tests. The proposed system's clinical reliability must also be identified based on Type I, II, and III errors. The clinical interpretation, using the activity maps (for example), must be provided. The rule-based systems or interaction networks are preferred over black box methods to facilitate clinical interpretation and validation [178]. Standardization (e.g., in terms of the brain frequency bands) and benchmark datasets could facilitate the comparison of the state-of-the-art and thus improve the CAD systems' effectiveness to diagnose major mental disorders, neurological disorders, and suicide.

Conclusions
The following issues must be taken into account to improve the clinical application of the CAD systems for mental health:

•
The related standards, including STARD and TRIPOD, must be used. TRIPOD-Artificial intelligence (AI) is now underway due to AI applications in CAD [179,180].

•
Proper performance indices must be provided in addition to their interpretation. This issue is especially critical when the database is imbalanced, and some indices could be biased [136].

•
The CI 95% of the performance indices must be provided. It is especially critical for the AUC. If its CI 95% includes 0.5, the diagnostic method's performance is not significantly better than a random generator.

•
The prevalence of the disorder in the analyzed dataset must resemble its actual prevalence in the population. Otherwise, the performance of the method in practice, a.k.a. PPV, is highly deteriorated. • A proper validation framework must be used to avoid Type III error. External validation is the best method to improve the generalization of the designed CAD.

•
The clinical interpretation of the input features, their ranking, and the classifier structure must be provided for clinicians.