Potential Clinical Applicability of Deep Learning in the Diagnosis of Major Depressive Disorder Using rs-fMRI: A Systematic Literature Review

Saeedi, Maryam; Wei, Lan; Edoho, Mercy; Mooney, Catherine

doi:10.3390/app16073444

Open AccessSystematic Review

Potential Clinical Applicability of Deep Learning in the Diagnosis of Major Depressive Disorder Using rs-fMRI: A Systematic Literature Review

¹

FutureNeuro Research Ireland Centre, UCD School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland

²

UCD School of Electrical and Electronic Engineering, University College Dublin, D04 V1W8 Dublin, Ireland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3444; https://doi.org/10.3390/app16073444

Submission received: 11 February 2026 / Revised: 23 March 2026 / Accepted: 25 March 2026 / Published: 1 April 2026

Download

Browse Figures

Versions Notes

Abstract

Background: Major Depressive Disorder (MDD) is one of the leading causes of disability worldwide. Deep learning methods have been widely used for MDD detection, with research suggesting that deep models outperform traditional machine learning techniques. However, detecting MDD remains challenging due to data heterogeneity, model complexities and the requirement for discriminative feature representations. Objective: This review outlines recent progress in deep learning methods for MDD detection from Resting-state fMRI (rs-fMRI), with a focus on the model’s generalisability and features that most effectively represent the function/anatomy of the brain to contribute to biomarker identifications and interpretability. Further, the review assesses the applicability of current models to real-world challenges. Methods: This systematic review followed the PRISMA guidelines. Studies involved clinically diagnosed MDD subjects, a control group, and deep learning methods for classification tasks. Results: The cerebellum, thalamus, amygdala, insula, and default mode network are the most frequently reported brain regions associated with depression. Although deep learning has shown impressive results, it has limitations in terms of reliance on labelled data, heterogeneity of data from various hospitals, and model interpretability. A majority of the studies lacked external validation and had a single-site dataset or regionally homogeneous datasets, and did not consider the temporal and dynamic nature of rs-fMRI data. Conclusion: Deep learning offers considerable potential in advancing MDD diagnosis and understanding its mechanisms. Multi-regional data collection, harmonisation techniques, and rigorous testing in real-world workflows should be the primary focus of future research.

Keywords:

major depressive disorder (MDD); resting-state functional magnetic resonance imaging (rs-fMRI); deep learning; clinical decision support systems (CDSS); Explainable Artificial Intelligence (XAI); multi-site datasets

1. Introduction

Artificial intelligence (AI) has demonstrated promising results in Clinical Decision Support Systems (CDSS), which can be employed to promote earlier and more reliable diagnostic methods. Deep learning techniques provide automated systems for identifying a variety of neurological and mental health conditions [1]. Previous research indicated that these automated systems can complement clinical decision-making while developing more personalised treatment plans [2]. These systems, in some cases, have achieved performance comparable to a radiologist [1,3], which makes them a desirable tool for reducing the workload of medical professionals and improving understanding of various medical conditions.

Major Depressive Disorder (MDD) is a serious mental health condition that affects how individuals think, feel, and behave [4]. According to the World Health Organisation’s latest report, over 280 million people suffer from depression worldwide [5]. Among adults, 5% are affected, with women being more affected (4% of men and 6% of women). It is characterised by persistent feelings of sadness, hopelessness, and a loss of interest in activities once enjoyed, and is often isolating, making even simple daily tasks feel overwhelming.

Depression affects not only individuals but also their relationships and broader societies. The exact cause of depression is still unknown. It can result from various factors, such as genetics, which means that individuals with a relevant family history are more prone to developing it [6], imbalances in neurotransmitters like serotonin, norepinephrine, or dopamine in the central nervous system, as well as trauma or stressful life events [7].

Untreated depression can increase the risk of serious issues such as substance abuse and suicide, which is one of the leading causes of death among young adults aged between 15 and 29 [5]. Despite the availability of various treatment methods, more than 75% of individuals in low- and middle-income countries do not receive the necessary care due to a lack of enough resources, trained professionals, and social stigma [8]. After the COVID-19 pandemic, the prevalence of MDD increased by 27.6% [9].

Currently, clinical interview is the most common approach to detect MDD. The DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) is a structured clinical interview used by psychiatrists to detect depression, which requires at least five consistent reported symptoms that appeared for the last two weeks [10,11]. The Hamilton Depression Rating Scale (HDRS) is another standardised scale for assessing the severity of depressive symptoms and treatment response [12].

Early diagnosis of depression contributes to timely and more effective intervention and treatment [13]. Current diagnosis methods rely on self-reported symptoms and a psychologist’s judgment, which is subjective and can be influenced by patient bias, and also, symptoms vary from person to person. To improve the accuracy of depression detection, an objective approach is required that is not solely dependent on reported symptoms. Functional magnetic resonance imaging (fMRI) is a non-invasive neuroimaging method that reveals brain activity in real-time by detecting changes in blood flow and oxygen levels (blood-oxygen-level-dependent (BOLD)) [14]. This is particularly beneficial for studying depression, which affects brain function. Certain areas of the brain that regulate mood, emotion, and memory demonstrated lower activity in individuals with depression, such as the hippocampus and amygdala [15]. fMRI uses powerful magnets and radio waves to create a detailed image of the brain, both its structure and function. As the brain regions become more active, their metabolic demands rise, and so does Cerebral Blood Flow (CBF). Consequently, oxygen-rich blood is delivered to the active areas by the body to support the increased functional needs [16]. fMRI can capture these shifts as they happen, allowing them to identify consistent spatiotemporal patterns across the brain [17]. A commonly used type of fMRI is resting-state fMRI (rs-fMRI), which involves acquiring data while individuals are at rest without performing a specific task. It captures low-frequency fluctuations in the BOLD signal associated with spontaneous brain activity across regions, which allows the identification of resting-state networks (RNs). This approach has shown promise in clinical applications, including presurgical planning, and may support future diagnostic and prognostic tools [18].

With the use of Artificial Intelligence and rs-fMR, more reliable diagnostic methods can be implemented, which enhance the understanding of the underlying cause of depression. Even though current computer-aided systems cannot completely replace a medical expert, they can provide additional information to enhance clinical judgments [19].

Artificial-intelligence-based MDD detection is often challenging for researchers due to the following:

Small sample sizes may limit the generalisability of the models, leading to overfitting due to a lack of validity on independent datasets.
The complexity of brain regions and the lack of knowledge to extract the most representative features of both the structure and function of the brain.
Data heterogeneity resulting from variations in sites, scanners, and acquisition methods.
The black box nature of AI algorithms and the limited availability of explainable methods.

Several studies reviewed the detection of MDD via machine learning algorithms [20,21]. These studies investigated different preprocessing, feature extraction, and machine learning methods. The results demonstrate that the machine learning algorithms are able to detect depression via rs-fMR data. Although, according to the results, the accuracy of the models is highly dependent on sample size, which makes traditional machine learning algorithms limited for real-world practical applications [22].

Deep learning has the potential to revolutionise image analysis in medicine. Although machine learning has been used for medical image analysis for many years, it is not widely used in clinical practice due to the limited performance of traditional machine learning approaches [23]. As shown in Figure 1, the distribution of publication years highlights the growing interest in this field over time.

This paper provides a comprehensive overview of MDD detection with deep learning. We aim to demonstrate how deep learning can be used in both supervised and unsupervised modes to improve the understanding of MDD. We analyse MDD detection using deep learning to identify potential biomarkers, the most discriminative regions, the connections of the brain and also the possibility of automated detection of MDD, the generalisability of the models, and the applicability for clinical use.

2. Materials and Methods

This systematic literature review follows Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines (Supplementary Materials) [24] to provide an overview of MDD detection via deep learning algorithms. The review process includes the following steps:

Defining research questions
Identifying related literature by performing searches on certain databases
Selecting studies based on inclusion and exclusion criteria
Extracting key data
Assessing the results

2.1. Research Questions

The goal of this review is to respond to these research questions:

RQ1: How do sample size and the use of multi-site datasets promote the generalisability of deep learning models?

RQ2: How can 4D rs-fMR brain scans be used for feature extraction, and what biomarkers have been identified?

RQ3: What are the most common deep learning techniques applied, and to what extent do they accurately detect MDD?

RQ4: What are the typical explainability methods for MDD detection in deep models, and how do they contribute to the interpretation of the model?

2.2. Search Strategy

Figure 2 shows the overview of this literature review workflow. Three well-known databases were searched: Google Scholar, PubMed, and Scopus, using the specific search terms (Table 1):

2.3. Inclusion Criteria

The inclusion criteria for this review were as follows:

Publications in English within 2020–2024.
Studies involving subjects clinically diagnosed with MDD and the presence of a control group.
The use of resting-state fMRI (rs-fMRI) data.
A focus on classification problems.
The application of deep learning methods.

Exclusion criteria were established to ensure that only relevant and high-quality studies were selected. Specifically excluded were:

Papers focusing solely on disorders other than MDD.
Non-research articles such as reviews, meta-analyses, book chapters, posters, and theses.
Non-peer-reviewed studies, including preprints and conferences not published by IEEE or ACM.
Journals that are not indexed in JCR.
Studies employing only traditional machine learning or statistical methods.

2.4. Study Selection

The search resulted in 1167 papers from Google Scholar, PubMed, and Scopus. After removing duplicates and the initial screening of titles and abstracts, 62 papers remained for full-text review. Ultimately, 35 studies were included after voting by two independent researchers, with a third researcher resolving any conflicts.

2.5. Data Extraction

Relevant information was extracted from the full texts of the included studies. General details such as author, year of publication, sample size, and data source were collected. Additionally, information regarding feature extraction methods, deep learning methods, reported biomarkers, and evaluation metrics was retrieved (Table 2).

3. Results

3.1. RQ1: How Do Sample Size and the Use of Multi-Site Datasets Promote the Generalisability of Deep Learning Models?

Deep learning models need to handle inter-site and inter-subject variability so that the models can generalise well to real-world clinical settings [56]. Inter-site variability is caused by the variation in MRI technology, image acquisition parameters, and instructions given to subjects. Additionally, inter-subject variation, and differences in age, sex, medication, or clinical history, can cause significant variability in neuroimaging data. These differences may introduce unwanted biases, causing the model to learn site- or subject-specific features rather than generalisable patterns applicable to MDD detection.

A major challenge is to verify that models trained on one dataset or site can generalise well to data from other hospitals. As a result, studies have started using multi-site datasets for training, validation, and testing, as well as harmonisation techniques to reduce the impact of the variations.

Table 3 shows the various datasets that have been used for MDD detection based on rs-fMR. According to the previous studies, the REST-meta-MDD [60] is currently the largest publicly available multi-site dataset, collected from 17 hospitals across China. Other large datasets include the SRPBS and PsyMRI. PsyMRI is a collection of resting-state MRI data from multiple European centres [32]. SRPBS is also a multi-site dataset collected across eight sites in Japan, covering multiple psychiatric disorders, including MDD [61].

To control inter-site variability in MDD classification and reduce the impact of scanner and acquisition variation, several studies [26,32,41,58] used ComBat for harmonisation. ComBat harmonisation is a statistical strategy that uses empirical Bayes to remove non-biological fluctuations from a signal. While a number of studies have shown an improvement in accuracy when including this strategy [58,62], some others have shown a small influence, specifically when used between multi-region datasets [32].

Some research has emphasised the usefulness of adopting multimodal datasets that combine functional MRI with structural or demographic data to better address data heterogeneity in MDD. Wang et al. [35] employed a multimodal fusion model to address inter-modality heterogeneity between rs-fMRI and sMRI. They incorporated a feature adaptation module to enable more effective integration of structural and functional data. Similar to this, Zheng et al. [30] showed that using sMRI in addition to rs-fMRI significantly increased accuracy to 75.2%, from 69.3% for sMRI alone and 61.9% with rs-fMRI.

Pan et al. [48] addressed inter-subject heterogeneity in MDD diagnosis through adaptive fusion of multimodal data, including rs-fMR and non-imaging phenotypic data. Liu et al. [58] managed subject heterogeneity in diagnosing MDD by using a demographic graph in a multimodal GNN. Similarities based on age, sex, education, and site were represented by the graph, allowing population-level inference, where predictions utilise shared information across subjects.

A growing number of studies also apply transfer learning methods, particularly domain adaptation techniques, to harmonise data and enhance cross-site performance. Fang et al. [56] introduced a dual-expert fMRI harmonisation (DFH) methodology to address multi-site heterogeneity. They utilised a domain-generic student model and domain-specific expert models to reduce site-specific variations. Fang et al. [57] addressed site heterogeneity in MDD classification by applying an unsupervised adaptation framework. They used Maximum Mean Discrepancy to reduce distribution shifts between source and target domains. Wang et al. [55] proposed an unsupervised framework that combines bi-level augmentation and contrastive learning to enhance cross-site and cross-disorder generalisation.

According to Table 3, the majority of the studies used REST-meta-MDD or a sub-site of this dataset (N = 30). A general limitation observed across these studies is the data collected from a single geographical location, which may significantly reduce the model’s ability to generalise to other populations or clinical settings. Several studies combined the data from various REST-meta-MDD sites and further applied cross-validation to improve the model training. This helps increase sample sizes, but it cannot avoid the problem of overfitting because one model may capture site-specific features rather than generalisable patterns. Leave-one-site-out (LOSO) cross-validation has been applied in a number of past research works and found to be an effective technique in reducing overfitting. It also assists in the assessment of the model’s generalisability on unseen data from different distributions. Most of these past research works have applied this method only within the same region, with relatively small samples, which undermines the model’s generalisability in other regions with different medical protocols. Gallo et al. [32] combined datasets from Europe and China; the model attained an average accuracy of 60%, hence indicating that regional data heterogeneity will make modelling challenging. Liu et al. [58] trained their model on REST-meta-MDD (China) and SRPBS (Japan) and tested it on independent datasets from Anding (China) and OpenNeuro (Russia). They reported an accuracy of 78.75%, which dropped to 69.97% and 69.05% on the external sets. These results demonstrate promising performance but also highlight the critical importance of geographically and clinically diverse testing.

This substantiates the need for LOSO and cross-site validation when handling data from geographically diverse locations in addressing differences in data distribution to ensure better reliability and consistency of the model performance. This will make deep learning models more clinically applicable and ensure that the patterns are consistent across different populations [47].

3.2. RQ2: How Can 4D rs-fMR Brain Scans Be Used for Feature Extraction, and What Biomarkers Have Been Identified?

4D rs-fMR is becoming more popular for its ability to capture both spatial and temporal brain dynamics, and is now widely used in deep learning research for brain function mapping [63]. To process 4D rs-fMR data (number of slices, height, width, time points) for deep learning models, various necessary steps need to be taken. This data includes structural and functional components of the brain. The structural component is represented via a 3D volume of the brain as a number of 2D slices with width and height. The additional time dimension shows dynamic changes in functional signals of blood oxygenation level measurements across all spatial locations over the scanning time. While deep learning models minimise the need for heavy preprocessing and feature extraction [30], the dynamic rs-fMR data still requires preprocessing, brain parcellation, and feature extraction to make the data suitable for analysis. Figure 3 shows a typical pipeline used for MDD detection using rs-fMR.

3.2.1. Preprocessing

All of the included studies applied standard preprocessing pipelines before extracting features. The most common frameworks applied were the Configurable Pipeline for the Analysis of Connectomes (CPAC) [64] and the Data Processing Assistant for Resting-State fMRI (DPARSF) [65], based on the Statistical Parametric Mapping package (SPM) [66]. In a few cases, standard tools such as FSL [67] or ANTs [68] were additionally used. The scanning sequence was usually started by discarding the first few volumes of usually 10 for signal stabilisation [35], followed by slice timing correction, and head motion correction of the rigid body realignment. Many of them had even stricter motion exclusion thresholds and then regressed out further the motion artefacts with the Friston 24-parameter model. Images were then spatially normalised to MNI space with tools including DARTEL [69] and resampled to a standard voxel size (e.g., 3 × 3 × 3 mm³). Spatial smoothing with a Gaussian kernel (usually 4–6 mm FWHM) was applied for better signal-to-noise ratio. Temporal preprocessing steps included band-pass filtering (0.01–0.08 or 0.01–0.10 Hz), linear detrending, and nuisance covariate regression to remove white matter (WM) and cerebrospinal fluid (CSF). For studies using the REST-meta-MDD dataset, used in the majority of included studies, a standardised preprocessing pipeline was applied by the original consortium prior to data release. This improves consistency across sites; however, it can also limit the ability to assess the impact of individual preprocessing steps, as these are applied uniformly before analysis. As a result, methodological differences may still arise from subsequent steps, particularly in the application of global signal regression (GSR) and harmonisation strategies. In particular, GSR remains a debated step, as while it reduces physiological noise, it may also distort true group differences and introduce artificial group differences in functional connectivity, making it a potential source of inconsistency across studies [70]. In multi-site datasets, additional preprocessing is often applied to reduce the variability introduced by differences in scanners, acquisition protocols, and site-specific factors. Harmonisation methods such as ComBat are commonly used to remove these effects after feature extraction [26,32,41,58]. Their impact, however, has been mixed across studies. Dai et al. [41] reported that applying ComBat improved performance in recurrent MDD classification, increasing accuracy from 74.5% to 75.1%. Similarly, Liu et al. incorporated ComBat within the LGMF-GNN framework and reported that site-effect suppression strategies improved accuracy from 69.02% to 78.75% in a six-site evaluation, corresponding to a 9.73% reduction in performance loss. In contrast, Gallo et al. [32], using large multi-centre datasets from Europe and China, found that ComBat had little influence on final classification results, suggesting that statistical harmonisation alone may be insufficient when clinical and technical heterogeneity remain substantial. These findings indicate that harmonisation can be beneficial, but its effect depends on dataset composition and study design. Harmonisation also introduces important methodological considerations. Best practice is to estimate harmonisation and nuisance regression parameters using only training data within each cross-validation fold and then apply them to the test set. Estimating these parameters on the full dataset before splitting risks information leakage and inflated performance [26]. Clear reporting of preprocessing order, harmonisation strategy, and data partitioning is therefore essential for assessing model generalisability.

3.2.2. Feature Extraction

Functional connectivity (FC) (N = 30) is the most frequent feature extraction method that measures statistical dependencies among brain regions. FC is estimated by methods of correlation, where the Pearson correlation coefficient is currently used most frequently. Pearson measures the linear interaction between pairs of pre-defined regions of interest of BOLD signal time series and provides a symmetric connectivity matrix expressing the magnitude of these interactions [71]. Apart from Pearson, Spearman’s rank correlation, which identifies monotonic relationships and is less affected by outliers, and partial correlation, which estimates the direct relationship between two regions by eliminating the effect of other brain regions that may cause indirect correlations, have been used in research [46]. More recent approaches, such as Ledoit–Wolf (LDW) shrinkage correlation, have been explored to improve estimation reliability, particularly in high-dimensional settings [29].

In functional connectivity analysis, the brain must be divided into Regions of Interest (ROIs), which is a process known as brain parcellation. It is usually done by using predefined brain atlases, which are standard templates for dividing the brain into anatomical or functional areas. Anatomical brain atlases focus on the brain structures and brain spatial organisation, where specific brain areas are positioned. Instead, functional brain atlases show connectivity brain networks, and how different areas collaborate. These atlases play an important role in obtaining region-level time series from rs-fMR data and constructing brain networks in which each ROI is a node. Well-known anatomical atlases include Automated Anatomical Labelling (AAL) and Harvard–Oxford atlases, which are frequently employed in prior works [29] (Table 4). The choice of atlas significantly affects the extracted features and model performance because it defines different boundaries and region numbers. Others have used multi-atlas approaches or high-resolution functional atlases such as Dosenbach and whole-brain parcellation schemes to obtain more detailed or global coverage information, making it possible to bypass the loss of small or functionally relevant areas [25,26,48]. To further explore the influence of atlas selection on reported performance, we analysed the relationship between classification accuracy and the type of brain atlas used across the reviewed studies (Figure 4a,b). The AAL atlas was the most widely used across studies, while other atlases, including Dosenbach, Harvard–Oxford, and CC200, were applied in fewer cases. Dosenbach and CC200 show relatively stable accuracy distributions, whereas studies using AAL exhibit greater variability, likely reflecting differences in model design and dataset characteristics. Evidence from individual studies indicates that no single atlas consistently outperforms others; CC200 has been associated with improved performance in some cases due to higher-resolution functional representation [41], while AAL remains competitive due to its anatomical interpretability [58], and several studies report only marginal or no significant differences between atlas choices [26,56]. In addition, a comparison between studies using a single atlas and those combining multiple atlases indicates that multi-atlas approaches tend to report slightly higher median accuracy. Prior work suggests that combining atlases can enhance performance [43,48]. However, this also increases feature dimensionality and computational complexity, which may elevate the risk of overfitting, particularly in smaller datasets. Graph-theory techniques further extend FC [34]. Network construction entails the selection of an atlas for defining nodes and weighting the edges by correlation values. After the computation of FC, the graph is constructed by representing brain regions as nodes and functional connections as edges. Methods such as thresholding or K-Nearest Neighbors (KNN) are applied for discarding weak connections to obtain a sparser and interpretable network. Upon the construction of the graph, nodal metrics describe properties of single regions and global metrics reflect the global organisation of the brain network [17,57].

While effective, FC is not without limitations. It treats connectivity as static over time and does not capture the direction of connections that exist between brain regions. A hierarchical encoding model combining multi-view FC graphs (Pearson, Spearman, and partial correlation) was introduced in [46] to describe connectivity more comprehensively.

Besides identifying direct correlations between brain regions in low-order FC (LOFC), high-order functional connectivity (HOFC) considers the interaction of connectivity patterns and presents a higher-order description of brain network interactions [47]. A multi-layer FC network consisting of LOFC, topographical HOFC (tHOFC), and associated HOFC (aHOFC) was introduced in [47], showing that high-order interactions offer greater insight into connectivity. Similarly, classification accuracy was shown to improve through the integration of multiple FC levels in [49]. However, although these methods improve the representation of connectivity, they are based on static FC matrices that fail to capture time-dependent variations [43].

Dynamic functional connectivity (dFC) (N = 4) addresses this limitation by analysing time-dependent connectivity variations rather than static interactions. Compared to static FC, dFC computes connectivity patterns in shorter windows of time and thus can detect transient brain states. Sliding-window analysis is a common method, where connectivity matrices are computed over overlapping windows of time. Spatial and temporal encoding have been shown to improve the representation of the states of connectivity [31], and graph-based dFC extraction was examined in [33]. A time-varying FC approach for capturing fluctuations in recurrent MDD (rMDD) was proposed in [40]. Tian et al. [59] extended conventional dFC by introducing dynamic high-order functional connectivity (dHOFC), which models not only temporal fluctuations of connectivity but also second-order interactions between connectivity patterns.

The importance of frequency-domain features in rs-fMR processing for MDD diagnosis has been proposed in [37]. Low-frequency features were shown to characterise global structure patterns, while high-frequency features revealed local details. Frequency decomposition was used to support functional connectivity analysis, with temporal convolutional units enabling identification of BOLD signal patterns.

Apart from connectivity-based features, cross-sample entropy (CSE) (N = 1) has also been investigated to estimate the complexity of rs-fMR time-series data. Different from FC, which is concerned with correlation metrics, CSE measures irregularities and randomness of signal patterns and provides complementary information about brain dynamics. Entropy features were shown to complement connectivity features in classification tasks in [36].

Feature extraction has also been investigated to incorporate both functional and anatomical information. In addition to functional insights from rs-fMR, sMRI offers information on grey matter volume, cortical thickness, and anatomical integrity. Studies suggest that combining the two modalities can increase diagnostic accuracy, and multimodal approaches outperform single-modality models. Feature-level fusion has been particularly successful, which combines deep features from both rs-fMR and sMRI. Structural alterations were weighted using attention mechanisms in some frameworks, assigning different importance to structural versus functional features [35].

Some studies also incorporated demographic information alongside sMRI or rs-fMR features [44,48,58], allowing more individual-level feature extraction. This multimodal approach showed above-average performance, suggesting its potential for improving personalised MDD diagnosis.

3.2.3. Biomarkers

Figure 5 and Figure 6, and Table 5 and Table 6, summarise most frequently reported anatomical and functional biomarkers across the studied literature. Amygdala (N = 6), thalamus (N = 6), cerebellum (N = 7) and, insula (N = 7) were the most frequently identified regions. All of these structures are consistently implicated in emotion regulation, memory, motor control, and cognitive control. Functionally, the DMN (N = 5) was most prominent, followed by the FPN and CON (each N = 2), as these are patterns associated with self-referential thinking, attention, and executive function.

Amygdala

One of the limbic system’s most vital regions for processing emotions is the amygdala. It contributes to the formation of emotional memories as well as the recognition of emotions [73]. Deep learning models can use the amygdala’s structural and functional characteristics to distinguish between MDD patients and healthy individuals. Studies have discovered that the amygdala often experiences volume alterations in individuals with MDD [74]. Female patients with remission depression have been found to have increased amygdala connectivity and functional activity [33]. The right amygdala has been associated with suicidal thoughts and behaviours [72]. The amygdala was also found to be one of the most discriminative regions for predicting response to medicine [75].

Thalamus

The thalamus filters, amplifies, and transmits desired sensory information [76]. Studies have identified that thalamic hyperconnectivity could be the most noticeable neurophysiological feature of MDD. The removal of the thalamus and its FC with the rest of the brain can result in a 5% drop in accuracy [32]. The thalamus has also been shown to be highly predictive of suicidality [72].

Cerebellum

Traditionally linked to motor control, the cerebellum is now recognised for its involvement in emotion regulation and higher-order cognition [31]. Studies have reported abnormal FC between the cerebellum and cerebral regions. Specifically, reduced connectivity with the prefrontal cortex and the default mode network (DMN), alongside increased connectivity with visual networks, has been interpreted as a possible compensatory mechanism [47].

Insula

The insula processes emotions, social interactions, and sensory data. It works closely with the amygdala, integrates body signals, and contributes to emotions like anxiety and fear [77]. It is also a core node in the salience network. Reduced connectivity between the salience network and the left executive control network (ECN) has been linked to thinking difficulties in late-life depression [36], while weakened connectivity between the insula and the hypothalamus has been observed in patients with rMDD [41]. Beyond diagnosis, the insula has also been involved in treatment response prediction in young adults with MDD [75].

Default Mode Network

The brain region known as the default mode network (DMN) is most active when an individual is concentrating on internal activities like daydreaming, mind-wandering, and self-reflection rather than outside stimuli. Abnormal connectivity patterns within the DMN have associations with depression [72]. Remission after antidepressant treatment can be predicted by increased connectivity within the DMN [26].

Fronto-Parietal Network (FPN) and Cingulo-Opercular Network (CON)

The fronto-parietal network controls the in-the-moment choices that are made when preparing and carrying out goal-directed behaviour [78]. The cingulo-opercular network is the human brain’s executive network for regulating behaviour [79]. Studies have shown that most brain regions predictive of suicidal behaviour are in the canonical networks, including FPN, DMN, and CON [36,72]. Suicide risk was found to be higher when there was less functional connectivity between the DMN and the FPN, especially the inferior parietal lobule (IPL). Impaired top-down emotional control could be the outcome of this [72]. The DMN, FPN, and CON were also the primary locations for distinct patterns of salient regions that contributed to the characterisation of MDD, first-episode drug-naive (FEDN) subtype, and recurrent subtype. Additionally, there was a significant correlation found between the duration of illness and the severity of depressive symptoms and the topological profiles of the IPL and dorsolateral prefrontal cortex (DLPFC), respectively [26].

3.3. RQ3: What Are the Most Common Deep Learning Techniques Applied, and to What Extent Do They Accurately Detect MDD?

Deep learning has been widely used in neuroimaging because of its ability to automatically learn features from complex, high-dimensional data [80]. This section outlines the underlying principles and structures of deep learning techniques used for MDD detection. The methods are classified as unsupervised and supervised, with subcategories including Auto-Encoder (AE), Generative Adversarial Network (GAN), Domain adaptation techniques, Deep Neural Network (DNN), Graph Neural Networks (GNN), Recurrent Neural Network (RNN), and 2D/3D Convolutional Neural Network (CNN). Figure 7 shows the frequency of each deep model, with graph models being the most commonly used architecture, while Table 7 provides more information about the methodologies’ strengths and limitations.

3.3.1. Supervised Methods

Supervised learning is a form of machine learning in which algorithms learn from labelled training data and then predict the outcome [81]. Our literature review shows that supervised approaches, which combine feature extraction and classification, are more widely used than unsupervised approaches. This section discusses supervised methods for detecting MDD.

Deep Neural Networks (DNNs)

DNNs have been applied in MDD classification, leveraging automatic feature learning. DNNs are a type of artificial neural network that contains numerous hidden layers, making them more sophisticated and resource-intensive than traditional neural networks [82]. There are multiple variations of deep neural networks, including CNNs, GNNS, RNNs, and Transformers, which employ attention to focus on particular parts of the deep neural network. DNNs are computationally more efficient than graph models in some cases [53] but face constraints with rs-fMR data because they handle brain features independently and are not able to preserve topology [83].

To address these limitations, Gupta et al. [51] introduced the Layerwise Elimination of Accessory Nodes (LEAN) and Correlation-based eLimination of InPuts (CLIP) optimisation techniques. LEAN removes redundant nodes, and CLIP reconstructs useful correlated features, reducing overfitting and improving the efficiency of classification, resulting in a model with significantly fewer parameters (2.6% of the original). Traditional methods typically use low-order FC, which measures the direct correlation between two brain regions. Two studies [47,49] introduced a multi-level FC DNN that also includes high-order FC, which captures the similarity between low-order FC patterns, particularly indirect brain interactions. DNN classifiers achieved classification accuracies ranging from 61.93% to 78.3%. Depending on the characteristics of the data and computational requirements, optimised DNNs can be a valid competitor to structural and hybrid neuroimaging models. However, their inability to model the connectivity structure of the brain has led to more interest in specialised architectures for capturing spatial and topological properties.

Graph Neural Networks (GNNs)

Graph Neural Networks (GNNs) are the most common deep learning models (N = 21) for MDD detection, because they effectively represent the brain as a graph [58]. In this method, ROIs are represented as nodes, and functional connectivity between them is represented as edges in an adjacency matrix [84]. These graphs are typically constructed by computing correlations between time series from different brain regions. GNNs operate on such structures by learning embeddings at the node, edge, and graph levels. Through a message-passing mechanism, each node updates its embedding by aggregating information from its neighbours. Variants like Graph Convolutional Networks (GCNs) apply convolutional operations, while Graph Attention Networks (GATs) assign adaptive weights to neighbouring nodes [34].

In contrast to traditional deep learning models developed to handle Euclidean data, GNNs are particularly suitable for neuroimaging data since they are built to handle non-Euclidean spaces [25,85]. Moreover, GNNs learn semantically informative feature representations from single brain regions and their interactions and improve their performance in classifying MDD subtypes and predicting treatment responses [39].

GNNs possess several strengths but they also have weaknesses. One of the primary issues is the graph topology definition, as it influences the ability of the GCN to identify useful MDD patterns. Another issue is that the majority of GCN methods were constructed based on small data, which can restrict the reliability of the models. The use of a single brain atlas to construct connectivity matrices also restricts analysis to a single spatial scale [45].

Over-smoothing is another problem with GNNs, which refers to how the depth of a network causes node representations to become increasingly similar regardless of their intrinsic differences.

As the network depth increases, the representations become more homogeneous, which reduces the diversity of features learned and ultimately reduces classification performance [35,45]. To address these challenges, a number of GNN architectures have been suggested.

Graph Design and Topology Optimisation

One of the challenges in using GNNs is constructing the brain graph, and this entails the definitions of ROIs as nodes and weighting interactions between them (edges) [86]. Topology plays an important role in the performance of GNNs, as dense graphs can mix the signal with noise while sparse graphs can lose important functional interactions [87]. Several papers introduced novel GNN structures to enhance the diversity of information flow and learn optimal paths. Venkatapathy et al. [25] employed an ensemble of GNNs using GCN, GAT, and GraphSAGE to analyse whole-brain FC for MDD classification. This ensemble approach aimed to diversify the learned representations. Zheng et al. [38] introduced BPI-GNN, an interpretable GNN that uses prototype learning to identify subtype-specific connectivity subgraphs. It prioritises edges over nodes to capture network-level disruptions and applies total correlation regularisation to capture non-redundant patterns. Xia et al. [43] introduced DepressionGraph, a two-channel GNN. The model uses two types of features: one based on the connectivity between brain regions (node features) and another representing the number of connections or relationships between them (node numbers). Pitsik et al. [54] investigated the importance of sparsity in graph topology. They demonstrated that reducing the graph to only 2.5% of the original edges by keeping only the strongest edges led to an increase in accuracy and reduced overfitting [54].

Temporal Dynamics Modeling

To characterise the rapidly changing patterns of brain activity that underlie mental function [88], several studies modelled brain network dynamics with temporal structures to capture the importance and interaction of time points.

Kong et al. [33] and Dai et al. [40] introduced the Spatio-Temporal Graph Convolutional Network (STGCN) and the Time Point Graph Convolutional Network (TGCN). STGCN combines graph and temporal convolutions to simultaneously learn spatial and temporal patterns. TGCN constructs graphs over time points based on evolving BOLD signals. Both models showed promising results for MDD diagnosis, with STGCN also applied to treatment response prediction.

Multi-Site Generalisation and Harmonisation

Some studies focused on inter-site variability by implementing harmonisation strategies and aiming to achieve consistent performance across diverse datasets.

Qin et al. [26] and Gallo et al. [32] applied a GCN-based classifier using ComBat for data harmonisation. While the Qin et al. [26] model achieved strong performance on REST-meta-MDD, Gallo et al. [32] reported that the GCN model achieved a poor classification accuracy, which did not outperform the SVM, and the harmonisation technique had minimal influence on the accuracy, highlighting the complex nature of building and optimising GCNs for multi-site MDD detection.

Multimodal Fusion

Multimodal methods have gained attention for combining various data sources such as rs-fMR, sMRI, and EEG to improve MDD detection. Incorporation of patient demographics enhances explanation models and enables more personalised predictions [89].

Liu et al. [44] introduced the Multi-Channel Fusion Graph Convolutional Network (MFGCN), which proposes Cross-Level High-Order Interaction (CLHOI) that generates high-order functional connectivity (HOFC) matrices from low-order FC (LOFC). MFGCN improved accuracy on REST-meta-MDD upon previous methods by 3% over earlier models using single-order FC connectivity.

Pan et al. [48] proposed MAMF-GCN, an adaptive multi-channel fusion GCN for MDD and Autism classification. The network merges rs-fMR and non-imaging data using adaptive fusion, which applies multi-scale convolution for brain network feature extraction, and uses an attention mechanism to highlight important connections. MAMF-GCN achieved an improvement in accuracy ranging from 3% to 39.83%, outperforming baselines like SVM (59.4%), GAT (62.59%) and earlier GCN models (up to 95.9%) in accuracy.

Liu et al. [58] used attention-based graphs to integrate rs-fMR, sMRI, and demographic data, enabling interpretation at both ROI and modality levels. They reported that functional features contributed most to classification (0.3148), followed by demographic (0.1598) and structural (0.1367), with the combined embedding receiving the highest score (0.3887).

Kong et al. [50] and Wang et al. [35] explored structural–functional fusion, a Multi-Stage Graph Fusion Network (MSGFN), which integrates white matter (WM) and grey matter (GM) functional connectivity. Taking a similar approach, the Adaptive Multimodal Neuroimage Integration (AMNI) framework [50] combines rs-fMRI and sMRI using GCN for functional and CNN for structural feature extraction [35].

Despite the promising results, one of the biggest challenges in multimodal fusion for GCNs is that most current methods tend to employ a rough fusion approach. The approach simply combines information from different modalities but fails to take advantage of the unique characteristics of each [30,48].

Classification accuracy ranged from 56.77% to 99.2% for GNN models, with graph design-focused models reporting 69.4–93%, temporal modelling 75.8–84.1%, multi-site generalisation harmonisation approaches 61–81.5%, and multimodal fusion obtaining the highest range of 77.6–99.2%.

Although GNNs are the most popular deep learning solution for MDD detection, interpretability, data availability, and graph construction still pose challenges to their widespread adoption. Future methodological developments in multi-site dataset fusion, explainability, and optimised graph representations have the potential to increase their clinical utility in psychiatric research.

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) are a type of deep neural network that consists of three components: convolution, pooling, and fully connected layers. Convolution and pooling layers extract features and reduce dimension, while a fully connected layer transfers the extracted features to the final output [90]. 2D CNNs can extract spatial information from their input data. However, 3D CNNs can use both 1D and 2D CNNs to extract spectral and spatial characteristics from the input volume at the same time. These 3D CNN properties are extremely beneficial for analysing volumetric data in medical imaging [91]. CNNs have also been used in MDD classification, either standalone or in ensembles with other models, for extracting spatial features from neuroimaging [30,31,35,52]. Tian et al. [59] used a convolutional neural network with self-attention, applying attentional feature fusion to combine dynamic low- and high-order functional connectivity (dtHOFC and daHOFC). Lin et al. [36] proposed a 3D CNN with Cross-Sample Entropy (CSE) for classifying late-life depression, and Wang et al. [35] integrated high- and low-frequency features of DTI and sMRI through CNNs for local feature learning and a transformer-based encoder for global relation. CNN-based models reported accuracies between 65% and 85%, with higher values from 3D CNNs and moderate results from multimodal fusion approaches.

Recurrent Neural Networks (RNNs)/Transformers

Recurrent Neural Networks (RNNs) can process sequential data and are capable of discovering temporal dependencies in brain activity [92]. RNNs are built to handle sequential data by maintaining a hidden state that keeps information about prior inputs. The basic structure consists of an input layer, a hidden layer, and an output layer. The vanishing and exploding gradients make it difficult to train RNNs. During back-propagation, gradients can either disappear or expand exponentially, making it challenging for the network to learn long-term relationships. To address these issues, RNN versions such as LSTM and GRUs have been developed. These structures employ gating methods to control the flow of data and gradients through the network. Transformers overcome the drawbacks of conventional RNNs by incorporating self-attention mechanisms into RNNs and processing sequences in parallel. This enables the networks to concentrate on important parts of the input sequence [93]. Unlike static models that focus on static properties of functional connectivity, RNNs have the capacity to model the dynamical changes of brain networks over time [57]. RNNs were not typically applied alone but as part of hybrid models, with an emphasis placed on their role in modelling temporal dynamics [35].

Zheng et al. [30] employed an attention-based multimodal fusion model with CNN and Transformers to extract temporal and spatial features from rs-fMRI. Their co-attention mechanism gave more weight to sMRI features. Hu et al. [31] introduced a spatio-temporal model consisting of an MLP spatial encoder and a Transformer temporal encoder for capturing dynamic FC patterns. Liu et al. [36] suggested a hierarchical encoding and fusion framework for rs-fMRI depression diagnosis and subtype classification. The framework uses pre-trained LSTMs for extracting regional brain features and builds multi-view FC graphs using Pearson, Spearman, and partial correlation. When RNNs were incorporated to model temporal dynamics, reported accuracies ranged from 65.8% to 82.38%.

3.3.2. Unsupervised Methods

Unsupervised learning (UL) works on an unlabelled training set, focusing on identifying patterns and relationships in the data without prior knowledge of its structure or classes. UL has the power to find previously unknown disease mechanisms or patient characteristics, allowing researchers to make discoveries that conventional analysis cannot. This sort of learning is especially well-suited to complicated and high-dimensional data, where manual labelling is costly [94].

Autoencoders

Autoencoders are valuable tools in MDD detection, which are used to extract features, reduce dimensionality, and eliminate noise. Autoencoders are a data-driven approach to investigate alterations in brain networks by projecting high-dimensional FC data onto low-dimensional latent representations. They consist of two main parts: an encoder, which maps the input to a latent vector, and a decoder, which attempts to reconstruct it [41,95]. Denoising autoencoders encourage network feature extraction, and graph autoencoders (GAEs) preserve the brain connectome topology. However, flattening FC maps in fully connected architectures is at the expense of spatial relationships and functional pattern preservation. Additionally, some autoencoder decoders reconstruct only the network structure [29].

Some research studies have used an autoencoder for MDD detection, each with some new contributions. Noman et al. [29] proposed a GAE that preserves brain connectivity topology by reconstructing the original graph from learned embeddings. These embeddings were used both unsupervised for representation learning and supervised via FCNN for classification. Notably, the unsupervised GAE-FCNN outperformed the supervised GCN-FCNN, highlighting the value of topology-aware embeddings learned without label supervision. Zheng et al. [39] proposed CI-GNN, a causally reasoning variational autoencoder with conditional mutual information to yield more interpretability. Dai et al. [41] proposed Res-DAE, a residual denoising autoencoder for more accurate classification of recurrent MDD by reducing noise in time-series features. Autoencoder models reported accuracies ranging from 59.5% to 75.1%.

Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) consist of two neural networks: the generator and discriminator. The generator creates medical images that mimic real data, while the discriminator distinguishes between real and artificially generated images. The generator iteratively adjusts its output, resulting in more realistic data for applications such as augmentation, synthesis, and classification [96].

Several studies have explored GAN methods to augment functional connectivity data for MDD classification. Zhao et al. [27] used adversarial training with feature matching to improve performance over traditional classifiers. Tan et al. [28] introduced Deep Convolutional GANs (DCGANs) to better capture spatial connectivity patterns using convolutional architectures. More recently, Oh et al. [45] proposed a Graph Convolutional Conditional GAN (GC-GAN) that preserves brain network topology by incorporating GCNs in both generator and discriminator components, offering topology-aware data augmentation. GAN models reported accuracies between 66.84% and 70.1%. Despite these benefits, GANs have some drawbacks. Conventional GANs used with FC data do not necessarily retain network structure, and GCN-based GANs are based on pre-specified patterns of connectivity, which may restrict how well they can adapt. GANs are also difficult to apply to small neuroimaging datasets because they need large databases in order to produce high-quality synthetic samples [28].

Domain Adaptation Methods

Knowledge transfer involves applying pretrained models to transfer knowledge from a source domain to a target domain, which reduces the need for large amounts of labelled data. The approach helps overcome data scarcity and increase model generalisation [97].

Fang et al. [56] introduced a dual-expert fMRI harmonisation (DFH) framework to improve multi-site rs-fMR generalisation for MDD diagnosis. The model uses a student–expert architecture with knowledge distillation and GCN, achieving improved cross-site generalisation. UFA-Net, a cross-domain unsupervised fMRI adaptation model, was proposed by Fang et al. [57] to minimise site heterogeneity in MDD classification. The model employs an attention-guided spatio-temporal GCN (AST-GCM) for capturing spatial and temporal rs-fMR patterns, and Maximum Mean Discrepancy (MMD) for feature alignment. Wang et al. [55] proposed an unsupervised contrastive graph learning (UCGL) method for rs-fMRI-based brain disorder detection (MDD, Autism, and Alzheimer’s). The model applies contrastive learning and bi-level rs-fMR augmentation to enhance learning of functional connectivity representations. The model also employs transfer learning, pre-training using large datasets before fine-tuning for target disorders. Domain adaptation models reported accuracies between 56.77% and 63.0%, comparable to fully supervised methods under cross-site settings.

General Comparison of Model Accuracies

According to Table 2, the highest accuracy is obtained from MAMF-GCN as 99.2% when applied on a single site of the REST-meta-MDD dataset using 10-fold cross-validation. Deep collaborative learning (DFH) achieved the lowest performance with 56.77% accuracy when implemented on multiple sites in the same dataset with cross-site validation.

The median of accuracies is 72% with a standard deviation of 9.15, indicating the variability of model performance. The difference between the top and bottom models is 42.43%. The majority of the models vary between 65.95% and 77.95%, corresponding to moderate performance, and there are a few models with high accuracy. As shown in (Figure 8), this distribution highlights the variability of model performance across different studies.

These results demonstrate that high accuracy is often achieved with single-site and small sets of data [42,44,46,48,54], while the GCN proposed by Qin et al. [26] is exceptionally impressive with 81.5% accuracy on the multi-site REST-meta-MDD dataset. Similarly, Liu et al. [58] achieved strong performance (Accuracy = 78.75%) using LGMF-GNN across a multi-region dataset, showing notable generalisability beyond single-site and region sources.

Influence of Architecture on Model Accuracies

The boxplot analysis shown in Figure 9c illustrates that the GNN models demonstrated the highest range of performance and usage, with top-performing structures such as MAMF-GCN (99.2%), GCN (81.5%) [26], LGMF-GNN (78.75%), and MFGCN (77.6%). GNNs also showed the largest variance and the most outliers, indicating that the variability in performance depends on the implementation of the graph structures and the fusion mechanisms. On the other hand, CNN models worked consistently but less accurately, as highlighted by the more consistent performance but lower ability to identify the brain network topology. Some CNN models, including CNN 3D (85%) with CSE, FSFC (75.4%) and CNN+AFF (78.3%), which incorporated structural and dynamic functional features, respectively, resulted in notably high performance. Autoencoder models such as GAE (65.07%), CI-GNN (72%) and Res-DAE (75.10%) also had relatively compact distributions, but with slightly lower overall performance. RNN models like TGCN (75.8%), MT-STN (68.56%), STANet (82.38%), and STGCN (84.14%) also showed competitive performance, highlighting the importance of temporal modelling and transformer integration in rs-fMR analysis. GAN models like GC-GAN (66.84%), 1D-DCGAN (68.32%), and GAN (70.1%) [27] typically showed lower median accuracies with smaller variances. DNN models showed a lower median accuracy but achieved competitive results in custom setups, including Lean+CLIP (78.3%) and DDN-Net (72.43%).

Impact of Validation Methods on Model Performance and Generalisability

Cross-validation, used by most studies (N = 20) (Table 8), has a higher accuracy with a range of between 64.1% and 99.2% and a mean of 74.56% (Figure 9a). The standard deviation of the models is 9.24. The models that achieved the highest accuracy, including MAMF-GCN (99.2%) and GCN [54] (93%), used Cross-Validation, indicating that it is highly beneficial in model accuracy.

The Leave-One-Site-Out (LOSO) (N = 4) validation method, though used by fewer models, worked fairly well with a mean accuracy of 69.41% and variability ranging from 61.93% to 81.5%. The standard deviation of LOSO is 8.77, which implies that this method gives more stable and consistent outputs. This is to be anticipated, as LOSO strictly evaluates generalisability by having models trained with data from every site but one and evaluated with the left-out site [49,53]. The relatively stable accuracy demonstrates that LOSO provides a more stable estimation of performance on real-world datasets compared to the traditional validation methods.

The cross-site validation and independent test sets (N = 6) [30,32,55,56,57,58] performed the least well with a mean of 64.28% and a standard deviation of 6.94. The underperformance under this type of validation suggests that models do not perform well when used on totally unseen sites. This is a reflection of the challenge of generalising between datasets from diverse sources, one of the main factors in real-world deployment. The performance drop under cross-site validation indicates that models learned on some data sources would not necessarily generalise to new data sources.

The Data Split technique (N = 5) models’ average accuracy was 75.84%, ranging from 70.91% to 85%. However, while hold-out has high performance, it may not always be consistent across studies due to the increased clinical heterogeneity associated with larger samples [26].

We further examined the relationship between reported performance, dataset size, and validation method (Figure 9b). A negative trend was observed, with higher accuracies (above 90–95%) primarily reported in smaller, single-site datasets using internal cross-validation. In contrast, studies using larger multi-site datasets and more rigorous validation strategies, including cross-site validation or independent test sets, reported more moderate performance. This pattern suggests that the performance reported in smaller studies may be inflated due to overfitting or limited data heterogeneity. It also highlights the critical role of validation strategy in assessing model generalisability, as performance tends to decrease when evaluated on more diverse and unseen data. Focusing specifically on cross-site validation studies, the FSCF (functional–structural co-attention) model, trained on REST-meta-MDD and evaluated on an independent hospital from Gansu, China (52 MDD, 80 HC), achieved an accuracy of 75.2% [30], indicating moderate generalisation within the same region. However, when models were evaluated across large multi-centre datasets from different regions, such as training on REST-meta-MDD (China) and testing on the PsyMRI (Europe; 531 MDD, 508 HC), performance dropped to around 61% accuracy [32]. Similarly, cross-site experiments using data from REST-meta-MDD showed limited generalisation when models trained on the largest site (Site 20; 282 MDD, 251 HC) were evaluated on other independent sites (Site 21; 86 MDD, 70 HC and Site 1; 74 MDD, 74 HC), with accuracies ranging from approximately 56% to 63% using domain adaptation approaches [55,56,57]. This highlights the difficulty of transferring models across sites, even within the same region. Multimodal approaches show promising improvements but do not fully resolve this challenge. The LGMF-GNN [58] achieved accuracies of around 69–70% on external datasets from different regions, including the Anding dataset (China; 196 MDD, 177 HC) and OpenNeuro (Russia; 21 MDD, 21 HC), compared to 78.75% on internal data. While the performance on Anding suggests moderate generalisability on a reasonably sized external site, results on smaller datasets such as OpenNeuro (21 MDD, 21 HC) should be interpreted with caution due to the limited sample size, highlighting the need for validation on larger cohorts to assess cross-regional generalisability.

Model Accuracy Comparison on REST-Meta-MDD

The REST-meta-MDD dataset is the largest publicly available multi-site rs-fMR dataset for MDD, revealing further details about model performance (Figure 10). Notably, various subsets of this dataset were used in experiments, ranging from one site to a few distinct sites, to the entire multi-site cohort, which affects reported accuracy as well as generalisability. Models trained and tested on REST-meta-MDD achieved an accuracy of between 56.77% and 99.2%, reflecting both the density of the dataset and the challenge of generalising over multiple clinical sites. The best-performing models were MAMF-GCN (99.2%), LGMF-GNN (78.75%), and the GCN model by Qin et al. [26] (81.5%), which reflects the strength of multimodal fusion, particularly individual characteristics, and graph learning. CNN+AFF (77.5%), FSFC (75.4%), and a DNN with Lean+CLIP (78.3%) also achieved strong performance despite lacking graph topology. CNN+AFF and MFGCN (77.6%) both modelled high- and low-order functional connectivity, but CNN+AFF further incorporated dynamic patterns via attention, resulting in better accuracy. FSFC fused structural and functional inputs using a multimodal design. Unsupervised models, including UCGL (63%) and GAE (65.07%), performed comparably to some supervised approaches, reflecting their promise to mitigate reliance on annotated datasets using contrastive pretraining and denoising. On the other hand, models such as DFH (56.77%) and UFA-Net (59.73%) reported lower accuracy, due to the effect of cross-site validation. These results highlight the difficulty of dealing with site-specific variation even within the same region. The REST-meta-MDD results show that while high accuracy can be achieved on single-site data, performance drops significantly when using cross-site validation across multiple sites, showing the challenge of generalisability.

3.4. RQ4: What Explainability Methods Have Been Applied for MDD Detection in Deep Models, and How Do They Contribute to the Interpretation of the Model?

When applied to the medical field, Explainable Artificial Intelligence (XAI) can help to overcome clinicians’ scepticism of AI outcomes [98]. Because MDD is a heterogeneous psychiatric disorder, understanding how deep models make predictions is essential for their use in the clinic. Explainability not only provides confidence in AI-based diagnosis but also helps validate whether model choices align with current neurological and clinical understanding [99].

Current XAI methods in deep models can generally be classified as ante hoc and post hoc methods. Ante hoc methods embed interpretability into model design, such that explanations emerge naturally from the model’s decision-making process. Post hoc methods analyse trained models to infer insights [100] (Table 9). The majority of studies (N = 24) utilised one of the two methods, and the others lacked explainability (Figure 11).

3.4.1. Post Hoc Methods

A trained machine learning model’s decision-making process is interpreted using post hoc explainability techniques, which offer insights into how the model arrived at its decisions [100]. Post hoc techniques can either be model-specific or model-agnostic [101].

Class activation mapping is a technique for creating heatmaps of images that highlight which areas were most important in terms of neural networks for image classification. There are various versions of the approach, including Grad-CAM (Gradient Weighted Class Activation Mapping) [62]. In rs-fMR investigations, these methods emphasise brain regions (nodes) or connections (edges) that contribute the most to classification by generating class-specific activation heatmaps based on functional connectivity matrices or node embeddings [26,40].

Feature attribution methods assign importance scores to input features based on internal model metrics such as gradients and learned weights. These model-specific techniques are used to rank brain regions or functional connections by their contribution to classification. Salience-based scoring [43], weight averaging [50], node pooling scores [44], and cross-validation accuracy-based ROI ranking [57] have been applied to identify the most informative features for distinguishing MDD from controls.

Ablation methods are a form of perturbation explainability that assess feature importance by removing or masking input components and measuring the impact on model performance. Unlike feature attribution methods, these techniques rely on external intervention to test feature relevance. In rs-fMR studies, region-level ablation [32,42] and leave-one-functional-connection-out (FNC-out) strategies [27] have been used to identify brain regions or connections.

Explainability methods such as Grad-CAM and SHapley Additive Explanations (SHAP) enhance model transparency for researchers, but Grad-CAM often produces coarse heatmaps that do not clearly localise features [102], and SHAP is computationally expensive and only explains correlation, not causality [103,104].

3.4.2. Ante Hoc Methods

Ante hoc explainability methods are designed to make a model’s decision process transparent from the start, by building interpretability directly into the model structure. A major contribution of these methods is their ability to guide the model’s focus during training. Several studies have used attention mechanisms to allow the model to learn which brain areas or time intervals are most relevant for discriminating between patients and controls by weighting important inputs [34,56,58]. To go beyond associations, some frameworks implement causal reasoning to identify brain connections that influence model predictions. These models are designed to filter out spurious correlations and identify functionally relevant subgraphs [30].

Prototype learning has also been employed, which generates representative prototypes for each diagnostic category. At the time of prediction, the model compares query inputs against learned prototypes and makes decisions based on similarity. This method is particularly beneficial in subtype classification [38,49].

Furthermore, counter-condition analysis has been introduced to simulate the brain connectivity of a subject under the opposite diagnosis label (e.g., simulating healthy connectivity for an MDD subject). By comparing the original and simulated connectivity patterns, the model determines which functional connections must change to shift the diagnosis, which also provides instance-level explanations [34].

Ante hoc methods are preferred in clinical practice for interpretability [105]. They are, however, dependent on the model design and alignment with clinical reasoning. They are intrinsically explainable but still require domain knowledge for their correct interpretation. In addition, these methods have a trade-off between generalisability and explainability, so that interpretable models might perform worse with diverse or complicated data [106].

3.4.3. XAI Challenges and Future Directions

Post hoc methods provide explanations after training models, but their stability may vary, particularly in small high-dimensional datasets due to sensitivity to noise and dependence on model layers [107]. They may even provide inconsistent outcomes when applied across different runs or with perturbed inputs. Such constraints necessitate the validation of post hoc methods properly before their use in the clinic. Measures including fidelity, comprehensibility, stability, and completeness have been proposed to assess explanation quality [103], but there are still no standardised benchmarks in neuroimaging-based XAI studies.

Ante hoc methods, which generate explanations as part of the model’s prediction process, are generally more efficient and predictable. However, this is at the expense of predictive performance or generalisability across datasets, and they typically still require domain knowledge to interpret them [108]. Although such methods offer a pathway to intrinsically interpretable models, they remain underexplored in neuroimaging pipelines.

In the future, research should focus on the development of clinically validated XAI models that are not only accurate but also understandable. This includes causal reasoning approaches for better representing condition-specific brain patterns, as well as the incorporation of personalised explanation techniques that can handle the inter-subject heterogeneity of MDD. One of the directions is combining natural language processing (NLP) and large language models (LLMs) for explaining complex model outputs to clinicians [109,110]. Such multimodal systems could enable diagnostic transparency and build trust in AI-driven decisions, yet this integration is technically and ethically challenging [103]. Demonstrating the clinical utility and cognitive usability of explanations should be a goal of future research in the field [111,112].

4. Discussion

This review systematically evaluated 35 deep learning studies applied to resting-state fMRI data for MDD detection. These studies are designed to contribute to clinical diagnosis by identifying disorder-specific patterns. A growing number of studies employed advanced neural architectures, multimodal features, and explainability approaches to enhance model performance and clinical interpretability. Most studies used functional connectivity (Table 2) derived from the AAL atlas (Table 4) as their main feature representation, with a smaller number exploring dynamic functional connectivity to capture temporal changes. Frequently reported brain regions included the amygdala, cerebellum, thalamus, and the Default Mode Network (Table 5 and Table 6). Supervised deep learning dominated the field, particularly Graph Neural Networks, followed by hybrid CNN–RNN architectures for capturing spatiotemporal patterns (Table 7). A few studies explored generative models and domain adaptation to address data scarcity and site variability. Despite this progress, several challenges remain. High accuracy was typically achieved only on single-site datasets with small samples. When evaluated across hospitals or regions, performance dropped sharply, largely due to inter-site variability as well as differences in patient characteristics and symptom profiles. To address this, some studies have explored multimodal approaches to integrate complementary information across modalities, which have shown promising potential for improving generalisability. Domain adaptation strategies have also been proposed to reduce site-related variability and lessen reliance on labelled data. However, performance in external validation settings remains moderate, highlighting the ongoing challenge of generalisation (see Section Impact of Validation Methods on Model Performance and Generalisability). There was also no clear agreement on which feature types are most informative, and biomarker findings were inconsistent across studies. As a result, the reproducibility of identified biomarkers across sites and populations remains an open question. In the following sections, we summarise the main outcomes of these studies and their methodological implications and areas for further research.

4.1. Generalisation and Data Diversity (RQ1)

A constant limitation in research has been the reliance on regionally constrained datasets drawn mainly from China. Although multi-site datasets increase sample size, many studies still rely on the same underlying dataset (REST-meta-MDD), meaning that cross-validation within pooled data does not represent true external validation. Cross-validation in a merged dataset of different sites tends to provide optimistic performance, but accuracy typically drops under LOSO or cross-site tests. Moreover, reported classification performance is heavily dependent on dataset size. Small samples are more prone to overfitting.

Few studies tested their model on completely independent test sets [30,32,56,57,58], which are required to measure true generalisability. Adopting evaluation approaches that challenge models “as is” (and not retrained or fine-tuned) and comparing against a random performance could provide more realistic estimates [113], especially in clinical classification tasks, where small sample size and class imbalance inflate performance [52]. It is also important to clearly report subject-level data splits during validation to avoid information leak. No data from the same subject should appear both in the training and test sets to evaluate generalisation [114]. Despite challenges, domain adaptation techniques and multi-modal approaches demonstrated great performance, potentially able to decrease inter-site and inter-subject biases and increase robustness in various settings.

4.2. Feature Extraction and Biomarker Discovery (RQ2)

Most studies employed FC, calculated primarily using the Pearson correlation coefficient, to demonstrate inter-regional brain connectivity. While the method remains robust, it does not consider the temporal resolution necessary to characterise dynamic brain activity. Gradually, more studies have utilised dFC, which provided improved classification in several cases and provided more individualised neural profiles [115,116]. Hybrid strategies that combined low- and high-order FC also performed better by maintaining both direct and indirect neural connections [28,44,47,49].

Nonetheless, there are several technical limitations in feature extraction. Feature selection is often biased toward group differences and will disregard individual heterogeneity. Fixed graph construction methods and the single use of brain atlas are also used by some models and do not leverage hierarchical information [117,118]. In addition, limited use of structural or non-imaging data restricts the model’s ability to contextualise functional activity. Multimodal pipelines are also limited by incomplete data because not all subjects have complete structural or clinical data, reducing the viable sample size. There is also no agreement on which type of feature is the most informative: static FC, dFC, higher-order FC, structural features, and patient demographics, with performance varying between studies. Data augmentation, while common in other domains, is not typically applied in rs-fMR since it is challenging to design realistic transformations that will preserve spatial–temporal brain structure. Despite the constraints, studies have converged on the same biomarkers across various models and data sources. Brain regions such as the amygdala, thalamus, hippocampus, cerebellum and insula were frequently identified, as well as functional networks including the DMN, FPN, and CON. Furthermore, the thalamus was a consistent biomarker across multi-regional datasets, showing its possible role as a consistent biomarker of MDD [32]. This recurring pattern may indicate an emerging convergence toward candidate neuroimaging biomarkers associated with MDD alterations. However, these findings should be interpreted cautiously, as many of these regions are implicated in other psychiatric disorders, and their specificity remains uncertain. Further validation across independent datasets, imaging sites, and populations is required to determine whether these regions represent reliable and generalisable biomarkers.

4.3. Deep Learning Techniques and Performance (RQ3)

Deep learning methods used for MDD prediction fall under supervised and unsupervised methods. Supervised models are dominant in the literature and have demonstrated strong performance. However, they are constrained by limited data, inter-site and subject variability, and class imbalance. Among them, GNNs achieved the highest performance and have been applied most frequently based on their ability to model brain connectivity as a graph. However, their performance varied significantly across experiments, with some indicating better accuracy compared to conventional machine learning models and others demonstrating minimal or no improvement. These inconsistencies can be attributed to variations in graph construction, sample size, and disease heterogeneity. There is no standard rule for graph construction. The depth and the use of transformer modules or positional encoding vary across studies and are biased toward previous works. Differences in model construction reduce comparability and reproducibility. Additionally, some models, such as DNNs and CNNs, obtained moderate performance but struggle in representing the complete richness of brain network topology and are mostly applied for structural feature extraction. Transfer learning has been utilised instead of training deep models from scratch, which is often not possible for small neuroimaging datasets, showing improved convergence by reusing features learned on related domains.

Unsupervised and semi-supervised approaches are getting more attention, especially with limited annotated data. These include autoencoders, GANs, contrastive learning models, and domain adaptation techniques. They are helpful for learning the latent representations or matching features between domains and have shown comparable performance to supervised methods. However, features learned from only unsupervised means may not necessarily be adequate for future classification tasks and often require additional supervised fine-tuning [49,55]. Furthermore, these methods are mostly reliant on data augmentation methods that are difficult to meaningfully define for complex structures like brain networks and still need to be explored. GANs have also been looked into as a solution to dataset size to generate synthetic connectivity. However, this area of research demands more work, and success with data generation in medical imaging remains to be determined [119].

Multimodal fusion methods have also been found to be able to enhance model performance. Studies incorporating structural MRI, demographic data, alongside functional data have been found to provide improved classification accuracy. Though still underutilised, the most common fusion mechanism is still feature concatenation, which processes each modality independently and fails to capture the relation between modalities. This integration would enable more comprehensive and personalised diagnosis.

Emerging research can leverage language models or attention-based fusion operations to adaptively weigh and interpret different input types [89]. It is also important to point out that model performance is not just a result of architecture but also of dataset properties, brain atlas selection, pre-processing method, and validation strategy. Even if the same dataset is used, differences in subjects, sample size, and fold definition may result in inconsistent performance. Though this review investigated the accuracy of models in general and on the same datasets, comparability is still in its early stages. Code and model weights were typically not available publicly in most cases, and more benchmarking is required for the comparison of such methods under standard validation frameworks and across datasets.

4.4. Explainability (RQ4)

The majority of the studies reviewed employed some form of explainability method, highlighting the growing demand for model interpretability in clinical use. The methods were shown to be promising in identifying biomarkers and improving model explainability. Post hoc and ante hoc interpretation methods are both valuable tools, particularly for researchers who want to understand complex deep model outputs.

However, there is a diversity of opinions regarding the malfunctioning regions identified by these methods. To ensure clinical adoption, validation of these methods is necessary, with assurance that the explanations are clinically relevant, interpretable, and stable.

Ante hoc approaches offer the advantage of producing more faithful representations of model behaviour, but post hoc approaches are adaptable and easy to apply across architectures. Regardless of the approach taken, future efforts should emphasise more individual-level and causality reasoning for understanding the nature of psychiatric disorders such as MDD.

In addition, large language models could be used to translate challenging model outputs into a comprehensible format so that clinician and developer communication can be facilitated [109,110]. Explainability methods should evolve beyond research settings toward clinical usability. Having the capacity to test the stability of biomarkers over a broad set of datasets and clinical settings will be essential to real-world deployment.

4.5. Clinical Translation and Deployment Challenges

This section outlines the main challenges in applying AI models for MDD classification in clinical practice. These include site-level issues (such as scanner differences and system integration), population-level challenges in identifying reliable biomarkers, subject-level variability reflecting the clinical heterogeneity of MDD, and clinical-level concerns such as model interpretability and the clear presentation of results to support decision-making (Figure 12).

4.5.1. Site-Level Challenges

One of the barriers to clinical deployment is the ability of models to generalise across hospitals and imaging centres. Differences in MRI scanners and acquisition protocols introduce substantial inter-site variability in rs-fMRI data [120]. These differences arise from a range of factors, including scanner-related characteristics including field strength and manufacturer (Siemens, GE, and Philips), head coil configuration (8-, 12-, or 32-channel), and acquisition settings consisting of spatial resolution, repetition time (TR), echo time (TE), scan duration, and number of volumes, which together contribute to variability in the resulting data. Several studies have attempted to address this variability using harmonisation techniques, such as ComBat. Large multi-site datasets, including REST-meta-MDD, show that harmonisation can partially reduce site-related variability in functional connectivity measures [58]. However, the effectiveness of these approaches may be limited in highly heterogeneous multi-region datasets, indicating the need for further systematic evaluation [32]. Reproducibility and benchmarking remain significant challenges. Many studies do not release implementation code or trained models, limiting reproducibility and preventing fair comparison across methods. In addition, variability in preprocessing pipelines, brain parcellation schemes, feature extraction strategies, model architectures, and validation protocols makes direct comparison of reported results difficult. Addressing these issues requires greater emphasis on the open-source sharing of code and models, as well as the development of standardised evaluation frameworks. Future efforts should also prioritise harmonised datasets, unified preprocessing and validation protocols, and consistent reporting standards to enable more reliable cross-study comparisons and clearer assessment of model generalisability. Clinical deployment introduces additional constraints. Imaging data are typically stored in DICOM format within Picture Archiving and Communication Systems (PACS), requiring AI systems to operate within an end-to-end pipeline that includes automated data retrieval, preprocessing of high-dimensional rs-fMRI data, feature extraction, model inference, and generation of interpretable outputs. These steps are computationally demanding and often rely on GPU-based systems, with studies commonly reporting the use of NVIDIA architectures such as P100, TITAN XP, TITAN RTX, and A100, typically with memory ranging from approximately 12–24 GB depending on the configuration [35,41,43,51]. However, few studies report computational requirements or inference time, limiting the assessment of real-world feasibility. In addition, commonly used preprocessing tools such as DPARSF provide standardised and user-friendly pipelines for rs-fMRI analysis, but are based on MATLAB and primarily designed for research use, which may limit scalability and integration into clinical workflows. In contrast, Python frameworks may offer more flexible and deployable solutions for real-world applications. The limited availability of annotated neuroimaging data remains another major barrier. Reliable labels often require expert clinical evaluation, making annotation costly and time-consuming. This challenge becomes more pronounced after deployment as new data are continuously generated while labelled examples remain scarce. Recent studies therefore explore approaches such as domain adaptation, self-supervised learning, and active learning, which leverage unlabelled data or adapt models across different acquisition sites. Although these approaches may sometimes achieve slightly lower accuracy than fully supervised methods, they can learn more generalisable representations and may therefore be better suited to real-world clinical settings where labelled data are limited.

4.5.2. Population-Level Challenges

At the population level, a major challenge concerns the reproducibility and generalisability of reported neuroimaging biomarkers. Many studies rely on datasets collected from a single region or country, most frequently from Chinese cohorts, without performing cross-site evaluation. Although several studies report similar candidate biomarkers, the diversity of reported regions and connectivity patterns suggests that these findings may be influenced by dataset-specific characteristics. Only a limited number of studies have explicitly examined consistency across regions, with the thalamus [32] being one of the few regions evaluated across multi-regional datasets. Consequently, biomarkers identified in one cohort must be validated across multiple hospitals, populations, and acquisition protocols to determine whether they represent robust disease-related signals rather than site-specific effects. In addition, most current approaches rely on static functional connectivity representations derived from predefined brain atlases. While these representations provide a simplified description of brain organisation, they may overlook the dynamic and time-varying nature of neural activity. Only a limited number of studies have explored dynamic functional connectivity or temporal biomarkers [31] that capture changes in connectivity states over time. Future research should therefore investigate both spatial and temporal biomarkers, as well as alternative feature representations, to better capture the complex and heterogeneous nature of MDD across populations.

4.5.3. Subject-Level Challenges

At the subject level, clinical translation is challenged by the substantial heterogeneity of MDD across individuals. Patients differ in demographic factors such as age and sex, as well as clinical characteristics including symptom severity, illness duration, medication status, and comorbidities. These factors can influence brain connectivity patterns and may affect model predictions across individuals. Most studies reviewed rely on binary classification between MDD patients and healthy controls. While useful for model development, this formulation does not reflect clinical diagnostic practice, where MDD diagnosis is established according to DSM-5 criteria. Some studies incorporate multimodal data, combining rs-fMRI with structural MRI, demographic variables, or clinical measures. Findings from these studies suggest that imaging data alone may not be sufficient [48,58]; however, multimodal approaches often rely on simple feature concatenation, and the contribution of each modality to overall model performance remains unclear. Clinical deployment would also require integration with hospital systems such as Radiology Information Systems (RIS) and Electronic Health Records (EHR), allowing models to access imaging, demographic, and clinical information within routine workflows. Future work should therefore focus on models that account for inter-subject variability, integrate multimodal information, and estimate the influence of demographic and clinical factors (HAMD or PHQ-9) on model predictions rather than providing prediction probabilities alone. Prospective clinical studies will also be required to evaluate performance across diverse patient populations.

4.5.4. Clinical-Level Challenges

At the clinical level, AI models must perform accurately and communicate results in a way clinicians can understand. Clinical utility depends not only on performance but also on interpretability, as clinicians need to understand and justify how a model makes its decisions. Approaches such as causal reasoning frameworks, prototype learning, and counterfactual analyses improve interpretability by identifying brain subnetworks or comparing patient patterns with similar cases. An important challenge is the trade-off between model complexity and interpretability. Although sophisticated GNNs may achieve high performance on retrospective datasets, they are computationally demanding and difficult to interpret. Simplified DNNs using only nine pre-selected regions achieved similar performance to whole-brain models with lower computational cost. Optimisation strategies, including LEAN and CLIP [51], removed redundant parameters while maintaining accuracy using only 2% of the original network. Pitsik et al. [54] showed that keeping only the strongest 2.5% of edges improved accuracy while simplifying the network. A GCN on multi-region MDD data did not outperform a traditional SVM [32]. These findings provide evidence that simpler, more interpretable models can remain competitive with complex deep learning approaches, while offering advantages in transparency and potential clinical usability. Few studies compare model predictions with clinician-level decisions. The MAMF-GCN framework [48] evaluated 50 hospital patients: clinicians classified 29 as depressive and 21 as having a depressive tendency. The model correctly identified all 29 confirmed cases and predicted 16 of the 21 uncertain cases as depressed; follow-up confirmed 11 of these. The framework also shows which brain regions and demographic features contributed most to the predictions, providing some level of interpretability. These findings suggest that AI models may support clinicians in unclear cases, but the small sample size and retrospective design highlight the need for larger prospective clinical studies. For deployment, AI outputs should be integrated with existing clinical reports and measures rather than providing only binary predictions. Summaries can complement assessments and symptom scales, supporting more interpretable decision-making in psychiatric practice. In addition, AI systems should operate within a human-in-the-loop framework, where outputs are treated as recommendations and reviewed by clinicians, with opportunities for improvement through clinical feedback.

4.6. Future Research Directions

Data diversity must be improved.
Current datasets lack sufficient geographical and demographic diversity, leading to population bias and limiting the generalisability of models across different cohorts.
Preprocessing and harmonisation require careful evaluation.
Choices such as global signal regression and harmonisation can influence both model performance and reported biomarkers, potentially introducing artificial differences or removing meaningful signals.
Biomarker reproducibility remains unclear.
Reported neuroimaging biomarkers are often inconsistent across sites, datasets, and preprocessing pipelines, raising concerns about their robustness.
Individual-level variability should be better modelled.
Most approaches rely on group-level classification, while clinical heterogeneity in MDD requires models that incorporate subject-specific clinical and demographic information.
Robust validation strategies are essential.
Many studies rely on internal validation, and cross-site evaluation or independent test sets are still limited, restricting conclusions about real-world generalisability.
Domain adaptation and data-efficient methods are promising.
Approaches that reduce sensitivity to site effects and dependence on large labelled datasets offer potential for improving generalisation.
Clear and clinically meaningful explanations are needed.
Current explainability methods often highlight important regions but do not sufficiently explain decisions at the individual level, limiting clinical trust and usability.
Standardised benchmarking is lacking.
Differences in datasets, preprocessing pipelines, and evaluation protocols make comparisons across studies difficult, highlighting the need for unified benchmarks and reporting standards.
Practical deployment constraints must be addressed.
Real-world implementation requires efficient pipelines, manageable computational demands, and integration into existing clinical workflows.

4.7. Limitations and Contributions

This review has some limitations. Only English-language, peer-reviewed journal articles published between 2020 and 2024 have been covered. Although we performed a comprehensive search, there is a possibility that we have overlooked related articles due to limitations in search terms and databases. It was not possible to do a meta-analysis as models were not tested under the same test conditions. In most cases, the code was not publicly available, limiting detailed comparison of the performance of models and reproducibility.

Whereas previous reviews [20,21] attempted to review machine learning algorithms, best reported accuracies, and most commonly occurring biomarkers, this review focuses more on generalisability, robustness of testing conditions in real-world settings, and reproducibility of biomarkers across populations. It also discusses validating and communicating the model explanations with clinicians, which is a preliminary step toward creating an environment in which deep learning algorithms may be reliably implemented into practice.

5. Conclusions

This systematic review examined the use of deep learning approaches for diagnosing MDD using rs-fMRI. While graph-based and multimodal models are becoming increasingly adopted and show promising diagnostic accuracy, most studies employ regionally restricted or single-site datasets, and therefore, their generalisability to clinical practice is unknown. Although many studies have explored explainability methods, their findings are often not validated by clinicians and require additional testing on a range of datasets to determine if they are reliable and consistent. Cross-site validation and real-world deployment are also limited. On the positive side, recent approaches that incorporate multimodal data, such as structural imaging or text- and phenotypic-based data, and employ unsupervised pre-training have yielded encouraging results and provide valuable directions for future research. Such findings may serve as a foundation for the development of more robust, interpretable, and clinically applicable systems for the diagnosis of MDD.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16073444/s1, the PRISMA checklist is available online.

Author Contributions

Conceptualization, M.S., L.W. and C.M.; methodology, M.S., L.W. and C.M.; software, M.S.; validation, M.S., L.W. and M.E.; formal analysis, M.S.; investigation, M.S.; resources, C.M.; data curation, M.S.; writing—original draft preparation, M.S.; writing—review and editing, M.S., L.W. and C.M.; visualization, M.S.; supervision, L.W. and C.M.; project administration, L.W. and C.M.; and funding acquisition, M.S. and C.M. All authors have read and agreed to the published version of the manuscript.

Funding

This publication has emanated from research conducted with the financial support of Taighde Éireann—Research Ireland, under grant number 21/RC/10294_P2 at FutureNeuro Research Ireland Centre for Translational Brain Science and under grant number GOIPG/2024/3520.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this review.

Acknowledgments

OpenAI’s ChatGPT 5.2 was employed to enhance the grammar, clarity, and language expression. The authors retain full responsibility for the content and conclusions presented in this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Najjar, R. Redefining radiology: A review of artificial intelligence integration in medical imaging. Diagnostics 2023, 13, 2760. [Google Scholar] [CrossRef]
Zafar, F.; Alam, L.F.; Vivas, R.R.; Wang, J.; Whei, S.J.; Mehmood, S.; Nazir, Z. The Role of Artificial Intelligence in Identifying Depression and Anxiety: A Comprehensive Literature Review. Cureus 2024, 16, e56472. [Google Scholar] [CrossRef] [PubMed]
Klöppel, S.; Stonnington, C.M.; Barnes, J.; Chen, F.; Chu, C.; Good, C.D.; Frackowiak, R.S. Accuracy of dementia diagnosis—A direct comparison between radiologists and a computerized method. Brain 2008, 131, 2969–2974. [Google Scholar] [CrossRef]
Fava, M.; Kendler, K.S. Major Depressive Disorder. Neuron 2000, 28, 335–341. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. Depression. 2023. Available online: https://www.who.int/news-room/fact-sheets/detail/depression (accessed on 31 March 2023).
Zalar, B.; Blatnik, A.; Maver, A.; Klemenc-Ketiš, Z.; Peterlin, B. Family history as an important factor for stratifying participants in genetic studies of major depression. Balk. J. Med. Genet. 2018, 21, 5–12. [Google Scholar] [CrossRef]
Hasler, G. Pathophysiology of depression: Do we have any solid evidence of interest to clinicians? World Psychiatry 2010, 9, 155. [Google Scholar] [CrossRef]
Mascayano, F.; Armijo, J.E.; Yang, L.H. Addressing stigma relating to mental illness in low-and middle-income countries. Front. Psychiatry 2015, 6, 38. [Google Scholar] [CrossRef] [PubMed]
Brasso, C.; Cisotto, M.; Del Favero, E.; Giordano, B.; Villari, V.; Rocca, P. Impact of COVID-19 pandemic on major depressive disorder in acute psychiatric inpatients. Front. Psychol. 2023, 14, 1181832. [Google Scholar] [CrossRef]
Mao, K.; Wu, Y.; Chen, J. A systematic review on automated clinical depression diagnosis. npj Ment. Health Res. 2023, 2, 20. [Google Scholar] [CrossRef] [PubMed]
Tolentino, J.C.; Schmidt, S.L. DSM-5 criteria and depression severity: Implications for clinical practice. Front. Psychiatry 2018, 9, 450. [Google Scholar] [CrossRef] [PubMed]
Renemane, L.; Vrublevska, J. Hamilton depression rating scale: Uses and applications. In The Neuroscience of Depression; Academic Press: Cambridge, MA, USA, 2021; pp. 175–183. [Google Scholar]
Khalifa, M.; Albadawy, M. Artificial Intelligence for Clinical Prediction: Exploring Key Domains and Essential Functions. Comput. Methods Programs Biomed. Update 2024, 5, 100148. [Google Scholar] [CrossRef]
Glover, G.H. Overview of functional magnetic resonance imaging. Neurosurg. Clin. 2011, 22, 133–139. [Google Scholar] [CrossRef]
Pandya, M.; Altinay, M.; Malone, D.A.; Anand, A. Where in the brain is depression? Curr. Psychiatry Rep. 2012, 14, 634–642. [Google Scholar] [CrossRef]
Buxton, R.B. The physics of functional magnetic resonance imaging (fMRI). Rep. Prog. Phys. 2013, 76, 096601. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Maltbie, E.A.; Keilholz, S.D. Spatiotemporal trajectories in resting-state FMRI revealed by convolutional variational autoencoder. NeuroImage 2021, 244, 118588. [Google Scholar] [CrossRef]
Lee, M.H.; Smyser, C.D.; Shimony, J.S. Resting-state fMRI: A review of methods and clinical applications. Am. J. Neuroradiol. 2013, 34, 1866–1872. [Google Scholar] [CrossRef] [PubMed]
Mosch, L.; Fürstenau, D.; Brandt, J.; Wagnitz, J.; Klopfenstein, S.A.; Poncette, A.S.; Balzer, F. The medical profession transformed by artificial intelligence: Qualitative study. Digit. Health 2022, 8, 20552076221143903. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Zhao, W.; Yi, S.; Liu, J. The diagnostic performance of machine learning based on resting-state functional magnetic resonance imaging data for major depressive disorders: A systematic review and meta-analysis. Front. Neurosci. 2023, 17, 1174080. [Google Scholar] [CrossRef] [PubMed]
Bondi, E.; Maggioni, E.; Brambilla, P.; Delvecchio, G. A systematic review on the potential use of machine learning to classify major depressive disorder from healthy controls using resting state fMRI measures. Neurosci. Biobehav. Rev. 2023, 144, 104972. [Google Scholar] [CrossRef]
Yeasmin, M.N.; Al Amin, M.; Joti, T.J.; Aung, Z.; Azim, M.A. Advances of AI in image-based computer-aided diagnosis: A review. Array 2024, 23, 100357. [Google Scholar] [CrossRef]
Chan, H.P.; Samala, R.K.; Hadjiiski, L.M.; Zhou, C. Deep Learning in Medical Image Analysis; Springer: Berlin/Heidelberg, Germany, 2020; pp. 3–21. [Google Scholar]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef] [PubMed]
Venkatapathy, S.; Votinov, M.; Wagels, L.; Kim, S.; Lee, M.; Habel, U.; Ra, I.H.; Jo, H.G. Ensemble graph neural network model for classification of major depressive disorder using whole-brain functional connectivity. Front. Psychiatry 2023, 14, 1125339. [Google Scholar] [CrossRef] [PubMed]
Qin, K.; Lei, D.; Pinaya, W.H.; Pan, N.; Li, W.; Zhu, Z.; Sweeney, J.A.; Mechelli, A.; Gong, Q. Using graph convolutional network to characterize individuals with major depressive disorder across multiple imaging sites. eBioMedicine 2022, 78, 103977. [Google Scholar] [CrossRef]
Zhao, J.; Huang, J.; Zhi, D.; Yan, W.; Ma, X.; Yang, X.; Li, X.; Ke, Q.; Jiang, T.; Calhoun, V.D.; et al. Functional network connectivity (FNC)-based generative adversarial network (GAN) and its applications in classification of mental disorders. J. Neurosci. Methods 2020, 341, 108756. [Google Scholar] [CrossRef] [PubMed]
Tan, Y.F.; Ting, C.M.; Noman, F.; Phan, R.C.W.; Ombao, H. fMRI Functional Connectivity Augmentation Using Convolutional Generative Adversarial Networks for Brain Disord. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Noman, F.; Ting, C.M.; Kang, H.; Phan, R.C.; Ombao, H. Graph autoencoders for embedding learning in brain networks and major depressive disorder identification. IEEE J. Biomed. Health Inform. 2024, 28, 1644–1655. [Google Scholar] [CrossRef]
Zheng, G.; Zheng, W.; Zhang, Y.; Wang, J.; Chen, M.; Wang, Y.; Cai, T.; Yao, Z.; Hu, B. An attention-based multi-modal MRI fusion model for major depressive disorder diagnosis. J. Neural Eng. 2023, 20, 066005. [Google Scholar] [CrossRef]
Hu, J.; Luo, J.; Xu, Z.; Liao, B.; Dong, S.; Peng, B.; Hou, G. Spatio-temporal learning and explaining for dynamic functional connectivity analysis: Application to depression. J. Affect. Disord. 2024, 364, 266–273. [Google Scholar] [CrossRef]
Gallo, S.; El-Gazzar, A.; Zhutovsky, P.; Thomas, R.M.; Javaheripour, N.; Li, M.; Bartova, L.; Bathula, D.; Dannlowski, U.; Davey, C.; et al. Functional connectivity signatures of major depressive disorder: Machine learning analysis of two multicenter neuroimaging studies. Mol. Psychiatry 2023, 28, 3013–3022. [Google Scholar] [CrossRef] [PubMed]
Kong, Y.; Gao, S.; Yue, Y.; Hou, Z.; Shu, H.; Xie, C.; Zhang, Z.; Yuan, Y. Spatio-temporal graph convolutional network for diagnosis and treatment response prediction of major depressive disorder from functional connectivity. Hum. Brain Mapp. 2021, 42, 3922–3933. [Google Scholar] [CrossRef] [PubMed]
Kang, E.; Heo, D.W.; Lee, J.; Suk, H.I. A Learnable Counter-Condition Analysis Framework for Functional Connectivity-Based Neurological Disorder Diagnosis. IEEE Trans. Med. Imaging 2023, 43, 1377–1387. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Li, L.; Qiao, L.; Liu, M. Adaptive multimodal neuroimage integration for major depression disorder detection. Front. Neuroinform. 2022, 16, 856175. [Google Scholar] [CrossRef] [PubMed]
Lin, C.; Lee, S.H.; Huang, C.M.; Chen, G.Y.; Chang, W.; Liu, H.L.; Ng, S.H.; Lee, T.M.; Wu, S.C. Automatic diagnosis of late-life depression by 3D convolutional neural networks and cross-sample Entropy analysis from resting-state fMRI. Brain Imaging Behav. 2023, 17, 125–135. [Google Scholar] [CrossRef]
Wang, J.; Li, T.; Sun, Q.; Guo, Y.; Yu, J.; Yao, Z.; Hou, N.; Hu, B. Automatic diagnosis of major depressive disorder using a high-and low-frequency feature fusion framework. Brain Sci. 2023, 13, 1590. [Google Scholar] [CrossRef] [PubMed]
Zheng, K.; Yu, S.; Chen, L.; Dang, L.; Chen, B. BPI-GNN: Interpretable brain network-based psychiatric diagnosis and subtyping. NeuroImage 2024, 292, 120594. [Google Scholar] [CrossRef] [PubMed]
Zheng, K.; Yu, S.; Chen, B. Ci-gnn: A granger causality-inspired graph neural network for interpretable brain network-based psychiatric diagnosis. Neural Netw. 2024, 172, 106147. [Google Scholar] [CrossRef] [PubMed]
Dai, P.; Lu, D.; Shi, Y.; Zhou, Y.; Xiong, T.; Zhou, X.; Chen, Z.; Zou, B.; Tang, H.; Huang, Z.; et al. Classification of recurrent major depressive disorder using a new time series feature extraction method through multisite rs-fMRI data. J. Affect. Disord. 2023, 339, 511–519. [Google Scholar] [CrossRef] [PubMed]
Dai, P.; Shi, Y.; Lu, D.; Zhou, Y.; Luo, J.; He, Z.; Chen, Z.; Zou, B.; Tang, H.; Huang, Z.; et al. Classification of recurrent major depressive disorder using a residual denoising autoencoder framework: Insights from large-scale multisite fMRI data. Comput. Methods Programs Biomed. 2024, 247, 108114. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, X.; Zhang, Z. DDN-Net: Deep Residual Shrinkage Denoising Networks with Channel-Wise Adaptively Soft Thresholds for Automated Major Depressive Disorder Identification. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1626–1630. [Google Scholar]
Xia, Z.; Fan, Y.; Li, K.; Wang, Y.; Huang, L.; Zhou, F. DepressionGraph: A Two-Channel Graph Neural Network for the Diagnosis of Major Depressive Disorders Using rs-fMRI. Electronics 2023, 12, 5040. [Google Scholar] [CrossRef]
Liu, S.; Gui, R. Fusing multi-scale fMRI features using a brain-inspired multi-channel graph neural network for major depressive disorder diagnosis. Biomed. Signal Process. Control 2024, 90, 105837. [Google Scholar] [CrossRef]
Oh, J.H.; Lee, D.J.; Ji, C.H.; Shin, D.H.; Han, J.W.; Son, Y.H.; Kam, T.E. Graph-based conditional generative adversarial networks for major depressive disorder diagnosis with synthetic functional brain network generation. IEEE J. Biomed. Health Inform. 2023, 28, 1504–1515. [Google Scholar] [CrossRef] [PubMed]
Liu, M.; Zhang, H.; Liu, M.; Chen, D.; Zhou, R.; Lu, W.; Zhang, L.; Shen, D.; Wang, Q.; Peng, D. Hierarchical Encoding and Fusion of Brain Functions for Depression Subtype Classification. IEEE Trans. Affect. Comput. 2024, 15, 1826–1837. [Google Scholar] [CrossRef]
Long, D.; Zhang, M.; Yu, J.; Zhu, Q.; Chen, F.; Li, F. Intelligent diagnosis of major depression disease based on multi-layer brain network. Front. Neurosci. 2023, 17, 1126865. [Google Scholar] [CrossRef]
Pan, J.; Lin, H.; Dong, Y.; Wang, Y.; Ji, Y. MAMF-GCN: Multi-scale adaptive multi-channel fusion deep graph convolutional network for predicting mental disorder. Comput. Biol. Med. 2022, 148, 105823. [Google Scholar] [CrossRef]
Liang, Y.; Xu, G. Multi-level functional connectivity fusion classification framework for brain disease diagnosis. IEEE J. Biomed. Health Inform. 2022, 26, 2714–2725. [Google Scholar] [CrossRef]
Kong, Y.; Niu, S.; Gao, H.; Yue, Y.; Shu, H.; Xie, C.; Zhang, Z.; Yuan, Y. Multi-stage graph fusion networks for major depressive disorder diagnosis. IEEE Trans. Affect. Comput. 2022, 13, 1917–1928. [Google Scholar] [CrossRef]
Gupta, S.; Chan, Y.H.; Rajapakse, J.C.; Initiative, A.D.N. Obtaining leaner deep neural networks for decoding brain functional connectome in a single shot. Neurocomputing 2021, 453, 326–336. [Google Scholar] [CrossRef]
Zhang, W.; Zeng, W.; Chen, H.; Liu, J.; Yan, H.; Zhang, K.; Wang, N. STANet: A Novel Spatio-Temporal Aggregation Network for Depression Classification with Small and Unbalanced FMRI Data. Tomography 2024, 10, 1895–1914. [Google Scholar] [CrossRef] [PubMed]
Zhu, M.; Quan, Y.; He, X. The classification of brain network for major depressive disorder patients based on deep graph convolutional neural network. Front. Hum. Neurosci. 2023, 17, 1094592. [Google Scholar] [CrossRef]
Pitsik, E.N.; Maximenko, V.A.; Kurkin, S.A.; Sergeev, A.P.; Stoyanov, D.; Paunova, R.; Kandilarova, S.; Simeonova, D.; Hramov, A.E. The topology of fMRI-based networks defines the performance of a graph neural network for the classification of patients with major depressive disorder. Chaos Solitons Fractals 2023, 167, 113041. [Google Scholar] [CrossRef]
Wang, X.; Chu, Y.; Wang, Q.; Cao, L.; Qiao, L.; Zhang, L.; Liu, M. Unsupervised contrastive graph learning for resting-state functional MRI analysis and brain disorder detection. Hum. Brain Mapp. 2023, 44, 5672–5692. [Google Scholar] [CrossRef]
Fang, Y.; Potter, G.G.; Wu, D.; Zhu, H.; Liu, M. Addressing multi-site functional MRI heterogeneity through dual-expert collaborative learning for brain disease identification. Hum. Brain Mapp. 2023, 44, 4256–4271. [Google Scholar] [CrossRef]
Fang, Y.; Wang, M.; Potter, G.G.; Liu, M. Unsupervised cross-domain functional MRI adaptation for automated major depressive disorder identification. Med. Image Anal. 2023, 84, 102707. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Zhou, J.; Zhu, X.; Zhang, Y.; Zhou, X.; Zhang, S.; Jin, C. An Objective Quantitative Diagnosis of Depression Using a Local-to-Global Multimodal Fusion Graph Neural Network. Patterns 2024, 5, 101081. [Google Scholar] [CrossRef]
Tian, C.; Lu, M. High-Order Functional Connectivity Based Major Depressive Disorder Classification Using Self-Attention Based CNN. In Proceedings of the 2024 7th International Conference on Data Science and Information Technology (DSIT), Nanjing, China, 13–15 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Yan, C.G.; Chen, X.; Li, L.; Castellanos, F.X.; Bai, T.J.; Bo, Q.J.; Cao, J.; Chen, G.M.; Chen, N.X.; Chen, W.; et al. Reduced default mode network functional connectivity in patients with recurrent major depressive disorder. Proc. Natl. Acad. Sci. USA 2019, 116, 9078–9083. [Google Scholar] [CrossRef]
Tanaka, S.C.; Yamashita, A.; Yahata, N.; Itahashi, T.; Lisi, G.; Yamada, T.; Ichikawa, N.; Takamura, M.; Yoshihara, Y.; Kunimatsu, A.; et al. A multi-site, multi-disorder resting-state magnetic resonance image database. Sci. Data 2021, 8, 227. [Google Scholar] [CrossRef]
Qian, J.; Li, H.; Wang, J.; He, L. Recent Advances in Explainable Artificial Intelligence for Magnetic Resonance Imaging. Diagnostics 2023, 13, 1571. [Google Scholar] [CrossRef]
Zhao, L. Advances in fMRI-Based Brain Function Mapping: A Deep Learning Perspective. Psychoradiology 2025, 5, kkaf007. [Google Scholar] [CrossRef] [PubMed]
Craddock, C.; Sikka, S.; Cheung, B.; Khanuja, R.; Ghosh, S.S.; Yan, C.G.; Milham, M. Towards automated analysis of connectomes: The configurable pipeline for the analysis of connectomes (C-PAC). Front. Neuroinform. 2013, 7, 42. [Google Scholar] [CrossRef]
Yan, C.G.; Zang, Y.F. DPARSF: A MATLAB toolbox for “pipeline” data analysis of resting-state fMRI. Front. Syst. Neurosci. 2010, 4, 13. [Google Scholar] [CrossRef] [PubMed]
Friston, K.J. Statistical Parametric Mapping. In Neuroscience Databases: A Practical Guide; Toga, A.W., Mazziotta, J.C., Eds.; Springer: Boston, MA, USA, 2003; pp. 237–250. [Google Scholar]
Woolrich, M.W.; Jbabdi, S.; Patenaude, B.; Chappell, M.; Makni, S.; Behrens, T.; Beckmann, C.; Jenkinson, M.; Smith, S.M. Bayesian analysis of neuroimaging data in FSL. NeuroImage 2009, 45, S173–S186. [Google Scholar] [CrossRef] [PubMed]
Avants, B.B.; Tustison, N.; Song, G. Advanced normalization tools (ANTS). Insight J. 2009, 2, 1–35. [Google Scholar]
Ashburner, J. A fast diffeomorphic image registration algorithm. NeuroImage 2007, 38, 95–113. [Google Scholar] [CrossRef] [PubMed]
Liu, T.T.; Nalci, A.; Falahpour, M. The global signal in fMRI: Nuisance or information? NeuroImage 2017, 150, 213–229. [Google Scholar] [CrossRef]
Bastos, A.M.; Schoffelen, J.M. A tutorial review of functional connectivity analysis methods and their interpretational pitfalls. Front. Syst. Neurosci. 2016, 9, 175. [Google Scholar] [CrossRef]
Lin, C.; Huang, C.M.; Chang, W.; Chang, Y.X.; Liu, H.L.; Ng, S.H.; Lin, H.L.; Lee, T.M.; Lee, S.H.; Wu, S.C. Predicting suicidality in late-life depression by 3D convolutional neural network and cross-sample entropy analysis of resting-state fMRI. Brain Behav. 2024, 14, e3348. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Wong, N.M.; Shao, R.; Lee, S.H.; Huang, C.M.; Liu, H.L.; Lee, T.M. Classification of Major Depressive Disorder using Machine Learning on brain structure and functional connectivity. J. Affect. Disord. Rep. 2022, 10, 100428. [Google Scholar] [CrossRef]
Kim, H.; Han, K.M.; Choi, K.W.; Tae, W.S.; Kang, W.; Kang, Y.; Ham, B.J. Volumetric alterations in subregions of the amygdala in adults with major depressive disorder. J. Affect. Disord. 2021, 295, 108–115. [Google Scholar] [CrossRef]
Duan, J.; Li, Y.; Zhang, X.; Dong, S.; Zhao, P.; Liu, J.; Zheng, J.; Zhu, R.; Kong, Y.; Wang, F. Predicting treatment response in adolescents and young adults with major depressive episodes from fMRI using graph isomorphism network. NeuroImage Clin. 2023, 40, 103534. [Google Scholar] [CrossRef]
Jones, E.G. The Thalamus; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Uddin, L.Q.; Nomi, J.S.; Hébert-Seropian, B.; Ghaziri, J.; Boucher, O. Structure and function of the human insula. J. Clin. Neurophysiol. 2017, 34, 300–306. [Google Scholar] [CrossRef]
Zanto, T.P.; Gazzaley, A. Fronto-parietal network: Flexible hub of cognitive control. Trends Cogn. Sci. 2013, 17, 602–603. [Google Scholar] [CrossRef]
D’Andrea, C.B.; Laumann, T.O.; Newbold, D.J.; Nelson, S.M.; Nielsen, A.N.; Chauvin, R.; Gordon, E.M. Substructure of the brain’s Cingulo-Opercular network. bioRxiv 2023. [Google Scholar] [CrossRef]
Avberšek, L.K.; Repovš, G. Deep learning in neuroimaging data analysis: Applications, challenges, and solutions. Front. Neuroimaging 2022, 1, 981642. [Google Scholar] [CrossRef] [PubMed]
Ono, S.; Goto, T. Introduction to supervised machine learning in clinical epidemiology. Ann. Clin. Epidemiol. 2022, 4, 63–71. [Google Scholar] [CrossRef]
Hussain, H.; Tamizharasan, P.S.; Rahul, C.S. Design Possibilities and Challenges of DNN Models: A Review on the Perspective of End Devices. Artif. Intell. Rev. 2022, 55, 5109–5167. [Google Scholar] [CrossRef]
Mohanasundaram, R.; Malhotra, A.S.; Arun, R.; Periasamy, P.S. Chapter 8 - Deep Learning and Semi-Supervised and Transfer Learning Algorithms for Medical Imaging. In Deep Learning and Parallel Computing Environment for Bioengineering Systems; Sangaiah, A.K., Ed.; Academic Press: Cambridge, MA, USA, 2019; pp. 139–151. [Google Scholar] [CrossRef]
Mohammadi, H.; Karwowski, W. Graph Neural Networks in Brain Connectivity Studies: Methods, Challenges, and Future Directions. Brain Sci. 2024, 15, 17. [Google Scholar] [CrossRef]
Asif, N.A.; Sarker, Y.; Chakrabortty, R.K.; Ryan, M.J.; Ahamed, M.H.; Saha, D.K.; Tasneem, Z. Graph Neural Network: A Comprehensive Review on Non-Euclidean Space. IEEE Access 2021, 9, 60588–60606. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Dai, E.; Jin, W.; Liu, H.; Wang, S. Towards robust graph neural networks for noisy graphs with sparse labels. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; ACM: New York, NY, USA, 2022; pp. 181–191. [Google Scholar]
Gevins, A.; Smith, M.E.; McEvoy, L.K.; Leong, H.; Le, J. Electroencephalographic imaging of higher brain function. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 1999, 354, 1125–1134. [Google Scholar] [CrossRef]
Chaabene, S.; Boudaya, A.; Bouaziz, B.; Chaari, L. An overview of methods and techniques in multimodal data fusion with application to healthcare. Int. J. Data Sci. Anal. 2025, 20, 3093–3117. [Google Scholar] [CrossRef]
Yamashita, R.; Nishio, M.; Do, R.K.; Togashi, K. Convolutional neural networks: An overview and application in radiology. Insights Into Imaging 2018, 9, 611–629. [Google Scholar] [CrossRef]
Singh, S.P.; Wang, L.; Gupta, S.; Goli, H.; Padmanabhan, P.; Gulyás, B. 3D deep learning on medical images: A review. Sensors 2020, 20, 5097. [Google Scholar] [CrossRef] [PubMed]
Güçlü, U.; Van Gerven, M.A. Modeling the dynamics of human brain activity with recurrent neural networks. Front. Comput. Neurosci. 2017, 11, 7. [Google Scholar] [CrossRef] [PubMed]
Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent Neural Networks: A Comprehensive Review of Architectures, Variants, and Applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
Trezza, A.; Visibelli, A.; Roncaglia, B.; Spiga, O.; Santucci, A. Unsupervised Learning in Precision Medicine: Unlocking Personalized Healthcare through AI. Appl. Sci. 2024, 14, 9305. [Google Scholar] [CrossRef]
Ehrhardt, J.; Wilms, M. Autoencoders and variational autoencoders in medical image analysis. In Biomedical Image Synthesis and Simulation; Burgos, N., Svoboda, D., Eds.; The MICCAI Society book series; Academic Press: Cambridge, MA, USA, 2022; pp. 129–162. [Google Scholar] [CrossRef]
Showrov, A.A.; Aziz, M.T.; Nabil, H.R.; Jim, J.R.; Kabir, M.M.; Mridha, M.F.; Shin, J. Generative Adversarial Networks (GANs) in Medical Imaging: Advancements, Applications and Challenges. IEEE Access 2024, 12, 35728–35753. [Google Scholar] [CrossRef]
Guan, H.; Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 2021, 69, 1173–1185. [Google Scholar] [CrossRef]
Sadeghi, Z.; Alizadehsani, R.; Cifci, M.A.; Kausar, S.; Rehman, R.; Mahanta, P.; Pardalos, P.M. A review of Explainable Artificial Intelligence in healthcare. Comput. Electr. Eng. 2024, 118, 109370. [Google Scholar] [CrossRef]
Alkhanbouli, R.; Matar Abdulla Almadhaani, H.; Alhosani, F.; Simsekler, M.C.E. The role of explainable artificial intelligence in disease prediction: A systematic literature review and future research directions. BMC Med. Inform. Decis. Mak. 2025, 25, 110. [Google Scholar] [CrossRef]
Retzlaff, C.; Angerschmid, A.; Saranti, A.; Schneeberger, D.; Roettger, R.; Mueller, H.; Holzinger, A. Post-hoc vs ante-hoc explanations: XAI design guidelines for data scientists. Cogn. Syst. Res. 2024, 86, 101243. [Google Scholar] [CrossRef]
Chen, Z.; Xiao, F.; Guo, F.; Yan, J. Interpretable machine learning for building energy management: A state-of-the-art review. Adv. Appl. Energy 2023, 9, 100123. [Google Scholar] [CrossRef]
Mohamed, E.; Sirlantzis, K.; Howells, G. A review of visualisation-as-explanation techniques for convolutional neural networks and their evaluation. Displays 2022, 73, 102239. [Google Scholar] [CrossRef]
Chaddad, A.; Peng, J.; Xu, J.; Bouridane, A. Survey of explainable AI techniques in healthcare. Sensors 2023, 23, 634. [Google Scholar] [CrossRef]
Salih, A.M.; Raisi-Estabragh, Z.; Galazzo, I.B.; Radeva, P.; Petersen, S.E.; Lekadir, K.; Menegaz, G. A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME. Adv. Intell. Syst. 2024, 7, 2400304. [Google Scholar] [CrossRef]
Antoniadi, A.M.; Du, Y.; Guendouz, Y.; Wei, L.; Mazo, C.; Becker, B.A.; Mooney, C. Current challenges and future opportunities for XAI in machine learning-based clinical decision support systems: A systematic review. Appl. Sci. 2021, 11, 5088. [Google Scholar] [CrossRef]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]
Longo, L.; Brcic, M.; Cabitza, F.; Choi, J.; Confalonieri, R.; Del Ser, J.; Stumpf, S. Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions. Inf. Fusion 2024, 106, 102301. [Google Scholar] [CrossRef]
Clusmann, J.; Kolbinger, F.R.; Muti, H.S.; Carrero, Z.I.; Eckardt, J.N.; Laleh, N.G.; Kather, J.N. The future landscape of large language models in medicine. Commun. Med. 2023, 3, 141. [Google Scholar] [CrossRef] [PubMed]
Nazario-Johnson, L.; Zaki, H.A.; Tung, G.A. Use of large language models to predict neuroimaging. J. Am. Coll. Radiol. 2023, 20, 1004–1009. [Google Scholar] [CrossRef] [PubMed]
Alowais, S.A.; Alghamdi, S.S.; Alsuhebany, N.; Alqahtani, T.; Alshaya, A.I.; Almohareb, S.N.; Aldairem, A.; Alrashed, M.; Bin Saleh, K.; Badreldin, H.A.; et al. Revolutionizing healthcare: The role of artificial intelligence in clinical practice. BMC Med. Educ. 2023, 23, 689. [Google Scholar] [CrossRef]
Aravazhi, P.S.; Gunasekaran, P.; Benjamin, N.Z.; Thai, A.; Chandrasekar, K.K.; Kolanu, N.D.; Prajjwal, P.; Tekuru, Y.; Brito, L.V.; Inban, P. The integration of artificial intelligence into clinical medicine: Trends, challenges, and future directions. Disease-a-Month 2025, 71, 101882. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Soltan, A.A.; Clifton, D.A. Machine learning generalizability across healthcare settings: Insights from multi-site COVID-19 screening. npj Digit. Med. 2022, 5, 69. [Google Scholar] [CrossRef]
Rosenblatt, M.; Tejavibulya, L.; Jiang, R.; Noble, S.; Scheinost, D. Data leakage inflates prediction performance in connectome-based machine learning models. Nat. Commun. 2024, 15, 1829. [Google Scholar] [CrossRef]
Jin, C.; Jia, H.; Lanka, P.; Rangaprakash, D.; Li, L.; Liu, T.; Hu, X.; Deshpande, G. Dynamic brain connectivity is a better predictor of PTSD than static connectivity. Hum. Brain Mapp. 2017, 38, 4479–4496. [Google Scholar] [CrossRef]
Kaufmann, T.; Alnæs, D.; Doan, N.T.; Brandt, C.L.; Andreassen, O.A.; Westlye, L.T.; Duff, E.P. Connectome fingerprinting identifies highly heritable patterns of brain connectivity. Sci. Rep. 2019, 9, 7442. [Google Scholar] [CrossRef]
Eickhoff, S.B.; Yeo, B.T.T.; Genon, S. The impact of different brain parcellation schemes on connectome-based predictions of behavior and psychiatric traits. Hum. Brain Mapp. 2023, 44, 1590–1608. [Google Scholar] [CrossRef]
Bzdok, D.; Eickhoff, S.B. Parcellations, parcellation-based connectome, and the challenges of individual variability. Trends Cogn. Sci. 2013, 17, 664–686. [Google Scholar] [CrossRef]
Pezoulas, V.C.; Zaridis, D.I.; Mylona, E.; Androutsos, C.; Apostolidis, K.; Tachos, N.S.; Fotiadis, D.I. Synthetic data generation methods in healthcare: A review on open-source tools and methods. Comput. Struct. Biotechnol. J. 2024, 23, 2892–2910. [Google Scholar] [CrossRef]
Yamashita, A.; Yahata, N.; Itahashi, T.; Lisi, G.; Yamada, T.; Ichikawa, N.; Takamura, M.; Yoshihara, Y.; Kunimatsu, A.; Okada, N.; et al. Harmonization of resting-state functional MRI data across multiple imaging sites via the separation of site differences into sampling bias and measurement bias. PLoS Biol. 2019, 17, e3000042. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Distribution of publication years.

Figure 2. Overview of the literature review workflow: A flow diagram summarising the search process, paper selection, and filtering steps.

Figure 3. A typical workflow of deep learning approaches for MDD detection using rs-fMR.

Figure 4. Comparison of classification accuracy based on the brain atlas used for feature extraction. (a) Accuracy comparison between studies using a single atlas and those combining multiple atlases. (b) Reported accuracies across the most frequently used brain atlases.

Figure 5. Most discriminative brain regions for Major Depressive Disorder (MDD).

Figure 6. Top functional networks associated with MDD.

Figure 7. Most common deep learning techniques across studies. Counts reflect all model occurrences, including hybrids.

Figure 8. Accuracy distribution across all model categories. The curve represents the kernel density estimate and the trend of the accuracy distribution.

Figure 9. Comparison of model evaluation results.

Figure 10. Model performance on the REST-meta-MDD dataset. The models correspond to studies reported in the literature [25,26,27,28,29,30,32,34,35,38,39,40,41,42,43,44,45,46,47,48,49,50,51,53,55,56,57,58,59]. Different colors represent different models, with darker shades corresponding to higher accuracy.

Figure 11. Usage of explainability methods in studies.

Figure 12. Pipeline illustrating the main challenges in translating deep learning models for MDD detection into clinical practice. The framework highlights barriers at four levels: site-level (inter-site variability), population-level (inconsistent biomarker findings), subject-level (inter-subject variability), and clinical-level (communication of model outputs to clinicians).

Table 1. Search strategy for Google Scholar, Scopus and PubMed databases.

Google Scholar
((“major depressive disorder” OR mdd) AND (fmri OR (functional AND magnetic AND resonance) OR (functional AND mri)) AND ((resting AND state) OR rs-fmri) AND (“deep learning”) AND (control* OR healthy) AND (classif*))
Scopus
TITLE-ABS-KEY(((“major depressive disorder” OR mdd) AND (fmri OR (functional AND magnetic AND resonance) OR (functional AND mri)) AND ((resting AND state) OR rs-fmri) AND (“deep learning”) AND (control* OR healthy) AND (classif*)))
PubMed
(“major depressive disorder”[Text Word] OR mdd[Text Word]) AND (fmri[Text Word] OR (functional[Text Word] AND magnetic[Text Word] AND resonance[Text Word]) OR (functional[Text Word] AND mri[Text Word])) AND ((resting[Text Word] AND state[Text Word]) OR rs-fmri[Text Word]) AND (“deep learning”[Text Word]) AND (control[Text Word] OR healthy[Text Word]) AND (classif[Text Word])

Table 2. Summary of studies investigating depression using various datasets and methods. Sample sizes indicate the number of patients with Major Depressive Disorder (MDD) and Healthy Controls (HC). Metrics reported include Accuracy (ACC), Area Under the ROC Curve (AUC), Sensitivity (SEN), Specificity (SPE), and F1-score (F1).

Authors	Year	Sample Size	Data Source	Feature Extraction	Method	Metrics
Venkatapathy [25]	2023	821 MDD, 765 HC	REST-meta-MDD	Dosenbach, FC using Pearson	Ensemble GNN (GCN, GAT, GraphSAGE)	ACC: 71.18%, AUC: 76.53%, SEN: 68.23%, SPE: 74.96%
Qin [26]	2022	821 MDD, 765 HC	REST-meta-MDD	Dosenbach, FC	GCN	ACC: 81.5%, AUC: 86.5%
Zhao [27]	2020	269 MDD, 286 HC from 4 sites	Henan Mental Hospital, West China Hospital, Anding Hospital, First Affiliated Hospital of Zhejiang	FNC using ICA	GAN	ACC: 70.1%, AUC: 70.3%, SEN: 73.5%, SPE: 66.5%, F1: 71.7%
Tan [28]	2024	250 MDD, 227 HC	REST-meta-MDD	AAL, FC	1D-DCGAN	ACC: 68.3%, AUC: 68.0%, F1: 67.9%
Noman [29]	2022	250 MDD, 227 HC	REST-meta-MDD	AAL and HO, FC using LDW	GAE, GCN (Supervised & Unsupervised)	Supervised: ACC: 59.47%, SEN: 54.87%, F1: 58.57%; Unsupervised: ACC: 65.07%, SEN: 69.7%, F1: 67.29%
Zheng [30]	2023	1179 MDD, 1008 HC (52 MDD 80 HC from Gansu Provincial Hospital, independent test set.)	REST-meta-MDD	AAL, FC	FSCF BFE (rs-fMRI), BSE (sMRI)	ACC: 75.2%, SEN: 69.0%, SPE: 80.5%, AUC: 80.8%
Hu [31]	2024	89 MDD, 89 HC	Shenzhen Kangning Hospital	AAL, dFC	MT-STN	ACC: 68.56%, SEN: 67.4%, SPE: 69.7%, AUC: 70.6%
Gallo [32]	2023	531 MDD, 508 HC, 1255 MDD, 1083 HC	Psymri, REST-meta-MDD	HO, FC	GCN, SVM	ACC: 61.0% (range: 57%–63%)
Kong [33]	2021	82 MDD, 50 HC, 98 MDD, 47 HC	ZhongDa Hospital of Southeast University, Hospital of Xinxiang Medical University	dFC	STGCN	Zhongda: ACC: 84.1%, SEN: 89.4%, SPE: 68.3%; Xinxiang: ACC: 83.9%, SEN: 92.9%, SPE: 67.9%
Kang [34]	2023	830 MDD, 771 HC	REST-meta-MDD	CC200, FC	Unified Deep Learning Framework	AUC: 75.6%, ACC: 70.2%, SEN: 69.7%, SPE: 70.7%
Wang [35]	2022	282 MDD, 251 HC	REST-meta-MDD	HO, FC, Feature fusion (rs-fMR + sMRI)	GCN + CNN	ACC: 65.0%, SEN: 69.4%, SPE: 60.9%, AUC: 66.5%
Lin [36]	2023	49 LLD, 28 HC	Chang Gung Medical Foundation	AAL, 3D CSE volumes	CNN	ACC: 85% for 4 ROIs, 80% for 20 ROIs
Wang [37]	2023	54 MDD, 62 HC	Gansu Provincial Hospital	AAL, MLFE, MHFE, FC	Feature Cross-Fusion (CNN Transformer)	ACC: 72.4%, PRE: 75.0%, SEN: 88.2%, F1: 60.0%, AUC: 66.7%
Zheng [38]	2024	828 MDD, 776 HC	REST-meta-MDD	AAL, FC	BPI-GNN	ACC: 73%, F1: 72.0%
Zheng [39]	2024	828 MDD, 776 HC	REST-meta-MDD	AAL, FC	CI-GNN	ACC: 72.0% %, F1: 70.0% %
Dai [40]	2023	189 MDD, 426 HC	REST-meta-MDD	dFC	TGCN	ACC: 75.8%, SPE: 66.0%, SEN: 85.3%
Dai [41]	2024	rMDD616 Dataset: 189 rMDD, 427 HC, all-MDD1611 Dataset: 832 MDD, 779 HC	REST-meta-MDD	CC200, AAL, FC	Res-DAE	rMDD616: ACC: 75.1%, SEN: 69%, SPE: 77.8%, all-MDD1611: ACC: 70%
Zhang [42]	2024	1179 MDD, 1008 HC	Rest-metaMDD dataset	AAL, FC	DDN-Net	ACC: 72.4%, PRE: 71.6%, SEN: 70.1%, F1: 67.8%
Xia [43]	2023	282 MDD, 251 HC	REST-meta-MDD consortium	CC200 and AAL, FC	Depression Graph	ACC: 69.4%, F1: 75.1%, PRE: 65.9%, SEN: 87.2%, SPE: 49.4%, AUC: 68.3%
Liu [44]	2024	282 MDD, 251 HC	REST-meta-MDD dataset	AAL, LOFC, HOFC, demographic info	MFGCN	ACC: 77.6%, SEN: 81.9%, PRE: 79.7%, F1: 79.3%
JH Oh [45]	2023	249 MDD, 228 HC	REST-meta-MDD	HO, FC using Pearson correlation	GC-GAN	ACC: 66.84%, SEN: 70.24%, SPE: 63.14%, F1: 68.72%
Liu [46]	2024	46 melancholic, 42 atypical, 34 anxious	Shanghai Mental Health Center	AAL. FC using Pearson/ Spearman/ Partial correlation	LSTM and Graph Fusion	ACC: 64.2% (MDD vs. HC), Multi-class: 65.8%
Long [47]	2023	597 MDD 563 HC	REST-meta-MDD project	AAL, tHOFC, aHOFC, LOFC Pearson’s correlation	DNN	tHOFC: 60.12%, aHOFC: 60.53%, LOFC: 62.49%, Combined networks ACC: 61.93%
Pan [48]	2022	282 MDD, 251 HC	Southwestern University dataset	AAL and HO, FC Pearson’s correlation	MAMF-GCN	SEN: 99.2%, AUC: 99.2%, ACC: 99.2%, SPE: 99.8%, F1: 99.2%
Liang [49]	2022	282 MDD, 251 HC	REST-meta-MDD	DNN, CC200, AAL FC	Multi-Level FC Fusion Classification (MFC)	ACC: 64.1%, SEN: 1.9%, F1: 60.7%
Kong [50]	2022	129 MDD, 89 HC	ZhongDa Hospital, Southeast University	FC between GM and WM, Pearson correlation	Multi-Stage Graph Fusion Networks (MSGFN)	ACC: 70.91%, SEN: 73.85%, SPE: 66.60%
Gupta [51]	2021	289 MDD, 168 HC	Brain Imaging Center at Southwest University	Power Atlas, FC	DNN (LEAN + CLIP)	ACC: 78.3%
Zhang [52]	2024	51 MDD, 21 HC	Open Neuro	AAL, FC	STANet	ACC: 82.38%
Zhu [53]	2023	830 MDD, 771 HC	REST-meta-MDD	Dosenbach, FC	DGCNN	ACC: 72.1%
Pitsik [54]	2023	35 MDD, 49 HC	Medical University of Plovdiv	AAL3, FC	GNN with GCN blocks	ACC: 93.0%, F1: max under 2.5% graph sparsity
Wang [55]	2023	Site 20: 282 MDD, 251 HC; Site 21: 86 MDD, 70 HC; Site 1: 74 MDD, 74 HC	REST-MDD	BOLD signal augmentation, AAL, FC	UCGL: Pretext model + Fine-tuning	Site 20 → Site 21: ACC: 63%, AUC: 65%, REC: 63%, PRE: 68%, F1: 65% Site 20 → Site 1: ACC: 62%, AUC: 66%, REC: 60%, PRE: 62%, F1: 61%
Fang [56]	2023	282 MDD, 251 HC	REST-meta-MDD	AAL, FC	DFH	Site-21: ACC: 56.77%, AUC: 57.23%, SEN: 66.12%, SPE: 45.43%, Site-1: ACC: 57.16%, AUC: 57.30%, SEN: 64.05%, SPE: 50.27%
Fang [57]	2023	Site-20: 282 MDD, 251 HC; Site-1: 74 MDD, 74 HC	REST-meta-MDD	Spatio-temporal graph, AST-GCM, MMD alignment	UFA-Net	ACC: 59.73%, AUC: 62.50%, SEN: 69.46%, SPE: 50.00%, PRE: 58.49%
Liu [58]	2024	REST-meta-MDD: 814 MDD, 756 HC; SRPBS: 229 MDD, 228 HC; Anding: 196 MDD, 177 HC; OpenNeuro: 21 MDD, 21 HC	REST-meta-MDD, SRPBS, Anding, OpenNeuro	ROI + subject-level graphs (rs-fMR, sMRI, demographic)	LGMF-GNN	10-fold CV: AUC: 80.6%, LOSO: AUC: 73.7%, External: AUC: 72.9% (Anding), 70.3% (OpenNeuro), ACC: 78.75%, Cross-site ACC: 69.97% (Anding), 69.05% (OpenNeuro)
Tian [59]	2024	368 MDD, 299 HC	REST-meta-MDD	dtHOFC, daHOFC	CNN + Self-Attention	ACC: 78.3%, F1: 80.4%, SEN: 79.3%

Table 3. Summary of datasets used in MDD classification studies.

Dataset Name	Number of Participants	Description	References
REST-meta-MDD (DIRECT Consortium)	2428 (1300 MDD, 1128 NC)	Large multi-site dataset from 17 hospitals in China	[25,26,27,28,29,30,32,33,34,35,38,39,40,41,42,43,44,45,46,47,48,49,50,51,53,55,56,57,58,59]
PsyMRI Consortium	1039 (531 patients, 508 controls)	Data from 23 cohorts worldwide	[32]
SRPBS Dataset (Japan)	229 MDD, 228 HC	Multi-site dataset collected across 8 sites in Japan	[58]
Anding Hospital Dataset	196 MDD, 177 HC	Used for external testing	[27,58]
OpenNeuro Dataset (Russia)	51 MDD, 21 HC	Public dataset used for training external testing	[52,58]
Affiliated ZhongDa Hospital of Southeast University & Second Affiliated Hospital of Xinxiang Medical University	218 (129 MDD, 89 HC)	Two-site dataset with matched controls	[33]
Southwest University Dataset	282 MDD, 251 HC	Subset of REST-meta-MDD (Site 20)	[55,57]
Shanghai Mental Health Center Dataset	122 (46 melancholic MDD, 42 atypical MDD, 34 anxious MDD)	Focused on MDD subtypes	[46]
Gansu Provincial Hospital Dataset	52 MDD, 80 HC	Used main dataset and also independent test set for validation	[30,37]
Shenzhen Kangning Hospital, China	178 (89 MDD, 89 HC)	Balanced MDD-HC dataset	[31]
Henan Mental Hospital, West China Hospital, Anding Hospital, First Affiliated Hospital of Zhejiang	555 total subjects (269 MDD patients, 286 HCs from 4 sites)	Multi-site MDD dataset	[27]
Medical University of Plovdiv Dataset	84 (35 MDD, 49 HC)	Study on topological properties of brain networks in MDD	[54]
Chang Gung Medical Foundation	77 older adults (49 LLD patients, 28 HC)	Late-life depression (LLD) study	[36]

Table 4. Summary of brain atlases used in neuroimaging studies.

Atlas Name	Number of Regions (ROIs)	Type (Anatomical/Functional)	References
Automated Anatomical Labeling (AAL)	116 (AAL-90) (AAL3-166)	Anatomical	[28,29,30,31,36,37,38,39,40,41,42,43,44,46,47,48,49,54,55,56,57,58,59]
Harvard–Oxford (HO) Atlas	112	Anatomical	[29,32,35,43,45,48]
Dosenbach’s Atlas	160	Functional	[25,26,40,53,57]
Power Atlas	264	Functional	[51]
Craddock(CC200) Atlas	200	Functional	[34,40,41,49,58]
Brodmann Atlas	82 (GM regions)	Anatomical	[50]
JHU ICBM-DTI-81 Atlas	48 (WM bundle regions)	Anatomical	[50]

Table 5. Summary of top anatomical biomarkers.

Region	Occurrence	References
Cerebellum	7	[26,27,31,34,42,47,58]
Insula	7	[26,32,36,40,41,46,57]
Thalamus	6	[32,41,42,45,51,56]
Amygdala	6	[29,32,39,40,42,45]
Lingual Gyrus	5	[32,41,44,56,57]
Temporal Gyrus	5	[34,41,42,51,56]
Precentral Gyrus	5	[34,40,41,44,58]
Hippocampus	4	[40,42,46,58]
Caudate	4	[33,41,42,56]
Fusiform Gyrus	4	[26,41,42,44]
Superior Frontal Gyrus	4	[36,41,42,44]
Putamen	3	[33,45,56]
Calcarine Cortex	3	[32,42,57]
Precuneus	3	[26,42,51]
Postcentral Gyrus	3	[33,42,45]
Supramarginal Gyrus	3	[32,41,45]
Inferior Frontal Gyrus	3	[33,40,42]
Inferior Parietal Lobule (IPL)	2	[26,42]
Temporal Pole	2	[41,42]
Rolandic Operculum	2	[42,58]
Calcarine region	2	[31,56]
Cuneus	2	[42,44]
Middle Frontal Gyrus	2	[34,39]
Superior Parietal Gyrus	2	[40,42]

Table 6. Summary of top network biomarkers.

Network	Occurrence	References
Default Mode Network (DMN)	5	[26,27,29,36,47]
Frontoparietal Network (FPN)	2	[26,27]
Cingulo-Opercular Network (CON)	2	[26,72]
Sensorimotor Network (SMN)	1	[27]
Cognitive Control Network (CC)	1	[27]

Table 7. Summary of deep learning models for MDD detection.

Model Type	Strengths	Limitations	References	Accuracy Range
DNN	Learns high-dimensional patterns automatically; computationally more efficient than other complex models;	Fails to model spatial/temporal structure of brain;	[42,47,49,51,53]	61.93–78.3%
CNN (2D/3D)	Strong at local feature extraction from rs-fMR slices or volumes; well-suited for spatial patterns	Cannot model temporal dynamics; 3D CNNs are computationally intensive	[30,35,36,52,55,59]	65.0–85.0%
RNN/Transformers	Models sequential and temporal patterns; suitable for dynamic FC	limited performance alone; often used in hybrid models	[30,31,37,46,52]	65.8–82.38%
GNN	Captures topological brain network structure; supports node/edge/graph-level prediction; strong with FC data	Sensitive to graph topology; risk of over-smoothing; high implementation complexity	[25,26,32,33,34,35,37,38,39,40,43,44,45,46,48,50,53,54,56,57,58]	56.77–99.2%
GAN	Generates synthetic functional connectivity; augments small datasets; helps with generalisation	Needs large training data; may produce unrealistic samples; graph structure may be lost	[27,28,45]	66.84–70.1%
Autoencoder	Reduces dimensionality; denoising and connectivity preservation; good for feature learning	Spatial/temporal context loss; limited interpretability in flattened FC maps	[39,41,48]	65.07–75.1%
Domain Adaptation	Enhances generalisation across sites using unlabelled data or transfer learning	Still limited by site heterogeneity; often poor performance	[55,56,57]	56.77–63.0%

Table 8. Validation methods used across studies.

Validation Category	Validation Method	References
Cross-validation	5-fold cross-validation	[28,29,31,34,41,42,43,45,46,51,53]
	10-fold cross-validation	[25,33,35,40,44,48,52,54]
	4-fold cross-validation	[37]
LOSO	Leave-one-site-out	[26,27,47,49]
Cross-site	Cross-site/Independent test set	[30,32,55,56,57,58]
Data Split	Train/val/test Random partitioning	[36,38,39,50,59]

Table 9. Summary of post hoc and ante hoc explainability methods in MDD studies.

Method	Technique	References
Post Hoc	Grad-CAM	[26,40]
	Feature attribution (gradients or weights)	[43,44,50,51]
	ROI/FC ranking via accuracy	[33,36,41,45,55]
	Region/connection ablation	[27,30,32,42]
	Statistical testing	[29,47]
	Layer-wise relevance propagation (LRP)	[31]
Ante Hoc	Attention-based ROI/time weighting	[56,57,58]
	Causal subgraph discovery	[30]
	Prototype learning	[38]
	Counter-condition analysis	[34]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Saeedi, M.; Wei, L.; Edoho, M.; Mooney, C. Potential Clinical Applicability of Deep Learning in the Diagnosis of Major Depressive Disorder Using rs-fMRI: A Systematic Literature Review. Appl. Sci. 2026, 16, 3444. https://doi.org/10.3390/app16073444

AMA Style

Saeedi M, Wei L, Edoho M, Mooney C. Potential Clinical Applicability of Deep Learning in the Diagnosis of Major Depressive Disorder Using rs-fMRI: A Systematic Literature Review. Applied Sciences. 2026; 16(7):3444. https://doi.org/10.3390/app16073444

Chicago/Turabian Style

Saeedi, Maryam, Lan Wei, Mercy Edoho, and Catherine Mooney. 2026. "Potential Clinical Applicability of Deep Learning in the Diagnosis of Major Depressive Disorder Using rs-fMRI: A Systematic Literature Review" Applied Sciences 16, no. 7: 3444. https://doi.org/10.3390/app16073444

APA Style

Saeedi, M., Wei, L., Edoho, M., & Mooney, C. (2026). Potential Clinical Applicability of Deep Learning in the Diagnosis of Major Depressive Disorder Using rs-fMRI: A Systematic Literature Review. Applied Sciences, 16(7), 3444. https://doi.org/10.3390/app16073444

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Potential Clinical Applicability of Deep Learning in the Diagnosis of Major Depressive Disorder Using rs-fMRI: A Systematic Literature Review

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Questions

2.2. Search Strategy

2.3. Inclusion Criteria

2.4. Study Selection

2.5. Data Extraction

3. Results

3.1. RQ1: How Do Sample Size and the Use of Multi-Site Datasets Promote the Generalisability of Deep Learning Models?

3.2. RQ2: How Can 4D rs-fMR Brain Scans Be Used for Feature Extraction, and What Biomarkers Have Been Identified?

3.2.1. Preprocessing

3.2.2. Feature Extraction

3.2.3. Biomarkers

Amygdala

Thalamus

Cerebellum

Insula

Default Mode Network

Fronto-Parietal Network (FPN) and Cingulo-Opercular Network (CON)

3.3. RQ3: What Are the Most Common Deep Learning Techniques Applied, and to What Extent Do They Accurately Detect MDD?

3.3.1. Supervised Methods

Deep Neural Networks (DNNs)

Graph Neural Networks (GNNs)

Graph Design and Topology Optimisation

Temporal Dynamics Modeling

Multi-Site Generalisation and Harmonisation

Multimodal Fusion

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)/Transformers

3.3.2. Unsupervised Methods

Autoencoders

Generative Adversarial Networks (GANs)

Domain Adaptation Methods

General Comparison of Model Accuracies

Influence of Architecture on Model Accuracies

Impact of Validation Methods on Model Performance and Generalisability

Model Accuracy Comparison on REST-Meta-MDD

3.4. RQ4: What Explainability Methods Have Been Applied for MDD Detection in Deep Models, and How Do They Contribute to the Interpretation of the Model?

3.4.1. Post Hoc Methods

3.4.2. Ante Hoc Methods

3.4.3. XAI Challenges and Future Directions

4. Discussion

4.1. Generalisation and Data Diversity (RQ1)

4.2. Feature Extraction and Biomarker Discovery (RQ2)

4.3. Deep Learning Techniques and Performance (RQ3)

4.4. Explainability (RQ4)

4.5. Clinical Translation and Deployment Challenges

4.5.1. Site-Level Challenges

4.5.2. Population-Level Challenges

4.5.3. Subject-Level Challenges

4.5.4. Clinical-Level Challenges

4.6. Future Research Directions

4.7. Limitations and Contributions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI