DepressionGraph: A Two-Channel Graph Neural Network for the Diagnosis of Major Depressive Disorders Using rs-fMRI

: Major depressive disorder (MDD) is a prevalent psychiatric condition with a complex and unknown pathological mechanism. Resting-state functional magnetic resonance imaging (rs-fMRI) has emerged as a valuable non-invasive technology for MDD diagnosis. By utilizing rs-fMRI data, a dynamic brain functional connection network (FCN) can be constructed to represent the complex interacting relationships of multiple brain sub-regions. Graph neural network (GNN) models have been widely employed to extract disease-associated information. The simple averaging or summation graph readout functions of GNNs may lead to a loss of critical information. This study introduces a two-channel graph neural network (DepressionGraph) that effectively aggregates more comprehensive graph information from the two channels based on the node feature number and node number. Our proposed DepressionGraph model leverages the transformer–encoder architecture to extract the relevant information from the time-series FCN. The rs-fMRI data were obtained from a cohort of 533 subjects, and the experimental data show that DepressionGraph outperforms both traditional GNNs and simple graph readout functions for the MDD diagnosis task. The introduced DepressionGraph framework demonstrates efficacy in extracting complex patterns from rs-fMRI data and exhibits promising capabilities for the precise diagnosis of complex neurological disorders. The current study acknowledges a potential gender bias due to an imbalanced gender distribution in the dataset. Future research should prioritize the development and utilization of gender-balanced datasets to mitigate this limitation and enhance the generalizability of the findings.


Introduction
Major depressive disorder (MDD) represents one of the most prevalent mental diseases worldwide, and it stands as a leading cause of disabilities like depressed mood, marked loss of interest in activities, and difficulty in concentrating [1,2].The heterogeneity among the MDD patients leads to significant variations in symptoms and poses a considerable challenge in the accurate diagnosis [3].Technology innovations may facilitate the early diagnosis and timely treatment of MDD patients.
Research on automated, computer-aided medical decision-making systems has made significant progress with the advancement of medical imaging technologies.Bayareh-Mancilla et al. integrated the morphological shape and skin thickness in the mammogrambased asymmetry screening and demonstrated that such explainable features could significantly improve the mammogram screening performance [4].Explainable medical image features like segmentation-based skin thickness were further supported by the efficacy of early cancer detection [5].
study has a limitation regarding potential gender bias, as the dataset used for analysis contains a larger number of female samples than male samples.

Dataset
The public rs-fMRI dataset utilized in this study was generously provided by the REST-meta-MDD consortium and comprises data from 17 Chinese hospitals/sites [26].This study specifically selects the site 20 with the largest number of subjects, similar to [27].The data were downloaded from http://rfmri.org/REST-meta-MDDaccessed on 27 March 2021.The dataset consists of 533 subjects, among whom 282 individuals were diagnosed with MDD-the other 251 subjects served as the matched healthy controls (HCs).
The diversity in demographic characteristics of the participants is important in influencing the manifestation of MDD symptoms and will impact the generalizability of the MDD prediction models.The demographic characteristics of the dataset are summarized in Table 1.The proportion of female participants in the MDD and HC classes is similar.The mean and standard deviation values of the participants' ages are also similar in the two classes.The HC participants have a slightly longer education length (12.97 years) than the MDD patients (10.78 years).Due to that, this whole cohort was recruited from Chinese hospitals/sites.The majority of the participants are Chinese.The rs-fMRI scans were acquired using the Siemens Tim Trio 3T MRI scanner at Southwestern University.The scanning parameters used for data acquisition were as follows: a repetition time (TR) of 2000 ms, an echo time (TE) of 30 ms, and a flip angle of 90 • .The image slice thickness/gap was set to 3.0 mm/1.0 mm, resulting in a total of 32 slices.Each rs-fMRI scan consisted of 242-time points, and the voxel size was 3.44 × 3.44 × 4.00 mm 3 , with a field of view (FOV) measuring 220 × 220 mm 2 .
We applied the preprocessing step using the Data Processing Assistant for rs-fMRI (DPARSF) [28,29].During this step, the rs-fMRI BOLD signals were extracted for each subject.The preprocessing operations included (1) discarding the first ten volumes to achieve magnetization equilibrium, (2) slice timing correction and head motion correction, (3) normalization to the Montreal Neurological Institute (MNI) template and resampling to isotropic 3 mm voxel size, (4) smoothing, (5) detrending and band-pass filtering, and (6) regressing covariates.Ultimately, each subject's data was transformed as a matrix of size T × N, where T denotes the number of time points and N is the number of brain regions of interest (ROIs).
Compared to other fMRI data preprocessing software, such as SPM version 12 (https://www.fil.ion.ucl.ac.uk/spm/software/download/spmreg.php),dparsf is more user-friendly and efficient, significantly improving the efficiency of image analysis.Its streamlined workflow minimizes the risk of operational and subjective errors in the rs-fMRI image analysis.

Overall Framework
The proposed DepressionGraph framework aims to identify MDD by effectively capturing the time-varying information of brain FCNs based on rs-fMRI time-series data, as depicted in Figure 1.Each subject is denoted as X ∈ R N×T , where there are N brain regions of interest (ROIs) and T length of the sequence of the BOLD signals over time.The ROIs are segmented according to the brain region segmentation template.We adopt a sliding window approach by dividing the time series data of each ROI into T dyn segments of length L. As a result, each subject is represented as

Overall Framework
The proposed DepressionGraph framework aims to identify MDD by effectively capturing the time-varying information of brain FCNs based on rs-fMRI time-series data, as depicted in Figure 1.Each subject is denoted as  ∈ ℝ × , where there are N brain regions of interest (ROIs) and T length of the sequence of the BOLD signals over time.The ROIs are segmented according to the brain region segmentation template.We adopt a sliding window approach by dividing the time series data of each ROI into Tdyn segments of length L. As a result, each subject is represented as   = { 1 , ⋯    } ∈ ℝ   ×× .For   ∈ ℝ × at the tth time point, we construct an FCN with the ROIs as the nodes and the BOLD signal value of the tth time point as the node feature of the tth FCN.An edge is defined between two nodes if the correlation coefficient between their node features satisfies a predefined threshold.Let   = (  ,   ) be the FCN of the tth time point.The feature matrix is denoted by   = {x  1 , ⋯ x   }, and the adjacency matrix is denoted by   ∈ {0, 1} × .
Our proposed DepressionGraph framework uses a gate recurrent unit (GRU) network to encode the node features and the two-channel graph neural network to obtain the feature representation of the entire graph at the tth time point G_fea  .Subsequently, we introduce a transformer-based network to capture temporal information from the FCN graphs, denoted as G_fea dyn .Finally, we employ a multi-layer perceptron (MLP) to predict the labels, expressed as   = (G_fea dyn ).
The next two sections give detailed descriptions of the proposed DepressionGraph model and the transformer-based timing information extraction module.

DepressionGraph: Two-Channel Graph Neural Network
The DepressionGraph framework aims to generate a one-dimensional feature vector that comprehensively captures the global information of an input graph (Figure 2).It leverages two distinct channels for graph representation learning: one for the fine-grained node features and the other for the coarse-grained features of the entire graph.Given the graph, G = (X, A) with N nodes and M features for each node.The global features of the input graph obtained through the DepressionGraph can be represented as: Here, _ channel1 ∈ ℝ  represents the graph features extracted through Channel 1, where the dimension of node features is transformed into a one-dimensional vector of length N. Similarly, _ channel2 ∈ ℝ M represents the graph features extracted through For X t ∈ R N×L at the tth time point, we construct an FCN with the ROIs as the nodes and the BOLD signal value of the tth time point as the node feature of the tth FCN.An edge is defined between two nodes if the correlation coefficient between their node features satisfies a predefined threshold.Let G t = (X t , A t ) be the FCN of the tth time point.The feature matrix is denoted by

and the adjacency matrix is denoted by
Our proposed DepressionGraph framework uses a gate recurrent unit (GRU) network to encode the node features and the two-channel graph neural network to obtain the feature representation of the entire graph at the tth time point G_fea t .Subsequently, we introduce a transformer-based network to capture temporal information from the FCN graphs, denoted as G_fea dyn .Finally, we employ a multi-layer perceptron (MLP) to predict the labels, expressed as Label pred = MLP G_fea dyn .
The next two sections give detailed descriptions of the proposed DepressionGraph model and the transformer-based timing information extraction module.

DepressionGraph: Two-Channel Graph Neural Network
The DepressionGraph framework aims to generate a one-dimensional feature vector that comprehensively captures the global information of an input graph (Figure 2).It leverages two distinct channels for graph representation learning: one for the fine-grained node features and the other for the coarse-grained features of the entire graph.Given the graph, G = (X, A) with N nodes and M features for each node.The global features of the input graph obtained through the DepressionGraph can be represented as: Here, G_ f ea channel1 ∈ R N represents the graph features extracted through Channel 1, where the dimension of node features is transformed into a one-dimensional vector of length N. Similarly, G_ f ea channel2 ∈ R M represents the graph features extracted through Channel 2, where the dimension of node count is transformed into a one-dimensional vector of length M. The final feature representation of the entire graph is obtained by concatenating the two vectors acquired from the two channels.
Channel 2, where the dimension of node count is transformed into a one-dimensional vector of length M. The final feature representation of the entire graph is obtained by concatenating the two vectors acquired from the two channels.

Channel 1: Graph Feature Extraction by Changing the Number of Node Features
This channel modifies the number of node features to enable the extraction of essential graph information.Firstly, we employ three aggregation methods to combine information from neighboring nodes: (1) summing the features of neighboring nodes to obtain  Sum_agg , (2) taking the maximum feature value of neighboring nodes to obtain  Max_agg , and taking the minimum feature value of neighboring nodes to obtain  Min_agg .After aggregation, the node feature matrix becomes  agg ∈ ℝ N×3M , represented as: Next, we employ a multi-layer perceptron (MLP) to reduce the dimensionality of node features.Through one layer of MLP, the node features of length 3 × M are transformed into features of length M. Another layer of MLP further reduces the length of each node's features to one.These one-dimensional node features are concatenated to form a one-dimensional vector of length N, which serves as the representation of the graph for Channel 1.This representation captures the fine-grained features of each node._ ℎ1 = MLP 2 (MLP 1 (  )). (3)

Channel 1: Graph Feature Extraction by Changing the Number of Node Features
This channel modifies the number of node features to enable the extraction of essential graph information.Firstly, we employ three aggregation methods to combine information from neighboring nodes: (1) summing the features of neighboring nodes to obtain X Sum_agg , (2) taking the maximum feature value of neighboring nodes to obtain X Max_agg , and taking the minimum feature value of neighboring nodes to obtain X Min_agg .After aggregation, the node feature matrix becomes X agg ∈ R N×3M , represented as: Next, we employ a multi-layer perceptron (MLP) to reduce the dimensionality of node features.Through one layer of MLP, the node features of length 3 × M are transformed into features of length M. Another layer of MLP further reduces the length of each node's features to one.These one-dimensional node features are concatenated to form a onedimensional vector of length N, which serves as the representation of the graph for Channel 1.This representation captures the fine-grained features of each node.

Channel 2: Graph Feature Extraction by Changing the Number of Nodes
Channel 2 aims to capture critical graph information by altering the number of nodes.The top-k pooling algorithm [30] is employed to successively reduce the N nodes to N/2 nodes and then further reduce the N/2 nodes to a single node.Therefore, all node features in the graph are aggregated into this single node while preserving the original number of features (M).The resulting one-dimensional vector of length M (denoted as G_ f ea channel2 ) serves as the representation of the graph for Channel 2, capturing the global information of the graph.

ViT-Based Timing Information Extraction
Figure 3 shows that the brain FCN at each time period can be viewed as a graph structure.The proposed DepressionGraph framework represents a graph as a one-dimensional vector G_ f ea t ∈ R C , where C = M + N. Therefore, each subject is represented as a timevarying graph representation H ∈ R T×C .A simple weighted averaging of the graph representation at different time points would not adequately capture the temporal dynamics.To address this issue, we draw inspiration from the vision transformer (ViT) network [31] and introduce location encoding via the encoder part of the transformer network to effectively capture information across time.

Channel 2: Graph Feature Extraction by Changing the Number of Nodes
Channel 2 aims to capture critical graph information by altering the number of nodes The top-k pooling algorithm [30] is employed to successively reduce the N nodes to N/ nodes and then further reduce the N/2 nodes to a single node.Therefore, all node feature in the graph are aggregated into this single node while preserving the original number o features (M).The resulting one-dimensional vector of length M (denoted as _ channel2 serves as the representation of the graph for Channel 2, capturing the global informatio of the graph.

ViT-Based Timing Information Extraction
Figure 3 shows that the brain FCN at each time period can be viewed as a graph structure.The proposed DepressionGraph framework represents a graph as a one-dimen sional vector _  ∈ ℝ  , where C = M + N. Therefore, each subject is represented as time-varying graph representation H ∈ ℝ T×C .A simple weighted averaging of the grap representation at different time points would not adequately capture the temporal dy namics.To address this issue, we draw inspiration from the vision transformer (ViT) net work [31] and introduce location encoding via the encoder part of the transformer net work to effectively capture information across time.The dynamic graph features _  for each subject can be represented as: ( Here, the variable W represents the weight assigned to different time points,  Sum ∈ ℝ  is the vector obtained by weighted summing of the different time point vectors, Hloc i the matrix obtained by adding location encoding to H, and  encoder ∈ ℝ  is the vector ob tained after passing through the encoder.We concatenate the vectors obtained from thes two methods to obtain the dynamic graph representation vector _  ∈ ℝ 2C .

Fusion of Prediction Results of Multiple Brain Region Segmentation Templates
The utilization of various brain region segmentation templates has been demon strated to enhance the accuracy of brain disease diagnosis [32,33].This study selects thre widely adopted brain segmentation templates to construct brain FCNs, i.e., Harvard-Ox ford atlas [34], Automated Anatomical Labeling (AAL) atlas [35], and Craddock's cluster ing 200 ROIs (CK) [36].We construct a brain FCN using each template and employ th The dynamic graph features G_ f ea dyn for each subject can be represented as: Here, the variable W represents the weight assigned to different time points, f Sum ∈ R C is the vector obtained by weighted summing of the different time point vectors, H loc is the matrix obtained by adding location encoding to H, and f encoder ∈ R C is the vector obtained after passing through the encoder.We concatenate the vectors obtained from these two methods to obtain the dynamic graph representation vector G_ f ea dyn ∈ R 2C .

Fusion of Prediction Results of Multiple Brain Region Segmentation Templates
The utilization of various brain region segmentation templates has been demonstrated to enhance the accuracy of brain disease diagnosis [32,33].This study selects three widely adopted brain segmentation templates to construct brain FCNs, i.e., Harvard-Oxford atlas [34], Automated Anatomical Labeling (AAL) atlas [35], and Craddock's clustering 200 ROIs (CK) [36].We construct a brain FCN using each template and employ the majority voting strategy from the predicted labels of the three trained classification models.This prediction strategy allows us to leverage the diverse information captured by different brain region segmentation templates, and we aim to achieve an improved and more reliable diagnosis of MDD using rs-fMRI.

Implementation Details
The code implementation was carried out in the Python programming language, utilizing PyTorch version 1.7.1 and Scikit-learn version 0.24.2.All the experiments were conducted on a system equipped with an NVIDIA P100 GPU featuring 16 GB VRAM.
The graph pooling in channel two was implemented by the Top-K pooling adjusted to the problem setting in this study [30].The approach allowed us to effectively reduce the number of nodes while preserving essential graph information.
The original BOLD signal was partitioned into multiple segments using a sliding window strategy.We set the window size to 50 and the window stride to 3. For example, if the BOLD signal has a length of 232 time points, it will be partitioned into 61 segments.
A graph was constructed by determining an edge between two nodes if their correlation coefficient was within the top 30% of all the correlation coefficients.We employed the cross-entropy loss function and utilized the OneCycleLR technique to adjust the learning rate dynamically.The maximum learning rate was set to 0.001.If a fixed learning rate is used, satisfying the learning rate requirements of both the initial and final stages of the model becomes challenging.If the learning rate is too large, the model may struggle to find a better solution.On the other hand, if the learning rate is too small, the optimization process at the initial stage becomes slow, resulting in slow convergence.We used dropout with a 0.5 probability instead of regularization methods.This was done because there were no apparent signs of overfitting during the model training process.Our model was trained for 50 epochs, and by the 50th epoch, the model had already converged.The training batch size, either 1 or 4, was determined by the number of nodes in the graph.Specifically, for experiments using the AAL and HO templates, we set the batch size to 4, while for experiments using the CK template, the batch size was set to 1.We used a grid search strategy to obtain the optimal values of the hyper-parameters.Regarding

Performance Measurements
This study formulates the determination of whether a sample suffers from MDD as a binary classification task.The prediction model is evaluated by the following five widely used measurements, i.e., accuracy (Acc), precision (Pre), sensitivity (Sen), specificity (Spe), F1 score (F1), and the area under the receiver operating characteristic curve (AUC).The numbers of correctly predicted positive (MDD) and negative (HC) samples are denoted as TP and TN, respectively, while those of the incorrectly predicted positive and negative samples are denoted as FN and FP, respectively.Acc = (TP + TN)/(TP + FN + TN + FP), (6) Pre = TP/(TP + FP), Sen = TP/(TP + FN), Spe = TN/(TN + FP) (7) and F1 = 2 × Pre × Sen/(Pre + Sen).( 8) All the classifiers are trained using the training subset and evaluated on the test subset.To ensure the experimental robustness and reduce bias, we employ a five-fold crossvalidation strategy for the evaluation of all the models.This approach involves randomly dividing a dataset into five equally sized subsets and using four subsets for training and the remaining one subset for testing iteratively.The final performance is averaged across the five iterations.

Comparison of Different Graph Neural Networks
In this section, we assess the performance of various graph neural networks on the MDD prediction task (Figure 4).Traditional graph neural networks, such as graph convolutional networks (GCNs) and graph isomorphism networks (GINs), often employ simple averaging or summation methods for graph readout in graph-based classification tasks.However, this simplistic approach may overlook critical graph information.The proposed DepressionGraph utilizes two channels of graph information to capture more comprehensive global information from the graph.

Comparison of Different Graph Neural Networks
In this section, we assess the performance of various graph neural networks on th MDD prediction task (Figure 4).Traditional graph neural networks, such as graph convo lutional networks (GCNs) and graph isomorphism networks (GINs), often employ simpl averaging or summation methods for graph readout in graph-based classification tasks However, this simplistic approach may overlook critical graph information.The proposed DepressionGraph utilizes two channels of graph information to capture more comprehen sive global information from the graph.
Figure 4 presents the MDD prediction accuracies of five models, including GCN GIN, and two variants of DepressionGraph using only channel one and channel two in dependently, together with the proposed DepressionGraph framework.Depres sionGraph outperforms all four other models using each of the three template HO/AAL/CK.At least a 2.24% improvement in accuracy has been achieved by Depres sionGraph.The DepressionGraph variant with channel one only achieves the second-bes accuracy using each of the three templates.It is evident that utilizing features from two channels significantly enhances the model's performance compared to using feature from just one channel across three templates.The experimental results in Figure 4

Comparison with Other Classification Algorithms
We constructed a graph structure and used the local clustering coefficients as the features, similarly to [37].These constructed features were then classified using the classifiers, including support vector machine (SVM), random forest (RF), extreme gradient boosting (XGBoost), multilayer perceptron (MLP), and convolutional neural network (CNN).Figure 5 presents the performance of these machine learning or deep learning methods as well as our proposed DepressionGraph.The experimental data demonstrated that DepressionGraph outperformed other algorithms in terms of metrics, such as Acc = 68.86%and AUC = 67.90%.Particularly, the Sen = 84.40% of DepressionGraph was significantly higher than that of other algorithms.Sen measures the detection rate of the model for diseased samples, and a higher Sen value can reduce the rate of false negatives, which is crucial for the MDD diagnosis task.DepressionGraph achieves Spe = 51.39%,which is similar to the classifiers SVM, XGBoost, and MLP.The other two classifiers, RF and CNN, outperform DepressionGraph in Spe, but their Sen values are the worst two in Figure 5.The main reason for the poor performance of neural network algorithms like MLP and CNN may be that the complexity of the feature extraction models is insufficient.It may require tens of thousands of training samples to train a better model.However, our proposed DepressionGraph offers a more targeted approach to extract features from changes in brain networks caused by diseases such as MDD or MCI.Even with fewer training samples, it can still achieve excellent performance.

Comparison with Other Classification Algorithms
We constructed a graph structure and used the local clustering coefficients as the fea tures, similarly to [37].These constructed features were then classified using the classifi ers, including support vector machine (SVM), random forest (RF), extreme gradient boost ing (XGBoost), multilayer perceptron (MLP), and convolutional neural network (CNN Figure 5 presents the performance of these machine learning or deep learning methods a well as our proposed DepressionGraph.The experimental data demonstrated that De pressionGraph outperformed other algorithms in terms of metrics, such as Acc = 68.86%and AUC = 67.90%.Particularly, the Sen = 84.40% of DepressionGraph was significantl higher than that of other algorithms.Sen measures the detection rate of the model fo diseased samples, and a higher Sen value can reduce the rate of false negatives, which i crucial for the MDD diagnosis task.DepressionGraph achieves Spe = 51.39%,which i similar to the classifiers SVM, XGBoost, and MLP.The other two classifiers, RF and CNN outperform DepressionGraph in Spe, but their Sen values are the worst two in Figure 5 The main reason for the poor performance of neural network algorithms like MLP an CNN may be that the complexity of the feature extraction models is insufficient.It ma require tens of thousands of training samples to train a better model.However, our pro posed DepressionGraph offers a more targeted approach to extract features from change in brain networks caused by diseases such as MDD or MCI.Even with fewer trainin samples, it can still achieve excellent performance.

Necessity of Dynamic Brain Functional Connection Network
To demonstrate the necessity of adopting the dynamic functional connectivity net work (DFCN) approach, we compared the performance of DepressionGraph using th static functional connectivity network (SFCN) versus DFCN.The AAL template was used As shown in Table 2, it is evident that DFCN outperforms SFCN in all metrics, highlight ing the importance of incorporating time-varying features in our method.This furthe validates the effectiveness of DepressionGraph based on DFCN in capturing patterns o functional connectivity that evolve over time.

Necessity of Dynamic Brain Functional Connection Network
To demonstrate the necessity of adopting the dynamic functional connectivity network (DFCN) approach, we compared the performance of DepressionGraph using the static functional connectivity network (SFCN) versus DFCN.The AAL template was used.As shown in Table 2, it is evident that DFCN outperforms SFCN in all metrics, highlighting the importance of incorporating time-varying features in our method.This further validates the effectiveness of DepressionGraph based on DFCN in capturing patterns of functional connectivity that evolve over time.

Contribution of Transformer-Encoder Module
We further investigate the impact of incorporating the transformer-encoder for the timing information extraction in the MDD prediction task (Figure 6).We

Contribution of Transformer-Encoder Module
We further investigate the impact of incorporating the transformer-encoder for th timing information extraction in the MDD prediction task (Figure 6).We

Contributions of Brain Segmentation Templates
We evaluate the contributions of the three brain segmentation template HO/AAL/CK and their fusion to the DepressionGraph framework (Table 3).The complet version of DepressionGraph fuses all three templates and achieves the best performance in three overall measurements, including Acc = 71.48%,F1 = 77.91%,and AUC = 70.03%The variant DepressionGraph(AAL) achieves the highest Pre = 66.11%, slightly better than that (66.01%) of DepressionGraph.The other variant, DepressionGraph(CK) (Sen 98.58%), outperforms the complete version DepressionGraph (Sen = 95.04%) in the meas urement Sen.

Contributions of Brain Segmentation Templates
We evaluate the contributions of the three brain segmentation templates HO/AAL/CK and their fusion to the DepressionGraph framework (Table 3).The complete version of DepressionGraph fuses all three templates and achieves the best performances in three overall measurements, including Acc = 71.48%,F1 = 77.91%,and AUC = 70.03%.The variant DepressionGraph(AAL) achieves the highest Pre = 66.11%, slightly better than that (66.01%) of DepressionGraph.The other variant, DepressionGraph(CK) (Sen = 98.58%), outperforms the complete version DepressionGraph (Sen = 95.04%) in the measurement Sen.
We analyzed the features encoded by the second-last layer of the HO-based model.We chose the HO brain parcellation template because it outperformed the models based on the AAL and CK templates (Table 3).The features in the layer before the final layer of the model are one-dimensional vectors with a length of 480.We separately calculated the Pearson correlation coefficients (PCCs) between these 480 features and the class labels separately.PCC ranges from −1 to 1.After analyzing the distribution of the latent features encoded by the HO-based model, we found that the highest absolute value of these PCCs was less than 0.1 (Figure 7).We speculate that this might be because the features before the final layer of the model still need to pass through a linear layer for classification.The weight parameters of the linear layer and the cross-entropy loss function are also crucial for the classification results, as they can potentially lead to a lack of strong correlation between the features before the final layer and the labels.We believe that improving the correlation between the latent features and the labels would further contribute to enhancing the model's performance.In our future work, we plan to design additional loss functions specifically for enhancing the correlations of the latent features with the labels.We analyzed the features encoded by the second-last layer of the HO-based model We chose the HO brain parcellation template because it outperformed the models based on the AAL and CK templates (Table 3).The features in the layer before the final layer o the model are one-dimensional vectors with a length of 480.We separately calculated th Pearson correlation coefficients (PCCs) between these 480 features and the class labels sep arately.PCC ranges from −1 to 1.After analyzing the distribution of the latent feature encoded by the HO-based model, we found that the highest absolute value of these PCC was less than 0.1 (Figure 7).We speculate that this might be because the features befor the final layer of the model still need to pass through a linear layer for classification.Th weight parameters of the linear layer and the cross-entropy loss function are also crucia for the classification results, as they can potentially lead to a lack of strong correlation between the features before the final layer and the labels.We believe that improving th correlation between the latent features and the labels would further contribute to enhanc ing the model's performance.In our future work, we plan to design additional loss func tions specifically for enhancing the correlations of the latent features with the labels.In summary, the DepressionGraph framework achieves similarly well, with value of Acc = 69.42%,68.86%, and 68.48% for the three templates HO, AAL, and CK, respec tively.The fusion of all three templates further improves the MDD prediction model to Acc = 71.48%.However, the fusion of three brain segmentation templates increased th algorithmic complexity and model training time.Future work may consider the construc tion of a better single template-based prediction framework or the designing of refined brain segmentation templates.In summary, the DepressionGraph framework achieves similarly well, with values of Acc = 69.42%,68.86%, and 68.48% for the three templates HO, AAL, and CK, respectively.The fusion of all three templates further improves the MDD prediction model to Acc = 71.48%.However, the fusion of three brain segmentation templates increased the algorithmic complexity and model training time.Future work may consider the construction of a better single template-based prediction framework or the designing of refined brain segmentation templates.

Comparison with the Existing Studies
The overall performance of DepressionGraph is compared with the existing studies (Figure 8).The comparison experiment focuses on the models using the same dataset as in this study, including Gu et al. [38], Li et al. [39], Jie et al. [40], Guo et al. [41], Yao et al. [27], Zhang et al. [42], and Zhu et al. [43].Figure 8 shows that DepressionGraph outperforms all the other studies in both the two performance measurements, Acc and F1.Previous studies did not release the performance measurements Pre, Sen, Spe, and AUC.The proposed DepressionGraph even improves the MDD prediction accuracy of Yao et al. [27] by 11.34%.Zhang et al. proposed a multi-view graph neural network to detect MDD across ten sites and achieved Acc = 65.61% and F1 = 64.55%,which were worse than DepressionGraph [42].Zhu et al. evaluated their model across 16 sites, and DepressionGraph outperformed their model by 0.78% in Acc and 6.78% in F1 on the same site [43].The multi-site investigations referenced in [42,43] reported F1 scores around 65%, whereas the DepressionGraph model attained a more elevated F1 score of 71.13%.This disparity implies the presence of an intersite batch effect, underscoring the necessity for future research focused on the development and implementation of sophisticated batch effect mitigation algorithms.

Comparison with the Existing Studies
The overall performance of DepressionGraph is compared with the existing studie (Figure 8).The comparison experiment focuses on the models using the same dataset a in this study, including Gu et al. [38], Li et al. [39], Jie et al. [40], Guo et al. [41], Yao et a [27], Zhang et al. [42], and Zhu et al. [43].Figure 8 shows that DepressionGraph outper forms all the other studies in both the two performance measurements, Acc and F1.Pre vious studies did not release the performance measurements Pre, Sen, Spe, and AUC.Th proposed DepressionGraph even improves the MDD prediction accuracy of Yao et al. [27 by 11.34%.Zhang et al. proposed a multi-view graph neural network to detect MDD across ten sites and achieved Acc = 65.61% and F1 = 64.55%,which were worse than De pressionGraph [42].Zhu et al. evaluated their model across 16 sites, and DepressionGrap outperformed their model by 0.78% in Acc and 6.78% in F1 on the same site [43].Th multi-site investigations referenced in [42,43] reported F1 scores around 65%, whereas th DepressionGraph model attained a more elevated F1 score of 71.13%.This disparity im plies the presence of an inter-site batch effect, underscoring the necessity for future re search focused on the development and implementation of sophisticated batch effect mit igation algorithms.

Validation of DepressionGraph on Mild Cognitive Impairment
To further validate the effectiveness of DepressionGraph, we downloaded 149 sam ples of early mild cognitive impairment (EMCI), 105 samples of mild cognitive impair ment (MCI), and 133 samples of normal controls (NC) from the ADNI database [44].Th data were preprocessed using dparsf following the approach described in Yang et al. [45 We employed the commonly used automated anatomical labeling (AAL) template fo brain region segmentation.
We set up two binary classification tasks: EMCI vs. NC and LMCI vs. NC.We com pared the performance of DepressionGraph with [45] to further demonstrate the effective ness of the proposed algorithm.Table 4 presents the performance of our proposed method compared to the method proposed in [45].The experimental data showed that our method outperformed Yang et al. in most metrics and provided evidence that DepressionGrap not only performed well in diagnosing MDD but also exhibited excellent performance i predicting other disorders such as MCI.We anticipated that DepressionGraph can als

Validation of DepressionGraph on Mild Cognitive Impairment
To further validate the effectiveness of DepressionGraph, we downloaded 149 samples of early mild cognitive impairment (EMCI), 105 samples of mild cognitive impairment (MCI), and 133 samples of normal controls (NC) from the ADNI database [44].The data were preprocessed using dparsf following the approach described in Yang et al. [45].We employed the commonly used automated anatomical labeling (AAL) template for brain region segmentation.
We set up two binary classification tasks: EMCI vs. NC and LMCI vs. NC.We compared the performance of DepressionGraph with [45] to further demonstrate the effectiveness of the proposed algorithm.Table 4 presents the performance of our proposed method compared to the method proposed in [45].The experimental data showed that our method outperformed Yang et al. in most metrics and provided evidence that De-pressionGraph not only performed well in diagnosing MDD but also exhibited excellent performance in predicting other disorders such as MCI.We anticipated that Depression-Graph can also achieve good results on the MCI dataset because both disorders primarily arise from abnormalities within the brain and exhibit symptomatic similarities.Compared to the MDD dataset, the MCI dataset's shorter temporal signals per brain region result in quicker training and testing speeds.The performance also suggests that EMCI and LMCI are easier to detect than MDD against the normal control samples.

Model Discussion
The above results indicated that our proposed DepressionGraph framework could extract more comprehensive information from graphs compared to the existing graph neural network methods, such as GCN and GIN.By utilizing the dynamic functional connectivity network (DFCN), we can capture the temporal features of functional connectivity networks (FCN).This approach significantly outperforms the SFCN method, with an improvement of 12.39% in accuracy (ACC).Recent studies have also adopted DFCN for their experiments.While most papers use the AAL template for brain parcellation, it may not be suitable for all brain disorders.To address this limitation, we employed three different brain parcellation templates and fused the results to compensate for the weakness of a single template.The experimental results further supported the effectiveness of fusing multiple brain parcellation templates.
DepressionGraph extracts high-level abstracted information from the complex brain rs-fMRI data, which lacks explainable features to guide the future clinical practice of MDD diagnosis.Explainable artificial intelligence (XAI) technologies may be considered in future studies.Combining expert medical knowledge with XAI (explainable artificial intelligence) technology is also more likely to achieve excellent performance.

Clinical Implications
The diagnosis of major depressive disorder (MDD) currently relies heavily on clinical observations.The spatiotemporal changes in brain blood flow can be detected by fMRI technology and have been related to neural activities of various neural disorders, including MDD [46].The rapidly innovated fMRI technology offers high temporal and spatial resolution of imaging human brains, and the fMRI-based machine learning and deep learning algorithms have been extensively explored for their contributions to diagnosing MDD [47].Precise classification of MDD patients from other neural disorders and normal controls may facilitate the computational diagnosis of MDD in future clinical practice and regular health examinations.

Conclusions
This study presents a two-channel graph neural network framework, Depression-Graph, for MDD prediction using rs-fMRI data.DepressionGraph leverages the capabilities of graph neural network and transformer-encoder-based timing information extraction and effectively captures rich temporal dynamics in the brain FCNs.We also address the limitation of a single brain segmentation template by fusing the prediction results from models trained with three different templates.
The experimental results clearly demonstrate the superiority of DepressionGraph over traditional GNNs as well as the existing studies.At least a 6.78% improvement in F1 is achieved by our proposed DepressionGraph over the other models on the same dataset.Similarly to [42,43], the cross-site generalizability of DepressionGraph needs to be explored in future work.The impact of the biased ethnicity in the cohort will be evaluated with additional diversified datasets.
Long-term trends associated with MDD in the time series data may greatly facilitate a better understanding of MDD's biological mechanism and the early diagnosis of MDD in clinical practice.However, most of the existing datasets overlooked this issue, probably due to the difficulties in identifying and tracking MDD patients before and after the disease onset in the long term.The lack of long-term time-series data means most of the computational studies focus on detecting the short-term fluctuations that only reflect acute changes in the mental status of neurological disorders like MDD.The accumulation of long-term time series brain imaging data for various neurological disorders can only be resolved by the establishment of a large international consortium.
DepressionGraph extracted complex high-level information across the time series activities of the whole brain, which captured the subtle patterns of MDD.The main limitation of the DepressionGraph-extracted features is the lack of explainability, and it cannot intuitively decompose the precise MDD classification model into the causal relations of activated brain regions and the in-depth characterization of functional neurodynamic patterns.In our future work, we aim to extract clinically meaningful types of node features from the current model.We plan to visualize the changes in brain networks over time using visualization tools to explore clinically actionable conclusions.Additionally, we will attempt to optimize the network structure by adding additional constraints to the node features and analyzing the changes in brain networks over time based on the node features and the connections between nodes.This analysis will help us identify specific brain regions activated in MDD patients, which facilitates a better understanding of the causal relations of activated brain regions and enhances the interpretability and clinical applicability of our model.
The other limitations of DepressionGraph include the confounding effect of unbalanced gender ratios and the high requirement for computational resources due to the FCN construction and the evaluation of the time series data.In future work, we will explore network modules that are more computationally efficient for the MDD prediction task.Additionally, we will conduct experiments on more datasets to ensure a balanced malefemale ratio in both positive and negative samples, thereby avoiding the confounding effect of gender on MDD classification.The imbalanced ratio between the two genders will be evaluated by building two separate gender-specific models.We will attempt to mitigate the limitations of the model in characterizing functional neurodynamics and explainability by utilizing more advanced algorithms to extract temporal dynamics information of individual brain regions.Additionally, we will refer to the image feature extraction methods evaluated by El-Gayar et al. [48] to extract intrinsic features from the magnetic resonance imaging itself, aiming to further enhance the performance of the model.Explainable artificial intelligence (XAI) algorithms will also be evaluated for their MDD prediction performance and the possible roles in characterizing the causal relations of brain activations and functional neurodynamics.Hierarchical clustering and k-means clustering strategies of the XAI-extracted explainable features will be employed to group the MDD patients into clinically meaningful subpopulations of MDD biotypes in future work.
the National Natural Science Foundation of China (U19A2061), the Jilin Provincial Key Laboratory of Big Data Intelligent Computing (20180622002JC), and the Fundamental Research Funds for the Central Universities, JLU.

Figure 1 .
Figure 1.Overview of the workflow for detecting MDD based on rs-fMRI data.

Figure 1 .
Figure 1.Overview of the workflow for detecting MDD based on rs-fMRI data.

Figure 2 .
Figure 2. Illustration of the proposed DepressionGraph framework.The upper half focuses on extracting features based on changes in the number of nodes (Channel 1), while the lower half focuses on extracting features based on changes in the number of node features (Channel 2).Specifically, during the channel2 process, the graph of n input nodes undergoes the first graph pooling, aggregating into n/2 nodes, represented by the yellow nodes in Figure 2.After the second graph pooling, the nodes further aggregate into a single node, represented by the orange node in Figure 2.

Channel 1 :Figure 2 .
Figure 2. Illustration of the proposed DepressionGraph framework.The upper half focuses on extracting features based on changes in the number of nodes (Channel 1), while the lower half focuses on extracting features based on changes in the number of node features (Channel 2).Specifically, during the channel2 process, the graph of n input nodes undergoes the first graph pooling, aggregating into n/2 nodes, represented by the yellow nodes in Figure 2.After the second graph pooling, the nodes further aggregate into a single node, represented by the orange node in Figure 2.

Figure 3 .
Figure 3.The architecture of transformers-based timing information extraction.

Figure 3 .
Figure 3.The architecture of transformers-based timing information extraction.
training speed, for the AAL and HO templates, each epoch took approximately 4 min 30 s. Completing the five-fold cross-validation required approximately 19 h.Due to the larger number of brain regions in the CK template resulting in a higher computational load, each epoch took approximately 6 min.Completing the five-fold cross-validation for the CK template required approximately 25 h.Slow training may pose challenges in the practical implementation of Depression-Graph for clinical situations with new cohorts of samples or new ethnic populations.Taking into account the factor of model training time, we have designed the model structure with a moderately sized number of MLP layers and graph pooling layers.This helps control the parameter count of the model and ensures that it does not require excessively long training time.Neural network distillation technology may be utilized to reduce the model training time further.
validat the effectiveness of our proposed model and demonstrate that the features extracted from Channel 1, which focuses on capturing local attention, and Channel 2, which emphasize global attention, are complementary to each other.

Figure 4 .
Figure 4. Classification accuracies of different graph neural networks for the MDD prediction task The horizontal axis gives the three brain region segmentation templates, and the vertical axis is th classification accuracy.

Figure 4 .
Figure 4. Classification accuracies of different graph neural networks for the MDD prediction task.The horizontal axis gives the three brain region segmentation templates, and the vertical axis is the classification accuracy.

Figure 4
Figure 4 presents the MDD prediction accuracies of five models, including GCN, GIN, and two variants of DepressionGraph using only channel one and channel two independently, together with the proposed DepressionGraph framework.DepressionGraph outperforms all four other models using each of the three templates HO/AAL/CK.At least a 2.24% improvement in accuracy has been achieved by DepressionGraph.The De-pressionGraph variant with channel one only achieves the second-best accuracy using each of the three templates.It is evident that utilizing features from two channels significantly enhances the model's performance compared to using features from just one channel across three templates.The experimental results in Figure 4 validate the effectiveness of our proposed model and demonstrate that the features extracted from Channel 1, which focuses on capturing local attention, and Channel 2, which emphasizes global attention, are complementary to each other.

Figure 5 .
Figure 5. Performance comparison between DepressionGraph and the other machine learning o deep learning algorithms.The horizontal axis gives the performance metrics Acc, F1, Pre, Sen, Spe and AUC.The vertical axis lists the metric values.The experiment evaluates six classification mod els, including SVM, XGBoost, RF, MLP, CNN, and DepressionGraph.

Figure 5 .
Figure 5. Performance comparison between DepressionGraph and the other machine learning or deep learning algorithms.The horizontal axis gives the performance metrics Acc, F1, Pre, Sen, Spec, and AUC.The vertical axis lists the metric values.The experiment evaluates six classification models, including SVM, XGBoost, RF, MLP, CNN, and DepressionGraph.
use the weighted summation module to replace the transformer encoder module in the DepressionGraph framework and denote this simplified variant as DepressionGraph-NoTE.The Depression-Graph framework outperforms DepressionGraph-NoTE on the MDD prediction task on all three brain region segmentation templates HO/AAL/CK.The experimental data show the valuable addition of the transformer encoder module to the DepressionGraph framework.Electronics 2023, 12, x FOR PEER REVIEW 10 of 1 use the weighted summation module to replace the transformer encoder module in the DepressionGraph framework and denote this simplified variant as DepressionGraph-NoTE.The Depres sionGraph framework outperforms DepressionGraph-NoTE on the MDD prediction task on all three brain region segmentation templates HO/AAL/CK.The experimental dat show the valuable addition of the transformer encoder module to the DepressionGraph framework.

Figure 6 .
Figure 6.Evaluation of the transformer encoder module in the DepressionGraph framework.Th model "DepressionGraph-NoTE" denotes the DepressionGraph without the transformer encode module.Three brain region segmentation templates, HO/AAL/CK, are evaluated on the horizonta axis.The vertical axis gives the prediction accuracies.

Figure 6 .
Figure 6.Evaluation of the transformer encoder module in the DepressionGraph framework.The model "DepressionGraph-NoTE" denotes the DepressionGraph without the transformer encoder module.Three brain region segmentation templates, HO/AAL/CK, are evaluated on the horizontal axis.The vertical axis gives the prediction accuracies.

Figure 7 .
Figure 7. Distribution of the feature importance in the HO-based model.The horizontal axis give the value ranges of the feature importance metrics in the HO-based model.The vertical axis lists th percentages of the features with the importance metrics in the HO-based model.The data series "FI and "Abs(FI)" refer to the feature importance and the absolute values, respectively, of the featur importance of the latent features encoded by the HO-based model.
importance Feature importances in the HO-based model FI Abs(FI)

Figure 7 .
Figure 7. Distribution of the feature importance in the HO-based model.The horizontal axis gives the value ranges of the feature importance metrics in the HO-based model.The vertical axis lists the percentages of the features with the importance metrics in the HO-based model.The data series "FI" and "Abs(FI)" refer to the feature importance and the absolute values, respectively, of the feature importance of the latent features encoded by the HO-based model.

Figure 8 .
Figure 8. Performance comparison with the existing studies.The horizontal axis gives the two meas urements, Acc and F1, and the vertical axis gives the values of these performance measurements The proposed DepressionGraph framework is compared with the previous studies, including Gu e al. [38], Li et al. [39], Jie et al. [40], Guo et al. [41], Yao et al. [27], Zhang et al. [42], and Zhu et al. [43 Comparison with the existing studiesGu et al.Li et al.Jie et al.Guo et al.Yao et al.Zhang et al.Zhu et al.DepressionGraph

Figure 8 .
Figure 8. Performance comparison with the existing studies.The horizontal axis gives the two measurements, Acc and F1, and the vertical axis gives the values of these performance measurements.The proposed DepressionGraph framework is compared with the previous studies, including Gu et al. [38], Li et al. [39], Jie et al. [40], Guo et al. [41], Yao et al. [27], Zhang et al. [42], and Zhu et al. [43].

Table 1 .
Demographics of the subjects in the MDD dataset.MDD: major depressive disorder; HC: healthy control; M/F: male/female; Edu: education years; std: standard deviation.

Table 2 .
Comparison of performance between DepressionGraph using DFCN and SFCN.The bolded values indicate the best-performing method for a specific indicator.

Table 2 .
Comparison of performance between DepressionGraph using DFCN and SFCN.The bolded values indicate the best-performing method for a specific indicator.

Table 3 .
Contribution evaluation of DepressionGraph using different brain segmentation templates.The column "Model" includes the complete version of DepressionGraph and the three variants using only one of the three templates HO/AAL/CK.The next columns give the five performance measurements, i.e., Acc, F1, Pre, Sen, Spe, and AUC.The bolded values indicate the best-performing method for a specific indicator.

Table 3 .
Contribution evaluation of DepressionGraph using different brain segmentation templates.The column "Model" includes the complete version of DepressionGraph and the three vari ants using only one of the three templates HO/AAL/CK.The next columns give the five performance measurements, i.e., Acc, F1, Pre, Sen, Spe, and AUC.The bolded values indicate the bestperforming method for a specific indicator.

Table 4 .
Performance of DepressionGraph and Yang et al.'s method on the MCI dataset.The bolded values indicate the best-performing method for a specific indicator.