The Use of Generative Adversarial Network and Graph Convolution Network for Neuroimaging-Based Diagnostic Classification

Functional connectivity (FC) obtained from resting-state functional magnetic resonance imaging has been integrated with machine learning algorithms to deliver consistent and reliable brain disease classification outcomes. However, in classical learning procedures, custom-built specialized feature selection techniques are typically used to filter out uninformative features from FC patterns to generalize efficiently on the datasets. The ability of convolutional neural networks (CNN) and other deep learning models to extract informative features from data with grid structure (such as images) has led to the surge in popularity of these techniques. However, the designs of many existing CNN models still fail to exploit the relationships between entities of graph-structure data (such as networks). Therefore, graph convolution network (GCN) has been suggested as a means for uncovering the intricate structure of brain network data, which has the potential to substantially improve classification accuracy. Furthermore, overfitting in classifiers can be largely attributed to the limited number of available training samples. Recently, the generative adversarial network (GAN) has been widely used in the medical field for its generative aspect that can generate synthesis images to cope with the problems of data scarcity and patient privacy. In our previous work, GCN and GAN have been designed to investigate FC patterns to perform diagnosis tasks, and their effectiveness has been tested on the ABIDE-I dataset. In this paper, the models will be further applied to FC data derived from more public datasets (ADHD, ABIDE-II, and ADNI) and our in-house dataset (PTSD) to justify their generalization on all types of data. The results of a number of experiments show the powerful characteristic of GAN to mimic FC data to achieve high performance in disease prediction. When employing GAN for data augmentation, the diagnostic accuracy across ADHD-200, ABIDE-II, and ADNI datasets surpasses that of other machine learning models, including results achieved with BrainNetCNN. Specifically, in ADHD, the accuracy increased from 67.74% to 73.96% with GAN, in ABIDE-II from 70.36% to 77.40%, and in ADNI, reaching 52.84% and 88.56% for multiclass and binary classification, respectively. GCN also obtains decent results, with the best accuracy in ADHD datasets at 71.38% for multinomial and 75% for binary classification, respectively, and the second-best accuracy in the ABIDE-II dataset (72.28% and 75.16%, respectively). Both GAN and GCN achieved the highest accuracy for the PTSD dataset, reaching 97.76%. However, there are still some limitations that can be improved. Both methods have many opportunities for the prediction and diagnosis of diseases.


Introduction
Functional magnetic resonance imaging (fMRI) is a neuroimaging tool that measures changes in cerebral blood flow to provide a visual representation of brain activity, allowing researchers to study brain function.The use of functional connectivity (FC) obtained from resting-state fMRI (rs-fMRI) enables imaging of temporal interaction between brain regions and has therefore been extensively employed in the classification of brain disorders and the identification of objective biomarkers associated with the underlying disorders.FC is a connectivity matrix representing functional communication between different brain regions, and the strength of connection between region i and region j is represented as the value of row i and column j in the matrix.The value is calculated using Pearson's correlation between the time series representing region i and j; however, other metrics of association between time series can also be used [1,2].Considerable evidence from rs-fMRI studies has shown the alteration or disruption of FC in individuals with neuropsychiatric and neurodegenerative disorders [3][4][5][6][7].Several recent works have applied convolutional neural networks (CNNs) that incorporate these altered brain FC patterns as relevant features for rapid and reliable classification of brain disorders.However, these models are constrained by two challenges.First, although traditional CNNs can extract local meaningful features from order and grid-like data (such as images), the spatial features learned in CNN may not be optimal for graph structure data (such as networks), which are invariant to node ordering and have irregular relationships between nodes.Second, patient fMRI data used for training is currently limited in its sample size because of a range of factors, such as the exorbitant expense of data acquisition, barriers to standardized data acquisition across different sites, and consequent open sharing of data.The relatively small sample size of patient data often leads to models being overfit.When relatively smaller samples of patient data are used with larger samples of healthy controls in the same model, it also causes the problem of class imbalance.To overcome those issues, graph convolutional networks (GCNs), an extended version of CNN, are proposed to deal with graph-structure data, while generative adversarial networks (GANs) can deal with data scarcity in neuroimaging due to their ability to generate additional data for training purposes.
The brain can be conceptualized as a network where the specialized regions are represented as nodes, and the pathways of communication or links between these regions are regarded as edges.By analyzing the patterns of FC, we can gain valuable insight into the temporal properties and dynamic interplay between the brain regions, revealing a more comprehensive view of the brain network.Therefore, graph theoretical analysis may be an ideal tool to investigate the organizational mechanisms underlying brain networks.Several complex graph theoretic algorithms have been applied to study the pathophysiology of various diseases [8][9][10].The brain graph is a network representation of the intricate interactions between N distinct regions of the brain and therefore can be captured by the N × N matrix.The elements in the matrix capture the strength or degree of correlation between each pair of nodes in the network.In general, brain graphs can be categorized as functional connectivity or effective connectivity, where the former captures the strength of statistical associations or correlation between brain regions and the latter represents the directionality of information flow.Networks can also be grouped as unweighted or weighted, depending on whether the edges are assigned a binary or continuous value.In functional brain networks, the edges can be estimated by various statistical methods, such as Pearson's correlation coefficients, Spearman's correlation, or Kendall rank correlation coefficients.
Our research aims to design an end-to-end GCN model that can be applied to functional graphs (here, constructed from rs-fMRI data) for distinguishing healthy controls from those with brain disorders.Similar to CNN, the proposed GCN also includes a convolution operation that learns localized patterns from the networks and a pooling operation that can not only downsample the graph but also increase the receptive field, allowing the graph to learn global graph-level patterns.The model learns features from each node and its relationship with neighboring nodes to generate new feature maps via the spectralbased convolution method.The spectral convolution operation [11] can transform complex node representations to low-dimensional representations to tackle graph-structure data more easily.
To solve the problem of small sample sizes and class imbalance, we recently proposed a modified version of the existing GAN model to be able to generate realistic FC correlation matrices [12].Generally, GAN consists of two main models that are trained in the adversarial optimization process: a generator G is designed to generate outputs that can mislead the discriminator into treating them as authentic.Unconditioned GAN or unsupervised GAN can discover the nature of data distribution and their latent structure to produce synthetic data.By utilizing those characteristics, conditional GAN and auxiliary classifier GAN have been used to allow GAN to perform classification tasks [13,14].The classification performance can be improved by adding synthetic data to the classifier [15,16].The proposed GAN model adapted these ideas to perform semi-supervised tasks.One of the issues involved in training GANs is the phenomenon called mode collapse, where the model only produces data belonging to a specific class.To prevent mode collapse, the proposed model utilizes supplementary information such as class category or phenotypic features to enhance the variety of the dataset.The generator of GAN will receive random noise combined with additional attributions, such as gender or age, to generate a synthetic FC matrix.The discriminator D will adopt the architecture of BrainNetCNN [17], where filters are customized to function well with the connectivity matrix.Our previous paper [12] also utilizes the inner product operation to embedding vectors to quantify the statistical link between two brain regions.Thus, we utilize the GAN we previously developed, which is an improvement over existing GAN-based methods for neuroimaging data.
We have reported on the designs of GCN and GAN needed to work on FC data and tested them on the ABIDE-I dataset [12,18].However, there is a need to examine the generalizability of these models to other datasets derived from different patient populations.Therefore, here we will test the applicability of GCN and GAN based models on FCbased brain networks for discriminating healthy subjects from individuals diagnosed with ADHD (ADHD-200 [19] dataset), autism (ABIDE-II [20] dataset instead of ABIDE-I used in our previous work), PTSD (acquired in-house but publicly shared [21]), and Alzheimer's (ADNI [22]) datasets.We have reported the utility of traditional machine learning models on these datasets before, and here we used those results to compare them with those obtained from GCN and GAN.We also compared the proposed models with BrainNetCNN [17] to evaluate the efficacy of GCN for extracting structural features and GAN for data augmentation.The statistical tests were also conducted to determine which models achieved superior performance.

Related Work
Deep learning has attracted considerable attention for its potential to automatically detect and classify neurological diseases at an early stage.Specifically, convolutional neural networks (CNN) have been successful in using high-dimensional medical imaging data to predict diagnostic status.Kawahara et al. [17] proposed the BrainNetCNN in 2017, which is a class of CNNs that can be used to predict non-imaging variables (such as diagnostic status) using brain networks as input features.Another study [23] improves the detection of epileptic seizures using electroencephalogram (EEG) data by applying variable-frequency complex demodulation (VFCDM) and CNNs.Building on basic CNNs, researchers have improved the classification performance by applying transfer learning, a technique that utilizes the pre-trained models to enable models to leverage knowledge gained from one dataset to perform well on different datasets [24][25][26].This method has the advantage of allowing the model to train on image data acquired at multiple sites.
GCN is able to model the complex interconnections between nodes in a graph, making it particularly well-suited for analyzing the irregular structure of brain network data.Therefore, it has been employed for diagnostic classification using functional brain networks.Prior works proposed different GCN-based architectures to distinguish between healthy and unhealthy subjects that can be categorized as individual-based graph architecture and population-based graph architecture.The main difference between these two methods is the representation of a node, wherein nodes in the individual-based graph represent brain regions while nodes in the population-based graph denote subjects.For instance, Ktena et al. [27] proposed Siamese GCN that analyzes brain functional connectivity networks by exploiting the similarities between two brain networks with the assumption that the classification task can be significantly improved with more accurate similarity metrics.Another study used varied templates to generate brain functional/structural connectivity networks for individuals subject and then trained a triplet graph convolutional network to learn the relationship at multiple scales [28].The proposed model achieved high performance in the classification of mild cognitive impairment and attention-deficit/hyperactivity disorder with healthy controls.On the other hand, Parisot et al. [29] considered implementing spectral GCN on a population-based graph where each subject is considered a node.The model leverages the relevant features from both rs-fMRI and non-imaging data to discriminate between nodes of healthy control and nodes of individuals with autism disorder.Kim et al. [30] introduced the spatio-temporal attention graph isomorphism network (STAGIN) model, which addresses dynamic graphs by employing two spatial attention READOUT mechanisms (Graph-Attention READOUT (GARO) and Squeeze-Excitation READOUT (SERO)) to capture spatial features at each time point and employing a transformer encoder to learn temporal attended features.Zhao et al. [31] introduced a data augmentation approach combining a "sliding window" strategy with the self-attention mechanism GCN (SA-GCN) for autism classification, utilizing time series subsegments to construct correlation matrices, and introducing both low-order and high-order functional graphs to enable the model to exploit features from various perspectives.Another study [32] proposed a model that comprises two distinct GCNs, f-GCN and p-GCN, where f-GCN analyzes individual brain networks within subjects by utilizing stacked GCNs and eigenpooling for coarsened graph generation, employing max pooling for node representation aggregation, while p-GCN, a population-based model, treats each subject as a graph node and utilizes f-GCN output as a node feature.
Researchers have applied the generative aspect of GAN to various tasks in medical image analysis, including classification [33], segmentation [34], de-noising [35], image reconstruction [36], and image synthesis [37].The use of GAN as a data augmentation method has been shown to outperform various traditional augmentation methods.GAN with feature matching has been proposed to discriminate psychiatric patients from controls [38].The model learns to generate functional network connectivity that is constructed by independent component analysis, and the feature matching technique was used to stabilize the training process.The paper shows that GAN performs better than other tradi-tional machine learning methods, such as support vector machine or nearest neighbors, with more than 6% higher accuracy.Barile et al. [39] utilized GAN with an autoencoder to generate brain connectivity for multiple sclerosis (MS) classification, ensuring that the model's training prevents collapse by producing synthetic data matching real data statistics.Cao et al. [40] introduced a multiloop algorithm aimed at improving the quality of generated data by enabling the assessment and ranking of sample distribution in each iteration, facilitating the selection of high-quality samples for training.While many studies have focused on generating realistic 3D brain images, only a few studies have developed GAN models to learn to mimic functional connectivity networks.This is not only computationally less demanding but also helpful in understanding brain network anomalies and underlying brain disorders.

Data
Attention deficit hyperactivity disorder (ADHD) ADHD is a prevalent neurobehavioral disorder in childhood that is typically characterized by symptoms of inattention, hyperactivity, and impulsivity.Children with ADHD are classified into three separate categories: ADHD-I (inattention), ADHD-H (hyperactive/impulsive), and ADHD-C (combination of both symptoms).The ADHD-200 Global Competition was held in summer 2011 and challenged teams to provide the best performance for diagnosing individuals with ADHD from their resting-state fMRI scans [19].There are 929 subjects in the dataset, which consists of 573 healthy controls, 207 individuals with ADHD-C, 13 individuals with ADHD-H, and 136 individuals with ADHD-I.Scanning for each participant took place at one of seven distinct sites, namely Peking University, Kennedy Krieger Institute, NeuroIMAGE Sample, New York University Child Study Center, Oregon Health & Science University, University of Pittsburgh, and Washington University.For more information regarding the acquisition parameters and site distribution, please refer the webpage http://fcon_1000.projects.nitrc.org/indi/adhd200/,accessed on 19 March 2024.Since there are fewer subjects diagnosed with subtype ADHD-H in comparison with the other classes, we combined subjects with ADHD-H into ADHD-C, which makes the problem into a 3-way diagnosis classification.
Autism Spectrum Disorder (ASD) ASD is a clinical term that encompasses a range of neurodevelopmental disorders marked by deficits in social behavior and communication skills, along with repeated behaviors and restricted interests.The classification of ASD individuals was carried out using an rs-fMRI image from the Austim Brain Imaging Data Exchange Data (ABIDE).ABIDE is a group of organizations that has collected and distributed datasets containing rs-fMRI, alongside additional clinical and demographic information from both individuals with ASD and those who are typically developing [20,41].The initial ABIDE data, or ABIDE I, have been experimented with by the two models in the papers.In this work, the algorithms were extended to apply to ABIDE II, a new multi-site open data resource that was established to increase the sample size.Data for the imaging were obtained from 11 different facilities and involved a total of 623 participants.Of these, 356 were considered to be healthy conhorts, 214 had been diagnosed with autism patients, and 53 had been diagnosed with Asperger's syndrome (a mild symptom of autism).
Post-traumatic stress disorder (PTSD) & post-concussive syndrome (PCS) PTSD is a psychological disorder that develops in some individuals who have experienced shocking, horrifying, or life-threatening events.PCS is a condition in which symptoms or other functional difficulties persist for a period of time after sustaining a concussion or a mild traumatic brain injury.Such disorders often co-occur in individuals serving in the military.This study investigating PTSD/PCS involved 87 active-duty US solders recruited from Fort Moore, GA and Fort Novosel, AL, USA.Data collection was approved by the Institutional Review Board (IRB) at Auburn University and the U.S. Army Medical Research and Development Command IRB (HQ USAMRDC IRB).This sample included 28 combat controls, 17 individuals diagnosed with PTSD, and 42 individuals who had both PTSD/PCS.The imaging data for the study were obtained exclusively at the Auburn University Neuroimaging Center.Information about screening procedures to diagnose PTSD/PCS symptoms and acquisition parameters can be found in the paper [21].Since each subject has 2 runs, we will treat each run as 1 subject, resulting in a dataset with 174 subjects in total.
Mild cognitive impairment (MCI) & Alzheimer's disease (AD) As people age, the risk of developing AD increases, and this condition is the primary cause of dementia in the US.When an individual experiences mild cognitive dysfunction in the memory domain, they may be diagnosed with MCI, and it is believed that people who are diagnosed with MCI are at an increased risk of developing AD later in life.Diagnosis and treatment of the condition remain challenging, with no definitive diagnostic test and cure available at present.Therefore, accurate detection of MCI can aid in preventing further deterioration and slowing the progression of AD.The imaging data was sampled from the Alzheimer's disease neuroimaging initiative (ADNI) database to perform a 4-way multiclass classification: healthy controls, early MCI (EMCI), late MCI (LMCI), and AD.In particular, 35 matched healthy controls, 34 subjects with EMCI, 34 subjects with LMCI, and 29 subjects with AD were collected from the database.The data acquisition process used for this study can be found in the paper [22].

Data Preprocessing
FC was derived with the assistance of Data Processing Assistant for Resting-state MRI (DPARSF, version V5.3_210101) and functional connectivity toolboxes (CONN) softwares, version v.22.a(https://web.conn-toolbox.org/,accessed on 19 March 2024).Firstly, to minimize subject motion artifacts during the scanning process, motion correction techniques were performed to align each image to a standard reference point in time.Then, slice time correction was performed, and after that, the subject's data underwent a nonlinear transformation to align it with a common reference MNI152 (Montreal Neurological Institute) space, which facilitates group-level analysis.The preprocessing pipeline also includes regressing out nuisance variables, such as six head motion parameters, the mean white matter, and the cerebrospinal fluid (CSF) signal, in order to minimize confounding effects.Then, the estimation of the underlying neural time series was carried out using the blind deconvolution method proposed by Wu et al. [42].The deconvolved data was then achieved by the Wiener filter.We applied a temporal band-pass filter with a bandwidth of 0.01-0.1 Hz to the data.Mean time series was extracted from defined 200 regions of interest provided by Craddock (known as the CC200 template) [43].Pearson's correlations between the mean time series of two brain regions were established, resulting in the FC for each subject with shape 200 × 200.However, due to incomplete brain coverage in the ADHD data, only 190 out of 200 regions were captured using the Craddock atlas.Similar to the ADHD dataset, the PTSD dataset suffered from incomplete data coverage and was only able to cover 125 out of 200 regions.

Graph Convolutional Network
The GCN architecture is depicted in Figure 1.For each subject, we define an undirected graph G ≡ {V, E} as a functional brain network, where V = {v 1 , . . .v i } is a set of N nodes (N may vary depending on the number of regions of interests) and E = {e ij } represents a collection of connectivity edges from node v i to node v j .The graph was represented by an adjacency matrix A ∈ R N×N , where each element a ij = 1 if the value of the corresponding position of the mean matrix Ā is greater than the cutoff threshold τ and a ij = 0 otherwise.The mean matrix Ā was determined by the mean of all the functional connectivity matrices in the training dataset, and the threshold τ was decided by the percentage of positive connections that we need to keep.One of the reasons that support this idea is that by taking the mean, we can sparsify the data to different degrees by varying the threshold.Furthermore, by keeping only relevant connections between regions, we can detect abnormal changes in meaningful patterns or connections that can effectively separate healthy subjects and subjects with brain disorders [3][4][5][6][7].
In this work, the graph convolutional layer was implemented from the spectral perspective.In the process of spectral graph convolution, the graph signals are transformed from node domain to frequency domain using the graph Fourier transform.Then, to reduce the computational complexity and enable the graph to learn locally, the K-polynomial filters were used in ChebNet; this approach can be simplified by taking only the first order approximation [11].Hence at layer l, the output representation node was computed as: where Ã = I + A is equivalent to adding self-loops to the adjacency matrix and D is the diagonal degree matrix of Ã, i.e., Di,i = ∑ j Ãij .σ is activation function (Rectified Linear Unit (ReLU) or linear activation function).In this work, ReLU activation was chosen.Furthermore, H (l−1) ∈ R N×d represents d attributes of the N nodes, and W ∈ R d×m refers to a learnable matrix used at layer l that transforms the input node representation H (l−1) from d to m feature dimensions.The initial node representations H (0) are just the original input features or functional connectivity of each subject: H (0) = X.As evident, we employed an individual-based graph architecture.Equation ( 1) aggregates node representations in their direct neighborhood, helping to gain more information after each iteration for the purpose of learning the graph.
Figure 1.Illustration of the GCN architecture proposed in our previous work [18] that we have applied here.In the figure, the model consists of two convolutional layers that transforms the number of node features from 8 to 2 and one pooling layer that pools the number of nodes from 8 to 3. The output of GCN was also concatenated with subject's attribute data (gender, age, imaging site) and then the combined input was passed to the classifier.The results reported in this paper were generated by this GCN architecture with a slight changes in parameters in each layer (as described in methods).
To apply GCN to the graph classification task, a graph-level representation is needed.Similar to conventional CNNs where pooling method is applied to reduce the spatial resolution, many methods of pooling for GCNs have been proposed with the aim of decreasing the number of nodes to obtain coarser graphs while preserving important graph properties.One of the graph pooling approaches is self-attention graph pooling (SAGPool), which is a technique that utilizes a graph neural network to produce a score for each node based on its features, and subsequently selects the K nodes with the highest score [44].Specially, the self-attention scores z for each node is calculated as: where Ã = A (l−1) + I, which depends on the adjacency matrix of the previous layer, and Θ ∈ R d×1 is the weight of the pooling layer.Because graph pooling changes the graph or particularly the adjacency matrix A, the shape of adjacency matrix A and the output node representation after pooling will change based on the top-k nodes we want to keep.To update those variables, first the top-k nodes were obtained as the following steps: The outputs of graph pooling were then determined as: where H (l−1) (idx, :) contains node-specific features that are indexed, ⊙ performs elementwise multiplication, and A (l−1) (idx, idx) is an adjacency matrix that is indexed by both rows and columns.Non-imaging measures that contribute variance to the imaging data, such as gender, age, and imaging site, can also combine with the extracted features from GNN to boost the prediction performance.To guarantee that all feature values are bounded in the interval [0, 1], gender and imaging site features were first encoded to one-hot vectors, while the age feature was normalized by dividing by 100.All non-imaging features were also transformed to the vector of length 2 by the dense layer, and 1 dense layer was also used to transform the output of the GNN model to the vector of length 15.Those vectors were then concatenated and used as input for the classifier that consists of one dense layer with a softmax activation function to compute the likelihood of each subject's network belonging to a particular class label.

Generative Adversarial Network
Generative adversarial network (GAN) comprises two different functional models, namely the discriminator (D) and the generator (G).The two models can be trained simultaneously, in which the generator takes random variable z from a prior distribution (usually Gaussian noise or uniform distribution) to generate new images, while the discriminator focuses on distinguishing whether the image is authentic or not.For supervised learning, the output of the discriminator will also include the probabilities of the class label in addition to its validity output.GAN is able to generate synthetic data that are of high quality and closely resemble real data by using an iterative adversarial approach.The specific designs of the discriminator and the generator are demonstrated in the following (and visually illustrated in Figure 2): Illustration of the GAN model proposed by using previously [12], which we have used in this work.The generator produces a synthetic functional connectivity matrix via the combined input of random noise and feature codes (gender, age, and label).The discriminator was trained on both real FC data and synthesized FC data generated from the generator.
Generator architecture: The generator collects the random noise vector z drawn from a uniform distribution to produce synthetic functional connectivity data.One of the issues of the generator is mode collapse, which occurs when there is only a limited set of samples that the generator can generate.To mitigate this problem, we use ideas from conditional GAN (CGAN) [13] and InfoGAN [45], which integrate more attribute data into the latent input, including category labels and phenotypic measures (such as age, gender, etc).
Typically, the generator will directly output the image from the latent input, which will violate the nature of functional connectivity, where each entry in the matrix corresponds to the correlation coefficients between the average time series of pairs of brain regions i and j.By transforming the latent vector z to a X matrix where X ∈ R N×d , we will have each row in X representing the embedding vector of one brain region (N is the number of ROIs and d is the dimension of the embedded region).Then the generated output A is determined by taking the inner product of X with tanh activation function to ensure each value in A will have a range from −1 and 1: Discriminator architecture: The discriminator is provided with both types of inputs-the original image or a synthesized one-and decides whether the input is real or not.To boost the performance of the discriminator, phenotypic features for each subject were also included as input besides the FC matrix.Similar to the design of deep convolutional GAN (DCGAN) [45], which uses multiple convolution layers to extract features, we employed BrainNetCNN, which was proposed as specifically designed convolutional filters for modeling brain networks.The BrainNetCNN consists of three special convolution layers: the edge-to-edge layer (ECE), the edge-to-node layer (ECN), and the node-to-graph layer (NCG).The ECE layer used cross-shaped filters to calculate the weighted sum of all the neighboring edges that results in a new edge value.On the other hand, regarding edge-to-node layer, given one node, we do the convolution for all the edges that connect to that node.If the number of ROIs is N, then the output of the ECE layer will have the shape of N × N, while the shape of the output of the ECN layer is N × 1.Finally, the NCG layer acts as a fully connected layer, which summarizes all the nodes into a single graph.
Then the dense layers were used to convert the output of the NCG layer and phenotypic features to a new feature space.These two vectors were then concatenated and fed to the dense layer with two heads, one with sigmoid activation for validity classification and another with softmax activation for label classification.

Experimental Setting
The architectures and hyper-parameters of both GAN and GCN were adopted from our previous papers [12,18] based on their highest performances on the ABIDE-I dataset.
In particular, the GCN model that was tested on the datasets has the following structure: 2 convolution layers, followed by 1 pooling layer.In particular, the first and second convolution layers transformed feature vectors to have sizes of 25 and 10, respectively, then the pooling layer was applied to downsample the graph from N nodes to 10 nodes.The shallow GCN was selected because the model performance tends to decrease with an increase in the number of layers.This phenomenon is known as over-smoothing, where through many messages passing steps, all node representations may become similar to each other, making it infeasible to identify discriminant features.The output of the pooling layer is then flattened and integrated with normalized age, one-hot coding of gender, and the imaging site (only available for ADHD and ABIDE-II datasets).One classifier layer was used to directly read out the combined inputs to produce the probability for each class by using the softmax activation function.
Regarding GAN, the discriminator has three type of layers similar to BrainNetCNN, which include an ECE layer with 16 feature maps, followed by an ECN layer with 64 filters, and an NCG layer with 128 filters to extract all the nodes' features.The BatchNormalization, the LeakyReLU activation function, and the Dropout function with a dropout rate of 0.5 were used consecutively after each layer.The dense layer with 64 hidden units continues to extract features from the flattened output of the NCG layer.To combine with phenotypic features, the age and gender of one individual are first concatenated to a vector of length 2, and this vector is then transformed into a vector of length 16 by a dense layer.The fullyconnected output is then merged with this feature vector.The combined input is passed through one more dense layer with 32 perceptrons before being fed to the classification layer that predicts the class label for the subject as well as the validity of the FC (real or fake).As for the generator part, a random vector of length 50 (including gender, age, and label) is fed into the embedding layer, which has the function to turn the input into an N × d matrix, where N corresponds to the number of regions and d represents the embedded dimension.N are equal to 190, 200, 125, and 200 for the ADHD, ABIDE-II, PTSD, and ADNI datasets, respectively, while d is selected to be 10.Since not all subjects in the ADHD and PTSD datasets had usable data from all 200 ROIs (either because of data quality or a lack of whole-brain coverage), the values of N for these datasets are not equal to 200.Nonetheless, the left-out ROIs corresponded to the cerebellum, and subcortex and cortical ROIs were present in all datasets.For every region, its feature representation is stored in a single row of the matrix.The inner product is then taken to output the functional connectivity matrix.
A test dataset consisting of 10% of the data was created for each dataset to assess the model's performance.After leaving out 10% of the data for testing, a 5-fold cross-validation approach was used to split the remaining data into training and validation sets.Therefore, each model was trained five times, and the cross-validation performance of each model is the average of these repeated runs.The model that had the best performance on the validation set was chosen for assessment on the test set.The test accuracy is, of course, obtained by using the test data on the trained model once.For the GAN model, validity accuracy is also considered to select the model besides its performance on the validation set (note that in GANs, the discriminator has two outputs: one for the probability of validity to test the authenticity of the FC (real or fake) and one for classification (HC or patients)).We applied the Adam algorithm as an optimization method with a learning rate of 0.01 for GCN and a learning rate of 0.0001 and β 1 = 0.5 for GAN.
Other models: For comparison purposes, 18 traditional machine learning models used by Lanka et al. [21] were also trained on all the datasets by the default hyper-parameters from Scikit-learn and Matlab tools provided in the paper.These models include probabilistic or Bayesian methods.In the probabilistic framework, the models were assumed with some prior belief in the data distribution, and then the model parameters were selected to maximize the probability of the observed data, given particular parameter settings.The representatives of the probabilistic models were Gaussian Naïve Bayes (GNB), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), sparse logistic regression (SLR), and ridge logistic regression (RLR).The kernel-based models utilize kernel functions to transfer the input into a different space, and then the models can be trained on the new feature space, including support vector machines with linear functions (LinearSVM), radial basis functions (RBF-SVM), and relevance vector machines (RVM).Some traditional neural networks are also involved, namely the multilayer perceptron neural net (MLP-Net), the fully-connected neural net (FC-Net), the extreme learning machine (ELM), and the linear vector quantization net (LVQNET).Also, k-nearest neighbors (kNN) is an instance-based learning model that assigns the unknown data to the appropriate categories based on the distances between the unknown data and the data points that have been labeled.Finally, ensemble learning is the technique that allows multiple classifiers to solve a problem with the belief that multiple classifiers can provide a better result than a single classifier.Using a decision tree as a base classifier, several methods were used to train ensemble classifiers, namely bagged trees, boosted stumps, random forest, and rotation forest.Further details regarding these models can be found in Lanka et al. [21].Additionally, BrainNetCNN, which is the top-performing method for connectome classification, was also trained with the same 5-fold CV, and the hyper-parameters and training process are similar to the settings of the discriminator in GAN.
To evaluate the models, using only accuracy may not be appropriate for imbalanced classification scenarios.Therefore, other metrics such as precision score, recall score/sensitivity, specificity, F1 score, and area under the curve (AUC) are also reported.Those metrics often apply to binary classification problem; therefore, to deal with multiclass classification, the one-vs-rest (OvR) algorithm (with a macro-averaging strategy) was used.

Cutoff Threshold
The binary adjacency matrix representing the graph for each dataset was built by thresholding the values of the mean matrix derived from the training data.In particular, if the correlation coefficient between region i and region j is greater than cutoff threshold τ, the value of the adjacency matrix at (i, j) is equal to 1 and 0 otherwise.In order to choose the appropriate threshold, we plotted the percentages of preserved edges against the cutoff threshold and chose the elbow of the curve as the cutoff, as in previous work [46,47].The mean matrix was derived from the average of all the training data across the 5-fold CV. Figure 3a-d shows the appropriate cutoff thresholds that can preserve meaningful edges for the ADHD, ABIDE-II, PTSD, and ADNI datasets, respectively.The cutoff threshold for ADHD, ABIDE-II, and ADNI datasets is 0.15, which maintains 13.17%, 20.60% and 14,80% of the total edges in each dataset, respectively, while the threshold for the PTSD dataset is 0.2, which keeps 16.19% of edges.

Model Comparison
The outcomes of all the models for multinomical classification are presented in  4 (b) demonstrate the results of those respective datasets in binary classification scenario.The value highlighted with red color represents the top performing result across all the models, while the blue highlight indicates the second highest result.In Figure 4, the models have been sorted from worst to best performance.We can observe that some models may perform very well for some metrics or datasets, but the deep learning models (including GCN and GAN) generally perform well across all metrics and datasets.ADHD For multinominal classification, GCN achieves the highest values for the accuracy score, precision score, and f1 score and the second highest for AUC.GAN also achieves the second highest accuracy score with 68.16%, which is only 3% less than the accuracy of GCN.The results remain the same in the binary classification scenario, with the only exception in the precision score where the GAN model takes the first place while GCN has the second place.Although the RBF-SVM model has the highest performance for specificity and AUC scores, its recall score is rather low with only 1.67%, which fails to predict the actual patients with disease.GAN and GCN therefore achieve better performance overall among all the models.ABIDE-II GAN and GCN outperform the other models in accuracy for both multinomial classification (73.56% and 72.28%) and binary classification (77.40% and 75.16%).GAN also shows the highest results in precision score and f1 score.kNN, RBF-SVM, and random rorest classifiers obtained the highest and second highest specificity; however, their recall scores are rather low.On the other hand, the specificity scores of GAN and GCN are relatively high (88.34%and 88.9% respectively).PTSD This is a homogeneous dataset wherein the scanning of all subjects was carried out on a single scanner using the same sequence.Since the sources of non-neural variability are minimized relatively in this dataset, most models performed very well (AUC > 90%).Therefore, it is not very informative to evaluate various classification models against one another.Nevertheless, BrainNetCNN outperforms GAN and GCN in terms of accuracy, precision, and f1 score for 3-way classification.Also in 3-way classification, while the evaluation results of GCN were outperformed by Linear SVM and BrainNetCNN, the model still has better performance than the others do (by a margin of 1% to 4%).As for binary classification, it can be seen that GAN and GCN have approximately similar patterns where they achieve the highest accuracy, highest recall, highest f1 score (97.76%, 100% and 98.40% respectively), and second highest precision score (96.92%) and specificity (93.33%).The best performance on this dataset also includes RLR, Linear SVM, and BrainNetCNN.
ADNI GAN appeared to reach the top level of performance in both 4-way classification and binary classification, particularly the accuracy score where the value is higher than the second highest value by large margins (52.84% vs. 44.28%and 88.56% vs. 82.86%).GCN displays only the second highest result in accuracy for multinomial classification.The reasons for this issue may be due to the limited sample dataset for training and the fact that the cut-off threshold may remove some important features in the graph.

Effect of Different Thresholds on GCN's Performance
Even though we have used a criterion for threshold selection that has been widely reported before, we want to ensure that our choices do not remove any important connections that may negatively impact the model's performance.Therefore, we estimated binary classification for the four datasets and plotted against different cutoff thresholds.As we can see in Figure 5a-d, all the accuracy results for all four datasets peak at our choices of thresholds, justifying the selection of thresholds based on the elbow cutoff criterion.

Statistical Significance
A random classifier for the binary classification problem would have the probability of 50% to predict the label correctly.A model with a prediction below that expectation cannot be used [48].Therefore, we modeled the outcomes of each classifier as a Bernoulli process B(n,p), where n is a total number of subjects from the test samples and p is the probability of success.Then we want to test whether the probability of correctly predicted labels by the classifiers could surpass the expected probability.The results of all the models on all the datasets are shown in Table 5. GAN and GCN appear to achieve significant results on all the datasets.

Statistical Comparison
To test the hypothesis that GAN and GCN generalize better than the other models, all the accuracy scores generated by the CV method were collected as samples for a statistical test.In particular, we made the assumption of the null hypothesis that the performances of GAN and GCN are worse than those of the other models, and we would like to check whether there is enough evidence to reject the null hypothesis.The Wilcoxon rank-sum test was applied to compare the performances of GAN and GCN with other models.The Wilcoxon technique, as an alternative approach to the Student's t-test, can be more appropriate when the sample is small because we cannot assume the data are normally distributed [49].The level of significance was selected at α = 0.05.
Table 6 (a) and (b) show the statistical results (p-value) of the Wilcoxon test for the comparison of GAN and GCN, respectively, with the other models on all the datasets.The tests indicated that GAN and GCN statistically have greater accuracy scores than almost all the traditional ML models on all the datasets (p-value < 0.05).We also do not have enough evidence to conclude that GAN and GCN statistically perform better than BrainNetCNN, although the test suggests that GAN has a better performance than BrainNetCNN for the ABIDE-II dataset (p-value = 0.02).

Discussion
GAN shows excellent results on independent test data on both large and small datasets, where the model had the best performance for the ABIDE-II, PTSD, and ADNI datasets and the second best performance for the ADHD dataset.The improvement of GAN using BrainNetCNN as the backbone network over using just BrainNetCNN alone demonstrates the benefits of data augmentation by GAN.This could potentially address the problem of data scarcity for neuroimaging based diagnostic prediction in patient populations in neurology and psychiatry.
Table 7 shows the computational time required for each model to complete training across datasets.Generally, all three deep learning models require more time to train than the traditional method, which can be attributed to their complexity and larger number of trainable parameters.We can observe that the GAN exhibits the longest training time.This is because the GAN model needs to learn the data distribution to synthesize data, in addition to the time required for training the classifier.Despite this extended training time, GAN achieves the best performance among all models across the four datasets.Notably, GCN requires less training time than BrainNetCNN across the three datasets (ABIDE-II, PTSD, and ADNI), yet it achieves better performance in ABIDE-II and ADNI and equivalent performance in PTSD.This suggests that, despite requiring fewer trainable parameters, GCN is a superior tool for capturing the complex structure of brain networks.Some traditional models require very little training time, sometimes as low as 0.01 s.However, their performance does not match that of GAN and GCN.This indicates a trade-off between training time and performance across traditional and deep learning models.In future research, there is a need to decrease the training time of GAN and GCN while maintaining satisfactory accuracy results to enhance their practical applicability in real-world clinical settings.
In Figure 3a-d, we can see that each dataset has a different cut-off threshold.As mentioned above, we aim to retain only the strong connections in the backbone network crucial for identifying abnormal patterns in individuals with brain disorders.Therefore, we intend to prune the low tail of the curve, which comprises solely low connection values.However, selecting an excessively high threshold may result in the elimination of many relevant connections, thereby negatively impacting accuracy performance (as demonstrated by examples in Figure 5a-d, where accuracy decreases with increasing thresholds).To strike a balance, we opt to set the threshold at the elbow of each curve distribution, which shares a similar concept with the elbow criterion used in k-means clustering.This choice allows for the retention of meaningful connections while removing redundant, noisy ones.Our hypothesis is validated by the accuracy results presented in Figure 5. Additionally, since each dataset exhibits distinct distributions in connection values, the selection of the elbow must vary accordingly.This accounts for differences in cut-off threshold selection across datasets.

Limitations and Future Research
The hyperparameters used in this paper were obtained from our previous works [12,18], where a hyperparameter tuning approach was employed to select the optimal parameters yielding the best results.Therefore, we applied the same parameters to this paper and achieved good results.However, it must be noted that extensive tuning of hyperparameters to a given dataset makes the model overfit the data and hence makes it less generalizable.This is not desirable in clinical diagnostic applications since there is wide variability in the human population, and we want these models to be generally applicable.
Ensemble methods can combine multiple deep neural networks to achieve more stable and generalizable predictions by mitigating variance and reducing generalization errors.However, due to the distinct characteristics and nature of GANs and GCNs, the development of ensemble frameworks for these techniques remains incomplete.While implementing this method requires careful planning and a significant time investment, its potential benefits are substantial.In our future work, we aim to explore the integration of GANs and GCNs to investigate whether this combination can lead to further performance improvements in terms of accuracy.
Interpretability is considered a crucial factor when integrating deep learning into clinical practice.In our study, we employed GCN coupled with a top-k pooling method.This approach offers interpretability by selecting a set (k) of the most relevant brain regions most predictive of brain disorders.These identified regions have the potential to serve as biomarkers, helping in the early detection of diseases.Although the paper has not presented the results, the methods hold significant potential, and we plan to implement them in future work.
GCN illustrates the effectiveness of applying graph neural networks to graph-structure data by achieving the highest performance in the ADHD dataset and also comparatively good results in other datasets.One of the ways to improve GCN is to train embedding of nodes in a space that has fewer dimensions instead of directly using row vectors as feature vectors [50].This technique utilizes a framework from an encoder-decoder perspective that can better capture the information contained in the data.The design of the adjacency matrix also plays an essential role.Instead of static non-directional graphs obtained from FC, directional graphs can be obtained using effective connectivity [51].The graphs could also be computed across different blocks of time to estimate the dynamics [52].These types of advanced graphical features, when used with GCN, have the potential to improve our understanding of the mechanisms underlying neuronal dynamics by examining alterations between patients and healthy controls.

Conclusions
We identified two major challenges for the application of deep learning for neuroimagingbased diagnostic classification: small sample sizes of patients and incompatibility of graphical features of brain networks and architectures of traditional deep learning models.We have illustrated how these issues can be addressed using brain connectivity features from four different clinical datasets.The patient data scarcity issue was addressed using GANs, while GCNs allowed us to conveniently handle graph-based features within a deep learning framework.Both GAN and GCN provided the best and second best accuracy for the four clinical datasets we used.
the Department of the Army or the Department of Defense or the United States Government.The investigators have adhered to the policies for protection of human subjects as prescribed in AR 70-25.

Figure 2 .
Figure 2.Illustration of the GAN model proposed by using previously[12], which we have used in this work.The generator produces a synthetic functional connectivity matrix via the combined input of random noise and feature codes (gender, age, and label).The discriminator was trained on both real FC data and synthesized FC data generated from the generator.

Figure 3 .
Figure 3. Percentages of edges preserved when the cutoff threshold is varied for each dataset.

Figure 4 .
Figure 4. Illustration of the models' performance sorted from worst to best for each dataset.

Figure 5 .
GCN's performance on different thresholds for each dataset.

Table 1 .
Performance comparison of models on ADHD dataset for multinomial (a) and binary (b) classification (Red color indicates best performance, while blue color denotes second best performance).

Table 2 .
Performance comparison of models on ABIDE-II dataset for multinomial (a) and binary (b) classification (Red color indicates best performance, while blue color denotes second best performance).

Table 3 .
Performance comparison of models on PTSD dataset for multinomial (a) and binary (b) classification (Red color indicates best performance, while blue color denotes second best performance).

Table 4 .
Performance comparison of models on ADNI dataset for multinomial (a) and binary (b) classification (Red color indicates best performance, while blue color denotes second best performance).

Table 5 .
The p-values of the Bernoulli test for all the models.Significance was defined at α = 0.05.

Table 6 .
The p-value of the Wilcoxon rank-sum test for the comparisons of GAN with the other models (a) and GCN with the other models (b) on all the datasets.Significance was defined at α < 0.05.

Table 7 .
The comparison of computational time (in seconds) required to train each model.