Diagnosis of Autism Spectrum Disorder (ASD) Using Recursive Feature Elimination–Graph Neural Network (RFE–GNN) and Phenotypic Feature Extractor (PFE)

Autism spectrum disorder (ASD) poses as a multifaceted neurodevelopmental condition, significantly impacting children’s social, behavioral, and communicative capacities. Despite extensive research, the precise etiological origins of ASD remain elusive, with observable connections to brain activity. In this study, we propose a novel framework for ASD detection, extracting the characteristics of functional magnetic resonance imaging (fMRI) data and phenotypic data, respectively. Specifically, we employ recursive feature elimination (RFE) for feature selection of fMRI data and subsequently apply graph neural networks (GNN) to extract informative features from the chosen data. Moreover, we devise a phenotypic feature extractor (PFE) to extract phenotypic features effectively. We then, synergistically fuse the features and validate them on the ABIDE dataset, achieving 78.7% and 80.6% accuracy, respectively, thereby showcasing competitive performance compared to state-of-the-art methods. The proposed framework provides a promising direction for the development of effective diagnostic tools for ASD.


Introduction
ASD is a neurodevelopmental disorder characterized by social communication impairment, restricted interests, repetitive and stereotyped behaviors, and sensory abnormalities [1].Treatment for ASD requires substantial financial resources and greatly affects the daily lives of patients and their families.However, diagnosing ASD requires a broad and systematic knowledge of medical practitioners and is also subject to the physician's subjective factors.Therefore, computer-aided diagnosis (CAD) [2] can provide an effective solution for objective diagnosis.
As fMRI can show changes in blood flow in different areas of the brain, most researchers use fMRI to explore the causes of ASD.Mostafa et al. [3] manually defined some brain features and then used LR, SVM [4], LDA, and KNN to classify fMRI data, achieving 77% accuracy.The manual approach to defining features has significant limitations; it requires knowledge of the relevant domain and the features are limited by the complexity of what can be formulated by humans.Shao et al. [5] and Wang et al. [6] proposed their own methods based on SVM to detect attention deficit hyperactivity disorder (ADHD) and ASD and obtained an accuracy of 77% and 68%, respectively.Methods based on conventional machine learning achieve good classification results on their well-designed feature selection (dimensionality reduction) algorithms.The traditional methods of machine learning cited above can only handle a limited amount of data and cannot yield good results when the amount of data increases.In addition to traditional machine learning, some researchers have also used deep learning algorithms to classify ASD and typical control (TC).Hu et al. [7] proposed a fully connected neural network model to classify ASD patients and TC, and they also explained the features based on the weight of the model.Although a 69.81% accuracy was achieved, the network was designed manually, with a fixed number of model layers and nodes, and limited data, making it difficult to generalize their model.CNN, as the most widely used neural network, has shown its excellent performance in many fields [8,9].Sherkatghanad et al. [10] directly used their defined CNN model on fMRI data to detect ASD.Even though an accuracy of 70.22% was achieved, the model takes too long to train due to the large number of parameters, and more images need to be used to train a more robust model.In addition, some researchers fed brain correlation matrices and topological features obtained from processed fMRI data into their proposed neural networks and achieved 70.2% accuracy [11].Their model takes the topological features between each subject and the connections between the features, making it possible to achieve good results while increasing the amount of data the model can handle.It is worth noting that the fMRI data used in these studies are generally sourced from the ABIDE dataset [12].This dataset increases sample size at the cost of uncontrolled heterogeneity, which causes some disturbances to data samples as well as some loss in classification accuracy.
Brain activity crucially relies on communication between distinct neurons, akin to the node representation update process observed in graph neural networks (GNNs).For the cited research, both traditional machine learning and deep learning methodologies have primarily focused on processing fMRI data to fit their models, e.g., CNN for direct feature extraction on the image and LSTM for the time series data extracted from the fMRI data.This lacks the design of models that effectively simulate inter-neuronal communication within the brain.To address this issue, many investigators use GNN to detect ASD; some investigators have harnessed fMRI data for training GNNs, enabling the simulation of communication patterns among different brain regions through node feature updates [13][14][15].The potential of including mutual information (MI) loss (Infomax) has been investigated to enable the model to better learn graph embeddings and increase the robustness of the model [13].They both used graph attention networks to process the input data [14,15]; the former proposed a new method called Pearson correlation-based spatially constrained representation (PSCR), resulting in a modeling accuracy of 72.4%, while the latter enhanced their model's ability by designing a new attention network and using a larger synthetic graph dataset with 4000 subjects to obtain an accuracy of 68.02%.These methods fit the input patterns of the GNNs by calculating the functional connectivity of the fMRI data.However, these caused a problem: the number of input subjects is small, while the dimensionality of the functional connectivity matrix is large, leading to model overfitting.For this problem, some researchers have used a combination of simple models [16], while others have explored the similarity between samples to make greater use of data [17].Both of their methods led to some improvement in model accuracy.Many studies have demonstrated the effective capacity of recursive feature elimination (RFE) in the diagnosis of ASD [18][19][20][21].In their studies, some have used RFE for dimensionality reduction of input data, some have used RFE to reduce the computational complexity of the model, and others have used RFE for feature selection.By comparing the previous studies, they all achieved good results.However, they ignored the contribution of phenotypic data to the classification results.Though several papers incorporate phenotypic data [22][23][24], they predominantly utilize the phenotypic similarity of subjects for graph construction and do not extract features from phenotypic data, relegating it to auxiliary data status within their studies.
To solve the above problems, we propose a novel framework that combines RFE with GNN.Moreover, we explore the correlation between ASD emergence and specific phenotypic data such as age and gender.Our framework segregates the feature extraction process for image data and phenotypic data before unifying them to discern ASD from TC.
The key contributions of this paper can be succinctly summarized as follows: 1.
There is a limited sample size of medical datasets and a substantial number of feature dimensions in graph data.

Data Acquisition and Preprocessing
The Autism Brain Imaging Data Exchange I (ABIDE I) serves as a shared repository comprising resting-state fMRI data, anatomical data, and phenotypic data.The extensive complexity arising from the amalgamation of data across 17 distinct data collection sites often leads researchers to test their models on partial datasets.In our investigation, we meticulously excluded incomplete data, focusing on a meticulously selected cohort of 1035 subjects (505 ASD and 530 typical control) to ascertain the efficacy of our proposed framework.Detailed subject information is presented in Table 1.To preprocess the selected image data, we adopted the widely utilized Configurable Pipeline for the Analysis of Connectomes (CPAC).
We used the brain anatomical template cc200 [25] to divide the preprocessed fMRI into 200 ROIs.To transform data into the input mode of the graph neural network, we converted each subject into an undirected graph, G = (V, E, Ω) , where V = [v 1 , . . . ,v N ] denotes nodes set, E = [e 1 , . . . ,e N ] represents edges set in the graph and is a collection of v i , v j linking vertices from v i to v j , and Ω = [ω 1 , . . . ,ω N ] represents the edge weights set.G has an associated node feature set, H = [h 1 , . . . ,h N ], where h i is the feature vector associated with node v i .We use the Pearson correlation coefficient between different ROIs as the node feature in the graph, which is calculated as follows: where χ, Υ represents the time series for different ROI and χ i , Υ i represents the i-th time point of χ, Υ, respectively.Based on the Pearson correlation coefficient, we used the absolute value of partial correlation coefficients as the weight of edges to alleviate the over-smoothing.
For phenotypic data, based on previous research [26], we selected the data collection site, sex, age, FIQ, VIQ, PIQ, eye_at_scan, and handedness_category as phenotypic features in this paper.We used one-hot encoding for categorical features.We then normalized the numerical features to the range [0, 1] to eliminate the impact of different scales.Finally, we concatenated the category and numerical features as the final phenotypic feature vector, P = [p 1 , . . . ,p N ].The procedure can be formulated as Equation (2).
where p i denotes the i th subject phenotypic feature and p cat and p num indicate the category and numerical features vectors, respectively.O(•) and S(•) denote the one-hot encode and scaling, respectively.For the ABIDE II dataset, the specific information is shown in Table 2.We selected 1110 samples for our study, including 518 ASD and 592 TC.We used the pipeline of Data Processing Assistant for Resting-State fMRI (DPARSF) [27] for preprocessing, along with the website http://rfmri.org/DPARSF(accessed on 25 October 2023), thus showing more detail.Afterwards, as in ABIDE I, we used the cc200 template to partition the brain into 200 ROI to construct the graph structure.

Methods
This manuscript presents a novel framework, depicted in Figure 1.The image data and phenotype data are respectively preprocessed in different ways to obtain the functional connectivity matrix and vector.Then, they are fed into RFE-GNN for feature extraction, respectively, and finally they are concatenated and fed into the fully connected layer for classification.The Framework of RFE-GNN Each subject in the original fMRI is preprocessed into a graph, G = (V, E, Ω), and G has an associated node feature set, H = [h 1 , . . . ,h n ], where n is the number of V. Since the H is a symmetric matrix, and the atlas we used is cc200, the node features for each subject have (19,900(200 × 199 ÷ 2)) dimensions, while the total number of subjects is small, only 1035 in ABIDE I and 1110 in ABIDE II, which causes the typical "small sample, high dimension" problem.
In the context of challenging data scenarios where neural networks struggle to learn meaningful features, often leading to severe overfitting and limited generalization capabilities, we propose employing RFE to mitigate this issue.RFE aids in reducing feature dimensions by leveraging a selected estimator.Subsequently, we utilize the SVM as the estimator to recursively partition the feature space into two distinct parts.The separating hyperplane of the SVM can be formulated using Equation (3).
where H denotes the node features and W SV M and b SV M are the SVM weights and bias.After each partition, we use the distance between feature points and the separating hyperplane as their importance scores, as formulated by Equation ( 4): where I i denotes the i-th importance scores, h i denotes the i-th feature, | • | and || • || 2 denote the l1, l2 norm.Subsequently, the feature ranking is determined based on the obtained scores; features with lower scores are successively eliminated.This iterative process persists until the number of remaining features matches the predetermined value we have set.
Given that the underlying graph is an undirected weighted graph, and our objective pertains to graph classification, it becomes imperative to aggregate comprehensive global information from the entire graph.To achieve this, we adopt GraphConv [28], which incorporates edge weights while updating node representations.Specifically, GraphConv entails the multiplication of edge weights with neighboring node representations, followed by summation to construct the information of neighboring nodes.Subsequently, this information is propagated to the target node.To mitigate the over-smoothing [29] effect, our proposed framework uses only two layers of GraphConv for learning node embeddings.The specific process can be succinctly represented through the following formula: where W 1 and W 2 are network parameters.H l i indicates the i-th node feature in the l-th layer representation.N i represents the set of neighboring nodes of v i , Ω ji represents the weight coefficient of the edge between v i and v j .
Additionally, we integrate BatchNorm layers following each convolution step to accelerate convergence and reduce overfitting.Subsequently, employing global max pooling enables the selection of an appropriate node representation as the overall graph representation.The specific operational process is concisely represented through the following formula: where G represents the graph representation, N represents the total number of nodes in the graph, and h i represents the feature representation of v i in the graph.After graph feature extraction, the data are processed into a two-dimensional vector.
In real-world diagnostic contexts, the reliance on single-modal data alone often proves inadequate for achieving accurate diagnoses.Professional medical practitioners adopt a holistic approach by analyzing multimodal information pertaining to the patient to arrive at comprehensive judgments.Similarly, in the domain of computer-aided diagnosis, we capitalize on the complementarity inherent in diverse modalities of information to make well-rounded assessments.Beyond medical imaging data, non-imaging phenotypic data offers valuable supplementary insights into the associations among subjects.After preprocessing, the phenotypic data are encoded into the vector denoted as P. Subsequently, to obtain its latent representation, we design a PFE that comprises an MLP with L hidden layers.In this study, we set L to 1, and the output vector dimension of the MLP aligns with that of the imaging data features.After processing the original phenotypic data with MLP, its features can be represented as: where σ MLP represents the activation function and W MLP and b MLP represent the layer parameters.P and P refer to the original and the processed phenotypic vector, respectively.Upon extracting features from both image data and phenotypic data, we acquire their respective latent representations.However, directly feeding these representations into the classifier may yield suboptimal results, as the presence of unimportant features can influence the outcomes, stemming from varying importance levels among the data.To address this concern and effectively fuse information from diverse modalities, we introduce an attention mechanism as a viable solution.The attention mechanism efficiently captures and integrates relevant information, thus enhancing the overall performance of the classification process.
In the process of calculating the attention score, we concatenate the representations of image data and phenotypic data as the input A, and the operation can be formulated as follows: where A represents the attention input and ⊕ represents vector concatenation.We use a scaled dot-product model as the scoring function to calculate the weight; therefore, the calculation of the attention coefficient, α, can be expressed in the following formula: where A T is the transpose of the input vector and D represents the dimension of the input vector.
After obtaining the attention score, we multiply it with the original input A to obtain the classifier's input.The classifier consists of two fully connected layers, and each layer is followed by ReLU activation.We chose the cross-entropy function as our objective function, which can be formulated as follows: where K represents the number of subjects and y i and y i represent the true and predict value, respectively.We minimize the objective function to obtain the best classification performance.

Experiments and Results
In this section, we meticulously validate the efficacy of our proposed framework in adeptly capturing the salient features present in multimodal data, leveraging the ABIDE datasets.In Section 3.1, we describe the training steps of the model and some parameter settings.In Section 3.2, we present the calculation formulas employed to measure the performance metrics.In Section 3.3, we substantiate the effectiveness of our framework by comparing recent methods.In Section 3.4, the raw features and embedding learned by RFE-GNN over ABIDE datasets were visualization.In Section 3.5, the ablation experiments is used to shedding light on the contribution of individual components.Moreover, ROC curves are plotted, providing a more intuitive visualization of the performance improvements achieved by our framework.In Section 3.6, we scrutinize the impact of diverse hyperparameter values on the overall system performance.

Training Setup
In this study, we implemented our model in the Python environment using the Pytorch library and trained it on the Linux platform using 12 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50 GHz with 43 GB memory and an NVIDIA GeForce RTX 2080 Ti with 11 GB GPU memory.The dataset was randomly split into training and test sets in the ratio of 8:2.During the training process, the fMRI data were preprocessed into functional connectivity matrices, after which they were feature-selected using RFE and then fed into GNN to extract features.The phenotype data, on the other hand, were encoded into twodimensional vectors by one-hot encoding and then feature extraction was performed using PFE.Afterward, we concatenated them and used the attention layer to assign different weights to the features.Finally, the resulting vectors were fed into a fully connected layer to obtain the class to which the subject belonged.The values of the parameters involved in this experiment are shown in Table 3.

Statistical Metrics
As a binary classification task, the commonly used evaluation metrics include Accuracy (ACC), Sensitivity (SEN), Precision (PRE), and F1_score, which can be calculated as follows: In the above formulas, TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative, respectively.ACC represents the proportion of correctly predicted samples to the total number of subjects.SEN and RECALL represent the proportion of correctly predicted positive samples to the total positive samples.PRE represents the proportion of true positive samples to the total predicted positive samples.To comprehensively evaluate the model's prediction performance, we use the F1_score, which considers both PRE and SEN.AUC are also used as an evaluation metric, which is obtained by calculating the area under the ROC curve.

Comparison with State-of-the-Art (SOTA) Models
In this section, we summarize the SOTA methods for detecting ASD, which can be divided into two categories based on the datasets.We also plot the distribution of the accuracy of different methods on different sample sizes.It is worth noting that in our experiments we counted the computational efficiency of our model.Following the setup experiments in Section 3.1, the model training for 100 epochs takes 264.95 s.In the validation mode, it takes only 0.2 s to evaluate a subject.
According to the data presented in Tables 4 and 5, our framework exhibits superior performance compared to recent SOTA methods.These results substantiate the framework's ability to effectively extract pertinent features from both image data and phenotypic data.As demonstrated in Figure 2, our framework not only surpasses the performance of most current methods but also successfully handles a larger data sample.This robustness and competitiveness affirm the efficacy of our framework in ASD detection.Furthermore, we observe that for various SOTA methods applied to the ABIDE II dataset, the sample size of their models is considerably smaller than that of ABIDE I.A possible reason for this disparity is the lack of preprocessed data available for the ABIDE II dataset.To address this issue, we utilize preprocessed data derived from the procedural description provided in [39] in our study.

Visualization the Classification
In this section, we use t-SNE [43] to visualize the classification performance of the framework we proposed.t-SNE is a data visualization tool that can reduce highdimensional data to two dimensions using PCA and represent it on a flat plot.
Figures 3 and 4 present compelling visualizations of the features learned by our framework.Upon careful examination, it becomes evident that our proposed approach yields superior separability compared to directly reducing the original data input.This observation serves as a strong indication of the efficacy of our proposed framework in effectively extracting relevant features, thereby enhancing the discriminative capabilities of the model.

Comparison with Different Models
In this section, we conduct ablation experiments for different methods, showing the effects of the various modules of our model as well as the robustness of the methods.Furthermore, to provide a visual understanding of the classification performance, ROC curves are adeptly plotted, facilitating insightful observations and conclusive interpretations.
• SVM: SVM finds a hyperplane in the feature space of input data such that the distance between each sample and the hyperplane is maximized, achieving the classification effect of the original data.In the subsequent experiments, we use the SVC model from the sklearn Python library.Parameters are set as regularization parameter = 1.0 and kernel = 'linear', and the remaining parameters are set to default.• RF [44]: Random Forest is an ensemble algorithm that builds multiple decision trees on data and combines the results.The RF model is also found in the sklearn library; the parameter of n_estimators = 1600, and the other is set to default.• GAT [45]: GAT is a graph convolutional layer based on GCN.It uses a self-attention mechanism to aggregate neighbor nodes and adaptively match the weights of different neighbors.The GATConv we use is in the torch_geometric library, and the layer parameter is set to default; in the GAT model, there are two layers.
From Tables 6 and 7, we can see that both RFE and PFE contribute to the improvement of model accuracy.When we add both modules to the model, the accuracy is improved even more.Meanwhile, we also use our modules on different models and find that the accuracy is improved to different degrees, indicating that our method is generalizable.As depicted in Figure 5, the AUC serves as a valuable metric denoting the discriminative power of the model.Higher AUC values indicate superior model performance.Our framework achieves AUC values of 0.82 and 0.86 on the ABIDE I and II datasets, respectively, outperforming the comparative methods.

Hyperparameter Discussion
To investigate the influence of varying numbers of remaining features when employing RFE, we conduct a thorough examination by adjusting the parameter N and observing its impact on dataset performance.Specifically, we set the range of N as [500, 5000], with increments of 500.We utilize model ACC and the AUC to assess the performance holistically.The results, as depicted in Figure 6, indicate that the ACC and AUC attain their peak values when N is set to 2500 for ABIDE I and 3500 for ABIDE II, respectively.

Discussion
Based on the above results on ABIDE, we validate the effectiveness of the proposed framework.In Section 3.3, by observing the number of subjects, accuracy, and sensitivity, we substantiate our model's excellent performance.Visualization of the classification demonstrates the discriminability of the features extracted by our framework, which can better classify ASD and TC.These are all due to RFE and PFE.RFE, by leaving behind the graph features that are useful for detecting ASD by recursively ranking the feature of graph data, effectively overcomes the overfitting caused by the large data samples.PFE extracts the phenotypic features through training, which further improves the accuracy.The ablation experiments demonstrate the enhancement of the PFE and RFE and also illustrate that the combination of graph data and phenotypic data to detect ASD is effective and generalizable, which provides a possible method for the CAD of ASD.
There are some limitations in our method.First, in the process of constructing graph data for fMRI, we consider the relationship of spatial scales but ignore the information on temporal scales.Second, in the feature selection for graph data, there are deficiencies in the method of calculating the importance of features.Finally, for the fusion of graph features and phenotypic features, we only perform a simple concatenate of them and do not consider the inter-modal feature relationships.Therefore, in future research, we will explore the above aspects.

Conclusions
In this paper, we present a novel framework designed for ASD detection utilizing multimodal data.Our approach incorporates RFE to efficiently reduce the number of features fed into the graph convolution layer, mitigating the risk of overfitting.To further enhance performance, we introduce phenotypic data and devise a PFE for effective feature extraction.Through rigorous comparative analysis against recent SOTA methods, we demonstrate the superiority of our proposed framework.The promising results obtained from our research suggest its potential applicability in the realm of CAD, thus contributing to the advancement of ASD detection.

Figure 2 .
Figure 2. Comparison of the performance with previous literature on ABIDE I and II.

Figure 3 .Figure 4 .
Visualization of the features learned by different methods in ABIDE I. (a) PCA in ABIDE I. (b) Our in ABIDE I. Visualization of the features learned by different methods in ABIDE II.(a) PCA in ABIDE II.(b) Our in ABIDE II.

Figure 5 .
ROC curve comparison of different methods.(a) Different methods of ROC on ABIDE I. (b) Different methods of ROC on ABIDE II.

Figure 6 .
Comparison of different N with RFE.(a) The impact of our method on ABIDE I. (b) The impact of our method on ABIDE II.

Table 1 .
ABIDE I demographic information.

Table 2 .
ABIDE II demographic information.

Table 4 .
Comparison of the performance of the SOTA methods in ABIDE I.

Table 5 .
Comparison of the performance of the SOTA methods in ABIDE II.

Table 6 .
Different method performance in ABIDE I.

Table 7 .
Different method performance in ABIDE II.