A Similarity Measure-Based Approach Using RS-fMRI Data for Autism Spectrum Disorder Diagnosis

Autism spectrum disorder (ASD) is a lifelong neurological disease, which seriously reduces the patients’ life quality. Generally, an early diagnosis is beneficial to improve ASD children’s life quality. Current methods based on samples from multiple sites for ASD diagnosis perform poorly in generalization due to the heterogeneity of the data from multiple sites. To address this problem, this paper presents a similarity measure-based approach for ASD diagnosis. Specifically, the few-shot learning strategy is used to measure potential similarities in the RS-fMRI data distributions, and, furthermore, a similarity function for samples from multiple sites is trained to enhance the generalization. On the ABIDE database, the presented approach is compared to some representative methods, such as SVM and random forest, in terms of accuracy, precision, and F1 score. The experimental results show that the experimental indicators of the proposed method are better than those of the comparison methods to varying degrees. For example, the accuracy on the TRINITY site is more than 5% higher than that of the comparison method, which clearly proves that the presented approach achieves a better generalization performance than the compared methods.


Introduction
Children with autism spectrum disorder (ASD) present a range of phenotypes, such as social and communication difficulties, restricted interests, and thinking function loss [1]. Typically, autistic children have lifelong mental retardation, and their life quality is often quite low. More unfortunately, ASD cannot be cured so far. Nevertheless, diagnosis and prompt intervention as early as possible play an important role in improving ASD children's life quality. Neuroimaging has significantly helped us to understand the underlying pathological mechanisms of brain diseases [2][3][4][5], and, therefore, it also has been employed to diagnose ASD [6][7][8].
In neuroimaging, resting-state functional magnetic resonance imaging (RS-fMRI) utilizes blood oxygen level-dependent (BOLD) signals to explore the biomarkers of nervous system diseases [9][10][11][12]. Recently, the RS-fMRI-based ASD diagnosis has made significant progress [13][14][15][16]. For instance, Zhao et al. [14] presented a multi-view high-order functional connectivity network (FCN) based on the RS-fMRI data for ASD vs. the normal control (NC) classification. These approaches were designed based on single-site data, and, therefore, they cannot be generalized to other sites' data because of the heterogeneity of the data from different sites. Additionally, the approaches developed from a small number of samples may present overfitting.
The RS-fMRI samples from multiple sites are somewhat heterogeneous due to the scanner type and imaging acquisition protocol. In particular, heterogeneity of the RS-fMRI samples tends to cause low generalization performance [17][18][19]. To deal with the problem caused by heterogeneity, many diagnostic modelings of ASD depending on samples from multiple sites have been explored [20,21]; these can be roughly categorized into two types. The first type ignores the heterogeneity [19,20] by assuming that samples from multiple sites are collected from the same or similar distribution. For instance, Brown et al. [21] proposed the element-wise layer for DNNs to predict ASD without considering the heterogeneity of data from different sites. The other type aims to avoid the adverse effect of data heterogeneity on the results [22][23][24]. For example, Niu et al. [22] proposed a multi-channel deep attention neural network to capture the correlations in multi-site data by integrating multi-layer neural networks, attention mechanisms, and feature fusion. However, these methods need extensive data that are hard to collect.
To improve generalization on the limited number of RS-fMRI samples, this study presents a similarity measure-based method for early ASD diagnosis. First, a Siamese network is devised to reduce the negative effect of heterogeneity on the performance of the model. Afterward, we design an independent objective function for each training site, and the total objective function is summed to train the parameters based on the backpropagation algorithm. Finally, a few samples, which the model has never seen before, are used to fine-tune the parameters with the aim of making the model adapt to new samples.

Problem Formulation
To improve the generalization and accuracy of ASD diagnoses, we follow the few-shot learning idea [25,26] to design the similarity measure-based approach in this study. Specifically, few-shot learning is formulated as learning a classifier to recognize the remaining samples given a few labeled samples for each class in the target site. If the classifier is directly trained utilizing traditional optimization algorithms, it is hard to obtain satisfactory performance due to the heterogeneity. Under the few-shot learning strategy, some imaging sites are used to train the parameters, while others are used to fine-tune the model parameters and evaluate the performance of the model. Accordingly, the whole dataset used in this study is partitioned into training sites, target sites, and a baseline site (set). In particular, the training sets are split further into meta-training sets and meta-test sets, and we aim to match the samples of the target site to the baseline site accurately.

Materials and Methods
This section illustrates the proposed method in detail, especially the Siamese network and the training strategy.

Materials
In this work, we sampled the studied RS-fMRI data from the Autism Brain Imaging Data Exchange (ABIDE), which aims to facilitate discovery and comparison in ASD research. We included data from the C-PAC preprocessing pipeline of ABIDE [27]. The data were excluded when the scanning data was missing from the original imaging data at some time points. After preprocessing, the RS-fMRI data from 12 imaging sites (402 ASD patients and 423 NCs) were included in the final analysis. The detailed demographic information, including age and sex, is summarized in Table 1. In this work, the AAL atlas [28] with 116 brain regions is used for brain parcellation.

Constructing Dynamic FCN
Functional connectivity exploits the temporal correlation of BOLD signals in different brain regions to demonstrate how structural segregation and functionally specialized brain regions interact. FCNs are of great significance for discovering the functional organization of the human brain and for finding biomarkers for neuropsychiatric diseases [10,29,30]. Studies have shown that dynamic functional connectivity patterns have significant contributions to the diagnosis of neurological diseases [9,31,32]. The sliding window strategy is a popular method for constructing a dynamic FCN (D-FCN). The illustration of constructing the D-FCN is shown in Figure 1. A D-FCN is a sequential collection of sub-networks created by dividing the entire RS-fMRI time series into multiple overlapping sub-segments, each of which is constructed as a sub-network that reflects short-term correlations. The short-term correlation in k-th window is calculated by Pearson's correlation coefficient, as follows: where M is the length of a segment, x i (k) denotes the BOLD signals of the i-th ROI,x i (k) is the average value of all elements in x i (k), and x im (k) is the m-th element in x i (k). Thus, a sub-FCN of a D-FCN is constructed as D(k) = FC ij (k) 1 k K , and the corresponding D-FCN can be represented as D = [D(1), . . . , D(k), . . . , D(K)], where K is the total number of segments. In this work, the mean matrix of the D-FCN [14,15,33] is used as the input. Taking into account the symmetry of the functional connectivity matrix, the off-diagonal triangular parts on it are vectored as the feature vector.

Siamese Network Framework
Recently, deep neural networks have been used broadly in various fields [34][35][36][37][38]. However, training neural networks requires abundant data. For some fields, it is difficult to collect a large amount of data. Therefore, few-shot learning [25,26] is proposed to solve this problem, and reliable results have been obtained in multitudinous studies. Few-shot learning is usually implemented based on metric learning methods, such as Prototype networks and Siamese networks.
As shown in Figure 2, the Siamese network [39,40] consists of two sub-networks that share parameters. In this work, the Siamese network consists of two identical feature extractors, an autoencoder, and fully connected layers; the input is a 6670-dimensional feature vector extracted from each subject, and the output is the similarity of two subjects. The calculation similarity unit is formulated as: wherex 1 andx 2 are the outputs of the two feature extractors, respectively. In order to intuitively compare, the sigmoid function is used to map the dist(x 1 ,x 2 ) within the range (0, 1). The loss function is important for artificial neural networks to generate separable representations for unseen classes. In this work, the mean square error is employed as the loss function: where y is the label, indicating whether the two subjects are from the same category, that is: y = 0, if the two subjects are from the same category 1, otherwise As illustrated in Figure 2, the training Siamese network needs to input paired samples and optimize the model parameters by minimizing the loss function. In the settings of this section, the similarity is measured by the distance (Equation (2)), so when minimizing the loss function (Equation (3)), the distance between paired samples from the same category should be small, and the distance between paired samples from different categories should be large.
In the test phase, the test dataset is divided into a support set and a query set. Suppose that there are C categories in the data set and N samples in the support set of each category. This few-shot learning classification task is called a C-way N-shot task. Figure 3 shows the test flowchart of the trained Siamese network model. In our case, we have two categories (NC and ASD), so it is a two-way N-shot task. Given a query sample Q from the query set, each support set has N support samples. The distances/similarities between the query sample Q and the samples in the two support sets are calculated to predict the label of sample Q.
the lable of Q is predicted to be NC else: the lable of Q is predicted to be ASD

Few-Shot Training Strategy
Because the underlying pathology of the ASD cross-sites is similar, it can be reasonably assumed that the data extracted from multiple sites share an inherent underlying data structure [23]. The baseline set is used in both the training and final performance test. In other words, we use the data of a specific site as the baseline and compare the subjects from other imaging sites with the baseline. Our goal is to train the model parameters so that the distance/similarity comparison between other site data and baseline site data can achieve satisfactory results. That is to say, in addition to training the model parameters, a site should be found as the baseline site during the training of the model, which will be used as the support set in the test phase.
Unlike the traditional few-shot task, called the C-way N-shot task, a specific site is selected as the baseline set. It is worth noting that the number of NC samples in the baseline set may not be equal to the number of ASD samples. Therefore, it can be called a C-way (C = 2 in our case) task, but the N is uncertain. In this work, we employ the prototype network [41] for reference to convert the few-shot task into a two-way one-shot task. Specifically, the average abstract features (extracted by feature extractor) of the NC/ASD samples in the baseline set as the prototypical features are used to calculate the similarity with other samples.
During training, the data for each site for training is partitioned into a meta-training set and meta-test set. The training process is shown in Figure 4. The model parameters are determined by minimizing the loss between the meta-training set and the baseline set; the meta-test set is used to verify the performance of the model in each iteration. Each training site will obtain an independent loss. Suppose that there are K sites for training, the sum of K losses is used as the total loss for the tuning parameters: Using the total loss as the objective function can prevent the model from overfitting the data of a certain site during the training process and promote the model's learning of the "basic concept" of the similarity. total loss

Experimental Setup
We selected five imaging sites as the baseline set and training sets, UCLA, UM, USM, NYU, and YALE. In each training set, 70% of the samples are used to train the model parameters, and the remaining 30% are used to evaluate the performance of the model during training. One of the five sites is selected as the baseline set, and the other four sites are selected to be the training sets during the training process. Determine the model parameters by minimizing the loss between the baseline set and the training sets.
Additionally, another seven imaging sites are used to test the performance of the final model, which are independent of the training sets. The test strategy is to predict the category by comparing the similarity between the target subject and the NC and ASD samples in the baseline set. Specifically, if the similarity between the target subject and the NC samples in the baseline set is higher, the label is predicted to be NC; otherwise, the label is ASD.
In this article, the accuracy, precision, and F1 score are used as the criteria for experimental performance evaluation.

Classification Performance on Meta-Test Sets
When one site was selected as the baseline site, the average meta-test performance of the other four sites is presented in Table 2, and, in turn, each site except for the target sites was selected as the baseline site. From Table 2, we can make the following observations. First, the selected baseline site affects the performance of the model. Due to the heterogeneity of the data among different imaging sites, it is difficult for the learner to make the loss function of each site converge to a certain range, although the total loss may converge. Therefore, when different baseline sets are selected, the learner will update the model parameters through different loss values. As a result, changing the baseline set will change the classification performance of the model. Second, the proposed model is effective on the training sites. Although the meta-test sets are selected from the training sets, they are not used to update and tune the model parameters. Each accuracy is greater than 60% in terms of classification accuracy, indicating that the model is effective on training sites. Finally, from the results, the average accuracy is between 60% and 70%, which can reasonably be considered to mean that the model does not overfit or underfit the data in training sites.
The loss curves are shown in Figure 5. The training loss is based on the following experimental settings. The RS-fMRI data from the YALE imaging site are selected as the baseline set; the data from the UCLA, UM, USM, and NYU imaging sites are selected as training sets. As can be seen from Figure 5, the loss function converges well.

Generalization Performance on Target Sites
To validate the generalization performance of the proposed model, we compared the model with several classical methods including the support vector machine (SVM), stacked autoencoder (SAE), and random forest (RF). The SVM and RF are implemented using the Python-based sklearn library, and default parameters are used. The SAE consists of three full connection layers with tanh activation function and uses two full connection layers for classification.
In the current experiment, the RS-fMRI data from the YALE imaging site are selected as the baseline set for the proposed method, and several subjects' data from each target site are selected to fine-tune the model parameters. In order to avoid the problem of a sample imbalance, some data are randomly removed from the categories with plenty of samples to make equal the number of ASD and NC samples. For consistency comparison, the RS-fMRI data from the five imaging sites (i.e., UCLA, UM, USM, NYU, and YALE) were combined for training the SVM, RF, and SAE. The data from the target site are used as a test set to evaluate performance. In the training process, 70% of the training samples are used to train the comparison methods' parameters, and the remaining 30% are used to evaluate the performances of the models during training. The results are summarized in Table 3. From the results, the ability of the proposed method to generalize to other imaging sites outperforms the competing methods.

Discussion
In light of their simplicity and ease of implementation, classic machine learning methods, such as the SVM and RF, have been widely used in previous RS-fMRI studies. However, to avert inter-site heterogeneity, these methods are often modeled and validated based on single-site data, which restricts their generalization to other imaging sites. Since the SAE contributes to easy overfitting when dealing with small sample size data, we merge the data from the training sites for training the SAE. In general, the SAE reshapes feature patterns in vector form to learn more informative high-level features for diagnosing ASD, but it is still difficult to generalize to other imaging sites in the face of heterogeneous data. Few-shot learning occurs by transferring knowledge learned during training tasks to unseen tasks. In our case, we chose four training sites and one baseline site, and we expected the model to learn general concepts to distinguish ASD from NC. This general capability may not be sufficient to deal with some unique characteristics of the target site, as fine-tuning of model parameters is required using a small amount of data from the query site. From the experimental results summarized in Table 3, the generalization performance of the proposed model outperforms the comparison methods, indicating that our method is effective for classifying ASD based on multi-site data.
It can be seen from Figure 5 that, after the training step is about 100, the loss value tends to be stable. This shows that the proposed method can converge rapidly. In addition, the trends of the four curves are similar, and there is no huge difference in the loss value. Except for one curve, the other three curves are almost coincidental. This shows that the model is not overfitted to the particular training station and is balanced on multiple training sites.

Conclusions
This paper presents a similarity measure-based approach for ASD diagnosis with RS-fMRI data. A unique property of this study is that samples from multiple sites are used to colearn a similarity function with the baseline site, enabling the presented approach to achieve good generalization on unseen samples from target sites. Extensive experiments on the ABIDE I dataset show that the proposed approach has a robust generalization performance with comparable diagnostic accuracy in comparison to several well-established methods.
In a previous study, the ComBat harmonization method was used to eliminate the impact of data heterogeneity among sites [42]. In future work, we consider combining small sample learning and ComBat to better eliminate the adverse impact of data heterogeneity between sites. In addition, this work has yielded satisfactory results only using RS-fMRI data, but some recent studies have shown that it is possible to use multimodal data to diagnose neurological diseases, such as combining RS-fMRI data and structural MRI data for disease diagnosis. Using multimodal data to diagnose neurological diseases will be the focus of our future work.