1. Introduction
Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by core features of social communication deficits and repetitive, restrictive behaviors [
1]. Epidemiological studies indicate that approximately 1 in 100 children worldwide are diagnosed with autism, leading to a significant burden on both families and society [
2]. Currently, the clinical diagnosis of ASD primarily relies on the criteria outlined in the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) and is supplemented by behavioral assessment tools such as the Autism Diagnostic Observation Schedule (ADOS) [
3]. However, this diagnostic approach is subject to inherent limitations due to its subjective nature and is unable to uncover the underlying neurobiological mechanisms of the disorder.
The application of magnetic resonance imaging (MRI) in ASD research has been gradually increasing, particularly in the exploration of brain structural and functional abnormalities. These studies have provided valuable insights into the neurodevelopmental characteristics of ASD. However, despite the fact that MRI techniques have revealed brain structural and functional changes associated with ASD, these findings have often not been consistently validated across all ASD patients [
4]. Structural MRI (sMRI) is primarily used to observe static anatomical features of the brain, focusing on metrics such as brain region volumes, morphology, and the density of gray and white matter in order to identify potential structural abnormalities. However, such studies tend to focus on morphological changes and are often unable to fully capture the dynamic functional processes of the brain. In this regard, functional MRI (fMRI) has emerged as an important tool for investigating brain activity and neural network connectivity, offering a new avenue for exploring the neurobiological underpinnings of ASD. Through fMRI, researchers can observe brain activity patterns during different tasks or resting states, revealing functional cooperation and connectivity changes between different brain regions. This dynamic approach provides a unique perspective for understanding the neurobiological mechanisms of ASD, particularly in uncovering abnormal connectivity within brain functional networks, which holds irreplaceable advantages.
In the analysis of sMRI, Ecker et al. [
5] employed a multi-parametric classification approach to characterize the complex and subtle gray matter anatomical patterns associated with autism in adults. Jiao et al. [
6] utilized regional cortical thickness extracted from surface-based morphology to perform ASD classification, achieving good classification performance. Kong et al. [
7] proposed a deep neural network classifier based on stacked autoencoders, which effectively classified ASD by extracting features from structural MRI images and performing feature selection. Ji et al. [
8] proposed the use of 3D Convolutional Neural Networks to train whole-brain images and obtain relevant feature information. However, the high-dimensional nature of sMRI data presents significant challenges for computational hardware, making direct input into networks difficult. To address this, researchers divided the complete 3D images into smaller patches before inputting them into the network, a method that not only fully utilized whole-brain features but also significantly improved training efficiency. Puranik et al. [
9] adopted a dimensionality reduction strategy by slicing 3D sMRI images into 2D slices, combined with transfer learning techniques for classification, which effectively alleviated computational pressure. However, due to the heterogeneity of brain structures in ASD patients, the generalization ability of such methods remains limited.
In the analysis of functional MRI (fMRI), Nielsen et al. [
10] achieved a classification accuracy of 60% based on whole-brain functional connectivity analysis, identifying connection patterns in key brain regions such as the default mode network as the most discriminative. While such methods are simple and effective, they have limitations, including reliance on manually designed feature extraction and the difficulty in capturing complex nonlinear relationships. With the development of deep learning techniques, researchers have begun to explore more complex model architectures. Heinsfeld et al. [
11] employed deep neural networks to analyze multi-center brain imaging data, achieving a 70% accuracy in identifying ASD patients through functional connectivity patterns. Eslami et al. [
12] utilized a joint learning framework combining autoencoders and single-layer perceptrons (SLPs) to optimize feature extraction and classification parameters, enhancing the performance of autism diagnostic models by incorporating a data augmentation strategy using linear interpolation in the feature space. Rakić et al. [
13] combined the volume of regions of interest (ROIs) from structural MRI and functional connectivity matrices, training with autoencoders and multi-layer perceptrons, and achieved the highest classification accuracy of 93.18% at the CMU site. Although significant progress has been made in ASD classification based on single-site data, existing methods generally face the key challenge of limited generalization ability. To address this challenge, Epalle et al. [
14] proposed an innovative multi-atlas feature fusion strategy. This study aligned the feature dimensions of resting-state fMRI data from three different sites using fully connected layers, constructing a unified multi-site feature representation space. The method achieved cross-site knowledge transfer through feature space alignment, offering a new approach to improving model generalization. Kang et al. [
15] proposed a multi-center autism recognition method based on LeNet5 and MLP, achieving 93% accuracy in single-site classification and 83.5% accuracy in multi-center classification by using glass brain feature extraction and SMS data partitioning strategies.
Recently, several deep-learning-based methods have attempted to extract multi-dimensional features. Deng et al. [
16] proposed the ST-Transformer, which incorporates a spatial–temporal multi-head attention mechanism to jointly capture spatial structures and temporal dynamics in fMRI signals while also addressing data imbalance through a Gaussian GAN-based augmentation strategy. Liu et al. [
17] introduced a pseudo 4D ResNet model that decomposes spatiotemporal convolutions into parallel 3D spatial and 1D temporal blocks, reducing computational complexity while preserving essential spatiotemporal patterns. Alharthi et al. [
18] explored multi-slice generation from both sMRI and fMRI modalities, applying 3D-CNNs along with vision transformer models and leveraging transfer learning to enhance ASD diagnosis under limited data conditions. Although these studies have made notable progress, they still face the following key issues:
Firstly, most studies reduce the dimensionality of 4D fMRI data or use only static functional connectivity matrices, thereby losing valuable spatiotemporal dynamic information. As a neurodevelopmental disorder, ASD-related brain dysfunction is often characterized by dynamic changes, and static analysis methods are insufficient to capture these subtle variations. Although certain studies have incorporated temporal features into their models, many still depend on manually crafted slice selection, downsampling procedures, or modality-specific adjustments. Such design choices, while enabling the extraction of spatial or temporal representations, may hinder the scalability of the methods and restrict their ability to capture fine-grained temporal dynamics. Secondly, there is significant heterogeneity in data across different scanning centers, including differences in scanning parameters, subject populations, and preprocessing pipelines, which severely impacts the generalization ability of models. Furthermore, existing methods insufficiently model the dynamic evolution of functional networks and lack effective utilization of temporal information.
To address these challenges, this study proposes an ASD recognition method that combines 3D Convolutional Neural Networks (3D-CNNs) and segmented temporal decision networks. The method first utilizes 3D-CNNs to automatically extract high-dimensional spatial features from raw 4D fMRI data, avoiding the loss of valuable information that may occur with traditional dimensionality reduction techniques. Furthermore, to effectively capture the temporal dynamics of brain activity, we designed a segmented Long Short-Term Memory (LSTM) network architecture that divides the time series into physiologically meaningful segments to capture the temporal characteristics of brain activity. Then, Gradient Boosting Decision Trees (GBDTs) are used to classify the concatenated spatiotemporal features. Finally, a voting mechanism integrates the predictions from different time segments to classify the subject as either ASD or typically developing. This method not only enhances the efficiency of spatiotemporal feature extraction but also strengthens the model’s ability to learn complex dynamic brain activity patterns while ensuring feature integrity. Overall, the main contributions of our work are as follows:
This study proposes a method that combines 3D-CNNs with segmented LSTM networks to automatically extract high-dimensional spatial features from raw 4D fMRI data and capture the temporal dynamics of brain activity through segmented LSTM. This approach effectively leverages the rich information in spatiotemporal data, enabling joint modeling of both spatial and temporal information, thereby enhancing the model’s ability to understand complex brain activity patterns.
This study employs GBDTs for classification and designs a voting mechanism to integrate predictions from different time segments, leading to the final ASD diagnosis. This approach not only enhances the model’s learning capacity but also improves the accuracy and efficiency of ASD identification.
Experimental results on the ABIDE dataset show that our method outperforms existing state-of-the-art approaches, achieving an average accuracy of 0.85. Furthermore, we employed t-SNE dimensionality reduction to illustrate the improved discriminative ability of the spatiotemporal fused features in distinguishing between ASD and typical control groups.
We hypothesize that the proposed deep learning framework, which jointly models spatial and temporal features from fMRI data, will achieve superior classification performance compared to conventional baseline approaches. Furthermore, we expect the model to obtain a robust average classification accuracy exceeding 80% across multiple imaging sites, thereby demonstrating its potential applicability in clinical diagnostic support for ASD.
3. Results
To evaluate the performance of the proposed model, this section presents the experimental results on the datasets from 17 different sites in the ABIDE I dataset. The key performance metrics of the proposed method, including accuracy, recall, precision, F1-score, and specificity, are reported. In addition, this section compares the performance of different methods on the same datasets and provides a further analysis of the strengths and weaknesses of each approach. To comprehensively assess the classification performance of the proposed model, several commonly used evaluation metrics were employed, including accuracy, precision, recall, F1-score, and specificity. These metrics allow us to measure the model’s performance from different perspectives.
3.1. Performance Metrics
- (1)
Accuracy
Accuracy represents the proportion of correctly predicted samples by the classification model. It is calculated as follows:
where TP (True Positive) is the number of samples that are truly positive and predicted as positive, TN (True Negative) is the number of samples that are truly negative and predicted as negative, FP (False Positive) is the number of samples that are truly negative but predicted as positive, and FN (False Negative) is the number of samples that are truly positive but predicted as negative.
- (2)
Precision
Precision measures the proportion of predicted positive samples that are actually positive, reflecting the accuracy of the model in predicting the positive class. It is calculated as follows:
- (3)
Recall
Recall, also known as Sensitivity, measures the proportion of actual positive samples that the model correctly identifies as positive, reflecting the model’s ability to capture positive class samples. It is calculated as follows:
- (4)
F1-score
The F1-score is the harmonic mean of precision and recall, providing a comprehensive evaluation of the model’s precision and recall capabilities. The F1-score is particularly useful in cases of class imbalance as it balances precision and recall. It is calculated as follows:
- (5)
Specificity
Specificity measures the proportion of actual negative samples that the model correctly identifies as negative, reflecting the model’s ability to identify negative class samples. It is calculated as follows:
These evaluation metrics provide a comprehensive reflection of the model’s classification performance. In particular, in the task of Autism Spectrum Disorder (ASD) recognition, metrics such as accuracy, recall, and F1-score offer important reference points for model evaluation.
3.2. Experimental Setup
In this study, all experiments were conducted using the PyTorch 1.11.0 framework and ran on a high-performance Nvidia 3090 GPU to ensure efficient model training and evaluation. To comprehensively assess the performance of the proposed model, a k-fold cross-validation strategy was employed. Specifically, in the single-site classification task, the dataset was randomly divided into k equally sized subsets. In each experiment, one subset was selected as the test set, while the remaining four subsets were used as the training set. This approach ensured that each subset was used as the test set in one of the k experiments, minimizing data partition bias and enhancing the generalizability of the experimental results. The procedure was repeated k times, with a different subset used as the test set each time. Finally, the evaluation metrics from the k experiments were averaged to yield the model’s final performance results.
For spatial feature extraction, a 3D CNN model with two convolutional layers was employed. In each layer, multiple parallel convolutional kernels of sizes , , were applied to capture multi-scale spatial patterns. The outputs of these parallel convolutions were aggregated to form a unified feature representation. Following each convolutional layer, a 3D max-pooling layer with a kernel size of 2 and a stride of 2 was used to progressively reduce the spatial dimensions while preserving salient features. For temporal feature extraction, an LSTM model was used with 500 hidden units. The learning rate was set to 0.001, and the batch size was set to 32. The model was trained for 50 epochs.
In this experiment, the dataset’s time window size was set to . For hyperparameter tuning, the sliding window size was set to , and the sliding step was set to .
3.3. Experimental Results
In this section, we present the experimental results of multiple configurations to validate the effectiveness and superiority of the proposed methods.
3.3.1. Performance Comparison of Different Model Configurations
In this study, to comprehensively evaluate the performance of the proposed 3D-CNN + LSTM + GBDT model and compare it with other commonly used combinations, we designed several experiments. Specifically, we compared the following model configurations: (1) 3D-CNN + LSTM + GBDT, which is our proposed method; (2) 3D-CNN + GBDT; (3) 3D-CNN + RNN + GBDT; (4) 3D-CNN + LSTM + Random Forest (RF); (5) 3D-CNN + RNN + RF. These models were evaluated using five-fold and ten-fold cross-validation strategies for accuracy. By comparing these configurations, we aim to validate the advantages of the selected model. Configuration (2) was used to verify the effectiveness of temporal–spatial feature fusion, while configurations (3), (4), and (5) aimed to determine the optimal choice for temporal feature extraction and classification strategies.
Through five-fold and ten-fold cross-validation, we obtained the average classification results for different models.
Figure 3 and
Figure 4 present the classification accuracy for each site under different model configurations, while
Table 3 and
Table 4 show the average values of precision, recall, accuracy, F1-score, and specificity for each site under the various model configurations. As shown in figures and tables,
3D-CNN + LSTM + GBDT consistently achieved the best average classification performance across all model configurations. Specifically, this model achieved an accuracy of 0.85 in both five-fold and ten-fold cross-validation, with only minimal differences observed between the two methods. Additionally,
3D-CNN + LSTM + GBDT significantly outperformed
3D-CNN + GBDT in terms of accuracy, recall, and F1-score. This indicates that the
3D-CNN + LSTM + GBDT model is highly effective in capturing temporal information, and the fusion of spatial and temporal features further enhances classification accuracy. More specifically, temporal–spatial feature fusion allows the model to extract spatial features at each time point and capture dynamic brain activity changes, thus providing a more comprehensive feature representation and improving classification performance. In contrast, the models using other combinations of temporal models and classifiers had lower accuracy, particularly when using RNN and RF, with performance notably decreasing. This suggests that the combination of LSTM and GBDT has a unique advantage in capturing temporal dependencies and nonlinear relationships. The results from both five-fold and ten-fold cross-validation demonstrate that although the validation methods differ slightly, the relative ranking of the configurations remains consistent, with
3D-CNN + LSTM + GBDT consistently outperforming all other configurations in terms of stability and classification efficiency.
The performance differences between the selected model configurations are substantial. First, 3D-CNN + LSTM + GBDT significantly outperforms all other combinations across all metrics. This combined model integrates the spatial feature extraction capability of 3D-CNN, the temporal dependency modeling advantage of LSTM, and the efficient nonlinear classification power of GBDT. Specifically, 3D-CNN extracts spatial information from brain images through convolution operations, LSTM captures long-term dependencies in the time series, and GBDT optimizes classification performance by integrating decision trees. Compared to other combinations, LSTM is better at handling the complexity of temporal features in ASD data, and GBDT is more efficient in handling nonlinear relationships between temporal and spatial features, making 3D-CNN + LSTM + GBDT the optimal choice.
In contrast, 3D-CNN + RNN + GBDT performed slightly worse. Although RNNs can handle sequential data, they have limitations in capturing long-term dependencies, especially when dealing with longer time series, where issues such as vanishing or exploding gradients may occur. Therefore, the inability of RNN to effectively model long-term dependencies results in lower classification performance compared to LSTM. On the other hand, the 3D-CNN + LSTM + RF model, while performing well on some metrics, is less effective than GBDT. RF, as an ensemble method based on decision trees, is typically not as efficient as GBDT in handling high-dimensional data. GBDT optimizes decision boundaries better, leading to superior performance compared to RF. Hence, 3D-CNN + LSTM + GBDT is better suited to handle complex classification tasks. Finally, 3D-CNN + RNN + RF performed the worst, especially in terms of accuracy and recall, which were significantly lower than other models. The combination of RNN and RF did not effectively leverage the strengths of both methods, leading to poor performance in processing the ABIDE dataset.
Through a series of experiments and comparisons, we found that the 3D-CNN + LSTM + GBDT model performed best across all configurations, especially in key metrics such as accuracy, precision, and recall. This model effectively combines the spatial feature extraction capability of 3D-CNN, the temporal dependency modeling of LSTM, and the nonlinear classification power of GBDT, making it highly effective in processing the complex temporal-spatial features in the ABIDE dataset. For ASD classification tasks, the combination of LSTM and GBDT proves to be the optimal choice, further validating the effectiveness of supervised learning methods in processing brain imaging data.
3.3.2. Comparison with Other Studies
In this section, we compare our proposed method with other state-of-the-art approaches to evaluate its performance in the single-site classification task.
Table 5 presents the classification accuracy results for various algorithms, including
Epalle [
14],
Heinsfeld [
11],
Eslami [
12],
Nielsen [
10],
Rakić [
13], and
Kang [
15]. The table lists the classification results for each site, along with the accuracy of each method at individual sites.
From
Table 5, it is evident that our proposed method outperforms other methods at multiple sites, with an overall average accuracy of 0.85, which is significantly higher than
Heinsfeld [
11] with 0.65 and
Nielsen [
10] with 0.60, and also surpasses
Rakić [
13] with 0.80 and
Kang [
15] with 0.83. At several sites, including Cal, Leu, NYU, Olin, Pitt, SBL, and UM, our method performs exceptionally well, achieving accuracies of 0.83, 0.83, 0.90, 0.89, 0.93, 0.96, and 0.92, respectively, all surpassing the performance of the other methods. Although the improvement over
Kang [
15]’s method is relatively modest, our approach demonstrates enhanced robustness, with lower variance across folds (±0.03 vs. ±0.06), suggesting better generalization and stability across heterogeneous sites.
The performance improvement of our model can be attributed to the effective combination of 3D-CNN, LSTM, and GBDT. Notably, 3D-CNN is capable of effectively extracting spatial features from brain imaging data, capturing voxel-level details of brain activity. LSTM enhances the model’s ability to handle time-series data by modeling temporal dependencies, with a particular advantage in capturing long-term dependencies. Lastly, GBDT leverages decision tree ensembles to handle complex nonlinear relationships, further optimizing classification performance.
While most sites show excellent performance, some sites such as CMU, SDSU, and Trinity exhibit relatively lower performance. For instance, at the CMU site, our method achieves an accuracy of 0.77, which is still higher than
Heinsfeld’s [
11] 0.66 but lower than
Epalle [
14]. This performance variation could be attributed to differences in site data characteristics, such as sample size, fMRI data quality, and noise levels. Nevertheless, despite these discrepancies at individual sites, our method consistently demonstrates strong competitive performance, especially when dealing with challenging datasets, providing reliable classification results. To further explore the lower classification performance observed at certain sites, we examined the relationship between misclassification and individual-level variables such as age and head motion. While no clear age-related trends were found, we observed that misclassified subjects tended to have higher mean framewise displacement, suggesting that motion-related artifacts may have negatively impacted model predictions.
Table 6 presents the comparative performance of different methods for multi-site classification. As indicated in the table, the proposed method exhibits superior performance across several evaluation metrics, including accuracy, precision, recall, F1-score, and specificity. Our method outperforms all other approaches in most of these metrics, showing improvements in accuracy, recall, F1 score, and specificity when compared to the state-of-the-art method. This highlights its robust generalization capability across multi-site datasets. Notably, our method excels in recall, demonstrating high sensitivity in identifying ASD samples.
3.3.3. The T-SNE Visualization of Features
In this experiment, we compare the feature distributions extracted by our proposed method with those of other methods at the Olin site. To facilitate a more intuitive comparison of the classification performance of each method, we employed T-Distributed Stochastic Neighbor Embedding (T-SNE) for dimensionality reduction and visualization of the features.
Figure 5 presents the feature distributions of each method after dimensionality reduction.
As shown in
Figure 5, the feature distribution of our method clearly demonstrates a better separation between positive and negative samples. Specifically, the overlap between the positive and negative samples is minimal, forming distinct clusters. This indicates that the fusion of spatiotemporal features effectively enhances the model’s ability to distinguish between different classes, especially between ASD and normal control groups, showcasing strong separability.
In contrast, the feature distributions of the other methods are more overlapped. While some methods exhibit partial separation between the positive and negative samples, the overall degree of separation is not as clear as in our method. For instance, in the methods by
Heinsfeld’s [
11],
Eslami [
12], and
Nielsen [
10], the positive and negative samples overlap significantly, making it challenging to achieve good classification performance. Other methods, such as
Epalle [
14],
Rakić [
13], and
Kang [
15], show some improvement in separation but still lack sufficient distinction between the positive and negative samples compared to our proposed method.
This phenomenon can be attributed to the advantages of our spatiotemporal feature fusion approach. By using 3D-CNN to extract spatial features and combining it with LSTM for modeling temporal information, we can more comprehensively capture the dynamic changes in brain activity over time, rather than solely relying on static spatial features or local temporal features. This spatiotemporal fusion allows the model to not only accurately extract spatial information from brain images but also to capture the temporal dynamics of brain activity, thus improving the model’s ability to differentiate between ASD and normal control groups.
3.4. Discussion
The experimental results of this study demonstrate that the proposed 3D-CNN with segmental temporal decision network method for ASD recognition exhibits superior performance in ASD classification tasks, particularly in handling spatiotemporal features. The fusion of spatiotemporal features is a key aspect of this method, which significantly enhances the model’s discriminative power in the classification task. By utilizing 3D-CNN for spatial feature extraction, LSTM for modeling temporal dependencies, and combining the nonlinear classification ability of GBDT, the model is able to comprehensively capture the complex patterns of brain activity across both spatial and temporal dimensions, thereby optimizing classification performance.
Firstly, the 3D-CNN effectively extracts spatial structural information from fMRI data, capturing spatial correlations between different brain regions. LSTM, on the other hand, captures the dynamic changes in brain activity over time by modeling long-term dependencies in time-series data. Compared to traditional methods that rely solely on spatial information or simple temporal features, the fusion of spatiotemporal features enables the model to not only extract spatial features more accurately but also handle the temporal variations in brain activity, enhancing the model’s ability to distinguish between ASD and typically developing controls.
Compared to other methods, our spatiotemporal fusion approach significantly improves the separability of the features. In particular, the T-SNE visualization results clearly show a substantial reduction in the overlap between positive and negative samples, indicating that our method effectively distinguishes between different classes in the feature space. This improved feature separability directly contributes to better classification performance across multiple sites, particularly at the Cal, NYU, and Olin sites, where our method achieves higher accuracy than other methods.
In this study, we used T-SNE visualization to demonstrate the separation between the autism spectrum disorder (ASD) group and the control group, but we did not identify the specific features driving this separation. Understanding these features is crucial for a deeper exploration of the biological mechanisms underlying ASD. Therefore, future research will incorporate techniques such as feature importance analysis and SHAP values to identify and discuss the potential associations between key features and ASD. Additionally, we will strive to enhance the transparency of the model to ensure its effectiveness and interpretability in clinical applications, thereby providing insights for understanding the causes of ASD and supporting early intervention.
Furthermore, the voting mechanism employed in this study further enhances the model’s classification performance. The voting mechanism combines the predictions from different models, leveraging the understanding of spatiotemporal features from each model to increase the robustness of the final classification result. Specifically, the introduction of the voting mechanism effectively mitigates errors that may arise from a single model when confronted with challenging samples, thus improving both classification accuracy and stability. In practical applications, this strategy aggregates information across multiple time windows, ensuring that the model’s predictions at each stage reflect the full scope of the data, thereby producing more reliable results.
The recall rate of the proposed model in this method ranges from 0.83 to 0.85, indicating that some ASD cases may have been missed. We recognize that false-negative results can lead to missed early identification of individuals who require intervention, which could impact their long-term development and treatment outcomes. To improve the model’s recall rate, we plan to explore additional features and variables in future research to enhance the detection capabilities for ASD. We will also consider adjusting the model’s threshold settings, employing different algorithms, or integrating multiple methods to bolster the recall rate and minimize the risk associated with false negatives.
In clinical applications, we recommend using our model as an auxiliary tool in conjunction with traditional assessment methods. While the model can support early identification, the final diagnosis should still be made by qualified healthcare professionals to ensure that each individual receives a comprehensive evaluation and the necessary support.
However, despite the outstanding performance of our model across most sites, there are still some sites where the performance is relatively lower. This performance variability could be attributed to differences in data characteristics between sites, such as sample size, data quality, and noise levels. To address these issues, future work could explore further optimization of data preprocessing and noise removal methods to improve the model’s performance at these sites.
Although our model performs excellently in terms of accuracy, its computational complexity may pose challenges for real-time clinical applications, particularly when constrained by computational resources for larger datasets. To enhance clinical usability, future research should consider model optimization strategies to reduce computational demands or shorten training time through parameter adjustments and optimization of algorithms to improve the feasibility of practical applications.
Current clinical ASD diagnosis largely depends on subjective behavioral evaluations, which can be time-consuming and variable. Our method provides an objective neuroimaging-based approach to complement clinical assessments, potentially improving diagnostic efficiency and accuracy. In the context of neuroimaging-based ASD classification, an accuracy threshold of approximately 70–75% is generally regarded as clinically meaningful, given the inherent heterogeneity of ASD and variability in fMRI data quality across sites. Our proposed model consistently achieves classification accuracies above this benchmark across multiple independent datasets, demonstrating robustness and potential clinical value. Overall, the experimental results validate the effectiveness of the proposed method in ASD classification tasks, demonstrating its strong classification capability, particularly in handling complex spatiotemporal features.
4. Conclusions
In this study, we proposed a method for ASD recognition based on a 3D-CNN and segmented LSTM, demonstrating excellent performance in the ASD classification task across multiple sites. By combining spatiotemporal feature extraction and a multi-model voting mechanism, our approach effectively captures both the spatial structure information and the temporal dynamics of brain activity. The proposed model achieved an average classification accuracy of 85%, outperforming several state-of-the-art methods and showing strong generalizability across sites, providing a more comprehensive solution for ASD identification.
Our findings show that utilizing the 3D-CNN to extract high-dimensional spatial features, combined with LSTM to capture the long-term dependencies in time-series data, can better reveal the characteristic brain activity patterns in ASD patients. Furthermore, the GBDT classifier enhances the model’s nonlinear classification capability, and the integration of the voting mechanism significantly improves the robustness and accuracy of the model. With this approach, we were able to effectively differentiate ASD patients from typical controls, achieving high classification accuracy, especially at challenging sites.
Although our study has made considerable progress, several challenges remain. Firstly, the performance variation across sites may be attributed to data heterogeneity, sample size differences, and variations in preprocessing protocols. Future work will focus on optimizing and standardizing preprocessing strategies as well as enhancing model generalization to improve stability and applicability across diverse datasets. Secondly, while the proposed spatiotemporal feature extraction and voting mechanisms have demonstrated efficacy in ASD classification, the high dimensionality and abstract nature of the extracted features pose challenges for biological interpretability. Developing methods to map these learned features onto meaningful neurobiological constructs will be an important direction for future research. Lastly, the framework introduced here provides promising insights for the identification of other neurodevelopmental disorders, such as ADHD or schizophrenia, and could be adapted accordingly to broaden its clinical utility.
Overall, the results of this study provide a new approach for the automated diagnosis of ASD, particularly in the modeling of spatiotemporal features. In the future, we will continue to explore advanced deep learning techniques, incorporate larger and more diverse datasets, and further improve the accuracy and generalizability of the model.