In this section, we assess the performance of our proposed method for ADHD classification using resting-state fMRI data, and we compare our results with those from state-of-the-art approaches in the literature.
5.3. Discussion
Table 7 reports the classification performance of the proposed PLV-GraphSAGE framework across individual ADHD-200 acquisition sites and the combined dataset, demonstrating robust and stable discriminative capability despite inter-site heterogeneity. The model achieves an average accuracy of 0.900 and a mean ROC-AUC of 0.908, indicating strong separation between ADHD and typically developing subjects and confirming reliable probabilistic ranking across operating points. Precision remains relatively high across sites (average ≈ 0.905), while recall averages 0.894, reflecting effective detection of ADHD cases. The mean F1-score of 0.891 further confirms a balanced trade-off between precision and recall, ensuring that performance is not driven by a single metric.
In addition, the average specificity reaches 0.886, indicating that the model effectively identifies typically developing controls while maintaining controlled false-positive rates. For instance, KKI and the Combined Dataset show particularly high specificity values (0.971 and 0.965, respectively), demonstrating strong ability to correctly classify non-ADHD subjects. In contrast, sites such as Peking exhibit slightly lower specificity (0.818), which is consistent with its sensitivity-oriented behavior characterized by very high recall (0.961). This reflects different operating balances rather than instability. Site-specific variations are expected in multi-site neuroimaging studies due to differences in acquisition protocols, demographic characteristics, and preprocessing pipelines. For example, KKI achieves high accuracy (0.947) and strong F1-score (0.897) with high specificity (0.971), reflecting well-balanced classification. OHSU and NeuroIMAGE also demonstrate stable and competitive performance across all metrics. Importantly, the Combined Dataset maintains competitive performance (accuracy = 0.898 ± 0.022, recall = 0.787 ± 0.044, specificity = 0.965 ± 0.017, F1 = 0.852 ± 0.012, AUC = 0.958 ± 0.012), confirming good generalization under heterogeneous conditions. Although recall in the Combined Dataset is more moderate, specificity remains very high, indicating a conservative and stable operating point that limits false positives while maintaining reliable discrimination.
The relatively low standard deviations (Mean ± Std) observed overall confirm stable cross-validation behavior and reproducibility, indicating that the performance does not rely on favorable data splits. In particular, the Combined Dataset exhibits consistently lower standard deviation values compared to individual sites, reflecting strong performance consistency across folds and highlighting the robustness of the model when trained on a larger and more diverse sample. This improved stability can be attributed to the increased dataset size, which reduces sensitivity to small perturbations during cross-validation.
However, higher standard deviations are observed at certain sites (e.g., NYU precision SD = 0.138), which aligns with prior findings on the ADHD-200 dataset. Brown et al. [
78] showed that NYU exhibits high variability due to substantial intra-site participant heterogeneity and class distribution shifts between training and test sets. Olivetti et al. [
79] further demonstrated that inter-site batch effects, arising from differences in acquisition protocols and scanner hardware, significantly impact classifier performance across sites. Taspinar et al. [
80] emphasized that intra-site heterogeneity is a key driver of unstable model performance, while [
44] showed that limited samples per fold further increase cross-fold variability in ADHD classification tasks.
In contrast, site-specific datasets, characterized by smaller sample sizes, show relatively higher standard deviations, as even minor variations in fold composition may lead to noticeable fluctuations in performance. Therefore, the variability observed across sites is mainly related to dataset size and heterogeneity, whereas the Combined Dataset benefits from enhanced statistical robustness and model stability. Methodologically, the integration of PLV-based functional connectivity with GraphSAGE aggregation enables the extraction of robust node embeddings that capture local topological relationships within brain networks, thereby stabilizing connectivity representations across sites and contributing to consistent predictive performance.
These results demonstrate strong discriminative power, balanced sensitivity–specificity behavior, high AUC values, and stable generalization across multiple acquisition sites, confirming the robustness and practical reliability of the proposed framework.
Figure 4 presents the confusion matrices of the GraphSAGE classifier for each site and for the combined dataset. These matrices provide a direct view of the classification outcomes by showing the number of correctly and incorrectly predicted samples for each class.
The confusion matrices show strong classification performance, with dominant values along the main diagonal indicating that the model correctly identifies the majority of both ADHD and TDC cases across sites. NeuroIMAGE stands out with particularly low error rates (only 2 FN and 2 FP), reflecting near-perfect separation. OHSU and Peking also demonstrate robust results, with minimal FNs (2 each) suggesting high recall for ADHD detection, though Peking has a higher FP count (26), which may indicate some over-prediction of positives. For KKI, the matrix reveals balanced but moderate performance, with low misclassifications (3 FN and 2 FP) relative to the sample size, confirming reliable discrimination despite the site’s smaller dataset. NYU shows a higher volume of correct classifications (140 TP and 64 TN), but with noticeable FPs (25) and FNs (22), pointing to slight challenges in precision and recall that align with its PR and ROC patterns. The Combined Dataset matrix aggregates these outcomes effectively, with substantial correct predictions (537 TN and 263 TP) and reduced relative errors (19 FP and 76 FN), demonstrating that multisite training enhances accuracy by leveraging diverse data to minimize site-specific biases. The confusion matrices affirm GraphSAGE’s effectiveness in ADHD classification, with minimal off-diagonal errors underscoring reliable class separation. The results for individual sites highlight variability influenced by sample size and data quality, while the Combined Dataset emphasizes the advantages of integration for improved robustness and clinical applicability.
Figure 5 presents the Precision–Recall (PR) curves of the GraphSAGE model evaluated on the ADHD-200 dataset for each acquisition site and for the combined dataset. The PR representation is particularly suitable for this task because the dataset is imbalanced, and it directly reflects the trade-off between correctly identifying ADHD subjects (recall) and avoiding false positives (precision). The PR curves demonstrate strong discriminative performance across sites, with high Average Precision (AP) values. NeuroIMAGE (AP = 0.97), OHSU (AP = 0.96), Peking (AP = 0.95), and NYU (AP = 0.89) exhibit consistently high precision across a wide range of recall levels. In these datasets, precision remains close to 1.0 at lower recall values and decreases gradually as recall increases, indicating that the model effectively identifies positive cases while maintaining control over false positives.
The curves for OHSU, and Peking show a smooth and progressive decline in precision as recall approaches 1.0, suggesting stable ranking of predictions and well-calibrated probability outputs. NeuroIMAGE demonstrates particularly strong performance, with precision remaining high even at moderate-to-high recall levels, reflecting robust class separation.
In contrast, KKI (AP = 0.84) presents a less stable curve, with more pronounced fluctuations in precision as recall increases. Precision decreases more sharply at higher recall levels, indicating that when the model attempts to capture nearly all positive cases, it introduces more false positives. This behavior may be due to smaller sample sizes, increased noise, or stronger variability in acquisition protocols and subject characteristics.
For the Combined Dataset (AP = 0.93), the curve remains smooth and stable over a broad recall range. Precision stays high at moderate recall levels and decreases mainly as recall approaches its maximum. This pattern indicates strong global ranking performance and confirms that the model generalizes well when trained on a larger and more diverse population. It also suggests that combining data from multiple sites improves the robustness of the learned representations by exposing the model to more diverse connectivity patterns. As a result, the model becomes less sensitive to site-specific characteristics and achieves more consistent behavior.
The PR analysis confirms that the proposed framework maintains high precision across clinically relevant recall ranges. The consistently high AP values across most sites demonstrate reliable positive class detection, while the shape of the curves reflects stable probability estimation and robust discriminative capacity.
Figure 5 indicates that GraphSAGE performs well in general, although its effectiveness remains influenced by data quality and diversity. It further highlights the importance of multisite integration for improving stability and reducing variability across datasets.
Figure 6 presents the Receiver Operating Characteristic (ROC) curves of the GraphSAGE model for each acquisition site and for the Combined Dataset. The ROC curve evaluates the trade-off between the true positive rate (sensitivity) and the false positive rate across different classification thresholds, while the Area Under the Curve (AUC) summarizes the discriminative ability of the model.
The ROC curves demonstrate strong classification performance across all sites, with high AUC values. NeuroIMAGE and OHSU achieve the highest performance (AUC = 0.97), followed by Peking (AUC = 0.96). NYU obtains an AUC value of 0.89, while KKI reaches 0.93, indicating that both exhibit comparatively lower discriminative capacity. For NeuroIMAGE and OHSU, the ROC curves rise steeply toward the top-left corner of the plot, reflecting high sensitivity even at low false positive rates. This behavior indicates excellent class separation and strong ranking of predicted probabilities. Similarly, the Peking dataset shows a smooth curve approaching the optimal region, confirming reliable discrimination between ADHD and control subjects.
In contrast, NYU and KKI present less steep initial slopes and slightly lower AUC values, suggesting that the model has more difficulty distinguishing ADHD from control subjects in these datasets. This reduced performance may be related to differences in imaging protocols, scanner properties, population characteristics, or sample size. Achieving high sensitivity in these sites requires accepting a relatively higher false positive rate. Nevertheless, the curves remain clearly above the diagonal reference line, confirming performance substantially better than random classification.
The Combined Dataset (AUC = 0.96) exhibits a smooth and stable ROC curve that closely approaches the top-left corner. This result confirms strong global discriminative performance when training and evaluating on a larger and more diverse sample. The high AUC indicates that the model maintains robust ranking ability across heterogeneous data distributions.
The ROC analysis confirms that GraphSAGE achieves high sensitivity across a broad range of false positive rates. The consistently high AUC values across sites demonstrate reliable class separation, while the Combined Dataset further highlights the benefit of multisite integration for improving generalization and stability.
5.3.1. Ablation Study
This section presents an ablation study aimed at assessing the contribution of key design choices in the proposed framework. We first investigate the impact of the connectivity measure used to construct the input graphs, before examining the effect of the architectural components of the model. Correlation-based functional connectivity is one of the most widely used approaches for constructing brain networks from fMRI data. It quantifies the linear statistical dependencies between the time series of different brain regions using the Pearson correlation coefficient, and has been extensively adopted in neuroimaging research [
81]. In this work, pairwise correlations between all regions of interest are computed to obtain a full correlation matrix. To derive a sparse graph from this matrix, we apply a global threshold based on the 80th percentile of the absolute correlation values, retaining only the strongest connections and discarding the weaker ones. This thresholding strategy is consistent with the one used for PLV-based graphs, ensuring a direct and fair comparison between the two connectivity approaches.
Table 8 compares the performance of GraphSAGE when built on correlation-based versus PLV-based functional connectivity graphs. Both connectivity measures yield strong classification results, which confirms that GraphSAGE can effectively leverage different types of brain network representations. The correlation-based approach already performs competitively, with an accuracy of 0.874 ± 0.015 and an AUC of 0.952 ± 0.017, suggesting that Pearson correlation captures sufficient linear structure to support ADHD classification. PLV-based graphs show consistent but moderate improvements across most metrics—accuracy (0.898), precision (0.930), F1-score (0.852), and AUC (0.958)—while the gain in recall remains relatively limited (0.787 vs. 0.760). Beyond mean performance, PLV-based results tend to exhibit slightly lower standard deviations across several metrics, indicating more stable behavior across folds. These observations suggest that PLV may provide a complementary representation of functional interactions by capturing phase synchronization effects, although the overall performance differences remain moderate.
Table 9 presents an ablation study comparing adaptive threshold optimization with a fixed decision threshold of 0.5. This analysis highlights the impact of threshold selection on classification behavior and calibration in a multi-site clinical setting.
The results indicate that adaptive thresholding achieves a more balanced trade-off between precision and recall across most sites. In KKI and NYU, adaptive calibration improves accuracy and precision while maintaining competitive recall, indicating improved control of false positives without substantially reducing sensitivity. These results suggest that a fixed 0.5 threshold may not be optimal under heterogeneous data distributions.
In certain sites, such as Peking and NeuroIMAGE, the fixed threshold slightly increases recall in Peking (0.970 vs. 0.961), indicating that the optimal operating point may vary according to site-specific characteristics. Nevertheless, adaptive thresholding generally provides a more consistent balance between performance metrics, which is particularly important in clinical classification tasks where both reliability and interpretability are required.
For the Combined Dataset, recall is higher with the fixed 0.5 threshold (0.896 vs. 0.787). However, this improvement in sensitivity is associated with reduced precision and increased variability. The standard deviation values reported in
Table 8 indicate that the adaptive configuration ensures more consistent performance across folds. This observation reflects a trade-off between sensitivity and precision, where the fixed threshold favors higher recall, while the adaptive strategy provides better control of false positives. In a screening-oriented context, higher recall may be preferred, whereas in settings requiring more reliable predictions, improved precision and stability can be advantageous. Therefore, the choice of threshold should be guided by the intended clinical objective rather than assuming a single optimal operating point.
These findings demonstrate that threshold selection significantly affects operating characteristics. In this context, adaptive calibration provides more stable and controlled performance across heterogeneous sites, although it does not uniformly dominate the fixed threshold across all metrics. This makes it a suitable option when prioritizing robustness and consistency in practical clinical deployment.
Table 10 presents an ablation study evaluating the effect of the PLV threshold percentile on classification performance using the Combined Dataset. The results clearly show that graph sparsification plays a critical role in model effectiveness.
Among the tested configurations, the 80th percentile achieves the best performance, with the highest accuracy (0.898 ± 0.022), precision (0.930 ± 0.036), F1-score (0.852 ± 0.012), and AUC (0.958 ± 0.012). Recall is also highest at this threshold (0.787 ± 0.044), providing a balanced trade-off between sensitivity and precision.
The 60th percentile yields moderate results, while the 70th percentile leads to a noticeable decrease in accuracy, precision, and AUC. Although recall remains relatively similar across percentiles, the discriminative performance declines when the threshold is lower. This suggests that retaining a larger number of weaker functional connections may introduce noise into the graph structure, reducing the quality of learned representations.
In contrast, the 80th percentile appears to preserve the most informative connectivity patterns while removing less relevant edges. The relatively low standard deviation at this level also indicates stable behavior across folds. These findings justify the selection of the 80th percentile as the optimal configuration for graph construction. However, it is worth noting that we did not explicitly analyze site-specific PLV distributions, which may influence the behavior and generalizability of a uniform global threshold across heterogeneous acquisition sites.
Table 11 reports an ablation study analyzing the effect of GraphSAGE depth on classification performance using the Combined Dataset.
The results show a clear improvement as the number of layers increases. With only one layer, the model achieves limited performance (accuracy = 0.715 ± 0.039, AUC = 0.784 ± 0.042), indicating that shallow aggregation is insufficient to capture complex inter-regional interactions. Although recall remains moderate (0.755 ± 0.004), precision and discriminative ability remain relatively low.
Using two layers significantly improves all metrics, confirming that incorporating broader neighborhood information enhances representation learning. However, the best performance is obtained with three layers (proposed configuration), achieving the highest accuracy (0.898 ± 0.022), precision (0.930 ± 0.036), recall (0.787 ± 0.044), F1-score (0.852 ± 0.012), and AUC (0.958 ± 0.012).
Importantly, the standard deviation remains low for the three-layer model, indicating that increased depth improves performance without compromising stability. These results demonstrate that deeper message passing enables the model to better capture higher-order connectivity patterns, supporting the choice of a three-layer GraphSAGE architecture.
Table 12 presents an ablation study analyzing the effect of the class weight parameter on classification performance using the Combined Dataset. The class weight is applied to the minority class in order to address the class imbalance problem and to reduce bias toward the majority class.
When no weighting is applied (class weight = 1), the model achieves moderate performance (accuracy = 0.853 ± 0.025, AUC = 0.920 ± 0.025). Although recall remains acceptable (0.772 ± 0.011), precision and F1-score are lower compared to higher weighting configurations, indicating that the model does not sufficiently emphasize the minority ADHD class.
Increasing the class weight to 1.5 improves performance, particularly in terms of precision (0.923 ± 0.069) and AUC (0.954 ± 0.028). This suggests that assigning greater importance to the minority class helps the model better distinguish ADHD subjects. However, the variability across folds slightly increases in this configuration.
With a class weight of 2, recall slightly decreases (0.755 ± 0.003), and the performance does not surpass that of the 1.5 configuration, indicating that moderate reweighting alone does not guarantee optimal balance.
The best results are obtained with a class weight of 3 (proposed configuration), which achieves the highest accuracy (0.898 ± 0.022), precision (0.930 ± 0.036), recall (0.787 ± 0.044), F1-score (0.852 ± 0.012), and AUC (0.958 ± 0.012). Importantly, this configuration also shows low standard deviation, reflecting stable and consistent behavior across folds.
These findings confirm that applying an appropriate class weight to the minority class effectively mitigates the impact of class imbalance. A higher class weight provides the most balanced and stable performance, demonstrating that emphasizing ADHD samples during training improves discriminative ability without compromising generalization.
5.3.2. Comparison with Other Methods
Table 13 presents a comparative analysis of classification accuracy across ADHD-200 acquisition sites. The comparison includes several previously published methods and the proposed PLV-GraphSAGE approach.
Early approaches such as FCNet and DeepFMRI report moderate average accuracies (60.4% and 67.9%, respectively), indicating limited generalization across sites. The 3D-CNN method improves performance, achieving an average accuracy of 71.6%, but still shows variability between datasets.
More recent approaches demonstrate stronger performance. Dual Subspace Learning achieves high accuracy in NYU (92.4%) and Peking (89.4%), with an average of 87.1%. The attention attribute-enhanced network also reports strong results, particularly in KKI (94.5%) and NeuroIMAGE (98.4%), reaching an average accuracy of 86.2%.
The proposed PLV-GraphSAGE model achieves the highest average accuracy (89.9%) across sites. It delivers competitive or superior performance in KKI (94.7%) and OHSU (93.9%), and maintains strong results in NeuroIMAGE (92.0%) and Peking (87.7%). Although NYU accuracy (81.6%) does not exceed the best reported value, it remains competitive and consistent with the overall trend.
Importantly, PLV-GraphSAGE demonstrates stable performance across multiple sites without extreme fluctuations. Unlike some previous methods that achieve very high performance in specific datasets but show inconsistency across others, the proposed model maintains balanced accuracy across heterogeneous acquisition conditions.
The results confirm that integrating PLV-based connectivity with GraphSAGE representation learning provides strong generalization across sites. The higher average accuracy highlights the robustness of the proposed framework and supports its effectiveness for multi-site ADHD classification.
Table 14 represents the comparative performance of various methods on the ADHD-200 dataset, evaluated using common metrics including Accuracy, Recall, and Specificity. The results reveal considerable variability across approaches.
Conventional and early deep learning approaches show moderate performance. BrainNetCNN [
83] achieves 63.77% accuracy, while MDCN [
85] reports 67.45% reflecting limitations in capturing complex brain connectivity patterns with conventional architectures. The LSTM with spatio-temporal convolution model reaches 71.3% accuracy, reflecting improvements brought by temporal modeling. The CNN approach proposed by De Silva et al. achieves 85.36% accuracy but with relatively lower recall (72.8%) and specificity (66.54%), indicating class imbalance in prediction. TLNN [
84] reports strong recall (90.0%), though specificity remains lower (77.0%). CAMEL [
45] attains 86.7% accuracy; however, missing recall and specificity values limit detailed comparison. USMDA [
43] achieves 84.38% accuracy and 83.87% recall, highlighting the effectiveness of unsupervised multisource domain adaptation for ADHD classification.
Graph-based approaches, including GCN [
47] and BrainGNN [
48], delivered intermediate performance, showing the advantages of modeling the brain as a graph but also indicating that further enhancements are required to fully leverage graph structures. Recent hybrid models, such as HAGCN [
52], achieved competitive results, with an accuracy of 77.95% and recall of 80.98%, suggesting that multi-head attention mechanisms can improve classification performance.
The proposed PLV-GraphSAGE method outperforms all previously reported approaches, achieving the highest accuracy (89.9%) while maintaining comparable recall (78.7%) and specificity (96.5%). This demonstrates that integrating phase-locking value (PLV) connectivity features with GraphSAGE provides a more discriminative representation of functional brain networks. Nonetheless, there is still potential for improvement, particularly in enhancing recall further without compromising specificity. It is also worth noting that some studies report missing values (NA) for certain metrics, limiting direct comparison for those specific measures. However, this comparison should be interpreted with caution, as differences in preprocessing steps, feature extraction methods, and evaluation protocols can significantly influence the reported results and make direct comparisons challenging. In general, modern graph-based and hybrid approaches, particularly PLV-GraphSAGE, show strong effectiveness for ADHD classification.