Background: Breast cancer remains a leading cause of cancer-related mortality, and reliable computational decision support is increasingly viewed as a complement to expert pathological assessment rather than a replacement for it. Variational quantum classifiers (VQCs) and Quantum Support Vector Machines (QSVMs) have recently been promoted as candidate models for medical classification, yet most published comparisons rely on internal hold-out validation alone and report only a single point estimate of discrimination, omitting calibration, decision-analytic value, and explainability—three ingredients that any clinically credible model must furnish.
Methods: We assembled a complete quantum–classical machine learning pipeline and evaluated it under a deliberately stringent protocol designed to expose, rather than conceal, the limitations of current Noisy Intermediate-Scale Quantum (NISQ)-era models. The analytical hypothesis was conservative and stated in advance; in light of saturated classical baselines on this benchmark, we did not anticipate a quantum advantage in raw discrimination, and we framed the study as a methodological probe rather than as a competition. Using the Wisconsin Diagnostic Breast Cancer (WDBC) dataset (
n = 569) for development and an independent Wisconsin Original (WBC) cohort (
n = 683) for external validation, we benchmarked five classical learners (XGBoost, LightGBM, CatBoost, RandomForest, RBF-SVM), two quantum models (an eight-qubit VQC implemented in PennyLane and a ZZ-feature-map QSVM implemented in Qiskit), and a stacked hybrid ensemble. The evaluation framework combined Optuna-driven hyperparameter optimisation, internal–external cross-validation, and external validation on the independent WBC cohort. Robustness and interpretability were then probed through circuit depth and embedding rotation ablation, depolarising noise stress tests, learning curve and feature stability analysis, decision curve analysis, and dual SHAP-based explanations covering both a direct tree-based explanation and a quantum surrogate. Reporting followed the TRIPOD + AI guideline.
Results: On the internal test partition, RBF-SVM achieved the highest discrimination (AUC = 0.998), with XGBoost, LightGBM, CatBoost, the hybrid ensemble, and the VQC clustering between 0.992 and 0.996; the QSVM with a ZZ-fidelity kernel underperformed substantially (AUC = 0.727). Pairwise tests for correlated ROC curves indicated that most differences among top models were not statistically significant. On the external WBC cohort, model rankings reorganised, as RBF-SVM (AUC = 0.986, 95% CI 0.946–0.997), RandomForest (0.985, 95% CI 0.945–0.996), VQC (0.983, 95% CI 0.942–0.995), and the hybrid ensemble (0.982, 95% CI 0.941–0.995) all retained near-ceiling discrimination with extensively overlapping confidence intervals. Ablation analysis demonstrated that the choice of embedding rotation is decisive—Z-rotation embeddings collapsed VQC performance to chance levels (AUC ≈ 0.50), whereas X- and Y-rotations preserved it. Depolarising noise up to
p = 0.10 had a negligible effect on the VQC, and SHAP analyses converged on worst concave points, mean concave points, and worst area as the dominant predictors across both classical and quantum models. Decision curve analysis showed positive net benefit for both classical and hybrid models across the clinically meaningful threshold range, exceeding both the treat-all and treat-none reference strategies throughout.
Conclusions: In the present regime, the principal contribution of QML is not raw discrimination—modern classical learners are already at the data ceiling—but the construction of a rigorous, reproducible, externally validated, and interpretable benchmarking framework in which quantum models can be fairly compared with their classical counterparts. Because evaluation was confined to curated benchmark datasets rather than real-world clinical populations, the interpretability and net benefit findings reported here should be read as benchmark-level evidence and not as a demonstration of readiness for clinical deployment.
Full article