Comparative Explainable AI for Breast Cancer Classification: Cross-Model SHAP Agreement Analysis Using XGBoost and Logistic Regression

Alalawi, Khalid

doi:10.3390/app16115684

Open AccessArticle

Comparative Explainable AI for Breast Cancer Classification: Cross-Model SHAP Agreement Analysis Using XGBoost and Logistic Regression

by

Khalid Alalawi

College of Computer Science and Engineering (CCSE), Taibah University, Medina 42353, Saudi Arabia

Appl. Sci. 2026, 16(11), 5684; https://doi.org/10.3390/app16115684 (registering DOI)

Submission received: 7 May 2026 / Revised: 23 May 2026 / Accepted: 25 May 2026 / Published: 5 June 2026

Download

Browse Figures

Versions Notes

Abstract

Breast cancer remains one of the most frequently diagnosed cancers worldwide, and improving the accuracy and transparency of automated diagnostic tools is an ongoing clinical priority. This study examines whether three established machine-learning classifiers achieve comparable performance on a standard breast cancer benchmark and whether their SHAP-based feature explanations converge on the same predictive signals across model architectures. Logistic Regression (LR), Support Vector Machines (SVM), and XGBoost were trained and tested on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset using a shared preprocessing pipeline to ensure fair comparison. Hyperparameters were selected through grid search with 5-fold stratified cross-validation, and model performance was estimated using 10-fold stratified cross-validation. Paired t-tests and Wilcoxon signed-rank tests were used to determine whether performance differences between models were statistically meaningful. Shapley Additive explanations (SHAP) values were computed separately for XGBoost using TreeExplainer and for Logistic Regression using LinearExplainer, and Spearman’s rank correlation was used to quantify the agreement between the two models’ feature importance rankings. All three classifiers achieved Receiver Operating Characteristic–Area Under the Curve (ROC-AUCs) above 0.994, with SVM achieving the highest accuracy (0.9737) and F1-score (0.9630). No statistically significant difference was found between any model pair (p > 0.05). The cross-model SHAP analysis yielded a Spearman correlation of r = 0.578 (p = 0.0008), with seven of the ten most important features ranked consistently across both architectures. The agreement between two structurally different models on which features matter most provides evidence that these features carry a consistent predictive signal that goes beyond what any single model’s architecture alone would produce. Cross-model explainability analysis of this kind offers a stronger basis for feature interpretation than the output of any single model.

Keywords:

breast cancer; machine learning; explainable AI; SHAP; XGBoost; logistic regression; feature importance

1. Introduction

Breast cancer is among the most frequently diagnosed cancers and a leading cause of cancer-related death in women across the United States and many other countries [1]. When caught early, treatment is substantially more effective, yet meaningful access barriers continue to hinder timely diagnosis in numerous healthcare settings worldwide [2]. Conventional diagnostic methods—biopsies, mammography, and clinician-led image review—are resource-intensive and subject to inter-observer variability. Against this backdrop, the growing body of evidence showing that AI can match or exceed specialist-level performance in controlled clinical evaluations [3,4] has generated considerable interest in AI-assisted breast cancer screening and classification [5,6].

Machine learning (ML) has proven well-suited to breast cancer classification tasks that rely on structured tabular data. Studies using Wisconsin breast cancer datasets have shown that classifiers ranging from Logistic Regression and Decision Trees to ensemble methods can achieve strong discrimination [7,8,9]. Support Vector Machines have attracted particular attention in this context. By maximising the margin between class-separating hyperplanes in high-dimensional spaces, they construct decision boundaries with strong generalisation properties [10], and they continue to demonstrate strong performance in breast cancer classification tasks [8,9].

Much of the existing ML work in this domain stops at predictive accuracy, giving relatively little attention to validation rigour or model interpretability [9,11]. In clinical settings, a high AUC figure alone is rarely sufficient justification for deployment—clinicians need to be able to follow the model’s reasoning behind predictions before they will trust its outputs [12]. Deep-learning and ensemble models often perform impressively on benchmark data yet remain opaque to the practitioners who depend on them—a problem that is especially acute in oncology, where diagnostic decisions carry direct patient consequences [11,13].

Methodological choices further complicate the picture. When a single train-test split is used both to select hyperparameters and to estimate generalisation error, the reported figures tend to be optimistic [14]. Stratified cross-validation within proper pipeline structures, combined with formal statistical testing to verify that observed performance gaps are genuine rather than artefacts of random variation, is what credible comparative evaluation actually requires [14].

Explainable artificial intelligence (XAI) offers a principled response to the interpretability problem. Among available techniques, SHapley Additive exPlanations (SHAP)—rooted in cooperative game theory—has become particularly prominent: it assigns each feature a contribution score satisfying consistency and local accuracy axioms, making attributions theoretically grounded rather than heuristic [15]. Several recent reviews have argued that XAI is no longer optional in healthcare AI—it is a prerequisite for clinical trust and regulatory acceptance [6,13]. SHAP’s kernel-based variant applies to any model architecture, while dedicated tree and linear explainers deliver exact attributions for specific model families [15,16].

The present study brings these threads together. We train Logistic Regression, SVM, and XGBoost on the WDBC dataset [17] within a shared preprocessing pipeline, tune each model through grid search cross-validation, and compare their performance using formal pairwise statistical tests. Beyond accuracy, we apply SHAP to both XGBoost and Logistic Regression to generate global and local explanations. Then, we take a further step that has not, to our knowledge, been applied to the WDBC dataset in this form: a quantitative cross-model agreement analysis that asks whether two explainers—one tree-based and the other linear—converge on the same features as consistently important across both models.

The main contributions of this study are as follows:

A robust and fair comparative evaluation of machine-learning models using unified preprocessing pipelines, stratified cross-validation, and statistical significance testing across all model pairs.
SHAP-based explainability was applied to both XGBoost (TreeExplainer) and Logistic Regression (LinearExplainer) to identify and quantify the key features influencing breast cancer classification.
A comprehensive analysis linking model behaviour with established tumour characteristics, supporting the development of interpretable AI in breast cancer classification.
A cross-model SHAP agreement analysis using Spearman rank correlation to quantify whether feature importance rankings are consistent across architecturally distinct models—an approach that, to our knowledge, has not previously been applied to the WDBC dataset and which may offer a more architecture-neutral basis for identifying candidate predictive features than single-model explanations can provide.
The experimental results show that all three classifiers reach comparable, high-level performance on this benchmark, with no statistically significant differences between them. Meanwhile, SHAP analysis yields interpretable insights grounded in features with established predictive relevance, going beyond what single-model explanations can offer.

The remainder of this paper is organised as follows. Section 2 surveys related work on machine learning for breast cancer classification and the use of SHAP-based explainability in medical AI. Section 3 describes the dataset, preprocessing pipeline, model architectures, and evaluation strategy. Section 4 presents the experimental results, including test-set performance, cross-validation outcomes, statistical comparisons, and a cross-model SHAP agreement analysis. Section 5 discusses these findings in the context of the existing literature, and Section 6 concludes the paper.

2. Related Work

The use of machine learning for breast cancer classification has a substantial history, driven by the clinical need to identify malignant tumours earlier and more reliably. Structured datasets derived from cell nuclei measurements, most notably the Wisconsin breast cancer collections, have served as a longstanding benchmark for this work [7,8]. Early work established that models ranging from Support Vector Machines and Logistic Regression to rule-based Decision Trees can reliably separate benign and malignant cases when the underlying feature distributions are well-defined [8].

Subsequent research has broadened the comparison, pitting a wider range of algorithms against one another across different dataset configurations. Classical discriminative models have generally held their own in these evaluations, often matching or exceeding more complex alternatives [9]. SVM has been a recurring standout, largely because its maximum-margin optimisation is well-matched to the high-dimensional, moderately sized feature spaces that characterise morphological cell data [8,10]. Ensemble approaches—Random Forest and gradient-boosted trees in particular—have also been applied extensively, as their ability to model nonlinear feature interactions typically produces competitive results [9,11]. Across all of these comparisons, though, accuracy tends to be treated as the main endpoint—questions about fold-to-fold stability, whether preprocessing was applied consistently across models, or how to interpret the predictions are largely left unaddressed [9,11].

Evaluation methodology has received greater scrutiny in more recent work. A single train-test split, while convenient, conflates model selection with error estimation and systematically overstates generalisation performance [14,18]. Cross-validation procedures and proper hyperparameter tuning pipelines are now more widely adopted in response, though their uptake remains inconsistent across the literature [14,18]. Even where cross-validation is used, formal statistical tests comparing model pairs remain the exception rather than the rule, leaving readers unable to determine whether reported performance differences reflect genuine algorithmic advantages or merely sampling noise [14,18].

Running alongside the push for better predictive accuracy, a parallel strand of research has asked how ML decisions can be made comprehensible to the clinicians who must act on them. In high-stakes medical settings, black-box behaviour is a genuine obstacle: practitioners will not integrate a model into diagnostic workflows unless they can interrogate its outputs [12]. A comprehensive review of XAI in healthcare highlights that transparency and interpretability are essential for ensuring trust, accountability, and adoption of AI systems in clinical environments [11].

XAI methods have been applied specifically to breast cancer prediction with encouraging results. A systematic scoping review of this literature found that incorporating explainability substantially improved both the transparency and the perceived clinical usability of the models under review [13]. Of the available XAI techniques, SHAP has seen the widest uptake, owing to its theoretical soundness—it is the unique solution satisfying consistency, local accuracy, and missingness—and its ability to produce consistent feature attributions regardless of model type [15], with later work introducing TreeExplainer for efficient and exact SHAPs of tree-based models, along with tools for global interpretation [16].

At the applied level, studies pairing ML classifiers with SHAP or related tools have produced findings with direct clinical relevance. For instance, Arravalli et al. [5] showed that stacking multiple ML models with an XAI layer improved both interpretability and generalisation in a breast cancer prediction setting. Similarly, SHAP-based analysis has illuminated diagnostically relevant risk factors in clinical datasets [19]. Taken together, these studies underscore that explainability is not merely a theoretical concern: it changes which features practitioners attend to and how they reason about model outputs.

Despite this progress, broader reviews of AI in breast cancer diagnosis consistently flag unresolved obstacles: interpretability gaps, data quality limitations, and the substantial gap between controlled benchmarking and routine clinical deployment [6]. Many of the models that score highest on published benchmarks remain functionally opaque, which continues to slow their translation into everyday diagnostic practice [11].

Two gaps stand out from this review. First, most comparative studies report accuracy figures obtained under conditions that vary between models—different preprocessing steps, different validation schemes—which makes the comparisons genuinely difficult to interpret, and few accompany their results with statistical significance tests. Second, SHAP analyses in the breast cancer literature have almost universally been confined to a single model, leaving open the question of whether the identified features reflect robust predictive patterns or simply the idiosyncrasies of a particular learning algorithm. The present study is designed around both of these gaps: preprocessing and evaluation conditions are held constant across all three models, and the extent to which two structurally different models agree in their SHAP-derived feature rankings is formally quantified for the first time on this dataset.

3. Materials and Methods

3.1. Dataset Description

This study utilises the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, obtained directly from the University of California Irvine (UCI)Machine Learning Repository [17]. The dataset consists of 569 instances and 30 numerical features computed from digitised images of fine needle aspirates (FNA) of breast masses. Each instance is described by features computed from the cell nuclei visible in each image. The ten base nuclear properties measured are: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. For each of these ten measurements, three summary statistics were computed—the mean value, the standard error, and the worst (largest) value across the nuclei present in the image—yielding 30 features in total.

The target variable is binary, indicating whether a tumour is Benign (B, encoded as 0) or Malignant (M, encoded as 1). The dataset contains 357 benign cases and 212 malignant cases, representing a moderately imbalanced but manageable class distribution. The features are all continuous-valued and require no imputation.

An ID column is present in the raw data file, but was excluded from the analysis because it carries no predictive information. No missing values are present in the dataset, and no imputation was required.

3.2. Data Preprocessing

Prior to model training, the dataset was preprocessed to ensure consistency and improve model performance. The ID column was removed, and the diagnosis label was encoded as a binary integer (B = 0, M = 1).

The dataset was split into training and test sets using a stratified split, with 80% allocated to training and 20% to testing. Feature scaling was applied using StandardScaler within the pipeline for all three models, ensuring that scaling was applied independently within each cross-validation fold and that the comparison was fair and free of data leakage [20].

The class distribution is unequal—357 benign cases against 212 malignant—though the ratio is not severe enough to warrant resampling. No oversampling or undersampling was applied. Stratified sampling was used at both the data-splitting and cross-validation stages to maintain consistent class proportions across all folds.

3.3. Machine-Learning Models

3.3.1. Logistic Regression (LR)

Logistic Regression is a linear probabilistic classifier that models the probability of class membership using the logistic (sigmoid) function [21], as given in Equation (1):

P (y = 1 | x) = \frac{1}{1 + e^{- (w^{T} x + b)}}

(1)

where w is the weight vector, and b is the bias term. The model is trained by maximising the log-likelihood with L2 regularisation controlled by parameter C. Despite its simplicity, Logistic Regression offers direct interpretability through its learned coefficients and achieves strong performance on structured medical datasets, particularly when features are approximately linearly separable.

3.3.2. Support Vector Machine (SVM)

SVM seeks the hyperplane that maximises the margin between classes [10], as defined in Equation (2):

\underset{w, b}{m i n} \frac{1}{2} ∥ w ∥^{2} subject to y_{i} (w^{T} x_{i} + b) \geq 1 | \forall i

(2)

For nonlinearly separable data, the soft-margin SVM introduces a regularisation parameter C to trade off the margin width against classification errors. The Radial Basis Function (RBF) kernel maps inputs into a higher-dimensional space where linear separation becomes feasible, as shown in Equation (3):

K (x_{i}, x_{j}) = e x p (- γ {∥ x_{i} - x_{j} ∥}^{2})

(3)

where γ (gamma) is a hyperparameter that controls the effective radius of each training example’s influence: a small γ gives a smoother, more generalised boundary, while a large γ fits the training data more tightly and risks overfitting.

SVM has been widely reported as one of the most effective classifiers for the Wisconsin breast cancer data due to its suitability for high-dimensional feature spaces [8,9].

3.3.3. Extreme Gradient Boosting (XGBoost)

XGBoost builds an ensemble of decision trees sequentially, where each new tree corrects the residual errors of the previous ensemble [22], as expressed in Equation (4):

{\hat{y}}_{i}^{(t)} = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})

(4)

where the superscript (t − 1) on ŷ denotes the cumulative prediction after t − 1 trees have been added, and f_t(x_i) is the t-th tree fitted to minimise the residual error of the current ensemble.

Key hyperparameters include the number of estimators, maximum tree depth, and learning rate. XGBoost has demonstrated strong empirical performance across structured datasets and medical classification tasks [9,22].

3.4. Model Training and Hyperparameter Optimisation

All three models were implemented using scikit-learn Pipeline structures that integrate StandardScaler and model training. This ensures consistent preprocessing and prevents data leakage across all models. Hyperparameter tuning was performed using Grid Search combined with 5-fold stratified cross-validation, optimising for ROC-AUC [23].

For Logistic Regression, the regularisation parameter C was tuned over {0.1, 1, 10}. For SVM, both C and kernel type (linear, RBF) were explored. For XGBoost, the number of estimators, maximum tree depth, and learning rate were optimised.

3.5. Evaluation Metrics and Statistical Testing

Five metrics were used to assess model performance: Accuracy, Precision, Recall, F1-Score, and ROC-AUC, each computed for the malignant class [18], reflecting the clinical priority of minimising false negatives.

Performance estimates were obtained through stratified 10-fold cross-validation; the resulting fold-level ROC-AUC scores were then used for pairwise statistical comparison. Both a paired t-test and the Wilcoxon signed-rank test were applied to all three model pairs [18].

3.6. Explainable AI Using SHAP

SHAP was applied to two models to provide both global and local explanations. TreeExplainer was used for XGBoost [16], and LinearExplainer was used for Logistic Regression [15]. Global explanations were obtained using SHAP summary plots and mean absolute SHAP values [16]. Local explanations were generated using SHAP waterfall plots for individual predictions. Both models received scaled input features, consistent with their training conditions. Additionally, a cross-model SHAP agreement analysis was conducted by computing Spearman’s rank correlation between the two models’ mean absolute SHAP rankings across all 30 features, providing a quantitative, model-independent measure of feature importance.

SVM was deliberately excluded from the SHAP analysis, and the reasons for this should be made explicit. SHAP values for kernel-based models, such as SVMs, must be estimated using KernelExplainer, which uses a weighted sampling approach to approximate SHAP values rather than computing them exactly. This stands in contrast to TreeExplainer and LinearExplainer, both of which derive exact attributions—the former by exploiting the tree structure of XGBoost, the latter by working directly with the linear model’s coefficient weights. Mixing exact and approximate attributions in a rank correlation analysis would introduce a form of measurement error that is difficult to quantify and could obscure genuine agreement or disagreement between models. Beyond this technical consideration, the decision to compare XGBoost and Logistic Regression is motivated by the structural differences between the two models. A gradient-boosted ensemble and a linear classifier represent substantially different modelling paradigms. Therefore, finding meaningful agreement in their feature attributions is less likely to arise solely from architectural similarity than it would be if the models were more alike. That said, extending this framework to SVM using approximate SHAP values, or exploring SVM-compatible exact methods such as TreeSHAP applied to surrogate models, remains a worthwhile direction for future work.

3.7. Data Partitioning and Validation Strategy

The dataset was partitioned into training and test subsets using a stratified 80/20 split, preserving the original class proportions in both halves. The test set was held out entirely until final evaluation—it played no role in model selection or tuning at any earlier stage.

For performance estimation and statistical comparison, 10-fold stratified cross-validation was used across all three models, with each fold maintaining the same class proportions as the full dataset—a consideration that matters when one class is clinically more consequential than the other. This cross-validation was applied to the training partition (455 instances) rather than the complete dataset, ensuring that the held-out test set played no role whatsoever in generating the fold-level scores used for statistical testing. The 80/20 hold-out split was reserved strictly for final performance reporting, as described above.

Hyperparameter selection used Grid Search with 5-fold stratified cross-validation, operating strictly on training data. The configuration yielding the highest mean ROC-AUC across validation folds was carried forward for each model.

3.8. Implementation, Evaluation, and Statistical Testing

All experiments were carried out in Python 3.12.13 (Google Colab). Model training and evaluation pipelines were built with scikit-learn 1.6.1 [23], with XGBoost 3.2.0 [22] for gradient boosting and SHAP 0.51.0 [15,16] for explainability. Supporting data processing and visualisation relied on NumPy 2.0.2, Pandas 2.2.2, Matplotlib 3.10.0, and Seaborn 0.13.2.

Once tuning was complete, each best-performing configuration was retrained on the full training set before being applied to the held-out test data. ROC curves and confusion matrices were produced for all three models to support both quantitative comparison and error analysis.

To determine whether performance differences between models were statistically meaningful, paired t-tests and Wilcoxon signed-rank tests were applied to all three model pairs, using the fold-level ROC-AUC scores from the 10-fold cross-validation as the basis for comparison.

4. Results

4.1. Hyperparameter Optimisation Results

Hyperparameter tuning was performed using Grid Search with stratified cross-validation. For Logistic Regression, the best performance was achieved with C = 1, yielding a mean ROC-AUC of 0.9958. For SVM, the optimal configuration used an RBF kernel with C = 1, achieving a mean ROC-AUC of 0.9949. For XGBoost, the best configuration consisted of 100 estimators, a maximum tree depth of 3, and a learning rate of 0.1, with a mean ROC-AUC of 0.9938.

4.2. Test Set Performance

The optimised models were evaluated on the held-out test set. Results are summarised in Table 1. All three models achieved high performance with ROC-AUC values exceeding 0.994. SVM achieved the highest accuracy (0.9737) and F1-score (0.9630) for the malignant class, along with perfect precision (1.000). XGBoost achieved the highest ROC-AUC (0.9967) with perfect precision (1.000) but the lowest recall (0.9048). Logistic Regression achieved an accuracy of 0.9649, a recall of 0.9286, and a ROC-AUC of 0.9960, with one false positive and three false negatives—recording the same recall as SVM (0.9286), but with slightly lower precision (0.9750 vs. 1.000), producing one false positive against zero for SVM and XGBoost. Among the three models, Logistic Regression is the only one that does not achieve perfect precision on the malignant class. However, the difference is marginal, and all three models perform comparably on this test set.

All three models achieve close and competitive performance, indicating that the applied preprocessing and evaluation framework provides a consistent basis for comparison.

4.3. Cross-Validation Performance

Stratified 10-fold cross-validation results are presented in Table 2. All models achieved mean ROC-AUC values above 0.992. Logistic Regression achieved the highest mean ROC-AUC (0.9952) with a standard deviation of 0.0084. SVM recorded a mean of 0.9942 with a standard deviation of 0.0096. XGBoost achieved a mean of 0.9929 with the highest standard deviation (0.0098), suggesting slightly greater sensitivity to data partitioning.

4.4. Statistical Significance Testing

To determine whether observed performance differences between models are statistically significant, paired statistical tests were conducted for all three model pairs using 10-fold cross-validation ROC-AUC scores. Results are presented in Table 3.

All three pairwise comparisons yielded p-values well above the 0.05 significance threshold in both tests. This indicates that there is no statistically significant difference among the three models. The observed differences in mean ROC-AUC are therefore attributable to random variation rather than inherent superiority of one model. All three pairwise comparisons were conducted to avoid selective reporting and ensure a complete view of model differences. The Wilcoxon signed-rank test was selected as the primary comparison method for each pair, in line with the recommendation of Demšar [24] for pairwise classifier comparisons. While García et al. [25] and Rainio et al. [18] recommend Friedman-type tests when the intent is to compare several classifiers simultaneously, the use of individual pairwise Wilcoxon tests here is appropriate given the small number of models (three) and the specific interest in each pairwise relationship.

4.5. Confusion Matrix Analysis

Confusion matrices were generated for all three models on the test set (Figure 1). Logistic Regression produced 71 true benign, 39 true malignant, 1 false positive, and 3 false negatives. SVM produced 72 true benign, 39 true malignant, 0 false positives, and 3 false negatives. XGBoost produced 72 true benign, 38 true malignant, 0 false positives, and 4 false negatives. SVM demonstrated the best balance, achieving zero false positives while maintaining competitive recall.

4.6. ROC Curve Analysis

The ROC curves for all models are shown in Figure 2. All models exhibit curves closely approaching the top-left corner, indicating excellent discrimination ability. XGBoost achieved the highest ROC-AUC (0.997), followed by Logistic Regression (0.996) and SVM (0.995). All three curves are visually nearly identical, consistent with the statistical finding that there is no significant difference between the models.

4.7. Feature Importance Analysis

Feature importance analysis was conducted using the XGBoost model (Figure 3). As shown in Figure 3, the gain-based feature importance analysis reveals that the most influential features are worst perimeter (importance score ~0.33), mean concave points (~0.17), worst radius (~0.13), and worst concave points (~0.11). These features represent key geometric and structural characteristics of tumour cells, indicating that size-related and concavity-related measurements play a dominant role in tree-split decisions. It is important to note that gain-based importance (shown here) and SHAP-based importance (presented in Section 4.8 and Table 4) measure different aspects of feature contribution. Gain-based importance quantifies how much a feature improves the purity of tree splits on average, which tends to favour features with large numerical ranges, such as perimeter and area. SHAP values, by contrast, measure the actual contribution of each feature to individual predictions, account for feature interactions, and provide a more theoretically grounded measure of importance. Differences between the two rankings are therefore expected and do not reflect inconsistency in the analysis.

4.8. SHAP Explainability Analysis

Where the gain-based importance in Section 4.7 reflects how much each feature reduces impurity across tree splits, the SHAP analysis below captures how much each feature actually shifts individual predictions away from the baseline—a subtler and often more informative picture of what the model has learned.

SHAP was applied to both the XGBoost and Logistic Regression models to provide global and local interpretability.

For XGBoost (Figure 4), the SHAP summary plot shows that the worst concavity, area error, worst concave points, and mean concave points have the highest impact on model output. High values of concavity and concave point features are strongly associated with malignant predictions in the present study, consistent with the known irregular morphology of malignant tumours, which are characterised by irregular shapes and increased concavity compared to benign lesions [26,27]. In the SHAP framework applied here, positive values indicate contributions toward malignant classification. The SHAP waterfall plot (Figure 5) illustrates how individual features contribute to a specific prediction, showing that worst texture and worst area drive the largest negative contributions toward a benign classification for that instance, with worst texture alone accounting for a shift of −0.87. In this plot, E[f(x)] is the expected model output across the full test set—the baseline from which all feature contributions are measured—and f(x) is the model’s actual output for the instance shown. Each bar shows how much a single feature pushes the prediction away from that baseline: bars to the right push toward malignant, bars to the left toward benign.

To complement the visual analysis, a quantitative feature importance evaluation was conducted using the mean absolute SHAP values across all test instances. This provides a global measure of each feature’s average contribution to the model’s predictions.

Table 4 shows that worst concavity tops the XGBoost SHAP rankings (0.8338), with area error close behind (0.7227), followed by worst concave points (0.6962) and mean concave points (0.6612). The presence of area error at rank 2 is notable, as it is a size-related measurement rather than a shape feature, suggesting that cell size variability plays a role alongside concavity in driving XGBoost’s predictions.

For Logistic Regression (Figure 6), the SHAP LinearExplainer summary plot reveals that the worst texture, worst symmetry, radius error, and mean concave points are the most influential features. It is worth noting that neither radius error nor the worst symmetry appears in Table 5, which lists only the top 10 XGBoost features—radius error ranks 16th for XGBoost, but third for Logistic Regression, and the worst symmetry ranks 12th for XGBoost but second for Logistic Regression. These large gaps further illustrate how differently the two architectures weigh certain features. Notably, the worst texture emerges as the dominant feature in LR, whereas it plays a secondary role in XGBoost.

4.9. Cross-Model SHAP Agreement Analysis

To quantify the degree of agreement between the two explainable models on which features matter most, a cross-model SHAP agreement analysis was conducted. Mean absolute SHAP values were computed for all 30 features for both XGBoost and Logistic Regression, and each feature was then ranked by its mean |SHAP| value within each model. A Spearman rank correlation coefficient was calculated between the two models’ feature importance rankings to provide a model-independent measure of agreement [18].

The analysis yielded a Spearman rank correlation of r = 0.578 (p = 0.0008), indicating moderate and statistically significant agreement between the two models on the relative importance of features. The result is methodologically meaningful. When a linear model and a gradient-boosted ensemble, trained and evaluated independently, converge on similar feature rankings, the overlap is difficult to attribute to the quirks of either algorithm. It suggests the features they agree on carry consistent predictive weight that is not architecture-specific.

Figure 7 presents side-by-side bar charts of mean |SHAP| values for both models across the top 15 features. The bars reveal that the two models share a broadly similar hierarchy at the top—concavity and concave-point features are prominent in both—but differ in their middle ranks. Worst texture, which sits fifth for XGBoost, rises to first place for Logistic Regression, while area error occupies a high second position for XGBoost yet barely registers in the LR top ten. This pattern suggests that the two architectures draw on a common core of discriminative features while weighting secondary signals differently.

Figure 8 provides a rank-agreement scatter plot across all 30 features, with points near the diagonal indicating close agreement and colour intensity reflecting the magnitude of the rank difference. Mean concave points fall directly on the diagonal (rank 4 for both models), confirming perfect agreement on this feature. Most of the top-10 features cluster reasonably close to the diagonal, supporting the moderate Spearman correlation of r = 0.578. The clearest outliers are area error (XGBoost rank 2, LR rank 12) and mean texture (XGBoost rank 7, LR rank 15), both of which drift conspicuously off the diagonal—a reminder that moderate overall agreement can still conceal meaningful disagreement on specific features.

Figure 9 shows a normalised heatmap of mean |SHAP| values for the top 15 features. The darkest cell in the XGBoost column corresponds to the worst concavity (0.8338). Meanwhile, the worst texture dominates the LR column, with the highest value in the entire heatmap (1.2046), immediately illustrating the key difference between the two models’ feature weightings. The colour contrast between columns is most pronounced for area error—dark in the XGBoost column, much lighter in the LR column—and for worst texture, which is the reverse. These cross-column differences illustrate that even features present in both top-10 lists can vary substantially in their relative contribution, and should be read alongside the rank data in Table 5 rather than in isolation.

Figure 10 displays a dot plot connecting XGBoost and Logistic Regression ranks for the top 20 features by average rank, where shorter connecting lines indicate stronger pairwise agreement. The mean concave points stand out with near-zero line lengths, reflecting their rank-4 position in both models. Several other features also show short connectors—including worst area, mean compactness, and concave points error—suggesting that agreement is not confined to the top-ranked features but extends across a broader set of morphological measurements. The longest lines belong to mean concavity and perimeter error, which are ranked very differently by the two architectures, despite both appearing within the top 20 by average rank; the mean area and worst symmetry also show notable divergence. Across Figure 7, Figure 8, Figure 9 and Figure 10, the same pattern keeps appearing: the two models tend to agree most on concavity-related features and core geometric measurements, whereas features such as area error, mean texture, and worst symmetry are weighted differently by the models.

These findings have practical implications for feature interpretation. Features that rank highly in both models carry more interpretive weight, since their importance is not tied to the assumptions of either architecture. These are reasonable candidates for follow-up in future validation work, though confirming their clinical relevance will require testing in prospective settings beyond this benchmark.

5. Discussion

The results of this study invite several observations about the nature of machine learning performance on the WDBC benchmark and about the value of combining rigorous evaluation with multi-model explainability. Three classifiers trained under identical preprocessing conditions all exceeded ROC-AUC 0.994, and none differed significantly from the others. What distinguishes the present work from prior comparisons is less the accuracy achieved than the methodological controls applied and the added cross-model interpretability layer.

5.1. Model Performance Analysis

All three classifiers surpassed an ROC-AUC of 0.994 on the held-out test set. SVM achieved the best overall test set performance, with an accuracy of 0.9737, an F1-score of 0.9630, and zero false positives. XGBoost achieved the highest ROC-AUC (0.9967) and perfect precision, though with the lowest recall for the malignant class. Logistic Regression, despite being the simplest of the three models, performed competitively across all metrics and achieved the highest mean ROC-AUC in cross-validation (0.9952).

Before concluding, it is worth noting how the comparison was set up. All three models went through the same StandardScaler pipeline; hence, any differences in performance reflect the models themselves rather than differences in data preparation. XGBoost’s greater architectural complexity did not translate into clearly better results, which is exactly the kind of finding that inconsistent preprocessing would obscure.

The grid search settled on an RBF kernel for SVM, indicating that the class boundary in WDBC feature space is not well-approximated by a hyperplane—a nuance consistent with the nonlinear geometry of nuclear morphology data [10].

5.2. Model Stability and Generalisation

Stability across cross-validation folds is arguably as clinically important as peak accuracy: a model with erratic fold-to-fold performance cannot be trusted to behave consistently on prospective patients. Logistic Regression was the most stable, with a standard deviation of 0.0084 across the ten folds. SVM was similarly consistent (std = 0.0096). XGBoost was somewhat more variable (std = 0.0098), which is a known characteristic of gradient-boosted ensembles that fit sequences of trees iteratively and may be more sensitive to variation in fold composition. This observation is worth bearing in mind when considering deployment: performance on a well-curated benchmark can decline markedly when models encounter real-world clinical data distributions [28,29].

5.3. Statistical Significance of Model Comparison

The paired t-test and Wilcoxon signed-rank test both returned p-values well above 0.05 for every model pairing, confirming that none of the observed performance differences would survive a standard hypothesis test. The appropriate conclusion is not that the models are equivalent, but that the current evidence cannot distinguish between them. Reporting all three pairwise comparisons—rather than selecting only the most favourable—avoids the selective reporting bias that inflates apparent differences in many published benchmarks.

5.4. Explainability and Feature Analysis

The SHAP profiles for XGBoost and Logistic Regression complement rather than mirror each other, and this is informative. XGBoost weighted concavity-related features most heavily (worst concavity, worst concave points, mean concave points), together with area error. Logistic Regression placed the worst texture at the top, with the worst symmetry also prominent at rank 2. Both models, however, assigned meaningful importance to worst perimeter and worst radius—geometrically intuitive predictors that have long been associated with malignant growth patterns in the breast pathology literature [7].

The cross-model SHAP agreement analysis (Section 4.9) provides a quantitative basis for this observation. The Spearman rank correlation between the two models’ mean |SHAP| rankings was r = 0.578 (p = 0.0008), indicating moderate, statistically significant agreement. Notably, 7 of the top 10 features were shared between the two models, including worst concavity, mean concave points, worst texture, worst area, worst concave points, worst radius, and compactness error. These features can be considered robust candidate predictive features, as their importance is not an artefact of any single model’s architecture. This model-independent feature evidence may provide a stronger basis for model interpretation than SHAP from a single model alone.

The SHAP waterfall plots provide local explainability, showing which features drove a specific prediction for a given instance. This kind of transparency supports interpretability and may help identify cases where model reasoning warrants closer scrutiny before any clinical application.

5.5. Comparison with Existing Studies

The SVM accuracy of 0.9737 reported here is consistent with published benchmarks on this dataset. Cakmak and Pacal [9] similarly identified SVM as the top-performing classifier on the same benchmark. Where the present study departs from theirs is in methodology: neither consistent preprocessing across models nor formal statistical testing between classifier pairs was applied, leaving it unclear whether their performance rankings reflect algorithmic differences or methodological ones. The current framework eliminates this ambiguity by confirming, using both a paired t-test and a Wilcoxon test, that none of the pairwise differences are statistically significant (all p > 0.05).

Regarding explainability, Arravalli et al. [5] applied multiple XAI techniques, including SHAP, to a breast cancer classification framework using clinical features, finding that tumour size and lymph node involvement were the dominant predictors—demonstrating the value of integrating explainability methods with ML models in clinical breast cancer contexts. Similarly, Jafarabadi and Abdolkarimi [30] used SHAP-based analysis on gene expression data from breast cancer patients and highlighted the value of interpretable feature attribution for identifying clinically relevant genetic biomarkers. The present study extends these contributions by applying SHAP to two architecturally distinct models simultaneously—XGBoost and Logistic Regression—rather than a single model, and by formally quantifying the agreement between them using Spearman’s rank correlation.

Alelyani et al. [19] applied explainable AI to breast cancer classification in a clinical context. They identified breast density, age, and family history as the most influential predictors—demonstrating the value of explainability methods for uncovering clinically actionable risk factors across diverse datasets. Inspection of the studies catalogued by Ghasemi et al. [13] reveals that each of the 30 reviewed studies applied XAI methods to a single model type, with SHAP predominantly applied to tree-based ensemble models in isolation. While cross-model comparisons of SHAP rankings have been explored in the broader XAI literature, the specific approach of quantifying rank correlation between exact tree-based and linear SHAP attributions has not been applied to the WDBC dataset. The cross-model SHAP agreement analysis presented here directly addresses this gap by demonstrating that 7 of the top 10 features are consistently identified across both models, providing a more robust basis for feature-level interpretation than any single-model analysis can offer.

5.6. Limitations

Several limitations qualify the conclusions of this study. The WDBC dataset, though a well-validated benchmark, comprises 569 cases with clean numerical features—a far cry from the noisy, heterogeneous data encountered in clinical practice [28,29]. The feature set is limited to FNA-derived nuclear measurements; clinical covariates such as patient age, family history, hormonal status, and genetic markers are absent. Combining both types of information would likely improve both prediction and explainability. Most fundamentally, all results here are based on a single, publicly available benchmark. Validation across independent, multi-site clinical datasets is necessary before these findings can be translated into practice. A further constraint is that the cross-model SHAP agreement analysis covers only two of the three classifiers studied. SVM was excluded because its SHAP implementation relies on approximation rather than exact computation, which would reduce the precision of the rank correlation. The agreement result should therefore be read as characterising the relationship between a linear and a tree-based model, specifically, and not as a general statement about consensus across all three architectures. Future work could address this by incorporating approximate SHAP values for SVM, with appropriate caveats about their precision, or by applying a different attribution method that supports exact inference for kernel models. The hyperparameter search grids used here were deliberately kept narrow—three values for the regularisation parameter C in Logistic Regression and SVM, two values each for the number of estimators, tree depth, and learning rate in XGBoost—which is reasonable for a benchmark study but may have left better configurations unexplored. A wider search or a Bayesian optimisation approach might yield models that differ modestly and should be considered in follow-up work. Finally, all hyperparameter selection and cross-validation scoring in this study were optimised for ROC-AUC, which treats false positives and false negatives as equally costly. In a clinical screening context, missing a malignant case is considerably more serious than a false alarm. Future work should explore cost-sensitive evaluation metrics—such as weighted F-beta scores or decision-curve analysis—that better reflect this asymmetry.

A related methodological constraint concerns the evaluation’s statistical power. Ten-fold cross-validation yields ten performance estimates, which is a fairly narrow basis for the paired significance tests used here. Approaches such as repeated k-fold cross-validation or bootstrap resampling would produce more stable estimates and tighter confidence intervals. They should be considered in follow-up work that seeks to characterise model differences with greater precision.

It is also worth acknowledging that the Spearman rank correlation of r = 0.578, while statistically significant, reflects moderate rather than strong agreement. As the scatter plot in Figure 8 shows, features such as area error and mean texture exhibit rank differences of 10 and 8, respectively, which is a substantial divergence given that only 30 features are ranked in total. The headline correlation figure should therefore not be read as implying that the two models agree closely across the board; it indicates that agreement is genuine and non-random, but is concentrated among the higher-ranked features and weakens considerably further down the list.

5.7. Implications for Clinical Practice

One practical message from these results is that well-validated classical models, when fairly evaluated, perform strongly on established benchmarks—on this dataset, they achieved ROC-AUC values above 0.994—without requiring the architectural complexity of deep learning. Whether this translates to clinically useful performance in real screening settings remains an open question that prospective validation would need to address. Applying SHAP to multiple model types, rather than a single one, sets a higher bar for feature-based interpretation [11,12]. Global feature rankings allow researchers to verify that the model attends to morphological features with established predictive relevance, while individual waterfall plots make specific predictions interpretable at the instance level.

6. Conclusions

This paper set out to evaluate three established machine-learning classifiers on the Wisconsin Diagnostic Breast Cancer dataset under conditions designed to support fair comparison, and to extend the analysis with a cross-model SHAP agreement analysis that, to the authors’ knowledge, has not previously been applied to the WDBC dataset in this form. Logistic Regression, SVM, and XGBoost were trained within a shared preprocessing pipeline, tuned via grid search cross-validation, and compared using both a paired t-test and the Wilcoxon signed-rank test.

All three classifiers exceeded an ROC-AUC of 0.994, and no pairwise differences in performance were statistically significant. SVM achieved the highest accuracy and F1-score; XGBoost achieved the highest ROC-AUC, with perfect precision for the malignant class. Running all three classifiers through the same preprocessing pipeline matters methodologically—it means the performance rankings reflect actual model differences rather than differences in how each model was set up.

The more distinctive contribution is the cross-model SHAP agreement analysis. Applying TreeExplainer to XGBoost and LinearExplainer to Logistic Regression independently, we computed Spearman’s rank correlation between the two models’ feature importance rankings across all 30 features. The resulting coefficient (r = 0.578, p = 0.0008) indicates moderate, statistically significant agreement: seven features—worst concavity, mean concave points, worst texture, worst area, worst concave points, worst radius, and compactness error—appeared in the top 10 for both architectures. Because these rankings were produced independently by structurally different classifiers, their agreement constitutes evidence of a consistent predictive signal unlikely to reflect a single model’s idiosyncrasies—though confirming genuine clinical relevance will require external validation in independent, prospectively collected datasets. To the authors’ knowledge, this cross-model SHAP agreement approach has not previously been applied to the WDBC dataset, and it warrants further investigation across other medical classification tasks.

The main limitation is reliance on a single benchmark dataset. The WDBC collection is clean, well-curated, and relatively small; performance on it will not automatically transfer to noisier or more heterogeneous clinical data. Extending this framework to larger, multi-site datasets that include clinical covariates alongside morphological measurements is the most pressing direction for future work. Prospective clinical validation and integration with real diagnostic workflows would further establish whether the identified features and the cross-model agreement pattern hold in practice.

Funding

This research received no external funding.

Institutional Review Board Statement

This study did not require ethics approval. The WDBC dataset is a publicly available, fully anonymized secondary dataset obtained from the UCI Machine Learning Repository. No human participants were directly involved in this research, and no personally identifiable information was used or accessed.

Informed Consent Statement

Not applicable. This study used publicly available, de-identified data, and no additional informed consent was required.

Data Availability Statement

The Wisconsin Diagnostic Breast Cancer (WDBC) dataset used in this study is publicly available through the UCI Machine Learning Repository at https://doi.org/10.24432/C5DW2B. No proprietary or restricted data were used.

Acknowledgments

The author would like to thank Taibah University for providing the institutional support and resources that facilitated this research. During the preparation of this manuscript, the author used Gemini (Google version 3.1) for limited language editing and readability improvement of the text. After using this tool, the author carefully reviewed and edited the content as necessary and takes full responsibility for the accuracy, integrity, and final content of the publication.

Conflicts of Interest

The author declares that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Siegel, R.L.; Miller, K.D.; Wagle, N.S.; Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 2023, 73, 17–48. [Google Scholar] [CrossRef]
Ginsburg, O.; Yip, C.; Brooks, A.; Cabanes, A.; Caleffi, M.; Yataco, J.A.D.; Gyawali, B.; McCormack, V.; de Anderson, M.M.; Mehrotra, R.; et al. Breast cancer early detection: A phased approach to implementation. Cancer 2020, 126, 2379–2393. [Google Scholar] [CrossRef]
Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017, 542, 115–118, Erratum in Nature 2017, 546, 686. https://doi.org/10.1038/nature22985. [Google Scholar] [CrossRef] [PubMed]
Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56. [Google Scholar] [CrossRef] [PubMed]
Arravalli, T.; Chadaga, K.; Muralikrishna, H.; Sampathila, N.; Cenitta, D.; Chadaga, R.; Swathi, K.S. Detection of breast cancer using machine learning and explainable artificial intelligence. Sci. Rep. 2025, 15, 26931. [Google Scholar] [CrossRef]
Ali, A.; Alghamdi, M.; Marzuki, S.S.; Din, T.A.T.; Yamin, M.S.; Alrashidi, M.; Alkhazi, I.S.; Ahmed, N. Exploring AI Approaches for Breast Cancer Detection and Diagnosis: A Review Article. Breast Cancer Targets Ther. 2025, 17, 927–947. [Google Scholar] [CrossRef]
Wolberg, W.H.; Street, W.; Mangasarian, O. Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates. Cancer Lett. 1994, 77, 163–171. [Google Scholar] [CrossRef]
Akay, M.F. Support vector machines combined with feature selection for breast cancer diagnosis. Expert Syst. Appl. 2009, 36, 3240–3247. [Google Scholar] [CrossRef]
Cakmak, Y.; Pacal, I. Enhancing Breast Cancer Diagnosis: A Comparative Evaluation of Machine Learning Algorithms Using the Wisconsin Dataset. J. Oper. Intell. 2025, 3, 175–196. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Mienye, I.D.; Obaido, G.; Jere, N.; Mienye, E.; Aruleba, K.; Emmanuel, I.D.; Ogbuokiri, B. A survey of explainable artificial intelligence in healthcare: Concepts, applications, and challenges. Inform. Med. Unlocked 2024, 51, 101587. [Google Scholar] [CrossRef]
Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar] [CrossRef]
Ghasemi, A.; Hashtarkhani, S.; Schwartz, D.L.; Shaban-Nejad, A. Explainable artificial intelligence in breast cancer detection and risk prediction: A systematic scoping review. Cancer Innov. 2024, 3, e136. [Google Scholar] [CrossRef]
Varma, S.; Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform. 2006, 7, 91. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process Syst. 2017, 30, 4765–4774. [Google Scholar]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.-I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef]
Wolberg, W.H.; Mangasarian, O.L.; Street, W.N.; Street, W.H. Breast Cancer Wisconsin (Diagnostic). Dataset. UCI Machine Learning Repository. 1993. Available online: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic (accessed on 1 April 2026).
Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef] [PubMed]
Alelyani, T.; Alshammari, M.M.; Almuhanna, A.; Asan, O. Explainable Artificial Intelligence in Quantifying Breast Cancer Factors: Saudi Arabia Context. Healthcare 2024, 12, 1025. [Google Scholar] [CrossRef]
Kaufman, S.; Rosset, S.; Perlich, C.; Stitelman, O. Leakage in data mining: Formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 2012, 6, 1–21. [Google Scholar] [CrossRef]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
García, S.; Fernández, A.; Luengo, J.; Herrera, F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 2010, 180, 2044–2064. [Google Scholar] [CrossRef]
Birwadkar, A.; Buttny, S.; Sahli, C. Kenry Machine-Learning-Guided Analysis of Breast Tumor Malignancy Based on Nuclear Morphological Features. Adv. Intell. Discov. 2025, 1, e202500034. [Google Scholar] [CrossRef]
Pan, H.; Shi, C.; Zhang, Y.; Zhong, Z. Artificial intelligence-based classification of breast nodules: A quantitative morphological analysis of ultrasound images. Quant. Imaging Med. Surg. 2024, 14, 3381–3392. [Google Scholar] [CrossRef]
Yang, J.; Soltan, A.A.S.; Clifton, D.A. Machine learning generalizability across healthcare settings: Insights from multi-site COVID-19 screening. npj Digit. Med. 2022, 5, 69. [Google Scholar] [CrossRef]
Zvuloni, E.; Celi, L.A.; Behar, J.A. Generalization in medical AI: A perspective on developing scalable models. arXiv 2023, arXiv:2311.05418. [Google Scholar]
Jafarabadi, A.; Abdolkarimi, E.S. Advanced explainable AI-driven biomarker identification for early breast cancer detection using peripheral blood mononuclear cells: Insights into prognostic biomarkers. Biomed. Signal Process. Control 2025, 108, 107910. [Google Scholar] [CrossRef]

Figure 1. Confusion matrices for Logistic Regression, SVM, and XGBoost on the test dataset. Darker cells indicate correctly classified instances (true positives and true negatives); lighter cells indicate misclassifications (false positives and false negatives).

Figure 2. ROC curves of Logistic Regression, SVM, and XGBoost models on the test dataset.

Figure 3. Top 10 feature importance scores based on the XGBoost model.

Figure 4. SHAP summary plot (XGBoost) showing global feature importance and impact direction on model predictions.

Figure 5. SHAP waterfall plot (XGBoost) illustrating feature contributions for an individual prediction.

Figure 6. SHAP summary plot (Logistic Regression) showing global feature importance using LinearExplainer.

Figure 7. Side-by-side bar charts of mean |SHAP| values for XGBoost (blue) and Logistic Regression (orange) across the top 15 features. Spearman r = 0.578 (p = 0.0008).

Figure 8. Rank agreement scatter plot for all 30 features. Points near the diagonal indicate agreement; colour intensity (green to red) indicates rank difference magnitude. The top 10 XGBoost features are labelled.

Figure 9. Heatmap of normalised mean |SHAP| values for the top 15 features across XGBoost and Logistic Regression. Raw values shown in cells; colour represents relative importance within each model column.

Figure 10. Dot plot showing XGBoost (blue circle) and Logistic Regression (orange diamond) ranks for the top 20 features by average rank. Shorter connecting lines indicate stronger agreement between models.

Table 1. Test Set Performance of Models.

Model	Accuracy	Precision (Malignant)	Recall (Malignant)	F1-Score (Malignant)	ROC-AUC
Logistic Regression	0.9649	0.9750	0.9286	0.9512	0.9960
SVM	0.9737	1.0000	0.9286	0.9630	0.9947
XGBoost	0.9649	1.0000	0.9048	0.9500	0.9967

Table 2. Cross-Validation Results (ROC-AUC, 10-Fold Stratified).

Model	Mean ROC-AUC	Std Dev
Logistic Regression	0.9952	0.0084
SVM	0.9942	0.0096
XGBoost	0.9929	0.0098

Table 3. Statistical Significance Testing Results (all model pairs).

Comparison	t-Statistic	t-Test p-Value	Wilcoxon W	Wilcoxon p-Value
LR vs. SVM	1.6374	0.1360	2.0000	0.1875
LR vs. XGBoost	1.7745	0.1097	2.5000	0.1250
SVM vs. XGBoost	1.0894	0.3043	6.0000	0.2188

Table 4. Top 10 Features Ranked by Mean Absolute SHAP Values (XGBoost).

Feature	Mean SHAP
worst concavity	0.8338
area error	0.7227
worst concave points	0.6962
mean concave points	0.6612
worst texture	0.6247
worst area	0.5281
mean texture	0.4581
worst perimeter	0.4550
worst radius	0.4119
compactness error	0.2809

Table 5. Top 10 Features by XGBoost SHAP Rank vs. Logistic Regression SHAP Rank.

Feature	XGBoost Mean \|SHAP\|	XGBoost Rank	LR Rank	\|Rank Diff\|
worst concavity	0.8338	1	6	5
area error	0.7227	2	12	10
worst concave points	0.6962	3	10	7
mean concave points	0.6612	4	4	0
worst texture	0.6247	5	1	4
worst area	0.5281	6	8	2
mean texture	0.4581	7	15	8
worst perimeter	0.4550	8	11	3
worst radius	0.4119	9	7	2
compactness error	0.2809	10	5	5

Note: Features appearing in the top 10 for both XGBoost and Logistic Regression (7 of 10 features): worst concavity, mean concave points, worst texture, worst area, worst concave points, worst radius, compactness error.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alalawi, K. Comparative Explainable AI for Breast Cancer Classification: Cross-Model SHAP Agreement Analysis Using XGBoost and Logistic Regression. Appl. Sci. 2026, 16, 5684. https://doi.org/10.3390/app16115684

AMA Style

Alalawi K. Comparative Explainable AI for Breast Cancer Classification: Cross-Model SHAP Agreement Analysis Using XGBoost and Logistic Regression. Applied Sciences. 2026; 16(11):5684. https://doi.org/10.3390/app16115684

Chicago/Turabian Style

Alalawi, Khalid. 2026. "Comparative Explainable AI for Breast Cancer Classification: Cross-Model SHAP Agreement Analysis Using XGBoost and Logistic Regression" Applied Sciences 16, no. 11: 5684. https://doi.org/10.3390/app16115684

APA Style

Alalawi, K. (2026). Comparative Explainable AI for Breast Cancer Classification: Cross-Model SHAP Agreement Analysis Using XGBoost and Logistic Regression. Applied Sciences, 16(11), 5684. https://doi.org/10.3390/app16115684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparative Explainable AI for Breast Cancer Classification: Cross-Model SHAP Agreement Analysis Using XGBoost and Logistic Regression

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset Description

3.2. Data Preprocessing

3.3. Machine-Learning Models

3.3.1. Logistic Regression (LR)

3.3.2. Support Vector Machine (SVM)

3.3.3. Extreme Gradient Boosting (XGBoost)

3.4. Model Training and Hyperparameter Optimisation

3.5. Evaluation Metrics and Statistical Testing

3.6. Explainable AI Using SHAP

3.7. Data Partitioning and Validation Strategy

3.8. Implementation, Evaluation, and Statistical Testing

4. Results

4.1. Hyperparameter Optimisation Results

4.2. Test Set Performance

4.3. Cross-Validation Performance

4.4. Statistical Significance Testing

4.5. Confusion Matrix Analysis

4.6. ROC Curve Analysis

4.7. Feature Importance Analysis

4.8. SHAP Explainability Analysis

4.9. Cross-Model SHAP Agreement Analysis

5. Discussion

5.1. Model Performance Analysis

5.2. Model Stability and Generalisation

5.3. Statistical Significance of Model Comparison

5.4. Explainability and Feature Analysis

5.5. Comparison with Existing Studies

5.6. Limitations

5.7. Implications for Clinical Practice

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI