1. Introduction
Breast cancer is the most frequently diagnosed malignancy and a leading cause of cancer-related death among women worldwide. According to the World Health Organization, an estimated 2.3 million women are diagnosed with breast cancer and 670,000 deaths occur globally in 2022, underscoring the need for effective early detection strategies [
1,
2]. Prognosis is strongly associated with stage at diagnosis, making timely screening a cornerstone of breast cancer control.
Digital Mammography (DM) remains the gold standard for population-based screening owing to its availability, reproducibility, and relatively low cost [
3]. However, its diagnostic performance declines in women with dense breast tissue, where tissue overlap may obscure lesions or generate false-positive findings, leading to unnecessary biopsies or missed diagnoses [
4,
5,
6]. These limitations have stimulated interest in imaging techniques that combine morphological and functional information, particularly contrast-enhanced modalities such as Contrast-Enhanced Spectral Mammography (CESM) [
7,
8].
CESM has emerged as a complementary modality that augments conventional X-ray imaging with iodine-based contrast, enabling visualization of hypervascular regions commonly associated with malignancy. State-of-the-art reviews report improved lesion conspicuity and diagnostic confidence with CESM, highlighting its expanding roles in diagnosis and procedural guidance [
7,
9,
10,
11,
12]. Accumulating evidence demonstrates that CESM provides diagnostic performance comparable to breast magnetic resonance imaging (MRI). Several comparative studies report that CESM achieves sensitivities equivalent to MRI while offering practical advantages such as shorter examination time [
13,
14]. A 2022 study additionally describes slightly higher overall accuracy and specificity for CESM in the evaluation of multifocal and multicentric breast cancer [
15]. Likewise, a 2020 meta-analysis finds pooled sensitivities of approximately 97% for both modalities, with a diagnostic odds ratio favoring CESM [
16]. Moreover, a systematic review encompassing 19 studies confirms consistently high sensitivities (≈97%) for both modalities and, in several instances, superior specificity with CESM [
17]. Collectively, this body of evidence supports CESM as a robust and more accessible alternative to MRI for breast cancer detection, particularly in healthcare settings with limited MRI availability. Despite these advantages, CESM requires intravenous iodinated contrast and dedicated workflows, which may limit widespread implementation, particularly in resource-constrained health systems. Furthermore, iodinated contrast introduces risks of hypersensitivity reactions and kidney-related complications, prompting professional guidance documents to provide patient-selection and management recommendations [
18,
19].
These clinical and operational constraints underscore the need for complementary, non–contrast-dependent strategies to strengthen diagnostic performance and support decision-making across breast imaging modalities. Although CESM provides valuable functional enhancement, its reliance on contrast agents and specialized workflows continues to restrict widespread adoption, especially in resource-limited settings. In parallel, artificial intelligence (AI) has emerged as a powerful tool to complement imaging-based decision-making. Deep learning (DL) approaches, particularly convolutional neural networks (CNNs), demonstrate strong performance in mammography for detection and classification tasks [
20,
21,
22,
23]. However, most studies focus on a single imaging modality, and comparative evidence across DM and CESM remains scarce. Assessing whether both modalities reveal consistent visual cues offers valuable insights into the robustness and generalizability of AI-based breast lesion assessment.
The recent literature highlights the importance of explainable AI (XAI) in breast imaging, particularly for understanding modality-specific errors and identifying features that influence model predictions. Shifa et al. (2025) emphasize that because diagnostic decisions have major implications for patient outcomes, understanding the rationale behind AI predictions is essential [
24]. Similarly, a 2023 systematic review by Gurmessa et al. reports that XAI not only enhances accuracy and reduces human error but also addresses key ethical challenges by promoting transparency, accountability, robustness, and the right to information in machine learning–based decision making [
25]. Despite these advances, only a limited number of studies extend XAI analyses to both DM and CESM, leaving unresolved whether their diagnostic patterns align or diverge when evaluated using deep learning models.
This limitation is further amplified by recent trends in CESM-focused AI research, which largely remain modality-specific. Most recent AI studies involving CESM restrict analyses to this modality alone. Dominique et al. (2022), for example, propose a CNN-based CESM classifier but do not evaluate its behavior on DM images [
26]. Likewise, Jailin et al. (2023) develop a CESM-based multimodal CAD framework without performing direct cross-modality comparisons [
27]. Because DM and CESM differ in acquisition physics, contrast enhancement patterns, and tissue representation, the absence of harmonized cross-modality evaluations limits our understanding of how AI models generalize across imaging techniques. This gap underscores the need for systematic comparisons under identical preprocessing, training, and evaluation conditions.
DL research in mammography advances rapidly with the emergence of transformer-based and hybrid attention architectures. Studies published between 2022 and 2024 show that these models outperform traditional CNNs in lesion localization, breast density prediction, and malignancy classification [
28,
29,
30]. Chang et al. (2025) report that deep-learning systems trained on large, multi-institutional datasets achieve screening performance approaching that of radiologists [
30]. Brahmareddy et al. (2025) further demonstrate that a multimodal, multitask hybrid CNN–Transformer framework enhances breast cancer diagnosis by jointly modeling spatial, temporal, and clinical information, outperforming state-of-the-art baselines in both subtype classification and stage prediction while improving interpretability through integrated explainability modules [
28]. Despite these advances, most investigations remain modality-specific, underscoring the need to examine whether consistent diagnostic patterns emerge across DM and CESM. Collectively, these observations highlight the importance of developing a unified analytical framework capable of assessing modality-consistent diagnostic patterns using contemporary deep learning and explainability techniques.
The present study investigates whether similar visual diagnostic patterns are identified in DM and CESM using explainable deep learning models. In this context, SHapley Additive exPlanations (SHAP) provides pixel-wise attribution scores that quantify how each image region contributes to a prediction, thereby making CNN decisions more transparent and clinically interpretable [
31]. Three CNN architectures, ResNet-18, DenseNet-121, and EfficientNet-B, are evaluated across three clinically relevant binary tasks (Normal vs. Benign, Benign vs. Malignant, and Normal vs. Malignant). To elucidate model behavior, SHAP is applied to generate pixel-level attribution maps in both modalities. The comparison of these explanations between DM and CESM enables identification of consistent visual cues that may support lesion characterization even when CESM is unavailable, thus promoting equitable access to AI-assisted, high-quality breast imaging in resource-limited healthcare settings.
This article is an extended version of the conference paper “Interpretable Deep Learning for Breast Lesion Classification: A SHAP-Based Comparison of CESM and Digital Mammography,” presented at AHTBE 2025 (Paper ID: AH1558).
4. Discussion
This study presents a comprehensive comparison of three CNN architectures, ResNet-18, DenseNet-121, and EfficientNet-B0, for breast lesion classification using DM and CESM. The results reveal clear performance differences between models and modalities, while SHAP-based interpretability analysis provides valuable insight into the spatial reasoning underlying CNN decisions. Together, these findings offer a balanced understanding of how model architecture and imaging modality jointly influence diagnostic performance and clinical reliability.
Across all metrics, CESM consistently outperforms DM in tasks requiring the discrimination of subtle tissue differences, such as Normal vs. Benign and Benign vs. Malignant. The use of iodine-based contrast improves lesion conspicuity and enhances discriminative power, reflected in higher AUC, precision, and F1-scores across architectures. In contrast, DM demonstrates stronger separability in the Normal vs. Malignant task, suggesting that structural and textural information alone can be sufficient for identifying overt malignancies. These results underscore the complementary diagnostic value of both modalities: CESM provides functional information that facilitates early and accurate characterization of complex lesions [
42,
43], whereas DM remains a reliable and interpretable baseline for broad screening workflows [
44,
45].
Although EfficientNet-B0 is selected as the reference model due to its overall stability and superior performance across modalities, its AUC in the Normal vs. Benign task for DM approximates random classification, indicating limited discriminative learning in this specific scenario. This behavior is visually supported by the SHAP attribution maps (
Figure 4), where activations appear spatially inconsistent compared to the other tasks. Such diffusion likely reflects stochastic learning rather than meaningful lesion recognition. In contrast, SHAP visualizations for the Benign vs. Malignant and Normal vs. Malignant tasks display anatomically coherent attention patterns across both modalities. CESM heatmaps show focal attributions concentrated on contrast-enhancing tissue, whereas DM maps, although more diffuse, consistently highlight the same anatomical regions. This convergence supports the reliability of DM-based representations, even when the absence of contrast reduces localization sharpness.
Particularly noteworthy is the performance of DM in the Normal vs. Malignant comparison, where the AUC surpasses that of CESM. This finding is clinically significant because DM is more widely available and less resource-intensive than CESM, making it a cornerstone modality in low- and middle-income healthcare systems. The ability of DM to achieve high separability between malignant and normal cases demonstrates that conventional mammography, when analyzed using well-tuned CNNs, provides accurate and interpretable diagnostic cues. This strengthens its role in accessible AI-assisted breast cancer screening. Moreover, such performance aligns with clinical screening priorities, where differentiating malignant from normal cases is critical for triage and early intervention, even without contrast enhancement.
When comparing CNN architectures, EfficientNet-B0 achieves the most balanced and superior performance across modalities, particularly in test AUC and precision. Its compound scaling strategy and efficient parameter utilization enable it to capture richer hierarchical representations without overfitting. DenseNet-121, while occasionally achieving competitive results during cross-validation, shows greater variability in the test set, suggesting increased sensitivity to distribution shifts. ResNet-18 demonstrates stable but conservative behavior, producing moderate F1-scores and recall values that reflect robustness but limited discriminative capacity in complex cases. These architectural differences highlight the importance of depth, connectivity, and representational efficiency in capturing multimodality imaging features. Recent studies further emphasize the continued competitiveness of EfficientNet-based models in medical imaging, even relative to emerging transformer architectures, owing to their parameter efficiency and stable performance on small to medium-sized datasets.
To contextualize these findings within existing literature,
Table 6 summarizes selected deep learning studies that address breast lesion classification using DM and CESM. Although methodological details and target tasks vary among studies, the reported results collectively establish reference benchmarks for CNN-based mammographic analysis.
In comparison with prior research, our results show that EfficientNet-B0 achieves competitive or superior discriminative performance across modalities. For instance, Aboutalib et al. (2018) [
46] report AUC values between 0.76 and 0.91 for distinguishing benign from malignant findings in DM without contrast, whereas our EfficientNet-B0 model attains an AUC of 0.97 in the Normal vs. Malignant task using DM, representing a substantial improvement in classification separability. This highlights the ability of modern architectures to extract richer morphological features even from non-contrast data. Similarly, Ribli et al. (2018) [
47] achieve an AUC of approximately 0.85 using a Faster R-CNN approach on the INbreast dataset, comparable to our CESM results, which exceed 0.93 in the same task.
Moreover, Helal et al. (2024) [
48] validate a multiview deep-learning framework on CESM that achieves AUC values between 0.90 and 0.94 for benign versus malignant classification, supporting the high discriminative capacity of CNNs when contrast enhancement is available. Their findings are consistent with our CESM results, reinforcing that contrast information enhances lesion separability through improved vascular and morphological characterization.
Finally, Qasrawi et al. (2024) [
49] demonstrate that ensemble CNNs can achieve high accuracy (96.6%) on DM, consistent with our observation that EfficientNet-B0 effectively leverages structural and textural cues even without contrast information. This comparison underscores that, while prior CNN-based systems report strong performance on individual datasets, our framework extends this evidence by directly contrasting CESM and DM under identical preprocessing, training, and evaluation conditions, thereby providing a controlled benchmark for cross-modality generalization.
Unlike previous studies, which typically evaluate either DM or CESM independently, the present work directly compares both modalities under identical preprocessing, training, and evaluation conditions, thereby providing a harmonized framework that strengthens the validity of cross-modality conclusions and highlights the novelty of this approach.
It is important to note that the studies summarized in
Table 6 do not use the same dataset employed in the present work. None of the prior publications were conducted using the CDD-CESM dataset, nor did they evaluate DM and CESM under a harmonized pipeline with identical preprocessing, training, and evaluation criteria. Therefore, the values reported in
Table 6 serve as contextual benchmarks rather than direct numerical comparisons. A key contribution of this study is that, to the best of our knowledge, it is the first to analyze both DM and CESM within the same dataset and modeling framework, enabling a controlled modality-level comparison that is not available in prior literature.
The interpretability results derived from SHAP further contextualize these quantitative findings. In both DM and CESM, the CNNs attend to anatomically coherent regions, primarily surrounding lesions or areas of contrast uptake. CESM produces sharper and more localized SHAP activations, consistent with contrast-driven delineation of vascularized tissue, whereas DM yields broader but still meaningful attention patterns focused on parenchymal structures. This alignment between modalities indicates that the models learn semantically relevant visual cues regardless of contrast enhancement. These consistent attribution patterns across modalities support the integration of SHAP-based interpretability as a quality-control step in clinical AI pipelines, ensuring that model predictions remain anatomically and pathophysiologically meaningful.
Benign lesion classification remains the most challenging task across all models and modalities. SHAP maps reveal dispersed and inconsistent attributions, suggesting model uncertainty in differentiating benign from malignant or normal tissue patterns. In CESM, mild enhancement occasionally leads to overestimation of malignancy likelihood, whereas in DM, low-contrast lesions are often underrepresented. These observations indicate that even state-of-the-art CNNs struggle to capture the nuanced imaging characteristics of benign lesions. Integrating radiologist-annotated regions of interest, fine-tuning with domain-specific loss functions, or adopting attention-based or transformer architectures could help refine decision boundaries and improve clinical reliability in this category.
The consistent spatial overlap observed between DM and CESM SHAP maps is an encouraging finding for clinical adoption. It implies that, despite quantitative performance differences, the networks base their predictions on biologically plausible and clinically relevant features. This spatial agreement across modalities strengthens trust in AI-assisted systems, supporting their use as complementary diagnostic tools rather than opaque classifiers. Furthermore, the convergence of model attention across architectures suggests that learned representations are not arbitrary but anchored to consistent anatomical cues.
A brief SWOT perspective further contextualizes these findings. The main strengths of this work include the harmonized DM–CESM preprocessing pipeline and the use of SHAP for transparent interpretability across models and modalities. A key weakness is the reliance on a single public dataset, which may limit generalizability. Opportunities arise from extending the framework to multiclass analyses, incorporating larger multi-institutional datasets, and integrating transformer-based architectures. Potential threats include variability in CESM acquisition protocols and the risk that modality-specific artifacts may influence model predictions if not carefully controlled.
From a clinical perspective, the three binary models evaluated in this study could be integrated into a sequential decision-making workflow. A Normal vs. Malignant classifier may function as an initial triage tool, identifying cases that require further diagnostic evaluation. Images classified as benign or suspicious could then be analyzed using the Benign vs. Malignant model to refine the decision and reduce unnecessary biopsies. This cascaded approach allows each model to operate within the diagnostic boundaries for which it is most reliable. Although a full multi-stage deployment system is beyond the scope of the present study, preliminary evaluation of such a combined workflow suggests that overall diagnostic accuracy would depend on the cumulative performance of each model in its respective decision step.
Despite the encouraging results, several limitations should be acknowledged. The dataset size, although balanced across classes, restricts generalizability and may amplify model variance, particularly in the benign category. The use of JPEG-compressed images instead of raw DICOM data may also reduce preservation of fine-grained texture features critical for subtle lesion discrimination. Moreover, the evaluation focuses on three architectures; expanding future analyses to include transformer-based or hybrid CNN–Transformer models could offer a more comprehensive understanding of feature abstraction in breast imaging. Finally, the SHAP analysis, while effective for qualitative interpretation, remains computationally expensive and may not fully capture complex nonlinear dependencies across feature hierarchies.
Collectively, the results demonstrate that CESM enhances classification performance through functional contrast information, whereas DM retains strong interpretability and clinical relevance. EfficientNet-B0 emerges as the most effective and explainable architecture, providing stable and anatomically coherent attributions across modalities. These insights reinforce the potential of explainable deep learning frameworks to bridge diagnostic accuracy with transparency, ultimately supporting equitable and reliable AI integration into breast cancer screening and diagnostic workflows. Future extensions of this framework could include multi-center datasets and multimodal fusion approaches, enabling a more holistic understanding of lesion behavior across imaging modalities and populations.
5. Conclusions
This study presents a comprehensive comparison of CNN-based models for breast lesion classification using DM and CESM. The findings confirm that CESM enhances diagnostic performance across most evaluation metrics due to the addition of functional contrast information. However, DM maintains strong predictive capability and interpretability, particularly in the Normal vs. Malignant task, where it achieves a higher AUC than CESM. This result underscores the continued clinical relevance of DM, which remains the most accessible imaging modality for population-level screening, especially in resource-limited settings.
The analysis of SHAP attribution maps further reinforces these conclusions by revealing that both modalities focus on comparable anatomical regions associated with lesion presence. CESM produces more spatially concentrated attributions, whereas DM generates broader yet anatomically consistent activation patterns. These findings demonstrate that even without contrast enhancement, DM-based deep learning models capture meaningful diagnostic cues that remain clinically coherent and interpretable.
EfficientNet-B0 emerges as the most effective architecture across both modalities, showing stable performance and anatomically coherent SHAP distributions. Nevertheless, certain tasks, particularly Normal vs. Benign in DM, remain challenging due to limited discriminative information and stochastic learning effects.
Overall, this study highlights the potential of explainable AI to enhance breast cancer detection by combining quantitative performance with clinical transparency. The results support the continued optimization and deployment of DM-based models as equitable, accessible, and explainable tools within AI-assisted breast imaging workflows.