Explainable Artificial Intelligence (XAI) for Cancer Classification in Medical Imaging: A Systematic Review

Ghauth, Khairil Imran; Kustiawan, Yanche Ari

doi:10.3390/make8050134

Open AccessSystematic Review

Explainable Artificial Intelligence (XAI) for Cancer Classification in Medical Imaging: A Systematic Review

by

Khairil Imran Ghauth

^* and

Yanche Ari Kustiawan

Faculty of Computing and Informatics, Multimedia University, Cyberjaya 63000, Malaysia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(5), 134; https://doi.org/10.3390/make8050134

Submission received: 5 April 2026 / Revised: 2 May 2026 / Accepted: 18 May 2026 / Published: 20 May 2026

(This article belongs to the Special Issue Clinically Robust and Transparent AI-Assisted Medical Diagnostics: From Learning Dynamics to Real-World Deployment)

Download

Browse Figures

Versions Notes

Abstract

Our study examines the growing role of Explainable Artificial Intelligence (XAI) in cancer medical imaging, where transparency and interpretability are essential for trustworthy clinical decision making. Using a PRISMA-guided systematic literature review, 926 records published between 2020 and 2026 were identified from major databases, with 46 studies meeting the inclusion criteria after screening and quality assessment. The review systematically analyzes XAI techniques, model architectures, evaluation approaches, interpretability mechanisms, challenges, and future research directions. The findings show that gradient-based methods, particularly Grad-CAM, dominate the field due to their ease of integration with convolutional neural networks. At the same time, complementary approaches such as LIME, SHAP, and Integrated Gradients provide additional attribution insights. Evaluation practices remain heterogeneous, with a strong reliance on qualitative visual inspection and limited standardized quantitative frameworks. XAI contributes to interpretability primarily through spatial localization, feature attribution, and clinical decision support; however, challenges persist, including instability in explanations, coarse localization, high computational cost, and limited compatibility with transformer-based models. Overall, while XAI enhances transparency in cancer imaging, its clinical reliability remains constrained by methodological and technical limitations. Future work should focus on standardized evaluation, clinician-centered validation, and the development of robust, multimodal, and architecture-aware explainability frameworks.

Keywords:

Explainable Artificial Intelligence (XAI); medical image analysis; cancer detection; cancer classification; deep learning; machine learning; model interpretability; medical imaging; systematic review

Graphical Abstract

1. Introduction

Cancer remains one of the leading causes of mortality worldwide in the 21st century, accounting for nearly one in six deaths (16.8%) globally. In 2022, recent global estimates indicate that about 20 million new cancer cases and approximately 9.7 million cancer-related deaths were reported [1]. The global cancer burden is expected to rise significantly, with new cases projected to reach about 35 million by 2050, representing a 77% increase compared to 2022. This growth will likely place greater pressure on countries with limited healthcare resources and may further widen existing disparities in cancer outcomes [2].

Cancer is commonly identified using medical imaging, which enables clinicians to observe internal body structures without invasive procedures. Common imaging techniques such as X-ray, computed tomography (CT), magnetic resonance imaging (MRI), ultrasound, positron emission tomography (PET), and mammography help detect suspicious lesions by revealing differences in tissue density, structure, blood flow, or metabolic activity compared with normal surrounding tissues [3,4]. Radiologists analyze these images by evaluating the lesion’s size, shape, borders, internal features, and potential spread, which often helps determine whether additional tests, such as a biopsy, are needed to confirm the diagnosis.

Artificial intelligence (AI), particularly machine learning (ML) and deep learning (DL), has emerged as a powerful technology in healthcare, providing strong capabilities for predictive modeling and clinical decision support [5,6,7]. Machine learning enables computers to learn patterns from data and improve performance over time, while deep learning uses neural network architectures inspired by the human brain to analyze complex, high-dimensional datasets [8]. In healthcare, especially in oncology, these technologies are widely used to analyze medical data and images to support disease detection, diagnosis, and clinical decision making.

Despite the adaptation of AI in healthcare, concerns remain regarding the interpretability of these systems. Many AI models function as “black boxes,” making it difficult for users to understand how decisions are generated [9]. This issue is particularly evident in complex machine learning models, especially deep learning architectures, which often contain millions of parameters arranged across multiple layers, making their decision-making processes difficult to interpret or explain [10]. However, in medical contexts, transparency and interpretability are critical, as medical practitioners and patients need clear explanations of how decisions are made to trust and effectively use AI-assisted outcomes [11,12].

In a recent study by Sun et al. [13], several Explainable Artificial Intelligence (XAI) techniques have been developed and applied in the medical field to improve the transparency of AI systems. In the context of cancer medical imaging, these approaches help reveal how AI models interpret medical images and support clinical decision-making by providing interpretable explanations of model predictions.

Few existing review studies focus primarily on the application of XAI methods to specific cancer types. To the best of our knowledge, there is still a lack of studies that examine the use of XAI across different cancer types in medical imaging. Furthermore, current reviews provide limited discussion on the evaluation approaches used to assess the interpretability of XAI. In addition, existing reviews do not capture recent developments in XAI techniques for improving interpretability in cancer medical imaging systems.

The primary objective of this systematic review is to examine the current landscape of XAI techniques applied in cancer medical imaging. The review specifically focuses on identifying commonly used XAI techniques, the model architectures integrated with XAI, the evaluation methods used to assess interpretability, and the limitations of XAI in clinical contexts.

Our review study follows a PRISMA-guided systematic literature review methodology to identify, screen, and analyze relevant studies published between 2020 and 2026. Through a structured qualitative synthesis of the literature, the review aims to provide an understanding of how XAI has been applied in cancer medical imaging and potential research directions for improving interpretability and clinical adoption. The main contributions of this study are as follows:

We employed a systematic literature review methodology, with predefined search strategies, inclusion criteria, and screening procedures, to ensure methodological rigor and transparency.
We reviewed and synthesized existing studies that integrate XAI techniques with classical machine learning and deep learning models for cancer detection and classification using medical imaging.
We examined the evaluation approaches used to assess the interpretability of XAI and analyzed their role in improving the transparency of cancer imaging models.
We identified current challenges in applying XAI for cancer medical imaging and highlighted future research opportunities to enhance trustworthy AI-assisted diagnosis.

The remainder of this paper is structured as follows. Section 2 presents the related work. Section 3 outlines the systematic literature review (SLR) methodology, including the search strategy and study selection process. Section 4 reports and categorizes the findings of the review. Section 5 discusses and interprets the results. Section 6 describes the limitations of the study. Finally, Section 7 concludes the paper and highlights directions for future research.

2. Related Work

In 2021, Gulum et al. [14] presented a review of explainable deep learning approaches for cancer detection in medical imaging, summarizing how deep learning models combined with explainability techniques can assist clinicians in interpreting diagnostic results across various imaging modalities. The study highlights the growing use of visualization-based XAI methods to improve the transparency of deep learning models and support clinical decision-making in cancer diagnosis. However, because the review primarily focuses on explainable deep learning methods, it does not provide a broader synthesis of different explainable AI techniques for DL and ML. In addition, the review provides insufficient discussion on the evaluation approaches used to assess the interpretability and clinical usefulness of model explanations.

Hauser et al. [15] conducted a review examining the application of XAI in skin cancer recognition using medical imaging. Their study reviewed research articles that applied deep learning models together with XAI to improve the interpretability of automated skin lesion classification systems. The findings indicate that most studies rely on convolutional neural networks (CNNs) combined with visualization-based XAI methods, such as saliency maps and Grad-CAM, to highlight image regions that influence model predictions. These techniques help clinicians better understand how AI systems detect malignant skin lesions and support trust in automated diagnosis. However, the review also identifies several gaps in the literature, including limited clinical validation of XAI explanations, a lack of standardized evaluation methods for assessing explanation quality, and insufficient comparisons across XAI techniques.

Gurmessa and Jimma [16] examined explainable machine learning approaches for breast cancer diagnosis using mammography and ultrasound images. The review summarizes how machine learning and deep learning models are integrated with XAI techniques to improve the interpretability of breast cancer detection systems. Their findings indicate that CNNs combined with visualization-based explanation methods, such as Grad-CAM, LIME, and saliency maps, are commonly used to highlight important regions in mammography and ultrasound images that influence model predictions. These approaches help improve the transparency of AI models and support clinicians in understanding diagnostic outcomes. However, their study primarily focuses on breast cancer imaging and specific modalities, limiting its scope across other cancer types and imaging contexts.

Wyatt et al. [17] conducted a review focusing on the application of XAI in oncological ultrasound image analysis. The study reviewed existing research that integrates deep learning models with explainability techniques to improve the interpretability of ultrasound-based cancer diagnosis. Their findings indicate that most studies rely on CNNs combined with visualization-based XAI methods, such as Grad-CAM, saliency maps, and attention-based explanations, to highlight clinically relevant regions in ultrasound images. These approaches help clinicians understand how AI models identify suspicious lesions and support more transparent diagnostic decisions. However, the review is specifically limited to oncological ultrasound imaging, thereby restricting its scope to other cancer types and imaging modalities. In addition, the study highlights gaps, including limited clinical validation, a lack of standardized evaluation approaches for explanation quality, and the need for broader investigations of XAI applications across diverse cancer imaging domains.

Karthiga et al. [18] presented a review of artificial intelligence and XAI approaches for breast cancer diagnosis using multiple imaging modalities, including mammography, ultrasound, and MRI. The study summarizes how machine learning and deep learning models are applied to breast cancer detection and how XAI techniques are incorporated to improve model interpretability. The findings show that CNN combined with explanation methods such as Grad-CAM, LIME, and saliency maps are widely used to highlight suspicious regions in breast images and assist clinicians in understanding model predictions. The review also discusses the benefits of integrating XAI to increase transparency and trust in AI-assisted diagnosis. However, the study mainly focuses on breast cancer imaging and specific modalities, which limits its generalizability to other cancer types and imaging contexts.

Although existing reviews have explored XAI in cancer medical imaging, most are limited in scope, focusing on specific cancer types, imaging modalities, or particular model categories. In contrast, this study provides a comprehensive and cross-domain synthesis of XAI techniques across multiple cancer types and imaging modalities, enabling a broader understanding of current practices. More importantly, unlike prior reviews that primarily describe methods, this study systematically analyzes evaluation approaches, identifies key methodological limitations, and proposes a standardized multi-dimensional evaluation framework for XAI. This contribution addresses a critical gap in the literature, namely, the lack of consistent evaluation, which has limited the comparability and clinical applicability of existing XAI methods. Therefore, this work extends beyond descriptive synthesis by offering both critical analysis and a structured methodological contribution to guide future research. To further clarify the contribution of this study, Table 1 compares existing review works with the proposed study in terms of scope, evaluation analysis, and methodological contribution.

3. Methodology

This SLR was conducted to synthesize existing studies on the application of XAI in cancer medical imaging. The review followed the PRISMA guidelines [19] to ensure methodological rigor, transparency, and a structured screening process. The study selection procedure is summarized using a PRISMA flow diagram [20], presented in Figure 1.

This systematic review was structured with six key research questions (RQs) that define the scope and direction of the study:

RQ1—What XAI techniques are used in cancer medical imaging?
RQ2—How is XAI evaluated in cancer medical imaging?
RQ3—How does XAI enhance the interpretability of cancer medical imaging models?
RQ4—What model architectures are used with XAI in cancer imaging?
RQ5—What challenges are reported in XAI for cancer medical imaging?
RQ6—What future directions are proposed for XAI in cancer medical imaging?

A predefined review protocol was registered on the Open Science Framework (https://doi.org/10.17605/OSF.IO/8VTB5). The review process consisted of five main stages: search strategy, study selection, included studies, data extraction, and qualitative synthesis.

3.1. Search Strategy

First, a search was conducted across several electronic databases, including the ACM Digital Library, IEEE Xplore, MDPI, and Scopus. These databases were chosen for their strong coverage of computer science research and for providing high-quality, peer-reviewed studies on machine learning, deep learning, and explainable AI, making them suitable for capturing the technical and methodological aspects of XAI in cancer medical imaging. The search string used in this study is presented in Table 2. To ensure that the review reflects recent advancements in the field, the search was limited to studies published within the last six years, from early January 2020 to the end of February 2026, focusing on recent developments of XAI in cancer medical imaging.

3.2. Study Selection

A total of records (n = 926) were initially retrieved and exported in Research Information Systems (RIS) format, then imported into Rayyan [21,22], a tool designed to support the screening workflow in systematic reviews. The process began with deduplication, during which duplicate records (n = 106) were removed, leaving unique studies (n = 820) for further screening.

Next, title and abstract screening were conducted according to the predefined inclusion and exclusion criteria. At this stage, records that were clearly outside the scope of the study (n = 663) were excluded, including studies unrelated to cancer detection or classification, those that did not employ Explainable Artificial Intelligence (XAI), or those that were not primary research articles.

Subsequently, the full texts of potentially relevant articles (n = 157) were sought for further assessment. However, some studies (n = 20) could not be retrieved due to access limitations or unavailable full texts. The remaining retrieved articles (n = 137) were then assessed for eligibility through full-text review. During this stage, additional studies (n = 91) were excluded for documented reasons, such as not implementing XAI techniques or failing to evaluate XAI explanations.

Finally, the screening process resulted in the inclusion of studies (n = 46). Any disagreements between reviewers at each stage of the screening process were resolved through discussion until mutual agreement was reached.

3.3. Included Studies

The final stage of the selection process resulted in the inclusion of 46 research articles, as illustrated in Figure 1. These studies represent the body of literature examining the application of XAI in cancer medical imaging. All eligible studies were systematically analyzed to ensure a comprehensive synthesis of the available evidence in this domain. Key information from the included studies, including source databases, publication years, and references, is summarized in Table 3, providing a transparent overview of the study selection outcomes.

3.4. Quality Assessments

To ensure the methodological rigor and reliability of the included studies, a quality assessment procedure was conducted following the guidelines proposed by Kitchenham and Charters, as one of the quality assessment (QA) instruments recommended by Yang et al. [69]. These guidelines recommend evaluating the methodological soundness of primary studies to minimize bias and to support reliable evidence synthesis in SLR.

A structured QA checklist consisting of eight criteria was developed and adapted to the context of XAI in cancer medical imaging. Table 4 presents the quality assessment checklist used in this study. The checklist evaluated whether the included studies clearly described their research objectives, XAI techniques, model architectures, datasets, explainability evaluation methods, interpretability contributions, reported challenges, and proposed future research directions.

Each criterion was scored using a three-point scale: 1 (Yes) if the criterion was fully satisfied, 0.5 (Partially) if the information was incomplete, and 0 (No) if the criterion was not addressed. The maximum possible score for each study was 8 points. Based on the total score, studies were categorized into high quality (6–8 points), moderate quality (3.5–5.5 points), and low quality (<3 points).

The quality assessment was conducted independently by the first reviewer and subsequently verified by the second reviewer to ensure consistency and minimize potential bias. The results of the QAs, as shown in Table 5, indicate that among the 46 studies reviewed, 36 (78.3%) were categorized as high quality, with scores ranging from 6.0 to 8.0. In contrast, 10 (21.7%) were rated as moderate quality, with scores ranging from 3.5 to 5.5. No studies were classified as low quality with scores below 3.0. The average quality score across all studies was 6.73 out of a maximum possible score of 8.0.

3.5. Data Extraction and Synthesis

A structured data extraction process was used to collect relevant information from the 46 included studies systematically. The extracted data covered several aspects, including the type of XAI technique used, the evaluation methods applied to assess explanations, the machine learning or deep learning models integrated with XAI, the mechanisms used to enhance interpretability, and the reported challenges and future research directions identified in each study.

The data extraction was conducted independently by the first author and subsequently reviewed by the second author to ensure accuracy and consistency and to reduce potential bias during the review process. The synthesis of the extracted information was performed using descriptive analysis and thematic categorization. Frequency analysis was used to identify commonly used XAI techniques, evaluation approaches, and model architectures. In contrast, thematic analysis was used to group findings into interpretability improvements, existing challenges, and emerging research directions.

4. Results

This section presents the synthesized findings from the selected studies, organized according to the predefined research questions. The results provide a structured synthesis of current evidence of XAI techniques, evaluation approaches, interpretability mechanisms, model architectures, reported challenges, and future research directions, highlighting key insights identified across the literature.

4.1. RQ1-What XAI Techniques Are Used in Cancer Medical Imaging?

Across 46 studies, 18 distinct XAI techniques were identified and grouped into six methodological families: gradient-based class activation mapping methods, perturbation-based model-agnostic methods, additive feature attribution methods, gradient attribution methods, backpropagation-based relevance methods, attention-based methods, and expert-guided clinical explainability. Table 6 presents the complete frequency distribution of the identified techniques.

Gradient-based class activation mapping methods were the most frequently reported family. Within this group, Grad-CAM was the most common technique, appearing in 34 of the 46 studies. This suggests that Grad-CAM is a preferred integration technique for different models due to its simplicity and ability to produce intuitive heatmap visualizations.

Other CAM-based variants included Grad-CAM++ in 12 studies, ScoreCAM in 6 studies, LayerCAM in 2 studies, and EigenCAM in 2 studies. Less frequently reported CAM variants were EnsembleCAM, AblationCAM, XGradCAM, and Counter-CAM, each identified in 1 study. These variants were developed specifically to address known weaknesses in the original Grad-CAM; the low level of adaptation suggests that the emerging techniques are not yet widely validated or adopted in practice.

Among the non-CAM methods, LIME was the second most frequently used technique overall, reported in 15 studies. This suggests that LIME is widely adopted because it generates superpixel-based local explanations that are visually interpretable to non-technical users. SHAP was identified in 10 studies, likely due to its ability to support multimodal interpretation by integrating imaging features with genomic or clinical variables within a unified explanatory framework.

Within the gradient attribution category, Integrated Gradients appeared in 6 studies, Guided Backpropagation in 2 studies, SmoothGrad in 1 study, and DeepLIFT/PDA in 1 study. Layer-wise Relevance Propagation (LRP) was reported in 2 studies. Attention Rollout appeared in 1 study, Saliency Maps in 2 studies, and ClinicXAI in 1 study. The limited use of these techniques suggests that they are still considered complementary rather than primary tools for interpretability. Moreover, several studies applied multiple XAI methods; the summed frequencies exceeded the total number of included studies.

In addition, Table 7 shows the distribution of XAI methods across cancer types. Lung cancer was the most represented cancer type with 11 studies, followed by breast cancer with 9 studies, multi-cancer/general imaging with 8 studies, skin/melanoma with 7 studies, and cervical cancer with 6 studies. Colorectal/polyp was represented by 2 studies, while ovarian, bone, and pancreatic cancers were each represented by 1 study. Across these subgroups, Grad-CAM was the primary technique in lung, breast, multi-cancer/general imaging, colorectal/polyp, and bone cancer studies. In skin/melanoma studies, Grad-CAM and Grad-CAM++ were the predominant techniques, while the Grad-CAM family dominated cervical cancer studies. Ovarian cancer studies reported LIME, Integrated Gradients, and SHAP, whereas pancreatic cancer studies reported Grad-CAM and SHAP.

4.2. RQ2—How Is XAI Evaluated in Cancer Medical Imaging?

Across the reviewed studies, three primary evaluation paradigms were identified for assessing XAI methods in cancer medical imaging: qualitative visual assessment, quantitative metric-based evaluation, and expert-validated clinical assessment. These approaches are not mutually exclusive, as several studies employ combinations of methods to evaluate different aspects of explanation quality.

A quantitative overview reveals a clear imbalance in evaluation practices. As shown in Figure 2, qualitative visual assessment remains the dominant approach, appearing in 27 out of 46 studies (58.7%) [23,25,29,32,33,35,36,40,41,43,50,52,58,62,63,65]. In contrast, expert-validated evaluation is reported in only 10 studies (21.7%) [24,28,39,44,51,54,58,64,67,68], while hybrid approaches combining qualitative and quantitative methods are used in 8 studies (17.4%) [26,27,30,33,37,59,61,64]. Purely quantitative evaluation is observed in only 5 studies (10.9%) [28,31,38,39,60]. This distribution highlights a strong reliance on subjective evaluation, with limited adoption of objective, reproducible, and standardized benchmarking methods.

Qualitative visual assessment is typically performed by inspecting heatmaps, saliency maps, or superpixel overlays to determine whether highlighted regions correspond to clinically relevant areas. For example, several studies evaluate Grad-CAM outputs by visually confirming activation over tumor-related regions [29,40,53], while others compare activation patterns between benign and melanoma classes [49]. In addition, SHAP attribution plots are examined in relation to known radiological features in lung CT imaging [63,65], and structured rating schemes have been introduced to compare explanation quality across methods such as LIME, Grad-CAM, and Integrated Gradients [37]. Although qualitative evaluation provides intuitive and clinically interpretable insights, it remains inherently subjective and lacks reproducibility, making it insufficient as a standalone evaluation strategy.

Expert-validated evaluation introduces greater rigor by incorporating clinical expertise into the assessment process. Radiologist-based validation is used to assess whether explanation outputs align with annotated cancer regions or clinically meaningful structures [24,28,39,44,51,54,58,64,67,68]. For instance, some studies report strong agreement between Grad-CAM heatmaps and expert segmentation masks, achieving high pixel-level alignment with ground truth annotations [44]. Other works employ structured scoring frameworks or expert review to evaluate explanation quality across multiple XAI techniques [51,67]. Despite its importance, expert validation remains limited due to its reliance on domain expertise and resource constraints, suggesting that many XAI systems have not yet been evaluated to the level required for robust clinical deployment.

Quantitative evaluation methods provide objective measurement of explanation quality and are essential for benchmarking. Among these, pixel-level ground truth comparison plays a central role, particularly in studies involving segmentation or lesion localization. A total of 8 studies employ pixel-level comparison using annotated masks to directly measure how well explanation maps correspond to clinically defined regions [28,30,38,39,44,59,60,64]. Common metrics used in this context include Intersection over Union (IoU) and Dice Similarity Coefficient (DSC), which quantify spatial overlap between explanation outputs and ground truth annotations. These metrics provide objective evidence of localization accuracy and are critical for validating whether XAI methods can correctly identify tumor regions.

Beyond spatial evaluation, additional quantitative metrics capture different aspects of explanation quality. These include Pointing Game Score, Hausdorff distance, entropy-based measures, Fréchet Inception Distance (FID), Counterfactual Validity (CV), and computational efficiency indicators such as runtime [26,27,30,33,37,59,61,64]. Fidelity-based evaluation techniques, including deletion and insertion curves and pixel-flipping methods, are also reported, although less frequently. These methods assess whether explanations truly reflect the model’s decision-making process by measuring the impact of perturbing important features on prediction outcomes.

A critical comparison of these evaluation approaches reveals that current practices are fragmented and often incomplete. Pixel-level metrics such as IoU and DSC provide strong evidence of spatial alignment but do not capture whether the highlighted regions are causally relevant to the model’s prediction. In contrast, fidelity-based metrics offer deeper insight into model behavior but are less commonly used due to their computational complexity. As a result, many studies emphasize spatial correctness while overlooking explanation faithfulness, leading to partial and potentially misleading evaluation outcomes.

Hybrid evaluation approaches attempt to address these limitations by combining multiple evaluation dimensions. For example, several studies integrate pixel-level metrics such as IoU or DSC with qualitative heatmap inspection [30,59], while others combine entropy-based or fidelity metrics with expert validation [37,64]. These approaches provide a more comprehensive assessment by simultaneously capturing spatial accuracy, feature relevance, and clinical interpretability. However, such methods remain relatively limited and are not standardized across studies.

Despite the use of diverse evaluation strategies, benchmarking across studies remains highly inconsistent. Variations in datasets, annotation quality, evaluation protocols, and selected metrics make direct comparison between XAI methods difficult. For example, while some studies report high pixel-level alignment scores for Grad-CAM-based explanations [44], others rely solely on qualitative validation without quantitative evidence [33,46]. Furthermore, only a small number of studies evaluate multiple XAI methods under identical experimental conditions [47,66], limiting the ability to draw reliable conclusions regarding comparative performance. This lack of standardized benchmarking prevents meaningful comparison across studies and reduces the generalizability of reported results.

4.3. RQ3—How Does XAI Enhance the Interpretability of Cancer Medical Imaging Models?

XAI was reported to enhance interpretability through eight functional mechanisms: spatial localization via visual heatmaps, clinical trust and transparency, clinical decision support and workflow integration, feature attribution and contribution quantification, bias and error detection, pixel-level segmentation and sub-cellular localization, counterfactual reasoning, and multi-modal interpretability. Table 8 summarizes the distribution of these themes across different studies.

Spatial localization via visual heatmaps was the most frequently reported interpretability mechanism, appearing in 35 studies. This indicates that heatmap-based localization is the dominant approach for interpreting model predictions, with CAM-based methods serving as the primary tools for identifying clinically relevant regions. Across the included studies, Grad-CAM and Grad-CAM++ were used to overlay class-discriminative heatmaps onto diagnostic images, highlighting tumor-associated and decision-relevant areas. For example, in a breast cancer ultrasound case study [40], Grad-CAM heatmaps were applied to interpret model predictions by visually emphasizing diagnostically relevant regions, enabling clinicians to better understand the basis of the model’s decisions. These heatmaps consistently focused on biologically meaningful structures, such as atypical glandular cells in lung adenocarcinoma and malignant epithelial regions in breast carcinoma, thereby strengthening clinical trust through transparent visual explanations. Our findings also show that visual heatmaps enhance transparency and interpretability by clearly illustrating how deep learning models make decisions, highlighting features such as asymmetry, irregular pigmentation, and lesion structures. Methods like Grad-CAM and saliency maps align well with clinical understanding, improving trust, supporting human-in-the-loop workflows, and aiding diagnosis through consistent localization. Enhanced variants such as Grad-CAM++ further improve clarity and reduce noise. However, limitations remain, including coarse and sometimes diffused localization, limited fine-grained detail, and only partial overlap with expert-annotated ground truth regions.

Clinical decision support and workflow integration were reported in 32 studies, highlighting the increasing role of XAI in real clinical settings. A representative case study in early lung cancer detection [37] demonstrates how explainability enhances clinician confidence and supports workflow integration. Deep learning models such as InceptionV3 and ResNet50 were applied to the classification of chest X-rays and CT scans. At the same time, Grad-CAM and LIME were integrated to provide visual and local explanations of model predictions. Grad-CAM-generated heatmaps highlighted suspicious regions, allowing radiologists to visually verify whether the model focused on clinically relevant areas, such as lung nodules. At the same time, LIME provided instance-level explanations to support decision-making. This enabled clinicians to validate the model’s reasoning, detect potential biases or misattributions, and maintain expert oversight throughout the diagnostic process.

Clinical trust and transparency were reported in 26 studies. In breast cancer diagnosis from mammographic images [28], explainable AI was integrated to address the opacity of high-performing CNNs and enhance clinician confidence. Techniques such as Grad-CAM, LIME, and SHAP were used to visualize and quantify model reasoning by highlighting decision-relevant regions and feature contributions. These explanations enabled radiologists to verify whether predictions were based on clinically meaningful lesions or spurious patterns, supporting the detection of biases and errors. By providing transparent visual evidence, these methods align well with clinical workflows, support differential diagnosis, and function as diagnostic assistants, particularly in high-throughput settings or for less experienced clinicians. They also reinforce clinically relevant patterns and facilitate human-in-the-loop decision-making. However, saliency-based methods primarily highlight areas of attention without fully explaining the model’s internal reasoning process. Visual outputs may not always align perfectly with expert annotations, indicating limitations in localization precision. In addition, methods like Grad-CAM can produce coarse or diffuse heatmaps, requiring enhanced variants such as Grad-CAM++ to improve clarity. Current approaches also lack interactivity, limiting real-time exploration and feedback in practical clinical use.

Feature attribution and contribution quantification were reported in 12 studies, using methods such as LIME, SHAP, and Integrated Gradients. In a cervical cancer diagnosis case study [42], a LightGBM model was interpreted using SHAP, which quantified feature contributions and identified Mean Intensity, Area, and Equivalent Diameter as the most influential factors. SHAP decision plots further revealed directional effects, with features such as Skewness and Area positively influencing cancer predictions, while Kurtosis and Perimeter negatively influenced cancer predictions. These findings aligned with established clinical indicators, strengthening trust in the model’s reasoning. Across studies, feature attribution methods, including Chi-square and saliency-based techniques such as Grad-CAM, were widely used to interpret model behavior. These approaches effectively identified key features, for example, EquivDiameter and Mean Intensity, and provided complementary insights that improved feature selection and overall model understanding. SHAP offered transparent, quantitative explanations, while Grad-CAM visually highlighted decision-relevant regions that often corresponded to clinically meaningful patterns, such as abnormal cells or tissue structures. This alignment supported clinical validation and enhanced interpretability, although overlap with expert annotations was sometimes partial. Despite these advantages, limitations remain. Saliency-based methods primarily indicate areas of attention without fully explaining the model’s internal reasoning. Grad-CAM may produce coarse or diffused heatmaps with limited localization precision, and results can be affected by preprocessing steps such as image downsampling. In addition, these explanations are generally static and require careful interpretation, as they provide only partial insight into the model’s decision-making process.

Bias, error, and spurious feature detection were reported in 6 studies. A representative case study [24] of lung cancer detection from CT scans illustrates this application. Deep learning models such as ResNet50 and InceptionV3 achieved high classification accuracy but sometimes focused on irrelevant regions, suggesting reliance on spurious features rather than true pathological indicators. To address this, Grad-CAM and LIME were applied to generate visual heatmaps and local explanations, revealing which regions influenced the model’s predictions. These methods enabled researchers and radiologists to detect misattributions, such as attention to background noise or non-diagnostic areas, and to identify potential biases and inaccuracies. Through expert validation, clinicians assessed whether the model focused on clinically meaningful features, ensuring that predictions were accurate for the right reasons. More broadly, interpretability methods such as SHAP, LIME, and Grad-CAM provide valuable insights into model behavior by revealing which features drive predictions. These techniques help identify data bias, uncover reliance on irrelevant or sensitive attributes, and validate alignment with domain knowledge. They also support error analysis by explaining misclassifications at both feature and instance levels, while visual methods indicate whether models focus on clinically relevant regions or incorrect areas. However, their effectiveness depends on domain expertise to interpret whether identified patterns represent true bias or error. Most approaches are post hoc, providing explanations after predictions rather than preventing issues, and they may not fully capture complex feature interactions. In addition, since XAI primarily reveals correlations rather than causation, distinguishing spurious features still requires careful expert validation.

Pixel-level segmentation and subcellular localization were reported in 6 studies each. Explanation methods such as Layer-wise Relevance Propagation were used to assign pixel-wise relevance and localize diagnostically important cellular structures, including nucleus and cytoplasm regions, while saliency-based and CAM-based outputs supported fine-grained localization and segmentation. A representative case study in cervical cancer screening demonstrates this approach [61]. To overcome the limitations of manual annotation and black-box deep learning models, a weakly supervised framework combined image classification with XAI-guided segmentation. The model first classified cervical cells as normal or abnormal, and interpretability techniques highlighted regions contributing to the prediction. These insights were then used to guide segmentation, for example via GraphCut, enabling accurate pixel-level delineation without requiring extensive ground-truth masks. This approach enabled effective localization of subcellular features, such as nuclear shape and the nucleus-to-cytoplasm ratio, which are clinically relevant indicators of malignancy. Overall, XAI-driven segmentation reduces annotation burden, improves transparency, and aligns model focus with clinically meaningful features, supporting trust and adoption in clinical workflows. It also enables error detection and iterative model refinement by revealing discrepancies between model attention and expected pathology. However, challenges remain, including variability across XAI methods, potential imprecision compared to fully supervised segmentation, and sensitivity to noise or image quality. In addition, visual interpretations may be subjective, and integrating multiple techniques can introduce computational complexity and consistency issues.

Multi-modal interpretability was identified in 2 studies, both combining Grad-CAM for imaging data with SHAP for clinical or genomic variables within a unified framework. In a multiclass classification study of tomosynthesis breast lesion shapes [23], this approach integrated visual methods, such as Grad-CAM and LIME, with mathematical techniques, such as t-SNE and UMAP. Visual explanations highlighted lesion regions influencing classification, while dimensionality reduction revealed how different lesion types were separated in the model’s feature space. This combination provided both local (image-level) and global (feature-level) insights, enabling clinicians to verify whether the model focused on clinically meaningful regions and learned distinct representations for each class, thereby improving transparency and supporting error analysis More broadly, multi-modal interpretability, which integrates imaging with genomic or clinical data, offers a more comprehensive and clinically meaningful understanding of disease by capturing both phenotypic and underlying biological patterns. This approach can improve diagnostic performance, support the discovery of relevant biomarkers, and enhance trust by linking predictions to both visual evidence and patient-specific factors. However, it introduces challenges, including complex data integration, high computational requirements, and difficulties in interpreting how different modalities are fused. In addition, issues such as data imbalance, variability across sources, and the need for multidisciplinary expertise can affect scalability and practical implementation in clinical settings.

Finally, counterfactual reasoning was reported in 1 study, represented by Counter-CAM, which combines Grad-CAM with CycleGAN-based counterfactual image generation to illustrate how image modifications affect model predictions. In a breast cancer classification study using infrared images [26], this approach generated minimally altered “counterfactual” images that could change a prediction from cancerous to non-cancerous while preserving high visual similarity. By comparing Grad-CAM heatmaps of the original and counterfactual images, clinicians identified the specific regions responsible for the prediction changes, highlighting the most critical diagnostic features. This method provides intuitive and actionable “what-if” insights, helping users understand not only where the model focuses but also how small changes influence outcomes, thereby improving transparency, trust, and clinical confidence. However, counterfactual reasoning also presents several challenges. Generating realistic and clinically plausible counterfactuals can be computationally intensive and depends on carefully defined similarity constraints. The concept of “minimal change” may vary across metrics, leading to multiple valid counterfactuals and potential ambiguity. In addition, counterfactual explanations are inherently local, providing insight into individual predictions rather than global model behavior. There is also a risk of misinterpretation if users infer causal relationships beyond the model’s learned associations, highlighting the need for careful validation and expert interpretation.

4.4. RQ4—What Model Architectures Are Used with XAI in Cancer Imaging?

Our study reported the underlying ML or DL model used together with XAI. The identified architectures spanned convolutional neural networks (CNNs), Vision Transformers (ViTs), hybrid CNN-ViT pipelines, segmentation networks, classical ML classifiers, and several custom-designed architectures. CNN was the dominant architecture category, accounting for 73.9% of the studies, either as standalone models or primary feature extractors. Most studies adopted transfer learning with ImageNet pre-trained weights, reflecting the common practice of adapting large-scale natural image models to medical imaging tasks.

Table 9 shows the distribution of deep learning model families, and among CNN families, ResNet was the most frequently used architecture, appearing in 15 studies (32.6%) across variants from ResNet-18 to ResNet-152. This indicates that ResNet is the preferred backbone due to its residual learning, which enables effective training of deeper networks and robust feature extraction in complex medical imaging tasks. VGG architectures (VGG-16 and VGG-19) were the second most common (12 studies, 26.1%), particularly in earlier or comparative studies where their sequential structure enabled straightforward integration with Grad-CAM explanations. This shows that VGG models are frequently adopted due to their simple, interpretable architecture, which facilitates the direct application of gradient-based explanation methods.

EfficientNet variants (B0–B7 and V2) were reported in 10 studies (21.7%), often selected for their compound scaling strategy that balances model depth, width, and resolution. For example, EfficientNet-B0 is used as a feature extractor [38], whose representations are classified using XGBoost, SVM, and Decision Tree models, with SHAP used for feature attribution. This finding reveals that EfficientNet is widely used for its efficiency and strong feature representation, making it suitable for hybrid pipelines that combine deep feature extraction with interpretable classical machine learning models.

InceptionV3 appeared in 8 studies, valued for its multi-scale convolutional structure that captures heterogeneous tissue patterns in medical images. DenseNet architectures were identified in 7 studies, particularly in lung cancer imaging, where dense feature reuse supports the detection of subtle CT abnormalities. Lightweight architectures were also observed. MobileNet variants were reported in 5 studies, typically for deployment in resource-constrained environments such as mobile or clinical edge devices. The Xception architecture appeared in 4 studies, including cervical cell classification and melanoma detection. This shows that a diverse range of CNN architectures is adopted to address different imaging requirements, including multi-scale feature extraction, detection of subtle patterns, and efficient deployment in resource-constrained environments.

Vision Transformer (ViT) models were identified in 17.4% of the studies, including three ViT architectures and five hybrid CNN–ViT configurations. ViT variants reported in this study included ViT-B/16, ViT-DINO, DeiT-B/16, Swin Transformer, MobileViT, and LeViT. Standard ViT and DeiT models were applied for oesophageal endoscopic image classification, using Grad-CAM and ScoreCAM for visual explanation [41]. CerviTrans-XAI proposed an ensemble of four ViT variants for cervical cancer classification with LIME explanations [45]. MRAViT-XAI introduced a custom transformer architecture incorporating multi-resolution attention mechanisms for lung cancer CT classification [52]. This finding shows that ViT models are emerging in XAI-based cancer imaging, with adoption extending to both standard and hybrid architectures, indicating their application across diverse classification tasks and their integration with multiple explanation methods.

Hybrid CNN–ViT architectures were implemented in 10.9% of the studies, combining convolutional inductive biases with transformer-based global attention mechanisms. For example, LungXResViT integrated ResNet-152 with a ViT backbone for lung cancer classification [60], while in another study, benchmarked nine architectures spanning CNN and transformer models for breast ultrasound classification [53]. A hybrid melanoma detection pipeline proposed combining CNN backbones with a Mask-guided Vision Transformer (SM-ViT) [38]. Another notable hybrid design was ETCapsNet [67], which integrated EfficientNetV2-Small, transformer attention blocks, and capsule networks, with multiple XAI techniques including Grad-CAM, SHAP, and Integrated Gradients. This result shows that hybrid CNN–ViT architectures are adopted to leverage both local feature extraction and global contextual modeling, indicating their use in integrating complementary learning capabilities within XAI-based cancer imaging systems.

Custom architectures tailored to specific medical imaging tasks were introduced in 10.9% of the studies. Examples include the CNN-MLP-Multi-Head Attention model proposed for lung nodule classification [63], AECNet for skin cancer detection [49], and PancreoFusion-Net [56], which combined imaging, clinical, and genomic data in a multimodal learning framework. This finding shows that custom architectures are developed to address task-specific requirements, including integrating multimodal data and enhancing model adaptability for specialized cancer imaging applications.

Segmentation models were reported in 4.3% of studies. An Attention-Guided U-Net was used for polyp segmentation in colonoscopy images with Grad-CAM explanations, achieving 92.3 ± 1.5% pixel alignment with expert annotations [44]. Additionally, a U-Net–based imaging module was integrated into a multimodal classification framework [56]. This reveals that segmentation models are selectively used for tasks that require precise pixel-level localization, underscoring their role in aligning XAI outputs with clinically annotated regions.

Classical ML models appeared in 4.3% of studies. LightGBM was used for cervical cell classification with SHAP explanations [42]. At the same time, in another study, XGBoost, SVM, and Decision Tree classifiers were applied on features extracted from EfficientNet-B0, demonstrating hybrid pipelines that combine deep feature extraction with interpretable classical classifiers [57]. This finding shows that classical ML models are combined with deep feature extractors, underscoring their role in hybrid pipelines that support interpretable prediction through feature-based learning.

4.5. RQ5—What Challenges Are Reported in XAI for Cancer Medical Imaging?

Table 10 presents the reported challenges across the studies, and twelve challenge categories were identified, covering technical, methodological, architectural, and data-related limitations.

A single factor does not cause instability in the explanation, but rather emerges from both method-inherent properties and data-related noise. For instance, perturbation-based methods such as LIME are inherently stochastic due to random sampling, leading to variability in explanation outputs even for identical inputs. This indicates a methodological limitation rather than a data issue. In contrast, instability observed in gradient-based approaches, such as Grad-CAM, is more strongly linked to model sensitivity and data characteristics, particularly gradient noise, activation saturation, and the high variability of medical images, including low contrast, artifacts, and subtle lesion boundaries. Therefore, explanation instability should be understood as a combined effect of algorithmic randomness and the inherently noisy and complex nature of medical imaging data.

The limited spatial precision of explanation maps is primarily attributed to architectural constraints. CAM-based methods rely on low-resolution feature maps generated in deeper convolutional layers, which are then upsampled. This leads to coarse localization that struggles to capture fine-grained pathological structures. However, this limitation is further amplified by data properties, such as small lesion size and low contrast, which make precise localization inherently difficult even for well-trained models.

The absence of standardized quantitative evaluation stems from a methodological gap in the field rather than from technical limitations of individual methods. The lack of benchmark datasets with pixel-level annotations and the absence of universally accepted evaluation metrics prevent consistent comparison across studies. This issue reflects the early stage of maturity of XAI evaluation frameworks in medical imaging.

Computational cost is directly linked to method design complexity. Perturbation-based and Shapley value approaches require repeated model inference, making them computationally expensive by design. This challenge is further intensified when applied to high-resolution medical images, where each evaluation involves large input dimensions and deep network architectures.

Then, failures in specific cancer types or morphological patterns are largely driven by data imbalance and limitations in representation learning. Models trained on imbalanced datasets tend to prioritize dominant classes, resulting in less informative feature representations for rare or atypical lesions. Consequently, explanation methods reflect these learned biases rather than true clinical relevance.

Incompatibility between XAI methods and modern architectures arises from the design assumptions of traditional XAI techniques. Most post hoc methods were developed for convolutional neural networks and rely on spatial feature maps, whereas transformer-based models operate using token-based representations. This fundamental mismatch explains why CAM-based methods often produce diffuse or irrelevant explanations when applied to Vision Transformers. This limitation is driven by how information is represented in each architecture. CNNs preserve spatial locality through hierarchical feature maps, allowing gradient-based methods to highlight important regions with reasonable precision. In contrast, Vision Transformers distribute information across attention layers, capturing global relationships rather than localized features. As a result, mapping model importance back to pixel space becomes less direct, leading to fragmented or less meaningful heatmaps.

Dataset limitations and generalization issues primarily stem from data-level constraints rather than methodological flaws. The absence of cross-domain validation reflects reliance on single-source datasets, in which models are trained and evaluated under similar imaging conditions. This leads to poor generalizability when applied to data from different scanners, institutions, or patient populations. Similarly, class imbalance and underrepresented lesion types introduce representation bias, leading models to fail to learn meaningful features for rare classes and causing explanations to reflect dominant patterns rather than clinically relevant features.

The locality of explanations, particularly in LIME, is an inherent methodological limitation. Since LIME approximates the model’s decision boundary locally, it explains individual predictions but does not capture the model’s global behavior. This restricts its usefulness in clinical settings where consistent, generalizable explanations across cases are required.

Challenges in transfer learning arise from a mismatch between learned representations and explanation mechanisms. When models are fine-tuned from pre-trained networks, their internal feature representations may not align well with XAI assumptions, leading to reduced explanation fidelity. This indicates a combined model-level and method-level limitation, where explanation techniques struggle to interpret features not originally learned for the target domain.

Architectural and methodological gaps drive difficulties in multi-modal data adaptation. Most XAI techniques are developed for single-modality inputs, particularly imaging data. When applied to multi-modal systems that integrate imaging with clinical or genomic data, these methods require significant adaptation, as they are not inherently designed to handle heterogeneous feature spaces or cross-modal interactions.

Model confidence miscalibration is primarily a model-level issue that directly affects explanation reliability. Even when incorrect, overconfident predictions lead XAI methods to assign strong importance to misleading features. This results in explanations that appear convincing but do not reflect true model reasoning, reducing trust in clinical decision support.

Finally, the XAI-to-segmentation translation gap reflects a methodological limitation in bridging interpretability and clinical usability. While heatmaps provide visual insights into model attention, they are not directly usable for precise clinical tasks such as lesion segmentation. The lack of robust methods for converting explanation maps into accurate, pixel-level segmentations highlights a disconnect between current XAI outputs and practical diagnostic requirements.

4.6. RQ6—What Future Directions Are Proposed for XAI in Cancer Medical Imaging?

Among the selected studies, thirty studies (65.2%) proposed future research directions for advancing XAI in cancer medical imaging. Across these studies, several recurring themes were identified, including clinical integration, methodological advancement, evaluation standardization, dataset expansion, and the development of hybrid explainability frameworks.

As shown in Figure 3, clinical integration and expert collaboration were the most frequently proposed directions, reported in 14 studies. These studies emphasized the importance of incorporating clinician expertise into the design and validation of XAI systems. Proposed directions included clinician-driven explanation design, expert validation of explanation outputs, interactive explainability tools for real-time analysis, and the development of explainable clinical decision support systems (CDSS) aligned with radiologists’ diagnostic workflows [28,36,38,40,42,50,51,52,53,54,55,56,57,67].

The adoption of advanced or complementary XAI techniques was proposed in thirteen studies. These studies recommended extending beyond the dominant Grad-CAM approach by integrating additional explanation methods such as SHAP, LIME, Layer-wise Relevance Propagation (LRP), ScoreCAM, LayerCAM, and EigenCAM. Several studies also suggested the use of multiple explanation methods simultaneously to provide complementary interpretability perspectives and improve explanation robustness [23,24,30,33,40,42,43,50,53,61,62,64,66].

Seven studies proposed the development of quantitative and standardized evaluation frameworks for explainability. These proposals included the use of spatial overlap metrics such as the Dice score and Intersection-over-Union (IoU), the adoption of faithfulness and stability metrics, and the creation of annotated benchmark datasets specifically designed to evaluate explanation quality. Several studies also recommended incorporating clinician-based evaluation through expert annotation or user studies [27,31,33,46,50,52,55].

Another frequently proposed direction involved real-world clinical deployment and prospective validation, reported in seven studies. These studies suggested validating XAI-based cancer detection models using prospective clinical data, multi-institutional datasets, or real-world healthcare environments to assess explanation reliability and clinical applicability. Some studies also proposed integrating explainable models into clinical decision support systems for practical diagnostic workflows [34,36,38,42,52,55,56].

Dataset expansion and improved generalization were proposed in six studies. These studies recommended constructing multi-institutional datasets that include diverse patient populations, imaging protocols, and clinical metadata. Such datasets were suggested as a means to improve model generalizability and ensure that explanation outputs remain reliable across different healthcare environments [25,36,42,43,54,66].

Hybrid or multi-method explainability frameworks were proposed in five studies. These approaches combine multiple XAI techniques within a unified pipeline to leverage complementary strengths. For example, several studies have proposed sequential pipelines in which localization methods, such as Grad-CAM, identify relevant regions, followed by attribution methods, such as SHAP or LIME, to quantify feature importance [40,57,61,62,66].

Four studies proposed integrating fairness, bias mitigation, and robustness considerations into explainable AI systems. These studies suggested combining XAI with fairness-aware machine learning methods to detect biases in cancer classification models and improve transparency in model decision-making across different demographic groups [31,40,42,43].

Similarly, four studies proposed using XAI to support segmentation tasks or clinical decision support systems. Proposed approaches included converting explanation heatmaps into segmentation masks, integrating XAI maps into segmentation pipelines, and developing transparent clinical systems that combine diagnostic predictions with interpretable visual explanations [30,55,61,67].

Several emerging research directions were also identified, but reported less frequently. Two studies suggested developing intrinsically interpretable or white-box deep learning models to reduce reliance on post hoc explanation methods [27,42]. Two studies proposed extending explainability techniques to three-dimensional imaging or multimodal data environments that combine imaging with clinical or genomic information [30,56]. Two studies highlighted the need for computationally efficient explainability methods that reduce processing time and GPU memory requirements [57,67]. Finally, one study proposed integrating XAI with federated learning or privacy-preserving frameworks to enable explainable AI deployment in distributed healthcare environments [36].

5. Discussion

5.1. XAI Techniques in Cancer Medical Imaging

A comparative benchmarking of the reviewed XAI methods reveals clear trade-offs among performance, robustness, and clinical usability that directly influence method selection in medical imaging contexts. Based on the synthesized evidence, gradient-based class activation mapping (CAM) variants consistently outperform other approaches due to their strong balance between visualization quality and interpretability.

Among these, Grad-CAM++ and ScoreCAM emerge as the top-performing methods. Grad-CAM++ improves localization accuracy by incorporating higher-order gradients, resulting in clearer, more precise heatmaps, particularly at object boundaries. ScoreCAM demonstrates strong robustness by avoiding gradient noise, generating reliable explanations even in complex or ambiguous cases. Both methods achieve high clinical usability, as their outputs are more consistent and easier to interpret for clinicians.

Grad-CAM, XGradCAM, and Counter-CAM form a second tier of effective methods with minor limitations. Grad-CAM remains widely used for its computational efficiency and ease of integration, though it produces relatively coarse heatmaps. XGradCAM improves localization through enhanced weighting mechanisms, while Counter-CAM provides additional insight by identifying critical regions via counterfactual analysis. Despite their strengths, these methods show greater variability in robustness and precision than the top tier.

A third tier includes LayerCAM, EigenCAM, and AblationCAM, which provide specialized advantages. LayerCAM improves spatial resolution by leveraging layer-wise relevance, while EigenCAM captures more holistic feature representations. AblationCAM offers strong theoretical reliability by directly measuring the impact of feature removal on predictions. However, these methods demonstrate moderate robustness and, in the case of AblationCAM, higher computational cost, limiting scalability, and real-time applicability.

EnsembleCAM represents a robustness-focused approach that combines multiple CAM variants to improve explanation stability and coverage. While this increases reliability, the added complexity reduces practicality for routine clinical deployment.

Finally, model-agnostic methods such as LIME rank lower overall. Although flexible and applicable across different architectures, LIME is sensitive to input perturbations and can produce inconsistent explanations. This limits its reliability in clinical settings where stability is essential.

Overall, the findings indicate that no single XAI method is universally optimal, and the specific requirements of performance, robustness, and clinical usability should guide selection. For CNN-based medical imaging tasks, Grad-CAM++ and ScoreCAM provide the most balanced and reliable performance, while Grad-CAM remains a practical baseline. More complex methods, such as AblationCAM and EnsembleCAM, are better suited for detailed analysis rather than real-time use, and model-agnostic approaches should be applied cautiously. To provide clearer guidance, the benchmarking and ranking of these methods are summarized in Table 11, which consolidates their comparative performance across key evaluation dimensions.

5.2. Integration of XAI into Clinical Workflows

Figure 4 illustrates the integration of Explainable Artificial Intelligence (XAI) into a typical clinical workflow for cancer imaging. The process begins with image acquisition, during which modalities such as CT, MRI, and X-ray are captured and stored in clinical systems. AI models then process these images during the inference stage to generate diagnostic predictions.

At the explanation stage, XAI methods such as Grad-CAM produce heatmaps that highlight regions influencing the model’s decision. These visual explanations allow clinicians to assess whether the model focuses on clinically relevant structures. This stage supports a human-in-the-loop approach, where clinicians review and validate both the prediction and its explanation before making a decision.

Following validation, the clinician proceeds to the final decision stage, which may include diagnosis, treatment planning, or further testing. The workflow also includes a feedback loop in which predictions, explanations, and clinical outcomes are stored and analyzed. This enables performance monitoring, error analysis, and continuous model refinement, improving both accuracy and reliability over time.

Despite this structured workflow, several limitations affect practical deployment. The lack of standardized evaluation frameworks makes it difficult to assess the reliability of explanations. Instability in explanation methods and coarse localization reduce clinical confidence, particularly for detecting small or subtle lesions. In addition, the high computational cost limits real-time applicability, and existing XAI methods often exhibit limited compatibility with advanced architectures such as Vision Transformers.

To support effective integration, key enablers are required. These include standardized evaluation protocols, clinician-centered validation, and seamless integration with systems such as PACS (Picture Archiving and Communication System) and clinical decision support systems. Interactive, real-time XAI interfaces are also important for improving usability. Furthermore, large-scale and multi-institutional validation is necessary to ensure generalizability across clinical settings.

5.3. Proposed Standardized Evaluation Framework for XAI

One key finding of this review is the absence of a standardized, consistent evaluation framework for assessing XAI methods in cancer medical imaging. As discussed in Section 4.2, current evaluation practices are highly heterogeneous, with most studies relying on qualitative visual inspection, while only a limited number adopt quantitative metrics or expert-based validation. This lack of consistency makes it difficult to compare results across studies and limits the reliability and clinical applicability of XAI systems.

To address this gap, this study proposes a structured, multidimensional evaluation framework for XAI methods to provide a more comprehensive and consistent assessment. The proposed framework consists of four key evaluation dimensions, as shown in Figure 5.

First, spatial Faithfulness (Localization Accuracy): this dimension evaluates how well the explanation aligns with clinically relevant regions or ground-truth annotations. It is particularly important in medical imaging, where the correct localization of lesions or abnormal structures is critical. Common metrics include Intersection over Union (IoU), Dice Similarity Coefficient (DSC), and Pointing Game Score. These metrics provide an objective measurement of how accurately the explanation highlights relevant regions in the image.

Second, explanation fidelity (Model Faithfulness) assesses whether the explanation accurately reflects the model’s internal decision-making process. Unlike spatial alignment, fidelity focuses on the causal relationship between input features and model predictions. Metrics such as deletion and insertion curves, pixel-flipping, and counterfactual validation are used to evaluate how changes in important regions affect model output. High fidelity indicates that the explanation truly represents model behavior rather than producing visually plausible but misleading results.

Third, stability and robustness, this dimension measures the consistency of explanations under small perturbations in the input or model. In clinical settings, explanations must be reliable and reproducible across similar cases. Methods such as sensitivity analysis, variance measurement, and repeatability testing can be used to evaluate stability. This is particularly important for perturbation-based methods such as LIME, which are known to produce variable outputs due to stochastic sampling.

Fourth, clinical relevance and expert validation, this dimension evaluates whether the explanation aligns with clinical reasoning and expert knowledge. It involves human-in-the-loop validation, typically through radiologist assessment, scoring frameworks, or comparison with annotated clinical data. This dimension is essential because technically correct explanations may still lack clinical meaning. Expert validation ensures that explanations are not only accurate but also interpretable and useful in real diagnostic workflows.

A key principle of the proposed framework is that XAI evaluation should not rely on a single metric or approach. Instead, a robust assessment requires integrating multiple dimensions, combining quantitative metrics with qualitative and expert validation. This multi-dimensional approach ensures that explanations are evaluated not only for technical correctness but also for stability and clinical usefulness.

The proposed framework is directly derived from the limitations identified in the reviewed studies. Many existing works rely heavily on qualitative visual inspection, which is subjective and difficult to reproduce. While some studies incorporate quantitative metrics such as IoU or DSC, these are often applied in isolation without considering explanation fidelity or robustness. In addition, expert validation is used in only a limited number of studies, despite its importance for clinical deployment. These gaps highlight the need for a unified evaluation framework to standardize assessment practices across different XAI methods.

From a practical perspective, the framework can also guide the selection of evaluation strategies based on application requirements. For example, real-time clinical systems may prioritize spatial accuracy and computational efficiency, while research-oriented studies may focus more on fidelity and robustness. Similarly, applications involving multimodal data may require a stronger emphasis on feature attribution and clinical validation. This flexibility allows the framework to be adapted to different use cases while maintaining a consistent evaluation structure.

Overall, the proposed standardized evaluation framework provides a structured approach to address the current fragmentation in XAI assessment. By integrating spatial accuracy, model fidelity, robustness, and clinical validation, it supports a more reliable comparison of XAI methods. It contributes toward improving the trustworthiness and clinical readiness of explainable AI systems in cancer medical imaging.

5.4. Clinical Applicability and Deployment Challenges

Although XAI techniques have shown strong potential in improving the transparency of cancer medical imaging models, their translation into real-world clinical environments remains limited. The findings of this review indicate that current XAI approaches remain largely research-oriented, with several critical barriers preventing their reliable deployment in clinical practice.

One of the primary challenges is the lack of standardized and clinically validated evaluation frameworks. As identified in Section 4.2, most studies rely heavily on qualitative visual assessment, with limited adoption of quantitative benchmarking and expert validation. This inconsistency makes it difficult to determine whether explanation outputs are sufficiently reliable for clinical decision-making. In real-world settings, clinicians require explanations that are not only visually intuitive but also quantitatively validated and reproducible across different patient cases and imaging conditions.

Another significant limitation is the instability and inconsistency of explanation outputs. Perturbation-based methods, such as LIME, exhibit variability due to stochastic sampling, while gradient-based approaches, such as Grad-CAM, are sensitive to noise, model parameters, and input variations. As discussed in Section 4.5, this instability is further influenced by the inherent complexity and variability of medical imaging data, including low contrast, imaging artifacts, and subtle lesion boundaries. In clinical environments, such inconsistency reduces trust, as explanations must remain stable across similar cases to support reliable diagnosis.

Computational cost and scalability also present major barriers to deployment. Methods such as LIME and SHAP require repeated model inference or complex feature-attribution calculations, resulting in high computational overhead. This limitation becomes more critical when applied to high-resolution medical images or real-time diagnostic workflows, where rapid response is required. As a result, many XAI techniques that perform well in controlled experimental settings are not feasible for integration into time-sensitive clinical systems.

A further challenge lies in integrating XAI into clinical workflows and decision-support systems. While XAI methods are designed to improve interpretability, many current approaches do not align well with how clinicians interpret medical images. For example, heatmap-based explanations provide visual localization but often lack precise boundaries or contextual reasoning, limiting their usefulness in tasks such as treatment planning or surgical decision-making. In addition, most XAI methods operate as standalone post hoc tools, rather than being embedded within interactive clinical systems. This lack of integration reduces usability and limits adoption in routine diagnostic practice.

The gap between explanation outputs and clinical usability is also evident in XAI methods’ limited ability to support actionable insights. While heatmaps highlight regions of interest, they do not directly yield clinically meaningful outputs, such as segmentation masks or quantitative measurements. As identified in Section 4.5, the XAI-to-segmentation translation gap remains a key limitation, where explanation maps cannot be directly used for downstream clinical tasks. This disconnect reduces the practical value of XAI in supporting diagnostic workflows.

Another important deployment challenge is dataset limitation and generalizability. Many studies rely on single-source datasets, with limited cross-institutional validation. This raises concerns regarding the robustness of XAI explanations when applied to data from different scanners, imaging protocols, or patient populations. In clinical practice, models must generalize across diverse environments, and explanation methods must remain consistent under such variability. Without multi-institutional validation, the reliability of XAI systems in real-world settings remains uncertain.

In addition, compatibility issues between XAI methods and modern model architectures further limit deployment. Many existing XAI techniques are designed for convolutional neural networks and rely on spatial feature maps. However, as discussed in Section 4.5, emerging architectures such as Vision Transformers use token-based representations, which are not directly compatible with traditional explainability methods. This architectural mismatch can degrade explanation quality and limit the applicability of XAI in next-generation AI systems.

Finally, clinical adoption also depends on trust, interpretability, and regulatory considerations. Explanations must be not only technically accurate but also understandable and meaningful to clinicians. Current XAI methods often produce static, non-interactive outputs, requiring expert interpretation and limiting usability for non-technical users. Furthermore, regulatory requirements for medical AI systems emphasize transparency, reproducibility, and validation, which are not consistently addressed in existing XAI studies.

5.5. Model Architectures and Their Impact on XAI Interpretability

The choice of model architecture plays an important role in shaping both diagnostic performance and the effectiveness of XAI in cancer medical imaging. Current evidence indicates a strong dominance of CNN, alongside a gradual shift toward ViT and hybrid architectures that aim to balance local feature extraction with global contextual understanding.

CNN-based models remain the most widely adopted due to their maturity and compatibility with established XAI techniques, particularly the Class Activation Mapping family. Architectures such as ResNet and VGG are commonly used because they provide stable, structured feature representations that facilitate intuitive heatmap generation. This alignment between model design and explainability methods has contributed to the widespread reliance on CNNs in clinically oriented studies. However, their depth and complexity also pose challenges, as highly nonlinear feature interactions can limit transparency in internal decision-making processes.

In contrast, Vision Transformers introduce a fundamentally different paradigm by leveraging self-attention mechanisms to capture long-range dependencies. This allows for improved modeling of complex spatial relationships in medical images. Nevertheless, a key limitation lies in their compatibility with traditional XAI methods. Techniques such as Grad-CAM often produce less precise visual explanations when applied to transformer-based models, prompting the adoption of attention-based interpretability methods, such as attention rollout. While these approaches offer finer spatial resolution, they also introduce new challenges in interpretation due to the distributed nature of attention mechanisms.

Hybrid architectures have emerged as a promising direction, combining the strengths of CNNs and transformers. By integrating local feature extraction with global contextual reasoning, these models aim to improve both predictive performance and interpretability. In particular, architectures that incorporate domain knowledge, such as segmentation-guided attention or multi-modal fusion frameworks, demonstrate enhanced alignment between model explanations and clinically relevant features. These developments suggest that interpretability can be improved not only through post hoc XAI methods but also through thoughtful architectural design.

Despite these advancements, several structural challenges remain. There is an inherent trade-off between model complexity and interpretability, where more powerful architectures often produce less transparent explanations. Additionally, the impact of training strategies, such as fine-tuning depth, on the quality of XAI outputs remains underexplored. Emerging trends in federated learning and lightweight models further complicate this landscape by introducing constraints on privacy, computational efficiency, and deployment environments.

We argue that the evolution of model architectures necessitates a parallel advancement in XAI methodologies. As the field moves toward more complex and hybrid designs, there is a growing need to develop explainability techniques that are not only compatible with these architectures but also capable of producing stable, clinically meaningful, and trustworthy interpretations.

5.6. Challenges and Limitations of XAI in Cancer Medical Imaging

In this study, the evaluation of XAI highlights a landscape constrained by interconnected technical, methodological, and architectural challenges. Despite increasing adoption, more than half of the studies reported substantial limitations, raising concerns about the clinical readiness and reliability of current XAI approaches.

From a technical perspective, instability and imprecision remain the most prominent issues. Perturbation-based methods such as LIME exhibit stochastic behavior, producing inconsistent explanations for identical inputs, which undermines reproducibility and clinical trust. Similarly, widely used gradient-based methods, particularly the CAM family, often produce coarse, low-resolution heatmaps that fail to capture fine-grained pathological details essential for accurate diagnosis. These limitations are further exacerbated when models encounter rare or atypical cancer morphologies, where explanation methods tend to perform inconsistently or fail to provide explanations.

Methodologically, the absence of standardized evaluation frameworks is a major barrier. Many studies rely heavily on qualitative visual inspection, often supported by selectively chosen examples, which limits objectivity and transparency. The lack of pixel-level ground truth annotations further restricts the use of quantitative metrics such as IoU and Dice Similarity Coefficient, resulting in weak validation of explanation accuracy. Moreover, the absence of unified benchmarks prevents meaningful comparison across different XAI methods and imaging modalities, contributing to fragmentation within the field.

Furthermore, architectural evolution introduces additional complexity, particularly with the transition from CNN to Vision Transformers. Traditional XAI techniques designed for convolutional structures are not fully compatible with transformer-based models, leading to reduced explanation fidelity. Although transformer-native methods have been proposed, they introduce new interpretability challenges, including the propagation of irrelevant attention across distant regions. Furthermore, reliance on transfer learning from non-medical datasets may reduce the clinical relevance of learned features, thereby affecting the validity of explanations.

In addition, practical and resource-related constraints further limit real-world deployment. Computationally intensive methods such as SHAP and LIME require substantial processing power, which may not be feasible in many clinical environments. Additionally, most XAI approaches provide only local explanations, offering limited insight into global model behavior across diverse patient populations. The lack of standardized pipelines for converting heatmaps into clinically usable outputs, such as segmentation masks, further restricts their applicability in clinical workflows.

These challenges are not isolated but mutually reinforcing, collectively limiting the robustness and clinical utility of XAI systems. Addressing these issues requires the development of stable, high-resolution explanation methods, alongside standardized evaluation protocols that balance quantitative accuracy and clinical relevance. Without such advancements, the transition of XAI from research settings to routine clinical practice will remain constrained.

5.7. Future Directions for XAI in Cancer Medical Imaging

The future development of XAI is increasingly oriented toward transforming current post hoc interpretability approaches into robust, clinically reliable decision-support systems. This transition is guided by three key strategic pillars: methodological advancement, rigorous evaluation standardization, and deep clinical integration, all of which are essential to achieving meaningful real-world impact.

First, from the technical perspective, there is a clear shift away from reliance on a limited set of dominant techniques toward more diverse and complementary XAI methodologies. Future research should integrate attribution-based methods, such as SHAP and LIME, with gradient-based methods to provide both spatial and quantitative explanations. Hybrid pipelines that combine fast localization with detailed attribution are emerging as a practical solution to balance efficiency and interpretability. In parallel, there is growing interest in intrinsic explainability, where interpretability is embedded directly within model architectures, reducing dependence on post hoc analysis. The extension of XAI from 2D imaging to 3D volumetric and multimodal settings further underscores the need for more comprehensive, clinically aligned explanations.

Equally important is advancing evaluation frameworks. Current reliance on subjective visual inspection is increasingly recognized as insufficient, prompting a shift toward standardized quantitative benchmarking. Metrics such as Dice Similarity Coefficient, Intersection over Union, and Pointing Game scores are expected to play a central role in objectively assessing spatial accuracy. Beyond localization, future evaluation must incorporate measures of faithfulness, stability, and fairness to ensure that explanations are both computationally valid and clinically trustworthy. The development of large-scale, multi-institutional datasets with expert-annotated ground truth is critical for enabling such standardized evaluation and improving reproducibility across studies.

Clinical translation represents the most crucial dimension of future progress. There is a growing recognition that effective XAI systems must be co-designed with domain experts rather than evaluated retrospectively. Concept-based frameworks that integrate clinical knowledge directly into model reasoning show strong potential to align AI outputs with medical decision-making processes. In addition, interactive XAI systems are expected to enhance human–AI collaboration by allowing clinicians to explore and refine explanations dynamically. Importantly, future research must move beyond retrospective validation and incorporate prospective clinical studies to assess the real-world impact of XAI on diagnostic accuracy and patient outcomes.

Despite these promising direction, several structural challenges remain. The continued dependence on post hoc methods highlights a gap between current practices and the goal of inherently interpretable models. Furthermore, the pace of technical innovation continues to outstrip clinical adoption, emphasizing the need for stronger interdisciplinary collaboration. A persistent issue is the discrepancy between model accuracy and explanation quality, where models may achieve high performance while relying on clinically irrelevant features. Addressing this “right for the wrong reasons” problem through standardized, clinically grounded evaluation frameworks is essential to advancing trust in XAI systems.

The future of XAI in cancer medical imaging depends on a coordinated effort to advance methodology, standardize evaluation, and integrate clinical expertise. Only through this multi-dimensional approach can XAI evolve from a supportive visualization tool into a reliable and actionable component of clinical decision-making.

6. Limitations

While this study provides a comprehensive synthesis of XAI techniques in cancer medical imaging, several limitations should be acknowledged. First, the review was restricted to a selected set of databases, including ACM Digital Library, IEEE Xplore, MDPI, and Scopus. Although these sources provide strong coverage of computer science and technical research, excluding core biomedical databases such as PubMed/MEDLINE and Web of Science may have led to the omission of relevant clinical studies, potentially affecting the comprehensiveness of the review.

Second, the review is limited by potential selection bias arising from study inclusion criteria and access to full-text articles. Some relevant studies may have been excluded due to restricted access or incomplete reporting of XAI evaluation methods. In addition, the heterogeneity across included studies in terms of datasets, imaging modalities, XAI techniques, and evaluation protocols limited the ability to perform a quantitative meta-analysis. As a result, the findings are based on qualitative synthesis and descriptive analysis rather than statistically aggregated evidence.

Third, the analysis of evaluation methods is constrained by inconsistent reporting across studies. Many papers do not provide detailed quantitative metrics or standardized benchmarks, limiting the ability to perform direct comparisons or draw generalizable conclusions. This reflects a broader issue in the field, where the lack of standardized evaluation frameworks affects not only primary studies but also systematic reviews.

Finally, the rapidly evolving nature of XAI and medical imaging research means that new techniques, architectures, and evaluation approaches may not be fully captured within the selected time frame. Although this review focuses on recent studies from 2020 to 2026, emerging developments beyond this period may further influence the conclusions presented.

Despite these limitations, this study provides valuable insights into current trends, challenges, and research gaps. It highlights the need for more standardized, rigorous, and clinically validated approaches in future XAI research.

7. Conclusions

This systematic review highlights that while XAI has made meaningful progress in improving the transparency of cancer medical imaging models, the field remains at an early stage in terms of clinical reliability and practical deployment. Current approaches are dominated by gradient-based visualization techniques such as Grad-CAM, supported by complementary methods including LIME, SHAP, and Integrated Gradients. However, evaluation practices remain inconsistent, with a strong reliance on qualitative assessment and limited use of standardized quantitative frameworks, which restricts comparability and clinical validation.

The findings of this review indicate that the major bottleneck lies in the lack of standardized, multi-dimensional evaluation frameworks and limited clinician-centered validation. Without consistent benchmarking and expert validation, it remains difficult to determine whether XAI explanations are reliable for clinical decision-making. In addition, technical challenges such as instability in explanations, coarse localization, high computational cost, and incompatibility with emerging architectures further limit real-world applicability.

To advance the field toward clinical reliability, several key research directions should be prioritized. First, future studies should develop and adopt standardized evaluation frameworks that integrate spatial accuracy, model fidelity, robustness, and clinical validation to ensure consistent and reproducible assessment of XAI methods. Second, there is a need for large-scale, multi-institutional datasets with high-quality annotations to support cross-domain validation and improve generalizability. Third, researchers should focus on designing clinician-centered XAI systems that align with radiologists’ workflows, including interactive and real-time explanation tools that support decision-making. Fourth, future work should explore architecture-aware explainability methods compatible with modern models, such as Vision Transformers and hybrid architectures. Finally, the development of hybrid and multi-method XAI frameworks that combine localization, attribution, and causal reasoning approaches is recommended to improve both interpretability depth and robustness.

Overall, advancing XAI from a research-oriented tool to a clinically reliable system requires a coordinated effort across technical, methodological, and clinical domains. By addressing these key directions, future research can help develop more trustworthy, interpretable, and deployable AI systems for cancer diagnosis and clinical decision support.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/make8050134/s1, A predefined review protocol is registered on the Open Science Framework at https://doi.org/10.17605/OSF.IO/8VTB5 [70].

Author Contributions

Conceptualization, Y.A.K. and K.I.G.; methodology, Y.A.K. and K.I.G.; software, Y.A.K.; validation, Y.A.K. and K.I.G.; formal analysis, Y.A.K.; writing—original draft preparation, Y.A.K. and K.I.G.; writing—review and editing, Y.A.K. and K.I.G.; visualization, Y.A.K.; supervision, K.I.G.; project administration, K.I.G.; funding acquisition, K.I.G. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Multimedia University.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ANN	Artificial Neural Network
AUC	Area Under the Curve
CAD	Computer-Aided Diagnosis
CAM	Class Activation Mapping
CNN	Convolutional Neural Network
CT	Computed Tomography
DL	Deep Learning
DSC	Dice Similarity Coefficient
EHR	Electronic Health Record
FN	False Negative
FP	False Positive
GBM	Gradient Boosting Machine
Grad-CAM	Gradient-weighted Class Activation Mapping
IoU	Intersection over Union
KNN	k-Nearest Neighbors
LIME	Local Interpretable Model-agnostic Explanations
ML	Machine Learning
MRI	Magnetic Resonance Imaging
PET	Positron Emission Tomography
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
RF	Random Forest
RNN	Recurrent Neural Network
ROC	Receiver Operating Characteristic
SHAP	SHapley Additive exPlanations
SLR	Systematic Literature Review
SVM	Support Vector Machine
TN	True Negative
TP	True Positive
US	Ultrasound
ViT	Vision Transformer
XAI	Explainable Artificial Intelligence

References

Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.L.; Soerjomataram, I.; Jemal, A. Global Cancer Statistics 2022: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef] [PubMed]
Bizuayehu, H.M.; Dadi, A.F.; Ahmed, K.Y.; Tegegne, T.K.; Hassen, T.A.; Kibret, G.D.; Ketema, D.B.; Bore, M.G.; Thapa, S.; Odo, D.B.; et al. Burden of 30 Cancers among Men: Global Statistics in 2022 and Projections for 2050 Using Population-based Estimates. Cancer 2024, 130, 3708–3723. [Google Scholar] [CrossRef]
Fass, L. Imaging and Cancer: A Review. Mol. Oncol. 2008, 2, 115–152. [Google Scholar] [CrossRef]
Pulumati, A.; Pulumati, A.; Dwarakanath, B.S.; Verma, A.; Papineni, R.V.L. Technological Advancements in Cancer Diagnostics: Improvements and Limitations. Cancer Rep. 2023, 6, e1764. [Google Scholar] [CrossRef]
DrKumo Inc. AI in Healthcare: Practical Applications for 2025; DrKumo: Buena Park, CA, USA, 2025. [Google Scholar]
Canada’s Drug Agency (CDA). 2025 Watch List: Artificial Intelligence in Health Care; Canada’s Drug Agency (CDA): Ottawa, ON, Canada, 2025; Volume 5. [Google Scholar]
Oei, S.P.; Bakkes, T.H.G.F.; Mischi, M.; Bouwman, R.A.; Van Sloun, R.J.G.; Turco, S. Artificial Intelligence in Clinical Decision Support and the Prediction of Adverse Events. Front. Digit. Health 2025, 7, 1403047. [Google Scholar] [CrossRef]
Kufel, J.; Bargieł-Łączek, K.; Kocot, S.; Koźlik, M.; Bartnikowska, W.; Janik, M.; Czogalik, Ł.; Dudek, P.; Magiera, M.; Lis, A.; et al. What Is Machine Learning, Artificial Neural Networks and Deep Learning?—Examples of Practical Applications in Medicine. Diagnostics 2023, 13, 2582. [Google Scholar] [CrossRef]
Hoffman, R.R.; Mueller, S.T.; Klein, G.; Litman, J. Metrics for Explainable AI: Challenges and Prospects. arXiv 2018, arXiv:1812.04608. [Google Scholar]
Samek, W.; Montavon, G.; Vedaldi, A.; Hansen, L.K.; Müller, K.-R. (Eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11700, ISBN 978-3-030-28953-9. [Google Scholar]
Reddy, S. Explainability and Artificial Intelligence in Medicine. Lancet Digit. Health 2022, 4, e214–e215. [Google Scholar] [CrossRef] [PubMed]
Precise4Q Consortium; Amann, J.; Blasimme, A.; Vayena, E.; Frey, D.; Madai, V.I. Explainability for Artificial Intelligence in Healthcare: A Multidisciplinary Perspective. BMC Med. Inform. Decis. Mak. 2020, 20, 310. [Google Scholar] [CrossRef]
Sun, Q.; Akman, A.; Schuller, B.W. Explainable Artificial Intelligence for Medical Applications: A Review. ACM Trans. Comput. Healthc. 2025, 6, 1–31. [Google Scholar] [CrossRef]
Gulum, M.A.; Trombley, C.M.; Kantardzic, M. A Review of Explainable Deep Learning Cancer Detection Models in Medical Imaging. Appl. Sci. 2021, 11, 4573. [Google Scholar] [CrossRef]
Hauser, K.; Kurz, A.; Haggenmüller, S.; Maron, R.C.; Von Kalle, C.; Utikal, J.S.; Meier, F.; Hobelsberger, S.; Gellrich, F.F.; Sergon, M.; et al. Explainable Artificial Intelligence in Skin Cancer Recognition: A Systematic Review. Eur. J. Cancer 2022, 167, 54–69. [Google Scholar] [CrossRef]
Gurmessa, D.K.; Jimma, W. Explainable Machine Learning for Breast Cancer Diagnosis from Mammography and Ultrasound Images: A Systematic Review. BMJ Health Care Inform. 2024, 31, e100954. [Google Scholar] [CrossRef]
Wyatt, L.S.; van Karnenbeek, L.M.; Wijkhuizen, M.; Geldof, F.; Dashtbozorg, B. Explainable Artificial Intelligence (XAI) for Oncological Ultrasound Image Analysis: A Systematic Review. Appl. Sci. 2024, 14, 8108. [Google Scholar] [CrossRef]
Karthiga, R.; Narasimhan, K.; Thanikaiselvan, V.; Hemalatha, M.; Amirtharajan, R. Review of AI & XAI-Based Breast Cancer Diagnosis Methods Using Various Imaging Modalities. Multimed. Tools Appl. 2025, 84, 2209–2260. [Google Scholar] [CrossRef]
Liberati, A.; Altman, D.G.; Tetzlaff, J.; Mulrow, C.; Gøtzsche, P.C.; Ioannidis, J.P.A.; Clarke, M.; Devereaux, P.J.; Kleijnen, J.; Moher, D. The PRISMA Statement for Reporting Systematic Reviews and Meta-Analyses of Studies That Evaluate Healthcare Interventions: Explanation and Elaboration. BMJ 2009, 339, b2700. [Google Scholar] [CrossRef] [PubMed]
Haddaway, N.R.; Page, M.J.; Pritchard, C.C.; McGuinness, L.A. PRISMA2020: An R Package and Shiny App for Producing PRISMA 2020-Compliant Flow Diagrams, with Interactivity for Optimised Digital Transparency and Open Synthesis. Campbell Syst. Rev. 2022, 18, e1230. [Google Scholar] [CrossRef] [PubMed]
Johnson, N.; Phillips, M. Rayyan for Systematic Reviews. J. Electron. Resour. Librariansh. 2018, 30, 46–48. [Google Scholar] [CrossRef]
Rayyan: AI-Powered Systematic Review Management Platform. Available online: https://www.rayyan.ai/ (accessed on 30 January 2026).
Hussain, S.M.; Buongiorno, D.; Altini, N.; Berloco, F.; Prencipe, B.; Moschetta, M.; Bevilacqua, V.; Brunetti, A. Shape-Based Breast Lesion Classification Using Digital Tomosynthesis Images: The Role of Explainable Artificial Intelligence. Appl. Sci. 2022, 12, 6230. [Google Scholar] [CrossRef]
Alomar, A.; Alazzam, M.; Mustafa, H.; Mustafa, A. Lung Cancer Detection Using Deep Learning and Explainable Methods. In Proceedings of the 2023 14th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 21–23 November 2023; IEEE: New York, NY, USA, 2023; pp. 1–4. [Google Scholar]
Mustari, A.; Ahmed, R.; Tasnim, A.; Juthi, J.S.; Shahariar, G.M. Explainable Contrastive and Cost-Sensitive Learning for Cervical Cancer Classification. In Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 13–15 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Raghavan, K.; Sivaselvan, B.; Kamakoti, V. Counter-CAM: An Improved Grad-CAM Based Visual Explainer for Infrared Breast Cancer Classification. In Proceedings of the 2023 IEEE 20th India Council International Conference (INDICON), Hyderabad, India, 14–17 December 2023; IEEE: New York, NY, USA, 2023; pp. 661–666. [Google Scholar]
Shivhare, I.; Jogani, V.; Purohit, J.; Shrawne, S.C. Analysis of Explainable Artificial Intelligence Methods on Medical Image Classification. In Proceedings of the 2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 5–6 January 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Ahmed, M.; Bibi, T.; Khan, R.A.; Nasir, S. Enhancing Breast Cancer Diagnosis in Mammography: Evaluation and Integration of Convolutional Neural Networks and Explainable AI. In Proceedings of the 2024 26th International Multi-Topic Conference (INMIC), Karachi, Pakistan, 30–31 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Burgos, D.; Morshed, A.; Rashid, M.M.; Mandala, S. A Comparison of Machine Learning Models to Deep Learning Models for Cancer Image Classification and Explainability of Classification. In Proceedings of the 2024 International Conference on Data Science and Its Applications (ICoDSA), Kuta, Indonesia, 10–11 July 2024; IEEE: New York, NY, USA, 2024; pp. 386–390. [Google Scholar]
Dugăeşescu, A.; Chiru, C.-M.; Nan, M.; Trăscău, M.; Florea, A.M. Explainable Cancer Segmentation Through Classification. In Proceedings of the 2024 IEEE 20th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 17–19 October 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Gnanavel, N.; Inparaj, P.; Sritharan, N.; Meedeniya, D.; Yogarajah, P. Interpretable Cervical Cell Classification: A Comparative Analysis. In Proceedings of the 2024 4th International Conference on Advanced Research in Computing (ICARC), Belihuloya, Sri Lanka, 21–24 February 2024; IEEE: New York, NY, USA, 2024; pp. 7–12. [Google Scholar]
Kaushik, S.; Lamba, A.K.; Kansal, I.; Khullar, V.; Sharma, P. Explainable Deep Learning for Lung Cancer Detection: Comparing CNN and DenseNet201 with Grad-CAM. In Proceedings of the 2024 2nd International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), Paralakhemundi, India, 19–21 December 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
Nugroho, W.A.; Supriyanto, C.; Shidik, G.F. Pujiono Modified ReLU in Deep Learning Models and Explainable AI Techniques for Accurate and Interpretable Breast Cancer Subtype Classification. In Proceedings of the 2024 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), Bali, Indonesia, 17–19 December 2024; IEEE: New York, NY, USA, 2024; pp. 772–777. [Google Scholar]
Ramoliya, F.; Gohil, K.; Gohil, A.; Gupta, R.; Kakkar, R.; Tanwar, S.; Rodrigues, J.J.P.C. X-CaD: Explainable AI for Skin Cancer Diagnosis in Healthcare 4.0 Telesurgery. In Proceedings of the ICC 2024—IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: New York, NY, USA, 2024; pp. 238–243. [Google Scholar]
Sritharan, N.; Gnanavel, N.; Inparaj, P.; Meedeniya, D.; Yogarajah, P. EnsembleCAM: Unified Visualization for Explainable Cervical Cancer Identification. In Proceedings of the 2024 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka, 4 April 2024; IEEE: New York, NY, USA, 2024; Volume 7, pp. 1–6. [Google Scholar]
Ulla, S.; Yousuf, M.A. MSCC: Multi-Class Skin Cancer Classification and Interpretable Deep Learning Systems. In Proceedings of the 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), Dhaka, Bangladesh, 2–4 May 2024; IEEE: New York, NY, USA, 2024; pp. 735–740. [Google Scholar]
Wedisinghe, H.; Fernando, T.G.I. Explainable AI for Early Lung Cancer Detection: A Path to Confidence. In Proceedings of the 2024 4th International Conference on Advanced Research in Computing (ICARC), Belihuloya, Sri Lanka, 21–24 February 2024; IEEE: New York, NY, USA, 2024; pp. 13–18. [Google Scholar]
Gamage, L.; Isuranga, U.; Meedeniya, D.; De Silva, S.; Yogarajah, P. Melanoma Skin Cancer Identification with Explainability Utilizing Mask Guided Technique. Electronics 2024, 13, 680. [Google Scholar] [CrossRef]
Cerekci, E.; Alis, D.; Denizoglu, N.; Camurdan, O.; Ege Seker, M.; Ozer, C.; Hansu, M.Y.; Tanyel, T.; Oksuz, I.; Karaarslan, E. Quantitative Evaluation of Saliency-Based Explainable Artificial Intelligence (XAI) Methods in Deep Learning-Based Mammogram Analysis. Eur. J. Radiol. 2024, 173, 111356. [Google Scholar] [CrossRef]
Ahmad, I.; Amelio, A.; Ali, F.; Merla, A.; Scozzari, F.; Ahmad, N. A Comparative Analysis of Artificial Intelligence Methods for Breast Cancer Interpretation. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Bissoonauth-Daiboo, P.; Muzzammil Auzine, M.; Khan, M.I.; Alshannaq, F.; Saba, T.; Gao, X.; Heenaye-Mamode Khan, M. Exploring Vision Transformers and Explainable AI for Enhanced Artefact Classification in Esophageal Endoscopic Images. IEEE Access 2025, 13, 176221–176244. [Google Scholar] [CrossRef]
Bhatia, V. Explainable AI for Cervical Cancer Classification Utilizing Optimized Feature Selection with Model Interpretability. In Proceedings of the 2025 International Conference on Artificial intelligence and Emerging Technologies (ICAIET), Bhubaneswar, India, 28–30 August 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
El Mrabet, A.; Benaly, M.; Kouach, B.; Alihamidi, I.; Hlou, L.; El Gouri, R. Explainable AI for Skin Cancer Classification Unlocking Insights with Grad-Cam and Grad-Cam++. In Proceedings of the 2025 International Conference on Circuit, Systems and Communication (ICCSC), Fez, Morocco, 19–20 June 2025; IEEE: New York, NY, USA, 2025; pp. 1–4. [Google Scholar]
Gupta, D.; Benila, S. Attention-Guided U-Net With Grad-CAM for Explainable Polyp Segmentation in Colonoscopy Images. IEEE Access 2025, 13, 185125–185136. [Google Scholar] [CrossRef]
Hossain, D.; Rahman, M.; Rahman, A.; Ahmed, T.; Dhar, S.; Biswass, O. CerviTrans-XAI: An Explainable Vision Transformer Ensemble for Accurate Cervical Cancer Classification. In Proceedings of the 2025 IEEE 2nd International Conference on Computing, Applications and Systems (COMPAS), Kushtia, Bangladesh, 23–24 October 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Hotiet, H.; AlHawat, M.; Ali, M.A.; Kassem, A. Lung Cancer Detection Using Deep Learning Models: A Comparative Study of Preprocessing Techniques with Explainable AI Integration. In Proceedings of the 2025 37th International Conference on Microelectronics (ICM), Cairo, Egypt, 14–17 December 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Ifty, H.S.; Nirjan, N.; Islam, L.; Diganta, M.A.; Ornate, R.A.; Tasnim, A.; Islam, S. Automated Detection of Malignant Lesions in the Ovary Using Deep Learning Models and XAI. In Proceedings of the 2025 IEEE 4th International Conference on AI in Cybersecurity (ICAIC), Houston, TX, USA, 5–7 February 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Nabi, M.S.; Fauzi, M.F.A.; Bin Abdul Karim, H.; Khalid, A.S.; Tang, T.B.; Razak, N.N. Explainable AI for Breast Cancer Diagnosis Using EfficientNetB3 with Attention Mechanism. In Proceedings of the TENCON 2025—2025 IEEE Region 10 Conference (TENCON), Kota Kinabalu, Malaysia, 27–30 October 2025; IEEE: New York, NY, USA, 2025; pp. 1554–1558. [Google Scholar]
Nipa, R.N.; Jakir Hossain, M.; Hassan, M.J.; Faisal Ahammed, M.; Alamgir Hossain, M. AECNet: Advancing Skin Cancer Classification Using ACGAN-Based Synthetic Data with Grad-CAM Explainability. In Proceedings of the 2025 2nd International Conference on Next-Generation Computing, IoT and Machine Learning (NCIM), Gazipur, Bangladesh, 27–28 June 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Priambodo, A.R.; Fatichah, C. Evaluating Lightweight CNN Models with CBAM for Explainable AI Skin Cancer Classification Using GradCAM and ScoreCAM. In Proceedings of the 2025 IEEE International Conference on Artificial Intelligence and Mechatronics Systems (AIMS), Sumedang, Indonesia, 24–25 May 2025; IEEE: New York, NY, USA, 2025; pp. 1–7. [Google Scholar]
Rafferty, A.; Ramaesh, R.; Rajan, A. Leveraging Expert Input for Robust and Explainable AI-Assisted Lung Cancer Detection in Chest X-Rays. In Proceedings of the 2025 IEEE 13th International Conference on Healthcare Informatics (ICHI), Rende, Italy, 18–21 June 2025; IEEE: New York, NY, USA, 2025; pp. 576–587. [Google Scholar]
Rahman, A.; Aftahee, S.; Pronay, L.Z.; Rahman, M.A. MRAViT-XAI: A Novel Multi-Resolution Attention Vision Transformer Framework with Explainable AI for Enhanced Lung and Colon Cancer Classification. In Proceedings of the 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN), Rangpur, Bangladesh, 31 July–2 August 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Rakha, M.; Wisesty, U.N.; Gunawan, P.H. A Comparative Study of Deep Learning Models and XAI for Breast Ultrasound Image Classification. In Proceedings of the 2025 International Conference on Data Science and Its Applications (ICoDSA), Jakarta, Indonesia, 3–5 July 2025; IEEE: New York, NY, USA, 2025; pp. 350–356. [Google Scholar]
Rani, P.; Dahiya, T.; Gupta, D.; Yadav, P.; Sachdeva, T. Deep Learning-Based Skin Cancer Detection: A Comprehensive Approach for Automated Dermatological Diagnosis. In Proceedings of the 2025 Second International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT), Delhi, India, 4–6 December 2025; IEEE: New York, NY, USA, 2025; pp. 18–23. [Google Scholar]
Reji, R.P.; Shine, D.; Nair, R. Interpretable Deep Learning for Bone Cancer Diagnosis: An Explainable AI Perspective. In Proceedings of the 2025 8th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 23–25 July 2025; IEEE: New York, NY, USA, 2025; pp. 897–902. [Google Scholar]
Meena Kumari, K.S.; Maheswari, E.; Suganyadevi, S.; Thanka Roobi, R.G.; Seethalakshmy, A.; Devi, M.R. An Explainable Multi-Modal Deep Learning Framework for Early Detection and Classification of Pancreatic Cancer Abstract Using PancreoFusion-Net. In Proceedings of the 2025 2nd International Conference on New Frontiers in Communication, Automation, Management and Security (ICCAMS), Bangalore, India, 11–12 July 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Rajalakshmi, S.; Nithya, E. Lung Cancer Detection Using EfficientNet-B0 Features: A Comparative Analysis of XGBoost, SVM, and Decision Tree. In Proceedings of the 2025 4th International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 17–19 December 2025; IEEE: New York, NY, USA, 2025; pp. 360–365. [Google Scholar]
Sathya, A.; Blessy, J.J. Genomic and Imaging Biomarker Analysis in Lung Cancer: A SHAP-Guided Approach. In Proceedings of the 2025 2nd International Conference on Computing and Data Science (ICCDS), Chennai, India, 25–26 July 2025; IEEE: New York, NY, USA, 2025; pp. 1–7. [Google Scholar]
Serhani, M.A.; Tariq, A.; Qayyum, T.; Taleb, I.; Din, I.; Trabelsi, Z. Meta-XPFL: An Explainable and Personalized Federated Meta-Learning Framework for Privacy-Aware IoMT. IEEE Internet Things J. 2025, 12, 13790–13805. [Google Scholar] [CrossRef]
Shrestha, T.E.; Islam, J.; Rukaia, A.R.; Rizon, F.R.; Dipto, S.K.; Ali, A. LungXResViT: A Robust Hybrid DL Framework for Accurate Lung Cancer Classification with XAI. In Proceedings of the 2025 IEEE 2nd International Conference on Computing, Applications and Systems (COMPAS), Kushtia, Bangladesh, 23–24 October 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Sritharan, N.; Gnanavel, N.; Inparaj, P.; Meedeniya, D.; Yogarajah, P. Explainable Artificial Intelligence Driven Segmentation for Cervical Cancer Screening. IEEE Access 2025, 13, 71306–71322. [Google Scholar] [CrossRef]
Fontes, M.F.; Neto, A.H.; Almeida, J.D.; Cunha, A.T. A Controlled Variation Approach for Example-Based Explainable AI in Colorectal Polyp Classification. Appl. Sci. 2025, 15, 8467. [Google Scholar] [CrossRef]
Haque, F.; Asif Hasan, M.; Siddique, A.I.; Roy, T.; Kanti Shaha, T.; Islam, Y.; Paul, A.; Chowdhury, M.E.H. An End-to-End Concatenated CNN Attention Model for the Classification of Lung Cancer With XAI Techniques. IEEE Access 2025, 13, 96317–96336. [Google Scholar] [CrossRef]
Sebukpor, D.; Odezuligbo, I.; Nagey, M.; Chukwuka, M.; Akinsuyi, O.; Ndubuisi, B. Browser-Based Multi-Cancer Classification Framework Using Depthwise Separable Convolutions for Precision Diagnostics. Diagnostics 2025, 15, 3066. [Google Scholar] [CrossRef]
Tabashoum, S.; Roza, A.A.; Masud, A. An Explainable AI with Federated EfficientNet-B3 and Multi-Head Attention for Privacy-Preserving Lung Cancer Classification. In Proceedings of the 2025 IEEE 7th International Conference on Sustainable Technologies For Industry 5.0 (STI), Dhaka, Bangladesh, 11–12 December 2025; MDPI: Basel, Switzerland, 2025; pp. 1–6. [Google Scholar]
Hendry, A.; Felicia, E.; Natanael, Y.; Rumagit, R.Y. Implementation of Explainable AI on CNN- and ViT-Based Models for Classifying Breast Cancer. Procedia Comput. Sci. 2025, 269, 968–978. [Google Scholar] [CrossRef]
Singh, S.K.; Patnaik, K.S. MammXAI: An XAI Integrated Adaptive Multi-Model Deep Learning Approach for Breast Cancer Detection Using Multi-Modality Images. Biomed. Signal Process. Control 2026, 113, 109173. [Google Scholar] [CrossRef]
Singh, S.; Kumar, R.; Singh, N.P.; Maurya, M.K. Unsupervised Attention-Based Framework for Cancerous Region Segmentation with Explainable Post-Hoc Thresholding in Histopathological Images. Biomed. Signal Process. Control 2026, 113, 108815. [Google Scholar] [CrossRef]
Yang, L.; Zhang, H.; Shen, H.; Huang, X.; Zhou, X.; Rong, G.; Shao, D. Quality Assessment in Systematic Literature Reviews: A Software Engineering Perspective. Inf. Softw. Technol. 2021, 130, 106397. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]

Figure 1. PRISMA flow diagram for this systematic literature review.

Figure 2. Distribution of XAI evaluation approaches.

Figure 3. Proposed future directions for XAI in cancer medical imaging.

Figure 4. Integration of XAI into Clinical Workflow for Cancer Imaging.

Figure 5. Proposed standardized evaluation framework for XAI in Cancer Medical Imaging.

Table 1. Comparison of existing reviews and the proposed study.

Study	Scope	XAI Techniques Coverage	Evaluation Analysis	Cross-Cancer/Multi-Modal Coverage	Critical Analysis	Proposed Evaluation Framework
Gulum et al. [14]	General cancer imaging (DL-focused)	Limited to DL-based XAI	Limited discussion	Focused scope	Mostly descriptive	-
Hauser et al. [15]	Skin cancer imaging	Grad-CAM, saliency maps	Identifies lack of evaluation	Single cancer type	Limited	-
Gurmessa and Jimma [16]	Breast cancer (mammography, ultrasound)	Grad-CAM, LIME, saliency	Minimal evaluation discussion	Single cancer type	Descriptive	-
Wyatt et al. [17]	Ultrasound imaging (oncology)	Grad-CAM, attention-based	Highlights lack of standardization	Single modality	Limited	-
Karthiga et al. [18]	Breast cancer (multi-modal imaging)	Grad-CAM, LIME, SHAP	Basic discussion	Limited generalization	Limited	-
This Study	Multi-cancer + multi-modal imaging	Comprehensive (CAM, LIME, SHAP, IG, LRP, etc.)	Systematic and multi-dimensional analysis	Broad coverage	Critical comparative analysis	Proposes standardized evaluation framework

Table 2. Search keyword and filters applied for each database.

Source	Filters	Search String
ACM	Publication Type: All publication, Article Type: Research Article, Publication Date: 2020–2026.	(“explainable artificial intelligence” OR “explainable AI” OR XAI) AND cancer AND medical imaging
IEEE Xplore	Document Type: Journals, Conferences. Publication Years: 2020–2026.
MDPI	Article Type: Research Article, Publication Year: 2020–2026.
Scopus	Subject Area: Computer Science, Medicine. Document Type: Article, Conference Paper. Year: 2020–2026.

Table 3. Summary of reviewed papers.

Year	Source	Refs.
2022	MDPI	[23]
2023	IEEE Xplore	[24,25,26,27]
2024	IEEE Xplore	[28,29,30,31,32,33,34,35,36,37]
	MDPI	[38]
	Scopus	[39]
2025	IEEE Xplore	[40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61]
	MDPI	[62,63,64,65]
	Scopus	[66]
2026	Scopus	[67,68]

Table 4. QA evaluation criteria.

Criterion	Quality Assessment Question
QA1	Are the research objectives or problem statements clearly defined?
QA2	Is the explainable AI (XAI) technique clearly described?
QA3	Is the model architecture used with XAI clearly specified?
QA4	Is the medical imaging dataset or modality clearly described?
QA5	Is the evaluation methodology for explainability clearly reported?
QA6	Does the study demonstrate how XAI improves interpretability or understanding of the model predictions?
QA7	Are limitations or challenges related to XAI methods discussed?
QA8	Does the study provide recommendations or future research directions related to XAI in medical imaging?

Table 5. QA score for each study.

Refs.	QA1	QA2	QA3	QA4	QA5	QA6	QA7	QA8	Total	Grade
[23]	1	1	1	1	0.5	1	1	1	7.5	High
[24]	1	1	1	1	0.5	0.5	0	1	6	High
[25]	1	1	1	1	1	1	1	1	8	High
[26]	1	1	1	1	1	1	0	0	6	High
[27]	1	1	1	1	1	1	1	1	8	High
[28]	1	1	1	1	1	1	1	1	8	High
[29]	1	0.5	0.5	1	0.5	0.5	0	0	4	Moderate
[30]	1	1	1	1	0.5	1	1	1	7.5	High
[31]	1	1	1	1	1	1	0	1	7	High
[32]	1	0.5	1	1	0.5	0.5	0	1	5.5	Moderate
[33]	1	0.5	1	1	0.5	0.5	1	1	6.5	High
[34]	1	1	1	1	0.5	1	0	1	6.5	High
[35]	1	1	1	1	1	1	1	0	7	High
[36]	1	1	1	1	0.5	1	1	1	7.5	High
[37]	1	1	1	1	1	1	1	0	7	High
[38]	1	1	1	1	1	1	1	1	8	High
[39]	1	1	1	1	1	1	1	0	7	High
[40]	1	0.5	1	1	0.5	1	0	1	6	High
[41]	1	1	1	1	0.5	1	1	0	6.5	High
[42]	1	1	1	1	0.5	1	0	1	6.5	High
[43]	1	1	1	1	0.5	1	0	1	6.5	High
[44]	1	1	1	1	1	1	1	0	7	High
[45]	1	1	1	1	0.5	1	0	0	5.5	Moderate
[46]	1	1	1	1	0.5	1	1	1	7.5	High
[47]	1	1	1	1	0.5	1	1	0	6.5	High
[48]	1	0.5	1	1	0.5	0.5	0	0	4.5	Moderate
[49]	1	0.5	1	1	0.5	0.5	0	0	4.5	Moderate
[50]	1	1	1	1	1	1	1	1	8	High
[51]	1	1	1	1	1	1	0	1	7	High
[52]	1	1	1	1	0.5	1	1	1	7.5	High
[53]	1	0.5	1	1	0.5	1	0	1	6	High
[54]	1	0.5	0.5	1	0.5	0.5	0	1	5	Moderate
[55]	1	1	1	1	0.5	1	1	1	7.5	High
[56]	1	1	1	1	0.5	1	1	1	7.5	High
[57]	1	1	1	1	1	1	1	1	8	High
[58]	1	1	1	1	0.5	1	0	0	5.5	Moderate
[59]	1	1	1	1	0.5	1	0	0	5.5	Moderate
[60]	1	1	1	1	1	1	0	0	6	High
[61]	1	1	1	1	1	1	1	1	8	High
[62]	1	1	1	1	1	1	1	1	8	High
[63]	1	1	1	1	0.5	1	1	1	7.5	High
[64]	1	1	1	1	1	1	1	1	8	High
[65]	1	1	1	1	0.5	1	0	0	5.5	Moderate
[66]	1	1	1	1	1	1	1	1	8	High
[67]	1	1	1	1	1	1	1	1	8	High
[68]	1	0.5	1	1	0.5	1	0	0	5	Moderate

Table 6. Frequency Distribution of XAI Techniques Identified.

XAI Category	XAI Method	Occurrence	Refs.
Gradient-based Class Activation Mapping (CAM)	Grad-CAM	34	[23,24,25,26,27,28,29,30,31,32,33,35,37,38,39,40,41,43,44,46,48,49,50,53,54,55,56,57,58,59,63,64,66,67]
	Grad-CAM++	12	[25,31,33,35,38,39,41,43,60,61,65,67]
	ScoreCAM	6	[25,35,41,50,67,68]
	LayerCAM	2	[25,35]
	EigenCAM	2	[35,39]
	EnsembleCAM	1	[35]
	AblationCAM	1	[41]
	XGradCAM	1	[41]
	Counter-CAM	1	[26]
Perturbation-based (Model-agnostic)	LIME	15	[23,24,25,27,28,34,36,37,45,47,52,60,62,66,67]
Additive Feature Attribution	SHAP	10	[28,36,42,47,56,57,58,63,65,66]
Gradient Attribution	Integrated Gradients	6	[27,34,37,46,47,67]
	Guided Backpropagation	2	[37,46]
	SmoothGrad	1	[67]
	DeepLIFT/PDA	1	[46,67]
Backpropagation- based Relevance	LRP	2	[31,61]
Attention-based	Attention Rollout	1	[66]
Expert-guided/Clinical XAI	Saliency Maps	2	[46,59]
Expert-guided/Clinical XAI	ClinicXAI	1	[51]

Note: Studies may employ multiple methods.

Table 7. XAI Techniques Usage by Cancer Type.

Cancer Type	Occurrence	Primary XAI	Additional XAI Methods	Refs.
Lung	11	Grad-CAM	SHAP, LIME, Integrated Gradients, Guided BP	[24,32,37,46,51,52,57,58,60,63,65]
Breast	9	Grad-CAM	LIME, SHAP, Grad-CAM++, Integrated Gradients	[23,26,28,33,40,48,53,66,67]
Skin/Melanoma	7	Grad-CAM/Grad-CAM++	LIME, SHAP, ScoreCAM	[34,36,38,43,49,50,54]
Cervical	6	Grad-CAM family	LIME, SHAP, LRP, EnsembleCAM	[25,31,35,42,45,61]
Multi-cancer/General imaging	8	Grad-CAM	LIME, SHAP, ScoreCAM, Saliency	[27,29,30,39,41,59,64,68]
Colorectal/Polyp	2	Grad-CAM	LIME	[44,62]
Ovarian	1	LIME + Integrated Gradients + SHAP	-	[47]
Bone	1	Grad-CAM	-	[55]
Pancreatic	1	Grad-CAM + SHAP	-	[56]

Table 8. Interpretability enhancement identified in XAI in cancer medical imaging.

Interpretability	Occurrence	Refs.
Spatial localization via visual heatmaps	35	[25,26,27,29,30,32,35,37,38,39,44,46,48,49,50,52,53,54,55,57,58,59,61,62,63,64,65,66,67,68]
Clinical decision support and workflow integration	32	[23,27,28,29,30,32,33,36,37,40,41,42,43,44,45,46,48,49,50,51,53,54,55,56,57,59,61,63,64,65,67,68]
Clinical trust and transparency	26	[23,25,28,29,30,33,36,37,38,40,41,42,44,48,49,52,53,56,57,58,59,61,63,64,66,68]
Feature attribution and contribution quantification	12	[25,27,28,34,36,42,47,56,57,63,65,66]
Bias, error, and spurious feature detection	6	[25,30,31,36,40,46]
Pixel-level segmentation and sub-cellular localization	6	[28,31,37,59,61,68]
Multi-modal interpretability (imaging + genomic/clinical)	2	[56,58]
Counterfactual reasoning	1	[26]

Table 9. Distribution of deep learning model families.

Model Family	Specific Variants Identified	No. of Studies	Refs.
ResNet	ResNet-18, ResNet-50, ResNet-101, ResNet-152, InceptionResNetV2	15	[23,24,25,27,28,30,34,36,38,39,53,55,56,60,66]
VGG	VGG-16, VGG-19	12	[23,25,27,28,30,31,36,38,40,53,61]
EfficientNet	EfficientNetB0, B1, B3, B7, V2S, V2B0	10	[26,48,50,53,57,62,65,66,67,68]
Inception	InceptionV3, InceptionResNetV2	8	[23,24,28,30,36,47,51,53]
Vision Transformer (ViT)	ViT-B/16, ViT-DINO, DeiT-B/16, Swin-S, MRAViT, SM-ViT, MobileViT, LeViT	8	[38,41,45,52,53,60,62,66]
DenseNet	DenseNet-121, DenseNet-201	7	[25,26,32,33,37,46,66]
MobileNet	MobileNetV2, MobileNetV3Large, ShuffleNetV2	5	[25,36,43,50,53]
Xception/XceptionNet	Xception, XceptionNet (1024 units)	4	[35,38,61,64]
Novel/Custom Architecture	AECNet, MRAViT-XAI, PancreoFusion-Net, ETCapsNet, Concatenated CNN-MLP-MHA	5	[49,52,56,63,67]
U-Net/Segmentation backbone	Attention-Guided U-Net, U-Net (ResNet backbone)	2	[44,56]

Table 10. Challenges reported in XAI for cancer medical imaging.

Challenge Category	Sub-Theme	Occurrence	Refs.
Explanation Instability and Inconsistency	LIME stochastic perturbation instability	4	[28,50,62,66]
	Ensemble activation map sensitivity to outliers	1	[35]
	Intra-class region fluctuation in CNN-attention models	1	[63]
	Gradient noise instability in GradCAM	1	[50]
Spatial Resolution and Localization Precision	Coarse, low-resolution heatmaps (CAM family)	3	[23,35,66]
	Failure on low-contrast/small lesions	2	[44,50]
	Limited class-agnostic saliency support	1	[50]
Absence of Standardized Quantitative Evaluation	No unified evaluation framework	3	[28,39,55]
	Qualitative-only reporting (no quantitative metrics)	2	[33,46]
	Annotation scarcity preventing ground truth comparison	2	[46,64]
Computational Cost and Resource Demands	High per-image computation time (LIME)	2	[27,37]
Computational Cost and Resource Demands	GPU memory and acceleration requirements (SHAP, LIME)	2	[66,67]
Class- and Morphology-Specific Explanation Failure	CAM family failure on rare/morphologically atypical classes	2	[25,50]
Class- and Morphology-Specific Explanation Failure	Failure on structurally subtle boundaries (mucus, debris)	1	[44]
Architecture–XAI Incompatibility	ViT interpretability opacity; CAM incompatibility	3	[41,62,66]
	Ensemble weighting subjectivity (EnsembleCAM)	1	[35]
	Post hoc structural nuance gap (complex networks)	1	[30]
Dataset Limitation and Generalization	Cross-domain validation absence	3	[36,38,46]
Dataset Limitation and Generalization	Class imbalance and underrepresented lesion types	2	[36,50]
Locality of Explanations	LIME: prediction-local explanations only	2	[23,52]
Transfer Learning XAI Difficulty	Reduced XAI fidelity in transfer-learned models	1	[47]
Multi-Modal Data Adaptation	XAI methods require adaptation for heterogeneous fusion	1	[56]
Model Confidence Miscalibration	Overconfident XAI outputs from classical classifiers	1	[57]
XAI-to-Segmentation Translation Gap	Scarcity of methods for converting XAI maps to segmentation	1	[61]

Table 11. Comparative analysis table of various XAI methods.

XAI Method	Performance	Robustness	Clinical Usability
Grad-CAM	Effective at highlighting broad discriminative regions; widely used for CNNs in medical imaging.	Moderate robustness; provides coarse heatmaps; sometimes less precise in boundary localization.	High; intuitive visual heatmaps aiding clinicians in understanding model focus areas.
Grad-CAM++	Improved visualization accuracy over Grad-CAM by considering higher-order gradients; sharper maps.	Higher robustness in highlighting critical regions; steeper performance drop on pixel flipping (better focus).	High; clearer regions of interest with better localization enhance trust.
ScoreCAM	Produces reliable heatmaps even where Grad-CAM/Grad-CAM++ struggle (e.g., difficult classes).	Robust against noisy or less distinctive features.	High; offers complementary insights, especially in complex cases.
LayerCAM	Similar to Grad-CAM but uses layer-wise relevance to improve resolution of heatmaps.	Moderate robustness; sometimes less accurate for certain classes.	Moderate; can be valuable but inconsistencies in some classes noted.
EigenCAM	Localization based on eigenvectors of feature maps, can capture holistic information.	Variable; depends on the nature of eigenvectors; less widely benchmarked.	Moderate; less intuitive for clinical users due to abstract nature.
EnsembleCAM	Combines multiple CAM variants aiming to balance weaknesses; potentially better coverage.	Increased robustness by aggregating multiple methods.	Moderate to High; complexity may limit straightforward clinical interpretation but improves reliability.
AblationCAM	Uses ablation to identify important regions; computationally intensive compared to Grad-CAM.	Robust due to direct observation of impact of regions on prediction.	Moderate; can highlight important areas well but less used clinically due to higher complexity.
XGradCAM	Enhanced weighting scheme improving localization performance.	Improved robustness compared to standard Grad-CAM.	High; similar clinical usability as Grad-CAM but with better accuracy.
Counter-CAM	Utilizes counterfactual analysis to see prediction changes on image modification.	High robustness in identifying critical regions affecting prediction.	High; provides causal explanations, beneficial for clinicians to assess model behavior.
LIME	Model-agnostic, local linear approximations produce explainable masks; sensitive to input perturbations.	Moderate robustness; can produce inconsistent explanations with similar inputs.	Moderate; easy to use but variability may reduce reliability in clinical settings.
SHAP	Provides consistent, theoretically grounded feature attributions with global and local explanations.	High computational cost reduces practical robustness for large medical images.	Moderate to High; highly interpretable but computational demands limit real-time clinical use.
Integrated Gradients	Provides axiomatic feature attributions respecting sensitivity and invariance properties.	Good robustness; mathematically sound; can be applied across data types.	Moderate; requires baseline choice, which can be challenging in medical imaging.
Guided Backpropagation	Produces fine-grained visual attribution maps.	Lower robustness; sensitive to noise and model specifics.	Moderate; detailed but may be less clinically intuitive due to noisy outputs.
SmoothGrad	Reduces noise in gradient-based maps by averaging over noisy inputs.	Improved robustness over Guided Backpropagation.	Moderate; clearer visualizations aid clinical interpretability.
DeepLIFT/PDA	Computes contribution scores by comparing activations to reference; accounts for saturated neurons.	High robustness; explains nonlinearities effectively.	Moderate to High; provides more consistent attributions than plain gradients.
LRP (Layer-wise Relevance Propagation)	Excellent at focusing on highly influential pixels; tends to produce sparse, localized explanations.	High robustness; low entropy in highlighted regions indicates focused attribution.	High; focused explanations enhance clinician trust and interpretability.
Attention Rollout	Offers fine, patch-level explanations due to self-attention propagation.	High robustness with fine spatial specificity; may highlight background patches as well.	Moderate to High; gives precise lesion borders aiding clinical decision making.
Saliency Maps	Basic gradients highlighting sensitive areas; often noisy and uninterpretable.	Low robustness due to noise and sensitivity to input perturbations.	Low to Moderate; limited direct clinical usability due to lack of clear focus.
ClinicXAI	Tailored for clinical settings; integrates various XAI methods with domain knowledge.	High robustness by combining multiple strategies; adapts explanations for clinical context.	Very High; designed for seamless clinical integration and actionable insights.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghauth, K.I.; Kustiawan, Y.A. Explainable Artificial Intelligence (XAI) for Cancer Classification in Medical Imaging: A Systematic Review. Mach. Learn. Knowl. Extr. 2026, 8, 134. https://doi.org/10.3390/make8050134

AMA Style

Ghauth KI, Kustiawan YA. Explainable Artificial Intelligence (XAI) for Cancer Classification in Medical Imaging: A Systematic Review. Machine Learning and Knowledge Extraction. 2026; 8(5):134. https://doi.org/10.3390/make8050134

Chicago/Turabian Style

Ghauth, Khairil Imran, and Yanche Ari Kustiawan. 2026. "Explainable Artificial Intelligence (XAI) for Cancer Classification in Medical Imaging: A Systematic Review" Machine Learning and Knowledge Extraction 8, no. 5: 134. https://doi.org/10.3390/make8050134

APA Style

Ghauth, K. I., & Kustiawan, Y. A. (2026). Explainable Artificial Intelligence (XAI) for Cancer Classification in Medical Imaging: A Systematic Review. Machine Learning and Knowledge Extraction, 8(5), 134. https://doi.org/10.3390/make8050134

Article Menu

Explainable Artificial Intelligence (XAI) for Cancer Classification in Medical Imaging: A Systematic Review

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Search Strategy

3.2. Study Selection

3.3. Included Studies

3.4. Quality Assessments

3.5. Data Extraction and Synthesis

4. Results

4.1. RQ1-What XAI Techniques Are Used in Cancer Medical Imaging?

4.2. RQ2—How Is XAI Evaluated in Cancer Medical Imaging?

4.3. RQ3—How Does XAI Enhance the Interpretability of Cancer Medical Imaging Models?

4.4. RQ4—What Model Architectures Are Used with XAI in Cancer Imaging?

4.5. RQ5—What Challenges Are Reported in XAI for Cancer Medical Imaging?

4.6. RQ6—What Future Directions Are Proposed for XAI in Cancer Medical Imaging?

5. Discussion

5.1. XAI Techniques in Cancer Medical Imaging

5.2. Integration of XAI into Clinical Workflows

5.3. Proposed Standardized Evaluation Framework for XAI

5.4. Clinical Applicability and Deployment Challenges

5.5. Model Architectures and Their Impact on XAI Interpretability

5.6. Challenges and Limitations of XAI in Cancer Medical Imaging

5.7. Future Directions for XAI in Cancer Medical Imaging

6. Limitations

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI