Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Artificial Intelligence Pipeline for Mammography-Based Breast Cancer Detection: An Integrated Systematic Review and Large-Scale Experimental Validation

Medicina 2025, 61(12), 2237; https://doi.org/10.3390/medicina61122237

by Daniel Añez^1,*

, Giuseppe Conti^2,3

, Juan José Uriarte⁴

, José-Javier Serrano-Olmedo^5,6,7

, Ricardo Martínez-Murillo⁸

and Oscar Casanova-Carvajal^2,5,*

Reviewer 1:

Fnu Neha

Reviewer 2:

Ahmed Al Marouf

Medicina 2025, 61(12), 2237; https://doi.org/10.3390/medicina61122237

Submission received: 18 November 2025 / Revised: 11 December 2025 / Accepted: 14 December 2025 / Published: 18 December 2025

(This article belongs to the Special Issue Application of Artificial Intelligence in Disease Diagnosis and Treatment)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Paper presents a combination of CNN-based and classical ML-based pipelines for breast cancer classification with added explainability via Grad-CAM and SHAP but parts of the manuscript tend to mix a systematic review, a modelling study, and a software/pipeline architecture description.

Author must clarify the contribution in one line.

Remove low-level details (feature names, specific ROC numbers).

Add missing XAI context by referring the survey “A Survey on Explainable Artificial Intelligence (XAI) Techniques for Visualizing Deep Learning Models in Medical Imaging.”
This will support the need for XAI and contextualize the use of Grad-CAM and SHAP.

Explain the impact of resizing mammograms (risk of losing microcalcifications).

Improve ML baselines with radiomics or clinical metadata.

Author Response

Response to Reviewer 1
We sincerely thank the reviewer for the thorough and constructive comments. Below we address each point in detail and indicate the corresponding changes in the revised manuscript.
1. Overall focus and main contribution
Reviewer comment: “Paper presents a combination of CNN-based and classical ML-based pipelines for breast cancer classification with added explainability via Grad-CAM and SHAP but parts of the manuscript tend to mix a systematic review, a modelling study, and a software/pipeline architecture description. Author must clarify the contribution in one line.”
Response: We agree that the original version did not state clearly enough how the systematic review, the modelling study, and the pipeline description were integrated.
In the revised manuscript, we have reorganised and expanded the end of the Introduction to clarify the study design and the main contribution. Specifically:
● We now describe explicitly the two-part design (systematic review plus experimental study) and the three components of the work: systematic review, experimental comparison of CNN, SVM and XGBoost on CBIS-DDSM, and an MLOps-oriented pipeline with XAI.
● We added a short list of items that summarizes: (i) the PRISMA-2020-based review, (ii) the experimental evaluation on CBIS-DDSM, and (iii) the implementation of a reproducible, MLOps-oriented workflow with Grad-CAM and SHAP.
The main contribution is now stated in one line: to link a PRISMA-guided systematic review with an experimental evaluation and interpret the behaviour of representative CNN, SVM and XGBoost models within a reproducible pipeline.
2. Level of detail in the Results
Reviewer comment: “Remove low-level details (feature names, specific ROC numbers).”
Response: We appreciate the reviewer’s comment regarding the amount of low-level detail in the Results section. We agree that the initial version contained too many specific numerical values and long lists of feature names in the main text, which could affect readability. In response, we have streamlined Sections 3.3–3.5:
Detailed AUC-ROC and Recall values for each model are now reported primarily in the tables, while the text emphasizes the main comparative findings (for example, that ResNet50 outperforms the other CNNs and that XGBoost performs better than SVM among classical models).
The description of the SHAP results has been rewritten to avoid long enumerations of individual features; instead, we summarize the main feature groups and refer the reader to the corresponding figures for exact importance values.
3. XAI context and reference to recent survey
Reviewer comment: “Add missing XAI context by referring the survey ‘A Survey on Explainable Artificial Intelligence (XAI) Techniques for Visualizing Deep Learning Models in Medical Imaging’. This will support the need for XAI and contextualize the use of Grad-CAM and SHAP.”
Response: We thank the reviewer for this suggestion.
● In the Introduction, we added a dedicated paragraph that summarizes the main conclusions of the requested survey by Bhati et al.. We subsequently link this to our work, explaining that we use Grad-CAM for CNNs and SHAP for tabular models as concrete examples of the gradient-based and attribution-based methods discussed in that survey, motivating their inclusion from a clinical-interpretability standpoint.
● In the same paragraph we explain that our work uses Grad-CAM for CNNs and SHAP for tabular models as concrete examples of the gradient-based and attribution-based methods discussed in that survey, and we motivate their inclusion from a clinical-interpretability standpoint.
● In Section 3.7 we explicitly connect our Grad-CAM and SHAP analyses to this broader XAI context, stressing their role in making model behaviour more transparent to clinicians.
4. Impact of mammogram resizing and microcalcification detection
Reviewer comment: “Explain the impact of resizing mammograms (risk of losing microcalcifications). While the discussion acknowledges that multi-scale or patch-based approaches could mitigate this issue, the manuscript should more explicitly discuss the trade-off between computational feasibility (224x224 input) and diagnostic fidelity for micro-lesions, and clarify whether the current metrics are sufficient to cover reliable microcalcification detection.”
Response: We fully agree that the impact of downsampling on microcalcification visibility deserves explicit discussion.
● In Section 2.8.1 (Preprocessing), we now explain that resizing all images to 224 × 224 pixels was a pragmatic compromise to match ImageNet-based architectures and keep training computationally feasible, and we state that this downsampling may partially smooth very small structures, including isolated microcalcification clusters.
● In Section 4.1 (Limitations), we expand the discussion to clarify that:
○ We did not perform an explicit ablation study across different input resolutions,
○ Therefore the quantitative impact of this decision on microcalcification detection cannot be fully characterised, and
○ The reported global metrics (for example AUC-ROC 0.95 for ResNet50) should be interpreted as overall performance on CBIS-DDSM rather than as a guarantee of optimised sensitivity for tiny micro-lesions.
● We also emphasise that future work should explicitly evaluate per-lesion performance, especially for microcalcifications, and compare single-scale versus multi-scale or patch-based strategies under controlled conditions.
This revision makes the trade-off between computational feasibility and microcalcification fidelity more transparent.
5. Strengthening the classical ML baselines and radiomics context
Reviewer comment: “Improve ML baselines with radiomics or clinical metadata.”
Response: We thank the reviewer for this insightful suggestion.
Regarding clinical metadata, we wish to clarify that our classical machine learning baselines (XGBoost and SVM) effectively utilized the available clinical attributes provided in the CBIS-DDSM dataset. As illustrated in the SHAP summary plots (Figures 12 and 13) and discussed in Section 4.4, the models were trained on a hybrid feature set including global image statistics (mean intensity, standard deviation, width, height) and clinical metadata (BI-RADS assessment and subtlety). Our results confirm that these metadata features were highly influential, with BI-RADS assessment and breast density emerging as top predictors alongside image intensity statistics.
Regarding radiomics, we acknowledge that we did not employ high-dimensional texture descriptors (e.g., GLCM or GLRLM features). In Section 4.1 (Limitations), we have explicitly addressed this design choice, stating that our models relied on a "relatively compact set of clinical and morphological features," which limits direct comparability with studies that use extensive radiomic feature sets. We now highlight that extending the current pipeline to incorporate high-dimensional radiomics is a natural direction for future work to potentially narrow the performance gap with deep learning models.
6. External validation and statistical comparisons
Reviewer related concern (raised in the general critique and editorial summary): Lack of external validation and absence of formal statistical tests for performance differences.
Response: We agree that these aspects are important for assessing robustness.
● In Section 2.7.3 we state that no external dataset was used and that our findings are limited to internal validation.
● In Section 4.1, we add that:
○ No statistical tests (for example p values or confidence intervals for AUC differences) were computed,
○ As a consequence, performance comparisons between models should be interpreted descriptively, and
○ Rigorous external, multi-institutional validation on independent datasets will be essential before any clinical deployment is considered.
● We reiterate this need for cross-dataset and multi centre validation in the concluding remarks of Section 5, where we position our work as a pipeline and proof-of-concept rather than a ready-to-use clinical tool.
7. Risk of bias assessment and PROBAST
Reviewer related concern: Risk of bias assessment is qualitative only; suggest using PROBAST.
Response: We thank the reviewer for pointing to this important aspect.
● In Section 2.6, we maintain the current qualitative dataset- and study-level bias analysis, but in Section 4.1 we now explicitly acknowledge that this approach is qualitative and that future work should adopt structured instruments such as PROBAST for a more rigorous risk-of-bias and applicability assessment.
● We added the PROBAST reference as [56] Wolff et al., Annals of Internal Medicine 2019, and we cite it when discussing this limitation.
Overall, we believe these revisions address the reviewer’s concerns and help clarify the scope, contributions, and limitations of the work, while aligning the manuscript more closely with current expectations for methodological transparency, XAI context, and clinical robustness.

Reviewer 2 Report

Comments and Suggestions for Authors

Authors have presented a systematic literature review on the topic of medical image processing with big data, focusing on breast cancer only. The paper need major and minor revision as follows.

Minor: Update the type of paper from "Article" to "Systematic Review".
Minor: The title of the paper should not be in all uppercase. Make is sentence case keeping the first letter capitalized.
Major: The abstract does not reflect the paper. It is more like a methodological abstract. The idea of review is not mentioned. Update the abstract maintaining the rules of MDPI.
Major: Some of the terms must have references when they are mentioned for the first time in the introduction section. For example, MRI, CNN, SVM. XGBoost, Grad-CAM, SHAP etc.
Minor: Line 131 - When URL is given, put the last accessed Date, Month, Year.
Major: Are the same keywords used (line 148-151) for Scopus and PubMed? Mention clearly.
Major: Put the subsection titles as per the convention. Just keeping Bold is not enough. You may use subsection like 2.7.1, 2.7.2 etc.
Major: Are the figure 5, 6, 7 drawn and designed by the authors? If not, put the right reference and take permission to reuse.
Major: In the results section, add the confusion matrix of the methods.
Major: For Figure, 8, 9 and 10, put some additional description of the figures. Like the meaning of the colours, what the colours are actually representing.

Author Response

Response to Reviewer 2
We are grateful to Reviewer 2 for the careful reading of our manuscript and the constructive suggestions, which have helped us improve clarity and alignment with MDPI/Medicina standards. Below we address each point in turn, indicating the corresponding changes in the revised manuscript.
Comment 1 (Minor). Update the type of paper from "Article" to "Systematic Review".
Response 1. We thank the reviewer for this clarification. In the revised manuscript, we have updated the article type in the front matter to reflect that the paper is a “Systematic Review and Experimental Study”, consistent with the study design described in the Abstract, Introduction, and Methods.
Comment 2 (Minor). The title of the paper should not be in all uppercase. Make it sentence case keeping the first letter capitalized.
Response 2. We agree. The title has been reformatted to sentence/title case:
Artificial Intelligence Pipeline for Mammography-Based Breast Cancer Detection: Systematic Review and Large-Scale Experimental Validation.
Comment 3 (Major). The abstract does not reflect the paper. It is more like a methodological abstract. The idea of review is not mentioned. Update the abstract maintaining the rules of MDPI.
Response 3. We appreciate this important observation. The Abstract has been completely rewritten to clearly reflect the two-part nature of the work: (i) a PRISMA-2020–guided systematic review and (ii) an original experimental study. The revised Abstract now:
● Explicitly states that the study “combines a PRISMA 2020-compliant systematic review with an original experimental validation”.
● Summarizes the search strategy and number of included studies.
● Briefly reports the main quantitative results of the experimental evaluation (ResNet50, XGBoost, SVM).
● Emphasizes the role of Grad-CAM and SHAP within a reproducible MLOps-oriented pipeline.
This brings the Abstract into line with MDPI’s structure and with the actual scope of the manuscript.
Comment 4 (Major). Some of the terms must have references when they are mentioned for the first time in the introduction section. For example, MRI, CNN, SVM, XGBoost, Grad-CAM, SHAP etc.
Response 4. We have revised the Introduction so that each of these key terms is now accompanied by an appropriate reference at first mention:
● MRI is referenced in the context of breast imaging modalities.
● CNNs, SVM, and XGBoost are each introduced with dedicated citations summarizing their role in medical/healthcare applications.
● Grad-CAM and SHAP are referenced when explainable AI methods are first discussed, and again in the Methods section where we detail their use.
This ensures that all core methodological concepts are properly grounded in the literature.
Comment 5 (Minor). Line 131 - When URL is given, put the last accessed Date, Month, Year.
Response 5. We have added explicit “accessed on” dates for all URLs, including the CBIS-DDSM dataset, the ECIS cancer statistics resource, and the project GitHub repository, following MDPI style (e.g., “accessed 4 December 2025” / “accessed 8 December 2025”).
Comment 6 (Major). Are the same keywords used (line 148–151) for Scopus and PubMed? Mention clearly.
Response 6. We thank the reviewer for pointing out this potential ambiguity. In Section 2.2 “Information Sources and Search Strategy” we now explicitly clarify that:
● PubMed queries used a combined MeSH and free-text formulation (with Boolean operators) tailored to that database.
● Scopus/ScienceDirect used an equivalent set of conceptual keywords (breast cancer, imaging modality, CNN, SVM, XGBoost, etc.), but expressed using the syntax appropriate to that platform.
The revised text explains that, while the exact query strings differ, the underlying keyword set and concepts were aligned across both databases to ensure comparability.
Comment 7 (Major). Put the subsection titles as per the convention. Just keeping Bold is not enough. You may use subsection like 2.7.1, 2.7.2 etc.
Response 7. We have restructured the Methods and Results sections so that all subsections follow the numbered MDPI style (e.g., 2.7.1 Convolutional neural networks, 2.7.2 Machine-learning models, 2.7.3 Explainability and interpretation, etc.). This change improves navigability and brings the manuscript into line with the journal’s formatting conventions.
Comment 8 (Major). Are the figure 5, 6, 7 drawn and designed by the authors? If not, put the right reference and take permission to reuse.
Response 8. We confirm that Figures 5, 6, and 10 (software architecture, CNN training loop, and handcrafted-feature workflow) are original diagrams created by the authors. The captions have been updated to explicitly state “Original figure created by the authors.”
Figures 7–9 (confusion matrices) are also direct outputs of our own trained models generated with the code described in the Methods, and thus do not reproduce any external material. This has now been clarified in the text and captions where appropriate.
Comment 9 (Major). In the results section, add the confusion matrix of the methods.
Response 9. Following this valuable suggestion, we have added confusion matrices for each of the three CNN models (ResNet50, EfficientNetB0, MobileNetV3-Small) in the Results section (Figures 7–9). In Section 3.4 we also include a concise explanation of how to interpret a confusion matrix (diagonal entries as correct classifications, off-diagonal entries as misclassifications, and their relation to TP, FP, TN, FN), so that the figures are self-explanatory for readers.
Comment 10 (Major). For Figure 8, 9 and 10, put some additional description of the figures. Like the meaning of the colours, what the colours are actually representing.
Response 10. We have expanded both the figure captions and the main text to clarify the meaning of colours and visual encodings:
● For the confusion matrices (now Figures 7–9), the text in Section 3.4 explains that colour intensity reflects the count (or proportion) of cases in each cell, with darker shades indicating higher frequencies.
● For the workflow diagram of handcrafted feature extraction and classification (Figure 10), the caption specifies the role of each block in the pipeline.
● In Section 3.7.1 and 3.7.2, we further describe the colour scales used in Grad-CAM and SHAP plots (e.g., red/yellow as stronger positive contribution to malignant predictions; blue as lower contribution), making these figures more interpretable.
We hope these clarifications address the reviewer’s concern and make all visual elements more accessible to the readership.
Once again, we thank Reviewer 2 for these helpful comments. We believe the revisions have substantially improved the clarity, structure, and transparency of the manuscript, and we hope that the updated version meets the reviewer’s expectations.

Article Menu

Artificial Intelligence Pipeline for Mammography-Based Breast Cancer Detection: An Integrated Systematic Review and Large-Scale Experimental Validation

Further Information

Guidelines

MDPI Initiatives

Follow MDPI