Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Explainable Multi-Modal Medical Image Analysis Through Dual-Stream Multi-Feature Fusion and Class-Specific Selection

AI 2026, 7(1), 30; https://doi.org/10.3390/ai7010030

by Naeem Ullah^1,*

, Ivanoe De Falco²

and Giovanna Sannino²

Reviewer 1: Anonymous

Reviewer 2: Anonymous

AI 2026, 7(1), 30; https://doi.org/10.3390/ai7010030

Submission received: 20 November 2025 / Revised: 24 December 2025 / Accepted: 13 January 2026 / Published: 16 January 2026

(This article belongs to the Special Issue Digital Health: AI-Driven Personalized Healthcare and Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposes a dual-stream framework combining handcrafted descriptors with deep statistical features, utilizing LIME for explainability. However, the proposed explainability technique remains at a basic "Level 0" XAI, primarily offering feature ranking and importance scores. Since the architecture explicitly relies on semantically meaningful handcrafted features (like texture, shape, and frequency), the authors miss a significant opportunity to provide deeper, natural language-style explanations. Instead of simply ranking feature contributions, the system could leverage these interpretable inputs to generate clinically relevant insights, which would vastly improve the paper's contribution to transparent medical diagnostics.

Regarding generalizability, the method’s heavy reliance on a pre-defined suite of handcrafted features (such as Gabor filters, HOG, and Hu moments) inherently limits its scope. While these descriptors are effective for the texture-rich data found in medical imaging (MRI, Ultrasound, Retinal), the framework appears tightly coupled to this domain. This dependence restricts the method's potential as a general-purpose image analysis tool and raises doubts about its adaptability to other image domains where these specific morphological or textural descriptors may not be discriminative.

Finally, the proposed methodology introduces significant computational overhead. The pipeline requires extracting multiple complex feature sets, performing iterative rank-based elimination, and executing a multi-method selection process for every specific class. This multi-stage approach, combined with ensemble voting, implies a high computational cost for training and optimization. The authors should critically analyze this complexity against the performance gains, as the current computational burden may hinder practical deployment in resource-constrained clinical environments.

Author Response

Concern # 1: However, the proposed explainability technique remains at a basic "Level 0" XAI, primarily offering feature ranking and importance scores. Since the architecture explicitly relies on semantically meaningful handcrafted features (like texture, shape, and frequency), the authors miss a significant opportunity to provide deeper, natural language-style explanations. Instead of simply ranking feature contributions, the system could leverage these interpretable inputs to generate clinically relevant insights, which would vastly improve the paper's contribution to transparent medical diagnostics.

Author Response:

We appreciate the reviewer’s comment. In the revised manuscript, the LIME component has been enhanced to link influential handcrafted features to clinically meaningful visual properties, providing natural-language-style explanations alongside feature importance scores. This improvement is reflected in different sections of the revised manuscript. Specifically, in the Abstract, Introduction (Section # 1), LIME Explanation (Section # 3.1.7), LIME explainability (Section 4.4), Discussion (Section #5), and Conclusion (Section # 6), enabling the framework to provide clearer, clinically relevant insights and elevating the interpretability beyond a baseline Level 0 XAI approach.

Concern # 2: Regarding generalizability, the method’s heavy reliance on a pre-defined suite of handcrafted features (such as Gabor filters, HOG, and Hu moments) inherently limits its scope. While these descriptors are effective for the texture-rich data found in medical imaging (MRI, Ultrasound, Retinal), the framework appears tightly coupled to this domain. This dependence restricts the method's potential as a general-purpose image analysis tool and raises doubts about its adaptability to other image domains where these specific morphological or textural descriptors may not be discriminative.

Author Response:

We thank the reviewer for this valuable feedback. We acknowledge that the current framework is effective for texture-rich medical imaging (MRI, Ultrasound, Retinal). We agree with the fact that the method heavily relies on a pre-defined suite of handcrafted features (such as Gabor filters, HOG, and Hu moments) inherently limits its direct applicability to other image domains.

This limitation is now explicitly discussed in the Discussion (Section # 5) section of the revised manuscript. We also highlighted that future work will explore incorporating domain-agnostic or learned features to extend the framework’s adaptability beyond medical imaging while preserving interpretability.

We have also added a brief note in the Introduction (Section # 1) to clarify that the proposed framework is effective across multiple medical imaging modalities, demonstrating generalizability within the medical domain.

Furthermore, we also clarified in the Conclusion (Section # 6) that the framework demonstrates strong robustness and generalizability across multiple medical imaging modalities (MRI, ultrasound, and retinal fundus), as reflected in the Results section. This emphasizes its effectiveness within the intended medical imaging domain while noting that adaptation to other image domains would require additional exploration.

Concern # 3: Finally, the proposed methodology introduces significant computational overhead. The pipeline requires extracting multiple complex feature sets, performing iterative rank-based elimination, and executing a multi-method selection process for every specific class. This multi-stage approach, combined with ensemble voting, implies a high computational cost for training and optimization. The authors should critically analyze this complexity against the performance gains, as the current computational burden may hinder practical deployment in resource-constrained clinical environments.

Author Response:

We appreciate the reviewer’s observation regarding computational overhead. In response, we have added clarification in the Discussion (Section # 5) section acknowledging that the multi-stage pipeline increases training complexity. We also explain why these steps are necessary to achieve the observed gains in accuracy, robustness, and interpretability.

Additionally, we added the text in future work in the Discussion (Section # 5) section to outline potential optimization directions, such as feature preselection, dimensionality reduction, and model pruning, to reduce the computational load and support deployment in resource-constrained clinical settings.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript presents a comprehensive and well-structured framework for explainable multi-modal medical image classification by integrating handcrafted features, statistically summarized deep features, ensemble learning, and class-specific feature selection. The topic is timely and highly relevant to the fields of medical image analysis, multimodal learning, and explainable AI. The methodological pipeline is clearly organized, and the experimental validation across three different medical imaging modalities (MRI, ultrasound, and retinal fundus) demonstrates the robustness and generalizability of the proposed approach. Overall, the study is technically sound and well motivated.

Nevertheless, several aspects could be improved to further strengthen the contribution and clarity of the work:

The reported performance on the BTTypes and ACRIMA datasets is extremely high (close to 99–100%). While impressive, such results also raise concerns about potential overfitting, especially given the use of extensive feature fusion and multiple selection stages. It would be helpful if the authors could provide additional discussion on how overfitting is mitigated (e.g., cross-validation strategy, random seed control, multiple train–test splits, or stability analysis across repeated runs).
All experiments are conducted on publicly available benchmark datasets. While this is fully acceptable, the generalization claims would be significantly strengthened by either (i) including an external independent dataset, or (ii) providing a more explicit discussion of the limitations associated with training and testing on datasets derived from similar public sources.
Although the focus of the paper is on interpretability and hybrid feature fusion, recent medical image analysis literature includes strong baselines based on Vision Transformers, hybrid CNN–Transformer models, and end-to-end deep learning with built-in explainability. A quantitative or at least qualitative comparison with such modern architectures would make the experimental evaluation more complete and up to date.
The framework involves multiple feature extraction pipelines, ensemble classifiers, and iterative feature selection stages. A more detailed comparison of computational cost (training and inference time) against simpler baselines would improve the practical relevance of the proposed solution, particularly for real-world clinical deployment.
The use of LIME and class-specific feature selection is an important strength of the paper. However, the presentation of explanation results could be further enhanced by including more explicit clinical interpretation of the selected features (e.g., how frequency-domain entropy or specific Gabor responses relate to known pathological image characteristics).
All three evaluated datasets are binary classification problems. It would be beneficial if the authors could discuss how the proposed CSMMFS and calibrated soft-voting strategy would scale to multi-class medical classification scenarios, which are common in real clinical practice.
While the general experimental setup is well described, some implementation details (e.g., exact MobileNet layer used for deep feature extraction, statistical feature normalization, and the value of parameter k in CSMMFS) could be more explicitly stated to further enhance reproducibility.

In summary, this study presents a solid and well-executed hybrid framework with strong experimental performance and meaningful interpretability. Addressing the above points would further improve the scientific rigor, transparency, and practical impact of the work.

Comments on the Quality of English Language

The English language of the manuscript is generally understandable and allows the reader to follow the main methodology and experimental results without major difficulty. The overall structure of the text is clear, and most technical descriptions are conveyed accurately.

However, there are several areas where the clarity and fluency of the language could be improved. In particular, some sentences are overly long and would benefit from being split into shorter, more concise statements. Minor grammatical inconsistencies, article usage, and punctuation errors are occasionally present throughout the manuscript. In addition, certain technical expressions are repeated multiple times using very similar wording, which slightly affects readability.

It is therefore recommended that the manuscript undergo a careful professional English proofreading to improve linguistic clarity, consistency, and overall readability, while preserving the technical meaning of the content.

Author Response

Reviewer#2, Concern # 1: The reported performance on the BTTypes and ACRIMA datasets is extremely high (close to 99–100%). While impressive, such results also raise concerns about potential overfitting, especially given the use of extensive feature fusion and multiple selection stages. It would be helpful if the authors could provide additional discussion on how overfitting is mitigated (e.g., cross-validation strategy, random seed control, multiple train–test splits, or stability analysis across repeated runs).

Author Response:

We thank the reviewer for highlighting this point. In the revised version, we have added the previously missing details in Section 4.2 (Experimental Setup) and added an explicit discussion in the Discussion (Section # 5) section addressing overfitting considerations and the stability of the results.

These clarifications confirm that the reported high performance reflects genuine generalization rather than memorization.

Reviewer#2, Concern # 2: All experiments are conducted on publicly available benchmark datasets. While this is fully acceptable, the generalization claims would be significantly strengthened by either (i) including an external independent dataset, or (ii) providing a more explicit discussion of the limitations associated with training and testing on datasets derived from similar public sources.

Author Response:

We appreciate the reviewer’s suggestion. While all experiments were conducted on publicly available benchmark datasets, we have clarified in the Discussion (Section # 5) that the generalizability of the framework is currently validated within these datasets and may be limited when applied to independent or real-world clinical data from other sources. We also highlight that future work will focus on testing the framework on additional external datasets and more heterogeneous clinical data to further assess robustness and generalization.

Reviewer#2, Concern # 3: Although the focus of the paper is on interpretability and hybrid feature fusion, recent medical image analysis literature includes strong baselines based on Vision Transformers, hybrid CNN–Transformer models, and end-to-end deep learning with built-in explainability. A quantitative or at least qualitative comparison with such modern architectures would make the experimental evaluation more complete and up to date.

Author Response:

We thank the reviewer for this valuable suggestion. To address the reviewer concern, we have added new subsection, Section 4.6 (Comparison with Modern Deep Learning Architectures) to provide a quantitative comparison with representative end-to-end baselines, including a CNN, Vision Transformer, and CNN–Transformer hybrid. Table 11 reports performance metrics across all datasets, demonstrating that the proposed method achieves comparable or superior accuracy while using substantially fewer parameters.

Reviewer#2, Concern # 4: The framework involves multiple feature extraction pipelines, ensemble classifiers, and iterative feature selection stages. A more detailed comparison of computational cost (training and inference time) against simpler baselines would improve the practical relevance of the proposed solution, particularly for real-world clinical deployment.

Author Response:

We thank the reviewer for this valuable suggestion. In the revised manuscript, to address computational complexity and deployment feasibility, we have included Section 4.6 (Comparison with Modern Deep Learning Architectures) and Table 11, which provide a detailed comparison of training time, inference time, and parameter counts against simpler CNN baselines and Transformer-based models. These results show that the proposed framework maintains competitive computational cost and moderate inference latency despite its multi-stage design, supporting its practical applicability in clinical settings.

Reviewer#2, Concern # 5: The use of LIME and class-specific feature selection is an important strength of the paper. However, the presentation of explanation results could be further enhanced by including more explicit clinical interpretation of the selected features (e.g., how frequency-domain entropy or specific Gabor responses relate to known pathological image characteristics).

Author Response:

We thank the reviewer for this insightful comment. In the revised version, we have revised Section # 4.4 (LIME explainability) to include explicit clinical interpretation of frequency-domain and Gabor-based features, linking LIME-attributed features to known pathological image characteristics

Reviewer#2, Concern # 6: All three evaluated datasets are binary classification problems. It would be beneficial if the authors could discuss how the proposed CSMMFS and calibrated soft-voting strategy would scale to multi-class medical classification scenarios, which are common in real clinical practice.

Author Response:

We thank the reviewer for this suggestion. We have added the details in Section # 5 (Discussion) to explain that the proposed CSMMFS and calibrated soft-voting strategy can be naturally extended to multi-class settings using one-vs-rest or one-vs-one feature selection and probability aggregation. Validation on multi-class and multi-label medical image datasets is identified as future work.

Reviewer#2, Concern # 7: While the general experimental setup is well described, some implementation details (e.g., exact MobileNet layer used for deep feature extraction, statistical feature normalization, and the value of parameter k in CSMMFS) could be more explicitly stated to further enhance reproducibility.

Author Response:

We sincerely thank the reviewer for this valuable observation. In the revised manuscript, we have addressed the reviewer’s concern regarding reproducibility by adding a dedicated Section # 4.2.1 (Implementation Specifications for Reproducibility) specifying key implementation details.

Reviewer#2, Concern # 8: The English language of the manuscript is generally understandable and allows the reader to follow the main methodology and experimental results without major difficulty. The overall structure of the text is clear, and most technical descriptions are conveyed accurately.

Author Response:

We thank the reviewer for the helpful feedback on language clarity and readability. The manuscript has been carefully reviewed and revised to improve sentence structure, correct minor grammatical issues, reduce repetition of technical expressions, and enhance overall readability. These changes ensure that the methodology, results, and technical content are presented more clearly and concisely while preserving the original scientific meaning.

Author Response File: Author Response.docx

Article Menu

Explainable Multi-Modal Medical Image Analysis Through Dual-Stream Multi-Feature Fusion and Class-Specific Selection

Further Information

Guidelines

MDPI Initiatives

Follow MDPI