Review Reports - DeepFocusNet: An Attention-Augmented Deep Neural Framework for Robust Colorectal Cancer Classification in Whole-Slide Histology Images

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Uddin et al present their model DeepFocusNet, which combines an attention module which allows for a more global representation to be built. The authors demonstrate its performance using the CRC tissue classification dataset, which is publicly available and is commonly used for testing such deep learning models.

While I think the architecture of the model is intriguing, I think the utility of it has been insufficiently demonstrated. The benchmarking task that the authors use is one which is commonly done, for which most modern models demonstrate excellent performance. Furthermore, the authors discuss clinical utility for such a model but the real life clinical problem which this model is applicable for is unclear - tissue classification of this sort is a trivial task for the pathologist.

My major concerns relate to:

Unclear methodology, specifically the dataset makeup. The exact volume of data used in training/validation/test sets is unclear and the performance statistics are not shown for each set. The authors discuss curation/cleaning of the data, but do not discuss the inclusion/exclusion criteria. The tissue classes are not those typically seen in NCT (complex is not one, this is usually muscle). The magnification/pixel size of the images is not discussed. It is not clear why sub-selections of data are used instead of the entire set of images.
Lack of benchmarking. There are numerous models which could serve as a comparator, including foundation models such as UNI, Virchow or CTransPath. How does performance relate to them?
Is there scope to trial this on another task? For example MSI prediction in CRC, or LN metastasis detection? Public datasets are available for these tasks. This would solidify the utility of the model.

I have minor comments, going through the manuscript in order:

Some of the figures are distorted/stretched. The presentation of these could be improved eg. Fig 1. The font size of others is too small and difficult to read.
There is a mis-placed paragraph break on line 109
Table 1, the inclusion of the task associated with the dataset would improve clarity.
Line 162 'whole-slide pictures' is abbreviated to WSI. This should read 'whole slide images'.
It is unclear where the 5000x5000px images are from. Are these from NCT or from the authors' institution?
Figure 2 - the image for adipose shows an empty tile, and the one labelled debris shows fibrillar collagen rather than debris.
Lines 171-177 Why is there discussion around MRI and CT scans? How does that relate to the paper? The normalisation strategy is unclear -- what is it applied to?
Line 183, what was the data cleaning strategy? How much was removed? What were the criteria exactly?
Line 196, If it was necessary to use SMOTE, what were the class balances?
Figure 3, incorporating some directionality in the flowchart would improve clarity
Is the model pre-trained in any way?
Line 299, why were the images resized to 224x224px?
Line 302, is the learning rate 1x10? This is possibly some repetition from the previous paragraph in lines 287-293.
Line 308, is the reference to figure 6 supposed to read figure 7?
Figure 7 may benefit from additional detail in the legend.
The colours mentioned in the text for malignant regions on line 341 do not appear to be the same as those in Figure 8b. Figure 8 appears somewhat redundant given the content of figure 9. Figure 8c appears inverted on the x axis compared to 8a and 8b.
In figures 8/9 a visual indicator of image size would be helpful, for example a scale bar.
Figure 11, what was the multi-class AUC measure specifically?

Author Response

Response to Reviewer 1

Major concerns :

Unclear methodology, specifically the dataset makeup. The exact volume of data used in training/validation/test sets is unclear and the performance statistics are not shown for each set. The authors discuss curation/cleaning of the data, but do not discuss the inclusion/exclusion criteria. The tissue classes are not those typically seen in NCT (complex is not one, this is usually muscle). The magnification/pixel size of the images is not discussed. It is not clear why sub-selections of data are used instead of the entire set of images.

Response:

Thank you for your insightful comments on our manuscript. We greatly appreciate your feedback, which has been instrumental in enhancing the quality of our work. In response to tissue classes we would like to clarify that our study did not use the NCT-CRC-HE-100K dataset from the NCT Biobank. Instead, we used the publicly available Colorectal Histology Dataset by Kather et al. (Zenodo record: https://zenodo.org/record/53169).

In response to your other suggestions, we have made the necessary updates to the manuscript to address the points raised. We believe these revisions have strengthened our arguments and improved the overall clarity of the paper.

Lack of benchmarking. There are numerous models which could serve as a comparator, including foundation models such as UNI, Virchow or CTransPath. How does performance relate to them?

Response:

We acknowledge the importance of benchmarking our proposed DeepFocusNet against emerging foundation models such as UNI, Virchow, and CTransPath. Currently, direct comparisons are limited due to differences in model architectures, training datasets, and implementation frameworks. Nonetheless, existing literature indicates that these foundation models achieve high accuracy in histopathological classification tasks. Future work will involve comprehensive benchmarking of DeepFocusNet against these models on standardized datasets to evaluate relative performance, robustness, and clinical applicability. Such comparisons will offer valuable insights into the advantages and limitations of our approach in the context of cutting-edge AI models in histopathology.

Is there scope to trial this on another task? For example MSI prediction in CRC, or LN metastasis detection? Public datasets are available for these tasks. This would solidify the utility of the model.

Response:

Given the demonstrated accuracy and interpretability of DeepFocusNet in colorectal cancer histology classification, there is promising potential to adapt and evaluate the framework on other critical tasks such as MSI prediction and LN metastasis detection. Existing public datasets, including the TCGA repository for MSI status and databases containing lymph node images, provide an excellent resource for such investigations. Conducting trials on these tasks would not only validate the generalizability and adaptability of the model but also reinforce its utility as a comprehensive tool for multiple aspects of colorectal cancer diagnosis and prognosis. Future research will focus on fine-tuning the model for these tasks and assessing its performance relative to existing methods.

Minor concerns:

Some of the figures are distorted/stretched. The presentation of these could be improved eg. Fig 1. The font size of others is too small and difficult to read.

Response:

Thank you for your feedback. We have updated the figures to correct any distortion and adjusted the font sizes to ensure they are clear and easy to read.

There is a mis-placed paragraph break on line 109

Response:

Thank you for pointing this out. We have corrected the misplaced paragraph break on line 109.

Table 1, the inclusion of the task associated with the dataset would improve clarity.

Response:

Thank you for the suggestion. We have added a 'Task' column to Table 1 to improve clarity.

Line 162 'whole-slide pictures' is abbreviated to WSI. This should read 'whole slide images'.

Response:

Thank you for noting this. We have corrected the text so that it now reads 'whole slide images' instead of 'whole-slide pictures.

It is unclear where the 5000x5000px images are from. Are these from NCT or from the authors' institution?

Response:

Thank you for your question regarding the source of the 5000x5000px images used in our study. We would like to clarify that these images are sourced from the dataset by Kather et al., Zenodo 2016.

Figure 2 - the image for adipose shows an empty tile, and the one labelled debris shows fibrillar collagen rather than debris.

Response:

We thank the reviewer for this careful observation. The images in Figure 2 were randomly sampled examples from the training dataset (Kather et al., Zenodo 2016). In the adipose class, some tiles appear largely empty due to the histological processing of fat tissue, which leaves clear spaces where adipocytes were present. We acknowledge that the chosen example may appear less representative, and we will replace it with a more typical adipose tile showing clear adipocyte boundaries.

For the debris class, we agree that the originally selected tile may resemble fibrillar collagen. This is due to the inherent variability in patch-level labeling within the dataset. To address this, we will update Figure 2 with a more representative debris tile consistent with the published dataset definitions. We have revised the figure accordingly to avoid misinterpretation.

Lines 171-177 Why is there discussion around MRI and CT scans? How does that relate to the paper? The normalisation strategy is unclear -- what is it applied to?

Response:

Thank you for your comment. We have removed the discussion of MRI and CT scans as it was not relevant to the focus of our paper. Additionally, we have clarified the normalization strategy to specify what it is applied to.

Line 183, what was the data cleaning strategy? How much was removed? What were the criteria exactly?

Response:

Thank you for your comment. We have updated the manuscript to clearly describe the data cleaning strategy.

Line 196, If it was necessary to use SMOTE, what were the class balances?

Response:

We appreciate the reviewer’s comment. The Kather_texture_2016_image_tiles_5000 dataset contains 625 images per class across 8 classes, resulting in a total of 5,000 balanced image tiles. Since the dataset is inherently balanced, there was no need to apply oversampling techniques such as SMOTE.

Figure 3, incorporating some directionality in the flowchart would improve clarity

Response:

Thank you for the suggestion. We have updated Figure 3 to incorporate directionality in the flowchart, improving clarity.

Is the model pre-trained in any way?

Response:

Yes, the model utilizes transfer learning by leveraging a pre-trained backbone. Specifically, the architecture is based on ResNet50V2, which is a convolutional neural network originally trained on ImageNet. This allows the model to benefit from learned features prior to fine-tuning on the specific colorectal histopathology dataset.

Line 299, why were the images resized to 224x224px?

Response:
The images were resized from 150×150 pixels to 224×224 pixels to align with the input size requirements of pre-trained deep learning models, including DeepFocusNet, which is based on ResNet50V2. Other models compared in the study include VGG19, InceptionV3, and Xception, all originally trained on the ImageNet dataset, which also uses a 224×224 pixel input size. This resizing is essential for transfer learning because it ensures compatibility with the model architecture and enables the effective use of pre-trained weights, enhancing model performance and reducing training time.

Line 302, is the learning rate 1x10? This is possibly some repetition from the previous paragraph in lines 287-293.

Response:

Thank you for pointing this out. We have clarified the learning rate value and removed the repetition to avoid confusion.

Line 308, is the reference to figure 6 supposed to read figure 7?

Response:

Thank you for noticing this. We have corrected the reference so that it now points to the correct figure.

Figure 7 may benefit from additional detail in the legend.

Response:

Thank you for the suggestion. We have ensured that Figure 7 is sufficiently described in the main text, providing the necessary detail.

The colours mentioned in the text for malignant regions on line 341 do not appear to be the same as those in Figure 8b. Figure 8 appears somewhat redundant given the content of figure 9. Figure 8c appears inverted on the x axis compared to 8a and 8b.

Response:

Thank you for your observations. We have updated the manuscript to ensure that the colors in Figure 8b match the text description, clarified the relationship between Figures 8 and 9, and corrected the orientation of Figure 8c.

In figures 8/9 a visual indicator of image size would be helpful, for example a scale bar.

Response:

Thank you for the suggestion. Figure 8 already includes a visual indicator of image size, and we have added more detail to Figure 9 to improve clarity.

Figure 11, what was the multi-class AUC measure specifically?

Response:

The multi-class Area Under the Curve (AUC) measure presented in Figure 11 is reported to be near-perfect, with individual class AUCs reaching 1.00. Specifically, the ROC curves demonstrate that the model achieved AUC values of 1.00 across all classes, indicating exceptional discriminative ability for distinguishing among the tissue categories in the dataset.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes an attention-augmented deep neural framework for robust colorectal cancer classification in whole-slide histology images. The topic is interesting an relevant for today. The use of AI is also decent and relevant. I consider that they answered research questions posed in Introduction. I would, however, enlarge some Figures as text is not seen properly. The biggest advantage of the paper is the amount of medical images used in the study. I do not detect some major flaws of the paper. However, I can suggest several modifications, if Authors find them suitable:

Do you need number 1 next to Authors' names if all Authors are from the same institution?
Affiliation of Shah Muhammad Imtiyaj Uddin is missing?
Elaborate more how is training data made. Especially, how many pathologists annotated training images? Having more pathologists might improve it.
Elaborate more how training images were processed, such as rotation, etc.
Does MDPI allow graphical abstracts in Abstract? If not, you can put in in the Introduction.
Elaborate more in Introduction your novelties, i.e. how is you network different from others?
Write paper outline in the Introduction so a reader can navigate easily through your paper.
Are there any limitations of your network?
Revise the whole manuscript because some spelling/grammar mistakes were found.
I would enlarge some Figures in which text is hardly visible.

Author Response

Comments 1: Do you need number 1 next to Authors' names if all Authors are from the same institution?

Response 1: Thank you for your observation. All authors are from the same university and the same department; therefore, according to MDPI guidelines, numbering next to the authors’ names is not required. We have listed the authors’ names followed by the single shared affiliation as recommended.

Comments 2: Affiliation of Shah Muhammad Imtiyaj Uddin is missing?

Response 2: We have added the missing affiliation for Shah Muhammad Imtiyaj Uddin. All authors now have their correct affiliations listed in the manuscript and in the submission system.

Comments 3: Elaborate more on how training data is made. Especially, how many pathologists annotated training images? Having more pathologists might improve it.

Response 3: Thank you for the question about the annotation process. We used the publicly available “Collection of textures in colorectal cancer histology” dataset (Kather et al., Zenodo 2016) as our primary training/validation/testing source. The 5,000-image dataset was split at the patient level to prevent information leakage and ensure a robust evaluation of the model's generalization capability. We used a standard 70-15-15 split, resulting in 3,500 images for the training set, 750 images for the valida-tion set, and 750 images for the independent test set.

However, the documentation for this dataset does not explicitly state how many pathologists performed the annotation, or whether there was inter-observer agreement or consensus labeling. We were therefore unable to determine from the published record whether one or multiple pathologists labelled the images.

We agree that having more pathologists involved (and/or using consensus labelling) could improve annotation quality, reduce label noise, and thus potentially improve model generalization. In future work, we will attempt to use datasets with detailed annotation provenance, or perform additional annotation validation (e.g. panel consensus, multiple annotators). Meanwhile, because our model shows stable convergence and good validation accuracy, this suggests the dataset-labels are of reasonably good quality, but we acknowledge this limitation in the Discussion section.”

Comments 4: Elaborate more on how training images were processed, such as rotation, etc.

Response 4: We thank the reviewer for this valuable suggestion. During training, images underwent a series of data augmentations to improve generalization and robustness. Each image had a 50% chance of being randomly flipped horizontally , was rotated by an angle θ uniformly sampled from −30° to 30° (), and was scaled by a zoom factor s sampled from 0.8 to 1.2 () to simulate variations in object size and viewpoint. These transformations increased dataset diversity and reduced overfitting by encouraging the model to learn rotation- and scale-invariant features.

Comment 5: Does MDPI allow graphical abstracts in Abstract? If not, you can put it in the Introduction.
Response 5: MDPI does not allow graphical abstracts inside the text Abstract. So added in introduction.

Comments 6: Elaborate more in Introduction your novelties, i.e. how is your network different from others?

Response 6: DeepFocusNet introduces three key innovations over existing methods. First, it integrates attention mechanisms that focus on diagnostically relevant regions, improving both accuracy and interpretability. Second, it employs a scalable tiling strategy, dividing large whole-slide images into high-resolution windows and reconstructing them into probability heatmaps and multi-class overlays for precise spatial localization. Third, it uses a progressive training pipeline with augmentation, fine-tuning, and selective layer freezing to reduce overfitting and enhance generalization. Together, these innovations enable superior diagnostic precision, stability, and clinical relevance compared to prior models.

Comments 7: Write a paper outline in the Introduction so a reader can navigate easily through your paper.
Response 7: The remainder of this paper is organized as follows. Section 2 provides an overview of related work in deep learning for medical image analysis, with a focus on histopathology. Section 3 describes the Dataset Acquisition and Description, Data Preprocessing, and data Augmentation pipeline. In section 4 we describe proposed DeepFocusNet architecture and its underlying design principles. Section 5 experimental setup, training and convergence analysis, qualitative analysis, Quantitative Performance and comparative analysis of different models. In Section 6 we discuss interpretation of the results, Clinical Implications and limitations. Finally, Section 7 concludes the paper and outlines potential directions for future research.

Comments 8: Are there any limitations to your network?

Response 8: We appreciate this important point. While DeepFocusNet demonstrates superior performance compared to baseline models, several limitations should be acknowledged. First, the training dataset is limited in size and originates from a single institution, which may constrain generalizability to other populations or staining protocols. Second, the dataset documentation does not explicitly describe the number of pathologists involved in annotation, and therefore possible inter-observer variability cannot be excluded. Third, although our design is computationally lighter than large-scale CNNs, GPU resources are still required for training, which may pose challenges in low-resource clinical environments. Moreover, while the attention mechanism improves focus on diagnostically relevant regions, interpretability is not fully solved, and further explainability techniques are needed for clinical trust. Finally, our validation was retrospective; multi-center prospective validation remains an important direction for future work.

These limitations are now explicitly discussed in the revised manuscript (Discussion section).

Comments 9: Revise the whole manuscript because some spelling/grammar mistakes were found.

Response 9: The entire manuscript has been carefully revised to correct spelling, grammar, and stylistic issues. We also improved clarity and consistency of terminology to ensure that the text reads more smoothly and maintains a professional scientific tone.

Comments 10: I would enlarge some Figures in which text is hardly visible.

Response 10: Thank the reviewer for this helpful suggestion. All the figures have been carefully revised to improve readability. Specifically, we enlarged the font sizes of axis labels, legends, and annotations, and adjusted the overall figure resolution to ensure clarity in both the online and printed versions. These modifications make the figures easier to interpret without altering their scientific content.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for the opportunity to review your manuscript entitled “DeepFocusNet: An Attention-Augmented Deep Neural Framework for Robust Colorectal Cancer Classification in Whole-Slide Histology Images.” The topic is highly relevant, and your work has clear potential to contribute to AI-driven pathology. The manuscript is generally well-structured and technically sound. Nevertheless, I believe that several aspects can be further refined to enhance clarity, scientific rigor, and accessibility for the readership. Please consider the following suggestions:

1. Introduction and Related Work

While the introduction sets an appropriate context, some references are relatively recent but others are missing. For example, recent transformer-based histopathology approaches (beyond CBAM and CNNs) could be briefly acknowledged to situate your model within the most current landscape.
The related work section is comprehensive but could be condensed, emphasising how existing methods fall short and how DeepFocusNet directly addresses these gaps. This would improve focus and readability.

2. Methods

The description of preprocessing and augmentation is detailed, but it sometimes blends standard practices (e.g., stain normalisation, CLAHE) with advanced techniques. It would help to clarify which of these steps were actually implemented in your pipeline versus those presented as general options.
Please ensure reproducibility by specifying dataset splits (exact percentages for training, validation, test). Although you mention patient-level splitting, additional clarity would be beneficial.
The mathematical notations (e.g., in augmentation and residual connections) could be checked for consistency and typographical accuracy.

3. Results and Figures

The results are convincing, particularly with strong accuracy and AUC values. However, presenting standard deviations or confidence intervals for performance metrics would strengthen the claims of robustness.
Figures 6–12 are informative, yet the captions could be expanded to be self-explanatory. For instance, the ROC curve figure should indicate the sample size or the evaluation dataset.
The confusion matrix (Figure 10) is visually clear, but providing per-class misclassification numbers in the text would make interpretation easier.

4. Discussion and Limitations

The discussion is intense, but the limitations section could be expanded. For example, please discuss potential biases introduced by the single data source (NCT biobank) and how model generalizability might be affected across staining protocols and scanners.
You may also wish to mention computational efficiency in more detail, as the model might be resource-intensive for low-resource clinical environments.

5. Language and Style

The manuscript would benefit from careful language polishing to reduce redundancy and improve readability. Shortening long sentences and ensuring consistent tense usage will help.

Overall Assessment
Your study demonstrates strong technical execution and clinical relevance. With revisions to enhance methodological clarity, reporting rigour, and linguistic style, the manuscript could make a valuable contribution to the Electronics readership. I encourage you to consider these points to strengthen your work further.

Author Response

Introduction and Related Work

Comments:

-While the introduction sets an appropriate context, some references are relatively recent but others are missing. For example, recent transformer-based histopathology approaches (beyond CBAM and CNNs) could be briefly acknowledged to situate your model within the most current landscape.

-The related work section is comprehensive but could be condensed, emphasising how existing methods fall short and how DeepFocusNet directly addresses these gaps. This would improve focus and readability.

Response :We thank the reviewer for the constructive feedback. In the revised manuscript, the Introduction has been updated to acknowledge the paradigm shift towards transformer-based histopathology approaches, including hybrid CNN–Transformer models, self-supervised Vision Transformers , and transformer-based MIL frameworks . The Related Work section has been streamlined to emphasize the limitations of CNNs (restricted contextual scope) and transformers (computational overhead and limited interpretability), while positioning DeepFocusNet as an efficient and interpretable alternative that bridges these gaps.

Methods

Comments : The description of preprocessing and augmentation is detailed, but it sometimes blends standard practices (e.g., stain normalisation, CLAHE) with advanced techniques. It would help to clarify which of these steps were actually implemented in your pipeline versus those presented as general options.

Response: Thank you for the comment. We have updated the manuscript to clearly distinguish between the preprocessing and augmentation steps actually implemented in our pipeline and those mentioned as general options.

Comment: Please ensure reproducibility by specifying dataset splits (exact percentages for training, validation, test). Although you mention patient-level splitting, additional clarity would be beneficial.

Response: We thank the reviewer for this valuable suggestion. The dataset (NCT-CRC-HE-100K, [Zenodo link]) is provided as image patches without explicit patient identifiers; therefore, patient-level splitting was not possible. To ensure reproducibility, we randomly divided the dataset into 70% training, 15% validation, and 15% test, maintaining class balance. In addition, we have shared multiple scripts for preprocessing, splitting, and training to facilitate reproducibility of our results.

Comments: The mathematical notations (e.g., in augmentation and residual connections) could be checked for consistency and typographical accuracy.

Response: We have carefully reviewed all mathematical notations in the manuscript, including those in the augmentation pipeline and residual connections. We confirm that they are consistent and typographically accurate, and no changes are necessary.

Results and Figures

Comments: The results are convincing, particularly with strong accuracy and AUC values. However, presenting standard deviations or confidence intervals for performance metrics would strengthen the claims of robustness.

Response: We thank the reviewer for the suggestion. Table 2 reports the quantitative performance metrics of DeepFocusNet on the test set as mean ± SD over 5 independent runs. All metrics are expressed as percentages, and the reported SD reflects variability across runs, supporting the robustness of the results.

Comment:

-Figures 6–12 are informative, yet the captions could be expanded to be self-explanatory. For instance, the ROC curve figure should indicate the sample size or the evaluation dataset.

-The confusion matrix (Figure 10) is visually clear, but providing per-class misclassification numbers in the text would make interpretation easier.

Response: We thank the reviewer for the suggestions. We have updated the captions for Figures 6–12 to make them more self-explanatory, including relevant details such as the sample size and evaluation dataset for the ROC curves. Additionally, we have provided per-class misclassification numbers in the text for Figure 10, facilitating easier interpretation of the confusion matrix and highlighting specific model strengths and confusions between classes.

Discussion and Limitations

Comment: The discussion is thorough, but the limitations section could be expanded. For instance, please discuss potential biases introduced by using a single data source (NCT Biobank) and how model generalizability might be affected across different staining protocols and scanners. Additionally, the manuscript could address computational efficiency in more detail, as the model may be resource-intensive for deployment in low-resource clinical environments.

Response: We thank the reviewer for highlighting these points. We have expanded the limitations section.

Language and Style

Comment: The manuscript would benefit from careful language polishing to reduce redundancy and improve readability. Shortening long sentences and ensuring consistent tense usage will help.

Response: We thank the reviewer for this suggestion. We have carefully revised the manuscript to improve readability, reduce redundancy, and ensure consistent use of tense throughout the text. Long sentences have been shortened where appropriate, and language has been polished to enhance clarity and flow.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

My comments have been addressed. No further comments.

Reviewer 2 Report

Comments and Suggestions for Authors

All comments were addressed in this revised version of the manuscript. I do not require any more changes.

Reviewer 3 Report

Comments and Suggestions for Authors

Accept in present form.