Review Reports - The Diagnostic Value of Deep Learning for Multi-Classification of Rectal Cancer T Staging Based on Regional Attention

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

General comments

The aim of the study is to evaluate the feasibility and effectiveness of a deep learning model for the preoperative identification of rectal cancer T stage on contrast-enhanced CT images. The authors conclude that their deep learning models (binary and multi-class classification), based on regional attention, exhibited superior predictive performance for rectal cancer staging compared to both radiomics- and clinical data-based models. The main strengths of the study include the large number of retrospectively included patients and CT examinations, as well as the detailed statistical analysis (although the 12 figures in the Results section may be excessive). The main limitations are the insufficient contextualization within the most recent medical literature and a tendency to overgeneralize the results in the conclusions.

Abstract
Clear and well written.

Introduction

1. The authors state: "On the other hand, MRI requires high patient compliance and longer scanning times, making it unsuitable for routine screening of rectal cancer."

This statement is not consistent with the previous discussion. The proposed deep learning model is not intended for screening purposes. In this context, MRI remains the gold standard for local staging of rectal cancer. Both the Introduction and the Discussion lack a comparison in terms of predictive performance for T staging between MRI and contrast-enhanced CT (with and without deep learning-based image analysis). While CT is certainly more widely accessible than MRI, the current role of MRI in loco-regional staging of rectal cancer should be clearly emphasized.

2. The paragraph discussing imaging modalities should be revised. The authors should better highlight the advantages and limitations of each modality specifically in the context of loco-regional staging of rectal cancer. CT can be described as a comprehensive, whole-body imaging modality; however, its limitations—particularly in terms of contrast resolution—should be clearly addressed.

3. "In fact, radiomics-trained CT-based models can predict the T stage and length of esophageal squamous cell carcinoma[11]. Furthermore, Liang et al.[12] used radiomics to train MRI-based models for predicting synchronous liver metastasis in rectal cancer."

In this section, the authors should focus more specifically on loco-regional staging of rectal cancer. Introducing heterogeneous topics (e.g., esophageal cancer, metastases, GIST) may distract the reader. The reader expects a concise overview of the current evidence regarding radiomics and deep learning applied to contrast-enhanced CT for rectal cancer staging. If such data are limited or lacking, this should be explicitly stated. Additionally, references 12 and 13 appear outdated, highlighting the need for a more thorough and up-to-date literature review. The authors should better clarify how the present study fits within the current radiological literature.

Materials and Methods

1. Replace: "and pelvic CT dynamic enhancement scanning images"
with: "contrast-enhanced CT scans in the portal venous phase."

2. The description of the CT acquisition protocol is unclear. Portal venous phase imaging is typically acquired approximately 45–50 seconds after reaching a threshold of ~100 HU in the abdominal aorta using bolus tracking technique. The authors should provide a more precise description of the acquisition technique. Additionally, was a fixed volume (80 mL) of iodinated contrast agent administered to all patients? In clinical practice, contrast volume is usually adjusted according to patient weight and iodine delivery rate. The rationale for using a fixed contrast volume should be clarified.

3. "Several studies have utilized portal venous phase enhanced CT images for tumor lesion segmentation [14]"

Reference 14 refers to gastric cancer. Are there prior studies specifically addressing segmentation of rectal tumors on contrast-enhanced CT? This should be clarified.

4. "T staging was determined based on the 8th edition of the AJCC TNM classification criteria[15]..."

This statement may be misleading. The AJCC TNM classification does not provide detailed CT imaging criteria for local staging. For example, the CT imaging features described (e.g., “localized enhancement within the submucosa,” “well-defined soft tissue density masses,” “smooth outer bowel wall”) do not appear to be directly derived from AJCC criteria. The authors should clarify the source of these imaging definitions and specify where such CT-based staging criteria can be found in the literature.

Results

The difference between micro-AUC and macro-AUC is not clearly explained.
The section: "Radiomics Feature Processing and Model Development..."
should be moved to the Materials and Methods section.
Although the number of figures (n = 12) is relatively high, they do help in presenting the results clearly.

Discussion

1. "Radiologists often misclassify T1 stage tumors as T2, and T2 stage tumors as T3 when relying solely on CT images..."

Reference 19 appears to refer to endorectal ultrasonography. This should be corrected or better justified. The authors should also discuss the specific challenges encountered in segmenting rectal tumors on CT images. For instance, is it possible to reliably distinguish the different layers of the rectal wall on CT?

2. The study by Hou et al. [20] is highly relevant and appropriately cited.

3. "Moreover, our binary classification model... achieved superior performance compared to the HRT2 model trained using high-resolution MRI data."

This is a potentially outstanding finding; however, the authors should be cautious not to overgeneralize their results.

4. Original: "Currently, local excision after neoadjuvant chemoradiotherapy recommended for T2 stage rectal cancer in case patients refuse or are unsuitable for abdominal surgical resection."

Revised: "Currently, local excision after neoadjuvant chemoradiotherapy may be considered for T2 stage rectal cancer in patients who refuse or are unfit for abdominal surgical resection."

5. "Given the relatively limited resolution of CT scans..."

The authors should specify whether they are referring to contrast resolution or spatial resolution.

6. Reference 25 appears to relate to a study on mammography and therefore does not seem relevant to the topic of rectal cancer imaging and staging. The authors are kindly requested to verify this reference and replace it with a more appropriate and relevant citation, if necessary.

Comments on the Quality of English Language

The overall quality of the English language is acceptable; however, several sections would benefit from careful revision to improve clarity and consistency.

Author Response

Introduction

Comments 1. The authors state: "On the other hand, MRI requires high patient compliance and longer scanning times, making it unsuitable for routine screening of rectal cancer."

Response 1:

We sincerely apologize for the inaccurate use of the term 'routine screening' in our previous draft. We have completely rewritten this sentence in the Introduction. As the reviewer correctly pointed out, MRI is the undisputed gold standard for local staging. Our revised sentence now explicitly acknowledges this gold standard status, while clarifying that the long scanning times, requirement for high patient compliance, and limited availability of MRI in grassroots hospitals justify the widespread clinical use of CECT for staging.

Revised Text(Page 2, Lines 78-89)：

Nevertheless, MRI requires high patient compliance and entails long scanning and appointment waiting times. Furthermore, the limited availability of MRI equipment, particularly in primary care facilities, significantly restricts its universal applicability for routine staging. Consequently, contrast-enhanced CT, characterized by its non-invasive nature, rapid acquisition, and widespread accessibility, remains a broadly adopted routine staging modality for comprehensive whole-body assessment.

Comments 2. The paragraph discussing imaging modalities should be revised. The authors should better highlight the advantages and limitations of each modality specifically in the context of loco-regional staging of rectal cancer. CT can be described as a comprehensive, whole-body imaging modality; however, its limitations—particularly in terms of contrast resolution—should be clearly addressed.

Response 2:

We highly appreciate this insightful and constructive suggestion. We have thoroughly rewritten the paragraph to systematically contrast the advantages and limitations of each imaging modality for loco-regional staging, maintaining our primary focus on contrast-enhanced CT.Specifically, we noted the invasiveness of EUS and the high cost of PET-CT. We then unequivocally highlighted MRI as the undisputed gold standard due to its unmatched soft-tissue resolution. However, we explicitly emphasized that its prolonged scanning time and limited availability in primary care facilities restrict its widespread use. Consequently, we positioned CT as a widely adopted, highly accessible whole-body imaging alternative, while clearly acknowledging its primary limitation in soft-tissue contrast for local staging.This logical progression strictly aligns with clinical realities and directly justifies our rationale for enhancing CT evaluation. The revised paragraph has been updated in the Introduction.

Revised Text(Page 2, Lines 67-89):

Currently, the primary imaging modalities used for the preoperative staging of rectal cancer include endoscopic ultrasound (EUS), positron emission tomography-computed tomography (PET-CT), magnetic resonance imaging (MRI), and computed tomography (CT) . Regarding loco-regional staging, each modality presents distinct advantages and limitations. EUS demonstrates high accuracy in evaluating the early involvement of the rectal wall layers; however, its invasiveness reduces patient acceptance, and it is inadequate for comprehensively assessing distant metastases. While PET-CT is highly valuable for whole-body evaluation, its high cost and associated radiation risks generally preclude its use as a routine modality for local staging. In contrast, high-resolution MRI, with its exceptional soft-tissue contrast, enables precise assessment of tumor invasion and is widely recognized as the undisputed gold standard for the local staging of rectal cancer.Nevertheless, MRI requires high patient compliance and entails long scanning and appointment waiting times. Furthermore, the limited availability of MRI equipment, particularly in primary care facilities, significantly restricts its universal applicability for routine staging. Consequently, contrast-enhanced CT, characterized by its non-invasive nature, rapid acquisition, and widespread accessibility, remains a broadly adopted routine staging modality for comprehensive whole-body assessment. Despite its extensive clinical use, CT possesses notable limitations in loco-regional staging; its inferior soft-tissue contrast resolution presents significant challenges for accurate local evaluation, such as T-staging. Given the extensive clinical reliance on CT and its inherent limitations in local assessment, exploring novel auxiliary analytical techniques (such as deep learning) to overcome its staging deficiencies holds immense clinical significance.

Comments 3. "In fact, radiomics-trained CT-based models can predict the T stage and length of esophageal squamous cell carcinoma[11]. Furthermore, Liang et al.[12] used radiomics to train MRI-based models for predicting synchronous liver metastasis in rectal cancer."

Response 3:

We sincerely appreciate this insightful and constructive comment. We fully agree that discussing heterogeneous topics like esophageal cancer, distant metastasis, and gastrointestinal stromal tumors distracted from our core clinical focus. Accordingly, we have completely removed these examples and their corresponding outdated references. We clarified that while numerous deep learning approaches currently focus on tumor segmentation and staging using magnetic resonance imaging or positron emission tomography, applications specifically utilizing contrast-enhanced computed tomography for precise local staging remain relatively limited. This highly focused literature review explicitly highlights the specific clinical gap our study aims to bridge. All these targeted modifications have been clearly highlighted in the Introduction section of the revised manuscript.

Revised Text(Page 3, Lines 94-105):

Prior studies have utilized radiomics to develop MRI-based models for predicting the preoperative T-stage of rectal cancer. Furthermore, while some research has applied deep learning approaches for the detection and segmentation of rectal tumors, the majority of these investigations have predominantly focused on MRI or PET imaging.In contrast, deep learning studies utilizing routine contrast-enhanced CT (CECT) for the precise locoregional staging of rectal cancer remain relatively scarce. Given the large patient population and the uneven distribution of medical resources, automated assessment of local invasion using CECT can significantly enhance diagnostic efficiency and provide an objective preoperative reference for clinicians. Therefore, this study specifically focuses on the primary rectal tumor, aiming to fill the literature gap regarding CECT-based deep convolutional models for locoregional staging.

Materials and Methods

Comments 1. Replace: "and pelvic CT dynamic enhancement scanning images"

with: "contrast-enhanced CT scans in the portal venous phase."

Response 1:

We sincerely thank the reviewer for this accurate and professional correction. We confirm that the imaging data utilized in our study were indeed thin-slice contrast-enhanced CT scans acquired during the portal venous phase. We have revised the terminology in the manuscript as suggested.

Revised Text(Page 3, Lines 115-116):

3) availability of complete high-resolution CT scans of the rectum and contrast-enhanced CT scans in the portal venous phase.

Comments 2. The description of the CT acquisition protocol is unclear. Portal venous phase imaging is typically acquired approximately 45–50 seconds after reaching a threshold of ~100 HU in the abdominal aorta using bolus tracking technique. The authors should provide a more precise description of the acquisition technique. Additionally, was a fixed volume (80 mL) of iodinated contrast agent administered to all patients? In clinical practice, contrast volume is usually adjusted according to patient weight and iodine delivery rate. The rationale for using a fixed contrast volume should be clarified.

Response 2 :

We sincerely appreciate this precise and professional comment. We apologize for the inaccurate description in the initial manuscript.

Regarding the acquisition protocol, we completely agree that detailing the exact timing is crucial. We have clarified that a bolus tracking technique was employed, with the portal venous phase acquired exactly 50 seconds after the contrast attenuation reached the predefined 100 HU threshold in the abdominal aorta.

Regarding the contrast agent dosage, thank you for catching this. Upon re-verifying our clinical records, "80 mL" was incorrectly stated as a fixed dose. The actual protocol rigorously utilized a weight-based individualized dosing strategy of 1.5 mL/kg for all patients. This strategy ensures optimal and consistent enhancement across varying body habits.

The manuscript has been carefully updated to accurately reflect these specific acquisition and dosing parameters.

Revised Text(Page 4, Lines 131-137):

Contrast-enhanced CT scans from the abdomen to the pelvis were performed using a 64-slice CT scanner (Revolution CT; GE Medical Systems, IL, USA). An individualized, weight-based dose of 1.5 mL/kg of iohexol (Omnipaque; GE Healthcare, Shanghai, China) was administered via the cubital vein. For image acquisition, a bolus tracking technique was employed. Specifically, the portal venous phase imaging was acquired with a precise diagnostic delay of 50 seconds after the contrast attenuation reached a predefined threshold of 100 HU within the abdominal aorta.

Comments 3. "Several studies have utilized portal venous phase enhanced CT images for tumor lesion segmentation [14]"

Reference 14 refers to gastric cancer. Are there prior studies specifically addressing segmentation of rectal tumors on contrast-enhanced CT? This should be clarified.

Response 3 :

We deeply appreciate the reviewer's careful reading and precise observation. We apologize for the previous oversight.

We completely agree that a reference specifically focusing on rectal cancer is necessary here. Accordingly, we have replaced the previous citation regarding gastric cancer with a highly relevant study that specifically investigates the segmentation of primary rectal tumors using contrast-enhanced CT.

The text has been slightly adjusted to reflect this specific citation, and the reference list has been updated.

Revised Text(Page 4, Lines 144-145):

Several studies have utilized portal venous phase enhanced CT images for tumor lesion segmentation[14]

Comments4. "T staging was determined based on the 8th edition of the AJCC TNM classification criteria[15]..."

Response 4:

We sincerely appreciate this insightful and professional comment. You are completely correct that the AJCC TNM classification provides pathological definitions rather than detailed CT morphological criteria. We apologize for this misleading statement.

To address this, we have corrected the manuscript to clarify the true source of our imaging definitions. Specifically, the CT-based staging criteria were not directly derived from the AJCC manual. Instead, they were meticulously synthesized from the consensus clinical experience of our specialized radiologists and gastrointestinal surgeons, in strict accordance with established radiological staging guidelines for rectal cancer documented in previous literature.

The corresponding sentences in the manuscript have been revised to accurately reflect the radiological origins of these criteria, and appropriate references have been added.

Revised Text(Page 4, Lines 148-155):

CT-based T staging was evaluated to best approximate the pathological categories of the 8th edition of the AJCC TNM classification. Since the AJCC manual does not explicitly provide detailed CT morphological standards, the specific imaging criteria utilized in our study were established based on the consensus clinical experience of specialized radiologists and gastrointestinal surgeons, alongside established morphological descriptions from relevant radiological literature . The detailed CT evaluation criteria were as follows:

Results

Comments 1:The difference between micro-AUC and macro-AUC is not clearly explained.

The section: "Radiomics Feature Processing and Model Development..."

should be moved to the Materials and Methods section.

Although the number of figures (n = 12) is relatively high, they do help in presenting the results clearly.

Response 1:

We sincerely appreciate your constructive feedback and your positive recognition of our figures.

First, to address the metric definitions, we have added a clear explanation in the text: micro-AUC aggregates the contributions of all classes globally to reflect overall performance, whereas macro-AUC computes the metric independently for each class and averages them, treating all classes equally to account for potential class imbalance. Second, exactly as you suggested, we have moved the entire "Radiomics Feature Processing and Model Development" section into the Materials and Methods section. This structural adjustment significantly improves the logical flow of the manuscript.The manuscript has been updated accordingly to reflect these necessary clarifications and structural adjustments.

Revised Text(Page 9, Lines 303-305):

"...Vision Transformer (AUC = 0.745, accuracy = 0.750) models. For the multi-classification approach, we evaluated both micro-average AUC (which aggregates the contributions globally to reflect overall performance) and macro-average AUC (which treats all classes equally to adjust for potential class imbalance). The micro-average AUC and macro-average AUC of the ROITransStage model were 0.873 and 0.862 respectively, and the accuracy was 0.81 (Figure 8), and these metrics were superior to that of Convnext (micro-average AUC=0.813, macro-average AUC=0.802, accuracy=0.71) and Vision Transformer..."

Discussion

Comments 1. "Radiologists often misclassify T1 stage tumors as T2, and T2 stage tumors as T3 when relying solely on CT images..."

Response 1:

We appreciate this precise comment. We apologize for the inaccurate reference and have replaced it with an appropriate study focusing on CT-based rectal cancer staging. Furthermore, we have expanded our discussion on the inherent challenges of segmenting rectal tumors on CT. As you correctly pointed out, standard CT lacks the soft-tissue resolution required to reliably distinguish the distinct microscopic layers of the rectal wall. This physical limitation makes precise visual boundary delineation highly challenging, often leading to cautious over-staging by radiologists. We have explicitly addressed these specific segmentation difficulties in the revised manuscript. This addition perfectly justifies why our data-driven deep learning approach, which captures subtle textural variations beyond human visual perception, provides a significant advantage in overcoming these conventional imaging limitations.

Revised Text(Page 14, Lines 371-376):

Radiologists often misclassify T1 stage tumors as T2, and T2 stage tumors as T3 when relying solely on CT images.A fundamental challenge in segmenting rectal tumors is that standard CT lacks sufficient soft-tissue resolution to reliably distinguish the distinct histological layers of the rectal wall. Faced with such blurred anatomical boundaries, radiologists tend to err on the side of caution to avoid under-staging and the risk of further development . To overcome these inherent visual limitations,the deep learning model based on regional attention and trained with a large data volume can eliminate the influence of subjective factors, thus allowing accurate identification of the subtle differences between various tumor stages, and enhancing the differentiation between T1 and T2 stage tumors.

Comments 2. The study by Hou et al. [20] is highly relevant and appropriately cited.

Response 2:

We sincerely appreciate this positive feedback regarding our literature selection. We deliberately chose to highlight that specific research because it effectively establishes the clinical necessity of binary classification for rectal cancer staging. By framing the diagnostic task into early and advanced stages, that prior work perfectly contextualizes the clinical foundation upon which our current model is built. Acknowledging such foundational methodology allows us to clearly position the practical value of our proposed architecture. It demonstrates how our data-driven approach aligns with established clinical workflows while addressing the ongoing challenges of evaluating morphological features on medical images.

Comments 3. "Moreover, our binary classification model... achieved superior performance compared to the HRT2 model trained using high-resolution MRI data."

This is a potentially outstanding finding; however, the authors should be cautious not to overgeneralize their results.

Response 3:

We sincerely appreciate this insightful and cautious comment. We completely agree that directly claiming our CT-based model is superior to an MRI-based model could be misleading, particularly since MRI remains the established clinical gold standard for local staging of rectal cancer. To avoid any overgeneralization, we have carefully toned down this statement. Instead of claiming absolute superiority, we now emphasize that our attention-guided CECT model demonstrates highly competitive diagnostic efficacy. We further clarified that the primary clinical value of our proposed method is not to replace MRI, but rather to serve as a robust and reliable alternative when high-resolution MRI is inaccessible or contraindicated for patients. These nuanced adjustments have been incorporated into the revised manuscript to reflect a more rigorous and balanced clinical perspective.

Revised Text(Page 14, Lines 402-407):

Moreover, our binary classification model based on regional attention and trained with enhanced CT images demonstrated highly competitive diagnostic efficacy, achieving results comparable to previously reported models trained on high-resolution MRI data. While MRI remains the clinical gold standard, this finding highlights the potential of our AI-assisted CECT approach as a robust and reliable alternative when MRI is contraindicated or unavailable.

Comments 4. Original: "Currently, local excision after neoadjuvant chemoradiotherapy recommended for T2 stage rectal cancer in case patients refuse or are unsuitable for abdominal surgical resection."

Revised: "Currently, local excision after neoadjuvant chemoradiotherapy may be considered for T2 stage rectal cancer in patients who refuse or are unfit for abdominal surgical resection."

Response 4:

We sincerely appreciate this precise comment and the excellent text refinement. We completely agree that the original term "recommended" overstated the current clinical consensus for T2 stage rectal cancer management. Your suggested phrasing, "may be considered," much more accurately reflects established clinical guidelines, which approach local excision in this specific context as an alternative or compromise rather than the definitive standard of care. Furthermore, we appreciate the grammatical improvements which make the sentence read more naturally. We have completely adopted your suggested revision in the updated manuscript to ensure absolute clinical accuracy and maintain an objective, rigorous tone.

Revised Text(Page 14, Lines 394-395):

Currently, local excision after neoadjuvant chemoradiotherapy may be considered for T2 stage rectal cancer in patients who refuse or are unfit for abdominal surgical resection.

Comments 5. "Given the relatively limited resolution of CT scans..."

The authors should specify whether they are referring to contrast resolution or spatial resolution.

Response 5:

We appreciate this precise comment. We completely agree that the general term "resolution" was ambiguous in this specific clinical context. We have clarified this by explicitly specifying "soft-tissue contrast resolution." While modern CT scanners provide excellent spatial resolution, their primary inherent limitation in evaluating rectal tumors lies in differentiating the distinct histological layers of the rectal wall. This specific limitation in contrast resolution makes it extremely challenging for the human eye to detect subtle structural changes and accurately stage the tumor. Your insightful suggestion perfectly highlights the exact physical limitation of CT that our attention-guided deep learning model aims to overcome by extracting high-dimensional attenuation patterns. We have updated the specific terminology in the revised manuscript accordingly.

Revised Text(Page 15, Lines 420-422):

Given the relatively limited soft-tissue contrast resolution of CT scans, it may be difficult to accurately detect these subtle changes in tumor tissue structure, and differentiate between the two stages.

Comments 6.Reference 25 appears to relate to a study on mammography and therefore does not seem relevant to the topic of rectal cancer imaging and staging. The authors are kindly requested to verify this reference and replace it with a more appropriate and relevant citation, if necessary.

Response 6:

We sincerely appreciate your careful review and this precise observation. We apologize for the oversight regarding the misaligned citation. We have corrected this error and replaced the mammography-related reference with a highly relevant study focused specifically on rectal cancer imaging. The newly cited work successfully employs a similar bounding box methodology to isolate the tumor region of interest within abdominal scans prior to deep learning analysis. By making this replacement, we ensure that our description of ROI extraction methods is appropriately contextualized within the specific domain of gastrointestinal image processing. The reference has been updated accordingly in the revised manuscript.

Revised Text(Page 15, Lines 433-435):

Studies similar to ours had adopted a bounding box drawing method for labeling samples, and marked tumor areas with rectangular boxes to obtain the ROI for training deep learning models

Comments on the Quality of English Language

Comments 1.The overall quality of the English language is acceptable; however, several sections would benefit from careful revision to improve clarity and consistency.

Response 1:

We appreciate your valuable feedback regarding the manuscript's language quality. Prior to our initial submission, the manuscript was edited by a professional language polishing service. However, we entirely agree that several sections needed further refinement to meet the highest standards of clarity and consistency. Following your advice, we have meticulously reviewed and revised the entire text. We focused on eliminating ambiguities, improving sentence flow, and ensuring consistent medical and deep learning terminology throughout the paper. All the linguistic modifications and phrasing adjustments have been clearly highlighted in the revised version of the manuscript. We believe these careful revisions have substantially enhanced the overall readability and presentation of our work.

Reviewer 2 Report

Comments and Suggestions for Authors

This article presents an improved CT deep learning model based on a regional attention mechanism for preoperative multi-class classification of rectal cancer T stages. Positive aspects of the article include:
1. A detailed literature review is provided, including a definition of the problem.
2. The data collection procedure is presented, including the manual segmentation process.
3. The training process, software, and hardware resources are provided.
4. The details, processes, and layers of the proposed model are presented.
5. The results of the multi-class classification model are presented in detail.

Points to consider in the article are as follows:
1. A confusion matrix should be provided for the best results (model). Clinical evaluation of incorrect cases should be performed. It is problematic to classify a normal case as a patient, or to classify a patient as a normal case.
2. Model training error and accuracy graphs should be provided for the best result (model).
3. Class imbalance exists in the dataset. A performance parameter that considers class imbalance should be presented and discussed.
4. GradCam images should be added to observe the tissues being considered for correct and incorrect classification, and evaluated under the supervision of a doctor.

Author Response

Comments 1. A confusion matrix should be provided for the best results (model). Clinical evaluation of incorrect cases should be performed. It is problematic to classify a normal case as a patient, or to classify a patient as a normal case.

Response 2:

We thank the reviewer for this suggestion. We have added the confusion matrix for the optimal model in the revised manuscript (Figure A1).The confusion matrix shows a good diagonal distribution across categories. Severe classification collapse did not occur even in classes with smaller sample sizes, indicating that the model has good feature extraction and discrimination capabilities. Our analysis showed that misclassifications mainly occurred between adjacent stages (e.g., T2 and T3) rather than across multiple stages (e.g., mistaking T1 for T3). This aligns with clinical practice, as imaging differences between adjacent stages are often subtle, and the boundaries can be blurred.In addition, we noted that the model rarely made severe misjudgments across multiple stages. This is an important feature for clinical applications, as it helps reduce the risk of major misdiagnosis. We have added this analysis to the Results section of the revised manuscript.To maintain the conciseness of the main text, detailed analyses have been added to the Appendix and cited in the relevant sections of the main text.

Revised Text in the Main Text(Page 10, Lines 320-326):

Furthermore, a confusion matrix was generated to detail the classification behavior, which revealed that misclassifications primarily occurred between adjacent stages rather than as severe cross-stage errors (Appendix Figure A1).A clinical evaluation of the misclassified cases was also conducted. A representative case of a T3 tumor incorrectly downstaged to T2 by the model is analyzed in Appendix Figure A2, highlighting the inherent limitations of CT soft-tissue contrast in detecting micro-invasion.

Revised Text in the Appendix(Page 17, Lines 506-532):

To further analyze the model's classification behavior, we generated a confusion matrix (Figure A1). The sample distribution for each category is primarily concentrated along the diagonal. This shows that the model maintains good discrimination ability across different lesions, despite a certain degree of class imbalance. Misclassifications mainly occur between adjacent categories (e.g., T2 and T3), with rare instances of severe, multi-stage misjudgments. This performance is consistent with the clinical reality that adjacent-stage lesions often share similar features and have indistinct boundaries. Also, by avoiding severe cross-stage errors, the model demonstrates its safety and potential value in clinical computer-aided diagnosis.

As shown in Figure A2, the model incorrectly downstaged a T3 rectal cancer to T2. This highlights the limitations of deep learning algorithms in identifying minor anatomical invasion. The key pathological distinction between T2 and T3 stages is whether the tumor penetrates the muscularis propria and invades the surrounding perirectal fat. However, due to the inherent soft-tissue contrast and spatial resolution limits of CT, minor invasion in early T3 tumors often appears as bowel wall thickening with blurred margins rather than clear fat involvement. These "pseudo-benign" imaging signs contribute to missed radiological diagnoses. In terms of feature extraction, CNNs build high-level semantic representations through successive convolution and pooling operations. This process inevitably causes the loss of high-frequency edge information. Combined with the partial volume effect during imaging, the microscopic invasion features at the lesion margins are smoothed out. Ultimately, the algorithm relies too heavily on the macroscopic morphology of the main lesion and fails to capture the subtle visual evidence of muscularis penetration, leading to downstaging.

Comments 2. Model training error and accuracy graphs should be provided for the best result (model).

Response 2:

We thank the reviewer for this suggestion. We have added the loss function and accuracy curves from the model training process to the revised manuscript (Figure A3).As shown in the figure, the training loss decreases gradually with iterations, indicating stable model convergence. Both training and validation accuracies increase steadily and plateau in the later stages. Although the training accuracy is higher than the validation accuracy—suggesting a certain generalization gap—the validation accuracy steadily improves throughout the training process without any significant drop. This shows that the model did not experience severe overfitting.These results indicate that the proposed model has good convergence and generalization capabilities during training. We have added the relevant descriptions to the revised manuscript.

Revised Text in the Main Text(Page 10, Lines 326-328):

The proposed model demonstrated stable convergence and good generalization during the training process, without evidence of severe overfitting (see Appendix Figure A3 for detailed training loss and accuracy curves).

Revised Text in the Appendix(Page 17, Lines 533-542):

Figure A3 displays the changes in the loss function and accuracy during the training process. The training loss decreases gradually with iterations and eventually stabilizes, indicating good model convergence. Also, the accuracies of both the training and validation sets improve steadily and plateau in the later epochs. The trends of the training and validation curves are highly consistent, which shows that the model has good generalization ability without obvious overfitting. The slight fluctuations observed in the validation curve may be related to the inherent heterogeneity and complexity of the medical imaging data.

Comments 3. Class imbalance exists in the dataset. A performance parameter that considers class imbalance should be presented and discussed.

Response 3:

We thank the reviewer for pointing out the class imbalance issue. To address this, we have added macro-average metrics (Macro-Precision, Macro-Recall, and Macro-F1-score) in the revised manuscript to better evaluate the model's performance.The results show that the model achieved a Macro-average F1-score of 0.7372. Because macro-averaging assigns equal weight to each class, it is not dominated by the sample size of the majority class. This demonstrates that the model maintains satisfactory classification performance on minority classes. In addition, the Micro-average F1-score, which equals the overall accuracy in a single-label multiclassification task, reached 0.7600.Together, these metrics show that despite the class imbalance in the dataset, the model maintains stable classification performance across different T stages. We have added the relevant details to the revised manuscript.

Revised Text in the Main Text(Page 10, Lines 317-320):

To address class imbalance and comprehensively evaluate per-category performance, macro-average metrics were calculated, showing stable discrimination across all T stages (Appendix Table A1).

Revised Text in the Appendix(Page 18, Lines 554-567):

To evaluate the model's performance on the class-imbalanced dataset, we calculated the precision, recall, and F1-score for each category, along with the macro-average and micro-average metrics (Table A1). The macro-average is the unweighted arithmetic mean across all classes. It treats each class equally regardless of sample size. The micro-average aggregates global true positives, false positives, and false negatives.As shown in the results, the model achieved stable performance across all categories. T3 showed the best performance (F1-score = 0.8163). T4 yielded a lower F1-score (0.6667) due to its smaller sample size, but this result remains reasonable. The macro-average F1-score was 0.7372. This indicates that the model avoided severe bias toward the majority class and achieved a good balance. The micro-average F1-score, which is numerically equivalent to the overall accuracy, was 0.7600. This demonstrates the overall classification reliability of the model.
Comments 4. GradCam images should be added to observe the tissues being considered for correct and incorrect classification, and evaluated under the supervision of a doctor.

Response 4:

We thank the reviewer for this suggestion. We have added Grad-CAM visualization results to the revised manuscript (Figure S4) to analyze the key regions the model focuses on during classification.

Under the same fine-tuning conditions for downstream tasks, our method shows a more concentrated response in the tumor region compared to the ViT and ConvNeXt models. This effectively reduces over-attention to irrelevant background tissues. The high-response areas of our model align well with the ROIs manually annotated by clinicians. These results demonstrate that the introduced regional attention mechanism effectively guides the model to focus on clinically significant lesion areas, thereby improving its interpretability.

In addition, medical experts preliminarily evaluated these results. They agreed that the feature regions highlighted by the model are consistent with clinical diagnostic logic. We have added this analysis to the revised manuscript.

Revised Text in the Main Text(Page 10, Lines 328-332):

Finally, Grad-CAM visualizations were utilized to assess the interpretability of the model. The results confirmed that the introduced regional attention mechanism effectively guided the model to focus on clinically relevant lesion areas, matching the diagnostic logic of medical experts (Appendix Figure A4).

Revised Text in the Appendix(Page 18, Lines 543-553):

To evaluate the interpretability of the model predictions, we used the Grad-CAM method to visualize the attention regions of different models (Figure A4).Compared to the ViT and ConvNeXt models, our method shows a more concentrated response in the tumor region. It effectively reduces background noise interference and demonstrates higher consistency with manually annotated ROIs. These results indicate that the proposed regional attention mechanism effectively improves the model's ability to localize key lesion areas. Based on a preliminary evaluation by medical experts, the high-attention regions of the model match the actual lesion locations well, reflecting reasonable clinical diagnostic logic.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have responded in a sufficiently comprehensive manner to all the comments raised. I have no further remarks on the manuscript in its current form.

Comments on the Quality of English Language

The overall quality of the English language is acceptable; however, several sections would benefit from careful revision to improve clarity and consistency.

Reviewer 2 Report

Comments and Suggestions for Authors

My recommendations have been taken into consideration and the article has been updated. I recommend that the article be accepted in its current form.