1. Introduction
Colorectal cancer is the third most common malignancy globally, accounting for approximately 10% of all cancer cases and ranking as the second leading cause of cancer-related mortality [
1]. Accurate staging of rectal cancer is essential for treatment planning and prognostication. Precise staging guides clinical decisions regarding the timing of surgery, the necessity of neoadjuvant therapy, and appropriate follow-up strategies, thereby holding substantial clinical value.
Imaging plays a central role in the diagnosis and staging of rectal cancer. Magnetic resonance imaging (MRI) is widely regarded as the gold standard for local staging due to its superior soft-tissue contrast [
2,
3,
4,
5]. However, in routine clinical practice, MRI has several limitations, including high cost, longer acquisition and waiting times, and limited accessibility in many healthcare settings. In contrast, computed tomography (CT) is more accessible and faster to perform and, when combined with rectal insufflation [
6], can improve bowel distension and lesion visualization—making it a practical alternative in real-world scenarios. Meta-analyses have shown that while MRI generally outperforms CT in local staging accuracy [
3,
4], optimized CT protocols can achieve clinically acceptable diagnostic performance [
7,
8], particularly when interpreted by experienced radiologists. The Society of Abdominal Radiology recognizes CT as an important staging tool in settings where MRI is unavailable or contraindicated [
7]. However, CT interpretation remains highly dependent on radiologist expertise, and less experienced readers may demonstrate substantial variability in staging accuracy.
Recent advances in artificial intelligence (AI) and deep learning have significantly impacted medical imaging, with successful applications in disease classification, tumor detection, and organ segmentation [
9,
10,
11]. Convolutional neural networks (CNNs), including their three-dimensional extensions (3D CNNs), have shown promise in enhancing CT-based diagnostic accuracy [
12,
13]. In particular, U-Net and its variants have become state-of-the-art models for medical image segmentation, enabling precise delineation of anatomical structures and tumor boundaries [
14,
15,
16]. Several studies have explored deep learning approaches for tumor detection and segmentation in colorectal and rectal cancer [
17,
18,
19], with most efforts focused on MRI or PET. Recent work has also investigated radiomics and deep learning for response prediction and staging in rectal cancer [
9]. However, comprehensive evaluations of AI-based staging using CT imaging remain limited, especially those directly comparing AI models with radiologist performance using pathology as the reference standard.
In this study, we present an integrated AI framework specifically tailored to a dual-phase, dual-position CT protocol. Unlike standard CT approaches, this study leverages the complementary information of prone non-contrast and supine contrast-enhanced acquisitions to assess the feasibility of automated rectal cancer evaluation. The proposed system consists of three components: (1) a Rectal Cancer Detection Convolutional Neural Network (RCD-CNN) for automatic lesion detection, (2) a U-Net-based model for rectal contour delineation and tumor localization, and (3) a 3D CNN (RCS-3DCNN) for staging prediction. Furthermore, AI-based staging results were directly compared with clinical assessments by radiologists, using pathological diagnosis as the reference standard, to evaluate the feasibility of using AI as a decision-support tool in rectal cancer management.
2. Materials and Methods
2.1. Study Design and Patient Population
This retrospective study was approved by the Institutional Review Board of Kaohsiung Medical University Chung-Ho Memorial Hospital (KMUHIRB-20200444), with the requirement for informed consent waived owing to the retrospective nature of the investigation.
Inclusion criteria were as follows:
- (1)
newly diagnosed rectal adenocarcinoma confirmed by colonoscopy and biopsy between January 2013 and November 2019;
- (2)
pathological T stage of T1–T3;
- (3)
availability of complete pre- and post-contrast CT examinations; and
- (4)
surgical resection performed within 30 days after CT without any intervening neoadjuvant therapy.
Exclusion criteria were as follows:
- (1)
prior neoadjuvant chemotherapy or radiotherapy;
- (2)
recurrent rectal cancer;
- (3)
severe image artifacts that compromised interpretation;
- (4)
incomplete pathological reports; and
- (5)
T4 disease at presentation.
Patients with T4 disease were excluded for two reasons. First, the distinction between T2 and T3 presents a greater diagnostic challenge and determines the indication for neoadjuvant therapy. Second, and methodologically more critical, T4 patients typically undergo neoadjuvant chemoradiotherapy prior to surgery. This treatment alters the tumor stage (downstaging), meaning the post-operative pathological stage (ypT) would not serve as an accurate reference standard for the pre-treatment CT images. Excluding these cases ensured a direct and accurate correlation between imaging features and pathological ground truth.
A total of 223 patients met these criteria and were included in the study. Each patient underwent dual-phase, dual-position rectal CT examinations, consisting of (1) a non-contrast prone acquisition and (2) a contrast-enhanced supine acquisition after rectal insufflation. In the prone position, the rectum was elevated, allowing insufflated air to accumulate under gravity, thereby facilitating luminal distension and improving delineation of the rectal wall. Conversely, the contrast-enhanced scan was performed in the supine position, where tumor enhancement could be better visualized, and gravitational shifting of fecal material helped distinguish tumor tissue from residual stool. This dual-position protocol was designed to be complementary, maximizing rectal wall visualization and overall tumor conspicuity.
Pathology was used as the reference standard. The study was designed as a binary classification task (T1/T2 vs. T3) a priori. This grouping aligns with clinical management guidelines where the distinction between early-stage (T1/T2) and locally advanced (T3) disease dictates the decision for neoadjuvant therapy. Furthermore, this binary formulation maximized statistical power given the available sample size. To prevent information leakage, dataset partitioning was performed at the patient level, with patients randomly assigned to training, validation, and testing sets in approximately a 7:1:2 ratio (
n = 155, 22, and 46, respectively). An independent held-out test subset (
n = 46) with both clinical staging and pathology results was used for direct comparison between AI staging and radiologists’ clinical staging. An overview of the study workflow, including patient allocation, AI pipeline, and comparison with pathology and radiologists, is illustrated in
Figure 1.
2.2. CT Acquisition Protocol
All CT examinations were performed using a 128-row dual-source, dual-energy multidetector CT scanner (SOMATOM Definition Flash; Siemens Medical Solutions, Forchheim, Germany). Acquisition parameters included a tube voltage of 120 kVp and a quality reference mAs of 250 with automatic exposure control (CAREDose 4D; Siemens Healthcare, Erlangen, Germany).
Each patient was positioned in the left lateral decubitus position, and a 12-F balloon-tipped rectal tube was inserted. Approximately 1200 mL of room air was gently insufflated to achieve adequate colonic distension and facilitate tumor visualization [
6]. A scout image was obtained to confirm satisfactory distension, defined as clear visualization of all colonic segments. Additional insufflation was performed when necessary. No insufflation-related complications occurred.
For the non-contrast scan, patients were repositioned prone to allow air accumulation in the rectum. The scan range extended from the third lumbar vertebra (L3) to the perineum. Subsequently, 100 mL of nonionic iodinated contrast material (Ultravist 300; Bayer, Berlin, Germany) was administered intravenously at 3 mL/s using an automatic injector. Spiral CT was then acquired during the portal venous phase (70 s after injection), covering the region from the hepatic dome to the perineum.
2.3. Image Annotation and Preprocessing
Axial CT images were de-identified and annotated using 3D Slicer (version 4.11;
https://www.slicer.org; accessed on 6 February 2026). Two board-certified abdominal radiologists (14 and 30 years of experience, respectively) independently delineated tumor regions with pathology reports as the reference standard. Discrepancies were resolved by consensus, and the pathology-informed masks were used as ground truth for training and evaluation. DICOM images were encoded as 12-bit data with pixel values ranging from 0 to 4096. Images were first resampled to 512 × 512 pixels, then normalized to the range [0, 255], and finally converted to PNG format. Hounsfield units were windowed to −300 to 200 to emphasize rectal wall and tumor boundaries.
2.4. Data Augmentation and ROI Processing
To enhance model generalization and reduce overfitting during training, data augmentation techniques were applied, including random rotation (±20°), width shift (±0.05), height shift (±0.05), shear (±0.05), zoom (±0.05), and horizontal flipping. No augmentation was applied during validation or testing. Because rectal tumors occupy a relatively small proportion of each slice, regions of interest (ROIs) were extracted by first performing rectal contour segmentation, followed by computing the maximum bounding rectangle to isolate tumor areas. For ROI sequence processing, the number of tumor slices per patient varied. To ensure a fixed input sequence length for the RCS-3DCNN: (1) during training, random duplication with augmentation was applied to sequences with fewer than 12 slices to reach the required length; (2) during validation and testing, to maintain consistency, deterministic padding (repeating slices) was used to reach 12 slices. Finally, 12 slices with contrast and 12 without contrast were combined, yielding a fixed total of 24 ROI slices per patient for all datasets. A fixed input length of 24 slices was selected to balance volumetric coverage with computational efficiency. Dataset inspection confirmed that 24 slices (covering approximately 12 cm of craniocaudal extent) fully captured all tumors. Deterministic padding (slice repetition) was used for smaller tumors to preserve spatial texture continuity for the 3D-CNN, avoiding the introduction of artificial zero-value artifacts.
2.5. Model Architectures
(1) RCD-CNN: The RCD-CNN model was designed to detect tumor regions in axial CT slices. The architecture consisted of three convolution–pooling blocks with channel dimensions of 8, 16, and 8, and kernel sizes of 5 × 5, 3 × 3, and 3 × 3, respectively. A stride of 1 was used in all convolutional layers. Zero padding was not applied, as rectal tumors were consistently located in the central region of the image following rectal insufflation, making peripheral image content less informative. Max pooling layers were used for down-sampling. The convolutional blocks were followed by two fully connected layers with 256 and 1 neurons, respectively. The ReLU activation function was applied throughout the network. The overall network structure is shown in
Figure 2.
(2) U-Net: The U-Net model was used for rectal contour extraction and tumor localization. The architecture consisted of two main parts: a contracting path (encoder) and an expansive path (decoder). The contracting path was responsible for feature extraction through a series of down-sampling operations achieved via convolutional layers and max pooling. It comprised four blocks, each consisting of two convolutional layers followed by a max pooling operation. With each down-sampling step, the feature map size was reduced by half, resulting in a final feature map size of 32 × 32. The expansive path aimed to reconstruct the segmentation mask from the compressed representation. It also consisted of four blocks that performed up-sampling operations through transpose convolution. Skip connections were used to concatenate feature maps from the contracting path with the corresponding up-sampled feature maps in the expansive path, preserving high-resolution information and enabling recovery of fine details. Finally, a 1 × 1 convolutional layer produced the output with two channels representing the foreground and background. The detailed architecture is illustrated in
Figure 3.
(3) RCS-3DCNN: The RCS-3DCNN was developed for staging prediction, consisting of three 3D convolutional layers and two 3D max-pooling layers. The first layer had 8 channels with a filter size of 3 × 3 × 3, followed by the second layer with 16 channels and the same filter size. These two sets of 3D convolutions extracted spatial-temporal features, followed by a 3D max-pooling layer for down-sampling. The output was then connected to the third convolutional layer with 8 channels and a filter size of 3 × 3 × 3. Another 3D max-pooling layer was applied to further extract features. Batch normalization was incorporated after each convolutional layer to stabilize and normalize the learning process, contributing to improved training and generalization performance. The output from the final max-pooling layer was flattened and connected to three fully connected layers with 256, 32, and 1 neuron, respectively. These fully connected layers captured higher-level representations and produced a one-dimensional output for binary classification between T12 and T3. By employing 3D convolutions and max pooling, the RCS-3DCNN architecture could effectively capture spatial and temporal information from the dual-phase CT image sequences. The architecture of the proposed 3D network is shown in
Figure 4.
2.6. Model Training and Implementation
Training was conducted on workstations with Intel i9-9900K CPU (Intel, Santa Clara, CA, USA), 64 GB RAM, and NVIDIA RTX 2080Ti GPU (NVIDIA, Santa Clara, CA, USA) under Linux/Windows. Batch sizes were set to 12 for RCD-CNN, 4 for U-Net, and 8 for RCS-3DCNN. Training was set to a maximum of 200 epochs.
For RCD-CNN, images were resized from 512 × 512 to 256 × 256 before training. The Adam optimizer was used with an initial learning rate of 0.001. A class weight ratio of 1:5 was applied to address the imbalance between tumor and non-tumor slices, and the loss function was weighted binary cross-entropy.
For U-Net, training was performed separately for rectal contour extraction (learning rate 3 × 10−5) and tumor localization (learning rate 1 × 10−5), both using the Adam optimizer. Dice coefficient loss was employed. ROI preprocessing was applied before training. Early stopping was applied if validation loss did not improve for 10 consecutive epochs, and the learning rate was adjusted using the ReduceLROnPlateau strategy (patience = 5).
For RCS-3DCNN, sequences of 24 ROI slices per patient were used as input. Before feeding into the network, the extracted ROIs were resized to 224 × 224 pixels to match the input dimensions of the RCS-3DCNN architecture. The Adam optimizer was applied with an initial learning rate of 0.01 and exponential decay. Binary cross-entropy was used as the loss function, with batch normalization to stabilize training. Early stopping and model checkpointing were employed to save the best-performing weights.
2.7. Evaluation Metrics and Statistical Analysis
For RCD-CNN, the accuracy, recall, and precision were calculated. For U-Net, the Dice similarity coefficient was reported. For RCS-3DCNN, the primary endpoint was distinguishing between T3 and T12 at the patient level using pathology as the reference. Sensitivity, specificity, and accuracy were reported with 95% confidence intervals (Wilson method). Binary classification metrics were specifically selected to facilitate a direct, head-to-head comparison with the categorical staging results provided by the radiologists, ensuring consistency in performance evaluation. Statistical significance was defined as p < 0.05. Agreement was assessed by Cohen’s kappa, and McNemar’s exact test (two-sided) was used to test paired proportions. For baseline characteristics, continuous variables were compared using independent t-test or Mann–Whitney U test as appropriate, and categorical variables were compared using chi-square or Fisher’s exact test.
2.8. Software and Availability
All preprocessing and annotation tasks were conducted using publicly available tools, including 3D Slicer (version 4.11;
https://www.slicer.org; accessed on 6 February 2026). Source code is available upon reasonable request to the corresponding author. This study was not preregistered. Generative AI tools (ChatGPT, OpenAI, San Francisco, CA, USA) were only used for language editing and were not involved in study design, data acquisition, analysis, or interpretation.
4. Discussion
This study developed and validated an integrated AI system for CT-based rectal cancer detection and staging. The framework demonstrated robust performance across all tasks, including lesion detection (97.6% accuracy), segmentation (Dice coefficient of 0.897 for rectal contours and 0.856 for tumors), and staging (80.4% accuracy). Notably, AI staging accuracy was comparable to that of experienced radiologists (82.6%, p = 1.0). Agreement analysis showed substantial concordance for radiologists (κ = 0.66) and moderate concordance for AI (κ = 0.59) when compared with pathology.
4.1. Technical Contributions and Comparison with Prior Work
The primary contribution of this study lies in developing an integrated three-stage pipeline that provides end-to-end automation—from raw CT images to staging predictions. Unlike previous studies that addressed isolated tasks [
16,
17,
18,
19], the proposed framework replicates the entire clinical workflow.
Our dual-phase CT protocol combines non-contrast prone and contrast-enhanced supine acquisitions with rectal insufflation [
6], addressing CT’s intrinsic limitation of lower soft-tissue contrast compared with MRI. The prone acquisition facilitates gravitational air accumulation, improving luminal distension, whereas the supine contrast-enhanced acquisition enables better tumor enhancement and discrimination from fecal material.
The segmentation performance (Dice 0.897 for rectal contours, 0.856 for tumors) favorably compares with previous segmentation studies (typically 0.70–0.80) [
16,
17,
18] and approaches the accuracy achieved by MRI-based systems [
17,
18,
19]. The RCS-3DCNN architecture leverages volumetric information across 24 slices to capture three-dimensional tumor extent, enabling precise assessment of mural invasion depth—critical for differentiating T2 from T3 disease [
13].
The staging accuracy of 80.4% is consistent with CT performance reported in systematic reviews (75–85%) [
4], although it remains slightly lower than that of MRI-based staging (85–90%) [
4,
5]. The direct head-to-head comparison with radiologists using pathology as reference confirms the clinical feasibility of AI-assisted staging.
4.2. Clinical Rationale for CT-Based AI
While MRI remains the reference standard for rectal cancer staging [
2,
4,
5,
20], CT-based AI addresses several important clinical needs. CT is routinely performed in newly diagnosed colorectal cancer patients for initial staging, is widely available, and can be used when MRI is contraindicated (e.g., pacemakers, claustrophobia) or unavailable. In resource-limited settings, CT may represent the only cross-sectional imaging modality accessible [
6,
7].
CT-based AI can play complementary roles in: (1) triaging patients for urgent MRI, (2) providing initial staging while awaiting MRI, (3) serving as a primary staging tool when MRI is unavailable, (4) supporting less experienced radiologists, and (5) functioning as a quality-assurance tool to reduce inter-observer variability. These applications are consistent with AI-assisted workflows designed to augment, rather than replace, human expertise [
9,
10,
21]. Furthermore, the proposed framework may have potential relevance for radiotherapy workflows, in which CT remains the standard modality for dose calculation and target volume definition. Although radiotherapy applications were not directly evaluated in this study, the tumor localization performance (Dice = 0.856) suggests that the generated masks could serve as an initial reference for gross tumor volume (GTV) delineation, with final contours determined by expert review.
4.3. Error Patterns and Human–AI Collaboration
Error analysis showed that radiologists tended to overstage equivocal cases, whereas AI showed a more balanced distribution of over- and under-staging errors, suggesting distinct decision biases rather than complementary error patterns (
Figure 5). Radiologists tended to overstage (7/8 errors, 87.5%, were T1–T2 misclassified as T3), likely reflecting clinical caution when mural invasion was equivocal. In contrast, AI errors were balanced (4/9 over-staging, 5/9 under-staging), suggesting consistent decision boundaries without clinical bias.
Importantly, when both radiologist and AI predictions agreed (
n = 35, 76.1% of cases), the concordant prediction was correct in 94.3% (33/35 against pathology). This finding suggests that concordant predictions may indicate high-confidence cases, whereas discordant cases should prompt further review or MRI confirmation. Such hybrid human–AI workflows have shown promise in other imaging domains [
9,
21] and merit prospective validation.
The T2–T3 boundary remains a well-recognized diagnostic challenge even on MRI [
22], as distinguishing subserosal invasion (T3) from muscularis propria involvement (T2) requires subtle assessment of tissue planes. Representative misclassification examples (
Figure 5) illustrate this diagnostic difficulty. Although our AI model captured relevant features, it could not fully overcome this inherent limitation.
4.4. Limitations and Future Directions
This study has several limitations. First, the single-center retrospective design with a modest sample size (n = 223) limits generalizability. We acknowledge that the current results represent a proof-of-concept feasibility study. Future work utilizing multi-center data is essential to validate robustness across different scanners and protocols. Additionally, while rigorous data augmentation was employed to mitigate overfitting, k-fold cross-validation was not performed in this study to prioritize the use of a strictly held-out test set for the head-to-head comparison with radiologists. Furthermore, we employed lightweight custom model architectures rather than deep pre-trained networks (e.g., ResNet or EfficientNet) to prevent overfitting given the dataset size. Future studies with larger cohorts will explore transfer learning and more complex backbones to potentially enhance feature extraction capabilities. In this study, our primary contribution lies in the clinically motivated dual-phase CT workflow rather than architectural novelty; therefore, a standard U-Net was adopted as a strong and well-validated baseline. External validation on multi-center cohorts with heterogeneous acquisition protocols is essential prior to clinical implementation. Second, the disease spectrum was restricted to T1–T3 stages, excluding T4, post-neoadjuvant, and mucinous histologic subtypes, thereby limiting applicability to broader real-world scenarios.
Third, while AI staging accuracy approximated radiologist performance (80.4% vs. 82.6%), it did not exceed it. The clinical value of AI lies in consistent decision support—offering reproducible assessment, reducing inter-observer variability, flagging discordant cases, and operating continuously without fatigue. These advantages require demonstration through prospective workflow studies. Fourth, this study did not directly compare CT-based AI with MRI-based staging or explore multi-modal fusion approaches, which may further enhance performance. Fifth, practical deployment metrics such as inference time, PACS integration, workflow efficiency, and user acceptance were not formally evaluated.
Future research should prioritize (1) multi-center external validation across diverse CT protocols, (2) prospective reader studies comparing radiologist performance with and without AI assistance, (3) expansion to T4 staging, nodal evaluation, and post-treatment assessment, (4) investigation of CT-AI as a triage or multi-modal prediction tool, and (5) clinical impact and cost-effectiveness analyses to demonstrate real-world value and inform implementation strategies. Regarding model interpretability, the current framework relies on the U-Net generated segmentation masks to visualize the tumor location. This explicit delineation provides radiologists with direct evidence of the region being analyzed. Future iterations may incorporate additional attention visualization techniques, such as Grad-CAM, specifically for the staging network. Finally, to address the challenge of variability across different CT scanners and protocols, future work will explore advanced domain adaptation and dynamic learning strategies. Recent methodologies in intelligent control and unsupervised adaptation, such as those proposed for dynamic network environments [
23] and prototype learning [
24], offer conceptual frameworks that could be adapted to improve the robustness of medical AI systems in complex, real-world clinical settings.
Finally, the translation of deep learning models from research to routine clinical practice requires careful assessment of algorithmic safety and robustness. Recent studies have highlighted that medical AI systems in cross-sectional imaging—including both MRI and CT—may be vulnerable to specific reliability issues. These include hallucination-like failure modes, in which models produce confident but incorrect predictions, and sensitivity to adversarial perturbations, where subtle input modifications may lead to erroneous outputs [
25,
26,
27]. Although our study employed a supervised learning framework with radiologist-verified ground truth to improve label reliability, safe clinical deployment will still require robustness evaluation, uncertainty estimation, and human-in-the-loop verification strategies to ensure patient safety in real-world settings.