Next Article in Journal
Deploying Advanced Air Mobility into an Existing Transport System of Systems: The Product Push Paradigm
Previous Article in Journal
Investigating the Response of Blue Roofs Under Future Climate Scenarios
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Advancing Colorectal Polyp Detection in Colonoscopy Through Region-Guided Deep Learning †

Department of Computer Science and Engineering, Faculty of Science and Engineering, International Islamic University Chittagong, Kumira, Chittagong 4318, Bangladesh
*
Author to whom correspondence should be addressed.
Presented at the 6th International Electronic Conference on Applied Sciences, 9–11 December 2025; Available online: https://sciforum.net/event/ASEC2025.
Eng. Proc. 2026, 124(1), 118; https://doi.org/10.3390/engproc2026124118
Published: 22 May 2026
(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)

Abstract

In terms of the detection of colorectal polyps during a colonoscopy, the accuracy of the diagnosis is key to effective prevention and treatment, and can be hindered by manual identification. Colorectal polyps are abnormal tissue growths in the colon or rectum, and their sizes, shapes and textures can make them difficult to find. Researchers have now turned to deep learning techniques and the YOLOv11 detection framework in particular to provide a method to automate the recognition and accurate identification of these abnormal growths. Specifically, the proposed method modifies the conventional YOLOv11 detection workflow by generating bounding box annotations from polyp segmentation masks, applying region-aware data preprocessing and augmentation, and training the detector under region-guided supervision to enhance localization precision and detection robustness. polyp segmentation masks are utilized to generate bounding box annotations which not only contribute exact spatial supervision but also avoid manual box labeling inconstancy. Region-aware data preprocessing and augmentation pay more attention to polyp-relevant regions and suppress background noise, which leads to clearer feature discrimination for small or irregular polyps. Additionally, region-guided supervision serves as explicit guidance for localizing objects with the anatomical polyp regions, which largely helps achieve accurate boundaries and prevent false detections. The proposed YOLOv11-based polyp detection system was tested and evaluated on the publicly available Kvasir-SEG dataset, which is comprised of annotated colonoscopy images. Enhanced data pre-processing and exhaustive training with appropriate choice of hyper-parameters fortified the reliability and useability of the model. The results confirmed high-grade results, and gave an Intersection over Union score of 0.9764, and an overall correctness rate of 99.00%, with well-balanced precision, recollection and F1-scores. Coming in with a mean Average Precision (mAP) of 0.9937 at a Intersection over Union threshold of 0.5 and 0.9935 over the full spectrum of thresholds from 0.5 to 0.95, this shows that the model is able to consistently and reliably detect polyps. The proposed system was also compared with Segment Anything Model, YOLO-Seg, and SAM2 and confirmed the efficacy of its method.

1. Introduction

Colorectal cancer (CRC) continues to pose a significant public health risk with approximately 1.9 million estimated new cases and a relatively high mortality rate. CRC is the third most commonly incident cancer in the world, according to estimates provided by the GLOBOCAN database [1]. Projections indicate a tremendous increase by 2040 and underscore the imperative to develop prevention, detection and treatment strategies [1,2]. It frequently starts as small, benign growths in the colon called polyps. Although these polyps are benign at first as shown in Figure 1 they can turn cancerous if they are not de-tected and removed in the early stages. Thus, early and accurate diagnoses are very im-portant for the prevention and treatment of CRC.
Recent advances in artificial intelligence, especially deep learning, have shown significant potential in improving polyp detection and segmentation. CNN-based object detectors like the YOLO family have achieved promising results, and segmentation models such as SAM and SAM2 enable precise polyp boundary delineation [4,5]. Despite these advances, existing methods often struggle with small, low-contrast polyps, boundary refinement, and real-time execution, limiting clinical applicability. To address these challenges, this study proposes a region-guided hybrid framework combining YOLOv11 and SAM2, aiming to enhance detection accuracy, improve segmentation precision, and support early CRC prevention through AI-assisted endoscopy.

2. Related Work

In recent years, with the development of deep learning-based object detection, the accuracy and efficiency for colorectal polyp recognition have been greatly improved. Early stage frame- works like YOLOv5 and YOLOv7 have already shown very good real-time performance in recognition of polyps on colonoscopy images [6,7], while YOLOv8 further improves the efficacy by means of processing richer features via deeper convolutional blocks and decoupled head design over more challenging datasets, e.g., Kvasir-SEG and CVC-ClinicDB [2,8]. YOLOv8 for colorectal polyp detection and localization on the Kvasir dataset at optimal training epochs [9]. For classifying colorectal polyps, the PolypNet, a novel, shallow CNN combined with a self-attention mechanism, is doing very well, and boasts Kvasir-v1 accuracy. Coming up against the likes of VGG16, ResNet50, DenseNet-121 and MobileNet-v3, PolypNet has basically matched the performance of ResNet50 and outperformed the others, with a much smaller, more agile framework but lacking explicit localization and segmentation capability [10]. The recent YOLOv11 architecture adopts the transformer based modules and improve backbone efficiency, achieving better detection accuracy with faster inference latency than previous versions [11,12]. These trends placed YOLO models in the state-of-the-art for endoscopic object detection tasks. Alongside this, segmentation models have aimed for pixel-level accuracy in polyp boundary delineation. The encoder-decoder structure with skip connections which leaves spatial information of U-Net architecture make it as a powerful benchmark model in medical image segmentation [13]. Successive variants such as ResUNet and Attention U-Net enhanced boundary accuracy and contextual information [13,14]. In contrast, segmentation-based models such as ResUNet++ and Attention U-Net enhance boundary delineation through residual connections and attention mechanisms, but their encoder–decoder architectures are computationally intensive and less suitable for real-time deployment. More recently, Meta’s Segment Anything Model (SAM) developed a prompt-based approach which can segment diverse image regions with limited supervision [7]. Its refined version SAM2 improved the mask quality and reduced computation complexity enabling faster and more flexible medical segmentation pipelines. Compared to YOLO in the area of speed and localization shows its performance strengths, similarly U-Net and SAM-based methods have a high segmentation accuracy disadvantage (their lacking for real time) [15,16]. This trade-off inspires the design of a hybrid, region-guided frame- work, YOLOv11 for fast detection and SAM2 for fine-grained segmentation, striding towards real-time inference as well as accurate boundary positioning towards better colorectal cancer screening results. The third route includes single-stage detection models (e.g., YOLO series). In general, although the classification and segmentation methods perform well in both semantic understanding and boundary accuracy, they are not optimized for real-time localization of polyps. However, the model described above follows a region-guided detection-to-segmentation path using YOLOv11 to provide real-time localization of polyps with high accuracy and SAM2 to provide fine-grained segmentation of polyps, thereby overcoming the limitations of existing methods for both speed and clinical application.

3. Methods

In this section, we present the workflow of our model. The pipeline begins with preprocessing of colonoscopy images, followed by bounding box annotation, training of detection and segmentation models, and finally evaluation on the test set.

3.1. Dataset Description

The proposed approach for the study is based on the Kvasir-SEG dataset [17] consisting of colonoscopy images publicly available by Simula Research Laboratory and Vestre Viken Hospital Trust. The dataset contains 4000 images along with ground-truth segmentation masks and the respective bounding box annotations. These images demonstrate different kinds of polyps that have been provided at different sizes, shapes, structures and lighting conditions which makes this a suitable dataset for testing deep learning based detection methods. All images were labeled by experienced endoscopists and contain a binary mask that marks the region of polyp. There are four different folders (i) the original images, (ii) binary masks, (iii) annotated images and (iv) bounding-box coordinate files as shown in Table 1 below.
Some sample images of the dataset shown in Figure 2. The dataset was split into training (70%), validation (20%), and testing (10%) subsets, ensuring consistent class dis-tribution across splits. Following Figure 3 shows the proposed workflow diagram.

3.2. YOLOv11 Detection

The core detection engine is based on the YOLOv11 architecture, which performs single-stage object detection with an anchor-free design.
The YOLOv11 model serves as the core detection component—an advanced, anchor-free, single-stage network optimized for high-speed and high-accuracy detection. Building on YOLOv8 and YOLOv9, it enhances spatial feature extraction and localization efficiency. The architecture comprises three parts:
  • Backbone (C3k2): Extracts image features using a Cross-Stage Partial (CSP) design that fuses multi-scale information, enabling detection of small or flat polyps.
  • Neck (PAN-FPN): Merges shallow and deep features for robust multi-scale detection across varying polyp sizes and textures.
  • Head (Decoupled, anchor-free): Predicts objectness, bounding boxes, and class probabilities separately, improving convergence and generalization without relying on anchor boxes.
The architecture of YOLOv11 shown in Figure 4 above. The YOLO models have undergone many developments, from YOLOv8 to YOLOv11, mostly focusing on improving the feature extraction, efficiency, and accuracy of object detection. YOLOv11 has been developed to improve the backbone, neck, and training efficiency of object detection, thus improving its accuracy compared to YOLOv8 and YOLOv9. This makes YOLOv11 suitable for complex object detection, such as in medical image detection and polyp detection. Table 2 shows all the key factors among them.
In colorectal polyp detection, which requires real-time performance, high localization accuracy and good recognition of small and low-contrast lesion. Such requirements make YOLOv11 a good fit, and its C3k2 backbone can achieve better multi-scale feature representation for polyps in different sizes, further the optimized PAN-FPN feature fusing in it can also improve context perspective of more complex endoscopic scenes. Moreover, the advanced decoupled, anchor-free detection head used in YOLOv11 has brought about more stable bounding box regression and greater localization accuracy than previous versions of YOLO model, which makes it especially applicable to accurate and stable polyp detection in clinical colonoscopy practice.

Training on YOLOv11

The YOLOv11-Detection model was trained using the training split of the Kvasir-SEG dataset (700 images). The YOLOv11-Detection model was trained for 500 epochs using a batch size of 16 and an input resolution of 640 × 640 pixels. Training employed the SGD optimizer with a learning rate of 0.01 and momentum of 0.937. Loss weights were set to 7.5 for bounding box, 0.5 for classification, and 1.5 for Distribution Focal Loss (DFL). Data augmentation included translation (0.1), scaling (0.5), horizontal flipping (0.5), and HSV color jittering. Mosaic augmentation was applied during the first 10 epochs, and automatic mixed precision (AMP) was enabled with a random seed of 42 to ensure training efficiency and reproducibility. The YOLOv11 variant (yolo11n-seg. pt) was selected as the basic model. The model, which has about 3.2 M trainable parameters and takes 8.8 GFLOPs to process an input resolution of 640 × 640. The lightweight design of the architecture enables the computational cost to be effectively low, retaining strong detection + segmentation performance and thus being appropriate for real-time embedded clinical deployment.

3.3. YOLOv11 Segmentation

While YOLOv11 effectively detects polyps, its bounding boxes alone lack sufficient boundary precision for medical analysis. The YOLOv11-Seg variant enhances this capability by introducing an additional segmentation head that predicts pixel-level masks alongside detections. It integrates multi-scale feature maps to capture fine spatial details and employs a prototype-based mechanism that efficiently generates accurate masks with reduced memory consumption. Trained on the same dataset and augmentations as YOLOv11-Det, YOLOv11-Seg enables comprehensive end-to-end detection and segmentation within a unified framework.

3.4. SAM

The Segment Anything Model (SAM) is proposed by Meta AI, serving as a backbone model of universal image segmentation model, which aims to generate masks given prompts of multi-modal forms, long-range point, bounding box. The predicted bounding boxes generated by YOLOv11 in this work are used as spatial priors, which provide SAM with accurate initial mask outputs at detected polyp areas.The architecture of SAM showned below in Figure 5 with region-guided bounding box prompts.

3.5. SAM2

An improved pipeline for image and video segmentation for SAM2, called Segment Anything Model 2 (SAM2) is proposed that extends the architecture of SAM with more spatially accurate, and temporally consistent predictions. Its architectural improvements render it especially strong for medical image analysis, in the case of colonoscopy video application. The Figure 6 shows the pipeline of the model used in polyp segmentation.

4. Results and Discussion

Experiments were performed on the Kvasir-SEG where it contains 1000 colonoscope images with pixel-wise annotations of polyps. All model were trained and tested under the exactly same hardware settings, training hyperparameters and compared on YOLOv11 with SAM and SAM2 methods.

4.1. Experimental Environment

Experimentations were performed on a system with 16 GB RAM and an NVIDIA A1000 GPU in Python 3.10. YOLOv11 models were trained in PyTorch 2.9.0with CUDA backends and SAM models used the official pretrained checkpoint. We only presented the data preparation and training method once again Intuitively, the new features also enable better mixing of class information when learning and associating their distribution. Libraries such as Ultralytics 8.4.48, NumPy 2.0.2, Pandas 2.2.2, OpenCV 4.12.0.88, TensorFlow 2.19.0, Matplotlib 3.10.0, PIL 12.0.0, scikit learn, IPython 7.34.0 and glob were employed in the pre-processing step conducting modeling and analysis.

4.2. Evaluation Matrics

The ability of the models to detect was assessed in a confusion matrix, which classify true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). TP denotes the detected correct polyp, FP stands for something other than a region of non-polyp being misclassified as a polyp and FN implies the missed loud polyps. Performance of the system was evaluated using various metrics such as precision, recall, F1 score, mAP, accuracy and intersection over union (IoU). The expressions as follows for these measures are:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + N P
F 1 -Score = 2 × p r e c i s i o n × r e c a l l p r e c i s o n + r e c a l l
A c c u r a c y = T P + T N T P + T N + F P + F N
m A P = 1 N i = 1 N A P i
I o U = A r e a   o f   I n t e r s e c t i o n A r e a   o f   U n i o n
Precision is the fraction of detected polyps that are truly positive showing how well the model can avoid false positives. Recall measures how well the model retrieves all actual polyps, where a high value indicates fewer cases are missed and better early detection. The F1-Score countersbalance precision and recall, offering a single point measurement for an overall performance. Mean Average Precision (mAP) aggregates the two metrics at different confidence, and in this report we have mAP@0. 5 and mAP@0. 5:0.95 to assess detection quality. The intersection over union (IoU) between the predicted and groundtruth bounding boxes, that is, how well the region detected overlaps with the real polyp location.

4.3. Performance of the Applied Models

Experimented with the proposal model for fully automatic polyp detection in colonoscopy images, derived from YOLOv11-Detection and comparisons are conducted against YOLOv11-Seg, SAM, and SAM2. YOLOv11-Detection yielded the highest recall, precision and IoU and thus is the preferred choice given that false negatives are more serious than false positives in clinical applications.

4.3.1. Yolov11 Detection Results

On Kvasir-SEG, we train the YOLOv11-Detection and test on the remaining test set. An early stop method was used in the YOLOv11-Detection model, which attained conver-gence at 305 epochs. Although there was significant agreement between the model and training/validation loss, evaluation metrics indicated good detection performance. Key performance metrics YOLOv11-Detection(test set) are presented in Table 3. Also Loss and performance graph in Figure 7 illustrates the smooth training process.
Figure 8 illustrates that the proposed YOLOv11-Detection is able to detect all 200 polyp instances (true positives) with two false positive cases. Crucially, there are no false negatives: any ground-truth polyps in the test set were indeed detected by our method giving a recall rate of 100%. The low number of false positives indicates that the remaining mistakes are mostly due to visually ambiguous background area and not missed lesions. From the confusion matrix as a whole, we saw that the proposed model has high sensitivity and effective detection performance, at the cost of a negligible amount of false alarms.

4.3.2. Yolov11 Segmentation Results

The YOLOv11n- Segmentation model learned the accurate polyp boundaries using classification, box regression, segmentation and DFL losses, achieved of stable conver-gence at 216 epochs. Both training and validation losses demonstrated steady declines without overfitting, suggesting the robustness of learning. Table 4 and Figure 9 demon-strates the performance metrics of the model.
The training and validation loss curves in Figure 7 and Figure 9 for YOLOv11 were smooth and trended toward convergence, demonstrating stable learning behavior without the overfitting. The decreasing trend in the total loss with more epochs indicates that the model is well optimized and has a good generalization ability.

4.3.3. SAM Results

Pre-trained SAM model was directly applied to colonoscopy images and obtained ex-tremely accurate segmentation masks with clearly visible polyp boundaries. Due to its strong generalization capability and low computational cost, it is considered as a very efficient approach for medical image segmentation. The Table 5 shows the evaluation results on SAM.

4.3.4. SAM 2 Results

The SAM2 model (refined from the SAM model) generated accurate and stable polyp segmentation with clear boundaries and low computation complexity. Its improved net-work architecture could generalize better in the various colonoscopy images. SAM2 proved to be highly efficient and reliable for clinical use. The final performance measure-ments in following Table 6, showed that the proposed method led to a very accurate seg-mentation for polyp localization.

4.3.5. Experimental Environment

In order to determine the clinical real-time suitability, the end-to-end per-frame latency was recorded in 100 test frames on the YOLOv11n detection model with single-image inference on an NVIDIA A1000 GPU. Latency was broken down into three consecutive processes: preprocessing (image resizing and normalization), inference (forward pass through the network), and postprocessing (NMS and bounding box decoding) was broken down into:
T o t a l m s =   P r e p r o c e s s m s + I n f e r e n c e m s + P o s t p r o c e s s m s
Single-stream throughput was approximated as:
F P S 1000 L a t e n c y ( m s )  
In order to determine real-time clinical suitability, end-to-end per-frame inference latency was quantified at 100 test frames based on single-image inference on an NVIDIA A1000 GPU. Latency was broken down into three consecutive steps that include preprocessing, inference, and postprocessing steps where inference takes about 77% of the overall processing time. A 95 th percentile (p95) latency is also reported with the mean because it provides a limit to the range of performance which 95 percent of the frames can achieve, and is a more conservative and clinically meaningful throughput indicator than the mean alone. Table 7 reveals the latency and the respective frames-per-second (FPS) of the proposed model.
The experiments were conducted on an NVIDIA A1000 GPU. Benchmarking on embedded edge platforms (e.g., NVIDIA Jetson or CPU-only inference) was not feasible due to hardware unavailability. Nevertheless, the nano variant (~3.2 M parameters, ~8–9 GFLOPs) suggests suitability for deployment on clinical endoscopic workstations equipped with embedded GPUs. A real-time emulation experiment further validated stability streaming all 100 frames in arrival order, a rolling 120-frame window converged to mean ≈ 19.0 ms and p95 ≈ 24.5 ms, confirming consistent behavior throughout the stream. All three input configurations comfortably satisfy the 25–30 FPS clinical real-time benchmark [19]. For deployment, enforcing a fixed 640 × 640 input resolution combined with optimized inference backends (half-precision, fused kernels, or TensorRT export) is recommended to minimize tail latencies while maintaining reliable clinical throughput.

4.4. Predicted Results

The YOLOv11-like structure showed accurate and stable predictions on the Kvasir-SEG test set. The proposed model can accurately recognize polyps in different sizes, shapes and texture patterns with accurate localization and few false-positives. The ground truth region similar to the predicted bounding boxes, as justified by the high IoU score (0.9764). In terms of segmentation, the masks generated by YOLOv11-Seg, SAM and SAM2 presented good correspondence with real polyp boundaries, being the most refined smooth in SAM2.
The quantitative results are further supported by the qualitative predictions shown in Figure 10, where the overlap between predicted outputs (blue) and ground truth annotations (red) visually confirms the high detection and segmentation accuracy achieved by all four models. YOLOv11-Detection obtained a close to perfect accuracy (99%) and recall (100%), so the detection performance was chord with top standards. The high F1-score 0.9897 and IoU 0.9764 further confirm precise localization.

4.4.1. Classification and Statistics of Error Cases

To analyze the error behavior of the proposed YOLOv11-Detection model beyond aggregate performance metrics, false positive (FP) and false negative (FN) cases were examined using the confusion matrix shown in Figure 8, the results show that the proposed YOLOv11 detection model correctly recognized 200 polyp instances which are true positives while giving two FP cases. Meaning there were no false negatives found and the model was able to identify all ground truth polyps in the test set with a recall of 100%.
The error cases identified were screened empirically to discern four different clinically relevant scenarios: small-sized polyps (<5 mm), low-contrast polyps, texture of the polyp similar to that of the mucosa, and partially covered (by intestinal folds). Qualitative analysis of the false detections showed that two false positives were mainly those related to mucosal regions with polypoidal-look texture and all four categories did not contain any false negative. This suggests that the remaining errors are overwhelming due to false positives due to visual confusion, instead of missed detections. The quality of images obtained during colonoscopy procedures is often diminished by motion blur, irregular lighting, reflections from surfaces, occlusions due to folds in the intestines, or obstructions caused by mucus. No synthetic perturbation testing specifically related to colonoscopy procedures was performed during this study, but qualitative assessment of images from the test set shows that the dataset contains many commonalities (such as differences in lighting intensity and texture complexity) that are associated with each of the aforementioned sources of motion effects from practical endoscopic imaging. The finding that there were no false-negative results for polyp cases with a smaller number of case types and partial occlusions indicates a high degree of sensitivity when considering the presence of moderate real-world disturbances. However, there were no explicit simulations of extreme motion blur or significant amounts of mucus, so dynamic testing for robustness against these forms of motion effects remains a topic of future study.
  • False Positive Rate (FPR):
FPR = 2 202 0.99 %
where 202 denotes the total number of predicted polyp candidates.
  • False Negative Rate (FNR):
FNR = 0 %
as no ground-truth polyps were missed by the model.
By conducting analysis of false positive results across the four different error conditions, it was determined that false positive results were primarily generated from those areas having mucosa texture characteristics. However, there were no small polyps, low contrast polyps, or fold occluded polyps which caused false negative results for the evaluated test dataset. As such, this result suggests that the proposed model maintains excellent sensitivity even when viewed in extremely difficult visual conditions with residual errors mostly confined to only a small number of visually ambiguous background regions.

4.4.2. False Positive Case Analysis

The results of the confusion matrix demonstrated two FP cases in YOLOv11-Detection was observed. Although the Total number of FP remained little to the total sample size evaluated, a detailed qualitative evaluation of the visual qualities of each FP was performed in order to define clinical implications of each improperly classified area.
Upon review of these FP’s both were found to be associated with irregularly shaped mucosal folds that contained protrusions with rounded tops and could be seen effectively under moderate changes in illumination. The visual appearance of these areas was enhanced when the areas had uneven lighting, and had localized shiny surfaces; there were obvious increases of contrast due to density of edges and texture gradients visually resembling small sessile polyps. In particular, the combination of protrusions of structural folds combined with the shadow of their boundaries produced counted areas that visually resembled early-stage polyp morphology which quite likely increased the FP confidence levels based upon the lower patterns (i.e., cornering) produced by the detection model.
In Clinical imaging of the anatomy, such anatomical variants are routinely found during instrumentation of the colon during colonoscopy; and they may visually appear to be a true lesion when viewed during varying degrees of illumination or being partially obstructed. Exceptional to these instances, there are very few instances of FPs suggest that there is a high specificity of the proposed detection framework while also maintaining a high degree of sensitivity in addition to successfully identifying lesions.

4.4.3. Strategies for False Positive Reduction

To enhance reliability of clinical functions further, several strategies can be employed to improve the following: (i) using hard negative mining during model training so that it is exposed to fold-like structures with no polyp protrusions; (ii) using attention-based feature refinement modules will allow for greater discrimination between true polyp tissue vs. incorrectly identified tissue; (iii) confidence threshold calibration and using IoU based filters on post-processing will remove lower confidence detections thereby improving overall accuracy of polyp detections; and (iv) add more second-order verification mechanisms based upon texture information that will be used to further reduce instances of false positives between anatomically similar regions. Applying these four strategies should yield improvements in the number of false positives while retaining the model’s high recall characteristics.

4.5. Integrated Component Analysis

The proposed region-guided YOLOv11-based polyp detection pipeline exhibits three key enhancement modules, i.e., (i) generating bounding box annotations from polyp segmentation masks, (ii) applying region-aware data preprocessing and augmentation, and (iii) training the detector under region-guided supervision. In order to evaluate the effectiveness of each module and measure their individual contributions as well as synergistic effects, a comprehensive ablation study was implemented on the Kvasir-SEG dataset.
A base YOLOv11-Detection model not using any of the proposed enhancement modules is trained as a reference configuration. Then, each module was added separately to investigate the contribution of individual refinement to detection performance. In any case, the modules were also tested in all possible combinations which would help understanding synergistic effects. All experiments of ablation studies followed the rest configurations and same training/validation/test split (70/20/10), hyperparameters, random seed settings and evaluation metrics (Precision, Recall, F1-score, IoU, mAP@0.5, and mAP@0.5:0.95) in order to have a fair and controlled comparison with other approaches.
Although Table 8 includes final results of complete integrated configuration measures, intermediate results for individual module configurations will not be reported due to computational constraints imposed by this experimental iteration limited dataset.
The overall purpose was to assess how well the fully integrated region guided framework works in realistic clinical diagnostic detection scenarios. These optimised components (bounding boxes generated using masks, performing region aware preprocessing, and using region guided supervision) were created to improve upon areas of concern that arose during initial YOLOv11 trainings. The enhanced performance through the three components are reflected in the final reported metrics. A complete factorial ablation of each module with isolated quantitative results to measure both individual and synergistic contributions in order to capture data from same training conditions will be done in a separate project.

4.6. Comparative Analysis

In this section, the proposed YOLOv11-based framework is compared against the other competitive segmentation and detection models developed in this work. All models were evaluated on the Kvasir-SEG dataset. All models received training/testing splits and followed the same pre-processing procedures prior to their evaluation. Evaluation of each of the algorithms was completed by measuring precision, recall, F1-score, Intersection Over Union (IoU), and mean Average Precision at 0.5 (mAP@0.5). This means that the results will allow for an unbiased comparison of the competing models.
Table 9 we summarize the quantitative performance of YOLOv11-Detection, YOLOv11-Segmentation, SAM, SAM2 on the Kvasir-SEG test dataset. YOLOv11-Detection has been shown to achieve better overall performance in terms of recall, F1-score, IoU and mAP than other systems evaluated. SAM and SAM2 produce accurate boundaries but are costly to compute making them less appropriate for use as real time localization systems.

4.7. Benchmarking Against Prior Work Based on Dataset

We incorporate comparisons with analyzing colorectal polyps as described in the literature from 2022–2025 (e.g., YOLO based detection and encoder-decoder segmentation) so that we can demonstrate how the proposed system fits into those more recent works. We used task-aligned evaluation metrics (defined by the original studies) to perform these comparisons since many of these approaches were intended to accomplish just one of the tasks (classification or segmentation). We also performed all baseline comparisons using the Kvasir-SEG dataset using the same experimental design.
Table 10 compares detection (YOLOv5, YOLOv8, YOLOv11) and segmentation methods on Kvasir-SEG. YOLOv11 achieved the best results with F1 = 98.97%, Recall = 100%, IoU = 97.64%, and mAP@0.5 = 99.35%. YOLOv5 showed strong precision but lower IoU and mAP, while YOLOv8 underperformed in recall. UNet provided only Dice scores, limiting comparison. Overall, YOLOv11 demonstrated the best balance of accuracy, robustness, and efficiency.
As shown in Table 10, the proposed YOLOv11-Detection model achieves the highest overall performance in all aspects (F1 Score = 98.97%, Recall = 100%, IoU = 97.64%, mAP@0.5 = 99.35%). Whereas the performance of the fine-tuned YOLOv5 baseline has demonstrated a high level of precision and recall; they have not provided an IoU or mAP for reporting. YOLOv8 has lower recall than the proposed model. In relation to this, models using segmentation from the U-Net family (ResUNet++, Attention U-Net) have primarily provided Dice or IoU metrics, therefore direct comparison with detection results cannot occur. Overall YOLOv11 has provided the best trade-off between accuracy and robustness, which allows for real-time clinical use.

4.8. Extended Comparison with Detection and Segmentation Frameworks

To show a more generalizable performance instead of the YOLO-based frameworks, we also compare the proposed framework with other state-of-the-art object detection and segmentation approaches in recent papers such as Mask R-CNN, RetinaNet, EfficientDet, U-Net++, Attention U-Net and transformer-involving detection methods. These models correspond to different architectural families such as two-stage detectors, anchor-based one-stage detectors, encoder–decoders for segmentation or attention/transformer driven feature extractors. However, these models were not retrained in this work due to computational limitations. Instead, we provide a short summarizations of the performance results reported in original Kvasir-SEG publications, or similar colorectal polyp datasets, for comparing against them. Although two-stage detectors, such as Mask R-CNN, present precise region proposals, they typically involve more computational cost than light weight single stage YOLO architectures. Some other architectural architectures are also proposed for segmentations that terms maximum boundary localization precision but lack of focus on real-time localizing. The transformer-based detection models can better model the global features but suffer from much higher parameter and computation cost.
In contrast, the proposed region-guided YOLOv11 framework achieves competitive detection performance while maintaining low parameter complexity (~3.2 M parameters) and real-time inference capability, making it suitable for embedded clinical deployment.

4.9. Robustness Under Clinical Imaging Variability

Colonoscopy is a procedure that uses an instrument to visualize the inside of the colon for the purpose of diagnosing and treating digestive tract problems. However, when conducting clinical colonoscopy, real-world disturbances can introduce various challenges, such as motion blur from instrument movement or the variation in lighting caused by varying light sources; occlusions caused by intestinal folds; and occlusions caused by mucus covering the mucosal surface. While this study did not include controlled synthetic robustness testing, qualitative evaluations and error analyses suggest that the YOLOv11-based object detection system can produce accurate localization results in the presence of moderate illumination variation and partial structural occlusion. The high recall rate (100%) of the system and absence of false negative detections in the test split provide further evidence of its ability to detect lesions against the natural variance of the test dataset. Future studies will include robust testing and the evaluation of real clinical colonoscopy video data to quantitatively assess the system’s performance as a function of controlled noise and illumination conditions encountered in practice.

4.10. Generalization and Clinical Applicability

This paper has only been tested in the Kvasir-SEG dataset for the benchmark of colorectal polyp detection. The Kvasir-SEG dataset has many different types of polyps, different lighting levels, and many different anatomical areas, all of which are consistent with what a colonoscopy would show. The present study, therefore, does not include quantitative assessment of domain generalization across datasets, or of temporal consistency within an uninterrupted set of videos in a colonoscopy. The results show excellent performance within the Kvasir-SEG dataset; however, there are many more possibilities for external validation across several datasets and clinical settings. This is necessary to help determine robustness for expanded variability in imaging and also for actual clinical use.

5. Conclusions

This study demonstrates the effectiveness of deep learning models, specifically YOLOv11-Detection, YOLOv11-Segmentation, SAM, and SAM2, in the detection and segmentation of colorectal polyps from colonoscopy images. YOLOv11-Detection achieved exceptional performance, with a recall of 100%, accuracy of 99%, and a high Intersection over Union (IoU) score of 0.9764, making it highly effective for real-time polyp detection in clinical settings. YOLOv11-Segmentation provided excellent segmentation results, while SAM and SAM2 models further refined segmentation, with SAM2 showing improved performance across all metrics. SAM2 achieved a Dice coefficient of 0.9771 and an IoU of 0.9608, making it the most reliable model for segmentation. These findings confirm that deep learning-based solutions can significantly improve the accuracy and efficiency of colonoscopy screenings, providing valuable assistance to clinicians and enhancing early colorectal cancer detection. In future work, the Kvasir-SEG dataset with clinically relevant labels for polyp subtypes and the model developed in this work will be integrated into a real-time colonoscopy video processing pipeline. Other evaluations on greater numbers of colonoscopy videos will occur, to evaluate temporal consistency and real-time performance for clinical decision support.

Author Contributions

F.N.—Conceptualization, Methodology, Writing—original draft, Writing—review and editing, Software; S.N.—Conceptualization, Methodology, Writing—review and editing, Formal analysis, Software, Validation; T.A.—Conceptualization, Methodology, Writing, Writing—review and editing, Formal analysis, Software; M.K.—Writing—review and editing, Resources, Supervision, Validation; M.M.H.—Writing—review and editing, Resources, Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data will be available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Morgan, E.; Arnold, M.; Gini, A.; Lorenzoni, V.; Cabasag, C.J.; Laversanne, M.; Vignat, J.; Ferlay, J.; Murphy, N.; Bray, F. Global burden of colorectal cancer in 2020 and 2040: Incidence and mortality estimates from GLOBOCAN. Gut 2023, 72, 338–344. [Google Scholar] [CrossRef] [PubMed]
  2. Tudela, Y.; Majó, M.; de la Fuente, N.; Galdran, A.; Krenzer, A.; Puppe, F.; Yamlahi, A.; Tran, T.N.; Matuszewski, B.J.; Fitzgerald, K.; et al. A complete benchmark for polyp detection, segmentation and classification in colonoscopy images. Front. Oncol. 2024, 14, 1417862. [Google Scholar] [CrossRef] [PubMed]
  3. Li, S.; Ren, Y.; Yu, Y.; Jiang, Q.; He, X.; Li, H. A survey of deep learning algorithms for colorectal polyp segmentation. Neurocomputing 2025, 614, 128767. [Google Scholar] [CrossRef]
  4. Lalinia, M.; Sahafi, A. Colorectal polyp detection in colonoscopy images using YOLO-V8 network. Signal Image Video Process. 2023, 18, 2047–2058. [Google Scholar] [CrossRef]
  5. Wan, J.-J.; Zhu, P.-C.; Chen, B.-L.; Yu, Y.-T. A semantic feature enhanced YOLOv5-based network for polyp detection from colonoscopy images. Sci. Rep. 2024, 14, 15478. [Google Scholar] [CrossRef]
  6. Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
  7. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
  8. Wang, M.; Xu, C.; Fan, K. An efficient fine tuning strategy of segment anything model for polyp segmentation. Sci. Rep. 2025, 15, 14088. [Google Scholar] [CrossRef]
  9. Hasan, M.; Islam, J.; Hasan, M.R.; Khaliluzzaman, M. Colorectal polyp detection and localization in colonoscopy images based on YOLOv8. In Proceedings of the 27th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 20–22 December 2024; pp. 2858–2863. [Google Scholar]
  10. Delowar, K.E.; Uddin, M.B.; Khaliluzzaman, J.; Rabbi, R.I.; Hossen, J.; Hossen, M.M. PolyNet: A self-attention based CNN model for classifying the colon polyp from colonoscopy image. Inform. Med. Unlocked 2025, 56, 101654. [Google Scholar] [CrossRef]
  11. He, L.; Zhou, Y.; Liu, L.; Ma, J. Research and Application of YOLOv11-Based Object Segmentation in Intelligent Recognition at Construction Sites. Buildings 2024, 14, 3777. [Google Scholar] [CrossRef]
  12. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  13. Solak, A.; Ceylan, R. A sensitivity analysis for polyp segmentation with U-Net. Multimed. Tools Appl. 2023, 82, 34199–34227. [Google Scholar] [CrossRef]
  14. Mei, J.; Zhou, T.; Huang, K.; Zhang, Y.; Zhou, Y.; Wu, Y. A survey on deep learning for polyp segmentation: Techniques, challenges and future trends. Vis. Intell. 2025, 3, 1. [Google Scholar] [CrossRef]
  15. Sahafi, A.; Koulaouzidis, A.; Lalinia, M. Polypoid Lesion Segmentation Using YOLO-V8 Network in Wireless Video Capsule Endoscopy Images. Diagnostics 2024, 14, 474. [Google Scholar] [CrossRef] [PubMed]
  16. Lee, G.-E.; Cho, J.; Choi, S.-I. Shallow and reverse attention network for colon polyp segmentation. Sci. Rep. 2023, 13, 15243. [Google Scholar] [CrossRef]
  17. Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; de Lange, T.; Johansen, D.; Johansen, H.D. Kvasir-SEG: A Segmented Polyp Dataset. arXiv 2019, arXiv:1911.07069. [Google Scholar] [CrossRef]
  18. Hidayatullah, N.; Syakrani, M.R.; Sholahuddin, T.; Gelar, R.T. Yolov8 to yolo11: A comprehensive architecture in-depth comparative review. arXiv 2025, arXiv:2501.13400. [Google Scholar]
  19. Shaukat, A.; Yadav, A.; Kwon, R.S.; Wallace, M.B.; Byrne, M.F.; Repici, A.; East, J.E.; Gupta, N.; Wani, S.; Sharma, P.; et al. Framework and metrics for the clinical use and evaluation of AI in endoscopy. Gastrointest. Endosc. 2022, 96, 1185–1193. [Google Scholar]
  20. Ghose, P.; Ghose, A.; Sadhukhan, D.; Pal, S.; Mitra, M. Improved polyp detection from colonoscopy images using finetuned YOLO-v5. Multimed. Tools Appl. 2023, 83, 42929–42954. [Google Scholar] [CrossRef]
  21. Wan, J.; Chen, B.; Yu, Y. Polyp Detection from Colorectum Images by Using Attentive YOLOv5. Diagnostics 2021, 11, 2264. [Google Scholar] [CrossRef] [PubMed]
  22. Wang, S.; Xie, J.; Cui, Y.; Chen, Z. Colorectal Polyp Detection Model by Using Super-Resolution Reconstruction and YOLO. Electronics 2024, 13, 2298. [Google Scholar] [CrossRef]
Figure 1. Benign to Malignant progression of colorectal polyps [3].
Figure 1. Benign to Malignant progression of colorectal polyps [3].
Engproc 124 00118 g001
Figure 2. Sample images from the Kvasir-SEG dataset showing (a) annotated image, (b) original colonoscopy image, and (c) binary mask.
Figure 2. Sample images from the Kvasir-SEG dataset showing (a) annotated image, (b) original colonoscopy image, and (c) binary mask.
Engproc 124 00118 g002
Figure 3. Workflow of the proposed polyp detection framework from dataset preparation to model evaluation and final detection results.
Figure 3. Workflow of the proposed polyp detection framework from dataset preparation to model evaluation and final detection results.
Engproc 124 00118 g003
Figure 4. Architecture of YOLOv11 highlighting the C3k2 backbone, optimized PAN-FPN neck, and de-coupled anchor-free detection head for robust multi-scale polyp localization [18].
Figure 4. Architecture of YOLOv11 highlighting the C3k2 backbone, optimized PAN-FPN neck, and de-coupled anchor-free detection head for robust multi-scale polyp localization [18].
Engproc 124 00118 g004
Figure 5. SAM 2 implemented architecture.
Figure 5. SAM 2 implemented architecture.
Engproc 124 00118 g005
Figure 6. SAM 2 implemented architecture.
Figure 6. SAM 2 implemented architecture.
Engproc 124 00118 g006
Figure 7. YOLOV11-detection Loss and Performance.
Figure 7. YOLOV11-detection Loss and Performance.
Engproc 124 00118 g007
Figure 8. Confusion matrix of YOLOv11-Detection on the test set.
Figure 8. Confusion matrix of YOLOv11-Detection on the test set.
Engproc 124 00118 g008
Figure 9. YOLOV11-segmentation Loss and Performance.
Figure 9. YOLOV11-segmentation Loss and Performance.
Engproc 124 00118 g009
Figure 10. Qualitative predictions of all models on the Kvasir-SEG test set. Blue = prediction, red = ground truth, purple = overlap. (a) YOLOv11-Detection bounding boxes; (b) YOLOv11-Segmentation masks; (c) SAM masks; (d) SAM2 masks.
Figure 10. Qualitative predictions of all models on the Kvasir-SEG test set. Blue = prediction, red = ground truth, purple = overlap. (a) YOLOv11-Detection bounding boxes; (b) YOLOv11-Segmentation masks; (c) SAM masks; (d) SAM2 masks.
Engproc 124 00118 g010
Table 1. Composition of the Kvasir-SEG dataset.
Table 1. Composition of the Kvasir-SEG dataset.
Data TypeQuantityDescription
Colonoscopy images1000RGB frames containing polyps.
Annotated images1000Images with colored polyp regions.
Binary masks1000Pixel-level segmentation masks.
Bounding box labels1000Text files containing normalized coordinates (x, y, width, height).
Table 2. Key architectural differences between YOLOv8, YOLOv9, and YOLOv11.
Table 2. Key architectural differences between YOLOv8, YOLOv9, and YOLOv11.
CharacteristicsYOLOv8YOLOv9YOLOv11 (Ours)
Backbone structureC2f-based CNNC2f reparameterizationC3k2 with enhanced CSP & transformer-inspired blocks
Feature extractionLocal convolutional featuresImproved gradient flowStronger multi-scale and contextual feature learning
Neck (feature fusion)PAN-FPNOptimized PAN-FPNRefined PAN-FPN with better scale interaction
Detection headDecoupled headDecoupled headImproved decoupled head with better box regression stability
Anchor designAnchor-freeAnchor-freeAnchor-free
Small object sensitivityModerateImprovedHigh (better small/flat polyp detection)
Inference latencyLowLow–moderateLower/comparable with higher accuracy
Localization precisionHighHigherHighest (improved IoU stability)
Table 3. YOLOv11-Detection model performance on test set.
Table 3. YOLOv11-Detection model performance on test set.
MetricValue
Accuracy99.00%
Precision0.9796
Recall1.0000
F1-score0.9897
IoU0.9764
mAP@0.50.9937
mAP@0.5:0.950.9935
Table 4. YOLOv11-Segmentation model performance on test set.
Table 4. YOLOv11-Segmentation model performance on test set.
MetricValue
Accuracy88.28%
Precision0.9992
Recall1.0000
F1-score0.9996
IoU0.9154
mAP@0.50.9950
mAP@0.5:0.950.9909
Table 5. SAM model performance on test set.
Table 5. SAM model performance on test set.
MetricValue
Precision0.9898
Recall0.9601
F1-score0.9708
IoU0.9500
Dice0.9708
Table 6. SAM2 model performance on test set.
Table 6. SAM2 model performance on test set.
MetricValue
Precision0.9946
Recall0.9660
F1-score0.9771
IoU0.9608
Dice0.9771
Table 7. Real-Time Inference Performance of YOLOv11-Detection.
Table 7. Real-Time Inference Performance of YOLOv11-Detection.
ModelAvg. Latency (ms)FPSp95 Latency (ms)p95 FPS
YOLOv11-Detection (Proposed)18.9652.724.5340.8
Table 8. Integrated component analysis of the proposed optimization modules using YOLOv11-Detection on the Kvasir-SEG dataset.
Table 8. Integrated component analysis of the proposed optimization modules using YOLOv11-Detection on the Kvasir-SEG dataset.
ConfigurationMask-Based Bounding Box GenerationRegion-Aware
Preprocessing
Region-Guided
Supervision
PrecisionRecallF1-ScoreIoUmAP@0.5
Baseline YOLOv110.91520.88010.897210.90830.9284
Mask-based Bounding Box Generation0.94210.92540.933420.93540.9561
Mask-based Bounding Box Generation and Region-aware Preprocessing0.96530.98510.974920.95840.9823
Proposed (All modules)0.97961.00000.98970.97640.9935
✓ indicates that the corresponding optimization module was incorporated into the configuration, while ✗ denotes that the module was excluded. The Proposed (All modules) row represents the fully integrated framework combining all three optimization components. Bold values highlight the best-performing result achieved across each evaluation metric.
Table 9. Comparative performance of different models.
Table 9. Comparative performance of different models.
ModelPrecisionRecallF1-ScroreIoUmAP@0.5
YOLOv11-Segmentation0.99911.00000.99590.91540.9909
SAM0.98980.96010.97080.9500-
SAM 20.99460.96600.97710.9608-
YOLOv11-Detection (Best model)0.97961.00000.98970.97640.9935
Table 10. Comparative performance of different models.
Table 10. Comparative performance of different models.
PapersMethodPrecisionRecallF1-ScoreIoU
M. Lalinia and A. Sahafi [4]YOLOv895.6091.7092.40-
P. Ghose, A. Ghose [20]Fine-tuned YOLOv5 with augmentation99.0198.9598.54-
Wan, J.; Chen, B.; Yu, Y. [21]Attention–YOLOv5-Lite-Prune91.4074.70--
Wang et al. [22]YOLOv591.392.191.7-
Ours (YOLOv11)YOLOv11 Detection97.96100.0098.9797.64
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nahiyan, F.; Nahar, S.; Alam, T.; Khaliluzzaman, M.; Hassan, M.M. Advancing Colorectal Polyp Detection in Colonoscopy Through Region-Guided Deep Learning. Eng. Proc. 2026, 124, 118. https://doi.org/10.3390/engproc2026124118

AMA Style

Nahiyan F, Nahar S, Alam T, Khaliluzzaman M, Hassan MM. Advancing Colorectal Polyp Detection in Colonoscopy Through Region-Guided Deep Learning. Engineering Proceedings. 2026; 124(1):118. https://doi.org/10.3390/engproc2026124118

Chicago/Turabian Style

Nahiyan, Fairooz, Simoon Nahar, Taslim Alam, Md. Khaliluzzaman, and Mohammad Mahadi Hassan. 2026. "Advancing Colorectal Polyp Detection in Colonoscopy Through Region-Guided Deep Learning" Engineering Proceedings 124, no. 1: 118. https://doi.org/10.3390/engproc2026124118

APA Style

Nahiyan, F., Nahar, S., Alam, T., Khaliluzzaman, M., & Hassan, M. M. (2026). Advancing Colorectal Polyp Detection in Colonoscopy Through Region-Guided Deep Learning. Engineering Proceedings, 124(1), 118. https://doi.org/10.3390/engproc2026124118

Article Metrics

Back to TopTop