Research on Polyp Segmentation via Dynamic Multi-Scale Feature Fusion and Global–Local Semantic Enhancement

Wei Qing; Yuyao Ouyang; Pengfei Yin

doi:10.3390/s25206495

,

and

College of Computer Science and Engineering, Jishou University, Jishou 416000, China

^*

Author to whom correspondence should be addressed.

Sensors2025, 25(20), 6495;https://doi.org/10.3390/s25206495

This article belongs to the Section Biomedical Sensors

Version Notes

Order Reprints

Review Reports

Abstract

Accurate segmentation of colorectal polyps is crucial for the early screening and clinical diagnosis of colorectal cancer. However, the diverse morphology of polyps, significant variations in scale, and unstable quality of endoscopic imaging pose serious challenges for existing algorithms in achieving precise boundary segmentation. To address these issues, this study proposes a novel polyp segmentation algorithm, GDCA-Net, which is developed based on the You Only Look Once version 12 segmentation model (YOLOv12-seg). GDCA-Net introduces several architectural innovations. First, a Gather-and-Distribute (GD) mechanism is incorporated to optimize multi-scale feature fusion, while Alterable Kernel Convolution (AKConv) is integrated to enhance the modeling of complex geometric structures. Second, the Convolution and Attention Fusion Module (CAF) and Context-Mixing dynamic convolution (ContMix) modules are designed to strengthen long-range dependency modeling and multi-scale feature extraction for polyp boundary representation. Finally, a Wise Intersection over Union–based (Wise-IoU) loss function is introduced to accelerate model convergence and improve robustness to low-quality samples. Experiments conducted on the PolypDB, Kvasir-SEG, and CVC-ClinicDB datasets demonstrate the superior performance of GDCA-Net in polyp segmentation tasks. On the most challenging PolypDB dataset, GDCA-Net achieved a mean Average Precision at 50% IoU threshold (mAP50) of 85.9% and an F1-score (F1) of 85.5%, representing improvements of 2.2% and 0.7% over YOLOv12-seg, respectively. Moreover, on the Kvasir-SEG dataset, GDCA-Net achieved a leading F1 score of 94.9%. These results clearly demonstrate that GDCA-Net possesses strong performance and generalization capabilities in handling polyps of varying sizes, shapes, and imaging qualities.

Keywords:

colon polyp segmentation; YOLOv12-seg; multi-scale feature fusion

1. Introduction

Accurate segmentation of colorectal polyps plays a critical role in computer-aided medical diagnosis, facilitating early detection, timely intervention, and effective prevention of malignant transformation [1]. However, existing segmentation algorithms still face considerable challenges in real clinical applications.

First, under multimodal imaging conditions, polyps exhibit significant variations in color, texture, boundary features, morphology, and scale, making it difficult for conventional pixel-level feature extraction methods to capture fine-grained information [2]. Second, many existing approaches tend to overemphasize local features while neglecting the complex interactions between the global semantic context and local structural details. As a result, their performance often deteriorates when dealing with blurred boundaries or small, low-contrast polyps. In addition, insufficient modeling of long-range dependencies and high sensitivity to low-quality inputs such as noise and specular reflections further undermine robustness and generalization, thereby limiting their clinical utility in fine-grained segmentation tasks [3].

In recent years, the rapid advancement of deep learning has brought about significant breakthroughs in medical image segmentation [4,5,6,7]. Many studies have demonstrated promising results by introducing attention mechanisms [8], convolutional kernel modeling [9,10], and context-aware feature representations [11]. Nevertheless, these methods still face limitations in balancing global and local feature integration, adapting to complex morphological variations, and maintaining robustness against low-quality imaging data, which restricts their widespread clinical adoption [12].

To address these challenges, this study proposes a novel deep neural network architecture that achieves superior segmentation performance and enhanced robustness through a multi-module collaborative optimization strategy. Specifically, to tackle the difficulty of jointly modeling local details and global context in fine-grained boundary segmentation, we designed the CAFMAttention mechanism [13], which simultaneously captures local detail features and global contextual information, thereby improving boundary delineation. To overcome the limitations of conventional convolution in modeling long-range dependencies, we introduce the ContMix module [14], a context-aware dynamic convolution mechanism that breaks the locality constraint of standard convolution. This module adaptively models long-range dependencies, thereby significantly enhancing feature representation and improving model generalization.

Furthermore, to address the inherent information loss in traditional feature pyramid networks (FPNs), we propose a Gather-and-Distribute (GD) mechanism [15], which collaboratively optimizes feature utilization efficiency and strengthens the detection capability of each branch. We also employ Arbitrary Kernel Convolution (AKConv) [16], a dynamic convolution module capable of generating and adjusting kernel sampling coordinates. This enhances the modeling of multi-scale and irregular polyp structures, thereby improving small-object detection performance.

Finally, to mitigate the sensitivity to low-quality samples and slow convergence in bounding-box regression, we adopt the improved Wise-IoU loss function [17]. By integrating a dual-attention mechanism with a dynamic gradient gain strategy, this loss function adaptively evaluates anchor-box quality and allocates gradients, thereby reducing sensitivity to low-quality samples, accelerating convergence, and enhancing robustness across datasets of varying quality.

The main contributions of this study can be summarized as follows:

We propose a multiscale feature fusion framework based on the GD mechanism and AKConv. The GD module integrates three key components: the Feature Alignment Module (FAM), the Information Fusion Module (IFM), and the Information Injection Module (Inject). Together with AKConv, this design significantly enhances branch segmentation capability and feature utilization, improving the modeling of complex geometric structures.
We design a local–global semantic extraction mechanism based on CAFM and ContMix. This approach adaptively generates convolutional kernels from input features, enabling effective modeling of long-range dependencies and significantly improving feature representation.
We introduce Wise-IoU, a loss function that combines a dual-attention mechanism with a dynamic gradient gain strategy. This loss accelerates model convergence and enhances adaptability to samples of varying quality.

3. Methodology

3.1. YOLOv12-Seg Overview

YOLOv12-Seg is an advanced segmentation model developed on the basis of the YOLO family of architectures, with the design objective of balancing high-precision semantic modeling and real-time inference efficiency [41]. In medical image segmentation tasks, a model must not only delineate boundaries with high accuracy but also satisfy real-time requirements for clinical applications. Therefore, YOLOv12-Seg provides a feasible solution that integrates both speed and accuracy for medical image segmentation.

Structurally, YOLOv12-Seg inherits the backbone of YOLOv12 and incorporates improved convolutional and feature aggregation modules (C3k2 and A2C2f) to enhance multi-scale feature representation [42]. Through cross-stage connections and multi-level fusion [43], the model integrates semantic and spatial information across different layers. Its head includes a segmentation branch that jointly leverages P3, P4, and P5 feature maps for segmentation prediction, thereby preserving global contextual information while strengthening the delineation of edge details.

3.2. Improved Feature Fusion Network

Current YOLO algorithms typically adopt feature pyramid networks (FPNs) and PANet structures for multi-scale fusion [43]. However, the degree of fusion remains limited: the PAFPN used in the neck of the standard YOLO series can only fully integrate information between adjacent layers, while non-adjacent layers require recursive upward fusion. This not only increases complexity but also leads to potential information loss.

With the introduction of the GD mechanism, features at different scales are uniformly collected, summarized, and fused, after which the centralized information is redistributed to each level [15]. During this process, the Feature Alignment Module (FAM) and the Information Fusion Module (IFM) work jointly to achieve multi-level feature integration [44]. This approach enables the model to effectively leverage diverse features, thereby improving segmentation accuracy while maintaining low latency. It overcomes the information loss problem of conventional FPNs and strengthens the feature integration in the neck, making better use of the features extracted by the backbone.

As shown in Figure 1, the GD mechanism is integrated into the neck of the original YOLOv12-Seg network: Low-GD replaces the up-sampling fusion in PANet, while Higher-GD substitutes the down-sampling fusion. Additionally, higher-level B2 features are incorporated into Low-GD to maximize the integration of low-level information. Within Low-GD, the B4 layer is used as a reference: larger feature maps, such as those from B2 and B3, are down-sampled via average pooling, whereas smaller feature maps, such as those from B5, are up-sampled using bilinear interpolation to standardize feature-map sizes and obtain fused features. The fused feature layers (P3, P4, and P5) from Low-GD are subsequently processed by High-GD fusion, further enhancing information integration and effectively preserving the feature details of small objects.

Figure 1. Improved network architecture. The Gather-and-Distribute (GD) mechanism is integrated into the neck of the original You Only Look Once version 12 segmentation model (YOLOv12-Seg) network: Low-GD replaces the up-sampling fusion in Path Aggregation Network (PANet), while Higher-GD substitutes the down-sampling fusion.

In addition, ContMix is incorporated to adaptively adjust the shape and size of convolutional kernels, thereby enhancing flexibility in feature extraction. CAFMAttention is also introduced to effectively model long-range dependencies and strengthen local feature extraction while maintaining computational efficiency. Furthermore, AKConv is adopted, which accommodates arbitrary kernel shapes and sizes through flexible initial sampling and learnable offsets, thereby improving feature extraction capability and adaptability to objects with diverse shapes [45].

Finally, the improved Wise-IoU loss function is employed to optimize the model. By leveraging a dual-attention mechanism and dynamic gradient gain, Wise-IoU accelerates model convergence and enhances its adaptability to datasets of varying quality [46].

3.3. Principles of Information Aggregation and Distribution Mechanism

To effectively address the problem of information loss in traditional Feature Pyramid Networks (FPNs) during feature transmission, we introduce the Gather-and-Distribute (GD) mechanism, as shown in Figure 2. This mechanism employs a unified module to aggregate and fuse multi-level features, then redistributes the fused results back to each layer [47]. By doing so, it mitigates the inherent information loss of conventional methods while enhancing the feature integration capability of the neck, without introducing significant latency. Consequently, this approach enables more efficient utilization of backbone-extracted features and can be conveniently embedded into any existing backbone–neck–head architecture.

Figure 2. Structure of the GD mechanism. The GD mechanism enhances the model’s ability to detect objects of varying sizes by constructing two dedicated branches: the Low-Stage Gather-and-Distribute Branch (Low-GD) and the High-Stage Gather-and-Distribute Branch (High-GD).

3.4. Convolution and Attention Fusion Module

To overcome the limitations of traditional convolutions in capturing long-range dependencies and global context, this project plans to introduce a Convolution and Attention Fusion Module (CAFM) [48], as shown in Figure 3. This module combines the strong local modeling capabilities of convolutions with the strong global perception capabilities of attention mechanisms, enabling more comprehensive extraction of multi-scale, multi-level semantic information and effectively enhancing the model’s feature expression capabilities and robustness in complex scenarios.

Figure 3. Module diagram of the attention and convolution fusion between the local branch and the global branch. The former specializes in efficiently extracting local details and facilitating inter-channel interactions, while the latter focuses on modeling long-range feature dependencies and capturing global spatial relationships.

Unlike the standard Transformer self-attention mechanism, which computes global dependencies based on pairwise similarity among all tokens using Q (query), K (key), and V (value) matrices, the proposed CAFM introduces a dual-branch structure that explicitly combines local convolutional feature extraction with global attention modeling. The convolutional branch focuses on capturing fine-grained spatial and edge details with strong inductive bias, while the attention branch aggregates global context information without fully relying on pairwise token relationships.

3.5. Variably Sized Convolution

Convolution-based neural networks have achieved remarkable success in deep learning; however, standard convolution operations suffer from two inherent limitations. On the one hand, convolution is restricted to local receptive fields with fixed sampling patterns. On the other hand, the number of kernel parameters grows quadratically with kernel size. To address these limitations, this study investigates Alterable Kernel Convolution (AKConv), which enables convolution kernels to adopt arbitrary numbers of parameters and flexible sampling patterns, thereby offering richer trade-offs between computational cost and model performance. In AKConv, a novel coordinate generation algorithm is introduced to define the initial positions for kernels of arbitrary size, while offsets are employed to adjust the sampling pattern of each position.

3.5.1. Define the Initial Sampling Position

Convolutional neural networks are based on convolution operations, which locate features at corresponding positions through a regular sampling grid. In [11,33,34], for a

3 \times 3

convolution operation, the regular sampling grid is given by the following equation. Let R denote the sampling grid; then, R is defined as follows:

R = {(- 1, - 1), (- 1, 0), \dots, (0, 1), (1, 1)}

(1)

Nevertheless, such a regular sampling strategy constrains the flexibility of kernel shapes. The AKConv explored in this work overcomes this limitation by allowing the convolution kernels to operate with irregular shapes. To provide irregular kernels with a structured sampling grid, we designed an algorithm capable of generating initial sampling coordinates for convolutions of arbitrary size, with the top-left corner of the kernel (0, 0) defined as the sampling origin to accommodate different kernel dimensions. After defining the initial coordinates (

P_{n}

) for an irregular convolution, the corresponding convolution operation at position

P_{0}

can be defined as follows:

Conv (P_{0}) = \sum w \times (P_{0} + P_{n})

(2)

where w represents the weight parameters of the convolution kernel;

w \in R^{C \times K}

; where C is the number of channels, and K is the size of the convolution kernel.

3.5.2. Novel Deformable Convolution Operation

Since the sampling positions of standard convolution are fixed, it can only capture local window information and fails to extract features from other regions. Although deformable convolution enhances flexibility by learning offsets to adjust sampling positions, it remains constrained to regular grids, and the number of kernel parameters increases quadratically with kernel size, leading to high computational overhead. To address these limitations, we propose a novel deformable convolution operation, termed AKConv, which enables convolutional kernels to adopt arbitrary sampling patterns and parameter counts, thereby adapting to targets of varying scales and irregular shapes. To address these limitations, we propose a novel deformable convolution operation, termed AKConv, which enables convolutional kernels to adopt arbitrary sampling patterns and parameter counts, thereby adapting to targets of varying scales and irregular shapes. However, the irregularity of sampling positions makes direct modeling challenging. To overcome this, operations such as stacked convolutions, RFAConv, or reshape-Conv can be employed to project the feature maps into higher-dimensional spaces, followed by dimensionality reduction using convolution, thereby approximating the extraction of irregular sampling features.

3.5.3. Extended AKConv

We designed various initial sampling shapes for convolutions of size 5. Therefore, even without using the offset idea in deformable convolutions, AKConv can still generate a wide variety of convolution kernel shapes. But in fact, the size of AKConv can be arbitrary and is not limited to 5. As the size increases, the initial convolution sampling shapes of AKConv become more diverse and rich. Given that target shapes vary across different datasets, designing convolution operations corresponding to the sampling shapes is crucial.

Existing methods, such as deformable convolutions and DSConv, which aim to address the limitations of regular convolutions or are designed for specific target shapes, have not explored convolutional operations for arbitrary sizes and arbitrary sampling shapes. AKConv’s design addresses these limitations by enabling convolutional operations to efficiently extract features from irregular sampling shapes using offsets and by granting convolutional kernels the ability to have an arbitrary number of parameters and multiple shapes.

3.6. Context-Mixing Dynamic Convolution

To enable the model to achieve dynamic global modeling capabilities comparable to those of Transformer- and Mamba-based models while retaining strong inductive bias [49], as shown in Figure 4, we introduce a novel context-mixing dynamic convolution (ContMix) that can dynamically model long-range dependencies while maintaining strong inductive bias. Our core idea is to represent the relationship between a token and its context using a set of affinity values between the token and all tokens in a group of region centers within the feature map. These affinity values can then be aggregated in a learnable manner to define a token-level dynamic convolution kernel, thereby injecting contextual knowledge into each weight of the convolution kernel. Once this dynamic kernel is applied to the feature map via a sliding window, each token in the feature map is modulated by the approximate global information collected through the region centers.

Figure 4. Dynamic convolution structure diagram. The advantage of Alterable Kernel Convolution (AKConv) lies in its ability to efficiently model complex kernel shapes by learning offsets and dynamic sampling grids while maintaining flexible sampling.

3.7. Introduction of Wise-IoU

To address the issue of low-quality samples degrading model generalization, this work introduces Wise-IoU [17], a bounding-box regression loss that integrates a dynamic non-monotonic focusing mechanism with a metric-based dual attention strategy [50]. The Intersection over Union (IoU) is a metric used to quantify the overlap between predicted and ground-truth bounding boxes in detection tasks. Here,

W_{i}

and

H_{i}

denote the width and height of the overlapping region, while

S_{u}

represents the union area. During backpropagation, the gradient of

L_{I o U}

is computed as shown in Equations (3) and (4). When no overlap exists between boxes, i.e.,

W_{i} = 0

or

H_{i} = 0

, the gradient of

L_{I o U}

vanishes, making the width untrainable.

\begin{matrix} L_{I o U} = 1 - I o U = 1 - \frac{W_{i} \times H_{i}}{S_{u}} \end{matrix}

(3)

\begin{matrix} \frac{\partial L_{I o U}}{\partial W_{i}} = \{\begin{matrix} - H_{i} \times \frac{I o U + 1}{S_{u}}, & W_{i} > 0 \\ 0, & W_{i} = 0 \end{matrix} \end{matrix}

(4)

As shown in Equations (5) and (6), Wise-IoU introduces a penalty term (

R_{W I o U}

) by separately incorporating

W_{g}

and

H_{g}

. This enhances the loss (

L_{I o U}

) for normal-quality anchors while avoiding gradient hindrance from

R_{W I o U}

. Such a design weakens geometric penalties under good overlap but prevents excessive interference during training, thereby improving the model’s generalization capability.

\begin{matrix} L_{i} = L_{I o U} + R_{i} \end{matrix}

(5)

\begin{matrix} R_{W I o U} = exp [\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{2}}], R_{W I o U} \in [1, e) \end{matrix}

(6)

Furthermore, to reinforce bounding-box regression and mitigate harmful gradients from low-quality samples, Wise-IoU introduces an anchor quality assessment term (

β

), as defined in Equation (7). This adjustment assigns smaller gradient increments to outlier anchors. Finally, the Wise-IoU formulation in Equation (8) integrates both the non-monotonic focusing factor and a dual attention mechanism emphasizing spatial distance.

\begin{matrix} β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}}, β \in [0, + \infty); \end{matrix}

(7)

\begin{matrix} L_{W I o U} = r \times R_{W I o U} \times L_{I o U}, r = \frac{β}{δ \times α^{β - δ}}, \end{matrix}

(8)

4. Experimental Results

4.1. Datasets Preparation

This study first selected the white-light imaging (WLI) subset from the PolypDB dataset [51] (as shown in Table 1) as the basis for evaluation. PolypDB is a high-quality medical image database specifically constructed for colorectal polyps.

Table 1. Description of WLI from the PolypDB dataset.

Its images are sourced from a variety of clinical scenarios and have high annotation consistency and representativeness, providing a reliable benchmark for preliminary model verification.

To overcome the limitations of single datasets in terms of image diversity, sample scale, and clinical coverage scope and to enhance the comprehensiveness and generalizability of model evaluation, this study further introduces two widely used public datasets: Kvasir-SEG [52] and CVC-ClinicDB [53] (as shown in Table 2). Kvasir-SEG, with its rich image count, diverse lesion morphologies, and fine annotations, serves as an authoritative benchmark for evaluating segmentation algorithms, effectively testing the model’s boundary recognition and fine segmentation capabilities for irregular targets under complex endoscopic conditions. Meanwhile, CVC-ClinicDB, derived from different devices and clinical environments, exhibits significant background differences, illumination variations, and quality fluctuations, which can enhance the heterogeneity of test samples and validate the model’s robustness and stability under practical interference scenarios such as low contrast, noise, and blur.

Table 2. Kvasir-SEG and CVC-ClinicDB dataset description.

For each dataset, we followed a stratified random sampling strategy to split the data into training, validation, and test sets in an approximate ratio of 8:1:1. Stratification ensures that the distribution of polyp size, morphology, and imaging conditions is consistent across subsets. To minimize labeling noise and batch effects caused by differences in imaging devices and acquisition years, we performed visual quality inspection on all annotations and applied histogram-based intensity normalization across datasets.

By integrating three datasets—namely, PolypDB, Kvasir-SEG, and CVC-ClinicDB—this study constructs a multi-level, multi-dimensional evaluation framework that systematically examines the model’s segmentation performance under different scales, shapes, textures, and imaging conditions, comprehensively demonstrating its cross-domain transfer and generalization capabilities.

4.2. Preparation for the Experiment

4.2.1. Experimental Environment

To ensure consistency in the experimental environment and robustness of the methods, all experiments were strictly conducted on a designated server. The server has detailed specifications, as shown in Table 3. The system is equipped with an RTX 4090D graphics processing unit with 24 GB of video memory and 60 GB of memory and is powered by an AMD EPYC 9754 128-core processor with 18 virtual cores. Hard disk storage includes a 30 GB system disk and a 50 GB data disk.

Table 3. Description of the software and hardware environment.

4.2.2. Model Training Strategy

To optimize model performance and enhance generalization capability, a systematic training strategy was employed in this study. Regarding the choice of optimizer, we adopted the stochastic gradient descent (SGD) algorithm, with its parameter configuration presented in Table 4. This strategy effectively ensures convergence stability during the early stages of training. In terms of data augmentation, we designed a multi-level augmentation pipeline that includes two main categories—geometric transformations and color-space perturbations—as detailed in Table 5. By integrating semantic information from diverse samples, the model’s feature discrimination capability is significantly enhanced. This comprehensive augmentation strategy substantially improves the model’s adaptability to variations in illumination, scale differences, and morphological diversity in wireless optical imaging.

Table 4. Optimizer parameter configuration.

Table 5. Data Augmentation Strategy.

4.2.3. Training Parameters

The configuration of hyperparameters is critical to the performance of the YOLOv12-SEG model during optimization. To rigorously evaluate the effectiveness of algorithmic improvements, it is essential to maintain highly consistent hyperparameters before and after any modifications; otherwise, it is impossible to determine whether performance gains stem from the algorithm itself or parameter adjustments. Therefore, this study adopts a standardized set of hyperparameters (as shown in Table 6) to ensure a fair comparison.

Table 6. Training hyperparameter settings.

We initially performed a grid search on a subset of the PolypDB dataset to identify the optimal initial parameters, which were subsequently validated on two additional datasets. The results demonstrate that the configuration (learning rate: 0.01; batch size: 16; epochs: 200; input size: 640 × 640) achieved stable IoU convergence across all datasets (as shown in Figure 5), confirming its robustness and generalization capability. Ultimately, this unified hyperparameter setup ensured that all experiments were conducted under identical conditions, thereby guaranteeing the fairness and credibility of the performance comparison.

Figure 5. Segmentation loss curves during training. Blue: training; red: validation. Both losses converge smoothly, indicating effective model training. Note that training on the Kvasir-SEG and CVC-ClinicDB datasets was terminated around epoch 175 due to the early stopping criterion. (a) Loss trend observed on the PolypDB dataset. (b) Loss trend observed on the Kvasir-SEG dataset. (c) Loss trend observed on the CVC-ClinicDB dataset.

4.2.4. Evaluation Metrics

To evaluate the performance of GDCA-Net, we employed several widely used metrics in polyp segmentation and object detection tasks, including Precision (Pre), Recall (Rec), F1-score (F1), and mean Average Precision (mAP) at different IoU thresholds (mAP@0.5 and mAP@[0.5:0.95]). Let TP, FP, and FN denote the numbers of true positives, false positives, and false negatives, respectively. The metrics are defined as Equations (9)–(12):

\begin{matrix} Precision = \frac{| T P |}{| T P | + | F P |} \end{matrix}

(9)

\begin{matrix} Recall = \frac{| T P |}{| T P | + | F N |} \end{matrix}

(10)

\begin{matrix} mAP = \frac{1}{m} \sum_{i = 1}^{m} {AP}_{i} \end{matrix}

(11)

\begin{matrix} F 1 Score = 2 \times \frac{Precision \times Recall}{Precision + Recall} \end{matrix}

(12)

where

A P_{i}

denotes the average precision of the i-th class and N is the total number of classes. mAP@0.5 represents the mean average precision at an IoU threshold of

0.5

, while mAP@[0.5:0.95] indicates the averaged precision across multiple IoU thresholds ranging from

0.5

to

0.95

.

In summary, these evaluation metrics form a comprehensive evaluation framework that provides strong support for in-depth analysis of the performance of polyp segmentation models in terms of pixel-level prediction accuracy and the completeness of segmentation results. Precision, recall, F1 score, mAP50, and mAP50-95 are widely recognized as standard benchmarks in YOLO-based segmentation research. These metrics are sufficient to comprehensively evaluate the model’s performance in terms of detection accuracy, lesion sensitivity, and segmentation quality. Although additional measures, such as sensitivity, specificity, accuracy, and AUC, could also be considered, they are mathematically correlated with the metrics already reported. Therefore, we focus on these widely accepted indicators in this study.

4.3. Experimental Analysis

4.3.1. Quantitative Comparison and Evaluation

The quantitative evaluation reported in this paper aims to comprehensively examine the effectiveness and generalization ability of the proposed method, GDCA-Net. We used Equations (9)–(12) to calculate performance metrics such as the accuracy, recall rate, mAP50, and mAP50-95. Given the diverse instances in the dataset—covering varying lesion sizes, shapes, and textures, as well as differences in endoscopic image quality—this study systematically tested multiple different deep learning models.

This paper focuses on utilizing deep learning models for polyp segmentation, particularly emphasizing early, precise segmentation of polyps to assist in clinical diagnosis. After thoroughly evaluating the dataset, we selected YOLOv12-seg as the primary framework due to its outstanding performance and efficiency in rapidly segmenting polyps of varying sizes and shapes. The segmentation model built on YOLOv12-seg demonstrated significant improvements across multiple performance metrics.

To systematically assess the effectiveness of the proposed method, this study conducted a comprehensive comparative analysis of a series of deep learning-based segmentation techniques and their improvements [43]. These comparison models include YOLOv6-seg [54], YOLOv8-seg, YOLOv8p2-seg [55], YOLOv10n-seg [56], YOLOv11-seg [57], YOLOv12-seg [58], EfficientNetv2-seg [59], vanillanet-seg [60], and ADNet-seg. Additionally, to assess the model’s generalization ability across diverse scenarios and requirements, we selected multiple datasets for experimentation. These include the PolypDB dataset, the Kvasir-SEG dataset, and the CVC-ClinicDB dataset. The comparison results of various performance metrics across different datasets are compiled in Table 7 and Figure 6, comprehensively demonstrating the advantages and disadvantages of the proposed GDCA-Net model compared to other models.

Table 7. Comparison of experimental results.

Figure 6. Curves of model evaluation metrics. (a) The mean Average Precision at 50% IoU threshold (mAP50) metric change graph of GDCA-Net under the PolypDB dataset. (b) The mAP50 metric change graph of GDCA-Net under the Kvasir-SEG dataset. (c) The mAP50 metric change graph of GDCA-Net under the CVC-ClinicDB dataset.

As shown in Table 7, the proposed GDCA-Net model performs well on most core metrics. On the highly challenging PolypDB dataset, GDCA-Net achieved the best results in terms of both the mAP50 and mAP50-95 metrics, with values of 85.9% and 46.9%, respectively. This indicates that the model demonstrates strong robustness when faced with challenging data such as low image quality, uneven lighting, and blurred polyp boundaries. On the high-quality Kvasir-SEG dataset, GDCA-Net also performed exceptionally well. GDCA-Net topped the rankings, with an F1 score of 94.9%, and achieved outstanding results of 97.0% and 74.1% on mAP50 and mAP50-95, respectively. It is worth noting that other advanced models in this dataset, such as YOLOv11-seg and YOLOv8-seg, also achieved very high scores, indicating that the dataset is relatively less challenging. However, GDCA-Net maintains its lead on this high-standard dataset, further validating its advanced capabilities. On the CVC-ClinicDB dataset, GDCA-Net achieved an mAP50 of 98.5% and an mAP50-95 of 82.9%, with performance comparable to that of advanced models such as ADNet-seg and YOLOv8-p2-seg.

Although GDCA-Net does not achieve the highest precision or recall across all datasets when compared to certain models (e.g., YOLOv8-seg and ADNet-seg), it demonstrates consistently high performance across all metrics and datasets, indicating superior overall generalization capability. Those models with marginally higher precision often exhibit a corresponding decrease in recall, highlighting the inherent trade-off between detection sensitivity and false-positive suppression. In contrast, GDCA-Net achieves a more balanced performance profile, which is particularly crucial in clinical applications where both high sensitivity and high specificity are equally critical.

These experimental results strongly demonstrate the effectiveness and robustness of GDCA-Net in handling datasets of different styles and with differing challenges. It is worth noting that in multiple comparison experiments, due to the strict requirements of EfficientNetv2-seg and vanillanet-seg on the dataset, they performed poorly on the PolypDB dataset, resulting in underfitting.

4.3.2. Ablation Experiments

To clarify the contributions and functions of each component in the GDCA-Net model, a series of ablation experiments was conducted in this study. Specifically, eight ablation experiments were performed using the PolypDB dataset, covering the YOLOv12-seg model, models ➀ to ➇, and GDCA-Net. The experimental results are detailed in Table 8. By systematically introducing improved modules based on the YOLOv12-seg baseline model, this study evaluated the roles of the GD mechanism, AKConv, CAF, ContMix, and Wise-IoU. The experimental results clearly demonstrate that the model’s outstanding performance is not accidental but the result of the synergistic effects of multiple key technologies.

Table 8. Results of ablation experiments.

In the evaluation of individual components, model ➃ achieved the most significant performance improvement. Its mAP50 improved significantly from 83.7% in the baseline model to 86.4%, and its F1 score also jumped from 84.8% to 86.7%. This significant improvement demonstrates that Wise-IoU can achieve extraordinary effectiveness in optimizing boundary regression and handling complex, uneven segmentation samples. Thus, it has become the core driving force behind the model’s performance improvement. Meanwhile, model ➁ improved the mAP50 by 1.1% and mAP50-95 by 2.1%, demonstrating the effectiveness of this mechanism in enhancing multi-scale feature fusion; model ➂ had a positive impact on recall and mAP50-95, indicating that CAF and ContMix effectively enhance the model’s ability to capture contextual information and irregular morphological features. By integrating global contextual attention and dynamic convolutions that can adapt to irregular shapes, the model can more accurately distinguish polyps in complex backgrounds and precisely segment their diverse, non-linear shapes.

Further experiments reveal the strong synergistic effects between components. When the GD mechanism and AKconv are combined with CAF and ContMix (Model ➄), the model’s performance is further improved in terms of mAP50 and F1 scores. Notably, the combination of CAF and ContMix with Wise-IoU (Model ➅) performs well, achieving an mAP50 of 85.5% and an F1 score of 85.9%, demonstrating strong synergistic gains. The combination of the GD mechanism with AKConv and Wise-IoU (Model ➆) also achieved outstanding performance, with an mAP50 of 85.2% and mAP50-95 of 47.5%.

Finally, by integrating all improved components—GD mechanism with AKConv, CAF with ContMix, and Wise-IoU—into GDCA-Net—the model achieves the best overall performance among all combinations. Although its F1 score of 85.5% is slightly lower than the peak value achieved by Wise-IoU alone, its mAP50 and mAP50-95 reach 85.9% and 46.9%, respectively, approaching optimal levels across all evaluation metrics. This fully demonstrates that GDCA-Net achieves optimal balance and optimization across all performance dimensions by integrating all components, making it a robust model that performs exceptionally well in various complex scenarios.

It should be noted that Models ➃, ➆, and ➇ achieved comparable results, as they share several key components (e.g., Wise-IoU and GD + AKConv), which significantly enhances segmentation performance. However, Model ➇ consistently demonstrates more stable performance across all metrics, indicating that the synergistic integration of modules yields more robust and balanced performance improvements than any individual component alone.

4.3.3. Qualitative Analysis

To comprehensively evaluate the segmentation performance of the GDCA-Net model, this study not only conducted the aforementioned quantitative analysis but also performed qualitative analysis. First, we randomly selected 12 images from the PolypDB dataset [51], as shown in Figure 7a. The GDCA-Net demonstrated strong robustness and adaptability under challenging conditions such as low image quality, uneven lighting, complex backgrounds, and blurred polyp boundaries. The model can reliably segment polyps of various shapes and handles blurred boundaries with great precision, with its prediction results highly consistent with the ground-truth labels (Figure 7b).

Figure 7. Qualitative comparisons between the segmentation results of GDCA-Net and the ground truth on samples from three datasets. From left to right: segmentation outputs of GDCA-Net and ground-truth masks. The results are evaluated based on the Intersection over Union (IoU) metric, where a higher IoU indicates a closer match between the predicted mask and the ground truth. (a) Segmentation results on the PolypDB dataset. (b) The mask on the PolypDB dataset. (c) Segmentation results on the Kvasir-SEG dataset. (d) The mask on the Kvasir-SEG dataset. (e) Segmentation results on the CVC-ClinicDB dataset. (f) The mask on the CVC-ClinicDB dataset.

Second, to assess the model’s generalization ability, we randomly selected 12 images from the Kvasir-SEG dataset [52], as shown in Figure 7c. The images in this dataset have relatively high quality, with polyp boundaries typically being clear. GDCA-Net also demonstrated exceptional segmentation capabilities on this dataset, with its prediction results matching the true labels in Figure 7d. This demonstrates that GDCA-Net can effectively utilize the rich information in high-quality images to achieve high-precision segmentation and successfully generalize to datasets of different styles.

In addition, to comprehensively evaluate the segmentation performance of GDCA-Net across different clinical scenarios, we randomly selected 12 images from the CVC-ClinicDB dataset, as shown in Figure 7e. Polyps in this dataset are often characterized by uneven illumination, mucus interference, and complex surrounding tissue textures. Under these challenging conditions, GDCA-Net still demonstrates excellent segmentation performance, accurately capturing both the overall structure and subtle contours of polyps. Its predicted results show a high degree of consistency with the ground-truth labels in Figure 7f; even in the presence of slight occlusion, the model maintains stable segmentation consistency.

Overall, the GDCA-Net model proposed in this study performs well in tasks involving different image qualities and polyp features, covering a wide range of scenarios, including blurry, complex, clear, and simple ones. On the PolypDB core dataset, GDCA-Net can accurately and effectively segment various types of polyps, thereby providing important auxiliary support for clinical doctors in early diagnosis.

4.3.4. Failure Cases Analysis

Although GDCA-Net demonstrates superior segmentation performance across various datasets, we also observed a few typical failure cases during qualitative analysis, as shown in Figure 8. Including and analyzing these cases is essential for understanding the current limitations of the proposed method and guiding future improvements. As shown in Figure 8, GDCA-Net occasionally fails to detect polyps with extremely low contrast, smooth texture, or unclear boundaries, particularly when they are small or flat against the surrounding mucosa. In these cases, the model struggles to differentiate subtle intensity variations, leading to incomplete or missed segmentation regions. In addition, the model sometimes misclassifies bright reflections caused by endoscopic illumination as polyp regions. These false positives are likely due to the similar intensity distribution between specular highlights and actual lesions, which confuses the feature extraction process.

Figure 8. Examples of failure cases encountered by GDCA-Net. These cases reveal potential areas for improvement in illumination invariance and contextual understanding. (a) False positives caused by specular highlights misclassified as polyp regions. (b) Missed detection of flat polyps with low contrast and unclear boundaries.

These failure cases highlight potential areas for improvement. Future work will focus on incorporating illumination-invariant feature representations and context-aware refinement mechanisms to better handle challenging imaging conditions and reduce misclassification.

5. Conclusions

This study proposes an improved gastrointestinal polyp segmentation algorithm based on an enhanced version of YOLOv12-seg. Compared to the base model, the proposed algorithm framework first reconstructs the neck network by introducing the GD mechanism, significantly enhancing multi-scale feature fusion capabilities; second, it employs the AKConv module to achieve dynamic sampling of arbitrarily shaped convolution kernels, enhancing the modeling capability of irregularly shaped polyp edges; Furthermore, it integrates the CAFM and ContMix modules to combine local details with global contextual information, optimizing the modeling of long-range dependencies in segmentation boundaries; finally, it introduces the Wise-IoU loss function, utilizing a dynamic non-monotonic focusing mechanism and mask quality evaluation strategy to enhance the model’s robustness against low-quality samples. Empirical results show that this algorithm outperforms existing methods in terms of polyp segmentation accuracy and boundary consistency, achieving significant results on both public datasets and internal clinical data while maintaining real-time inference efficiency.

Although the algorithm performs exceptionally well, it still exhibits minor segmentation errors under extreme lighting conditions, small polyps, and blurred boundaries. Future work will focus on the following directions: constructing a multi-center, multi-modal polyp image dataset to deeply analyze the correlation between polyp morphological features and segmentation accuracy; exploring lightweight architectures based on Transformers to balance model complexity and real-time segmentation requirements; and combining pathological prior knowledge to study the mapping relationship between polyp malignancy and segmentation boundaries, providing more comprehensive decision support for clinical diagnosis. The ultimate goal is to achieve a high-precision, robust intelligent segmentation system through algorithm optimization and multidisciplinary research.

Author Contributions

Conceptualization, W.Q.; data curation, W.Q. and Y.O.; formal analysis, W.Q.; funding acquisition, W.Q., Y.O. and P.Y.; methodology, W.Q.; resources, P.Y.; supervision, P.Y.; validation, W.Q.; visualization, W.Q. and Y.O.; writing—original draft, W.Q. and Y.O.; writing—review and editing, P.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62461027); the Natural Science Foundation of Hunan Province, China (Grant No. 2024JJ739); the Research Foundation of the Education Bureau of Hunan Province, China (Grant No. 22A0371); the Hunan Student’s Innovation and Entrepreneurship Training Program (Grant No. S202510531054); and the Hunan Provincial Common Institutions of Higher Learning Teaching Reform Research Project (Grant No. HNJG-20230695).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, S.; Ren, Y.; Yu, Y.; Jiang, Q.; He, X.; Li, H. A survey of deep learning algorithms for colorectal polyp segmentation. Neurocomputing 2025, 614, 128767. [Google Scholar] [CrossRef]
Rana, D.; Pratik, S.; Balabantaray, B.K.; Peesapati, R.; Pachori, R.B. GCAPSeg-Net: An efficient global context-aware network for colorectal polyp segmentation. Biomed. Signal Process. Control 2025, 100, 106978. [Google Scholar] [CrossRef]
Mameli, M.; Shiralizadeh, S.; Papi, M.; Coltea, I.G. DeepPolyp: An artificial intelligence framework for polyp detection and segmentation. Explor. Digit. Health Technol. 2025, 3, 101158. [Google Scholar] [CrossRef]
Liao, B.; Han, L.; Cao, X.; Li, S.; Li, J. Double integral-enhanced Zeroing neural network with linear noise rejection for time-varying matrix inverse. CAAI Trans. Intell. Technol. 2024, 9, 197–210. [Google Scholar] [CrossRef]
Liao, B.; Wang, Y.; Li, J.; Guo, D.; He, Y. Harmonic noise-tolerant ZNN for dynamic matrix pseudoinversion and its application to robot manipulator. Front. Neurorobot. 2022, 16, 928636. [Google Scholar] [CrossRef]
Xiao, L.; Dai, J.; Lu, R.; Li, S.; Li, J.; Wang, S. Design and comprehensive analysis of a noise-tolerant ZNN model with limited-time convergence for time-dependent nonlinear minimization. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5339–5348. [Google Scholar] [CrossRef]
Xiao, L.; He, Y.; Dai, J.; Liu, X.; Liao, B.; Tan, H. A variable-parameter noise-tolerant zeroing neural network for time-variant matrix inversion with guaranteed robustness. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 1535–1545. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Li, D.; Hu, J.; Wang, C.; Li, X.; She, Q.; Zhu, L.; Zhang, T.; Chen, Q. Involution: Inverting the inherence of convolution for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12321–12330. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6070–6079. [Google Scholar]
Li, S.; Wang, Z.; Liu, Z.; Tan, C.; Lin, H.; Wu, D.; Chen, Z.; Zheng, J.; Li, S.Z. Moganet: Multi-order gated aggregation network. arXiv 2022, arXiv:2211.03295. [Google Scholar]
Liu, X.; Isa, N.A.M.; Chen, C.; Lv, F. Colorectal Polyp Segmentation Based on Deep Learning Methods: A Systematic Review. J. Imaging 2025, 11, 293. [Google Scholar] [CrossRef]
Chen, Z.; Lu, S. Caf-yolo: A robust framework for multi-scale lesion detection in biomedical imagery. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
Lou, M.; Yu, Y. OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 128–138. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. 2023, 36, 51094–51112. [Google Scholar]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. AKConv: Convolutional kernel with arbitrary sampled shapes and arbitrary number of parameters. arXiv 2023, arXiv:2311.11587. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Pal, A.; Rajanala, S.; Ting, C.; Phan, R. Denoising via Repainting: An image denoising method using layer wise medical image repainting. arXiv 2025, arXiv:2503.08094. [Google Scholar] [CrossRef]
Moghtaderi, S.; Yaghoobian, O.; Wahid, K.A.; Lukong, K.E. Endoscopic image enhancement: Wavelet transform and guided Filter decomposition-based Fusion Approach. J. Imaging 2024, 10, 28. [Google Scholar] [CrossRef]
Zhao, Y.; Xu, J. A small sample bearing fault diagnosis method based on novel Zernike moment feature attention convolutional neural network. Meas. Sci. Technol. 2024, 35, 066208. [Google Scholar] [CrossRef]
Santhoshi, A.; Muthukumaravel, A. Texture and Shape-Based Feature Extraction for Colorectal Tumor Segmentation. In Proceedings of the 2024 10th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 14–15 March 2024; IEEE: New York, NY, USA, 2024; Volume 1, pp. 315–320. [Google Scholar]
Prasad, G.; Gaddale, V.S.; Kamath, R.C.; Shekaranaik, V.J.; Pai, S.P. A study of dimensionality reduction in GLCM feature-based classification of machined surface images. Arab. J. Sci. Eng. 2024, 49, 1531–1553. [Google Scholar] [CrossRef]
Dinesh, P.; Vickram, A.; Kalyanasundaram, P. Medical image prediction for diagnosis of breast cancer disease comparing the machine learning algorithms: SVM, KNN, logistic regression, random forest and decision tree to measure accuracy. In Proceedings of the AIP Conference Proceedings; AIP Publishing LLC: Melville, NY, USA, 2024; Volume 2853, p. 020140. [Google Scholar]
Thakur, N.; Kumar, P.; Kumar, A. A systematic review of machine and deep learning techniques for the identification and classification of breast cancer through medical image modalities. Multimed. Tools Appl. 2024, 83, 35849–35942. [Google Scholar] [CrossRef]
Espinosa, R.; Cerriteño, J.; Gonzalez-Dominguez, S.; Ochoa-Ruiz, G.; Daul, C. A deep learning-based image pre-processing pipeline for enhanced 3D colon surface reconstruction robust to endoscopic illumination artifacts. In Proceedings of the 2024 IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS), Guadalajara, Mexico, 26–28 June 2024; IEEE: New York, NY, USA, 2024; pp. 81–88. [Google Scholar]
Annavarapu, A.; Borra, S. An adaptive watershed segmentation based medical image denoising using deep convolutional neural networks. Biomed. Signal Process. Control 2024, 93, 106119. [Google Scholar] [CrossRef]
Mei, J.; Zhou, T.; Huang, K.; Zhang, Y.; Zhou, Y.; Wu, Y.; Fu, H. A survey on deep learning for polyp segmentation: Techniques, challenges and future trends. Vis. Intell. 2025, 3, 1. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Hu, S.; Gao, F.; Zhou, X.; Dong, J.; Du, Q. Hybrid convolutional and attention network for hyperspectral image denoising. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
Kathariya, B.; Li, Z.; Wang, H.; Van Der Auwera, G. Multi-stage locally and long-range correlated feature fusion for learned in-loop filter in VVC. In Proceedings of the 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2022; IEEE: New York, NY, USA, 2022; pp. 1–5. [Google Scholar]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. RFAConv: Innovating spatial attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Xiao, L.; Liao, B.; Li, S.; Chen, K. Nonlinear recurrent neural networks for finite-time solution of general time-varying linear matrix equations. Neural Netw. 2018, 98, 102–113. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. A Review of YOLOv12: Attention-Based Enhancements vs. Previous Versions. arXiv 2025, arXiv:2504.11995. [Google Scholar] [CrossRef]
Hua, C.; Cao, X.; Liao, B.; Li, S. Advances on intelligent algorithms for scientific computing: An overview. Front. Neurorobot. 2023, 17, 1190977. [Google Scholar] [CrossRef]
Yılmaz, A.; Yurtay, Y.; Yurtay, N. AYOLO: Development of a Real-Time Object Detection Model for the Detection of Secretly Cultivated Plants. Appl. Sci. 2025, 15, 2718. [Google Scholar] [CrossRef]
Mao, R.; Shen, D.; Wang, R.; Cui, Y.; Hu, Y.; Li, M.; Wang, M. An Integrated Gather-and-Distribute Mechanism and Attention-Enhanced Deformable Convolution Model for Pig Behavior Recognition. Animals 2024, 14, 1316. [Google Scholar] [CrossRef]
Xiong, C.; Zayed, T.; Abdelkader, E.M. A novel YOLOv8-GAM-Wise-IoU model for automated detection of bridge surface cracks. Constr. Build. Mater. 2024, 414, 135025. [Google Scholar]
Liao, B.; Xiang, Q.; Li, S. Bounded Z-type neurodynamics with limited-time convergence and noise tolerance for calculating time-dependent Lyapunov equation. Neurocomputing 2019, 325, 234–241. [Google Scholar] [CrossRef]
Wang, T.; Zhang, Z.; Huang, Y.; Liao, B.; Li, S. Applications of zeroing neural networks: A survey. IEEE Access 2024, 12, 51346–51363. [Google Scholar] [CrossRef]
Xiao, L.; Liao, B. A convergence-accelerated Zhang neural network and its solution application to Lyapunov equation. Neurocomputing 2016, 193, 213–218. [Google Scholar] [CrossRef]
Liao, B.; Xu, J.; Hua, C.; Wang, T.; Li, S. Predefined-time ZNN model with noise reduction for solving quadratic programming and its application to binary assignment problem in logistics: B. Liao et al. J. Supercomput. 2025, 81, 1228. [Google Scholar] [CrossRef]
Jha, D.; Tomar, N.K.; Sharma, V.; Trinh, Q.H.; Biswas, K.; Pan, H.; Jha, R.K.; Durak, G.; Hann, A.; Varkey, J.; et al. Polypdb: A curated multi-center dataset for development of ai algorithms in colonoscopy. arXiv 2024, arXiv:2409.00045. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Halvorsen, P.; De Lange, T.; Johansen, D.; Johansen, H.D. Kvasir-seg: A segmented polyp dataset. In Proceedings of the International Conference on Multimedia Modeling, Daejeon, Republic of Korea, 5–8 January 2020; Springer: Cham, Switzerland, 2019; pp. 451–462. [Google Scholar]
Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; Gil, D.; Rodríguez, C.; Vilariño, F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med Imaging Graph. 2015, 43, 99–111. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Chen, H.; Wang, Y.; Guo, J.; Tao, D. Vanillanet: The power of minimalism in deep learning. Adv. Neural Inf. Process. Syst. 2023, 36, 7050–7064. [Google Scholar]

Figure 1. Improved network architecture. The Gather-and-Distribute (GD) mechanism is integrated into the neck of the original You Only Look Once version 12 segmentation model (YOLOv12-Seg) network: Low-GD replaces the up-sampling fusion in Path Aggregation Network (PANet), while Higher-GD substitutes the down-sampling fusion.

Figure 2. Structure of the GD mechanism. The GD mechanism enhances the model’s ability to detect objects of varying sizes by constructing two dedicated branches: the Low-Stage Gather-and-Distribute Branch (Low-GD) and the High-Stage Gather-and-Distribute Branch (High-GD).

Figure 3. Module diagram of the attention and convolution fusion between the local branch and the global branch. The former specializes in efficiently extracting local details and facilitating inter-channel interactions, while the latter focuses on modeling long-range feature dependencies and capturing global spatial relationships.

Figure 4. Dynamic convolution structure diagram. The advantage of Alterable Kernel Convolution (AKConv) lies in its ability to efficiently model complex kernel shapes by learning offsets and dynamic sampling grids while maintaining flexible sampling.

Figure 5. Segmentation loss curves during training. Blue: training; red: validation. Both losses converge smoothly, indicating effective model training. Note that training on the Kvasir-SEG and CVC-ClinicDB datasets was terminated around epoch 175 due to the early stopping criterion. (a) Loss trend observed on the PolypDB dataset. (b) Loss trend observed on the Kvasir-SEG dataset. (c) Loss trend observed on the CVC-ClinicDB dataset.

Figure 6. Curves of model evaluation metrics. (a) The mean Average Precision at 50% IoU threshold (mAP50) metric change graph of GDCA-Net under the PolypDB dataset. (b) The mAP50 metric change graph of GDCA-Net under the Kvasir-SEG dataset. (c) The mAP50 metric change graph of GDCA-Net under the CVC-ClinicDB dataset.

Figure 7. Qualitative comparisons between the segmentation results of GDCA-Net and the ground truth on samples from three datasets. From left to right: segmentation outputs of GDCA-Net and ground-truth masks. The results are evaluated based on the Intersection over Union (IoU) metric, where a higher IoU indicates a closer match between the predicted mask and the ground truth. (a) Segmentation results on the PolypDB dataset. (b) The mask on the PolypDB dataset. (c) Segmentation results on the Kvasir-SEG dataset. (d) The mask on the Kvasir-SEG dataset. (e) Segmentation results on the CVC-ClinicDB dataset. (f) The mask on the CVC-ClinicDB dataset.

Figure 8. Examples of failure cases encountered by GDCA-Net. These cases reveal potential areas for improvement in illumination invariance and contextual understanding. (a) False positives caused by specular highlights misclassified as polyp regions. (b) Missed detection of flat polyps with low contrast and unclear boundaries.

Table 1. Description of WLI from the PolypDB dataset.

Source	Category	Sample Count
High-quality colonoscopic images from multiple medical centers worldwide [51]	Training set	2870
	Validation set	358
	Test set	360
	Total	3588

Table 2. Kvasir-SEG and CVC-ClinicDB dataset description.

Dataset	Source	Training Set	Validation Set	Test Set
Kvasir-SEG	Vestre Viken Health Trust [52]	800	100	100
CVC-ClinicDB	Hospital Clinic, Barcelona and Computer Vision Center [53]	489	61	62

Table 3. Description of the software and hardware environment.

Device Type	Device Description
GPU	RTX 4090D 24 GB
CPU	18 vCPU AMD EPYC 9754 128-Core Processor (AMD)
RAM	60 GB
Storage	System disk: 30 GB; Data disk: 50 GB

Table 4. Optimizer parameter configuration.

Parameter Category	Parameter Name	Setting Value
Optimizer	Type	SGD
Basic Parameters	Initial Learning Rate	0.01
	Momentum	0.937
	Weight Decay	0.0005
Warm-up Strategy	Warm-up Period	3
Warm-up Strategy	Warm-up Momentum	0.8

Table 5. Data Augmentation Strategy.

Augmentation Type	Specific Method	Parameter Setting
Geometric Transformation	Random Horizontal Flip	Probability: 0.5
	Random Translation	Ratio: 0.1
	Random Scaling	Ratio: 0.5
Color Augmentation	Random Erasing	Probability: 0.4
	AutoAugment	randaugment
Advanced Augmentation	Copy–Paste	Probability: 0.1; Mode: Flip

Table 6. Training hyperparameter settings.

Dataset	Training Hyperparameters	Setting Quantity
PolypDB	Learning rate	0.01
	Batch size	16
	Epoch	200
	Image size	640
Kvasir-SEG	Learning rate	0.01
	Batch size	16
	Epoch	200
	Image size	640
CVC-ClinicDB	Learning rate	0.01
	Batch size	16
	Epoch	200
	Image size	640

Table 7. Comparison of experimental results.

Dataset	Model	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)	F1 Score
PolypDB	YOLOv6-seg	93.4	79.8	84.9	44.8	86.1
	YOLOv8p2-seg	88.5	81.3	84.9	44.9	84.7
	YOLOv8-seg	88.2	82.9	85.3	46.3	85.5
	yolov10n-seg	90.1	82.5	85.4	46.6	86.1
	YOLOv11-seg	91.6	79.4	84.3	45.4	85.1
	YOLOv12-seg	90.5	79.7	83.7	44.0	84.8
	EfficientNetv2-seg	82.8	76.9	78.7	37.5	79.7
	vanillanet-seg	72.0	62.7	59.8	24.9	67.0
	ADNet-seg	89.3	80.7	83.6	44.7	84.8
	GDCA-Net	90.6	81.0	85.9	46.9	85.5
Kvasir-SEG	YOLOv6-seg	89.7	92.0	93.3	71.6	90.8
	YOLOv8p2-seg	91.7	88.7	94.7	73.8	90.2
	YOLOv8-seg	95.8	86.0	96.0	76.0	90.6
	yolov10n-seg	88.2	89.7	94.9	72.9	88.9
	YOLOv11-seg	95.7	90.0	96.9	73.8	92.8
	YOLOv12-seg	94.4	89.0	96.5	74.8	91.6
	EfficientNetv2-seg	91.9	87.0	94.0	65.7	89.4
	vanillanet-seg	87.0	88.0	92.5	67.2	87.5
	ADNet-seg	88.6	93.0	96.3	73.5	90.7
	GDCA-Net	93.9	96.0	97.0	74.1	94.9
CVC-ClinicDB	YOLOv6-seg	98.3	94.8	97.1	82.7	96.5
	YOLOv8p2-seg	98.1	96.7	98.3	83.8	97.4
	YOLOv8-seg	97.3	91.8	96.0	83.2	94.5
	yolov10n-seg	98.3	94.8	97.0	82.8	96.5
	YOLOv11-seg	94.5	95.1	98.4	82.4	94.8
	YOLOv12-seg	83.5	95.1	95.0	81.2	88.9
	EfficientNetv2-seg	96.2	91.8	95.5	75.4	93.9
	vanillanet-seg	98.2	87.4	97.8	81.2	92.5
	ADNet-seg	98.3	94.9	98.4	83.3	96.6
	GDCA-Net	98.1	93.4	98.5	82.9	95.7

Table 8. Results of ablation experiments.

Num	GD and AKConv	CAF and ContMix	Wise-IoU	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)	F1 Score
➀				90.5	79.7	83.7	44.0	84.8
➁	✓			90.3	79.7	84.8	46.1	84.7
➂		✓		87.4	81.0	83.9	44.9	84.1
➃			✓	90.1	83.5	86.4	45.3	86.7
➄	✓	✓		89.8	81.9	84.9	45.6	85.7
➅		✓	✓	90.6	81.7	85.5	46.2	85.9
➆	✓		✓	91.3	78.5	85.2	47.5	84.4
➇	✓	✓	✓	90.6	81.0	85.9	46.9	85.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Research on Polyp Segmentation via Dynamic Multi-Scale Feature Fusion and Global–Local Semantic Enhancement

Abstract

1. Introduction

3. Methodology

3.1. YOLOv12-Seg Overview

3.2. Improved Feature Fusion Network

3.3. Principles of Information Aggregation and Distribution Mechanism

3.4. Convolution and Attention Fusion Module

3.5. Variably Sized Convolution

3.5.1. Define the Initial Sampling Position

3.5.2. Novel Deformable Convolution Operation

3.5.3. Extended AKConv

3.6. Context-Mixing Dynamic Convolution

3.7. Introduction of Wise-IoU

4. Experimental Results

4.1. Datasets Preparation

4.2. Preparation for the Experiment

4.2.1. Experimental Environment

4.2.2. Model Training Strategy

4.2.3. Training Parameters

4.2.4. Evaluation Metrics

4.3. Experimental Analysis

4.3.1. Quantitative Comparison and Evaluation

4.3.2. Ablation Experiments

4.3.3. Qualitative Analysis

4.3.4. Failure Cases Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Research on Polyp Segmentation via Dynamic Multi-Scale Feature Fusion and Global–Local Semantic Enhancement

Abstract

1. Introduction

2. Related Works

2.1. Traditional Polyp Segmentation Methods

2.2. Advanced Polyp Segmentation Methods

3. Methodology

3.1. YOLOv12-Seg Overview

3.2. Improved Feature Fusion Network

3.3. Principles of Information Aggregation and Distribution Mechanism

3.4. Convolution and Attention Fusion Module

3.5. Variably Sized Convolution

3.5.1. Define the Initial Sampling Position

3.5.2. Novel Deformable Convolution Operation

3.5.3. Extended AKConv

3.6. Context-Mixing Dynamic Convolution

3.7. Introduction of Wise-IoU

4. Experimental Results

4.1. Datasets Preparation

4.2. Preparation for the Experiment

4.2.1. Experimental Environment

4.2.2. Model Training Strategy

4.2.3. Training Parameters

4.2.4. Evaluation Metrics

4.3. Experimental Analysis

4.3.1. Quantitative Comparison and Evaluation

4.3.2. Ablation Experiments

4.3.3. Qualitative Analysis

4.3.4. Failure Cases Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics