A Dual-Path Fusion Network with Edge Feature Enhancement for Medical Image Segmentation

Shi, Liangxu; He, Weiyuan; Wang, Guodong

doi:10.3390/math14010055

Open AccessArticle

A Dual-Path Fusion Network with Edge Feature Enhancement for Medical Image Segmentation

by

Liangxu Shi

^†,

Weiyuan He

^†

and

Guodong Wang

^*

College of Computer Science & Technology, Qingdao University, Qingdao 266071, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2026, 14(1), 55; https://doi.org/10.3390/math14010055

Submission received: 12 October 2025 / Revised: 21 November 2025 / Accepted: 25 November 2025 / Published: 24 December 2025

(This article belongs to the Special Issue Advances in Artificial Intelligence, Machine Learning and Optimization)

Download

Browse Figures

Versions Notes

Abstract

This paper proposes a Dual-path Feature-enhanced Fusion Network (DPF-Net) for medical image segmentation to address limitations in existing methods, including insufficient edge feature extraction, semantic gaps among multi-scale encoder features, and significant semantic disparities between the encoder and decoder in U-Net architectures. To this end, we design a symmetric encoder–decoder structure based on U-Net and introduce three core modules: the Edge Feature Gating (EFG) module, which extracts and integrates edge features from shallow encoder layers to enhance edge integrity; the Cross-channel Fusion Transformer (CCFT) module, embedded in skip connections, to achieve comprehensive and semantically balanced multi-scale cross-fused features; and the Dual-path Feature Fusion (DPFM) module, which combines channel and spatial attention mechanisms to effectively bridge the semantic gap between encoder and decoder features while improving spatial resolution recovery accuracy. Experimental results demonstrate that DPF-Net achieves superior performance on six public datasets covering gland segmentation, colon polyp segmentation, and skin lesion segmentation tasks, significantly outperforming existing methods in terms of both mDice and IoU metrics. The conclusions confirm that the proposed method not only comprehensively improves the overall accuracy and edge segmentation quality of medical image segmentation but also enhances the model’s generalization capability across different tasks.

Keywords:

medical image; image segmentation; U-Net; transformer; attention mechanism

MSC:

68T45

1. Introduction

With the rapid development of computer-aided technologies in the medical field, these intelligent tools provide healthcare professionals with more efficient and precise diagnostic support. The effectiveness of medical image segmentation methods critically influences physicians’ analysis and diagnosis of patient conditions [1]. Accurate edge segmentation in medical images can clearly distinguish between pathological regions and normal tissues. For example, the segmentation of intestinal polyp images determines the scope of polyp resection: an excessively large resection area may damage healthy intestinal cells, while an insufficient resection may leave potential risks. Therefore, the precision of edge segmentation is particularly crucial in medical image segmentation. However, in many medical images, the edge regions requiring segmentation are often blurred, making it challenging to delineate the segmentation boundaries.

The emergence of U-Net [2] provided a foundational framework for biomedical image processing. In recent years, numerous methods have focused on improving skip connections from a global perspective to achieve comprehensive fusion of multi-level features. UNet++ [3] employed a combination of long and short skip connections to aggregate features at each node of the decoder. UNet# [4] enhanced the restrictive skip connections in UNet++ to enable more flexible and thorough fusion of decoder-extracted features. GLFRNet [5] proposed global feature reconstruction based on multi-level features from the encoder to capture global contextual information. MSD-Net [6] introduced a channel-based attention mechanism, utilizing cascaded pyramid convolution blocks for multi-scale spatial context fusion. These methods, which focused on skip connections, aimed to effectively capture long-range dependencies in downsampled feature maps to facilitate the restoration of spatial information during upsampling. However, they often overlooked the challenges posed by blurred edge segmentation. To address this issue, several approaches have been developed: SFA [7] incorporated an additional decoder dedicated to boundary prediction and employs a boundary-sensitive loss function to leverage region-boundary relationships. This specifically designed loss introduces region-boundary constraints to generate more accurate predictions. Psi-Net [8] proposed three parallel decoders for contour extraction, mask prediction, and distance map estimation tasks. EA-Net [9] introduced an edge preservation module that integrates learned boundaries into intermediate layers to enhance segmentation accuracy. Meganet [10] presented an edge-guided attention module utilizing Laplace operators to emphasize boundary information. Although these methods enhanced the model’s attention to boundaries through explicit boundary-related supervision, they fail to fundamentally improve the model’s inherent capability to autonomously extract comprehensive edge features. As the encoder progressively extracts deeper features, it primarily captures abstract semantic information. Since these approaches do not specifically process the encoder’s shallow features, the edge feature information contained in the early encoding stages remains underutilized.

Vision Transformer [11] marked a breakthrough by applying Transformer [12] to image processing, challenging the dominance of traditional convolutional neural networks. TransUnet [13] replaced the deepest features in the U-Net encoder with features extracted from Vision Transformer. TCRNet [14] integrated both ViT and CNN as encoders in the segmentation network. Due to the inherent limitations of convolutional operations, CNN-based methods struggle to learn long-range semantic interactions. Consequently, researchers have begun incorporating Transformer technology to enhance skip connections. MCTrans [15] combined self-attention and cross-attention modules, enabling the model to capture cross-scale contextual dependencies and correspondences between different categories. Swin-UNet [16] replaced convolutional modules in U-Net with Swin Transformer blocks. DS-TransNet [17] employed dual Swin-Transformers as the encoder on Swin-Unet’s architecture. PolypFormer [18] introduced a novel cross-shaped windows self-attention mechanism and integrated it into the Transformer architecture to enhance the semantic understanding of polyp regions. However, the semantic levels and types of information contained in encoder features at different scales are inconsistent, leading to a semantic gap between multi-scale features in conventional Transformer architectures. Consequently, segmentation results from these methods often exhibit satisfactory overall segmentation but coarse edge delineation. Moreover, cross-channel feature fusion Transformers primarily focus on global information integration and fail to effectively address the semantic mismatch between the fused features output by the Transformer and the decoder features.

To address the aforementioned challenges—insufficient extraction of edge features in medical images, dilution of edge information during long-range modeling, and semantic mismatch between cross-channel Transformer outputs and decoder features—this paper proposes a Dual-Path Fusion Network (DPF-Net) based on edge feature enhancement. The contribution of this work can be summarized in four key aspects:

We introduce an Edge Feature Enhancement Gating (EFG) module that effectively aggregates edge features extracted by the early encoder. This strategy not only provides more complete edge features for subsequent inputs but also enhances the model’s robustness in extracting edge features in complex scenarios.
We introduced the Channel-wise Cross-Fusion Transformer (CCT) module with its unique channel cross-attention mechanism, which not only effectively avoids edge dilution in multi-scale fusion but also enables the active enhancement of edge information, shifting from passive transmission to active enhancement.
The proposed Dual-Path Fusion (DPFM) module adopts a two-path strategy: an attention path (with channel and spatial attention) that highlights important edge regions and suppresses background noise, and a weighting path that performs targeted fusion of multi-source features to alleviate semantic gaps.
The proposed DPF-Net is applied to six medical image segmentation tasks on four datasets, including gland segmentation, colon polyp segmentation, and skin lesion segmentation. Compared with other popular methods, DPF-Net has better performance.

2. Method

2.1. Comprehensive Framework

The overall architecture of our proposed network is illustrated in Figure 1. The input image is first processed through a 3 × 3 convolutional layer, followed by batch normalization and a ReLU activation function, to generate the initial feature map. This feature map then undergoes four successive pooling operations, progressively reducing the spatial resolution while increasing the receptive field, thereby producing the remaining four multi-scale feature maps from the encoder. To capture more comprehensive edge information in the encoder pathway, we incorporate an Edge Feature Enhancement (EFG) Module. This module strengthens the extraction of edge contours in lesion regions by performing weighted fusion of edge features contained in the shallow-level features. The enhanced feature maps from this module, together with the features from the other encoder layers (except the deepest one), are processed through patch embedding and then fed into the Channel-wise Cross-Fusion Transformer (CCFT) module. To emphasize regions of interest and mitigate the semantic gap between the CCFT-processed features and the decoder features, both sets of features are jointly fed into the dual-path decoder. Finally, through upsampling operations, the spatial resolution is progressively restored layer by layer, ultimately yielding the segmented image.

2.2. Edge Feature Fusion Gating (EFG) Module

Due to the often indistinct boundaries between lesion areas and background in medical images, effective extraction of boundary information can significantly enhance segmentation performance. To address this, we propose a gating module that integrates more comprehensive low-level edge features to serve as the first-level features in skip connections. Based on previous studies [19,20], early encoder-stage low-level features contain sufficient edge information. Therefore, we integrate two low-level features and adopt a gated fusion strategy [21] to combine them. This gating approach offers the advantage of fusing complementary information from both features to obtain more complete boundary details while adaptively weighting the features to reduce noise interference. The module structure is shown in Figure 2. First, the first two layers of features obtained from the encoder are concatenated along the channel dimension. Although both sets of encoder features contain edge characteristics, their proportions vary. Therefore, separate gating vectors are created for each feature to independently adjust their respective contributions to the final fused output. This ensures that while edge features are being fused, other detailed features are not neglected. Specifically, the two features are fed into a 3 × 3 convolutional layer followed by batch normalization and an activation function, respectively, to obtain E3 and E4. Subsequently, the two features are concatenated and passed through two separate 1 × 1 convolutional layers to compute the fusion weights. The above processing steps can be formulated as follows:

g a t e 1 = C o n v 1 \times 1 \{c o n c a t (E 3, E 4)\}

(1)

g a t e 2 = C o n v 1 \times 1 \{c o n c a t (E 3, E 4)\}

(2)

The operation concat denotes channel-wise concatenation. In order to fully integrate the edge information from the two gating vectors, we further concatenate them to form a new vector G:

G = {concat (gate 1, gate 2)}

(3)

A Softmax layer is then applied to obtain a probability distribution along the channel dimension for each position, which is used to weight the different channel feature maps:

g = S o f t m a x (G_{i}) = \frac{e^{G_{i}}}{\sum_{j} e^{G_{i}}}

(4)

Here, the symbol Gi denotes an element along the channel dimension, and j represents the index over all channels. The weight extracted for each channel is utilized to perform a weighted fusion of the inputs:

W E_{1}^{(i, j)} = \frac{e^{g_{1} (i, j)}}{e^{g_{1} (i, j)} + e^{g_{2} (i, j)}}

(5)

W E_{2}^{(i, j)} = \frac{e^{g_{2} (i, j)}}{e^{g_{1} (i, j)} + e^{g_{2} (i, j)}}

(6)

Here, the notation (i, j) denotes the weight value corresponding to a pixel. Finally, the two low-level features undergo fusion via the gating mechanism to reduce the fused feature, specifically as follows:

E_{f u s i o n} = W E_{1}^{(i, j)} \cdot E_{1} + W E_{2}^{(i, j)} \cdot E_{2}

(7)

Owing to the sufficient edge information preserved in low-level features, the edge features are assigned greater weights. Thus, the final fused feature map better retains edge details while simultaneously suppressing noise. Finally, the fused features are passed through five convolutional blocks—each consisting of three 3 × 3 kernels followed by one 1 × 1 kernel—to produce the final output feature map with enhanced edge information.

2.3. Channel-Wise Cross-Fusion Transformer (CCFT) Module

To address the semantic gaps among multi-scale features and obtain comprehensively enhanced features with balanced semantics, while leveraging the strength of Transformers in modeling long-range dependencies, we introduce the Cross-Channel Fusion Transformer (CCFT) module as a feature exchange platform. By employing a cross-attention mechanism along the channel dimension, this module enables features at each encoder level to adaptively select and assimilate information from all other levels—complementing details from shallow layers and semantic context from deep layers. Consequently, it proactively bridges the informational and semantic disparities among different-scale features before they are passed to the decoder.

2.3.1. Multi-Scale Feature Embedding

The four-level encoder feature maps into flattened 2D patch sequences for tokenization, with patch sizes of

P_{1} = 32, P_{2} = 16, P_{3} = 8, P_{4} = 4

. These patches are then concatenated along the channel dimension to form tokens

I_{i} (i = 1,2, 3,4)

. Here,

I_{i} \in R^{\frac{H W}{2^{i}}}

is concatenated as both the key and value, and concatenated as both the key and value to obtain the merged feature map

I_{Σ} = C o n c a t (I_{1}, I_{2}, I_{3}, I_{4})

.

2.3.2. Multi-Head Cross-Attention

These tokens are subsequently fed into the Cross-Channel Fusion Transformer module, followed by a Multi-Layer Perceptron (MLP) with residual structure, to encode channel and dependencies for refining features from each U-Net encoder level using multi-scale features. As shown in Figure 3, the introduced CCFT module contains five inputs: four independent tokens Ti function as query vectors, while the aggregated token

T_{Σ}

, generated through multi-token concatenation, serves the dual role of both key and value vectors.

Q_{i} = I_{i} W_{Q_{i}}, K = I_{Σ} W_{Q_{i}}, V = I_{Σ} W_{V}

(8)

here

W_{Q_{i}} \in R^{C_{i} \times d}, W_{K} \in R^{C_{Σ} \times d}, V \in R^{C_{Σ} \times d}

are weights of different inputs, d is the sequence length (patch numbers) and

C_{i} (i = 1,2, 3,4)

are the channel dimensions of the four skip connection layers. In our implementation

C_{1} = 64, C_{2} = 128, C_{1} = 256, C_{1} = 512

. With

Q \in R^{C_{i} \times d}, K \in R^{C_{Σ} \times d}, V \in R^{C_{Σ} \times d}

, the similarity matrix

M_{i}

is produced and the value V is weighted by

M_{i}

through a cross-attention (CA) mechanism:

C A_{i} = M_{i} V^{T} = s o f t m a x [σ (\frac{Q_{i}^{T} K}{\sqrt{C_{Σ}}})]

(9)

where

σ

denote the instance normalization. In the case of N-head attention, the output is computed by applying multi-head cross-attention followed by an MLP with residual operations as follows

M C A_{i} = \frac{C A_{1}^{i} + C A_{2}^{i} + C A_{3}^{i} + \dots + C A_{N}^{i}}{N}

(10)

O_{i} = M C A_{i} + M L P (Q_{i} + M C A_{i})

(11)

Here, N denotes the number of multi-head attention heads. Finally, the four outputs of the L-th layer

O_{i} (i = 1,2, 3,4)

are reconstructed though an up-sampling operation followed by a convolution layer and concatenated with the decoder features

D_{i} (i = 1,2, 3,4)

.

2.4. Dual-Path Fusion (DPFM) Module

Upsampling serves as a core process in U-shaped networks for restoring spatial resolution and achieving precise localization. However, relying solely on interpolation and convolution operations is insufficient to fully reconstruct the detailed information lost during downsampling. Although skip connections are introduced to supplement these details, significant semantic gaps exist between features processed by the Cross-Channel Fusion Transformer (CCFT) module and those in the upsampling path. To guide channel information filtering, focus on preserving complete edge features, and resolve ambiguities with decoder features, we propose a Dual-Path Fusion Module (DPFM).

The core of this module is an innovative attention fusion mechanism that integrates local detail features from the CNN branch with global contextual representations from the Transformer branch. Each attention path is based on the classical CBAM unit [22], which consists of a series-connected Channel Attention (CA) and Spatial Attention (SA) structure. The CA module assigns weights to each channel and recalibrates the importance of individual positions to analyze the channel distribution of features. The SA module learns spatial weights and recalibrates the importance of each channel to highlight target objects while suppressing background interference. The weighted path, formed by connecting average pooling, a feed-forward neural network, and Softmax in sequence, dynamically allocates appropriate weights to each branch, thereby reducing semantic disparities between different branches. Figure 4 illustrates the forward propagation process of the dual-path fusion module. Initially, features from two different sources undergo targeted weighted fusion. By incorporating average pooling, a feed-forward neural network, and Softmax operations, the weighted fusion of features is further enhanced, improving compatibility during the integration of the two branch features:

b_{i} = S o f t m a x (F F N (A v g p o o l (O_{i} + D_{i})))

(12)

{\overset{\land}{O}}_{i} = C B A M (O_{i}) \times b_{i}

(13)

{\overset{\land}{D}}_{i} = C B A M (D_{i}) \times b_{i}

(14)

2.5. Loss Function

We employed a combined loss function consisting of both Cross-Entropy and Dice loss, which enables the model to achieve precise pixel-level classification while simultaneously optimizing the quality of the segmented regions. The loss function is defined as follows:

TotalLoss = α Diceloss (p, q) + (1 - α) BCEloss (p, f)

(15)

where α denotes the weighting coefficient for the Dice loss, which is set to 0.5 in our experiments,

p

represents the ground truth label,

q

is the predicted value, and

f

signifies the predicted probability.

3. Experiment

3.1. Datasets

The experiments in this study evaluate the segmentation performance of the proposed algorithm using six public datasets: MoNuSeg [23], GlaS [24], ISIC2017 [25], ISIC2018 [26], Kvasir [27], and CVC-ClinicDB [28]. The specific dataset partitions are detailed in Table 1.

3.2. Experimental Parameter Settings

The proposed network in this experiment was implemented based on the PyTorch 1.12.1 framework and trained and tested on an NVIDIA V100 GPU (24 GB VRAM). During the training phase, we used the Adam optimizer along with a cosine annealing learning rate scheduler [29]. To prevent overfitting, three types of online data augmentation were applied, including horizontal flipping, vertical flipping, and random rotation. No pre-trained weights were used for training the proposed model on any of the datasets. The batch size, initial learning rate, and number of training epochs are specified in Table 2. In our experiments, the number of layers for the multi-head attention mechanism was set to 4, and the number of Transformer layers was set to 2.

3.3. Evaluation Index

In this study, the Dice Similarity Coefficient (Dice) and Intersection over Union (IoU) are adopted as performance evaluation metrics for the algorithm. Both Dice and IoU primarily measure the overlap between segmentation results and ground truth annotations, where higher metric values indicate better segmentation performance. The definitions of these two evaluation metrics are as follows:

D i c e = \frac{2 \times T P}{2 \times T P + F N + F P}

(16)

I O U = \frac{T P}{T P + F N + F P}

(17)

4. Results

4.1. Quantitative Comparison

Table 3, Table 4 and Table 5 present the comparative evaluation results between our proposed method and baseline approaches across multiple metrics, with bold values indicating the best performance.

As detailed in Table 3, our network achieves Dice and IoU scores of 80.50%/91.90% on the MoNuSeg dataset and 67.54%/85.48% on the GlaS colorectal gland dataset, representing improvements of 0.97%/0.62% and 0.68%/0.48%, respectively, over the compared methods. Additionally, our method maintains a relatively lower parameter count compared to the listed CNN-Transformer hybrid approaches.

Table 4 demonstrates that our method attains Dice and IoU values of 86.47%/78.36% on ISIC-2017 and 88.87%/81.79% on ISIC-2018, corresponding to performance gains of 0.82%/0.71% and 1.00%/1.22%, respectively, in skin lesion segmentation.

The results in Table 5 show our network achieving 90.62%/84.23% on Kvasir and 92.14%/87.01% on CVC-ClinicDB for colon polyp segmentation, with respective improvements of 0.97%/0.69% and 0.80%/1.84% over existing approaches.

These experimental findings substantiate that our proposed network not only significantly enhances medical image segmentation accuracy but also maintains consistent performance gains across diverse dataset types, confirming its robust generalization capability.

Furthermore, our method maintains a parameter count that is comparatively lower than other CNN-Transformer hybrid approaches cited in the comparison.

4.2. Qualitative Comparison

To validate the effectiveness of our proposed network architecture, we conducted comparative experiments with five methods: the classical U-Net, three edge-enhanced algorithms primarily based on CNN architecture (ET-Net, MEGANet and ConDseg-Net), and four general segmentation algorithms combining CNN and Transformer architectures (Swin-UNet, FUSION-UNet, UCTransNet, and UDTransNet).

For each dataset, we selected two types of images: the first row contains images with indistinct lesion boundaries and low environmental contrast, while the second row presents images with relatively clear lesion edges and high environmental contrast. As shown in Figure 5, the first row of each dataset displays the first type, and the second row shows the second type. From the results, we observe that ET-Net and ConDseg-Net demonstrate advantages over Swin-UNet, UCTransNet, and UDTransNet in segmenting low-contrast images. Their superior focus on edge features enables them to delineate contours in such environments, rarely resulting in over-segmentation. In contrast, the three Transformer-based networks, focusing on global multi-level feature fusion, tend to overlook edge details. This leads to over-segmentation when processing low-contrast images with ambiguous lesion boundaries. For high-contrast images, UCTransNet and UDTransNet, despite occasional boundary-related over-segmentation, outperform ET-Net and ConDseg-Net in lesion region segmentation. This advantage stems from the Transformer architecture’s strength in long-range dependency modeling, capturing remote semantic information through a global perspective and achieving more complete segmentation areas via multi-level fusion. Conversely, ET-Net and ConDseg-Net, lacking such long-range modeling capabilities, exhibit incomplete segmentation regions. U-Net, with its basic encoder–decoder structure and simple skip connections, demonstrates the weakest performance among all compared methods.

Our proposed model integrates an Edge Feature Enhancement Gating (EFG) Module to capture more complete edge features, combines a Cross-Channel Fusion Transformer (CCFT) module to leverage long-range modeling for remote semantic information, and finally incorporates a Dual-Path Fusion (DPFM) module to emphasize critical features (including edge features) while reducing semantic gaps. As illustrated in the figure, our network produces segmentation results with both clear boundaries and complete regions, confirming the effectiveness of our approach.

4.3. Ablation Study

4.3.1. Analysis of Hyperparameter Settings

To identify the optimal hyperparameter configurations for our experimental setup, we conducted a series of systematic experiments across three distinct datasets: GlaS, ISIC2018, and Kvasir. We initially set the α value to the default of α = β = 0.5, a common and reasonable strategy to maintain balanced contributions from both loss terms and prevent either from dominating the optimization process. Furthermore, since the boundary enhancement module in our model can acquire relatively sufficient edge features, increasing the weight of α is not suitable. As shown in Table 6, fine-tuning the α value leads to decreased accuracy, confirming our choice of α = 0.5. Regarding the configuration of heads (H) and layers (L) in our CCFT module, these parameters were empirically initialized based on optimal values from UCTtranNet. Experimental results demonstrate that reducing the value of L does not negatively impact performance and even slightly improves segmentation accuracy. However, simultaneously decreasing both L and H values causes noticeable performance degradation. Therefore, we ultimately set H = 4 and L = 2.

4.3.2. Comparison with Baselines

To validate the effectiveness of each component in our proposed algorithm, ablation studies were conducted on five datasets: GlaS, ISIC2017, ISIC2018, CVC-ClinicDB, and Kvasir. The U-Net architecture was adopted as our baseline model. We incrementally incorporated key components—the Cross-channel Fusion Transformer module, the Edge Feature Enhancement Gating module, and the Dual-path Fusion module—to evaluate their individual and combined impact on segmentation performance. As shown in Table 7, both the Dice and IoU metrics improved progressively with the addition of each key component. Furthermore, representative samples from two types of datasets were selected for visual analysis, as illustrated in Figure 6. Consistent with the selection criteria used in the segmentation result visualization, we included both high-contrast and low-contrast images from each dataset category. It can be observed that, regardless of whether the input image exhibits high or low environmental contrast, the segmentation results demonstrate noticeable enhancement in both boundary delineation and overall region consistency as each proposed module is successively integrated. These experimental results confirm that every module contributes positively to the performance. The sequential integration strategy not only effectively enhances the capability of boundary extraction and segmentation accuracy but also enables the network to segment medical images with higher precision.

5. Conclusions

To address the limitations of existing segmentation methods, this paper proposes a novel edge feature-enhanced dual-path fusion network for image segmentation. By incorporating an edge feature enhancement gating module, the model effectively captures comprehensive edge features from the early encoder stages, thereby providing richer edge information for subsequent feature fusion while significantly improving the model’s robustness in segmenting images with blurred boundaries. During the feature fusion stage, the multi-level features processed by the cross-channel fusion Transformer module, along with the corresponding decoder features, are fed into the dual-path fusion module. This module employs a dual-path design: the attention path enhances critical regions and suppresses background interference through channel and spatial attention mechanisms, effectively preventing the dilution of edge features in deep layers; the weighting path achieves precise fusion of multi-source features via adaptive weight allocation, significantly bridging the semantic gap. Ultimately, the synergistic combination of dual-path output and upsampling operations enables fine restoration of spatial resolution, resulting in more accurate segmentation outcomes. Experiments on six public datasets demonstrate that the proposed network exhibits exceptional capability in capturing complex boundary structures across various segmentation tasks, along with remarkable generalization performance. The method holds significant potential for early clinical practice and pathological analysis, providing reliable technical support for automated lesion detection, precise lesion measurement, and early disease diagnosis.

However, this study still has certain limitations. The adoption of Transformer architecture for skip connections considerably increases the model parameter count, leading to reduced computational efficiency. Therefore, developing lightweight models to improve the efficiency of medical image segmentation will be a key focus of our future research.

Author Contributions

Conceptualization, L.S. and W.H.; methodology, L.S.; software, L.S.; validation, L.S., W.H. and G.W.; formal analysis, L.S.; investigation, G.W.; resources, L.S.; data curation, L.S.; writing—original draft preparation, L.S.; writing—review and editing, G.W.; visualization, W.H.; supervision, G.W.; project administration, G.W.; funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Province Natural Science Foundation, grant number ZR2025MS1088; Qingdao Original Exploration Project, grant number 23-2-1-163-zyyd-jch.

Data Availability Statement

This study utilizes publicly accessible anonymized datasets, includ ing GlaS (https://www.kaggle.com/datasets/sani84/glasmiccai2015-gland-segmentation, accessed on 24 November 2025), ISIC2017, ISIC2018 (https://challenge.isic-archive.com/data/, accessed on 24 November 2025), Kvasir (https://datasets.simula.no/kvasir-seg/, accessed on 24 November 2025), CVC-ClinicDB (https://polyp.grand-challenge.org/CVCClinicDB/, accessed on 24 November 2025). These datasets are publicly available for academic use and were utilized in full compliance with their respective licenses.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gao, Y.; Jiang, Y.; Peng, Y.; Yuan, F.; Zhang, X.; Wang, J. Medical Image Segmentation: A Comprehensive Review of Deep Learning-Based Methods. Tomography 2025, 11, 52. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—(MICCAI2015), Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Granada, Spain, 20 September 2018; proceedings 4. Springer: Cham, Switzerland, 2018; pp. 3–11. [Google Scholar]
Qian, L.; Zhou, X.; Li, Y.; Hu, Z. Unet#: A unet-like redesigning skip connections for medical image segmentation. arXiv 2022, arXiv:2205.11759. [Google Scholar] [CrossRef]
Song, J.; Chen, X.; Zhu, Q.; Shi, F.; Xiang, D.; Chen, Z.; Fan, Y.; Pan, L.; Zhu, W. Global and local feature reconstruction for medical image segmentation. IEEE Trans. Med. Imaging 2022, 41, 2273–2284. [Google Scholar] [CrossRef] [PubMed]
Jia, H.; Cai, W.; Huang, H.; Xia, Y. Learning multi-scale synergic discriminative features for prostate image segmentation. Pattern Recognit. 2022, 126, 108556. [Google Scholar] [CrossRef]
Fang, Y.; Chen, C.; Yuan, Y.; Tong, K.; Su, Y.; Yang, Y.; Li, C.; Sun, J. Selective feature aggregation network with area-boundary constraints for polyp segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 302–310. [Google Scholar]
Murugesan, B.; Sarveswaran, K.; Shankaranarayana, S.M.; Ram, K.; Sivaprakasam, M. Psi-Net: Shape and boundary aware joint multi-task deep network for medical image segmentation. In Proceedings of the 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 23–27 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7223–7226. [Google Scholar]
Wang, K.; Zhang, X.; Zhang, X.; Huang, Y.; Wang, Z.; Wang, F.Y. EANet: Iterative edge attention network for medical image segmentation. Pattern Recognit. 2022, 127, 108636. [Google Scholar] [CrossRef]
Bui, N.T.; Hoang, D.H.; Nguyen, Q.T.; Pham, T.T.H.; Nguyen, V.D.; Bui, H.M.; Cho, S.; Noh, S.; Park, S.; Cho, N.I. Meganet: Multi-scale edge-guided attention network for weak boundary polyp segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 7985–7994. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Shan, X.; Ma, T.; Gu, A.; Wang, Z.; Wang, Y. TCRNet: Make transformer, CNN and RNN complement each other. In Proceedings of the International Conference on Acoustics, Speech and Signal Proces Sing, Virtual, 22–27 May 2022; pp. 1441–1445. [Google Scholar]
Ji, Y.; Zhang, R.; Wang, H.; Li, Z.; Wu, L.; Zhang, S.; Liang, P. Multi-compound transformer for accurate biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part I 24. Springer: Cham, Switzerland, 2021; pp. 326–336. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. Ds-transunet: Dual swin transformer u-net for medical image segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 4005615. [Google Scholar] [CrossRef]
Fan, L.; Jiang, Y. A Polyp Segmentation Algorithm Based on Local Enhancement and Attention Mechanism. Mathematics 2025, 13, 1925. [Google Scholar] [CrossRef]
Zhang, Z.; Fu, H.; Dai, H.; Shen, J.; Pang, Y.; Shao, L. Et-net: A generic edge-attention guidance network for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Proceedings, Part I 22. Springer: Cham, Switzerland, 2019; pp. 442–450. [Google Scholar]
Shu, X.; Li, X.; Zhang, X.; Shao, C.; Yan, X.; Huang, S. MRAU-net: Multi-scale residual attention U-shaped network for medical image segmentation. Comput. Electr. Eng. 2024, 118, 109479. [Google Scholar] [CrossRef]
Zeng, X.; Chen, K.; Lin, J.; Wang, W.; Wu, C.; Qian, H.; Li, G. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 561–577. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Kumar, N.; Verma, R.; Sharma, S.; Bhargava, S.; Vahadane, A.; Sethi, A. A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Trans. Med. Imaging 2017, 36, 1550–1560. [Google Scholar] [CrossRef] [PubMed]
Sirinukunwattana, K.; Pluim, J.P.; Chen, H.; Qi, X.; Heng, P.-A.; Guo, Y.B.; Wang, L.Y.; Matuszewski, B.J.; Bruni, E.; Sanchez, U.; et al. Gland segmentation in colon histology images: The glas challenge contest. Med. Image Anal. 2017, 35, 489–502. [Google Scholar] [CrossRef] [PubMed]
Codella, N.C.F.; Gutman, D.; Celebi, M.E.; Helba, B.; Marchetti, M.A.; Dusza, S.W.; Kalloo, A.; Liopyris, K.; Mishra, N.; Kittler, H.; et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In Proceedings of the 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 168–172. [Google Scholar]
Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv 2019, arXiv:1902.03368. [Google Scholar] [CrossRef]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; De Lange, T.; Halvorsen, P.; Johansen, H.D. Kvasir-seg: A segmented polyp dataset. In Proceedings of the International Conference on Multimedia Modeling, Thessaloniki, Greece, 8–11 January 2019; Springer: Cham, Switzerland, 2019; pp. 451–462. [Google Scholar]
Bernal, J.; Sánchez, F.J.; Fernández-Esparrach, G.; Gil, D.; Rodríguez, C.; Vilariño, F. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Comput. Med. Imaging Graph. 2015, 43, 99–111. [Google Scholar] [CrossRef] [PubMed]
Loshshilov, I.; Hutter, F. SGDR: Stastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Li, Z.; Lyu, H.; Wang, J. Fusionu-net: U-net with enhanced skip connection for pathology image segmentation. In Proceedings of the Asian Conference on Machine Learning, PMLR, 2024, Bangkok, Thailand, 11–14 December 2024; pp. 694–706. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 2441–2449. [Google Scholar]
Wang, H.; Cao, P.; Yang, J.; Zaiane, O. Narrowing the semantic gaps in u-net with learnable skip connections: The case of medical image segmentation. Neural Netw. 2024, 178, 106546. [Google Scholar] [CrossRef] [PubMed]
Lei, M.; Wu, H.; Lv, X.; Qiu, Z.; Feng, S.; Zhang, Y.; Wang, Q. Condseg: A general medical image segmentation framework via contrast-driven feature enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–27 February 2025; Volume 39, pp. 4571–4579. [Google Scholar]

Figure 1. Illustration of the proposed DPF-Net.

Figure 2. Illustration of the proposed Edge Feature Fusion Gating (EFG) Module.

Figure 3. Illustration of the proposed Channel-wise Cross-Fusion Transfomer (CCFT) Module.

Figure 4. Illustration of the proposed Dual-Path Fusion (DPFM) Module.

Figure 5. Results of segmentation parts for different networks on five datasets.

Figure 6. The visualization results of the ablation experiment on two different types of datasets.

Table 1. Dataset Split.

Datasets	MoNuSeg	GlaS	ISIC2017	ISIC2018	CVC-ClinicDB	Kvasir
Train	25	72	2000	2594	488	800
Validation	6	13	150	100	62	100
Test	12	80	600	1000	62	100
Image Size	224 × 224	224 × 224	224 × 224	224 × 224	224 × 224	224 × 224

Table 2. Hyperparameter Setting.

Datasets	MoNseg/GlaS	ISIC2017/2018	CVC-ClinicDB	Kvasir
Learning rate	1 × 10⁻³	1 × 10⁻³	1 × 10⁻⁴	1 × 10⁻⁴
Batch size	4	12	10	8
α	0.5	0.5	0.5	0.5
Epoch	100	200	200	200
Layers	2	2	2	2
Heads	4	4	4	4

Table 3. Evaluation index results of different nets in GlaS dataset and MoNuSeg dataset.

			GlaS		MoNuSeg
Methods	Params	Gflops	Dice	Iou	Dice	Iou
U-Net [2]	14.8M	50.3	85.53	75.85	76.68	63.02
ET-Net [19]	10.67M	42.54	88.83	82.05	77.36	64.32
FUSION U-Net [30]	25.8M	55.95	88.78	82.45	78.75	64.56
Swin-UNet [16]	82.3M	67.33	89.58	82.06	77.65	65.52
UCTransNet [31]	65.6M	59.65	90.21	83.02	79.08	66.02
UDTransNet [32]	40.8M	63.21	91.28	84.68	79.53	66.86
MEGANet [10]	44.19M	62.20	88.95	82.23	77.56	64.54
ConDseg-Net [33]	22M	64.35	91.25	84.52	79.36	66.45
ours	42.5M	58.70	91.90	85.48	80.50	67.54

Table 4. Evaluation index results of different nets in ISIC2017 and ISIC2018 dataset.

	ISIC2017		ISIC2018
Methods	Dice	Iou	Dice	Iou
U-Net	78.86	69.53	80.35	75.42
ET-Net	82.14	73.57	84.56	78.89
FUSION U-Net	83.24	72.58	84.23	77.56
Swin-UNet	83.35	74.36	86.67	79.56
UCTransNet	85.04	76.86	86.95	79.34
UDTransNet	84.76	76.31	87.89	80.56
MEGANet	83.15	72.15	83.56	76.79
ConDseg-Net	85.65	77.65	87.65	80.25
Ours	86.47	78.36	88.89	81.79

Table 5. Evaluation index results of different nets in kvasir and CVC-ClinicDB dataset.

	Kvasir		CVC-ClincDB
Methods	Dice	Iou	Dice	Iou
U-Net	75.30	66.23	82.36	75.56
ET-Net	85.45	79.57	86.30	80.65
FUSION U-Net	86.56	80.15	87.53	81.65
Swin-UNet	87.27	80.28	89.15	82.13
UCTransNet	87.52	81.65	88.47	81.49
UDTransNet	89.65	83.54	89.98	84.02
MEGANet	90.21	83.78	91.25	85.12
ConDseg-Net	90.32	84.08	91.34	85.26
Ours	90.62	84.23	92.14	87.01

Table 6. Comparison of Segmentation Performance Under Different α, H, and L Configurations.

Hyperparameter	GlaS		ISIC2018		Kvasir
	Dice	Iou	Dice	Iou	Dice	Iou
α2 = 0.8	91.45	86.56	88.48	81.35	90.45	83.86
α3 = 0.3	91.62	86.65	88.65	81.54	90.54	83.97
α = 0.5	91.90	87.01	88.87	81.78	90.62	84.23
N = 4, L = 4	91.86	86.89	88.80	81.64	90.61	84.20
N = 2, L = 2	90.34	85.15	87.25	80.34	88.64	82.56
N = 4, L = 2	91.90	87.01	88.87	81.78	90.62	84.23

Table 7. Evaluation index of ablation experiment.

	GlaS		ISIC2017		ISIC2018		Kvasir		CVC-ClinicDB
	Dice	Iou	Dice	Iou	Dice	Iou	Dice	Iou	Dice	Iou
Baseline (U-Net)	85.53	75.85	78.86	69.53	80.35	75.40	75.93	66.23	82.36	75.56
Baseline + CCFT	90.18	82.96	85.62	77.36	87.86	80.63	87.94	81.85	89.05	81.65
Baseline + CCFT + EEG	91.63	85.19	85.95	77.95	88.06	80.46	88.96	82.56	91.23	86.61
Baseline + CCFT + EEG + DPF	91.90	87.01	86.47	78.36	88.87	81.78	90.62	84.23	92.14	87.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shi, L.; He, W.; Wang, G. A Dual-Path Fusion Network with Edge Feature Enhancement for Medical Image Segmentation. Mathematics 2026, 14, 55. https://doi.org/10.3390/math14010055

AMA Style

Shi L, He W, Wang G. A Dual-Path Fusion Network with Edge Feature Enhancement for Medical Image Segmentation. Mathematics. 2026; 14(1):55. https://doi.org/10.3390/math14010055

Chicago/Turabian Style

Shi, Liangxu, Weiyuan He, and Guodong Wang. 2026. "A Dual-Path Fusion Network with Edge Feature Enhancement for Medical Image Segmentation" Mathematics 14, no. 1: 55. https://doi.org/10.3390/math14010055

APA Style

Shi, L., He, W., & Wang, G. (2026). A Dual-Path Fusion Network with Edge Feature Enhancement for Medical Image Segmentation. Mathematics, 14(1), 55. https://doi.org/10.3390/math14010055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Dual-Path Fusion Network with Edge Feature Enhancement for Medical Image Segmentation

Abstract

1. Introduction

2. Method

2.1. Comprehensive Framework

2.2. Edge Feature Fusion Gating (EFG) Module

2.3. Channel-Wise Cross-Fusion Transformer (CCFT) Module

2.3.1. Multi-Scale Feature Embedding

2.3.2. Multi-Head Cross-Attention

2.4. Dual-Path Fusion (DPFM) Module

2.5. Loss Function

3. Experiment

3.1. Datasets

3.2. Experimental Parameter Settings

3.3. Evaluation Index

4. Results

4.1. Quantitative Comparison

4.2. Qualitative Comparison

4.3. Ablation Study

4.3.1. Analysis of Hyperparameter Settings

4.3.2. Comparison with Baselines

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI