BFLE-Net: Boundary Feature Learning and Enhancement Network for Medical Image Segmentation

Jiale Fan; Liping Liu; Xinyang Yu

doi:10.3390/electronics14153054

,

and

¹

College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China

²

China Hebei Key Laboratory of Industrial Intelligent Perception, Tangshan 063210, China

³

College of Mechanical and Energy Engineering, Shanghai Technical Institute of Electronics Information, Shanghai 201411, China

^*

Author to whom correspondence should be addressed.

Electronics2025, 14(15), 3054;https://doi.org/10.3390/electronics14153054

This article belongs to the Topic Applications of Image and Video Processing in Medical Imaging

Version Notes

Order Reprints

Abstract

Multi-organ medical image segmentation is essential for accurate clinical diagnosis, effective treatment planning, and reliable prognosis, yet it remains challenging due to complex backgrounds, irrelevant noise, unclear organ boundaries, and wide variations in organ size. To address these challenges, the boundary feature learning and enhancement network is proposed. This model integrates a dedicated boundary learning module combined with an auxiliary loss function to strengthen the semantic correlations between boundary pixels and regional features, thus reducing category mis-segmentation. Additionally, channel and positional compound attention mechanisms are employed to selectively filter features and minimize background interference. To further enhance multi-scale representation capabilities, the dynamic scale-aware context module dynamically selects and fuses multi-scale features, significantly improving the model’s adaptability. The model achieves average Dice similarity coefficients of 81.67% on synapse and 90.55% on ACDC datasets, outperforming state-of-the-art methods. This network significantly improves segmentation by emphasizing boundary accuracy, noise reduction, and multi-scale adaptability, enhancing clinical diagnostics and treatment planning.

Keywords:

deep learning; medical image segmentation; transformer; multi-scale features; boundary optimization; feature fusion

1. Introduction

Medical image segmentation aims to accurately differentiate organ or lesion structures in images, providing key anatomical references for clinical diagnosis and treatment [1]. The current clinical reliance on manual visual inspection has inherent limitations, such as low efficiency and high subjectivity, while the complexity of medical images further exacerbates the challenges of automated segmentation. For example, computed tomography (CT) [2] is limited by noise and radiation dose, while there are sequence differences and motion artifacts in magnetic resonance imaging (MRI). Various types of modalities generally face problems [3] such as blurred tissue boundaries, overlapping organ dynamics topology, and heterogeneous pathological structures, leading to more difficult modeling of semantic relationships between target and background, as well as between target and target.

Deep learning has revolutionized the field of image segmentation, especially through convolutional neural networks (CNNs), such as U-Net [4] and its improved models [5,6,7,8,9], which have achieved accurate parsing of medical images in hierarchical feature extraction through a unique encoder–decoder structure and jump-connection mechanism, and have demonstrated significant advantages in capturing complex structures. The model performance has been limited by this type of U-shaped network, as it also ignores the rich spatial position information included in the shallow network features and loses local detail information in the encoding stage. Moreover, it is difficult to model long-range contextual dependency due to the limited range of the sensory field. Unlike CNNs, the Transformer [10] can capture global information through its self-attention mechanism and exhibits strong generalization. It has been quickly introduced into the field of medical image segmentation. Transformer-based models [11,12] appear to deal with targets of different sizes in multi-organ medical images by modeling global dependencies. Nevertheless, they lack detailed depictions of local features. Many studies have proposed hybrid architectures that combine Transformers with CNNs to fully utilize their advantages and encode target features through various hybrid approaches [13,14,15]. However, these methods still have shortcomings in structural consistency modeling, and spatial semantic correlation mining of multi-scale features limits segmentation performance enhancement in complex medical scenarios.

Fine-grained segmentation of anatomical structures or lesion regions relies on accurate boundary segmentation, and current approaches have advanced somewhat through methods such as boundary-related prior knowledge [16], geometric regularization [17], and boundary point prediction [18]. In addition, the uncertainty quantization-based approach [19] leverages probabilistic modeling to evaluate boundary confidence, showing a unique advantage in weak contrast areas. Wang et al. [20] proposed BA-Net to extract boundary features with multi-granularity information using the pyramid edge extraction module, allowing the network to more effectively understand and handle the target object’s edge region. FRBNet [21] was proposed to enhance segmentation accuracy by complementarily fusing the rough prediction map and boundary features through the boundary detection and feedback refinement module. A boundary and mask guidance module and a boundary structure perception module are combined in CMR-BENet [22] to boost network sensitivity to target boundaries and learn the connection between target boundaries and structures. To improve the accuracy of local fuzzy information evaluation, Chen et al. [23] presented LD-UNet, which makes use of both global and local long-range sensing modules. The experimental performance in border fuzzy data is outstanding. However, there are still two major issues with the current approach, as follows: first, in severe boundary blur scenarios, it is hard to accurately capture morphological asymptotic information using the traditional boundary feature representation; second, it does not effectively use the boundary region’s inherent uncertainty features to guide the segmentation decision, which leads to limited localization accuracy of important anatomical targets.

Inspired by the considerations mentioned above, this study improves multi-organ segmentation performance on three levels, as follows: boundary structure identification, multi-scale feature fusion, and redundant interference removal. For multi-organ image segmentation, a boundary feature learning and enhancement network (BFLE-Net) is proposed. To utilize the convolutional layer and self-attention mechanism, the encoder of the established BFLE-Net serially fuses Transformer and Res2Net-50. The model also includes three essential modules for interference filtering, fuzzy boundary refinement, and target localization to effectively extract feature information from fuzzy medical images—the dynamic scale-aware context module (DSCM), the boundary learning module (BLM), and the channel and position compound attention (CPCA). To provide richer feature representations, the CPCA filters out noisy interferences and makes up for the information lost in higher-level features as a result of downsampling. To enhance the model’s capacity to discriminate between fuzzy border regions, BLM employs pixel-level confidence analysis to help the model learn the boundary details and boosts the model to focus more on uncertain boundaries. DSCM achieves collaborative optimization of multi-scale feature information by investigating the inherent connection between parallel features from different sensory fields and better leveraging the complementarities between multiple levels of features. The main contributions of this paper are as follows:

We design a novel attentional coding mechanism, CPCA, which can enhance the feature extraction ability of the coded part and filter out the interference of noise in the shallow information.
We construct a novel boundary learning module, BLM, which guides the decoder to focus on the fine-grained learning of the boundary region through the confidence and uncertainty weighting mechanism and realizes the enhancement of target subjects and boundary detail optimization. We propose a module named the dynamic scale-aware context module, DSCM. Breaking through the limitation of a fixed perceptual field, it enhances the representation of multi-scale features through adaptive fusion of multi-scale contexts.
We propose a novel hybrid model that fully utilizes the potential of CNNs and Transformers, aiming to address the challenges posed by complex backgrounds, noisy interferences, large-scale variations, and ambiguous boundaries. The effectiveness of BFLE-Net for multi-organ segmentation is demonstrated through extensive experiments on datasets and qualitative and quantitative comparisons with state-of-the-art methods.

2. Related Works

2.1. CNN and Transformer Architectures

Deep learning-based methods have made great progress in the field of image processing. For segmenting specific targets in medical images, feature extraction through a unique U-shaped encoder and decoder structure is essential, as is the addition of a jump-connection structure to refine spatial information localization. Many U-Net versions have emerged to increase segmentation accuracy, including UNet++ [5], nnUNet [6], FCRB U-Net [7], NAG-Net [8], and MEF-UNet [9]. In order to facilitate the deep fusion of features at various levels, UNet++ redesigns the connection path between the encoder and decoder. This reduces the information loss that long-hopping connections in traditional U-Net cause while also maintaining a low computational complexity. By employing a selective feature extraction encoder and establishing information storage modules on the jump connections, MEF-UNet bridges the semantic gap between the feature maps. Information storage modules, which enable the network to better represent contextual information, capture detailed morphological aspects, and perform better in both quantitative and qualitative analyses.

Compared with CNNs, Transformers are able to capture remote dependencies through a self-attention mechanism. As a way to comprehend and reason about contextual interactions in various image regions, Swin-UNet [11] proposes a hierarchical framework. MissFormer [12] introduces a local–global attention mechanism; it models global dependencies and captures local information to handle targets of varying sizes in medical images. Many researchers have utilized them with CNNs in hybrid architectures to fully leverage their benefits. The first hybrid architecture, TransUNet [13], greatly improves the accuracy and robustness of medical image segmentation. It fuses the global context modeling advantage of Transformers with the local feature extraction power of CNNs. TranSiam [14], proposed by Li et al., extracts features from various modes through two parallel CNNs, and extracts global information at a higher level using Transformers, improving the quality of feature extraction for each modality. The dual-stream architecture of HIGF-Net [15], which was proposed by Wang et al., employs the CNN encoder and Transformer encoder in combination to mine the image’s shallow local spatial characteristics and deep global semantic information. However, it still has challenges processing images with fuzzy boundaries, target-independent noise, and complex backgrounds.

2.2. Feature Fusion in Medical Image Segmentation

In multi-organ image segmentation, single-scale features cannot meet the needs of multi-organ modeling, and multi-scale feature fusion is crucial for capturing contextual information. ST-UNet [24] enables cross-layer feature learning by integrating multi-scale and upsampled encoder features via a cross-layer feature enhancement module, while also effectively bridging the semantic gap between the encoding and decoding phases. GCFormer [25] was designed to integrate the pass-down module in the encoder to fuse multi-scale features and to use the cat module in the decoder to effectively connect different-scale features, achieving improved learning diversity and excellent segmentation results in multi-organ segmentation tasks. HM-Net [26] presents a hybrid pyramid attention module, which adaptively deepens shallow semantic information from both spatial and channel dimensions using multi-scale feature fusion, which reduces the semantic space between the decoder and encoder in jump connections. To obtain richer and more accurate features, Xu et al. propose FCSU-Net [27], which replaces the simple jump connections with the proposed multi-scale feature blocks and integrates the cross-dimensional self-attention mechanism to create multi-dimensional dependencies. While numerous fusion strategies have been proposed in previous studies, they all have failed to pay enough attention to feature interactions between layers. This is particularly true when using static fusion weights, which not only make it difficult to retain the full spatial structure information but also limit the model’s ability to adapt to complex noise conditions.

2.3. Boundary-Based Modeling

Many previous studies concentrated on boundary awareness in medical image segmentation tasks and then attempted to enhance the accuracy of identifying ambiguous pixels close to the boundary. BDNet [28] employs a boundary refinement module to determine the boundary point placements; the boundary points are redrawn to keep them from shifting. This aids the model in extracting crucial structured information. TBP-Net [29] adds a CBAM block after each stage’s upsampling, combining deep and shallow feature maps to focus on important information, particularly for shape and boundary difficulties in multi-organ segmentation. Through a multi-stage edge feature fusion module and a sparse dynamic encoder–decoder module that extracts global spatial semantic features, SDV-TUNet [30] achieves layer-by-layer incremental fusion of multi-stage local edge information. To enhance segmentation performance on fuzzy boundary targets, MEGA-Net [31] employs a boundary-guided attention module to preserve important high-frequency details, such as boundary information. For the purpose of enhancing the segmentation performance on multi-organ datasets, Yu et al. [32] presented a more effective 3D medical image segmentation approach that uses boundary-constrained knowledge distillation methods and attention-corrected features. However, most methods do not delve more deeply into the advantages of further using uncertainty quantification in semantic segmentation.

3. Method

3.1. Overall Architecture

The framework structure of BFLE-Net is shown in Figure 1; it is mainly composed of three parts—encoder, decoder, and hopping connection—and focuses on the design of the channel and position compound attention (CPCA), boundary learning module (BLM), and dynamic scale-aware context module (DSCM).

Figure 1. Architecture of BFLE-Net.

To take advantage of the convolutional layers and the self-attention mechanism, the encoder part serially fuses Res2Net-50 and the Transformer, and the Transformer part consists of a stack of 12 layers. Unlike TransUNet, which slices the token directly on the last convolutional feature map and feeds it to the Transformer, we designed the CPCA module before the Transformer layer in the encoder to preserve the shallow features extracted from the earlier coding layers and to filter out the interference of redundant information. Unlike simple decoding processes such as TransUNet, the decoder for this method dynamically mines the clues of fuzzy boundaries through the uncertainty-guided BLM module, and optimizes the semantic discriminative ability of fuzzy boundaries by combining the fine-grained global information fusion strategy. To achieve full-scale modeling from microscopic organizational features to macroscopic anatomical structure, the DSCM module is integrated into the jump-connected three-level feature transfer path. This allows for collaborative optimization of the dynamic calibration of multi-scale features, thereby creating a “local-global” feature optimization link. After training the optimization network to minimize the total primary and auxiliary losses, the network is optimized for convergence. The segmentation head then outputs the segmentation results.

3.2. Channel and Position Compound Attention

The CPCA module is designed before the Transformer layer of the original encoder part, which is compounded by the channel and positional attention in parallel in order to reduce the interference of redundant information in the encoding layer when extracting shallow features, and to preserve the intrinsic features of the input image. The specific structure is shown in Figure 2.

Figure 2. Architecture of the CPCA module.

On the one hand, the original input feature map is input into the positional attention module, and the input features are first transformed and augmented by convolution and batch normalization. The positional attention module then receives feature

F \in R^{B \times C / 8 \times H \times W}

as input. Three independent 1 × 1 convolution layers are used to obtain the three tensors,

Q

,

K

,

V

. The spatial dimension

(B, C, H, W)

of the

Q

and

K

tensors is then reshaped to

(B, C, H \times W)

to obtain

Q_{f}

and

K_{f}

. The attention weight matrix

A_{i j} \in R^{B \times (H \times W) \times (H \times W)}

is obtained by calculating and normalizing the dot products of

Q_{f}

and

K_{f}

:

A_{i j} = \frac{\exp (Q_{f i} \cdot K_{f j})}{\sum_{i = 1}^{H \times W} \exp (Q_{f i} \cdot K_{f j})},

(1)

where

A_{i j}

represents the influence of the ith position on the jth position, with a higher value indicating a closer association between pixels. Next, the resulting attention weight matrix

A_{i j}

is matrix multiplied by the spread tensor

V_{f}

. To preserve information of the original features and prevent the attention mechanism from over-modifying the features, a learnable hyperparameter

α

is introduced as the weight in the matrix multiplication, and the obtained results are restored to the original spatial dimensions. Finally, they are weighted and output with the feature

F

after the residual linkage to obtain

F_{1}

, which has the following expression:

F_{1} = α \sum_{i = 1}^{H \times W} (A_{i j} \cdot V_{f i}) + F .

(2)

On the other hand, the feature

F

is input to the channel attention module, and the feature

F

is directlys used as the tensor

Q_{1}

,

K_{1}

,

V_{1}

, and its dimension is reshaped to

(B, C, H \times W)

. Then the dot product of the reshaped

Q_{1 f}

and

K_{1 f}

is calculated, and the result is Softmax-normalized to obtain the attention weight matrix

B_{i j}

, where

B_{i j}

denotes the effect of the ith channel on the jth channel, and then

B_{i j}

is matrix-multiplied with the

V_{1 f}

tensor to obtain multiplication, introducing the learnable hyperparameter

β

, restoring the result to the original spatial dimensions. Finally, it is weighted with the feature

F

through residual linkage to produce the output

F_{2}

. Finally, the positional attention module output

F_{1}

and the channel attention module output

F_{2}

are summed element-by-element through the convolutional layers, and the fused features are translated to the output space through the last convolutional layer to generate the final

F_{Output}

. The designed CPCA module is able to refine the compressed high-level features to filter out noise and other redundant information interference, which helps the model to better understand the global structure in the image and generate richer feature representations.

3.3. Boundary Learning Module

Fuzzy boundaries are characteristics of medical images that make it challenging for the model to accurately locate the target bounds. For the purpose of creating a pixel-level uncertainty map that guides the model to focus on the fine-grained learning of the boundary region and improves the network’s capacity to distinguish fuzzy boundaries, the boundary learning module combines data from the encoder and jump connections. During that time, to improve the model’s robustness in boundary modeling, the intermediate prediction results are limited early in the training process by the auxiliary supervision mechanism. The structure is shown in Figure 3.

Figure 3. Architecture of the BLM module.

The module receives the global context features

f_{t}

output after 12 layers of the Transformer of the encoder and the local spatial detail features

f_{c}

from the shallow CNN of the encoder. The two types of features are spliced together and then passed through three parallel convolutional branches with kernels of 1 × 1, 3 × 3, and 5 × 5, which integrate the target location information in

f_{t}

and the spatial detail information in

f_{c}

, to obtain the locally augmented feature map

f_{L}

:

f_{i n} = Concat (f_{t}, f_{c}),

(3)

f_{L} = Concat ({Conv}_{1 \times 1} (f_{i n}), {Conv}_{3 \times 3} (f_{i n}) {, Conv}_{5 \times 5} (f_{i n})),

(4)

The

f_{L}

is then mapped to the number of categories

C

by convolution of the feature map, and the activation function is used to compute the category probabilities for each pixel location to obtain a multi-category preliminary prediction map

{P_{l}}^{'} \in R^{H_{1} \times W_{1} \times C}

, and a confidence map

P_{conf}

is computed for each pixel location

(h, w)

:

{P_{l}}^{'} = Softmax ({Conv}_{1 \times 1} (f_{L})),

(5)

P_{conf} (h, w) = \max {P_{l}}^{'} (h, w, c),

(6)

where

l

denotes the number of layers. The confidence map reflects the “confidence level” of the model in pixel classification, which is used to strengthen the feature representation of explicit regions; then the uncertainty map

P_{u n}

is defined based on the normalized information entropy, and its expression is as follows:

P_{u n} (h, w) = - \frac{1}{\log C} \sum_{c = 1}^{C} {P_{l}}^{'} (h, w, c) \cdot \log {P_{l}}^{'} (h, w, c),

(7)

The normalized entropy value can reflect the degree of uncertainty more intuitively—a larger value indicates that the classification of the position is more uncertain, which guides the model to pay more attention to the boundary region. Under the condition of retaining the original features, the confidence feature of the target region and the uncertainty feature of the boundary region are jointly used as the guidance information,

P_{conf}

ensures the consistency of segmentation of the target subject, and

P_{u n}

drives the refinement of the boundary contour; in order to avoid over-amplification of the feature response, the learnable parameter

α, β

is introduced, which is optimized by gradient descent and participates in the subsequent feature fusion process:

f_{out} = α \cdot (P_{conf} ⊙ f_{L}) + β \cdot (P_{u n} ⊙ f_{L}) + f_{L}, α + β \leq 1 .

(8)

In the boundary learning module (BLM), the uncertainty map assumes the role of a “difficulty marker”, whose central role is to dynamically adjust the intensity of the network’s attention to different pixels. First, the network implicitly compares the consistency between the predicted probability distributions and the truth labels, which produces relatively high uncertainty values in the boundary fuzzy regions. Next, this information is used to implement a targeted weight mapping on the deep semantic feature map, specifically, pixel locations with high uncertainty are assigned greater gradient weights via the Hadamard product operation. Both strengthen the model’s semantic feature response to high-confidence regions (e.g., organ interiors) by focusing on the consistent representation of the target subject, while enhancing the model’s ability to learn the details of uncertain regions (e.g., organ boundary regions). They also optimize the prediction results of the fuzzy regions and enable focused modeling and fine-tuning of the boundary fuzzy regions.

3.4. Dynamic Scale-Aware Context Module

The complexity of medical images is reflected in the large target scale differences and significant intra-class morphological differences. In this paper, we design a hierarchical embedding of the DSCM module and construct an adaptive multi-scale feature fusion mechanism to achieve dynamic calibration of local multi-scale contextual information while preserving the advantages of the Transformer’s global modeling, which significantly improves robustness in complex medical scenarios; the structure of the module is shown in Figure 4.

Figure 4. Architecture of the DSCM module.

First, a scale-continuous receptive field network is established through multi-scale feature sensing. In order to capture the target multi-scale features without introducing too much extra computation, the feature map

F_{i n}

is processed in parallel by three sets of parameter-sharing dilated convolutions, which generate the multi-scale features

M_{a}

,

M_{b}

,

M_{c}

.

M_{x} = {Conv}_{3 \times 3}^{d = n} (F_{i n}), (x = a, b, c; n = 1, 2, 4),

(9)

M_{i j} = Concat (M_{i}, M_{j}),

(10)

where i, j denote the feature maps of neighboring scales. By introducing the parameter-sharing strategy, the multi-scale generalized feature representation of network learning targets is achieved while reducing computational redundancy, and the generalization ability to unknown-scale targets can also be enhanced.

The spliced features are processed by dual paths to achieve context-aware scale selection, and in the spatial excitation path, the adjacent scale features are subjected to channel compression and spatial excitation operations to generate the spatial attention graph

Q

, which realizes the dynamic allocation of feature weights:

Q = σ (Conv (GAP (M_{i j}))) ⊙ M_{i j},

(11)

where

GAP (\cdot)

denotes global mean pooling,

σ

is the sigmoid function, and

⊙

is element-by-element multiplication. In the adaptive fusion path, the dynamic fusion weights

W_{i}

and

W_{j}

are learned through the gating mechanism, and the cross-scale feature representation

K

is obtained from the channel-by-channel spatial weight multiplication, which preserves the spatial details of the original features and dynamically focuses on the target-relevant scale information through the weights. Then the dual-path features are integrated by residual concatenation, and finally, the fused features of all scale pairs are aggregated by hierarchical contextual integration.

\{\begin{cases} W_{i}, W_{j} = Softmax (Conv (ϕ (Conv (M_{i j})))) \\ K = W_{i} ⊙ M_{i} + W_{j} ⊙ M_{j} \\ {M_{i j}}^{'} = Q + K \\ F_{o u t} = ϕ (Conv ({M_{a b}}^{'} + {M_{a b c}}^{'} + F_{i n})) . \end{cases}

(12)

where

ϕ

is the BN-RELU activation layer, and

W_{i}

,

W_{j}

are the spatial perception weight vectors. Different from the ASPP module with a fixed expansion rate, DSCM is able to realize the dynamic expansion of the sensory field through the parameter-sharing mechanism to capture the contextual information of different scales, solve the scale sensitivity problem caused by the traditional method due to the fixed sensory field, reduce the risk of over-segmentation, and enable the model to better utilize the complementarity between the features of different levels to generate more accurate segmentation results.

3.5. Loss Function

We optimize BFLE-Net using a composite loss function that consists of cross-entropy loss weighting and Dice loss. The total loss is a combination of the primary and auxiliary losses. The main loss is intended to reduce the discrepancy between the final segmentation maps and the ground truth. The auxiliary loss is incorporated in the preliminary prediction maps to guide the network’s attention on the target region and improve its capacity to refine boundaries. The following is the definition of the total loss:

L_{s e g} = α \cdot L_{C E} + (1 - α) \cdot L_{D i c e},

(13)

L_{aux} = \sum_{l = 1}^{3} L_{s e g} ({P_{l}}^{'}, Y_{l, GT}),

(14)

L_{total} = λ \cdot L_{seg} (P, Y_{GT}) + β \cdot L_{aux} .

(15)

where

λ

and

β

are hyperparameters balancing the primary and auxiliary losses,

α

controls the cross-entropy and Dice loss weights;

P

and

Y_{GT}

denote the final segmentation map and Ground Truth;

{P_{l}}^{'}

and

Y_{l, GT}

denote the preliminary prediction map and downsampling labels.

4. Experiments and Discussion

4.1. Datasets and Evaluation Indicators

To evaluate the performance of the proposed approach, we use two publicly accessible medical image segmentation datasets—the Synapse dataset and the ACDC dataset. See Table 1 and Figure 5 for details.

Table 1. A brief introduction to the datasets.

Figure 5. Pictures of the selected datasets are shown.

Here are 30 abdominal CT scans in the Synapse dataset. Annotations of eight organs—liver, pancreas, spleen, stomach, right and left kidneys, aorta, and gallbladder—are included in the dataset, which consists of 3779 axial slices of abdominal CT scans. Every slice is 512 × 512 pixels in size. In this case, with reference to the TransUNet literature, the slices from 18 cases were used as the training set, and the remaining 12 cases were reserved for testing and evaluation.

Each of the 100 sets of cardiac magnetic resonance images in the ACDC dataset was annotated by a specialist with knowledge of the left atrium, right ventricle, and myocardium. In this investigation, we adhered to the data partitioning approach utilized in TransUnet, segregating the dataset into 70 cases for training, 10 for validation, and 20 for testing.

The experiments employ the Dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95) as quantitative assessment metrics to fully evaluate the test results of the proposed model on the dataset.

4.2. Experimental Environment and Parameter Settings

The network employs a CNN–Transformer cascade architecture as the encoder. The CNN encoder is based on the Res2Net-50 backbone, while the Transformer component consists of a 12-layer stack. To enhance both convergence speed and segmentation performance, pre-trained weights from ImageNet-21k are utilized for both components. All experiments were conducted on a Linux (Ubuntu 20.04) operating system using a single RTX 3090 (24 GB) GPU, implemented with PyTorch 1.11.0, Python 3.8, and CUDA 11.3. The input images were resized to a consistent resolution of 224 × 224. Data augmentation techniques, including random rotation and flipping, were applied to improve the model’s generalization ability. During training, a batch size of 16 was used, with a total of 150 epochs. Model parameters were optimized using the SGD optimizer, with an initial learning rate of 0.01, momentum set to 0.9, and weight decay of 0.0001.

4.3. Results

4.3.1. Quantitative Experimental Analysis

To evaluate the accuracy of the BFLE-Net network in medical image segmentation tasks, a comprehensive comparison experiment is conducted between the proposed method and several state-of-the-art models on the Synapse multi-organ segmentation dataset and the ACDC dataset. The experimental results are presented in Table 2 and Table 3, where underlined values represent suboptimal results and bolded values denote the best results. As shown in Table 2, BFLE-Net achieves optimal performance across the overall metrics, with an average DSC of 81.67%. Additionally, the average HD95 metric of BFLE-Net significantly outperforms other models, achieving a performance of 21.67 mm. For organ-specific segmentation tasks, BFLE-Net excels in the pancreas segmentation task, achieving an average DSC of 66.30%, which is 5.70 percentage points higher than the next best-performing model, TBP-Net. In the spleen segmentation task, BFLE-Net attains an average DSC of 91.36%, outperforming all other models. Similarly, for the segmentation of the left and right kidneys, BFLE-Net achieves average DSC values of 84.54% and 81.31%, respectively, maintaining high segmentation accuracy. These results comprehensively demonstrate the effectiveness of the proposed method in addressing the challenge of small target organ segmentation.

Table 2. Comparison of algorithm performance on the Synapse dataset.

Table 3. Quantitative tests on the ACDC dataset.

As shown in Table 3, in the ACDC segmentation task, BFLE-Net demonstrates strong performance in maintaining a balanced segmentation accuracy across all structures, with an average DSC of 90.55%. This performance is 0.84 percentage points higher than the second-best model, TransUNet. Additionally, in the myocardial (MYO) segmentation task, BFLE-Net achieves an average DSC of 88.67%, significantly outperforming other models. These results highlight the superiority of BFLE-Net in handling thin-walled tissue segmentation tasks. Overall, the experimental findings provide strong evidence of the robustness and generalization capability of the proposed method in addressing complex medical image segmentation tasks.

4.3.2. Qualitative Experimental Analysis

To further validate the segmentation performance of BFLE-Net, this section presents a systematic visualization and comparison experiment conducted on the Synapse and ACDC datasets. By visualizing the segmentation results, the advantages of BFLE-Net in handling complex anatomical structures are thoroughly analyzed.

In this study, the performance of various segmentation methods, including U-Net, MT-UNet, TransUNet, and BFLE-Net, is systematically compared on the Synapse and ACDC datasets. The experimental results reveal significant performance differences among the methods in segmenting complex anatomical structures. As shown in Figure 6, in the multi-organ segmentation task on the Synapse dataset, both U-Net and MT-UNet suffer from considerable over-segmentation, particularly in the stomach region, where organ misclassification occurs. More notably, MT-UNet and TransUNet exhibit substantial topology confusion in the segmentation of the left and right kidneys, as well as inaccuracies in localizing the spatial relationship between the gallbladder and the liver. These issues highlight that current methods still face significant challenges in incorporating anatomical prior knowledge and modeling 3D spatial context dependencies for multiple organs. In contrast, the method proposed in this study substantially improves the completeness of segmentation and effectively addresses the problem of semantic confusion between organs, particularly in distinguishing neighboring organs (e.g., the liver and gallbladder) with similar grayscale features, thus demonstrating significant advantages.

Figure 6. Qualitative analysis results on the Synapse dataset. Note: The green box in the figure indicates the location where the effect of segmentation is insufficient.

As shown in Figure 7, the existing methods also show obvious limitations in the right ventricle segmentation task in the ACDC dataset. Both TransUNet and Swin-UNet struggle with missing the right ventricle structure. By adopting a dynamic feature calibration mechanism, the method in this study completely preserves the fine anatomical structural features of the right ventricle and significantly optimizes boundary continuity.

Figure 7. Qualitative analysis results on the ACDC dataset. Note: The green box in the figure indicates the location where the effect of segmentation is insufficient.

Through the above data comparison and visualization analysis, the experimental results show that the proposed CPCA, DSCM, and BLM modules synergistically significantly improve the model performance. The CPCA module effectively solves the redundancy problem of early feature information through shallow feature purification; the DSCM module makes multi-scale feature fusion efficient through three-level feature dynamic calibration, which significantly improves the segmentation performance of the fine structures (e.g., left kidney, right kidney, pancreas, and spleen). The boundary optimization strategy of the BLM module further enhances the boundary keeping performance of the model and enables fine segmentation of fuzzy boundary regions. Based on Table 4, our model has 43.62 M parameters and 23.95 G FLOPs on the Synapse dataset. While its parameter count is moderate compared to other models, it stands out with a lower FLOP value, indicating better computational efficiency. Performance-wise, it achieves a Dice score of 81.67%, outperforming many models such as AttenUNet and TransUNet. This shows that our model maintains high accuracy while being computationally efficient, striking a strong balance between performance and resource usage. Overall, compared to larger models like TransUNet, our model offers advantages in both parameter count and FLOPs without sacrificing accuracy, making it highly suitable for environments with limited computing resources.

Table 4. The number of parameters, calculation amount, and inference time of different methods are all tested on RTX 3090.

4.3.3. Ablation Experiment

To validate the effectiveness of the designed CPCA, BLM, and DSCM modules, a systematic ablation experiment was conducted. The performance contribution of each module to the overall model was quantitatively assessed by sequentially introducing each module. The detailed results of this analysis are presented in Table 5 and Figure 8 and Figure 9.

Table 5. Ablation experiments on the Synapse dataset.

Figure 8. Visualization results of ablation experiments on the Synapse dataset.

Figure 9. Class activation mapping of module combination results in ablation experiments on the Synapse dataset. Note: Red represents the areas of the most interest to the model, green represents the areas of greater interest, and blue represents the areas of less interest.

The experimental results reveal that the introduction of the CPCA module alone leads to an improvement of 2.17 percentage points in the average DSC and a reduction of 10.65 mm in HD95, enhancing the segmentation performance of small organs through shallow feature purification. The BLM module alone improves the average DSC by 1.81 percentage points and reduces HD95 by 7.74 mm. Its boundary optimization strategy effectively enhances the sharpness of the segmentation boundaries. The introduction of the DSCM module alone improves the average DSC by 0.62 percentage points and reduces HD95 by 0.13 mm, with its multi-scale feature calibration mechanism enhancing the model’s adaptability to the overall anatomical structure.

In terms of combined module effects, the combination of CPCA and BLM achieves an average DSC of 81.33%, indicating that the synergy between interference filtering, feature purity enhancement, and adaptive information supplementation strengthens boundary segmentation performance. The combination of BLM and DSCM results in an average DSC of 80.42%, with HD95 reduced to 25.16 mm, significantly enhancing the model’s performance through the synergy of multi-scale features and boundary optimization. This combination improves the model’s ability to capture both local and global features, as well as its robustness in segmentation tasks.

When combining CPCA and DSCM modules, the average DSC is slightly lower than that of CPCA alone but still higher than that of DSCM alone, with a significant decrease in HD95. This suggests a trade-off when combining the modules, playing a crucial role in further optimizing boundary errors and complementing the segmentation task. The experimental results show that as the model becomes more comprehensive and complete, it can focus more on important information while suppressing the representation of irrelevant features, thereby enhancing its ability to handle complex segmentation tasks. The synergistic effect of each module has significantly improved performance, demonstrating the overall model’s boundary retention and scale adaptability while maintaining segmentation accuracy, strengthening boundary prediction capabilities, and validating the effectiveness of this method.

To further validate the effectiveness of the proposed BLM module, we visualize the uncertainty map in the BLM of the last layer at the decoder end, visualizing the uncertainty map side by side with the predicted segmentation mask and ground truth. As shown in Figure 10, regions with high uncertainty values primarily appear at object boundaries or in ambiguous areas (such as weak edges or occluded structures), where pixel-level classification is inherently more challenging. These observations align with the design intent of BLM, which aims to allocate more attention to uncertain boundary regions. Based on the final results, the segmentation accuracy for the target boundary blurred areas achieved a relatively satisfactory effect.

Figure 10. Visualization of uncertainty map functions and their impact on network attention.

5. Discussion

5.1. Experiments on Module Layout and Functional Performance

This section describes the design and setup of the experiments, as well as the effect of each module combination on the segmentation performance. As shown in Table 6, we first show the experimental results under different configurations and analyze the synergistic effect of DSCM, CPCA, and ASPP modules. The model achieves an average DSC of 79.01% and an HD95 of 27.42 mm when only DSCM is used. This suggests that the DSCM module performs superiorly in enhancing the model’s ability to capture global semantic information, which effectively improves the overall accuracy of the segmentation and the boundary accuracy. However, the relatively high HD95 indicates that there is still room for further improvement of segmentation accuracy in the detailed part. Further, when the DSCM module is added between CNN and Transformer, and the CPCA module is added to the jump connection, the average DSC slightly decreases to 78.92%, but the HD95 significantly improves and decreases to 23.12 mm. This result indicates that although the introduction of the CPCA module leads to a slight decrease in the overall segmentation performance, it is effective in improving the segmentation boundary accuracy, reducing errors. The CPCA module optimizes the HD95 metrics by enhancing the ability to model local details and helping the model to more accurately deal with complex boundary sections. Moreover, when only two modules are added to the jump connection, although it is higher than the baseline model effect, the segmentation performance still decreases compared with other methods. This combination fails to complement the modules effectively, and insufficient interaction leads to a decrease in overall segmentation performance. When the ASPP module with a fixed dilation rate is used to replace the DSCM module, it does not adapt well to the target scale due to the limitations of the receptive field, and the segmentation performance is limited due to the lack of spatial position sensing. When using the methods in this paper, the CPCA module performs particularly well in dealing with detailed parts and complex boundaries, which strengthens the refinement ability of low-level features, and the DSCM improves the segmentation accuracy of details and boundaries by enhancing the ability to capture contextual information. The balanced performance of the model between global and local is further improved, which results in a significant improvement in segmentation accuracy and boundary accuracy.

Table 6. Analysis of DSCM and CPCA, and ASPP module integration in segmentation tasks.

5.2. Statistical Significance Analysis Experiments of Multiple Models

In this subsection, a comprehensive comparison and validation of the performance of the BFLE-Net model proposed in this paper with the Baseline, U-Net, MT-UNet, and the improved network Dual-Net in the medical image segmentation task are presented in Figure 11, focusing on distribution and statistical confidence, respectively, to illustrate the robustness and accuracy of BFLE-Net. The performance of the proposed BFLE-Net model is comprehensively compared and validated against the Baseline, U-Net, MT-UNet, and the improved network Dual-Net in the medical image segmentation task, thus illustrating the robustness and accuracy of BFLE-Net.

Figure 11. Comparison of the distribution of Dice coefficients on different model test datasets.

Figure 11 shows the Dice coefficient for each model in five independent replicate experiments. The horizontal axis represents Baseline, U-Net, MT-UNet, Dual-Net, and BFLE-Net in that order, while the vertical axis represents the Dice coefficients of the models. It can be seen that the overall position of the BFLE-Net box is significantly higher than that of the other four models, with its median located at about 81.5%, which is higher than that of Dual-Net by more than 1.5%, and higher than the medians of MT-UNet and Baseline by between 3% and 4%. More importantly, the height of the BFLE-Net box is the smallest, indicating that its results have the narrowest range of fluctuation and almost no significant outliers, whereas the other models have more obvious discrete or occasional anomalies, either broadly or narrowly. For example, the lowest value of U-Net falls to about 74.93%, with a difference of more than 1% between the upper and lower quartiles; MT-UNet is slightly better than Baseline in the average, but its maximum and minimum values range from about 80.08% to 76.26%, which also shows relatively large fluctuations in performance. These distributional features show that BFLE-Net not only has an absolute advantage in average performance but also exhibits far greater stability than the comparison models, thus proving its high robustness under different data slices and different random initialization conditions.

To assess the variability and significance of model performance, this study conducted five independent replicate experiments under identical hyperparameter settings for the overall average Dice score. The baseline model achieved average Dice scores of 77.59, 77.48, 77.41, 75.47, and 77.84 across the five experiments, with an overall mean of 77.16. The 95% confidence interval, calculated using Student’s t-distribution, was ±1.19 (n = 5). Correspondingly, the average Dice scores for BFLE-Net in the five experiments were 81.33, 81.67, 81.37, 82.05, and 81.93, with an overall mean of 81.67 and a 95% confidence interval of ±0.40 (n = 5). To test whether the difference in overall performance between the two models is statistically significant, we performed a paired t-test on the five paired samples, yielding t = 8.64 and an original p-value of 9.9 × 10⁻⁴; after Bonferroni correction, p_(corr) = 9.9 × 10⁻⁴ (<0.001). This result indicates that the overall Dice improvement of BFLE-Net compared to the baseline model is highly significant at the α = 0.05 level, and its narrower confidence interval also reflects superior stability and reliability. The detailed results are shown in Table 7.

Table 7. Paired t-test results and statistical analysis.

Combined with the above results, BFLE-Net not only leads in average segmentation accuracy but also minimizes the error fluctuation, proving its stronger generalization ability and reliability in multi-organ segmentation tasks. Therefore, this method satisfies the dual demands of high accuracy and high stability for clinical applications, and lays a solid foundation for subsequent large-scale validation and practical deployment.

6. Conclusions

In this paper, we propose a BFLE-Net model for medical image segmentation to enhance the model’s ability to represent fuzzy boundaries and capture detailed image information, addressing the issue of inaccurate boundary localization in target regions. The model utilizes a hybrid CNN and Transformer encoder, leveraging the strengths of both CNN and Transformer architectures. The CPCA module is designed to effectively reduce the interference of redundant information. Furthermore, confidence and uncertainty guidance methods are incorporated through the uncertainty-guided BLM module to improve the model’s ability to discriminate fuzzy boundaries and to achieve fine boundary segmentation. Additionally, the DSCM module is introduced to explore the intrinsic relationships between parallel features from different sensory fields, enabling the capture of contextual information across multiple scales.

To validate the effectiveness of the proposed approach, both quantitative and qualitative comparison experiments are conducted on publicly available datasets. The experimental results demonstrate that the proposed method performs excellently in medical image segmentation tasks, achieving advanced segmentation performance. It effectively addresses challenges related to fuzzy boundaries and multi-target semantic confusion, showcasing the model’s robustness and accuracy.

Our network shows promise for more accurate organ delineation in radiotherapy planning, surgical assessment, and disease monitoring, but it was only tested on two public datasets and may not generalize to other modalities or pathologies; moreover, its extra boundary/scale modules incur some computational cost. Future work will focus on domain adaptation, model compression, and uncertainty estimation to broaden applicability and speed up inference.

Author Contributions

Conceptualization, J.F.; methodology, J.F.; software, J.F.; validation, J.F.; formal analysis, J.F.; investigation, X.Y.; resources, L.L.; data curation, X.Y.; writing—original draft preparation, J.F.; writing—review and editing, L.L.; visualization, J.F.; supervision, L.L.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study can be downloaded from the URLs on their official websites. The code is accessible from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CT	computed tomography
MRI	magnetic resonance imaging
CNN	convolutional neural network

References

Azad, R.; Kazerouni, A.; Heidari, M.; Aghdam, E.K.; Molaei, A.; Jia, Y.; Jose, A.; Roy, R.; Merhof, D. Advances in medical image analysis with vision transformers: A comprehensive review. Med. Image Anal. 2024, 91, 103000. [Google Scholar] [CrossRef] [PubMed]
Rokach, L.; Aperstein, Y.; Akselrod-Ballin, A. Deep active learning framework for chest-abdominal CT scans segmentation. Expert Syst. Appl. 2024, 263, 125522. [Google Scholar] [CrossRef]
Ramedani, M.; Moussavi, A.; Memhave, T.R.; Boretius, S. Deep learning-based automated segmentation of cardiac real-time MRI in non-human primates. Comput. Biol. Med. 2025, 189, 109894. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans. Med. Imaging 2019, 39, 1856–1867. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Shu, X.; Gu, Y.; Zhang, X.; Hu, C.; Cheng, K. FCRB U-Net: A novel fully connected residual block U-Net for fetal cerebellum ultrasound image segmentation. Comput. Biol. Med. 2022, 148, 105693. [Google Scholar] [CrossRef]
Huang, Q.; Zhao, L.; Ren, G.; Wang, X.; Liu, C.; Wang, W. NAG-Net: Nested attention-guided learning for segmentation of carotid lumen-intima interface and media-adventitia interface. Comput. Biol. Med. 2023, 156, 106718. [Google Scholar] [CrossRef]
Xu, M.; Ma, Q.; Zhang, H.; Kong, D.; Zeng, T. MEF-UNet: An end-to-end ultrasound image segmentation algorithm based on multi-scale feature extraction and fusion. Comput. Med. Imaging Graph. 2024, 114, 102370. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Helly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar] [CrossRef]
Huang, X.; Deng, Z.; Li, D.; Yuan, X.; Fu, Y. Missformer: An effective transformer for 2d medical image segmentation. IEEE Trans. Med. Imaging 2022, 42, 1484–1494. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Li, X.; Ma, S.; Xu, J.; Tang, J.; He, S.; Guo, F. TranSiam: Aggregating multi-modal visual features with locality for medical image segmentation. Expert Syst. Appl. 2024, 237, 121574. [Google Scholar] [CrossRef]
Wang, J.; Tian, S.; Yu, L.; Zhou, Z.; Wang, F.; Wang, Y. HIGF-Net: Hierarchical information-guided fusion network for polyp segmentation based on transformer and convolution feature learning. Comput. Biol. Med. 2023, 161, 107038. [Google Scholar] [CrossRef]
Wang, J.; Wei, L.; Wang, L.; Zhou, Q.; Zhu, L.; Qin, J. Boundary-aware transformers for skin lesion segmentation. In Proceedings of the Medical Image Computing and Computer–Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27–30 September 2021; Proceedings, Part I 24. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 206–216. [Google Scholar]
Qin, C.; Zheng, B.; Zeng, J.; Chen, Z.; Zhai, Y.; Genovese, A.; Piuri, V.; Scotti, F. Dynamically aggregating MLPs and CNNs for skin lesion segmentation with geometry regularization. Comput. Methods Programs Biomed. 2023, 238, 107601. [Google Scholar] [CrossRef] [PubMed]
Bi, H.; Cai, C.; Sun, J.; Jiang, Y.; Lu, G.; Shu, H.; Ni, X. BPAT-UNet: Boundary preserving assembled transformer UNet for ultrasound thyroid nodule segmentation. Comput. Methods Programs Biomed. 2023, 238, 107614. [Google Scholar] [CrossRef] [PubMed]
Jungo, A.; Reyes, M. Assessing reliability and challenges of uncertainty estimations for medical image segmentation. In Proceedings of the Medical Image Computing and Computer–Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; Proceedings, Part II 22. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 48–56. [Google Scholar] [CrossRef]
Wang, R.; Chen, S.; Ji, C.; Fan, J.; Li, Y. Boundary-aware context neural network for medical image segmentation. Med. Image Anal. 2022, 78, 102395. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Zeng, G.; Li, F.; Zhao, Y.; Zhang, H. FRBNet: Feedback refinement boundary network for semantic segmentation in breast ultrasound images. Biomed. Signal Process. Control 2023, 86, 105194. [Google Scholar] [CrossRef]
Yu, Q.; Ning, H.; Yang, J.; Li, C.; Qi, Y.; Qu, M.; Li, H.; Sun, S.; Cao, P.; Feng, C. CMR-BENet: A confidence map refinement boundary enhancement network for left ventricular myocardium segmentation. Comput. Methods Programs Biomed. 2025, 260, 108544. [Google Scholar] [CrossRef]
Chen, S.; Luo, C.; Liu, S.; Li, H.; Liu, Y.; Zhou, H.; Liu, L.; Chen, H. LD-UNet: A long-distance perceptual model for segmentation of blurred boundaries in medical images. Comput. Biol. Med. 2024, 171, 108120. [Google Scholar] [CrossRef]
Zhang, J.; Qin, Q.; Ye, Q.; Ruan, T. ST-unet: Swin transformer boosted U-net with cross-layer feature enhancement for medical image segmentation. Comput. Biol. Med. 2023, 153, 106516. [Google Scholar] [CrossRef]
Feng, Y.; Cong, Y.; Xing, S.; Wang, H.; Ren, Z.; Zhang, X. GCFormer: Multi-scale feature plays a crucial role in medical images segmentation. Knowl.-Based Syst. 2024, 300, 112170. [Google Scholar] [CrossRef]
Zhao, G.; Zhu, X.; Wang, X.; Yan, F. HM-Net: Hybrid multi-scale cross-order fusion network for medical image segmentation. Biomed. Signal Process. Control 2024, 98, 106658. [Google Scholar] [CrossRef]
Xu, S.; Chen, Y.; Yang, S.; Zhang, X.; Sun, F. FCSU-Net: A novel full-scale Cross-dimension Self-attention U-Net with collaborative fusion of multi-scale feature for medical image segmentation. Comput. Biol. Med. 2024, 180, 108947. [Google Scholar] [CrossRef] [PubMed]
Huang, Q.; Jia, L.; Ren, G.; Wang, X.; Liu, C. Extraction of vascular wall in carotid ultrasound via a novel boundary-delineation network. Eng. Appl. Artif. Intell. 2023, 121, 106069. [Google Scholar] [CrossRef]
Li, B.; Li, P.; Liu, Z.; Wang, B.; Li, C.; Guo, X.; Wang, J.; Li, W. A Tri-Path boundary preserving network via Context-aware aggregator for multi-organ medical image segmentation. Biomed. Signal Process. Control 2025, 104, 107593. [Google Scholar] [CrossRef]
Zhu, Z.; Sun, M.; Qi, G.; Li, Y.; Gao, X.; Liu, Y. Sparse dynamic volume TransUNet with multi-level edge fusion for brain tumor segmentation. Comput. Biol. Med. 2024, 172, 108284. [Google Scholar] [CrossRef]
Bui, N.T.; Hoang, D.H.; Nguyen, Q.T.; Tran, M.; Le, N. Meganet: Multi-scale edge-guided attention network for weak boundary polyp segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2024; pp. 7985–7994. [Google Scholar] [CrossRef]
Yu, X.; Teng, L.; Zhang, D.; Zheng, J.; Chen, H. Attention correction feature and boundary constraint knowledge distillation for efficient 3D medical image segmentation. Expert Syst. Appl. 2025, 262, 125670. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.; Kainz, B. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Wang, H.; Xie, S.; Lin, L.; Iwamoto, Y.; Han, H.; Chen, W.; Tong, R. Mixed transformer u-net for medical image segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 2390–2394. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; pp. 2441–2449. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, Z.; Fang, Z. An effective CNN and Transformer complementary network for medical image segmentation. Pattern Recognit. 2023, 136, 109228. [Google Scholar] [CrossRef]
Yuan, F.; Tang, Z.; Wang, C.; Huang, Q.; Shi, J. A multiple gated boosting network for multi-organ medical image segmentation. IET Image Process. 2023, 17, 3028–3039. [Google Scholar] [CrossRef]
Bao, H.; Zhu, Y.; Li, Q. Hybrid-scale contextual fusion network for medical image segmentation. Comput. Biol. Med. 2023, 152, 106439. [Google Scholar] [CrossRef]
Tan, D.; Hao, R.; Zhou, X.; Xia, J.; Su, Y.; Zheng, C. A Novel Skip-Connection Strategy by Fusing Spatial and Channel Wise Features for Multi-Region Medical Image Segmentation. IEEE J. Biomed. Health Inform. 2024, 28, 2168–2194. [Google Scholar] [CrossRef] [PubMed]
Xu, J. Hc-mamba: Vision mamba with hybrid convolutional techniques for medical image segmentation. arXiv 2024, arXiv:2405.05007. [Google Scholar]
Hong, Z.; Chen, M.; Hu, W.; Yan, S.; Qu, A.; Chen, L.; Chen, J. Dual encoder network with transformer-CNN for multi-organ segmentation. Med. Biol. Eng. Comput. 2023, 61, 661–671. [Google Scholar] [CrossRef] [PubMed]
Manzari, O.N.; Kaleybar, J.M.; Saadat, H.; Maleki, S. BEFUnet: A hybrid CNN-transformer architecture for precise medical image segmentation. arXiv 2024, arXiv:2402.08793. [Google Scholar]
Chen, Y.; Lu, X.; Xie, Q. ATFormer: Advanced transformer for medical image segmentation. Biomed. Signal Process. Control 2023, 85, 105079. [Google Scholar] [CrossRef]
Gao, Y.; Zhang, S.; Shi, L.; Zhao, G.; Shi, Y. Collaborative transformer U-shaped network for medical image segmentation. Appl. Soft Comput. 2025, 173, 112841. [Google Scholar] [CrossRef]
Li, Z.; Zheng, Y.; Shan, D.; Yang, S.; Li, Q.; Wang, B.; Zhang, Y.; Hong, Q.; Shen, D. Scribformer: Transformer makes cnn work better for scribble-based medical image segmentation. IEEE Trans. Med. Imaging 2024, 43, 2254–2265. [Google Scholar] [CrossRef]
Islam, M.R.; Qaraqe, M.; Serpedin, E. CoST-UNet: Convolution and swin transformer based deep learning architecture for cardiac segmentation. Biomed. Signal Process. Control 2024, 96, 106633. [Google Scholar] [CrossRef]
Heidari, M.; Kazerouni, A.; Soltany, M.; Azad, R.; Aghdam, E.K.; Cohen-Adad, J.; Merhof, D. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6202–6212. [Google Scholar]

Figure 1. Architecture of BFLE-Net.

Figure 2. Architecture of the CPCA module.

Figure 3. Architecture of the BLM module.

Figure 4. Architecture of the DSCM module.

Figure 5. Pictures of the selected datasets are shown.

Figure 6. Qualitative analysis results on the Synapse dataset. Note: The green box in the figure indicates the location where the effect of segmentation is insufficient.

Figure 7. Qualitative analysis results on the ACDC dataset. Note: The green box in the figure indicates the location where the effect of segmentation is insufficient.

Figure 8. Visualization results of ablation experiments on the Synapse dataset.

Figure 9. Class activation mapping of module combination results in ablation experiments on the Synapse dataset. Note: Red represents the areas of the most interest to the model, green represents the areas of greater interest, and blue represents the areas of less interest.

Figure 10. Visualization of uncertainty map functions and their impact on network attention.

Figure 11. Comparison of the distribution of Dice coefficients on different model test datasets.

Table 1. A brief introduction to the datasets.

Attribute	Synapse Multi-Organ CT Segmentation Dataset	ACDC Cardiac MRI Segmentation Dataset
Full Name	MICCAI 2015 Multi-atlas Abdomen Labeling Challenge	Automated Cardiac Diagnosis Challenge
Imaging Modality	Computed Tomography (CT)	Cine Magnetic Resonance Imaging (cine-MRI)
Anatomical Region	Abdomen	Heart
Slices/Frames	3779 2D slices, 512 × 512 pixels	Approx. 9–35 frames per case (short-axis sections), totaling approx. 1900 frames
Slice Thickness	1.5–2.5 mm	5–10 mm
Segmentation Labels	8 organs: aorta, gallbladder, liver, pancreas, left/right kidney, spleen, stomach	3 cardiac chamber structures: left ventricle (LV), right ventricle (RV), left ventricular myocardium (Myo)
Data Format	NIfTI (.nii.gz)	NIfTI (.nii.gz)

Table 2. Comparison of algorithm performance on the Synapse dataset.

Model	Avg		DSC (%)
Model	DSC (%)	HD95 (mm)	Aor	Gal	Kid (L)	Kid (R)	Liv	Pan	Spl	Stom
U-Net [4]	76.85	39.70	89.07	69.72	77.77	68.60	93.43	53.98	86.67	75.58
TransUNet [13]	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
TBP-Net [29]	80.15	26.76	87.75	65.31	83.61	79.59	95.10	60.60	88.17	81.11
AttenUNet [33]	77.77	36.02	89.55	68.88	77.98	71.11	93.57	58.04	87.30	75.75
MT-UNet [34]	78.59	26.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
UCTransUNet [35]	78.23	26.75	88.86	66.97	80.18	73.17	93.16	56.22	87.84	79.43
CTC-Net [36]	78.41	22.52	86.46	63.53	83.71	80.79	93.78	59.73	86.87	72.39
MGB-Net [37]	80.66	-	88.49	67.94	80.87	79.91	94.79	62.77	89.50	80.96
Hybrid-Net [38]	79.77	24.43	88.41	68.99	84.05	75.69	94.66	57.97	90.15	78.25
FSCA-Net [39]	79.12	29.10	87.90	67.60	83.65	79.05	93.80	59.40	86.85	74.70
HC-Mamba [40]	79.58	26.34	89.93	67.65	84.57	78.27	95.38	52.08	89.49	79.84
Dual-Net [41]	80.68	25.46	86.98	70.07	84.40	81.20	94.34	61.06	89.41	77.96
BFLE-Net	81.67	21.67	87.72	66.85	84.54	81.31	95.03	66.30	91.36	80.23

Note: Underlined values in the table indicate suboptimal results and bolded values indicate optimal results.

Table 3. Quantitative tests on the ACDC dataset.

Model	Avg DSC (%)	RV	MYO	LV
Swin-UNet [11]	88.07	85.77	84.42	94.03
R50U-Net [13]	87.55	87.10	80.63	94.92
TransUNet [13]	89.71	88.86	84.53	95.73
BEFUNet [42]	87.04	82.75	85.98	92.39
ATFormer [43]	87.38	86.51	84.99	90.63
CoTransUNet [44]	89.08	90.98	84.47	91.79
ScribFormer [45]	88.53	86.86	87.08	91.64
CoST-UNet [46]	86.50	87.30	82.61	89.50
HiFormer [47]	89.68	87.49	86.55	94.99
MISSFormer [12]	89.47	87.73	87.51	93.16
BFLE-Net (our)	90.55	89.00	88.67	93.98

Note: Underlined values in the table indicate suboptimal results and bolded values indicate optimal results.

Table 4. The number of parameters, calculation amount, and inference time of different methods are all tested on RTX 3090.

Methods	U-Net	TransUNet	Swin-UNet	Atten-UNet	MISSFormer	DA-TransUNet	BFLE-Net
Params (M)	39.4	105.28	40.67	34.9	32.46	97.52	43.62
Inference times (ms)	6.37	21.24	9.02	85.01	21.39	61.94	37.60
FLOPs (G)	61.85	38.73	13.95	101.8	37.28	110.34	23.95

Table 5. Ablation experiments on the Synapse dataset.

Baseline	CPCA	BLM	DSCM	DSC (%)	HD95 (mm)
√				77.48	31.69
√	√			79.64	23.31
√		√		79.29	23.95
√			√	78.10	31.56
√	√	√		81.33	24.47
√	√		√	79.06	21.44
√		√	√	80.42	25.16
√	√	√	√	81.67	21.67

Table 6. Analysis of DSCM and CPCA, and ASPP module integration in segmentation tasks.

Between CNN and Transformer	Skip Connection	Avg DSC% (%)	HD95 (mm)
-	ASPP	76.88	34.71
-	DSCM + CPCA	77.95	31.81
-	CPCA + DSCM	78.03	25.30
DSCM	CPCA	78.92	23.12
DSCM	DSCM	79.01	27.42
CPCA	ASPP	78.12	29.74
CPCA	DSCM	81.67	21.67

Table 7. Paired t-test results and statistical analysis.

	1	2	3	4	5	Mean	95%CI	t-Statistic	p-Value	p-Value (Bonferroni Corrected)
Baseline	77.59	77.48	77.41	75.47	77.84	77.16	1.19	8.64	0.00099	0.00099
BFLE-Net	81.33	81.67	81.37	82.05	81.93	81.67	0.4	8.64	0.00099	0.00099

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

BFLE-Net: Boundary Feature Learning and Enhancement Network for Medical Image Segmentation

Abstract

1. Introduction

2. Related Works

2.1. CNN and Transformer Architectures

2.2. Feature Fusion in Medical Image Segmentation

2.3. Boundary-Based Modeling

3. Method

3.1. Overall Architecture

3.2. Channel and Position Compound Attention

3.3. Boundary Learning Module

3.4. Dynamic Scale-Aware Context Module

3.5. Loss Function

4. Experiments and Discussion

4.1. Datasets and Evaluation Indicators

4.2. Experimental Environment and Parameter Settings

4.3. Results

4.3.1. Quantitative Experimental Analysis

4.3.2. Qualitative Experimental Analysis

4.3.3. Ablation Experiment

5. Discussion

5.1. Experiments on Module Layout and Functional Performance

5.2. Statistical Significance Analysis Experiments of Multiple Models

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics