A U-Shaped Architecture Based on Hybrid CNN and Mamba for Medical Image Segmentation

Ma, Xiaoxuan; Du, Yingao; Sui, Dong

doi:10.3390/app15147821

Open AccessArticle

A U-Shaped Architecture Based on Hybrid CNN and Mamba for Medical Image Segmentation

by

Xiaoxuan Ma

^*

,

Yingao Du

and

Dong Sui

School of Intelligence Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7821; https://doi.org/10.3390/app15147821

Submission received: 30 May 2025 / Revised: 4 July 2025 / Accepted: 10 July 2025 / Published: 11 July 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate medical image segmentation plays a critical role in clinical diagnosis, treatment planning, and a wide range of healthcare applications. Although U-shaped CNNs and Transformer-based architectures have shown promise, CNNs struggle to capture long-range dependencies, whereas Transformers suffer from quadratic growth in computational cost as image resolution increases. To address these issues, we propose HCMUNet, a novel medical image segmentation model that innovatively combines the local feature extraction capabilities of CNNs with the efficient long-range dependency modeling of Mamba, enhancing feature representation while reducing computational cost. In addition, HCMUNet features a redesigned skip connection and a novel attention module that integrates multi-scale features to recover spatial details lost during down-sampling and to promote richer cross-dimensional interactions. HCMUNet achieves Dice Similarity Coefficients (DSC) of 90.32%, 81.52%, and 92.11% on the ISIC 2018, Synapse multi-organ, and ACDC datasets, respectively, outperforming baseline methods by 0.65%, 1.05%, and 1.39%. Furthermore, HCMUNet consistently outperforms U-Net and Swin-UNet, achieving average Dice score improvements of approximately 5% and 2% across the evaluated datasets. These results collectively affirm the effectiveness and reliability of the proposed model across different segmentation tasks.

Keywords:

medical image segmentation; Mamba; Convolutional Neural Networks; hybrid architectures; State Space Model

1. Introduction

With advancements in computing technology and hardware, computer-aided diagnosis has become increasingly prominent in modern medicine. Among its applications, medical image segmentation stands out as particularly critical and remains one of the most challenging tasks in image analysis. Accurate segmentation enables precise delineation of anatomical structures and quantitative assessment of organ and lesion morphology, thereby supporting disease classification, progression monitoring, and personalized treatment planning. These capabilities significantly enhance diagnostic accuracy and drive advances in medical technology. Medical image segmentation is predominantly addressed using deep neural networks, particularly Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) [1], both of which have proven to perform exceptionally well on a variety of benchmarks. CNN-based architectures have garnered significant attention for their strong feature extraction capabilities, especially following the introduction of U-Net [2]. However, convolutional operations are inherently limited, with constraints such as local feature focus, poor modeling of long-range dependencies [3], and fixed receptive field sizes [4]. In contrast, ViT-based segmentation models leverage global self-attention mechanisms to effectively capture long-range dependencies, thereby enhancing segmentation accuracy. Nevertheless, this advantage comes at the cost of increased computational complexity, owing to the quadratic nature of self-attention. To mitigate this, the Pyramid Vision Transformer [5] introduces variable-sized attention windows across multiple resolution levels. This approach not only boosts recognition accuracy by 8.7% but also reduces the parameter count by nearly 50%. Similarly, the Swin Transformer [6] employs a shifted window mechanism to constrain the attention scope, achieving a 70% reduction in computational cost while surpassing standard ViTs in segmentation accuracy. Although the above methods effectively reduce model parameters and computational complexity, practical constraints such as limited computational power and memory resources must still be considered in clinical and real-world healthcare settings. Low parameter counts and minimal memory usage are fundamental requirements for mobile medical applications. Therefore, designing segmentation algorithms that deliver high accuracy with minimal computational demand is essential for deployment on portable and low-power medical devices. Recently, State Space Models (SSMs) [7] have attracted growing interest due to their efficiency and suitability for lightweight model design. Additionally, their strength in modeling long-range interactions contributes significantly to the integration of global semantic context. In contrast to traditional State Space Models, the Mamba model [8] enhances dynamic modeling by incorporating time-varying parameters, which allows it to adapt more flexibly to complex data. When processing text data, Mamba requires significantly fewer parameters than standard Transformers, leading to improved computational efficiency and scalability. Additionally, Mamba introduces a selective scanning mechanism, which enables it to process large-scale data approximately five times faster than Transformer models. Vision Mamba [9] significantly extends the model’s applicability to computer vision tasks. While maintaining modeling capabilities comparable to those of Vision Transformers, Vision Mamba demonstrates nearly threefold speed improvements over ViT-based models while reducing GPU memory consumption by about 86% in the context of batch inference for image feature processing. VMamba [10] presents a 2D selective scanning module, effectively bridging one-dimensional sequence modeling and two-dimensional spatial feature extraction. This innovation leads to a 3% improvement in performance on the ImageNet-1K dataset. U-Mamba [11] is the first architecture to integrate Mamba with CNNs for medical image segmentation. It outperforms ViT-based models by achieving 5% to 10% higher DSC scores across various datasets, along with a notable reduction in parameter count. Extending previous work, VM-UNet [12] employs an asymmetric encoder–decoder architecture and integrates Visual State Space blocks to enhance contextual understanding. It is the first medical image segmentation model fully constructed on an the SSM framework. VM-UNet demonstrates strong performance across multiple datasets, achieving a 3.49% improvement over U-Net and showing further gains compared to ViT-based models. Inspired by VM-UNet, we propose a hybrid architecture that fuses CNNs with Mamba networks to exploit their complementary strengths in medical image segmentation. The model comprises four main components: an encoder, a decoder, a bottleneck module, and an enhanced skip connection. In the encoder, CNNs are embedded within Visual State Space blocks to form a foundational unit termed the Hybrid CNN-Mamba (HCM) block, which captures both fine-grained local features and global semantic context. A patch merging mechanism is employed for down-sampling, enhancing representational capacity while optimizing computational efficiency. The decoder follows a similar architecture, utilizing HCM blocks alongside patch expansion layers to gradually recover spatial information and refine boundary predictions. The bottleneck comprises two HCM blocks, which enrich feature representations and support the learning of complex image characteristics. For the skip connection, we enhance the traditional design by introducing attention mechanisms to mitigate the loss of local spatial information and facilitate global multi-scale feature interactions. We evaluate our model on three widely used publicly available datasets for medical image segmentation: the International Skin Imaging Collaboration 2018 Challenge (ISIC 2018) [13], the Synapse multi-organ segmentation dataset [14], and the Automatic Cardiac Diagnosis Challenge (ACDC) dataset [15]. Results show that our model consistently delivers high segmentation accuracy with strong generalization across diverse medical imaging scenarios, underscoring its effectiveness and practical value. The main contributions of this work are summarized as follows:

We propose HCMUNet, a novel U-shaped architecture that integrates a hybrid CNN-Mamba network within a dual-branch encoder-decoder framework. This design facilitates efficient extraction of both local and global features while maintaining low computational complexity.
We introduce Skip Connection Dual Attention (SCDA), which enhances conventional skip connections by incorporating both channel and spatial attention. This mechanism strengthens cross-dimensional feature fusion and improves the recovery of spatial information lost during down-sampling.
We validate HCMUNet on three benchmark datasets: ISIC 2018, Synapse, and ACDC. The experimental results indicate that HCMUNet achieves high segmentation performance and exhibits strong generalization capability across diverse medical image segmentation tasks.

2. Related Work

2.1. U-Shaped Model for Medical Image Segmentation

CNNs constitute a fundamental component of deep learning and were originally developed for visual classification tasks. The advancement of artificial intelligence has significantly accelerated the adoption of models like Fully Convolutional Networks (FCNs) [16] in image segmentation tasks. In contrast to natural images, medical images typically possess higher complexity and require more accurate feature extraction. U-Net, the first CNN-based model tailored for medical image segmentation, incorporates skip connections to merge detailed spatial and abstract semantic features, significantly improving segmentation results. In the ISBI 2015 Cell Tracking Challenge, U-Net improved the Intersection over Union metric by approximately 10% compared to traditional image processing methods and ranked first in multiple sub-tasks, demonstrating its strong modeling capacity and superior accuracy. By embedding dense skip connections within a nested architecture, UNet++ [17] extends U-Net to facilitate feature reuse and enhance segmentation performance. Experiments indicate a 3% to 5% increase in Dice scores over U-Net across multiple benchmark datasets. ResUNet++ [18] builds upon UNet++ by integrating residual blocks and dilated convolutions, thereby improving global feature perception and delivering an approximate 5% increase in segmentation accuracy on the skin cancer dataset. Attention U-Net [19] further improves U-Net by incorporating attention gates that selectively focus on relevant feature regions. This design yields a 1% to 3% improvement in Dice scores and reduces boundary errors, leading to better precision and robustness in medical image segmentation. UNetV2 [20] presents an improved skip connection design that leverages Hadamard product-based fusion to better combine deep feature representations with detailed spatial cues. Its segmentation accuracy surpasses that of existing state-of-the-art models on skin lesion and polyp benchmarks. While CNNs have achieved notable success, the limited spatial extent of their receptive fields restricts their ability to capture global context and learn complex feature dependencies. This constraint can impair the richness of feature representations and negatively impact segmentation accuracy. To overcome these issues, recent research has increasingly explored SSMs as a promising alternative to conventional attention-based architectures. Mamba introduces a selective State Space architecture that achieves strong representational capacity by dynamically controlling input selection through linear recurrent operations. Unlike Transformers, Mamba avoids the quadratic complexity of self-attention, improving scalability for high-resolution medical images. Its efficient modeling of long-range dependencies makes it a compelling approach for advancing semantic segmentation in this domain. Recent advancements in Mamba-based architectures have resulted in the creation of models such as U-Mamba, SegMamba [21], and their variants, which have been successfully incorporated into encoder–decoder frameworks for medical image segmentation. For example, U-Mamba enhances nnUNet [22] with the Mamba architecture, leading to a roughly 2% improvement in segmentation accuracy and outperforming ViT-based models by a significant margin. SegMamba reduces both computational cost and memory usage relative to SwinUNETR [23], while also exhibiting more stable training behavior. Leveraging these advantages, it achieves an increase of around 7% in Dice score. Swin-UMamba [24] extends U-Mamba by integrating ImageNet-pretrained weights, achieving approximately a 50% reduction in parameters while maintaining comparable segmentation performance. SliceMamba [25] introduces a bidirectional slice scanning module to enhance local feature modeling, particularly excelling in boundary-sensitive segmentation tasks. It achieves a 25% reduction in the HD95 metric compared to Swin-UNet [26]. VM-UNet represents the first model purely based on SSMs for medical image segmentation, outperforming several ViT-based counterparts on the ISIC2017, ISIC2018, and Synapse datasets. H-VMUNet [27], an extension of VM-UNet, introduces a high-order 2D selective scanning mechanism that leverages hierarchical feature interactions to address the redundancy of conventional SS2D operations. In addition to enhancing segmentation accuracy, this design reduces the parameter count by around 67% relative to VM-UNet. Together, these findings highlight the strong capability and future promise of Mamba-based models in the domain of medical image segmentation.

2.2. Attention Mechanism

Attention mechanisms have seen significant advancements in both design and application within computer vision. Among these, attention mechanisms that operate along the channel and spatial dimensions have proven particularly effective. Channel-wise attention adaptively reweights each feature channel, diminishing less informative responses while strengthening those most critical for the task. In contrast, spatial-wise attention emphasizes salient regions across spatial dimensions by learning position-dependent weights, thereby improving the model’s capacity to capture spatial contextual details. This allows the model to focus on semantically important areas while reducing the impact of irrelevant background, thereby enhancing recognition accuracy. To effectively integrate these attention strategies into practical models, several notable architectures have been proposed. SENet [28] adaptively recalibrates channel-wise feature responses, significantly enhancing the network’s representational capacity. When integrated into ResNet-50 [29], it improves Top-1 accuracy by approximately 1.3% with only a marginal increase in computational cost. Building on the principles of channel and spatial attention, CBAM [30] introduces a sequential attention module that captures intricate inter-channel and inter-spatial relationships with negligible additional computational cost. Compared to the SE module, CBAM achieves a 0.5% improvement on the COCO dataset. BAM [31] applies channel and spatial attention in parallel, placing greater emphasis on deep semantic features. On the ImageNet classification task, it outperforms SE by approximately 0.22 percentage points. ECA-Net [32] replaces fully connected layers with one-dimensional convolutions to eliminate dimensionality reduction, preserving efficient feature modeling with fewer parameters. It achieves a slightly better classification performance than CBAM while maintaining lower complexity. SFA [33] further improves upon these designs by capturing long-range spatial dependencies through strip pooling and enhancing computational efficiency via grouped channel attention. Compared to other mechanisms, SFA achieves higher accuracy, while requiring fewer parameters and less computational resources.

3. Method

3.1. Framework Overview

As shown in Figure 1, the architecture of HCMUNet is structured around four core components: the encoder, decoder, bottleneck, and SCDA module. The encoder initiates the process by partitioning the input image of size

H \times W \times 3

into non-overlapping patches, which are subsequently projected into a fixed embedding dimension via a patch embedding operation. These embeddings are passed through Hybrid CNN-Mamba (HCM) blocks along with patch merging layers to construct a hierarchical multi-scale representation. While the patch merging layers down-sample the spatial resolution and expand the channel dimension, the HCM blocks focus on extracting rich and discriminative features. The bottleneck module preserves both the spatial resolution and feature dimensionality, enabling effective global representation learning without altering the scale. In the decoder, HCM blocks are combined with patch expanding layers to progressively restore spatial resolution while decreasing feature dimensionality. To mitigate the loss of spatial information caused by down-sampling, the SCDA module integrates contextual features with multi-level encoder outputs, enhancing cross-scale feature interactions and long-range dependencies. A final 4× up-sampling operation performed by the last patch expanding layer restores the original input resolution, and a subsequent linear projection produces pixel-wise segmentation predictions. Each component is described in detail in the subsequent sections. The pseudo-code of the model is detailed in Algorithm 1.

Algorithm 1 Pseudo code of HCMUNet.

Require: Image

{x_{i}, i \in N}

, Ground truth

{y_{i}^{l}, i \in N}

, T

Ensure:

y_{i}^{o u t}

,

M o d e l p a r a m e t e r s

1:: /*Iterate over the training steps*/
2:: for $k = 1$ to T do
3:: $x = P a t c h E m b e d d i n g (x_{i})$
4:: /*Iterate through the stages*/
5:: for $i = 1$ to $n u m_s t a g e$ do
6:: for $j = 1$ to $n u m_s t a g e_b l o c k$ do
7:: $x = H C M B l o c k (x)$
8:: end for
9:: $x_{m} = P a t c h M e r g i n g (x)$
10:: $t m p [i] = x_{m}$
11:: end for
12:: for $s = n u m_s t a g e$ downto 1 do
13:: $x = P a t c h E x p a n d i n g (x)$
14:: /*Concatenate encoder and decoder features*/
15:: $x = C o n c a t (t m p [n u m_s t a g e - s], x)$
16:: $x = S C D A (x)$
17:: for $t = 1$ to $n u m_s t a g e_b l o c k$ do
18:: $x = H C M B l o c k (x)$
19:: end for
20:: end for
21:: /*Apply a final 4x expansion to the output*/
22:: $x = P a t c h E x p a n d i n g 4 x (x)$
23:: /*Apply linear projection to obtain the final output*/
24:: $y_{i}^{o u t} = L i n e a r P r o j e c t i o n (x)$
25:: /*Compute the loss between predicted and ground truth*/
26:: $L \leftarrow l o s s (y_{i}^{o u t}, y_{i}^{l})$
27:: Back propagation
28:: Update parameters
29:: end for

3.2. Hybrid Convolutional Mamba Block

Effective medical image segmentation requires capturing both local details and global context. To achieve this, we design a lightweight and efficient feature extraction module. While CNN-based models are proficient at capturing local features, their ability to capture long-range dependencies remains inherently limited. In contrast, Mamba-based models excel at capturing global context through structured State Space mechanisms. However, they are often computationally intensive and may lack the inductive biases needed for learning precise local structures. To address the complementary limitations, we propose the Hybrid CNN-Mamba (HCM) block, a dual-branch module integrated into the HCMUNet framework. The HCM block is specifically designed to learn local and global representations simultaneously and efficiently. As illustrated in Figure 2, we adopt a channel-splitting strategy, which partitions the input feature map into two parallel branches for processing. One branch is processed by a convolutional branch to extract detailed local representations, while the other captures long-range dependencies via the Mamba branch, facilitating a balanced distribution of features across the branches. This design effectively reduces computational costs while maintaining high processing efficiency. The outputs of both branches are subsequently fused through channel concatenation to restore the original channel dimension. To address potential challenges such as information fragmentation and feature redundancy introduced by grouped processing, we incorporate a channel shuffle mechanism. This mechanism rearranges the channels of the output feature map, promoting effective information exchange between branches and ensuring more comprehensive utilization of features in each channel, thus enhancing feature diversity. In addition, activation functions are carefully selected to optimize feature expressiveness and training stability: ReLU [34] is used in the convolutional branch, while SiLU [35] is employed in the Mamba branch to improve gradient flow and convergence. Due to its efficient balance of local and global feature extraction and a lower parameter count relative to ViT-based models, the HCM block presents a compelling solution for medical image segmentation.

Given a feature input

X \in R^{H \times W \times C}

, the feature output after processing is

Y \in R^{H \times W \times C}

. X is divided equally into

X_{1}

and

X_{2}

, followed by separate feature extraction in two branches. To ensure compatibility with convolution operations, the original feature map is rearranged for subsequent processing.

3.3. Skip Connection Dual Attention

To improve segmentation accuracy, particularly in preserving fine-grained anatomical structures, the integration of features between the encoder and decoder stages is strengthened. However, existing attention mechanisms, which operate solely on the spatial or channel dimension, often fail to capture comprehensive contextual dependencies. When only the importance or positional information of features is considered, the model’s capacity to learn fully discriminative representations becomes constrained. Motivated by global attention mechanisms [36], we propose a Skip Connection Dual Attention (SCDA) module, developed to prevent spatial information loss in the encoder’s down-sampling process, while also strengthening the interaction of features between the encoder and decoder. SCDA incorporates both channel and spatial attention mechanisms, enabling the model to capture richer contextual information than employing either mechanism in isolation. As illustrated in Figure 3, adaptive reweighting is applied along the channel dimension to emphasize task-relevant features and suppress redundant information, facilitated by the channel attention submodule. In parallel, the spatial attention submodule assigns importance to specific spatial locations, guiding the model to focus on relevant regions while reducing background interference. By embedding both attention mechanisms within a unified skip connection framework, SCDA promotes enhanced multi-scale feature fusion and facilitates the aggregation of both global and local contextual information. This design not only preserves structural details but also improves overall segmentation performance.

3.4. Loss Function

Inspired by [12,37], we adopt the

L_{B c e D i c e}

loss [17] for the ISIC 2018 dataset, as defined in Equation (1). This loss combines Binary Cross-Entropy (BCE) and Dice loss in a weighted fashion, with

λ_{1}

controlling their relative contributions. The BCE component enhances pixel-wise precision by minimizing the error between predicted values and actual binary labels. In contrast, Dice loss evaluates the overall spatial overlap, enhancing the segmentation of target structures, particularly in the presence of class imbalance. This combination effectively balances both local precision and global consistency. For the Synapse and ACDC datasets, which involve multi-class segmentation with notable class imbalance, we use a hybrid loss function

L_{C e D i c e}

[38], defined in Equation (2). This function integrates the standard multi-class Cross-Entropy (CE) loss and Dice loss, with individual weighting coefficients

λ_{2}

and

λ_{3}

. The CE loss promotes accurate classification across all categories by minimizing the softmax-normalized cross-entropy. Meanwhile, the Dice loss emphasizes per-class spatial overlap, improving performance on underrepresented or small structures. This hybrid approach ensures both categorical precision and structural integrity in the segmentation results.

L_{B c e D i c e} = λ_{1} L_{B c e} + (1 - λ_{1}) L_{D i c e}

(1)

L_{B c e} = - \frac{1}{N} \sum_{1}^{N} [y_{i} log ({\overset{\land}{y}}_{i}) + (1 - y_{i}) log (1 - {\overset{\land}{y}}_{i})]

(2)

where N denotes the total number of samples,

{\hat{y}}_{i} \in [0, 1]

is the predicted probability, and

y_{i} \in {0, 1}

is the ground truth label for the i-th pixel.

L_{D i c e} = 1 - \frac{2 | X \cap Y |}{| X | + | Y |}

(3)

where

| X |

and

| Y |

denote the number of positive pixels in the ground truth and prediction, respectively, and

| X \cap Y |

denotes their intersection. The weighting factor

λ_{1} \in [0, 1]

controls the balance between the two components (default: 0.6).

L_{C e D i c e} = λ_{2} L_{C e} + λ_{3} L_{D i c e}

(4)

L_{C e} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log ({\overset{\land}{y}}_{i, c})

(5)

where

y_{i, c} \in {0, 1}

is an indicator (1 if pixel i belongs to class c, 0 otherwise), and

{\hat{y}}_{i, c} \in [0, 1]

is the softmax probability that pixel i belongs to class c. C denotes the number of classes.

4. Experimental Results and Analysis

4.1. Dataset

To evaluate HCMUNet’s segmentation performance, we employ three benchmark datasets: ISIC 2018, Synapse, and ACDC. These datasets encompass diverse imaging modalities, including dermoscopy, computed tomography (CT), and magnetic resonance imaging (MRI). This variability enables a comprehensive evaluation of the model’s segmentation accuracy, its generalizability across anatomical structures, and its robustness under varying imaging conditions and clinical scenarios.

ISIC 2018 Dataset: Released by the International Skin Imaging Collaboration, the ISIC 2018 dataset is a widely used benchmark for skin lesion segmentation. It consists of 2694 dermoscopic images with corresponding segmentation masks, divided into 1886 training and 808 testing samples.

Synapse Multi-Organ Segmentation Dataset: Provided as part of the MICCAI 2015 multi-atlas abdominal CT segmentation challenge, this dataset is widely used in medical image segmentation. It includes 30 contrast-enhanced abdominal CT scans with a total of 3779 axial slices, annotated with labels for eight organs: aorta, gallbladder, left and right kidneys, liver, pancreas, spleen, and stomach. In our experiments, we use 21 cases for training and nine for testing.

ACDC Dataset: The Automatic Cardiac Diagnosis Challenge (ACDC) dataset is a widely used benchmark for cardiac MRI segmentation. It comprises cardiac MRI scans of 100 patients, with manual annotations provided for the left ventricle, right ventricle, and myocardium. In our experiments, we use 70 cases for training and 30 for testing.

4.2. Implementation Details

All experiments are conducted on a workstation running CentOS 7, equipped with NVIDIA Tesla V100 GPUs. The model is implemented in PyTorch 1.13 with Python 3.8. For preprocessing, images from the ISIC 2018 and ACDC datasets are resized to 256 × 256, while images from the Synapse dataset are resized to 224 × 224. To enhance generalization and mitigate overfitting, standard data augmentation techniques, such as random horizontal and vertical flipping, are applied. Training is performed using the AdamW optimizer, which combines the adaptive learning rate of Adam with decoupled weight decay. The learning rate is initialized to 0.001 and gradually decayed using a cosine annealing schedule over 300 epochs, with a batch size of 32.

4.3. Evaluation Metrics

To evaluate the model’s performance across diverse datasets, we employ dataset-specific metrics. For the ISIC 2018 dataset, segmentation is assessed using five metrics: mean Intersection over Union (mIoU) [16], Dice Similarity Coefficient (DSC) [39], Accuracy (Acc) [2], Sensitivity (Sen), and Specificity (Spe) [40]. For the Synapse, the DSC is used to measure organ-wise segmentation accuracy, while the Hausdorff Distance (HD) [41] quantifies the shape similarity between predicted and ground truth contours, with a focus on boundary precision. In the ACDC dataset, the DSC is similarly used to evaluate the segmentation quality of cardiac structures.

I o U = \frac{T P}{T P + F P + F N}

(6)

m I o U = \frac{1}{n} \sum_{i = 1}^{n} I o U_{i}

(7)

D S C = \frac{2 \times T P}{2 \times T P + F P + F N}

(8)

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(9)

S e n = \frac{T P}{T P + F N}

(10)

S p e = \frac{T N}{T N + F P}

(11)

H D (Y, \hat{Y}) = max {max_{y \in Y} min_{y \in \hat{Y}} d (y, \hat{y}), max_{y \in \hat{Y}} min_{y \in Y} d (y, \hat{y})}

(12)

where the false negative (FN) refers to a pixel belonging to the ground truth target region but mistakenly classified as background, while the true negative (TN) denotes a background pixel correctly identified as such. The false positive (FP) is a background pixel incorrectly predicted as part of the target region, and the true positive (TP) is a pixel correctly identified within the target region. In medical image segmentation, positive samples are pixels within the lesion or organ of interest, while negative samples are background pixels. Y denotes the ground truth mask and

\hat{Y}

the corresponding prediction. The term

d (y, \hat{y})

refers to the Euclidean distance between a ground truth point y and the nearest predicted point

\hat{y}

.

4.4. Experimental Results

Automatic segmentation of skin lesions is crucial for enabling precise clinical tasks such as diagnosis, prognosis, and treatment planning. We evaluate the proposed model on the ISIC 2018 dataset to assess its effectiveness in skin lesion segmentation. As illustrated in Figure 4, HCMUNet excels at delineating lesion boundaries and effectively mitigates the edge-blurring artifacts observed in other models. The resulting segmentation masks show high fidelity to the ground truth, maintaining sharp contours and structural integrity even for lesions with irregular shapes and varying sizes. Quantitative results in Table 1 further validate the model’s performance. HCMUNet achieves the best scores across all key metrics, including an mIoU of 82.19%, a DSC of 90.32%, and a Sensitivity of 91.21%, outperforming both traditional CNN-based architectures (e.g., U-Net, U-Net++, Att-UNet) and ViT-based models (e.g., TransUNet, TransFuse, Swin-UNet). Notably, it surpasses the well-established Transformer baseline, Swin-UNet, by 1.48% in mIoU, 0.66% in the DSC, and 0.90% in Sensitivity. Compared to the baseline, HCMUNet shows improved segmentation accuracy, as reflected in the higher DSC, and demonstrates greater stability across diverse cases. These consistent improvements underscore the effectiveness of HCMUNet’s hybrid design, which integrates convolutional layers for detailed local feature extraction with Mamba networks to capture long-range dependencies. Overall, the results demonstrate not only strong generalization across complex scenarios but also the model’s practical applicability in real-world medical image segmentation tasks.

To further validate the generalizability of HCMUNet, we evaluate its performance on the Synapse multi-organ CT dataset, which poses challenges due to large anatomical variability and complex organ boundaries. As shown in Figure 5, our model delivers anatomically coherent segmentations, accurately capturing both the shape and spatial layout of organs, even in regions with high overlap or irregularity. Table 2 provides a comprehensive comparison with prior methods. HCMUNet achieves a DSC of 81.52 and an HD95 of 17.83, outperforming most CNNs and ViT-based counterparts. Specifically, it improves over TransUNet by 5.06% in DSC, and over Swin-UNet by 2.33%, highlighting its effectiveness in modeling both local textures and global spatial cues. In organ-wise comparisons, HCMUNet attains top scores in critical and structurally diverse organs such as the gallbladder (69.60%), right kidney (82.35%), and liver (95.10%), demonstrating high robustness across categories with varying shapes, sizes, and boundary clarity. While EMCAD performs competitively in certain cases, HCMUNet exhibits more consistent accuracy across all organs. These improvements can be attributed to its hybrid design: Convolutional encoders preserve fine-grained local patterns, while the Mamba blocks enhance global reasoning. Importantly, the reduced HD95 suggests stronger alignment with true organ contours, further reinforcing HCMUNet’s suitability for complex 3D medical image segmentation tasks.

As part of a broader evaluation, we applied our model to the ACDC dataset, which contains cine cardiac MRI scans annotated for three key anatomical regions, the right ventricle (RV), myocardium (MYO), and left ventricle (LV), thereby testing its cross-domain performance. This dataset poses unique challenges due to the dynamic nature of the cardiac cycle, varying wall thickness, and frequent shape deformations. Figure 6 shows that HCMUNet produces segmentation maps with smooth contours and strong anatomical fidelity, accurately capturing cardiac structures even under substantial shape and intensity variations. The model preserves clear boundaries in challenging regions like the myocardium, which often suffers from segmentation ambiguity due to its thin and non-uniform geometry. Table 3 presents a detailed comparison of segmentation performance across models, where HCMUNet attains the highest overall Dice score of 92.11%, In particular, it outperforms the standard U-Net by 4.27% and exhibits a 1.39% improvement over a Mamba-only variant, highlighting the synergy of convolutional and sequence modeling components in our design. In addition, HCMUNet consistently achieves high performance across all cardiac substructures, including a Dice score of 91.50% for the RV, which often presents irregular boundaries, and 90.20% for the MYO, which is particularly challenging to segment due to its thin and variable shape. Although marginally lower scores are observed for the left ventricle when compared to EMCAD and TransCASCADE, HCMUNet demonstrates superior overall accuracy, stability, and generalization, further validating the robustness of the proposed hybrid framework under diverse anatomical and imaging conditions.

4.5. Ablation Study

Ablation experiments were conducted on three datasets to systematically evaluate the contribution of each architectural component. The performance under various configurations is summarized in Table 4. Specifically, when only the Mamba branch was used, the model exhibited a slight decline in most evaluation metrics, indicating that relying solely on structured state-space modeling is insufficient to capture fine-grained spatial features. Upon integrating a convolutional branch to form a dual-branch architecture, segmentation accuracy improved considerably. Further enhancements were observed with the inclusion of the SCDA module. By incorporating both channel and spatial attention mechanisms, SCDA facilitates more effective feature interaction and mitigates spatial information loss. We also examined the effect of adding the SCDA module independently to the Mamba and convolutional branches. Notably, in the dual-branch configuration, the model achieved superior performance while significantly reducing the number of parameters compared to single-branch variants. This finding underscores not only the effectiveness of the hybrid design but also the parameter efficiency afforded by channel-balanced decomposition. Compared to its counterpart without SCDA, the full model yielded higher DSC scores and more precise segmentation boundaries. However, this improvement came at the cost of increased model complexity due to the additional parameters introduced by SCDA. One possible reason for the limited improvements in certain metrics is that the current fusion mechanism between local CNN features and the global representations learned by Mamba has not yet been fully optimized for segmentation tasks. In future work, we aim to further explore and enhance this interaction strategy to fully leverage the advantages of the dual-branch architecture in medical image segmentation. To assess the influence of training data volume on segmentation performance, we conducted experiments using progressively larger subsets of the datasets, corresponding to 25%, 50%, 75%, and 100% of the available samples. The results, presented in Table 5, show a consistent improvement in the DSC metric as the amount of training data increases, highlighting the model’s strong scalability and robustness under varying data availability scenarios. Specifically, on the ISIC 2018 dataset, the DSC improved from 87.15% with 25% of the data to 90.32% with the full dataset, reflecting a gain of 3.17 percentage points. A similar trend was observed on Synapse, where the DSC rose from 75.35% to 81.52%. The ACDC dataset exhibited the smallest relative gain, from 88.95% to 92.12%, likely due to its comparatively smaller inter-sample variability. Notably, even with only 25% of the training data, the model maintained competitive performance across all three datasets, demonstrating its strong generalization ability in low-resource settings. These findings underscore HCMUNet’s data efficiency and robustness, making it well suited for real-world medical scenarios where labeled data is often limited. Furthermore, the observed performance gains suggest that the model is capable of leveraging additional training data effectively, offering a promising path for future extensions to larger-scale datasets.

4.6. Discussion

Based on experiments conducted on three benchmark datasets, HCMUNet consistently delivers robust and high-accuracy segmentation results. However, the degree of improvement varies depending on the complexity and nature of the task. For the relatively straightforward ISIC 2018 skin lesion segmentation, HCMUNet achieves a DSC of 90.32%, outperforming U-Net by 2.9% and Swin-UNet by 0.66%, while preserving lesion boundaries with high fidelity and minimal edge blurring. Figure 7 illustrates the ROC curves of multiple models tested on the ISIC dataset. Our model (black) consistently lies closest to the top-left corner, indicating superior classification performance with a high TPR and a low FPR.

On the more complex Synapse multi-organ segmentation task, HCMUNet reaches a DSC of 81.52%, marking a significant 5.6% gain over U-Net (75.92%), alongside a reduced HD95 of 17.83 mm, indicating strong spatial precision. It notably achieves superior accuracy on organs with diverse morphologies such as the gallbladder, right kidney, and liver, showcasing enhanced structural modeling and edge delineation capabilities. For the ACDC cardiac MRI segmentation task, HCMUNet attains an average DSC of 92.11%, outperforming advanced methods like EMCAD (91.30%) and TransCASCADE (91.59%). It achieves class-specific DSCs of 91.50% (RV) and 90.20% (MYO), reflecting its robustness across anatomical regions with varying spatial characteristics and intensity distributions. Collectively, these results demonstrate HCMUNet’s strong generalization ability across both simple and highly complex segmentation scenarios. The model effectively segments target regions while preserving fine structural details, reducing both over-segmentation and under-segmentation. Furthermore, its outputs maintain sharp, continuous boundaries, minimizing blurry artifacts commonly seen in other architectures. These attributes make HCMUNet particularly suitable for real-world clinical applications involving diverse imaging modalities and anatomical targets. Figure 8 illustrates the comparison of DSC scores among various segmentation models on three benchmark datasets.

5. Conclusions

In this paper, we presented HCMUNet, a novel U-shaped hybrid architecture that integrates CNNs with Mamba for medical image segmentation. By combining the local feature extraction capabilities of CNNs with Mamba’s strength in modeling long-range dependencies, HCMUNet enables efficient contextual representation learning with reduced computational complexity. To address the spatial information loss introduced during down-sampling, we proposed the Skip Connection Dual Attention (SCDA) module, which facilitates multi-scale and cross-dimensional feature fusion. Extensive experiments on publicly available datasets demonstrate that HCMUNet consistently delivers strong segmentation performance, particularly excelling in boundary preservation and structural detail retention. The model achieved DSC scores of 90.32%, 81.52%, and 92.11% on the ISIC 2018, Synapse multi-organ, and ACDC cardiac datasets, respectively, underscoring its robustness and strong generalization ability across diverse segmentation tasks. In future work, we plan to further refine the model’s internal structure to enhance its applicability to broader medical imaging domains. We also intend to evaluate its performance under data-scarce conditions using self-supervised and transfer learning techniques. Additionally, we aim to develop a lightweight variant of HCMUNet that maintains competitive accuracy while significantly reducing computational overhead, thereby facilitating deployment in resource-constrained clinical environments.

Author Contributions

All authors contributed to the approach conception and design. Material preparation, data collection, and analysis were performed by Y.D. The first draft of the manuscript was written by Y.D. under the supervision of X.M. and D.S. All authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code is available at https://doi.org/10.5281/zenodo.15606243.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assiste Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Rao, Y.; Zhao, W.; Tang, Y.; Zhou, J.; Lim, S.; Lu, J. Hornet:Efficient high-order spatial interactions with recursive gated convolutions. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 10353–10366. [Google Scholar]
Wu, R.; Liang, P.; Huang, X.; Shi, L.; Gu, Y.; Zhu, H.; Chang, Q. Mhorunet: High-order spatial interaction unet for skin lesion segmentation. Biomed. Signal Process. Control 2024, 88, 105517. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 548–558. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the 10th International Conference on Learning Representations, Virtually, 25–29 April 2022. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 62429–62442. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. In Proceedings of the Conference on Neural Information Processing Systems, Vancouver, ON, Canada, 10–15 December 2024; Volume 37, pp. 103031–103063. [Google Scholar]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
Codella, N.; Rotemberg, V.; Tschandl, P.; Celebi, M.E.; Dusza, S.; Gutman, D.; Helba, B.; Kalloo, A.; Liopyris, K.; Marchetti, M.; et al. Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC). arXiv 2018, arXiv:1902.03368. [Google Scholar]
Harrigr. Segmentation Outside the Cranial Vault Challenge. Synapse 2015. [Google Scholar] [CrossRef]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.-A.; Cetin, I.; Lekadir, K.; Camara, O.; Gonzalez Ballester, M.A.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 3431–3440. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; De Lange, T.; Halvorsen, P.; Johansen, H.D. ResUNet++: An Advanced Architecture for Medical Image Segmentation. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019; pp. 225–2255. [Google Scholar]
Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. In Proceedings of the Medical Imaging with Deep Learning, Amsterdam, The Netherlands, 4–6 July 2018. [Google Scholar]
Peng, Y.; Sonka, M.; Chen, D.Z.; Sonka, M. U-Net v2: Rethinking the skip connections of U-Net for medical image segmentation. In Proceedings of the IEEE 22nd International Symposium on Biomedical Imaging, Houston, TX, USA, 14–17 April 2025; pp. 1–5. [Google Scholar]
Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In Proceedings of the 27th International Conference On Medical lmage Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 578–588. [Google Scholar]
Isensee, F.; Petersen, J.; Klein, A.; Zimmerer, D.; Jaeger, P.; Kohl, S.; Wasserthal, J.; Koehler, G.; Norajitra, T.; Wirkert, S.; et al. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019; pp. 18–28. [Google Scholar]
Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. In Proceedings of the International MICCAI Brainlesion Workshop, Virtual Event, 27 September 2021; pp. 272–284. [Google Scholar]
Liu, J.; Yang, H.; Zhou, H.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. In Proceedings of the 27th International Conference on Medical lmage Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 615–625. [Google Scholar]
Fan, C.; Yu, H.; Huang, Y.; Wang, L.; Yang, Z.; Jia, X. SliceMamba with Neural Architecture Search for Medical Image Segmentation. IEEE J. Biomed. Health Inform. 2025, 1–13. [Google Scholar] [CrossRef] [PubMed]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. H-vmunet: High-order vision mamba unet for medical image segmentation. Neurocomputing 2025, 624, 129447. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Woo, S.; Park, J.; Lee, J.; Kweon, I. Cbam: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Park, J.; Woo, S.; Lee, J.; Kweon, I. Bam: Bottleneck attention module. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018; p. 147. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Xu, W.; Wan, Y.; Zhao, D. SFA: Efficient Attention Mechanism for Superior CNN Performance. Neural Process. Lett. 2025, 57, 38. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Lan, L.; Cai, P.; Jiang, L.; Liu, X.; Li, Y.; Zhang, Y. BRAU-Net++: U-Shaped Hybrid CNN-Transformer Network for Medical Image Segmentation. arXiv 2024, arXiv:2401.00722. [Google Scholar]
Yeung, M.; Sala, E.; Schönlieb, C.B.; Rundo, L. Unified Focal Loss: Generalising Dice and Cross Entropy-Based Losses to Handle Class Imbalanced Medical Image Segmentation. Comput. Med. Imaging Graph. 2022, 95, 102026. [Google Scholar] [CrossRef] [PubMed]
Murguía, M.; Villaseñor, J.L. Estimating the Effect of the Similarity Coefficient and the Cluster Algorithm on Biogeographic Classifications. Ann. Bot. Fennici 2003, 40, 415–421. [Google Scholar]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.W.M.; van Ginneken, B.; Sánchez, C.I. A Survey on Deep Learning in Medical Image Analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
Karimi, D.; Salcudean, S.E. Reducing the Hausdorff Distance in Medical Image Segmentation with Convolutional Neural Networks. IEEE Trans. Med. Imaging 2019, 39, 499–513. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtually, 19–25 June 2021; pp. 574–584. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the 24th International Conference on Medical Image Computing and Computer-Assisted Intervention, Virtually, 27 September–1 October 2021; pp. 14–24. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Virtually, 22 February–1 March 2022; pp. 2441–2449. [Google Scholar]
Xu, G.; Zhang, X.; He, X.; Wu, X. Levit-unet: Make faster encoders with transformer for medical image segmentation. In Proceedings of the 6th Chinese Conference on Pattern Recognition and Computer Vision, Xiamen, China, 13–15 October 2023; pp. 42–53. [Google Scholar]
Heidari, M.; Kazerouni, A.; Soltany, M.; Azad, R.; Aghdam, E.; Cohen-Adad, J.; Merhof, D. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6202–6212. [Google Scholar]
Rahman, M.; Munir, M.; Marculescu, R. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 19–21 January 2024; pp. 11769–11779. [Google Scholar]
Rahman, M.; Marculescu, R. Medical image segmentation via cascaded attention decoding. In Proceedings of the IEEE/CVF Winter Conference on Applications Of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6222–6231. [Google Scholar]

Figure 1. The overall architecture of HCMUNet integrates the strengths of CNNs and the Mamba network through HCM blocks, which are designed to capture both local and global features effectively. To address spatial information loss induced by down-sampling, the SCDA module enhances cross-dimensional interactions by jointly leveraging channel and spatial attention. Furthermore, the resolution and dimensionality of feature maps are preserved within the bottleneck to maintain representational integrity.

Figure 2. Structure of the HCM block: The upper branch captures local features via convolution, complementing the lower branch’s global context modeling achieved through Mamba-based processing. PWConv modifies and combines channel-level information, maintaining the spatial resolution throughout the process. BN and LN refer to batch normalization and layer normalization, respectively.

Figure 3. The architecture of SCDA, which combines features from the encoder and decoder. These features are refined through channel and spatial attention modules to enhance model representation.

Figure 4. The figure presents segmentation results from the ISIC 2018 dataset, showcasing four representative models for comparison. In each visualization, the background is rendered in black, while the regions of interest are highlighted in white to clearly delineate the segmented structures.

Figure 5. Segmentation results on the Synapse multi-organ CT dataset. Different colors indicate distinct anatomical structures, providing a visual representation of the model’s organ-level segmentation performance.

Figure 6. Comparison of segmentation results on the ACDC dataset. The red region represents the right ventricle (RV), green corresponds to the myocardium (MYO), and blue indicates the left ventricle (LV).

Figure 7. ROC curves of different models on the ISIC 2018 dataset. False positive rate (FPR) denotes the rate of false positives among negatives, and true positive rate (TPR) denotes the rate of true positives among positives.

Figure 8. Comparison of segmentation performance (DSC) across multiple datasets using different models.

Table 1. Comparative experimental results on the ISIC 2018 dataset (best results are highlighted in Bold).

Model	mIoU (%)	DSC (%)	Acc (%)	Spe (%)	Sen (%)
U-Net [2]	77.64 ± 0.73	87.42 ± 0.87	93.88 ± 0.95	96.32 ± 1.20	87.71 ± 1.44
U-Net++ [17]	78.30 ± 0.90	87.63 ± 0.71	93.76 ± 0.78	95.19 ± 1.07	88.10 ± 1.23
Att-UNet [19]	78.95 ± 1.17	87.91 ± 1.01	93.21 ± 1.15	96.23 ± 1.35	87.60 ± 1.53
SANet [42]	79.47 ± 1.29	88.42 ± 1.11	94.29 ± 0.69	95.57 ± 1.42	89.46 ± 1.37
TransUNet [43]	79.71 ± 1.20	88.52 ± 1.27	94.57 ± 1.41	96.05 ± 1.30	89.14 ± 1.19
TransFuse [44]	80.66 ± 1.25	89.33 ± 1.28	93.66 ± 0.67	93.73 ± 1.17	90.78 ± 1.45
Swin-UNet [26]	80.71 ± 1.13	89.66 ± 1.04	94.19 ± 0.78	95.41 ± 1.22	90.31 ± 1.10
VM-UNet [12]	81.27 ± 0.79	89.67 ± 0.68	94.83 ± 0.63	96.13 ± 1.01	90.79 ± 0.98
HCMUNet (Ours)	82.19 ± 0.62	90.32 ± 0.53	94.71 ± 0.70	96.19 ± 0.78	91.21 ± 0.80

Table 2. Comparative experimental results on the Synapse dataset.

Model	DSC	HD95	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
U-Net [2]	75.92 ± 2.71	37.55 ± 4.80	87.39 ± 2.87	67.52 ± 3.11	78.72 ± 2.04	68.87 ± 3.19	92.45 ± 1.26	51.51 ± 3.60	86.09 ± 3.67	74.82 ± 2.08
Att-UNet [19]	76.14 ± 3.32	33.51 ± 3.17	88.61 ± 2.12	66.40 ± 2.44	77.12 ± 1.49	71.07 ± 1.41	91.78 ± 2.43	55.01 ± 3.61	85.66 ± 2.72	73.47 ± 2.46
TransUNet [43]	76.46 ± 1.80	29.32 ± 4.27	86.55 ± 1.69	61.65 ± 3.40	79.41 ± 2.41	76.30 ± 2.95	93.13 ± 1.33	55.30 ± 3.29	84.63 ± 3.43	74.72 ± 1.87
UCTransNet [45]	78.24 ± 1.77	26.35 ± 1.38	86.52 ± 3.13	65.23 ± 2.05	80.69 ± 1.61	73.19 ± 2.58	93.05 ± 1.62	57.07 ± 2.65	87.55 ± 1.05	77.28 ± 2.48
LeViT-UNet [46]	78.82 ± 3.99	18.89 ± 3.99	85.23 ± 3.73	66.32 ± 1.39	82.68 ± 3.33	78.13 ± 3.37	93.61 ± 1.80	60.95 ± 2.50	89.00 ± 1.56	74.62 ± 3.68
Swin-UNet [26]	79.19 ± 2.07	22.07 ± 2.67	84.72 ± 2.61	66.60 ± 3.40	82.82 ± 1.19	79.41 ± 0.68	93.94 ± 0.75	59.49 ± 3.67	89.21 ± 2.31	77.36 ± 1.22
HiFormer [47]	80.55 ± 2.19	15.63 ± 3.41	87.07 ± 2.24	66.67 ± 2.27	83.92 ± 2.24	81.09 ± 2.91	94.09 ± 1.84	60.17 ± 2.72	90.76 ± 1.74	80.61 ± 2.31
VM-UNet [12]	80.47 ± 1.71	18.91 ± 2.42	86.84 ± 1.59	68.43 ± 4.13	85.04 ± 2.16	81.12 ± 2.67	94.11 ± 0.48	59.49 ± 1.78	87.77 ± 3.16	80.97 ± 2.33
EMCAD [48]	81.16 ± 2.01	16.72 ± 4.44	85.24 ± 1.80	67.24 ± 4.64	87.62 ± 0.30	81.38 ± 0.91	94.67 ± 1.77	61.02 ± 3.96	91.81 ± 2.35	80.30 ± 1.93
HCMUNet (Ours)	81.52 ± 1.14	17.83 ± 1.47	88.06 ± 1.63	69.60 ± 2.30	87.04 ± 0.67	82.35 ± 1.48	95.10 ± 3.71	59.24 ± 3.62	90.63 ± 1.54	80.76 ± 2.30

Table 3. Comparative experimental results on the ACDC dataset.

Model	DSC	RV	MYO	LV
U-Net [2]	87.84 ± 1.38	86.51 ± 1.32	84.66 ± 5.65	92.36 ± 3.92
Att-UNet [19]	88.04 ± 2.02	86.70 ± 2.33	84.59 ± 8.96	92.83 ± 1.57
TransUNet [43]	89.45 ± 1.13	87.85 ± 2.22	86.09 ± 2.23	94.39 ± 1.43
nnUNet [22]	90.14 ± 1.83	88.58 ± 1.44	89.72 ± 1.18	92.13 ± 2.24
Swin-UNet [26]	90.20 ± 0.78	87.44 ± 2.46	88.20 ± 3.40	94.97 ± 1.26
LeViT-UNet [46]	90.42 ± 1.15	88.85 ± 1.52	88.29 ± 2.04	94.12 ± 1.88
HiFormer [47]	90.69 ± 0.85	90.06 ± 1.30	89.00 ± 1.29	93.00 ± 1.07
VM-UNet [12]	90.72 ± 0.77	90.61 ± 1.67	89.40 ± 1.88	92.14 ± 0.76
EMCAD [48]	91.30 ± 0.35	90.17 ± 1.88	89.16 ± 1.24	94.55 ± 1.59
TransCASCADE [49]	91.59 ± 0.21	90.12 ± 2.66	90.14 ± 0.68	94.51 ± 2.55
HCMUNet (Ours)	92.11 ± 0.26	91.50 ± 0.78	90.20 ± 0.56	94.61 ± 1.12

Table 4. Effectiveness of different modules on HCMUNet.

Dataset	Components			Evaluation Metrics
Dataset	Mamba	Conv	SCDA	Paras (M)	DSC (%)
ISIC 2018	✓			44.27	89.67 ± 0.68
	✓	✓		25.43	90.16 ± 0.59
	✓		✓	55.11	90.03 ± 0.73
		✓	✓	67.17	85.22 ± 1.21
	✓	✓	✓	36.27	90.32 ± 0.53
Synapse	✓			44.27	80.47 ± 1.71
	✓	✓		25.43	80.92 ± 1.44
	✓		✓	55.11	80.67 ± 2.02
		✓	✓	67.17	74.48 ± 3.13
	✓	✓	✓	36.27	81.52 ± 1.14
ACDC	✓			44.27	90.72 ± 0.77
	✓	✓		25.43	91.34 ± 0.55
	✓		✓	55.11	90.88 ± 0.97
		✓	✓	67.17	87.05 ± 1.35
	✓	✓	✓	36.27	92.11 ± 0.26

Table 5. DSC performance (%) of HCMUNet under different training data proportions.

Dataset	Different Data Volumes from the Dataset
Dataset	25%	50%	75%	100%
ISIC 2018	87.15 ± 1.39	88.68 ± 0.71	89.73 ± 0.68	90.32 ± 0.53
Synapse	75.35 ± 3.41	79.46 ± 1.85	81.08 ± 1.45	81.52 ± 1.14
ACDC	88.95 ± 1.67	90.36 ± 0.74	91.68 ± 0.45	92.11 ± 0.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, X.; Du, Y.; Sui, D. A U-Shaped Architecture Based on Hybrid CNN and Mamba for Medical Image Segmentation. Appl. Sci. 2025, 15, 7821. https://doi.org/10.3390/app15147821

AMA Style

Ma X, Du Y, Sui D. A U-Shaped Architecture Based on Hybrid CNN and Mamba for Medical Image Segmentation. Applied Sciences. 2025; 15(14):7821. https://doi.org/10.3390/app15147821

Chicago/Turabian Style

Ma, Xiaoxuan, Yingao Du, and Dong Sui. 2025. "A U-Shaped Architecture Based on Hybrid CNN and Mamba for Medical Image Segmentation" Applied Sciences 15, no. 14: 7821. https://doi.org/10.3390/app15147821

APA Style

Ma, X., Du, Y., & Sui, D. (2025). A U-Shaped Architecture Based on Hybrid CNN and Mamba for Medical Image Segmentation. Applied Sciences, 15(14), 7821. https://doi.org/10.3390/app15147821

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A U-Shaped Architecture Based on Hybrid CNN and Mamba for Medical Image Segmentation

Abstract

1. Introduction

2. Related Work

2.1. U-Shaped Model for Medical Image Segmentation

2.2. Attention Mechanism

3. Method

3.1. Framework Overview

3.2. Hybrid Convolutional Mamba Block

3.3. Skip Connection Dual Attention

3.4. Loss Function

4. Experimental Results and Analysis

4.1. Dataset

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Experimental Results

4.5. Ablation Study

4.6. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI