Hybrid Cascade and Dual-Path Adaptive Aggregation Network for Medical Image Segmentation

Junhong Ren; Sen Chen; Yange Sun; Huaping Guo; Yongqiang Tang; Wensheng Zhang

doi:10.3390/electronics14244879

,

and

¹

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

²

State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

³

School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China

^*

Authors to whom correspondence should be addressed.

Electronics2025, 14(24), 4879;https://doi.org/10.3390/electronics14244879

This article belongs to the Special Issue Advances in Artificial Intelligence and Computer Vision Based on Deep Learning

Version Notes

Order Reprints

Abstract

Deep learning methods based on convolutional neural networks (CNNs) and Mamba have advanced medical image segmentation, yet two challenges remain: (1) trade-off in feature extraction, where CNNs capture local details but miss global context, and Mamba captures global dependencies but overlooks fine structures, and (2) limited feature aggregation, as existing methods insufficiently integrate inter-layer common information and delta details, hindering robustness to subtle structures. To address these issues, we propose a hybrid cascade and dual-path adaptive aggregation network (HCDAA-Net). For feature extraction, we design a hybrid cascade structure (HCS) that alternately applies ResNet and Mamba modules, achieving a spatial balance between local detail preservation and global semantic modeling. We further employ a general channel-crossing attention mechanism to enhance feature expression, complementing this spatial modeling and accelerating convergence. For feature aggregation, we first propose correlation-aware aggregation (CAA) to model correlations among features of the same lesions or anatomical structures. Second, we develop a dual-path adaptive feature aggregation (DAFA) module: the common path captures stable cross-layer semantics and suppresses redundancy, while the delta path emphasizes subtle differences to strengthen the model’s sensitivity to fine details. Finally, we introduce a residual-gated visual state space module (RG-VSS), which dynamically modulates information flow via a convolution-enhanced residual gating mechanism to refine fused representations. Experiments on diverse datasets demonstrate that our HCDAA-Net outperforms some state-of-the-art (SOTA) approaches.

Keywords:

hybrid cascade structure; correlation-aware aggregation; dual-path adaptive feature aggregation; medical image

1. Introduction

Medical image segmentation is a fundamental task in medical image analysis, aiming to accurately delineate regions of interest, such as lesions or anatomical structures, from medical images [1]. The quality of segmentation directly influences the reliability of subsequent clinical decisions, including disease diagnosis, prognosis evaluation, and treatment planning. Although manual segmentation is considered the gold standard due to its high accuracy, it is time-consuming, labor-intensive, and highly dependent on expert knowledge, thereby limiting its scalability and consistency in clinical practice [2]. To address these limitations, automated medical image segmentation has attracted considerable research attention, aiming to improve segmentation efficiency while maintaining or enhancing accuracy [3].

In recent years, deep learning, particularly convolutional neural networks (CNNs), has achieved remarkable success in medical image segmentation. Notably, UNet [4] stands out as a representative model, owing to its strong generalization and feature extraction capabilities enabled by multiscale feature propagation and aggregation. Based on UNet, several improved variants have been proposed to further improve the model’s performance in medical image segmentation. For example, UNet++ [5] introduces nested skip pathways, which reduce the semantic gap between the feature extraction and reconstruction stages and enhance multiscale feature aggregation, thereby improving segmentation performance. Similarly, UNetv2 [6] introduces semantic information fusion module that leverages skip connections to achieve bidirectional aggregation of semantic and detailed features, significantly enhancing feature representation. Building on these foundations, UDTransNet [7] improves skip connections with attention-based recalibration, effectively narrowing the semantic gap. Unlike UDTransNet, UTANet [8] adopts task-adaptive skip connection strategy, flexibly selecting the information flow between the downsampling and upsampling paths to further improve medical image segmentation performance.

More recently, Mamba [9,10,11] and its variants have demonstrated impressive performance in vision tasks by modeling long-range dependencies using selective state space models (SSMs). Inspired by this ability, a series of Mamba-based architectures have been developed to address key challenges in medical image segmentation, including limited data, complex lesion regions or anatomical structures, and the need for efficient global context modeling. For example, UMamba [12] integrates Mamba modules with convolutional layers in a hybrid framework, using Mamba for global context modeling and CNNs for feature aggregation, achieving competitive segmentation results. In contrast, VMUNet employs Mamba modules for both feature extraction and aggregation, forming a fully SSM-based architecture with higher efficiency. Building on VMUNet, Swin-UMamba [13] utilizes ImageNet-pretrained weights for transfer learning in medical image segmentation, while H-vmunet [14] introduces a hierarchical channel-wise interaction mechanism within the Mamba module to suppress redundant information and improve feature representation purity.

These approaches above have demonstrated strong performance across diverse medical image segmentation tasks by learning rich representations from complex inputs. Despite architectural differences, their success fundamentally hinges on two key capabilities: extracting discriminative features that capture local details and/or global context and aggregating these features to produce accurate and coherent segmentation results.

(a): Feature Extraction: Feature extraction refers to the process of automatically generating informative representations from raw input data, with the goal of converting the original information into a compact and task-relevant form. In medical image segmentation, CNN-based methods extract multiscale features through stacked convolutions and downsampling, effectively capturing local structures, including edges and textures [4]. However, due to their limited receptive fields and strong inductive biases, CNNs struggle to model long-range dependencies and global semantics [3,14,15]. In contrast, Mamba-based methods leverage SSMs to capture global information [10,16], such as complete object contours, spatial relationships among multiple objects, and scene-level contextual patterns. However, their reliance on flattening 2D images into 1D sequences disrupts spatial continuity and weakens the ability to capture fine-grained local structures [17]. Consequently, both CNN- and Mamba-based methods face an inherent trade-off: CNNs retain spatial locality but struggle with global context, while Mamba captures global information but compromises local detail.
(b): Feature Aggregation: Feature aggregation combines features from multiple levels or scales to produce richer representations, improving model performance [4]. CNN- and Mamba-based models typically use skip connections to fuse high-resolution shallow features with deep semantic features, mitigating fine-grained information loss caused by downsampling [8]. However, most aggregation strategies rely on simple operations, such as addition or concatenation with fixed weights, which do not adapt to varying inter-layer content [18]. As a result, these methods inadequately capture commonalities and subtle differences across layers, compromising feature stability and representation quality. The problem is further amplified by semantic gaps and redundant noise in features from different layers, particularly under challenging conditions such as blurred boundaries, complex structures, or small lesions [8]. Therefore, inflexible aggregation limits the model’s ability to capture fine structural details, highlighting the need for adaptive methods that explicitly account for inter-layer similarities and fine-grained variations to enhance both stability and boundary sensitivity.

To address issues existing in feature extraction and aggregation, as discussed above, we propose a hybrid cascade and dual-path adaptive aggregation network (HCDAA-Net) for medical image segmentation. For feature extraction, we design a hybrid cascade structure (HCS) that alternately stacks ResNet [19] and Mamba modules, achieving a spatial balance between short-range detail extraction and long-range semantic modeling. To complement this spatial modeling, we introduce a channel-crossing attention mechanism (CCAM) [20], which strengthens feature representations and accelerates convergence through inter-channel interactions. For feature aggregation, we first design a correlation-aware aggregation (CAA) module, which captures the global correlations among features of the same class to address the feature representation inconsistency caused by variations in size and morphology, improving the segmentation accuracy of critical structures. We then propose the dual-path adaptive feature aggregation (DAFA) module with two branches: the common path (CP) and delta path (DP). CP employs a sum operation and channel attention to learn common semantics across scales and suppress redundancy, whereas DP adopts a difference operation to highlight fine-grained variations and enhance local lesion extraction. By integrating both paths, DAFA jointly achieves robust representation of common features and precise perception of discrepancies, thereby strengthening multiscale feature fusion. Finally, we propose a novel residual-gated visual state space (RG-VSS) module to further refine the fused features by dynamically modulating information flow through a convolution-enhanced residual gating mechanism.

The main contributions of this paper are as follows:

We design a novel HCDAA-Net for medical image segmentation. We employ an HCS that alternately integrates ResNet and Mamba modules to balance short-range detail extraction and long-range semantic modeling, followed by CCAM for channel-wise feature enhancement.
We propose a CAA to capture global correlations among same-class features to reduce feature inconsistencies from shape and scale variations, enhancing critical structure segmentation.
We introduce a DAFA to integrate a common path for shared semantics with delta path for highlighting fine-grained differences, enabling robust and precise multiscale feature fusion.
We develop a RG-VSS that dynamically modulates information flow through a convolution-enhanced residual gating mechanism to refine feature representations.

Experiments on four public datasets, covering various anatomical and pathological segmentation tasks, demonstrate that HCDAA-Net achieves SOTA performance.

The structure of this paper is as follows: Section 2 reviews related research progress. Section 3 details the proposed HCDAA-Net architecture. Section 4 presents experimental results. Section 5 concludes with the main contributions.

2. Related Work

In recent years, deep learning has made remarkable progress in medical image segmentation, and existing methods for this task can be mainly grouped into three categories: CNN-, Transformer-, and Mamba-based methods.

2.1. CNN-Based Methods

Many CNN-based methods have been proposed for medical image segmentation owing to their strong capabilities in feature extraction and aggregation. Among them, UNet [4] and its variants including ResUNet [21], ResUNet++ [22] and DRU-Net [23], improve feature representation by introducing residual or dense connectivity mechanisms, facilitating deeper training and effective multiscale feature aggregation. Motivated by these advances, segmentation networks for medical image analysis have further evolved in recent years, increasingly adopting advanced strategies such as attention and multiscale modeling. For example, CSCA U-Net [24] incorporates channel-spatial composite attention modules to enhance critical lesion feature representation, whereas EMCAD [25] employs channel, spatial, and grouped gated attention mechanisms with multiscale depthwise convolutions to capture intricate spatial relationships and highlight salient regions. LGCE-Net [26] captures local features via dense atrous convolution attention and incorporates global context using spatial grid attention, resulting in precise and efficient medical image segmentation. Despite their strong capability in capturing local features and hierarchical representations, the intrinsic locality of CNNs limits their ability to model long-range dependencies.

2.2. Transformer-Based Methods

Transformer [27] has recently been adopted in medical image segmentation, leveraging self-attention to capture long-range dependencies and global contextual information. Among the representative architectures, Polyp-PVT [28] leverages pure vision transformer (ViT) layers to extract multiscale features, allowing the model to learn more powerful and robust representations of polyp characteristics. Unlike Polyp-PVT, TransUNet [15] combines ViT layers with CNN layers, with the former modeling high-level global context and the latter capturing fine-grained local features, enabling general medical image feature extraction. To overcome the inefficiency caused by the quadratic complexity of ViT’s self-attention, Swin-UNet [29] introduces cross-window interactions through a shifted window mechanism, and CSWin-UNet [30] applies horizontal and vertical strip-based self-attention, enabling fast long-range dependency modeling. Despite these advances, the quadratic computational and memory complexity of self-attention with respect to input resolution remains a significant bottleneck for high-resolution medical image.

2.3. Mamba-Based Methods

Mamba-based architectures [11] have emerged as a promising approach for medical image segmentation, employing selective state-space model (SSM) mechanisms to model long-range dependencies with linear computational complexity. VMUNet [14] uses Mamba modules as feature extractors to capture multiscale features with long-range dependencies, while CCViM [31] combines Mamba modules with context clustering to capture local-global multiscale lesion features. ASP-VMUNet [32] further incorporates atrous scanning mechanism into Mamba modules to enlarge the receptive field and reduce background interference, enhancing lesion feature extraction. Moreover, HMT-UNet [33] employs a hybrid structure, using Mamba modules for modeling long-range dependencies in shallow layers and a Transformer for extracting global context in deeper layers. Although these methods effectively capture long-range dependencies, they often fail to fully capture fine-grained local features, limiting segmentation accuracy for small lesions.

3. Methods

3.1. Overall Architecture

We propose the HCDAA-Net architecture for medical image segmentation, as shown in Figure 1, which employs an HCS for feature extraction. In particular, we used ResNet modules to preserve local fine-grained details while employing Mamba modules to capture global semantic information. Unlike most previous approaches that apply ResNet and Mamba modules in parallel, HCS alternately stacks them to form a multi-level cascade, enabling progressive integration of local and global features across spatial dimensions. To complement this spatial feature extraction, the CCAM was applied prior to transferring features from the feature extraction side to the feature aggregation side, enhancing inter-channel information flow and providing a comprehensive representation of both spatial and channel-wise features. In this way, we obtain refined feature blocks that facilitate faster convergence of the model. We represent the four refined feature outputs from the feature extraction side as

{f_{i}^{'}}_{i = 1}^{4}

, where i denotes the i-th layer. Afterward, the features

{f_{i}^{'}}_{i = 1}^{4}

were fed into the feature aggregation stage, as shown in Figure 1b. We first applied a CAA on the fourth-layer feature

f_{4}^{'}

to enhance the autocorrelation among features belonging to the same anatomical structure or lesion within the global semantic context. This layer was chosen because it provides the optimal balance between semantic richness and computational efficiency for modeling global correlations, avoiding both the noise and excessive computational cost associated with lower-layer features. The resulting feature was then processed by DAFA, a dual-path module that hierarchically fuses multiscale features by jointly capturing common and distinctive information. RG-VSS was subsequently employed to refine the fused features, leveraging the SSM mechanism and residual-gated block (RGB) to incorporate global and local information, respectively. Finally, we use the large kernel patch expanding (LKPE) [34] to upsample the features and enhance the representation of lesion characteristics.

Figure 1. Overall architecture of our HCDAA-Net, where HCS indicates the hybrid cascade structure.

3.2. Feature Extraction

CNN-based backbones, such as ResNet [19], are widely used in medical image analysis for their ability to capture local spatial details, yet their inherently limited receptive fields restrict the modeling of long-range dependencies and global semantic relationships. Conversely, Mamba-based backbones, including VMamba [16], Vim [35], EfficientVMamba [17], LocalMamba [36], and MambaVision [37], excel at capturing global context via SSM, complementing CNNs. However, Mamba modules inadequately preserve fine-grained local details, potentially causing the loss of lesion boundaries or small structural features [32].

To address these limitations, we propose a hybrid cascade structure (HCS), illustrated in Figure 1c. In contrast to prior methods that extract features using CNN- and Mamba-based backbones in parallel and then fuse them, HCS alternately stacks CNN- and Mamba-based blocks, enabling progressive integration of local and global features across spatial scales. Let

x \in R^{H \times W \times C}

be an input image with spatial dimensions

H \times W

and C channels. The HCS feature extraction process is formulated as

f_{i} = \{\begin{matrix} VSS (RB (PM (x))), if i = 1, \\ VSS (RB (PM (f_{i - 1}))), if i \in {2, 3, 4}, \end{matrix}

(1)

where PM denotes patch merging module [3], while RB and VSS correspond to the ResNet and VMamba base blocks, respectively. We used PM rather than convolutions for downsampling, aiming to preserving more visual clues.

HCS primarily focuses on extracting multiscale features from the spatial perspective, as discussed above. To achieve a more comprehensive representation that integrates both spatial and channel-wise characteristics, the CCAM (as shown in Figure 2) was introduced before features are passed from the extraction side to the aggregation side. Unlike traditional channel attention, which merely reallocates channel weights along the channel dimension, CCAM explicitly captures inter-channel dependencies and reinforces the joint spatial–channel representation, resulting in more discriminative features. Formally, the feature set

{f_{i}}_{i = 1}^{4}

was processed by the refinement blocks, i.e., CCAMs, to yield the refined feature set

{f_{i}^{'}}_{i = 1}^{4}

:

f_{i}^{'} = {CCAM}_{i} (f_{i}), i \in {1, 2, 3, 4} .

(2)

Specifically, each feature

f_{i}

was first split into a token sequence and then processed by the Q-Shift operation [20] to allow spatial interactions among neighboring tokens, followed by linear projections to generate K and V, which encode channel-wise correlations:

\begin{matrix} q & = & S p l i t (f_{i}), \\ K & = & Q - S h i f t (q) W_{k}, \\ V & = & Q - S h i f t (q) W_{v}, \end{matrix}

(3)

where

W_{k}

and

W_{v}

denote the linear projection matrices. Subsequently, K underwent squaredrelu (SR) activation and was multiplied element-wise with the sigmoid-activated V, effectively modulating each channel based on its correlations with other channels while preserving spatial locality. The resulting feature was then normalized, enhanced with a residual connection, and reshaped to produce the output:

\begin{matrix} f_{i}^{'} = R S (N o r m (σ (V) ⊙ S R (K)) + q), \end{matrix}

(4)

where RS denotes reshape operation, and

σ (\cdot)

denotes sigmoid function. Finally,

{f_{i}^{'}}_{i = 1}^{4}

was input into the aggregation stage to undergo subsequent processing and integration.

Figure 2. Illustration of the proposed CCAM.

3.3. Feature Aggretation

Feature extraction streams inherently provide complementary information in medical image segmentation, with high-level features capturing semantic representations and low-level features preserving abundant spatial details. Therefore, we first designed a correlation-aware aggregation (CAA) module to capture global semantic dependencies, followed by an LKPE for upsampling. We then proposed a dual-path adaptive feature aggregation (DAFA) module to effectively integrate multi-level features and a residual-gated visual state space (RG-VSS) module to suppress irrelevant activations, followed by another LKPE for upsampling to strengthen feature representations. Formally,

f_{i}^{″} = \{\begin{matrix} Up (CAA (f_{i}^{'})), & if i = 4, \\ Up (RG-VSS (DAFA (f_{i}^{'}, f_{i + 1}^{″}))), & if i \in {1, 2, 3}, \end{matrix}

(5)

where

Up (\cdot)

denotes the upsampling operation implemented via LKPE, which was chosen for its ability to enhance lesion feature representation by integrating spatial and channel information.

3.3.1. Correlation-Aware Aggregation (CAA)

In medical image segmentation, similar lesions or anatomical structures may exhibit variations in pixel-level features across different regions. The network’s ability to learn these similar regions can be influenced by their size or morphology, resulting in inconsistent feature representations. Therefore, we proposed a correlation-aware aggregation (CAA) module as shown in Figure 3, which captures the global correlations among features of the same lesions or anatomical structures, improving the model’s ability to accurately segment critical structures.

Figure 3. Illustration of the proposed CAA.

In CAA, the input feature

f_{4}^{'} \in R^{h \times w \times c}

was reshaped along the spatial dimensions into a sequence

{\hat{f}}_{4}^{'} \in R^{h w \times c}

, treating each spatial position as an individual token for correlation modeling, and then a linear projection was applied to generate the packed feature matrix:

q k v = {\hat{f}}_{4}^{'} w_{q k v} + b_{q k v}, q k v \in R^{h w \times 3 c},

(6)

where

w_{q k v} \in R^{c \times 3 c}

is the packed projection matrix that simultaneously generates the concatenated representations of q, k, and v, and

b_{q k v} \in R^{3 c}

is the bias term.

q k v

was reshaped and split along the channel dimension:

q, k, v = Split (Reshape (q k v)), q, k, v \in R^{t \times h w \times d},

(7)

where

t = 8

is the number of attention heads, and

d = 768

is the dimension of each head, which balances the model’s representational capacity and computational efficiency.

To improve the spatial consistency of the similar lesions or anatomical structures across different positions, we first computed the similarity between vectors in the correlation-aware feature matrices q and k via matrix multiplication. The resulting similarity matrix was then normalized using a softmax function to obtain a set of weights encoding global pixel correlations. These weights were applied to correlation-aware feature matrix v to guide the construction of a new feature representation. After this reconstruction, the spatial correlations of the original feature were effectively embedded into the newly generated feature matrix:

f_{a t t} = Reshape (softmax (q \otimes k^{⊤}) \otimes v),

(8)

where

f_{a t t} \in R^{h \times w \times c}

is a global correlation feature, and ⊗ is the matrix multiplication. To obtain more robust weights, we further process

f_{a t t}

through a CBR module, which refines the correlation-aware features and stabilizes their representations. Then, we apply a simple sigmoid activation function to generate the weight with rich information:

f_{w e i g h t} = σ (CBR (f_{a t t})),

(9)

where

σ (\cdot)

denotes the sigmoid activation function. Finally, the pooled features were reweighted using the

f_{w e i g h t}

and subsequently aggregated through another CBR, which refines and stabilizes the fused representations, to generate the correlation-aware enhanced output feature:

\begin{matrix} f_{g m p} & = f_{weight} ⊙ GMP (f_{4}^{'}), \\ f_{g a p} & = f_{weight} ⊙ GAP (f_{4}^{'}), \\ f_{4}^{g l b} & = CBR (f_{g m p} © f_{g a p}), \end{matrix}

(10)

where GMP and GAP provide complementary salient and contextual information, and ⊙ and © denote the element-wise multiplication and concatentaion along channel dimension.

3.3.2. Dual-Path Adaptive Feature Aggregation (DAFA)

Segmentation accuracy can be further improved by inter-layer feature aggregation, which enables the integration of complementary semantic and spatial information [4,18]. Conventional aggregation strategies often rely on lightweight element-wise addition to combine multiscale features, efficiently emphasizing shared information among them [3,4,33,38]. Nevertheless, when feature streams conflict or differ in importance, this operation may suppress critical cues and fails to adaptively regulate their contributions, thereby limiting the modeling of fine-grained structures [18]. To address this issue, we propose the DAFA module as shown in Figure 4, which adaptively aggregates features across layers to enhance sensitivity to structural variations. DAFA was designed with a commonality–diversity collaborative learning mechanism that explicitly models two complementary pathways: the common path (CP) and the delta path (DP), enabling more effective feature fusion. Specifically, given a low-level feature

f_{i}^{'}

and a high-level feature

f_{i + 1}^{″}

(Figure 1), DAFA applies independent

3 \times 3

convolutions to them, producing enhanced feature representations

{\hat{f}}_{i}

and

{\hat{f}}_{i + 1}

that preserve local spatial details.

Figure 4. Illustration of the proposed DAFA.

CP was designed to capture inter-layer common information through an element-wise addition operation:

f_{c p} = {\hat{f}}_{i} + {\hat{f}}_{i + 1} .

(11)

f_{c p}

was subsequently passed through a channel attention (CA) module [39] to reinforce critical feature representations and suppress noise, followed by a residual connection to facilitate gradient flow and preserve original feature information:

f_{c p}^{c a} = CA (f_{c p}) \oplus f_{c p},

(12)

where ⊕ denotes the element-wise addition.

In parallel, DP aims to capture inter-layer delta details utilizing element-wise subtraction operation, highlighting subtle variations and enhancing sensitivity to fine-grained structures:

f_{d p} = {\hat{f}}_{i} - {\hat{f}}_{i + 1} .

(13)

Similarly to CP,

f_{d p}

was then passed through a CA, followed by a residual connection:

f_{d p}^{c a} = CA (f_{d p}) \oplus f_{d p} .

(14)

Finally, DAFA dynamically aggregate the commonality and diversity features using learnable weights

α

and

β

, enabling the model to automatically balance the robustness of common features and the fine-grained perception of distinctive features. The adaptive aggregation is formally expressed as follows:

f_{i}^{a g g} = α \cdot f_{c p}^{c a} + β \cdot f_{d p}^{c a} .

(15)

In this way, DAFA effectively captures and aggregates both common patterns and distinctive details across feature layers, boosting semantic representation and fine-grained structure modeling.

3.3.3. Residual-Gated Visual State Space Module (RG-VSS)

To fully leverage the DAFA-enhanced features, we design the RG-VSS module to capture global correlations across spatial and channel dimensions, while extracting local details such as edges and textures, as illustrated in Figure 5. RG-VSS is composed of two consecutive newly residual blocks, namely, SSMM and CERG. SSMM incorporates an SSM-based mechanism to model global spatial correlations among features, whereas CERG employs a convolution-enhanced residual gating scheme to capture global channel correlations and reinforce local spatial information.

Figure 5. Illustration of the proposed RG-VSS.

SSMM takes the aggregated feature

f_{i}^{a g g}

from DAFA as input and uses layer normalization (LN) to standardize

f_{i}^{a g g}

, followed by a

1 \times 1

convolutional linear projection (LP) to ensure numerical stability, producing

{\hat{f}}_{p r e} \in R^{H_{i} \times W_{i} \times 2 C}

:

{\hat{f}}_{p r e} = LP (LN (f_{i}^{a g g})) .

(16)

{\hat{f}}_{p r e}

was then fed into the SS2D module [16], followed by LN and a short-cut connection:

f_{p r e} = LP (SS 2 D ({\hat{f}}_{p r e})) + f_{i}^{a g g} .

(17)

SS2D consists of four consecutive blocks: a depthwise separable convolution (DWConv), silu activation, a scan block and LN. Formally,

\begin{matrix} f_{s i l u} & = SiLU (DWConv (f_{p r e})), \\ f_{g l o b a l} & = LN (Scan (f_{s i l u})), \end{matrix}

(18)

where DWConv aims to capture localized patterns and fine-grained structural details. For the scan block, we split the feature

f_{s i l u}

into four sequences along symmetric directions (top-down, bottom-up, left-right, and right-left), as shown in Figure 6. Formally,

\begin{matrix} f_{t d}, f_{b u}, f_{l r}, f_{r l} = C r o s s S c a n (f_{s i l u}) . \end{matrix}

(19)

Each sequence was processed through the S6 block [10] to model the autocorrelation of lesion or anatomical regions:

f_{t d}^{'}, f_{b u}^{'}, f_{l r}^{'}, f_{r l}^{'} = S 6 (f_{t d}, f_{b u}, f_{l r}, f_{r l}) .

(20)

The outputs from the four directions are cross-merged, producing globally correlations and direction-aware feature representations:

f_{s c a n}^{'} = C r o s s M e r g e (f_{t d}^{'}, f_{b u}^{'}, f_{l r}^{'}, f_{r l}^{'}) .

(21)

Figure 6. Illustration of the core operations in the scan block.

CERG takes the feature

f_{g l o b a l}

as input, applies LP to remap the features for enhanced representation and then processes it through a residual-gated block (RGB) with a shortcutpath:

\begin{matrix} \hat{f} & = LP (f_{g l o b a l}), \\ f_{i}^{r g} & = RGB (\hat{f}) + f_{g l o b a l} . \end{matrix}

(22)

Within the RGB, the feature

\hat{f}

processed by

1 \times 1

convolution was evenly split along the channel axis into two branches,

{\hat{f}}_{1}

and

{\hat{f}}_{2}

:

{\hat{f}}_{1}, {\hat{f}}_{2} = Split (Conv 2 d (\hat{f})) .

(23)

The

{\hat{f}}_{2}

branch employed a DWConv to enhance local detail representations under a low computational overhead, while a residual connection was integrated into this pathway to facilitate efficient gradient propagation and ensure stable optimization:

f_{d w} = DWConv ({\hat{f}}_{2}) + {\hat{f}}_{2} .

(24)

To modulate information flow, RGB applies a GELU activation to

f_{d w}

as a nonlinear gating function. The resulting feature was subsequently multiplied element-wise with

{\hat{f}}_{1}

, acting as a modulation step to highlight informative patterns while suppressing redundancy. The gated output was further refined through a

1 \times 1

convolution to enable channel-wise interaction and was finally fused with the residual path:

f_{r g b} = Conv 2 d ({\hat{f}}_{1} ⊙ GELU (f_{d w})) + \hat{f} .

(25)

By incorporating both convolutional operations and gating mechanisms, the RGB preserves spatial structure, captures neighborhood-level dependencies, and enhances sensitivity to fine-grained features.

3.4. Loss Function

For multi-class segmentation, a hybrid loss function was adopted by combining Dice loss and cross-entropy (CE) loss, formulated as follows:

\begin{matrix} L_{t o t a l} & = α \times L_{d i c e} + (1 - α) \times L_{c e}, \end{matrix}

(26)

where

α

is the weighting coefficient. For binary segmentation tasks, we employ the binary cross-entropy (BCE) and dice loss, expressed as follows:

\begin{matrix} \begin{matrix} L_{b c e d i c e} & = L_{b c e} + L_{d i c e} \end{matrix} \end{matrix}

(27)

3.4.1. Datasets

We evaluated the performance of our HCDAA-Net on four benchmark datasets, including Synapse [31], ACDC [40], ISIC18 [41] and Glas [7].

Synapse dataset consists of 30 abdominal CT scans with a total of 3779 contrast-enhanced axial slices. Following prior studies [15,29], we randomly partition the dataset into 18 cases for training and 12 cases for testing. The segmentation task focuses on eight abdominal organs: aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach.

ACDC contains 100 cardiac MRI scans, each annotated with three anatomical structures: right ventricle (RV), myocardium (Myo), and left ventricle (LV). Following Ref. [15], we split the dataset into 70 cases for training, 10 for validation, and 20 for testing.

ISIC 18 consists of 2694 dermoscopic images with corresponding segmentation masks. Following Ref. [41], we split the dataset into a 70% training set (1886 images) and 30% test set (808 images).

Glas comprises 165 H&E-stained histological images, with 85 used for training and the remaining 80 for testing.

3.4.2. Implementation Details

Our HCDAA-Net was implemented in PyTorch 2.1.0 and trained on a NVIDIA A100 80G GPU. AdamW [42] was used to train our HCDAA-Net for up to 300 epochs with a cosine-annealed learning rate initialized at

5 \times 10^{- 4}

and batch sizes of 32 for Synapse, ACDC, and ISIC18, and 8 for Glas. To mitigate overfitting, training images were augmented with flips, rotations, Gaussian noise, Gaussian blur, and contrast adjustments. In addition, we resized ISIC18 images to

256 \times 256

and Synapse, ACDC, and Glas images to

224 \times 224

.

4. Experiments

4.1. Experimental Setup

Evaluation Metrics

To evaluate the performance of our HCDAA-Net, we use the dice similarity coefficient (DSC) [7] and 95% hausdorff distance (HD95) [7] as evaluation metrics for multi-class segmentation tasks, specifically on the Synapse and ACDC datasets, and mean intersection over union (mIoU), DSC, accuracy (Acc), specificity (Spe), and sensitivity (Sen) [14] for binary segmentation tasks, specifically on the ISIC18 and Glas datasets.

4.2. Quantitative Comparison

To validate the effectiveness of our HCDAA-Net, we select sixteen advanced approaches for comparison on Synapse, ACDC, ISIC18 and Glas datasets, which are categorized into four groups: two CNN-based methods including UNet [4] and MALUNet [43]; nine Transformer-based methods including Att-UNet [44], TransUNet [15], PVT-CASCADE [44], TransCASCADE [44], Swin-UNet [29], CoTransUNet [45], Swin-IBNet [46], CSWin-UNet [30] and PARF-Net [1]; one RWKV-based approach including RWKV-UNet [20]; and four Mamba-based approaches including Swin-UMamba [13], H-vmunet [14], VM-UNet [3] and MSVM-UNet [34]. Table 1, Table 2, Table 3 and Table 4 show the corresponding results of these methods on the four datasets, respectively. To ensure fairness, the results are retrained by running the official codes with default parameter settings or obtained from published articles. “-” denotes that the corresponding data is not available or not provided in the paper.

Table 1. Performance comparison with SOTA methods on the Synapse dataset. Bold black data indicates the best results. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Table 2. Performance comparison with SOTA methods on the ACDC dataset. Bold black data indicates the best results. ‘↑’ indicates higher is better.

Table 3. Performance comparison with SOTA methods on the ISIC18 dataset. Bold black data indicates the best results. ‘↑’ indicates higher is better.

Table 4. Performance comparison with SOTA methods on the Glas dataset. Bold black data indicates the best results. ‘↑’ indicates higher is better.

From Table 1, our HCDAA-Net achieves the overall best performance, with a DSC of 86.94% and HD95 of 12.66 mm. In terms of organ categories, our HCDAA-Net achieves the highest DSC on five organs, namely, the aorta, gallbladder, kidney (L), pancreas, and stomach. Notably, HCDAA-Net attains a DSC of 95.82% on the small pancreas, demonstrating excellent fine-grained recognition capability. For Kidney (R), our HCDAA-Net achieves competitive result (77.24%). With respect to large organs, including Liver and Spleen, HCDAA-Net is significantly outperformed by some methods such as MSVM-UNet. This result indicates that HCDAA-Net’s performance on large targets remains limited, resulting in incomplete contours and structural errors.

Table 2 shows that HCDAA-Net achieves the best DSC (92.73%), demonstrating its superiority in capturing fine-grained structures and enhancing overall segmentation accuracy. In terms of the RV, Myo, and LV organs, HCDAA-Net achieves DSCs of 91.26%, 90.70%, and 96.25%, respectively, and ranks second for RV and Myo and third for LV among the compared methods. These results indicate that, although HCDAA-Net may not always achieve the top performance on every individual organ, it maintains highly competitive results across all categories, reflecting its robustness and balanced capability. Similar results are observed from Table 3 and Table 4. Specifically, Table 3 represents that HCDAA-Net achieves the best overall performance on the ISIC18 dataset, with a mIoU of 81.49%, DSC of 89.80%, and ACC of 95.01%. In addition, HCDAA-Net ranks second in the Spe metric with 96.54%, only behind CSWin-UNet, and ranks third in the Sen metric with 90.25%. Table 4 shows that our method ranks first with a mIoU of 84.63% and DSC of 91.18%, significantly surpassing traditional methods such as Att-UNet and UNet.

4.3. Visual Comparison

We conduct visual comparisons between HCDAA-Net and nine SOTA methods, including UNet [4], TransUNet [15], Swin-UMamba [13], H-vmunet [14], VM-UNet [3], MSVM-UNet [34], CSwin-UNet [30], RWKV-UNet [20], and PARF-Net [1], as shown in Figure 7, Figure 8, Figure 9 and Figure 10.

Figure 7. Visual comparison of different methods on the Synapse dataset.

Figure 8. Visual comparison of different methods on the ACDC dataset.

Figure 9. Visual comparison of different methods on the ISIC18 dataset.

Figure 10. Visual comparison of different methods on the Glas dataset.

From Figure 7, HCDAA-Net consistently achieves superior visual performance in multi-organ segmentation on the Synapse dataset. Compared with other methods, HCDAA-Net effectively suppresses over- and under-segmentation while avoiding the mis-segmented regions observed in VM-UNet, CSwin-UNet, and PARF-UNet (row 1). Particularly, HCDAA-Net yields more accurate delineation of challenging organs such as the stomach (row 2 and 3) and pancreas (row 3) and achieves nearly ideal segmentation of the gallbladder (row 4). Although minor deviations remain in some large organs (e.g., the liver (row 2)), the overall boundaries generated by HCDAA-Net are more continuous and anatomically consistent, highlighting its strength in fine-grained structure preservation.

Similarly, as shown in Figure 8, Figure 9 and Figure 10, HCDAA-Net consistently delivers superior segmentation accuracy and robustness across diverse datasets. On the ACDC dataset (Figure 8), HCDAA-Net produces results that closely match the ground truth, outperforming nine SOTA methods. For the right ventricle (row 1 and 2), HCDAA-Net achieves highly accurate delineation, while methods such as CSwin-UNet suffer from under-segmentation. For the myocardium and left ventricle (row 3), H-vmunet, VM-UNet, CSwin-UNet, and RWKV-UNet generate extensive errors. In contrast, HCDAA-Net provides anatomically consistent masks. Even in row 4, although all methods under-segment the right ventricle, HCDAA-Net still maintains superior performance. On the ISIC18 dataset (Figure 9), HCDAA-Net precisely captures complex lesion contours (row 1) and fine-grained structures (row 2 and 3), avoiding the blurring and mis-segmentation problems common to UNet, TransUNet, VM-UNet, and MSVM-UNet. In challenging cases (row 4), where most methods suffer from over- or under-segmentation, HCDAA-Net still produces masks closely aligned with the ground truth. On the GlaS dataset (Figure 10), UNet and CSWin-UNet show over-segmentation, while TransUNet and VM-UNet exhibit under-segmentation (rows 3 and 4). MSVM-UNet and PARF-Net preserve some structural details but remain prone to errors across rows 1 to 4. In contrast, HCDAA-Net consistently achieves results that are both precise and robust, highlighting its strong generalization capability across various medical image segmentation tasks.

4.4. Ablation Studies

We conduct comprehensive ablation studies on the Synapse dataset to validate the effectiveness of key components of our HCDAA-Net.

4.4.1. Structure Ablation

Table 5 presents the ablation results of HCDAA-Net on the Synapse dataset, showing the impact of key components on segmentation performance. The baseline denotes the model without any trick of our HCDAA-Net. According to Table 5, sequentially incorporating the HCS, CCAM, and CAA into the baseline yields consistent performance gains, validating their roles in balancing feature extraction, representation refinement, and global correlation modeling, respectively. Incorporating DAFA further enhances boundary precision, reducing HD95 to 12.69 mm and increasing DSC to 86.53%. Finally, with the addition of RG-VSS, feature refinement is further improved, yielding 86.94% DSC and 12.66 mm HD95.

Table 5. Ablation studies on the Synapse dataset. ‘✓’ indicates that the corresponding module is used. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Figure 11 presents the stage-wise feature map visualization of HCDAA-Net. The HCS extracts multiscale features, which are further refined by CAAM from the channel perspective. For feature aggregation, CAA enhances global correlations in the highest-level feature (row 4), DAFA integrates common and delta paths to achieve multiscale fusion, and RG-VSS subsequently modulates information flow to further refine the representations. The visualizations in Figure 11 are consistent with the quantitative results in Table 5. In particular, CCAM improves boundary sharpness and captures fine structures in HCS-extracted features, while CAA strengthens target-region correlations in the fourth-layer feature and reduces background interference. Moreover, in rows 1–3, DAFA highlights edges and preserves fine details through low- and high-level feature fusion, and RG-VSS further enhances both global and local information, especially in row 1.

Figure 11. Visualization of feature maps from each stage of our HCDAA-Net, highlighting the effects of the HCS, CCAM, CAA, DAFA and RG-VSS.

4.4.2. Ablation Study on Attention Modules

To evaluate the effectiveness of the CAA module relative to the conventional Self-Attention module [27], we conduct ablation studies, as shown in Table 6. When adopting the CAA module, the DSC increases by 0.26% and the HD95 decreases by 0.55 mm compared to the conventional Self-Attention module, indicating improvements in both overall segmentation accuracy and boundary precision. Meanwhile, the number of parameters, FLOPs, and latency increase by 1.41 M, 0.14 G, and 0.02 ms, respectively, indicating only a slight increase. Overall, these results show that the CAA module effectively captures feature correlations within the same lesions or anatomical structures, improving segmentation of critical regions while adding only slight computational overhead.

Table 6. Ablation study of self-attention and CAA modules on the synapse dataset. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

4.4.3. Loss Ablation

We investigate the influence of the weighting parameter

α

in the combined loss function (Equation (26)). As reported in Table 7, our HCDAA-Net achieves the best performance at

α = 0.4

with a DSC of 86.94% and an HD95 of 12.66 mm. Both smaller and larger values lead to degraded accuracy and boundary precision, indicating that an appropriate balance of loss components is essential for optimal segmentation.

Table 7. Ablation studies on the parameter

α

in the loss function. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

4.4.4. Backbone Network Effectiveness

We conduct an ablation study on different backbone architectures, including ResNet34, VMamba and an HCS (ours), each evaluated with and without pretrained weights. As shown Table 8, without pretrained weights, the model using an HCS as the backbone achieves the best performance, yielding a DSC of 84.67% and an HD95 of 19.70 mm, outperforming its counterparts with ResNet34 and VMamba. With pretrained weights, all variants improve, and the HCS-based model remains the top performer (DSC 86.94%, HD95 12.66 mm), demonstrating the effectiveness of the HCS backbone for accurate boundary delineation.

Table 8. Ablation studies on backbone architectures: ResNet34, VMamba, and HCS (ours). ‘-’ denotes backbone without pretrained weights, and ‘✓’ denotes backbone with pretrained weights. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

The quantitative results above are consistent with the visualizations in Figure 12. ResNet34 shows strong edge responses in the low layers (Layers 1 and 2) but exhibits reduced structural clarity in the high layers (Layers 3 and 4), indicating limited capability for global context modeling. VMamba effectively captures global structures in the high layers, whereas its low layers provide weaker boundary and texture representation. In contrast, the HCS backbone combines the strengths of both networks, yielding more balanced feature representations. The low layers of the HCS preserve fine edges and textures, and the high layers accurately capture global semantics, resulting in the feature of layer 4 that closely matches the ground truth boundaries and provides favorable local and global feature representation.

Figure 12. Visualization of feature maps from four layers using three backbone networks: ResNet34 (row 1), VMamba (row 2), and HCS (row 3). Orange boxes denote the final segmentation results.

4.4.5. Different Upsampling Method Effectiveness

To evaluate the impact of different upsampling methods, we perform ablation studies on the Synapse dataset, comparing the original decoder with transposed convolution, the upsample block [44], the patch expanding layer [29], and the LKPE layer. As shown in Table 9, the LKPE layer achieves the highest DSC of 86.94% with a competitive HD95 of 12.66 mm, demonstrating superior segmentation accuracy and boundary delineation. The upsample block yields a modest DSC gain of 0.65% over transposed convolution but introduces increased computational cost (7.35G FLOPs) and parameter overhead (8.25M). In contrast, the patch expanding layer shows the lowest DSC (84.78%) and the worst HD95 (20.65 mm), suggesting less effective feature reconstruction, despite moderate computational and parameter requirements. Overall, the LKPE layer offers the best trade-off between accuracy and efficiency among the evaluated upsampling methods.

Table 9. Ablation study of upsampling methods on the Synapse dataset. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

4.4.6. Ablation of RG-VSS Layer Number

We conduct an ablation study by varying the number of RG-VSS layers to evaluate the impact of module depth on segmentation performance. As shown in Table 10, increasing the number of layers from 1 to 2 improves the DSC from 85.79% to 86.94% and reduces HD95 from 14.32 mm to 12.66 mm, indicating better segmentation accuracy and boundary delineation. However, adding a third layer does not further improve performance significantly (DSC 86.88%, HD95 12.73 mm) while increasing computational cost and parameter count. These results suggest that a two-layer RG-VSS module provides a favorable trade-off between accuracy and computational efficiency.

Table 10. Ablation study on RG-VSS layer number on the Synapse dataset. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

5. Conclusions

In this work, we present HCDAA-Net, a hybrid cascade and dual-path adaptive aggregation network, to improve medical image segmentation through enhanced feature extraction and feature aggregation. For feature extraction, we design an HCS to alternately integrate ResNet and Mamba modules, preserving fine edges while capturing global semantics. We further employ CCAM to strengthen feature representation and accelerate convergence. For feature aggregation, we introduce CAA to capture intra-structure correlations and DAFA to combine stable cross-layer semantics with subtle differences, improving sensitivity to fine details. Finally, we introduce RG-VSS to refine fused features through dynamic modulation of information flow. Experimental results demonstrate that HCDAA-Net achieves superior segmentation performance, effectively capturing both fine-grained structures and global context, highlighting its potential for robust and accurate medical image analysis.

Although HCDAA-Net demonstrates strong overall performance, its segmentation accuracy for certain specific organs, such as the liver, remains limited. We attribute this suboptimal performance in segmenting large targets to two main factors: (1) RG-VSS excessively modulates feature flow, overemphasizing fine-grained details and impairing the modeling of overall organ structures and (2) LKPE emphasizes local details via convolution and pixel-shuffle upsampling but lacks global structural modeling, leading to blurred boundaries in large targets. To address these issues, future work will focus on enhancing global structural modeling for large targets by (1) improving RG-VSS to enhance local details while preserving overall structural integrity and avoiding overemphasis on fine-grained features and (2) optimizing the upsampling strategy of the LKPE by incorporating larger receptive fields to reduce boundary blurring and improve contour reconstruction for large targets. Additionally, our HCDAA-Net relies entirely on fully supervised segmentation and does not consider reducing annotation costs or leveraging large Vision-Language Models (VLMs). Given the high cost of medical labels, future work also will explore semi-supervised learning, weakly-supervised segmentation, or VLMs to improve practicality and scalability.

Author Contributions

J.R.: Methodology, Writing—original draft; S.C.: Software, Formal Analysis, Resources, Writing—review and editing, Software; Y.S.: Conceptualization, Writing—review and editing; H.G.: Validation, Writing—review and editing, Formal Analysis; Y.T.: Writing—review and editing, Resources; W.Z.: Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded in part by Henan Provincial Science and Technology Program under Grants 241111212200 and 252102220046, in part by Henan Joint Fund for Science and Technology Research under Grant 20240012, in part by Key Scientific Research Projects of Higher Education Institutions in Henan Province under Grants 26A520036 and 26A520037, and in part by Henan KeyLaboratory of Education Big Data Analysis and Application under Grant 2025jYDSj01.

Data Availability Statement

Our code is available at https://github.com/hpguo1982/HCDAA-Net (accessed on 1 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ma, X.; Chen, M.; Zhang, J.; Song, L.; Du, F.; Yu, Z. PARF-Net: Integrating pixel-wise adaptive receptive fields into hybrid Transformer-CNN network for medical image segmentation. arXiv 2025, arXiv:2501.02882. [Google Scholar]
Sun, G.; Pan, Y.; Kong, W.; Xu, Z.; Ma, J.; Racharak, T.; Nguyen, L.; Xin, J. DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation. arXiv 2023, arXiv:2310.12570. [Google Scholar] [CrossRef]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Peng, Y.; Sonka, M.; Chen, D.Z. U-Net v2: Rethinking the Skip Connections of U-Net for Medical Image Segmentation. arXiv 2024, arXiv:2311.17791. [Google Scholar]
Wang, H.; Cao, P.; Yang, J.; Zaiane, O. Narrowing the semantic gaps in u-net with learnable skip connections: The case of medical image segmentation. Neural Netw. 2024, 178, 106546. [Google Scholar] [CrossRef]
Luo, Z.; Zhu, X.; Zhang, L.; Sun, B. Rethinking U-Net: Task-Adaptive Mixture of Skip Connections for Enhanced Medical Image Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 5874–5882. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Mehta, H.; Gupta, A.; Cutkosky, A.; Neyshabur, B. Long range language modeling via gated state spaces. arXiv 2022, arXiv:2206.13947. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 615–625. [Google Scholar]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. H-vmunet: High-order vision mamba unet for medical image segmentation. Neurocomputing 2025, 624, 129447. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Pei, X.; Huang, T.; Xu, C. Efficientvmamba: Atrous selective scan for light weight visual mamba. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6443–6451. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Jiang, J.; Zhang, J.; Liu, W.; Gao, M.; Hu, X.; Yan, X.; Huang, F.; Liu, Y. RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation. arXiv 2025, arXiv:2501.08458. [Google Scholar]
Alom, M.Z.; Yakopcic, C.; Hasan, M.; Taha, T.M.; Asari, V.K. Recurrent residual U-Net for medical image segmentation. J. Med. Imaging 2019, 6, 014006. [Google Scholar] [CrossRef]
Jha, D.; Smedsrud, P.H.; Riegler, M.A.; Johansen, D.; De Lange, T.; Halvorsen, P.; Johansen, H.D. Resunet++: An advanced architecture for medical image segmentation. In Proceedings of the 2019 IEEE International Symposium on Multimedia (ISM), San Diego, CA, USA, 9–11 December 2019; pp. 225–2255. [Google Scholar]
Jafari, M.; Auer, D.; Francis, S.; Garibaldi, J.; Chen, X. DRU-Net: An efficient deep convolutional neural network for medical image segmentation. In Proceedings of the 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 3–7 April 2020; pp. 1144–1148. [Google Scholar]
Shu, X.; Wang, J.; Zhang, A.; Shi, J.; Wu, X.J. CSCA U-Net: A channel and space compound attention CNN for medical image segmentation. Artif. Intell. Med. 2024, 150, 102800. [Google Scholar] [CrossRef]
Rahman, M.M.; Munir, M.; Marculescu, R. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11769–11779. [Google Scholar]
Zhu, Y.; Peng, M.; Wang, X.; Huang, X.; Xia, M.; Shen, X.; Jiang, W. LGCE-Net: A local and global contextual encoding network for effective and efficient medical image segmentation. Appl. Intell. 2025, 55, 66. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dong, B.; Wang, W.; Fan, D.P.; Li, J.; Fu, H.; Shao, L. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv 2021, arXiv:2108.06932. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Liu, X.; Gao, P.; Yu, T.; Wang, F.; Yuan, R.Y. CSWin-UNet: Transformer UNet with cross-shaped windows for medical image segmentation. Inf. Fusion 2025, 113, 102634. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, D.; Lin, Y.; Feng, Y.; Tang, J. Merging Context Clustering With Visual State Space Models for Medical Image Segmentation. IEEE Trans. Med. Imaging 2025, 44, 2131–2142. [Google Scholar] [CrossRef] [PubMed]
Bao, M.; Lyu, S.; Xu, Z.; Zhao, Q.; Zeng, C.; Bai, W.; Cheng, G. ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation. arXiv 2025, arXiv:2503.19427. [Google Scholar]
Zhang, M.; Chen, Z.; Ge, Y.; Tao, X. HMT-UNet: A hybird mamba-transformer vision UNet for medical image segmentation. arXiv 2024, arXiv:2408.11289. [Google Scholar]
Chen, C.; Yu, L.; Min, S.; Wang, S. MSVM-UNet: Multi-Scale Vision Mamba UNet for Medical Image Segmentation. In Proceedings of the 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Lisboa, Portugal, 3–6 December 2024; pp. 3111–3114. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. Localmamba: Visual state space model with windowed selective scan. In Proceedings of the European Conference on Computer Vision, Madrid, Spain, 4–5 June 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 12–22. [Google Scholar]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 25261–25270. [Google Scholar]
Xu, Z.; Tang, F.; Chen, Z.; Zhou, Z.; Wu, W.; Yang, Y.; Liang, Y.; Jiang, J.; Cai, X.; Su, J. Polyp-mamba: Polyp segmentation with visual mamba. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 510–521. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef]
Zhang, Y. KM-UNet KAN Mamba UNet for medical image segmentation. arXiv 2025, arXiv:2501.02559. [Google Scholar] [CrossRef]
Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards understanding convergence and generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 45, 6486–6493. [Google Scholar] [CrossRef] [PubMed]
Ruan, J.; Xiang, S.; Xie, M.; Liu, T.; Fu, Y. Malunet: A multi-attention and light-weight unet for skin lesion segmentation. In Proceedings of the 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA, 6–8 December 2022; pp. 1150–1156. [Google Scholar]
Rahman, M.M.; Marculescu, R. Medical image segmentation via cascaded attention decoding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6222–6231. [Google Scholar]
Gao, Y.; Zhang, S.; Shi, L.; Zhao, G.; Shi, Y. Collaborative transformer U-shaped network for medical image segmentation. Appl. Soft Comput. 2025, 173, 112841. [Google Scholar] [CrossRef]
Gao, Y.; Xu, H.; Liu, Q.; Bie, M.; Che, X. A swin-transformer-based network with inductive bias ability for medical image segmentation. Appl. Intell. 2025, 55, 1–18. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of our HCDAA-Net, where HCS indicates the hybrid cascade structure.

Figure 2. Illustration of the proposed CCAM.

Figure 3. Illustration of the proposed CAA.

Figure 4. Illustration of the proposed DAFA.

Figure 5. Illustration of the proposed RG-VSS.

Figure 6. Illustration of the core operations in the scan block.

Figure 7. Visual comparison of different methods on the Synapse dataset.

Figure 8. Visual comparison of different methods on the ACDC dataset.

Figure 9. Visual comparison of different methods on the ISIC18 dataset.

Figure 10. Visual comparison of different methods on the Glas dataset.

Figure 11. Visualization of feature maps from each stage of our HCDAA-Net, highlighting the effects of the HCS, CCAM, CAA, DAFA and RG-VSS.

Figure 12. Visualization of feature maps from four layers using three backbone networks: ResNet34 (row 1), VMamba (row 2), and HCS (row 3). Orange boxes denote the final segmentation results.

Table 1. Performance comparison with SOTA methods on the Synapse dataset. Bold black data indicates the best results. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Methods	DSC (%) ↑	HD95 (mm) ↓	Aorta	Gallbladder	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
UNet [4]	74.81	54.60	85.66	53.24	81.13	71.60	92.68	56.80	87.46	69.93
Att-UNet [44]	71.71	34.46	82.61	61.94	76.08	70.42	87.54	46.71	80.67	67.66
TransUNet [15]	76.75	44.30	86.70	58.97	83.33	77.94	94.13	53.60	84.01	75.38
PVT-CASCADE [44]	81.05	20.21	83.00	70.60	82.23	80.37	94.11	63.34	90.11	83.69
TransCASCADE [44]	82.58	17.24	86.62	68.37	87.65	84.56	94.41	65.33	90.79	83.50
Swin-UNet [29]	79.14	21.53	85.46	66.53	83.28	79.62	94.29	56.58	90.66	76.60
Swin-UMamba [13]	82.25	19.50	86.22	70.77	83.58	81.60	95.22	69.36	89.95	81.13
H-vmunet [14]	74.05	31.77	87.50	68.21	73.31	59.23	60.28	92.31	71.69	79.85
CoTransUNet [45]	82.39	17.51	87.13	71.34	86.40	81.12	94.88	66.27	90.37	81.63
MALUNet [43]	65.69	23.89	78.17	65.78	62.65	55.97	49.36	86.76	57.62	69.24
Swin-IBNet [46]	82.58	17.46	87.96	67.99	84.12	84.64	94.51	66.64	91.12	83.68
VM-UNet [3]	82.40	16.21	87.01	69.38	85.52	82.25	94.10	65.78	91.54	83.50
MSVM-UNet [34]	85.00	14.75	88.73	74.90	85.62	84.47	95.74	71.53	92.52	86.51
CSWin-UNet [30]	81.12	18.86	87.13	67.85	83.51	78.53	95.23	65.94	89.05	81.74
RWKV-UNet [20]	84.04	15.73	89.56	73.91	86.74	84.07	95.57	69.35	90.95	83.94
PARF-Net [1]	84.27	13.82	90.38	72.95	85.91	83.42	95.54	72.86	91.43	78.91
Ours	86.94	12.66	94.13	85.83	88.26	77.24	76.87	95.82	87.50	89.82

Table 2. Performance comparison with SOTA methods on the ACDC dataset. Bold black data indicates the best results. ‘↑’ indicates higher is better.

Methods	DSC (%) ↑	RV	Myo	LV
UNet [4]	87.54	87.10	80.62	94.91
Att-UNet [44]	86.75	87.54	79.26	93.48
TransUNet [15]	89.73	88.89	84.5	95.76
PVT-CASCADE [44]	91.50	89.00	89.98	95.51
TransCASCADE [44]	91.62	89.13	90.25	95.49
Swin-UNet [29]	90.01	88.56	85.63	95.84
Swin-UMamba [13]	92.15	89.98	90.00	95.73
H-vmunet [14]	84.03	80.09	79.94	92.07
CoTransUNet [45]	89.08	91.79	90.98	84.47
MALUNet [43]	72.80	68.21	67.16	83.03
Swin-IBNet [46]	91.54	89.89	89.12	95.63
VM-UNet [3]	92.23	90.73	89.94	96.01
MSVM-UNet [34]	92.58	91.00	90.35	96.39
CSWin-UNet [30]	91.46	89.68	88.94	95.76
RWKV-UNet [20]	92.18	90.87	88.72	96.92
PARF-Net [1]	90.98	88.65	88.82	95.49
Ours	92.73	91.26	90.70	96.25

Table 3. Performance comparison with SOTA methods on the ISIC18 dataset. Bold black data indicates the best results. ‘↑’ indicates higher is better.

Methods	mIoU (%) ↑	DSC (%) ↑	Acc (%) ↑	Spe (%) ↑	Sen (%) ↑
UNet [4]	77.86	87.55	94.05	96.69	85.86
Att-UNet [44]	78.43	87.91	94.13	96.23	87.60
TransUNet [15]	78.09	87.7	94.06	96.37	86.91
PVT-CASCADE [44]	-	-	-	-	-
TransCASCADE [44]	-	-	-	-	-
Swin-UNet [29]	64.48	78.41	89.68	93.79	76.94
Swin-UMamba [13]	79.91	88.84	94.59	96.57	88.43
H-vmunet [14]	79.56	88.61	94.56	96.98	87.01
CoTransUNet [45]	-	88.75	93.65	95.27	89.49
MALUNet [43]	80.25	89.04	94.62	96.19	89.74
Swin-IBNet [46]	-	-	-	-	-
VM-UNet [3]	81.35	89.71	94.91	96.13	91.12
MSVM-UNet [34]	79.30	88.46	94.44	96.66	87.54
CSWin-UNet [30]	74.00	85.05	93.00	96.62	81.76
RWKV-UNet [20]	81.36	89.72	94.93	96.22	90.92
PARF-Net [1]	74.88	85.64	93.20	96.37	83.32
Ours	81.49	89.80	95.01	96.54	90.25

Table 4. Performance comparison with SOTA methods on the Glas dataset. Bold black data indicates the best results. ‘↑’ indicates higher is better.

Methods	mIoU (%) ↑	DSC (%) ↑
UNet [4]	$81.23 \pm 1.61$	$89.09 \pm 1.09$
Att-UNet [44]	$83.69 \pm 0.51$	$90.65 \pm 0.31$
TransUNet [15]	$78.92 \pm 1.22$	$87.33 \pm 0.87$
PVT-CASCADE [44]	-	-
TransCASCADE [44]	-	-
Swin-UNet [29]	$75.38 \pm 1.68$	$88.27 \pm 1.63$
Swin-UMamba [13]	$77.99 \pm 1.86$	$86.74 \pm 1.31$
H-vmunet [14]	$73.52 \pm 2.14$	$83.73 \pm 1.60$
CoTransUNet [45]	-	-
MALUNet [43]	$67.39 \pm 2.29$	$79.6 \pm 1.75$
Swin-IBNet [46]	-	-
VM-UNet [3]	$76.08 \pm 4.85$	$85.51 \pm 3.46$
MSVM-UNet [34]	$84.48 \pm 0.89$	$91.06 \pm 0.55$
CSWin-UNet [30]	$52.60 \pm 2.87$	$67.50 \pm 2.67$
RWKV-UNet [20]	$79.52 \pm 1.40$	$87.86 \pm 0.99$
PARF-Net [1]	$83.76 \pm 0.65$	$91.06 \pm 0.40$
Ours	84.63± 0.70	91.18± 0.46

Table 5. Ablation studies on the Synapse dataset. ‘✓’ indicates that the corresponding module is used. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

	HCS	CCAM	CAA	DAFA	RG-VSS	DSC (%) ↑	HD95 (mm) ↓
Baseline	-	-	-	-	-	85.00	14.75
	✓	-	-	-	-	85.85	16.11
	✓	✓	-	-	-	86.06	13.04
	✓	✓	✓	-	-	86.23	12.72
	✓	✓	✓	✓	-	86.53	12.69
	✓	✓	✓	✓	✓	86.94	12.66

Table 6. Ablation study of self-attention and CAA modules on the synapse dataset. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Module	DSC (%) ↑	HD95 (mm) ↓	FLOPs (G) ↓	Params (M) ↓	Latency (ms) ↓
Self-Attention [27]	86.68	13.21	16.98	17.20	0.86
CAA	86.94	12.66	17.12	18.61	0.88

Table 7. Ablation studies on the parameter

α

in the loss function. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Table 7. Ablation studies on the parameter

α

in the loss function. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

$α$	DSC (%) ↑	HD95 (mm) ↓
0.0	83.58	20.19
0.2	84.23	18.55
0.4	86.94	12.66
0.6	85.40	14.83
0.8	84.79	15.12
0.9	84.00	15.79
1.0	83.54	17.91

Table 8. Ablation studies on backbone architectures: ResNet34, VMamba, and HCS (ours). ‘-’ denotes backbone without pretrained weights, and ‘✓’ denotes backbone with pretrained weights. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Pretrained	Backbone	DSC (%) ↑	HD95 (mm) ↓
-	ResNet34	81.89	30.15
-	VMamba	83.67	15.18
-	HCS	84.67	19.70
✓	ResNet34	84.81	14.77
✓	VMamba	85.85	16.11
✓	HCS	86.94	12.66

Table 9. Ablation study of upsampling methods on the Synapse dataset. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Upsampling	DSC (%) ↑	HD95 (mm) ↓	FLOPs (G) ↓	Params (M) ↓
Transposed Conv	85.31	12.17	4.51	6.56
UpSample [44]	85.97	15.35	7.35	8.25
Patch Expand [29]	84.78	20.65	5.78	6.09
LKPE	86.94	12.66	5.43	6.36

Table 10. Ablation study on RG-VSS layer number on the Synapse dataset. ‘↑’ and ‘↓’ indicate that higher and lower values are better, respectively.

Layer	DSC (%) ↑	HD95 (mm) ↓	FLOPs (G) ↓	Params (M) ↓
1	85.79	14.32	14.87	15.8
2	86.94	12.66	17.12	18.61
3	86.88	12.73	19.23	20.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Hybrid Cascade and Dual-Path Adaptive Aggregation Network for Medical Image Segmentation

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Methods

2.2. Transformer-Based Methods

2.3. Mamba-Based Methods

3. Methods

3.1. Overall Architecture

3.2. Feature Extraction

3.3. Feature Aggretation

3.3.1. Correlation-Aware Aggregation (CAA)

3.3.2. Dual-Path Adaptive Feature Aggregation (DAFA)

3.3.3. Residual-Gated Visual State Space Module (RG-VSS)

3.4. Loss Function

3.4.1. Datasets

3.4.2. Implementation Details

4. Experiments

4.1. Experimental Setup

Evaluation Metrics

4.2. Quantitative Comparison

4.3. Visual Comparison

4.4. Ablation Studies

4.4.1. Structure Ablation

4.4.2. Ablation Study on Attention Modules

4.4.3. Loss Ablation

4.4.4. Backbone Network Effectiveness

4.4.5. Different Upsampling Method Effectiveness

4.4.6. Ablation of RG-VSS Layer Number

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics