As depicted in the left half of
Figure 1, the encoder is a hybrid CNN-Transformer design. The CNN component extracts features via convolutional and downsampling layers, progressively reducing feature map size. After processing through the DW-CAM, the feature flows into the DPA: DW-CAM improves feature representation by integrating depthwise convolution with a Composite Attention Module (CAM), which combines Alterable Kernel Convolution (AKConv) [
35], ParNet Attention (PA) [
36], and Spatial and Channel Reconstruction Convolution (SCConv) [
37] to capture multi-scale contextual information. The DPA improves global feature representation by combining spatial and channel attention. This hybrid approach balances local and global feature processing, optimizing encoder performance.
3.2.1. The Depthwise Composite Attention Module (DW-CAM)
In
Figure 1, the DW-CAM is designed to address two primary limitations in hybrid models limited local feature extraction. The DW-CAM overcomes these challenges by incorporating the following:
Depthwise convolution (DWConv): Enhances local feature extraction, capturing fine details more precisely.
Composite Attention Module (CAM): Improves attention mechanisms and local feature extraction efficiency.
Depthwise convolution (DWConv): To enhance computational efficiency while preserving spatial information, we employ a depthwise convolution operation. Specifically, a depthwise convolution is utilized with a padding of 1 (stride = 1, padding = 1) to ensure that the spatial dimensions of the feature maps remain unchanged. The number of groups in the convolution is set equal to the number of input channels, allowing each channel to be processed independently. Compared to standard convolution, depthwise convolution significantly reduces the number of parameters and computational complexity while maintaining the integrity of spatial features, thus improving the overall computational efficiency of the model.
Composite Attention Module (CAM): As illustrated in
Figure 2, the Composite Attention Module (CAM) represents an advanced attention mechanism designed to integrate spatial and channel information effectively. By incorporating geometric transformations, interactions between local and global features, and joint spatial–channel reconstruction, the CAM module dynamically enhances input features across multiple dimensions. Compared to conventional attention mechanisms, CAM demonstrates superior flexibility and adaptability and easily models features in complex scenarios while significantly reducing computational complexity, thus contributing to improved overall performance. The following is an introduction to AKConv, PA, and SCConv.
Given an input feature map , where N is batch size, and H and W denote the height and width of the feature map, a 2D convolution is first applied to obtain an offset matrix. Before adjusting the sampling coordinates, an initial sampling coordinate set is generated, where K represents the number of sampling points (the kernel size), and each point is defined by two coordinate values (x,y). By adding the learned offsets to the initial sampling coordinates , the adjusted sampling coordinates are obtained as . Since these adjusted sampling coordinates may not align with integer grid positions, a resampling process is performed using bilinear interpolation to compute the feature values at the new locations. Finally, the resampled feature map is mapped to the standard channel dimension through a convolution, ensuring compatibility with subsequent network processing.
The output feature map is first processed using a convolution to extract channel-wise information and a convolution to capture spatial information; standard convolutions have stride = 1, padding = 0, and the convolution typically uses stride = 1 and padding = 1. Subsequently, global features are extracted by a global average pool (GAP), followed by a convolution for the reduction of dimensionality. A sigmoid activation function is then applied to generate attention weights for global attention computation. Next, attention-weighted refinement is performed by applying element-wise multiplication between the generated attention weights and the input feature map, thereby enhancing key regions. Finally, refined features are fused by aggregating the outputs of the convolution, convolution, and attention-weighted features through element-wise addition. This fusion process enhances target region information, improving the feature representation and its discriminative capability.
To optimize spatial and channel information, enhance feature representation capability, reduce redundant information, and improve computational efficiency, we introduce a Spatial Reconstruction Unit (SRU) and a Channel Reconstruction Unit (CRU). Given an input feature map , the SRU is first applied to separate primary information from redundant information and reconstruct the feature representation. Subsequently, the channel reconstruction unit is employed to further refine channel-wise information, thereby enhancing feature expressiveness. The final output is the optimized feature map .
The following describes the implementation steps of the SRU. To assess the significance of each channel, Group Normalization (GN) is employed to compute the weight of each channel:
In this process, represents the channel importance, which is utilized to distinguish between primary information and redundant information.
Based on the computed channel importance weights, the feature map is divided into two components:
where
represents the primary information channels (high-weight features), while
corresponds to the redundant information channels (low-weight features). The weight distribution satisfies
to ensure the total information remains unchanged.
and
are weight masks derived from the channel importance vector
(in Formula (
1)). Element-wise multiplication (⊙) allows high-weight channels to retain more features while suppressing low-weight channels.
Finally, the spatial features are reconstructed by concatenating features from different sources to enhance information representation capability:
where ∪ represents channel concatenation, enabling the model to retain information from different perspectives simultaneously.
The spatially optimized feature map
is fed into the CRU. Since the spatially optimized feature
still contains redundant information, channel optimization is applied:
where
represents the primary channel information (high-frequency features), and
serves as the supplementary channel information (low-frequency features). The function
divides the channels according to a split ratio.
Next, a transformation is performed using Group-Wise Convolution (GWC) and Point-Wise Convolution (PWC) to extract features:
where
employs group-wise convolution to extract features and improve computational efficiency, while
applies
convolution to reduce channel dimensions.
and
represent the optimized channel features.
Finally, feature fusion is performed by computing Softmax weights to integrate the two types of channel information:
where
and
compute from global average pool, represented the importance scores of channel features,
and
compute from Softmax.
In summary, the CAM module effectively learns both local details and global contextual information from input features, thus enhancing the representational capacity of spatial and channel features. This multi-scale fusion approach significantly improves the model’s receptive field and the efficiency of information utilization. By adaptively generating weight distributions across spatial and channel dimensions based on the input content, the module dynamically models the relative importance of information. Through the integration of spatial reconstruction and channel reconstruction, the CAM module fully harnesses the potential of input features across multiple dimensions, ensuring comprehensive information fusion and substantially boosting the performance of downstream tasks.
3.2.2. The Depthwise Polarized Attention (DPA) Module
Figure 3 illustrates the Depthwise Polarized Attention (DPA) module, which is designed to effectively employ global context modeling to address the challenges of medical image segmentation. The module begins with a Depthwise Convolution (DWConv) layer, which allows the extraction of lightweight local features and preserves high-resolution details. Layer Normalization is applied in the channel attention branch (
) before the final sigmoid activation. Its purpose is to normalize the feature statistics across the channel dimension after the feature transformation (
conv) and potential reshaping/multiplication. This stabilizes training, prevents exploding/vanishing activations, and ensures the attention weights generated by the sigmoid function are in a well-behaved range, leading to more effective channel-wise feature refinement.
To further enhance feature representation, the DPA module employs a dual-branch mechanism that focuses on refining both spatial and channel information. This mechanism adaptively redistributes feature importance by dynamically learning the relevance of different spatial and channel dimensions. By combining these refined features, the DPA module achieves a comprehensive and robust representation of input features, balancing local detail preservation with global context integration.
In the channel dimension, the output features are represented as
:
Here, , and are convolution layers, while and represent reshape operations. denotes the SoftMax function, the symbol “×” represents matrix multiplication, and represents the sigmoid function. The output channel dimension of , and is . Therefore, the output in the channel dimension is , and represents the multiplication operator in the channel dimension.
In the spatial dimension, the output features are represented as
:
Here, and are both convolution layers, and and and are reshape operations. represents the Global Pooling operation. The symbol “×” indicates matrix multiplication. Thus, the output features in the spatial dimension are represented as , and denotes the multiplication operation in the spatial dimension. To reduce the risk of overfitting, we add Dropout layers after the channel partition and the spatial branch , respectively. The Dropout rate is set to 0.1, which means 10% of neurons are randomly discarded during training to enhance the adaptability of the model to complex medical images.
Fusion: To dynamically adjust the contribution of spatial and channel features, we introduce learnable weights
and
to perform weighted fusion. The output features from the two branches described above are fused in parallel:
Here, and are trainable parameters learned by the network with initial values set to 0.5 and optimized by backpropagation. The weights satisfy the constraint (implemented by softmax normalization) to ensure the stability of feature fusion.
Through its innovative design, the DPA module improves the segmentation accuracy by integrating multi-scale information and effectively enhancing critical features while suppressing redundant ones. This approach ensures precise and reliable delineation of anatomical structures in medical images, making it a powerful component in the proposed segmentation framework.