Author Contributions
Conceptualization, Y.C., Y.D., N.M. and X.W.; Methodology, Y.C. and Y.D.; Software, Y.C. and X.L. (Xuefeng Li); Validation, Y.C. and X.L. (Xuefeng Li); Formal Analysis, Y.C.; Investigation, Y.C. and X.L. (Xuefeng Li); Data Curation, Y.C. and X.L. (Xuefeng Li); Writing—Original Draft Preparation, Y.C.; Writing—Review and Editing, X.L. (Xuefeng Li), Y.D., H.J., X.L. (Xiaohui Liu), N.M. and X.W.; Visualization, Y.C.; Supervision, Hui Jiang, X.L. (Xiaohui Liu), N.M. and X.W.; Project Administration, X.W.; Funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Segmentation of the liver and liver tumors faces many challenges. (A) The boundary between the region of interest and the surrounding tissues is ambiguous. (B) The individual differences in the liver and liver tumors are very significant. (C) Early tumors or small tumors have low contrast with surrounding tissues.
Figure 1.
Segmentation of the liver and liver tumors faces many challenges. (A) The boundary between the region of interest and the surrounding tissues is ambiguous. (B) The individual differences in the liver and liver tumors are very significant. (C) Early tumors or small tumors have low contrast with surrounding tissues.
Figure 2.
Attention gate structure. Schematic diagram of the attention gate integrated into the skip connections of Attention U-Net. The gate receives two inputs: the gating signal (from the decoder) and the encoder feature map. After upsampling the gating signal to match the spatial dimensions, element-wise summation is performed, followed by ReLU activation and 1 × 1 convolution to generate attention coefficients. A sigmoid function normalizes the coefficients to [0, 1], which are then resampled and multiplied with the encoder feature map to suppress irrelevant background regions and emphasize target structures.
Figure 2.
Attention gate structure. Schematic diagram of the attention gate integrated into the skip connections of Attention U-Net. The gate receives two inputs: the gating signal (from the decoder) and the encoder feature map. After upsampling the gating signal to match the spatial dimensions, element-wise summation is performed, followed by ReLU activation and 1 × 1 convolution to generate attention coefficients. A sigmoid function normalizes the coefficients to [0, 1], which are then resampled and multiplied with the encoder feature map to suppress irrelevant background regions and emphasize target structures.
Figure 3.
Structure of the SBM–Attention U-Net. The encoder consists of five downsampling stages: the first three stages incorporate SCDA modules for fine-grained low-level feature enhancement, while the two deepest stages employ BiFormer blocks for global semantic modeling. The decoder integrates MSB at each upsampling stage to fuse multi-scale features from skip connections and decoder inputs. Attention gates (AGs) are retained in skip connections to further refine feature transmission.
Figure 3.
Structure of the SBM–Attention U-Net. The encoder consists of five downsampling stages: the first three stages incorporate SCDA modules for fine-grained low-level feature enhancement, while the two deepest stages employ BiFormer blocks for global semantic modeling. The decoder integrates MSB at each upsampling stage to fuse multi-scale features from skip connections and decoder inputs. Attention gates (AGs) are retained in skip connections to further refine feature transmission.
Figure 4.
Structure of the CA module. Input feature maps (C × H × W) are pooled along the height and width dimensions separately using global average pooling, producing two directional feature vectors (C × H × 1 and C × 1 × W). These are concatenated and passed through a 1 × 1 convolution for channel reduction, batch normalization, and non-linear activation. The resulting tensor is split back into two directional components, each processed by a 1 × 1 convolution and sigmoid to generate attention weights. These weights are broadcasted and multiplied with the original input to achieve orientation-aware feature recalibration.
Figure 4.
Structure of the CA module. Input feature maps (C × H × W) are pooled along the height and width dimensions separately using global average pooling, producing two directional feature vectors (C × H × 1 and C × 1 × W). These are concatenated and passed through a 1 × 1 convolution for channel reduction, batch normalization, and non-linear activation. The resulting tensor is split back into two directional components, each processed by a 1 × 1 convolution and sigmoid to generate attention weights. These weights are broadcasted and multiplied with the original input to achieve orientation-aware feature recalibration.
Figure 5.
Structure of the SE module. Global average pooling squeezes spatial information from each channel into a channel descriptor (1 × 1 × C). Two fully connected layers (with reduction ratio r) perform excitation: the first reduces dimensionality and applies ReLU, and the second restores the original channel dimensions followed by sigmoid activation. The resulting channel-wise weights are multiplied with the input feature map to adaptively recalibrate feature responses.
Figure 5.
Structure of the SE module. Global average pooling squeezes spatial information from each channel into a channel descriptor (1 × 1 × C). Two fully connected layers (with reduction ratio r) perform excitation: the first reduces dimensionality and applies ReLU, and the second restores the original channel dimensions followed by sigmoid activation. The resulting channel-wise weights are multiplied with the input feature map to adaptively recalibrate feature responses.
Figure 6.
Structure of SEnetV2. After global average pooling, the squeezed vector passes through multiple parallel fully connected branches (dense layers) before aggregation. This design captures richer inter-channel dependencies and improves global context modeling compared to the original SE block.
Figure 6.
Structure of SEnetV2. After global average pooling, the squeezed vector passes through multiple parallel fully connected branches (dense layers) before aggregation. This design captures richer inter-channel dependencies and improves global context modeling compared to the original SE block.
Figure 7.
Structure of SCDA. Input features first pass through a CA branch to encode direction-sensitive spatial information. The output is then processed by two consecutive convolutional layers (Conv + BN + ReLU) to refine representations, followed by a SENetV2 block for channel-wise recalibration. Residual connections add the original input to the final output to preserve information and facilitate gradient flow.
Figure 7.
Structure of SCDA. Input features first pass through a CA branch to encode direction-sensitive spatial information. The output is then processed by two consecutive convolutional layers (Conv + BN + ReLU) to refine representations, followed by a SENetV2 block for channel-wise recalibration. Residual connections add the original input to the final output to preserve information and facilitate gradient flow.
Figure 8.
Structure of BiFormer. Input features undergo a 3 × 3 depthwise convolution to encode relative position, then pass through a Bi-level Routing Attention module that selectively aggregates information from the most relevant regions in a content-aware manner. Finally, a two-layer MLP with an expansion ratio e performs per-position embedding. This design efficiently captures global context while reducing computational complexity.
Figure 8.
Structure of BiFormer. Input features undergo a 3 × 3 depthwise convolution to encode relative position, then pass through a Bi-level Routing Attention module that selectively aggregates information from the most relevant regions in a content-aware manner. Finally, a two-layer MLP with an expansion ratio e performs per-position embedding. This design efficiently captures global context while reducing computational complexity.
Figure 9.
Structure of the multi-scale parallel large convolution kernel module. Input features are normalized and split into two parallel paths: a 1 × 1 convolution for channel adjustment and a 5 × 5 convolution to allow a larger receptive field. The outputs are fed into three parallel depthwise dilated convolutions with different dilation rates to capture multi-scale context. These features are concatenated, refined by 1 × 1 convolutions and GELU activation, and finally added to the input via a residual connection.
Figure 9.
Structure of the multi-scale parallel large convolution kernel module. Input features are normalized and split into two parallel paths: a 1 × 1 convolution for channel adjustment and a 5 × 5 convolution to allow a larger receptive field. The outputs are fed into three parallel depthwise dilated convolutions with different dilation rates to capture multi-scale context. These features are concatenated, refined by 1 × 1 convolutions and GELU activation, and finally added to the input via a residual connection.
Figure 10.
Structure of enhanced parallel attention module. Input features are normalized and processed by three parallel attention branches: pixel attention (focuses on spatial positions), channel attention (models channel dependencies), and simple pixel attention (models local pixel correlations). The outputs of these branches are aggregated to refine feature representation, improving the segmentation of boundaries and small structures.
Figure 10.
Structure of enhanced parallel attention module. Input features are normalized and processed by three parallel attention branches: pixel attention (focuses on spatial positions), channel attention (models channel dependencies), and simple pixel attention (models local pixel correlations). The outputs of these branches are aggregated to refine feature representation, improving the segmentation of boundaries and small structures.
Figure 11.
Trends of the consolidated ablation study results on the 3Dircadb dataset. The line chart illustrates the performance changes in tumor segmentation after progressively integrating different modules into the baseline Attention U-Net. The sequential integration of the proposed modules yielded consistent and significant performance gains.
Figure 11.
Trends of the consolidated ablation study results on the 3Dircadb dataset. The line chart illustrates the performance changes in tumor segmentation after progressively integrating different modules into the baseline Attention U-Net. The sequential integration of the proposed modules yielded consistent and significant performance gains.
Figure 12.
Model inference results on 3Dircadb dataset, where the white areas represent liver tumors and the black area represents the background.
Figure 12.
Model inference results on 3Dircadb dataset, where the white areas represent liver tumors and the black area represents the background.
Figure 13.
Trends of the consolidated ablation study results on the LITS dataset. The line chart demonstrates the performance evolution for both tumor and liver segmentation across different model configurations.
Figure 13.
Trends of the consolidated ablation study results on the LITS dataset. The line chart demonstrates the performance evolution for both tumor and liver segmentation across different model configurations.
Figure 14.
Model inference results on LITS dataset, where the white areas represent liver tumors, the gray area represents the liver, and the black area represents the background.
Figure 14.
Model inference results on LITS dataset, where the white areas represent liver tumors, the gray area represents the liver, and the black area represents the background.
Figure 15.
Trends of the consolidated ablation study results on the CHAOS dataset. The line chart illustrates the performance evolution of multi-abdominal organ segmentation (liver, right kidney, left kidney, and spleen) across different model configurations. The sequential integration of the proposed modules demonstrates comprehensive optimization across all target categories.
Figure 15.
Trends of the consolidated ablation study results on the CHAOS dataset. The line chart illustrates the performance evolution of multi-abdominal organ segmentation (liver, right kidney, left kidney, and spleen) across different model configurations. The sequential integration of the proposed modules demonstrates comprehensive optimization across all target categories.
Figure 16.
Model inference results on CHAOS dataset, the blue area represents the liver, the purple area represents the right kidney, the yellow area represents the left kidney, and the white area represents the spleen.
Figure 16.
Model inference results on CHAOS dataset, the blue area represents the liver, the purple area represents the right kidney, the yellow area represents the left kidney, and the white area represents the spleen.
Figure 17.
This figure compares the training-time costs of the baseline Attention U-Net, the individual ablation modules (+SCDA, +BiFormer, +MSB), and the proposed SBM–Attention U-Net.
Figure 17.
This figure compares the training-time costs of the baseline Attention U-Net, the individual ablation modules (+SCDA, +BiFormer, +MSB), and the proposed SBM–Attention U-Net.
Figure 18.
This figure compares the testing-time costs of the baseline Attention U-Net, the individual ablation modules (+SCDA, +BiFormer, +MSB), and the proposed SBM–Attention U-Net.
Figure 18.
This figure compares the testing-time costs of the baseline Attention U-Net, the individual ablation modules (+SCDA, +BiFormer, +MSB), and the proposed SBM–Attention U-Net.
Table 1.
Consolidated ablation study results on the 3Dircadb dataset. Bold represents optimal performance in its evaluation metric.
Table 1.
Consolidated ablation study results on the 3Dircadb dataset. Bold represents optimal performance in its evaluation metric.
| Model Variant | Tumor IoU | Tumor Dice | Tumor Precision | Tumor Recall | Mean Dice | Mean Recall |
|---|
| Attention U-Net | 71.48 | 83.37 | 89.35 | 78.13 | 91.58 | 89.01 |
| +SCDA | 75.41 | 85.98 | 90.30 | 82.05 | 92.91 | 90.97 |
| +BiFormer | 76.34 | 86.58 | 89.08 | 84.22 | 93.21 | 92.04 |
| +MSB | 76.88 | 86.93 | 89.82 | 84.23 | 93.39 | 92.05 |
| SBM–Attention U-Net | 78.07 | 87.69 | 90.40 | 85.13 | 93.77 | 92.51 |
Table 2.
Consolidated ablation study results on the LITS dataset. Bold represents optimal performance in its evaluation metric.
Table 2.
Consolidated ablation study results on the LITS dataset. Bold represents optimal performance in its evaluation metric.
| Model Variant | Tumor IoU | Tumor Dice | Tumor Precision | Tumor Recall | Liver Dice | Mean Dice |
|---|
| Attention U-Net | 61.73 | 76.34 | 77.52 | 75.19 | 94.76 | 90.27 |
| +SCDA | 66.52 | 79.89 | 81.94 | 77.95 | 95.37 | 91.67 |
| +BiFormer | 66.52 | 81.30 | 81.79 | 80.82 | 95.45 | 92.17 |
| +MSB | 67.38 | 80.51 | 82.40 | 78.71 | 95.26 | 91.84 |
| SBM–Attention U-Net | 69.93 | 82.30 | 84.84 | 79.92 | 95.64 | 92.57 |
Table 3.
Consolidated ablation study results on the CHAOS dataset (dice). Bold represents optimal performance in its evaluation metric.
Table 3.
Consolidated ablation study results on the CHAOS dataset (dice). Bold represents optimal performance in its evaluation metric.
| Model Variant | Liver | Right Kidney | Left Kidney | Spleen | Mean Dice |
|---|
| Attention U-Net | 93.82 | 95.19 | 95.11 | 91.78 | 95.11 |
| +SCDA | 94.00 | 95.23 | 94.57 | 92.57 | 95.20 |
| +BiFormer | 94.30 | 95.06 | 95.19 | 93.71 | 95.58 |
| +MSB | 94.00 | 95.23 | 94.57 | 92.57 | 95.20 |
| SBM–Attention U-Net | 94.98 | 96.41 | 95.73 | 93.73 | 96.11 |
Table 4.
Ablation study for the proposed method on 3Dircadb dataset. Bold represents optimal performance in its evaluation metric.
Table 4.
Ablation study for the proposed method on 3Dircadb dataset. Bold represents optimal performance in its evaluation metric.
| Architecture | Year | Core Mechanism | Tumor Dice |
|---|
| SBM–Attention U-Net | 2026 | CNN+Transformer | 87.69 |
| TransUNet [30] | 2021 | CNN+Transformer | 76.06 |
| Swin-UNet [43] | 2022 | Swin Transformer | 71.38 |
| MS-FANet [44] | 2023 | Multi-Scale Feature Attention | 87.50 |
| nnU-Net (V2) [45] | 2024 | Self-Configuring Baseline | 82.34 |
| E2Net [46] | 2025 | Edge-Enhanced Network | 83.00 |
| G-UNETR++ [47] | 2025 | Gradient-Enhanced Encoder | 83.21 |
Table 5.
Ablation study for the proposed method on LITS dataset. Bold represents optimal performance in its evaluation metric.
Table 5.
Ablation study for the proposed method on LITS dataset. Bold represents optimal performance in its evaluation metric.
| Architecture | Year | Core Mechanism | Tumor Dice | Liver Dice |
|---|
| SBM–Attention U-Net | 2026 | CNN+Transformer | 82.30 | 95.64 |
| TransUNet [30] | 2021 | CNN+Transformer | 82.19 | 94.95 |
| Swin-UNet [43] | 2022 | Pure Transformer | 81.73 | 93.64 |
| MS-FANet [44] | 2023 | Multi-Scale Feature Attention | 74.20 | 94.80 |
| SBCNet [48] | 2024 | Dual-Branch CNN | 81.35 | 94.21 |
| T-MPEDNet [49] | 2025 | CNN+Transformer | 81.98 | 95.54 |
| SegMamba (V2) [50] | 2025 | State-Space Model | 82.68 | 96.62 |
Table 6.
Ablation study for the proposed method on CHAOS dataset. Bold represents optimal performance in its evaluation metric.
Table 6.
Ablation study for the proposed method on CHAOS dataset. Bold represents optimal performance in its evaluation metric.
| Architecture | Year | Core Mechanism | Liver Dice | R. Kidney Dice | L. Kidney Dice | Spleen Dice |
|---|
| SBM–Attention U-Net | 2026 | Hybrid Attention + MSB | 94.98 | 96.41 | 95.73 | 93.73 |
| Swin-UNETR [51] | 2021 | Swin Transformer | 87.67 | 86.50 | 85.82 | 86.39 |
| MISSFormer [52] | 2022 | Hierarchical Transformer | 91.88 | 93.52 | 92.73 | 91.13 |
| SegMamba (V2) [50] | 2025 | State-Space Model | 95.38 | 96.20 | 95.08 | 92.40 |
| MRSegmentator [53] | 2024 | nnU-Net variant | 93.55 | 92.62 | 90.02 | 89.58 |
| CabiNet [54] | 2025 | Multi-Dataset Learning | 90.39 | 87.41 | 89.09 | 90.78 |
Table 7.
Comparison of model training-time on three datasets.
Table 7.
Comparison of model training-time on three datasets.
| Dataset | Attention U-Net | +SCDA | +BiFormer | +MSB | SBM–Attention U-Net |
|---|
| 3Dircadb | 17.2 min | 39.63 min | 38.13 min | 56.42 min | 1.16 h |
| LITS | 16.75 min | 29.17 min | 21.7 min | 47.63 min | 1.05 h |
| CHAOS | 22.23 min | 45.38 min | 33.14 min | 47.24 min | 1.2 h |
Table 8.
Comparison of model testing time on three datasets.
Table 8.
Comparison of model testing time on three datasets.
| Dataset | Attention U-Net | +SCDA | +BiFormer | +MSB | SBM–Attention U-Net |
|---|
| 3Dircadb | 16.33 min | 37.77 min | 36.64 min | 52.77 min | 1.1 h |
| LITS | 15.93 min | 27.97 min | 20.66 min | 44.76 min | 1 h |
| CHAOS | 21.08 min | 43.65 min | 31.47 min | 43.95 min | 1.15 h |