3.1. Dataset
Synapse Multi-Organ Segmentation Dataset. The Synapse dataset [
53] originates from the MICCAI 2015 Beyond the Cranial Vault (BTCV) multi-organ segmentation challenge. It comprises 30 abdominal CT cases, totaling 3779 axial clinical CT slices. Each 3D CT volume contains between 85 and 198 slices, with a spatial resolution of 512 × 512 pixels. Although the original dataset includes annotations for 13 organs, following the protocol of previous works [
29,
31], we focus on the segmentation of 8 abdominal organs: aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. For data partitioning, 18 cases are used for training, and the remaining 12 cases for testing, without a separate validation set. Specifically, we extract 2D axial slices from the 18 training volumes, resulting in 2211 2D CT images. The training/testing split is based on a provided .txt file list, consistent with prior studies [
29,
31].
Automated Cardiac Diagnosis Challenge (ACDC). The ACDC (Automated Cardiac Diagnosis Challenge) dataset [
54] is a benchmark specifically designed for cardiac MRI segmentation and diagnosis tasks. It is derived from real clinical examinations conducted at Dijon University Hospital. The dataset consists of 3D short-axis cardiac MRI sequences, along with expert-annotated segmentation masks for key cardiac structures. The label values and their corresponding cardiac regions are as follows: 0 for the background, 1 for the left ventricle (LV), 2 for the right ventricle (RV), and 3 for the myocardium (MYO). The dataset contains data from 150 patients, which have been officially split into 100 patients (1902 axial slices) for training and 50 patients (1076 axial slices) for testing. For each patient, the dataset officially provides keyframe cardiac images at the end-diastolic and end-systolic phases. In our experiments, we further randomly split the 100 training samples into 80 training data (1521 axial slices) and 20 validation data (381 axial slices). To make the 3D volumetric data compatible with deep learning architectures that operate on 2D images, we extracted individual axial slices from each volume.
International Skin Imaging Collaboration (ISIC) 2018 Dataset. The ISIC 2018 dataset [
55] was released by the ISIC organization as part of a skin lesion analysis challenge at MICCAI 2018, focusing on the automated processing of dermoscopic images. It contains high-quality clinical images of skin lesions and provides annotations for three tasks: lesion segmentation (Task 1), lesion attribute detection (Task 2), and disease classification (Task 3). In our work, we only used the data from Task 1, which includes 2594 training images, 100 validation images, and 1000 testing images, each accompanied by a binary lesion segmentation mask. These images cover a variety of skin conditions, including melanoma and melanocytic nevus.
Breast Ultrasound Images (BUSI) Dataset. This dataset [
56], collected in 2018, consists of breast ultrasound images from 600 female patients aged 25 to 75. The dataset contains a total of 780 images, with an average resolution of 500 × 500 pixels, stored in PNG format. These images are categorized into three classes: normal, benign, and malignant. Ground truth segmentation masks are provided alongside the original images in the dataset. We used 624 images for training and validation with data augmentation, while the remaining 156 images were used for testing.
3.2. Implementation Details
The implementation of our approach was developed using Python 3.8, TensorFlow 2.19 [
57], and Keras 3.10.0. Our experiments were conducted on an Intel Xeon CPU with six cores and 39 MB of cache and significantly accelerated with an NVIDIA A100 GPU, equipped with 40 GB of VRAM, which is well-suited for high-performance deep learning tasks. We adapted several networks from established sources, specifically Keras-UNet-Collections (
https://github.com/yingkaisha/keras-unet-collection, accessed on 1 October 2025), applying necessary modifications to optimize them for our specific dataset. These adaptations were crucial in handling our dataset’s unique characteristics.
Due to the increased complexity and multi-organ nature of the Synapse dataset, we resized all input images to 224 × 224 and normalized the pixel values to the range [0, 1] when training both the baseline methods and our proposed network. For the ACDC, ISIC 2018, and BUSI datasets, all images were resized to 128 × 128 and normalized to the same range. To enhance the generalization capability of our network and reduce overfitting, we applied four data augmentation techniques—horizontal flipping, vertical flipping, diagonal flipping, and random rotations within the range of [−20°, 20°]. Each transformation was applied with a random probability, allowing different augmentations to be independently combined to generate a more diverse and robust training dataset. The SGD optimization algorithm was employed with a learning rate of 1 ×
, a momentum of 0.9, and a weight decay of 1 ×
. For the Synapse dataset, we used Hybrid Loss (
7) that equally weights categorical cross-entropy and Dice loss to balance pixel-level accuracy and region-level overlap. For the other three datasets, a simple Dice loss was adopted.
To ensure efficient utilization of computational resources, we adopted a batch size of 16 for all datasets. The training was conducted for 300 epochs on the ACDC dataset and 150 epochs on the Synapse, ISIC 2018, and BUSI datasets. A model checkpoint was employed to automatically monitor the minimum Validation Loss value during training on the ACDC, ISIC 2018, and BUSI datasets, ensuring the best-performing network would be saved. For the Synapse dataset, previous works, such as TransUNet [
31] and SwinUNet [
29], do not employ a separate validation set, as the training set contains only 18 volumes, and further splitting would significantly reduce the effective training data. Following this common practice, we trained the model for a fixed number of epochs and periodically saved checkpoints, selecting the one that achieved the lowest training loss without exhibiting signs of overfitting. The final evaluation was then performed on the official testing set.
The proposed architecture settings largely follow the widely used configurations of TransUNet and R2U-Net to ensure reproducibility and fair comparison. For the Transformer bottleneck, we adopted a ViT-Base style encoder, as in TransUNet [
31]: the embedding dimension is
, the number of Transformer layers is
, and each layer uses 12 attention heads with an MLP expansion ratio of 4 (hidden dimension 3072). Since the encoder reduces the spatial resolution to
of the input, we tokenized the bottleneck feature map into non-overlapping tokens with an effective patch size of
(i.e., each spatial location corresponds to one token), and we used learnable positional embeddings. Unless otherwise stated, we followed the default TransUNet-style implementation with LayerNorm and residual connections; dropout was set to 0 in our implementation for stable training on the considered medical datasets.
For the convolutional backbone, we used recurrent residual convolution blocks (RRCBs) following the design philosophy of R2U-Net [
39]. Specifically, at each encoder/decoder stage, we set the number of RRCL layers per RRCB to 2 and the recurrent iteration number to 2. Each RRCL consists of a
convolution, followed by Batch Normalization and ReLU activation, and a residual connection is applied at the RRCB level. The channel widths follow the standard U-Net pyramid: 64, 128, 256, 512, and 1024, from shallow to deep, and the decoder mirrors the encoder. A
convolution is used for channel projection before entering the Transformer and for the final prediction head.
3.5. Qualitative Results
In
Figure 3, the dark blue region indicates the
Aorta, the bright green region represents the
Gallbladder, and the red region corresponds to the
Left Kidney. The cyan region marks the
Right Kidney, while the purple region highlights the
Liver. The yellow area represents the
Pancreas, the light purple region corresponds to the
Spleen, and the nearly white region denotes the
Stomach. In
Figure 4, the red region on the left indicates the
right ventricle, the green circular region on the right corresponds to the
myocardium, and the blue region on the right denotes the
left ventricle. The following can be observed from
Figure 3 and
Figure 4:
(1) Our RCV-UNet achieves more complete target segmentation. For instance, in the first image example of
Figure 3, the stomach is segmented most accurately and completely by the proposed RCV-UNet. In contrast, other methods, such as U-Net [
18], result in segmentations with missing regions or internal holes. SwinUNet [
29] and
-Net [
58] fail to capture the stomach region almost entirely. Furthermore, models including U-Net [
18], Attention U-Net [
26], U-Net++ [
24], V-Net [
23], ResUNet [
27], TransUNet [
31], and R2U-Net [
39] tend to produce over-segmented outputs, erroneously dividing the stomach into multiple disconnected parts.
Similarly, in the fifth image of
Figure 4, when segmenting the
left ventricle, U-Net [
18], UNet3+ [
25], V-Net [
23], TransUNet [
31], SwinUNet [
29], and
-Net [
58] all generate disjointed regions, while UNet++ [
24] erroneously splits the
myocardium into two parts. Additionally, ResUNet [
27] overpredicts the extent of the
left ventricle, while Attention U-Net makes incorrect predictions for the
right ventricle. In contrast, only our RCV-UNet successfully and accurately segments all three regions: the
right ventricle,
myocardium, and
left ventricle.
(2) Our RCV-UNet achieves more accurate target boundary segmentation. For example, in the second image of
Figure 3, our proposed RCV-UNet produces the most accurate segmentation of the
Pancreas. Models such as UNet++ [
24], UNet3+ [
25], V-Net [
23], ResUNet [
27], SwinUNet [
29], and
-Net [
58] fail to fully capture the sharp protrusion at the rightmost end of the pancreas, while U-Net [
18], UNet++ [
24], and TransUNet [
31] are unable to correctly segment the leftmost portion of the organ.
Furthermore, in the first image of
Figure 4, the mask generated by RCV-UNet is the closest to the ground truth (second row). Other methods, such as UNet++ [
24], UNet3+ [
25], V-Net [
23], TransUNet [
31], and SwinUNet [
29], show poor segmentation of the
right ventricle.
-Net [
58] severely mispredicts the
right ventricle region. Moreover, Attention U-Net [
26], UNet3+ [
25], V-Net [
23], ResUNet [
27], and TransUNet [
31] fail to accurately segment the
myocardium and
left ventricle. Attention U-Net [
26], UNet3+ [
25], V-Net [
23], and ResUNet [
27], in particular, incorrectly predict the right side of the
myocardium, producing protrusions where the ground truth shows a smoother, more circular shape. Additionally, U-Net [
18] produces incorrect predictions in the
myocardium region.
In
Figure 5 and
Figure 6, the red regions represent predicted positives, the green regions represent actual positives, and the overlapping areas appear in yellow. Accordingly, the red regions indicate false positives, the green regions indicate false negatives, and the yellow regions represent true positives. The following can be observed from
Figure 5 and
Figure 6:
(1) Our method demonstrates a stronger capability in complete target segmentation. For instance, in the second image of
Figure 5, our proposed RCV-UNet achieves the most accurate and complete segmentation. V-Net [
23], ResUNet [
27], and
-Net [
58] exhibit incomplete results with internal holes and discontinuities in the lesion region. Other models either miss portions of the actual lesion or produce over-segmented outputs that extend beyond the lesion boundary.
Similarly, in the first image of
Figure 6, compared to other methods, our RCV-UNet achieves the highest overlap between the predicted mask and the ground truth (the largest yellow area) while producing the least incorrect mask (the smallest red area). In the sixth image, the mask generated by our method is most consistent with the ground truth. Other methods, such as U-Net [
18], UNet++ [
24], and SwinUNet [
29], result in a much smaller yellow area, indicating that these methods only predict a small portion of the target.
(2) Our method achieves more precise predictions, particularly in boundary delineation. For example, in the last image of
Figure 5, U-Net [
18], Attention U-Net [
26], UNet++ [
24], UNet3+ [
25], V-Net [
23], SwinUNet [
29], and R2U-Net [
39] all exhibit noticeable over-segmentation in the upper part of the lesion region (highlighted in red). Additionally, Attention U-Net [
26], UNet3+ [
25], and SwinUNet [
29] also show over-segmentation in the lower part of the lesion. In contrast, V-Net [
23], ResUNet [
27], and TransUNet [
31] suffer from clear under-segmentation in the lower region (highlighted in green). Compared with these methods, our proposed RCV-UNet produces a result that is closest to the ground truth, with only a small number of missed pixels in the upper-left area and slight over-segmentation in the upper-right region of the lesion.
Furthermore, in the fourth image of
Figure 6, although most methods can roughly predict the target, SwinUNet [
29] predicts a significant amount of incorrect regions, while UNet++ [
24] identifies only a small portion of the target. Upon closer inspection of the target boundaries, methods such as U-Net [
18], V-Net [
23], and TransUNet [
31] make incorrect predictions on the top and right boundaries of the target. Attention U-Net [
26], UNet3+ [
25], V-Net [
23], ResUNet [
27], and
-Net [
58] fail to predict the bottom target area. Among the remaining methods, R2U-Net and our RCV-UNet, it can be observed that RCV-UNet produces fewer false positives (smaller red areas), indicating better prediction performance.
These observations demonstrate that RCV-UNet excels in achieving more complete target segmentation and higher segmentation accuracy, particularly in boundary delineation. This is because RCV-UNet not only captures global contextual information but also effectively extracts and learns local features.
3.6. Quantitative Results
The data in
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6 clearly show the quantitative results of our proposed network compared with the other 10 baseline methods across four datasets. The best performance result is highlighted in
bold, and the second-best performance is
underlined. The following can be observed:
(1) For the Synapse dataset, it is evident that the segmentation of the
Gallbladder,
Pancreas, and
Stomach is relatively challenging. As shown in
Table 2, RCV-UNet achieves outstanding performance, with an average Dice score of 0.812, outperforming the second-best method by 3.5%. Specifically, RCV-UNet ranks first among all competitors on five out of the eight organs:
Gallbladder,
Left Kidney,
Right Kidney,
Pancreas, and
Spleen, while ranking second for the
Aorta,
Liver, and
Stomach, falling behind by 0.5, %0.6%, and 2.0%, respectively. In
Figure 7a, it can be observed that our method significantly outperforms others for the
Aorta,
Left Kidney, and
Spleen, with boxplots that are not only narrower but also distributed higher overall. The segmentation performance for the
Gallbladder is also better, as indicated by a higher median value. Although the boxplot of the
Right Kidney is relatively wide, its median is higher than that of the competing methods. For the
Liver and
Stomach, our method performs slightly worse than the top-performing TransUNet [
31]. Moreover, as shown in
Figure 8a, although the Dice distribution of our model does not exhibit a strong right-skewed trend, it still consistently outperforms the other methods.
(2) For the ACDC dataset, a multi-class segmentation task, it can be observed that segmenting RV and MYO is more challenging than LV, as reflected by the higher Dice scores achieved for LV. From
Table 3, RCV-UNet still demonstrates excellent performance, with its average predicted metrics ranking among the top one across all categories. From
Table 4, RCV-UNet is particularly outstanding in segmenting RV, where it surpasses the second-best method by 1% in Dice and 1.4% in IoU. As can be seen from
Figure 7b, the distributions of Dice coefficients for RV, MYO, and LV obtained by RCV-UNet are relatively concentrated. This is particularly evident in the MYO and LV categories, where the interquartile range of the boxplot is narrower, indicating higher prediction stability. Additionally, the median Dice coefficient of the boxplot corresponding to our method in the RV and LV categories is significantly higher than that of other competitors. In the MYO category, the performance of RCV-UNet has slightly decreased, but its overall performance remains within a reasonable range. As shown in the histogram in
Figure 8b, the Dice score distribution predicted by RCV-UNet is noticeably skewed to the right compared to the other 10 methods, with a higher concentration in the range between 0.8 and 1.0. This indicates that RCV-UNet demonstrates superior overall segmentation performance on the test samples.
(3) For the ISIC 2018 dataset, our network achieves the best performance across four evaluation metrics: Dice, IoU, Accuracy, and Sensitivity. Specifically, as shown in
Table 5, our method achieves Dice, IoU, and Sensitivity scores of 0.870, 0.771, and 0.857, respectively, surpassing the second-best competitor by 1.2%, 1.9%, and 2.7%. As illustrated in the boxplot in
Figure 7c, our method outperforms other approaches in terms of the median, first-quartile, and third-quartile Dice values, and exhibits a more compact interquartile range. Furthermore,
Figure 8c shows that the Dice distribution of RCV-UNet is clearly right-skewed, with the highest proportion of cases falling within the 0.8–1.0 range. Notably, the proportion of predictions in the 0.98–1.0 interval is notably higher than that of the other methods. These results demonstrate the superior segmentation performance of RCV-UNet.
(4) For the BUSI dataset, our network achieves the best performance across all six metrics. Specifically, the Dice, IoU, and Pre scores reach 0.773, 0.630, and 0.879, respectively, surpassing the second-best competitor by 3.5%, 4.5%, and 5%, as shown in
Table 6. From the Dice coefficient boxplot in
Figure 7d, it is evident that RCV-UNet exhibits higher median, first-quartile, and third-quartile values than the competing methods. Moreover, the interquartile range is the smallest, indicating more stable performance. Additionally, the lower number of outliers demonstrates that our network exhibits greater robustness across a wide range of samples. From the histogram in
Figure 8d, it is evident that the Dice value distribution of RCV-UNet exhibits a stronger right skew with a higher peak density, highlighting its advantage in segmentation quality.
3.7. Ablation Study
3.7.1. Effect of Different Network Settings
Figure 9 presents two competitor architectures and eight variants of our network, and
Table 7 reports the detailed results.
Effectiveness of combining RRCNN with a bottleneck ViT. Comparing (a) with (c) shows a substantial gain (+0.039) achieved by replacing CNN blocks with RRCNN blocks in the encoder–decoder. Similarly, comparing (b) with (c) shows that replacing an RRCNN bottleneck with a ViT bottleneck yields a notable improvement (+0.034). These results indicate that integrating RRCNN and ViT within a U-Net architecture is effective.
First decoder stage: CNN block outperforms RRCNN. The comparisons between (c) and (d), (e) and (f), and (i) and (j) indicate that using a simple CNN block in the first decoder stage yields better performance than an RRCNN block. This stage primarily reconstructs spatial details from globally encoded features and fuses them with the deepest skip connection; local convolutions provide more stable statistics and preserve high-frequency boundaries, whereas recurrent refinement at this low resolution tends to over-smooth and complicate optimization.
Deepest skip connection: Consistent gains. The comparisons between (c) and (e) and between (d) and (f) show that retaining the deepest skip connection (i.e., one level above the bottleneck) consistently improves performance. The bottleneck skip injects strong global semantics into the first decoder stage, enhances cross-scale consistency during early up-sampling, and shortens the gradient path; this coarse prior complements shallow skips that refine boundaries, stabilizing feature fusion.
ViT placement across scales: bottleneck-only is best and most efficient. Comparing (c), (g), (h), and (i) indicates that placing a single ViT at the bottleneck achieves the best performance; adding ViT to the encoder and bottleneck or to the decoder and bottleneck yields similar but inferior results, and using ViT in the encoder, decoder, and bottleneck narrows the gap yet still lags behind (c). We attribute this to the bottleneck being the most effective location for aggregating global context while CNN/RRCNN blocks recover local spatial details; additional ViT blocks introduce heterogeneous feature statistics across skip connections and extra normalization/attention overhead, complicating fusion and optimization. Correspondingly, inserting ViT blocks into the encoder or decoder significantly increases FLOPs/parameters and notably reduces training throughput.
Overall, placing ViT blocks only at the bottleneck, using RRCNN blocks in the main encoder–decoder (with a CNN block in the first decoder stage), and retaining the deepest skip connection yields the best performance and efficiency, whereas extending ViT to the encoder or decoder provides limited gains while markedly increasing computational cost and reducing throughput.
3.7.2. Effect of the Number of Skip Connections
We ablate the number of skip connections from 0 to 5 while holding all other settings fixed (
Table 8). The largest single-step improvement arises when introducing the first skip: Dice increases from 0.765 to 0.855 (+0.090), and IoU from 0.625 to 0.748 (+0.123), confirming that the mere presence of skip connections is pivotal for recovering fine structures and stabilizing optimization.
Beyond the first skip, additional skips yield steady—albeit smaller—gains. In particular, the fifth skip—i.e., the deep skip adjacent to the ViT bottleneck—provides the largest marginal improvement among the added skips (4→5): Dice improves from 0.869 to 0.889 (+0.020) and IoU from 0.771 to 0.802 (+0.031). This indicates that injecting high-level semantics from the stage immediately preceding the bottleneck into the decoder is especially effective, reducing boundary ambiguity and mitigating both false positives and false negatives.
Overall, performance improves with the number of skips, with two effects standing out: (i) the presence versus absence of skip connections drives the largest jump, and (ii) among the configurations that include skips, the bottleneck-adjacent (fifth) skip has the strongest incremental impact. Consequently, we adopted five skip connections in the final model, which achieves the best overall results.
3.7.3. Effect of Different Up-Sampling
As shown in
Table 9, transposed convolution achieves the best performance, clearly outperforming bilinear and nearest-neighbor interpolation. We attribute this advantage to its learnable parameters, which enable adaptive restoration of structural details and sharper boundary delineation during up-sampling. In contrast, bilinear interpolation, although smooth and computationally efficient, often oversmooths boundaries and reduces overlap accuracy, while nearest-neighbor interpolation produces blocky artifacts and loses fine details. These results highlight that learnable up-sampling provides a more effective mechanism for accurate medical image segmentation.