RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation

Yang, Hang; Yang, Chuanghua; Yang, Dan; Hang, Xiaojing; Liu, Wu

doi:10.3390/bdcc9110274

Open AccessArticle

RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation

by

Hang Yang

¹,

Chuanghua Yang

^1,*,

Dan Yang

¹,

Xiaojing Hang

² and

Wu Liu

¹

School of Physics and Telecommunication Engineering, Shaanxi University of Technology, Hanzhong 723001, China

²

School of Communications and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710061, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(11), 274; https://doi.org/10.3390/bdcc9110274

Submission received: 23 August 2025 / Revised: 21 October 2025 / Accepted: 29 October 2025 / Published: 31 October 2025

Download

Browse Figures

Versions Notes

Abstract

Medical image segmentation has been a central research focus in deep learning, but methods based on convolutions have limitations in modeling the long-range validity of images. To overcome this issue, hybrid CNN-Transformer architectures have been explored, with SwinUNet being a classic approach. However, SwinUNet still faces challenges such as insufficient modeling of relative position information, limited feature fusion capabilities in skip connections, and the loss of translational invariance caused by Patch Merging. To overcome these limitations, the architecture RE-XswinUnet is presented as a novel solution for medical image segmentation. In our design, relative position biases are replaced with rotary position embedding to enhance the model’s ability to extract detailed information. During the decoding stage, XskipNet is designed to improve cross-scale feature fusion and learning capabilities. Additionally, an SCAR Block downsampling module is incorporated to preserve translational invariance more effectively. The experimental results demonstrate that RE-XswinUnet achieves improvements of 2.65% and 0.95% in Dice coefficients on the Synapse multi-organ and ACDC datasets, respectively, validating its superiority in medical image segmentation tasks.

Keywords:

deep learning; medical image segmentation; multi-organ CT; SwinUNet; RoPE

1. Introduction

With the continuous advancement of medical imaging technologies, medical image segmentation has been established as an indispensable component of computer-aided diagnostic systems [1]. Accurate and robust segmentation results are essential for assisting clinicians in lesion localization, lesion assessment, and preoperative planning. These improvements enhance both the efficiency and accuracy of diagnosis and treatment [2,3,4,5]. Since the success of AlexNet on ImageNet [6], convolutional neural networks (CNNs) have become the dominant paradigm in computer vision and have driven notable progress in medical image analysis [7,8].

CNN-based architectures, particularly U-Net [9], have been recognized as highly effective for medical image segmentation owing to their symmetric encoder–decoder design and skip connections [1]. Inspired by U-Net, numerous modifications have been proposed to strengthen feature extraction capabilities and adapt to diverse tasks [10,11,12]. For example, V-Net [13] integrates residual learning into 3D segmentation, making it particularly suitable for structural preservation in volumetric tasks. Three-dimensional U-Net [14] has been adapted to handle volumetric data, capturing inter-slice contextual information to improve segmentation in CT and MRI scans. Attention U-Net [15] has incorporated an attention gating mechanism to emphasize small organs or tumor boundaries, proving effective in noisy or low-contrast environments. UNet++ [16,17], by employing nested and dense skip connections, has been shown to mitigate semantic discrepancies between the encoder and decoder, thereby integrating high-level and low-level features more effectively [8,18]. nnU-Net [19], through its adaptive setup, has consistently achieved top-tier performance in multiple segmentation benchmarks. Moreover, ResUNet [20] and DenseUNet [21] have introduced residual and dense connection structures, respectively, which enhance gradient propagation and feature reuse. Despite their demonstrated effectiveness, CNN-based approaches are still limited by the fixed receptive field of convolutional operations, limiting their capacity to capture long-range dependencies, particularly for complex organ structures and morphological variations [5,7,8,16].

To overcome this limitation, Transformer architectures [22,23] have been introduced into computer vision due to their ability to model global dependencies. Methods such as Vision Transformers (ViTs) and DeiT [24] have demonstrated the feasibility of applying Transformers to visual tasks. ViT divides images into fixed-size patches and treats each patch as a token input to the self-attention module [18,25]. However, the pure Transformer architecture relies on large-scale training data and lacks spatial inductive biases, resulting in limited ability to model fine-grained structures and hindering its direct application to medical image segmentation [26,27]. More recently, considerable research has focused on hybrid CNN-Transformer architectures [28,29,30]. These models typically use dual-branch or embedded designs that integrate Transformer modules into a CNN backbone [31]. TransUNet [16], a representative hybrid architecture, incorporates ViTs modules between the encoder and decoder, employing self-attention to capture global semantic information and thereby improve segmentation accuracy [8,16]. The Segmentation Transformer (SETR) [32] utilizes a pure Transformer encoder to extract global features for semantic segmentation, combined with multi-scale reconstruction modules for precise prediction. CoTr (CNN-Transformer Hybrid) [33] adopts a parallel branch design in which CNNs and Transformers extract local and global features separately, while an attention fusion module enables information interaction. TransBTS [34] focuses on 3D medical images by embedding a Transformer into the bottleneck layer of a 3D U-Net to enhance volumetric structural continuity. UNetr [35] employs a pure Transformer encoder with a U-Net-style decoder to maintain global modeling capabilities while restoring spatial details. SwinUNet [36] adopts a U-Net backbone augmented with the hierarchical Swin Transformer [37], enabling the integration of windowed attention mechanisms and contextual feature modeling, and has become a prominent baseline in medical image segmentation research [18,38]. Nevertheless, despite their impressive results, most of these dual-branch methods have a limitation. They rely on direct connections at specific scales during encoding and decoding. This design limits their ability to effectively extract multi-scale information [17].

The Transformer architecture relies on self-attention to model relationships among all elements in a sequence, but it inherently lacks positional information. Unlike text, images exhibit strong spatial structures in which pixel layout is crucial for interpretation and segmentation [39,40]. Vision Transformers (ViTs) therefore typically incorporate explicit positional encoding mechanisms. Common strategies include absolute position encoding (APE) and relative position encoding (RPE) [41,42]. APE, widely used in early Transformers (e.g., ViTs), lacks translational invariance, is sensitive to input size, and generalizes poorly. RPE instead models relative spatial relationships between elements, yielding stronger generalization ability [43]. SwinUNet employs relative position bias (RPB), where learnable bias weights are introduced for each positional pair within a local window to adjust the attention score matrix. However, because RPB is embedded after the multiplication of query (Q) and key (K), relative positional information cannot directly participate in the computation of attention weights [42,44,45].

In summary, despite the advantages of existing methods and the trend towards hybrid models, key challenges remain unresolved. These include: (1) the ineffective participation of local RPE in attention calculations, which limits complex structure modeling; (2) overly simplistic skip connections that lack multi-scale semantic interaction; and (3) downsampling operations that disrupt spatial structures and impair translational invariance. To address these limitations, an enhanced SwinUNet-based architecture, RE-XSwinUNet, is proposed. First, RoPE is introduced to replace the traditional relative position bias (RPB), thereby enhancing the Transformer’s spatial modeling and detail perception capabilities. Second, designing a multi-scale skip connections block, XskipNet, to achieve rich semantic fusion. Third, an SCAR Block downsampling module is designed to enhance feature structural preservation and cross-block interaction capabilities.

The main contributions of this work are summarized as follows:

(1): We integrate RoPE into SwinUNet, verifying the feasibility of RoPE in medical image segmentation and enhancing detail modeling capabilities.
(2): A plug-and-play multi-scale skip connections block, XskipNet, is designed to enhance semantic information fusion capabilities.
(3): We design SCAR Block, a novel downsampling module that preserves structural information while improving translational invariance.
(4): Extensive experiments on the Synapse and ACDC datasets demonstrate competitive performance, and ablation experiments confirm the efficacy and generalizability for every component.

The paper is organized as follows: Section 2 provides a detailed introduction to the overall architecture and key module design of RE-XSwinUNet. Section 3 presents evaluation and visualization analysis on the Synapse dataset and the ACDC dataset. Section 4 provides a detailed discussion of each module’s contribution. Section 5 concludes with a summary and analysis of this paper.

2. Materials and Methods

In this section, we will provide a detailed account of the RE-XSwinUNet network architecture and the design of its key modules. The methodology is divided into four parts: the overall network architecture, Rotary Position Embedding (RoPE), the multi-scale skip connections block (XskipNet), and the structure-preserving downsampling module (SCAR Block). Each module is designed to address specific limitations of SwinUNet, with improvements justified from both theoretical and practical perspectives.

2.1. Overall Architecture

The overall architecture of the proposed RE-XSwinUNet is illustrated in Figure 1. The encoder–decoder structure of SwinUNet is retained. In the encoder, Rotary Position Embedding (RoPE) is introduced during the window multi-head self-attention stage to replace the original relative position bias (RPB), thereby enhancing the modeling of local and global details. During downsampling, the SCAR Block is employed to preserve structural information, improve translational invariance, and facilitate cross-block information interaction. In the decoder, multi-scale feature fusion is achieved using XskipNet skip connections, promoting the integration of semantic information across scales and improving the model’s generalization capability.

Specifically, an input image

I \in ℝ^{H \times W \times 3}

is first processed through a patch embedding layer using a convolutional kernel of size, a stride of 4, and 96 output channels, producing a feature map

F_{0} \in ℝ^{\frac{H}{4} \times \frac{W}{4} \times C}

. The feature map

F_{0}

is first encoded by a SwinUNet Transformer Block with RoPE. It is then downsampled by the SCAR Block to generate four multi-scale features:

\{F_{1} \in \frac{W}{4} \times \frac{H}{4} \times 96, F_{2} \in \frac{W}{8} \times \frac{H}{8} \times 192, F_{3} \in \frac{W}{16} \times \frac{H}{16} \times 384, F_{4} \in \frac{W}{32} \times \frac{H}{32} \times 968\}

The decoder gradually resumes spatial resolution through Patch Expanding and RoPE-SwinUNet Transformer Block. At each scale of the decoder, four multi-scale features generated by the encoder are cross-fused through XskipNet. Finally, a segmentation probability map is generated through a fully connected layer.

2.2. Rotational Position Embedding

Rotary Position Embedding (RoPE) is a method of encoding relative position information into the Transformer architecture [39,42,43]. Despite its prevalence in natural language processing tasks like text classification and machine translation, its application in computer vision remains limited. In Swin Transformer, elative position bias (RPB) is added to the attention score. Let

X_{i = 1}^{n} = [x_{1}, x_{2}, \dots, x_{n}]

represent

n

word embedding vectors. The query (

Q

, key (

K

), and value (

V

) matrices are derived by applying linear transformations to the input representations, as expressed in Equation (1):

Q = W_{q} • X_{i = 1}^{n}, K = W_{k} • X_{i = 1}^{n}, V = W_{v} • X_{i = 1}^{n}

(1)

where

W_{q}

,

W_{k}

, and

W_{v}

represent self-attention weights. The query at position

m

and the key at position

n

are represented by

Q_{m}

and

K_{n}

, respectively. The attention score can then be obtained as shown in Equation (2):

A (Q_{m}, K_{n}) = s o f t \max (\frac{Q_{n}^{T} • K_{n}}{\sqrt{d}})

(2)

where

d

is the dimension of the key vector per head, and

Q_{m}^{T}

denotes the transpose of

Q_{m}

. The RPB is obtained by adding the relative position bias B to Equation (2), as shown in Equation (3):

A (Q_{m}, K_{n}) = s o f t \max (\frac{Q_{n}^{T} • K_{n}}{\sqrt{d}}) + b i a s

(3)

However, this bias is applied after the inner product, which limits its influence on the similarity between queries and keys. RoPE uses Euler’s Equation (

e^{j φ}

) to embed relative position information directly into

Q_{m}

and

K_{n}

. The RoPE transformation is defined in Equation (4):

Q_{m}^{'} = Q_{m} e^{j m φ}, K_{n}^{'} = K_{n} e^{j m φ}

(4)

Substituting Equation (4) into Equation (2) yields the attention scores for the Rotary Position Embedding, as shown in Equation (5):

A^{'} (Q_{m}, K_{n}, n - m) = s o f t \max (\frac{Re [Q_{n}^{T} • K_{n} • e^{j (n - m) φ}]}{\sqrt{d}})

(5)

where

Re [•]

denotes the real part, and the relative position

n - m

is inherently captured in the inner product. The output of the attention mechanism is given by Equation (6):

f (x_{m}, m) = R (m) W_{\{q, k\}} x_{m}

(6)

where

x_{m}

is the mth element of the attention output to be calculated,

R (m)

is the Rotary Position Embedding matrix, and

W_{\{q, k\}}

is the attention weight matrix. In actual calculations, it is composed of the following matrices, as shown in Equation (7):

R (m) = (\begin{matrix} \cos m φ_{1} & - \sin m φ_{1} & 0 & 0 & \dots & 0 & 0 \\ \sin m φ_{1} & \cos m φ_{1} & 0 & 0 & \dots & 0 & 0 \\ 0 & 0 & \cos m φ_{2} & - \sin m φ_{2} & \dots & 0 & 0 \\ 0 & 0 & \sin m φ_{2} & \cos m φ_{2} & \dots & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & 0 & 0 & \dots & \cos m φ_{d / 2} & - \sin m φ_{d / 2} \\ 0 & 0 & 0 & 0 & \dots & \sin m φ_{d / 2} & \cos m φ_{d / 2} \end{matrix})

(7)

Note that

R (m)

is an orthogonal matrix. Going one step further, we can obtain the product of

Q_{m}

and

K_{n}

, as shown in Equation (8):

Q_{m}^{T} K_{n} = {\{R (m) W_{q} x_{m}\}}^{T} • \{R (n) W_{k} x_{m}\} = x_{m}^{T} W_{q}^{T} R (n - m) W_{k} x_{n}

(8)

Figure 2 shows the specific implementation block diagram of RoPE in this paper. Before computing the windowed self-attention, the relative position information is fused into the phase angle of the Euler Equation. This is achieved by rotating the keys and values vectors using the complex exponential function (Euler’s formula), after which the attention scores are computed. This approach effectively captures the similarity between keys and values, thereby augmenting the model’s capacity for relative position representation and detailed information extraction.

2.3. XskipNet Block

In the baseline SwinUNet, skip connections only concatenate features from encoder and decoder at the same resolution. This approach has two serious issues: First, concatenation at specific dimensions makes it difficult to fuse multi-scale information, reducing the model’s ability to capture details and semantic information; Second, direct concatenation may cause conflicts between shallow (detail-rich) and deep (semantic-rich) features.

To address these issues, the XskipNet block is proposed (Figure 3). Through multi-scale cross-fusion and attention-guided mechanisms within each skip connections, the model effectively merges detailed shallow features with rich semantic information from deeper layers. This integration significantly improves the accuracy of segmenting fine edges and small organ structures. Specifically, for a target scale of the skip connections

S_{t}

(e.g., scale

\frac{1}{8}

), the corresponding encoder feature

E_{t}

is used. In addition, the encoder feature

E_{t - 1}

is resized to the target scale

E_{t}

using bilinear interpolation. Meanwhile, the channel count is modified using

1 \times 1

to align with the number of channels in

E_{t}

, yielding multi-scale features that share a uniform channel count. The multi-scale features are concatenated with the original scale features

E_{t}

in the channel dimension to obtain the multi-scale fusion features

E_{t}^{'}

, which are then fed into Convolutional Block Attention Module (CBAM) for spatial and channel attention weighting [16,46,47]. The reduction ratio for channel attention is set to 16, and spatial attention uses

7 \times 7

convolution to generate spatial weight maps. The corresponding scale features in the decoder are enhanced through residual addition of the attention-weighted features, thereby achieving feature fusion.

For target scale

\frac{1}{8}

, its multi-scale fusion feature

E_{t}^{'}

is shown in Equation (9):

E_{t}^{'} = C B A M (C o n c a t (E_{t}, C o n v (B i l i n e a r (E_{t - 1}))))

(9)

Among them,

E_{t}

is the

\frac{1}{8}

-scale feature of the encoder,

E_{t - 1}

is the

\frac{1}{4}

-scale feature of the encoder,

B i l i n e a r (•)

represents bilinear downsampling,

C o n v (•)

is the convolution layer for adjusting channels, and

C B A M (•)

represents the weighting operation.

Furthermore, the decoder’s multi-scale fusion feature

F_{t}

is shown in Equation (10):

F_{t} = C o n c a t (E_{t}^{'} + S_{t})

(10)

Similarly, the fusion feature

E_{t + 1}^{'}

of target scale

\frac{1}{16}

can be obtained as shown in Equation (11):

E_{t + 1}^{'} = C B A M (C o n c a t (E_{t + 1}, C o n v (B i l i n e a r (E_{t}^{'}))))

(11)

The fusion feature

F_{t + 1}

of the decoder’s target scale

\frac{1}{16}

is shown in Equation (12):

F_{t + 1} = C o n c a t (E_{t + 1}^{'} + S_{t + 1})

(12)

The XskipNet block, which is designed to combine information between shallow details and deep semantics effectively, improves the accuracy of edge and small organ segmentation. At the same time, it maintains multi-scale consistency and feature complementarity, reducing conflicts and redundant information introduced by direct splicing.

2.4. SCAR Block

In Swin Transformer, Patch Merging is responsible for reducing the sampling rate while increasing the number of channels [48,49,50]. The input feature map

x \in ℝ^{B \times H \times W \times C}

is first subsampled by selecting one pixel from every two pixels along both spatial dimensions. Subsequently, the chosen pixels are amalgamated along the channel axis, thereby engendering the resultant transformed feature map

x_{1} \in ℝ^{B \times \frac{H}{2} \times \frac{W}{2} \times 4 C}

. Finally, a fully connected layer is employed to calibrate the channel count from 4C to 2C, yielding the output

f (x) \in ℝ^{B \times \frac{H}{2} \times \frac{W}{2} \times 2 C}

. The Equation for Patch Merging is shown in Equation (13):

f (x) = M L P (N o r m (M_{m e r g e} (x)))

(13)

Among them,

M_{m e r g e} (•)

is the checkerboard sampling and stitching operation,

N o r m (•)

is the normalization layer, and

M L P (•)

is the fully connected layer responsible for channel compression. This method limits the exchange of non-local information across blocks. When changing the number of channels through a fully connected neural network, the spatial structure of the feature map is altered, thereby destroying translational invariance. In order to address this issue, we redesigned a downsample module called the SCAR Block.

Figure 4 shows the SCAR Block network architecture design. This module performs downsampling without destroying the original feature map spatial structure. The module uses standard 2D convolution for downsampling and introduces the Squeeze Excitation (SE) [51] attention mechanism to highlight important channel features and suppress irrelevant ones. Concurrently, the residual connections are incorporated within the module. This design choice effectively mitigates the vanishing gradient problem prevalent in deep networks, thereby enhancing the stability of the network throughout the training process.

Specifically, the main branch first passes through an

3 \times 3

convolution (stride = 2) and a normalization layer to transform the input feature map

B \times C \times H \times W

into

B \times 2 C \times \frac{H}{2} \times \frac{W}{2}

. The output is then fed into the Squeeze Excitation module to obtain a weight vector of size

B \times 2 C \times 1 \times 1

, which is then multiplied by the input feature map along the channels to perform channel weighting. The residual branch uses

1 \times 1

convolution (stride = 2) and normalization to perform downsampling and channel number adjustment. The outputs of the two branches are summed and subsequently subjected to a ReLU activation function. This procedure diminishes the spatial resolution and concurrently accentuates the model’s proficiency in extracting salient features at different scales during the downsampling phase.

3. Results

3.1. Datasets

In this study, the Synapse multi-organ segmentation dataset [52] and the Automated Cardiac Diagnosis Challenge (ACDC) dataset [53] were utilized. The Synapse dataset originates from a subset of the Multi-Atlas Labeling Beyond the Cranial Vault (BTCV) Challenge. The ACDC dataset is the official dataset for the Automated Cardiac Diagnosis Challenge held during MICCAI 2017. The manual segmentation labels (i.e., ground truth) for both datasets were provided by the organizers and meticulously annotated by experienced radiologists following standardized clinical protocols. These annotations are recognized as industry benchmarks and have been extensively used in numerous prior studies [16,36] to enable fair and consistent comparisons of segmentation algorithms.

We used the Synapse multi-organ CT dataset, which comprises abdominal contrast-enhanced CT scans from 30 patients. This dataset contains a total of 3779 axial slices. Each scan includes 85–198 slices with an in-plane resolution of 512 × 512 pixels. To ensure consistency with prior work [54], we resampled all slices to a resolution of 224 × 224 pixels. Eight abdominal structures were annotated in this dataset, including the aorta, gallbladder, left kidney, right kidney, liver, pancreas, spleen, and stomach. Following common practice, the dataset was randomly divided into 18 cases for training and 12 cases for validation, resulting in a training set of 2212 axial slice images [16].

We evaluated the RE-XswinUNet architecture using the ACDC dataset, which contains cine cardiac MR images from 100 patients captured during breath-hold. The dataset provides manual annotations for the left ventricle (LV), right ventricle (RV), and myocardium (MYO). Each short-axis series extends from the left ventricular base to the apex. The images have a slice thickness of 5–8 mm and an in-plane resolution of 0.83–1.75 mm²/pixel. Following the ACDC protocol, we randomly split the data into 70 patients (1930 slices) for training, 10 for validation, and 20 for testing. In addition, we employed the same baseline methods as reported in the original challenge to ensure a fair performance comparison.

Both the Synapse and ACDC experiments represent internal validation, as their training and test datasets are derived from the same source. While this setup allows for fair comparisons with established benchmarks, it may lead to performance estimates that do not fully generalize to external data collected from different clinical environments.

3.2. Implementation Details

The implementation of this study was conducted utilizing Python 3.8 in conjunction with the PyTorch 1.8.0. The input CT or MR images resolution was set to 224 × 224, and the patch size to 16 × 16. All experiments were performed on a NVIDIA A800 GPU. Each training cycle comprised 500 steps, utilizing a batch size of 16. We adopted the AdamW optimizer, setting the learning rate at 1 × 10⁻⁴ to reduce overfitting. For fair comparison, The loss function is adopted from the loss function used in the baseline model SwinUNet. The publicly available pre-trained weights of TransUNet were used to initialize RE-XSwinUNet.

3.3. Evaluation Indicators

This study used the average Dice similarity coefficient (DSC) and average Hausdorff distance (HD95) as evaluation indicators.

In the field of image segmentation, model performance is often evaluated by Dice similarity coefficient, which measures the degree of overlap between the actual segmentation results and the predicted segmentation. This coefficient ranges in value from 0 to 1. When the value is closer to 1, it is indicative of superior segmentation performance. Its calculation formula can be expressed in Equation (14):

D S C (P, T) = \frac{2 |P \cap T|}{|P| + |T|}

(14)

where

P

denotes the segmentation label, and

T

denotes the prediction result.

The Hausdorff distance, as a measurement criterion, is used to assess the distance between the segmentation result and the labels. It is particularly sensitive to the segmentation boundaries. A smaller HD95 indicates higher similarity, while a larger value indicates greater discrepancy. In this study, the Hausdorff distance was utilized to assess model performance by comparing the actual labels with the segmentation results. The mathematical expression for the Hausdorff distance is presented in Equation (15):

H D (S_{p}, S_{t}) = \max \{\max_{p \in S_{p}} \{\min_{t \in S_{t}} d (p, t)\}, \max_{t \in S_{t}} \{\min_{p \in S_{p}} d |(p, t)|\}\}

(15)

where

S_{t}

denotes the set of points representing the predicted results,

S_{p}

denotes the set of points representing the segmentation labels, and

\min_{t \in S_{t}} d (p, t)

denotes the minimum Euclidean distance from

p

to

S_{t}

.

3.4. Overall Performance on Synapse Dataset

The comparative performance of the proposed RE-XswinUNet architecture is presented in Table 1. The model is evaluated against state-of-the-art methods, including SwinUNet, TransUNet, and UNet, on the Synapse dataset. The results indicate that the RE-XswinUNet architecture achieves a significant improvement in performance, with an DSC of 81.78% and a reduced HD of 18.83 mm. Relative to the SwinUNet baseline, the DSC was increased by 2.65%, while the HD was reduced by 2.72 mm. Furthermore, higher DSC were consistently obtained across all organs, with particularly pronounced improvements observed in small organs (e.g., pancreas and gallbladder). These results demonstrate that the RE-XswinUNet architecture can effectively extract more detailed anatomical information.

Figure 5 presents representative visualization results on the Synapse dataset, comparing our method with other approaches. The segmentation outcomes of the stomach (first row) and spleen (second row) indicate that the UNet model is prone to over-segmentation, a phenomenon that may be attributed to the inherent dependence of convolutional neural network architectures on local contextual information. When the pancreas is considered (first and third rows), the proposed model produces more precise boundaries and achieves higher segmentation accuracy than SwinUNet. This improvement can be ascribed to the integration of RoPE and XskipNet, as discussed in Section 4. In contrast, for the left kidney, right kidney, and stomach, the performance gain of RE-XSwinUNet over SwinUNet and TransUNet is less substantial, which may be explained by the clearer structures and well-defined boundaries of these larger organs that can already be effectively segmented by conventional architectures. Overall, the proposed model yields moderate improvements for large organs, while considerable enhancements are obtained for smaller and more structurally complex organs.

3.5. Overall Performance on ACDC Dataset

To validate the generalization capability of the RE-XSwinUNet architecture on other datasets, we trained and evaluated it on the ACDC dataset for automated cardiac diagnosis. The experimental evaluation results are shown in Table 2, where the DSC of RE-XSwinUNet reaches 90.95%, outperforming baseline models such as TransUNet and SwinUNet in terms of performance. The improvements are particularly noticeable in the right ventricle and left ventricle, primarily due to the ability of RoPE and the XskipNet module proposed in this paper to model detailed information.

3.6. Model Complexity and Efficiency Analysis

To assess the practicality of RE-XSwinUNet, we examined its model complexity against leading baseline methods by comparing parameter counts and estimated memory usage. As summarized in Table 3, RE-XSwinUNet has 40.241 million parameters, which is considerably lower than TransUNet (108.596 million) but moderately higher than both SwinUNet (37.169 million) and UNet (31.037 million).

RE-XSwinUNet exhibits notable parameter efficiency relative to TransUNet, attaining higher segmentation accuracy while utilizing fewer than 40% of the parameters. In comparison with SwinUNet, the proposed model incurs only a marginal parameter increase of around 3 million. This slight growth in model size leads to substantial gains in segmentation performance, especially in the delineation of small anatomical structures. With an estimated total memory footprint of 423.98 MB, RE-XSwinUNet remains compatible with common clinical computational environments, supporting its practical deployability.

4. Discussion

4.1. Ablation Studies

To evaluate the contributions of the proposed modules individually and in combination, and to address potential difficulties in component selection, we conducted a comprehensive ablation study. We integrated the RoPE, XskipNet, and SCAR block modules into the SwinUNet baseline model and report their individual and combined performance on the Synapse dataset in Table 4. Additionally, a thorough theoretical investigation was conducted to explain the principles behind the performance gains of these three modules, further demonstrating the feasibility of the RE-XSwinUNet architecture.

RoPE. Following the introduction of RoPE, the average Dice coefficient was elevated to 80.38%, representing an increase of 1.25% compared with the original baseline model, SwinUNet. In particular, the segmentation of small organs exhibited even more pronounced improvement. For example, the segmentation accuracy of the pancreas was enhanced from 56.58% to 63.43%, corresponding to an improvement of 6.85%. We attribute the effectiveness of RoPE primarily to its long-range attenuation characteristic. As the relative position increases, the inner product of

Q_{m}

and

K_{n}

in Equation (8) gradually decreases, enhancing the model’s generalization ability and suppressing excessive focus on long-range dependencies. For Equation (8), when

n = m

, Equation (16) can be obtained:

R (n - m) = R (0) = R_{m}^{T} • R_{m} = E

(16)

where E represents the identity matrix. Further derivation yields Equation (17):

Q_{m}^{T} K_{n} = x_{m}^{T} W_{q}^{T} W_{k} x_{m}

(17)

At this point, the dot product of the attention vector does not decay. When

n \neq m

, as

|n - m|

increases, the direction of the vector shifts due to rotation, and its dot product equals the projection of the angle, becoming Equation (18):

|Q_{m}^{T} K_{n}| = |x_{m}^{T} W_{q}^{T} R (n - m) W_{k} x_{n}| \leq ‖W_{q} x_{m}‖ • ‖W_{k} x_{n}‖ \cos θ_{n - m}

(18)

where

‖•‖

represents the norm of the matrix, and

θ_{n - m}

represents the angle between two vectors. This angle increases with

|n - m|

, which results in a decrease in

\cos θ_{n - m}

.

This distance-dependent attenuation characteristic prevents interference from tokens at a distance, thereby improving the model’s ability to model local details and relationships. This results in the most significant improvement in performance on the pancreas. At the same time, RoPE can control attention distribution more smoothly and structurally, which helps improve robustness and generalizability.

XskipNet. The design of the XskipNet block resulted in an increase in the DSC from 79.13% to 79.97%. Significant enhancements in the segmentation accuracy of several key organs, including the aorta, liver, and kidneys, were observed. For larger organs, such as the stomach and spleen, relatively minor improvements were noted. In contrast, considerable progress was achieved for smaller organs, like the pancreas and gallbladder. The segmentation accuracy of the pancreas and gallbladder was found to have increased by 5.56% and 1.3%, respectively. These findings thereby confirm the module’s optimization capability for organs that are highly sensitive to detailed information.

The effectiveness of XskipNet is primarily attributed to its multi-scale feature fusion and attention-guided mechanism. Small organs (e.g., pancreas and gallbladder) occupy only a small proportion of medical images and possess complex boundaries, making them prone to detail loss during downsampling. By fusing shallow-layer high-resolution features, XskipNet mitigates detail loss during downsampling and consequently improves edge segmentation accuracy. Furthermore, since small organs are susceptible to interference from surrounding tissues, the spatial attention mechanism of CBAM enables the model to focus on the organ region and suppress background noise. For large organs, which exhibit clearer structures and occupy a larger proportion, traditional skip connections are generally sufficient, and thus the advantages of XskipNet are less significant.

SCAR Block. After the original patch merging was replaced, structural preservation and translational invariance of the model were enhanced. The Dice score on the aorta increased from 85.47% to 86.82%, while the Dice coefficients on the left kidney and right kidney reached 84.47% and 81.76%, respectively. Significant improvements were achieved in the segmentation of large organs (e.g., liver, kidneys, and aorta). These improvements are not coincidental; they are a direct result of the SCAR Block’s ability to maintain structural integrity and enrich semantic information during downsampling.

In the SCAR Block, convolutional downsampling prevents the disruption of spatial structure caused by traditional patch merging, thereby preserving the overall geometric consistency of large organs. The SE module adaptively recalibrates channel-wise feature responses, allowing the model to focus on stable and discriminative semantic patterns within large organs. Meanwhile, residual connections ensure stable transmission of deep features, alleviating information decay under long-range dependencies.

Module Combinations. To thoroughly evaluate the synergistic effects among innovative modules, we systematically conducted modular combination ablation experiments on the SwinUNet baseline model. Among all dual-module configurations, the “RoPE + XskipNet” combination demonstrated the highest performance with a DSC of 81.15%. Specifically, pancreatic segmentation accuracy improved from the baseline 56.58% to 63.43% after incorporating the RoPE module, and further increased to 64.85% when combined with XskipNet. This result demonstrates that integrating RoP’s relative positional awareness with XskipNet’s cross-scale detail fusion mechanism enables more effective capture of small organ fine structures and boundary features.

Among other combinations, “RoPE+SCAR Block” demonstrated outstanding performance in aortic (87.42%) and renal segmentation through synergistic positional encoding and structural preservation. While “XskipNet+SCAR Block” achieved the best liver segmentation (94.62%), its pancreatic segmentation (60.85%) significantly lagged behind the former due to lacking positional encoding support. Overall, all dual-module combinations significantly outperformed single-module approaches, with the fusion of detail modeling and multi-scale features contributing most substantially to the overall performance enhancement.

4.2. Significance Verification Based on t-Test

To verify whether the performance improvement stems from random fluctuations, this paper conducted five independent replicate experiments on the Synapse dataset. Dice score data were generated using five distinct random seeds (Seed1234, Seed2341, Seed3412, Seed4123, Seed5432), followed by independent samples t-tests. Following standard practice in medical image segmentation, a significance level of α = 0.05 was set. Figure 6 presents the average DSC scores and their standard deviations across the five experiments for different methods. Statistical analysis revealed that RE-XSwinUNet achieved a 2.28% average improvement in Dice score compared to SwinUNet (p = 0.0003 < α, t(8) = 5.99). The 95% confidence interval for this performance gain was [1.37%, 3.19%], which did not include zero, confirming that the observed enhancement exceeded the range of random fluctuation. These results statistically validate the reliability of RE-XSwinUNet’s performance improvement, dispel concerns about the robustness of the enhancement approach, and establish its statistical significance.

4.3. Limitations and Future Work

Although RE-XSwinUNet demonstrates outstanding segmentation performance across both CT and MRI modalities, its clinical application should be carefully evaluated within the practical context of actual imaging modality selection. In clinical practice, particularly under stringent radiation protection regulations such as the European EURATOM Directive 2013/59, the inherent radiation risks associated with CT must be fully considered when selecting it for selective soft tissue assessment. For multiple abdominal organs covered in this study (e.g., pancreas, gallbladder, and liver), magnetic resonance imaging is generally considered the more appropriate imaging modality. The primary objective of this study is to propose a robust and universal segmentation architecture. To this end, we validated the model’s performance on both CT (Synapse) and MRI (ACDC) datasets to demonstrate its cross-modal adaptability and modality-agnostic characteristics. However, further exploring the model’s generalization capabilities in MRI-based multi-organ abdominal segmentation remains a highly promising research direction warranting future investigation. Additionally, to ensure fair comparison with prior studies like those on TransUNet and SwinUNet, this work only completed internal validation and has not yet been evaluated on external test sets. The model’s generalization performance requires more comprehensive validation in subsequent research.

5. Conclusions

In this study, we propose a SwinUNet architecture with rotational position encoding and cross-slice context connection, named RE-XSwinUNet, which is specifically designed to achieve the conducting of accurate and robust multi-organ segmentation tasks. The framework addresses key limitations of existing SwinUNet-based approaches through three major innovations. First, the integration of Rotary Position Embedding (RoPE) enables effective modeling of relative spatial relationships, thereby enhancing the SwinUNet’s capacity to capture fine-grained anatomical structures. Second, the proposed XskipNet module facilitates multi-scale, attention-guided feature fusion between the encoder and decoder, leading to substantial improvements in the segmentation accuracy of small organs with complex boundaries. Third, the structure-preserving SCAR block replaces conventional patch merging, ensuring translational invariance and promoting inter-block information flow during downsampling. Superior performance of RE-XswinUnet was demonstrated on the Synapse and ACDC datasets, surpassing numerous state-of-the-art network models. This study is expected to advance the development of multi-organ segmentation techniques. Future work will focus on exploring the generalization performance of RE-XswinUnet across various imaging modalities and in external validation.

Author Contributions

Conceptualization, H.Y. and C.Y.; Data curation, D.Y. and X.H.; Formal analysis, H.Y. and D.Y.; Funding acquisition, C.Y.; Investigation, H.Y. and X.H.; Methodology, H.Y. and W.L.; Project administration, C.Y.; Resources, C.Y.; Software, H.Y.; Supervision, C.Y.; Validation, H.Y., D.Y. and W.L.; Visualization, H.Y.; Writing—original draft, H.Y.; Writing—review and editing, C.Y. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation for Theoretical Physics special fund “cooperation program”(No. 11547039), Shaanxi Provincial Natural Science Research Funding Project (No. 2024SF-YBXM-587), Shaanxi Institute of Scientific Research Plan projects (No. SLGKYQD2-05).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lv, C.; Li, B.; Wang, X.; Cai, P.; Yang, B.; Sun, G.; Yan, J. ECM-TransUNet: Edge-enhanced multi-scale attention and convolutional Mamba for medical image segmentation. Biomed. Signal Process. Control 2025, 107, 107845. [Google Scholar] [CrossRef]
Xia, Q.; Zheng, H.; Zou, H.; Luo, D.; Tang, H.; Li, L.; Jiang, B. A comprehensive review of deep learning for medical image segmentation. Neurocomputing 2025, 613, 128740. [Google Scholar] [CrossRef]
Etehadtavakol, M.; Etehadtavakol, M.; Ng, E.Y.K. Enhanced thyroid nodule segmentation through U-Net and VGG16 fusion with feature engineering: A comprehensive study. Comput. Methods Programs Biomed. 2024, 251, 108209. [Google Scholar] [CrossRef] [PubMed]
Dan, Y.; Jin, W.; Yue, X.; Wang, Z. Enhancing medical image segmentation with a multi-transformer U-Net. PeerJ 2024, 12, e17005. [Google Scholar] [CrossRef] [PubMed]
Gao, Y.; Jiang, Y.; Peng, Y.; Yuan, F.; Zhang, X.; Wang, J. Medical Image Segmentation: A Comprehensive Review of Deep Learning-Based Methods. Tomography 2025, 11, 52. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems—Volume 1, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
Yao, W.; Bai, J.; Liao, W.; Chen, Y.; Liu, M.; Xie, Y. From CNN to Transformer: A Review of Medical Image Segmentation Models. J. Imaging Inform. Med. 2024, 37, 1529–1547. [Google Scholar] [CrossRef]
Jiang, J.; Zhang, J.; Liu, W.; Gao, M.; Hu, X.; Xue, Z.; Liu, Y.; Yan, S. Rwkv-unet: Improving unet with long-range cooperation for effective medical image segmentation. arXiv 2025, arXiv:2501.08458. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Cham, Switzerland, 5–9 October 2015; pp. 234–241. [Google Scholar]
Du, G.; Cao, X.; Liang, J.; Chen, X.; Zhan, Y. Medical image segmentation based on U-net: A review. J. Imaging Sci. Technol. 2020, 64, 1. [Google Scholar] [CrossRef]
Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical image segmentation review: The success of u-net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef]
Huang, L.; Miron, A.; Hone, K.; Li, Y. Segmenting Medical Images: From UNet to Res-UNet and nnUNet. In Proceedings of the 2024 IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS), Guadalajara, Mexico, 26–28 June 2024; pp. 483–489. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016, Cham, Switzerland, 17–21 October 2016; pp. 424–432. [Google Scholar]
Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Qi, Y.; Cai, J.; Chen, R. AO-TransUNet: A multi-attention optimization network for COVID-19 and medical image segmentation. Digit. Signal Process. 2025, 164, 105264. [Google Scholar] [CrossRef]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Cai, S.; Tian, Y.; Lui, H.; Zeng, H.; Wu, Y.; Chen, G. Dense-UNet: A novel multiphoton in vivo cellular image segmentation model based on a convolutional neural network. Quant Imaging Med. Surg. 2020, 10, 1275. [Google Scholar] [CrossRef] [PubMed]
Kumar, S.S. Advancements in medical image segmentation: A review of transformer models. Comput. Electr. Eng. 2025, 123, 110099. [Google Scholar] [CrossRef]
Ashish, V.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Krishna, M.S.; Machado, P.; Otuka, R.I.; Yahaya, S.W.; Neves dos Santos, F.; Ihianle, I.K. Plant Leaf Disease Detection Using Deep Learning: A Multi-Dataset Approach. J 2025, 8, 4. [Google Scholar] [CrossRef]
Pu, Q.; Xi, Z.; Yin, S.; Zhao, Z.; Zhao, L. Advantages of transformer and its application for medical image segmentation: A survey. Biomed. Eng. Online 2024, 23, 14. [Google Scholar] [CrossRef] [PubMed]
Liu, Q.; Kaul, C.; Wang, J.; Anagnostopoulos, C.; Murray-Smith, R.; Deligianni, F. Optimizing vision transformers for medical image segmentation. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wang, Y.; Qiu, Y.; Cheng, P.; Zhang, J. Hybrid CNN-transformer features for visual place recognition. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 1109–1122. [Google Scholar] [CrossRef]
Qiao, X.; Yan, Q.; Huang, W.; Sensing, R. Hybrid CNN-Transformer Network With a Weighted MSE Loss for Global Sea Surface Wind Speed Retrieval From GNSS-R Data. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–13. [Google Scholar] [CrossRef]
Bashir, T.; Wang, H.; Tahir, M.; Zhang, Y. Wind and solar power forecasting based on hybrid CNN-ABiLSTM, CNN-transformer-MLP models. Renew. Energy 2025, 239, 122055. [Google Scholar] [CrossRef]
Tang, H.; Chen, Y.; Wang, T.; Zhou, Y.; Zhao, L.; Gao, Q.; Du, M.; Tan, T.; Zhang, X.; Tong, T. HTC-Net: A hybrid CNN-transformer framework for medical image segmentation. Biomed. Signal Process. Control. 2024, 88, 105605. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Event, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
Xie, Y.; Zhang, J.; Shen, C.; Xia, Y. Cotr: Efficiently bridging cnn and transformer for 3d medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 171–180. [Google Scholar]
Wang, W.; Chen, C.; Ding, M.; Yu, H.; Zha, S.; Li, J. Transbts: Multimodal brain tumor segmentation using transformer. In Proceedings of the International conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September–1 October 2021; pp. 109–119. [Google Scholar]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Kumar, S.; Kumar, R.V.; Ranjith, V.; Jeevakala, S.; Varun, S.S.J.C.; Engineering, E. Grey Wolf optimized SwinUNet based transformer framework for liver segmentation from CT images. Comput. Electr. Eng. 2024, 117, 109248. [Google Scholar] [CrossRef]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar] [CrossRef]
Li, X.; Cheng, Y.; Fang, Y.; Liang, H.; Xu, S. 2DSegFormer: 2-D Transformer Model for Semantic Segmentation on Aerial Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Wu, K.; Peng, H.; Chen, M.; Fu, J.; Chao, H. Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10033–10041. [Google Scholar]
Liutkus, A.; Cıfka, O.; Wu, S.-L.; Simsekli, U.; Yang, Y.-H.; Richard, G. Relative positional encoding for transformers with linear complexity. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 7067–7079. [Google Scholar]
Heo, B.; Park, S.; Han, D.; Yun, S. Rotary position embedding for vision transformer. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 289–305. [Google Scholar]
Wang, P.; Yang, Q.; He, Z.; Yuan, Y. Vision transformers in multi-modal brain tumor MRI segmentation: A review. Meta-Radiology 2023, 1, 100004. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, Y.; Zou, Y.; He, X.; Xu, Q.; Liu, M.; Jin, S.; Zhang, Q.; He, M.M.; Zhang, J. HFA-UNet: Hybrid and full attention UNet for thyroid nodule segmentation. Knowl.-Based Syst. 2025, 328, 114245. [Google Scholar] [CrossRef]
Sun, P.; Wu, J.; Zhao, Z.; Gao, H. ACMS-TransNet: Polyp Segmentation Network Based on Adaptive Convolution and Multi-Scale Global Context. IAENG Int. J. Comput. Sci. 2025, 52, 474–483. [Google Scholar]
Zhu, Y.; Zhang, D.; Lin, Y.; Feng, Y.; Tang, J. Merging Context Clustering with Visual State Space Models for Medical Image Segmentation. Qeios 2025, 44, 2131–2142. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Segmentation Outside the Cranial Vault Challenge. 2015. Available online: https://repo-prod.prod.sagebase.org/repo/v1/doi/locate?id=syn3193805&type=ENTITY (accessed on 17 September 2024). [CrossRef]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Ballester, M.A.G.; et al. Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef] [PubMed]
Fu, S.; Lu, Y.; Wang, Y.; Zhou, Y.; Shen, W.; Fishman, E.; Yuille, A. Domain adaptive relational reasoning for 3d multi-organ segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Lima, Peru, 4–8 October 2020; pp. 656–666. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Jha, A.; Kumar, A.; Pande, S.; Banerjee, B.; Chaudhuri, S. Mt-unet: A novel u-net based multi-task architecture for visual scene understanding. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 2191–2195. [Google Scholar]
Yu, J.; Qin, J.; Xiang, J.; He, X.; Zhang, W.; Zhao, W. Trans-UNeter: A new Decoder of TransUNet for Medical Image Segmentation. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 2338–2341. [Google Scholar]

Figure 1. Overall network architecture of RE-XSwinUNet.

Figure 2. Swin Transformer Block with RoPE.

Figure 3. XskipNet block architecture.

Figure 4. SCAR Block Network Architecture.

Figure 5. Visualization results of different segmentation methods on the Synapse dataset. From left to right: (a) Original images, (b) Ground Truth, (c) RE-XSwinUNet, (d) SwinUNet, (e) TransUNet, (f) UNet. The proposed RE-XswinUNet architecture minimizes false positives while retaining more detailed structures.

Figure 6. Statistical performance comparison of segmentation results obtained by different methods on the Synapse dataset.

Table 1. Comparison results of different models on the Synapse dataset (average DSC score (%), average HD (mm), and DSC score for eight organ).

Method	DSC	HD	Aorta	Gall.	Kidney(L)	Kidney(R)	Liver	Pancreas	Spleen	Stomach
V-Net [13,16]	68.81	-	75.34	51.87	77.10	80.75	87.84	40.05	80.56	56.98
DARR [16,54]	69.77	-	74.74	53.77	72.31	73.24	94.08	54.18	89.90	45.96
U-Net [9,16]	74.68	36.87	87.74	63.66	80.60	78.19	93.74	56.90	85.87	74.16
AttnUNet [15,16]	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
R50-ViT [16,55]	71.29	32.87	73.73	55.13	75.80	72.20	91.51	45.99	81.99	73.95
TransUNet [16]	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
CoT-TransUNet [36]	78.24	23.75	88.69	62.56	88.33	76.91	94.57	55.23	86.35	78.28
MT-UNet [56]	78.59	26.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
SwinUNet [36]	79.13	21.55	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.60
RE-XSwinUNet	81.78	18.83	87.99	68.19	83.76	81.43	94.55	65.11	91.72	81.52

Table 2. Comparison results of different models on the ACDC dataset (average Dice score (%)).

Methods	DSC	RV	Myo	LV
R50 U-Net [16,57]	87.55	87.10	80.63	94.92
R50 Att-UNet [36,57]	86.75	87.58	79.20	93.47
R50 ViT [36,57]	87.57	86.07	81.88	94.75
TransUNet [16,57]	89.71	88.86	84.53	95.73
SwinUNet [36,57]	90.00	88.55	85.62	95.83
RE-XSwinUNet	90.95	91.24	85.52	96.08

Table 3. Comparison of Model Complexity.

Methods	Total Params (M)	Estimated Total Size (MB)
UNet	31.037	567.90
TransUNet	108.596	841.84
SwinUNet	37.169	408.04
RE-XSwinUNet	40.241	423.98

Table 4. Evaluation results of each module on the Synapse dataset (average Dice score (%)).

Block	DSC	Aorta	Gallb.	Kidney (L)	Kidney (R)	Liver	Pancreas	Spleen	Stomach
SwinUNet [36]	79.13	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.60
RoPE	80.38	86.32	67.78	82.76	78.50	94.12	63.43	91.19	78.92
XskipNet	79.97	85.55	67.83	82.37	78.14	94.58	62.26	91.72	77.28
SCAR Block	80.12	86.82	66.34	84.47	81.76	94.38	56.10	91.79	79.32
RoPE+XskipNet	81.15	87.25	68.05	83.25	80.35	94.52	64.85	91.45	80.28
RoPE+SCAR Block	80.95	87.42	67.45	84.12	81.25	94.45	62.75	91.68	80.65
XskipNet+SCAR Block	80.65	86.95	67.28	84.35	81.08	94.62	60.85	91.85	79.95
RE-XSwinUNet (All)	81.78	87.99	68.19	83.76	81.43	94.55	65.11	91.72	81.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Yang, C.; Yang, D.; Hang, X.; Liu, W. RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation. Big Data Cogn. Comput. 2025, 9, 274. https://doi.org/10.3390/bdcc9110274

AMA Style

Yang H, Yang C, Yang D, Hang X, Liu W. RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation. Big Data and Cognitive Computing. 2025; 9(11):274. https://doi.org/10.3390/bdcc9110274

Chicago/Turabian Style

Yang, Hang, Chuanghua Yang, Dan Yang, Xiaojing Hang, and Wu Liu. 2025. "RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation" Big Data and Cognitive Computing 9, no. 11: 274. https://doi.org/10.3390/bdcc9110274

APA Style

Yang, H., Yang, C., Yang, D., Hang, X., & Liu, W. (2025). RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation. Big Data and Cognitive Computing, 9(11), 274. https://doi.org/10.3390/bdcc9110274

Article Menu

RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Architecture

2.2. Rotational Position Embedding

2.3. XskipNet Block

2.4. SCAR Block

3. Results

3.1. Datasets

3.2. Implementation Details

3.3. Evaluation Indicators

3.4. Overall Performance on Synapse Dataset

3.5. Overall Performance on ACDC Dataset

3.6. Model Complexity and Efficiency Analysis

4. Discussion

4.1. Ablation Studies

4.2. Significance Verification Based on t-Test

4.3. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI