UNETR++ with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers

Ntanzi, Sithembiso; Viriri, Serestina

doi:10.3390/app152011034

Open AccessArticle

UNETR++ with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers

by

Sithembiso Ntanzi

^† and

Serestina Viriri

^*,†

Computer Science Discipline, University of KwaZulu-Natal, Durban 4000, South Africa

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(20), 11034; https://doi.org/10.3390/app152011034

Submission received: 17 August 2025 / Revised: 25 September 2025 / Accepted: 8 October 2025 / Published: 14 October 2025

Download

Browse Figures

Versions Notes

Abstract

There have been significant breakthroughs in developing models for segmenting 3D medical images, with many promising results attributed to the incorporation of Vision Transformers (ViT). However, the fundamental mechanism of transformers, known as self-attention, has quadratic complexity, which significantly increases computational requirements, especially in the case of 3D medical images. In this paper, we investigate the UNETR++ model and propose a voxel-focused attention mechanism inspired by TransNeXt pixel-focused attention. The core component of UNETR++ is the Efficient Paired Attention (EPA) block, which learns from two interdependent branches: spatial and channel attention. For spatial attention, we incorporated the voxel-focused attention mechanism, which has linear complexity with respect to input sequence length, rather than projecting the keys and values into lower dimensions. The deficiency of UNETR++ lies in its reliance on dimensionality reduction for spatial attention, which reduces efficiency but risks information loss. Our contribution is to replace this with a voxel-focused attention design that achieves linear complexity without low-dimensional projection, thereby reducing parameters while preserving representational power. This effectively reduces the model’s parameter count while maintaining competitive performance and inference speed. On the Synapse dataset, the enhanced UNETR++ model contains 21.42 M parameters, a 50% reduction from the original 42.96 M, while achieving a competitive Dice score of 86.72%.

Keywords:

volumetric medical image segmentation; efficient attention; hybrid architecture; 3D sliding window

1. Introduction

Volumetric medical imaging devices, such as Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), have had a significant impact on the medical field [1,2]. Segmenting the 3D medical images produced by these devices is crucial, as it aids in quantitative analysis, guiding surgical procedures, and improving patient intervention outcomes. Classical image segmentation methods, such as computer-aided diagnosis (CADx), which use tools like thresholding and the paintbrush, are often inefficient and time-consuming [3,4].

Advancements in deep learning have led to the development of more efficient and accurate segmentation models. Generally, deep learning segmentation methods for 3D medical images can be divided into three categories: convolution-based, transformer-based, and hybrid. Convolution-based models rely solely on convolutional neural networks (CNNs) while transformer-based rely solely on Transformers. Hybrid models, particularly for 3D medical imaging, combine CNNs and Transformers, and typically follow a U-shaped architecture with an encoder and a decoder. The adoption of Transformers has been driven by the success of Vision Transformers (ViTs) in various visual tasks, such as classification, segmentation, restoration, synthesis, registration, and object detection, where they often outperform fully CNN-based techniques [5,6,7,8,9,10,11].

In this paper, we investigated UNETR++ [12], which features a hybrid hierarchical encoder–decoder architecture that incorporates both Transformer and convolution methods. The main component of UNETR++ is the Efficient Paired Attention (EPA) block, which consists of two interdependent branches: the spatial branch and the channel branch. Spatial attention is applied to the spatial branch, while channel attention is applied to the channel branch. The fundamental component of Transformers is the self-attention mechanism [13], which scales quadratically with the input sequence length. This significantly impacts the computation of attention in medical volumetric images due to their large input sequences. To mitigate this, the original UNETR++ reduces computational cost by projecting keys and values into lower dimensions; however, this projection introduces information loss and limits representational richness.

Our main contributions are summarized as follows:

We identify the deficiency of UNETR++ in its spatial attention design, which relies on dimensionality reduction for computational efficiency but sacrifices feature fidelity.
We propose a voxel-focused attention (VFA) mechanism that achieves linear complexity without dimensionality reduction, directly computing attention across voxels.
We integrate VFA into the spatial branch of the EPA block in UNETR++, resulting in an enhanced model that reduces parameters by nearly 50% while maintaining competitive segmentation accuracy.
We evaluate the proposed method on three widely-used benchmarks (Synapse, ACDC, and BRaTs), demonstrating its effectiveness across diverse medical imaging modalities.

We experimented with full- and semi-voxel-focused attention (VFA) on the spatial branch to compute the attention weights. The performance difference was not significant when evaluated on the ACDC dataset: the full-VFA achieved an average Dice score of 92.9, while the semi-VFA scored 92.7. The full-VFA had fewer trainable parameters but a longer inference time compared to the semi-VFA. The experiments were conducted on three benchmarks: Synapse [14], ACDC [15], and BRaTs [16]. The enhanced UNETR++ model incorporating the VFA achieved comparable results across these datasets while having significantly fewer parameters. Figure 1 presents a visual comparison of the Dice similarity score, model parameters, and FLOPs for state-of-the-art models on the Synapse dataset.

2. Literature Review and Related Work

2.1. Convolution-Based Segmentation Methods

Convolutional neural networks (CNNs) were among the first deep learning techniques to be adopted to segment medical images. U-Net [17] was one of the pioneering CNN-based models that demonstrated excellent performance in segmenting medical images. Since then, various U-Net variants have been developed to enhance performance for different medical imaging modalities and segmentation tasks [18,19,20,21]. For volumetric medical image segmentation, some U-Net variants employ a strategy that avoids using the full 3D volumetric images, a technique commonly referred to as 2.5D segmentation [22]. In contrast, 3D methods utilise the entire volumetric data. 3D U-Net [20] is an example of a U-Net variant that processes the complete volume of the images. nnU-Net [23] provides a highly flexible framework due to its ability to configure the architecture, making it capable of handling both 2D and 3D images and adaptable to different data preprocessing requirements. Liu et al. [24] introduced ConvNeXt, a convolutional neural network derived from ResNet [18], to revitalise CNNs as Vision Transformers began outperforming CNNs across multiple vision tasks. Some models have adopted architectural features from ConvNeXt. MedNeXt [25], for instance, is fully based on ConvNeXt for the segmentation of 3D medical images. Furthermore, some methods have explored the use of dilated depthwise convolutions and large kernels to capture more contextual information [26].

2.2. Transformer-Based Segmentation Methods

Vision Transformers (ViTs) have demonstrated exceptional results across multiple visual tasks, outperforming CNNs. The self-attention mechanism is the key component of Transformers, enabling them to capture global context by encoding the relationships between image patches [13]. However, the standard self-attention mechanism has a quadratic computational complexity with respect to the input sequence length, which in this case refers to image patches. In recent years, several approaches have been proposed to reduce this complexity. Guo et al. [27] addressed the issue by introducing small, external learnable shared memory units for the keys and values, resulting in a mechanism with linear complexity, which they named External Attention. TransNeXt [11] attempts to reduce the complexity of the self-attention mechanism using a pixel-focused attention mechanism inspired by the focal perception mode of biological vision. In recent years, fully Transformer-based designs have also been explored. Huang et al. introduced LightViT [28], a convolution-free design incorporating learnable tokens to capture global dependencies and a bi-dimensional attention module to aggregate global information across both channel and spatial dimensions. LightViT has demonstrated good performance on various 2D visual tasks. Karimi et al. [29]. proposed a convolution-free design specifically for 3D medical image segmentation. In this approach, 3D images are broken down into patches, flattened, embedded into a 1D representation, and passed through the network to predict the segmentation map.

2.3. Hybrid Segmentation Methods

Due to the strong inductive spatial bias of CNNs, some research incorporates CNNs to help capture local context. Hybrid designs leverage the strengths of both CNNs and Transformers to capture local and global dependencies, respectively. TransUNet [30] is the first hybrid framework for medical image segmentation, consisting of a CNN–Transformer encoder. The Transformer within the encoder encodes global context from high-resolution spatial CNN features. The decoder is entirely CNN-based and upsamples the encoded features. UNETR [31] uses a Transformer as the encoder and a CNN-based decoder to capture localised information. The encoder and decoder are connected via skip connections at different resolutions. nnFormer [32] has a hierarchical feature representation and is rooted in the SwinUNet [33] architecture. nnFormer adapts the Swin Transformer’s shifted window approach to 3D segmentation. It divides the input into 3D patches and employs local and global volume-based self-attention mechanisms. Liu et al. [34]. introduced a hierarchical encoder–decoder model with skip connections (SCANeXt) for 3D medical image segmentation. While the hierarchical encoder–decoder structure is similar to that of UNETR++, SCANeXt replaces the EPA block with the dual attention and depthwise convolution (DADC) block. The DADC block utilises dual attention and depthwise convolution, inspired by ConvNeXt. The dual attention mechanism captures global context, while the depthwise convolution block extracts multiscale features.

3. Methods and Techniques

This section outlines the architecture of the enhanced UNETR++ model, which incorporates the voxel-focused attention (VFA) mechanism. The overall architecture remains unchanged; only the EPA block in the spatial branch has been enhanced to calculate attention using VFA.

3.1. Overall Architecture

The UNETR++ architecture features a hierarchical encoder–decoder structure connected by skip connections, drawing inspiration from the UNETR architecture introduced by Hatamizadeh et al. [31]. It comprises various components, including ConvBlock, Downsampling, Upsampling, 1 × 1 × 1 Convolution, and EPA blocks, as shown in Figure 2.

In this hierarchical design, the Downsampling layers progressively reduce the resolution of the feature maps by a factor of 2 at each stage through a non-overlapping convolution operation. The encoder consists of four stages. The first stage employs patch embedding to divide the image volume into non-overlapping 3D patches, projecting them into the channel dimension. The size of the patches can be represented as

\frac{D}{P_{1}}, \frac{H}{P_{2}}, \frac{W}{P_{3}}

, where (

P_{1}, P_{2}, P_{3}

) is the patch resolution. The corresponding sequence length can be expressed as

N = \frac{D}{P_{1}} \times \frac{H}{P_{2}} \times \frac{W}{P_{3}}

. At each subsequent stage, the Downsampling layer is applied, followed by the EPA block.

The decoder also consists of four stages. Each stage includes an EPA block (except for the last stage) and an Upsampling layer that performs deconvolution, increasing the resolution of the feature maps by a factor of 2 while reducing the channel dimension by the same factor. The encoder and decoder stages are linked through skip connections, which help recover spatial information lost during the downsampling operations.

In the final stage of the decoder, the output is fused with the convolution feature maps to restore spatial information and enhance feature representation. This output is then processed through a 3 × 3 × 3 ConvBlock and a 1 × 1 × 1 Convolution Block to generate the final prediction mask.

3.2. Voxel-Focused Attention (VFA)

Voxel-focused attention (VFA) is inspired by the pixel-focused attention (PFA) mechanism introduced by Dai Shi [11]. The PFA mechanism aims to replicate the behaviour of biological foveal vision by using a query-centred sliding window for pixel-wise attention along with pooling attention. We extended this mechanism to operate voxel-wise for 3D medical image segmentation.

Approaches to VFA

We explored two approaches to implementing VFA:

Approach 1—VFA: This approach considers only the query-centred sliding window voxel-wise attention, without incorporating the pooling attention.
Approach 2—Full VFA (FVFA): This approach extends the sliding window voxel-wise attention by including the pooling attention operation to capture additional contextual information.

Given a patch

X \in R^{C \times H \times W \times D}

, we define a set of voxels within the sliding window as

v (i, j, m)

. The size of the window is expressed as:

|v (i, j, m)| = k \times k \times k,

(1)

where k denotes the size of the window along each dimension. Using this definition, the voxel-wise attention for Approach 1 can be represented as:

S_{(i, j, m) \sim p (i, j, m)} = (Q_{(i, j, m)} + Q E) \cdot K_{p (i, j, m)}^{⊤}

(2)

α_{(i, j, m) \sim p (i, j, m)} = Softmax (\frac{S_{(i, j, m) \sim σ (i, j, m)}}{\sqrt{d}} + B_{(i, j, m) \sim ρ (i, j, m)})

(3)

VFA (X (i, j, m)) = α_{(i, j, m) \sim p (i, j, m)} \cdot V_{(i, j, m)}

(4)

where

$S_{(i, j, m) \sim p (i, j, m)}$ : Query-key similarity score;
$Q_{(i, j, m)}$ : Query vector at location $(i, j, m)$ ;
$Q E$ : Query embedding;
$K_{p (i, j, m)}$ : Key vector at position $p (i, j, m)$ ;
$α_{(i, j, m) \sim p (i, j, m)}$ : Attention weights;
d: Dimension of the query and key vectors;
$B_{(i, j, m) \sim ρ (i, j, m)}$ : Learnable position bias;
$V_{(i, j, m)}$ : Value vector.

For Approach 2, we first define a set of voxels obtained from pooling the feature map as

σ (X)

. Given the pooled size as

H_{p} \times W_{p} \times D_{p}

, the voxel set size can be expressed as:

|σ (X)| = H_{p} \times W_{p} \times D_{p} .

(5)

The voxel-focused attention for Approach 2 can be expressed as:

S_{(i, j, m) \sim p (i, j, m)} = (Q_{(i, j, m)} + QE) \cdot K_{p (i, j, m)}^{⊤}

(6)

S_{(i, j, m) \sim σ (X)} = (Q_{(i, j, m)} + QE) \cdot K_{σ (X)}^{⊤}

(7)

\begin{matrix} B_{(i, j, m)} = C o n c a t (B_{(i, j, m) \sim ρ (i, j, m)}, \\ \log - CPB (Δ_{(i, j, m) \sim σ (X)})) \end{matrix}

(8)

\begin{matrix} α_{(i, j, m)} = Softmax (τ log (N) \cdot \\ Concat [S_{(i, j, m) \sim p (i, j, m)}, S_{(i, j, m) \sim σ (X)}] + B_{(i, j, m)}) \end{matrix}

(9)

α_{(i, j, m) \sim p (i, j, m)}, α_{(i, j, m) \sim σ (X)} = split (α_{(i, j, m)})

(10)

\begin{matrix} FVFA (X (i, j, m)) = (α_{(i, j, m) \sim p (i, j, m)} + Q_{(i, j, m)} \cdot T) \cdot \\ V_{(i, j, m)} + α_{(i, j, m) \sim σ (X)} \cdot V_{σ (X)} \end{matrix}

(11)

where

$S_{(i, j, m) \sim p (i, j, m)}$ : Query-key similarity score;
$S_{(i, j, m) \sim σ (X)}$ : Pooled query-key similarity score;
$α_{(i, j, m)}$ : Attention weights;
QE: Query embedding;
$σ (X)$ : Pooled feature map;
log−CPB: Log-spaced continuous position bias;
$Δ_{(i, j, m) \sim σ (X)}$ : Relative coordinates;
T: Learnable tokens.

3D Padding Mask: Similar to the pixel-focused attention similarity computation in TransNeXt [11], the padding mask is used to prevent zero-padding voxels at the edges of the feature map from influencing the softmax operation. This is achieved by setting the results of any similarity computations involving these voxels to

- \infty

.

Motivated by the exceptional performance of aggregated attention mechanisms demonstrated by TransNeXt [11], several key features were incorporated to further enhance the model’s effectiveness. These include the Learnable Query Embedding (QE), which enables vision-language models to perform cross-attention between textual queries and visual keys, aiding tasks such as Visual Question Answering.

To enhance voxel-focused attention across multiscale image inputs, position bias is calculated differently on two paths. On the pooling feature path, a log-spaced continuous position bias (log-CPB) is computed with a 2-layer multilayer perceptron (MLP) using relative coordinates

Δ_{(i, j, m) \sim σ (X)}

between

Q_{i, j, m}

and

K_{i, j, m}

. On the sliding window path, a learnable position bias

B_{(i, j, m) \sim ρ (i, j, m)}

is directly applied.

Additionally, length-scaled cosine attention is introduced to improve training stability and moderate attention weights by using cosine similarity instead of dot product attention. This method incorporates a learnable scaling factor,

λ

, that adjusts based on input sequence length, with

λ

expressed as

λ = τ log (N)

, where

τ

is a learnable variable and N represents the number of significant keys, excluding masked tokens.

Lastly, positional attention, or Query-Learnable-Value (QLV) attention, replaces static key–value pairs with learnable keys that adapt to each query. This dynamic approach improves positional information and enhances locality modelling, providing greater robustness than static methods. It introduces learnable tokens T for each attention head to generate adaptive position bias.

3.3. Efficient Paired Attention Block

The EPA block can be broken into two branches: the spatial branch and the channel branch. Given an input volume X, the branches can be computed as follows:

X_{s} = SA (Q_{shared}, K_{shared}, V_{spatial})

X_{c} = CA (Q_{shared}, K_{shared}, V_{channel})

where

X_{s}

denotes the spatial attention and

X_{c}

denotes the channel attention. SA is the spatial attention module, which is performed by the VFA operation, and CA is the channel attention module.

Q_{shared}

and

K_{shared}

are matrices for queries and keys that are shared between the branches.

V_{channel}

and

V_{spatial}

are matrices for the channel value and spatial value, respectively. Linear layers are used to generate these matrices as follows:

Q_{shared} = W_{Q} \cdot X

K_{shared} = W_{K} \cdot X

V_{spatial} = W_{V_{s}} \cdot X

V_{channel} = W_{V_{c}} \cdot X

where the weight matrices

W_{Q}

,

W_{K}

,

W_{V_{s}}

, and

W_{V_{c}}

are learnable parameters used to linearly project the input volume X into query, key, and value matrices for the spatial and channel branches. For full-voxel-focused attention (FVFA), additional query and value matrices are added and utilised in the pooling path of FVFA, as shown in Figure 3 and Figure 4. They can be expressed as follows:

K_{pool} = W_{K_{p}} \cdot σ (X)

V_{pool} = W_{V_{p}} \cdot σ (X)

where

W_{K_{p}}

and

W_{V_{p}}

are learnable weight matrices used to linearly project the pooled input volume

σ (X)

.

The spatial branch performs spatial attention using voxel-focused attention (VFA) or full-voxel-focused attention (FVFA). The original UNETR++ architecture projects the spatial information into a lower dimension so that the attention computation is linear with respect to the input sequence. Both FVFA and VFA also achieve linear computation. Algorithm 1 shows the overall process that volumetric medical images undergo to produce segmentations using VFA and FVFA UNETR++.

Algorithm 1 VFA and FVFA UNETR++ Segmentation Pseudocode

1:

Input: 3D Medical Image Volume.

2:

Output: Segmented 3D Volume.

3:

1. Preprocessing:

Resample the input volume to a target spacing.
Apply data augmentations:
-
Rotation, scaling, Gaussian noise, Gaussian blur
-
Brightness and contrast adjustment, low-resolution simulation
-
Gamma adjustment, mirroring

4:

2. Encoder:

Split the input volume into non-overlapping 3D patches.
Embed patches and add positional embeddings for spatial information.
Pass embedded patches through multiple Transformer layers:
-
For each Transformer layer:
*
Perform layer normalisation.
*
Apply VFA or FVFA on the spatial dimension.
*
Apply channel attention mechanism.
*
Apply convolution.
*
Return features for each layer.

5:

3. Decoder:

Initialize the decoder with feature maps from the last Transformer layer.
For each decoder stage (in reverse order of encoder stages):
-
Apply an EPA block with VFA or FVFA spatial attention.
-
Upsample feature maps from the previous layer.
-
Concatenate upsampled features with corresponding encoder features (skip connections).
Repeat until reaching the original input resolution.

6:

4. Output Layer:

Apply a final 1 × 1 × 1 convolution to reduce channels to the number of segmentation classes.
Apply a convolutional block to generate the final segmentation mask.

7:

5. Postprocessing.

8:

6. Return segmented 3D volume.

4. Experimental Results

4.1. Experimental Setup

To evaluate the performance and efficiency of the enhanced UNETR++ model, three datasets were used: Synapse, ACDC, and BraTS. The three benchmark datasets were selected to provide diverse imaging modalities, anatomical targets, and dataset scales, which together form a representative testbed for 3D medical segmentation.

Synapse: This dataset consists of 30 abdominal CT scans with annotations for eight target organs: spleen, right kidney, left kidney, gallbladder, liver, stomach, aorta, and pancreas. Although its sample size is modest, Synapse remains valuable because it is publicly available, well curated, and frequently used for efficiency comparisons, enabling reproducibility and direct benchmarking against a large body of prior work.
ACDC: The ACDC dataset contains 100 patients’ cardiac MRI images, with annotations for three anatomical structures: right ventricle (RV), left ventricle (LV), and myocardium (MYO).
BraTS: This dataset includes 484 MRI images with four modalities (FLAIR, T1w, T1gd, and T2w). The dataset is annotated for peritumoral oedema, GD-enhancing tumour, and necrotic/non-enhancing tumour core.

By spanning CT and MRI, healthy-organ and tumour segmentation, and small-, medium-, and large-scale data regimes, these three datasets together provide a balanced and representative evaluation of the proposed method’s efficiency and accuracy across diverse clinical scenarios. We used the publicly provided preprocessed datasets from the UNETR++ project. According to nnFormer [32], the preprocessing includes resampling all images to a uniform target spacing, followed by augmentations such as rotation, scaling, Gaussian noise, Gaussian blur, brightness and contrast adjustments, low-resolution simulation, gamma augmentation, and mirroring, applied during training. UNETR++ uses the same preprocessing pipeline and augmentation scheme, and we adopted those preprocessed data directly, without re-implementing the full pipeline. Regarding dataset splitting, we used the same training/validation/test splits as in the original UNETR++ work. This helps ensure comparable and fair evaluation across models.

The data splits for each dataset were as follows:

Synapse: In total, 18 samples were used for training, and 12 samples were used for evaluation.
ACDC: The dataset was split into a 70:10:20 ratio for training, validation, and testing, respectively.
BraTS: The data was split into an 80:5:15 ratio for training, validation, and testing, respectively.

These consistent splits ensure a fair comparison of the models’ performance across all datasets.

Evaluation Metrics: The Dice Similarity Coefficient (DSC) and 95% Hausdorff Distance (HD95) are the metrics used to measure the segmentation performance of the models. The DSC measures the overlap between two images: the prediction and the ground truth images. In the context of volumetric images, the DSC measures the overlap between the voxels of the segmentation prediction and the voxels of the ground truth. The DSC can be expressed as follows:

DSC (Y, P) = \frac{2 \cdot | Y \cap P |}{| Y | + | P |}

(12)

where Y and P denote ground truth voxels and predicted voxels, respectively.

HD95 is a boundary-based metric that measures the 95th percentile of distances between the boundaries of the segmentation predictions and the boundaries of the ground truth. HD95 can be expressed as follows:

HD 95 = max \{sup_{a \in P} d (a, Y), sup_{b \in Y} d (b, P)\}

(13)

where

d (a, Y)

denotes the shortest distance from a point

a \in P

to any point in Y, and

d (b, P)

denotes the shortest distance from a point

b \in Y

to any point in P. The boundaries of the prediction and ground truth are represented by P and Y, respectively.

Implementation Details: Both VFA UNETR++ and FVFA UNETR++ were implemented using Python version 3.8.19 and PyTorch 2.4.0, with the MONAI libraries also utilised. The models were trained on a single Nvidia V100 16 GB (PCIe) GPU. For efficiency, a custom CUDA implementation was developed to compute the query-key similarity and aggregate the attention weights. The CUDA code was compiled using GCC version 9.2.0 and CUDA version 12.4. The Stochastic Gradient Descent (SGD) optimiser was used with an initial learning rate of 0.01, which was gradually decreased using a Polynomial Learning Rate Schedule at each epoch:

lr = initial_lr \times {(1 - \frac{epoch_id}{max_epoch})}^{0.9}

(14)

where

lr: Learning rate at the current epoch;
initial_lr: Initial learning rate at the start of training;
epoch_id: The current epoch number;
max_epoch: The total number of epochs for training;
0.9: The exponent that controls the decay rate.

The weight decay used across all datasets was set to 3 × 10⁻⁵, and the default SGD momentum (0.99) was applied. The hyperparameters were kept consistent with those of UNETR++ to ensure a fair comparison. Each dataset was trained for 1000 epochs with varying input resolutions and patch resolutions: ACDC with an input resolution of 16 × 160 × 160 and a patch size of 1 × 4 × 4, Synapse with an input resolution of 64 × 128 × 128 and a patch size of 2 × 4 × 4, and BraTS with an input resolution of 128 × 128 × 128 and a patch size of 4 × 4 × 4. Data augmentation was performed in line with the methods used in UNETR, nnFormer, and the original UNETR++.

4.2. Comparison with State-of-the-Art Methods

4.2.1. Synapse Dataset

Table 1 shows the segmentation performance on the Synapse dataset, including the number of parameters, FLOPs, Dice Score for each organ, and the average HD95 and Dice score. The results are presented for 10 methods, which include both classical and state-of-the-art models. The results for UNETR++ were reproduced using the weights provided by the authors, while results for other methods were taken from their respective published papers. Both voxel-focused attention (VFA) and full-voxel-focused attention (FVFA) results are included. FVFA demonstrated competitive results in terms of Dice score while having fewer parameters than other models. However, FVFA has higher computational complexity (measured in FLOPs) and requires a longer training time compared to the original UNETR++ and VFA UNETR++. In Figure 5, the qualitative comparison illustrates that FVFA UNETR++ achieves a better HD95 score, with fewer outliers, reduced voxel misclassification, and sharper boundary delineation.

4.2.2. ACDC Dataset

Table 2 and Table 3 present the segmentation results and computational efficiency of the ACDC dataset. FVFA UNETR++ achieves the highest performance while having the fewest parameters. However, it has a higher FLOP count and longer training time compared to the original UNETR++ and VFA UNETR++. FVFA UNETR++, VFA UNETR++, and UNETR++ outperform other state-of-the-art methods in terms of average DSC score. FVFA UNETR++ has the lowest number of parameters, with 41% fewer than UNETR++, while VFA has 35% fewer parameters compared to UNETR++. VFA also has lower FLOPs than the other three models. Figure 6 shows the qualitative comparison of segmentations produced by the three models. The differences between the segmentations are minimal, as the models have very close segmentation performance, with a maximum DSC difference of 0.19. However, VFA UNETR++ is more efficient overall due to improvements in the number of parameters and FLOPs.

4.2.3. BraTs Dataset

The experimental results for tumour segmentation are presented in Table 4 and Table 5, including both segmentation performance and computational efficiency. UNETR++, VFA UNETR++, and FVFA UNETR++ demonstrate similar performance, with a difference of less than 0.4 DSC between them. VFA UNETR++ achieved the highest DSC score of 82.8, with the original UNETR++ trailing by only 0.1 DSC. FVFA UNETR++ had the lowest performance on the BraTS dataset compared to the other two models. All three models outperform other state-of-the-art models not only in performance but also in terms of the number of parameters, FLOPs, and memory usage (except FVFA UNETR++). FVFA UNETR++ has a higher memory usage than UNETR. FVFA has the lowest number of parameters compared to all the models (21.4), which is a 50% reduction compared to the original UNETR++. Among the three models, UNETR++ has the shortest average epoch training time of 244.14 s, while FVFA UNETR++ has the longest, with a difference of only 4.44 s. VFA UNETR++ has the fastest inference time and the highest HD95 score. FVFA UNETR++ has the longest inference time of 7.94, which is 2.66 s longer than that of VFA UNETR++. Figure 7 shows the segmentations of the three models for qualitative analysis.

5. Comparison of CUDA C++ and Native Python-Only Implementations

The computational efficiency of the CUDA and Python-only implementations is presented in Table 6 and Table 7. Table 6 shows results for FVFA UNETR++, while Table 7 shows results for VFA UNETR++. The CUDA implementation outperforms the Python-only implementation in training time, inference speed, and memory usage for both VFA and FVFA UNETR++. For FVFA UNETR++ inference time, the CUDA implementation is 12.36% faster than the Python-only version. Similarly, the CUDA implementation for VFA UNETR++ achieves a 15.52% faster inference speed compared to the Python-only version. The training time difference is smaller, with the CUDA implementation being 5% faster for FVFA UNETR++ and 1% faster for VFA UNETR++. However, the Python-only implementation of VFA UNETR++ is faster than the CUDA implementation of FVFA UNETR++ for inference. In terms of memory efficiency, the CUDA implementation significantly reduces GPU memory usage, with FVFA UNETR++ using 22.67% less GPU memory and VFA UNETR++ using 26.19% less.

6. Discussion

While both approaches demonstrate excellent results, a notable trade-off was observed between the two VFA approaches on the ACDC dataset. While the full VFA had fewer trainable parameters, it had a longer inference time compared to the semi-VFA. This suggests a direction for future work to optimize the computational efficiency of the full VFA implementation to minimize inference time. Additionally, future research could explore applying the voxel-focused attention mechanism to other hybrid or Transformer-based architectures for 3D medical image segmentation to determine if similar gains in efficiency and performance can be achieved.

7. Conclusions

This research proposes two competitive models, voxel-focused attention UNETR++ (VFA UNETR++) and full-voxel-focused attention UNETR++ (FVFA UNETR++), which extend pixel-focused attention and adapt aggregated attention from TransNeXt for 3D medical image segmentation. The experimental results demonstrated the competitiveness of the proposed models in terms of DSC, HD95, and computational efficiency. Both models effectively learn from the data while utilising fewer parameters, all while maintaining strong segmentation performance and computational efficiency.

Author Contributions

Conceptualization, S.N. and S.V.; methodology, S.N.; software, S.N.; validation, S.V.; formal analysis, S.V.; investigation, S.N.; resources, S.V.; data curation, S.N.; writing—original draft preparation, S.N.; writing—review and editing, S.V.; visualization, S.N.; supervision, S.V.; project administration, S.V.; funding acquisition, S.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in Kaggle at https://www.kaggle.com/datasets/awsaf49/brats2020-training-data (accessed on 7 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ViTs	Vision Transformers
CNNs	Convolutional Neural Networks
VFA	Voxel-Focused Attention
FVFA	Full-Voxel-Focused Attention
EPA	Efficient Paired Attention
DADC	Dual Attention and Depthwise Convolution
QLV	Query-Learnable-Value
LQE	Learnable Query Embedding
DSC	Dice Similarity Coefficient
HD95	Hausdorff Distance 95
Spl	Spleen
RKid	Right Kidney
LKid	Left Kidney
Gal	Gallbladder
Liv	Liver
Sto	Stomach
Aor	Aorta
Pan	Pancreas
RV	Right Ventricle
LV	Left Ventricle
MYO	Myocardium

References

National Health Service. Magnetic Resonance Imaging (MRI) Scan; Retrieved from Health A to Z; National Health Service: London, UK, 2022.
National Health Service. CT Scan; Retrieved from Health A to Z; National Health Service: London, UK, 2023.
Padhi, S.; Rup, S.; Saxena, S.; Mohanty, F. Mammogram Segmentation Methods: A Brief Review. In Proceedings of the 2019 2nd International Conference on Intelligent Communication and Computational Techniques (ICCT), Jaipur, India, 28–29 September 2019; pp. 218–223. [Google Scholar] [CrossRef]
Doi, K. Computer-aided diagnosis in medical imaging: Historical review, current status and future potential. Comput. Med. Imaging Graph. 2007, 31, 198–211. [Google Scholar] [CrossRef]
Chen, Y.; Zhuang, Z.; Chen, C. Object Detection Method Based on PVTv2. In Proceedings of the 2023 IEEE 3rd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 26–28 May 2023; pp. 730–734. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Yang, J.; Li, C.; Dai, X.; Gao, J. Focal Modulation Networks. In Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 4203–4217. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. arXiv 2022, arXiv:2107.00652. [Google Scholar] [CrossRef]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv 2023, arXiv:2211.05778. [Google Scholar]
Shi, D. TransNeXt: Robust Foveal Visual Perception for Vision Transformers. arXiv 2024, arXiv:2311.17132. [Google Scholar]
Shaker, A.M.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation. IEEE Trans. Med. Imaging 2024, 43, 3377–3390. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Bernard, O.; Lalande, A.; Zotti, C.; Cervenansky, F.; Yang, X.; Heng, P.A.; Cetin, I.; Lekadir, K.; Camara, O.; Gonzalez Ballester, M.A.; et al. Deep Learning Techniques for Automatic MRI Cardiac Multi-Structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [PubMed]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans. Med. Imaging 2014, 34, 1993–2024. [Google Scholar] [CrossRef]
Landman, B.; Xu, Z.; Igelsias, J.; Styner, M.; Langerak, T.; Klein, A. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proceedings of the MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, Munich, Germany, 5 October 2015; Volume 5, p. 12. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. arXiv 2020, arXiv:2004.08790. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2016, Athens, Greece, 17–21 October 2016; Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W., Eds.; pp. 424–432. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. arXiv 2018, arXiv:1807.10165. [Google Scholar]
Angermann, C.; Haltmeier, M. Random 2.5 d u-net for fully 3d segmentation. In Proceedings of the Machine Learning and Medical Engineering for Cardiovascular Health and Intravascular Imaging and Computer Assisted Stenting: First International Workshop, MLMECH 2019, and 8th Joint International Workshop, CVII-STENT 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, 13 October 2019; Proceedings 1. Springer: Berlin/Heidelberg, Germany, 2019; pp. 158–166. [Google Scholar]
Isensee, F.; Petersen, J.; Klein, A.; Zimmerer, D.; Jaeger, P.F.; Kohl, S.; Wasserthal, J.; Koehler, G.; Norajitra, T.; Wirkert, S.; et al. nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation. arXiv 2018, arXiv:1809.10486. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
Roy, S.; Koehler, G.; Ulrich, C.; Baumgartner, M.; Petersen, J.; Isensee, F.; Jaeger, P.F.; Maier-Hein, K.H. Mednext: Transformer-driven scaling of convnets for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Daejeon, Republic of Korea, 23–27 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 405–415. [Google Scholar]
Li, H.; Nan, Y.; Yang, G. LKAU-Net: 3D Large-Kernel Attention-Based U-Net for Automatic MRI Brain Tumor Segmentation. In Proceedings of the Medical Image Understanding and Analysis; Yang, G., Aviles-Rivero, A., Roberts, M., Schönlieb, C.B., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 313–327. [Google Scholar]
Guo, M.H.; Liu, Z.N.; Mu, T.J.; Hu, S.M. Beyond self-attention: External attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5436–5447. [Google Scholar] [CrossRef]
Huang, T.; Huang, L.; You, S.; Wang, F.; Qian, C.; Xu, C. Lightvit: Towards light-weight convolution-free vision transformers. arXiv 2022, arXiv:2207.05557. [Google Scholar]
Karimi, D.; Vasylechko, S.D.; Gholipour, A. Convolution-Free Medical Image Segmentation Using Transformers. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021, Strasbourg, France, 27 September–1 October 2021; de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 78–88. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Zhou, H.Y.; Guo, J.; Zhang, Y.; Yu, L.; Wang, L.; Yu, Y. nnformer: Interleaved transformer for volumetric segmentation. arXiv 2021, arXiv:2109.03201. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
Liu, Y.; Zhang, Z.; Yue, J.; Guo, W. SCANeXt: Enhancing 3D medical image segmentation with dual attention network and depth-wise convolution. Heliyon 2024, 10, e26775. [Google Scholar] [CrossRef]
Huang, X.; Deng, Z.; Li, D.; Yuan, X. Missformer: An effective medical image segmentation transformer. arXiv 2021, arXiv:2109.07162. [Google Scholar] [CrossRef] [PubMed]
Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.R.; Xu, D. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In Proceedings of the International MICCAI Brainlesion Workshop; Springer: Berlin/Heidelberg, Germany, 2021; pp. 272–284. [Google Scholar]
Rahman, M.M.; Marculescu, R. Medical Image Segmentation via Cascaded Attention Decoding. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 6211–6220. [Google Scholar] [CrossRef]
Niknejad, M.; Firouzbakht, M.; Amirmazlaghani, M. Enhancing Precision in Dermoscopic Imaging using TransUNet and CASCADE. In Proceedings of the 2024 32nd International Conference on Electrical Engineering (ICEE), Tehran, Iran, 14–16 May 2024; pp. 1–5. [Google Scholar] [CrossRef]
Azad, R.; Niggemeier, L.; Huttemann, M.; Kazerouni, A.; Aghdam, E.K.; Velichko, Y.; Bagci, U.; Merhof, D. Beyond Self-Attention: Deformable Large Kernel Attention for Medical Image Segmentation. arXiv 2023, arXiv:2309.00121. [Google Scholar] [CrossRef]
Rahman, M.M.; Marculescu, R. Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation. arXiv 2023, arXiv:2303.16892. [Google Scholar] [CrossRef]
Rahman, M.M.; Munir, M.; Marculescu, R. EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation. arXiv 2024, arXiv:2405.06880. [Google Scholar] [CrossRef]
Perera, S.; Navard, P.; Yilmaz, A. SegFormer3D: An Efficient Transformer for 3D Medical Image Segmentation. arXiv 2024, arXiv:2404.10156. [Google Scholar]
Tsai, T.Y.; Yu, A.; Maadugundu, M.S.; Mohima, I.J.; Barsha, U.H.; Chen, M.H.F.; Prabhakaran, B.; Chang, M.C. IntelliCardiac: An Intelligent Platform for Cardiac Image Segmentation and Classification. arXiv 2025, arXiv:2505.03838. [Google Scholar] [CrossRef]

Figure 1. Model parameters, Dice similarity score, and FLOP size. The circle size represents the FLOP size.

Figure 2. Overview of the enhanced UNETR++ architecture: The overall structure of the architecture remains unchanged, with enhancements applied only to the Efficient Paired Attention (EPA) block.

Figure 3. Overview of the enhanced Efficient Paired Attention block with the spatial branch at the top and the channel branch at the bottom. (Left): The spatial branch is enhanced to use the full-voxel-focused attention mechanism. (Right): The spatial branch is enhanced to use the semi-voxel-focused attention mechanism.

Figure 4. Illustration of the 3D sliding window. The left side shows a 4 × 4 × 4 input feature. The middle image shows a 3 × 3 × 3 sliding window, with the centre voxel highlighted in red to indicate the query voxel, while the surrounding teal voxels represent key voxels, demonstrating the query-centred nature of the similarity computation. The right side presents an application of the 3D sliding window on the input feature.

Figure 5. Qualitative comparison of segmentation performance among UNETR++, VFA UNETR++, and FVFA UNETR++ on the Synapse dataset. The dashed red boxes highlight segmentation errors.

Figure 6. Qualitative comparison of segmentation performance among UNETR++, VFA UNETR++, and FVFA UNETR++ on the ACDC dataset.

Figure 7. Qualitative comparison of segmentation performance among UNETR++, VFA UNETR++, and FVFA UNETR++ on the BraTs dataset.

Table 1. Comparison of state-of-the-art methods on the abdominal multi-organ Synapse dataset, using Dice Similarity Coefficient (DSC) and Hausdorff Distance 95 (HD95) metrics. The target anatomical structures evaluated are Spl (spleen), RKid (right kidney), LKid (left kidney), Gal (gallbladder), Liv (liver), Sto (stomach), Aor (aorta), and Pan (pancreas). Additional columns report the number of parameters (Params) and floating point operations per second (FLOPs). The best performance scores are highlighted.

Methods	Params	FLOPs	Spl	RKid	LKid	Gal	Liv	Sto	Aor	Pan	Average
											HD95 ↓	DSC ↑
U-Net [17]	-	-	86.67	68.60	77.77	69.72	93.43	75.58	89.07	53.98	-	76.85
TransUNet [30]	96.07	88.91	85.08	77.02	81.87	63.16	94.08	75.62	87.23	55.86	31.69	77.49
Swin-UNet [33]	-	-	90.66	79.61	83.28	66.53	94.29	76.60	85.47	56.58	21.55	79.13
UNETR [31]	92.49	75.76	85.00	84.52	85.60	56.30	94.57	78.00	89.60	60.47	18.59	78.35
MISSFormer [35]	-	-	91.92	82.00	85.21	68.65	94.41	80.81	89.06	65.67	18.20	81.96
Swin-UNETR [36]	62.83	384.2	95.37	86.26	86.99	66.54	95.72	77.01	91.12	68.80	10.55	83.48
nnFormer [32]	150.5	213.4	90.51	86.25	86.57	70.17	96.84	86.83	92.04	83.35	10.63	86.57
UNETR++ [12]	42.96	47.98	89.98	90.29	87.58	73.20	96.92	85.18	92.28	82.19	9.84	87.20
SCANeXt [34]	43.38	50.53	95.69	87.46	85.92	73.66	96.98	85.60	92.96	82.95	7.47	89.67
PVT-CASCADE [37]	-	-	83.44	80.37	82.23	70.59	94.08	83.69	83.01	64.43	20.23	81.06
TransCASCADE [38]	-	-	90.79	84.56	87.66	68.48	94.43	83.52	86.63	65.33	17.34	82.68
2D D-LKA Net [39]	-	-	91.22	84.92	88.38	73.79	94.88	84.94	88.34	67.71	20.04	84.27
MERIT-GCASCADE [40]	-	-	91.92	84.83	88.01	74.81	95.38	83.63	88.05	69.73	10.38	84.54
PVT-EMCAD-B2 [41]	-	-	92.17	84.10	88.08	68.87	95.26	83.92	88.14	68.51	15.68	83.63
VFA UNETR++	28.6	46.26	88.68	87.77	87.67	69.61	97.05	84.55	92.43	81.49	10.49	86.16
FVFA UNETR++	21.42	52.27	89.56	87.39	87.33	72.96	96.71	85.56	92.56	81.68	8.84	86.72

Downarrow indicates better results going down. Uparrow is better results upwards. Bold indicates the best result in the column.

Table 2. State-of-the-art comparison on the ACDC dataset using Dice Similarity Coefficient scores.

Methods	RV	Myo	LV	Average (DSC)
TransUNet [30]	88.86	84.54	95.73	89.71
Swin-UNet [33]	88.55	85.62	95.83	90.00
UNETR [31]	85.29	86.52	94.02	86.61
MISSFormer [35]	86.36	85.75	91.59	87.90
nnFormer [32]	90.94	89.58	95.65	92.06
UNETR++ [12]	91.89	90.61	96.00	92.83
SegFormer3D [42]	88.50	88.86	95.53	90.96
IntelliCardiac [43]	92.27	90.33	95.09	92.56
VFA UNETR++	91.45	90.59	96.10	92.71
FVFA UNETR++	91.75	90.74	96.21	92.90

Highlighting and bolding were the proposed techniques, with their best results achieved.

Table 3. Number of parameters and FLOPs for UNETR++, VFA UNETR++, and FVFA UNETR++.

Methods	Params (M)	FLOPs
UNETR++ [12]	66.8	43.71
VFA UNETR++	44.36	42.4
FVFA UNETR++	39.03	47.54

Highlighting and bolding were the proposed techniques, with their best results achieved.

Table 4. Comparison of state-of-the-art methods on the BraTS dataset, including the number of parameters, FLOPs, memory (GB), and Dice Similarity Coefficient score.

Methods	Params	FLOPs	Mem	DSC (%)
UNETR [31]	92.5	153.5	3.3	81.2
SwinUNETR [36]	62.8	572.4	19.7	81.5
nnFormer [32]	149.6	421.5	12.6	82.3
UNETR++ [12]	42.6	70.1	2.7	82.7
VFA UNETR++	28.6	72.29	3.1	82.8
FVFA UNETR++	21.4	78.3	5.8	82.4

Highlighting and bolding were the proposed techniques, with their best results achieved.

Table 5. Average HD95, training time, and inference time for the original, VFA, and FVFA UNETR++ in seconds.

Methods	Training Time	Inference Time	HD95
UNETR++ [12]	244.14	5.69	5.27
VFA UNETR++	245.21	5.28	5.01
FVFA UNETR++	248.58	7.94	5.08

Highlighting and bolding were the proposed techniques, with their best results achieved.

Table 6. Comparison of the computational efficiency between CUDA and Python-only implementations for the full voxel-focused attention (FVFA) UNETR++ model.

Implementations	Training Time (s)	Inference Time (s)	Mem
Python	261.15	9.06	7.5
CUDA	248.58	7.94	5.8

Table 7. Comparison of the computational efficiency between CUDA and Python-only implementations for the voxel-focused attention (VFA) model.

Implementations	Training Time (s)	Inference Time (s)	Mem
Python	248.37	6.25	4.2
CUDA	245.21	5.28	3.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ntanzi, S.; Viriri, S. UNETR++ with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers. Appl. Sci. 2025, 15, 11034. https://doi.org/10.3390/app152011034

AMA Style

Ntanzi S, Viriri S. UNETR++ with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers. Applied Sciences. 2025; 15(20):11034. https://doi.org/10.3390/app152011034

Chicago/Turabian Style

Ntanzi, Sithembiso, and Serestina Viriri. 2025. "UNETR++ with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers" Applied Sciences 15, no. 20: 11034. https://doi.org/10.3390/app152011034

APA Style

Ntanzi, S., & Viriri, S. (2025). UNETR++ with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers. Applied Sciences, 15(20), 11034. https://doi.org/10.3390/app152011034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UNETR++ with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers

Abstract

1. Introduction

2. Literature Review and Related Work

2.1. Convolution-Based Segmentation Methods

2.2. Transformer-Based Segmentation Methods

2.3. Hybrid Segmentation Methods

3. Methods and Techniques

3.1. Overall Architecture

3.2. Voxel-Focused Attention (VFA)

Approaches to VFA

3.3. Efficient Paired Attention Block

4. Experimental Results

4.1. Experimental Setup

4.2. Comparison with State-of-the-Art Methods

4.2.1. Synapse Dataset

4.2.2. ACDC Dataset

4.2.3. BraTs Dataset

5. Comparison of CUDA C++ and Native Python-Only Implementations

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI