Next Article in Journal
Innovative Flavoring of Rapeseed Honey with Selected Essential Oils—Chemical, Antioxidant and Organoleptic Evaluation
Previous Article in Journal
DARTS Meets Ants: A Hybrid Search Strategy for Optimizing KAN-Based 3D CNNs for Violence Recognition in Video
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

UNETR with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers++

by
Sithembiso Ntanzi
and
Serestina Viriri
*,†
Computer Science Discipline, University of KwaZulu-Natal, Durban 4000, South Africa
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(20), 11034; https://doi.org/10.3390/app152011034
Submission received: 17 August 2025 / Revised: 25 September 2025 / Accepted: 8 October 2025 / Published: 14 October 2025

Abstract

There have been significant breakthroughs in developing models for segmenting 3D medical images, with many promising results attributed to the incorporation of Vision Transformers (ViT). However, the fundamental mechanism of transformers, known as self-attention, has quadratic complexity, which significantly increases computational requirements, especially in the case of 3D medical images. In this paper, we investigate the UNETR++ model and propose a voxel-focused attention mechanism inspired by TransNeXt pixel-focused attention. The core component of UNETR++ is the Efficient Paired Attention (EPA) block, which learns from two interdependent branches: spatial and channel attention. For spatial attention, we incorporated the voxel-focused attention mechanism, which has linear complexity with respect to input sequence length, rather than projecting the keys and values into lower dimensions. The deficiency of UNETR++ lies in its reliance on dimensionality reduction for spatial attention, which reduces efficiency but risks information loss. Our contribution is to replace this with a voxel-focused attention design that achieves linear complexity without low-dimensional projection, thereby reducing parameters while preserving representational power. This effectively reduces the model’s parameter count while maintaining competitive performance and inference speed. On the Synapse dataset, the enhanced UNETR++ model contains 21.42 M parameters, a 50% reduction from the original 42.96 M, while achieving a competitive Dice score of 86.72%.
Keywords: volumetric medical image segmentation; efficient attention; hybrid architecture; 3D sliding window volumetric medical image segmentation; efficient attention; hybrid architecture; 3D sliding window

Share and Cite

MDPI and ACS Style

Ntanzi, S.; Viriri, S. UNETR with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers++. Appl. Sci. 2025, 15, 11034. https://doi.org/10.3390/app152011034

AMA Style

Ntanzi S, Viriri S. UNETR with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers++. Applied Sciences. 2025; 15(20):11034. https://doi.org/10.3390/app152011034

Chicago/Turabian Style

Ntanzi, Sithembiso, and Serestina Viriri. 2025. "UNETR with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers++" Applied Sciences 15, no. 20: 11034. https://doi.org/10.3390/app152011034

APA Style

Ntanzi, S., & Viriri, S. (2025). UNETR with Voxel-Focused Attention: Efficient 3D Medical Image Segmentation with Linear-Complexity Transformers++. Applied Sciences, 15(20), 11034. https://doi.org/10.3390/app152011034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop