3D TractFormer: 3D Direct Volumetric White Matter Tract Segmentation with Hybrid Channel-Wise Transformer

Gao, Xiang; Tian, Hui; Yin, Xuefei; Liew, Alan Wee-Chung

doi:10.3390/s26031068

Open AccessArticle

3D TractFormer: 3D Direct Volumetric White Matter Tract Segmentation with Hybrid Channel-Wise Transformer

School of Information & Communication Technology, Griffith University, Gold Coast, QLD 4215, Australia

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(3), 1068; https://doi.org/10.3390/s26031068

Submission received: 12 January 2026 / Revised: 26 January 2026 / Accepted: 2 February 2026 / Published: 6 February 2026

(This article belongs to the Special Issue Secure AI for Biomedical Sensing and Imaging Applications)

Download

Browse Figures

Versions Notes

Abstract

Segmenting white matter tracts in diffusion-weighted magnetic resonance imaging (dMRI) is of vital importance for brain health analysis. It remains a challenging task due to the intersection and overlap of tracts (i.e., multiple tracts coexist in one voxel) and the data complexity of dMRI images (e.g., 4D high spatial resolution). Existing methods that demonstrate good performance implement direct volumetric tract segmentation by performing on individual 2D slices. However, this ignores 3D contextual information, requires additional post-processing, and struggles with the boundary handling of 3D volumes. Therefore, in this paper, we propose an efficient 3D direct volumetric segmentation method for segmenting white matter tracts. It has three key innovations. First, we propose to deeply interleave convolutions and transformer blocks into a U-shaped network, which effectively integrates their respective strengths to extract spatial contextual features and global long-distance dependencies for enhanced feature extraction. Second, we propose a novel channel-wise transformer, which integrates depth-wise separable convolution and compressed contextual feature-based channel-wise attention, effectively addressing the memory and computational challenges of 4D computing. Moreover, it helps to model global dependencies of contextual features and ensures each hierarchical layer focuses on complementary features. Third, we propose to train a fully symmetric network with gradually sized volumetric patches, which can solve the challenge of few 3D training samples and further reduce memory and computational costs. Experimental results on the largest publicly available tract-specific tractograms dataset demonstrate the superiority of the proposed method over the current state-of-the-art methods.

Keywords:

white matter tract; 3D segmentation; diffusion MRI

1. Introduction

White matter tract segmentation in diffusion-weighted magnetic resonance imaging (dMRI) is a critical component in the study of brain function and the diagnosis of diseases [1,2]. It presents significant challenges due to the complex nature of white matter tracts and dMRI data, such as the presence of intersecting and overlapping tracts within a single voxel and the 4D high spatial resolution. Various methods have been proposed for automated white matter tract segmentation [3,4,5], which can be broadly classified into two categories: (1) regions-of-interest (ROI)/clustering-based methods, and (2) direct volumetric segmentation.

ROI/clustering-based methods achieve tract segmentation by grouping the fiber streamlines into anatomically defined fiber tracts. They typically require an entire series of processing steps, including atlas registration, tractography, and clustering or parcellation. Therefore, these methods are computationally expensive and tedious for fine-tuning. Direct volumetric segmentation generates complete tract segmentation directly from the input images. It does not require performing tractography that is sensitive to the choice of tracking algorithm and is therefore simple and efficient.

In the early stage, many efforts have been made on direct volumetric segmentation by using techniques such as Markov random field and k-nearest neighbors [6,7,8]. However, the segmentation quality achieved by these methods is limited, which in turn has forced the subsequent research to focus on ROI/clustering-based methods. Direct volumetric segmentation of white matter tracts has achieved promising performance until recently, when Wasserthal et al. [4] proposed using a Unet-based network, TractSeg, to segment white matter tracts in the field of fiber orientation distribution function (fODF) peaks [9].

After that, several studies on enhancing direct volumetric tract segmentation have been conducted, as reviewed and discussed in [5,8]. However, these studies are based on 2D slices. They ignore 3D contextual information, which may lead to inconsistency and discontinuity in segmentation of 3D volumes. Furthermore, they require additional post-processing steps and encounter difficulties in handling the boundaries of 3D volumes. This motivates us to develop an efficient 3D direct volumetric segmentation method for segmenting white matter tracts.

In the literature, Wasserthal et al. [10] attempted to employ a 3D Unet for direct volumetric tract segmentation but failed. Li et al. [11] achieved 3D direct volumetric tract segmentation in native diffusion space, but their method only worked for six gradient directions and single tract segmentation. Nelkenbaum et al. [12] implemented 3D direct volume tract segmentation in MRI images, where the data has very different characteristics from dMRI data. This shows that the main challenges in developing 3D direct volumetric tract segmentation in dMRI are: (1) the inadequacy of relying solely on 3D convolutions to learn the features required for effective segmentation from high angular resolution data, (2) the large memory and computational cost associated with 4D computations impacting the development of sophisticated networks, and (3) the scarcity of 3D annotation data affecting the network training.

In this paper, we address the above challenges through three major innovative designs and, for the first time, propose an efficient 3D direct volumetric white matter tract segmentation method that can simultaneously generate segmentation of 72 anatomically well-described tracts. We name the proposed method 3D TractFormer. It has the following key contributions.

We propose to deeply interleave convolutions and transformer blocks into a U-shaped network. This effectively integrates the respective strengths of both to extract spatial contextual features and global long-distance dependencies, thereby enhancing feature extraction for direct volumetric tract segmentation.
We propose a novel memory- and computationally efficient channel-wise transformer. It integrates depth-wise separable convolution and compressed contextual feature-based channel-wise attention. It can effectively address the memory and computational challenges of 4D computation. Moreover, it facilitates the modeling of global dependencies of contextual features and ensures that each hierarchical layer of the 3D TractFormer focuses on complementary features for tract segmentation.
We propose to train a fully symmetric network with gradually sized volumetric patches, which can solve the challenge of few 3D training samples and further reduce memory and computational costs.
We show experimentally that the proposed 3D TractFormer outperforms existing methods that have demonstrated state-of-the-art performance for white matter tract segmentation.

2. Related Works

2.1. White Matter Tract Segmentation

White matter tract segmentation in dMRI refers to delineating streamlines/voxels of the anatomical white matter tracts in brain [13,14]. In the context of direct volumetric tract segmentation, which is the focus of this paper, promising segmentation performance was first achieved when Wasserthal et al. [4] proposed using a deep network to achieve segmentation in the field of fODF peaks. Specifically, they proposed using dMRI-derived fODF peaks as a compact representation of raw dMRI data and as the input for direct volumetric tract segmentation. This solves the problem of large channel size of the input caused by high angular resolution data and makes the segmentation method independent of dMRI acquisition settings. This also facilitates the application of deep learning methods for direct volumetric tract segmentation. This work stands as a significant milestone and has become the benchmark in the field.

Since then, several studies on enhancing direct volumetric tract segmentation have been proposed based on this idea, as reviewed and discussed in [5,8]. Among them, several methods have been proposed to improve network architecture to enhance the segmentation performance [10,15]. For example, Dong et al. [15] enhanced TractSeg by introducing dual inputs, resulting in improved segmentation. Several methods have been proposed to achieve direct volumetric tract segmentation with few training samples or in low quality/resolution images [14,16,17]. A systematical investigation of a large variety of methods for direct volumetric tract segmentation is provided in a most recent study [18]. However, these methods are based on 2D slices, which ignore 3D contextual information, require extra post-processing, and encounter difficulties in handling the edges of 3D volumes.

This motivates us to develop an efficient 3D direct volumetric tract segmentation method. In the literature, Wasserthal et al. [10] attempted to utilize a 3D Unet for direct volumetric tract segmentation but failed. Li et al. [11] achieved 3D direct volumetric tract segmentation in native diffusion space; however, their method was only effective for six gradient directions and single tract segmentation. Nelkenbaum et al. [12] implemented 3D direct volume tract segmentation in MRI images, which have significantly different characteristics from dMRI data. These highlight the primary challenges in developing 3D direct volumetric tract segmentation: (1) the insufficiency of relying solely on 3D convolutions to learn the necessary features for effective segmentation from high angular resolution data, (2) the substantial computational demands of 4D computations impacting the development of sophisticated networks, and (3) the scarcity of 3D annotation data affecting the network training. In this paper, we carefully design a 3D TractFormer to address these challenges.

2.2. Vision Transformer

The transformer [19] has strong capability to learn long-range dependencies due to its core component self-attention. It has been used in various vision tasks such as recognition [20], segmentation [21], and restoration [22], as discussed in [23,24]. Most current vision transformers apply self-attention in the spatial dimension and model the dependencies between pixels. Since the computational complexity of self-attention grows quadratically with spatial resolution, these methods usually apply self-attentions within small windows or patches. This is not conducive to capturing true global long-range dependencies, especially on high-resolution images. Channel-wise transformers have been proposed to solve this limitation to a certain extent [25,26]. However, when they are directly used to compute attention for 3D images, the computational complexity is still very high, as it increases cubically with volume dimension. Therefore, in this paper, we propose a new memory- and computationally efficient channel-wise transformer for 3D attention computation. It can model global long-range dependencies of contextual features while maintaining memory and computational efficiency.

More recently, the medical image segmentation community has actively explored more efficient global context modeling for volumetric data. For example, MA-SAM adapts the Segment Anything Model to 3D medical image segmentation via modality-agnostic volumetric adapters [27]. Slim UNETR further investigates efficient hybrid transformer designs for 3D segmentation under limited computational resources [28]. In addition, state space model-based architectures have emerged as an alternative direction for long-range modeling with improved efficiency, including SegMamba for 3D volumetric segmentation [29] and VM-UNet for medical image segmentation [30]. These recent studies reinforce the importance of designing memory-efficient mechanisms to capture global dependencies in 3D segmentation tasks.

2.3. Hybrid Convolution and Transformer

Transformers excel in capturing distant dependencies, while convolution networks excel in extracting spatial context [23,24]. As analyzed above, exploiting their complementarity in feature learning is beneficial for 3D direct volumetric tract segmentation. Most existing hybrid convolution and transformer methods involve shallow block-level blending. For example, in [21,31], transformer blocks are employed in the encoder path for feature encoding, while convolution blocks are adopted in the decoder path for feature decoding. In [32], transformer blocks are only integrated into skip-connection layers. In [33], convolutions and transformer blocks are interleaved in block-level. In [22,25], convolutions are blended in transformers, but are not blended with transformer blocks. Therefore, inspired by these works, in this paper, we propose a deep hybrid mode, where convolutions and transformers are interleaved both at block-level and within the attention computation.

Recent work has also continued to advance hybrid segmentation architectures that balance local inductive bias, global dependency modeling, and efficiency. For 3D segmentation, LW-CTrans proposes a lightweight CNN and transformer hybrid network [34], and Switch-UMamba integrates convolution with state space modeling for medical image segmentation [35]. Beyond volumetric segmentation, recent U-Net enhancements for related dense prediction tasks also emphasize dynamic multi-scale fusion and attention-based feature interaction, such as APU-Net for polyp segmentation [36]. These developments provide useful context for positioning our design choices in a broader segmentation landscape.

3. Methods

We propose 3D TractFormer, a hybrid channel-wise transformer for efficient 3D direct volumetric white matter tract segmentation. The overall pipeline is first presented in Section 3.1. Then, the two core components, multi-head convolution-based channel-wise attention (MCCA) and context-enhanced feed-forward network (CFFN) of the proposed channel-wise transformer, are detailed in Section 3.2 and Section 3.3, respectively. Finally, we provide the network configuration and training strategy in Section 3.4.

3.1. Overall Pipeline

Figure 1 illustrates the architecture of 3D TractFormer. Its input is a 3D fiber orientation map with

H \times W \times D \times 9

voxels. As mentioned above, instead of using raw dMRI image values as input, we use fiber orientation peaks as input for direct volumetric tract segmentation. This reduces the number of input channels from 270 to 9, significantly reducing the memory and computational cost. Fiber orientation peaks are computed from dMRI image values using the method in [4]. These 9 channels correspond to the three anatomically defined principal fiber directions, and each principal direction has three input channels. The output is the complete segmentation for 72 white matter tracts, which is concatenated channel-wise into a segmentation mask of

H \times W \times D \times 72

voxels. This enables multi-label segmentation where multiple tracts coexist in a voxel.

3D TractFormer has a U-shaped encoder–decoder framework, where convolutions (including conv, conv down, and conv up) and transformer blocks are interleaved hierarchically in multiple scales. This design facilitates the extraction of complementary spatial contextual features and global dependencies, which is beneficial for direct volumetric tract segmentation as analyzed above. The first convolution encodes spatial contextual 3D features, while the last convolution generates the segmentation from decoded features. The down convolutions are responsible for encoding spatial contextual features while halving their spatial resolution and doubling their channels, and vice versa for the up convolutions. Skip-connections [37] are adopted to facilitate feature decoding. The network is symmetrical, with a 4-layer encoder–decoder structure. Each layer contains multiple transformer blocks, whose numbers are gradually increased to maintain efficiency. The transformer blocks capture global contextual long-range dependencies through its two core components MCCA and CFFN, as detailed below.

3.2. Multi-Head Convolution-Based Channel-Wise Attention

The main computational cost in a transformer comes from its self-attention computation. In a conventional transformer, the computational complexity of the key query dot product is

O (H^{2} W^{2} D^{2})

for an input of size

H \times W \times D

. By using a vanilla channel-wise transformer, the computational complexity can be reduced to

O (H W D)

. However, this is still computationally expensive, as a typical size of

H / W / D

can result in a very large

H W D

and this computational complexity will increase cubically with the volumetric dimension. To alleviate this problem, we propose MCCA (Figure 1a), which has a computational complexity of

O (H W D / r^{3})

.

Another core design of MCCA is to use depth-wise separable convolution instead of the learnable weight matrices in conventional transformers to generate the query

Q

, key

K

, and value

V

. Given the small number of parameters in depth-wise separable convolution, this does not introduce much computational complexity, but offers two distinct advantages. First, it compensates for the limitations of transformers in modeling spatial contextual information by introducing a convolution operation. Second, it enables us to model global dependencies of spatial contextual features, which offers unique advantages for direct volumetric tract segmentation.

Specifically, for an input feature

F \in R^{H^{'} \times W^{'} \times D^{'} \times C^{'}}

, our MCCA first generates its

Q \in R^{\frac{H^{'}}{r} \times \frac{W^{'}}{r} \times \frac{D^{'}}{r} \times C^{'}}

and

K \in R^{\frac{H^{'}}{r} \times \frac{W^{'}}{r} \times \frac{D^{'}}{r} \times C^{'}}

by

Q = f_{m} (f_{d s}^{Q} (F))

and

K = f_{m} (f_{d s}^{K} (F))

, respectively, where

f_{d s}^{(\cdot)}

is the depth-wise separable convolution and

f_{m}

is the max-pooling. We use a pooling factor of r to reduce the dimension of the extracted spatial contextual feature maps, which can reduce the complexity of the attention computation by a factor of

r^{3}

. Then, the

Q

and

V

are reshaped to make spatial contextual features in each channel as tokens to compute channel-wise attention. Thus, the channel-wise attention map is of size

C^{'} \times C^{'}

. We generate

V \in R^{H^{'} \times W^{'} \times D^{'} \times C^{'}}

by

V = f_{d s}^{V} (F)

and also reshape it along the channel. Overall, our MCCA can be formulated as:

\begin{matrix} Attention (Q^{'}, K^{'}, V^{'}) = Softmax (\frac{Q^{'} {K'}^{T}}{α}) V^{'} \\ F^{'} = f_{c}^{1} (Attention (Q^{'}, K^{'}, V^{'})) + F, \end{matrix}

(1)

where

Q^{'} \in R^{C^{'} \times \frac{H^{'}}{r} \frac{W^{'}}{r} \frac{D^{'}}{r}}

,

K^{'} \in R^{C^{'} \times \frac{H^{'}}{r} \frac{W^{'}}{r} \frac{D^{'}}{r}}

, and

V^{'} \in R^{C^{'} \times H^{'} W^{'} D^{'}}

are the reshaped tensors.

α

is a learnable parameter used to control the magnitude of the dot product,

f_{c}^{1}

is a

1 \times 1 \times 1

convolution used to refine features, and

F^{'}

is the output feature generated by MCCA. Similar to the conventional multi-head self-attention [22], we partition the channel number into the heads and simultaneously learn separate attention maps.

3.3. Context-Enhanced Feed-Forward Network

The feed-forward network in the conventional transformer uses only two

1 \times 1

convolutions and a non-linear operation between them to transform features. Thus, its capability to exploit spatial contextual information is limited. However, spatial contextual information is important for white matter tract segmentation, as evidenced by the much-studied convolution-based methods. To address this limitation, we propose CFFN (Figure 1b), which integrates depth-wise separable convolutions into the conventional feed-forward network to emphasize spatial context. Also, it incorporates a gate mechanism as in [25] to further control the features passed forward to the next layer. This enables different layers in the hierarchical 3D TractFormer to focus on complementary features, ultimately enhancing feature learning.

Overall, our CFFN can be formulated as:

\begin{matrix} F^{″} = f_{c}^{2} (f_{d s}^{a} (F^{'}) ⊙ ϕ (f_{d s}^{b} (F^{'}))) + F^{'}, \end{matrix}

(2)

where

F^{'}

is the output from MCCA,

f_{d s}^{a}

and

f_{d s}^{b}

are two depth-wise separable convolutions,

ϕ

is the GELU activation function [38], ⊙ denotes the element-wise multiplication,

f_{c}^{2}

is a

1 \times 1 \times 1

convolution used to refine features, and

F^{″}

is the output feature generated by CFFN.

MCCA leverages spatial contextual features to enhance the encoding of global dependencies, while CFFN further enriches these features using contextual information and controls features forward to the next layer, ensuring that different hierarchical layers focus on complementary information. Their synergy empowers 3D TractFormer to effectively extract features that facilitate white matter tract segmentation.

3.4. Network Configuration and Training Strategy

Our 3D TractFormer has four hierarchical levels; from the first to the fourth level, the numbers of transformer blocks are

[1, 2, 4, 8]

, the numbers of attention heads in MCCA are

[1, 2, 4, 8]

, and the numbers of channels are

[36, 72, 144, 288]

. The pooling factor in MCCA is

r = 2

. The convolutions have a kernel size of

3 \times 3 \times 3

. The down and up convolutions have a kernel size of

4 \times 4 \times 4

with a stride of 2. The depth-wise separable convolutions have kernel sizes of

3 \times 3 \times 3

and

1 \times 1 \times 1

. The final layer incorporates the sigmoid activation function to transform output values into probabilities. A threshold of 0.5 is applied to distinguish background voxels (assigned 0) and foreground voxels (assigned 1). To train the network, binary cross-entropy loss is computed between the generated segmentation masks and the ground-truth masks.

In 3D direct volumetric white matter tract segmentation, the computational complexity can increase quartically with the dimension (i.e.,

H / W / D / C

). To improve computation efficiency, we propose to train 3D TractFormer with small volumetric patches. This also naturally increases the amount of training data and reduces the requirement for the number of annotations. Our 3D TractFormer is symmetric, so it can be trained with small patches but tested on large images. That is, at test time, we directly perform whole volume inference on the complete 3D sample and do not use any patch merging procedure. To ensure its performance on images of different sizes, we also incorporate progressively larger patches for training. This is somewhat similar to the ideas of progressive learning [39] and curriculum learning [40]. With this training strategy, we partially solve the challenges of high memory and computational cost and lacking sufficient annotations for training in 3D direct volumetric white matter tract segmentation.

4. Experimental Results

4.1. Datasets

We use the largest publicly available tract-specific tractograms dataset provided in [4] for white matter tract segmentation. It contains tractogram annotations for 72 anatomically well-defined white matter tracts across 105 subjects in the WU-Minn Human Connectome Project (HCP) [2]. These annotations are used as ground truth. The fiber orientation peaks input to the 3D TractFormer are computed from the corresponding dMRI scans in the HCP dataset. These dMRI scans have 270 diffusion gradients and a spatial resolution of

145 \times 174 \times 145

. We crop them to

144 \times 144 \times 144

for experiments, without loss of any brain tissue.

4.2. Experimental Settings

We use the AdamW optimizer [41] with

β_{1} = 0.9, β_{2} = 0.999

, and weight decay of

1 \times 10^{- 4}

to train the 3D TractFormer. The initial learning rate is set to be

3 \times 10^{- 4}

, which is gradually reduced to

1 \times 10^{- 6}

through the cosine annealing algorithm [42]. We train 3D TractFormer for 200 epochs, with each epoch comprising 1000 iterations. The volumetric patches used for training have sizes

64 \times 64 \times 64

and

80 \times 80 \times 80

, with batch sizes 6 and 2, respectively. The experiments were carried out utilizing PyTorch (version 2.4.0). and executed on a cluster of four Nvidia V100 GPUs (Nvidia, Santa Clara, CA, USA), each equipped with 32 GB of memory.

Five-fold cross-validation is used, where each fold contains 63 training subjects, 21 validation subjects, and 21 test subjects. The model that achieved the highest performance on the validation set is used for the final evaluation on the test set.

4.3. Evaluation Metrics

We employ the Dice similarity coefficient (DSC) and the relative volume difference (RVD) as quantitative metrics to evaluate the segmentation performance. The DSC is a widely recognized overlap-based segmentation metric, while the RVD is a prominent size-based segmentation metric. They complement each other, offering a comprehensive evaluation of our 3D TractFormer. Additionally, we employ a two-tailed Wilcoxon signed-rank test with

α = 0.05

[43] to evaluate statistical significance. For each pair of methods, we perform a paired test at the tract level by comparing the DSCs (and RVDs) for each of the 72 tracts across all 105 subjects, yielding

72 \times 105 = 7560

paired differences for one comparison. The null hypothesis is that the median of the paired differences is zero, and the alternative hypothesis is that it is nonzero. In Table 1, the reported p values compare each method with the method listed immediately above it within the same block, namely within the 2D slice-based block and within the 3D volumetric block. For Table 4, the reported p values are computed by comparing each ablation variant against the full 3D TractFormer.

4.4. Performance Evaluation

4.4.1. Comparison with State-of-the-Art Methods

We compare our 3D TractFormer with the TractSeg method [10], which is the current benchmark in the field. We also compare it with a variety of state-of-the-art methods that have been applied to white matter tract segmentation for all 105 subjects in the HCP. These methods include the UNet++ [45], DS-UNet [10], UNet3+ [46], and Attention UNet [44], as investigated in the latest study [18], as well as the latest method AC-UNet that was specially designed for WMT segmentation [47]. Since most of these baselines operate on 2D slices, we additionally incorporate volumetric 3D baselines to provide a more comprehensive comparison. Concretely, we adapt Neuro4Neuro [11], a direct 3D tract segmentation approach in native diffusion space, to our setting and denote this variant as m-Neuro4Neuro. We also include two additional 3D methods, namely nnUNet [48] and Uformer [11], to benchmark against strong volumetric segmentation pipelines. Collectively, these baselines cover both slice-based and fully 3D formulations, enabling a thorough evaluation of our method.

Table 1 compares the quantitative evaluation results of these methods. It includes the mean and standard deviation (STD) of Dice scores and RVD values for each method, which are computed over the five folds between the generated segmentation and the ground truth. Note that higher Dice scores indicate better performance, while lower RVD values indicate higher performance. Additionally, Table 1 reports p values from paired Wilcoxon signed-rank tests, where each method is compared with the method listed immediately above it within the same block, namely within the 2D block and within the 3D block. As can be seen, our 3D TractFormer outperforms the other methods, showcasing the best segmentation performance and statistically significant improvements in both Dice scores and RVD values. This demonstrates its superiority in 3D direct volumetric tract segmentation. The performance achieved by m-Neuro4Neuro is obviously lower than other methods, which is explainable because Neuro4Neuro was originally designed for single tract segmentation in native diffusion space. This also underscores the importance of developing tailored methods for 3D white matter tract segmentation.

Figure 2 provides a qualitative evaluation of our segmentation results in comparison to those of TractSeg and Unet3+. Both the cross-sectional views and 3D renderings of the segmentation results are illustrated. For better observation, we also present the absolute differences between segmentation results and the manual delineations. It can be seen that our results exhibit greater similarity to the manual delineations compared to those of TractSeg and Unet3+, with smaller absolute differences. This further demonstrates the superiority of 3D TractFormer.

4.4.2. Effectiveness on Small-Scale and Challenging Tracts

Small-scale tracts are inherently more difficult to segment, and slight changes in segmentation volume can lead to relatively large impacts on their performance. For a more in-depth evaluation of 3D TractFormer, we compare its segmentation performance with those of TractSeg and Unet3++ (the current top-performing method in the literature, as shown in Table 1) on varied tract sizes. Figure 3 plots the tract size versus segmentation performance in terms of Dice score and RVD value for 72 white matter tracts. As can be seen, 3D TractFormer achieves better results on small-scale tracts. This supports our claim that direct 3D segmentation can leverage 3D spatial context to capture information of small volume, thereby improving segmentation performance.

Among the 72 tracts, some are more challenging to segment than others. Liu et al. [14] have identified seven particularly challenging tracts that exhibit narrow or uncinate shapes. To evaluate the performance of our 3D TractFormer in tough scenarios, we compare the segmentation performance of our 3D TractFormer and Unet3+ on each of these challenging tracts. Table 2 presents the quantitative evaluation results. As can be seen, 3D TractFormer achieves consistently higher Dice scores for all these tracts, demonstrating its effectiveness in difficult scenarios.

4.4.3. Model Complexity and Computational Cost

We further quantify the efficiency of our method by reporting model complexity in Table 3, including the number of trainable parameters, TMACs, and TFLOPs. For a fair comparison, TMACs and TFLOPs are computed for a single inference pass using the same input resolution of

9 \times 144 \times 144 \times 144

. As shown in Table 3, 3D TractFormer achieves a favorable tradeoff between accuracy and efficiency, requiring only

8.84

M parameters and

0.431

TMACs, which is substantially lower than the transformer-based 3D baseline Uformer. Compared with nnUNet, 3D TractFormer has comparable parameter scale and lower computational cost, while delivering better segmentation performance as shown in Table 1.

4.5. Ablation Study

We conduct ablation studies to quantify the contributions of the two key components in the proposed channel-wise transformer, namely MCCA and CFFN. All ablation models share the same overall 3D TractFormer backbone, training protocol, and evaluation settings as the full model, differing only in the corresponding module replacement. 3D TractFormer-xMCCA removes multi-head convolution-based channel-wise attention, weakening global dependency modeling from spatial context. 3D TractFormer-xCFFN removes context-enhanced transformation and gating, reducing spatial context enrichment and selective feature propagation across stages.

Table 4 reports Dice and RVD. The full 3D TractFormer achieves the best performance with Dice

0.857 (\pm 0.003)

and RVD

0.096 (\pm 0.005)

. Replacing CFFN yields a consistent degradation, with Dice decreasing to

0.853 (\pm 0.003)

and RVD increasing to

0.097 (\pm 0.005)

. Replacing MCCA leads to a larger performance drop, with Dice further decreasing to

0.852 (\pm 0.003)

and RVD increasing to

0.097 (\pm 0.005)

. The p values, computed against 3D TractFormer, indicate that both degradations are statistically significant. Overall, these results verify that MCCA and CFFN are both necessary. MCCA plays a dominant role in improving feature learning, while CFFN provides additional gains by enhancing spatial context modeling and controlling feature propagation, and their combination yields the strongest volumetric tract segmentation performance.

5. Conclusions

This paper proposed 3D TractFormer, which solves three challenges in 3D direct volumetric tract segmentation and achieves state-of-the-art segmentation performance. Its success relies on its three main innovations: (1) the deep blending of convolution and converter blocks in the U-shaped network effectively enhances complementary feature extraction, (2) the memory- and computationally efficient channel-wise transformer facilitates the modeling of global dependencies of contextual features, and (3) the efficient 3D TractFormer training with gradually sized volumetric patches.

Author Contributions

Conceptualization, X.Y., A.W.-C.L., and H.T.; methodology, X.Y., X.G., H.T., and A.W.-C.L.; software, X.G.; validation, X.G., X.Y., H.T., and A.W.-C.L.; formal analysis, X.G., X.Y., H.T., and A.W.-C.L.; investigation, X.G.; resources, A.W.-C.L. and H.T.; data curation, X.G.; writing—original draft preparation, X.G., X.Y.; writing—review and editing, X.Y., H.T., and A.W.-C.L.; visualization, X.G. and X.Y.; supervision, H.T., A.W.-C.L., and X.Y.; project administration, H.T.; funding acquisition, None. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was waived.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lawes, I.N.C.; Barrick, T.R.; Murugam, V.; Spierings, N.; Evans, D.R.; Song, M.; Clark, C.A. Atlas-based segmentation of white matter tracts of the human brain using diffusion tensor tractography and comparison with classical dissection. Neuroimage 2008, 39, 62–79. [Google Scholar] [CrossRef] [PubMed]
Van Essen, D.C.; Smith, S.M.; Barch, D.M.; Behrens, T.E.; Yacoub, E.; Ugurbil, K.; WU-Minn HCP Consortium. The WU-Minn human connectome project: An overview. Neuroimage 2013, 80, 62–79. [Google Scholar] [CrossRef] [PubMed]
Clayden, J.D.; Storkey, A.J.; Bastin, M.E. A probabilistic model-based approach to consistent white matter tract segmentation. IEEE Trans. Med. Imaging 2007, 26, 1555–1561. [Google Scholar] [CrossRef] [PubMed]
Wasserthal, J.; Neher, P.; Maier-Hein, K.H. TractSeg-Fast and accurate white matter tract segmentation. NeuroImage 2018, 183, 239–253. [Google Scholar] [CrossRef]
Zhang, F.; Daducci, A.; He, Y.; Schiavi, S.; Seguin, C.; Smith, R.E.; Yeh, C.H.; Zhao, T.; O’Donnell, L.J. Quantitative mapping of the brain’s structural connectivity using diffusion MRI tractography: A review. Neuroimage 2022, 249, 118870. [Google Scholar] [CrossRef]
Jonasson, L.; Bresson, X.; Hagmann, P.; Cuisenaire, O.; Meuli, R.; Thiran, J.P. White matter fiber tract segmentation in DT-MRI using geometric flows. Med. Image Anal. 2005, 9, 223–236. [Google Scholar] [CrossRef]
Bazin, P.L.; Ye, C.; Bogovic, J.A.; Shiee, N.; Reich, D.S.; Prince, J.L.; Pham, D.L. Direct segmentation of the major white matter tracts in diffusion tensor images. Neuroimage 2011, 58, 458–468. [Google Scholar] [CrossRef]
Ghazi, N.; Aarabi, M.H.; Soltanian-Zadeh, H. Deep Learning Methods for Identification of White Matter Fiber Tracts: Review of State-of-the-Art and Future Prospective. Neuroinformatics 2023, 21, 517–548. [Google Scholar] [CrossRef]
Anderson, A.W. Measurement of fiber orientation distributions using high angular resolution diffusion imaging. Magn. Reson. Med. Off. J. Int. Soc. Magn. Reson. Med. 2005, 54, 1194–1206. [Google Scholar] [CrossRef]
Wasserthal, J.; Neher, P.F.; Hirjak, D.; Maier-Hein, K.H. Combined tract segmentation and orientation mapping for bundle-specific tractography. Med. Image Anal. 2019, 58, 101559. [Google Scholar] [CrossRef]
Li, B.; De Groot, M.; Steketee, R.M.; Meijboom, R.; Smits, M.; Vernooij, M.W.; Ikram, M.A.; Liu, J.; Niessen, W.J.; Bron, E.E. Neuro4Neuro: A neural network approach for neural tract segmentation using large-scale population-based diffusion imaging. Neuroimage 2020, 218, 116993. [Google Scholar] [CrossRef]
Nelkenbaum, I.; Tsarfaty, G.; Kiryati, N.; Konen, E.; Mayer, A. Automatic segmentation of white matter tracts using multiple brain MRI sequences. In Proceedings of the IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 3–7 April 2020; pp. 368–371. [Google Scholar]
Zhang, F.; Karayumak, S.C.; Hoffmann, N.; Rathi, Y.; Golby, A.J.; O’Donnell, L.J. Deep white matter analysis (DeepWMA): Fast and consistent tractography segmentation. Med. Image Anal. 2020, 65, 101761. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Lu, Q.; Zhuo, Z.; Li, Y.; Duan, Y.; Yu, P.; Qu, L.; Ye, C.; Liu, Y. Volumetric segmentation of white matter tracts with label embedding. Neuroimage 2022, 250, 118934. [Google Scholar] [CrossRef] [PubMed]
Dong, X.; Yang, Z.; Peng, J.; Wu, X. Multimodality white matter tract segmentation using CNN. In Proceedings of the ACM Turing Celebration Conference-China, Chengdu, China, 17–19 May 2019; pp. 1–8. [Google Scholar]
Lu, Q.; Li, Y.; Ye, C. Volumetric white matter tract segmentation with nested self-supervised learning using sequential pretext tasks. Med. Image Anal. 2021, 72, 102094. [Google Scholar] [CrossRef] [PubMed]
Lu, Q.; Liu, W.; Zhuo, Z.; Li, Y.; Duan, Y.; Yu, P.; Qu, L.; Ye, C.; Liu, Y. A transfer learning approach to few-shot segmentation of novel white matter tracts. Med. Image Anal. 2022, 79, 102454. [Google Scholar] [CrossRef]
Tchetchenian, A.; Zhu, Y.; Zhang, F.; O’Donnell, L.J.; Song, Y.; Meijering, E. A comparison of manual and automated neural architecture search for white matter tract segmentation. Sci. Rep. 2023, 13, 1617. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the 35th International Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021; pp. 12077–12090. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 200. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Zhang, J.; Zhang, Y.; Gu, J.; Dong, J.; Kong, L.; Yang, X. Xformer: Hybrid X-Shaped Transformer for Image Denoising. arXiv 2023, arXiv:2303.06440. [Google Scholar]
Chen, C.; Miao, J.; Wu, D.; Zhong, A.; Yan, Z.; Kim, S.; Hu, J.; Liu, Z.; Sun, L.; Li, X.; et al. Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation. Med. Image Anal. 2024, 98, 103310. [Google Scholar] [CrossRef]
Pang, Y.; Liang, J.; Huang, T.; Chen, H.; Li, Y.; Li, D.; Huang, L.; Wang, Q. Slim UNETR: Scale hybrid transformers to efficient 3D medical image segmentation under limited computational resources. IEEE Trans. Med. Imaging 2023, 43, 994–1005. [Google Scholar] [CrossRef] [PubMed]
Xing, Z.; Ye, T.; Yang, Y.; Cai, D.; Gai, B.; Wu, X.J.; Gao, F.; Zhu, L. Segmamba-v2: Long-range sequential modeling mamba for general 3d medical image segmentation. IEEE Trans. Med. Imaging 2025, 45, 4–15. [Google Scholar] [CrossRef] [PubMed]
Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. ACM Trans. Multimed. Comput. Commun. Appl. 2024. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Chen, L.; Wan, L. CTUNet: Automatic pancreas segmentation using a channel-wise transformer and 3D U-Net. Vis. Comput. 2023, 39, 5229–5243. [Google Scholar] [CrossRef]
Zhou, H.Y.; Guo, J.; Zhang, Y.; Han, X.; Yu, L.; Wang, L.; Yu, Y. nnFormer: Volumetric medical image segmentation via a 3D transformer. IEEE Trans. Image Process. 2023, 32, 4036–4045. [Google Scholar] [CrossRef]
Kuang, H.; Wang, Y.; Tan, X.; Yang, J.; Sun, J.; Liu, J.; Qiu, W.; Zhang, J.; Zhang, J.; Yang, C.; et al. LW-CTrans: A lightweight hybrid network of CNN and Transformer for 3D medical image segmentation. Med. Image Anal. 2025, 102, 103545. [Google Scholar] [CrossRef]
Zhang, Z.; Ma, Q.; Zhang, T.; Chen, J.; Zheng, H.; Gao, W. Switch-UMamba: Dynamic scanning vision Mamba UNet for medical image segmentation. Med. Image Anal. 2025, 107, 103792. [Google Scholar] [CrossRef]
Li, F.; Xu, L.; Ma, Z.; Zhao, Y.; Li, X. APU-Net: A U-Net Enhanced Network with Dynamic Feature Fusion and Pyramid Cross-Attention Mechanism for Polyp Segmentation. Digit. Signal Process. 2026, 172, 105879. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Acharya, A.; Ortner, J. Progressive learning. Econometrica 2017, 85, 1965–1990. [Google Scholar] [CrossRef]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution; Springer: New York, NY, USA, 1992; pp. 196–202. [Google Scholar]
Oktay, O.; Schlemper, J.; Le Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. In Proceedings of the Medical Imaging with Deep Learning, Zurich, Switzerland, 6–8 July 2022. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested U-net architecture for medical image segmentation. In Proceedings of the 4th International Workshop on Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
Zhu, Y.; Tchetchenian, A.; Yin, X.; Liew, A.; Song, Y.; Meijering, E. AC-UNet: Adaptive Connection UNet for White Matter Tract Segmentation Through Neural Architecture Search. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
Lucena, O.; Borges, P.; Cardoso, J.; Ashkan, K.; Sparks, R.; Ourselin, S. Informative and reliable tract segmentation for preoperative planning. Front. Radiol. 2022, 2, 866974. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed 3D TractFormer. It employs a U-shaped encoder–decoder framework in which convolutions (blue rectangles) and transformer blocks (purple rectangles) are interleaved hierarchically at multiple scales. The proposed channel-wise transformer has two key components: (a) multi-head convolution-based channel-wise attention (MCCA) that models global spatial contextual feature dependence across channels, and (b) a context-enhanced feed-forward network (CFFN) that facilitates the propagation of important contextual information.

Figure 2. Visual comparison of segmentation results of our 3D TractFormer, TractSeg, and Unet3+ for small-scale (FX_right), medium-scale (CST_left), and large-scale (AF_right) white matter tracts in terms of 3D renderings and 2D cross-sectional views. Green indicates the ground truth, orange indicates the predicted segmentation, and yellow indicates the absolute difference from the ground truth.

Figure 3. Graph depicting the relationship between tract size and segmentation performance across 72 white matter tracts. (a) Mean Dice score and (b) mean RVD value. The dots indicate white matter tracts.

Table 1. Comparison of segmentation performance of our 3D TractFormer and the state-of-the-art methods in terms of mean and STD Dice scores and RVD values. p values (

α = 0.05

) are obtained using a two-tailed Wilcoxon signed-rank test and are computed against the method listed immediately above within the same block, that is, within the 2D block and within the 3D block. N/A indicates that no reference method is defined for that row. Bold values indicate the best performance under the corresponding metric.

Table 1. Comparison of segmentation performance of our 3D TractFormer and the state-of-the-art methods in terms of mean and STD Dice scores and RVD values. p values (

α = 0.05

) are obtained using a two-tailed Wilcoxon signed-rank test and are computed against the method listed immediately above within the same block, that is, within the 2D block and within the 3D block. N/A indicates that no reference method is defined for that row. Bold values indicate the best performance under the corresponding metric.

Methods	2D/3D	Dice Scores (↑)		RVD Values (↓)
Methods	2D/3D	Mean (STD)	p-Value	Mean (STD)	p-Value
TractSeg [4]	2D	0.849 (±0.003)	N/A	0.097 (±0.006)	N/A
Attention Unet [44]		0.850 (±0.003)	<0.000	0.097 (±0.005)	0.599
DS-Unet [10]		0.851 (±0.002)	0.008	0.097 (±0.005)	0.564
Unet++ [45]		0.852 (±0.003)	<0.000	0.096 (±0.005)	<0.000
Unet3+ [46]		0.853 (±0.003)	<0.000	0.097 (±0.005)	0.001
AC-UNet [47]		0.856 (±0.003)	<0.000	0.096 (±0.005)	0.001
nnUNet [48]	3D	0.842 (±0.040)	N/A	0.109 (±0.006)	N/A
m-Neuro4Neuro [11]		0.844 (±0.004)	<0.000	0.101 (±0.006)	<0.000
Uformer [11]		0.852 (±0.003)	<0.000	0.097 (±0.005)	<0.000
3D TractFormer		0.857 (±0.003)	<0.000	0.096 (±0.005)	<0.000

Table 2. Comparison of segmentation performance in terms of mean and STD Dice scores and RVD values of 3D TractFormer versus Unet3+ and AC-Unet on the seven most challenging tracts. Bold values indicate the best performance for each tract under the corresponding metric.

Tracts	Mean Dice Scores (↑)			Mean RVD Values (↓)
Tracts	Unet3+	AC-UNet	3D TractFormer	Unet3+	AC-UNet	3D TractFormer
CA	0.704	0.706	0.726	0.482	0.482	0.402
FX_left	0.757	0.757	0.763	0.210	0.210	0.189
FX_right	0.715	0.714	0.730	0.271	0.271	0.256
ILF_left	0.822	0.823	0.829	0.105	0.104	0.106
ILF_right	0.806	0.808	0.819	0.115	0.115	0.121
UF_left	0.787	0.789	0.801	0.195	0.194	0.163
UF_right	0.819	0.820	0.829	0.133	0.133	0.098

Table 3. Comparison of model complexity measured by parameters (M), TMACs, and TFLOPs.

Methods	Parameters (M)	TMACs	TFLOPs
TractSeg [4] (2D)	9.24	0.696	1.395
Attention Unet [44] (2D)	34.89	3.05	6.11
DS-Unet [10] (2D)	55.505	4.296	8.605
Unet++ [45] (2D)	9.05	1.369	2.742
Unet3+ [46] (2D)	27.179	8.848	17.695
AC-UNet [47] (2D)	22.03	17.26	34.86
nnUNet [48] (3D)	4.09	0.69	1.39
m-Neuro4Neuro [11] (3D)	51.79	2.55	5.1
Uformer [11] (3D)	47.45	0.672	1.361
3D TractFormer (3D)	8.84	0.431	0.867

Table 4. Ablation study of 3D TractFormer. 3D TractFormer-xMCCA replaces MCCA with a simple MLP, and 3D TractFormer-xCFFN replaces CFFN with a simple MLP. Results are reported as mean (STD) for Dice and RVD. p values (

α = 0.05

) are obtained using a two-tailed Wilcoxon signed-rank test and are computed by comparing each ablation variant against the full 3D TractFormer. Bold values indicate the best performance under the corresponding metric. 3D TractFormer serves as the reference (full) model for statistical comparison.

Table 4. Ablation study of 3D TractFormer. 3D TractFormer-xMCCA replaces MCCA with a simple MLP, and 3D TractFormer-xCFFN replaces CFFN with a simple MLP. Results are reported as mean (STD) for Dice and RVD. p values (

α = 0.05

) are obtained using a two-tailed Wilcoxon signed-rank test and are computed by comparing each ablation variant against the full 3D TractFormer. Bold values indicate the best performance under the corresponding metric. 3D TractFormer serves as the reference (full) model for statistical comparison.

Methods	Dice Scores		RVD Values
Methods	Mean (STD)	p-Value	Mean (STD)	p-Value
3D TractFormer	0.857 (±0.003)	N/A	0.096 (±0.005)	N/A
3D TractFormer-xCFFN	0.853 (±0.003)	<0.000	0.097 (±0.005)	<0.000
3D TractFormer-xMCCA	0.852 (±0.003)	<0.000	0.097 (±0.005)	<0.000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, X.; Tian, H.; Yin, X.; Liew, A.W.-C. 3D TractFormer: 3D Direct Volumetric White Matter Tract Segmentation with Hybrid Channel-Wise Transformer. Sensors 2026, 26, 1068. https://doi.org/10.3390/s26031068

AMA Style

Gao X, Tian H, Yin X, Liew AW-C. 3D TractFormer: 3D Direct Volumetric White Matter Tract Segmentation with Hybrid Channel-Wise Transformer. Sensors. 2026; 26(3):1068. https://doi.org/10.3390/s26031068

Chicago/Turabian Style

Gao, Xiang, Hui Tian, Xuefei Yin, and Alan Wee-Chung Liew. 2026. "3D TractFormer: 3D Direct Volumetric White Matter Tract Segmentation with Hybrid Channel-Wise Transformer" Sensors 26, no. 3: 1068. https://doi.org/10.3390/s26031068

APA Style

Gao, X., Tian, H., Yin, X., & Liew, A. W.-C. (2026). 3D TractFormer: 3D Direct Volumetric White Matter Tract Segmentation with Hybrid Channel-Wise Transformer. Sensors, 26(3), 1068. https://doi.org/10.3390/s26031068

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3D TractFormer: 3D Direct Volumetric White Matter Tract Segmentation with Hybrid Channel-Wise Transformer

Abstract

1. Introduction

2. Related Works

2.1. White Matter Tract Segmentation

2.2. Vision Transformer

2.3. Hybrid Convolution and Transformer

3. Methods

3.1. Overall Pipeline

3.2. Multi-Head Convolution-Based Channel-Wise Attention

3.3. Context-Enhanced Feed-Forward Network

3.4. Network Configuration and Training Strategy

4. Experimental Results

4.1. Datasets

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Performance Evaluation

4.4.1. Comparison with State-of-the-Art Methods

4.4.2. Effectiveness on Small-Scale and Challenging Tracts

4.4.3. Model Complexity and Computational Cost

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI