nnSegNeXt: A 3D Convolutional Network for Brain Tissue Segmentation Based on Quality Evaluation

Accurate and automated segmentation of brain tissue images can significantly streamline clinical diagnosis and analysis. Manual delineation needs improvement due to its laborious and repetitive nature, while automated techniques encounter challenges stemming from disparities in magnetic resonance imaging (MRI) acquisition equipment and accurate labeling. Existing software packages, such as FSL and FreeSurfer, do not fully replace ground truth segmentation, highlighting the need for an efficient segmentation tool. To better capture the essence of cerebral tissue, we introduce nnSegNeXt, an innovative segmentation architecture built upon the foundations of quality assessment. This pioneering framework effectively addresses the challenges posed by missing and inaccurate annotations. To enhance the model’s discriminative capacity, we integrate a 3D convolutional attention mechanism instead of conventional convolutional blocks, enabling simultaneous encoding of contextual information through the incorporation of multiscale convolutional features. Our methodology was evaluated on four multi-site T1-weighted MRI datasets from diverse sources, magnetic field strengths, scanning parameters, temporal instances, and neuropsychiatric conditions. Empirical evaluations on the HCP, SALD, and IXI datasets reveal that nnSegNeXt surpasses the esteemed nnUNet, achieving Dice coefficients of 0.992, 0.987, and 0.989, respectively, and demonstrating superior generalizability across four distinct projects with Dice coefficients ranging from 0.967 to 0.983. Additionally, extensive ablation studies have been implemented to corroborate the effectiveness of the proposed model. These findings represent a notable advancement in brain tissue analysis, suggesting that nnSegNeXt holds the promise to significantly refine clinical workflows.


Introduction
The segmentation of brain tissue in magnetic resonance imaging (MRI) scans into constituent elements such as white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is instrumental in facilitating the diagnostic process for neurological pathologies like epilepsy, Alzheimer's disease, and multiple sclerosis.Diseases with psychiatric and neurodegenerative origins often involve changes in cerebral tissue morphology, such as alterations in the volume or configuration of deep gray matter structures, cortical thickness, surface area, and convoluted brain patterns [1].Therefore, the morphometric analysis of cerebral tissue serves as a critical biomarker for disease diagnosis and acts as an effective diagnostic tool [2,3].In addition, brain tissue segmentation in MRI scans is valuable for preoperative evaluation, surgical planning [4], and the development of radiation therapy plans [5].
Manual segmentation, although accurate, is laborious, repetitive, and subjective, making it impractical even for experts when dealing with large-scale datasets.In the past, numerous conventional techniques have been proposed for cerebral tissue segmentation, including intensity thresholding [6,7], deformable models [8][9][10], clustering [11,12], and other machine learning algorithms.However, these techniques have faced significant challenges due to the complex structure of the brain, variations in tissue morphology and texture, and inherent features of MRI scans, which have limited their performance [13].
In recent years, deep-learning-based methods, particularly those based on fully convolutional networks (FCNs) [14], have emerged as a robust alternative to traditional machine learning algorithms for cerebral tissue segmentation tasks.Among these methods, the U-Net architecture [15] has gained considerable attention in medical image segmentation.However, as medical data often exist in 3D volumetric form, 3D convolution kernels are necessary.To address this, Cicek et al. [16] extended the U-Net architecture to handle 3D data, resulting in the development of the 3D U-Net for brain tissue segmentation.V-Net [17] utilizes residual connections to accelerate network convergence and provide excellent feature representation.SegNet [18] incorporates non-linear upsampling during decoding to reduce parameters and computational complexity.SegResNet [19] employs a residual encoder-decoder architecture with an auxiliary branch for input data reconstruction.nnU-Net [20,21] demonstrates that minor modifications to the U-Net architecture can yield competitive performance in medical image segmentation.
Attention-based Transformer architectures, along with convolutional networks, have demonstrated promising results in medical image segmentation.Attention U-Net [22] employs attention blocks to refine features before merging them with decoder outputs, while TransUNet [23] integrates a Vision Transformer at critical points to enhance performance.Cao et al. [24] proposes a pure Transformer-based network, Swin-UNet, and applies it to medical image segmentation tasks.It utilizes a hierarchical Swin Transformer [25] as the encoder to extract contextual features.The TABS [26] introduces a novel CNN-Transformer hybrid architecture to improve brain tissue segmentation.By designing a multiscale feature representation and a two-layer fusion module, a fine fusion of global and local features is realized.UNETR [27] eliminates the need for a CNN-based feature extractor by employing a ViT [28] encoder.nnFormer [29] combines convolutional layers and transformer layers in an interleaved encoder-decoder fashion.Although these attention-based architectures have significantly contributed to image segmentation, many solutions heavily rely on extensive labeled datasets.Additionally, the accuracy of labeling is crucial, and automated segmentation toolkits such as FSL [30] or FreeSurfer [1] cannot perfectly substitute ground truth due to inaccurate labeling and limited generalization capabilities.
Recent studies have explored the use of quality assessment methodologies to enhance the effectiveness of deep convolutional models in medical image analysis.For instance, Roy et al. [31] employed a Bayesian fully convolutional neural network and model uncertainty to modulate brain segmentation quality.They created uncertainty maps and three structure-wise uncertainty indices by generating Monte Carlo samples from the posterior distribution and using dropout during testing.Additionally, Hann et al. [32] developed a quality-control-centric framework for medical image segmentation, utilizing the Dice similarity coefficient prediction methodology to identify optimal segmentations and enhance precision and efficiency.Some researchers have also used deep neural networks to regress evaluation metrics for segmentation tasks.For instance, Li et al. [33] introduced an entropy-weighted Dice loss function to improve subcortical structure segmentation accuracy by training a neural network to better differentiate between foreground and background regions within ambiguous boundary voxels of subcortical structures.However, these approaches require creating training sets for regressors or error map predictors.
To address these limitations, nnSegNeXt presents a novel approach that leverages the edge overlap between input and labeled segments (see Figure 1).This overlap is a reliable measure of segmentation quality, which is then utilized during training to adjust the weights assigned to each image dynamically.This dynamic adjustment significantly enhances the overall accuracy of the segmentation.It has been observed that a higher degree of overlap corresponds to a higher level of accuracy in the segmentation results.Furthermore, we enhance the data preprocessing process that generates multi-center labels to further verify the neural network's accuracy.This additional step improves the robustness of our framework and ensures more precise training results.Consequently, our approach effectively tackles the challenges of missing labels and inaccuracies, improving image segmentation accuracy.Our approach offers the following significant contributions:

•
We present a novel framework for brain tissue segmentation, leveraging a quality evaluation approach.This framework consists of two essential processes: dataset preprocessing and network training.

•
We incorporate a 3D Multiscale Convolutional Attention Module instead of conventional convolutional blocks, enabling simultaneous encoding of contextual information.These attention mechanisms significantly curtail computational overhead while eliciting spatial attention via multiscale convolutional features.

•
We devise a Data Quality Loss metric that appraises label quality on training images, thereby attenuating the impact of label quality on segmentation precision during the training process.

Method 2.1. The Proposed Segmentation Framework
The nnSegNeXt framework is presented in Figure 1 and comprises two main stages: preprocessing and network training.

The Preprocessing Stage
In the initial phase of data preparation, the methodology delineated by Feng et al. [34] was refined and applied to preprocess the brain tissue imagery.The inaugural procedure entailed the implementation of a bias field correction [35] to rectify any unevenness in image intensity.Subsequently, the imagery was standardized to a resolution of 197 × 233 × 189 pixels, ensuring uniformity across all datasets.To prevent the omission of essential anatomical structures, the HD-Bet technique [20] was employed for the segregation of brain tissue from non-cerebral elements.Following this, the FSL FLIRT (version 6.0.7.11, created by the Analysis Group, FMRIB, Oxford, UK.) tool [36], with its trilinear interpolation capability, was utilized for dataset alignment to the MNI152 isotropic standard space, employing a 1 mm 3 brain template for affine registration.The final stage involved the application of FSL FAST for the delineation of different brain tissues within the imagery, a pivotal step for the ensuing analysis.This comprehensive preprocessing protocol ensured the maintenance of data integrity and consistency, facilitating accurate brain tissue segmentation.

The Network Training Stage
During the network training stage, we trained nnSegNeXt on the preprocessed data using a weighted loss function that considers image quality.The network architecture is detailed in the following section.

Network Architecture
The nnSegNeXt network, depicted in Figure 2, involves processing input data X ∈ R H×W×D×S .It consists of five stages with downsampling rates of 2, 4, 8, 16, and 32.The shallow stages of the encoder (Stages 1 and 2) employ downsampling and 3D convolutional layers with a 3 × 3 × 3 kernel size, while the deeper stages (Stages 3, 4, and 5) integrate downsampling and a 3D Convolutional Attention Module for capturing global information.The bottleneck uses a 3D Convolutional Attention Module to provide a sufficient receptive field to the decoder, which shares a highly symmetrical architecture with the encoder.Strided deconvolution upsamples low-resolution feature maps to high-resolution ones, with skip connections linking corresponding features in the encoding and decoding paths.In line with the nnU-Net training framework, our approach optimizes network learning through a weighted deep-supervised loss function.This function incorporates both the low-resolution outputs from the initial stages and the output from the final stage.Notably, only the output from the final stage is used as the final result.By considering the features of different stages' hidden layers during the training process, this method enhances the network's training effectiveness and generalization ability.This architecture emphasizes the replacement of Batch Normalization (BatchNorm) with Instance Normalization (InstanceNorm) for increased stability.By normalizing features per instance and channel, InstanceNorm allows for greater flexibility in handling style variations, which has proven to be beneficial in our application.The 3D Convolutional Attention Module and the loss function are detailed in the following sections.

3D Multiscale Convolutional Attention Module
We have implemented attention mechanisms similar to those utilized in SegNeXt for both the encoder and the decoder.However, we significantly improved this approach by optimizing the Multiscale Convolutional Attention (MSCA) module in SegNeXt to a threedimensional Multiscale Convolutional Attention (3DMSCA) module.Instead of relying on self-attention mechanisms, we upgraded the MSCA module to a three-dimensional Multiscale Convolutional Attention Module.Additionally, we used InstanceNorm instead of BatchNorm to address the challenges presented by medical images, and modified the size of the multiscale convolution kernel to better suit medical images.Our 3DMSCA module comprises three components, as illustrated in Figure 3: a depth-wise convolution for local information aggregation, a multi-branch depth-wise band convolution for capturing multiscale contexts, and a 1 × 1 × 1 convolution for modeling relationships among channels.The output of the 1 × 1 × 1 convolution serves as attention weights that reweigh the inputs of 3DMSCA.Mathematically, our 3DMSCA can be formulated as follows: where X in denotes the input feature to the network, while X out represents the corresponding output.The operation ⊗ refers to an element-wise matrix multiplication process.The layer denoted as Conv D denotes a depth-wise convolution, whereas Sca i , i ∈ {0, 1, 2, 3} represents the specific branch shown in Figure 3.The Sca 0 branch corresponds to the identity connection.To approximate standard convolutions with large kernels, we deploy three depth-wise strip convolutions in each branch, agreeing with reference guidance [37].
In this case, the kernel sizes for corresponding branches are set to 5, 7, and 11.We prefer to use depth-wise strip convolutions because of the lightweight nature of strip convolution operations.Specifically, we can replicate a standard 3D convolution with a kernel size of 5 × 5 × 5 by deploying a set of 5 × 1 × 1, 1 × 5 × 1, and 1 × 1 × 5 convolutions.

Loss Function
Our proposed nnSegNeXt loss function consists of two parts: the segmentation loss L seg and the data quality loss L Data .The segmentation loss L seg adopts a weighted deep-supervised loss function and is composed of Dice and multi-class cross-entropy loss between the predicted and ground truth labels.The Dice loss is a widely used metric for the evaluation of segmentation algorithms, as it measures the overlap between the predicted and ground truth labels [38].The multi-class cross-entropy loss, on the other hand, penalizes the differences between the predicted probabilities and the ground truth labels [39].The Dice loss and the multi-class cross-entropy loss are defined as follows: where P and G are the predicted and ground truth labels, respectively.k ∈ K represents the k-th class, which consists of four different classes: background, GM, WM, and CSF.P i k is the predicted probability of the k-th class for pixel i, while G i k is the corresponding ground truth label.Ω denotes all the pixels in the predicted segmentation result P and its corresponding ground truth G.
The overall segmentation loss is then given by the following: where s ∈ S represents the s-th stage.Due to the size difference between P s and G, we upsample the low-resolution prediction P s to the same size as G for loss calculation.w represents the weight w s of the s-th stage output prediction, with weights assigned according to resolution in ascending order [0.03125, 0.0625, 0.125, 0.25, 0.5].
To evaluate the accuracy of preprocessed image labels, our method is based on edge extraction and a comparison of edge overlap.Specifically, we employ the Canny operator [40] to extract edges from both the original input patches and their corresponding labeled patches.We then compare the degree of overlap between the edges to obtain quality weight scores for the labels, denoted as W Data .Patches with a higher edge overlap are considered to have more accurate labels and should receive more attention during subsequent training to enhance the precision of the segmentation results.The degree of edge overlap is quantitatively measured using the Dice metric, and the segmentation weight scores are used to guide the network training.This allows us to assess the accuracy of the image labels and optimize the performance of deep learning models.The quality score of input patches, W Data , is calculated using the following equation: where E I and E L represent the edge maps of the input patch and labeled patch, respectively.The overlap of the two edge maps allows us to evaluate the accuracy of the image labels.
The data quality loss L Data is defined as the product of the quality score W Data and the cross-entropy loss, and is used to guide the network training process, as follows: Finally, the total loss L Total , which incorporates both the image quality loss and the segmentation loss, is expressed as follows: where λ represents the trade-off parameter that weighs the importance of each component.

Datasets
We conducted our initial experiment by collecting MRI scan data from a diverse cohort of healthy subjects representing different age groups from three distinct datasets: HCP [41], SALD [42], and IXI (https://brain-development.org/ixi-dataset/, accessed on 5 May 2024).Although all datasets employed the MPRAGE sequence, discrepancies existed in other scanning parameters.Specifically, the datasets had varying field strengths, with HCP and SALD utilizing 3T scans, while IXI used 1.5T scans.Furthermore, different scanners were used to obtain the datasets, with the Philips scanner employed for the IXI dataset instead of the Siemens scanner.In addition, these datasets differed in specific scan parameter characteristics, such as repetition/echo time and flip angles.Moreover, to evaluate the model's generalizability, we employed the IBSR dataset (https://www.nitrc.org/projects/ibsr,accessed on 5 May 2024), a labeled dataset widely utilized in brain tissue segmentation tasks.However, the dataset contained only 18 instances with a voxel size of 0.875 × 1.5 × 0.875 mm.We trained the network using this dataset to higight its superiority.A detailed overview of the demographic details and acquisition parameters for all four datasets is provided in Table 1.The HCP, SALD, and IXI datasets contained 200, 251, and 224 scans, respectively, all partitioned into training and test sets at a 4:1 ratio.This distribution facilitated a comprehensive evaluation of our model across diverse datasets and scanning parameters, thereby enhancing the robustness and generalizability of our findings.

Evaluation Metrics
In evaluating the segmentation performance of various methods, we conducted our experiment utilizing the Dice coefficient [38] and the 95th percentile of the Hausdorff distance [43].The Dice coefficient (DC) measures the degree of overlap between the predicted segmentation outcome and the ground truth, and is represented as a percentage ranging from 0% (indicating a complete mismatch) to 100% (representing a perfect match), as depicted in the following Equation (7): where P denotes the predicted segmentation result, while G signifies the ground truth.The Dice coefficient measures the extent of overlap between the predicted segmentation result (P) and the ground truth (G).The Hausdorff distance (HD) quantifies the distance between the predicted segmentation result and the ground truth.Nevertheless, the conven-tional HD is exceedingly sensitive to outliers.As a result, we utilized the 95th percentile of the HD for outlier suppression.The 95th percentile of the HD is defined as follows: where p denotes an element of the predicted segmentation result P, and g represents an element of the ground truth G.A smaller HD value indicates greater proximity between the segmentation prediction and the ground truth, thus reflecting superior segmentation performance.

Implementation Details
The experiments were conducted using PyTorch (version 2.2.0) [44] on an NVIDIA RTX 3060 with 12 GB RAM.To ensure fair comparisons, all U-shaped fully convolutional neural networks (FCNNs) utilized five scales of feature maps and maintained a similar number of feature channels at each stage along the encoding and decoding paths.Instead of providing the entire MRI volumes as input to the networks, the images were cropped to sizes of 128 × 128 × 128.The network's performance was evaluated by continuing the training process until the model's performance on the validation set ceased to improve, with loss computation excluding background voxels.The initial learning rate was set to 0.01, and a "poly" decay strategy described in Equation ( 9) was employed.The weight decay was set to 3 × 10 −5 .Demonstrating the effectiveness of the proposed network, a 5-fold cross-validation was conducted, with 500 training epochs, where one epoch included 250 iterations.The default optimizer was SGD, with a momentum of 0.99.For other hyperparameters, the weighting parameter λ in Equation ( 9) was set to 1, and standard data augmentations, such as axial flip and rotation, were applied during training to enhance performance.lr = initial_lr × 1 − epoch_id max_epoch 0.9 .(9)

Results
In this section, we delved into the performance and generality of our model.We initially present a comparative analysis of our model's performance with state-of-the-art CNN-based and Transformer-based models.We then proceed to discuss the generality of our model, again in comparison with the top CNN-based and Transformer-based models.Further, we extend our validation to ISBR datasets.Subsequently, statistical validation was achieved through paired t-tests with a Bonferroni correction applied to determine the significance of enhancements attributed to the nnSegNeXt method over nnUNet.Additionally, we report the findings of an ablation study conducted on the HCP, SALD, and IXI datasets.

Model Performance
We compared nnSegNeXt with state-of-the-art CNN-based models using the HCP, SALD, and IXI datasets.Table 2 demonstrates that nnSegNeXt consistently outperformed other CNN models in both the Dice coefficient and HD95 for gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) tissue types across all datasets.Notably, on the HCP dataset, nnSegNeXt achieved the highest Dice score of 0.992 and the lowest HD95 value of 0.277.Similar trends were observed for the SALD and IXI datasets, highlighting nnSegNeXt as a superior model for accurate brain tissue segmentation with remarkable generalization capability across diverse datasets.Additionally, Figure 4 provides qualitative comparisons across all methods, and Figure 5 displays exemplary segmentation outputs for the performance testing of all datasets.Table 3 demonstrates that the nnSegNeXt model consistently outperforms other models concerning Dice and HD95 scores.On the HCP dataset, the nnSegNeXt model achieved Dice coefficients of 0.991 for gray matter (GM) and 0.994 for white matter (WM), surpassing other models significantly.Furthermore, on the SALD and IXI datasets, nnSegNeXt demonstrated superior results compared with other models in terms of Dice coefficients and HD95 metrics for GM, WM, and CSF.Specifically, on the SALD dataset, nnSegNeXt exhibited GM Dice coefficient and HD95 values of 0.984 and 0.546, WM Dice coefficient and HD95 values of 0.991 and 0.285, and CSF Dice coefficient and HD95 values of 0.986 and 0.486.Additionally, qualitative comparisons across all methods can be found in Figure A4 in the Appendix A, while Figure 5 displays the visualization of representative segmentation outputs for all models.

Model Generality
The generality of nnSegNeXt was evaluated by comparing its performance with those of other CNN-based models on brain tissue segmentation across multiple datasets, including HCP → SALD, SALD → HCP, HCP → IXI, and SALD → IXI.Table 4 indicates that the nnSegNeXt model consistently outperformed other models, demonstrating superior segmentation across all four datasets with higher Dice coefficients, smaller HD95 values, and overall better average performance.For example, in the HCP → IXI and SALD → IXI experiments, the nnSegNeXt model achieved average Dice coefficients of 0.937 and 0.910, respectively, surpassing other models in terms of HD95 values.A qualitative comparison of the models' generality is depicted in Figure A1 in Appendix A, and representative segmentation outputs for all models are displayed in Figure A2 in Appendix A.

Validation on IBSR Dataset
We conducted additional validation using the publicly available labeled brain tissue segmentation datasets IBSR to confirm the validity of the model.Despite the limited data and differences in labeling between the sulcal CSF regions, which were labeled as GM in IBSR and as CSF in other datasets, our method displayed superior performance compared with leading segmentation frameworks such as nnUNet and nnFormer, even with low-quality datasets.Table 6 presents a performance comparison of nnSegNeXt with other leading models on the IBSR dataset.nnSegNeXt achieved the highest Dice scores of 0.944, 0.922, and 0.796 for GM, WM, and CSF segmentations, respectively.These results demonstrate the substantial advantages of nnSegNeXt in accurately segmenting brain tissues.The comparative results of model performance and representative segmentation outputs are illustrated in Figures A5 and A6 in Appendix A.

Comparison with nnUNet
In this section, we conducted a comparative analysis between nnSegNeXt and the renowned top-tier 3D medical image segmentation model, nnUNet.Observing the average performance metrics in Table 7, nnSegNeXt consistently demonstrates superior average performance.For instance, nnSegNeXt surpasses nnUNet across all three public datasets, achieving lower values in both Dice and HD95, with average DSC values of 0.992, 0.987, and 0.989, respectively.The term "Meandiff" refers to the average performance discrepancy between nnSegNeXt and nnUNet.A positive "Meandiff" for the DSC indicates enhanced segmentation precision by nnSegNeXt.Conversely, negative values for the HD95 score suggest superior edge delineation capabilities by nnSegNeXt.This indicates that nnSegNeXt may offer a more accurate object boundary delineation under the HD95 metric.
To further substantiate the performance superiority of nnSegNeXt over nnUNet, we employed paired t-tests with a Bonferroni correction [45] to calculate the p-values for the nnSegNeXt and nnUNet methods across the HCP, SALD, and IXI datasets.As performed in Table 7, we presented two sets of p-values for both HD95 and DSC across the three public datasets.The significantly low p-values (well below 0.05) confirm the statistically significant performance improvement of nnSegNeXt over nnUNet.

Ablation Study
We conducted ablation experiments and evaluated the performance on three different datasets using the Dice similarity coefficient (DSC) as the default evaluation metric, as shown in Table 8.The most basic baseline model excluded the MSCAN layer and L Data .Subsequently, we replaced the convolutional layer in the deeper network layers with the MSCAN layer, which resulted in a noteworthy improvement in segmentation accuracy of 0.6%, 0.6%, and 0.4% on the respective datasets.This approach also achieved a higher average DSC compared with SegResNet and TransBTS, as observed in the previous experiments.However, when attempting to replace all the convolutional layers, there was a decrease in accuracy, attributed to the initial struggle of the Transformer block to efficiently capture spatial dependencies within the large medical image data.Moreover, the features from Stages 1 and 2 contained excessive low-level information, hindering the performance.In contrast, convolutional layers excel at capturing local features and preserving spatial information, which is crucial in medical imaging.Therefore, we retained the convolutional layers in the initial stages.Additionally, we experimented with the L Data loss function and identified significant impact on the overall performance of nnSegNeXt.In conclusion, our ablation study highlights the crucial role of the nnSegNeXt architecture with MSCAN and L Data components in its effectiveness, suggesting its potential as a superior and more efficient method for brain tissue segmentation based on quality assessment.

Discussion
In this section, we investigated the issue of label dependency in medical image segmentation tasks within clinical settings.The variability in label quality, influenced by differences in scanning devices, software processing environments, and the expertise of annotators, presents a challenge for the selection of training strategies.In response to this issue, we introduce the nnSegNeXt framework, which aims to enhance segmentation accuracy through the optimization of data preprocessing and training procedures.
The quality of data annotation is intrinsically connected to the dependability of training models.Zhang et al. [46] provide a framework for the prediction of segmentation errors and the assessment of segmentation quality for Whole-Heart Segmentation, thereby advancing the precision and trustworthiness of automated segmentation technologies.Zhang et al. [47] improved the quality of crowdsourced labels using noise correction methods and assessed their impact on learning models.Marmanis et al. [48] proposed a trainable deep convolutional neural network that enhances segmentation quality by integrating semantic segmentation and edge detection.Cheng et al. [49] proposed a new segmentation evaluation metric-boundary IoU, concentrating on the improvement of boundary quality to augment segmentation precision.Zhu et al. [50] introduced a brain tumor segmentation approach that fuses semantic and edge features, realized through the design of a graph-convolution-based multi-feature inference block.Unlike other methods, our approach assesses label quality by extracting edges from the training data and incorporates a 3D Multiscale Convolutional Attention Module and a quality loss function, effectively increasing segmentation precision.Despite its strengths, nnSegNeXt's performance is somewhat influenced by the dataset's image quality, suggesting an area for future optimization.Additionally, expanding the model's testing on more diverse datasets could further its generalization capabilities and explore potential clinical applications.Future research should endeavor to broaden the utilization of this methodology to additional challenges in semantic segmentation, such as the delineation of brain tumors and skin lesions.

Conclusions
In this study, we presented nnSegNeXt, a novel framework for brain tissue segmentation designed to address the challenges of missing and inaccurate labels.The essence of nnSegNeXt lies in its innovative substitution of traditional convolutional blocks with three-dimensional Multiscale Convolutional Attention Modules.This design choice enables the model to encode contextual information more effectively, enhancing its ability to focus on relevant features for more accurate segmentation.Moreover, nnSegNeXt incorporates a data quality loss function, which significantly reduces the model's reliance on the quality of the training dataset, bolstering the model's versatility and robustness across various scenarios.The results revealed that nnSegNeXt achieved superior segmentation accuracy compared with various CNN and Transformer-based methods, demonstrating its effectiveness for medical image segmentation.This endeavor could significantly improve segmentation accuracy and efficiency, particularly in clinical settings where rapid and precise imaging analysis is crucial for timely diagnosis and treatment planning.

Figure 1 .
Figure 1.The proposed segmentation framework.The framework is composed of two main stages: preprocessing and network training.During the preprocessing stage, the dataset underwent several processing steps, such as bias field correction, brain extraction, affine registration, and FSL FAST, to produce the corresponding labels.In the network training stage, nnSegNeXt was trained using a weighted loss function on the preprocessed data.

Figure 2 .
Figure 2. Architectural design of nnSegNeXt.The neural network comprises four encoder layers, four decoder layers, and a bottleneck layer.Additionally, we utilize deep supervision at every decoder layer, accompanied by reduced loss weights at lower resolutions.The dashed box shows the downsampling, convolutional, and upsampling layers.We emphasize that InstanceNorm replaces the original BatchNorm to improve stability.

Figure 3 .
Figure 3. Illustration of the proposed 3DMSCA.We implement a depth-wise convolution with a kernel size of l × m × n and d.We extract multiscale features through convolutions and apply them as attention weights to reweigh the input of 3DMSCA.

Figure 4 .
Figure 4. Qualitative results of performance comparison with state-of-the-art CNN-based models on the (a) HCP, (b) SALD, and (c) IXI datasets.Boxplots showing Dice scores for different brain MR tissues using the proposed nnSegNeXt and existing image registration methods.

Figure 5 .
Figure 5. Visualization of model performance on HCP, SALD, and IXI.Red indicates gray matter (GM), green indicates white matter (WM), and blue indicates cerebrospinal fluid (CSF).Zoom-in regions are provided below each image.

Figure A3 .
Figure A3.Qualitative results of generality comparison with state-of-the-art Transformer-based models on the HCP, SALD, and IXI datasets.(a) HCP → SALD, (b) SALD → HCP, (c) HCP → IXI, (d) SALD → IXI.Boxplots showing Dice scores for different brain MR tissues using the proposed nnSegNeXt and existing image registration methods.The convention HCP → SALD signifies that the dataset HCP is utilized as the training set, while the dataset SALD is deployed for the subsequent inference process.

Figure A4 .
Figure A4.Visualization of generality comparison with state-of-the-art models on the HCP, SALD, and IXI datasets.Red indicates gray matter (GM), green indicates white matter (WM), and blue indicates cerebrospinal fluid (CSF).Zoom-in regions are provided below each image.

Figure A5 .
Figure A5.Qualitative results of performance comparison with state-of-the-art models on the IBSR dataset.Boxplots showing Dice scores for different brain MR tissues using the proposed nnSegNeXt and existing image registration methods.

Figure A6 .
Figure A6.Visualization of performance comparison with state-of-the-art models on the IBSR dataset.Red indicates gray matter (GM), green indicates white matter (WM), and blue indicates cerebrospinal fluid (CSF).Zoom-in regions are provided below each image.

Table 1 .
Demographic details and acquisition parameters of the HCP, SALD, IXI, and IBSR datasets.

Table 2 .
Performance comparison with other CNN-based models on brain tissue segmentation.Bold text indicates superior performance, with equally performing model metrics both in bold.The upward arrow indicates superior performance with higher numbers, while the downward arrow indicates better performance with lower numbers.

Table 3 .
Performance comparison with other transformer-based models on brain tissue segmentation.Bold text indicates superior performance, with equally performing model metrics both in bold.The upward arrow indicates superior performance with higher numbers, while the downward arrow indicates better performance with lower numbers.

Table 4 .
Generality comparison with other CNN-based models on brain tissue segmentation.The convention HCP → SALD signifies that the dataset HCP is utilized as the training set, while the dataset SALD is deployed for the subsequent inference process.Bold text indicates superior performance, with equally performing model metrics both in bold.The upward arrow indicates superior performance with higher numbers, while the downward arrow indicates better performance with lower numbers.

Table 5
demonstrates nnSegNeXt's consistent outperformance of other Transformerbased models across multiple datasets.Specifically, on the HCP → SALD dataset, nnSeg-NeXt achieved an impressive Dice score of 0.967 for GM segmentation, surpassing all other models.Additionally, for WM and CSF segmentation, nnSegNeXt attained the highest Dice scores of 0.974 and 0.982, respectively.The comparative results of model generality and representative segmentation outputs are depicted in FiguresA3 and A4in Appendix A.

Table 5 .
Generality comparison with other transformer-based models on brain tissue segmentation.The convention HCP → SALD signifies that the dataset HCP is utilized as the training set, while the dataset SALD is deployed for the subsequent inference process.Bold text indicates superior performance, with equally performing model metrics both in bold.The upward arrow indicates superior performance with higher numbers, while the downward arrow indicates better performance with lower numbers.

Table 6 .
Performance comparison on the IBSR dataset.Bold text indicates superior performance, with equally performing model metrics both in bold.The upward arrow indicates superior performance with higher numbers, while the downward arrow indicates better performance with lower numbers.

Table 7 .
Performance comparison with nnUNet.Bold text indicates superior performance, with equally performing model metrics both in bold.The upward arrow indicates superior performance with higher numbers, while the downward arrow indicates better performance with lower numbers.

Table 8 .
Impact of the different modules used in nnSegNeXt.Bold text indicates superior performance.nnSegNeXt w/o L Data means nnSegNeXt without the data quality loss.nnSegNeXt w/o 3DMSCA denotes the replacement of 3DMSCA with the convolutional layer in Stages 3, 4, and 5. nnSegNeXt w/o Conv denotes the replacement of the convolutional layer with 3DMSCA in Stages 1 and 2.