1. Introduction
The global burden of chronic respiratory diseases continues to rise. Chronic obstructive pulmonary disease (COPD) is the third leading cause of death worldwide, accounting for 3.23 million deaths in 2019 [
1].
Imaging modalities such as chest X-rays, CT scans, MRI, and Nuclear Medicine enable accurate diagnosis of the severity and type of COPD, ensuring appropriate treatment to alleviate symptoms and enhance daily functioning. Advanced imaging techniques can track the progression of COPD and identify complications, allowing healthcare providers to forecast potential outcomes and adjust treatment plans to improve prognosis and quality of life.
Chest CT is considered the current diagnostic tool for assessing structural abnormalities in the lungs, particularly in COPD patients [
2]. It is regarded as the clinical gold standard for pulmonary imaging due to its outstanding spatial and temporal resolution. Chest CT findings correlate strongly with pathological results from post-mortem studies, confirming the presence and extent of emphysema [
3]. However, despite its diagnostic advantages, CT significantly contributes to medical radiation exposure, raising concerns about the risk of radiation-induced cancer [
4]. This is especially worrisome given the increasing frequency of CT scans for longitudinal studies. Therefore, cautious use of CT imaging and consideration of alternative imaging methods are important to mitigate these risks.
Magnetic Resonance Imaging (MRI) is highly valued for its ability to perform repeated studies without exposing patients to ionizing radiation. MRI provides both structural and functional information, enabling comprehensive lung assessments, including airway structures, lung parenchyma, and blood flow [
5]. Although traditionally underutilized for lung imaging due to challenges like low proton density and motion artifacts, recent advancements in MRI technology, such as faster imaging techniques and lung-specific contrast agents, have greatly enhanced its effectiveness and reliability.
Hyperpolarized gas MRI, using helium-3 (
3He) and xenon-129 (
129Xe), has revolutionized pulmonary imaging over the past three decades [
6]. This technique offers detailed structural and functional insights into the lungs. Pioneered by Albert and colleagues in 1994, the polarization of
129Xe (
3He) gas can be significantly increased through spin-exchange with optically pumped rubidium vapor, amplifying the MRI signal by approximately 100,000 times [
7]. Hyperpolarized
3He and
129Xe MRI provide unparalleled advantages, including detailed visualization of airway structures, lung ventilation, and gas exchange processes [
8]. Extensive research has confirmed the safety and tolerability of these gases, supporting their use in diverse research and clinical settings to provide critical information on lung function and disease.
Hyperpolarized gas MRI enables the quantification of specific biomarkers, offering detailed insights into lung health. One key biomarker, the Ventilation Defect Percent (VDP), measures the percentage of lung volume that is not effectively ventilated. This is visualized by areas lacking inhaled hyperpolarized gas on MRI, indicating poor ventilation. VDP is essential for assessing functional impairment in diseases like COPD [
9].
Earlier studies for hyperpolarized gas MRI ventilation segmentation employed classical image processing and machine learning approaches [
10], such as hierarchical K-means [
11] and spatial fuzzy c-means (SFCM) clustering [
12]; however, these algorithms are time consuming and observer dependent. For example, segmentation of a 16 slice MRI would take around 45 min which may not be applicable in a clinical setting.
Deep learning (DL) has shown significant promise in image segmentation [
13]. The application of DL in medical imaging has advanced, providing accurate and efficient analysis of medical images. DL-based segmentation offers precise quantitative analysis and standardized indicators for clinical trials. For instance, the U-Net architecture [
14], combined with the pre-trained DenseNet121, has been used for segmenting lung CT scans to detect COVID-19 abnormalities [
15]. Additionally, Sulaiman et al. detailed a CNN architecture for segmenting lung diseases using chest X-ray images in Diagnostics [
16].
DL applications in hyperpolarized gas MRI segmentation have also been explored. Studies include a large-scale investigation of ventilated lung segmentation using multi-nuclear hyperpolarized gas MRI [
17], a dual-channel DL approach for lung cavity estimation from hyperpolarized gas and proton MRI [
18], a 3D CNN-based method for ventilated lung segmentation [
19], and a U-Net++ based Quantification of Ventilation Defects for ventilated lung segmentation [
20]. These studies primarily used CNN methods for segmentation tasks.
Despite the previous research on DL-based segmentation of hyperpolarized gas MRI, significant gaps remain in understanding how different architectural paradigms perform under data-constrained conditions. A recent review identified a notable gap in the use of vision transformers and foundational models for this task [
21]. A recent study [
22] compared CNNs and Vision Transformer-based models for lung segmentation in proton and hyperpolarized gas MRI under varying noise levels, finding that transformer-based architectures such as SegFormer outperform CNNs in high-noise settings, underscoring their potential for robust clinical deployment. Recent developments have introduced foundational models pre-trained on vast datasets and advanced large-kernel architectures that challenge conventional design principles. The limited comparative analysis of these emerging architectural approaches for hyperpolarized gas MRI presents a clear opportunity to explore how different paradigms could enhance segmentation accuracy and efficiency, particularly when working with limited datasets.
The broader computer vision field has witnessed the emergence of three distinct architectural paradigms with different approaches to achieving robust performance. Traditional deep learning architectures, such as U-Net [
23], Feature Pyramid Networks (FPNs) [
24], and DeepLabV3 [
25], rely on increasing network depth and complexity to capture spatial relationships, but typically require large datasets to achieve optimal performance. Foundational models, like the Segment Anything Model (SAM) [
26] and its medical variant MedSAM [
27], demonstrate remarkable performance through pre-training on vast, diverse datasets, offering potential advantages in data-scarce scenarios through transfer learning capabilities. Advanced large-kernel architectures, such as UniRepLKNet [
28] and TransXNet [
29], have shown that innovative kernel designs and hybrid attention mechanisms can achieve superior spatial context capture without requiring extensive pre-training datasets.
Recent developments in computer vision have continued to push architectural boundaries through innovative approaches to spatial context modeling and feature interaction. OverLoCK [
30] demonstrates that pure ConvNets can effectively incorporate top-down attention through Deep-stage Decomposition Strategy (DDS) and Context-Mixing Dynamic Convolution (ContMix), achieving superior accuracy-efficiency trade-offs compared to ConvNeXt while maintaining ConvNet inductive biases. The emergence of non-causal Vision Mamba architectures, exemplified by VSSD [
31], addresses fundamental limitations of causal state space models in vision tasks by preserving relative interaction weights while enabling bidirectional context modeling. Multi-scale approaches have gained prominence through MSVMamba [
32], which introduces hierarchy-in-hierarchy designs with economical multi-scale scanning for efficient long-range dependency captureand SparX-Mamba [
33], which enhances both Vision Mamba and transformer architectures through sparse cross-layer connections that improve feature reuse across distant layers with minimal computational overhead. These advances underscore the ongoing architectural innovation in computer vision, where diverse approaches, from enhanced ConvNets to non-causal state space models, continue to explore optimal strategies for spatial context aggregation and feature interaction, complementing the foundational model and large-kernel architecture paradigms examined in our study.
A critical challenge in medical imaging, particularly in specialized modalities like hyperpolarized gas MRI, is the scarcity of annotated data. Medical image acquisition is expensive, time-consuming, and often requires specialized equipment and expertise. Patient recruitment can be challenging due to ethical considerations, rare disease prevalence, and geographical limitations. Expert annotation is labor-intensive and requires specialized knowledge, creating bottlenecks in dataset creation. These constraints are particularly pronounced in hyperpolarized gas MRI, where the technology is available in limited centers worldwide and requires specialized expertise for both acquisition and interpretation.
Traditional deep learning models typically require large datasets to achieve optimal performance, making them vulnerable to overfitting and poor generalization when trained on limited data. This limitation poses significant challenges for clinical translation and widespread adoption of automated segmentation tools in specialized imaging modalities. However, foundational models are trained on vast, diverse datasets and learn generalizable representations that can be fine-tuned for specific applications. Similarly, advanced large-kernel architectures incorporate design principles that enhance spatial context capture and feature reuse efficiency. Both approaches theoretically offer several advantages in medical imaging scenarios: reduced data requirements for achieving good performance, improved generalization across different imaging conditions and populations, faster convergence during fine-tuning, and potential for robust performance under data constraints.
Despite these architectural advances and the recognized challenges of data scarcity in medical imaging, a critical gap remains in understanding the comparative performance of different architectural paradigms under systematically varied data constraints. While individual studies have explored foundational models, advanced architectures, or traditional approaches in isolation, there lacks a comprehensive comparative framework that evaluates these paradigms across identical datasets and progressive data reduction scenarios. This gap is particularly pronounced in specialized imaging modalities like hyperpolarized gas MRI, where practitioners face uncertainty about optimal architectural choices when working with inherently limited datasets. Furthermore, the absence of systematic evidence comparing the data efficiency of foundational pre-training approaches versus innovative architectural designs leaves clinicians and researchers without clear guidance for model selection in resource-constrained environments. Without this comparative understanding, the medical imaging community cannot make informed decisions about resource allocation, computational requirements, and expected performance trade-offs when implementing automated segmentation tools in clinical practice.
This study addresses the critical question of whether foundational models and advanced architectures truly offer advantages over traditional deep learning architectures when working with limited medical imaging data. Our research aims to provide a comprehensive comparative analysis between foundational models (SAM and MedSAM), advanced large-kernel architectures (UniRepLKNet and TransXNet), and established traditional architectures (UNet with VGG19, FPN with MIT-B5, and DeepLabV3 with ResNet152) for both proton MRI and hyperpolarized gas MRI segmentation.
Our specific contributions are as follows: (1) A systematic comparison of foundational, advanced, and traditional models using identical datasets and evaluation metrics. (2) Assessment of model performance across progressive data reduction scenarios (100%, 50%, 25%, and 10% of the original dataset).
This research provides crucial insights for the medical imaging community regarding the practical benefits of foundational and advanced models in real-world clinical scenarios where data availability is inherently limited, ultimately informing decisions about model selection and resource allocation for medical AI development.
2. Materials and Methods
This section describes the proton and hyperpolarized gas MRI dataset compilation, ground truth generation procedures, and deep learning architectures employed in this comparative analysis. All computational experiments were conducted using the PyTorch 2.0 framework on dual NVIDIA GeForce RTX 3090 GPUs (NVIDIA Corporation, Santa Clara, CA, USA).
2.1. Image Acquisition and Dataset Compilation
The imaging dataset utilized in this study was retrospectively compiled from multiple research and clinical investigations involving patients referred for hyperpolarized gas MRI scans. This study was performed following institutional ethics approval, with informed written consent obtained from all participants. Each of the scans in the dataset was accompanied by a semi-automated expert segmentation, serving as the ground truth.
Imaging data were acquired using a 3T Discovery MR750 scanner (GE Healthcare, Waukesha, WI, USA) equipped with flexible vest quadrature coils (Clinical MR Solutions, Milwaukee, WI, USA). Hyperpolarized 129Xe gas was prepared using a 9820 polarizer (Polarean, Durham, NC, USA) achieving polarization levels between 10–40%.
Proton MRI acquisition employed a fast-spoiled gradient-recalled echo sequence with the following parameters: TR/TE/flip angle = 4.7 ms/1.2 ms/30°; field-of-view = 40 × 40 cm2; bandwidth = 24.4 kHz; acquisition matrix = 128 × 80, zero-padded to 128 × 128; partial-echo = 62.5%; slice thickness = 15 mm with 15–17 slices. Hyperpolarized 129Xe imaging utilized a three-dimensional FGRE sequence (TR/TE = 6.7 ms/1.5 ms; variable flip-angle; field-of-view = 40 × 40 cm2; bandwidth = 15.63 kHz; matrix = 128 × 128; slice thickness = 15 mm with 14 slices).
Standardized breathing protocols ensured consistent lung volumes across modalities. Participants were trained to inhale 1.0 L of gas mixture from functional residual capacity: nitrogen for proton imaging and a combination of 400 mL hyperpolarized 129Xe with 600 mL 4He for gas imaging.
2.2. Ground Truth Generation and Quality Control
Ground truth lung segmentation masks were generated using a semi-automated pipeline implemented in MATLAB R2021b (The MathWorks, Natick, MA, USA). The process began with automated thresholding of hyperpolarized 129Xe gas images, followed by manual adjustment to exclude tracheal regions from the segmentation mask. Image registration between proton and hyperpolarized gas images required manual placement of 3–6 anatomical landmarks on both 1H and 129Xe images, with subsequent affine transformation calculation for spatial alignment. Thoracic cavity segmentation was obtained through region growing algorithms applied to the registered gas mask. All semi-automated segmentation results underwent comprehensive quality control review by five independent observers with varying levels of experience in hyperpolarized gas MRI analysis (range: 1–7 years of experience).
2.3. Dataset Characteristics and Stratification
We collected data from 205 participants, which included 22 healthy individuals, 26 with COPD, 90 with asthma, and 67 with Long-COVID-19. The study population had a mean age of 54 ± 16 years, included 119 females (58%), and had a mean BMI of 28 ± 6 kg/m
2. This yielded 1640 2D slices, originally sized 128 × 128. Proton and hyperpolarized slices were registered using a CNN-based registration approach as described by Mozaffaripour et al. [
34]. The dataset was balanced to ensure an even distribution across the participant groups, with an 80% training, 10% validation, and 10% testing split. To avoid data leakage, no specific patient data were included in both the training and testing sets. To ensure representativeness and fairness, we also maintained an even distribution of the different conditions across the training, validation, and test sets.
To ensure rigorous and unbiased model comparison, identical dataset splits were maintained across all architectural evaluations. The stratified splitting methodology addressed two critical distribution considerations: anatomical slice representation and disease condition balance. For anatomical representativeness, slices were categorized by their position within each participant’s imaging volume (apical, middle, and basal lung regions) and distributed proportionally across training, validation, and test sets to ensure representative coverage of complete lung anatomy.
Simultaneously, 205 participants were stratified by clinical condition with proportional representation maintained within each data split, ensuring that training (80%), validation (10%), and test (10%) sets each contained the same relative distribution of disease conditions. This dual-stratification approach prevented potential confounding effects from anatomical bias or disease-specific learning advantages, while consistent use of identical splits across all model architectures eliminated inter-model variability due to data distribution differences.
2.4. Preprocessing
Preprocessing was conducted using the MONAI library [
35]. The intensity values of the images were normalized to a range of 0 to 255, while label intensities were scaled between 0 and 1, aligning with the expected input ranges for subsequent processing steps. Spatial padding was applied to standardize dimensions across all images and labels. Further preprocessing involved converting grayscale images to RGB format by replicating grayscale values across three channels. Ground truth masks were adjusted to have positive values for labels and zero for the background. Finally, bounding boxes to the full size of the image for each mask were determined to serve as prompts for the SAM and MedSAM processors.
These preprocessing steps were selected to ensure compatibility across diverse model architectures while preserving medical image characteristics. Intensity normalization to 0–255 range standardized input distributions across different MRI acquisition parameters and enabled consistent model convergence. Grayscale-to-RGB conversion was necessary because all evaluated models were pre-trained on natural RGB images (ImageNet) and required three-channel inputs; replicating grayscale values across channels preserved the original intensity information while maintaining compatibility with pre-trained feature extractors. Spatial padding standardized dimensions to prevent artifacts from variable image sizes, while full-image bounding boxes for foundational models ensured fair comparison by eliminating spatial guidance advantages that could confound architectural performance differences.
The choice of full-image bounding boxes as prompts for SAM and MedSAM was deliberate to ensure fair comparison with advanced and traditional models. Unlike foundational models that utilize prompts to guide segmentation, advanced architectures and traditional models process the entire image without any spatial guidance or region-of-interest indicators. By providing bounding boxes encompassing the full image dimensions rather than tight boxes around lung regions, we eliminated the potential advantage that precise spatial prompts could provide to foundational models, ensuring that all architectures operated under equivalent conditions where they must identify and segment lung structures from the complete image without prior localization information. This methodological approach prevents confounding effects that could arise from differential spatial guidance across model types and maintains the integrity of our comparative analysis focused on architectural resilience rather than prompt engineering advantages.
2.5. Model Architectures and Fine-Tuning
2.5.1. Foundational Models
Foundation models are expansive AI systems trained on vast datasets using substantial computational resources, capable of generating diverse outputs from text to images. Prominent examples include Open AI’s GPT model [
36], DALLE-3 [
37], Segment Anything [
26], and BERT [
38].
The SAM, developed by Meta AI, is a foundational model for image segmentation. It features a revolutionary architecture consisting of three main components: the image encoder, prompt encoder, and mask decoder. The image encoder, using a Masked Autoencoder (MAE) with a Vision Transformer, creates an embedding by extracting essential features from the input image. This embedding is a 16x downscaled version of the original image, optimizing for efficient processing while retaining critical features.
Various types of prompts, including points, boxes, masks, and text, can be used with SAM. For our task, we employed box prompts. Each box is represented by an embedding pair, combining the positional encoding of its top-left and bottom-right corners with learned embeddings representing these positions. The lightweight mask decoder, a modified Transformer decoder block followed by a dynamic mask prediction head, predicts segmentation masks by integrating information from the image and prompt embeddings.
Despite their capabilities, foundational models like SAM have limited applicability in medical image segmentation due to the significant differences between natural and medical images. To address these challenges, MedSAM was developed as an extension of SAM, specifically tailored for the medical imaging domain. MedSAM is trained on a large-scale dataset comprising over one million medical image-mask pairs.
2.5.2. Advanced Large-Kernel Architectures
UniRepLKNet [
28], employs very large convolution kernels (up to 13 × 13) with four architectural guidelines: (1) use efficient structures to increase depth, (2) use dilated re-parameterization blocks to enhance large kernel performance, (3) decide kernel size by downstream task with large kernels typically used in middle and high-level layers, and (4) add small kernels when scaling up model depth. The core principle is that large kernels can “see wide without going deep,” enabling efficient spatial context capture. We evaluated UniRepLKNet-F (6.2M parameters) and UniRepLKNet-S (55.6M parameters) variants. We also employed TransXNet [
29], A hybrid CNN-transformer architecture featuring a Dual Dynamic Token Mixer (D-Mixer) that simultaneously learns global and local dynamics. The D-Mixer combines overlapping spatial reduction attention (OSRA) for global context with input-dependent depthwise convolution (IDConv) for dynamic local feature extraction. The architecture also incorporates multiscale feedforward networks (MS-FFN) for comprehensive feature aggregation. We assessed TransXNet-tiny (12.8M parameters), TransXNet-small (26.9M parameters), and TransXNet-base (29.0M parameters).
2.5.3. Traditional Deep Learning Models
As for the traditional deep learning models, the U-Net [
14] architecture with VGG19 [
39] encoder represents a well-established approach for medical image segmentation. VGG19 provides a robust feature extraction backbone with its deep convolutional layers, while the U-Net decoder enables precise spatial localization through skip connections. This combination has proven effective across numerous medical imaging applications.
FPN [
24] architecture excels at multi-scale feature extraction, crucial for segmenting structures of varying sizes in medical images. The MIT-B5 [
40] backbone, part of the Mix Transformer family, incorporates attention mechanisms and hierarchical feature learning, providing strong representational capabilities for complex segmentation tasks.
DeepLabV3 [
41] employs atrous-convolution and pyramid pooling to capture multi-scale contextual information. The ResNet152 [
42] backbone provides deep feature extraction capabilities with residual connections, enabling training of very deep networks while maintaining gradient flow and avoiding degradation problems.
2.6. Training Methodology
During fine-tuning, images and their corresponding bounding box prompts were processed and fed into the model. We employed the Adam optimizer [
43] with learning rates carefully tuned for each model based on preliminary experiments. A combined Dice and Cross-Entropy loss function was used to optimize both overlap and boundary accuracy.
Early stopping based on validation loss prevented overfitting, ensuring model robustness and generalization capabilities. Detailed training specifications including hyperparameters, optimization strategies, and reproducibility specifications are provided in the
Supplementary Materials.
2.7. Computational Complexity of Experimented Models
The evaluated models demonstrated significant variation in computational requirements and parameter counts, which have important implications for clinical deployment scenarios. Foundational models required the highest computational resources, with both SAM and MedSAM utilizing ViT-B backbones containing approximately 91 million parameters each. Among advanced architectures, UniRepLKNet variants showed considerable scalability, ranging from UniRepLKNet-F with 6.2 million parameters to UniRepLKNet-S with 55.6 million parameters. TransXNet architectures provided a middle-ground approach with TransXNet-tiny (12.8 M parameters), TransXNet-small (26.9 M parameters), and TransXNet-base (48.0 M parameters). Traditional deep learning models exhibited diverse parameter counts: UNet-VGG19 was the most lightweight at 34.5 million parameters, DeepLabV3-ResNet152 contained 60.1 million parameters, while FPN-MIT-B5 required 81.4 million parameters.
Table 1 summarizes the parameter counts.
2.8. Progressive Data Reduction Experiments
To provide a comprehensive understanding of architectural resilience to data scarcity, we conducted experiments across four different training data availability scenarios while maintaining consistent validation and testing sets to enable direct performance comparison:
100% Training Data (Baseline): All models were trained on the complete training dataset (80% of total data) to establish optimal performance benchmarks under ideal conditions.
50% Training Data: Models were trained using half of the original training data, randomly sampled while maintaining class distribution balance, to assess moderate data reduction impact.
25% Training Data: Models were trained on a quarter of the original training data to simulate realistic data-constrained scenarios commonly encountered in specialized imaging modalities.
10% Training Data: Models were trained using only one-tenth of the original training data to represent extreme data scarcity conditions and test architectural robustness limits.
For the progressive data reduction experiments (50%, 25%, and 10% training data scenarios), the same dual-stratification methodology was applied when subsampling from the original training set, ensuring that each reduced training subset maintained proportional representation of both anatomical slice positions and disease conditions, thereby preserving dataset balance across all experimental conditions. Additionally, the same validation and testing sets were maintained to enable direct performance comparison across data availability conditions. Each model was trained until convergence with early stopping based on validation loss to prevent overfitting.
A graph representing the experiments is depicted in
Figure 1.
2.9. Evaluation Metrics and Statistical Analysis
Model performance was assessed using multiple complementary metrics to provide a comprehensive evaluation of segmentation quality. Primary metrics included Dice Similarity Coefficient (DSC) for overlap assessment, Hausdorff Distance at 95th percentile (HD95) for boundary accuracy, Average Hausdorff Distance (Avg HD) for overall geometric agreement, and XOR Error for pixel-wise disagreement quantification.
Statistical analysis employed IBM SPSS version 31 software with non-parametric tests due to the non-normal distribution of metrics as determined by Shapiro–Wilk tests. Friedman tests assessed differences between models, with post hoc pairwise comparisons using Bonferroni correction for multiple comparison adjustment. The analysis specifically evaluated: (1) performance equivalence between foundational and advanced models, (2) performance differences between foundational/advanced models versus traditional approaches, and (3) statistical significance of performance changes across data reduction scenarios for each architectural group.
4. Discussion
This study provides comprehensive evidence that architectural approaches incorporating large effective receptive fields, whether through extensive pre-training (foundational models) or innovative kernel designs (advanced architectures), offer fundamental advantages over traditional deep learning approaches when working with limited medical imaging data. The results establish clear architectural performance hierarchies that become increasingly pronounced under data-constrained conditions, providing crucial guidance for clinical AI deployment. Representative qualitative comparisons illustrating these performance differences across model architectures are provided in the
Supplementary Materials (Figures S33 and S34).
4.1. Statistical Equivalence of Foundational and Advanced Models
The most significant finding of our comprehensive analysis is the statistical equivalence between foundational models and advanced large-kernel architectures across all data availability scenarios. This equivalence demonstrates that innovative architectural designs can achieve the data efficiency benefits traditionally associated with extensive pre-training approaches, offering multiple viable pathways for robust performance under data constraints.
The success of UniRepLKNet validates the architectural principle that large kernels enable models to “see wide without going deep.” The dilated re-parameterization blocks and strategic kernel size selection created efficient spatial context aggregation that maintained robust performance even with severely limited training data. This approach effectively decouples the traditional relationship between model depth and receptive field size, enabling more efficient parameter utilization under data constraints.
TransXNet’s dual dynamic token mixer demonstrated that hybrid approaches combining global attention mechanisms with input-dependent local feature extraction achieve performance parity with foundational models. The overlapping spatial reduction attention for global context, combined with dynamic convolution for local features, created a balanced architecture capable of maintaining both spatial detail preservation and global context understanding across progressive data reduction scenarios.
4.2. Progressive Data Reduction Performance Analysis
Under optimal data conditions, the performance convergence across all architectural approaches suggests that traditional models can achieve competitive results when sufficient training data is available. This convergence validates that the fundamental task of lung segmentation can be addressed through multiple architectural paradigms when data limitations are not a constraint.
The 25% training data scenario emerged as a critical threshold where architectural differences become clinically significant. At this point, both foundational and advanced models maintained near-optimal performance (DSC > 0.86 for both modalities), while traditional architectures experienced severe degradation. This threshold represents a realistic clinical scenario where specialized imaging modalities face moderate data constraints.
Under 10% training data conditions, the exceptional stability of both foundational and advanced models (maintaining DSC > 0.84) contrasts dramatically with the complete failure of traditional approaches. This resilience addresses a fundamental barrier to clinical AI translation in specialized imaging modalities, demonstrating that high-quality automated segmentation remains achievable even under extreme data limitations.
Effect size analysis using Cohen’s d with 95% confidence intervals provided quantitative evidence for the magnitude of architectural performance differences across progressive data reduction scenarios (
Supplementary Materials Figure S35). Under full data conditions (100% training data), all pairwise comparisons yielded negligible effect sizes (d < 0.2) with overlapping confidence intervals, confirming architectural equivalence when sufficient training data is available. However, as training data decreased, large effect sizes (d > 1.0) consistently emerged between foundational/advanced models versus traditional architectures, with some comparisons reaching d > 2.0 under extreme data scarcity (10% training data), representing differences exceeding two pooled standard deviations. Most critically, foundational models and advanced architectures maintained negligible effect sizes (d < 0.2) relative to each other across all data scenarios, providing statistical confirmation of their equivalence despite different underlying design principles. The magnitude of these effect sizes indicates not merely statistical significance but clinically substantial differences in segmentation quality, with traditional architectures experiencing performance degradation of sufficient magnitude to render them unsuitable for clinical deployment under data-constrained conditions, while both foundational and advanced approaches demonstrated statistical equivalence in their superior data efficiency.
4.3. Modality-Specific Considerations
While hyperpolarized gas MRI’s specialized imaging characteristics initially minimized architectural advantages under full data conditions, the benefits of foundational and advanced models became increasingly apparent with data reduction. This pattern suggests that specialized imaging modalities, which inherently face data scarcity challenges, represent primary beneficiaries of advanced architectural approaches.
The consistent performance advantages observed across both imaging modalities support the hypothesis that architectural approaches enhancing effective receptive field capture provide fundamental benefits for medical image segmentation. Both foundational pre-training and large-kernel designs enable efficient spatial context aggregation that reduces dependence on extensive training examples.
4.4. Clinical Translation and Practical Implications
The demonstrated data efficiency of both foundational and advanced models provides multiple pathways for democratizing advanced medical imaging analysis. The ability to achieve clinically acceptable performance with 75–90% less training data enables deployment in resource-limited settings, specialized centers, or emerging imaging modalities where extensive data collection is impractical.
The statistical equivalence between foundational and advanced models provides practical flexibility for clinical implementation. Selection can be based on computational resources, deployment constraints, regulatory considerations, and specific application requirements while maintaining confidence in performance reliability under data-limited conditions.
The availability of multiple equivalent approaches (foundational vs. advanced) enables optimization based on secondary factors such as computational efficiency, memory requirements, inference speed, or customization capabilities, while ensuring robust performance across varying data availability scenarios.
4.5. Computational Efficiency and Clinical Deployment Trade-Offs
The computational analysis reveals a nuanced relationship between model complexity and performance efficiency that has significant implications for clinical deployment strategies. While foundational models (SAM and MedSAM at 91M parameters each) achieved superior data efficiency, their computational demands may limit deployment in resource-constrained environments. However, the statistical equivalence demonstrated by advanced architectures offers compelling alternatives with more favorable computational profiles. UniRepLKNet-F, with only 6.2M parameters, achieved performance parity with models containing 15× more parameters, representing exceptional parameter efficiency. Similarly, TransXNet variants provided scalable options ranging from 12.8M to 48.0M parameters while maintaining equivalent performance to much larger foundational models. This finding challenges the conventional assumption that superior performance necessarily requires larger models, suggesting that architectural innovation can achieve foundational model-level benefits with significantly reduced computational overhead. For clinical translation, this implies that healthcare institutions with limited computational resources need not sacrifice performance quality, as advanced architectures like UniRepLKNet-F offer a compelling middle ground between traditional models’ data vulnerability and foundational models’ computational demands. The 14-fold reduction in parameters from foundational to advanced models while maintaining statistical performance equivalence represents a critical breakthrough for democratizing advanced medical imaging AI, enabling deployment across diverse clinical environments from resource-limited settings to high-throughput imaging centers. This computational flexibility, combined with demonstrated resilience under data constraints, positions advanced architectures as potentially optimal choices for widespread clinical adoption.
4.6. Limitations
Several limitations warrant consideration in interpreting these results. First, our analysis focused on segmentation tasks within specific anatomical regions (lung segmentation). The generalizability of these findings to other anatomical structures, pathological conditions, or segmentation tasks requires further investigation.
Secondly, our study did not extensively explore computational efficiency and resource requirements. While foundational models demonstrated superior performance, their larger memory footprints and computational demands may present practical challenges in resource-constrained clinical environments.
Thirdly, validation of the models on real patient data in clinical practice has been limited. Our evaluation was conducted on retrospectively collected research datasets, and prospective clinical validation would be necessary to confirm the practical utility and safety of these approaches in real-world diagnostic workflows.
Additionally, our dataset originates from a single center and has not been validated across different MRI devices, scanner manufacturers, imaging protocols, or diverse patient populations spanning different ethnicities and demographic groups. This single-center limitation restricts the generalizability of our findings and highlights the need for multi-center validation studies to establish robustness across varied clinical environments and patient populations.
To address these limitations, future research should prioritize multi-center validation studies incorporating diverse scanner manufacturers, imaging protocols, and patient demographics to establish broader generalizability. Federated learning approaches could enable collaborative training across institutions while preserving patient privacy, while domain adaptation techniques could facilitate rapid model deployment to new clinical sites with minimal local data requirements. Expanding evaluation beyond lung segmentation to other anatomical structures and pathological conditions would demonstrate architectural robustness across medical imaging applications. Additionally, comprehensive computational efficiency analyses comparing inference times, memory requirements, and energy consumption would provide practical deployment guidance for resource-constrained clinical environments. Prospective clinical validation studies integrating these models into real diagnostic workflows would ultimately confirm their safety, utility, and impact on clinical decision-making and patient outcomes.
5. Conclusions
This comprehensive evaluation across progressive data reduction scenarios provides compelling evidence that architectural design fundamentally determines model resilience to data scarcity in medical imaging applications. Our systematic comparison of foundational models, advanced large-kernel architectures, and traditional deep learning approaches across four data availability scenarios reveals consistent patterns that have profound implications for clinical AI deployment.
Our results demonstrate that both foundational models (SAM and MedSAM) and advanced architectures (UniRepLKNet and TransXNet) maintain robust performance when transitioning from full to severely limited data scenarios, with minimal degradation in segmentation accuracy across both imaging modalities. Most remarkably, these architectures achieve statistically equivalent performance (p > 0.01) across all data conditions, demonstrating that innovative architectural designs can match the benefits of extensive pre-training. In stark contrast, traditional architectures (UNet-VGG19, FPN-MIT-B5, DeepLabV3-ResNet152) suffer catastrophic performance collapse under data constraints, with some models experiencing DSC decreases exceeding 50%.
Importantly, while specialized imaging characteristics (as observed in hyperpolarized gas and proton MRI) may initially level performance differences between all architectural approaches under full data conditions, both foundational and advanced models retain their resilience advantages when data becomes scarce. Under extreme data scarcity (10% training data), both architectural groups maintained acceptable performance (DSC > 0.86) while traditional models experienced complete failure. This suggests that the benefits of both foundational approaches and advanced architectural innovations extend beyond simple domain transfer to encompass fundamental learning efficiency advantages that persist across imaging modalities.
These findings have profound implications for the deployment of AI tools in medical imaging workflows. The statistical equivalence between foundational and advanced models provides healthcare institutions with flexible implementation options based on their specific computational resources, regulatory constraints, and deployment requirements while maintaining confidence in performance reliability. The superior data efficiency of both architectural approaches could significantly lower barriers to implementing automated segmentation in specialized imaging applications, resource-limited settings, and emerging clinical domains where large datasets are difficult to obtain.
For hyperpolarized gas MRI specifically, our results suggest that both foundational and advanced models could enable more widespread adoption of automated biomarker quantification tools, potentially improving the accessibility and standardization of this important but specialized imaging technique. The ability to achieve clinically viable performance with 75–90% less training data addresses fundamental barriers to clinical translation in specialized imaging modalities that are available in limited centers worldwide.
Our findings suggest that the medical imaging community should prioritize architectural approaches that enhance effective receptive field capture. These approaches can be achieved through extensive pre-training, innovative kernel designs, or hybrid attention mechanisms. This paradigm shift from traditional depth-focused architectures to spatial context-focused designs could accelerate the development and deployment of AI tools across diverse clinical applications while improving accessibility in resource-constrained environments.
As medical imaging continues to evolve with new technologies and clinical applications, the need for robust, data-efficient AI tools becomes increasingly critical. Our results provide strong evidence that both foundational models and advanced architectural innovations offer promising paths forward, particularly for addressing the data scarcity challenges that have historically limited the clinical translation of AI tools in specialized medical imaging domains.
The consistent demonstration that architectural innovation can achieve foundational model-level performance while potentially offering greater deployment flexibility opens new avenues for practical clinical AI implementation. This revelation that multiple equivalent pathways exist for achieving robust performance under data-limited conditions ultimately advances the goal of democratizing advanced medical imaging analysis across diverse healthcare settings and improving patient care through more accessible automated diagnostic tools.
Building on these findings, our future research will pursue several interconnected avenues that leverage the demonstrated advantages of foundational and advanced architectural approaches in data-constrained medical imaging scenarios.
We plan to advance multimodal approaches that integrate language and vision capabilities for enhanced medical image segmentation. This involves developing hybrid architectures that can simultaneously process visual information alongside natural language descriptions. These descriptions will include anatomical structures, pathological findings, and clinical context. Specifically, we will explore how radiological reports, clinical notes, and structured medical vocabularies can be incorporated as additional guidance signals during the segmentation process. This includes investigating prompt engineering techniques that allow clinicians to provide natural language descriptions of regions of interest, pathological characteristics, or diagnostic priorities that can dynamically influence segmentation boundaries and accuracy. We will also develop cross-modal attention mechanisms that enable the model to align textual descriptions with corresponding visual features, potentially improving segmentation precision in ambiguous or challenging cases.
We will explore autonomous quality assessment and correction mechanisms that can iteratively refine segmentation results through self-supervised learning approaches. This involves developing meta-learning frameworks where models can evaluate their own segmentation quality using uncertainty quantification, anatomical plausibility checks, and consistency validation across multiple imaging planes or temporal sequences. These agentic systems will incorporate active learning components that can identify cases requiring human expert review, suggest targeted corrections, and adapt their performance based on accumulated feedback from clinical workflows.
Given the single-center limitation identified in this study, we plan comprehensive multi-center validation studies across different scanner manufacturers, imaging protocols, and patient populations to establish the generalizability of our architectural findings. This will include developing domain adaptation techniques that can rapidly fine-tune models for new clinical sites with minimal local data requirements, investigating federated learning approaches that can benefit from distributed clinical datasets while preserving patient privacy, and establishing standardized evaluation protocols for assessing model performance across diverse clinical environments.
This research directions aim to translate the fundamental architectural insights demonstrated in this study into practical, clinically deployable systems that can improve patient care while addressing the inherent data scarcity challenges in specialized medical imaging domains.