1. Introduction
In this paper, we present a comparative study of three cutting-edge approaches for bone metastasis segmentation on two-dimensional (2D) computed tomography (CT) slices: the self-configuring nnU-Net pipeline, the self-supervised DINOv3 transformer encoder, and a prompt-free adaptation of MedSAM. By systematically training and benchmarking each under comparable conditions (first, each method’s performance was maximized according to our metric on validation set, and then the final performance was measured on a held-out test set), we assess their respective strengths in terms of accuracy and robustness to limited annotations, thereby providing a perspective on the suitability of these architectures for clinical-grade slice-wise lesion delineation.
Training a 3D segmentation deep learning (DL) model typically requires large datasets to achieve reasonable performance. Consequently, this study concentrates primarily on processing 2D CT scan slices.
In radiology, a perceptual error refers to a diagnostic error that results in a missed lesion, most notably due to its small size, similar density or intensity to the surrounding tissue, or location in a blind spot or an inconspicuous anatomical site [
1]. Missed bone metastases are among the most common perceptual errors with other examples including missed lymphadenopathy, intra-abdominal solid organ malignancy, lung nodules, and compression fractures [
2].
According to the primary site of malignancy, bone metastases most frequently develop from lung, prostate, and breast cancer with annual incidences of 8.7, 3.2, and 2.4 per 100,000 population, respectively [
3]. Other primary sites, such as colorectal cancer or melanoma, have a lower annual incidence, at <1.0 per 100,000 [
3]. The majority of bone metastases, around 70% [
4], occur in the spinal column, often involving multiple vertebral levels and in a descending order of frequency affecting the thoracic, lumbar, sacral, and cervical spine [
5]. Other common locations include the pelvis, ribs and the ends of long bones [
4].
Given that bone metastases are associated with worse prognosis and lower survival [
6] and significantly reduced quality of life [
7], timely detection is paramount in effective patient treatment planning and preventing further debilitating sequelae like pathological fractures, spinal cord compression, or hypercalcemia [
8].
Advancements in artificial intelligence (AI) models, particularly deep learning, have shown promising potential in aiding radiologists and augmenting the diagnostic workflow by detecting and classifying bone metastatic lesions [
9], thus increasing the efficiency of imaging analysis and reducing perceptual errors. The creation of AI models generally requires the transformation of raw imaging data into regions of interest (ROI) that can be labelled and used as training data for the model. This transformation, called segmentation, is the identification, delineation and grouping of sets of pixels or voxels that form the ROI, in this case, a three-dimensional metastatic bone lesion [
10]. Segmentation can be performed manually by an expert, and, despite being a time-consuming and arduous process, serves as the reference standard against which model performance is evaluated [
10]. To accelerate segmentation, semiautomatic and automatic methods are in continuous development, from simple intensity thresholding to AI-based automatic segmentation models [
10]. Different methods notwithstanding, segmentation must precisely and consistently delineate the lesion of interest, producing a high-quality dataset for training effective AI models in radiology [
9].
In this study we benchmark three state-of-the-art segmentation frameworks—nnU-Net, DINOv3 and a prompt-free variant of MedSAM—on 2D CT slices of radiologic bone metastasis lesions [
11,
12,
13]. nnU-Net remains a well-established framework for medical image segmentation because its self-configuring pipeline automatically adapts preprocessing, architecture depth, and training hyperparameters to the specific data distribution, thereby delivering robust performance without extensive manual tuning. DINOv3, originally devised for self-supervised visual representation learning, offers powerful generic feature encoders that can be fine-tuned on limited annotated medical data. Its transformer-based backbone captures long-range contextual cues that are valuable for delineating heterogeneous metastatic patterns across slices [
12,
14]. Prompt-free MedSAM extends the foundation of the Segment Anything Model to the medical domain, eliminating the need for interactive prompts while leveraging large-scale pretraining on diverse anatomical structures. This reduces annotation overhead and enables rapid deployment on new datasets [
13,
15]. Since our goal was to segment CT scans—not merely to use the model for annotating the training set—we employ the prompt-free version of MedSAM.
By evaluating these three approaches side by side, we aim to elucidate trade-offs between fully automated pipeline optimization (nnU-Net), flexible self-supervised feature learning (DINOv3), and prompt-free, foundation-model-driven segmentation (MedSAM) for the challenging task of bone metastasis CT slice segmentation.
While nnU-Net, DINOv3, and prompt-free MedSAM represent strong baselines for 2D CT slice segmentation, emerging approaches could further advance performance. The literature suggests that hybrid architectures combining convolutional encoders with vision transformer backbones—such as ConvNeXtUNet hybrids—may capture both local texture details and global contextual relationships more effectively than pure CNN or transformer models alone [
16]. Additionally, diffusion-based generative segmentation networks (e.g., SegDiff) have shown promise in producing high-resolution masks with calibrated uncertainty, which is valuable for clinical decision-making [
17]. Finally, multimodal self-supervised frameworks that jointly learn from paired imaging modalities (CT + PET) or incorporate radiomics-derived priors can enrich feature representations beyond what single-modality models achieve, potentially yielding superior delineation of bone metastasis lesions on individual slices.
Beyond task-specific architectures, there is a growing interest in radiology foundation models that aim to provide a single, generalist backbone for a wide variety of diagnostic and segmentation tasks across modalities and anatomical regions. Such models are typically pre-trained on large, heterogeneous collections of medical images and associated labels or reports, and then adapted to downstream tasks with comparatively modest amounts of annotated data. In principle, this paradigm could be particularly advantageous for bone metastasis analysis, where rare lesion phenotypes, protocol variability, and evolving treatment regimens pose challenges for conventional supervised pipelines. Our comparison of nnU-Net, DINOv3, and MedSAM can therefore be viewed as a case study in how current foundation-model approaches perform against a strong, carefully configured task-specific baseline in a clinically demanding setting.
This paper makes the following contribution: we systematically compare three frameworks-nnU-Net, DINOv3, and prompt-free MedSAM-for metastatic lesion segmentation on CT slices.
2. Materials and Methods
In this Section, the experimental design is described and presented. In this work we conduct a rigorous comparative analysis of the training pipelines for three leading segmentation architectures—nnU-Net, DINOv3, and the prompt-free variant of MedSAM—applied to CT slice-based metastatic lesion delineation. All models are trained under comparable conditions (first, each method’s performance was maximized on our metrics, and then the final performance was measured on a held-out test set) and are evaluated on a validation set drawn from the same patient cohort using two complementary overlap metrics: the Intersection over Union (IoU) and the Dice similarity coefficient (DSC). By reporting both IoU and Dice scores, we capture subtle differences in boundary adherence and volumetric agreement, respectively, thereby providing a multidimensional assessment of each method’s segmentation fidelity. These quantitative results form the basis for our subsequent conclusions regarding the relative robustness, convergence behavior, and clinical applicability of the examined frameworks. Additionally, the model is evaluated using the Normalized Hausdorff Distance (NHD), which measures the maximum distance between corresponding points in two shapes and represents shape similarity between the objects. The Hausdorff distance was normalized relative to image dimensions to ensure comparability across slices and to mitigate scale dependency inherent to absolute boundary distance metrics.
Accurate delineation of metastatic bone lesions is clinically important for disease staging, radiotherapy planning, and treatment response monitoring. Automated segmentation methods may reduce annotation burden and improve reproducibility in large-scale oncological workflows.
2.1. CT Scan Dataset for Bone Metastasis Segmentation
A 2D dataset was constructed from the CT scans of 88 patients, each scan having been annotated by medical experts in radiology. Subsequently, the CT slices were postprocessed using a bone window setting (window level ≈ 400 HU, window width ≈ 1800 HU) to enhance skeletal contrast for the computational pipeline. The images were then intensity normalized (from 0 to 255) to ensure a suitable dynamic range for algorithmic processing. The data annotations were performed by four annotators under the supervision of a senior radiologist, who ensured consensus in ambiguous cases. For annotations, 3D Slicer software (ver. 5.10.0) was utilized. The annotation procedure was guided by established clinical radiology practices. A multiplanar approach was employed, incorporating sagittal, coronal, and axial views, with the axial plane serving as the definitive reference in cases of uncertainty.
After the initial preprocessing, the model training stage employed 11,006 image–label pairs, which were divided 70%/17%/13% into approximately 7630 training samples, 1908 validation samples and 1468 patient-level held-out test set samples. This train/validation split was performed at the slice level within the 78-patient development cohort (88 patients for all set); to assess generalization beyond the development cohort and mitigate any residual risk of patient-level information leakage, an additional held-out test set comprising scans from a separate group of previously unseen patients was reserved and was not used during training or model selection. Intensity normalization followed a bone-window protocol (window level ≈ 400 HU, window width ≈ 1800 HU) and was applied slice-wise. Expert radiologist annotations consolidated lytic, sclerotic, and mixed subtypes into a single positive class for binary segmentation. Nevertheless, the distribution of the original lesion subtypes is still reported to characterize the underlying dataset composition and lesion heterogeneity. Providing these subtype statistics improves transparency regarding potential class imbalance and offers additional context for interpreting segmentation performance, especially given the differing attenuation profiles and boundary characteristics associated with lytic and sclerotic lesions.
Both the nnU-Net model and the prompt-free variant of MedSAM were trained at a resolution of 1024 × 1024 pixels. For the DINOv3 model, the images were resized to meet the model’s requirements—specifically 512 × 512 pixels. The original data resolution is 512 × 512 pixels. Upscaling does not add new information to the image; however, nnU-Net and MedSAM were retained at their native/recommended input resolution of 1024 × 1024, which is the configuration under which their published performance characteristics were established, while DINOv3 was run at its native 512 × 512. This decision preserves each framework’s intended operating point at the cost of an input-size mismatch, which is further discussed as a limitation in
Section 4.
Figure 1 illustrates the mean pixel value distributions of lytic and sclerotic lesions across slices. These values are proportional to Hounsfield units following bone windowing and value normalization. The class distribution is as follows: Lytic (L): 28.07%; Sclerotic (S): 15.11%; Mixed (M): 56.83%.
2.2. Training Conditions for CT Segmentation Models
The nnU-Net model was trained using the standard nnU-Net framework methodology with the following configuration: over 500 epochs without pretrained model (could require more epochs for the optimum), an initial learning rate of 0.01 with a cosine annealing scheduler replacing the standard polynomial decay. The training utilized the nnUNetTrainerV2 framework with 2D configuration and employed automatic preprocessing and augmentation strategies. The dataset was split approximately 70%/17%/13% for training, validation and test sets (the framework automatically handles data splitting and ensures no overlap between the training and validation sets).
The DINOv3 (ViT-Large) model training process involved fine-tuning a pre-trained semantic segmentation model with 768-dimensional features (approximately 1.5 GB model size) on medical imaging data for binary segmentation over 50 epochs. The model architecture consisted of the DINOv3-ViT-Large backbone with a custom decoder head (three Conv2d layers with BatchNorm and ReLU), trained using AdamW optimizer (learning rate 1 × 10−4, weight decay 0.01), combined Dice + BCE loss, and cosine annealing learning rate scheduling with a batch size of 4. Throughout training, comprehensive metrics were tracked including Dice scores and IoU (Intersection over Union) for both classes, with the best model saved based on validation Dice score and periodic checkpoints created every 5 epochs.
The prompt-free MedSAM model training employs a two-stage approach: first fine-tuning a pretrained MedSAM ViT-B model, then training a prompt-free variant without manual prompts through a 3-layer CNN adapter module. Both stages maintain consistent 70%/17%/13% train/validation/test splits and employ combined Dice + BCE loss achieving training DSC values above 0.85. The initial learning rate was set to 1 × 10−4 with a weight decay of 0.01. It is trained using AdamW optimizer. Unlike the DINOv3 model, CT slices were resized to 1024 × 1024 pixels.
For optimization of the DINOv3 and prompt-free MedSAM segmentation frameworks, a composite loss function combining Dice loss and Binary Cross-Entropy (BCE) loss was employed. The total loss is defined as:
where λ
1 and λ
2 denote weighting coefficients for the Dice and BCE components, respectively.
The Dice loss is formulated as:
where p
i represents the predicted probability for pixel i, g
i denotes the corresponding ground-truth label, N is the total number of pixels, and ε is a small constant introduced for numerical stability.
The Binary Cross-Entropy loss is defined as:
The combined objective function leverages the complementary properties of region-overlap optimization and pixel-wise classification accuracy. Dice loss improves robustness to foreground–background imbalance, while BCE stabilizes gradient propagation during optimization.
During the experiments, the nnU-Net framework was used with default augmentation and data preparation configurations. For the MedSAM and DINOv3 models, data augmentation was applied probabilistically and included affine transformations (probability
p = 0.7) with random rotations (−20° to +20°) and scaling (0.9–1.1), as well as elastic deformations (
p = 0.5) using Gaussian-smoothed displacement fields (α = 50, σ = 8). Additionally, color jitter (
p = 0.8) was applied to images only, followed by clipping to the valid range. During inference, images were scaled to [0, 1] by division by 255.0, and normalized using mean and standard deviation using formula:
To support repeatability, random seeds were fixed for dataset shuffling and weight initialization, and the exact framework versions were recorded for each model (nnU-Net v2 with PyTorch 2.8, MedSAM release corresponding to the ViT-B checkpoint, and DINOv3 ViT-Large pretrained weights). Methodologically, we first maximized each method’s performance on our metrics (Dice score and Hausdorff distance), even though this required different configurations for each method. Subsequently, we measured the final performance of all methods on a held-out independent test set; we therefore describe the conditions as “comparable” rather than “identical” and we discuss the implications of these choices in
Section 4.
The NVIDIA A100 GPU is used for more computationally intensive tasks together with the smaller NVIDIA GeForce RTX 2070 GPU card.
To further support methodological transparency and reproducibility, we deliberately relied on publicly available implementations and widely adopted deep learning frameworks for all three architectures, documenting framework versions and training configurations in detail. The use of standardized toolchains such as nnU-Net v2 and open-source MedSAM and DINOv3 repositories facilitates independent replication of our experiments and enables future studies to build on the present work with modified loss functions, architectural variants, or alternative optimization strategies. This design is aligned with recent calls for more rigorous validation of segmentation models, including clear reporting of data splits, hyperparameters, and post-processing steps, especially when benchmarking against strong baselines such as nnU-Net on 3D and 2D medical imaging tasks.
The described experimental protocol establishes a comparative framework for evaluating modern segmentation architectures on metastatic bone lesion delineation. The resulting quantitative and qualitative findings are presented in
Section 3, where the segmentation performance of nnU-Net, DINOv3, and the prompt-free MedSAM framework is comparatively analyzed using described evaluation metrics.
3. Results
In this
Section 3, the obtained results with all three nnU-Net, DINOv3, and the prompt-free variant of MedSAM are presented.
The nnU-Net model achieved a mean validation DSC of 0.6996 (
Figure 2).
The DINOv3 model converged within 50 epochs (
Figure 3), requiring fewer training epochs than nnU-Net.
The DINOv3 model reached a validation DSC of approximately 0.5272 (
Figure 3). The lower Dice performance of DINOv3 suggests that the pretrained transformer features, although highly expressive for natural-image representations, may not fully capture the fine-grained boundary characteristics of metastatic bone lesions in CT imaging without more extensive domain-specific adaptation. Nevertheless, the model demonstrated rapid convergence and comparatively stable optimization behavior.
As shown in
Figure 4, the prompt-free MedSAM converged rapidly, similar to DINOv3, and achieved a validation DSC of 0.6968, comparable to the nnU-Net model.
Figure 5 presents qualitative examples of nnU-Net segmentation overlaid on the reference standard annotations provided by clinicians. For a more complete qualitative comparison across all three frameworks,
Figure 5 is expanded below to also include side-by-side predictions of MedSAM, DINOv3 and nnU-Net on the same slices (e.g., small lytic lesions, lesions adjacent to cortical bone), together with representative failure cases to highlight where each method under- or over-segments relative to the expert annotation.
The training dynamics differed substantially across frameworks. nnU-Net required a larger number of epochs to achieve stable convergence, reflecting the optimization characteristics of fully supervised convolutional segmentation architectures trained from scratch. In contrast, both DINOv3 and MedSAM converged considerably faster due to initialization from large-scale pretrained transformer representations. However, rapid convergence did not necessarily correspond to superior segmentation fidelity, particularly in the case of DINOv3, where faster optimization was accompanied by lower overlap accuracy on both validation and held-out test sets.
Despite lower segmentation accuracy, DINOv3 achieved substantially faster convergence, indicating potential utility in resource-constrained training scenarios. nnU-Net prioritized final segmentation fidelity at the expense of longer optimization time.
The summary of the model performance is given in
Table 1. It includes estimates of the Normalized Hausdorff Distance NHD (lower is better) and the Dice Scores considered previously. Confidence intervals (CI) for mean values are reported to characterize the variability of performance estimates.
The NHD quantifies shape similarity between predicted and reference segmentations, with lower values indicating better boundary alignment.
A noticeable performance decrease between the validation and held-out test sets was observed for all models, indicating reduced generalization to previously unseen patient data. Among the evaluated frameworks, nnU-Net demonstrated the smallest degradation in Dice score, suggesting superior robustness and better adaptation to dataset-specific anatomical variability. In contrast, MedSAM and DINOv3 exhibited larger reductions in overlap performance, potentially reflecting sensitivity to distributional shifts and lesion appearance heterogeneity.
In addition to the aggregate metrics summarized in
Table 1 and
Table 2, qualitative assessment of representative cases highlighted complementary strengths and weaknesses of the evaluated frameworks. nnU-Net tended to produce smooth, contiguous masks that closely followed lesion boundaries, particularly for larger mixed-type metastases, whereas MedSAM occasionally generated slightly over-segmented regions around highly sclerotic foci, reflecting its strong sensitivity to high-contrast structures. DINOv3, by contrast, showed comparatively good localization of the general lesion area but sometimes failed to capture fine cortical interruptions or very small lytic defects, consistent with its lower Dice scores despite competitive Hausdorff distances. These qualitative findings support the quantitative results and underscore the importance of combining numerical metrics with visual inspection when assessing segmentation performance for heterogeneous bone metastases.
From a clinical perspective, the superior Dice performance and lower generalization degradation of nnU-Net suggest improved suitability for radiotherapy planning and longitudinal disease monitoring workflows, where consistent volumetric delineation is essential. Conversely, the lower overlap accuracy observed for DINOv3 may limit direct deployment without further domain-specific adaptation or postprocessing refinement.
Overall, nnU-Net achieved the highest and most consistent segmentation performance across both validation and independent held-out test datasets, demonstrating superior robustness and generalization capability for metastatic bone lesion delineation. MedSAM achieved competitive validation performance with substantially faster convergence, while DINOv3 exhibited weaker overlap accuracy despite favorable optimization characteristics and competitive boundary localization metrics.
4. Discussion and Conclusions
In this study, after comparing the foundation models-DINOv3 and prompt-free MedSAM-with the traditional nnU-Net framework under the previously outlined conditions, we report that fine-tuned DINOv3 delivers a Dice score of ~0.53, versus nnU-Net’s Dice score of ~0.70 and MedSAM’s Dice score of ~0.70 on the validation set. DINOv3 alongside nnU-Net provides the best Hausdorff distances, which are nearly identical at 0.0425 and 0.0424, respectively. On a held-out test set, the MedSAM, DINOv3, and nnU-Net models achieved the following Dice scores: 0.6280, 0.4480, and 0.6849, respectively. Additionally, on a held-out test set, the MedSAM, DINOv3, and nnU-Net models achieved the following normalized Hausdorff distances: 0.1008, 0.1019, and 0.0473, respectively.
This finding aligns with some reports favoring CNN-based architectures like nnU-Net for 3D medical image segmentation [
18], yet it contrasts with emerging evidence that self-supervised vision transformers (ViTs) can offer superior feature robustness in various medical imaging scenarios [
19]. The nnU-Net remains a strong baseline due to its self-configuring adaptability [
20].
The comparison was performed for 2D bone metastasis segmentation on a CT slice dataset labelled by medical experts, a task complicated by the textural heterogeneity of lytic and sclerotic lesions. Our study indicates that, for bone metastasis segmentation, the AI model architecture can still be effectively based on the nnU-Net framework. Furthermore, general-purpose segmentation models often struggle with the specific morphological irregularities of metastatic bone disease compared to primary bone tumors [
21]. Nevertheless, the utility of deep learning for automated bone segmentation remains high, as evidenced by success in related tasks such as pelvic bone metastasis segmentation [
22] and PET/CT quantification [
23], suggesting that foundation models like DINOv3 may bridge the gap between specialized performance and generalizability in the future.
In the future, to identify the most appropriate deep learning model, the comparison could be expanded to include multimodal architectures. Specifically, the integration of text–visual aligned models such as CLIP variants or RadFM—the radiology foundation model with CT support—warrants exploration, as these models can leverage diagnostic reports to guide segmentation attention [
24]. Recent work has demonstrated that multimodal data-driven approaches can outperform unimodal segmentation in complex metastatic scenarios by incorporating clinical context [
25]. Additionally, with a larger training dataset, we could explore deep learning models for 3D CT segmentation, addressing the inherent limitations of slice-wise analysis [
26]. The transition to 3D remains challenging due to the computational cost and data requirements, but improvements in self-supervised learning for 3D medical volumes [
27] and multitask frameworks [
28] offer promising pathways. Ultimately, validating these models on external multicenter datasets will be crucial to confirm their clinical safety and utility in routine oncology workflows [
29].
A complementary direction is to explore 3D and multi-view architectures that can exploit the full volumetric context of metastatic lesions, rather than relying solely on slice-wise analysis. Transformer-based 3D encoders such as UNETR and self-supervised Swin Transformer variants have demonstrated strong performance on volumetric organ and tumor segmentation, suggesting that similar designs could better capture the complex spatial extent of spinal and pelvic metastases while preserving computational efficiency through patch-based processing. Coupling such models with multitask learning objectives—for example, jointly predicting lesion segmentation, subtype, and simple clinical endpoints—may further improve robustness and clinical utility, as shown in recent work on multi-output histology and PET/CT pipelines. However, realizing these benefits will require larger, multi-centre datasets and careful attention to the practical constraints of clinical deployment, including inference time, hardware availability, and integration with existing radiology workflows.
Several limitations of the present work should be acknowledged. First, although each framework was run at its recommended input resolution, the resulting mismatch (1024 × 1024 for nnU-Net and MedSAM versus 512 × 512 for DINOv3) is a potential confound: because upscaling the 512 × 512 source data to 1024 × 1024 does not introduce new information; a controlled comparison at a single shared resolution remains a relevant follow-up. Second, the 70%/17% training/validation split was performed at the slice level, so slices from the same patient could co-occur across the two partitions; to mitigate the resulting leakage risk we additionally evaluated generalization on a patient-disjoint held-out test cohort 13%. Third, the three lesion subtypes (lytic, sclerotic, mixed) were consolidated into a single positive class to enlarge the effective sample per class; we did not ablate per-subtype segmentation heads, which could reveal class-specific failure modes. Finally, the dataset originates from a single institution, so external, multi-centre validation is required before clinical deployment can be considered. These limitations frame the interpretation of the reported Dice and NHD values and motivate the future-work directions described above.
In summary, the nnU-Net framework provides robust segmentation performance and serves as a strong baseline for 2D slice-wise bone metastasis delineation even with limited annotated data. These findings highlight the continued effectiveness of specialized medical segmentation architectures while also demonstrating the potential and current limitations of large pretrained transformer-based frameworks in CT lesion segmentation tasks.