1. Introduction
Spine-related diseases are highly prevalent in clinical practice, and accurately obtaining the anatomical morphology and index (ID) of each vertebra is crucial for diagnosis and surgical planning [
1,
2]. CT is widely used owing to its high spatial resolution for osseous structures. However, manual vertebra-by-vertebra annotation is time-consuming and subject to inter-observer variability, motivating research into automated vertebral segmentation and identification.
Deep learning [
3], especially encoder–decoder models, such as U-Net, has advanced medical image segmentation [
4], but performance is often limited by scarce annotations. The release of the VerSe benchmark datasets has enabled standardized evaluation and methodological iteration [
5]. Multi-stage pipelines built on VerSe (e.g., first localizing/identifying vertebrae, then segmenting each vertebra) have shown promising results. However, treating “segmentation” and “identification” as separate steps limits cross-task information sharing [
6,
7], and global numbering shifts can still occur under limited fields of view or in complex cases.
Earlier studies commonly decoupled segmentation and identification. Along the “segmentation-only” line, traditional approaches (statistical shape models [
8,
9,
10], thresholding/active contours, etc.) and deep networks (FCN [
4], U-Net [
4], V-Net [
11], and iterative frameworks [
10,
12,
13]) continuously improved the voxel-level accuracy. Along the “identification-only” line, typical strategies attached classification/ranking heads to already segmented regions or high-level features [
8] or iteratively inferred vertebral indices along the spinal axis [
13,
14,
15]. While foundational, these separate pipelines allow segmentation errors to propagate to identification and make it difficult to use identification priors to enforce segmentation coherence.
More recent work has shifted toward performing vertebral segmentation and identification within a single model, leveraging shared representations for mutual benefit. Representative studies such as SIIL output voxel-wise masks and vertebral IDs in an end-to-end fashion, reducing error propagation arising from segmentation–identification inconsistencies [
16]. Other graph-optimization-based iterative paradigms unify localization, segmentation, and identification within a single iterative solver using anatomical consistency constraints to automatically correct labeling errors and highlight the advantages of “joint modeling + structural priors” [
15]. In parallel, discriminative representation learning for identification (e.g., contrastive learning and uncertainty modeling) has been integrated into joint frameworks to improve separability between adjacent vertebrae and to stabilize the final ID sequence [
17]. Overall, integrated joint segmentation–identification modeling appears to be a more suitable path to enhanced robustness.
Moreover, the high cost of medical image annotation limits large-scale supervised training. Semi-supervised learning (SSL) mitigates the scarcity of labels by incorporating unlabeled data into training [
18]. Typical strategies include Mean Teacher consistency [
18], uncertainty-aware pseudo-label filtering/weighting [
19,
20], transformation-consistency self-ensembling [
21], dual-task consistency [
22], and more general approaches, such as UniMatch v2 [
23]. In general, SSL has proven effective for alleviating annotation scarcity in medical imaging; the key lies in designing task-appropriate consistency constraints and pseudo-label strategies. These strategies are compatible with joint modeling: within a shared-encoder multi-task framework, consistency regularization and high-confidence pseudo-labels can jointly regularize both the segmentation and identification branches, improving overall consistency and data efficiency.
Despite recent progress, vertebra segmentation and identification in routine CT still face three practical failure modes: (i) partial field-of-view (FOV) scans may miss upper or lower vertebrae, inducing a global index shift; and (ii) the high structural similarity between adjacent vertebrae often causes swaps, merges, or fragmentation. These observations motivate a unified framework that jointly reasons about 3D anatomy and sequential continuity while reducing reliance on dense annotations.
Different from prior joint segmentation–identification approaches that primarily rely on anatomical priors or deterministic indexing, we introduce a semi-supervised teacher–student objective tailored to the joint setting: segmentation and identification are regularized simultaneously via complementary consistency terms. This design not only improves segmentation under limited labels but also mitigates identification failures, such as global index shift and adjacent-vertebra confusion.
To address the above challenges, we propose an end-to-end multi-task framework that performs vertebra segmentation and identification within a single model, leveraging a shared encoder for cross-task synergy and incorporating semi-supervised learning to reduce reliance on manual annotations. Our main contributions are as follows:
Dual-branch 3D U-Net with Mamba and 3D convolutional block attention module (CBAM). We design a dual-branch 3D U-Net that integrates Mamba modules [
24] and 3D-CBAM [
25] to model the sequential dependencies of vertebrae and to accomplish segmentation and identification within a unified architecture. A shared encoder serves as the core feature extractor: the encoder representation feeds (i) a segmentation branch to produce voxel-wise vertebral masks and (ii) an identification branch to predict anatomical indices from global features. Through shared representations and parallel optimization, the two branches mutually reinforce each other, enabling integrated, synergistic vertebra segmentation–identification.
Unified semi-supervised objective for joint segmentation–identification. Building on the UniMatch v2 semi-supervised semantic segmentation framework [
23], we formulate a unified SSL objective tailored to our two-branch setting. For each unlabeled CT volume, we generate weakly and strongly augmented views and enforce teacher–student consistency on both branches [
18,
23]. In the segmentation branch, high-confidence pseudo-masks are obtained by thresholding the teacher’s foreground probabilities, and binary cross-entropy is computed only on voxels that pass confidence filtering. In the identification branch, temperature-calibrated teacher heatmaps provide pseudo-classes, and a weighted cross-entropy is applied at high-confidence locations. Combined with uncertainty-guided filtering [
19,
20], class-frequency reweighting, and a linear ramp-up of the consistency weight, this design markedly improves the utilization of unlabeled data and enhances cross-task coherence and overall performance. Notably, unlike UniMatch v2 (which addresses only single-task segmentation), our approach extends consistency training to a dual-task setting with branch-specific pseudo-labeling strategies, representing a novel application of SSL to joint segmentation–identification.
The remainder of this paper is structured as follows:
Section 2 presents materials and methods, including the overall framework, datasets, network architecture, loss functions, data augmentation, implementation details, and experimental design.
Section 3 reports experimental results.
Section 4 provides the discussion.
Section 5 concludes the study and outlines future work.
2. Materials and Methods
2.1. Overall Framework
We propose a unified end-to-end framework that performs voxel-level segmentation and instance-level identification of vertebrae within a single 3D convolutional neural network. The overall pipeline is illustrated in
Figure 1. Given a spinal CT sub-volume as the input, a deep 3D convolutional encoder first extracts multi-scale features, which are then fed in parallel into two decoder branches: (i) a segmentation branch that progressively upsamples features to reconstruct a vertebral probability mask with the same spatial dimensions as the input and (ii) an identification branch that processes high-level global features from the encoder to predict the anatomical ID of each vertebra within the sub-volume. The final output combines the vertebral masks and their corresponding predicted IDs to form complete labeled results. Note that the dual-branch 3D U-Net is trained under our semi-supervised learning framework, in which both the teacher and student models share the same dual-branch 3D U-Net architecture.
2.2. Dataset
We conduct experiments on VerSe [
5], a publicly available benchmark dataset for vertebral segmentation and identification, comprising VerSe 2019 and VerSe 2020. VerSe 2019 contains 160 CT scans (80 training, 40 validation, and 40 test), with 1725 annotated vertebrae spanning C1–L5. VerSe 2020 expands the scale to 319 CT scans (113 training, 103 validation, and 103 test), with 4141 annotated vertebrae from C1 to L5. Details are summarized in
Table 1.
The native CT volumes have isotropic or near-isotropic voxel spacing of approximately 0.5–1.5 mm. To ensure consistent spatial scale and mitigate scanner-dependent resolution differences, all volumes are resampled to 1.0 × 1.0 × 1.0 mm (using trilinear interpolation). A unified normalization pipeline is then applied. Because a full spinal CT is large (on average ≈ 512 × 512 × 300 voxels) and cannot be processed by a 3D network in a single pass due to GPU memory limits, we extract sub-volumes (patches) via random/sliding-window cropping along the axial direction—random during training and sliding at inference—followed by isotropic scaling. Symmetric padding is applied along the depth dimension as needed to match a fixed input size of 128 × 64 × 64 voxels. Finally, intensities are clipped to [−500, 1500] Hounsfield units (HUs) and linearly normalized to [0, 1].
2.3. Network Architecture
As shown in
Figure 1, we use a dual-head 3D U-Net with a shared encoder and two task-specific decoders and insert a Mamba module [
24] at the bottleneck to model long-range cranio–caudal dependencies.
The encoder has four stages. Each stage contains two 3D convolutions followed by instance normalization and LeakyReLU and applies ×2 downsampling to expand the receptive field while preserving multi-scale skip features. At the bottleneck, Mamba aggregates distant context without the quadratic cost of self-attention. Its linear-time complexity (in sequence length) and lower memory footprint make it suitable for 3D volumes, where capturing cranio–caudal continuity is critical for cross-vertebra consistency and robust indexing, especially under partial FOV/truncated scans that can induce global ID shifts.
We adopt separate decoders to reduce negative transfer and allow task-specific optimization. The segmentation decoder performs three successive upsampling steps with skip-feature fusion to output a voxel-wise binary mask at full resolution. Deep-supervision heads at 1/2 and 1/4 resolutions improve the gradient flow, stabilize small-batch 3D training, and enhance the boundary details.
The identification decoder uses a lighter upsampling path and outputs, at 1/4 resolution, multi-channel confidence/heatmaps for vertebral IDs. Predicting at 1/4 resolution balances localization granularity and compute/memory cost while retaining sufficient spatial detail for centroid-based assignments. For VerSe-style reporting, we follow the C1–L5 protocol (24 classes) and explicitly state any label mapping; we optionally support extended labels (e.g., C1–S1 plus background) to handle transitional anatomy while keeping the reporting protocol consistent.
To improve class separability, we insert 3D-CBAM [
25] (channel attention followed by spatial attention) only in the identification decoder (upsampling and output stages) to suppress redundant channels and irrelevant spatial responses. We keep the segmentation decoder purely convolutional to avoid degrading precise boundary modeling.
2.4. Loss Functions
2.4.1. Overall Objective
The total loss comprises a supervised term and a self-supervised (consistency) term with a linearly ramped weight, as in Equation (1). The ramp-up policy is as follows: the consistency weight is zero for the first several epochs (supervised-only training), then increases from a specified epoch and reaches a stable level by mid-training, i.e., it grows linearly from 0 to 1 to fully leverage unlabeled data thereafter.
Here,
is the total loss at epoch
;
is the supervised term;
is the self-supervised consistency term; and
is the epoch-dependent weight (Equation (2)):
where
is the epoch at which the consistency term is activated, and
denotes half of the total number of training epochs. In our implementation, we use a supervised-only warm-up of the first 40 epochs (
= 40) and linearly ramp the consistency weight for 40 epochs (
= 80).
Notation and symbols used in Equations (1)–(11) are summarized in
Table A1 for clarity.
2.4.2. Supervised Term
The supervised loss
is the weighted sum of the segmentation and identification losses (Equation (3)), where the identification weight
strengthens the learning of vertebral indices:
For the segmentation branch, we adopt multi-scale deep supervision: at full, half, and quarter resolutions, we compute an equal-weighted sum of cross-entropy and Dice loss, then aggregate them with scale weights (larger for higher resolutions), as shown in Equation (4):
At scale
, the Dice term
and cross-entropy term
are as follows:
Notation: is the voxel grid at scale ; is the voxel; is the ground-truth label (foreground = 1, background = 0); is the predicted foreground probability; is the scale weight with ; is the internal weighting factor of the segmentation loss; and avoids division by zero.
The identification branch combines two complementary terms—distribution alignment and instance consistency .
Distribution alignment
. Kullback–Leibler (KL) divergence between the target Gaussian heatmap and the predicted spatial distribution (Equation (7)):
Instance consistency
. At low resolution, we separate the foreground into per-vertebra instances and compute the mean Dice to penalize merging/fragmentation (Equation (8)):
Here, and denote the ground-truth and predicted per-vertebra instance maps, respectively. is the normalized Gaussian heatmap target (from ground-truth centroids, or high-confidence teacher centroids for unlabeled data; ). is the softmax-normalized spatial distribution from the identification logits. For labeled samples, is from ground truth; for unlabeled samples, is from the teacher. The two terms are dynamically reweighted.
Gaussian heatmap parameters. After resampling to the fixed training spacing, each centroid is rendered as a 3D Gaussian with voxels; the heatmap is normalized to sum to 1. We use the same for all vertebra levels (no level-specific tuning).
2.4.3. Self-Supervised Term
The total consistency loss
is the weighted sum of the segmentation consistency term
and the identification consistency term
(Equation (9)). We adopt teacher–student consistency: the teacher generates pseudo-labels on a weakly augmented view, while the student receives the corresponding strongly augmented view and is encouraged to match the teacher.
Segmentation consistency
. We threshold teacher probabilities to form a pseudo-mask and compute BCE on student logits only over high-confidence voxels (Equation (10)):
Identification consistency
. The t temperature-calibrated teacher heatmaps yield pseudo-classes (argmax); we compute CE at high-confidence locations with a ramped threshold, applying class-frequency weights and excluding the background (Equation (11)):
Confidence-threshold schedules. For identification, we apply softmax with temperature
to teacher heatmaps and keep pseudo-labels only at locations whose maximum confidence exceeds
. In our implementation,
is ramped from 0.30 to 0.55 when
(Equation (12)). For segmentation, consistency is computed only on confident voxels when
, with
ramped from 0.50 to 0.90 (Equation (13)).
EMA teacher update. After each student optimization step, the EMA teacher is updated at every iteration by with . At the end of warm-up, the EMA teacher is initialized from the current student weights before enabling consistency training.
Notation: is the set of high-confidence voxels used for segmentation consistency; x is the voxel index; is the teacher’s binary pseudo-label at voxel (obtained by thresholding teacher foreground probabilities); is the set of high-confidence voxels for class kkk used for identification consistency; is the teacher’s one-hot pseudo-class label at ( vertebral classes, e.g., C1–L5); is the student’s class-probability vector at with ; is the sigmoid-normalized student segmentation probability from the logit ; is the class-adaptive weight based on training-set frequency with .
2.5. Data Augmentation
In our semi-supervised framework, we use teacher–student consistency with synchronized views for each unlabeled volume: weak for the teacher and strong for the student.
Weak augmentation (geometry only). To simulate pose variability while maintaining topology, we apply random flips (X/Y with probability 0.5, Z with probability 0.25) and an in-plane XY rotation with an angle uniformly sampled from (linear interpolation).
Strong augmentation (photometric/noise only). The strong view inherits the weak geometry (no extra geometric distortion) and applies stochastically sampled intensity/noise operations (RandAugment-style [
26]) with independently sampled magnitudes. The pool includes intensity scaling/shifting, gamma correction, Gaussian/salt-and-pepper/Poisson/speckle noise, Gaussian blur, median filtering, unsharp masking, and slice-wise CLAHE. Operations are applied only to image intensities to avoid label contamination.
For labeled samples, we apply the same geometry to volume and mask (trilinear vs. nearest-neighbor interpolation). We apply the full strong photometric/noise pipeline to the volume with
p = 0.5; otherwise, we optionally apply mild intensity scaling in
with a probability of 0.5. Masks are never modified by photometric/noise operations. See
Table A2 for parameter ranges.
2.6. Implementation Details
We implement the model in PyTorch 2.7.0 and train for 100 epochs on a single NVIDIA RTX 3090 Ti GPU (Santa Clara, CA, USA) (random seed 42). All FSL and SSL variants share identical preprocessing, patch extraction, and optimization settings to ensure fair comparison. We use SGD with momentum (initial learning rate of 0.01, momentum of 0.9). The training schedule includes a supervised-only warm-up for the first 40 epochs; the consistency weight is then linearly ramped from 0 to 1 over the next 40 epochs (epochs 41–80) and kept fixed thereafter. In the same schedule, the pseudo-label thresholds are increased linearly (: 0.30 to 0.55; τ_seg: 0.50 to 0.90).
At inference, the segmentation output is thresholded at a probability of 0.5 to obtain a foreground mask. For identification, the predicted 1/4-resolution class-wise heatmaps are upsampled to the segmentation resolution; per-class peak locations are treated as predicted centroids, and each foreground voxel is assigned to the nearest centroid in Euclidean distance to produce a voxel-wise vertebra ID map. We then apply 3D connected-component labeling on the segmentation mask to extract candidate vertebral instances and remove small, isolated fragments. For each connected instance, we compute its geometric centroid and sort instances along the cranio–caudal axis. Predicted IDs are matched one-to-one with the sorted instances, after which a sequence-continuity prior is enforced to ensure anatomically plausible numbering: for duplicate IDs, we retain the instance with higher confidence and reassign the others according to the spatial order; for missing or skipped IDs, we shift or fill labels from superior to inferior to conform to the typical anatomical sequence. This post-processing, grounded in anatomical continuity, yields a one-to-one, superior-to-inferior consistent vertebral index for each connected instance, improving the coherence and utility of the joint segmentation–identification output.
2.7. Experiments and Evaluation
2.7.1. Fully Supervised vs. Semi-Supervised Settings
To systematically assess performance, we establish two training protocols on VerSe 2019 and VerSe 2020: fully supervised learning (FSL) and SSL. Unless otherwise specified, we adopt the official VerSe 2019 validation/test splits for model selection and final reporting to ensure consistent evaluation.
Data pools. Labeled training pool: VerSe 2019 training split (80 cases) with voxel-wise masks and vertebral IDs. Unlabeled training pool: VerSe 2020 training split (113 cases); labels are not used, and these cases are treated as unlabeled inputs in the teacher–student consistency framework. Evaluation sets: VerSe 2019 validation split (40 cases) for model selection and VerSe 2019 test split (40 cases) for final reporting.
Fixed evaluation protocol. FSL trains only on the 80 labeled VerSe 2019 cases and evaluates on VerSe 2019 validation/test. SSL trains on the same 80 labeled VerSe 2019 cases while additionally incorporating 113 VerSe 2020 cases as unlabeled data; validation and testing remain identical to FSL (both on VerSe 2019).
Fairness statement. This protocol isolates the effect of unlabeled data: the only difference between FSL and SSL is whether unlabeled VerSe 2020 volumes are included during training, while evaluation sets and post-processing remain unchanged.
Table 2 reports the sample counts under both protocols; for SSL, the “Training” entry is shown as “labeled + unlabeled”.
2.7.2. Comparative Studies
We compare the proposed method with several representative approaches for vertebral segmentation and identification:
We additionally benchmark against established SSL baselines adapted to 3D medical segmentation, including Mean Teacher [
18], UA-MT [
20], CPS [
27], and a UniMatch-style strong/weak consistency baseline [
28]. All SSL baselines use the same labeled/unlabeled splits, backbone capacity, and post-processing to ensure a fair comparison.
Payer et al. [
6]: a three-stage pipeline of coarse localization → identification → segmentation.
Sekuboyina et al. [
7]: segmentation combined with location-based numbering using an improved indexing strategy.
nnU-Net baseline [
29]: an auto-configuring segmentation framework. Since it only outputs a segmentation mask, we assign vertebral IDs post hoc using a 3D connected-component analysis in superior-to-inferior order to compute the identification accuracy.
SpineCLUE [
17]: a recent model leveraging contrastive learning and uncertainty estimation for vertebral identification (we implement the end-to-end model following the authors’ public description).
All methods are trained and evaluated on the same VerSe 2019 labeled training and test splits. Note that some original papers used different training configurations; for fairness, we restrict all methods to use only the labeled data provided by VerSe 2019, without any extra data or pretraining. To ensure fair comparison, we applied the same connected-component instance labeling and sequential indexing (described in
Section 2.6) to all segmentation outputs that lacked ID predictions (e.g., nnU-Net). In particular, the nnU-Net baseline (which produces only a binary segmentation) was followed by the same vertebra instance extraction and superior-to-inferior ID assignment procedure as our method.
2.7.3. Ablation Studies
To assess the marginal contribution of each component, we design several ablation studies corresponding to the main modules and hyperparameters used in our framework. Unless otherwise specified, all ablation experiments follow the data splits described in
Section 2.7.1; network architecture, optimization settings, and data augmentations are kept fixed, and only the component under investigation is changed.
We first evaluate the effect of progressively adding the identification branch and the semi-supervised consistency loss. Three configurations are compared on the VerSe 2019 validation set to quantify the effect of progressively adding the identification branch and semi-supervised learning: (i) baseline, a single-task model with only the segmentation branch trained fully supervised on labeled VerSe 2019 data; (ii) +ID, the dual-branch model that adds the identification head but is still trained purely supervised on labeled VerSe 2019 data with the consistency loss disabled; and (iii) +SSL, the full semi-supervised model that keeps the dual-branch architecture of +ID and additionally enables the teacher–student consistency training on unlabeled VerSe 2020 scans with a linearly ramped consistency weight (
Section 2.4.3).
- 2.
Pseudo-label threshold sensitivity.
With the +SSL configuration fixed, we vary the teacher pseudo-label confidence threshold while keeping all other training details identical. We compare validation metrics as a function of to identify a robust operating range and a recommended threshold.
- 3.
Mamba modules at multiple encoder layers.
To investigate where long-range modeling is most effective, we compare two placements of the Mamba module under the +SSL setting: (i) a single Mamba block inserted only at the bottleneck between an encoder and decoders (our default choice) and (ii) Mamba blocks [
24] inserted at all encoder stages (“Mamba@All Encoders”). Both variants are trained with the same semi-supervised protocol. This ablation evaluates whether distributing Mamba across multiple scales yields measurable performance gains relative to the added computational cost.
- 4.
Effect of 3D-CBAM placement.
We further examine the impact of the 3D-CBAM attention modules and where they should be placed. Under the +SSL setting, we compare three configurations: (i) no CBAM in either branch, (ii) CBAM in the ID branch only (our default design), and (iii) CBAM in both the segmentation and ID branches. All other components, including Mamba placement and consistency training, are kept unchanged. This ablation tests whether attention is primarily beneficial for the identification task, which requires discriminative features across vertebrae, or whether it also improves the segmentation branch that focuses on precise boundary delineation.
- 5.
Replacing Mamba with a Transformer block.
Finally, to validate the choice of Mamba for long-range dependency modeling, we compare our default +SSL model (with a Mamba module at the bottleneck) against a variant in which the Mamba block is replaced by a standard 3D self-attention (Transformer [
30]) block inserted at the same location. The two variants share the same dual-branch architecture and semi-supervised training setup. This ablation examines whether Mamba can match the performance of a conventional Transformer while offering lower computational complexity.
2.7.4. Evaluation Metrics
Segmentation metric. The segmentation accuracy is evaluated using the mean Dice similarity coefficient (Dice), defined as follows:
where
and
denote the predicted and ground-truth foreground voxel sets, respectively.
Identification metric. The identification accuracy is measured as instance-level accuracy (Acc), i.e., the proportion of correctly identified vertebrae over all vertebrae in the test set. Let the ground-truth index be
and the predicted index be
. We obtain the ground-truth instance set
by applying 3D connected-component labeling to the ground-truth vertebra mask and obtain the predicted instance set
by applying 3D connected-component labeling to the predicted binary segmentation (thresholded at 0.5), with small isolated fragments removed. The instance-wise identification accuracy
is defined in Equation (15):
A ground-truth vertebra counts as correct only if it is matched to a predicted connected component and that component’s predicted index is correct; unmatched (missed) instances or mismatched indices are counted as errors. The matching set
is defined by an IoU criterion (Equation (16)), where
is the mean Intersection-over-Union (IoU) matching threshold:
Here,
denotes the indicator function, and
is computed between the binary masks of the individual instances
and
as the ratio of intersection to union (Equation (17)):
4. Discussion
This study demonstrates the effectiveness of semi-supervised learning for vertebral segmentation and identification in spinal CT. Compared with prior pipelines, our end-to-end multi-task model fully integrates segmentation and numbering, avoiding the error amplification typical of staged methods. On the VerSe benchmarks, our approach surpasses the best reported results in both Dice and identification accuracy. However, cross-site evaluation revealed that performance can degrade under distribution shifts (
Section 3.3), underscoring the need for domain adaptation.
Cross-domain analysis notes the following: for both VerSe and CTSpine1K, we use the same intensity preprocessing (HU clipping to [−500, 1500] and [0, 1] normalization after resampling) and the same evaluation protocol. The cervical/thoracic/lumbar (C/T/L) breakdown is computed using identical vertebral-level criteria across datasets. Therefore, the observed domain shift mainly stems from differences in scanner/protocol, field-of-view truncation, and anatomical distribution rather than from inconsistent normalization.
Nonetheless, several limitations remain. Firstly, our method still requires a certain amount of labeled data for initialization. Secondly, we used only the unlabeled data provided by VerSe; future work could incorporate larger, more diverse unlabeled spinal CT corpora to further exploit SSL. Thirdly, generalization across scanners and acquisition protocols may be affected by domain shifts; domain adaptation techniques could alleviate this.
We acknowledge that our fully supervised baseline segmentation Dice is slightly lower than the nnU-Net [
29] benchmark. Importantly, our supervised baseline is designed as a matched control for the SSL study: it uses identical preprocessing, patching (128 × 64 × 64), and the same optimization schedule as the SSL setting, so that the reported SSL gain (+1.8 Dice) reflects the effect of adding unlabeled data rather than changes in training configuration. By contrast, nnU-Net is a segmentation-only framework that automatically selects task-specific hyperparameters (e.g., patch size, augmentation strength, and optimizer/schedule) to maximize Dice. In addition, our architecture is multi-task (segmentation + identification) with a dual-branch decoder and loss balancing; this can slightly reduce the effective capacity/optimization focus for pure segmentation compared with a single-task nnU-Net model. We have added this clarification to contextualize the baseline positioning and to make the comparison fair and interpretable.
In future work, we plan to (1) incorporate richer anatomical priors (e.g., plausible spine length ranges and typical vertebra counts) to improve robustness under congenital variants or partial scans; (2) scale up semi-supervised training with larger unlabeled datasets to mine more latent information; and (3) study domain adaptation across hospitals, devices, and imaging parameters to enhance real-world generalization. Overall, our results highlight the substantial potential of semi-supervised deep learning in medical imaging, with a promise for more reliable clinical decision support.