Next Article in Journal
Singular Double Phase Kirchhoff Type Problem with a General Nonlocal Integrodifferential Operator
Previous Article in Journal
Enhancing Site Selection Decision-Making Using Bayesian Networks and Open Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prompt-Guided Refinement: A Novel Technique for Improving Intervertebral Disc Semantic Labeling

by
Mohammed N. Alharbi
* and
Mohammad D. Alahmadi
Software Engineering Department, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(24), 3944; https://doi.org/10.3390/math13243944
Submission received: 31 October 2025 / Revised: 6 December 2025 / Accepted: 9 December 2025 / Published: 11 December 2025
(This article belongs to the Special Issue Applications of Artificial Intelligence and Medical Imaging)

Abstract

Accurate detection and semantic labeling of intervertebral discs (IVDs) in magnetic resonance imaging (MRI) are crucial for evaluating and treating spinal-related disorders. Conventional approaches typically utilize convolutional neural networks (CNNs) to extract contextual features from MRI images, but they often overlook the inherent geometric structure of the vertebral column, leading to inaccuracies in IVD localization and segmentation. Addressing this limitation, we propose a novel prompt encoder method that incorporates geometric information to enhance the semantic labeling of IVDs in MRI images. Our approach effectively learns the skeleton structure of the spinal column and adaptively adjusts its predictions to conform to this anatomical framework. Extensive evaluations on multi-center spine datasets demonstrate that our method outperforms existing state-of-the-art techniques, consistently achieving superior performance in both T1w and T2w MRI modalities. The incorporation of geometric information significantly improves the accuracy and robustness of IVD semantic labeling, paving the way for more precise and reliable assessments of spine health and disease progression.

1. Introduction

The human vertebral column, composed of 33 vertebrae interconnected by ligaments and intervertebral discs (IVDs), serves as a crucial element in facilitating various physiological functions, including shock absorption, load distribution, spinal cord protection, and overall flexibility. Each of its five regions—cervical, thoracic, lumbar, sacral, and caudal—plays a distinctive role in maintaining the body’s mechanical integrity [1].
IVDs, consisting of fibrocartilage, act as crucial cushions and joints between adjacent vertebrae, absorbing stress and shock during body movements. Disruptions in IVDs, as a result of age-related changes, degenerative processes, or physical trauma, may lead to changes in disc properties, compromising the mechanical functionalities of surrounding tissues [2]. Accurate localization and segmentation of IVDs are essential for diagnosing spine-related disorders and optimizing treatment procedures. Consequently, various techniques, categorized into hand-crafted and deep learning-based methods, have been proposed in the literature.
Hand-crafted methods, such as Cheng et al.’s two-step approach [3] and Glocker et al.’s use of a regression forest [4], rely on traditional image processing for localizing and segmenting IVDs. While effective in certain cases, these methods may have limitations compared to deep learning-based approaches [3,5]. Recent advancements in deep learning have significantly improved the exploration of robust IVD labeling methods [6].
In this context, researchers have proposed diverse deep learning architectures, ranging from standard Convolutional Neural Networks (CNNs) [7] to sophisticated models like ‘IVD-Net’ [8] and those integrating transfer learning [9]. These methods leverage multi-modal information, as seen in Vania et al.’s mask-RCNN-based approach [10], or cross-modality techniques demonstrated by Wimmer et al. [11].
Despite significant advancements in intervertebral disc semantic labeling, current approaches often employ conventional convolutional neural network (CNN) learning strategies, which can introduce false positive (FP) and false negative (FN) detections.
To rectify this limitation, we propose a novel approach that models intervertebral disc semantic labeling as a pose estimation method. By minimizing the prediction loss between the predicted points and the actual vertebral locations, our method significantly reduces FPs and FNs, leading to more accurate and reliable intervertebral disc localization.
To further enhance the generalization capability of our approach, we incorporate a prompt-based learning strategy that explicitly encodes the global skeleton of the vertebral column into a set of compact, learnable prompts. These prompts dynamically guide the decoder throughout the refinement process, enabling the model to remain anatomically aligned even in the presence of significant structural variability. As a result, the proposed method is highly effective across a wide range of vertebral herniations and severity levels, consistently achieving robust and accurate predictions without requiring prior knowledge of the underlying pathology. This versatility makes the framework particularly valuable for clinical scenarios, where morphological diversity is the norm rather than the exception. While recent methods have explored attention mechanisms [12,13] or geometry-aware [14,15] learning for vertebral and disc localization, our framework introduces three fundamental innovations that distinguish it from prior work. First, we propose a dedicated prompt encoder that transforms the global vertebral skeleton into a set of learnable prompts. Unlike geometry-aware approaches [14,16] our prompts are data-driven embeddings that capture inter-disc spacing, spinal curvature, and overall vertebral alignment. This design allows the network to internalize global anatomical structure in a flexible and adaptive way, instead of relying on fixed positional assumptions. Second, the MSPC introduces prompt-driven affine modulation at multiple encoder scales. Whereas prior multi-scale attention [17] or the feature pyramid method [18] operate exclusively on visual features, MSPC injects global skeletal priors directly into the latent representation across several resolutions. This enables the encoder to align its learned features with the spinal geometry from the earliest stages of representation learning—an architectural mechanism not previously explored in intervertebral disc labeling networks.
Third, the PGR module establishes a cross-attention mechanism between decoder features and the geometry-derived prompt vectors. Unlike conventional self-attention [19], where queries, keys, and values originate from the same feature map, PGR uses decoder features as queries and prompt embeddings as keys and values. This design enforces anatomical coherence throughout the decoding hierarchy, improving spatial consistency and robustness to imaging artifacts.
Collectively, these components form a coherent, geometry-aware refinement pipeline based on prompt-driven modulation and global-to-local anatomical guidance. This combination has not appeared in existing attention-based or geometry-aware IVD methods, and it constitutes the core novelty of the proposed framework.
The full implementation of PGR-Net is available at (https://github.com/malharbi2016/PGR, accessed on 6 December 2025). The repository includes the source code, training and evaluation scripts, the prompt and skeleton generation code, and instructions for preparing the dataset.

2. Related Work

IVD Semantic Labeling is a critical aspect of medical image analysis, significantly impacting diagnostic precision and treatment efficiency. Traditional techniques focused on isolating regions of interest based on features like grayscale values [20], edges [21], and shape patterns in medical images. However, these methods often required manual intervention for accuracy refinement and were computationally intensive [22]. The evolution of IVD Semantic Labeling techniques has been profoundly influenced by advancements in deep learning [23,24,25,26,27]. The Fully Convolutional Network (FCN) [28] represented a milestone in applying Convolutional Neural Networks (CNNs) to segmentation tasks. Nevertheless, drawbacks such as the need for extensive computational resources persisted. Ronneberger et al. addressed this by introducing the U-Net network [29], utilizing a symmetric encoder–decoder structure with skip connections for more efficient segmentation, particularly beneficial for complex medical images.
The ResNet architecture [30] enhanced feature representation through deep residual learning, overcoming limitations of previous methods. Subsequent innovations aimed at addressing challenges included the PSPNet [31], introducing a pyramid pooling module to expand the receptive field and combine features from different resolutions. However, the computational demands remained a concern. BiseNet [32] addressed this by categorizing image features into semantic and contextual types, achieving a balance between precision and efficiency. Deeplabv3 [33], leveraging Xception [34] and depth-wise separable convolutions, aimed at reducing model parameters while enhancing feature extraction. Despite this improvement, the handling of varying scales in medical images remained a challenge. HRNet [35] introduced continuous feature merging at different scales, maintaining resolution consistency.
Recent studies in IVD Semantic Labeling have made noteworthy contributions, each addressing specific challenges. Ref. [36] introduced a novel loss function to enhance geometric accuracy, addressing limitations in traditional loss functions. Ref. [37] explored lumbar spinal stenosis (LSS) segmentation, with the ResUNet model achieving high accuracy, mitigating challenges in LSS diagnosis. Ref. [38] proposed a two-stage CNN framework integrating 3D Transformers and 2D CNNs, overcoming limitations in 2D segmentation. Ref. [39] utilized a combination of FCN and Stacked Hourglass Network with a Multi-level Attention Mechanism, reducing false positives in vertebral disc localization. Ref. [40] focused on automatic semantic segmentation using modified U-Net architectures, surpassing standard models in classifying structural elements. Ref. [22] introduced a multi-scale information fusion framework for cervical intervertebral discs, addressing challenges in accuracy, especially for 3D reconstruction and printing applications.
Beyond traditional CNN-based and geometry-aware IVD labeling methods, recent research in multimodal learning and prompt-based feature conditioning has shown promising advances in other domains. For example, hybrid feature fusion strategies have been explored in audio–context emotion recognition [41] and multimodal depth-based gesture understanding [42]. These works highlight the effectiveness of combining heterogeneous cues through explicit conditioning mechanisms. While these methods operate in entirely different application areas and cannot be directly compared to spinal MRI localization tasks, they conceptually reinforce the value of integrating complementary sources of information—a principle that aligns with our use of geometry-driven prompts. Our method adapts this broader idea to the medical imaging domain by embedding anatomical skeleton priors into the segmentation pipeline, a direction that has not been explored in existing IVD labeling literature.
Building on this broader insight, our method introduces anatomical skeleton priors as learnable prompt representations and incorporates them directly into the segmentation pipeline. This strategy allows the network to exploit global vertebral structure to refine local predictions, addressing one of the core challenges in intervertebral disc labeling: effectively capturing geometric relationships along the spinal column. By embedding skeletal distribution information into the learning process, our framework adaptively enhances shape representation and improves localization accuracy, offering a direction not previously explored in existing IVD labeling literature.

3. Method

The primary goal of the proposed Prompt-Guided Refinement Network (PGR-Net) is to achieve anatomically consistent and geometrically constrained intervertebral disc (IVD) semantic labeling in MRI. Unlike conventional convolutional models that operate purely on local pixel intensities, our framework explicitly models the geometric skeleton of the vertebral column as an auxiliary prior. This prior is encoded through a lightweight prompt encoder that dynamically interacts with the latent feature space to regularize the segmentation process. The overall pipeline, illustrated conceptually in Figure 1, comprises three main components: (1) a hierarchical convolutional encoder–decoder backbone for feature extraction, (2) a multi-scale prompt conditioner for geometric embedding, and (3) a prompt-guided refinement mechanism integrated into the decoding stage for anatomically consistent predictions.

3.1. Problem Formulation

Let X R H × W × C denote an input MRI slice with spatial dimensions H × W and C imaging channels. The objective of the model is to predict the corresponding set of intervertebral disc heatmaps
Y = { y 1 , y 2 , , y V } , y i [ 0 , 1 ] H × W ,
where V represents the number of intervertebral levels. The network learns a mapping
F Θ : X Y ^ ,
parameterized by Θ , such that the predicted heatmaps Y ^ approximate the ground-truth distribution Y while respecting the geometric consistency of the vertebral skeleton. The overall forward pass can be expressed as
Y ^ = D Φ ( E ( X ; θ e ) , P ( S ; θ p ) ) ; θ d ,
where E ( · ) , Φ ( · ) , and D ( · ) denote the encoder, prompt conditioner, and decoder modules, respectively, parameterized by θ e , θ p , and θ d . S denotes the skeleton prior information extracted from the vertebral distribution map.

3.2. Encoder–Decoder Backbone

The encoder E ( · ) extracts dense semantic and contextual representations using a modified stacked hourglass network that preserves spatial precision while enlarging the receptive field. Given an input X, the encoder produces a latent feature tensor:
F = E ( X ; θ e ) R h × w × d ,
where d denotes the latent dimensionality. Each hourglass block E k in the stack applies a sequence of convolutional, normalization, and residual operations:
E k ( X ) = R σ ( W k X + b k ) ,
where ∗ represents convolution, σ ( · ) the activation function (ReLU), and R ( · ) a residual refinement operator ensuring stable gradient propagation. The multi-stage aggregation is defined as
F = k = 1 N ω k E k ( X ) ,
with ω k denoting trainable attention weights over the N stacked modules. This aggregation enables multi-scale contextual learning, crucial for distinguishing discs with varying morphology and contrast.
The decoder D ( · ) progressively upsamples the conditioned features to the original image resolution using transposed convolutions and skip connections. Each decoder layer refines predictions through spatial attention gates that emphasize high-confidence regions guided by prompt embeddings. The decoder output is a tensor Y ^ R H × W × V , where each channel corresponds to one IVD location.

3.3. Multi-Scale Prompt Conditioner (MSPC)

To incorporate geometric priors into the latent representation, we propose a Multi-Scale Prompt Conditioner that encodes the global skeletal structure of the vertebral column into the latent space for prompt-guided decoding.

3.3.1. Skeleton Encoding

Given a precomputed skeleton map S R H × W , derived from the centroid distribution of intervertebral discs, we first convert it into a heatmap representation H s through Gaussian smoothing:
H s ( i , j ) = exp ( i c i ) 2 + ( j c j ) 2 2 σ s 2 ,
where ( c i , c j ) are the coordinates of the i-th disc center and σ s controls the spatial spread. A convolutional encoder f enc then maps H s to a low-dimensional prompt vector:
P = f enc ( H s ; θ p ) R d p .
This prompt captures essential skeletal geometry, encoding inter-disc spacing, curvature, and orientation.

3.3.2. Prompt–Feature Interaction

The encoded prompt vector P conditions the feature space through adaptive modulation of latent features. Specifically, we employ channel-wise affine transformations similar to conditional normalization:
F ˜ = γ ( P ) F + β ( P ) ,
where ⊙ denotes element-wise multiplication and γ ( · ) , β ( · ) : R d p R d are learnable mappings realized via MLPs. This enables each prompt to dynamically scale and shift latent channels based on the inferred skeletal geometry.
To ensure multi-scale consistency, we extend the conditioner to three pyramid levels, Φ = { Φ 1 , Φ 2 , Φ 3 } , applied to encoder outputs { F 1 , F 2 , F 3 } with decreasing resolutions:
F ˜ l = γ l ( P ) F l + β l ( P ) , l { 1 , 2 , 3 } .
These conditioned features are fused into the bottleneck representation:
F = l = 1 3 ψ l ( F ˜ l ) ,
where ψ l denotes an upsampling operator to match the bottleneck scale. The resulting F captures both fine-grained local cues and global skeletal dependencies.

3.4. Prompt-Guided Refinement Module (PGR)

To further enforce anatomical coherence during decoding, the Prompt-Guided Refinement module iteratively aligns feature maps with the prompt priors. At each decoder stage t, the refined feature R t is computed as
R t = A t ( F t , P ) = Softmax ( Q t K p ) d k V p ,
where Q t = W q F t , K p = W k P , and V p = W v P represent the query, key, and value projections, respectively, and d k is the feature dimension. This cross-attention mechanism enables contextual alignment between the learned image features and the skeletal prior. Because the keys and values originate from compact prompt embeddings rather than full-resolution feature maps, the attention operation is computationally lightweight and adds less than 2% overhead to the backbone. Moreover, the use of stable, anatomy-encoded prompts reduces gradient noise during refinement, leading to improved optimization stability and more consistent localization across subjects. The attention-weighted representation R t is fused with the decoder features through residual gating:
F t = α t R t + ( 1 α t ) F t ,
where α t [ 0 , 1 ] is a learnable gate controlling the strength of prompt guidance at each scale.

3.5. Training Objective

The entire model is trained using a hybrid objective function that enforces pixel-wise accuracy, structural consistency, and geometric fidelity.

3.5.1. Localization Loss

The primary supervision is a mean squared error (MSE) loss computed between predicted and ground-truth heatmaps:
L l o c = 1 V M i = 1 V p = 1 M y p i y ^ p i 2 ,
where M = H × W denotes the number of pixels per map.

3.5.2. Prompt Consistency Loss

To ensure alignment between latent features and geometric prompts, we introduce a prompt consistency loss:
L p c = P P ¯ g t 2 2 ,
where P ¯ g t is the encoded prompt derived from ground-truth skeleton maps.

3.5.3. Skeleton Structural Loss

To preserve the relative distances between predicted disc centroids, we define a pairwise skeleton loss:
L s k = i = 1 V 1 c ^ i + 1 c ^ i c i + 1 g t c i g t 2 2 ,
where c ^ i and c i g t denote the predicted and ground-truth centroids of the ith disc, respectively. This term enforces uniform spacing and curvature consistent with anatomical geometry.

3.5.4. Total Objective

The overall loss is defined as a weighted combination:
L t o t a l = L l o c + λ 1 L p c + λ 2 L s k ,
where λ 1 and λ 2 control the relative influence of prompt and skeleton constraints.

4. Experimental Results

This section embarks on a thorough examination of our experimental endeavors, encompassing the intricacies of the datasets, the metrics employed for evaluation, and an in-depth analysis of the results.

4.1. Dataset

To evaluate our method’s effectiveness, we conducted extensive experiments on the Spine Generic Dataset [43], a comprehensive collection of MRI scans spanning 42 centers and 260 participants worldwide. The dataset’s diverse imaging modalities and varying imaging equipment, image resolutions, and age groups pose a formidable challenge for intervertebral disc labeling tasks. The provision of T1 and T2 MRI contrasts for each participant further elevates the dataset’s complexity, enabling a comprehensive evaluation of our method’s ability to handle a range of imaging modalities. Our subsequent analysis delves into the experimental results and highlights the remarkable performance of our approach. Following the preprocessing protocol established in [1,13], all experiments were conducted on 2D representations extracted from the 3D MRI volumes. For each subject, we first identified the sagittal plane and generated a 2D slice by averaging the middle five slices, as intervertebral discs consistently appear within this region. The dataset was then divided into non-overlapping training, validation, and test subsets at the subject level, ensuring that no participant contributed data to more than one split.
During training and inference, the model operated exclusively on these 2D slices. However, for evaluation, the predicted 2D disc locations were re-projected back into the original 3D coordinate system. All quantitative metrics were computed in 3D space using the dataset’s original volumetric annotations.

4.2. Metrics

To validate the reliability of our comparative study and derive conclusions on the suitability of our approach, we employed a multifaceted set of evaluation metrics. To estimate positional precision, the model’s output coordinates are compared to the ground-truth annotations using the L2 norm computed along the superior–inferior direction.
In addition, to assess the versatility of our post-processing technique, we focused on the False Positive Rate (FPR) and False Negative Rate (FNR) as the key performance indicators. Aligning with [1], FPR encapsulates the percentage of predictions that deviate from the ground truth by at least 5 mm, while FNR quantifies the proportion of ground truth annotations that remain undetected by our method.
These carefully chosen metrics offer an in-depth assessment of the proposed method’s accuracy and robustness. The L2 norm assesses the precision of individual predictions, while FPR and FNR examine the overall consistency of our method in identifying and locating intervertebral discs. This thorough evaluation strategy facilitates a rigorous comparison with established methods and provides valuable insights into the practical applicability of our approach.

4.3. Optimization and Implementation Details

The model is optimized using the Adam optimizer with an initial learning rate of 1 × 10 4 , halved every 20 epochs following a step-decay schedule. Mini-batches of size 2 are used due to the high spatial resolution of MRI scans. To enhance generalization, standard data augmentations—random rotation, elastic deformation, and intensity scaling—are applied. The network converges after approximately 100 epochs, with inference performed on a single NVIDIA RTX 3090 GPU. Our PGR-Net is designed as a plug-and-play module compatible with various CNN or Transformer backbones. The inclusion of prompt guidance adds less than 3% additional parameters while substantially improving the network’s robustness to anatomical variations and imaging artifacts.

4.4. Results

Table 1 presents a detailed quantitative comparison of the proposed method against several SOTA approaches for IVD semantic labeling, including Template Matching [16], Countception [13], Pose Estimation [1], Look Once [14], and HCA-Net [44]. The evaluation is performed on both T1-weighted (T1w) and T2-weighted (T2w) MRI modalities of the Spine Generic public dataset.
Across both modalities, our proposed PGR-Net consistently achieves the lowest distance to target (DTT) and the lowest false negative rate (FNR), demonstrating clear superiority over competing methods.
In the T1w modality, PGR-Net attains an average DTT of 1.17 mm, outperforming HCA-Net by approximately 1.7 % and the pose estimation approach [1] by nearly 11.3 % . The standard deviation is reduced to ± 1.05 mm, indicating stable and consistent localization across all cases. Furthermore, our model achieves an FNR of only 0.27%, improving upon HCA-Net’s 0.3 % and outperforming Countception [13] and Template Matching [16] by more than 4– 8 % . The false positive rate (FPR) remains at 0.0%, on par with the best-performing alternatives. This demonstrates that PGR-Net achieves highly accurate disc localization without introducing spurious detections.
In the T2w modality, PGR-Net achieves a DTT of 1.24 mm, improving upon HCA-Net’s 1.26 mm by approximately 1.6 % and showing more than 30 % relative improvement over Countception [13] and Template Matching [16]. Similarly, the FNR drops to 0.58%, a further 4.9 % relative reduction from HCA-Net [44] and almost 50 % reduction compared to Pose Estimation [1]. The FPR again remains at 0.0 % , confirming the method’s robustness in suppressing false detections even under intensity variations inherent to T2w imaging.
We also compare our method with the recent graph-based method [45] and Transformer-based baseline, Swin-UNet [46], which incorporates hierarchical self-attention and shifted window mechanisms and represents a more modern class of encoder–decoder architectures. Despite its strong performance in natural image segmentation tasks, Swin-UNet exhibits noticeably weaker robustness on spinal MRI, achieving a DTT of 1.44 mm on T1w and 1.86 mm on T2w. The method also produces substantially higher FNR and FPR values across both modalities, indicating difficulties in capturing the fine-grained geometry of intervertebral discs under varying contrast conditions. In contrast, PGR-Net leverages explicit skeletal prompts and multi-scale conditioning, enabling it to outperform Swin-UNet by 18–33% in localization precision and by large margins in error rates. This further demonstrates that Transformer-based global attention alone is insufficient for stable IVD localization without incorporating the type of geometry-aware refinement introduced in our framework.
Qualitative results, shown in Figure 2, visually confirm these quantitative gains. Compared to the Pose Estimation approach [1], which fails to correctly identify one disc in the T1w modality, our PGR-Net successfully detects all intervertebral locations with precise alignment to the ground truth. The predictions exhibit well-defined disc boundaries and accurate spatial ordering, reflecting the network’s ability to generalize across subjects and modalities.
Overall, PGR-Net achieves a clear and consistent margin over all baselines in both T1w and T2w scans. Relative to prior best results (HCA-Net), our model improves the mean DTT by approximately 1– 2 % , reduces FNR by 5– 10 % , and maintains zero FPR across modalities. Compared with earlier CNN-based frameworks such as Countception and Template Matching, the improvements are far more substantial—exceeding 30 % in localization precision. These results confirm that the proposed framework performs accurate and reliable IVD semantic labeling on the Spine Generic dataset.

5. Ablation Study

To rigorously assess the effectiveness of each design component in our framework, three ablation studies were carried out. The first examines the contribution of each key module within PGR-Net, including the Multi-Scale Prompt Conditioner (MSPC), Prompt-Guided Refinement (PGR) module, and Skeleton Structural Loss ( L s k ). The second investigates the effect of the hyper-parameters λ 1 and λ 2 in the total loss formulation (Equation (18)), analyzing their influence on model convergence and accuracy. Finally, the third study focuses on the distributional stability of localization performance, evaluating prediction variance and robustness across samples and modalities.

5.1. Impact of Each Module

The results in Table 2 clearly demonstrate that each module contributes meaningfully to the overall performance. Starting from the baseline stacked hourglass network, which yields a DTT of 1.45 mm on T1w and 1.80 mm on T2w, the inclusion of the Multi-Scale Prompt Conditioner (MSPC) substantially enhances the model’s ability to capture global anatomical cues, reducing DTT by approximately 9.6 % on T1w and 25.6 % on T2w. The FNR also drops by nearly 70 % , confirming that the prompt conditioning effectively integrates geometric information about the vertebral column into the feature representation.
When the Prompt-Guided Refinement (PGR) module is introduced, the model achieves a further reduction in DTT to 1.23 mm (T1w) and 1.28 mm (T2w). This improvement of approximately 6 % demonstrates that the cross-attention mechanism between the feature maps and the prompt embeddings facilitates adaptive refinement, enabling the network to correct ambiguous predictions near intervertebral boundaries. The FNR is reduced to below 1 % in both modalities, indicating a stronger capability for disc localization under challenging contrast variations.
Finally, integrating the Skeleton Structural Loss L s k yields the best performance, with a DTT of 1.17 mm on T1w and 1.24 mm on T2w and FNRs of 0.27 % and 0.58 % , respectively. This demonstrates that L s k effectively regularizes spatial relationships between adjacent discs, ensuring anatomically plausible predictions that respect the vertebral column’s curvature. Notably, this addition eliminates all false positives (FPR = 0.0%) across both modalities, confirming its stabilizing effect on network inference.
Overall, the ablation results confirm that each proposed component contributes in a complementary manner to the final performance of PGR-Net. The gradual improvement across configurations—from baseline to full model—highlights the importance of combining prompt-based geometric conditioning, cross-attentive refinement, and structural consistency enforcement. This synergy leads to superior precision and robustness in intervertebral disc semantic labeling across diverse MRI modalities.
To provide a more comprehensive understanding of how the proposed components behave beyond a purely sequential pipeline, we additionally evaluated several cross-module configurations. These include PGR only, MSPC + L s k , and PGR + L s k , which isolate different interaction pathways between geometric conditioning, refinement attention, and structural regularization. As shown in the extended ablation table, each module contributes positively when used independently, but their combination yields improvements that are consistently larger than the sum of individual effects. This confirms that MSPC supplies global anatomical structure that both PGR and L s k refine and stabilize, while PGR and L s k further enhance localization accuracy when MSPC is present. These additional experiments demonstrate that the modules interact in a complementary manner rather than a simple linear progression, offering a clearer view of the synergy that leads to the final performance of PGR-Net.
It is also worth noting that the geometric prompts operate as an auxiliary refinement signal rather than a primary prediction source; they provide coarse skeletal structure that stabilizes attention without making the model critically dependent on perfectly estimated poses, ensuring that disc localization still relies predominantly on the underlying image features.

5.2. Hyper-Parameter Effect

The overall optimization objective of PGR-Net is defined as
L t o t a l = L l o c + λ 1 L p c + λ 2 L s k ,
where L l o c denotes the localization loss, L p c the prompt-conditioning term, and L s k the skeleton regularization term enforcing geometric consistency. The coefficients λ 1 and λ 2 control the relative importance of prompt guidance and structural regularization in the joint optimization process. Understanding how these hyper-parameters affect convergence and accuracy is essential for ensuring both stability and generalization.
To analyze their effect, we trained PGR-Net with varying λ 1 and λ 2 values while keeping all other settings constant. Table 3 presents the results on the T2-weighted modality. As observed, the model achieves the best performance when λ 1 = 0.4 and λ 2 = 0.5 , striking a balance between the feature alignment guidance from L p c and the geometric regularization imposed by L s k . Smaller values of λ 1 reduce the influence of prompt conditioning, causing slight degradation in localization precision due to weaker global contextual encoding. Conversely, excessively large λ 1 overemphasizes prompt alignment, limiting the network’s flexibility to adapt to local structural variations.
Mathematically, the total gradient norm with respect to the learnable parameters θ can be expressed as
θ L t o t a l = θ L l o c + λ 1 θ L p c + λ 2 θ L s k .
An optimal configuration minimizes the interference among gradient directions from the three terms, i.e.,
θ L l o c · θ L p c > 0 and θ L l o c · θ L s k > 0 ,
indicating that all losses cooperate toward the same optimization direction. Empirically, the selected pair ( λ 1 , λ 2 ) = ( 0.4 , 0.5 ) achieves this balance, minimizing destructive gradient interference and leading to lower DTT and FNR values. This analysis highlights the role of proper weighting in harmonizing prompt-guided refinement with anatomical regularization.
Overall, the empirical and theoretical observations jointly confirm that moderate regularization weights yield optimal performance by ensuring synergy among the three loss components. This balance allows PGR-Net to maintain anatomical integrity while remaining adaptive to inter-patient and modality variations.
In addition to the sequential ablations presented above, we expanded the analysis to better reflect the interaction dynamics among the proposed components. While the original configuration order already reveals how each module incrementally enhances the network, we now include additional cross-combinations to demonstrate that the behaviour of the modules is not purely linear. These extended results show that MSPC and PGR can each improve performance independently, but their combination yields a larger gain than the sum of the individual improvements—confirming a complementary effect rather than isolated contributions. Likewise, adding the skeleton regularization term L s k consistently stabilizes predictions across all configurations.

Distributional Stability Analysis on T2-Weighted MRI

To further evaluate the robustness of localization precision, we analyze the empirical distribution of the distance to target (DTT) on the T2-weighted modality, visualized in Figure 3. Each dot represents an individual case-level prediction error, while larger markers indicate the mean DTT per method. Compared to all baselines, PGR-Net exhibits a pronounced contraction of the DTT distribution around a smaller mean, demonstrating both higher accuracy and reduced variance across samples.
Mathematically, for a set of N test cases { x i } i = 1 N , the average localization error (DTT) for a given model m is defined as
d ¯ m = 1 N i = 1 N y ^ i ( m ) y i 2 ,
where y ^ i ( m ) and y i denote the predicted and ground-truth intervertebral disc coordinates, respectively. The variance of localization error is then
σ m 2 = 1 N i = 1 N ( y ^ i ( m ) y i 2 d ¯ m ) 2 .
Lower values of both d ¯ m and σ m 2 correspond to better localization precision and higher reliability. As shown in our experiments, PGR-Net achieves d ¯ PGR = 1.24 mm with σ PGR = 2.09 , compared to d ¯ HCA = 1.26 mm and σ HCA = 2.16 for HCA-Net. Despite the marginal difference in mean DTT, the variance reduction of approximately 3.2 % indicates that PGR-Net produces more stable and predictable localization across subjects.
From a statistical standpoint, this reduction in variance reflects improved convergence in the model’s predictive distribution. Let ϵ i = y ^ i ( m ) y i 2 represent the random localization error. Under the assumption E [ ϵ i ] = d ¯ m and Var [ ϵ i ] = σ m 2 , the tighter concentration of ϵ i around d ¯ m implies a smaller expected deviation, E [ | ϵ i d ¯ m | ] , hence yielding stronger generalization across unseen subjects and varying spine geometries. Practically, this means that PGR-Net not only minimizes the average positional error but also avoids unpredictable deviations caused by modality shifts or morphological variations in intervertebral discs.
The dense clustering of points in Figure 3 visually reinforces this quantitative evidence: the majority of PGR-Net’s predictions fall within a narrow low-error band (<1.5 mm), while alternative methods such as Countception or Pose Estimation show wide error dispersion and numerous outliers. Consequently, the combination of prompt-guided refinement and geometric conditioning in PGR-Net yields not only superior mean accuracy but also statistically significant improvement in the stability of spatial localization, which is critical for reproducible clinical assessment.

5.3. Statistical Significance of the Improvements

To assess whether the performance gains of the proposed PGR-Net are consistent and not attributable to random variation, we performed a paired two-tailed t-test comparing our method against the strongest baseline, HCA-Net [44]. Since both models were evaluated on the same test subjects, a paired design provides a more sensitive measure of systematic improvements.
Let e i ( PGR ) and e i ( HCA ) denote the subject-wise DTT errors for PGR-Net and HCA-Net for subject i { 1 , , N } . The paired difference for each subject is defined as
d i = e i ( PGR ) e i ( HCA ) .
The paired t-statistic is computed as
t = d ¯ s d / N ,
where
d ¯ = 1 N i = 1 N d i , s d = 1 N 1 i = 1 N ( d i d ¯ ) 2 .
Under the null hypothesis H 0 : d ¯ = 0 , the statistic t follows a Student’s t-distribution with N 1 degrees of freedom.
Using the subject-wise DTT values reported in Table 1, the paired t-test demonstrates that PGR-Net achieves statistically significant improvements over HCA-Net on both modalities:
T1-weighted : p = 0.041 , T2-weighted : p = 0.038 .
These results confirm that the observed reductions in DTT are consistent across subjects and unlikely to arise from chance, thereby reinforcing the robustness and reliability of the proposed prompt-guided refinement mechanism.

5.4. Limitations

Although the proposed PGR-Net demonstrates strong localization performance, several limitations remain. As visualized in Figure 4, the attention maps reveal that the model is generally precise in focusing on intervertebral disc regions, with high-probability responses tightly concentrated around the annotated centers. However, the same visualization also exposes cases in which the attention distribution becomes more diffuse, indicating lower confidence in challenging anatomical zones or under reduced contrast. In such cases, the predicted disc locations may slightly deviate from the ground-truth landmarks, reflecting the model’s sensitivity to subtle variations in disc morphology, motion artifacts, or signal inhomogeneity. While these deviations remain small in magnitude, they highlight the need for future improvements in robustness, particularly under difficult imaging conditions or patient-specific anatomical variations.

6. Conclusions

We presented PGR-Net, a geometry-aware framework for intervertebral disc (IVD) semantic labeling in MRI. By embedding vertebral skeleton information through prompt-guided conditioning, PGR-Net achieves anatomically consistent and highly precise localization across both T1w and T2w modalities.
On the Spine Generic dataset, our method outperformed all prior approaches, reducing the average distance to target (DTT) by 1–2% compared to HCA-Net and by over 30% relative to earlier CNN-based models. It also maintained a zero false positive rate and the lowest false negative rate among all competitors, confirming its robustness and reliability.
The ablation study verified that each proposed component—multi-scale prompt conditioning, refinement attention, and skeleton regularization—contributes meaningfully to overall performance. In summary, PGR-Net establishes a new state of the art for geometry-aware IVD labeling and provides a strong foundation for future extensions to 3D volumetric segmentation and multi-modal spine analysis.
Despite these strengths, certain limitations remain: the current framework operates on 2D axial slices and may show reduced confidence in cases of severe disc herniation or highly irregular anatomy, where the geometric cues become less reliable. Future work will explore 3D extensions and deformity-aware priors to further improve robustness under challenging clinical conditions.

Author Contributions

Conceptualization, M.N.A.; Methodology, M.N.A.; Software, M.D.A.; Validation, M.D.A.; Writing—original draft, M.N.A.; Writing—review & editing, M.D.A.; Visualization, M.D.A.; Funding acquisition, M.N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-25-DR-563). Therefore, the authors thank the University of Jeddah for its technical and financial support.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-25-DR-563). Therefore, the authors thank the University of Jeddah for its technical and financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Azad, R.; Rouhier, L.; Cohen-Adad, J. Stacked Hourglass Network with a Multi-level Attention Mechanism: Where to Look for Intervertebral Disc Labeling. In Machine Learning in Medical Imaging, Proceedings of the 12th International Workshop, MLMI 2021, Strasbourg, France, 27 September 2021; Springer: Cham, Switzerland, 2021; pp. 406–415. [Google Scholar]
  2. Urban, J.P.; Roberts, S. Degeneration of the intervertebral disc. Arthritis Res. Ther. 2003, 5, 120. [Google Scholar] [CrossRef]
  3. Chen, C.; Belavy, D.; Yu, W.; Chu, C.; Armbrecht, G.; Bansmann, M.; Felsenberg, D.; Zheng, G. Localization and segmentation of 3D intervertebral discs in MR images by data driven estimation. IEEE Trans. Med. Imaging 2015, 34, 1719–1729. [Google Scholar] [CrossRef]
  4. Glocker, B.; Feulner, J.; Criminisi, A.; Haynor, D.R.; Konukoglu, E. Automatic localization and identification of vertebrae in arbitrary field-of-view CT scans. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the 15th International Conference, Nice, France, 1–5 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 590–598. [Google Scholar]
  5. Ayed, I.B.; Punithakumar, K.; Garvin, G.; Romano, W.; Li, S. Graph cuts with invariant object-interaction priors: Application to intervertebral disc segmentation. In Information Processing in Medical Imaging, Proceedings of the 22nd International Conference, IPMI 2011, Kloster Irsee, Germany, 3–8 July 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 221–232. [Google Scholar]
  6. Cheng, Y.K.; Lin, C.L.; Huang, Y.C.; Chen, J.C.; Lan, T.P.; Lian, Z.Y.; Chuang, C.H. Automatic Segmentation of Specific Intervertebral Discs through a Two-Stage MultiResUNet Model. J. Clin. Med. 2021, 10, 4760. [Google Scholar] [CrossRef] [PubMed]
  7. Ji, X.; Zheng, G.; Belavy, D.; Ni, D. Automated intervertebral disc segmentation using deep convolutional neural networks. In Computational Methods and Clinical Applications for Spine Imaging, Proceedings of the 4th International Workshop and Challenge, CSI 2016, Athens, Greece, 17 October 2016; Springer: Cham, Switzerland, 2016; pp. 38–48. [Google Scholar]
  8. Dolz, J.; Desrosiers, C.; Ayed, I.B. IVD-Net: Intervertebral disc localization and segmentation in MRI with a multi-modal UNet. In Computational Methods and Clinical Applications for Spine Imaging, Proceedings of the 5th International Workshop and Challenge, CSI 2018, Granada, Spain, 16 September 2018; Springer: Cham, Switzerland, 2018; pp. 130–143. [Google Scholar]
  9. Mbarki, W.; Bouchouicha, M.; Frizzi, S.; Tshibasu, F.; Farhat, L.B.; Sayadi, M. Lumbar spine discs classification based on deep convolutional neural networks using axial view MRI. Interdiscip. Neurosurg. 2020, 22, 100837. [Google Scholar] [CrossRef]
  10. Vania, M.; Lee, D. Intervertebral disc instance segmentation using a multistage optimization mask-RCNN (MOM-RCNN). J. Comput. Des. Eng. 2021, 8, 1023–1036. [Google Scholar] [CrossRef]
  11. Wimmer, M.; Major, D.; Novikov, A.A.; Bühler, K. Fully automatic cross-modality localization and labeling of vertebral bodies and intervertebral discs in 3D spinal images. Int. J. Comput. Assist. Radiol. Surg. 2018, 13, 1591–1603. [Google Scholar] [CrossRef] [PubMed]
  12. Liu, L.; Wolterink, J.M.; Brune, C.; Veldhuis, R.N. Anatomy-aided deep learning for medical image segmentation: A review. Phys. Med. Biol. 2021, 66, 11TR01. [Google Scholar] [CrossRef]
  13. Rouhier, L.; Romero, F.P.; Cohen, J.P.; Cohen-Adad, J. Spine intervertebral disc labeling using a fully convolutional redundant counting model. arXiv 2020, arXiv:2003.04387. [Google Scholar] [CrossRef]
  14. Azad, R.; Heidari, M.; Cohen-Adad, J.; Adeli, E.; Merhof, D. Intervertebral Disc Labeling With Learning Shape Information, A Look Once Approach. In Predictive Intelligence in Medicine, Proceedings of the 5th International Workshop, PRIME 2022, Singapore, 22 September 2022; Springer: Cham, Switzerland, 2022. [Google Scholar]
  15. Chen, C.; Xie, W.; Franke, J.; Grutzner, P.; Nolte, L.P.; Zheng, G. Automatic X-ray landmark detection and shape segmentation via data-driven joint estimation of image displacements. Med. Image Anal. 2014, 18, 487–499. [Google Scholar] [CrossRef]
  16. Ullmann, E.; Pelletier Paquette, J.F.; Thong, W.E.; Cohen-Adad, J. Automatic labeling of vertebral levels using a robust template-based approach. Int. J. Biomed. Imaging 2014, 2014, 719520. [Google Scholar] [CrossRef]
  17. Azad, R.; Kazerouni, A.; Azad, B.; Khodapanah Aghdam, E.; Velichko, Y.; Bagci, U.; Merhof, D. Laplacian-former: Overcoming the limitations of vision transformers in local texture detection. In Medical Image Computing and Computer-Assisted Intervention, Proceedings of the 26th International Conference, Vancouver, BC, Canada, 8–12 October 2023; Springer: Cham, Switzerland, 2023; pp. 736–746. [Google Scholar]
  18. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  20. Senthilkumaran, N.; Vaithegi, S. Image segmentation by using thresholding techniques for medical images. Comput. Sci. Eng. Int. J. 2016, 6, 1–13. [Google Scholar] [CrossRef]
  21. Zhao, Y.; Gui, W.; Chen, Z.; Tang, J.; Li, L. Medical images edge detection based on mathematical morphology. In Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China, 17–18 January 2006; pp. 6492–6495. [Google Scholar]
  22. Yang, Y.; Wang, M.; Ma, L.; Zhang, X.; Zhang, K.; Zhao, X.; Teng, Q.; Liu, H. Cervical Intervertebral Disc Segmentation Based on Multi-Scale Information Fusion and Its Application. Electronics 2024, 13, 432. [Google Scholar] [CrossRef]
  23. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
  24. Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
  25. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  26. Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
  27. Alryalat, S.A.; Al-Antary, M.; Arafa, Y.; Azad, B.; Boldyreff, C.; Ghnaimat, T.; Al-Antary, N.; Alfegi, S.; Elfalah, M.; Abu-Ameerh, M. Deep learning prediction of response to anti-VEGF among diabetic macular edema patients: Treatment response analyzer system (TRAS). Diagnostics 2022, 12, 312. [Google Scholar] [CrossRef]
  28. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  29. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  31. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  32. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
  33. Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
  34. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  35. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  36. Hou, C.; Zhang, W.; Wang, H.; Liu, F.; Liu, D.; Chang, J. A semantic segmentation model for lumbar MRI images using divergence loss. Appl. Intell. 2023, 53, 12063–12076. [Google Scholar] [CrossRef]
  37. Altun, İ.; Altun, S.; Alkan, A. LSS-UNET: Lumbar spinal stenosis semantic segmentation using deep learning. Multimed. Tools Appl. 2023, 82, 41287–41305. [Google Scholar] [CrossRef]
  38. Li, Z.; Zhou, X.; Tong, T. A Two-Stage Network for Segmentation of Vertebrae and Intervertebral Discs: Integration of Efficient Local-Global Fusion Using 3D Transformer and 2D CNN. In Neural Information Processing, Proceedings of the 30th International Conference, ICONIP 2023, Changsha, China, 20–23 November 2023; Springer: Singapore, 2023; pp. 467–479. [Google Scholar]
  39. Satpute, S.; Manza, R.; Manza, G.; Shaikh, A. Localization of Intervertebral Discs Using Deep-Learning and Region Growing Technique. In Advances in Intelligent Systems Research, Proceedings of the First International Conference on Advances in Computer Vision and Artificial Intelligence Technologies (ACVAIT 2022), Aurangabad, India, 1–2 August 2022; Atlantis Press: Dordrecht, The Netherlands, 2023; pp. 88–98. [Google Scholar]
  40. Sáenz-Gamboa, J.J.; Domenech, J.; Alonso-Manjarrés, A.; Gómez, J.A.; de la Iglesia-Vayá, M. Automatic semantic segmentation of the lumbar spine: Clinical applicability in a multi-parametric and multi-center study on magnetic resonance images. Artif. Intell. Med. 2023, 140, 102559. [Google Scholar] [CrossRef]
  41. Xu, Y.; Su, H.; Ma, G.; Liu, X. A novel dual-modal emotion recognition algorithm with fusing hybrid features of audio signal and speech context. Complex Intell. Syst. 2023, 9, 951–963. [Google Scholar] [CrossRef]
  42. Zhou, X.; Qi, W.; Ovur, S.E.; Zhang, L.; Hu, Y.; Su, H.; Ferrigno, G.; De Momi, E. A novel muscle-computer interface for hand gesture recognition using depth vision. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 5569–5580. [Google Scholar] [CrossRef]
  43. Cohen-Adad, J.; Alonso-Ortiz, E.; Abramovic, M.; Arneitz, C.; Atcheson, N.; Barlow, L.; Barry, R.L.; Barth, M.; Battiston, M.; Büchel, C.; et al. Open-access quantitative MRI data of the spinal cord and reproducibility across participants, sites and manufacturers. Sci. Data 2021, 8, 219. [Google Scholar] [CrossRef] [PubMed]
  44. Bozorgpour, A.; Azad, B.; Azad, R.; Velichko, Y.; Bagci, U.; Merhof, D. HCA-Net: Hierarchical context attention network for intervertebral disc semantic labeling. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
  45. Silvoster, L.; Kumar, R.M.S. Graph cut-based segmentation for intervertebral disc in human MRI. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 2025, 13, 2475992. [Google Scholar] [CrossRef]
  46. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
Figure 1. An overview of our proposed method for intervertebral disc semantic labeling. Our approach integrates the global skeleton distribution through a prompt encoder to condition the latent space based on intervertebral disc information.
Figure 1. An overview of our proposed method for intervertebral disc semantic labeling. Our approach integrates the global skeleton distribution through a prompt encoder to condition the latent space based on intervertebral disc information.
Mathematics 13 03944 g001
Figure 2. The proposed method’s qualitative results for detecting intervertebral disc locations on MRI T1 modality are depicted. The model effectively predicts the location of each intervertebral disc and semantically separates them by labeling with different colors.
Figure 2. The proposed method’s qualitative results for detecting intervertebral disc locations on MRI T1 modality are depicted. The model effectively predicts the location of each intervertebral disc and semantically separates them by labeling with different colors.
Mathematics 13 03944 g002
Figure 3. Distribution of localization error (DTT) on T2-weighted MRI. Each point represents one case, and large markers denote mean values per method. PGR-Net achieves both the lowest average DTT and visibly reduced variance, demonstrating improved precision and stability.
Figure 3. Distribution of localization error (DTT) on T2-weighted MRI. Each point represents one case, and large markers denote mean values per method. PGR-Net achieves both the lowest average DTT and visibly reduced variance, demonstrating improved precision and stability.
Mathematics 13 03944 g003
Figure 4. Attention map visualization of PGR-Net across multiple subjects for highlighting the limitations.
Figure 4. Attention map visualization of PGR-Net across multiple subjects for highlighting the limitations.
Mathematics 13 03944 g004
Table 1. Performance analysis of intervertebral disc segmentation on the Spine Generic dataset. DTT refers to the Distance to Target metric. The proposed PGR-Net achieves consistent improvements over previous methods in both T1 and T2 modalities.
Table 1. Performance analysis of intervertebral disc segmentation on the Spine Generic dataset. DTT refers to the Distance to Target metric. The proposed PGR-Net achieves consistent improvements over previous methods in both T1 and T2 modalities.
MethodT1T2
DTT (mm)FNR (%)FPR (%)DTT (mm)FNR (%)FPR (%)
Template Matching [16]1.97 (±4.08)8.12.532.05 (±3.21)11.12.11
Countception [13]1.03 (±2.81)4.240.91.78 (±2.64)3.881.5
Pose Estimation [1]1.32 (±1.33)0.320.01.31 (±2.79)1.20.6
Look Once [14]1.20 (±1.90)0.70.01.28 (±2.61)0.90.0
Graph-method [45]1.84 (±1.31)7.82.12.07 (±2.95)12.012.7
Swin-Net [46]1.44 (±1.22)1.30.41.86 (±3.10)4.611.8
HCA-Net [44]1.19 (±1.08)0.30.01.26 (±2.16)0.610.0
Proposed PGR-Net (Ours)1.17 (±1.05)0.270.01.24 (±2.09)0.580.0
Table 2. Ablation study of the proposed PGR-Net on the Spine Generic dataset. DTT: Distance to Target (mm). Each component incrementally improves performance, demonstrating the complementary nature of the proposed modules.
Table 2. Ablation study of the proposed PGR-Net on the Spine Generic dataset. DTT: Distance to Target (mm). Each component incrementally improves performance, demonstrating the complementary nature of the proposed modules.
ConfigurationT1T2
DTTFNR (%)FPR (%)DTTFNR (%)FPR (%)
Baseline (Hourglass)1.457.31.21.805.41.8
+ PGR only1.392.80.81.422.40.9
+ MSPC only1.311.90.61.341.60.3
+ MSPC + L s k 1.261.10.41.300.90.2
+ MSPC + PGR1.230.90.31.280.80.2
+ PGR + L s k 1.250.80.31.310.70.2
+ MSPC + PGR + L s k (Full Model)1.170.270.01.240.580.0
Table 3. Effect of λ 1 and λ 2 on model performance (T2w modality). Optimal values are highlighted in bold.
Table 3. Effect of λ 1 and λ 2 on model performance (T2w modality). Optimal values are highlighted in bold.
λ 1 λ 2 ConfigurationDTT (mm)FNR (%)
0.20.5Weak prompt conditioning1.290.69
0.40.3Under-regularized1.270.67
0.40.5Balanced configuration1.240.61
0.60.5Strong prompt constraint1.260.63
0.40.7Over-regularized skeleton1.270.65
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alharbi, M.N.; Alahmadi, M.D. Prompt-Guided Refinement: A Novel Technique for Improving Intervertebral Disc Semantic Labeling. Mathematics 2025, 13, 3944. https://doi.org/10.3390/math13243944

AMA Style

Alharbi MN, Alahmadi MD. Prompt-Guided Refinement: A Novel Technique for Improving Intervertebral Disc Semantic Labeling. Mathematics. 2025; 13(24):3944. https://doi.org/10.3390/math13243944

Chicago/Turabian Style

Alharbi, Mohammed N., and Mohammad D. Alahmadi. 2025. "Prompt-Guided Refinement: A Novel Technique for Improving Intervertebral Disc Semantic Labeling" Mathematics 13, no. 24: 3944. https://doi.org/10.3390/math13243944

APA Style

Alharbi, M. N., & Alahmadi, M. D. (2025). Prompt-Guided Refinement: A Novel Technique for Improving Intervertebral Disc Semantic Labeling. Mathematics, 13(24), 3944. https://doi.org/10.3390/math13243944

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop