1. Introduction
The human vertebral column serves as the central axis of the body, providing essential support for the upper body and protecting the spinal cord. Abnormalities in vertebrae and intervertebral discs (IVDs) are among the primary causes of neck and lower back pain, with the latter ranking as the second most common neurological condition globally after headaches [
1,
2]. This widespread issue disproportionately affects adolescents and middle-aged adults, prompting significant research attention toward the cervical and lumbar regions of the spine for effective diagnosis and intervention.
Recent studies have emphasized the importance of analyzing disc-specific characteristics such as size, shape, and water content to support clinical decision-making. Traditionally, radiologists manually annotate magnetic resonance (MR) images to label vertebral structures. However, this manual approach is labor-intensive, prone to errors, and highly dependent on the radiologist’s expertise. Labeling often relies on anatomical atlases, but variability in spine shape and disc appearance makes this difficult [
3].
One of the major challenges in this domain is the development of reliable Computer-Aided Diagnosis (CAD) systems tailored for the cervical and lumbar spine. Fully automated methods struggle because nearby structures in MRI/CT scans often have similar intensities and may appear merged. Despite these challenges, CAD systems have demonstrated great promise in the early detection of various conditions, including vertebral fractures, disc degeneration, and other pathologies such as breast cancer and intracranial aneurysms [
4]. These advances are driven by interdisciplinary innovations in medical imaging, artificial intelligence, and software engineering, which play a pivotal role in designing robust, scalable diagnostic tools. Given the superior soft tissue contrast offered by T1- and T2-weighted MRI imaging, this study focuses on cervical spine labeling using these modalities.
To address the limitations of existing approaches, we propose MSA-Net, a novel Multi-Scale Attention Network specifically designed for robust semantic labeling of intervertebral discs in MR images. Unlike conventional methods that rely on disc detection followed by post-labeling, MSA-Net directly performs end-to-end semantic labeling by leveraging spatial attention mechanisms and prior geometric knowledge. Furthermore, we integrate contrastive learning to ensure feature consistency and improve the model’s ability to distinguish discs from surrounding anatomical structures. Extensive experiments on multi-center T1w and T2w datasets demonstrate that MSA-Net achieves superior performance and generalizability compared to state-of-the-art methods.
2. Related Work
Accurately labeling and analyzing intervertebral discs (IVDs) is essential to medical image analysis. Various techniques have been developed for IVD segmentation and localization. Conventional feature engineering approaches involving methods such as thresholding [
5], edge detection [
6], and heuristic rules [
7] were commonly employed for IVD segmentation. Thresholding-based approaches rely on global or adaptive intensity thresholds to isolate disc regions, but they assume a clear contrast difference between disc tissue and surrounding structures, which does not consistently hold in spine MRI. Edge detection-based methods extract disc contours using gradient operators or morphological filters, but they are highly sensitive to noise, weak boundaries, and anatomical variability. Heuristic rule-based systems incorporate handcrafted constraints such as expected disc positions or intensity patterns; while these heuristics may reduce false detections, they require manual tuning and lack scalability across diverse patient populations. These approaches, commonly involving iterations and demanding computational resources, usually necessitated manual intervention to enhance precision [
8].
With the advent of deep learning, the focus shifted towards more automated and data-driven techniques. Recent research has emphasized the application of deep convolutional neural networks (CNNs) for IVD segmentation, demonstrating their superiority over traditional techniques. Chen et al. [
9] introduced a 3D Fully Convolutional Network (FCN) to extract center coordinates and segment IVDs. Their approach leveraged volumetric receptive fields to jointly detect all discs in a 3D scan, improving robustness compared to 2D slice-based methods. However, the method demands substantial GPU memory and does not explicitly enforce anatomical ordering along the spine. Ji et al. [
10] employed a standard CNN with a patch-based approach to investigate the influence of different patch sampling strategies on network performance and achieved comparable results with state-of-the-art methods. Patch-based learning reduces computational cost and allows localized feature extraction, but because patches are processed independently, the method may produce inconsistent segmentations and fails to incorporate long-range anatomical context.
Zeng and Zheng [
11] incorporated deeply supervised multi-scale fully CNNs to address the risk of gradient loss during training for IVD segmentation on three-dimensional (3D) T2-weighted images. By applying supervision at intermediate layers, their model improves feature refinement across multiple scales and enhances boundary detection, especially for discs that vary in size. Yet, the framework still relies on pixel-wise segmentation and does not model disc identity or ordering explicitly. Forsberg et al. [
12] utilized clinically annotated spine labels in distinct pipelines for cervical and lumbar MR, employing two configured CNNs for vertebra localization. This region-specific strategy demonstrates that diagnostic workflows can influence network design, but training separate models increases complexity and limits applicability to full-spine scenarios. Zhu et al. [
13] introduced a Gabor filter bank-based method for IVD localization and segmentation, where multi-scale, multi-orientation filters enhance disc-like textures in the frequency domain prior to detection. While this improves feature distinctiveness, it remains dependent on handcrafted filter selection and tends to fail when discs exhibit irregular texture patterns. while Alomari et al. [
14] employed a two-level probabilistic model for MRI-based IVD localization. Their hierarchical probabilistic framework jointly models pixel-level appearance and object-level spatial configuration, which improves robustness over purely local descriptors, but still requires manual feature engineering and may struggle under severe anatomical deformation.
Michopoulou et al. [
15] presented a semi-automatic approach for the detection and segmentation of IVDs, considering both degenerated and normal lumbar IVDs. Their atlas-based method propagates segmentation labels through deformable registration, allowing anatomical priors to guide localization. However, performance depends strongly on registration accuracy and manual initialization. Peng et al. [
16] utilized a model-based searching method for the localization of entire spine discs. This framework iteratively matches shape models to image data, enabling full-spine processing but requiring reliable initial estimates and suffering in cases of severe curvature or occlusion. Castro-Mateos et al. [
17] applied an active contour model with fuzzy C-means for IVD segmentation, where deformable contours evolve under fuzzy membership constraints to fit disc boundaries while allowing uncertainty modeling. Haq et al. [
18] utilized the discrete simplex surface model for accurate IVD segmentation. Their 3D simplex model integrates boundary smoothness and anatomical structure, yielding high segmentation accuracy but reduced flexibility for atypical disc morphologies. Law et al. [
19] introduced a novel anisotropic-oriented flux model for IVD segmentation. This technique analyzes gradient vector flux through oriented kernels to detect elongated disc structures, but it may falsely respond to vertebral edges or other elongated features if not carefully constrained.
Azad et al. [
20] reframed semantic labeling of vertebral discs as pose estimation, employing an hourglass neural network. The network predicts heatmaps for disc centroids in a keypoint detection fashion, addressing false positive issues common in segmentation-based methods and implicitly capturing the sequential ordering of discs. In a subsequent study [
21], they further improved the detection process by incorporating image gradient as an auxiliary input, enhancing the capture and representation of global shape information. By injecting gradient maps, the method preserves fine structural details, but it still processes features at a single dominant scale and may miss small or degenerated discs with weak contrast.
Prior endeavors aimed to enrich shape information through various means, such as integrating edge information derived from image gradients [
21], emphasizing the detection of vertebral column regions [
22], and incorporating pose estimation techniques [
23]. For instance, Countception-based models use redundant counting to exploit the repetitive structure of discs, but they do not provide explicit semantic labeling, while symmetry-based detection approaches leverage bilateral symmetry of the vertebrae but break down in cases of scoliosis or spinal deformity. Despite these efforts, existing methods encounter challenges in effectively leveraging global vertebral column information to shape the representation space and capture geometric constraints efficiently. This limitation often leads to suboptimal outcomes, including the generation of false positives and false negatives.
As shown in
Table 1, while existing approaches have addressed disc labeling from multiple perspectives, including heuristic processing, FCN-based segmentation, pose estimation, and attention mechanisms, they still fall short in simultaneously capturing multi-scale context and discriminative feature representations, motivating the design of our proposed MSA-Net. By harnessing Multi-scale Gaussian Context Attention (M-GCA) modules, the network adeptly integrates contextual cues while preserving local intricacies. This approach addresses the shortcomings of prior methods and offers a more robust solution for semantic labeling tasks in medical imaging. Unlike earlier frameworks that treat disc appearance primarily at the pixel or patch level, our design incorporates multi-scale contextual aggregation and contrastive representation learning to explicitly enhance feature separability and anatomical consistency, making the method particularly robust in cases involving low-contrast discs, motion artifacts, or imaging variability across scanners.
3. Methodology
The idea of a multi-scale attention network for IVD labeling stems from the essential need to extract fine-grained and contextual information from the data. Local features are useful, but accurate labeling also requires understanding global spinal structure. This includes considerations such as spine orientation, IVD arrangement, and the relationships between adjacent discs, which are best captured across multiple spatial scales within the medical image. To tackle this challenge, we present our innovative multi-scale attention strategy, showcased in
Figure 1.
In our design, prior geometric knowledge refers to the consistent anatomical arrangement of intervertebral discs along the spinal axis. This structural regularity is implicitly incorporated through the multi-scale attention mechanism, which emphasizes feature alignment along the expected spinal trajectory and enforces spatial consistency across levels of the network.
These multi-scale attention blocks empower the model to capture both local and global dependencies, while also guiding model predictions by leveraging prior knowledge of the distribution of the IVDs.
3.1. Network Architecture
The overall structure of the proposed approach is depicted in
Figure 1. First of all, a series of convolutional layers are utilized to convert the input image into a latent space. Subsequently, an hourglass block with a multi-scale attention module is utilized [
14] to proficiently capture local to global information from the data.
As illustrated in
Figure 1, the stacked hourglass network models object pose by generating a sequence of
intermediate outputs followed by a final prediction. This setup leverages multiple levels of representation offered by the N-stacked hourglass networks. To further enhance the representational capacity, we introduce a multi-scale attention mechanism. In this mechanism, each hourglass network generates an intermediate representation, which is combined to create a comprehensive multi-level representation. This integrated representation consolidates insights from different network levels, refining the final representation. By harnessing this combined knowledge as a guiding signal, we aim to optimize the representation space. To implement this, we stack all intermediate representations and feed them through an attention block comprising 1 × 1 convolutions with sigmoid activation, resulting in a
V-channel prediction map (
). Each channel represents a distinct intervertebral disc position, offering a complete representation of the disc positions. During training, the Mean Squared Error (MSE) loss function is calculated between the predicted mask
y and the ground truth mask
y. Equation (
1) outlines this loss function, with
N denoting the total number of pixels contained in the ground truth mask.
We use two stacked hourglass blocks (N = 2), each followed by a residual block and an M-GCA module. The output of every hourglass is refined by its corresponding M-GCA before decoding. A schematic overview is shown in
Figure 1, and the architecture stages are summarized in
Table 2.
By incorporating contrastive loss as another supervisory loss, we aim to potentially enhance the model prediction. The primary goal of optimization is to minimize the comprehensive loss, which is represented as:
3.2. Multi-Scale Gaussian Context Attention (M-GCA)
In this work, our objective is to enhance the network’s representational capabilities through the utilization of a specific attention mechanism called M-GCA. This module is crafted to recalibrate contextual information across various scales by dynamically selecting receptive fields from both global and local streams. M-GCA incorporates a dual-path module, integrating convolution, global average pooling, normalization, and Gaussian context excitation in two paths with kernel sizes of 3 × 3 and 7 × 7. This design enables the module to prioritize relevant features by adjusting receptive field sizes dynamically. Consequently, the network can selectively focus on the most informative features for the given task, disregarding irrelevant or noisy data, thereby enhancing the discriminative power of its feature representation.
Let
denote the input feature map. We use two parallel convolutional layers to extract context at different scales:
These produce local and global features, respectively. After normalization and Gaussian excitation
, we obtain attention-weighted maps:
The two paths are then fused using an adaptive gate:
The gate
is computed as:
This gate learns to balance local and global contributions based on the input features.
To condense the spatial information across the feature map, we apply global average pooling to the output of the convolution layer. We then normalize the resulting representation by computing the sample mean and standard deviation for each feature vector:
This normalization step ensures stable feature magnitudes and supports consistent training across samples.
After normalization, the standardized feature
is used to compute the Gaussian attention weight:
where
c can be a constant or a learnable parameter.
g represents the attention activations. GCA capitalizes on spatial feature relationships to capture long-range dependencies and contextual cues. By employing Gaussian distributions to model these dependencies, this attention mechanism facilitates the integration of rich contextual information into the network’s representation, while maintaining computational efficiency. This incorporation of global context information contributes to improved generalization and robustness of the network’s representations. The Gaussian constant
c in Equation (
5) controls the spread of the attention weighting and was fixed to
in all experiments. This choice is motivated by the fact that the input features are normalized to approximately zero mean and unit variance (Equation (
7)), making
c = 1.0 equivalent to using a standard Gaussian kernel. In this setting, activations within one to two standard deviations are preserved, while higher-magnitude responses are smoothly suppressed without producing hard thresholding effects. We performed preliminary experiments with alternative fixed values (
) and also tested a learnable variant of
c, but found no consistent improvement over the fixed setting. Smaller values led to overly sharp attention responses, while larger values weakened spatial selectivity. Fixing
c therefore provides a stable compromise between localization sensitivity and robustness, while avoiding unnecessary hyperparameter tuning and preventing overfitting to dataset-specific intensity distributions. The structure of M-GCA is shown in
Figure 2.
3.3. Deep Contrastive Learning
In our methodology, we introduce deep contrastive supervision to refine the model’s representation and bolster its discriminative capabilities. Throughout this subsection, superscript i denotes the disc class index and subscript p the pixel location. Variables and represent voxel embeddings before and after contrastive refinement, respectively.
Initially, we establish class prototypes by extracting representations from multiple network levels, encompassing both shallow and deep representations. The class prototype
for class
k can be defined as the average feature vector of all instances belonging to that class:
By applying contrastive loss we improve feature representation guiding the model to learn embeddings that capture similarity between classes and suppress noise. This leads to more discriminative, semantically meaningful, and generalizable representations.The contrastive loss is calculated as follows:
where
is the contrastive loss for class
k and
is the similarity function measuring the cosine similarity between each voxel representation
and the class prototype
, where
denotes a temperature parameter that controls the sharpness of the contrastive objective. The contrastive loss is further refined by adding an additional term that accounts for the distances between different class prototypes. This supplementary component is integrated to encourage the representation space to distinctly segregate the clustering regions associated with different classes.
During training, class prototypes are computed dynamically as the batch-wise average embedding of each semantic region (intervertebral disc and non-disc):
where
is the embedding at voxel
and
is the set of feature locations labeled
k. Prototypes are updated each batch to reflect evolving feature distributions. Contrastive learning is used only during training and not applied at inference.
The set
contains all spatial feature vectors belonging to class
k:
Positive samples share the same class as the anchor and are pulled toward , while features from other classes act as negatives and push the anchor away from other prototypes. In practice, we form and per batch and enforce higher similarity with the correct prototype and lower similarity with the rest.
To explicitly separate class prototypes, we add a prototype-distance regularization:
where
controls the penalty strength. The final contrastive objective becomes
with
balancing prototype separation and contrastive alignment.
In Equation (
2),
, the weighting factor
was empirically set to 0.1 after validation experiments. Smaller values (
) weakened contrastive regularization, whereas larger values (
) reduced localization accuracy. An ablation over
confirmed
as optimal.
4. Experimental Evaluation and Analysis
In this section, we delve into the comprehensive exploration of our experimental setup, encompassing the datasets, evaluation metrics, and a detailed examination of the results obtained. Our study draws upon the Spine Generic Dataset [
28], a publicly available resource that captures the intricacies of spinal imaging across a vast spectrum of medical centers globally. This dataset, encompassing 42 centers and involving 260 participants, presents a rich repository of T1 and T2 MRI contrasts for each subject.
4.1. Datasets and Acquisition
The Spine Generic Dataset serves as a robust foundation for our experimentation, capturing a diverse range of imaging scenarios. The dataset’s breadth is derived from its multinational acquisition, spanning 42 centers. The inclusion of 260 participants ensures a representative sample size, reflecting varied demographics and medical conditions. Importantly, the dataset features both T1 and T2 MRI contrasts for each participant, adding a layer of complexity and realism to the evaluation.
The inherent diversity of the dataset extends beyond geographical locations, encapsulating variations in image quality, participant ages, and imaging devices employed by different medical institutes. This heterogeneity intentionally mirrors the real-world challenges faced in clinical scenarios, providing a robust and challenging benchmark for the specific task of intervertebral disc labeling.
4.2. Results
The quantitative results reported in
Table 1 confirm that MSA-Net achieves the best performance among all evaluated methods across both T1- and T2-weighted MRI scans. On T1w, the proposed framework reaches a DTT of 0.98 mm (±1.12), which represents a substantial reduction compared to classical template matching (1.97 mm ± 4.08), Countception (1.03 mm ± 2.81), and the pose estimation baseline (1.32 mm ± 1.33). Even when compared more recent deep learning systems, the improvement remains clear: a 17.6% relative reduction over HCA-Net (1.19 mm ± 1.08) and a 31.9% reduction over Swin-Net (1.44 mm ± 1.22). On T2-weighted images, MSA-Net maintains this advantage, lowering the DTT from 2.05 mm in Template Matching, 1.78 mm in Countception, 1.31 mm in pose estimation, 1.86 mm in Swin-Net, and 1.26 mm in HCA-Net to 1.05 mm (±1.43). The reduction in variability is equally significant, as the standard deviation decreases from 2.81 and 4.08 in earlier methods to just 1.12 for our model, demonstrating considerably tighter distribution and more reliable localization across subjects and acquisition centers.
In terms of detection reliability, MSA-Net achieves the lowest false negative rate among all methods while completely eliminating false positives. On T1w scans, our FNR of 0.28% improves upon Template Matching (8.1%), Countception (4.24%), and baseline hourglass (7.3%) by large margins, while also outperforming the transformer-based Swin-Net (1.3%) and even slightly improving on HCA-Net (0.3%). A similar trend is observed for T2w, where MSA-Net reduces the FNR to 0.85%, compared to 11.1% in Template Matching, 3.88% in Countception, 5.4% in baseline hourglass, 4.61% in Swin-Net, and 0.61% in HCA-Net. The elimination of false positives (FPR = 0.0% in both T1w and T2w) is shared only with the pose estimation and HCA-Net methods, yet our model provides significantly better spatial accuracy than both (0.98 mm vs. 1.32 mm and 1.19 mm for T1w, respectively). These improvements demonstrate that MSA-Net not only reduces gross detection errors but also maintains precise localization of true discs even in challenging and low-contrast cases.
The underlying reasons for these improvements lie in the complementary nature of the proposed architectural components. Classical methods such as Template Matching exhibit poor performance primarily because they rely on intensity correlation without modeling geometric context, resulting in high FNR (8.1–11.1%) and unstable localization. Countception improves repeat-structure modeling but still lacks anatomical priors, leading to higher FPR (0.9–1.5%) and large estimation variance. The pose estimation approach introduced by Azad et al. reduces false positives through heatmap regression and achieves competitive FNR (0.32%), but its single-scale design fails to capture variability in disc size and appearance, reflected in its higher DTT (1.32 mm vs. 0.98 mm). Transformer-based architectures such as Swin-Net provide improved long-range modeling but introduce large uncertainty when disc size varies across slices, resulting in weaker DTT (1.44 mm and 1.86 mm) and elevated FNR (1.3% and 4.61%). HCA-Net incorporates hierarchical contextual attention and achieves strong performance, yet its reliance on standard segmentation-based supervision limits its ability to separate disc and non-disc features in ambiguous regions, which explains why it still exhibits a higher DTT than MSA-Net even when FPR is suppressed.
In contrast, the proposed M-GCA module demonstrates a clear benefit through its ability to adaptively integrate local 3 × 3 and global 7 × 7 receptive fields, enabling the network to model fine disc boundaries while simultaneously maintaining awareness of vertebral column geometry. This leads to a sharp reduction in localization error and variance. The deep contrastive supervision further improves feature separability by refining the embedding space such that disc and non-disc representations remain well-clustered even under severe intensity variability. Empirically, this is reflected in the reduction in FNR from 7.3% in the baseline to 0.28% in our model, and the complete elimination of false positives without sacrificing spatial precision. The synergy of multi-scale attention and contrastive discrimination therefore provides measurable improvements over both classical appearance-based approaches and more recent attention or transformer-based systems.
4.3. Ablation Study
To validate the contribution of each component, we conducted systematic ablation experiments, as shown in
Table 3. The baseline hourglass network achieves 1.45 mm DTT, which improves to 1.12 mm with the addition of M-GCA alone, demonstrating its effectiveness in multi-scale feature integration. Contrastive learning alone provides different benefits, reducing FNR from 7.3% to 1.8% by improving feature discriminability.
The complete MSA-Net configuration shows synergistic effects, where the combination of M-GCA and contrastive learning yields better results than either component alone. Key observations include that M-GCA contributes most to localization precision (DTT improvement from 1.45 mm to 1.12 mm), as its dynamic receptive fields better capture disc boundaries and spatial relationships. The attention mechanism’s ability to focus on relevant scales reduces errors in cases with unusual disc spacing or orientation. Contrastive learning shows the strongest impact on reducing false negatives (FNR drops from 7.3% to 1.8%), as it learns more robust features that distinguish true discs from similar-looking structures. This is particularly valuable for detecting small or degenerated discs that might otherwise be missed. The complete system’s performance (0.98 mm DTT, 0.28% FNR) surpasses the sum of individual improvements, indicating that M-GCA and contrastive learning complement each other—the attention mechanism provides better spatial features for contrastive learning to discriminate, while the contrastive framework guides the attention to focus on more semantically meaningful regions.
Figure 3 and
Figure 4 provide representative qualitative results for MSA-Net on T1-weighted spine MRI. The predicted masks exhibit strong alignment with ground truth labels, particularly in the central disc regions. The model maintains tight localization even in images with low contrast or anatomical curvature. Notably, the method avoids common errors such as mislabeling vertebrae or merging adjacent discs, which were frequent in baseline models. In some cases involving pathological deformation or disc degeneration, minor under-segmentation occurs at disc boundaries. However, the predicted regions still provide accurate centroid locations, preserving the labeling task’s clinical value. These instances expose a limitation of the method, as it has not been explicitly trained to handle such anatomical extremes. Even in these challenging cases, our method maintains better performance than existing alternatives, demonstrating its robustness to anatomical variability.
4.4. Computational Complexity
In addition to accuracy, we also report the computational characteristics of the proposed model. MSA-Net contains approximately 2.96 million trainable parameters, being significantly smaller than many transformer-based architectures (e.g., Swin-UNet variants typically exceeding 20 M parameters). Despite incorporating multi-scale attention and contrastive supervision, the model maintains a lightweight footprint and efficient runtime. On a single NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). All experiments were implemented in Python 3.8 using PyTorch 2.0 (Meta AI, New York, NY, USA), together with NumPy 2.2, SciPy 1.15, and OpenCV 4.9 for scientific computing and image processing. Inference on a batch of 16 axial MRI slices takes approximately 0.48 s, corresponding to an average processing time of roughly 30 ms per image. This demonstrates that the proposed architecture can operate in near real time and is sufficiently fast for integration into clinical pipelines, PACS systems, or interactive diagnostic workflows without requiring specialized hardware or extreme memory resources.
4.5. Training Dynamics and Sensitivity of the Contrastive Objective
To better characterize the training dynamics of the contrastive loss, we conducted a small sensitivity analysis of the temperature parameter
and the batch size, while keeping all other settings fixed. In particular, we evaluated
and batch sizes of 8, 16, and 32 and report the corresponding DTT values for both T1w and T2w modalities in
Table 4. The results show that MSA-Net is robust to moderate changes in these hyperparameters: the DTT on T1w varies within a narrow band of 0.98–1.02 mm and on T2w within 1.05–1.10 mm across all tested configurations. The configuration used in our main experiments (
, batch size = 16) consistently yields the best trade-off between accuracy and stability, with DTT values of 0.98 mm (T1w) and 1.05 mm (T2w).
Very low temperatures (e.g., ) sharpen the contrastive objective and make the similarity distribution more peaked, which we observed to slightly slow down convergence in the early epochs and marginally degrade the final DTT (1.01 mm/1.09 mm for T1w/T2w). Conversely, higher temperatures (e.g., ) lead to a flatter similarity landscape and weaker prototype separation, which again results in slightly higher DTT (1.02 mm/1.10 mm). Regarding batch size, smaller batches (8 samples) introduce more gradient noise and yield slightly worse localization than the default setting, while larger batches (32 samples) do not provide a clear performance gain but increase memory consumption and training time. Across all tested settings, training remained numerically stable, with smoothly decreasing training and validation losses and no oscillatory behavior or divergence. These observations indicate that the contrastive component is well-conditioned and that the chosen hyperparameters (, batch size = 16) provide a robust and efficient operating point.
5. Conclusions
In this paper, we presented MSA-Net, a multi-scale attention network for the semantic labeling of intervertebral discs in MRI scans. Our method addresses known difficulties in this task, including intensity similarities with surrounding tissues, anatomical variability between subjects, and limited annotated data. We combine attention-based multi-scale feature extraction with contrastive learning to improve both localization precision and feature robustness.
Intervertebral disc labeling remains challenging due to anatomical ambiguities, patient-specific variations in disc geometry, and the need for accurate identification across different imaging protocols. Manual annotation is slow and error-prone, motivating the development of automated tools that are both accurate and generalizable. This study was driven by the need for a method that not only captures fine-grained spatial features but also incorporates prior geometric structure into the learning process.
MSA-Net achieves state-of-the-art performance on the Spine Generic Dataset, showing improved accuracy in both T1w and T2w modalities. Specifically, our method achieves a Distance to Target of 0.98 mm for T1w and 1.05 mm for T2w scans, while eliminating false positives entirely. Compared to previous approaches, it reduces the false negative rate by 56% in T1w and 29% in T2w images.
Looking ahead, further work may explore adapting this approach to related tasks such as vertebra localization or disc degeneration grading and evaluating robustness under limited annotation or different imaging protocols.