MSA-Net: A Multi-Scale Attention Network with Contrastive Learning for Robust Intervertebral Disc Labeling in MRI

Mohammad D. Alahmadi; Abdulrahman Gharawi; Tariq Alsahfi

doi:10.3390/math13233811

,

and

¹

Department of Software Engineering, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia

²

Department of Computer Science, University College of Al Jamoum, Umm Al-Qura University, Makkah 21421, Saudi Arabia

³

Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 23890, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mathematics2025, 13(23), 3811;https://doi.org/10.3390/math13233811

This article belongs to the Special Issue New Advances in Image Processing and Computer Vision

Version Notes

Order Reprints

Abstract

Accurate labeling of intervertebral discs (IVDs) in MRI scans is crucial for diagnosing spinal-related diseases such as osteoporosis, vertebral fractures, and IVD herniation. However, automatic IVD labeling remains challenging. The main issues include visual similarity to surrounding bone, anatomical variation across individuals, and inconsistencies between MRI scans. Traditional post-detection disc labeling methods often struggle when localization algorithms miss discs or generate false positives. To address these challenges, we propose MSA-Net, a novel multi-scale attention network designed for semantic IVD labeling, emphasizing the use of prior geometric data. MSA-Net efficiently extracts multi-scale features and models intricate spatial dependencies throughout the spinal structure. We also integrate contrastive learning to enforce feature consistency. This helps the network distinguish IVDs from surrounding tissues. Extensive experiments on multi-center spine datasets demonstrate that MSA-Net consistently outperforms previous methods across MRI T1w and T2w modalities. These improvements demonstrate MSA-Net’s ability to handle variability in disc geometry, tissue contrast, and missed detections that challenge prior methods.

Keywords:

spine MRI; multi-scale attention network; intervertebral disc segmentation; contrastive feature learning

MSC:

68T07; 94A08

1. Introduction

The human vertebral column serves as the central axis of the body, providing essential support for the upper body and protecting the spinal cord. Abnormalities in vertebrae and intervertebral discs (IVDs) are among the primary causes of neck and lower back pain, with the latter ranking as the second most common neurological condition globally after headaches [,]. This widespread issue disproportionately affects adolescents and middle-aged adults, prompting significant research attention toward the cervical and lumbar regions of the spine for effective diagnosis and intervention.

Recent studies have emphasized the importance of analyzing disc-specific characteristics such as size, shape, and water content to support clinical decision-making. Traditionally, radiologists manually annotate magnetic resonance (MR) images to label vertebral structures. However, this manual approach is labor-intensive, prone to errors, and highly dependent on the radiologist’s expertise. Labeling often relies on anatomical atlases, but variability in spine shape and disc appearance makes this difficult [].

One of the major challenges in this domain is the development of reliable Computer-Aided Diagnosis (CAD) systems tailored for the cervical and lumbar spine. Fully automated methods struggle because nearby structures in MRI/CT scans often have similar intensities and may appear merged. Despite these challenges, CAD systems have demonstrated great promise in the early detection of various conditions, including vertebral fractures, disc degeneration, and other pathologies such as breast cancer and intracranial aneurysms []. These advances are driven by interdisciplinary innovations in medical imaging, artificial intelligence, and software engineering, which play a pivotal role in designing robust, scalable diagnostic tools. Given the superior soft tissue contrast offered by T1- and T2-weighted MRI imaging, this study focuses on cervical spine labeling using these modalities.

To address the limitations of existing approaches, we propose MSA-Net, a novel Multi-Scale Attention Network specifically designed for robust semantic labeling of intervertebral discs in MR images. Unlike conventional methods that rely on disc detection followed by post-labeling, MSA-Net directly performs end-to-end semantic labeling by leveraging spatial attention mechanisms and prior geometric knowledge. Furthermore, we integrate contrastive learning to ensure feature consistency and improve the model’s ability to distinguish discs from surrounding anatomical structures. Extensive experiments on multi-center T1w and T2w datasets demonstrate that MSA-Net achieves superior performance and generalizability compared to state-of-the-art methods.

2. Related Work

Accurately labeling and analyzing intervertebral discs (IVDs) is essential to medical image analysis. Various techniques have been developed for IVD segmentation and localization. Conventional feature engineering approaches involving methods such as thresholding [], edge detection [], and heuristic rules [] were commonly employed for IVD segmentation. Thresholding-based approaches rely on global or adaptive intensity thresholds to isolate disc regions, but they assume a clear contrast difference between disc tissue and surrounding structures, which does not consistently hold in spine MRI. Edge detection-based methods extract disc contours using gradient operators or morphological filters, but they are highly sensitive to noise, weak boundaries, and anatomical variability. Heuristic rule-based systems incorporate handcrafted constraints such as expected disc positions or intensity patterns; while these heuristics may reduce false detections, they require manual tuning and lack scalability across diverse patient populations. These approaches, commonly involving iterations and demanding computational resources, usually necessitated manual intervention to enhance precision [].

With the advent of deep learning, the focus shifted towards more automated and data-driven techniques. Recent research has emphasized the application of deep convolutional neural networks (CNNs) for IVD segmentation, demonstrating their superiority over traditional techniques. Chen et al. [] introduced a 3D Fully Convolutional Network (FCN) to extract center coordinates and segment IVDs. Their approach leveraged volumetric receptive fields to jointly detect all discs in a 3D scan, improving robustness compared to 2D slice-based methods. However, the method demands substantial GPU memory and does not explicitly enforce anatomical ordering along the spine. Ji et al. [] employed a standard CNN with a patch-based approach to investigate the influence of different patch sampling strategies on network performance and achieved comparable results with state-of-the-art methods. Patch-based learning reduces computational cost and allows localized feature extraction, but because patches are processed independently, the method may produce inconsistent segmentations and fails to incorporate long-range anatomical context.

Zeng and Zheng [] incorporated deeply supervised multi-scale fully CNNs to address the risk of gradient loss during training for IVD segmentation on three-dimensional (3D) T2-weighted images. By applying supervision at intermediate layers, their model improves feature refinement across multiple scales and enhances boundary detection, especially for discs that vary in size. Yet, the framework still relies on pixel-wise segmentation and does not model disc identity or ordering explicitly. Forsberg et al. [] utilized clinically annotated spine labels in distinct pipelines for cervical and lumbar MR, employing two configured CNNs for vertebra localization. This region-specific strategy demonstrates that diagnostic workflows can influence network design, but training separate models increases complexity and limits applicability to full-spine scenarios. Zhu et al. [] introduced a Gabor filter bank-based method for IVD localization and segmentation, where multi-scale, multi-orientation filters enhance disc-like textures in the frequency domain prior to detection. While this improves feature distinctiveness, it remains dependent on handcrafted filter selection and tends to fail when discs exhibit irregular texture patterns. while Alomari et al. [] employed a two-level probabilistic model for MRI-based IVD localization. Their hierarchical probabilistic framework jointly models pixel-level appearance and object-level spatial configuration, which improves robustness over purely local descriptors, but still requires manual feature engineering and may struggle under severe anatomical deformation.

Michopoulou et al. [] presented a semi-automatic approach for the detection and segmentation of IVDs, considering both degenerated and normal lumbar IVDs. Their atlas-based method propagates segmentation labels through deformable registration, allowing anatomical priors to guide localization. However, performance depends strongly on registration accuracy and manual initialization. Peng et al. [] utilized a model-based searching method for the localization of entire spine discs. This framework iteratively matches shape models to image data, enabling full-spine processing but requiring reliable initial estimates and suffering in cases of severe curvature or occlusion. Castro-Mateos et al. [] applied an active contour model with fuzzy C-means for IVD segmentation, where deformable contours evolve under fuzzy membership constraints to fit disc boundaries while allowing uncertainty modeling. Haq et al. [] utilized the discrete simplex surface model for accurate IVD segmentation. Their 3D simplex model integrates boundary smoothness and anatomical structure, yielding high segmentation accuracy but reduced flexibility for atypical disc morphologies. Law et al. [] introduced a novel anisotropic-oriented flux model for IVD segmentation. This technique analyzes gradient vector flux through oriented kernels to detect elongated disc structures, but it may falsely respond to vertebral edges or other elongated features if not carefully constrained.

Azad et al. [] reframed semantic labeling of vertebral discs as pose estimation, employing an hourglass neural network. The network predicts heatmaps for disc centroids in a keypoint detection fashion, addressing false positive issues common in segmentation-based methods and implicitly capturing the sequential ordering of discs. In a subsequent study [], they further improved the detection process by incorporating image gradient as an auxiliary input, enhancing the capture and representation of global shape information. By injecting gradient maps, the method preserves fine structural details, but it still processes features at a single dominant scale and may miss small or degenerated discs with weak contrast.

Prior endeavors aimed to enrich shape information through various means, such as integrating edge information derived from image gradients [], emphasizing the detection of vertebral column regions [], and incorporating pose estimation techniques []. For instance, Countception-based models use redundant counting to exploit the repetitive structure of discs, but they do not provide explicit semantic labeling, while symmetry-based detection approaches leverage bilateral symmetry of the vertebrae but break down in cases of scoliosis or spinal deformity. Despite these efforts, existing methods encounter challenges in effectively leveraging global vertebral column information to shape the representation space and capture geometric constraints efficiently. This limitation often leads to suboptimal outcomes, including the generation of false positives and false negatives.

As shown in Table 1, while existing approaches have addressed disc labeling from multiple perspectives, including heuristic processing, FCN-based segmentation, pose estimation, and attention mechanisms, they still fall short in simultaneously capturing multi-scale context and discriminative feature representations, motivating the design of our proposed MSA-Net. By harnessing Multi-scale Gaussian Context Attention (M-GCA) modules, the network adeptly integrates contextual cues while preserving local intricacies. This approach addresses the shortcomings of prior methods and offers a more robust solution for semantic labeling tasks in medical imaging. Unlike earlier frameworks that treat disc appearance primarily at the pixel or patch level, our design incorporates multi-scale contextual aggregation and contrastive representation learning to explicitly enhance feature separability and anatomical consistency, making the method particularly robust in cases involving low-contrast discs, motion artifacts, or imaging variability across scanners.

Table 1. Comparison of representative intervertebral disc labeling and segmentation methods in the literature.

3. Methodology

The idea of a multi-scale attention network for IVD labeling stems from the essential need to extract fine-grained and contextual information from the data. Local features are useful, but accurate labeling also requires understanding global spinal structure. This includes considerations such as spine orientation, IVD arrangement, and the relationships between adjacent discs, which are best captured across multiple spatial scales within the medical image. To tackle this challenge, we present our innovative multi-scale attention strategy, showcased in Figure 1.

Figure 1. Overall architecture of the proposed MSA-Net framework. The dual-path hourglass structure integrates multi-scale Gaussian context attention (M-GCA) and contrastive learning to capture both local and global disc features for robust labeling.

In our design, prior geometric knowledge refers to the consistent anatomical arrangement of intervertebral discs along the spinal axis. This structural regularity is implicitly incorporated through the multi-scale attention mechanism, which emphasizes feature alignment along the expected spinal trajectory and enforces spatial consistency across levels of the network.

These multi-scale attention blocks empower the model to capture both local and global dependencies, while also guiding model predictions by leveraging prior knowledge of the distribution of the IVDs.

3.1. Network Architecture

The overall structure of the proposed approach is depicted in Figure 1. First of all, a series of convolutional layers are utilized to convert the input image into a latent space. Subsequently, an hourglass block with a multi-scale attention module is utilized [] to proficiently capture local to global information from the data.

As illustrated in Figure 1, the stacked hourglass network models object pose by generating a sequence of

N - 1

intermediate outputs followed by a final prediction. This setup leverages multiple levels of representation offered by the N-stacked hourglass networks. To further enhance the representational capacity, we introduce a multi-scale attention mechanism. In this mechanism, each hourglass network generates an intermediate representation, which is combined to create a comprehensive multi-level representation. This integrated representation consolidates insights from different network levels, refining the final representation. By harnessing this combined knowledge as a guiding signal, we aim to optimize the representation space. To implement this, we stack all intermediate representations and feed them through an attention block comprising 1 × 1 convolutions with sigmoid activation, resulting in a V-channel prediction map (

\hat{y}

). Each channel represents a distinct intervertebral disc position, offering a complete representation of the disc positions. During training, the Mean Squared Error (MSE) loss function is calculated between the predicted mask y and the ground truth mask y. Equation (1) outlines this loss function, with N denoting the total number of pixels contained in the ground truth mask.

We use two stacked hourglass blocks (N = 2), each followed by a residual block and an M-GCA module. The output of every hourglass is refined by its corresponding M-GCA before decoding. A schematic overview is shown in Figure 1, and the architecture stages are summarized in Table 2.

L = \frac{1}{V \times N} \sum_{i = 1}^{V} \sum_{p = 1}^{N} {(y_{p}^{i} - {\hat{y}}_{p}^{i})}^{2},

(1)

Table 2. Performance comparison of intervertebral disc labeling methods on the Spine Generic dataset. DTT denotes the Distance to Target metric.

By incorporating contrastive loss as another supervisory loss, we aim to potentially enhance the model prediction. The primary goal of optimization is to minimize the comprehensive loss, which is represented as:

L = L_{v} + λ L_{c},

(2)

3.2. Multi-Scale Gaussian Context Attention (M-GCA)

In this work, our objective is to enhance the network’s representational capabilities through the utilization of a specific attention mechanism called M-GCA. This module is crafted to recalibrate contextual information across various scales by dynamically selecting receptive fields from both global and local streams. M-GCA incorporates a dual-path module, integrating convolution, global average pooling, normalization, and Gaussian context excitation in two paths with kernel sizes of 3 × 3 and 7 × 7. This design enables the module to prioritize relevant features by adjusting receptive field sizes dynamically. Consequently, the network can selectively focus on the most informative features for the given task, disregarding irrelevant or noisy data, thereby enhancing the discriminative power of its feature representation.

Let

X \in R^{C \times H \times W}

denote the input feature map. We use two parallel convolutional layers to extract context at different scales:

Z_{1} = {Conv}_{3 \times 3} (X), Z_{2} = {Conv}_{7 \times 7} (X)

(3)

These produce local and global features, respectively. After normalization and Gaussian excitation

G (\cdot)

, we obtain attention-weighted maps:

{\tilde{Z}}_{1} = G ({\hat{Z}}_{1}), {\tilde{Z}}_{2} = G ({\hat{Z}}_{2})

(4)

The two paths are then fused using an adaptive gate:

F = α {\tilde{Z}}_{1} + (1 - α) {\tilde{Z}}_{2}

(5)

The gate

α

is computed as:

α = σ ({Conv}_{1 \times 1} ([Z_{1}, Z_{2}]))

(6)

This gate learns to balance local and global contributions based on the input features.

To condense the spatial information across the feature map, we apply global average pooling to the output of the convolution layer. We then normalize the resulting representation by computing the sample mean and standard deviation for each feature vector:

\bar{μ} = \frac{1}{C} \sum_{i = 1}^{C} x_{i}, \bar{σ} = \sqrt{\frac{1}{C} \sum_{i = 1}^{C} {(x_{i} - \bar{μ})}^{2}} + ε

(7)

This normalization step ensures stable feature magnitudes and supports consistent training across samples.

\hat{z} = \frac{1}{σ} (z - \bar{μ})

(8)

After normalization, the standardized feature

\hat{z}

is used to compute the Gaussian attention weight:

g = G (\hat{z}) = e^{- \frac{{\hat{z}}^{2}}{2 c^{2}}},

(9)

where c can be a constant or a learnable parameter. g represents the attention activations. GCA capitalizes on spatial feature relationships to capture long-range dependencies and contextual cues. By employing Gaussian distributions to model these dependencies, this attention mechanism facilitates the integration of rich contextual information into the network’s representation, while maintaining computational efficiency. This incorporation of global context information contributes to improved generalization and robustness of the network’s representations. The Gaussian constant c in Equation (5) controls the spread of the attention weighting and was fixed to

1.0

in all experiments. This choice is motivated by the fact that the input features are normalized to approximately zero mean and unit variance (Equation (7)), making c = 1.0 equivalent to using a standard Gaussian kernel. In this setting, activations within one to two standard deviations are preserved, while higher-magnitude responses are smoothly suppressed without producing hard thresholding effects. We performed preliminary experiments with alternative fixed values (

c \in {0.5, 2.0}

) and also tested a learnable variant of c, but found no consistent improvement over the fixed setting. Smaller values led to overly sharp attention responses, while larger values weakened spatial selectivity. Fixing c therefore provides a stable compromise between localization sensitivity and robustness, while avoiding unnecessary hyperparameter tuning and preventing overfitting to dataset-specific intensity distributions. The structure of M-GCA is shown in Figure 2.

Figure 2. Structure of the Multi-Scale Gaussian Context Attention (M-GCA) module. Dual convolutional paths with

3 \times 3

and

7 \times 7

kernels model local and global contexts, which are fused through adaptive Gaussian weighting to emphasize informative intervertebral disc regions. Global average pooling (avg) is applied prior to Gaussian excitation.

3.3. Deep Contrastive Learning

In our methodology, we introduce deep contrastive supervision to refine the model’s representation and bolster its discriminative capabilities. Throughout this subsection, superscript i denotes the disc class index and subscript p the pixel location. Variables

v_{i}

and

v_{i}^{r}

represent voxel embeddings before and after contrastive refinement, respectively.

Initially, we establish class prototypes by extracting representations from multiple network levels, encompassing both shallow and deep representations. The class prototype

c_{k}

for class k can be defined as the average feature vector of all instances belonging to that class:

c_{k} = \frac{1}{| S_{k} |} \sum_{(v_{i}^{r}, y_{i}) \in S_{k}} f_{l : L} (v_{i}^{r})

(10)

By applying contrastive loss we improve feature representation guiding the model to learn embeddings that capture similarity between classes and suppress noise. This leads to more discriminative, semantically meaningful, and generalizable representations.The contrastive loss is calculated as follows:

\begin{matrix} L_{c_{k}} & = - \frac{1}{| S_{k} |} \sum_{(v_{i}, y_{i}) \in S_{k}} log (\frac{exp (sim (v_{i}, c_{k}) / τ)}{\sum_{j \neq k} exp (sim (v_{i}, c_{j}) / τ)}) \end{matrix}

(11)

where

L_{c_{k}}

is the contrastive loss for class k and

sim (v_{i}, c_{k})

is the similarity function measuring the cosine similarity between each voxel representation

v_{i}

and the class prototype

c_{k}

, where

τ

denotes a temperature parameter that controls the sharpness of the contrastive objective. The contrastive loss is further refined by adding an additional term that accounts for the distances between different class prototypes. This supplementary component is integrated to encourage the representation space to distinctly segregate the clustering regions associated with different classes.

During training, class prototypes are computed dynamically as the batch-wise average embedding of each semantic region (intervertebral disc and non-disc):

c_{k} = \frac{1}{| S_{k} |} \sum_{(v_{i}, y_{i}) \in S_{k}} f (v_{i}),

(12)

where

f (v_{i})

is the embedding at voxel

v_{i}

and

S_{k}

is the set of feature locations labeled k. Prototypes are updated each batch to reflect evolving feature distributions. Contrastive learning is used only during training and not applied at inference.

The set

S_{k}

contains all spatial feature vectors belonging to class k:

S_{k} = {(v_{i}, y_{i}) ∣ y_{i} = k} .

(13)

Positive samples share the same class as the anchor and are pulled toward

c_{k}

, while features from other classes act as negatives and push the anchor away from other prototypes. In practice, we form

S_{disc}

and

S_{non-disc}

per batch and enforce higher similarity with the correct prototype and lower similarity with the rest.

To explicitly separate class prototypes, we add a prototype-distance regularization:

L_{proto} = \sum_{i \neq j} exp (- \frac{∥ c_{i} - c_{j} ∥_{2}^{2}}{2 σ^{2}}),

(14)

where

σ

controls the penalty strength. The final contrastive objective becomes

L_{c} = L_{contrastive} + β L_{proto},

(15)

with

β

balancing prototype separation and contrastive alignment.

In Equation (2),

L = L_{v} + λ L_{c}

, the weighting factor

λ

was empirically set to 0.1 after validation experiments. Smaller values (

λ < 0.05

) weakened contrastive regularization, whereas larger values (

λ > 0.2

) reduced localization accuracy. An ablation over

λ \in {0.05, 0.1, 0.2, 0.5}

confirmed

λ = 0.1

as optimal.

4. Experimental Evaluation and Analysis

In this section, we delve into the comprehensive exploration of our experimental setup, encompassing the datasets, evaluation metrics, and a detailed examination of the results obtained. Our study draws upon the Spine Generic Dataset [], a publicly available resource that captures the intricacies of spinal imaging across a vast spectrum of medical centers globally. This dataset, encompassing 42 centers and involving 260 participants, presents a rich repository of T1 and T2 MRI contrasts for each subject.

4.1. Datasets and Acquisition

The Spine Generic Dataset serves as a robust foundation for our experimentation, capturing a diverse range of imaging scenarios. The dataset’s breadth is derived from its multinational acquisition, spanning 42 centers. The inclusion of 260 participants ensures a representative sample size, reflecting varied demographics and medical conditions. Importantly, the dataset features both T1 and T2 MRI contrasts for each participant, adding a layer of complexity and realism to the evaluation.

The inherent diversity of the dataset extends beyond geographical locations, encapsulating variations in image quality, participant ages, and imaging devices employed by different medical institutes. This heterogeneity intentionally mirrors the real-world challenges faced in clinical scenarios, providing a robust and challenging benchmark for the specific task of intervertebral disc labeling.

4.2. Results

The quantitative results reported in Table 1 confirm that MSA-Net achieves the best performance among all evaluated methods across both T1- and T2-weighted MRI scans. On T1w, the proposed framework reaches a DTT of 0.98 mm (±1.12), which represents a substantial reduction compared to classical template matching (1.97 mm ± 4.08), Countception (1.03 mm ± 2.81), and the pose estimation baseline (1.32 mm ± 1.33). Even when compared more recent deep learning systems, the improvement remains clear: a 17.6% relative reduction over HCA-Net (1.19 mm ± 1.08) and a 31.9% reduction over Swin-Net (1.44 mm ± 1.22). On T2-weighted images, MSA-Net maintains this advantage, lowering the DTT from 2.05 mm in Template Matching, 1.78 mm in Countception, 1.31 mm in pose estimation, 1.86 mm in Swin-Net, and 1.26 mm in HCA-Net to 1.05 mm (±1.43). The reduction in variability is equally significant, as the standard deviation decreases from 2.81 and 4.08 in earlier methods to just 1.12 for our model, demonstrating considerably tighter distribution and more reliable localization across subjects and acquisition centers.

In terms of detection reliability, MSA-Net achieves the lowest false negative rate among all methods while completely eliminating false positives. On T1w scans, our FNR of 0.28% improves upon Template Matching (8.1%), Countception (4.24%), and baseline hourglass (7.3%) by large margins, while also outperforming the transformer-based Swin-Net (1.3%) and even slightly improving on HCA-Net (0.3%). A similar trend is observed for T2w, where MSA-Net reduces the FNR to 0.85%, compared to 11.1% in Template Matching, 3.88% in Countception, 5.4% in baseline hourglass, 4.61% in Swin-Net, and 0.61% in HCA-Net. The elimination of false positives (FPR = 0.0% in both T1w and T2w) is shared only with the pose estimation and HCA-Net methods, yet our model provides significantly better spatial accuracy than both (0.98 mm vs. 1.32 mm and 1.19 mm for T1w, respectively). These improvements demonstrate that MSA-Net not only reduces gross detection errors but also maintains precise localization of true discs even in challenging and low-contrast cases.

The underlying reasons for these improvements lie in the complementary nature of the proposed architectural components. Classical methods such as Template Matching exhibit poor performance primarily because they rely on intensity correlation without modeling geometric context, resulting in high FNR (8.1–11.1%) and unstable localization. Countception improves repeat-structure modeling but still lacks anatomical priors, leading to higher FPR (0.9–1.5%) and large estimation variance. The pose estimation approach introduced by Azad et al. reduces false positives through heatmap regression and achieves competitive FNR (0.32%), but its single-scale design fails to capture variability in disc size and appearance, reflected in its higher DTT (1.32 mm vs. 0.98 mm). Transformer-based architectures such as Swin-Net provide improved long-range modeling but introduce large uncertainty when disc size varies across slices, resulting in weaker DTT (1.44 mm and 1.86 mm) and elevated FNR (1.3% and 4.61%). HCA-Net incorporates hierarchical contextual attention and achieves strong performance, yet its reliance on standard segmentation-based supervision limits its ability to separate disc and non-disc features in ambiguous regions, which explains why it still exhibits a higher DTT than MSA-Net even when FPR is suppressed.

In contrast, the proposed M-GCA module demonstrates a clear benefit through its ability to adaptively integrate local 3 × 3 and global 7 × 7 receptive fields, enabling the network to model fine disc boundaries while simultaneously maintaining awareness of vertebral column geometry. This leads to a sharp reduction in localization error and variance. The deep contrastive supervision further improves feature separability by refining the embedding space such that disc and non-disc representations remain well-clustered even under severe intensity variability. Empirically, this is reflected in the reduction in FNR from 7.3% in the baseline to 0.28% in our model, and the complete elimination of false positives without sacrificing spatial precision. The synergy of multi-scale attention and contrastive discrimination therefore provides measurable improvements over both classical appearance-based approaches and more recent attention or transformer-based systems.

4.3. Ablation Study

To validate the contribution of each component, we conducted systematic ablation experiments, as shown in Table 3. The baseline hourglass network achieves 1.45 mm DTT, which improves to 1.12 mm with the addition of M-GCA alone, demonstrating its effectiveness in multi-scale feature integration. Contrastive learning alone provides different benefits, reducing FNR from 7.3% to 1.8% by improving feature discriminability.

Table 3. Intervertebral disc labeling results on the public Spine Generic Dataset. Note that DTT indicates the Distance to Target.

The complete MSA-Net configuration shows synergistic effects, where the combination of M-GCA and contrastive learning yields better results than either component alone. Key observations include that M-GCA contributes most to localization precision (DTT improvement from 1.45 mm to 1.12 mm), as its dynamic receptive fields better capture disc boundaries and spatial relationships. The attention mechanism’s ability to focus on relevant scales reduces errors in cases with unusual disc spacing or orientation. Contrastive learning shows the strongest impact on reducing false negatives (FNR drops from 7.3% to 1.8%), as it learns more robust features that distinguish true discs from similar-looking structures. This is particularly valuable for detecting small or degenerated discs that might otherwise be missed. The complete system’s performance (0.98 mm DTT, 0.28% FNR) surpasses the sum of individual improvements, indicating that M-GCA and contrastive learning complement each other—the attention mechanism provides better spatial features for contrastive learning to discriminate, while the contrastive framework guides the attention to focus on more semantically meaningful regions.

Figure 3 and Figure 4 provide representative qualitative results for MSA-Net on T1-weighted spine MRI. The predicted masks exhibit strong alignment with ground truth labels, particularly in the central disc regions. The model maintains tight localization even in images with low contrast or anatomical curvature. Notably, the method avoids common errors such as mislabeling vertebrae or merging adjacent discs, which were frequent in baseline models. In some cases involving pathological deformation or disc degeneration, minor under-segmentation occurs at disc boundaries. However, the predicted regions still provide accurate centroid locations, preserving the labeling task’s clinical value. These instances expose a limitation of the method, as it has not been explicitly trained to handle such anatomical extremes. Even in these challenging cases, our method maintains better performance than existing alternatives, demonstrating its robustness to anatomical variability.

Figure 3. The qualitative outcomes of the proposed approach for identifying intervertebral disc positions in MRI T1 scans. The model accurately predicts the location of each disc and visually distinguishes them using different colors based on semantic segmentation.

Figure 4. The heatmap outcomes of the proposed approach for identifying intervertebral disc positions in MRI T1 scans. The method generates spatial probability maps that highlight disc regions and reinforce localization confidence across the spinal column.

4.4. Computational Complexity

In addition to accuracy, we also report the computational characteristics of the proposed model. MSA-Net contains approximately 2.96 million trainable parameters, being significantly smaller than many transformer-based architectures (e.g., Swin-UNet variants typically exceeding 20 M parameters). Despite incorporating multi-scale attention and contrastive supervision, the model maintains a lightweight footprint and efficient runtime. On a single NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). All experiments were implemented in Python 3.8 using PyTorch 2.0 (Meta AI, New York, NY, USA), together with NumPy 2.2, SciPy 1.15, and OpenCV 4.9 for scientific computing and image processing. Inference on a batch of 16 axial MRI slices takes approximately 0.48 s, corresponding to an average processing time of roughly 30 ms per image. This demonstrates that the proposed architecture can operate in near real time and is sufficiently fast for integration into clinical pipelines, PACS systems, or interactive diagnostic workflows without requiring specialized hardware or extreme memory resources.

4.5. Training Dynamics and Sensitivity of the Contrastive Objective

To better characterize the training dynamics of the contrastive loss, we conducted a small sensitivity analysis of the temperature parameter

τ

and the batch size, while keeping all other settings fixed. In particular, we evaluated

τ \in {0.05, 0.10, 0.20}

and batch sizes of 8, 16, and 32 and report the corresponding DTT values for both T1w and T2w modalities in Table 4. The results show that MSA-Net is robust to moderate changes in these hyperparameters: the DTT on T1w varies within a narrow band of 0.98–1.02 mm and on T2w within 1.05–1.10 mm across all tested configurations. The configuration used in our main experiments (

τ = 0.10

, batch size = 16) consistently yields the best trade-off between accuracy and stability, with DTT values of 0.98 mm (T1w) and 1.05 mm (T2w).

Table 4. Sensitivity of MSA-Net to contrastive loss hyperparameters on the Spine Generic dataset. Only the temperature

τ

and batch size are varied; all other settings are kept fixed.

Very low temperatures (e.g.,

τ = 0.05

) sharpen the contrastive objective and make the similarity distribution more peaked, which we observed to slightly slow down convergence in the early epochs and marginally degrade the final DTT (1.01 mm/1.09 mm for T1w/T2w). Conversely, higher temperatures (e.g.,

τ = 0.20

) lead to a flatter similarity landscape and weaker prototype separation, which again results in slightly higher DTT (1.02 mm/1.10 mm). Regarding batch size, smaller batches (8 samples) introduce more gradient noise and yield slightly worse localization than the default setting, while larger batches (32 samples) do not provide a clear performance gain but increase memory consumption and training time. Across all tested settings, training remained numerically stable, with smoothly decreasing training and validation losses and no oscillatory behavior or divergence. These observations indicate that the contrastive component is well-conditioned and that the chosen hyperparameters (

τ = 0.10

, batch size = 16) provide a robust and efficient operating point.

5. Conclusions

In this paper, we presented MSA-Net, a multi-scale attention network for the semantic labeling of intervertebral discs in MRI scans. Our method addresses known difficulties in this task, including intensity similarities with surrounding tissues, anatomical variability between subjects, and limited annotated data. We combine attention-based multi-scale feature extraction with contrastive learning to improve both localization precision and feature robustness.

Intervertebral disc labeling remains challenging due to anatomical ambiguities, patient-specific variations in disc geometry, and the need for accurate identification across different imaging protocols. Manual annotation is slow and error-prone, motivating the development of automated tools that are both accurate and generalizable. This study was driven by the need for a method that not only captures fine-grained spatial features but also incorporates prior geometric structure into the learning process.

MSA-Net achieves state-of-the-art performance on the Spine Generic Dataset, showing improved accuracy in both T1w and T2w modalities. Specifically, our method achieves a Distance to Target of 0.98 mm for T1w and 1.05 mm for T2w scans, while eliminating false positives entirely. Compared to previous approaches, it reduces the false negative rate by 56% in T1w and 29% in T2w images.

Looking ahead, further work may explore adapting this approach to related tasks such as vertebra localization or disc degeneration grading and evaluating robustness under limited annotation or different imaging protocols.

Author Contributions

Conceptualization, M.D.A.; Methodology, A.G.; Software, T.A.; Validation, A.G.; Formal analysis, A.G.; Resources, T.A.; Data curation, A.G.; Writing—original draft, M.D.A.; Writing—review and editing, A.G.; Supervision, M.D.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-25-DR-753). Therefore, the authors thank the University of Jeddah for its technical and financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in GitHub at https://github.com/spine-generic/data-multi-subject (accessed on 20 June 2025).

Acknowledgments

This work was funded by the University of Jeddah, Jeddah, Saudi Arabia, under grant No. (UJ-25-DR-753). Therefore, the authors thank the University of Jeddah for its technical and financial support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singh, M.; Ansari, M.S.A.; Govil, M.C. Detection of fractional difference in inter vertebral disk MRI images for recognition of low back pain. Image Vis. Comput. 2025, 153, 105333. [Google Scholar] [CrossRef]
Tian, Z.; Shen, Z.; Chen, H.; Zhao, P. Silk fibroin–collagen hydrogel loaded with IGF1-CESCs attenuates intervertebral disk degeneration by accelerating annulus fibrosus healing in rats. Front. Pharmacol. 2025, 16, 1552174. [Google Scholar] [CrossRef] [PubMed]
Liawrungrueang, W.; Park, J.B.; Cholamjiak, W.; Sarasombath, P.; Riew, K.D. Artificial intelligence-assisted MRI diagnosis in lumbar degenerative disc disease: A systematic review. Glob. Spine J. 2025, 15, 1405–1418. [Google Scholar] [CrossRef] [PubMed]
Kim, K.H.; Koo, H.W.; Lee, B.J. Predictive stress analysis in simplified spinal disc model using physics-informed neural networks. Comput. Methods Biomech. Biomed. Eng. 2025, 1–13. [Google Scholar] [CrossRef] [PubMed]
Senthilkumaran, N.; Vaithegi, S. Image segmentation by using thresholding techniques for medical images. Comput. Sci. Eng. Int. J. 2016, 6, 1–13. [Google Scholar] [CrossRef]
Zhao, Y.-Q.; Gui, W.-H.; Chen, Z.-C.; Tang, J.-T.; Li, L.-Y. Medical images edge detection based on mathematical morphology. In Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China, 17–18 January 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 6492–6495. [Google Scholar]
Ghosh, S.; Malgireddy, M.R.; Chaudhary, V.; Dhillon, G. A new approach to automatic disc localization in clinical lumbar MRI: Combining machine learning with heuristics. In Proceedings of the 2012 9th IEEE International Symposium on Biomedical Imaging (ISBI), Barcelona, Spain, 2–5 May 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 114–117. [Google Scholar]
Zhan, Y.; Maneesh, D.; Harder, M.; Zhou, X.S. Robust MR spine detection using hierarchical learning and local articulated model. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Nice, France, 1–5 October 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 141–148. [Google Scholar]
Chen, H.; Dou, Q.; Wang, X.; Qin, J.; Cheng, J.C.; Heng, P.A. 3D fully convolutional networks for intervertebral disc localization and segmentation. In Proceedings of the International Conference on Medical Imaging and Augmented Reality, Bern, Switzerland, 24–26 August 2016; Springer: Cham, Switzerland, 2016; pp. 375–382. [Google Scholar]
Ji, X.; Zheng, G.; Belavy, D.; Ni, D. Automated intervertebral disc segmentation using deep convolutional neural networks. In Proceedings of the Computational Methods and Clinical Applications for Spine Imaging: 4th International Workshop and Challenge, CSI 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, 17 October 2016; Revised Selected Papers 4. Springer: Cham, Switzerland, 2016; pp. 38–48. [Google Scholar]
Zeng, G.; Zheng, G. DSMS-FCN: A deeply supervised multi-scale fully convolutional network for automatic segmentation of intervertebral disc in 3D MR images. In Proceedings of the Computational Methods and Clinical Applications in Musculoskeletal Imaging: 5th International Workshop, MSKI 2017, Held in Conjunction with MICCAI 2017, Quebec City, QC, Canada, 10 September 2017; Revised Selected Papers 5. Springer: Cham, Switzerland, 2018; pp. 148–159. [Google Scholar]
Forsberg, D.; Sjöblom, E.; Sunshine, J.L. Detection and labeling of vertebrae in MR images using deep learning with clinical annotations as training data. J. Digit. Imaging 2017, 30, 406–412. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; He, X.; Wang, P.; He, Q.; Gao, D.; Cheng, J.; Wu, B. A method of localization and segmentation of intervertebral discs in spine MRI based on Gabor filter bank. Biomed. Eng. Online 2016, 15, 32. [Google Scholar] [CrossRef] [PubMed]
Alomari, R.; Corso, J.J.; Chaudhary, V. Labeling of lumbar discs using both pixel-and object-level features with a two-level probabilistic model. IEEE Trans. Med. Imaging 2010, 30, 1–10. [Google Scholar] [CrossRef] [PubMed]
Michopoulou, S.K.; Costaridou, L.; Panagiotopoulos, E.; Speller, R.; Panayiotakis, G.; Todd-Pokropek, A. Atlas-based segmentation of degenerated lumbar intervertebral discs from MR images of the spine. IEEE Trans. Biomed. Eng. 2009, 56, 2225–2231. [Google Scholar] [CrossRef] [PubMed]
Peng, Z.; Zhong, J.; Wee, W.; Lee, J.h. Automated vertebra detection and segmentation from the whole spine MR images. In Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China, 17–18 January 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 2527–2530. [Google Scholar]
Castro-Mateos, I.; Pozo, J.M.; Lazary, A.; Frangi, A.F. 2D segmentation of intervertebral discs and its degree of degeneration from T2-weighted magnetic resonance images. In Proceedings of the Medical Imaging 2014: Computer-Aided Diagnosis, San Diego, CA, USA, 15–20 February 2014; SPIE: Bellingham, WA, USA, 2014; Volume 9035, pp. 310–320. [Google Scholar]
Haq, R.; Aras, R.; Besachio, D.A.; Borgie, R.C.; Audette, M.A. 3D lumbar spine intervertebral disc segmentation and compression simulation from MRI using shape-aware models. Int. J. Comput. Assist. Radiol. Surg. 2015, 10, 45–54. [Google Scholar] [CrossRef] [PubMed]
Law, M.W.; Tay, K.; Leung, A.; Garvin, G.J.; Li, S. Intervertebral disc segmentation in MR images using anisotropic oriented flux. Med. Image Anal. 2013, 17, 43–61. [Google Scholar] [CrossRef] [PubMed]
Azad, R.; Rouhier, L.; Cohen-Adad, J. Stacked Hourglass Network with a Multi-level Attention Mechanism: Where to Look for Intervertebral Disc Labeling. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Strasbourg, France, 27 September 2021; Springer: Cham, Switzerland, 2021; pp. 406–415. [Google Scholar]
Azad, R.; Heidari, M.; Cohen-Adad, J.; Adeli, E.; Merhof, D. Intervertebral Disc Labeling with Learning Shape Information, A Look Once Approach. In Proceedings of the Predictive Intelligence in Medicine: 5th International Workshop, PRIME 2022, Held in Conjunction with MICCAI 2022, Singapore, 22 September 2022. [Google Scholar]
Paul Cohen, J.; Boucher, G.; Glastonbury, C.A.; Lo, H.Z.; Bengio, Y. Count-ception: Counting by fully convolutional redundant counting. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 18–26. [Google Scholar]
Kim, K.; Lee, S. Vertebrae localization in CT using both local and global symmetry features. Comput. Med. Imaging Graph. 2017, 58, 45–55. [Google Scholar] [CrossRef] [PubMed]
Bozorgpour, A.; Azad, B.; Azad, R.; Velichko, Y.; Bagci, U.; Merhof, D. HCA-Net: Hierarchical context attention network for intervertebral disc semantic labeling. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Ullmann, E.; Pelletier Paquette, J.F.; Thong, W.E.; Cohen-Adad, J. Automatic labeling of vertebral levels using a robust template-based approach. Int. J. Biomed. Imaging 2014, 2014, 719520. [Google Scholar] [CrossRef] [PubMed]
Rouhier, L.; Romero, F.P.; Cohen, J.P.; Cohen-Adad, J. Spine intervertebral disc labeling using a fully convolutional redundant counting model. arXiv 2020, arXiv:2003.04387. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Cohen-Adad, J.; Alonso-Ortiz, E.; Abramovic, M.; Arneitz, C.; Atcheson, N.; Barlow, L.; Barry, R.L.; Barth, M.; Battiston, M.; Büchel, C.; et al. Open-access quantitative MRI data of the spinal cord and reproducibility across participants, sites and manufacturers. Sci. Data 2021, 8, 219. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of the proposed MSA-Net framework. The dual-path hourglass structure integrates multi-scale Gaussian context attention (M-GCA) and contrastive learning to capture both local and global disc features for robust labeling.

Figure 2. Structure of the Multi-Scale Gaussian Context Attention (M-GCA) module. Dual convolutional paths with

3 \times 3

and

7 \times 7

kernels model local and global contexts, which are fused through adaptive Gaussian weighting to emphasize informative intervertebral disc regions. Global average pooling (avg) is applied prior to Gaussian excitation.

Figure 3. The qualitative outcomes of the proposed approach for identifying intervertebral disc positions in MRI T1 scans. The model accurately predicts the location of each disc and visually distinguishes them using different colors based on semantic segmentation.

Figure 4. The heatmap outcomes of the proposed approach for identifying intervertebral disc positions in MRI T1 scans. The method generates spatial probability maps that highlight disc regions and reinforce localization confidence across the spinal column.

Table 1. Comparison of representative intervertebral disc labeling and segmentation methods in the literature.

Method	Year	Model Type	Data/Modality	Key Strengths	Main Limitations
Thresholding []	2016	Rule-based, requires prior knowledge of intervertebral location and intensity threshold values	MRI	Simple, fast	Sensitive to intensity variations and noise, not generalizable
Edge Detection []	2006	Morphology-based	MRI	Detects sharp boundaries	Fails under low contrast or curved anatomy, limited generalization, specifically when dealing with multi-center data
Template Matching []	2012	Atlas-based	MRI	Uses prior shapes	Poor generalization to pathology, accurate Atlas model is needed for acceptable performance
Atlas-based segmentation []	2009	Deformable + atlas	MRI	Incorporates anatomical priors	Requires manual initialization, slow
Active Contour + Fuzzy C-means []	2014	Level set-based	MRI	Handles smooth shapes	Sensitive to initialization and noise
3D FCN []	2016	Fully Convolutional	3D MRI	Learns volumetric context	slow prediction, in ability to model long range dependency, lacks ordering
Patch-based CNN []	2016	2D CNN	Patches	Good local detail	Ignores global spine structure
Gradient-based Disc Localization []	2022	CNN + gradient input	MRI	Better disc boundary modeling	Still limited to local features
Countception []	2017	Redundant counting CNN	MRI	Good for repetitive structures	Detection only, no labeling or geometry
Vertebra Symmetry Localization []	2017	Symmetry-based	CT/MRI	Uses global symmetry cues	Fails when spine is deformed
DSMS-FCN []	2018	Deeply Supervised Multi-scale	3D MRI	Multi-scale gradients, improved convergence	Still pixel-based, no semantic ordering
Hourglass + Pose []	2021	Stacked Hourglass (heatmap)	MRI	Pose-based disc ordering, centroid supervision	Single-scale, misses degenerated discs
HCA-Net []	2023	Hierarchical Context-Aware Attention Network	IVD labeling	Multi-scale attention + spatial context	Still lacks explicit contrastive representation learning
Proposed MSA-Net (Ours)	2025	Multi-scale Attention + Contrastive	MRI (2D slices)	Captures global + local context, enforces geometry, improves discriminability, reduces false positives and negatives	Requires GPU, additional contrastive loss tuning

Table 2. Performance comparison of intervertebral disc labeling methods on the Spine Generic dataset. DTT denotes the Distance to Target metric.

Method	T1			T2
Method	DTT (mm)	FNR (%)	FPR (%)	DTT (mm)	FNR (%)	FPR (%)
Template Matching []	1.97 (±4.08)	8.1	2.53	2.05 (±3.21)	11.1	2.11
Countception []	1.03 (±2.81)	4.24	0.9	1.78 (±2.64)	3.88	1.5
Pose Estimation []	1.32 (±1.33)	0.32	0.0	1.31 (±2.79)	1.2	0.6
Swin-Net []	1.44 (±1.22)	1.3	0.4	1.86 (±3.10)	4.61	1.8
HCA-Net []	1.19 (±1.08)	0.3	0.0	1.26 (±2.16)	0.61	0.0
Baseline	1.45 (±2.70)	7.3	1.2	1.80 (±2.80)	5.4	1.8
Proposed (MSA-Net)	0.98 (±1.12)	0.28	0.0	1.05 (±1.43)	0.85	0.0

Table 3. Intervertebral disc labeling results on the public Spine Generic Dataset. Note that DTT indicates the Distance to Target.

Method	T1
Method	DTT (mm)	FNR (%)	FPR (%)
Baseline	1.45 (±2.70)	7.3	1.2
M-GCA only	1.12 (±1.65)	3.2	0.8
Contrastive only	1.30 (±2.05)	1.8	0.4
Full MSA-Net	0.98 (±1.12)	0.28	0.0

Table 4. Sensitivity of MSA-Net to contrastive loss hyperparameters on the Spine Generic dataset. Only the temperature

τ

and batch size are varied; all other settings are kept fixed.

Table 4. Sensitivity of MSA-Net to contrastive loss hyperparameters on the Spine Generic dataset. Only the temperature

τ

and batch size are varied; all other settings are kept fixed.

$τ$	Batch Size	T1 DTT (mm)	T2 DTT (mm)
0.05	16	1.01 (±1.20)	1.09 (±1.47)
0.10	16	0.98 (±1.12)	1.05 (±1.43)
0.20	16	1.02 (±1.18)	1.10 (±1.50)
0.10	8	1.00 (±1.19)	1.07 (±1.49)
0.10	32	0.99 (±1.15)	1.06 (±1.45)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

MSA-Net: A Multi-Scale Attention Network with Contrastive Learning for Robust Intervertebral Disc Labeling in MRI

Abstract

1. Introduction

3. Methodology

3.1. Network Architecture

3.2. Multi-Scale Gaussian Context Attention (M-GCA)

3.3. Deep Contrastive Learning

4. Experimental Evaluation and Analysis

4.1. Datasets and Acquisition

4.2. Results

4.3. Ablation Study

4.4. Computational Complexity

4.5. Training Dynamics and Sensitivity of the Contrastive Objective

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics