Next Article in Journal
PLY-SLAM: Semantic Visual SLAM Integrating Point–Line Features with YOLOv8-seg in Dynamic Scenes
Previous Article in Journal
Multi-Source Data-Driven Terrestrial Multi-Algorithm Fusion Path Planning Technology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Segment Anything Model (SAM) and Medical SAM (MedSAM) for Lumbar Spine MRI

1
Punahou School, Honolulu, HI 96822, USA
2
The Cambridge School, San Diego, CA 92129, USA
3
Polytechnic School, Pasadena, CA 91106, USA
4
Valencia High School, Santa Clarita, CA 92870, USA
5
ResMed Inc., San Diego, CA 92123, USA
6
Radicle Science, Encinitas, CA 92024, USA
7
Department of Radiology, University of California-San Diego, San Diego, CA 921093, USA
8
Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Republic of Korea
9
Center for Healthcare Robotics, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea
10
Department of Radiology, Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul 03722, Republic of Korea
11
Department of Oral and Maxillofacial Radiology, Yonsei University College of Dentistry, Seoul 03722, Republic of Korea
12
Department of Radiology, VA San Diego Healthcare System, San Diego, CA 92161, USA
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(12), 3596; https://doi.org/10.3390/s25123596
Submission received: 14 March 2025 / Revised: 4 June 2025 / Accepted: 5 June 2025 / Published: 7 June 2025
(This article belongs to the Section Sensing and Imaging)

Abstract

Lumbar spine Magnetic Resonance Imaging (MRI) is commonly used for intervertebral disc (IVD) and vertebral body (VB) evaluation during low back pain. Segmentation of these tissues can provide useful quantitative information such as shape and volume. The objective of the study was to determine the performances of Segment Anything Model (SAM) and medical SAM (MedSAM), two “zero-shot” deep learning models, in segmenting lumbar IVD and VB from MRI images and compare against the nnU-Net model. This cadaveric study used 82 donor spines. Manual segmentation was performed to serve as ground truth. Two readers processed the spine MRI using SAM and MedSAM by placing points or drawing bounding boxes around regions of interest (ROI). The outputs were compared against ground truths to determine Dice score, sensitivity, and specificity. Qualitatively, results varied but overall, MedSAM produced more consistent results than SAM, but neither matched the performance of nnU-Net. Mean Dice scores for MedSAM were 0.79 for IVDs and 0.88 for VBs, and significantly higher (each p < 0.001) than those for SAM (0.64 for IVDs, 0.83 for VBs). Both were lower compared to nnU-Net (0.99 for IVD and VB). Sensitivity values also favored MedSAM. These results demonstrated the feasibility of “zero-shot” DL models to segment lumbar spine MRI. While performance falls short of recent models, these zero-shot models offer key advantages in not needing training data and faster adaptation to other anatomies and tasks. Validation of a generalizable segmentation model for lumbar spine MRI can lead to more precise diagnostics, follow-up, and enhanced back pain research, with potential cost savings from automated analyses while supporting the broader use of AI and machine learning in healthcare.

1. Introduction

Low back pain (LBP) affects millions worldwide and has severe economic consequences due to healthcare costs, lost productivity, and disability [1]. LBP is a multi-factorial disease, where an alteration in any of the tissues near the lumbar spine may cause the pain. Among these, degeneration of intervertebral discs (IVD) [2] and vertebral bodies (VB) [3] have been implicated in LBP, and clinical imaging such as magnetic resonance imaging include the evaluation of these tissues as the routine protocol [4,5,6,7,8,9,10,11,12]. MRI protocols including sagittal spin echo T2-weighted [13] sequences are often used for evaluation of both the IVD degeneration [12] as well as alteration in VB bone and marrow.
While MRI images are typically evaluated as a whole by a radiologist familiar with the anatomy, for advanced quantitative techniques such as T2 mapping [14,15] where MR properties of the tissues are being measured, regions of interest (ROI) are drawn around each tissue (i.e., segmentation) to determine the tissue-specific T2 value. Additionally, measurement of tissue volume [16] or shape [17,18] also require segmentation.
Conventionally, automated segmentation of these tissues has been challenging. As apparent on typical spin echo MRI images of a lumbar spine (Figure 1), the boundaries of the IVD and VB can be irregular, and the signal intensity of the contiguous tissue is often inhomogeneous. The classical methods included utilizing histogram-based voxel intensity to segment the image via thresholding [19,20]. Another conventional method is region-growing, where a seed point is placed and grown based on the similarity of the voxel intensity will fill an ROI [21,22,23].
With the advent of deep learning (DL) models such as full convolutional network (FCN) [24], U-Net [25], and Google DeepLab [26] have shown good results in medical image segmentation, reaching Dice scores of 85% or higher when segmenting the IVD [27,28]. In a study by Wang et al., [29] a 3D DeepLab was used to segment multiple lumbar structures, including vertebrae, intervertebral discs, nerve roots, and muscles, achieving high Dice scores > 0.85. Another study [30] developed a deep neural network for automatic segmentation of lumbar muscles (e.g., psoas major, multifidus) in axial MRI scans, validated on external datasets. Similarly, [31] demonstrated the feasibility of convolutional neural networks for fully automated segmentation of vertebral bodies, discs, and paraspinous muscles, with results comparable to human-demarcated masks with disc scores ≥ 0.77. Building upon basic U-Net [32,33], boundary specific U-Net [27] that uses a modified loss function for the boundary of the segmentation rather than the area and MultiResUNet [34,35] that uses multiple convolutional layers with different kernel sizes to capture features with varying spatial resolutions have been used for improved intervertebral disc segmentation. Sharp U-Net [36] is another enhanced variant of U-Net that utilizes a depthwise convolutional sharpening filter on the encoder’s feature maps before merging them with the decoder’s features. This achieved matching or better results compared to a plain U-Net with the same number of parameters when segmenting images from different modalities. Using U-Net’s basic architecture but focusing on refined automatic configuration and training rather than introducing new architecture, nnU-Net [37] (“no new net”) was found to perform well for segmenting a variety of organs when compared against existing models. However, nnU-Net’s performance on musculoskeletal images have not yet been reported.
These models, while varied in architecture, all require supervised learning with a training data that is highly specific in terms of image appearance (e.g., lumbar spine MRI with a particular weighting) and the task (e.g., segmentation of only IVD), rendering them ungeneralizable to other images and tasks. Recently, promptable foundation segmentation DL models have been introduced, such as Segment Anything Model (SAM) [38] developed by the company META, and a fine-tuned adaptation of SAM called medical SAM (MedSAM) [39] that was trained on additional medical image segmentation pairs. At a high-level description, these models build on Transformer vision models [40,41] with an image encoder, prompt encoder, and mask decoder. These pre-trained “zero-shot” models are meant to be used to perform segmentation across multiple types of images and objects, only using a minimal user input prompt such as placement of point(s) or a bounding box around the object of interest.
For broader adaptation and usage of different DL models, validation and understanding of model behavior under different conditions is needed. SAM and MedSAM have not been tested in-depth for lumbar spine MRI segmentation of IVD and VB. MRI images acquired with different scan parameters have also not been used. The purpose of this study was to (1) compare the performance metrics of SAM, MedSAM, and nnU-Net for the segmentation of intervertebral discs (IVD) and vertebral bodies (VB) in lumbar spine MRI, (2) determine how image contrast affects segmentation metrics, and (3) determine inter-reader agreement in segmentation metrics for SAM and MedSAM.

2. Materials and Methods

IRB Statement: This cadaveric study was exempted from an Institutional Review Board approval. Informed written consents were not required.
Study Design: This is a prospective cross-sectional study to compare the performance metrics of segmentation of IVD and VB from lumbar spine MRI, when using different DL models.
Spine Specimens: MRI images from 82 cadaveric lumbar spines including at least L1 to L5 vertebrae were acquired. The specimens were from 64 males and 18 females, ages between 24 and 75 years old, with a mean and standard deviation of 57.5 and 10.4 years old, respectively.
MRI: MRI data from a recently published work [42] was used. Imaging was performed on a 3-T system (GE Signa HDx) with an 8-channel transmit/receive knee coil. Imaging protocol was sagittal fast spin echo (FSE) with constant repetition time (TR) and variable echo time (TE) to acquire eight images suitable for T2 mapping (Figure 1). This opportunistic data provided a great opportunity to determine the effect of varying TE on the performance of deep learning models. The following scan parameters were used: sagittal imaging plane through the center of the spine, TR = 2000 milliseconds (ms), eight TEs = 8, 16, 24, 32, 40, 48, 56, and 64 ms, field of view (FOV) = 180 to 220 mm, image matrix = 256 × 256, and slice thickness = 3 millimeters (mm). This MRI sequence is similar to a routine sagittal T2 sequence used in clinical settings [43]. Note the change in image appearance with TE: at the first TE of 8 ms, the signal intensity for all tissues is the highest, while the IVD shows the most homogenous signal intensity; at the highest TE of 64 ms, IVD shows distinctly high signal intensity for the nucleus pulposus, NP, and a low signal intensity for the annulus fibrosus, AF.
Prompts for SAM and MedSAM: To perform segmentation with SAM, two prompting methods were used. One was a point-input within each target region and the background was needed. Two readers (high school students) with no previous experience were trained by the senior investigator (W.C.B. with over 10 years of experience) in MRI sectional anatomy of the lumbar spine and tasked to placing points within each IVD and VB on the first TE image (proton density-weighted), as well as the background (Figure 2A). The same readers performed the second task, which was to draw multiple boxes encompassing the IVD and VB (Figure 2B). MedSAM also accepts the box input (but not the point-input), so the same box inputs were used. These input prompts were created using ImageJ [44] (v2.1.0), and the coordinates of the points and boxes were saved for further processing. Although the readers were inexperienced, given the simple nature of placing points or bounding boxes, the readers had no trouble completing the task correctly when checked by the senior investigator.
Ground Truth Segmentation: All images were manually segmented using ImageJ [44] (v2.1.0) by a senior investigator (W.C.B.) with over 10 years of experience in spine MRI research. Regions of interest (ROI) representing the entire IVD and VB were drawn on the first TE image. The boundaries of the IVD, including both the NP and AF, were the superior and inferior endplates, and anterior and posterior longitudinal ligaments (Figure 2C). The boundaries of the VB were the longitudinal ligaments and the IVDs (Figure 2D).
Deep Learning Models: For SAM and MedSAM (Figure 3A), we utilized pre-trained built-in models available on Matlab with Image Processing Toolbox (R2024b). The models and MRI images were loaded onto the program, and segmentation for each IVD and VB were output based on the coordinates of the points (for SAM) or the boxes (for MedSAM). The outputs were segmented images of the IVD (Figure 4C–E) and VB (Figure 4H–J). Processing (inferencing) time for SAM and MedSAM were ~7.5 and ~10 s per image, respectively, on a MacBook Pro with M1 Pro processor and 16 GB of RAM (Apple Inc., Cupertino, CA, USA). The following is a pseudo-code for performing SAM or MedSAM segmentation, using either a point or a box input.
Figure 3. Deep learning models used in this study. (A) SAM and MedSAM have components of image encoder, prompt encoder, and a mask decoder that creates the segmented mask. (B) U-Net, basis for nnU-Net, has layers of image encoders and decoders with skip connection to retain high spatial details.
Figure 3. Deep learning models used in this study. (A) SAM and MedSAM have components of image encoder, prompt encoder, and a mask decoder that creates the segmented mask. (B) U-Net, basis for nnU-Net, has layers of image encoders and decoders with skip connection to retain high spatial details.
Sensors 25 03596 g003
I = imread('input image');
imageSize = size(imtemp);

model = SegmentAnythingModel;
% model = medicalSegmentAnythingModel; % for MedSAM
embeddings = extractEmbeddings(model,I);

points = 'coordinates for point prompt for IVD or VB'
backgroundPoints = 'coordinate for background point'
box =  'coordinates for a bounding box prompt for IVD or VB'

% using point input
segmentation = segmentObjectsFromEmbeddings(model,embeddings,imageSize, ...
                ForegroundPoints=points,BackgroundPoints=backgroundPoints);

% using box input
segmentation = segmentObjectsFromEmbeddings(model,embeddings,...
                        imageSize,BoundingBox=box);
Figure 4. A representative case showing the input image (A), ground truths (B,G), segmentations from DL models of SAM with point prompts (C,H), SAM with box prompts (D,I), MedSAM with box prompts (E,J), and nnU-Net (F,K). Top row (BF) are intervertebral discs and the bottom row (GK) are vertebral bodies.
Figure 4. A representative case showing the input image (A), ground truths (B,G), segmentations from DL models of SAM with point prompts (C,H), SAM with box prompts (D,I), MedSAM with box prompts (E,J), and nnU-Net (F,K). Top row (BF) are intervertebral discs and the bottom row (GK) are vertebral bodies.
Sensors 25 03596 g004
For the U-Net (Figure 3B), we utilized the state-of-the-art iteration of U-Net called nnU-Net [37]. A U-Net was chosen based on our past experience with it, robust performance for many applications, and its wide usage since inception. Briefly, the nnU-Net (the newest v2 dated 18 April 2024, available at https://github.com/MIC-DKFZ/nnUNet, accessed on 10 April 2025) automatically adapts to a given data, by analyzing the training cases and automatically configuring a matching U-Net pipeline. We used a 70/15/15 split for training/validation/testing, equal to 460/98/98 images. For preprocessing, the images were converted to a compressed nitfi format, and the labels were rescaled to have intensity values of 0, 1, and 2 for the background, discs, and vertebral bodies, respectively. Training was performed on a Windows 10 PC with Intel core i9-9900K CPU, 32 GB RAM, and a NVIDIA RTX2080 GPU with 8 GB VRAM and took ~2 days. The automatically configured model was as follows: 2D U-Net with 6 stages with progressively increasing feature channels (32, 64, 128, 256, 512, 512), each stage using two convolutional layers with 3 × 3 kernels and strides that downsample spatial dimensions at each level after the first. The decoder mirrors this configuration with symmetric upsampling. All convolution operations use 2D convolutions with instance normalization, Leaky ReLU activation functions, and no dropout. The model was trained with a batch size of 25 using 224 × 256 patches extracted from images normalized using z-score normalization. The architecture used batch Dice loss. For direct comparison against SAM and MedSAM, the entire dataset was inferenced to create segmented images of the IVD and VB, rather than inferencing just the test data. The inferencing, also performed on the PC, took ~1 s per image.
Segmentation Metrics: DL segmentations from SAM, MedSAM, and nnU-Net were compared against the ground truth segmentation using the following metrics. Dice score [37,38] provided a measure of image (i.e., segmentation mask) overlap, defined in Equation (1):
D i c e = 2 T P 2 T P + F P + F N
where TP is the number of true positive voxels (i.e., value of 1 in both DL and ground truth segmentations), FP is the number of false positive voxels (value of 1 in DL, value of 0 in ground truth), and FN is the number of false negative voxels (value of 0 in DL, value of 1 in ground truth).
Sensitivity [39] was determined as in Equation (2):
S e n s i t i v i t y = T P T P + F N
Specificity [39] was determined as in Equation (3):
S p e c i f i c i t y = T N T N + F P
where TN is the number of true negative voxels (value of 0 in both DL and ground truth).
Lastly, robust Hausdorff distance (RHD) [45] between the DL segmentation and the ground truth was determined from the boundary points. Let A = {a1, …, am} and B = {b1, …, bn} be the boundary point sets from the segmentations. The RHD between two sets of boundary points is a variation of the classical Hausdorff distance that reduces sensitivity to outliers. It uses the 95th percentile of minimum distances from each point in one set to the nearest point in the other set. Formally, let d a , B = min b B | | a b | | be the distance from point a A to set B. Then the directed robust Hausdorff distance from A to B is defined as follows:
R H D A , B = q u a n t i l e 95 % d a , B a A
Statistics: Segmentation metrics (Dice score, sensitivity, and specificity) of the four DL models (SAM with point input, SAM with box input, MedSAM with box input, and U-Net) to segment IVD and VB on the first echo image (TE = 8 ms; proton density-weighted) were compared using either Friedman statistic (for Dice, sensitivity, and specificity) or repeated measures analysis of variance (rmANOVA; for RHD) with post hoc comparisons. The effect of echo time of the source image on the Dice score was determined using Friedman statistic with post hoc comparisons, separately for each DL model and tissue (IVD and VB). The inter-reader agreement of Dice score was determined via intraclass correlation coefficient (ICC), for each DL model and tissue. For all tests, the significance level was set at 5%. Statistical analyses were performed using JASP software (version 0.18.3, jasp-stats.org).

3. Results

Effect of DL Model: Quantitatively, mean segmentation metrics varied by DL model (Table 1). The mean Dice scores for the IVD was the lowest for SAM with point prompts (0.64 ± 0.20, mean ± standard deviation), and higher for SAM with box prompts (0.83 ± 0.12), MedSAM (0.79 ± 0.11), and nnU-Net (0.99 ± 0.22). Dice scores for the VB were comparatively higher: for SAM (point), SAM (box), MedSAM, and nnU-Net, the values were 0.83 ± 0.07, 0.86 ± 0.05, 0.88 ± 0.04, and 0.99 ± 0.02, respectively. Post hoc analysis revealed significant (each p < 0.05) differences between every pairwise comparison. Similar trends were found with sensitivity values, while the specificity values were very high (~0.99) across all DL models. RHD values (where a higher value indicates worse match between boundary points) showed a similar trend as Dice scores: IVD RHD for SAM (point), SAM (box), MedSAM, and nnU-Net, were 7.4 ± 3.6, 4.1 ± 1.3, 4.6 ± 1.3, and 0.5 ± 0.7, respectively. The values were slightly higher for VB but with a similar trend across DL models. The differences in Dice, sensitivity, and RHD values were attributable to false negative inferences, i.e., over-segmentation.
Observations: Qualitatively, the segmentation results varied and were imperfect in many cases, often grossly over-segmenting the intended region (e.g., Figure 4C,H,J). However, there were many cases that nearly matched the ground truth segmentation (e.g., Figure 4D–F,I,K). A notable observation was that SAM and MedSAM segmentations tended to have smooth/rounded corners, while nnU-Net maintained sharp features better. Between the point and box prompts used for SAM, the point prompts tended to result in frequent over-segmentation (Figure 4C,H), while the box prompts appeared to mitigate this problem and result in better isolation of IVDs and VB regions. Between SAM and MedSAM (both box prompts), the segmentation appeared similar, both performing better than SAM using the point prompt. Here, nnU-Net performed the best, with the fewest instances of over-segmentation. Nonetheless, there were few exceptions where SAM with point prompts performed the better than SAM and MedSAM models (Figure 5).
Effect of TE: There was a significant effect of the TE on the Dice scores (Figure 6 and Table 2 showing the p-values), but the manifestation varied by the DL model. Figure 5A shows an example of IVD segmentation performed with SAM using point prompts at varying TEs. For SAM using point prompts (Figure 6B,C, yellow circle), the Dice score generally increased with TE, being the lowest at TE1 and TE2. This effect was more obvious for the IVD (Figure 6B, yellow circle; p < 0.001) than the VB (Figure 6C, yellow circle; p = 0.008). For SAM using box prompts, there was a slightly decreasing Dice score with TE for the IVD (Figure 6B, blue circles; p < 0.001), but a slightly increasing Dice score with TE for the VB (Figure 6C, blue circles; p = 0.01). For MedSAM, no change in Dice score with TE was found for the IVD (Figure 6B, pink circles; p = 0.27) but an increase Dice score with TE was found for the VB (Figure 6C, pink circles; p < 0.001). For nnU-Net, Dice score averaged above 0.98 across all TEs. While it varied with TE for the IVD (Figure 6B, green circles; p < 0.001) and VB (Figure 6C, green circles; p < 0.001) no notable trends were found.
Inter-reader Agreement: Figure 7 shows that intraclass correlation coefficients (ICC) of Dice scores of two readers, which were good-to-excellent (0.757 to 0.98) across different DL models and tissue. SAM with point prompts (Figure 7A,D) had inconsistent ICC (excellent for IVD at 0.926 and good for VB at 0.757) depending on the tissue. SAM with box prompts (Figure 7B,E) and MedSAM (Figure 7C,F) had consistently high ICC values > 0.8, and many data points along the identity line where the measurements agree perfectly.

4. Discussion

These results demonstrate the feasibility of “zero-shot” DL models to segment lumbar spine MRI and compared the performance against nnU-Net. For IVD, while SAM with point prompts performed significantly worse (mean Dice = 0.64), SAM with box prompts performed (Dice = 0.83) slightly better than MedSAM (Dice = 0.79). nnU-Net performed the best, with Dice score averaging over 0.98. For VB, Dice scores of all models improved to above 0.8, presumably due to a much larger segmented area compared to the IVD. SAM with point prompts still performed the worst, with an average Dice = 0.83, but not too far behind SAM with box prompts (Dice = 0.86), MedSAM (Dice = 0.88), and nnU-Net (Dice = 0.99). A lower performance of SAM with point prompts vs. SAM and MedSAM with box prompts appears attributable largely to the differences in the input; a bounding box input appears to hinder (but does not necessarily eliminate) over-segmentation beyond the boundary of the box. Between SAM vs. MedSAM with box prompts, the performances were similar; SAM performed slightly better for IVD and MedSAM performed slightly better for the VB. nnU-Net significantly outperformed other models.
While there are no specific studies that evaluated segmentation of IVD and VB using SAM and MedSAM making a direct comparison difficult, the general trends of increased performance using box rather than point prompts is consistent with studies performed on other types of images. Mazurowski et al. compared point vs. box prompts on 12 different types of medical images (including images from MRI, CT, ultrasound, and X-ray), and found the lowest mean Dice score (~0.5) for the point prompts, and a markedly higher score (~0.6) for the box prompts [46]. In another study on organ segmentation on CT images [47], greater average Dice score was found when using box (~0.8) vs. point prompts (~0.5). We also found that nnU-Net greatly outperformed SAM and MedSAM for both IVD and VB, a finding that agrees with past reports. Although highly variable by modality and task, He et al. [48] reported Dice scores in the order of 0.8 to 0.9 for U-Net, compared to SAM (with point prompts) whose Dice scores had a very wide range from about 0.3 to 0.6. Specifically for IVD and VB, we have previously reported Dice score of ~0.86 [27] using a version of U-Net, and more recent studies have reported value of ~0.95 [49,50]. While the acceptable level of Dice score is subjective, scores above 0.7 has been proposed as the lower limit in brain tumor studies [51,52], but for spine applications, Dice ~0.95 might be considered “good” [53].
Reasons for the better performance of nnU-Net over SAM and MedSAM may include the supervised nature of the nnU-Net, being fine-tuned on the task-specific images and labels, unlike SAM and MedSAM that were trained on a broad set of data and meant to be a generalizable model without a specific task in mind. Although not addressed in this study, another disadvantage of SAM and MedSAM is that these are inherently 2D in nature, and the task of 3D segmentation must be performed as a series of 2D segmentation each providing an input. This not only limits their usefulness and may degrade their relative performance even further when compared to other 3D deep learning models.
Several recent models for general bone segmentation have also been reported. Stojšić et al. [54] reported on a supervised hybrid CNN–transformer model for vertebral body assessment, reporting Dice scores > 0.9 for segmentation. Notable is a model named SegmentAnyBone [55], a variation of SAM that has been fine-tuned for bone segmentation and shown generalizability across different MRI sequences. This model reported Dice scores >0.85, very high for a zero-shot model. Other recent and novel approaches for vertebral body segmentation include deep learning active contours [56,57,58] to avoid fragmentation regions in the outcome.
Our results of similar performance between SAM vs. MedSAM when using box prompts was somewhat unexpected. In the initial development work by Ma et al. [38], MedSAM, with the Dice score reaching 0.9, was reported to be far superior to SAM with a mean Dice score of ~0.7. In an application study involving segmentation of liver cancer on MRI, in the best-case scenario MedSAM (Dice~0.68) outperformed SAM (Dice~0.60) by a small but significant margin [59]. Reasons for such discrepancy may include the lack of the lumbar MRI training data during MedSAM development, although cervical MRI data with vertebral body was included.
While we found that the echo time (TE) of the source image had a significant effect in segmentation performance of all DL models, the differences in Dices scores were generally small. Except for IVD segmentation using SAM with point prompts that showed a marked increase in Dice score at longer TEs (Figure 6A,B), other DL models showed less than 0.03 difference in Dice scores at various TEs. This finding supports the use of these models for MRI images acquired with different scan parameters. The poor performance of SAM with point prompts for IVD at the shortest TE of 8 ms (Dice = 0.64) was surprising at first, since the IVD has the most uniform signal intensity at this TE (Figure 1A) compared to longer TE (Figure 1D) where the periphery of the IVD (annulus pulposus) becomes dark along with the bone marrow within the vertebral bodies. SAM with point prompts tended to over-segment (e.g., Figure 4C,H) when the adjacent region has a similarly high signal intensity (Figure 4A), and this may have contributed to the poor performance.
We also found generally good-to-excellent inter-reader agreement of SAM and MedSAM models (Figure 7). For SAM with point prompts (Figure 7A,D), despite poor performance, the agreement for IVD was excellent (ICC = 0.926), and for VB (ICC = 0.757) was good. Both SAM with box prompts (Figure 7B,E) and MedSAM (Figure 7C,F) also had good-to-near perfect agreement. Moreover, many data points were along the identity line suggesting that the measurements agreed perfectly.
The integration of robust foundational segmentation models into clinical workflows has the potential to streamline image analysis by automating segmentation of the vertebral bodies and intervertebral discs directly within PACS or radiology workstations. This automation could reduce manual effort (may take tens of minutes per image [60]), improve reproducibility, and enable rapid extraction of quantitative metrics relevant to spinal diseases. In the context of back pain research, the models’ scalability allows for efficient processing of large imaging datasets, which can facilitate the development of imaging biomarkers and supporting population-scale studies that link morphological features to clinical outcomes [61].
This study has several limitations. This study used MRI images of cadaveric specimens instead of in vivo images. The specimens had been dissected and had missing tissues including the skin, paraspinal muscles, etc., that may have resulted in lowered performance of the MedSAM model trained on in vivo images. Images were annotated by a single researcher who is a co-author on the study. It would reduce potential bias if there was an external expert radiologist who also annotated the images. Nonetheless, the segmentation task in this study, while time-consuming, was not a highly difficult one. The number of specimens were small at less than 100, spanning a large range of ages; with a larger training data, the U-Net model performance could be much improved, but this would be irrelevant for the performance of pretrained zero-shot SAM and MedSAM models. Nonetheless, it would be important to have both in vivo images as well as greater numbers for better generalization. This study did not utilize the most advanced or the newest iteration of SAM with improved performance such as modality agnostic SAM [62] and granular box prompt SAM, or assistive techniques to convert point prompts to box prompts [63], which could be considered in the future. For this work, we focused on baseline SAM and MedSAM, which are readily available on a variety of deep learning frameworks but have not been validated for specific anatomies and tasks.

5. Conclusions

In conclusion, this study highlights the behavior and performance of SAM and MedSAM zero-shot models for segmentation of IVD and VB in lumbar spine MRI images. Overall, IVD segmentation performed by SAM and MedSAM using box prompts yielded acceptable results, while VB segmentation often had obvious and large errors (that humans would not make) regardless of the prompting methods. The nnU-Net model yielded the most consistent results, on par with past studies. For a carefully chosen and validated targets, the zero-shot models offer key advantages of not needing a task-specific training data and fast adaptation to other anatomies and tasks. The requirement of user input can be a limitation for automated processing but can also provide flexibility when mixed tasks are being performed. Validation of a generalizable segmentation model for lumbar spine MRI would lead to more precise diagnostics, follow-up, enhanced back pain research, and potential cost savings from automated analyses while supporting the broader use of AI and machine learning in healthcare.

Author Contributions

Conceptualization, W.C.B. and K.L.; methodology, W.C.B., V.M., C.C., H.L., K.L., A.J. and D.H.; software, W.C.B., V.M., K.L. and A.J.; validation, W.C.B., V.M., C.C., H.L., C.P., S.Y. and D.H.; formal analysis, W.C.B., C.C., H.L., C.P. and S.Y.; investigation, W.C.B., V.M., C.C., H.L., C.P., S.Y. and D.H.; resources, W.C.B.; data curation, W.C.B., C.C., H.L., C.P. and S.Y.; writing—original draft preparation, W.C.B. and D.H.; writing—review and editing, All authors; visualization, W.C.B.; supervision, W.C.B.; project administration, W.C.B.; funding acquisition, W.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

Research reported in this publication was supported in parts by research grants from National Institute of Health R01 AR066622 to Dr. Bae.

Institutional Review Board Statement

The study utilized cadaveric specimens and was exempt from Institutional Review Board approval.

Informed Consent Statement

Informed consent was not required for this study.

Data Availability Statement

The data that support the findings of this study are not publicly available due to reasons of sensitivity. Anonymized data may be available from the corresponding author upon the review of the request. Data are located in controlled access data storage at the corresponding author’s institution.

Acknowledgments

The author thanks Graeme Bydder for image acquisition and optimization, Koichi Masuda for the specimen acquisition. This work is a culmination of a summer 2024 research program created by Bae entitled “AI in MRI Research”, where high school students (including Noah Chang and Ethan Kang, along with the first four co-authors) learned about the diseases of the lumbar spine, magnetic resonance imaging (MRI), and image processing through lectures, e-learning, and hands-on curation of MRI data.

Conflicts of Interest

Dr. Bae received research funding from Canon Medical Systems, USA, for an unrelated work. Kaustubh Lall is an employee of ResMed Inc. and Armin Jamshidi is an employee of Radicle Science. However, the contents of this paper are solely the responsibility of the authors and does not necessarily represent the official views of the affiliated entities of the authors.

References

  1. Freburger, J.K.; Holmes, G.M.; Agans, R.P.; Jackman, A.M.; Darter, J.D.; Wallace, A.S.; Castel, L.D.; Kalsbeek, W.D.; Carey, T.S. The rising prevalence of chronic low back pain. Arch. Intern. Med. 2009, 169, 251–258. [Google Scholar] [CrossRef]
  2. An, H.S.; Anderson, P.A.; Haughton, V.M.; Iatridis, J.C.; Kang, J.D.; Lotz, J.C.; Natarajan, R.N.; Oegema, T.R., Jr.; Roughley, P.; Setton, L.A.; et al. Introduction: Disc degeneration: Summary. Spine 2004, 29, 2677–2678. [Google Scholar] [CrossRef]
  3. Abu-Ghanem, S.; Ohana, N.; Abu-Ghanem, Y.; Kittani, M.; Shelef, I. Acute schmorl node in dorsal spine: An unusual cause of a sudden onset of severe back pain in a young female. Asian Spine J. 2013, 7, 131–135. [Google Scholar] [CrossRef]
  4. Natalia, F.; Sudirman, S.; Ruslim, D.; Al-Kafri, A. Lumbar spine MRI annotation with intervertebral disc height and Pfirrmann grade predictions. PLoS ONE 2024, 19, e0302067. [Google Scholar] [CrossRef] [PubMed]
  5. Lund, T.; Schlenzka, D.; Lohman, M.; Ristolainen, L.; Kautiainen, H.; Klemetti, E.; Osterman, K. The intervertebral disc during growth: Signal intensity changes on magnetic resonance imaging and their relevance to low back pain. PLoS ONE 2022, 17, e0275315. [Google Scholar] [CrossRef] [PubMed]
  6. Takashima, H.; Yoshimoto, M.; Ogon, I.; Takebayashi, T.; Imamura, R.; Akatsuka, Y.; Yamashita, T. T1rho, T2, and T2* relaxation time based on grading of intervertebral disc degeneration. Acta Radiol. 2023, 64, 1116–1121. [Google Scholar] [CrossRef]
  7. Yeung, K.H.; Man, G.C.W.; Deng, M.; Lam, T.P.; Cheng, J.C.Y.; Chan, K.C.; Chu, W.C.W. Morphological changes of Intervertebral Disc detectable by T2-weighted MRI and its correlation with curve severity in Adolescent Idiopathic Scoliosis. BMC Musculoskelet. Disord. 2022, 23, 655. [Google Scholar] [CrossRef]
  8. Kamei, N.; Nakamae, T.; Nakanishi, K.; Tamura, T.; Tsuchikawa, Y.; Morisako, T.; Harada, T.; Maruyama, T.; Adachi, N. Evaluation of intervertebral disc degeneration using T2 signal ratio on magnetic resonance imaging. Eur. J. Radiol. 2022, 152, 110358. [Google Scholar] [CrossRef] [PubMed]
  9. Yeung, K.H.; Man, G.; Hung, A.; Lam, T.P.; Cheng, J.; Chu, W. Morphological changes of intervertebral disc in relation with curve severity of patients with Adolescent Idiopathic Scoliosis—A T2-weighted MRI study. Stud. Health Technol. Inform. 2021, 280, 37–39. [Google Scholar] [CrossRef]
  10. Belavy, D.L.; Owen, P.J.; Armbrecht, G.; Bansmann, M.; Zange, J.; Ling, Y.; Pohle-Frohlich, R.; Felsenberg, D. Quantitative assessment of the lumbar intervertebral disc via T2 shows excellent long-term reliability. PLoS ONE 2021, 16, e0249855. [Google Scholar] [CrossRef]
  11. Haughton, V. Imaging intervertebral disc degeneration. J. Bone Jt. Surg. Am. 2006, 88 (Suppl. 2), 15–20. [Google Scholar] [PubMed]
  12. Pfirrmann, C.W.; Metzdorf, A.; Zanetti, M.; Hodler, J.; Boos, N. Magnetic resonance classification of lumbar intervertebral disc degeneration. Spine 2001, 26, 1873–1878. [Google Scholar] [CrossRef]
  13. Joe, E.; Lee, J.W.; Park, K.W.; Yeom, J.S.; Lee, E.; Lee, G.Y.; Kang, H.S. Herniation of cartilaginous endplates in the lumbar spine: MRI findings. AJR Am. J. Roentgenol. 2015, 204, 1075–1081. [Google Scholar] [CrossRef]
  14. Marinelli, N.L.; Haughton, V.M.; Anderson, P.A. T2 relaxation times correlated with stage of lumbar intervertebral disk degeneration and patient age. AJNR Am. J. Neuroradiol. 2010, 31, 1278–1282. [Google Scholar] [CrossRef]
  15. Trattnig, S.; Stelzeneder, D.; Goed, S.; Reissegger, M.; Mamisch, T.C.; Paternostro-Sluga, T.; Weber, M.; Szomolanyi, P.; Welsch, G.H. Lumbar intervertebral disc abnormalities: Comparison of quantitative T2 mapping with conventional MR at 3.0 T. Eur. Radiol. 2010, 20, 2715–2722. [Google Scholar] [CrossRef]
  16. Pfirrmann, C.W.; Metzdorf, A.; Elfering, A.; Hodler, J.; Boos, N. Effect of aging and degeneration on disc volume and shape: A quantitative study in asymptomatic volunteers. J. Orthop. Res. 2006, 24, 1086–1094. [Google Scholar] [CrossRef] [PubMed]
  17. Klein, J.A.; Hickey, D.S.; Hukins, D.W. Radial bulging of the annulus fibrosus during compression of the intervertebral disc. J. Biomech. 1983, 16, 211–217. [Google Scholar] [CrossRef]
  18. Castoldi, N.M.; O’Rourke, D.; Antico, M.; Sansalone, V.; Gregory, L.; Pivonka, P. Assessment of age-dependent sexual dimorphism in paediatric vertebral size and density using a statistical shape and statistical appearance modelling approach. Bone 2024, 189, 117251. [Google Scholar] [CrossRef] [PubMed]
  19. Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Sys. Man. Cyber. 1979, 9, 62–66. [Google Scholar] [CrossRef]
  20. Schneiderman, H.; Kanade, T. A histogram-based method for detection of faces and cars. In Proceedings of the 2000 International Conference on Image Processing (Cat. No.00CH37101), Vancouver, BC, Canada, 10–13 September 2000; Volume 503, pp. 504–507. [Google Scholar]
  21. Zhu, S.C.; Lee, T.S.; Yuille, A.L. Region competition: Unifying snakes, region growing, energy/Bayes/MDL for multi-band image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Cambridge, MA, USA, 20–23 June 1995; pp. 416–423. [Google Scholar]
  22. Pohle, R.; Toennies, K. Segmentation of Medical Images Using Adaptive Region Growing; SPIE: Pune, MA, USA, 2001; Volume 4322. [Google Scholar]
  23. Biratu, E.S.; Schwenker, F.; Debelee, T.G.; Kebede, S.R.; Negera, W.G.; Molla, H.T. Enhanced Region Growing for Brain Tumor MR Image Segmentation. J. Imaging 2021, 7, 22. [Google Scholar] [CrossRef]
  24. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  25. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  26. Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decorder with atrous separable convolution for semantic image segmentation. arXiv 2018, arXiv:1802.02611. [Google Scholar]
  27. Kim, S.; Bae, W.C.; Masuda, K.; Chung, C.B.; Hwang, D. Fine-Grain Segmentation of the Intervertebral Discs from MR Spine Images Using Deep Convolutional Neural Networks: BSU-Net. Appl. Sci. 2018, 8, 1656. [Google Scholar] [CrossRef]
  28. Baur, D.; Bieck, R.; Berger, J.; Schofer, P.; Stelzner, T.; Neumann, J.; Neumuth, T.; Heyde, C.E.; Voelker, A. Automated Three-Dimensional Imaging and Pfirrmann Classification of Intervertebral Disc Using a Graphical Neural Network in Sagittal Magnetic Resonance Imaging of the Lumbar Spine. J. Imaging Inform. Med. 2024, 38, 979–987. [Google Scholar] [CrossRef] [PubMed]
  29. Wang, M.; Su, Z.; Liu, Z.; Chen, T.; Cui, Z.; Li, S.; Pang, S.; Lu, H. Deep Learning-Based Automated Magnetic Resonance Image Segmentation of the Lumbar Structure and Its Adjacent Structures at the L4/5 Level. Bioengineering 2023, 10, 963. [Google Scholar] [CrossRef]
  30. Niemeyer, F.; Zanker, A.; Jonas, R.; Tao, Y.; Galbusera, F.; Wilke, H.J. An externally validated deep learning model for the accurate segmentation of the lumbar paravertebral muscles. Eur. Spine J. 2022, 31, 2156–2164. [Google Scholar] [CrossRef]
  31. Hess, M.; Allaire, B.; Gao, K.T.; Tibrewala, R.; Inamdar, G.; Bharadwaj, U.; Chin, C.; Pedoia, V.; Bouxsein, M.; Anderson, D.; et al. Deep Learning for Multi-Tissue Segmentation and Fully Automatic Personalized Biomechanical Models from BACPAC Clinical Lumbar Spine MRI. Pain. Med. 2023, 24, S139–S148. [Google Scholar] [CrossRef] [PubMed]
  32. Carl, M.; Lall, K.; Pai, D.; Chang, E.Y.; Statum, S.; Brau, A.; Chung, C.B.; Fung, M.; Bae, W.C. Shoulder bone segmentation with DeepLab and U-Net. Osteology 2024, 4, 98–110. [Google Scholar] [CrossRef]
  33. Huang, J.; Shen, H.; Wu, J.; Hu, X.; Zhu, Z.; Lv, X.; Liu, Y.; Wang, Y. Spine Explorer: A deep learning based fully automated program for efficient and reliable quantifications of the vertebrae and discs on sagittal lumbar spine MR images. Spine J. 2020, 20, 590–599. [Google Scholar] [CrossRef]
  34. Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
  35. Cheng, Y.K.; Lin, C.L.; Huang, Y.C.; Lin, G.S.; Lian, Z.Y.; Chuang, C.H. Accurate Intervertebral Disc Segmentation Approach Based on Deep Learning. Diagnostics 2024, 14, 191. [Google Scholar] [CrossRef]
  36. Zunair, H.; Ben Hamza, A. Sharp U-Net: Depthwise convolutional network for biomedical image segmentation. Comput. Biol. Med. 2021, 136, 104699. [Google Scholar] [CrossRef] [PubMed]
  37. Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
  38. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
  39. Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef] [PubMed]
  40. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  41. Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  42. Finkenstaedt, T.; Siriwananrangsun, P.; Masuda, K.; Bydder, G.M.; Chen, K.C.; Bae, W.C. Ultrashort time-to-echo MR morphology of cartilaginous endplate correlates with disc degeneration in the lumbar spine. Eur. Spine J. 2023, 32, 2358–2367. [Google Scholar] [CrossRef]
  43. Sollmann, N.; Fields, A.J.; O’Neill, C.; Nardo, L.; Majumdar, S.; Chin, C.T.; Tosun, D.; Han, M.; Vu, A.T.; Ozhinsky, E.; et al. Magnetic Resonance Imaging of the Lumbar Spine: Recommendations for Acquisition and Image Evaluation from the BACPAC Spine Imaging Working Group. Pain. Med. 2023, 24, S81–S94. [Google Scholar] [CrossRef]
  44. Schneider, C.A.; Rasband, W.S.; Eliceiri, K.W. NIH Image to ImageJ: 25 years of image analysis. Nat. Methods 2012, 9, 671–675. [Google Scholar] [CrossRef] [PubMed]
  45. Moreno, R.; Koppal, S.; de Muinck, E. Robust estimation of distance between sets of points. Pattern Recognit. Lett. 2013, 34, 2192–2198. [Google Scholar] [CrossRef]
  46. Mazurowski, M.A.; Dong, H.; Gu, H.; Yang, J.; Konz, N.; Zhang, Y. Segment anything model for medical image analysis: An experimental study. Med. Image Anal. 2023, 89, 102918. [Google Scholar] [CrossRef]
  47. Roy, S.; Wald, T.; Koehler, G.; Rokuss, M.R.; Disch, N.; Holzschuh, J.; Zimmerer, D.; Maier-Hein, K.H. Sam. md: Zero-shot medical image segmentation capabilities of the segment anything model. arXiv 2023, arXiv:2304.05396. [Google Scholar]
  48. He, S.; Bao, R.; Li, J.; Stout, J.; Bjornerud, A.; Grant, P.E.; Ou, Y. Computer-vision benchmark segment-anything model (sam) in medical images: Accuracy in 12 datasets. arXiv 2023, arXiv:2304.09324. [Google Scholar]
  49. Soydan, Z.; Bayramoglu, E.; Karasu, R.; Sayin, I.; Salturk, S.; Uvet, H. An Automatized Deep Segmentation and Classification Model for Lumbar Disk Degeneration and Clarification of Its Impact on Clinical Decisions. Glob. Spine J. 2023, 15, 554–563. [Google Scholar] [CrossRef] [PubMed]
  50. Nazir, A.; Cheema, M.N.; Sheng, B.; Li, P.; Li, H.; Xue, G.; Qin, J.; Kim, J.; Feng, D.D. ECSU-Net: An Embedded Clustering Sliced U-Net Coupled With Fusing Strategy for Efficient Intervertebral Disc Segmentation and Classification. IEEE Trans. Image Process 2022, 31, 880–893. [Google Scholar] [CrossRef]
  51. Zijdenbos, A.P.; Dawant, B.M.; Margolin, R.A.; Palmer, A.C. Morphometric analysis of white matter lesions in MR images: Method and validation. IEEE Trans. Med. Imaging 1994, 13, 716–724. [Google Scholar] [CrossRef]
  52. Boehringer, A.S.; Sanaat, A.; Arabi, H.; Zaidi, H. An active learning approach to train a deep learning algorithm for tumor segmentation from brain MR images. Insights Imaging 2023, 14, 141. [Google Scholar] [CrossRef] [PubMed]
  53. Ramos, J.S.; Cazzolato, M.T.; Linares, O.C.; Maciel, J.G.; Menezes-Reis, R.; Azevedo-Marques, P.M.; Nogueira-Barbosa, M.H.; Traina Junior, C.; Traina, A.J.M. Fast and accurate 3-D spine MRI segmentation using FastCleverSeg. Magn. Reson. Imaging 2024, 109, 134–146. [Google Scholar] [CrossRef]
  54. Stojšić, K.; Miletić Rigo, D.; Jurković, S. Automated Vertebral Bone Quality Determination from T1-Weighted Lumbar Spine MRI Data Using a Hybrid Convolutional Neural Network–Transformer Neural Network. Appl. Sci. 2024, 14, 10343. [Google Scholar] [CrossRef]
  55. Gu, H.; Colglazier, R.; Dong, H.; Zhang, J.; Chen, Y.; Yildiz, Z.; Chen, Y.; Li, L.; Yang, J.; Willhite, J.; et al. SegmentAnyBone: A universal model that segments any bone at any location on MRI. Med. Image Anal. 2025, 101, 103469. [Google Scholar] [CrossRef]
  56. Zhao, S.; Wang, J.; Wang, X.; Wang, Y.; Zheng, H.; Chen, B.; Zeng, A.; Wei, F.; Al-Kindi, S.; Li, S. Attractive deep morphology-aware active contour network for vertebral body contour extraction with extensions to heterogeneous and semi-supervised scenarios. Med. Image Anal. 2023, 89, 102906. [Google Scholar] [CrossRef]
  57. Tang, Z.; Chen, B.; Zeng, A.; Liu, M.; Li, S. Progressive Deep Snake for Instance Boundary Extraction in Medical Images. Expert Syst. Appl. 2024, 249, 123590. [Google Scholar] [CrossRef]
  58. Qian, L.; Wang, Y.; Zhang, H.; Li, Y.; Li, S. A Sequential Geometry Reconstruction-Based Deep Learning Approach to Improve Accuracy and Consistency of Lumbar Spine MRI Image Segmentation. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar]
  59. Saha, A.; van der Pol, C.B. Liver Observation Segmentation on Contrast-Enhanced MRI: SAM and MedSAM Performance in Patients With Probable or Definite Hepatocellular Carcinoma. Can Assoc Radiol J 2024, 75, 771–779. [Google Scholar] [CrossRef]
  60. Khalil, Y.A.; Becherucci, E.A.; Kirschke, J.S.; Karampinos, D.C.; Breeuwer, M.; Baum, T.; Sollmann, N. Multi-scanner and multi-modal lumbar vertebral body and intervertebral disc segmentation database. Sci. Data 2022, 9, 97. [Google Scholar] [CrossRef] [PubMed]
  61. van den Heuvel, M.M.; Oei, E.H.G.; Renkens, J.J.M.; Bierma-Zeinstra, S.M.A.; van Middelkoop, M. Structural spinal abnormalities on MRI and associations with weight status in a general pediatric population. Spine J. 2021, 21, 465–476. [Google Scholar] [CrossRef]
  62. Chen, C.; Miao, J.; Wu, D.; Zhong, A.; Yan, Z.; Kim, S.; Hu, J.; Liu, Z.; Sun, L.; Li, X.; et al. MA-SAM: Modality-agnostic SAM adaptation for 3D medical image segmentation. Med. Image Anal. 2024, 98, 103310. [Google Scholar] [CrossRef] [PubMed]
  63. Liu, X.; Woo, J.; Ma, C.; Ouyang, J.; Fakhri, G.E. Point-supervised Brain Tumor Segmentation with Box-prompted MedSAM. arXiv 2024, arXiv:2408.00706v1. [Google Scholar]
Figure 1. Representative sagittal MR images of a lumbar spine acquired at varying echo times (TE) of (A) 8 ms, (B) 24 ms, (C) 40 ms, and (D) 64 ms.
Figure 1. Representative sagittal MR images of a lumbar spine acquired at varying echo times (TE) of (A) 8 ms, (B) 24 ms, (C) 40 ms, and (D) 64 ms.
Sensors 25 03596 g001
Figure 2. Input prompts needed to perform SAM and MedSAM segmentations. SAM utilized either (A) a point prompt where a point or several points can be placed within the region of interest, or (B) a bounding box prompt. MedSAM required a bound box prompt. Ground truth images for (C) IVD and (D) VB segmentation.
Figure 2. Input prompts needed to perform SAM and MedSAM segmentations. SAM utilized either (A) a point prompt where a point or several points can be placed within the region of interest, or (B) a bounding box prompt. MedSAM required a bound box prompt. Ground truth images for (C) IVD and (D) VB segmentation.
Sensors 25 03596 g002
Figure 5. An atypical case where SAM with point prompts had the best outcome. Input image (A), ground truths (B,G), segmentations from DL models of SAM with point prompts (C,H), SAM with box prompts (D,I), MedSAM with box prompts (E,J), and nnU-Net (F,K). Top row (BF) are intervertebral discs and the bottom row (GK) are vertebral bodies.
Figure 5. An atypical case where SAM with point prompts had the best outcome. Input image (A), ground truths (B,G), segmentations from DL models of SAM with point prompts (C,H), SAM with box prompts (D,I), MedSAM with box prompts (E,J), and nnU-Net (F,K). Top row (BF) are intervertebral discs and the bottom row (GK) are vertebral bodies.
Sensors 25 03596 g005
Figure 6. Effect of echo time (TE) on segmentation performance. (A) An example of IVD segmentation using SAM with point prompts when source images with different TE were used. Dice scores for (B) IVD and (C) VB when different DL models and source images TE were used.
Figure 6. Effect of echo time (TE) on segmentation performance. (A) An example of IVD segmentation using SAM with point prompts when source images with different TE were used. Dice scores for (B) IVD and (C) VB when different DL models and source images TE were used.
Sensors 25 03596 g006
Figure 7. Plots showing inter-reader agreement of Dice scores using intraclass correlation coefficient (ICC) for different tissues and DL models. Correlation plots for IVD are shown on top (AC) and the plots for VB are shown on the bottom (DF), for the images processed with SAM with point prompts (A,D), SAM with box prompts (B,E), and MedSAM with box prompts (C,F).
Figure 7. Plots showing inter-reader agreement of Dice scores using intraclass correlation coefficient (ICC) for different tissues and DL models. Correlation plots for IVD are shown on top (AC) and the plots for VB are shown on the bottom (DF), for the images processed with SAM with point prompts (A,D), SAM with box prompts (B,E), and MedSAM with box prompts (C,F).
Sensors 25 03596 g007
Table 1. Table of segmentation metrics showing the mean ± standard deviation (SD) of the Dice scores, sensitivity, specificity, and robust Hausdorff distance (RHD) of segmenting intervertebral discs and vertebral bodies using SAM with point prompts, SAM with box prompts, MedSAM with box prompts, and nnU-Net. Red numbers indicate significant differences between DL models.
Table 1. Table of segmentation metrics showing the mean ± standard deviation (SD) of the Dice scores, sensitivity, specificity, and robust Hausdorff distance (RHD) of segmenting intervertebral discs and vertebral bodies using SAM with point prompts, SAM with box prompts, MedSAM with box prompts, and nnU-Net. Red numbers indicate significant differences between DL models.
Intervertebral Disc Vertebral Body
ParameterDice SensitivitySpecificityRHD Dice SensitivitySpecificityRHD
SAM (Pt), mean ± SD0.64 ± 0.200.56 ± 0.280.995 ± 0.0037.4 ± 3.6 0.83 ± 0.070.77 ± 0.100.988 ± 0.0147.1 ± 1.5
SAM (box)0.83 ± 0.120.78 ± 0.140.995 ± 0.0064.1 ± 1.3 0.86 ± 0.050.80 ± 0.070.989 ± 0.0096.6 ± 1.3
MedSAM (box)0.79 ± 0.110.75 ± 0.140.993 ± 0.0054.6 ± 1.3 0.88 ± 0.040.85 ± 0.060.986 ± 0.0085.8 ± 1.0
nnUNet0.99 ± 0.220.99 ± 0.030.999 ± 0.0010.5 ± 0.7 0.99 ± 0.010.99 ± 0.020.999 ± 0.0020.6 ± 0.9
Effect of DL model Friedman (Dice, Sensitivity, Specificity) or rmANOVA (RHD) p-values
Overall<0.001<0.001<0.001<0.001 <0.001<0.001<0.001<0.001
Post hoc w/Bonferroni
SAM (Pt) vs. SAM (box)<0.001<0.001<0.001<0.001 <0.001<0.001<0.001<0.001
SAM (Pt) vs. MedSAM<0.001<0.001<0.001<0.001 <0.001<0.001<0.001<0.001
SAM (Pt) vs. nnU-Net<0.001<0.001<0.001<0.001 <0.001<0.001<0.001<0.001
SAM (box) vs. MedSAM<0.001<0.001<0.001<0.001 0.097<0.050.127<0.05
SAM (box) vs. nnU-Net<0.001<0.001<0.001<0.001 <0.001<0.001<0.001<0.001
MedSAM vs. nnU-Net<0.0010.104<0.001<0.001 <0.001<0.001<0.001<0.001
Table 2. Table of Friedman p-values showing the effect of TE on the Dice scores from each DL model, complementing Figure 5B,C. Red numbers indicate significant differences between DL models.
Table 2. Table of Friedman p-values showing the effect of TE on the Dice scores from each DL model, complementing Figure 5B,C. Red numbers indicate significant differences between DL models.
IVD DiceVB Dice
SAM (Pt)<0.001<0.001
SAM (box)<0.001<0.001
MedSAM (box)<0.05<0.001
nnU-Net<0.001<0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, C.; Law, H.; Poon, C.; Yen, S.; Lall, K.; Jamshidi, A.; Malis, V.; Hwang, D.; Bae, W.C. Segment Anything Model (SAM) and Medical SAM (MedSAM) for Lumbar Spine MRI. Sensors 2025, 25, 3596. https://doi.org/10.3390/s25123596

AMA Style

Chang C, Law H, Poon C, Yen S, Lall K, Jamshidi A, Malis V, Hwang D, Bae WC. Segment Anything Model (SAM) and Medical SAM (MedSAM) for Lumbar Spine MRI. Sensors. 2025; 25(12):3596. https://doi.org/10.3390/s25123596

Chicago/Turabian Style

Chang, Christian, Hudson Law, Connor Poon, Sydney Yen, Kaustubh Lall, Armin Jamshidi, Vadim Malis, Dosik Hwang, and Won C. Bae. 2025. "Segment Anything Model (SAM) and Medical SAM (MedSAM) for Lumbar Spine MRI" Sensors 25, no. 12: 3596. https://doi.org/10.3390/s25123596

APA Style

Chang, C., Law, H., Poon, C., Yen, S., Lall, K., Jamshidi, A., Malis, V., Hwang, D., & Bae, W. C. (2025). Segment Anything Model (SAM) and Medical SAM (MedSAM) for Lumbar Spine MRI. Sensors, 25(12), 3596. https://doi.org/10.3390/s25123596

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop