Next Article in Journal
Persona, Break Glass, Name Plan, Jam (PBNJ): A New AI Workflow for Planning and Problem Solving
Previous Article in Journal
Decentralizing AI Economics for Poverty Alleviation: Web3 Social Innovation Systems in the Global South
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Transformer-Based Deep Learning for Multiplanar Cervical Spine MRI Interpretation: Comparison with Spine Surgeons and Radiologists

by
Aric Lee
1,†,
Junran Wu
2,†,
Changshuo Liu
2,
Andrew Makmur
1,3,
Yong Han Ting
1,3,
You Jun Lee
1,
Wilson Ong
1,
Tricia Kuah
1,
Juncheng Huang
1,
Shuliang Ge
1,
Alex Quok An Teo
4,
Joey Chan Yiing Beh
5,
Desmond Shi Wei Lim
1,
Xi Zhen Low
1,
Ee Chin Teo
1,
Qai Ven Yap
6,
Shuxun Lin
7,
Jonathan Jiong Hao Tan
4,
Naresh Kumar
4,
Beng Chin Ooi
2,
Swee Tian Quek
1,3 and
James Thomas Patrick Decourcy Hallinan
1,3,*
add Show full author list remove Hide full author list
1
Department of Diagnostic Imaging, National University Hospital, Singapore 119228, Singapore
2
Department of Computer Science, School of Computing, National University of Singapore, Singapore 117417, Singapore
3
Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 119074, Singapore
4
National University Spine Institute, Department of Orthopaedic Surgery, National University Health System, Singapore 119228, Singapore
5
Department of Radiology, Ng Teng Fong General Hospital, Singapore 609606, Singapore
6
Biostatistics Unit, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 117597, Singapore
7
Division of Spine Surgery, Department of Orthopaedic Surgery, Ng Teng Fong General Hospital, Singapore 609606, Singapore
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
AI 2025, 6(12), 308; https://doi.org/10.3390/ai6120308
Submission received: 24 September 2025 / Revised: 17 November 2025 / Accepted: 18 November 2025 / Published: 27 November 2025

Abstract

Background: Degenerative cervical spondylosis (DCS) is a common and potentially debilitating condition, with surgery indicated in selected patients. Deep learning models (DLMs) can improve consistency in grading DCS neural stenosis on magnetic resonance imaging (MRI), though existing models focus on axial images, and comparisons are mostly limited to radiologists. Methods: We developed an enhanced transformer-based DLM that trains on sagittal images and optimizes axial and foraminal classification using a maximized dataset. DLM training utilized 648 scans, with internal testing on 75 scans and external testing on an independent 75-scan dataset. Performance of the DLM, spine surgeons, and radiologists of varying subspecialities/seniority were compared against a consensus reference standard. Results: On internal testing, the DLM achieved high agreement for all-class classification: axial spinal canal κ = 0.80 (95%CI: 0.72–0.82), sagittal spinal canal κ = 0.83 (95%CI: 0.81–0.85), and neural foramina κ = 0.81 (95%CI: 0.77–0.84). In comparison, human readers demonstrated lower levels of agreement (κ = 0.60–0.80). External testing showed modestly degraded model performance (κ = 0.68–0.77). Conclusions: These results demonstrate the utility of transformer-based DLMs in multiplanar MRI interpretation, surpassing spine surgeons and radiologists on internal testing and highlighting its potential for real-world clinical adoption.

1. Introduction

Deep learning models (DLMs) for degenerative cervical spondylosis (DCS) have the potential to improve consistency in image interpretation and impact patient outcomes. DCS is a common, debilitating condition with rising prevalence in aging populations across the world. While many patients with DCS can be managed conservatively, surgery may be indicated for certain groups, such as patients with progressive neurologic symptoms or failed conservative management [1]. Magnetic resonance imaging (MRI) permits the visualisation of neural structures to quantify stenosis and evaluate for cord signal abnormality, helping to guide management and surgical planning [2,3]. However, conventional evaluation on MRI has limitations. Radiologist interpretation of MRI can be inconsistent and there is significant overlap in findings between asymptomatic and symptomatic patients [4].
Recent advances in deep learning have augmented medical imaging through automated image classification and diagnosis across various modalities, potentially leading to more reliable assessment [5]. Several DLMs have been developed for cervical spine MRI. For example, Lee et al. (2024) trained a transformer-based DLM that exhibited superior performance to radiologists at diagnosing spinal canal and neural foraminal stenosis (assessed at each disc level) [6,7]. While such prior models showed promise, they were limited to axial images and had relatively lower levels of inter-observer agreement compared to earlier lumbar spine models [8,9]. Model performance was also only compared against radiologists, limiting generalizability since spine surgeons typically review imaging before planning further management. In addition, although some previous cervical spine MRI models have used sagittal sequences [10,11], they lacked per-level assessment, a key requirement for clinical decision-making.
In this study, we developed an enhanced DLM that expands beyond axial-only models by incorporating sagittal spinal canal training and optimizing axial and foraminal classification. Its performance was then benchmarked against both spine surgeons and radiologists. We hypothesized that this multiplanar design would achieve superior performance to human readers across clinical disciplines, to demonstrate broader clinical relevance than existing models.

2. Materials and Methods

We retrospectively extracted all MRI cervical spine scans performed from July to August 2023 of adults (≥18 years old) from National University Hospital, Singapore, excluding patients with prior cervical instrumentation, poor image quality and confounding pathology (for example, tumors or spondylodiscitis). MRIs were performed on Siemens and GE 1.5 and 3.0 Tesla magnets, with sagittal T2 spin echo and axial T2* sequences extracted. This formed the training and internal test datasets. An additional 75 MRIs from Ng Teng Fong General Hospital, Singapore, were randomly extracted from November 2015 to December 2020 using similar criteria, for external testing. MRI scan parameters are presented in Supplementary Table S1.

2.1. Model Training

We optimized our pre-existing axial-only DLM with further training data [7], and developed a new, complementary sagittal spinal canal classifier. Of the training data, 10% of cases were set aside for validation. Building upon recent advances in transformer-based approaches for object detection [12], we further improved training efficiency and detection performance via collaborative hybrid assignments [13]. These enable the model to leverage multiple matching strategies during training. The model directly detects the spinal canal and neural foramina from embedded image features. By utilizing attention mechanisms, the network selectively focuses on clinically relevant regions, improving the accuracy of Regions of Interest (ROIs) identification.
For feature extraction, a convolutional neural network-based backbone, ResNet50 [14], was selected from the DETR Model Zoo. ResNet is specifically designed to handle very deep architectures through skip connections, which allow information to bypass intermediate layers. This design alleviates the vanishing gradient issue, ensuring stable learning as depth increases. Owing to these advantages, ResNet performs reliably in medical imaging tasks, maintaining strong accuracy even with complex data. Once features are obtained, the hidden feature map is flattened and enriched with positional encoding before being passed into a transformer encoder. Then, the encoder processes the data through stacked layers, each containing multi-head self-attention to assess relationships across features and a feed-forward network (FFN) to refine the representations [15]. The FFN in this implementation is composed of a single linear layer for multi-class classification, and three fully connected layers ending with sigmoid activation for bounding box regression. The subsequent decoder integrates the encoder output with object queries, positional encodings, and memory to produce final predictions, comprising stenosis grades along with bounding boxes that delineate ROIs. Six encoder and six decoder layers were utilized. Through iterative refinement, the model progressively emphasizes critical features and improves prediction accuracy.
Each prediction generated by the decoder is subsequently evaluated by a classifier to determine whether it corresponds to a valid object (i.e., a spinal structure with an associated bounding box) or a “no object” case. Considering the complexity of DCS assessment, this study employs a collaborative assignment strategy to mitigate the inefficiency of cross-attention learning in the transformer decoder associated with the conventional one-to-one set matching paradigm. The strategy incorporates auxiliary queries generation techniques, including Adaptive Training Sample Selection (ATSS) [16], fully convolutional one-stage object detection (FCOS) [17], and Faster region-based convolutional neural network (R-CNN) [18], to achieve one-to-many label assignments during training. This hybrid assignment accelerates convergence and enhances detection robustness, which is particularly beneficial in medical imaging contexts where object boundaries are often ambiguous. Model training is guided by a composite weighted loss that combines cross-entropy loss for classification and L1 regression loss for bounding box refinement, aligned with the collaborative assignment scheme to promote stable optimization and accurate detection.
Both ResNet50 and the transformer were implemented with PyTorch 1.7.1 and trained on NVIDIA GeForce RTX/GTX GPU devices with CUDA version 10.1. The backbone and transformer were pre-trained on the COCO dataset and used to initialize our object detector model. Figure 1 provides an overview of the model design.

2.2. Dataset Labelling

The datasets were uploaded onto a commercially available annotation platform (V7 Darwin, London, UK) to facilitate labelling. At each disc level, the severity of spinal canal stenosis was graded on axial and sagittal images, and neural foraminal stenosis was graded on axial images. The Muhle and modified Kim grading systems were used for the spinal canal (4 classes) and the neural foramina (3 classes), respectively [19,20]. Six human readers labelled the internal test set; this comprised two spine surgeons (SS1, 6 years post-board certification and SS2, 3 years), a neuroradiologist (NR1, 4 years post-board certification), a musculoskeletal radiologist (MR1, 4 years post-board certification) and two radiology residents (RR1 and RR2, both 4th year of training). Human readers did not grade the external test set. The training/validation dataset was labelled by a senior musculoskeletal radiologist with an interest in spine imaging (MR2, 13 years post-board certification), assisted by a radiology resident (RR3, 4th year of training). Reference standard labelling for both test datasets was performed by two senior musculoskeletal radiologists in consensus (MR2 and MR3, both 13 years post-board certification). No demographic or clinical information was provided to the readers.

2.3. Demographics

A total of 239 MRI cervical scans were extracted, with subsequent exclusion of 30 scans for the presence of prior instrumentation (7, 23% of excluded scans), other pathology (13, 43%) and technical factors (10, 33%). This was combined with a pre-existing dataset [7]. A total of 648 scans were included in the training/validation dataset, comprising 279 females (43%), mean age 58 ± 13.7 years (range 19–90), further split into 583 scans for training and 65 for validation. The internal test set contained 75 scans, comprising 40 females (53%), mean age 56 ± 14.3 years (range 20–84). The external test set contained 75 scans, comprising 17 females (23%), mean age 60 ± 13.2 years (range 24–95). Figure 2 provides an overview of the MRI scans selected for inclusion and allocation into the various datasets. Table 1 summarizes the demographic information of the study population.

2.4. Statistical Analysis

We used recall to compare the DLM bounding boxes against the reference standard for correct ROI detection, setting the threshold for intersection over union (IOU) at 0.5. Gwet’s kappa (including 95% confidence intervals) was chosen to evaluate the DLM and reader accuracy against the reference standard for stenosis classification to overcome potential bias caused by anticipated class imbalance (e.g., grade 0/1 labels likely to be most prevalent). Sensitivity and specificity were also calculated for dichotomous classification. Statistical analysis was performed (by Q.V.Y.) using Stata version 18 (StataCorp, College Station, TX, USA). A p-value <0.05 was defined as statistically significant.

3. Results

3.1. Model Performance

Total training time for the DLM with 200 epochs on three NVIDIA GeForce RTX 2080 Ti GPUs was 40.3 h. The mean inference time was 0.063 s per scan, enabling near real-time evaluation. Loss and class error curves demonstrated stable convergence with early plateauing of both curves and minimal divergence (Figure 3). Performance stability across epochs supports reproducible optimization and the absence of overfitting.
The model achieved high levels of high recall across all ROIs, although this was relatively lower at the neural foramina (88.8% vs. 95.7–99.4% at the spinal canal on internal testing).

3.2. Internal Testing

The reference standard bounding boxes comprised a higher number of grade 0/1 (normal/mild) labels (Table 2). At the axial spinal canal, there were 1642 (58.7%) grade 0, 757 (27.0%) grade 1, 216 (7.7%) grade 2 and 180 (6.4%) grade 3 labels. Sagittal spinal canal grades possessed a similar distribution with 877 (57.0%) grade 0, 436 (28.3%) grade 1, 142 (9.2%) grade 2 and 83 (5.4%) grade 3 labels. At the neural foramina, there were 627 (62.0%) grade 0, 186 (18.4%) grade 1 and 199 (19.7%) grade 2 labels.
The DLM achieved high levels of agreement across all metrics (Table 3). At the axial spinal canal, this was κ = 0.80 (95%CI: 0.72–0.82) for all-class and κ = 0.95 (95%CI: 0.93–0.96) for dichotomous classification. At the sagittal spinal canal, κ = 0.83 (95%CI: 0.81–0.85) for all-class and κ = 0.95 (95%CI: 0.94–0.96) for dichotomous classification. At the neural foramina, κ = 0.81 (95%CI: 0.77–0.84) for all-class and κ = 0.90 (95%CI: 0.88–0.93) for dichotomous classification. Table 4 summarizes the sensitivity and specificity at each ROI. Representative model predictions are shown in Figure 4.
Readers demonstrated lower concordance compared with the DLM (detailed in Table 3, which also reports p-values for reader vs. DLM comparisons). At the axial spinal canal, agreement was significantly lower across all comparisons. For all-class gradings, κ values ranged from 0.60–0.72, with spine surgeons showing the lowest concordance (κ = 0.60–0.65) and MR1 the highest (κ = 0.72). For dichotomous classification, readers achieved high agreement (κ = 0.90–0.93), though still significantly lower than the DLM (κ = 0.95). At the sagittal spinal canal, κ values ranged 0.65–0.80 for all-class gradings and κ = 0.88–0.94 for dichotomous gradings. For dichotomous classification, SS2 (κ = 0.93) and RR1 (κ = 0.94) achieved concordance not significantly different from the DLM. At the neural foramina, reader performance remained slightly inferior to the DLM, although RR1’s performance on dichotomous classification (κ = 0.89) was not significantly different.

3.3. External Testing

The reference standard bounding boxes again comprised a higher number of grade 0/1 (normal/mild) labels, however the degree of class imbalance was less than the internal test set (Table 2). The number of grade 3 labels was higher, suggesting more severe degenerative disease in this population. At the axial spinal canal, there were 978 (39.4%) grade 0, 842 (33.9%) grade 1, 330 (13.3%) grade 2 and 332 (13.4%) grade 3 labels. Sagittal spinal canal grades also demonstrated a similar distribution with 892 (47.4%) grade 0, 526 (27.9%) grade 1, 229 (12.2%) grade 2 and 236 (12.5%) grade 3 labels. At the neural foramina, there were 370 (41.4%) grade 0, 258 (28.9%) grade 1 and 265 (29.7%) grade 2 labels. Similar to internal testing, the DLM achieved high recall at the spinal canal (95.1–99.8%) and slightly lower recall at the neural foramina (86.7%).
The DLM achieved high but slightly degraded performance compared to internal testing. At the axial spinal canal, it achieved κ = 0.76 (95%CI: 0.74–0.77) for all-class classification and κ = 0.91 (95%CI: 0.90–0.92) for dichotomous classification. At the sagittal spinal canal, performance was similar at κ = 0.77 (95%CI: 0.75–0.79) for all-class classification and κ = 0.92 (95%CI: 0.91–0.94) for dichotomous classification. At the neural foramina, this was slightly lower, at κ = 0.68 (95%CI: 0.64–0.72) for all-class classification and κ = 0.86 (95%CI: 0.83–0.89) for dichotomous classification.

4. Discussion

4.1. Model Performance and Generalizability

We present an augmented transformer-based DLM, incorporating sagittal spinal canal assessment at a per-disc level basis, in addition to optimized axial canal and foraminal gradings. During the training-validation phase, the DLM demonstrated stable convergence on loss and class error curves. Together with a minimal train-validation gap, this suggests that the DLM is generalizable. By then benchmarking its performance against a diverse cohort of human readers, including spine surgeons and various radiologists, the model demonstrated robust and clinically relevant performance. The DLM consistently outperformed human readers across MRI planes and ROIs on internal testing, although some comparisons did not reach statistical significance. Its performance at the sagittal spinal canal on the internal test set was slightly superior to the axial spinal canal (κ = 0.83 vs. κ = 0.80), supporting the reliability of sagittal image assessment.
On external testing, the DLM’s performance was slightly degraded. Given the robust model training-validation metrics, this is likely contributed by differences in scanning parameters and patient populations. For all-class grading, Gwet’s kappa values decreased at the axial spinal canal, sagittal spinal canal and axial neural foramina from κ = 0.80 to 0.76, κ = 0.83 to 0.77 and κ = 0.81 to 0.68, respectively. Achieving adequate generalizability and minimizing ‘dataset shift’ remain important areas of study for many diagnostic DLMs. For instance, in evaluating mammogram classification models, Wang et al. (2020) [21] showed a decline in AUROC from 0.71–0.79 to 0.44–0.65 when the same models were tested on an external dataset. To this end, techniques such as adversarial training and transfer learning could be trialed in the future to preserve DLM performance [22].

4.2. Inter-Rater Performance

Interestingly, spine surgeons had the lowest concordance for certain metrics, such as all-class classification at the axial spinal canal, and dichotomous classification at the sagittal spinal canal. This could be partly attributed to differences in clinical focus and training. In practice, spine surgeons often interpret imaging studies themselves, correlating imaging findings with clinical characteristics and surgical outcomes. While in many cases there are high levels of agreement between radiologist and surgeon interpretation [23], spine surgeons may be more accurate in correlating certain findings with intraoperative observations [24]. In our study, spine surgeons may have been less accustomed to interpreting MRI without clinical context and less familiar with structured grading systems. In future work, clinical information could be incorporated into the reading task, providing a more accurate representation of clinical practice. Radiologist grading was used as the reference standard in this study, and spine surgeons should be consulted in the future to offer a complementary perspective. Nonetheless, concordance attained by spine surgeons was still substantial.

4.3. Clinical Implications

Given the high levels of concordance shown by the DLM across both axial and sagittal planes, it could be deployed for triage or decision-support. The increasing incidence of DCS and thus cervical spine MRIs has contributed to a growing medical imaging workload, where automated review and prioritization of severe cases could meaningfully reduce treatment delays.
We included sagittal images in this study because they are obtained in a single acquisition, providing a continuous overview of the spinal cord and the spinal canal. Prior studies have shown that sagittal imaging offers high reliability when evaluating canal size, cord diameter, and identifying the level of most severe stenosis [25,26]. In contrast, axial sequences offer superior evaluation of foraminal stenosis [8,9], but require multiple acquisitions at variable angulations to account for the normal cervical lordosis. This can introduce longer scan times and interslice variability. These complementary properties underscore the importance of multiplanar evaluation, providing both robust assessment of the cord and spinal canal, as well as detailed foraminal assessment.
Importantly, sagittal images are particularly suited for automated triage because they can provide rapid assessment of the most critical findings on a single acquisition, namely high-grade spinal canal stenosis and cord compression. In practice, the DLM could identify potentially urgent pathology even before axial acquisitions are completed. This mirrors existing deep learning-driven triage systems in other aspects of medical imaging, such as hemorrhage detection on MRI brain scans [27] and critical findings on chest radiographs [28]. Thus, integrating sagittal and axial analysis into a single framework optimizes both timeliness and diagnostic performance.

4.4. Limitations and Future Work

Several study limitations should be acknowledged. First, the presence of class imbalance could artificially bias performance metrics, although the relatively higher incidence of grade 0/1 labels likely reflects real-world practice. Second, the reference standard used labels from two radiologists in consensus. As alluded to, future work could include ratings from spine surgeons, or clinical information such as intraoperative findings and surgical outcomes. Third, although sagittal images were also incorporated into this study, these were assessed independently. More sophisticated models that evaluate the commonly acquired sagittal MRI sequences collectively (e.g., T2-weighted, T1-weighted and short-tau inversion recovery (STIR)) could potentially offer more insight and improved accuracy. Fourth, we could incorporate a wider range of pathologies (for example, neoplasms and post-surgical cases) in further iterations. The aim of this present DLM was to classify DCS, which still represents the majority of cases in clinical practice. Fifth, we acknowledge that although the DLM was benchmarked on human readers on the internal test set, external testing only included DLM performance. This was presently limited by ethical and regulatory data constraints, but future work that includes multi-centre reader studies for external comparison, or datasets from other countries or health systems, would add rigor.

4.5. Conclusions

Our DLM provides a validated demonstration of multiplanar cervical MRI interpretation, outperforming both spine surgeons and radiologists in internal testing. These findings support its role as a triage or decision-support tool for either sagittal or axial images.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/ai6120308/s1, Table S1. MRI scanner parameters.

Author Contributions

Conceptualization, A.L., J.W., C.L., A.M., Y.H.T., D.S.W.L., X.Z.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; methodology, A.L., J.W., C.L., J.H., J.C.Y.B., D.S.W.L., X.Z.L., E.C.T., Q.V.Y., S.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; software, J.W., C.L., A.M., Y.H.T., Y.J.L., W.O., T.K., J.H., S.G., A.Q.A.T., J.C.Y.B., D.S.W.L., X.Z.L., E.C.T. and J.J.H.T.; validation, A.L., J.W., C.L., A.M., Y.H.T., Y.J.L., W.O., T.K., S.G., A.Q.A.T., J.C.Y.B., D.S.W.L., X.Z.L., E.C.T., Q.V.Y., S.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; formal analysis, A.L., J.W., C.L., A.M., Y.H.T., Y.J.L., W.O., T.K., S.G., A.Q.A.T., E.C.T., Q.V.Y., S.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; investigation, A.L., J.W., C.L., A.M., Y.H.T., Y.J.L., W.O., T.K., J.H., S.G., A.Q.A.T., J.C.Y.B., E.C.T., Q.V.Y., S.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; resources, A.L., J.W., C.L., A.M., Y.H.T., Y.J.L., W.O., T.K., S.G., A.Q.A.T., J.C.Y.B., D.S.W.L., X.Z.L., S.L., J.J.H.T. and J.T.P.D.H.; data curation, A.L., J.W., C.L., Y.J.L., W.O., T.K., J.H., S.G., A.Q.A.T., J.C.Y.B. and J.T.P.D.H.; writing—original draft preparation, A.L., J.W., C.L., T.K., J.J.H.T. and J.T.P.D.H.; writing—review and editing, A.L., J.W., C.L., A.M., Y.H.T., Y.J.L., W.O., T.K., J.H., S.G., A.Q.A.T., J.C.Y.B., D.S.W.L., X.Z.L., E.C.T., Q.V.Y., S.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; visualization, A.L., J.W., C.L., E.C.T. and Q.V.Y.; supervision, J.C.Y.B., D.S.W.L., X.Z.L., S.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; project administration, A.L., J.W., C.L., A.M., Y.H.T., Y.J.L., W.O., T.K., S.G., A.Q.A.T., J.C.Y.B., D.S.W.L., X.Z.L., E.C.T., Q.V.Y., S.L., J.J.H.T., N.K., B.C.O., S.T.Q. and J.T.P.D.H.; funding acquisition, J.T.P.D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was directly funded by MOH/NMRC, Singapore. Specifically, this study received support from the Singapore Ministry of Health National Medical Research Council under the NMRC Clinician Innovator Award (CIA). The grant was awarded for the project titled “From Prototype to Full Deployment: A Comprehensive Deep Learning Pipeline for Whole-Spine MRI” (Grant ID: CIAINV25jan-0005 J.T.P.D.H).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of National Healthcare Group, Singapore (IRB No.: 2021/00752, approval date 29 November 2021).

Informed Consent Statement

Patient consent was waived by the Institutional Review Board due to the minimal risks to human subjects.

Data Availability Statement

Data are available upon request due to restrictions—privacy reasons, to protect patient confidentiality.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DCSDegenerative cervical spondylosis
DLMDeep learning model
MRIMagnetic resonance imaging
ROIRegion of interest
FFNFeed-forward network
ATSSAdaptive Training Sample Selection
FCOSFully convolutional one-stage object detection
R-CNNRegion-based convolutional neural network
IOUIntersection over union
STIRShort-tau inversion recovery

References

  1. Theodore, N. Degenerative cervical spondylosis. N. Engl. J. Med. 2020, 383, 159–168. [Google Scholar] [CrossRef]
  2. Nouri, A.; Martin, A.R.; Kato, S.; Reihani-Kermani, H.; Riehm, L.E.; Fehlings, M.G. The relationship between MRI signal intensity changes, clinical presentation, and surgical outcome in degenerative cervical myelopathy: Analysis of a global cohort. Spine 2017, 42, 1851–1858. [Google Scholar] [CrossRef]
  3. Brown, B.M.; Schwartz, R.H.; Frank, E.; Blank, N.K. Preoperative evaluation of cervical radiculopathy and myelopathy by surface-coil MR imaging. AJR Am. J. Roentgenol. 1988, 151, 1205–1212. [Google Scholar] [CrossRef]
  4. Teresi, L.M.; Lufkin, R.B.; Reicher, M.A.; Moffit, B.J.; Vinuela, F.V.; Wilson, G.M.; Bentson, J.R.; Hanafee, W.N. Asymptomatic degenerative disk disease and spondylosis of the cervical spine: MR imaging. Radiology 1987, 164, 83–88. [Google Scholar] [CrossRef] [PubMed]
  5. Cheng, P.M.; Montagnon, E.; Yamashita, R.; Pan, I.; Cadrin-Chênevert, A.; Romero, F.P.; Chartrand, G.; Kadoury, S.; Tang, A. Deep learning: An update for radiologists. Radiographics 2021, 41, 1427–1445. [Google Scholar] [CrossRef] [PubMed]
  6. Lee, A.; Wu, J.; Liu, C.; Makmur, A.; Ting, Y.H.; Lee, S.; Chan, M.D.Z.; Lim, D.S.W.; Khoo, V.M.H.; Sng, J.; et al. Using deep learning to enhance reporting efficiency and accuracy in degenerative cervical spine MRI. Spine J. 2025, 25, 1942–1950. [Google Scholar] [CrossRef] [PubMed]
  7. Lee, A.; Wu, J.; Liu, C.; Makmur, A.; Ting, Y.H.; Muhamat, F.E.; Tan, L.Y.; Ong, W.; Tan, W.C.; Lee, Y.J.; et al. Deep learning model for automated diagnosis of degenerative cervical spondylosis and altered spinal cord signal on MRI. Spine J. 2025, 25, 255–264. [Google Scholar] [CrossRef]
  8. Hallinan, J.T.P.D.; Zhu, L.; Yang, K.; Makmur, A.; Algazwi, D.A.R.; Thian, Y.L.; Lau, S.; Choo, Y.S.; Eide, S.E.; Yap, Q.V.; et al. Deep learning model for automated detection and classification of central canal, lateral recess, and neural foraminal stenosis at lumbar spine MRI. Radiology 2021, 300, 130–138. [Google Scholar] [CrossRef]
  9. Tumko, V.; Kim, J.; Uspenskaia, N.; Honig, S.; Abel, F.; Lebl, D.R.; Hotalen, I.; Kolisnyk, S.; Kochnev, M.; Rusakov, A.; et al. A neural network model for detection and classification of lumbar spinal stenosis on MRI. Eur. Spine J. 2024, 33, 941–948. [Google Scholar] [CrossRef]
  10. Yi, W.; Zhao, J.; Tang, W.; Yin, H.; Yu, L.; Wang, Y.; Tian, W. Deep learning-based high-accuracy detection for lumbar and cervical degenerative disease on T2-weighted MR images. Eur. Spine J. 2023, 32, 3807–3814. [Google Scholar] [CrossRef]
  11. Niemeyer, F.; Galbusera, F.; Tao, Y.; Phillips, F.M.; An, H.S.; Louie, P.K.; Samartzis, D.; Wilke, H.J. Deep phenotyping the cervical spine: Automatic characterization of cervical degenerative phenotypes based on T2-weighted MRI. Eur. Spine J. 2023, 32, 3846–3856. [Google Scholar] [CrossRef] [PubMed]
  12. Zong, Z.; Song, G.; Liu, Y. DETRs with collaborative hybrid assignments training. arXiv 2022, arXiv:2211.12860. [Google Scholar]
  13. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Lecture Notes in Computer Science, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  14. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
  15. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  16. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
  17. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
  18. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  19. Muhle, C.; Metzner, J.; Weinert, D.; Falliner, A.; Brinkmann, G.; Mehdorn, M.H.; Heller, M.; Resnick, D. Classification system based on kinematic MR imaging in cervical spondylitic myelopathy. AJNR Am. J. Neuroradiol. 1998, 19, 1763–1771. [Google Scholar] [PubMed]
  20. Park, H.J.; Kim, J.H.; Lee, J.W.; Lee, S.Y.; Chung, E.C.; Rho, M.H.; Moon, J.W. Clinical correlation of a new and practical magnetic resonance grading system for cervical foraminal stenosis assessment. Acta Radiol. 2015, 56, 727–732. [Google Scholar] [CrossRef]
  21. Wang, X.; Liang, G.; Zhang, Y.; Blanton, H.; Bessinger, Z.; Jacobs, N. Inconsistent performance of deep learning models on mammogram classification. J. Am. Coll. Radiol. 2020, 17, 796–803. [Google Scholar] [CrossRef]
  22. Tran, A.T.; Zeevi, T.; Payabvash, S. Strategies to improve the robustness and generalizability of deep learning segmentation and classification in neuroimaging. BioMedInformatics 2025, 5, 20. [Google Scholar] [CrossRef]
  23. Lurie, J.D.; Doman, D.M.; Spratt, K.F.; Tosteson, A.N.A.; Weinstein, J.N. Magnetic resonance imaging interpretation in patients with symptomatic lumbar spine disc herniations: Comparison of clinician and radiologist readings. Spine 2009, 34, 701–705. [Google Scholar] [CrossRef]
  24. Rihn, J.A.; Yang, N.; Fisher, C.; Saravanja, D.; Smith, H.; Morrison, W.B.; Harrop, J.; Vacaro, A.R. Using magnetic resonance imaging to accurately assess injury to the posterior ligamentous complex of the spine: A prospective comparison of the surgeon and radiologist: Clinical article. J. Neurosurg. Spine 2010, 12, 391–396. [Google Scholar] [CrossRef]
  25. Grochmal, J.K.; Lozen, A.M.; Klein, A.P.; Mark, L.P.; Li, J.; Wang, M.C. Interobserver reliability of magnetic resonance imaging predictors of outcome in cervical spine degenerative conditions. World Neurosurg. 2018, 117, e215–e220. [Google Scholar] [CrossRef]
  26. Sritharan, K.; Chamoli, U.; Kuan, J.; Diwan, A.D. Assessment of degenerative cervical stenosis on T2-weighted MR imaging: Sensitivity to change and reliability of mid-sagittal and axial plane metrics. Spinal Cord. 2020, 58, 238–246. [Google Scholar] [CrossRef]
  27. Shin, H.; Park, J.E.; Jun, Y.; Eo, T.; Lee, J.; Kim, J.E.; Lee, D.H.; Moon, H.H.; Park, S.I.; Kim, S.; et al. Deep learning referral suggestion and tumour discrimination using explainable artificial intelligence applied to multiparametric MRI. Eur. Radiol. 2023, 33, 5859–5870. [Google Scholar] [CrossRef]
  28. Kolossváry, M.; Raghu, V.K.; Nagurney, J.T.; Hoffmann, U.; Lu, M.T. Deep learning analysis of chest radiographs to triage patients with acute chest pain syndrome. Radiology 2023, 306, e221926. [Google Scholar] [CrossRef]
Figure 1. Overview of Model Design. After initial feature map extraction, the model performs region of interest (ROI) detection and classification, with the predictions appearing as color-coded bounding boxes at the respective ROIs.
Figure 1. Overview of Model Design. After initial feature map extraction, the model performs region of interest (ROI) detection and classification, with the predictions appearing as color-coded bounding boxes at the respective ROIs.
Ai 06 00308 g001
Figure 2. Flowchart of the MRI scans selection and dataset allocation.
Figure 2. Flowchart of the MRI scans selection and dataset allocation.
Ai 06 00308 g002
Figure 3. Loss and class error curves for the training and validation sets.
Figure 3. Loss and class error curves for the training and validation sets.
Ai 06 00308 g003
Figure 4. Examples of model predictions on sagittal (top row) and axial (bottom row) images. The left-sided images show reference standard labels (expert consensus). The right-sided images show model predictions for the same images, with percentages presented next to each bounding box to represent the probability of the predicted class. Bounding boxes are color-coded based on the predicted class, with green (grade 0), yellow (grade 1), orange (grade 2) and red (grade 3) colors used.
Figure 4. Examples of model predictions on sagittal (top row) and axial (bottom row) images. The left-sided images show reference standard labels (expert consensus). The right-sided images show model predictions for the same images, with percentages presented next to each bounding box to represent the probability of the predicted class. Bounding boxes are color-coded based on the predicted class, with green (grade 0), yellow (grade 1), orange (grade 2) and red (grade 3) colors used.
Ai 06 00308 g004
Table 1. Demographics of all patients included in the datasets.
Table 1. Demographics of all patients included in the datasets.
CharacteristicsTraining/Validation and Internal Testing (n = 723)Internal Testing
(n = 75)
External Testing (n = 75)
Age (years ± standard deviations (range))58 ± 13.7 (19–90)56 ± 14.3 (20–84)60 ± 13.2 (24–95)
Women319 (44%)40 (53%)17 (23%)
Indication for MRINeck pain174 (35%)14 (19%)1 (1%)
Unilateral radiculopathy199 (39%)20 (27%)16 (21%)
Bilateral radiculopathy24 (5%)4 (5%)3 (4%)
Myelopathy296 (59%)35 (47%)51 (68%)
Others30 (6%)2 (3%)4 (5%)
Table 2. Distribution of reference standard labels for the test sets.
Table 2. Distribution of reference standard labels for the test sets.
Test SetAxial Spinal CanalSagittal Spinal CanalAxial Neural Foramina 1
Internal TestingGrade 01642 (58.7%)877 (57.0%)627 (62.0%)
Grade 1757 (27.0%)436 (28.3%)186 (18.4%)
Grade 2216 (7.7%)142 (9.2%)199 (19.7%)
Grade 3180 (6.4%)83 (5.4%)N/A
Total279515381012
DLM Total (Recall)2777 (99.4%)1492 (95.7%)899 (88.8%)
External TestingGrade 0978 (39.4%)892 (47.4%)370 (41.4%)
Grade 1842 (33.9%)526 (27.9%)258 (28.9%)
Grade 2330 (13.3%)229 (12.2%)265 (29.7%)
Grade 3332 (13.4%)236 (12.5%)N/A
Total24821883893
DLM Total (Recall)2477 (99.8%)1791 (95.1%)774 (86.7%)
1 Park et al. (2015) [20] only uses grade 0/1/2 for the neural foramina. These were only assessed on axial images. DLM = deep learning model.
Table 3. Gwet’s kappa values for internal testing.
Table 3. Gwet’s kappa values for internal testing.
RaterAxial Spinal Canalp-Value Against DLMSagittal Spinal Canalp-Value Against DLMAxial Neural Foraminap-Value Against DLM
Deep Learning ModelAll-class0.80 (0.72–0.82) 0.83 (0.81–0.85) 0.81 (0.77–0.84)
Dichotomous0.95 (0.93–0.96) 0.95 (0.94–0.96) 0.90 (0.88–0.93)
Spine Surgeon 1All-class0.65 (0.63–0.67)<0.0010.70 (0.67–0.73)<0.0010.70 (0.66–0.74)<0.001
Dichotomous0.93 (0.92–0.94)0.0370.92 (0.91–0.94)0.0330.86 (0.83–0.88)0.002
Spine Surgeon 2All-class0.60 (0.58–0.62)<0.0010.65 (0.62–0.68)<0.0010.65 (0.61–0.69)<0.001
Dichotomous0.91 (0.90–0.93)<0.0010.93 (0.91–0.94)0.0580.84 (0.81–0.86)<0.001
NeuroradiologistAll-class0.67 (0.65–0.69)<0.0010.70 (0.67–0.73)<0.0010.65 (0.61–0.68)<0.001
Dichotomous0.90 (0.88–0.91)<0.0010.90 (0.88–0.92)<0.0010.82 (0.79–0.86)<0.001
Musculoskeletal RadiologistAll-class0.72 (0.70–0.74)<0.0010.70 (0.67–0.72)<0.0010.72 (0.69–0.76)<0.001
Dichotomous0.92 (0.91–0.93)0.0010.88 (0.86–0.90)<0.0010.87 (0.84–0.89)0.009
Radiology Resident 1All-class0.67 (0.64–0.69)<0.0010.80 (0.78–0.82)0.0410.72 (0.68–0.75)<0.001
Dichotomous0.93 (0.91–0.94)0.0170.94 (0.93–0.95)0.4110.89 (0.86–0.91)0.333
Radiology Resident 2All-class0.67 (0.65–0.69)<0.0010.68 (0.66–0.71)<0.0010.68 (0.64–0.71)<0.001
Dichotomous0.90 (0.89–0.92)<0.0010.90 (0.88–0.92)<0.0010.84 (0.81–0.87)<0.001
Data are presented as Gwet’s kappa (95% CI). Dichotomous classification is presented for grades 0/1 vs. 2/3. All p-values were <0.001. DLM = deep learning model.
Table 4. Sensitivity and specificity for the deep learning model and all readers for internal testing.
Table 4. Sensitivity and specificity for the deep learning model and all readers for internal testing.
Region of InterestReaderSensitivitySpecificityPPVNPVAUROC
Axial Spinal CanalDeep learning model86.897.484.597.892.1
Spine Surgeon 187.695.877.697.991.7
Spine Surgeon 289.194.572.698.191.8
Neuroradiologist93.992.567.498.993.2
Musculoskeletal Radiologist81.196.177.496.988.6
Radiology Resident 190.295.275.698.392.7
Radiology Resident 291.493.168.798.592.3
Axial Neural ForaminaDeep learning model89.194.680.897.191.8
Spine Surgeon 155.397.885.989.976.5
Spine Surgeon 242.298.889.487.570.5
Neuroradiologist36.798.989.086.567.8
Musculoskeletal Radiologist53.399.193.889.776.2
Radiology Resident 176.996.182.794.486.5
Radiology Resident 243.799.091.687.871.4
Sagittal Spinal CanalDeep learning model85.097.986.797.691.5
Spine Surgeon 191.695.075.798.593.3
Spine Surgeon 296.094.474.799.395.2
Neuroradiologist93.392.668.498.893.0
Musculoskeletal Radiologist93.391.465.098.892.4
Radiology Resident 182.297.896.597.090.0
Radiology Resident 294.292.969.599.093.6
Data are presented as percentages for dichotomous classification.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, A.; Wu, J.; Liu, C.; Makmur, A.; Ting, Y.H.; Lee, Y.J.; Ong, W.; Kuah, T.; Huang, J.; Ge, S.; et al. Transformer-Based Deep Learning for Multiplanar Cervical Spine MRI Interpretation: Comparison with Spine Surgeons and Radiologists. AI 2025, 6, 308. https://doi.org/10.3390/ai6120308

AMA Style

Lee A, Wu J, Liu C, Makmur A, Ting YH, Lee YJ, Ong W, Kuah T, Huang J, Ge S, et al. Transformer-Based Deep Learning for Multiplanar Cervical Spine MRI Interpretation: Comparison with Spine Surgeons and Radiologists. AI. 2025; 6(12):308. https://doi.org/10.3390/ai6120308

Chicago/Turabian Style

Lee, Aric, Junran Wu, Changshuo Liu, Andrew Makmur, Yong Han Ting, You Jun Lee, Wilson Ong, Tricia Kuah, Juncheng Huang, Shuliang Ge, and et al. 2025. "Transformer-Based Deep Learning for Multiplanar Cervical Spine MRI Interpretation: Comparison with Spine Surgeons and Radiologists" AI 6, no. 12: 308. https://doi.org/10.3390/ai6120308

APA Style

Lee, A., Wu, J., Liu, C., Makmur, A., Ting, Y. H., Lee, Y. J., Ong, W., Kuah, T., Huang, J., Ge, S., Teo, A. Q. A., Beh, J. C. Y., Lim, D. S. W., Low, X. Z., Teo, E. C., Yap, Q. V., Lin, S., Tan, J. J. H., Kumar, N., ... Hallinan, J. T. P. D. (2025). Transformer-Based Deep Learning for Multiplanar Cervical Spine MRI Interpretation: Comparison with Spine Surgeons and Radiologists. AI, 6(12), 308. https://doi.org/10.3390/ai6120308

Article Metrics

Back to TopTop