X2P-Net: Context-Aware 2D/3D Vertebra Localization
Abstract
1. Introduction
- We introduce an end-to-end context-aware 2D/3D vertebra localization framework, referred to as X2P-Net. The framework takes advantage of vertebral context, which is first enhanced by a prompt-guided reference vertebra and then extracted using learnable vertebral embeddings, for high-performing 2D/3D vertebra localization.
- We design a novel BrickFormer architecture, which leverages a dual-attention mechanism. The initial attention layer automatically identifies the foreground region from the background, and the subsequent attention layer then focuses only on the foreground features. This approach achieves high localization accuracy at a low computational cost.
- We conduct comprehensive experiments on two datasets to demonstrate the efficacy of the proposed method: a large-scale synthetic dataset of biplanar digitally reconstructed radiographs (DRRs) and a real biplanar X-ray image dataset of sheep spines, captured by a C-arm imaging system.
2. Related Work
2.1. Leveraging Semantic Context in Vertebra Localization
2.2. Estimating 3D Landmarks from 2D Images
3. Methodology
3.1. Overview
3.2. The VFE Unit
3.3. The Prompt-Guided FE Unit
3.4. The SCE Unit
3.5. The 3D Multi-View Feature Fusion Unit
3.6. Loss Functions
3.7. Implementation Details
4. Experiments
4.1. Datasets
4.1.1. Synthetic Biplanar Spine DRR Dataset (BiSpineX Dataset)
4.1.2. Sheep Spine X-Ray Dataset (SheepSpineX Dataset)
4.2. Evaluation Metrics
4.3. Results
4.3.1. Results on the BiSpineX Dataset
4.3.2. Results on the SheepSpineX Dataset
4.4. Analytical Ablation Studies
4.4.1. Results on Investigating the Effectiveness of Key Components
4.4.2. Results on Examining Different Attention Mechanisms
4.4.3. Results on Investigating the Impact of Different Hyperparameters
4.4.4. Results of Investigating the Sensitivity of Our Method to Prompt Displacement
4.4.5. Analysis of BrickFormer
5. Discussions
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tajsic, T.; Patel, K.; Farmer, R.; Mannion, R.; Trivedi, R. Spinal navigation for minimally invasive thoracic and lumbosacral spine fixation: Implications for radiation exposure, operative time, and accuracy of pedicle screw placement. Eur. Spine J. 2018, 27, 1918–1924. [Google Scholar] [CrossRef]
- Maken, P.; Gupta, A. 2D-to-3D: A review for computational 3D image reconstruction from X-ray images. Arch. Comput. Methods Eng. 2023, 30, 85–114. [Google Scholar] [CrossRef]
- Unberath, M.; Gao, C.; Hu, Y.; Judish, M.; Taylor, R.H.; Armand, M.; Grupp, R. The impact of machine learning on 2d/3d registration for image-guided interventions: A systematic review and perspective. Front. Robot. AI 2021, 8, 716007. [Google Scholar] [CrossRef]
- Drover, D.; MV, R.; Chen, C.H.; Agrawal, A.; Tyagi, A.; Huynh, C.P. Can 3D Pose Be Learned from 2D Projections Alone? In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; European Computer Vision Association: Milan, Italy, 2018; pp. 78–94. [Google Scholar]
- Zhao, Q.; Zheng, C.; Liu, M.; Chen, C. A single 2d pose with context is worth hundreds for 3d human pose estimation. Adv. Neural Inf. Process. Syst. 2024, 36, 27394–27413. [Google Scholar]
- Aubert, B.; Vazquez, C.; Cresson, T.; Parent, S.; de Guise, J.A. Toward automated 3D spine reconstruction from biplanar radiographs using CNN for statistical spine model fitting. IEEE Trans. Med. Imaging 2019, 38, 2796–2806. [Google Scholar] [CrossRef]
- Wang, L.; Xu, Q.; Leung, S.; Chung, J.; Chen, B.; Li, S. Accurate automated Cobb angles estimation using multi-view extrapolation net. Med. Image Anal. 2019, 58, 101542. [Google Scholar] [CrossRef]
- Kasten, Y.; Doktofsky, D.; Kovler, I. End-to-end convolutional neural network for 3D reconstruction of knee bones from bi-planar X-ray images. In Proceedings of the Machine Learning for Medical Image Reconstruction: Third International Workshop, MLMIR 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, 8 October 2020; Proceedings 3; Springer: Berlin/Heidelberg, Germany, 2020; pp. 123–133. [Google Scholar]
- Huang, Y.; Jones, C.K.; Zhang, X.; Johnston, A.; Waktola, S.; Aygun, N.; Witham, T.; Bydon, A.; Theodore, N.; Helm, P.A.; et al. Multi-perspective region-based CNNs for vertebrae labeling in intraoperative long-length images. Comput. Methods Programs Biomed. 2022, 227, 107222. [Google Scholar] [CrossRef] [PubMed]
- Kyung, D.; Jo, K.; Choo, J.; Lee, J.; Choi, E. Perspective projection-based 3d CT reconstruction from biplanar X-rays. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Cafaro, A.; Spinat, Q.; Leroy, A.; Maury, P.; Munoz, A.; Beldjoudi, G.; Robert, C.; Deutsch, E.; Grégoire, V.; Lepetit, V.; et al. X2Vision: 3D CT Reconstruction from Biplanar X-Rays with Deep Structure Prior. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 699–709. [Google Scholar]
- Ye, K.; Sun, W.; Tao, R.; Zheng, G. A Projective-Geometry-Aware Network for 3D Vertebra Localization in Calibrated Biplanar X-Ray Images. Sensors 2025, 25, 1123. [Google Scholar] [CrossRef]
- Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]
- Zheng, G.; Gollmer, S.; Schumann, S.; Dong, X.; Feilkas, T.; Ballester, M.A.G. A 2D/3D correspondence building method for reconstruction of a patient-specific 3D bone surface model using point distribution models and calibrated X-ray images. Med. Image Anal. 2009, 13, 883–899. [Google Scholar] [CrossRef] [PubMed]
- Baka, N.; Kaptein, B.L.; de Bruijne, M.; van Walsum, T.; Giphart, J.; Niessen, W.J.; Lelieveldt, B.P. 2D–3D shape reconstruction of the distal femur from stereo X-ray imaging using statistical shape models. Med. Image Anal. 2011, 15, 840–850. [Google Scholar] [CrossRef] [PubMed]
- Wu, H.; Zhang, J.; Fang, Y.; Liu, Z.; Wang, N.; Cui, Z.; Shen, D. Multi-view vertebra localization and identification from ct images. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 136–145. [Google Scholar]
- Kim, H.; Lee, K.; Lee, D.; Baek, N. 3D reconstruction of leg bones from X-ray images using CNN-based feature analysis. In Proceedings of the 2019 International Conference on Information and Communication Technology Convergence (ICTC); IEEE: Piscataway, NJ, USA, 2019; pp. 669–672. [Google Scholar]
- Aubert, B.; Vidal, P.; Parent, S.; Cresson, T.; Vazquez, C.; De Guise, J. Convolutional neural network and in-painting techniques for the automatic assessment of scoliotic spine surgery from biplanar radiographs. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, 11–13 September 2017; Proceedings, Part II 20; Springer: Berlin/Heidelberg, Germany, 2017; pp. 691–699. [Google Scholar]
- Bayat, A.; Sekuboyina, A.; Paetzold, J.C.; Payer, C.; Stern, D.; Urschler, M.; Kirschke, J.S.; Menze, B.H. Inferring the 3D standing spine posture from 2D radiographs. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part VI 23; Springer: Berlin/Heidelberg, Germany, 2020; pp. 775–784. [Google Scholar]
- Sekuboyina, A.; Husseini, M.E.; Bayat, A.; Löffler, M.; Liebl, H.; Li, H.; Tetteh, G.; Kukačka, J.; Payer, C.; Štern, D.; et al. VerSe: A vertebrae labelling and segmentation benchmark for multi-detector CT images. Med. Image Anal. 2021, 73, 102166. [Google Scholar] [CrossRef]
- Payer, C.; Štern, D.; Bischof, H.; Urschler, M. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Med. Image Anal. 2019, 54, 207–219. [Google Scholar] [CrossRef]
- Tao, R.; Liu, W.; Zheng, G. Spine-transformers: Vertebra labeling and segmentation in arbitrary field-of-view spine CTs via 3D transformers. Med. Image Anal. 2022, 75, 102258. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Xie, Z.; Lin, Z.; Sun, E.; Ding, F.; Qi, J.; Zhao, S. Deep learning for automatic vertebra analysis: A methodological survey of recent advances. Comput. Med. Imaging Graph. 2025, 125, 102652. [Google Scholar] [CrossRef] [PubMed]
- Klinder, T.; Ostermann, J.; Ehm, M.; Franz, A.; Kneser, R.; Lorenz, C. Automated model-based vertebra detection, identification, and segmentation in CT images. Med. Image Anal. 2009, 13, 471–482. [Google Scholar] [CrossRef]
- Schmidt, S.; Kappes, J.; Bergtholdt, M.; Pekar, V.; Dries, S.; Bystrov, D.; Schnörr, C. Spine detection and labeling using a parts-based graphical model. In Proceedings of the Information Processing in Medical Imaging: 20th International Conference, IPMI 2007, Kerkrade, The Netherlands, 2–6 July 2007; Proceedings 20; Springer: Berlin/Heidelberg, Germany, 2007; pp. 122–133. [Google Scholar]
- Glocker, B.; Zikic, D.; Konukoglu, E.; Haynor, D.R.; Criminisi, A. Vertebrae localization in pathological spine CT via dense classification from sparse annotations. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2013: 16th International Conference, Nagoya, Japan, 22–26 September 2013; Proceedings, Part II 16; Springer: Berlin/Heidelberg, Germany, 2013; pp. 262–270. [Google Scholar]
- Chen, Y.; Gao, Y.; Li, K.; Zhao, L.; Zhao, J. Vertebrae identification and localization utilizing fully convolutional networks and a hidden Markov model. IEEE Trans. Med. Imaging 2019, 39, 387–399. [Google Scholar] [CrossRef] [PubMed]
- Han, Z.; Wei, B.; Mercado, A.; Leung, S.; Li, S. Spine-GAN: Semantic segmentation of multiple spinal structures. Med. Image Anal. 2018, 50, 23–35. [Google Scholar] [CrossRef]
- Huang, Z.; Zhao, R.; Leung, F.H.; Banerjee, S.; Lam, K.M.; Zheng, Y.P.; Ling, S.H. Landmark Localization from Medical Images with Generative Distribution Prior. IEEE Trans. Med. Imaging 2024, 43, 2679–2692. [Google Scholar] [CrossRef]
- Ye, K.; Zou, X.; Sun, W.; Zheng, G. Semi-GDE: Generative distribution estimation for semi-supervised medical landmark localization. Neurocomputing 2025, 652, 131095. [Google Scholar] [CrossRef]
- Yang, Y.; Wang, Y.; Liu, T.; Wang, M.; Sun, M.; Song, S.; Fan, W.; Huang, G. Anatomical prior-based vertebral landmark detection for spinal disorder diagnosis. Artif. Intell. Med. 2025, 159, 103011. [Google Scholar] [CrossRef]
- Chen, H.; Shen, C.; Qin, J.; Ni, D.; Shi, L.; Cheng, J.C.; Heng, P.A. Automatic localization and identification of vertebrae in spine CT via a joint learning model with deep neural networks. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part I 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 515–522. [Google Scholar]
- Wang, F.; Zheng, K.; Lu, L.; Xiao, J.; Wu, M.; Miao, S. Automatic vertebra localization and identification in CT by spine rectification and anatomically-constrained optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 5280–5288. [Google Scholar]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Chen, D.; Chen, M.; Wu, P.; Wu, M.; Zhang, T.; Li, C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci. Rep. 2025, 15, 4982. [Google Scholar] [CrossRef]
- Bürgin, V.; Prevost, R.; Stollenga, M.F. Robust vertebra identification using simultaneous node and edge predicting graph neural networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 483–493. [Google Scholar]
- Xiang, S.; Zhang, L.; Wang, Y.; Zhou, S.; Zhao, X.; Zhang, T.; Li, S. VLD-Net: Localization and Detection of the Vertebrae from X-ray Images by Reinforcement Learning with Adaptive Exploration Mechanism and Spine Anatomy Information. IEEE J. Biomed. Health Inform. 2025, 29, 4969–4980. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- Zhang, Y.; Ji, X.; Liu, W.; Li, Z.; Zhang, J.; Liu, S.; Zhong, W.; Hu, L.; Li, W. A spine segmentation method under an arbitrary field of view based on 3d swin transformer. Int. J. Intell. Syst. 2023, 2023, 8686471. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
- Huang, Y.; Jones, C.K.; Zhang, X.; Johnston, A.; Aygun, N.; Witham, T.; Helm, P.A.; Siewerdsen, J.H.; Uneri, A. Automatic labeling of vertebrae in long-length intraoperative imaging with a multi-view, region-based CNN. In Proceedings of the Medical Imaging 2022: Image-Guided Procedures, Robotic Interventions, and Modeling; SPIE: San Francisco, CA, USA, 2022; Volume 12034, pp. 180–185. [Google Scholar]
- Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 7718–7727. [Google Scholar]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 11656–11665. [Google Scholar]
- Dong, J.; Jiang, W.; Huang, Q.; Bao, H.; Zhou, X. Fast and robust multi-person 3d pose estimation from multiple views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 7792–7801. [Google Scholar]
- Bridgeman, L.; Volino, M.; Guillemaut, J.Y.; Hilton, A. Multi-Person 3D Pose Estimation and Tracking in Sports. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Piscataway, NJ, USA, 2019; pp. 2487–2496. [Google Scholar]
- Ju, F.; Wang, Y.; Zhao, J.; Dong, M. Multiview 2D/3D image registration in minimally invasive pelvic surgery navigation. Sci. Rep. 2025, 15, 26183. [Google Scholar] [CrossRef] [PubMed]
- Lin, J.; Lee, G.H. Multi-view multi-person 3d pose estimation with plane sweep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 11886–11895. [Google Scholar]
- Wu, S.; Jin, S.; Liu, W.; Bai, L.; Qian, C.; Liu, D.; Ouyang, W. Graph-based 3d multi-person pose estimation using multi-view images. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 11148–11157. [Google Scholar]
- Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 561–578. [Google Scholar]
- Tome, D.; Toso, M.; Agapito, L.; Russell, C. Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In Proceedings of the 2018 International Conference on 3D Vision (3DV); IEEE: Piscataway, NJ, USA, 2018; pp. 474–483. [Google Scholar]
- Tu, H.; Wang, C.; Zeng, W. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 197–212. [Google Scholar]
- Zhang, Z.; Wang, C.; Qiu, W.; Qin, W.; Zeng, W. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. Int. J. Comput. Vis. 2021, 129, 703–718. [Google Scholar] [CrossRef]
- Ye, H.; Zhu, W.; Wang, C.; Wu, R.; Wang, Y. Faster voxelpose: Real-time 3d human pose estimation by orthographic projection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 142–159. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
- Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards understanding convergence and generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
- Russakoff, D.B.; Rohlfing, T.; Mori, K.; Rueckert, D.; Ho, A.; Adler, J.R.; Maurer, C.R. Fast generation of digitally reconstructed radiographs using attenuation fields with application to 2D-3D image registration. IEEE Trans. Med. Imaging 2005, 24, 1441–1454. [Google Scholar] [CrossRef] [PubMed]
- Guo, X.; Xu, S.; Lin, X.; Sun, Y.; Ma, X. 3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space. Pattern Anal. Appl. 2022, 25, 157–167. [Google Scholar] [CrossRef]
- Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
- Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 196–202. [Google Scholar]
- Brost, A.; Liao, R.; Strobel, N.; Hornegger, J. Respiratory motion compensation by model-based catheter tracking during EP procedures. Med. Image Anal. 2010, 14, 695–706. [Google Scholar] [CrossRef]
- Niu, K.; Tao, Z.; Cheng, L.; Wei, Z.; Kang, H.; Wei, T.; Huang, B.; Xu, F.; Xiong, C. Comprehensive workflow with optical navigation in minimally invasive transforaminal lumbar interbody fusion: A retrospective study. J. Orthop. Surg. Res. 2025, 20, 694. [Google Scholar] [CrossRef]










| 2D localization (LAT view) | ||||
| Methods | ||||
| SCN-Net [21] | 93.8 | 96.9 | 3.78 | 0.9735 |
| Spine-Trans [22] | 93.2 | 95.9 | 3.87 | 0.9689 |
| AdaFuse [57] | 91.1 | 95.8 | 4.61 | 0.9609 |
| ALG-Net [47] | 92.5 | 96.5 | 4.43 | 0.9716 |
| VOL-Net [47] | 92.3 | 96.5 | 4.33 | 0.9703 |
| Ours | 88.2 | 96.6 | 5.84 | 0.9701 |
| 2D localization (AP view) | ||||
| Methods | ||||
| SCN-Net [21] | 90.0 | 96.8 | 4.96 | 0.9682 |
| Spine-Trans [22] | 88.1 | 96.5 | 5.26 | 0.9667 |
| AdaFuse [57] | 87.6 | 94.8 | 5.66 | 0.9586 |
| ALG-Net [47] | 88.6 | 95.9 | 5.39 | 0.9607 |
| VOL-Net [47] | 88.2 | 95.4 | 5.44 | 0.9584 |
| Ours | 85.2 | 96.0 | 6.14 | 0.9681 |
| 3D localization | ||||
| Methods | ||||
| SCN-Net [21] | 90.5 | 92.6 | 8.94 | 0.9274 |
| Spine-Trans [22] | 87.5 | 91.5 | 9.21 | 0.9166 |
| AdaFuse [57] | 95.8 | 97.9 | 3.95 | 0.9826 |
| ALG-Net [47] | 95.7 | 98.3 | 3.25 | 0.9846 |
| VOL-Net [47] | 95.4 | 98.0 | 3.49 | 0.9827 |
| Ours | 96.9 | 98.8 | 2.99 | 0.9923 |
| Method | 3D Localization | |||
|---|---|---|---|---|
| ↑ | ↑ | ↓ | ↑ | |
| SCN-Net [21] | 95.2 | 97.2 | 3.71 | 0.9854 |
| Spine-Trans [22] | 93.5 | 97.5 | 4.42 | 0.9803 |
| AdaFuse [57] | 97.2 | 99.7 | 2.41 | 0.9944 |
| ALG-Net [47] | 96.5 | 100.0 | 1.56 | 0.9948 |
| VOL-Net [47] | 96.3 | 99.8 | 1.63 | 0.9939 |
| Ours | 98.4 | 100.0 | 1.08 | 0.9972 |
| Components | ↑ | ↑ | ↓ | ↑ | Params ↓ | FLOPs ↓ |
|---|---|---|---|---|---|---|
| No Prompt | 90.2 | 93.1 | 6.27 | 0.9653 | 3.04 M | 129.4 GMac |
| No SCE | 92.3 | 94.8 | 5.95 | 0.9657 | 3.00 M | 108.1 GMac |
| No Fusion | 92.8 | 96.5 | 5.90 | 0.9822 | 3.22 M | 123.6 GMac |
| Ours | 96.9 | 98.8 | 2.99 | 0.9923 | 3.24 M | 130.8 GMac |
| Methods | ↑ | ↑ | ↓ | ↑ | Params ↓ | FLOPs ↓ |
|---|---|---|---|---|---|---|
| Vanilla attention [23] | 94.7 | 97.2 | 4.13 | 0.9794 | 3.24 M | 109.2 GMac |
| Sparse attention [63] | 93.2 | 98.1 | 4.72 | 0.9870 | 3.24 M | 111.2 GMac |
| Ours | 96.9 | 98.8 | 2.99 | 0.9923 | 3.24 M | 130.8 GMac |
| A. Impact of spatial dimensions of vertebral embeddings. | ||||||
| Dimensions | ↑ | ↑ | ↓ | ↑ | Params ↓ | FLOPs↓ |
| 92.4 | 95.1 | 5.92 | 0.9661 | 3.08 M | 109.8 GMac | |
| 93.2 | 96.7 | 4.19 | 0.9731 | 3.11 M | 114.0 GMac | |
| 96.9 | 98.8 | 2.99 | 0.9923 | 3.24 M | 130.8 GMac | |
| B. Impact of top-k value. | ||||||
| Top-k | ↑ | ↑ | ↓ | ↑ | Params ↓ | FLOPs ↓ |
| 2 | 93.1 | 95.4 | 5.63 | 0.9665 | 3.24 M | 114.7 GMac |
| 4 | 95.7 | 97.9 | 3.59 | 0.9813 | 3.24 M | 120.0 GMac |
| 8 | 96.9 | 98.8 | 2.99 | 0.9923 | 3.24 M | 130.8 GMac |
| C. Impact of pooling stride . | ||||||
| ↑ | ↑ | ↓ | ↑ | Params ↓ | FLOPs ↓ | |
| 1 | 92.9 | 96.2 | 4.32 | 0.9689 | 3.24 M | 110.8 GMac |
| 2 | 94.7 | 97.7 | 4.07 | 0.9806 | 3.24 M | 114.7 GMac |
| 4 | 96.9 | 98.8 | 2.99 | 0.9923 | 3.24 M | 130.8 GMac |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Tao, R.; Ye, K.; Zhang, W.; Sun, W.; Yu, D.; Hang, D.; Zheng, G. X2P-Net: Context-Aware 2D/3D Vertebra Localization. Bioengineering 2026, 13, 178. https://doi.org/10.3390/bioengineering13020178
Tao R, Ye K, Zhang W, Sun W, Yu D, Hang D, Zheng G. X2P-Net: Context-Aware 2D/3D Vertebra Localization. Bioengineering. 2026; 13(2):178. https://doi.org/10.3390/bioengineering13020178
Chicago/Turabian StyleTao, Rong, Kangqing Ye, Weijun Zhang, Wenyuan Sun, Derong Yu, Donghua Hang, and Guoyan Zheng. 2026. "X2P-Net: Context-Aware 2D/3D Vertebra Localization" Bioengineering 13, no. 2: 178. https://doi.org/10.3390/bioengineering13020178
APA StyleTao, R., Ye, K., Zhang, W., Sun, W., Yu, D., Hang, D., & Zheng, G. (2026). X2P-Net: Context-Aware 2D/3D Vertebra Localization. Bioengineering, 13(2), 178. https://doi.org/10.3390/bioengineering13020178

