X2P-Net: Context-Aware 2D/3D Vertebra Localization

Tao, Rong; Ye, Kangqing; Zhang, Weijun; Sun, Wenyuan; Yu, Derong; Hang, Donghua; Zheng, Guoyan

doi:10.3390/bioengineering13020178

Open AccessArticle

X2P-Net: Context-Aware 2D/3D Vertebra Localization

by

Rong Tao

^1,†

,

Kangqing Ye

^1,†

,

Weijun Zhang

^2,†,

Wenyuan Sun

¹

,

Derong Yu

¹

,

Donghua Hang

^3,* and

Guoyan Zheng

^1,*

¹

Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

²

School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China

³

Department of Orthopedics, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200080, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Bioengineering 2026, 13(2), 178; https://doi.org/10.3390/bioengineering13020178

Submission received: 27 December 2025 / Revised: 23 January 2026 / Accepted: 31 January 2026 / Published: 3 February 2026

(This article belongs to the Special Issue Advances in Computational Imaging and Artificial Intelligence for Biomedical and Clinical Applications)

Download

Browse Figures

Versions Notes

Abstract

In the context of minimally invasive spine surgery, accurately estimating the 3D coordinates of the vertebrae from intraoperative 2D X-ray images is crucial for aligning preoperative data with the patient’s real-time posture. However, existing methods are hindered by the ill-posed nature of 2D-to-3D localization and the distinctive anatomical features of the spinal column, leading to ambiguities and reduced accuracy. In this paper, we introduce X2P-net, a novel prompt-guided and semantic context-enhanced 2D/3D vertebra detection framework. To achieve this, we design a novel Transformer architecture, referred to as BrickFormer, which can automatically extract the refined vertebral foreground context at low computational cost using a dual-attention mechanism. Comprehensive experiments were conducted to validate the proposed approach on two datasets: a large-scale synthetic dataset (BiSpineX) and a sheep spine dataset (SheepSpineX). Results obtained from these experiments demonstrate superior landmark localization performance of the proposed method compared to other state-of-the-art methods. Specifically, on the BiSpineX dataset, X2P-Net achieves percentages of 96.9% and 98.8% at 10 mm and 20 mm thresholds, respectively, a mean position error of 2.99 mm, and an

A U C

of 0.9923. Similar superior performance was also observed when the proposed method was applied to the SheepSpineX dataset, with percentages of 98.4% and 100.0% at 10 mm and 20 mm thresholds, respectively, a mean position error of 1.08 mm, and an

A U C

of 0.9972.

Keywords:

spine; vertebra detection; Transformer; attention

1. Introduction

Recent years have witnessed an increasing demand for minimally invasive spine surgery (MISS) [1,2]. The success of the surgical procedure depends significantly on the precision with which preoperative planning is mapped onto the patient’s intraoperative position [3]. X-ray fluoroscopy is one of the most popular intraoperative imaging modalities because of its high flexibility, low radiation exposure, and cost-effectiveness. Using intraoperative 2D X-ray images to estimate the 3D locations of the vertebrae—also known as 2D/3D vertebra localization—is crucial for aligning preoperative data with the patient’s intraoperative posture. However, because an X-ray image offers a 2D projected view of the 3D anatomical structures, the 2D/3D vertebra localization task is inherently ill-posed, leading to ambiguities and reduced localization precision [4,5].

Many existing studies have utilized biplanar 2D X-ray images to infer 3D morphological characteristics of the spine [2,6,7,8,9,10,11,12], where the spatial information can be supplemented by two calibrated images acquired from two typical views, i.e., the anteroposterior (AP) and lateral (LAT) views. Currently, two main challenges limit the accuracy of 2D/3D vertebra localization. The first challenge is depth ambiguity, which arises from the superimposition of anatomical structures and the low contrast between the vertebrae and the surrounding soft tissues. The second challenge is semantic ambiguity, which stems from the spine’s chain-like structure, characterized by repetitive vertebrae, making it difficult to distinguish between adjacent vertebrae. Furthermore, metal implants, spinal deformities, and variability in the field of view (FOV) increase the complexity of the task.

To address these challenges, recent studies have proposed various methods that can be largely divided into two categories [13]: lifting-based approaches and direct regression-based approaches. Lifting-based approaches [6,14,15,16,17] involve estimating 2D landmark locations from each view before lifting them into 3D space through triangulation and optimization techniques, such as the least-squares method [16] and statistical shape models [14,15]. Despite its utility, these methods have several limitations, as shown in Figure 1. Notably, 2D landmarks may lack essential depth information for accurate 3D prediction. In particular, when the vertebral centroid landmarks are occluded due to superimposition (Figure 1b), spinal deformities (Figure 1e), or metal implants (Figure 1f,g), it is difficult to estimate precise 3D locations using incomplete visual information. Furthermore, the relationship between 2D visual appearance and 3D structures varies across images. For example, the projection of the vertebral centroid from CT volumes to the 2D detector plane is not always positioned at the center of the projected vertebrae, leading to 2D/3D annotation inconsistency, as demonstrated in Figure 1 a,c,d. Lastly, view-angle disparities between multi-view X-ray images make it difficult to establish one-to-one correspondence between landmarks identified independently on each view, as demonstrated in Figure 1h. On the other hand, direct regression-based methods [6,12,18,19] use 2D visual features from multi-view images to construct a pseudo-3D feature volume, which is then processed by 3D convolutions to regress the voxel-wise likelihood of each landmark. However, the accuracy of these methods is constrained by the spatial resolution of the projected volume, since increasing this resolution leads to a quadratic increase in computational cost.

In this study, we tackle the aforementioned challenges by leveraging advanced semantic context integration strategies to reduce ambiguities and to enhance localization accuracy. First, in contrast to previous works [21,22], which depend only on visual context, we introduce an interactive learning scenario where users can place a point-like prompt on each view to target a reference vertebra. The position and visual features of the reference vertebra are then utilized to enhance the semantic context of the remaining vertebrae. Second, we incorporate the anatomical prior information of the spine through a set of learnable vertebral embeddings. These embeddings are employed to delineate each vertebra using the enhanced 2D features. Subsequently, the delineated 2D vertebral location context is fused with high-resolution features, creating a context-enriched pseudo-3D volume for accurate 3D vertebra localization. To demonstrate the effectiveness of our approach, we designed a context-aware 2D/3D vertebra localization framework that uses biplanar X-ray images to estimate vertebral positions in 3D space, hereafter referred to as X2P-Net. The core of X2P-Net is a novel Transformer architecture, referred to as BrickFormer, which facilitates computational efficiency while maintaining performance. Unlike the vanilla Transformer [23], which computes attention weights over all pixels in the feature maps, BrickFormer benefits from a dual-attention mechanism, which automatically discriminates foreground pixels from the background. By removing background pixels, BrickFormer uses sparse foreground pixels as building bricks to support the localization task. Key contributions of this paper are summarized as follows:

We introduce an end-to-end context-aware 2D/3D vertebra localization framework, referred to as X2P-Net. The framework takes advantage of vertebral context, which is first enhanced by a prompt-guided reference vertebra and then extracted using learnable vertebral embeddings, for high-performing 2D/3D vertebra localization.
We design a novel BrickFormer architecture, which leverages a dual-attention mechanism. The initial attention layer automatically identifies the foreground region from the background, and the subsequent attention layer then focuses only on the foreground features. This approach achieves high localization accuracy at a low computational cost.
We conduct comprehensive experiments on two datasets to demonstrate the efficacy of the proposed method: a large-scale synthetic dataset of biplanar digitally reconstructed radiographs (DRRs) and a real biplanar X-ray image dataset of sheep spines, captured by a C-arm imaging system.

2. Related Work

2.1. Leveraging Semantic Context in Vertebra Localization

Automatic localization of vertebrae from spinal images is a challenging task due to the unique morphology of the spinal column [24]. Published vertebra localization methods have leveraged vertebral context to improve localization accuracy. These methods can be broadly categorized into three groups: statistics-based, context-based, and object detection-based.

In statistics-based approaches, traditional methods employed statistical shape models or atlas-based methods to learn the statistical distribution of all vertebrae [25,26]. Subsequent studies combined machine learning [27] or fully convolutional neural networks [28] with a hidden Markov model (HMM) to identify the locations of vertebral bodies. With recent advancements, deep generative models such as generative adversarial networks (GANs) [29] and normalizing flows [30,31] were proposed to learn the prior distribution of vertebral landmarks. These models were able to implicitly capture the prior distribution, at the cost of additional computational complexity.

Conversely, context-based methods often employ sequential and structural modeling techniques to improve the robustness of vertebra localization by integrating global contextual cues or anatomical priors [32]. For instance, Chen et al. [33] proposed a joint learning model that combined the local appearance of one vertebra and the pairwise conditional dependencies of neighboring vertebrae for vertebra localization from CT images. Similarly, Wang et al. [34] designed an anatomically constrained optimization module that employed a soft constraint to regulate the distance between estimated vertebrae and a hard constraint on the consecutive estimated vertebra labels. Payer et al. [21] proposed a spatial configuration network (SCN-Net) to encode the semantic context of the landmarks. Tao et al. [22] proposed Spine-Transformers that formulated vertebra labeling as a one-to-one set mapping problem and introduced a global loss to incorporate the sequential relationships of vertebrae. Acknowledging the constraints of strict sequential assumptions in prior studies [21,22], especially when handling pathological deformations, researchers have also proposed graph neural networks (GNNs) [35] or reinforcement learning [36] to incorporate anatomical prior information. Specifically, Transformers [23] excel at capturing long-range dependencies, while GNNs and reinforcement learning provide greater flexibility in modeling non-linear spatial relationships and anatomical topologies [37]. For example, Bürgin et al. [38] proposed a hybrid network combining convolutional neural networks (CNNs) and GNNs for robust vertebra identification from CT images. Xiang et al. [39] proposed VLD-Net for localizing and detecting the vertebrae from X-ray images using reinforcement learning with an adaptive exploration mechanism and spine anatomy information.

Recently, object detection models like the You Only Look Once (YOLO) series [40,41,42] and Faster R-CNN [43] have gained increasing adoption for vertebra localization tasks owing to their speed and efficiency. For example, Zhang et al. [44] proposed a method that combined 3D Swin Transformers [45] with YOLOX [41] for accurate spine segment detection from 3D CT images. Huang et al. [46] proposed a method that integrated a bidirectional long short-term memory (LSTM) layer into the Faster R-CNN architecture to implicitly ensure sequential consistency. Although object detector-based approaches showed significant potential in accelerating vertebra localization while maintaining acceptable accuracy, they encountered challenges under severe pathological deformations, primarily due to the lack of explicit anatomical modeling.

A common limitation of the above-mentioned studies on vertebra localization was that they mainly focused on single-image (either a 3D volume or a 2D image) scenarios. However, few studies have addressed the more complex problem of multi-view vertebra localization, where the model not only needs to distinguish each vertebra within each image but also needs to establish cross-view semantic correspondence for landmarks detected on each view. Limited attempts have been made to address this task. For example, Wu et al. [16] proposed a multi-view contrastive learning strategy to capture the anatomical structural information from different views. Huang et al. [9] introduced a multi-perspective network for cross-view vertebra localization, where they used a recurrent module to incorporate contextual information and to enforce anatomical order for the detected vertebrae. While achieving promising results, these methods required either extra post-processing steps or repetitive sampling strategies to assemble the landmarks detected on each view.

2.2. Estimating 3D Landmarks from 2D Images

Estimating 3D landmarks from 2D images is inherently an ill-posed problem because multiple 3D predictions may result in the same 2D projection. To alleviate such ambiguities, existing approaches resort to either multi-view visual correspondence [5,47] or long-term temporal clues [48]. As this study focuses on enhancing 3D landmark localization from 2D images for intraoperative applications, we concentrate on studies in the multi-view scenario.

Multi-view images significantly reduce ambiguity in 3D landmark localization, yet effectively aggregating and fusing information from multiple perspectives remains challenging. In the literature, both lifting-based approaches and direct regression-based approaches have been proposed. Lifting-based approaches, such as those by Dong et al. [49] and Bridgeman et al. [50], associate 2D landmark estimations and fuse them into 3D poses. However, establishing cross-view correspondence is still an issue. Existing studies have utilized methods like triangulation [47,51], 4D graph cuts [49], plane sweep stereo [52], cross-view graph matching [53], or statistical shape models [54] to generate 3D poses from 2D landmarks. These methods often adopt a multi-stage framework [55] to refine 3D pose estimation, leading to increased computational cost and accumulated uncertainty. In contrast, direct regression-based approaches [12,47,56,57,58] first build a pseudo-3D feature volume through heatmap estimation and then regress 3D coordinates from the feature volume with 3D CNNs. However, the accuracy of these methods is constrained by the spatial resolution of the projected volume, and increasing this resolution leads to a quadratic increase in computational cost.

3. Methodology

3.1. Overview

Figure 2 illustrates the network architecture of the proposed context-aware X2P-Net, which consists of a 2D visual feature extraction (VFE) unit, a prompt-guided feature enhancement (FE) unit, a semantic context extraction (SCE) unit, and a 3D multi-view feature fusion unit. The network first utilizes positional information and visual features of the reference vertebra, as indicated by prompts, to enhance the features of the remaining vertebrae. Subsequently, the 2D vertebral context is captured through a series of learnable vertebral embeddings, yielding a set of 2D vertebral heatmaps. These 2D heatmaps are then integrated with high-resolution multi-view image features to construct a pseudo-3D volume, which is used to estimate the 3D coordinates of the vertebrae. Unlike previous 2D/3D landmark localization methods that require pretraining a 2D feature extraction backbone, the proposed method leverages the simultaneous learning of 2D vertebral context and 3D vertebral locations, thereby enabling end-to-end training. Details about each unit are presented below.

3.2. The VFE Unit

The VFE unit is shared across both the LAT and the AP views of the spine radiographs. Given a pair of input images, we denote

I_{l a t}

and

I_{a p}

as the LAT and the AP views, respectively. Each view is represented as

I_{i} \in R^{1 \times w \times h}

, where

i \in {l a t, a p}

, and

w = 512

pixels and

h = 512

represent the width and height in pixels, respectively. To capture multi-scale visual cues, we utilize a 2D U-Net-like architecture that comprises four levels and a spatial pyramid pooling (SPP) module [59] for feature integration. Subsequently, the 2D VFE unit generates a pair of feature maps corresponding to the respective LAT and AP views, denoted as

{f_{l a t}, f_{a p}}

, as follows:

f_{i} = V F E (I_{i}), \forall i \in \{l a t, a p\},

(1)

where

f_{i} \in R^{d \times w \times h}

and

d = 64

is the number of feature channels.

In the case where the height of the input images exceeds 512 pixels, we employ a sliding window technique to sample consecutive image patches, as depicted in Figure 2A.

3.3. The Prompt-Guided FE Unit

For a given pair of input images, the prompt-guided FE unit requires users to specify a point-like prompt on each image as the input. It is worth noting that the two prompts in the AP and the LAT images need to be placed around the ground truth centers of the same vertebral body, usually the top-most vertebra in both images. These prompts serve as reference points, guiding the network to predict the locations of the vertebral body center for the remaining levels. Similarly, in the case of long-length images, which are divided into a series of contiguous image patches, the bottom-level vertebra from the preceding patch is utilized as the prompt for the subsequent patch. For simplicity, we introduce the prompt-guided FE unit and the SCE unit for a single image patch.

Let us denote the position of the prompt on an image patch as

(x_{0}, y_{0})

, where

x_{0}

and

y_{0}

correspond to the horizontal and vertical coordinates of a point-like prompt, respectively. We first preprocess the input prompt to create two types of masks: a binary prompt mask

M_{p}

and a unidirectional distance mask

M_{d}

. Specifically, the binary prompt mask

M_{p} \in R^{1 \times w \times h}

is constructed as follows:

\begin{matrix} M_{p} (x, y) = \{\begin{matrix} 1, (x, y) \in Ω_{p}, \\ 0, otherwise, \end{matrix} \\ Ω_{p} = \{(x, y) | x_{0} - r \leq x < x_{0} + r, y_{0} - r \leq y < y_{0} + r\}, \end{matrix}

(2)

where

(x, y)

represents pixel coordinates within the image patch,

Ω_{p}

is the set of pixel coordinates covered by the square patch centered at the prompt, and

r = 40

is an empirical value that delineates a region that can cover the area of the reference vertebral body.

Next, using the features

f_{i}

from the 2D VFE unit and the prompt mask

M_{p}

, we can obtain prompt features

f_{p} \in R^{d \times 2 r \times 2 r}

:

f_{p} (x, y) = f_{i} (x, y) M_{p} (x, y), (x, y) \in Ω_{p} .

(3)

Meanwhile, the unidirectional distance mask

M_{d} \in R^{1 \times w \times h}

is derived as follows:

M_{d} (x, y) = {∥y - y_{0}∥}_{2}, \forall (x, y) .

(4)

M_{d}

is essential to establish cross-view correspondence, which will be explained in detail in the next section.

Following the preprocessing steps, both

f_{i}

and

f_{p}

are down-sampled by a factor of 8, yielding

{\tilde{f}}_{i}

and

{\tilde{f}}_{p}

, respectively. These down-sampled features are then fed into a 4-layer vanilla Transformer block with 8 heads and a hidden dimension of 64, as depicted in Figure 3. Within the Transformer block,

{\tilde{f}}_{p}

serves as both the key and the value, which enables the enhancement of repetitive vertebral features present in

{\tilde{f}}_{i}

. The output of the Transformer layer can be presented by the following equation:

O u t = s o f t m a x (\frac{({\tilde{f}}_{i} W_{Q}) \cdot {({\tilde{f}}_{p} W_{K})}^{T}}{\sqrt{d}}) \cdot ({\tilde{f}}_{p} W_{V}),

(5)

where

W_{Q}

,

W_{K}

, and

W_{V}

represent the learnable weight matrices for the query, key, and value, respectively.

Eventually, the prompt-guided FE unit produces a set of enhanced features, represented as

f_{e} \in R^{d \times w \times h}

, which are then fed into the SCE unit to extract the important contextual information of the remaining vertebrae.

3.4. The SCE Unit

The SCE unit aggregates spatial contextual information using learnable vertebral embeddings. These embeddings are structured as a sequence of spatial feature maps, denoted as

e = {e_{n} ∣ n = 1, \dots, N}

with

e_{n} \in R^{d \times 16 \times 16}

, where

16 \times 16

is the spatial dimension, and N is the maximum number of predicted vertebrae. In conventional Transformers, the computational expense of the attention mechanism increases quadratically with the spatial dimensions of the key and the query matrices. To mitigate this issue, we develop a novel Transformer variant, referred to as BrickFormer, as illustrated in Figure 4. Notably, BrickFormer reduces computational cost by incorporating a dual-attention mechanism. Specifically, in the first attention layer, it automatically delineates the foreground regions from the background using low-resolution feature maps, while in the second attention layer, it refines the process by engaging only the selected high-resolution but sparse foreground features for the extraction of 2D vertebral context.

Mathematically, we first combine the enhanced features

f_{e}

with the distance mask

M_{d}

as follows:

\begin{matrix} f_{m} (x, y) = f_{e} (x, y) {\tilde{M}}_{d} (x, y) + M_{d} (x, y), \forall (x, y), \\ {\tilde{M}}_{d} (x, y) = \{\begin{matrix} 0, if M_{d} (x, y) = 0, \\ 1, if M_{d} (x, y) > 0 . \end{matrix} \end{matrix}

(6)

Here, the masked features

f_{m} \in R^{d \times w \times h}

integrate the vertical distance from each pixel to a reference vertebra indicated by the prompt, thus facilitating spatial alignment of 2D features extracted from different views.

Next,

f_{m}

undergoes a max pooling operation with a stride of

α

, obtaining

{\tilde{f}}_{m}

, which is fed into the first attention layer of BrickFormer. Specifically, within the first attention layer, the attention matrix

A t t n_{1}

is calculated as follows:

A t t n_{1} = s o f t m a x (\frac{(e_{n} W_{Q 1}) \cdot {({\tilde{f}}_{m} W_{K 1})}^{T}}{\sqrt{d}}),

(7)

where

W_{Q 1}

and

W_{K 1}

represent the learnable weight matrices for the query and the key, respectively.

Following the attention computation, we sort the elements of the matrix

A t t n_{1}

by their attention scores in descending order. This sorting operation allows us to identify the indices of the top-k highest-scoring elements, resulting in a set of indices

\tilde{a} = \{{\tilde{a}}_{1}, \dots, {\tilde{a}}_{k}\}

. Then, we map

\tilde{a}

to the corresponding positions on the high-resolution features

f_{m}

(one position at the low-resolution features will be mapped to

α^{2}

positions at the high-resolution features), obtaining the new set of indices

a = \{a_{1}, \dots, a_{k α^{2}}\}

, which will be used to extract the subset of foreground features

f_{s} \in R^{d \times k α^{2}}

. Subsequently, these foreground features are input into the second attention layer of BrickFormer for fine-grained attention computation, where the attention matrix

A t t n_{2}

is calculated by the following:

A t t n_{2} = s o f t m a x (\frac{(e_{n} W_{Q 2}) \cdot {(f_{s} W_{K 2})}^{T}}{\sqrt{d}}) .

(8)

Finally, the output of BrickFormer is a set of context-enriched vertebral embeddings

\tilde{e} = {{\tilde{e}}_{n} ∣ n = 1, \dots, N}

with

{\tilde{e}}_{n} = A t t n_{2} \cdot (f_{s} W_{V 2}),

(9)

where

W_{Q 2}

,

W_{K 2}

, and

W_{V 2}

are the corresponding weight matrices for the query, key, and value, respectively.

These embeddings are then fed into linear layers to obtain the predicted 2D vertebral heatmaps

\tilde{h} = {{\tilde{h}}_{n} ∣ n = 1, \dots, N}

, with

{\tilde{h}}_{n} \in R^{16 \times 16}

.

To summarize, in a standard stacked Transformer with L attention layers, the complexity of the attention computation is

O (L w h M d)

, where

w \times h

denotes the spatial dimension of

f_{m}

and

M = 16 \times 16 \times N

represents the product of the spatial dimension and the total number of vertebral embeddings. In contrast, our proposed BrickFormer, utilizing the same number of attention layers, significantly reduces the computational complexity to

\frac{L}{2} [O (w h M d / α^{2}) + O (k α^{2} M d)]

. Here, the first term corresponds to the attention computation on the low-resolution features, which identifies foreground vertebral regions. The second term represents the fine-grained attention computation on the high-resolution but sparse foreground features, aimed at improving localization accuracy while maintaining low computational cost.

3.5. The 3D Multi-View Feature Fusion Unit

The predicted 2D vertebral heatmaps, denoted as

{\tilde{h}}_{i}

, are rescaled to match the input dimensions. These heatmaps are combined with the respective feature maps

f_{e}

to obtain the concatenated features

f_{c}

, which are then fed into the 3D multi-view feature fusion unit. Within the unit, we unproject the 2D features into a fixed-size pseudo-3D feature volume based on projective geometry. Specifically, we assume that each calibrated view (LAT and AP) is associated with a projection matrix

P_{i} \in R^{3 \times 4}

, which is used to project 3D coordinates to 2D image space. During unprojection, each voxel in the pseudo-3D volume is projected onto the 2D image space using the projection matrix. The voxel feature is assigned by sampling the corresponding 2D feature value at the projected location. Then, the volumes from multiple views are aggregated and processed by 3D convolutions to output heatmaps representing 3D vertebral locations. These 3D vertebral heatmaps are subsequently passed through a soft-argmax function, which transforms the heatmaps into precise vertebral coordinates. Finally, the 3D multi-view feature fusion unit produces a set of predicted 3D vertebral coordinates

\{{\tilde{l}}_{n} ∣ n = 1, \dots, N\}

with

{\tilde{l}}_{n} \in R^{1 \times 3}

.

3.6. Loss Functions

Assuming that the 2D ground truth heatmap for the n-th vertebra in view i is

h_{i, n} = {p_{i, n} (x, y) ∣ 1 ⩽ x ⩽ 16, 1 ⩽ y ⩽ 16}

and the predicted heatmap for each vertebra is

{\tilde{h}}_{i, n} = {{\tilde{p}}_{i, n} (x, y) ∣ 1 ⩽ x ⩽ 16, 1 ⩽ y ⩽ 16}

, we compute the MSE loss

L o s s_{M S E}

and Dice loss

L o s s_{D i c e}

as follows:

L o s s_{M S E} = \frac{1}{2 \times N \times 16 \times 16} \sum_{i} \sum_{n = 1}^{N} \sum_{x = 1}^{16} \sum_{y = 1}^{16} {(p_{i, n} (x, y) - {\tilde{p}}_{i, n} (x, y))}^{2},

(10)

and

L o s s_{D i c e} = 1 - \frac{1}{2 N} \sum_{i} \sum_{n = 1}^{N} \frac{2 \sum_{x = 1}^{16} \sum_{y = 1}^{16} p_{i, n} (x, y) {\tilde{p}}_{i, n} (x, y)}{\sum_{x = 1}^{16} \sum_{y = 1}^{16} {(p_{i, n} (x, y))}^{2} + \sum_{x = 1}^{16} \sum_{y = 1}^{16} {({\tilde{p}}_{i, n} (x, y))}^{2}},

(11)

where

p_{i, n} (x, y)

and

{\tilde{p}}_{i, n} (x, y)

indicate, respectively, the ground truth and the predicted probabilities of a pixel at position

(x, y)

for the n-th vertebra in view i.

Therefore, we obtain the 2D localization loss

L o s s_{2 D}

as follows:

L o s s_{2 D} = L o s s_{M S E} + L o s s_{D i c e} .

(12)

To predict the 3D coordinates of the vertebrae, we assume that the ground truth vertebral locations are

\{l_{n} ∣ n = 1, \dots, N_{v}\}

with

N_{v} \leq N

, where

N_{v}

is the number of vertebrae present in the current input. We compute an MSE loss

L o s s_{3 D}

for each vertebra, as follows:

L o s s_{3 D} = \frac{1}{N_{v}} \sum_{n = 1}^{N_{v}} {∥l_{n} - {\tilde{l}}_{n}∥}_{2}^{2} .

(13)

Then, the overall localization loss

L o s s_{O v e r a l l}

is defined as follows:

L o s s_{O v e r a l l} = L o s s_{2 D} + L o s s_{3 D} .

(14)

3.7. Implementation Details

The proposed method was developed in Python 3.9 using the PyTorch 2.0 framework and was trained on a workstation equipped with two NVIDIA GeForce RTX 4090 GPUs. The input images had a size of

512 \times 512

. The network was trained from scratch in an end-to-end fashion for 100 epochs, employing the AdamW optimizer [60] with a weight decay of 0.05 and a batch size of 2. For the synthetic dataset, we set the maximum number of predicted vertebrae to

N = 10

and the hyperparameters of BrickFormer to

α = 32

and

k = 8

. In contrast, for the real sheep spine dataset, we set

N = 5

,

α = 8

, and

k = 4

. These parameter choices were guided by empirical estimates of the foreground-to-background ratio for each dataset.

During the training phase, we incorporate random cropping as a data augmentation technique and take the ground truth centers of the top-most vertebral bodies in both LAT and AP images as prompts. During inference, a point-like prompt is manually placed around the center of the top-most vertebral body in each image. After that, the network simultaneously generates N heatmaps for each view and N sets of 3D vertebral coordinates. For each predicted heatmap, a vertebra was considered present if the maximum probability

p_{m a x}

was higher than

0.5

. Thus, we used

p_{m a x}

as a criterion to assess the validity of the 3D predictions.

4. Experiments

In this section, we present experimental results on a synthetic dataset consisting of biplanar spine DRR images and a real dataset consisting of biplanar sheep spine X-ray images. Below, we first describe the datasets and the evaluation metrics used in our experiments, and then present the experimental results.

4.1. Datasets

4.1.1. Synthetic Biplanar Spine DRR Dataset (BiSpineX Dataset)

Given the scarcity of biplanar spine X-ray datasets and the difficulty in obtaining accurate 3D annotations, we generated biplanar DRRs from CT images with precise annotations of vertebral body centroids and created a synthetic dataset, referred to as the BiSpineX dataset. To construct the dataset, we utilized CT volumes from the Large Scale Vertebrae Segmentation Challenge (VerSe) held at MICCAI 2019 and MICCAI 2020 [20]. Since the VerSe dataset included images with a variety of FOVs and resolutions, we excluded cases with mismatched CT volumes and annotations, images with fewer than three vertebrae, and those with file reading errors. This resulted in a dataset of 337 spinal CT volumes, including cases with fractures and metal implants, which were reoriented and resampled to a 1 mm isotropic resolution. For each CT volume, we generated LAT and AP DRRs [61] by simulating X-ray projections using a ray-tracing method that accounts for photon attenuation and scattering. A virtual detector plane of size

0.6 \times 0.6

m² with a resolution of

2048 \times 2048

was used. To address view-angle disparities, we applied random spatial transformations to the CT volumes, including rotations in a range from −15° to 15° about the vertical axis, and from −5° to 5° about both the coronal and sagittal axes. Consequently, we derived 337 pairs of biplanar spine X-ray images from the corresponding CT volumes. The number of vertebrae present in each radiograph ranged from 3 to 24, covering both traditional C-arm X-ray radiographs that typically contain 3 to 5 vertebrae and emerging long-film X-ray images capable of capturing a larger FOV that spans the entire spinal column. The dataset was partitioned into an 80–20% train-test split, with an additional 5% of the training set reserved for validation purposes.

4.1.2. Sheep Spine X-Ray Dataset (SheepSpineX Dataset)

To demonstrate the performance and efficacy of the proposed method, we further conducted experiments on a real sheep spine biplanar X-ray dataset, referred to as the SheepSpineX dataset. This dataset comprises radiographs of 9 sheep cervical spines, which were obtained from a commercial slaughterhouse. For each sheep cervical spine, the radiographs were captured in both AP and LAT views, utilizing a Siemens Arcadis Varic C-Arm system (see Figure 5 for the experimental setup). For each case, we acquired 30 pairs of X-ray images, each pair consisting of LAT and AP views. The number of vertebrae present in each case ranged from 4 to 7. The ground truth 3D vertebral locations were obtained by annotating the centroids of each vertebral body on the corresponding CT volumes. These 3D coordinates were then projected onto the 2D views, yielding precise 2D ground truth vertebral locations. The dataset was divided into 5 cases for training, 1 case for validation, and 3 cases for testing.

4.2. Evaluation Metrics

We adopt commonly used metrics [13,20,62] for both 2D and 3D vertebra localization, as follows:

Percentage of Correct Landmarks (PCL): The

P C L @ τ

is defined as the ratio of correctly detected landmarks to the total number of landmarks. A landmark is considered correctly detected if the Euclidean distance to the corresponding ground truth location is below a given threshold

τ

.

P C L @ τ

is calculated as follows:

\begin{matrix} P C L @ τ = \frac{1}{N_{t}} \sum_{n = 1}^{N_{t}} T H R_{τ} (l_{n}, {\tilde{l}}_{n}), \\ T H R_{τ} (l_{n}, {\tilde{l}}_{n}) = \{\begin{matrix} 1, if {∥l_{n} - {\tilde{l}}_{n}∥}_{2} < τ, \\ 0, otherwise, \end{matrix} \end{matrix}

(15)

where

N_{t}

denotes the total number of annotated vertebral locations,

l_{n}

denotes the ground truth location of the n-th landmark, and

{\tilde{l}}_{n}

denotes the corresponding predicted location.

Mean Position Error (MPE): It is defined as the average Euclidean distance, expressed in millimeters (mm) for 3D measurements and pixels for 2D measurements.

M P E

is calculated as follows:

M P E = \frac{1}{N} \sum_{i = 0}^{N} {∥x_{i} - {\tilde{x}}_{i}∥}_{2}

(16)

Area Under the Curve (AUC): It is defined as the area under the

P C L

curve between the upper limit and the lower limit on the x-axis. This metric is used to evaluate the overall performance of a landmark detection method.

For each of the three metrics, we report the corresponding vertebrae localization performance in both 2D and 3D spaces. Specifically, for the measurement of PCL in 2D, we calculate both

P C L_{2 D} @ 10 p

and

P C L_{2 D} @ 20 p

by setting

τ_{2 D}

to be 10 pixels and 20 pixels, while for the measurement of PCL in 3D, we compute both

P C L_{3 D} @ 10 m m

and

P C L_{3 D} @ 20 m m

by setting

τ_{3 D}

to be 10 mm and 20 mm, respectively. Similarly, we estimated

A U C_{2 D}

in 2D by setting the lower limit to be 10 pixels and the upper limit to be 50 pixels, and

A U C_{3 D}

in 3D by setting the lower limit to be 10 mm and the upper limit to be 50 mm.

4.3. Results

4.3.1. Results on the BiSpineX Dataset

Due to the limited research on 3D vertebral localization utilizing biplanar radiographs, we evaluated the performance of X2P-Net in comparison with state-of-the-art (SOTA) methods, including benchmark approaches for vertebrae localization, such as SCN-Net [21] and Spine-Transformers (Spine-Trans) [22]. In these methods, 2D vertebral locations are initially identified on the LAT and AP views of radiographs, followed by a triangulation process to derive their 3D coordinates. For the Spine-Transformers, which were initially designed for vertebral localization from CT volumes, we have re-engineered the network and loss functions to adapt them for use with spinal radiographs. Furthermore, we have incorporated classic 2D-3D landmark detection algorithms that have been previously applied to human pose estimation for comparative analysis, including AdaFuse [57], ALG-Net [47], and VOL-Net [47]. All aforementioned methods were trained from the ground up, with methods following the original two-stage training protocol: pretraining the feature extractor, followed by end-to-end fine-tuning of the network.

Results of the comparison study are shown in Table 1. For 3D vertebral landmark localization, X2P-Net achieves the top performance in terms of all evaluation metrics. Specifically, it obtained a

P C L_{3 D} @ 10 m m

of 96.9%, a

P C L_{3 D} @ 20 m m

of 98.8%, an average

M P E_{3 D}

of 2.99 mm, and an

A U C_{3 D}

of 0.9923. In contrast, the second-best method (ALG-Net [47]) obtained a

P C L_{3 D} @ 10 m m

of 95.7%, a

P C L_{3 D} @ 20 m m

of 98.3%, an average

M P E_{3 D}

of 3.25 mm, and an

A U C_{3 D}

of 0.9846. Our method outperforms ALG-Net [47] with a 1.2% increase in

P C L_{3 D} @ 10 m m

and a 0.5% increase in

P C L_{3 D} @ 20 m m

, a 0.26 mm decrease in

M P E_{3 D}

, and a 0.0077 increase in

A U C_{3 D}

.

Furthermore, we report the 2D vertebra localization performance in Table 1. For X2P-Net, the 2D vertebral locations were derived from the auxiliary output of the predicted vertebral locations for the LAT and AP views. Although our method was not designed for 2D landmark localization, its performance closely matched the best-performing 2D vertebra localization network (SCN-Net [21]), indicating that our method effectively captured the semantic context of each vertebra. Specifically, our method achieved a

P C L_{2 D} @ 20 p

of 96.6% for the LAT view and 96.0% for the AP view, and SCN-Net obtained a

P C L_{2 D} @ 20 p

of 96.9% for the LAT view and 96.8% for the AP view. Thus, compared to SCN-Net, our method exhibited a higher 2D localization error. In particular, our method achieved an average

M P E_{2 D}

of 5.84 pixels for the LAT view and 6.14 pixels for the AP view, while SCN-Net obtained an average

M P E_{2 D}

of 3.78 pixels for the LAT view and 4.96 pixels for the AP view. This higher 2D error is attributed to the lower spatial resolution of our predicted 2D heatmaps. Nonetheless, by integrating the semantic context from 2D landmark predictions with high-resolution 3D features, our method achieved the lowest 3D localization error, demonstrating the efficacy of the proposed method. This efficacy is further illustrated by a visualization of

P C L

curves in Figure 6A.

Figure 7 illustrates a challenging case for 2D/3D vertebra localization, where the presence of scoliosis complicates accurate localization. Despite this, our method effectively identifies each vertebra’s location, demonstrating the method’s robustness.

We additionally conducted a per-level analysis of the landmark localization results. We evaluated the per-level landmark localization performance in terms of

P C L_{3 D} @ 10 m m

and

M P E_{3 D}

; the results are presented in Figure 8. Localization errors exceeding 10 mm were observed for 35 out of 1123 testing vertebrae. Among these 35 vertebrae, 29 were from cases with scoliosis, and 4 were from cases with metal implants.

4.3.2. Results on the SheepSpineX Dataset

Using the sheep spine dataset, we evaluate X2P-Net against the aforementioned SOTA techniques. The comparative analysis is presented in Table 2, where X2P-Net demonstrates superior performance across all metrics. Specifically, it achieved a

P C L_{3 D} @ 10 m m

of 98.4%, a

P C L_{3 D} @ 20 m m

of 100%, an average

M P E_{3 D}

of 1.08 mm, and an

A U C_{3 D}

of 0.9972. In contrast, the second best-performing method (ALG-Net [47]) attained a

P C L_{3 D} @ 10 m m

of 96.5%, a

P C L_{3 D} @ 20 m m

of 100%, an average

M P E_{3 D}

of 1.56 mm, and an

A U C_{3 D}

of 0.9948. The PCL curves for the various methods are depicted in Figure 6B. Despite influences such as variations in viewing angles, for the sheep spine X-ray images, X2P-Net can still accurately estimate the 3D locations of vertebrae.

4.4. Analytical Ablation Studies

We conducted analytical ablation studies on the BiSpineX dataset to evaluate the performance of the proposed X2P-Net. We designed and conducted the following ablation studies: (1) We first performed major component ablations by systematically removing key components to understand their essential role in the landmark localization pipeline. (2) We then conducted a study to compare the proposed BrickFormer with other attention mechanisms. To achieve this, we replaced the BrickFormer attention layers in the network with either vanilla attention layers [23] or sparse attention layers [63] using the same hyperparameter settings (i.e., 4 layers, 8 heads, and a hidden dimension of 512), while keeping the remaining components of the network unchanged. The inputs to these attention layers were the same as those of the first attention layer in BrickFormer. (3) Subsequently, we investigated the impact of different hyperparameters on the performance of our method by systematically varying key hyperparameters while keeping others fixed, to determine the optimal balance between landmark localization accuracy and computational efficiency. (4) Additionally, we investigated the sensitivity of our method to prompt displacement. Specifically, we shifted the prompt along the x- and y-axes in both the LAT and AP images and evaluated performance under these perturbations. (5) Finally, we performed an in-depth analysis of the dual-attention mechanism of BrickFormer by visualizing features at different stages. For each study, the efficiency of each algorithm was quantified by the number of floating-point operations (FLOPs). The larger the FLOPs, the less efficient the algorithm.

4.4.1. Results on Investigating the Effectiveness of Key Components

The results from ablation experiments evaluating the contribution of each unit are shown in Table 3. We explored three network configurations: (1) In the first experiment, named No Prompt, we removed the prompt-guided FE unit, allowing features from the VFE unit to proceed directly to the SCE unit. As shown in Table 3, incorporating the prompt information led to a 6.7% increase in

P C L_{3 D} @ 10 m m

, a 5.7% increase in

P C L_{3 D} @ 20 m m

, a 3.28 mm decrease in average

M P E_{3 D}

, and a 0.0270 increase in

A U C_{3 D}

. (2) In the second experiment, named No SCE, we removed both the SCE unit and the

L o s s_{2 D}

. Consequently, the 3D volumes were solely constructed from the masked features

f_{m}

from the previous unit. Compared to the No SCE method, our method improved

P C L_{3 D} @ 10 m m

and

P C L_{3 D} @ 20 m m

by 4.6% and 4.0%, respectively, reduced average

M P E_{3 D}

by 2.96 mm, and increased

A U C_{3 D}

by 0.0266. (3) In the last experiment, named No Fusion, we excluded the fusion operation between the predicted 2D vertebral heatmaps and the masked features

f_{m}

. As shown in the results, the No Fusion method achieved better performance than the No SCE method because of the learned vertebral context. However, it still fell short of the performance achieved by our proposed context fusion strategy. Specifically, compared to the No Fusion method, our method achieved 4.1% and 2.3% improvements in

P C L_{3 D} @ 10 m m

and

P C L_{3 D} @ 20 m m

, respectively, a 2.91 mm reduction in average

M P E_{3 D}

, and a 0.0101 increase in

A U C_{3 D}

.

4.4.2. Results on Examining Different Attention Mechanisms

The results of investigating the influence of different attention mechanisms on the performance of the proposed method are presented in Table 4. From this table, one can see that when compared to the vanilla attention mechanism, the proposed BrickFormer demonstrates improved performance with an increase in

P C L_{3 D} @ 10 m m

by 2.2% and

P C L_{3 D} @ 20 m m

by 1.6%, a decrease in average

M P E_{3 D}

by 1.14 mm, and an increase in

A U C_{3 D}

by 0.0129. When compared to the sparse attention mechanism, our method achieved 3.7% and 0.7% improvements in

P C L_{3 D} @ 10 m m

and

P C L_{3 D} @ 20 m m

, respectively, a 1.73 mm reduction in average

M P E_{3 D}

, and a 0.0053 increase in

A U C_{3 D}

. These enhancements are attributed to the dual-attention mechanism embedded within BrickFormer, which effectively filters out irrelevant information from high-resolution foreground features, thereby enhancing localization precision.

4.4.3. Results on Investigating the Impact of Different Hyperparameters

We first examined the impact of the spatial dimensions of the vertebral embeddings on the performance of our proposed method, with dimensions set at

4 \times 4

,

8 \times 8

, and

16 \times 16

, corresponding to the resolution of the predicted 2D vertebral heatmaps. The results are shown in Table 5A. It was evident from the results that the accuracy of vertebra localization improved with an increase in the embedding dimension, peaking at a dimension of

16 \times 16

. This enhancement was likely due to the higher resolution of the predicted 2D vertebral heatmaps, which provided more contextual information for improving the precision of 3D vertebra localization. However, increasing the resolution from

4 \times 4

to

16 \times 16

raised the computational cost by approximately 20 million FLOPs, as shown in Table 5A.

Next, we explored the effect of varying the top-k value, which was set to 2, 4, or 8, on the performance of our method. The results of this analysis are reported in Table 5B. Increasing the top-k from 4 to 8 resulted in a 1.2% and 2.1% increase in

P C L_{3 D} @ 10 m m

and

P C L_{3 D} @ 20 m m

, respectively, a reduction in the average

M P E_{3 D}

by 1.2 mm, and an increase in

A U C_{3 D}

by 0.0110. The top-k value was directly related to the number of regions selected for fine-grained attention computation during the second stage, with a larger k indicating a greater number of features involved in this computation.

In our final hyperparameter ablation study, we investigated the influence of the pooling stride

α

. As shown in Table 5C, we tested three configurations:

α = 1

,

α = 2

, and

α = 4

. Optimal performance on the BiSpineX dataset was achieved with

α = 4

. Specifically, when

α

was set to 2 (compared to

α = 1

),

P C L_{3 D}

increased by 1.8% and 1.5% at the 10 mm and 20 mm thresholds, respectively,

M P E_{3 D}

decreased by 0.25 mm, and

A U C_{3 D}

increased by 0.0117. When

α

was increased to 4, the performance improved, with

P C L_{3 D}

increasing by 2.2% and 1.1% at the respective thresholds,

M P E_{3 D}

decreasing by 1.08 mm, and

A U C_{3 D}

increasing by 0.0117. This improvement was attributed to the fact that a larger

α

corresponded to a larger receptive field for the attention computation on the low-resolution features and provided more context for fine-grained attention computation, thereby improving localization accuracy.

4.4.4. Results of Investigating the Sensitivity of Our Method to Prompt Displacement

To assess the sensitivity of our method to prompt displacement, we investigated the effects of shifting the point-like prompt along the x- and the y-axes (ranging from −20 to +20 pixels away from the ground truth center of the top-most vertebral body in each image) in each image. The performance of the proposed method under different point-like prompt inputs was assessed in terms of

P C L_{3 D} @ 20 m m

and

M P E_{3 D}

. Furthermore, to compare the performance when different point-like prompts were used, we conducted one-sided Wilcoxon signed-rank tests [64] and chose a significance level of 0.05. The results of this ablation study are shown in Figure 9. From this figure, one can see that the performance of our method is not sensitive to displacement along the x-axis in both images. In particular, with displacement along the x-axis in either image, the average

M P E_{3 D}

was below 3.00 mm, and the

P C L_{3 D} @ 20 m m

remained above 98.7% for both views. When comparing the results using the ground truth centers with those using the displaced prompts, the maximal change in terms of

M P E_{3 D}

was less than 0.01 mm, and no statistically significant difference was detected (p-value = 0.77 for the LAT view and p-value = 0.42 for the AP view). However, this is not the case for displacement along the y-axis. In particular, when the displacement was constrained to be 10 pixels around the ground truth center in each image, the maximal change in terms of the average

M P E_{3 D}

increased to 0.06 mm, although no statistically significant difference was detected when comparing the results using the ground truth centers with those using the displaced prompts (p-value = 0.09 for the LAT view, p-value = 0.83 for the AP view). As the displacement along the y-axis increased further to 20 pixels, the maximal change in terms of the average

M P E_{3 D}

increased to 0.14 mm, and the differences between the results using the ground truth centers and those using the displaced prompts were statistically significant (p-values < 0.001 for both views), though the

P C L_{3 D} @ 20 m m

remained above 98.7%.

4.4.5. Analysis of BrickFormer

To obtain a deeper understanding of the dual-attention mechanism in BrickFormer, we conducted an analysis to determine whether the vertebral embeddings effectively capture vertebral context. To this end, we visualized features at various stages of BrickFormer, as depicted in Figure 10.

Given a pair of input images consisting of LAT and AP views, we visualized the context-enriched vertebral embeddings

\tilde{e}

(the first row) as well as the predicted 2D vertebral heatmaps

\tilde{h}

(the second row) of a thoracic spine with a fractured vertebra, which was taken from the BiSpineX dataset. Note that while BrickFormer can handle up to 10 vertebral embeddings, only the first five, corresponding to valid vertebral predictions, were visualized here. Each column represents a single predicted vertebra, ordered from superior to inferior. As illustrated in Figure 10, the locations of the vertebrae can be readily estimated from these context-enriched embeddings, demonstrating the effectiveness of the proposed context-enriched representations.

5. Discussions

Estimation of the 3D coordinates of the vertebrae from intraoperative 2D X-ray images is challenging, especially for MISS procedures. This study aimed to develop a novel framework, X2P-Net, that integrated vertebral context to overcome this challenge. Considering the complex and dynamic environment in the operating room, we designed a context-aware 2D/3D vertebra localization framework, which incorporated user interaction in the form of prompts to guide vertebra localization in 3D space. Along with the framework, we introduced BrickFormer, which was a Transformer architecture based on a dual-attention mechanism to delineate each vertebra at low computational cost. The estimated 2D vertebral heatmaps from BrickFormer were fused with multi-view image features to regress 3D vertebral coordinates. Both quantitative and qualitative results demonstrated the effectiveness of the proposed method.

The design of X2P-Net offered several advantages: it leveraged rich vertebral semantics to reduce localization ambiguity and supported a flexible, computationally efficient setup that could be trained end-to-end. This setup ensured high-performance localization in both 2D and 2D/3D scenarios, as demonstrated on synthetic and real spine datasets. Specifically, the prompt-guided FE unit generated masks that extracted the reference vertebral features within specified regions and incorporated positional information for detecting the remaining vertebrae, thereby improving localization performance, as shown in Table 5A. The localization accuracy was further improved by incorporating the proposed dual-attention mechanism in BrickFormer, which could automatically distinguish the foreground regions from the background, as shown in Figure 10.

In comparison with the SOTA methods, X2P-Net achieved better results. Specifically, when evaluated on the BiSpineX dataset, X2P-Net attained a

P C L_{3 D} @ 10 m m

of 96.9% and a

P C L_{3 D} @ 20 m m

of 98.8%, an average

M P E_{3 D}

of 2.99 mm, and an

A U C_{3 D}

of 0.9923. In contrast, the second-best method in terms of

M P E_{3 D}

and

A U C_{3 D}

(ALG-Net [47]) achieved an average

M P E_{3 D}

of 3.25 mm and an

A U C_{3 D}

of 0.9846. Qualitative results shown in Figure 6 and Figure 7 also demonstrate the superior performance of the proposed method. On the SheepSpineX dataset, we observe similarly superior performance of X2P-Net over the SOTA methods [21,22,47,57], with a

P C L_{3 D} @ 10 m m

of 98.4% and a

P C L_{3 D} @ 20 m m

of 100.0%, an

M P E_{3 D}

of 1.08 mm, and an

A U C

of 0.9972.

Our method was computationally efficient, with 3.24M parameters and 130.8 GMacs. For an input image of size 512 × 512, the average inference time was 0.1 s, with a GPU memory usage of approximately 3 GB. Moreover, in practical clinical workflows, it was typically unnecessary to perform full 3D vertebral localization on every X-ray image [65]; instead, the localization method could be executed on key frames, thereby further reducing computational and memory demands.

It is worth discussing the limitations of the present study. First, although the BiSpineX dataset did involve abnormal cases such as cases with scoliosis or metal implants, and the present method demonstrated superior performance on the BiSpineX dataset, the superiority of the proposed method in more complex clinical scenarios involving more severe abnormal anatomy or more challenging imaging conditions remained to be further validated. Another limitation lies in the requirement of manual placement of a point-like prompt in each view. Results (Figure 9) obtained from the ablation study investigating the sensitivity of the proposed method to the prompt displacement indicated that the proposed method was robust to moderate prompt displacement from the ground truth center (up to 20 pixels along the x-axis and up to 10 pixels along the y-axis) in either view. Thus, in clinical workflows, such prompts are doable with mouse clicks or touchscreen taps [66]. Furthermore, several studies [21,22,37,39,45,46] have shown that one can design an end-to-end network for fully automatic vertebra localization from input X-ray images, which may be used to eliminate the manual placement. Third, potential domain shift arising from differences between synthetic data and real clinical images (e.g., acquisition protocols, anatomy) may affect the generalization of the proposed method to clinical data. Nevertheless, our method was validated on both a synthetic dataset (the BiSpineX dataset) and a real dataset (the SheepSpineX dataset). On both datasets, the proposed method demonstrated better results than the SOTA competing methods, indicating its efficacy in 2D/3D landmark localization.

6. Conclusions

In this paper, we introduced X2P-Net, a prompt-guided and context-aware network that estimated the 3D positions of the vertebrae from biplanar X-ray images using a novel BrickFormer architecture. Our network included a prompt-guided FE unit, an SCE unit, and a 3D multi-view feature fusion unit. In addition to visual features, we leveraged vertebral context and positional information from the reference vertebra as indicated by the prompt. We further introduced a generic and novel way to incorporate the anatomical prior information of the spine using a set of learnable vertebral embeddings, which were trained to delineate each vertebral level using BrickFormer. Comprehensive experiments on two datasets demonstrated the superior performance of the proposed method over other SOTA methods. Future work will focus on prospective clinical trials to validate the method in real-world surgical practice. Upon successful validation, X2P-Net could be integrated into MISS systems to support intraoperative guidance, thereby enhancing the safety and quality of spinal surgery.

Author Contributions

Conceptualization, R.T., K.Y., W.Z., D.H. and G.Z.; Methodology, R.T., K.Y., W.Z., D.H. and G.Z.; Software, R.T.; Validation, R.T., K.Y. and W.Z.; Formal analysis, R.T.; Investigation, R.T., K.Y., W.Z., W.S., D.Y., D.H. and G.Z.; Resources, D.H. and G.Z.; Data curation, R.T., K.Y., W.Z., W.S. and D.Y.; Writing—original draft, R.T.; Writing—review and editing, R.T., K.Y., W.Z., W.S., D.Y., D.H. and G.Z.; Visualization, R.T.; Supervision, D.H. and G.Z.; Project administration, D.H. and G.Z.; Funding acquisition, G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work was partially supported by the National Key R&D Program of China via project 2023YFB4706300 and by the National Natural Science Foundation of China via projects 62471293 and U20A20199.

Institutional Review Board Statement

Ethical review and approval were waived for this study, due to the fact that the nine sheep cervical spines used in this study are obtained from a commercial slaughterhouse, which do not require ethics approval.

Informed Consent Statement

Informed consent was not required for this study as it involved only publicly available, anonymized datasets. The data utilized does not contain personal identifiers, and no interaction with human subjects was conducted.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tajsic, T.; Patel, K.; Farmer, R.; Mannion, R.; Trivedi, R. Spinal navigation for minimally invasive thoracic and lumbosacral spine fixation: Implications for radiation exposure, operative time, and accuracy of pedicle screw placement. Eur. Spine J. 2018, 27, 1918–1924. [Google Scholar] [CrossRef]
Maken, P.; Gupta, A. 2D-to-3D: A review for computational 3D image reconstruction from X-ray images. Arch. Comput. Methods Eng. 2023, 30, 85–114. [Google Scholar] [CrossRef]
Unberath, M.; Gao, C.; Hu, Y.; Judish, M.; Taylor, R.H.; Armand, M.; Grupp, R. The impact of machine learning on 2d/3d registration for image-guided interventions: A systematic review and perspective. Front. Robot. AI 2021, 8, 716007. [Google Scholar] [CrossRef]
Drover, D.; MV, R.; Chen, C.H.; Agrawal, A.; Tyagi, A.; Huynh, C.P. Can 3D Pose Be Learned from 2D Projections Alone? In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; European Computer Vision Association: Milan, Italy, 2018; pp. 78–94. [Google Scholar]
Zhao, Q.; Zheng, C.; Liu, M.; Chen, C. A single 2d pose with context is worth hundreds for 3d human pose estimation. Adv. Neural Inf. Process. Syst. 2024, 36, 27394–27413. [Google Scholar]
Aubert, B.; Vazquez, C.; Cresson, T.; Parent, S.; de Guise, J.A. Toward automated 3D spine reconstruction from biplanar radiographs using CNN for statistical spine model fitting. IEEE Trans. Med. Imaging 2019, 38, 2796–2806. [Google Scholar] [CrossRef]
Wang, L.; Xu, Q.; Leung, S.; Chung, J.; Chen, B.; Li, S. Accurate automated Cobb angles estimation using multi-view extrapolation net. Med. Image Anal. 2019, 58, 101542. [Google Scholar] [CrossRef]
Kasten, Y.; Doktofsky, D.; Kovler, I. End-to-end convolutional neural network for 3D reconstruction of knee bones from bi-planar X-ray images. In Proceedings of the Machine Learning for Medical Image Reconstruction: Third International Workshop, MLMIR 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, 8 October 2020; Proceedings 3; Springer: Berlin/Heidelberg, Germany, 2020; pp. 123–133. [Google Scholar]
Huang, Y.; Jones, C.K.; Zhang, X.; Johnston, A.; Waktola, S.; Aygun, N.; Witham, T.; Bydon, A.; Theodore, N.; Helm, P.A.; et al. Multi-perspective region-based CNNs for vertebrae labeling in intraoperative long-length images. Comput. Methods Programs Biomed. 2022, 227, 107222. [Google Scholar] [CrossRef] [PubMed]
Kyung, D.; Jo, K.; Choo, J.; Lee, J.; Choi, E. Perspective projection-based 3d CT reconstruction from biplanar X-rays. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
Cafaro, A.; Spinat, Q.; Leroy, A.; Maury, P.; Munoz, A.; Beldjoudi, G.; Robert, C.; Deutsch, E.; Grégoire, V.; Lepetit, V.; et al. X2Vision: 3D CT Reconstruction from Biplanar X-Rays with Deep Structure Prior. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 699–709. [Google Scholar]
Ye, K.; Sun, W.; Tao, R.; Zheng, G. A Projective-Geometry-Aware Network for 3D Vertebra Localization in Calibrated Biplanar X-Ray Images. Sensors 2025, 25, 1123. [Google Scholar] [CrossRef]
Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar] [CrossRef]
Zheng, G.; Gollmer, S.; Schumann, S.; Dong, X.; Feilkas, T.; Ballester, M.A.G. A 2D/3D correspondence building method for reconstruction of a patient-specific 3D bone surface model using point distribution models and calibrated X-ray images. Med. Image Anal. 2009, 13, 883–899. [Google Scholar] [CrossRef] [PubMed]
Baka, N.; Kaptein, B.L.; de Bruijne, M.; van Walsum, T.; Giphart, J.; Niessen, W.J.; Lelieveldt, B.P. 2D–3D shape reconstruction of the distal femur from stereo X-ray imaging using statistical shape models. Med. Image Anal. 2011, 15, 840–850. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Zhang, J.; Fang, Y.; Liu, Z.; Wang, N.; Cui, Z.; Shen, D. Multi-view vertebra localization and identification from ct images. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 136–145. [Google Scholar]
Kim, H.; Lee, K.; Lee, D.; Baek, N. 3D reconstruction of leg bones from X-ray images using CNN-based feature analysis. In Proceedings of the 2019 International Conference on Information and Communication Technology Convergence (ICTC); IEEE: Piscataway, NJ, USA, 2019; pp. 669–672. [Google Scholar]
Aubert, B.; Vidal, P.; Parent, S.; Cresson, T.; Vazquez, C.; De Guise, J. Convolutional neural network and in-painting techniques for the automatic assessment of scoliotic spine surgery from biplanar radiographs. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, 11–13 September 2017; Proceedings, Part II 20; Springer: Berlin/Heidelberg, Germany, 2017; pp. 691–699. [Google Scholar]
Bayat, A.; Sekuboyina, A.; Paetzold, J.C.; Payer, C.; Stern, D.; Urschler, M.; Kirschke, J.S.; Menze, B.H. Inferring the 3D standing spine posture from 2D radiographs. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, 4–8 October 2020; Proceedings, Part VI 23; Springer: Berlin/Heidelberg, Germany, 2020; pp. 775–784. [Google Scholar]
Sekuboyina, A.; Husseini, M.E.; Bayat, A.; Löffler, M.; Liebl, H.; Li, H.; Tetteh, G.; Kukačka, J.; Payer, C.; Štern, D.; et al. VerSe: A vertebrae labelling and segmentation benchmark for multi-detector CT images. Med. Image Anal. 2021, 73, 102166. [Google Scholar] [CrossRef]
Payer, C.; Štern, D.; Bischof, H.; Urschler, M. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Med. Image Anal. 2019, 54, 207–219. [Google Scholar] [CrossRef]
Tao, R.; Liu, W.; Zheng, G. Spine-transformers: Vertebra labeling and segmentation in arbitrary field-of-view spine CTs via 3D transformers. Med. Image Anal. 2022, 75, 102258. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Xie, Z.; Lin, Z.; Sun, E.; Ding, F.; Qi, J.; Zhao, S. Deep learning for automatic vertebra analysis: A methodological survey of recent advances. Comput. Med. Imaging Graph. 2025, 125, 102652. [Google Scholar] [CrossRef] [PubMed]
Klinder, T.; Ostermann, J.; Ehm, M.; Franz, A.; Kneser, R.; Lorenz, C. Automated model-based vertebra detection, identification, and segmentation in CT images. Med. Image Anal. 2009, 13, 471–482. [Google Scholar] [CrossRef]
Schmidt, S.; Kappes, J.; Bergtholdt, M.; Pekar, V.; Dries, S.; Bystrov, D.; Schnörr, C. Spine detection and labeling using a parts-based graphical model. In Proceedings of the Information Processing in Medical Imaging: 20th International Conference, IPMI 2007, Kerkrade, The Netherlands, 2–6 July 2007; Proceedings 20; Springer: Berlin/Heidelberg, Germany, 2007; pp. 122–133. [Google Scholar]
Glocker, B.; Zikic, D.; Konukoglu, E.; Haynor, D.R.; Criminisi, A. Vertebrae localization in pathological spine CT via dense classification from sparse annotations. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2013: 16th International Conference, Nagoya, Japan, 22–26 September 2013; Proceedings, Part II 16; Springer: Berlin/Heidelberg, Germany, 2013; pp. 262–270. [Google Scholar]
Chen, Y.; Gao, Y.; Li, K.; Zhao, L.; Zhao, J. Vertebrae identification and localization utilizing fully convolutional networks and a hidden Markov model. IEEE Trans. Med. Imaging 2019, 39, 387–399. [Google Scholar] [CrossRef] [PubMed]
Han, Z.; Wei, B.; Mercado, A.; Leung, S.; Li, S. Spine-GAN: Semantic segmentation of multiple spinal structures. Med. Image Anal. 2018, 50, 23–35. [Google Scholar] [CrossRef]
Huang, Z.; Zhao, R.; Leung, F.H.; Banerjee, S.; Lam, K.M.; Zheng, Y.P.; Ling, S.H. Landmark Localization from Medical Images with Generative Distribution Prior. IEEE Trans. Med. Imaging 2024, 43, 2679–2692. [Google Scholar] [CrossRef]
Ye, K.; Zou, X.; Sun, W.; Zheng, G. Semi-GDE: Generative distribution estimation for semi-supervised medical landmark localization. Neurocomputing 2025, 652, 131095. [Google Scholar] [CrossRef]
Yang, Y.; Wang, Y.; Liu, T.; Wang, M.; Sun, M.; Song, S.; Fan, W.; Huang, G. Anatomical prior-based vertebral landmark detection for spinal disorder diagnosis. Artif. Intell. Med. 2025, 159, 103011. [Google Scholar] [CrossRef]
Chen, H.; Shen, C.; Qin, J.; Ni, D.; Shi, L.; Cheng, J.C.; Heng, P.A. Automatic localization and identification of vertebrae in spine CT via a joint learning model with deep neural networks. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part I 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 515–522. [Google Scholar]
Wang, F.; Zheng, K.; Lu, L.; Xiao, J.; Wu, M.; Miao, S. Automatic vertebra localization and identification in CT by spine rectification and anatomically-constrained optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 5280–5288. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Chen, D.; Chen, M.; Wu, P.; Wu, M.; Zhang, T.; Li, C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci. Rep. 2025, 15, 4982. [Google Scholar] [CrossRef]
Bürgin, V.; Prevost, R.; Stollenga, M.F. Robust vertebra identification using simultaneous node and edge predicting graph neural networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2023; pp. 483–493. [Google Scholar]
Xiang, S.; Zhang, L.; Wang, Y.; Zhou, S.; Zhao, X.; Zhang, T.; Li, S. VLD-Net: Localization and Detection of the Vertebrae from X-ray Images by Reinforcement Learning with Adaptive Exploration Mechanism and Spine Anatomy Information. IEEE J. Biomed. Health Inform. 2025, 29, 4969–4980. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Zhang, Y.; Ji, X.; Liu, W.; Li, Z.; Zhang, J.; Liu, S.; Zhong, W.; Hu, L.; Li, W. A spine segmentation method under an arbitrary field of view based on 3d swin transformer. Int. J. Intell. Syst. 2023, 2023, 8686471. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Huang, Y.; Jones, C.K.; Zhang, X.; Johnston, A.; Aygun, N.; Witham, T.; Helm, P.A.; Siewerdsen, J.H.; Uneri, A. Automatic labeling of vertebrae in long-length intraoperative imaging with a multi-view, region-based CNN. In Proceedings of the Medical Imaging 2022: Image-Guided Procedures, Robotic Interventions, and Modeling; SPIE: San Francisco, CA, USA, 2022; Volume 12034, pp. 180–185. [Google Scholar]
Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019; pp. 7718–7727. [Google Scholar]
Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 11656–11665. [Google Scholar]
Dong, J.; Jiang, W.; Huang, Q.; Bao, H.; Zhou, X. Fast and robust multi-person 3d pose estimation from multiple views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 7792–7801. [Google Scholar]
Bridgeman, L.; Volino, M.; Guillemaut, J.Y.; Hilton, A. Multi-Person 3D Pose Estimation and Tracking in Sports. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Piscataway, NJ, USA, 2019; pp. 2487–2496. [Google Scholar]
Ju, F.; Wang, Y.; Zhao, J.; Dong, M. Multiview 2D/3D image registration in minimally invasive pelvic surgery navigation. Sci. Rep. 2025, 15, 26183. [Google Scholar] [CrossRef] [PubMed]
Lin, J.; Lee, G.H. Multi-view multi-person 3d pose estimation with plane sweep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 11886–11895. [Google Scholar]
Wu, S.; Jin, S.; Liu, W.; Bai, L.; Qian, C.; Liu, D.; Ouyang, W. Graph-based 3d multi-person pose estimation using multi-view images. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 11148–11157. [Google Scholar]
Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part V 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 561–578. [Google Scholar]
Tome, D.; Toso, M.; Agapito, L.; Russell, C. Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In Proceedings of the 2018 International Conference on 3D Vision (3DV); IEEE: Piscataway, NJ, USA, 2018; pp. 474–483. [Google Scholar]
Tu, H.; Wang, C.; Zeng, W. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16; Springer: Berlin/Heidelberg, Germany, 2020; pp. 197–212. [Google Scholar]
Zhang, Z.; Wang, C.; Qiu, W.; Qin, W.; Zeng, W. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. Int. J. Comput. Vis. 2021, 129, 703–718. [Google Scholar] [CrossRef]
Ye, H.; Zhu, W.; Wang, C.; Wu, R.; Wang, Y. Faster voxelpose: Real-time 3d human pose estimation by orthographic projection. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 142–159. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards understanding convergence and generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
Russakoff, D.B.; Rohlfing, T.; Mori, K.; Rueckert, D.; Ho, A.; Adler, J.R.; Maurer, C.R. Fast generation of digitally reconstructed radiographs using attenuation fields with application to 2D-3D image registration. IEEE Trans. Med. Imaging 2005, 24, 1441–1454. [Google Scholar] [CrossRef] [PubMed]
Guo, X.; Xu, S.; Lin, X.; Sun, Y.; Ma, X. 3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space. Pattern Anal. Appl. 2022, 25, 157–167. [Google Scholar] [CrossRef]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. Adv. Neural Inf. Process. Syst. 2020, 33, 17283–17297. [Google Scholar]
Wilcoxon, F. Individual comparisons by ranking methods. In Breakthroughs in Statistics: Methodology and Distribution; Springer: Berlin/Heidelberg, Germany, 1992; pp. 196–202. [Google Scholar]
Brost, A.; Liao, R.; Strobel, N.; Hornegger, J. Respiratory motion compensation by model-based catheter tracking during EP procedures. Med. Image Anal. 2010, 14, 695–706. [Google Scholar] [CrossRef]
Niu, K.; Tao, Z.; Cheng, L.; Wei, Z.; Kang, H.; Wei, T.; Huang, B.; Xu, F.; Xiong, C. Comprehensive workflow with optical navigation in minimally invasive transforaminal lumbar interbody fusion: A retrospective study. J. Orthop. Surg. Res. 2025, 20, 694. [Google Scholar] [CrossRef]

Figure 1. A schematic illustration of the challenges in 2D/3D vertebra localization, where we show four examples, each including the CT volume along with the LAT and AP views of the radiographs. Ground truth 3D vertebral locations are initially annotated on the CT volume and then projected onto 2D image planes. For the first, second, and third examples, we utilize digitally reconstructed radiographs (DRRs) derived from the VerSe dataset [20]; for the fourth example, we show real biplanar X-ray images of the spine of a sheep, captured by a C-arm imaging system. The boxes highlight the common challenges in 2D/3D vertebra localization, including inconsistencies between 2D and 3D annotations (a,c,d), superimposition (b), spinal deformities (e), metal implants (f,g), and view-angle disparities (h).

Figure 2. A schematic overview of the proposed framework, consisting of (A) a 2D visual feature extraction (VFE) unit; (B) a prompt-guided feature enhancement (FE) unit; (C) a semantic context extraction (SCE) unit; and (D) a 3D multi-view feature fusion unit.

Figure 3. A schematic overview of the prompt-guided FE unit. Using a vanilla Transformer, the image features

f_{i}

are enhanced with the prompt features

f_{p}

.

Figure 3. A schematic overview of the prompt-guided FE unit. Using a vanilla Transformer, the image features

f_{i}

are enhanced with the prompt features

f_{p}

.

Figure 4. A schematic overview of the SCE unit. Learnable vertebral embeddings capture vertebral context from the masked features through BrickFormer, which automatically delineates the foreground regions.

Figure 5. Experimental setup for C-arm image acquisition of a sheep cervical spine. (A) Data acquisition of the sheep spine. (B) A schematic illustration of the experimental setup where both LAT and AP images were acquired. The 2D landmark ground truth was generated by projecting the 3D landmark ground truth into the 2D image space.

Figure 6. Comparisons of

P C L

curves from different methods on (A) the BiSpineX dataset and (B) the SheepSpineX dataset.

Figure 6. Comparisons of

P C L

curves from different methods on (A) the BiSpineX dataset and (B) the SheepSpineX dataset.

Figure 7. Qualitative comparison of 3D vertebra localization performance across various methods on the BiSpineX dataset. (A) One example from the test set of the BiSpineX dataset, showing the CT volume together with the LAT and AP views of the radiographs. Ground truth 3D vertebral locations are annotated on the CT volume. (B) 3D vertebra localization results from different methods.

Figure 8. Per-level performance of our method on the BiSpineX dataset measured by

P C L_{3 D} @ 10 m m

and

M P E_{3 D}

. The x-axis denotes vertebral levels from the second cervical vertebra to the sixth lumbar vertebra (C2-L6), where letters C, T, and L refer to the cervical, thoracic, and lumbar spine, respectively.

Figure 8. Per-level performance of our method on the BiSpineX dataset measured by

P C L_{3 D} @ 10 m m

and

M P E_{3 D}

. The x-axis denotes vertebral levels from the second cervical vertebra to the sixth lumbar vertebra (C2-L6), where letters C, T, and L refer to the cervical, thoracic, and lumbar spine, respectively.

Figure 9. Results of the ablation study evaluating the effect of prompt displacement in terms of

P C L_{3 D} @ 20 m m

and

M P E_{3 D}

. The zero point in each sub-figure indicates using the ground truth center as the point-like prompt. (A) Results of shifting the prompt along the x-axis in the LAT view. (B) Results of shifting the prompt along the y-axis in the LAT view. (C) Results of shifting the prompt along the x-axis in the AP view. (D) Results of shifting the prompt along the y-axis in the AP view.

Figure 9. Results of the ablation study evaluating the effect of prompt displacement in terms of

P C L_{3 D} @ 20 m m

and

M P E_{3 D}

. The zero point in each sub-figure indicates using the ground truth center as the point-like prompt. (A) Results of shifting the prompt along the x-axis in the LAT view. (B) Results of shifting the prompt along the y-axis in the LAT view. (C) Results of shifting the prompt along the x-axis in the AP view. (D) Results of shifting the prompt along the y-axis in the AP view.

Figure 10. Visualization of features learned from different stages of BrickFormer. Context-enriched vertebral embeddings (the first row) and the predicted 2D vertebral heatmaps (the second row) learned from both the LAT and the AP images of a thoracic spine with fractured vertebrae are displayed. The thoracic spine is taken from the BiSpineX dataset.

Table 1. Comparisons of 2D and 3D vertebra localization on the BiSpineX dataset with other SOTA methods. ↑: higher value indicates better results. ↓: lower value indicates better results. Pix: pixels. The best results are displayed in bold font.

2D localization (LAT view)
Methods	$P C L_{2 D} @ 10 p (%) ↑$	$P C L_{2 D} @ 20 p (%) ↑$	$M P E_{2 D} (p i x) ↓$	$A U C_{2 D} ↑$
SCN-Net [21]	93.8	96.9	3.78	0.9735
Spine-Trans [22]	93.2	95.9	3.87	0.9689
AdaFuse [57]	91.1	95.8	4.61	0.9609
ALG-Net [47]	92.5	96.5	4.43	0.9716
VOL-Net [47]	92.3	96.5	4.33	0.9703
Ours	88.2	96.6	5.84	0.9701
2D localization (AP view)
Methods	$P C L_{2 D} @ 10 p (%) ↑$	$P C L_{2 D} @ 20 p (%) ↑$	$M P E_{2 D} (p i x) ↓$	$A U C_{2 D} ↑$
SCN-Net [21]	90.0	96.8	4.96	0.9682
Spine-Trans [22]	88.1	96.5	5.26	0.9667
AdaFuse [57]	87.6	94.8	5.66	0.9586
ALG-Net [47]	88.6	95.9	5.39	0.9607
VOL-Net [47]	88.2	95.4	5.44	0.9584
Ours	85.2	96.0	6.14	0.9681
3D localization
Methods	$P C L_{3 D} @ 10 m m (%) ↑$	$P C L_{3 D} @ 20 m m (%) ↑$	$M P E_{3 D} (m m) ↓$	$A U C_{3 D} ↑$
SCN-Net [21]	90.5	92.6	8.94	0.9274
Spine-Trans [22]	87.5	91.5	9.21	0.9166
AdaFuse [57]	95.8	97.9	3.95	0.9826
ALG-Net [47]	95.7	98.3	3.25	0.9846
VOL-Net [47]	95.4	98.0	3.49	0.9827
Ours	96.9	98.8	2.99	0.9923

Table 2. Comparisons of 3D vertebra localization with other SOTA methods on the SheepSpineX dataset. ↑: higher value indicates better results. ↓: lower value indicates better results. The best results are displayed in bold font.

Method	3D Localization
Method	${PCL}_{3 D} @ 10 mm (%)$ ↑	${PCL}_{3 D} @ 20 mm (%)$ ↑	${MPE}_{3 D} (mm)$ ↓	${AUC}_{3 D}$ ↑
SCN-Net [21]	95.2	97.2	3.71	0.9854
Spine-Trans [22]	93.5	97.5	4.42	0.9803
AdaFuse [57]	97.2	99.7	2.41	0.9944
ALG-Net [47]	96.5	100.0	1.56	0.9948
VOL-Net [47]	96.3	99.8	1.63	0.9939
Ours	98.4	100.0	1.08	0.9972

Table 3. Results of the ablation study investigating the effectiveness of key components in our method. ↑: higher value indicates better results. ↓: lower value indicates better results. The best results are displayed in bold font.

Components	${PCL}_{3 D} @ 10 mm (%)$ ↑	${PCL}_{3 D} @ 20 mm (%)$ ↑	${MPE}_{3 D} (mm)$ ↓	${AUC}_{3 D}$ ↑	Params ↓	FLOPs ↓
No Prompt	90.2	93.1	6.27	0.9653	3.04 M	129.4 GMac
No SCE	92.3	94.8	5.95	0.9657	3.00 M	108.1 GMac
No Fusion	92.8	96.5	5.90	0.9822	3.22 M	123.6 GMac
Ours	96.9	98.8	2.99	0.9923	3.24 M	130.8 GMac

Table 4. Results of investigating the influence of different attention mechanisms on the performance of the proposed method. ↑: higher value indicates better results. ↓: lower value indicates better results. The best results are displayed in bold font.

Methods	${PCL}_{3 D} @ 10 mm (%)$ ↑	${PCL}_{3 D} @ 20 mm (%)$ ↑	${MPE}_{3 D} (mm)$ ↓	${AUC}_{3 D}$ ↑	Params ↓	FLOPs ↓
Vanilla attention [23]	94.7	97.2	4.13	0.9794	3.24 M	109.2 GMac
Sparse attention [63]	93.2	98.1	4.72	0.9870	3.24 M	111.2 GMac
Ours	96.9	98.8	2.99	0.9923	3.24 M	130.8 GMac

Table 5. Results of the ablation study investigating the impact of different hyperparameters. ↑: higher value indicates better results. ↓: lower value indicates better results. The best results are displayed in bold font.

A. Impact of spatial dimensions of vertebral embeddings.
Dimensions	$P C L_{3 D} @ 10 m m (%)$ ↑	$P C L_{3 D} @ 20 m m (%)$ ↑	$M P E_{3 D} (m m)$ ↓	$A U C_{3 D}$ ↑	Params ↓	FLOPs↓
$4 \times 4$	92.4	95.1	5.92	0.9661	3.08 M	109.8 GMac
$8 \times 8$	93.2	96.7	4.19	0.9731	3.11 M	114.0 GMac
$16 \times 16$	96.9	98.8	2.99	0.9923	3.24 M	130.8 GMac
B. Impact of top-k value.
Top-k	$P C L_{3 D} @ 10 m m (%)$ ↑	$P C L_{3 D} @ 20 m m (%)$ ↑	$M P E_{3 D} (m m)$ ↓	$A U C_{3 D}$ ↑	Params ↓	FLOPs ↓
2	93.1	95.4	5.63	0.9665	3.24 M	114.7 GMac
4	95.7	97.9	3.59	0.9813	3.24 M	120.0 GMac
8	96.9	98.8	2.99	0.9923	3.24 M	130.8 GMac
C. Impact of pooling stride $α$ .
$α$	$P C L_{3 D} @ 10 m m (%)$ ↑	$P C L_{3 D} @ 20 m m (%)$ ↑	$M P E_{3 D} (m m)$ ↓	$A U C_{3 D}$ ↑	Params ↓	FLOPs ↓
1	92.9	96.2	4.32	0.9689	3.24 M	110.8 GMac
2	94.7	97.7	4.07	0.9806	3.24 M	114.7 GMac
4	96.9	98.8	2.99	0.9923	3.24 M	130.8 GMac

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tao, R.; Ye, K.; Zhang, W.; Sun, W.; Yu, D.; Hang, D.; Zheng, G. X2P-Net: Context-Aware 2D/3D Vertebra Localization. Bioengineering 2026, 13, 178. https://doi.org/10.3390/bioengineering13020178

AMA Style

Tao R, Ye K, Zhang W, Sun W, Yu D, Hang D, Zheng G. X2P-Net: Context-Aware 2D/3D Vertebra Localization. Bioengineering. 2026; 13(2):178. https://doi.org/10.3390/bioengineering13020178

Chicago/Turabian Style

Tao, Rong, Kangqing Ye, Weijun Zhang, Wenyuan Sun, Derong Yu, Donghua Hang, and Guoyan Zheng. 2026. "X2P-Net: Context-Aware 2D/3D Vertebra Localization" Bioengineering 13, no. 2: 178. https://doi.org/10.3390/bioengineering13020178

APA Style

Tao, R., Ye, K., Zhang, W., Sun, W., Yu, D., Hang, D., & Zheng, G. (2026). X2P-Net: Context-Aware 2D/3D Vertebra Localization. Bioengineering, 13(2), 178. https://doi.org/10.3390/bioengineering13020178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

X2P-Net: Context-Aware 2D/3D Vertebra Localization

Abstract

1. Introduction

2. Related Work

2.1. Leveraging Semantic Context in Vertebra Localization

2.2. Estimating 3D Landmarks from 2D Images

3. Methodology

3.1. Overview

3.2. The VFE Unit

3.3. The Prompt-Guided FE Unit

3.4. The SCE Unit

3.5. The 3D Multi-View Feature Fusion Unit

3.6. Loss Functions

3.7. Implementation Details

4. Experiments

4.1. Datasets

4.1.1. Synthetic Biplanar Spine DRR Dataset (BiSpineX Dataset)

4.1.2. Sheep Spine X-Ray Dataset (SheepSpineX Dataset)

4.2. Evaluation Metrics

4.3. Results

4.3.1. Results on the BiSpineX Dataset

4.3.2. Results on the SheepSpineX Dataset

4.4. Analytical Ablation Studies

4.4.1. Results on Investigating the Effectiveness of Key Components

4.4.2. Results on Examining Different Attention Mechanisms

4.4.3. Results on Investigating the Impact of Different Hyperparameters

4.4.4. Results of Investigating the Sensitivity of Our Method to Prompt Displacement

4.4.5. Analysis of BrickFormer

5. Discussions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI