1. Introduction
Low back pain represents one of the most prevalent musculoskeletal complaints worldwide. In Japan, the burden is particularly high: the National Health and Nutrition Survey consistently identifies back pain as the leading symptom-based complaint, with prevalence rates of 91.6 per 1000 men and 111.9 per 1000 women [
1]. This substantial disease burden underscores the critical importance of accurate and efficient diagnostic imaging techniques to identify treatable etiologies. Among the most clinically significant causes of back pain in young individuals are lumbar spondylolysis and its sequela, spondylolisthesis.
Spondylolysis is a stress fracture of the pars interarticularis of the lumbar vertebra, occurring in approximately 5–7% of the general Japanese population based on CT examination [
2]. Its prevalence is markedly higher among adolescent athletes experiencing back pain. The condition exhibits a notable sex disparity, with males showing higher incidence rates than females [
3]. Vertebral level distribution is non-uniform: L5 is affected in approximately 90% of cases, followed by L4. The natural history of spondylolysis is consequential: L5 spondylolysis progresses to spondylolisthesis in approximately 30% of cases, while L4 shows a 90% progression rate, emphasizing the critical need for early and accurate diagnosis [
4]. Without appropriate management, progressive vertebral slippage can result in radiculopathy, neurogenic claudication, and significant disability.
The diagnostic gold standard for detecting lumbar spondylolysis remains the oblique radiograph, which reveals the characteristic “Scotty dog sign”—a disruption in cortical continuity at the pars interarticularis that appears as a collar on the neck of the Scotty dog silhouette formed by the vertebral elements. Clinically recommended oblique projection angles range from 30° to 45°, with specific institutional protocols suggesting 35°, 40°, or 45° projections depending on patient anatomy. Despite the widespread use of CT and MRI for definitive diagnosis, plain oblique radiographs remain an indispensable first-line screening tool, particularly in outpatient settings and facilities where advanced cross-sectional imaging is not immediately available.
Achieving and verifying the appropriate projection angle in oblique radiography presents several challenges. First, the lack of standardization across institutions leads to inconsistent image quality. Second, subjective assessment of image adequacy frequently results in repeat examinations, increasing patient radiation exposure, examination costs, and workflow inefficiencies. Third, even when radiographers intend to position a patient at a target angle, the actual anatomical projection angle may differ from the intended value due to variability in body habitus, table positioning, and tube angulation. The accurate verification of projection angles is therefore a clinically meaningful problem that has, until recently, lacked an automated, objective solution. Beyond quality rejection, automated angle knowledge enables several downstream applications: real-time acquisition feedback allowing immediate repositioning before the patient leaves the table; automatic annotation of the verified projection angle in PACS records, which supports consistent radiologist interpretation and medico-legal documentation; retrospective institutional audit of positioning consistency; and geometric correction or normalization of diagnostic measurements—such as slip percentage in spondylolisthesis grading—that depend on a known projection geometry [
5]. In surgical planning contexts, knowing the precise oblique angle from preoperative radiographs allows the surgeon to account for the projection geometry when estimating pars defect dimensions or screw trajectory [
6].
Artificial intelligence (AI) and deep learning have demonstrated transformative potential in medical imaging, with convolutional neural networks (CNNs) achieving human-level performance across a wide range of clinical decision- and workflow-support tasks [
7,
8,
9,
10,
11]. The YOLOX anchor-free object detector has been particularly influential, offering real-time detection with strong performance on multi-scale targets [
12]. Within spinal imaging specifically, deep learning methods have been applied to vertebral body detection and segmentation [
13], spondylolisthesis classification [
14], and the automated assessment of degenerative spinal conditions from plain radiographs [
15]. These prior works collectively demonstrate that anatomical structures in spine radiographs can be reliably localized using modern object detection frameworks, even in the presence of complex overlapping anatomy.
A key insight motivating the present work is the predictable geometric relationship between vertebral bodies and pedicles as a function of projection angle. As the oblique angle changes, the relative horizontal offset of the pedicle with respect to the vertebral body changes systematically. This change can be quantified using the Vertebral–Pedicle Ratio (VPR), a normalized geometric index derived from bounding box coordinates detected by an object detection model. If the VPR can be reliably estimated from the detected anatomy, projection angle can be estimated through linear regression, enabling fully automatic, quantitative quality control of lumbar oblique radiography.
The use of synthetic X-ray images generated from CT data provides a solution to the training data problem. Digitally reconstructed radiographs (DRRs) produced from CT volumes by ray-sum projection at controlled angles yield large annotated datasets with precisely known ground-truth angles [
16]. Several studies have demonstrated that models trained on synthetic spine radiographs can transfer effectively to clinical images [
17], and the approach is now recognized as a general strategy for data augmentation and model development in projection radiography.
Initial experiments indicated that simultaneous detection of three anatomical classes—vertebral region, vertebral body, and pedicle—from the full radiograph yielded inconsistent pedicle localization, particularly at the smaller scales at which pedicles appear in whole-image views. This motivated the design of a two-stage strategy in which a dedicated pedicle detector operates within a pre-localized vertebral body crop, reducing the detection search space and increasing the relative size of the target structure. Model1 was therefore retained both as a standalone comparison baseline and as the first-stage detector within the Model2 pipeline. The purpose of this study was to develop and systematically evaluate a deep learning-based framework for automated projection angle estimation in lumbar oblique radiography using a Vertebral–Pedicle Ratio (VPR) approach. Two object detection pipelines were compared: a single-stage three-class detector (Model1) that simultaneously identifies the L2–L4 vertebral region, vertebral bodies, and pedicles from the full radiograph; and a two-stage detector (Model2) in which vertebral body localization by Model1 is followed by dedicated single-class pedicle detection within the cropped vertebral body region. The impact of these contrasting detection strategies on both object detection accuracy and downstream angle estimation error was assessed through five-fold cross-validation.
Novelty and contributions: (i) We introduce a geometry-informed Vertebral–Pedicle Ratio (VPR) computed from detector outputs to estimate the lumbar oblique projection angle via regression; (ii) we propose a two-stage detection pipeline that improves small-structure (pedicle) localization by operating within vertebral-body crops; (iii) we leverage CT-derived digitally reconstructed radiographs (DRRs) with precisely known projection angles for systematic five-fold cross-validation; and (iv) we discuss practical deployment scenarios, including near-real-time acquisition feedback versus post-acquisition quality control.
2. Materials and Methods
2.1. System Configuration and Development Environment
All model development, training, and evaluation were performed on a high-performance workstation equipped with an Intel Core i9-10980XE processor (3.0 GHz), dual NVIDIA RTX A6000 GPUs (48 GB each), and 64 GB DDR4 RAM. Algorithm development and implementation were conducted in MATLAB 2024a (MathWorks Inc., Natick, MA, USA) using the Deep Learning Toolbox and Computer Vision Toolbox. An overview of the entire study pipeline, from data preparation through angle estimation output, is provided in
Figure 1.
2.2. Training Data and Target Scope
Training data consisted of JPEG images paired with same-name text annotation files recording class labels and bounding box coordinates in YOLO format (normalized center x, center y, width, height per class). The three annotation classes used for Model1 were: (i) L2–L4 vertebral region as a single encompassing bounding box, (ii) individual vertebral bodies, and (iii) individual pedicles. Each image was accompanied by its projection angle label, which was assigned automatically from the filename suffix (e.g., _n20 indicating a negative-angle projection of −20°, and _p45 indicating a positive-angle projection of +45°), where the prefix “n” denotes a third-oblique (negative-angle) projection and “p” denotes a fourth-oblique (positive-angle) projection. Angle estimation analysis was restricted to images with absolute projection angles between 20° and 60°, corresponding to the clinically recommended range for lumbar oblique radiography.
2.3. Dataset Acquisition and Synthetic Image Generation
2.3.1. CT Data Source
Synthetic X-ray images were generated from publicly available CT datasets obtained from The Cancer Imaging Archive (TCIA), focusing on scans with clear lumbar vertebral anatomy. From this repository, 100 high-quality CT datasets were selected. Synthetic radiograph generation followed a ray-sum projection protocol: starting from a patient-specific reference angle defined by the line connecting the spinal canal center and the spinous process, oblique projections were generated at 5° increments from 20° to 60° (nine angles per patient), yielding 900 synthetic radiographs with precisely known ground-truth projection angles. Post-processing steps included window-level optimization to replicate typical radiographic appearance.
2.3.2. Data Augmentation
To enhance model robustness and prevent overfitting, a comprehensive augmentation pipeline was applied to all training images before each epoch. Augmentation operations included: horizontal flipping, rotation at discrete angles between −15° and +15°, and intensity variation with multiple brightness. Each augmentation was applied independently, generating a substantially larger effective training dataset size for both the whole-image detection (Model1) and the vertebra-specific pedicle detection (Model2) tasks. The augmentation parameters were kept identical for both models to ensure a fair comparison.
2.4. Object Detection Model Architecture (Model1 and Model2)
Both models employed the YOLOX anchor-free detection architecture [
12], which is schematically illustrated in
Figure 2. YOLOX adopts a CSPDarkNet53 backbone to extract multi-scale feature maps, a Feature Pyramid Network (FPN) combined with a Path Aggregation Network (PAN) neck to fuse features across scales, and a decoupled detection head that separately handles classification and bounding box regression tasks at each of three output scales (P3/8×, P4/16×, P5/32×). Unlike anchor-based predecessors, YOLOX uses an anchor-free prediction strategy whereby each feature map grid point directly predicts an object’s bounding box center and dimensions without reference to pre-defined anchor sizes. Label assignment during training is performed by the SimOTA strategy, which globally optimizes the matching between predicted boxes and ground-truth boxes based on classification and regression costs. Non-maximum suppression (NMS) is applied as a post-processing step to remove duplicate detections.
Model1 is a three-class detector that receives whole radiographs and simultaneously detects L2–L4 vertebral region, vertebral body, and pedicle. Model2 is a single-class pedicle detector applied in a two-stage pipeline: Model1 first detects the vertebral body region; this region is then cropped and input to Model2, which detects pedicles exclusively within the crop. Model2 thus operates on a substantially smaller and more homogeneous search space than Model1, which is the principal mechanism underlying its performance advantage. Both models were trained with the following hyperparameters: 10 epochs, learning rate 0.0001, batch size 128, and momentum 0.9, with the same augmentation strategy applied to both.
2.5. Detection Performance Evaluation
Detection performance was assessed using 5-fold cross-validation applied uniformly to both models. Two complementary metrics were computed. First, Average Precision at an IoU threshold of 0.5 (AP@0.5) was used as the primary detection quality metric, following standard object detection benchmarking practice. Second, a bounding-box-unit Dice Similarity Coefficient (DSC) was computed at the level of individual matched bounding box pairs: DSC = 2|A ∩ B|/(|A| + |B|), where A and B are the areas of a predicted and ground-truth bounding box, respectively. Predicted and ground-truth boxes were matched by greedy IoU-based assignment. Per-fold, per-class DSC statistics (mean, median, SD, IQR) were computed across all matched pairs. For Model1 (three classes), all per-class metrics were averaged with equal weights to yield macro-average values. For Model2 (single class), the macro-average is identical to the class-level result.
2.6. Evaluation Metrics
The Vertebral–Pedicle Ratio (VPR) is a normalized geometric index that captures the horizontal offset of the pedicle relative to the corresponding vertebral body as observed in the oblique projection, as illustrated in
Figure 3B. It is formally defined as:
where x_pd is the left-edge x-coordinate of the pedicle bounding box in full-image pixel space, x_vb is the left-edge x-coordinate of the corresponding vertebral body bounding box, and w_vb is the width of the vertebral body bounding box. Normalization by w_vb makes the VPR invariant to image resolution and to the absolute size of the vertebral body, which varies across patients and levels. As the projection angle increases within the n-group or p-group range, the pedicle shifts progressively relative to the vertebral body in a predictable direction, yielding a monotonic and near-linear VPR–angle relationship.
Because each radiograph was expected to contain vertebral levels L2, L3, and L4, up to three VPR measurements were available per image. The image-level VPR was computed as the arithmetic mean of the level-wise VPRs. Individual vertebral body candidates were accepted if their bounding box center fell within the L2–L4 region detected by Model1; pedicle candidates were accepted if their bounding box center fell within the accepted vertebral body bounding box. Images in which all three vertebral levels (L2, L3, L4) could not be successfully assigned were excluded from the angle estimation analysis.
2.7. Inference Protocol
The complete inference protocol, illustrated schematically in
Figure 3A, proceeded as follows for each test image.
Step 1 (Image Loading and Preprocessing): The input image was loaded from disk, resized to the model input dimensions, and pixel values were normalized to the range required by the YOLOX architecture. For the Model2 pipeline, this step also generated the cropped vertebral body region (see Step 3 below).
Step 2 (Forward Pass through YOLOX): The preprocessed image was passed forward through the YOLOX network, comprising the CSPDarkNet53 backbone, the FPN-PAN neck, and the decoupled detection head. At each of the three output scales (P3, P4, P5), the decoupled head simultaneously produced class score predictions and bounding box regression predictions for all grid positions. Raw predictions were converted to absolute bounding box coordinates and confidence scores by decoding the model outputs.
Step 3 (NMS Post-Processing and Geometric Filtering): Raw detections were first filtered by a confidence threshold, then subjected to Non-Maximum Suppression (NMS) to eliminate duplicate bounding boxes with high overlap. The NMS IoU threshold was set consistently for both models. After NMS, geometric filtering was applied: vertebral body candidates were retained if their bounding box center lay within the detected L2–L4 region, and pedicle candidates were retained if their bounding box center lay within the retained vertebral body bounding box. Final assignments were performed by greedy IoU-based matching between vertebral bodies and pedicles. For the Model2 pipeline specifically, the following additional steps were applied between Step 2 and Step 3: (a) Model1 was first used to detect vertebral body bounding boxes on the full image; (b) for each accepted vertebral body, a cropped subimage was extracted from the full radiograph using the Model1 vertebral body bounding box coordinates; (c) Model2 was applied independently to each cropped subimage to detect the pedicle within the crop; and (d) the pedicle bounding box coordinates were transformed back to full-image coordinate space by adding the offset of the crop within the full image (x_pd_full = x_pd_crop + x_vb_offset; y_pd_full = y_pd_crop + y_vb_offset), after which the standard geometric filtering and VPR computation proceeded on full-image coordinates.
Step 4 (VPR Computation): For each successfully matched vertebral body–pedicle pair at L2, L3, and L4, the level-wise VPR was computed using the formula defined in
Section 2.6. The image-level VPR was obtained as the mean of available level-wise VPRs. Images for which fewer than three levels were successfully matched were excluded from downstream angle estimation.
Step 5 (Group Assignment and Angle Estimation): The projection group (n-group or p-group) was determined from the filename suffix of each test image. Based on the assigned group, the corresponding linear regression coefficients (a_n, b_n or a_p, b_p) estimated during training were applied to compute the projected angle estimate: = a · VPR + b. The absolute difference between the estimated angle and the ground-truth angle from the filename was recorded as the image-level absolute error, and MAE was computed across all images within a fold.
Step 6 (Processing Speed Measurement): Inference processing speed was measured in frames per second (FPS) for each fold. The mean FPS was then calculated from the second frame onward across all test images within the fold, and fold-level FPS values were summarized as mean ± SD across five folds.
2.8. Linear Regression for Angle Estimation
The relationship between image-level VPR and projection angle was modelled separately for the n-group and the p-group using ordinary least-squares (OLS) linear regression: = a · VPR + b, where a is the regression slope and b is the intercept. This separation is necessary because the direction of pedicle offset relative to the vertebral body is opposite for third-oblique (n-group) and fourth-oblique (p-group) projections, meaning that a single combined regression would be non-linear and poorly fitting. Regression coefficients (a_n, b_n) and (a_p, b_p) and coefficients of determination (R2_n, R2_p) were estimated per fold using training data only. Fold-level coefficients were summarized across five folds (mean, median, SD, IQR). In addition, a pooled regression was computed by concatenating all training data from all five folds to obtain globally stable coefficient estimates.
4. Discussion
This study demonstrates that a two-stage object detection approach combining vertebral body localization and dedicated pedicle detection achieves substantially better angle estimation accuracy compared with a single-stage three-class detector for lumbar oblique radiography. The overall MAE of 5.42° achieved by Model2 is within the clinically tolerated positioning variation of ±5° that is commonly accepted in routine spinal radiographic practice. This level of accuracy validates our methodological approach in a radiographic context.
The rationale for developing two models was grounded in the expectation that simultaneous multi-class detection of small structures such as pedicles in whole-image context would be inherently more challenging than single-class detection within a constrained crop. Model1 was first developed as a straightforward end-to-end pipeline; its relatively high fold-to-fold variability in AP@0.5 (range: 0.553–0.890) confirmed this hypothesis and directly motivated the two-stage Model2 design. The superiority of Model2 over Model1 in detection performance and angle estimation can be primarily attributed to the reduction in the detection search space. When pedicle detection is performed within a cropped region already containing the target vertebral body, the relative size of the pedicle in the image is substantially larger, reducing the challenges inherent in detecting small objects in full-field radiographs. This finding aligns with the broader principle that hierarchical or region-of-interest-based detection pipelines improve performance for anatomical structures with large inter-scale variability [
13,
18]. Our two-stage approach mirrors strategies applied in other spinal imaging studies—such as the YOLOv8-then-classification pipeline reported for lumbar spondylolisthesis detection [
14]—and extends this concept to the projection radiograph domain.
The VPR metric proved to be a robust geometric index for projection angle estimation. High R2 values (0.832 for n-group, 0.870 for p-group) in both per-fold and pooled regressions confirm that the horizontal offset of the pedicle relative to the vertebral body width changes predictably and nearly linearly with projection angle across the clinically relevant 20–60° range. The separate treatment of n-group and p-group projections is essential: because the VPR–angle relationship operates in opposing directions for these two projection types, combining both into a single regression would substantially reduce estimation accuracy. In the present study, the angle group label was derived directly from the filename suffix of the synthetic training and test images. In a fully automated clinical pipeline, this label could be reliably assigned by a lightweight oblique-direction classifier or by exploiting DICOM header metadata.
The fold-to-fold variability in regression slope coefficients reflects sensitivity to the composition of each training fold. However, the near-identical R2 values in pooled regression versus per-fold estimates demonstrate that the linear VPR–angle relationship is a stable property of the underlying anatomy rather than an artifact of any particular data split. This stability provides confidence that the regression coefficients estimated from the synthetic training data will generalize to new synthetic images and, by extension, to clinical images in which the same vertebral–pedicle geometric relationship holds.
The processing speed trade-off between the two models has direct implications for clinical deployment scenarios. Model1 operates at 18.3 FPS, which is sufficient for near-real-time feedback during image acquisition. Model2 processes at approximately 2.9 FPS due to the sequential two-stage inference and the overhead of coordinate transformation. While this throughput is adequate for post-acquisition quality control workflows, it may not support real-time intraoperative guidance. Future engineering approaches including GPU-batched multi-vertebra crop inference, model quantization, and asynchronous pipeline execution could substantially improve Model2 throughput while preserving its accuracy advantage.
The use of synthetic radiographs generated from CT data is a principled solution to the training data bottleneck [
17,
19]. Ray-sum projection of CT volumes at controlled angles produces images with precisely known ground-truth projection angles, enabling supervised regression without the need to acquire multiple radiographs from the same patient at varying angles. The consistent R
2 values across folds confirm that overfitting to any particular CT dataset is minimal. Extension to clinical radiographs would require adaptation strategies—such as domain adaptation or fine-tuning with a small set of annotated clinical images—to bridge the appearance gap between synthetic and real radiographs.
Limitations and Assumptions
To avoid list-style discussion, we consolidate these points as follows. Because training and validation were performed on CT-derived DRRs, a domain gap may exist relative to clinical radiographs (e.g., noise/scatter and exposure variability) [
20]; therefore, clinical validation using lumbar oblique radiographs with reference projection angles (derived from calibrated acquisition geometry or CT-based reconstruction) will be required [
21]. The VPR-based formulation also assumes that the vertebral–pedicle geometric relationship is sufficiently consistent across subjects; this assumption may be violated in severe deformity or post-surgical instrumentation, motivating confidence-based rejection and/or implant-aware filtering as future work. In addition, the current implementation excludes cases where all three levels (L2–L4) cannot be matched, which improves stability but reduces coverage; we will implement a fallback strategy that fuses one- or two-level VPR estimates with reliability weighting and will report both MAE and exclusion rate with and without this fallback. Finally, our results indicate a practical speed–accuracy trade-off (Model1: 18.27 FPS; Model2: 2.87 FPS), supporting a hybrid workflow in which Model1 provides near-real-time feedback during acquisition and Model2 provides high-precision post-acquisition quality control [
22].
Despite these limitations, the present results establish VPR-based two-stage detection as a sound methodological foundation for automated lumbar oblique radiographic quality control. Beyond detection accuracy, the clinical value of automated angle estimation lies in its potential for integration across the radiographic workflow. At the point of acquisition, real-time angle feedback can prompt immediate repositioning, eliminating the need for repeat examinations and reducing patient radiation dose—a particular concern in the predominantly young, athletic population affected by spondylolysis [
2,
3]. After acquisition, automatic PACS annotation of the verified projection angle supports consistent radiological interpretation, because the apparent morphology of the Scotty dog sign varies with projection angle; a documented record of the actual angle allows the reporting radiologist to contextualize borderline findings. In longitudinal follow-up, angle-normalized comparison of serial radiographs reduces measurement variability attributable to positioning differences. In surgical planning, the verified oblique angle from preoperative radiographs enables geometrically corrected estimation of pars defect dimensions and screw trajectory [
6], improving operative precision. Finally, from a broader AI-in-radiology perspective, this work contributes to the growing evidence that synthetic data generation combined with appropriate geometric modeling and hierarchical detection pipelines can produce clinically meaningful automated assessments of radiographic acquisition quality [
7,
15]. Integration of such tools into routine radiography workflows could ultimately reduce repeat examinations, lower patient radiation exposure, and support the standardization of imaging protocols across institutions.