Next Article in Journal
A Bi-Objective Optimization for Sensor Path Planning and Communication Node Deployment
Previous Article in Journal
Wideband DOA Estimation Using a Compact Formulation of 2,1 Norm Minimization with Multiple Dictionaries
Previous Article in Special Issue
CMFA-Net: A CNN–Mamba Collaborative Feature Alignment Network for Robust Medical Image Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Automatic Identification Method for Vertebral Compression Fractures in X-Ray Images Based on Multi-Stage Deep Learning

College of Science and Technology, Ningbo University, Ningbo 315212, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(12), 2626; https://doi.org/10.3390/electronics15122626 (registering DOI)
Submission received: 9 May 2026 / Revised: 11 June 2026 / Accepted: 12 June 2026 / Published: 14 June 2026
(This article belongs to the Special Issue AI-Driven Medical Image/Video Processing)

Abstract

Vertebral compression fractures (VCFs) are one of the most common spinal disorders encountered clinically. Untimely diagnosis or inaccurate classification often leads to prolonged pain and functional impairment in patients. To enhance diagnostic accuracy and efficiency, this study addressed the high cost and limited applicability of computed tomography (CT) and magnetic resonance imaging (MRI) examinations by leveraging the universality and convenience of X-ray imaging. We proposed a multi-stage deep learning-based method for identifying vertebral compression fractures. The method first employs Discrete Wavelet Transform-YOLOv5 (DWT-YOLOv5) for preliminary vertebral region localization, followed by Polarized Self-Attention-UNet (PSA-UNet) for precise segmentation. Finally, a ResNet50 network incorporating a Convolutional Block Attention Module (CBAM) performs graded classification, categorizing vertebrae into four types: Non-fracture, Mild fracture, Moderate fracture, and Severe fracture. The experimental results demonstrate that the proposed method achieved average accuracy, precision, recall, specificity, and F1-score of 83.7%, 88.1%, 86.2%, 97.7%, and 87.2%, respectively. The proposed method fully leverages the cost-effectiveness and convenience of X-ray imaging, providing clinicians with an efficient and economical auxiliary diagnostic tool. It enables rapid and accurate identification of vertebral compression fractures in emergency and initial screening scenarios.

1. Introduction

Vertebral compression fractures (VCFs) refer to a collapse fracture of the vertebra caused by external force or pathological factors (such as osteoporosis, tumors, etc.), typically presenting as varying degrees of reduction in vertebral height. This condition is particularly common among middle-aged and elderly individuals, especially in patients with osteoporosis, where its incidence can reach up to 40%. Clinical symptoms primarily include back pain, postural changes, and limited mobility. Severe cases may trigger multiple complications, significantly impairing patients’ quality of life and even increasing mortality risk [1,2]. Therefore, timely diagnosis and intervention for vertebral compression fractures hold significant clinical importance.
Currently, imaging examinations for vertebral compression fractures primarily rely on X-ray, computed tomography (CT), and magnetic resonance imaging (MRI). Among these, MRI and CT offer high accuracy in assessing lesion structure, bone detail, and freshness. However, their application remains constrained by factors such as high cost, time-consuming procedures, and limited equipment accessibility. In contrast, X-ray examinations offer advantages such as rapid imaging, low radiation exposure, and relatively low cost. Scans can be completed within seconds to minutes, making them particularly suitable for rapid screening and diagnosis in emergency settings [3,4,5].
In recent years, deep learning has significantly advanced medical image analysis, expanding from early lesion detection to tasks including organ segmentation, pathological classification, and disease grading [6]. Through end-to-end learning frameworks, convolutional neural networks and their variants can automatically extract diagnostic features from multimodal images such as X-ray, CT, and MRI, thereby improving recognition efficiency and accuracy [7,8]. Some scholars have achieved automatic detection and localization of vertebral fracture regions by improving detection networks such as YOLO, RetinaNet, and EfficientDet. Other studies employ U-Net and its variants for precise segmentation of vertebral regions, combining attention mechanisms to enhance feature expression and thereby improve the model’s ability to assess fracture severity [9].
However, current mainstream methods for automatically identifying and grading the severity of vertebral compression fractures primarily rely on CT and MRI. For patients unsuitable for CT or MRI examinations—such as those with radiation allergies, renal insufficiency, or metallic implants—X-rays remain a more feasible diagnostic tool. Therefore, developing an automated method for identifying and grading vertebral compression fractures based on X-ray images holds significant clinical value and contributes to the advancement of intelligent diagnostic systems.
To address this challenge, this paper proposes a multi-stage deep learning-based approach for vertebral compression fracture detection. This method fully integrates hierarchical features from detection, segmentation, and classification to enhance the accuracy and robustness of vertebral region identification in X-ray images. The main contributions of this paper are as follows:
(1)
Addressing the challenge of ambiguous features and high recognition difficulty in X-ray images of vertebral compression fractures, this study decomposes the recognition task into three stages: localization, segmentation, and classification. Tailored network architectures are designed for each stage, significantly enhancing overall recognition accuracy and robustness.
(2)
To address the limitation of single feature extraction networks in simultaneously capturing local details and global semantic information, this study introduces a parallel feature fusion structure combining High-frequency Network (HNET) and Low-frequency Network (LNET) within Discrete Wavelet Transform-YOLOv5 (DWT-YOLOv5). By jointly modeling high- and low-frequency information, it effectively enhances the accuracy and stability of vertebral body detection.
(3)
Addressing the limitations of traditional segmentation and classification models in edge region recognition and feature representation, this study incorporates Polarized Self-Attention (PSA) into U-Net to enhance detail capture capabilities. Additionally, integrating the Convolutional Block Attention Module (CBAM) into ResNet50 significantly improves the precision of vertebral segmentation and the accuracy of fracture severity classification.
This integrated approach aims to deliver accurate, automated VCFs identification while preserving the clinical practicality and affordability of X-ray imaging, offering a viable diagnostic aid especially where advanced imaging is unavailable.

2. Proposed Method

To effectively address challenges in grading VCFs on X-ray images—such as low resolution, noise interference, and subtle morphological differences between fracture severity levels—this study designed a multi-stage deep learning approach to enhance diagnostic accuracy and clinical applicability. The overall framework of the method is illustrated in Figure 1. Specifically, considering that lateral spine X-ray images often exhibit blurred features due to low contrast and noise, this study first performs image preprocessing. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast, highlighting vertebral edges and textures to provide high-quality input for subsequent deep learning modules. Second, to achieve rapid and precise localization of vertebral regions, an improved DWT-YOLOv5 is proposed. This approach decomposes high- and low-frequency features via discrete wavelet transform and incorporates the parallel networks HNET and LNET. These networks simultaneously capture vertebral edges and overall structure, effectively overcoming the challenge of insufficient detail capture in low-resolution images. Subsequently, to enhance segmentation accuracy and detect minute morphological variations, PSA-UNet is constructed based on UNet. It incorporates a PSA module to optimize encoder–decoder feature fusion, achieving high-precision segmentation of edges and critical regions. Finally, the CBAM was introduced into ResNet50 to enhance focus on critical vertebral features while suppressing background noise. This enables accurate classification of fractures into Non-fracture, Mild, Moderate, and Severe categories.

2.1. Image Preprocessing

Lateral chest and lumbar spine X-ray images offer advantages such as rapid imaging and low cost. However, their inherent low resolution, uneven brightness, and contrast variations pose challenges for subsequent detection and segmentation tasks. The vertebral region often exhibits uneven gray-scale distribution due to variations in bone density, surrounding soft tissue occlusion, and projection angle changes. This leads to edge blurring, loss of detail, or background noise interference. Particularly in early-stage, mild fractures, such blurring may directly impact the feature extraction capability and recognition accuracy of deep learning models.
To address these issues and enhance input image quality, this study employs Contrast CLAHE for preprocessing raw X-ray images [10]. Figure 2 shows the X-ray image after CLAHE processing. As seen in Figure 2, CLAHE preprocessing enhances key features such as vertebral edges and fine textures in the X-ray image. This not only improves the accuracy of DWT-YOLOv5 in locating vertebral regions but also provides clear structural information for the detailed segmentation performed by PSA-UNet, thereby enhancing the overall accuracy and reliability of fracture classification.

2.2. DWT-YOLOv5 Vertebral Body Detection

In the initial detection phase of vertebral compression fractures, accurately and rapidly locating the vertebral region within X-ray images forms the critical foundation of the entire identification process. However, X-ray images commonly suffer from low resolution, low contrast, and complex background noise, posing challenges to feature extraction and localization accuracy for deep learning models. YOLOv5 strikes a favorable balance between detection speed and accuracy, featuring a lightweight architecture and strong small object detection capabilities, making it widely adopted for medical image detection tasks. Nevertheless, the original YOLOv5 exhibits limitations when processing X-ray images, such as restricted ability to capture subtle edges and textural features, and susceptibility to noise interference, resulting in imprecise vertebral localization. To address these issues, this paper proposes an enhanced DWT-YOLOv5 model, which builds upon YOLOv5 and incorporates Discrete Wavelet Transform (DWT) and designs dual-branch feature extraction networks—HNET and LNET—for high- and low-frequency features respectively. This enhances detail sensitivity and structural integrity in vertebral detection. The overall network architecture is illustrated in Figure 3.
Vertebral information in X-ray images primarily manifests in two types of features: high-frequency information such as edges and texture, and low-frequency information such as overall contours and density distribution. Traditional convolutional neural networks often struggle to distinguish between these two types of information during feature extraction, leading to weakened edge features. To address this, we perform a first-level DWT decomposition on the input image, decomposing it into one low-frequency subband and three high-frequency subbands. The low-frequency subband primarily contains the image’s overall structure and background information, while the high-frequency subbands respectively reflect texture and edge features in the horizontal, vertical, and diagonal directions. Through this frequency domain decomposition, the model can capture both the structural and detailed information of the vertebrae, thereby enhancing feature discriminability and robustness. This provides more targeted input for subsequent feature extraction.
Building upon this frequency decomposition, this paper designs two parallel sub-networks—HNET and LNET—to extract edge texture features and global structural features, respectively, based on the distinct characteristics of high- and low-frequency features. Their network architectures are illustrated in Figure 4.
In LNET, the basic feature extraction module adopts a CBL architecture (comprising a 3 × 3 convolutional layer, a BN layer, and Leaky ReLU) and incorporates the lightweight channel attention mechanism Efficient Channel Attention (ECA) [11]. This mechanism adaptively adjusts the weights of different channels, thereby enhancing the model’s ability to perceive the overall morphology of the vertebral body within low-frequency subbands. Additionally, the CBAM [12] is incorporated to further focus on key regions and suppress background noise. Let X denote the enhanced feature map output by LNET, where the feature enhancement process of LNET is described by Equation (1):
X = C B A M ( x 1 + C B L ( C B L ( x 1 ) ) )
where x 1 = E C A ( C B L ( C B L ( x ) ) ) represents the intermediate feature map weighted by the ECA module, and x denotes the low-frequency subband feature map obtained through DWT decomposition.
In HNET, the primary focus is on extracting fine details from high-frequency subband features. By incorporating the Simple Attention Module (SimAM) [13], the model effectively enhances its responsiveness to vertebral edges and textures. SimAM highlights regions of fine-scale features through an adaptive weighting mechanism, thereby increasing the model’s sensitivity to minute fracture textures. Let Y denote the enhanced feature map output by HNET, whose feature extraction process is described by Equation (2):
Y = E C A ( y + C B L ( C B L ( y 1 ) ) )
where y 1 = E C A ( C B L ( y ) ) represents the attention-weighted intermediate feature map, and y denotes the high-frequency subband feature map obtained via DWT decomposition.
In network architecture design, we fully considered the complementarity of features across different levels. Shallow layers primarily capture high-frequency edge and texture information, while deep layers offer larger receptive fields, making them suitable for extracting low-frequency global structures. This paper integrates high-frequency features into shallow branches and low-frequency features into deep branches, thereby achieving synergistic optimization of local details and overall structure. This significantly enhances the robustness and localization accuracy of vertebral body detection.
In summary, DWT-YOLOv5 achieves rapid, precise, and robust detection of vertebral regions through multi-frequency feature decomposition, dual-branch feature extraction, and hierarchical feature fusion. This provides a reliable structural foundation for subsequent segmentation and fracture identification.

2.3. PSA-UNet Vertebral Body Segmentation

During the fine segmentation stage of vertebral compression fractures, accurately segmenting the preliminarily detected vertebral regions is a critical foundation for subsequent classification tasks. However, X-ray images commonly suffer from low resolution, low contrast, and complex background noise. Soft tissue occlusions and artifact interference pose significant challenges to the segmentation task.
In X-ray images, vertebral body edges and morphological features often exhibit global distribution. However, traditional convolutions struggle to effectively model long-range dependencies due to noise and low contrast. To address this challenge, we embed the PSA module into the encoding path of U-Net to enhance attention in the downsampled features generated by the encoder. The PSA module comprises two parallel branches: channel-only attention and spatial-only attention. It employs polarized vectors to modulate attention weights, thereby enhancing information expressiveness [14]. The polarisation vectors in the PSA module (including the convolutional layers associated with these vectors) are all initialised using Kaiming Normal (mode=“fan_in”). This initialisation strategy is applied uniformly during module construction via the `reset_parameters()` function, which helps to maintain gradient stability and feature representation capabilities during the early stages of deep network training.
The channel-only attention branch focuses on global dependencies between channels, capturing high-level semantic information by compressing the spatial dimension, while the spatial attention branch emphasizes local pixel-to-pixel relationships, enhancing detail feature capture by compressing the channel dimension. By modeling global semantic and local detail dependencies across channel and spatial dimensions respectively, the model effectively highlights key regional features in images (such as conical texture edges and their morphological variations) while suppressing interference from artifacts and redundant information.
During the skip connection phase, encoder features enhanced by the PSA module are concatenated with decoder upscaled features, ensuring thorough integration of semantic information and detailed characteristics. This mechanism enables the model to more accurately reconstruct vertebral edges and other critical details during decoding. It not only compensates for U-Net’s neglect of global context but also further optimizes segmentation performance for blurred features in X-ray images, providing more reliable feature representations for subsequent classification tasks. The computation for the skip connection portion can be expressed as
F o = P S A ( F i )
where F i denotes the feature map from the encoder, and F o represents the feature map enhanced by the PSA module, which is then fused with the decoder’s upsampled features.
As a classic model in medical image segmentation, U-Net effectively captures multi-scale features and restores spatial details through its encoder–decoder architecture and skip connection mechanism, making it particularly suitable for pixel-level segmentation tasks of anatomical structures like vertebral bodies. However, the original U-Net exhibits limitations when processing X-ray images: its reliance on local convolutional operations makes it challenging to capture long-range global dependencies. This often leads to segmentation inaccuracies, detail loss, or over-segmentation in areas with blurred edges and complex textures. Particularly concerning is the model’s difficulty in precisely capturing morphological features at vertebral edges, such as compression deformations, which can compromise overall diagnostic accuracy. To address these issues, this study proposes PSA-UNet as an enhancement to U-Net. This improved model incorporates the PSA module to strengthen the network’s ability to model both global contextual information and local structural features, thereby significantly improving segmentation accuracy in vertebral regions following preliminary detection. The modified model architecture is illustrated in Figure 5.

2.4. CBAM-ResNet50 Classification Model

In the final classification task, accurately distinguishing between four categories—Non, Mild, Moderate, and Severe—is crucial for clinical diagnosis. ResNet50, as a widely adopted deep convolutional neural network, offers robust feature extraction capabilities through its residual learning framework and effectively mitigates the vanishing gradient problem in deep networks. However, ResNet50 relies on local convolutional operations, making it challenging to fully capture global semantic information and long-range dependencies. This limitation is particularly evident when handling subtle morphological variations in vertebral bodies within X-ray images, where background noise easily interferes and compromises classification accuracy.
To address these challenges, we introduce the CBAM into ResNet50’s residual modules. This enhances feature selection capabilities through channel attention and spatial attention mechanisms. The residual module structure after CBAM integration is illustrated in Figure 6. CBAM guides the network to focus on critical vertebral regions, enhancing the capture of morphological features (such as edges and subtle structural differences) while suppressing background interference and noise from non-target areas. By jointly modeling local details and global semantic features, CBAM significantly improves the model’s ability to distinguish between morphologically similar vertebrae or those with varying degrees of pathology. This optimizes overall classification performance and enhances prediction accuracy.

3. Experimental Results and Analysis

To validate the effectiveness and robustness of the proposed multi-stage deep learning approach in identifying vertebral compression fractures, this chapter conducts a comprehensive evaluation of its performance across detection, segmentation, and classification stages through a systematic experimental design. Experiments were conducted using a clinically acquired thoracolumbar X-ray dataset. Multiple quantitative evaluation metrics were employed to analyze and compare model performance, alongside comparative experiments against several mainstream benchmark models. This approach validated the proposed method’s advantages in accuracy, stability, and generalization capability.

3.1. Dataset and Experimental Details

The dataset used in this study comprises 453 thoracolumbar X-ray images acquired at Ningbo Second Hospital. This dataset was obtained with the patients’ informed consent, has been anonymised, and has been approved by the hospital’s ethics committee. All patients’ X-ray images were acquired using the Digital DIAGNOST system from Philips Medical Systems (Amsterdam, The Netherlands). The key acquisition parameters are as follows: tube voltage = 55.000 kV, exposure time = 20 ms, exposure dose = 6 mAs, source-to-detector distance = 1800.000 mm, pixel pitch = 0.143 mm × 0.143 mm. Furthermore, in order to construct a cohort suitable for the assessment of osteoporotic or traumatic VCFs, strict inclusion and exclusion criteria were established during database construction. Among patients included who had been clinically and radiologically diagnosed with VCFs, those with pathological fractures caused by spinal malignancies or infections, those who had undergone vertebral surgery, and those with poor-quality X-ray images or foreign body interference were excluded. The dataset was annotated by specialist orthopaedic surgeons at the hospital using Labelme software (version 5.3.1).
To enhance the model’s robustness and generalization capability, various data augmentation techniques—including rotation, flipping, and scaling—were applied to the original images, ultimately yielding 1870 thoracolumbar X-ray images. The dataset was divided into training, validation, and test sets at a ratio of 6:2:2 to ensure scientific rigor and stability in model training and performance evaluation. The specific parameters for data augmentation are as follows. The rotation angles for the rotation operation are 30°, 60° and 90°; the flipping options are horizontal and vertical flipping; and the scaling factor is randomly selected from between 0.9 and 1.1.
Vertebral fractures were annotated according to the Genant semi-quantitative (SQ) grading criteria [15]. Samples were categorized into four fracture severity grades based on the percentage change in vertebral height: Non-fracture, Mild fracture, Moderate fracture, and Severe fracture. Mild fracture indicated a height reduction of approximately 20–25% in the anterior, middle, and posterior regions of the vertebral body; Moderate fracture ranged from 26–40%; and Severe fracture exceeded 40%. Additionally, based on radiographic morphology, fracture types were further subdivided into three categories: wedge deformity, double-concave deformity, and compression deformity. All annotations were performed collaboratively by orthopedic and radiology specialists. Statistical analysis revealed the dataset comprised 2668 non-fractured vertebrae, 621 mild fractures, 624 moderate fractures, and 678 severe fractures. The distribution across categories is presented in Table 1. It should be emphasised that the aforementioned dataset partitioning is strictly enforced at the patient level, ensuring that all fractured vertebrae belonging to the same patient are assigned to the same data subset in order to prevent data leakage.
This method trains the neural network model on an NVIDIA RTX 3090 (NVIDIA Corporation, Santa Clara, CA, USA) using the PyTorch (version 2.4.1) framework. The model is optimised using the SGD optimiser with cosine-smoothed learning rate adjustment, with a batch size of 16, 200 training epochs, a base learning rate of 0.1, momentum of 0.9, and weight decay set to 5 × 10−4.

3.2. Experimental Results

To systematically evaluate the performance and effectiveness of the proposed multi-stage intelligent diagnosis framework for vertebral compression fractures, this section conducts experimental analysis across three levels: detection, segmentation, and classification. First, by comparing with multiple mainstream object detection algorithms, we validate the accuracy and robustness of the proposed DWT-YOLOv5 network in fracture region detection tasks. Subsequently, comparative experiments demonstrate the performance improvement of PSA-UNet in vertebral fine segmentation. Finally, the ResNet50 classification network with the introduced CBAM is validated to assess its accuracy and generalization capability in fracture severity identification. Through a series of comparative experiments and visual analyses, the improvements achieved by each proposed module across different task stages and the overall performance advantages of the system are comprehensively validated. It should be emphasised that, in the comparative experiments, the proposed model and the benchmark models were evaluated using the same dataset and splitting conditions, and followed the same pre-processing and evaluation procedures, in order to ensure the validity of the experiments.
(1)
Performance of the Vertebral Body Detection Model
To validate the vertebral detection performance of the DWT-YOLOv5 model designed in this work, we compared it with mainstream object detection models such as YOLOv3, YOLOv4, YOLOv8, and Faster R-CNN across evaluation metrics including Precision, mAP, Recall, and F1-score. The experimental results are shown in Table 2.
As shown in the table, DWT-YOLOv5 achieves significant improvements across all evaluation metrics. Its Precision reaches 0.932, mAP increases to 0.986, Recall reaches 0.971, and F1-score reaches 0.950, all outperforming the baseline model YOLOv5 and other mainstream detection networks.
Compared to the baseline model, DWT-YOLOv5 achieves a 10.2% increase in Precision, a 1.3% improvement in mAP, and enhancements of 1.5% and 2.0% in Recall and F1-score, respectively. This demonstrates that the separation of high- and low-frequency features via DWT decomposition, combined with the parallel HNET-LNET architecture design, effectively enhances the network’s multi-scale feature extraction capabilities. Compared to YOLOv3 and YOLOv4, the improved model achieves 14.3% and 13.7% gains in mAP, respectively, with particularly significant improvements in Recall at 24.9% and 17.0%. Against YOLOv8, it demonstrates a 1.2% increase in mAP and a 5.4% boost in Precision, highlighting the enhanced method’s advantage in capturing fine-grained features. Compared to Faster R-CNN, DWT-YOLOv5 achieves a 6.2% improvement in mAP, along with 3.5% and 16.7% gains in Recall and F1-score respectively, demonstrating superior adaptability to complex backgrounds and multi-object scenarios.
(2)
Model Performance in Vertebral Body Segmentation
To further validate the performance of the PSA-enhanced UNet network in vertebral body segmentation, we compared PSA-UNet with baseline models and calculated metrics including mIoU, mPA, and accuracy. To verify the statistical significance of the improvement, we conducted a 5 × 2 cross-validation experiment comparing the mIoU of PSA-UNet with that of the baseline model. In each fold, we trained both the baseline and PSA-UNet models separately, and subsequently performed a paired t-test to analyse the difference in mIoU between the two. The experimental results are shown in Table 3. The results indicate that this improvement is statistically significant. This result indicates that the improvements brought about by the PSA module in enhancing the accuracy of vertebral segmentation are both stable and significant.
As shown in the table, our method outperforms the baseline in terms of average performance on mIoU, mPA and accuracy, with p-values of 0.037, 0.028 and 0.011 respectively, all of which are below 0.05, indicating that the improvements are statistically significant.
(3)
Model Performance in Fracture Classification
Finally, to validate the performance of the ResNet50 network with the introduced CBAM in the fracture severity classification task, we compared it with models such as Baseline, ResNet34, VGG16, and VGG19 across metrics including Accuracy, Precision, Recall, and F1-score. The results are shown in Table 4.
The experimental results demonstrate that ResNet50 with the CBAM outperforms all comparison models in overall performance, achieving an overall accuracy of 0.885 and significantly improving recognition accuracy for different fracture grades. The introduction of CBAM enables the network to focus more precisely on critical regions of the vertebral body, thereby enhancing its ability to distinguish between morphologically similar samples or those with subtle differences in lesion severity.
(4)
Ablation Experiment
To evaluate the contribution of the DWT, PSA and CBAM to the model’s classification performance, we conducted ablation experiments; Table 5 presents the results of these experiments for the DWT, PSA and CBAM. The baseline achieved a precision of 0.785, a recall of 0.793, a specificity of 0.892 and an F1 score of 0.815. After sequentially incorporating the DWT, PSA and CBAM, all metrics reached their optimal values, with accuracy, recall, precision and F1 score reaching 0.837, 0.862, 0.877 and 0.872 respectively. The experimental results demonstrate that each module makes a positive contribution to the final classification results.
Further analysis of the confusion matrix reveals, as shown in Figure 7a, that the model achieves its highest prediction accuracy in the “Non-fracture” and “Severe fracture” categories (90% and 90%, respectively). This indicates that the network effectively leverages CBAM’s spatial attention mechanism to capture global features of vertebral morphological changes. However, misclassification persists in the “Mild” and “Moderate” categories, primarily manifesting as mild fractures being misclassified as ‘Non’ and moderate fractures as “Severe.” This likely stems from subtle differences in imaging presentation between mild and moderate fractures, where vertebral compression rates and texture alterations are less pronounced, leading to feature space overlap. The normalized confusion matrix in Figure 7b further validates this trend. Nevertheless, the overall confusion rate remains below 10%, indicating the improved network exhibits high stability in multi-classification tasks.
Overall, CBAM achieves significant advantages in multi-class discrimination by modeling feature importance in both the channel and spatial dimensions. This approach guides the model to focus more on fracture-related regions while effectively suppressing background noise interference. This result not only validates CBAM’s applicability in fine-grained medical image classification but also provides valuable direction for future research on fusing multi-modal image features. At the same time, category-weighted cross-entropy loss combined with Focal Loss was employed during the training of the classification model to enhance the model’s focus on minority classes and challenging samples.
Given the class imbalance in the dataset, we have additionally calculated the precision, recall, specificity, F1 score and macro-averages for each class. The results are shown in Table 6, which indicates that the model performs well across all four categories, with macro-averages for accuracy, precision, recall, specificity and F1 score reaching 0.837, 0.881, 0.862, 0.977 and 0.872 respectively. This indicates that the model not only performs excellently across the majority of categories but also possesses strong classification capabilities for minority categories, rather than simply relying on class bias.
To assess the feasibility of the proposed method for clinical deployment, we tested the computational complexity of the complete multi-stage model. The model has a total of approximately 68.4 M parameters. Our model was run on NVIDIA RTX 3090 GPU and Intel Core i7-12700K CPU hardware using the PyTorch framework. The average inference time for a single X-ray image was 185 ms, with the DWT-YOLOv5 detection stage taking approximately 45 ms, the PSA-UNet segmentation stage took approximately 95 ms, the CBAM-ResNet50 classification stage took approximately 43 ms, and the memory usage was approximately 1.8 GB.
In summary, the proposed multi-stage deep learning framework demonstrates superior performance across vertebral body detection, segmentation, and classification tasks. DWT-YOLOv5 significantly enhances detection accuracy and robustness through frequency domain feature decomposition and a dual-branch feature extraction network; PSA-UNet effectively strengthens the modeling capabilities of both global semantics and local details, achieving more precise vertebral region segmentation; while ResNet50 with CBAM further improves the model’s discrimination ability for different fracture grades. Overall results demonstrate that this multi-stage model maintains high detection, segmentation, and classification accuracy in complex X-ray imaging environments. It provides reliable technical support for intelligent diagnosis of vertebral compression fractures and offers new insights for constructing clinical image-assisted diagnostic systems.

4. Discussion

This study proposes a multi-stage deep learning framework for identifying VCFs. Addressing the limitations in detection and classification accuracy of VCFs in X-ray images, it effectively enhances overall recognition performance through multi-level task decomposition and targeted network optimization. Compared to existing detection methods reliant on CT or MRI images, this study innovates by designing a structured multi-stage optimization strategy tailored to address the low resolution and lack of detail inherent in X-ray images. This enables high-accuracy identification while preserving imaging simplicity and cost-effectiveness.
Methodologically, the VCFs identification process is divided into three stages: vertebral body detection, fine segmentation, and fracture grading. Structural enhancements are introduced at each stage. During detection, the proposed DWT-YOLOv5 model employs DWT to decompose images into high- and low-frequency components, capturing fine textures and structural features respectively. This significantly improves detection accuracy and robustness in low-contrast images. During segmentation, the enhanced PSA-UNet model incorporates a PSA module to strengthen the encoder’s feature representation capabilities. This enables the model to simultaneously focus on global semantic information and local edge details, significantly improving segmentation performance in areas with blurred vertebral edges. In the classification stage, the CBAM-enhanced ResNet50 leverages spatial and channel attention mechanisms to enhance the model’s ability to distinguish vertebral morphological variations and fracture severity, thereby further improving classification accuracy and stability.
Experimental results demonstrate that the proposed multi-stage approach outperforms existing mainstream models across multiple key metrics. Notably, in the detection task, DWT-YOLOv5 achieves a mAP of 0.986 and a Precision of 0.932, representing improvements of 1.3% and 10.2% over YOLOv5, respectively, achieving significant optimization in both detection accuracy and recall. In the segmentation stage, PSA-UNet outperformed baseline models in mIoU, mPA, and accuracy, demonstrating the effectiveness of attention mechanisms in detail restoration and global structural perception. In the classification stage, ResNet50 with CBAM achieved an overall accuracy of 0.885, with precision rates of 90% for non-fracture and severe fracture identification, significantly outperforming ResNet34, VGG16, and VGG19. These results indicate that integrating attention mechanisms with frequency domain feature fusion strategies effectively enhances diagnostic performance in complex X-ray imaging scenarios.
In recent years, numerous studies have focused on the automatic identification of vertebral compression fractures. Dong et al. [16] collected spinal X-ray images from 4461 patients, extracted 100,409 vertebrae, and then used the deep learning model GoogLeNet to classify the vertebrae into two major categories: moderate to severe fractures and normal/trace/mild fractures. Experiments demonstrated that on the test set, the model achieved a sensitivity of 59.8%, an F1-score of 0.72, and a precision–recall curve of 0.82. Kim et al. [17] collected anteroposterior and lateral X-ray images from 1507 patients across two hospitals. They employed the EfficientNet-B5 algorithm to classify images as OVCF or non-OVCF. Evaluation on the test set yielded AUC values of 0.915 for anteroposterior images and 0.953 for lateral images. Tian et al. [18] collected CT images from 395 patients with lumbar vertebral fractures across two hospitals. They employed deep learning methods for binary classification (fracture/non-fracture) of lumbar fractures. They first generated 3D segmentation maps using 3D-VNet, extracted regions for each vertebra from these maps, and then fed them into 3D-ResNet for classification. Evaluation on the test set yielded an AUC-PR of 0.89. Yilmaz et al. [19] collected CT images of 145 patients with vertebral compression fractures. They first employed a hierarchical neural network to detect and identify all visible vertebrae within the field of view. Subsequently, a feedforward convolutional neural network was applied to image patches containing a single vertebra to determine the presence of vertebral compression fractures. Experimental results demonstrated that the model achieved an AUC of 0.742.
Compared to the above literature, our approach utilizes more economical and widely available X-ray images for vertebral compression fracture classification while achieving finer-grained categorization (Non, Mild, Moderate, Severe). Additionally, our model covers the entire workflow from vertebral localization to detailed grading, further enhancing interpretability. Experimental performance achieved accuracy comparable to CT or MRI. Consequently, this method serves as an efficient, cost-effective, and interpretable auxiliary diagnostic tool in clinical practice, significantly improving physician workflow in early screening and rapid diagnosis.
Despite achieving promising experimental results, this study has certain limitations. Firstly, this study utilised a single-centre dataset, with all images originating from the same medical institution and acquired using identical X-ray equipment and acquisition protocols. Although we have enhanced model performance through CLAHE pre-processing, a multi-stage network architecture and attention mechanisms, the model’s generalisation ability still requires further validation using data from multiple centres and involving different equipment and acquisition protocols. Secondly, the current dataset is relatively limited in size, with few samples of mild fractures, which may result in insufficient generalisation performance for minor categories. Furthermore, there is room for optimisation regarding the computational complexity of the model; future work should focus on further lightweighting the model to meet the real-time requirements of clinical medical equipment.
In summary, the multi-stage deep learning framework proposed in this study has achieved the automatic detection and grading of vertebral compression fractures in X-ray images, providing cost-effective and convenient technical support for early clinical screening and auxiliary diagnosis. However, as this study constitutes a single-centre validation, there remains some distance to be covered before true clinical deployment. Future work will focus on collecting multi-centre, multi-device data to conduct external validation, and on exploring domain adaptation techniques and model compression methods to enhance the system’s robustness and clinical applicability.

5. Conclusions

This study proposes a multi-stage deep learning framework for automated VCFs detection and grading using X-ray images. Through task decomposition and tailored network design, our method achieves collaborative optimization across detection, segmentation, and classification. DWT-YOLOv5 enables high-accuracy vertebral detection, PSA-UNet improves boundary segmentation, and CBAM-enhanced ResNet50 delivers reliable four-grade fracture classification. While distinguishing mild from moderate fractures remains challenging, the model performs robustly in identifying non-fractures and severe fractures. This work demonstrates that X-ray imaging—with its inherent advantages of speed, accessibility, and low cost—can support accurate automated VCFs diagnosis. Future work will focus on collecting multi-centre, multi-device data to conduct external validation, and on exploring domain adaptation techniques and model compression methods to enhance the system’s robustness and clinical applicability.

Author Contributions

Conceptualization, S.D. and Y.S.; methodology, S.D. and Y.S.; investigation, S.D. and Y.D.; data curation, Y.S., S.D. and Y.D.; writing—original draft preparation, S.D.; writing—review and editing, S.D. and Y.S.; validation, S.D. and Y.D.; supervision, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Ningbo Clinical Research Center for Orthopedics and Exercise Rehabilitation through project No.2024L004, Zhejiang Key Specialties of Clinical Features through project No.2024021, and Ningbo Municipal Public Welfare Research Plan through grant No. 2024S177.

Data Availability Statement

Due to patient privacy restrictions, the dataset used in this study is not publicly available; however, it may be requested from the authors upon submission of a reasonable request and approval.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Nguyen, H.T.; Nguyen, B.T.; Thai, T.H.N.; Tran, A.V.; Nguyen, T.T.; Vo, T.; Mai, L.D.; Tran, T.S.; Nguyen, T.V.; Ho-Pham, L.T. Prevalence, incidence of and risk factors for vertebral fracture in the community: The Vietnam Osteoporosis Study. Sci. Rep. 2024, 14, 32. [Google Scholar] [CrossRef] [PubMed]
  2. Miao, K.H.; Miao, J.H.; Belani, P.; Dayan, E.; Carlon, T.A.; Cengiz, T.B.; Finkelstein, M. Radiological diagnosis and advances in imaging of vertebral compression fractures. J. Imaging 2024, 10, 244. [Google Scholar] [CrossRef]
  3. VanBerlo, B.; Hoey, J.; Wong, A. A survey of the impact of self-supervised pretraining for diagnostic tasks in medical X-ray, CT, MRI, and ultrasound. BMC Med. Imaging 2024, 24, 79. [Google Scholar] [CrossRef]
  4. Al Taha, K.; Lauper, N.; Bauer, D.E.; Tsoupras, A.; Tessitore, E.; Biver, E.; Dominguez, D.E. Multidisciplinary and coordinated management of osteoporotic vertebral compression fractures: Current state of the art. J. Clin. Med. 2024, 13, 930. [Google Scholar] [CrossRef] [PubMed]
  5. Bendtsen, M.G.; Hitz, M.F. Opportunistic identification of vertebral compression fractures on CT scans of the chest and abdomen, using an AI algorithm, in a real-life setting. Calcif. Tissue Int. 2024, 114, 468–479. [Google Scholar] [CrossRef] [PubMed]
  6. Yabu, A.; Hoshino, M.; Tabuchi, H.; Takahashi, S.; Masumoto, H.; Akada, M.; Morita, S.; Maeno, T.; Iwamae, M.; Inose, H. Using artificial intelligence to diagnose fresh osteoporotic vertebral fractures on magnetic resonance images. Spine J. 2021, 21, 1652–1658. [Google Scholar] [CrossRef] [PubMed]
  7. Windsor, R.; Jamaludin, A.; Kadir, T.; Zisserman, A. Automated detection, labelling and radiological grading of clinical spinal MRIs. Sci. Rep. 2024, 14, 14993. [Google Scholar] [CrossRef] [PubMed]
  8. Lee, S.; Kim, H.; Kim, H.; Cho, S. Risk prediction of stereotactic-body-radiotherapy-induced vertebral compression fracture using multi-modal deep learning network. In Proceedings of the Medical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling, San Diego, CA, USA, 18–22 February 2024; SPIE: Washington, DC, USA, 2024; pp. 472–477. [Google Scholar]
  9. Wang, Q.; Liu, J.; Ji, Q.; Qiu, Y.; Min, N.; Wang, L.; Zhang, Y. Percutaneous vertebroplasty versus percutaneous kyphoplasty in elderly patients with osteoporotic vertebral compression fractures: Prospective controlled study. BJS Open 2024, 8, zrad162. [Google Scholar] [CrossRef] [PubMed]
  10. Kim, K.C.; Cho, H.C.; Jang, T.J.; Choi, J.M.; Seo, J.K. Automatic detection and segmentation of lumbar vertebrae from X-ray images for compression fracture evaluation. Comput. Methods Programs Biomed. 2021, 200, 105833. [Google Scholar] [CrossRef] [PubMed]
  11. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11534–11542. [Google Scholar]
  12. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
  13. Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR, 2021. pp. 11863–11874. [Google Scholar]
  14. Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise mapping. Neurocomputing 2022, 506, 158–167. [Google Scholar] [CrossRef]
  15. Genant, H.K.; Wu, C.Y.; Van Kuijk, C.; Nevitt, M.C. Vertebral fracture assessment using a semiquantitative technique. J. Bone Miner. Res. 1993, 8, 1137–1148. [Google Scholar] [CrossRef] [PubMed]
  16. Dong, Q.; Luo, G.; Lane, N.E.; Lui, L.-Y.; Marshall, L.M.; Kado, D.M.; Cawthon, P.; Perry, J.; Johnston, S.K.; Haynor, D. Deep learning classification of spinal osteoporotic compression fractures on radiographs using an adaptation of the genant semiquantitative criteria. Acad. Radiol. 2022, 29, 1819–1832. [Google Scholar] [CrossRef] [PubMed]
  17. Kim, C.; Kang, M.; Yuh, W.T.; Lee, S.-L.; Lee, J.J.; Hou, J.-U.; Kang, S.H. Comparative efficacy of anteroposterior and lateral X-ray based deep learning in the detection of osteoporotic vertebral compression fracture. Sci. Rep. 2024, 14, 28388. [Google Scholar] [CrossRef] [PubMed]
  18. Tian, J.; Wang, K.; Wu, P.; Li, J.; Zhang, X.; Wang, X. Development of a deep learning model for detecting lumbar vertebral fractures on CT images: An external validation. Eur. J. Radiol. 2024, 180, 111685. [Google Scholar] [CrossRef] [PubMed]
  19. Yilmaz, E.B.; Buerger, C.; Fricke, T.; Sagar, M.M.R.; Peña, J.; Lorenz, C.; Glüer, C.-C.; Meyer, C. Automated deep learning-based detection of osteoporotic fractures in CT images. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Strasbourg, France, 27 September 2021; Springer: Cham, Switzerland, 2021; pp. 376–385. [Google Scholar]
Figure 1. Overall framework of proposed method. First, the original X-ray images undergo preprocessing via Contrast Limited Adaptive Histogram Equalization (CLAHE). Subsequently, the Discrete Wavelet Transform-YOLOv5 (DWT-YOLOv5) object detection network generates preliminary vertebral object bounding boxes. These vertebral object bounding boxes are then fed into Polarized Self-Attention-UNet (PSA-UNet) for fine segmentation. Finally, the resulting fine segmentation images are submitted to Convolutional Block Attention Module-ResNet50 (CBAM-ResNet50) for severity classification.
Figure 1. Overall framework of proposed method. First, the original X-ray images undergo preprocessing via Contrast Limited Adaptive Histogram Equalization (CLAHE). Subsequently, the Discrete Wavelet Transform-YOLOv5 (DWT-YOLOv5) object detection network generates preliminary vertebral object bounding boxes. These vertebral object bounding boxes are then fed into Polarized Self-Attention-UNet (PSA-UNet) for fine segmentation. Finally, the resulting fine segmentation images are submitted to Convolutional Block Attention Module-ResNet50 (CBAM-ResNet50) for severity classification.
Electronics 15 02626 g001
Figure 2. Image preprocessing. (a) Original image; (b) Pre-processed image.
Figure 2. Image preprocessing. (a) Original image; (b) Pre-processed image.
Electronics 15 02626 g002
Figure 3. DWT-YOLOv5 model structures.
Figure 3. DWT-YOLOv5 model structures.
Electronics 15 02626 g003
Figure 4. High-frequency Network (HNET) and Low-frequency Network (LNET) model structures.
Figure 4. High-frequency Network (HNET) and Low-frequency Network (LNET) model structures.
Electronics 15 02626 g004
Figure 5. Structures of the PSA-UNet model.
Figure 5. Structures of the PSA-UNet model.
Electronics 15 02626 g005
Figure 6. ResNet50 residual module structure after introducing CBAM.
Figure 6. ResNet50 residual module structure after introducing CBAM.
Electronics 15 02626 g006
Figure 7. Confusion matrix for CBAM-ResNet50. (a) Count confusion matrix; (b) Normalized confusion matrix.
Figure 7. Confusion matrix for CBAM-ResNet50. (a) Count confusion matrix; (b) Normalized confusion matrix.
Electronics 15 02626 g007
Table 1. Dataset distribution.
Table 1. Dataset distribution.
Types of FracturesTotalNonMildModerateSevere
Numbers of Vertebrae45912668621624678
Table 2. Performance comparison of object detection networks.
Table 2. Performance comparison of object detection networks.
Model Precision mAP Recall F1-Score
Baseline0.8300.9730.9560.930
FasterRCNN0.7980.9240.9360.783
YOLOv30.8130.8630.7220.765
YOLOv40.8270.8490.8010.814
YOLOv80.8780.9740.9770.920
YOLOv110.8920.9800.9630.902
Ours0.9320.9860.9710.950
Table 3. Statistical comparison of model performance based on 5 × 2 cross-validation paired t-test.
Table 3. Statistical comparison of model performance based on 5 × 2 cross-validation paired t-test.
MetricsModelAverage Performancep Value
mIoUBaseline0.8920.037
Ours0.901
mPABaseline0.9410.028
Ours0.947
AccuracyBaseline0.9790.011
Ours0.987
Table 4. Performance results of various models in identifying the severity of four types of vertebral compression fractures.
Table 4. Performance results of various models in identifying the severity of four types of vertebral compression fractures.
ClassMetricBaselineResNet34Vgg16Vgg19ConvNeXtOurs
NonPrecision0.8420.8420.8500.8100.8450.857
Recall0.8000.8000.8500.8500.8720.900
F1-score0.8200.8210.8500.8290.8530.878
MildPrecision0.7140.7000.7030.6670.7100.714
Recall0.7500.7000.7490.7000.7450.750
F1-score0.7320.7000.7210.6830.7290.732
ModeratePrecision0.7140.7620.7500.7000.7850.800
Recall0.7500.8000.7500.7000.7920.800
F1-score0.7320.7810.7500.7000.7800.800
SeverePrecision0.8950.8500.8941.0000.9370.995
Recall0.8500.8500.8500.8000.8530.900
F1-score0.8720.8500.8720.8890.9250.945
TotalAccuracy0.8630.8600.8790.8780.8830.885
Macro-AUC0.8740.8680.8830.8810.8950.902
Table 5. The impact of the Discrete Wavelet Transform (DWT), Polarized Self-Attention (PSA) and Convolutional Block Attention Module (CBAM) on overall classification performance.
Table 5. The impact of the Discrete Wavelet Transform (DWT), Polarized Self-Attention (PSA) and Convolutional Block Attention Module (CBAM) on overall classification performance.
ModelDWTPSACBAMAccuracyRecallSpecificityF1-Score
Baseline 0.7850.7930.8920.815
DWT 0.8100.8230.9300.843
DWT + PSA 0.8210.8350.9580.860
DWT + PSA + CBAM0.8370.8620.9770.872
Table 6. Accuracy, precision, recall, F1-score and macro-averages for each category.
Table 6. Accuracy, precision, recall, F1-score and macro-averages for each category.
ClassAccuracyPrecisionRecallSpecificityF1-Score
Non0.9000.9840.9980.9500.992
Mild0.7500.7900.7500.9820.769
Moderate0.8000.7620.8000.9820.780
Severe0.9000.9890.9000.9950.947
Macro average0.8370.8810.8620.9770.872
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Duan, S.; Deng, Y.; Song, Y. An Automatic Identification Method for Vertebral Compression Fractures in X-Ray Images Based on Multi-Stage Deep Learning. Electronics 2026, 15, 2626. https://doi.org/10.3390/electronics15122626

AMA Style

Duan S, Deng Y, Song Y. An Automatic Identification Method for Vertebral Compression Fractures in X-Ray Images Based on Multi-Stage Deep Learning. Electronics. 2026; 15(12):2626. https://doi.org/10.3390/electronics15122626

Chicago/Turabian Style

Duan, Shenyang, Yufeng Deng, and Yang Song. 2026. "An Automatic Identification Method for Vertebral Compression Fractures in X-Ray Images Based on Multi-Stage Deep Learning" Electronics 15, no. 12: 2626. https://doi.org/10.3390/electronics15122626

APA Style

Duan, S., Deng, Y., & Song, Y. (2026). An Automatic Identification Method for Vertebral Compression Fractures in X-Ray Images Based on Multi-Stage Deep Learning. Electronics, 15(12), 2626. https://doi.org/10.3390/electronics15122626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop