A Deep Learning-Based Model for Classifying Osteoporotic Lumbar Vertebral Fractures on Radiographs: A Retrospective Model Development and Validation Study

Early diagnosis and initiation of treatment for fresh osteoporotic lumbar vertebral fractures (OLVF) are crucial. Magnetic resonance imaging (MRI) is generally performed to differentiate between fresh and old OLVF. However, MRIs can be intolerable for patients with severe back pain. Furthermore, it is difficult to perform in an emergency. MRI should therefore only be performed in appropriately selected patients with a high suspicion of fresh fractures. As radiography is the first-choice imaging examination for the diagnosis of OLVF, improving screening accuracy with radiographs will optimize the decision of whether an MRI is necessary. This study aimed to develop a method to automatically classify lumbar vertebrae (LV) conditions such as normal, old, or fresh OLVF using deep learning methods with radiography. A total of 3481 LV images for training, validation, and testing and 662 LV images for external validation were collected. Visual evaluation by two radiologists determined the ground truth of LV diagnoses. Three convolutional neural networks were ensembled. The accuracy, sensitivity, and specificity were 0.89, 0.83, and 0.92 in the test and 0.84, 0.76, and 0.89 in the external validation, respectively. The results suggest that the proposed method can contribute to the accurate automatic classification of LV conditions on radiography.


Introduction
Osteoporotic lumbar vertebral fracture (OLVF) is one of the most common complications in osteoporotic patients.The main factors associated with OLVF are a decrease in bone density, bone quality deterioration, and bone microstructure degeneration caused by osteoporosis [1,2].Since the risk of osteoporosis increases with age, the number of patients with osteoporosis and those who develop OLVF will continue to increase as life expectancy increases [3][4][5].
As OLVF progresses, chronic severe pain, decreased vertebral body height, a round back, gait difficulty, decreased pulmonary function, and increased mortality significantly lead to a decrease in activities of daily living (ADL) and quality of life (QOL) [6,7].Although the progression of these symptoms is often stabilized in patients with old OLVF, the prognosis of fresh OLVF is poor unless appropriate intervention is provided early.Therefore, when OLVF is confirmed, it is crucial to confirm the diagnosis of fresh OLVF as early as possible, relieve pain, and prevent the progression of crushing to maintain ADL and QOL in fresh OLVF patients [8][9][10].
Generally, magnetic resonance imaging (MRI) is used to determine whether OLVF is old or fresh [11].The presence of fresh OLVF is indicated by low signal intensity on T1-weighted images and high signal intensity on T2-weighted and short tau inversion recovery (STIR) images, which reflect vertebral edema.MRI is a highly sensitive and specific method to determine whether an OLVF is old or fresh [12].MRI can also detect subclinical fractures and identify the site of injury.MRI is an important tool for diagnosing OLVF because it allows for more detailed treatment strategy decisions [13,14].On the other hand, MRI is a high-cost exam that burdens patients with severe back pain by forcing them to maintain their body position during long examinations.MRI also has drawbacks, such as the difficulty of emergency examinations and the limited number of examinations that can be performed in a day [11,13,15,16].MRI should therefore only be performed on appropriately selected patients with a high suspicion of fresh OLVF who are most likely to require therapeutic intervention [15].
Currently, when fresh OLVF is suspected, the first-choice imaging examination is radiography, which is superior in image acquisition time, simplicity, low exposure dose, and low cost [10,17,18].Radiography can screen for the presence of OLVF by capturing morphological changes in the vertebral body.However, due to its characteristics, it is difficult to diagnose fine fractures with few morphological changes immediately after onset and to accurately determine the stage of OLVF even when a decrease in vertebral height is observed [13,15,19,20].It is also known that the probability of developing multiple OLVFs is higher in patients who have developed OLVF once in the past [21,22].In patients with multiple OLVFs, especially when there are both old and fresh fractures in a radiograph, it is more difficult to visually identify the causative vertebrae from their morphology and determine whether MRI is indicated.In such cases, an old or fresh OLVF diagnosis may significantly depend on the physician's experience and ability.To optimize the decision of the indication for MRI, it is therefore necessary to improve the accuracy of OLVF screening in radiography and develop a new automatic evaluation method that is not easily influenced by differences in the physicians' experience and ability.
In recent years, deep learning (DL) methods based on network structures called convolutional neural networks (CNNs) have been used to solve various problems in the field of medical imaging.One of the most important features is high image classification performance based on high feature extraction capability [23,24].In this study, we utilize DL methods of object detection and 3-class classification based on CNNs.The object detection algorithm used in this study is "you only look once" at version 5 (YOLOv5).It has been reported that YOLO has high efficiency and detection accuracy and meets the requirements for use in clinical practice [25][26][27][28].
In a previous study that attempted to identify fresh vertebral compression fractures using radiography, the CNN model was designed for a 2-class classification of fresh and old fractures but could not identify normal vertebrae [29].Therefore, it was necessary to manually select vertebrae with suspected fractures before applying the CNN model.To the best of our knowledge, this is the first study to be able to classify not only fresh and old vertebrae but also normal vertebrae.A 3-class classification allows all vertebrae to be included without prior selection.This reduces the burden of selecting the target vertebrae and the risk of missing a fresh fracture in another vertebra other than the target vertebra.
We have developed an efficient evaluation method with high detection capability by combining radiography's quick image acquisition with CNNs high image classification performance.The use of CNNs to accurately detect fresh OLVF in previously difficult cases to visually evaluate with radiography makes it possible to more accurately determine the indication for further examination using MRI, regardless of the experience or specialty of the attending physician.This study, therefore, aims to evaluate our method to automatically determine the presence of OLVF and classify old and fresh OLVF using a CNN model with radiographs.

Materials and Methods
The overview of this study is shown in Figure 1.In this study, we first automatically detected each lumbar vertebra from lateral radiographs.Then, after preliminary image processing, each vertebra was classified into normal, old, or fresh OLVF using three CNNs.Accuracy evaluation using external data was also performed for both (detection and classification) DL methods.
J. Imaging 2023, 9, x FOR PEER REVIEW 3 of 14 performance.The use of CNNs to accurately detect fresh OLVF in previously difficult cases to visually evaluate with radiography makes it possible to more accurately determine the indication for further examination using MRI, regardless of the experience or specialty of the attending physician.This study, therefore, aims to evaluate our method to automatically determine the presence of OLVF and classify old and fresh OLVF using a CNN model with radiographs.

Materials and Methods
The overview of this study is shown in Figure 1.In this study, we first automatically detected each lumbar vertebra from lateral radiographs.Then, after preliminary image processing, each vertebra was classified into normal, old, or fresh OLVF using three CNNs.Accuracy evaluation using external data was also performed for both (detection and classification) DL methods.

Subjects
Patients who underwent both lumbar vertebrae radiography and MRI were included in this study.Radiographs for our DL method were collected at two institutions (Institutions 1 and 2).Institutional review boards approved this study in both institutions (institution 1: No. 20-00457, September 2020, and institution 2: No. 020-0342, March 2021).Informed consent in this retrospective study was obtained from all subjects by the opt-out method.This manuscript was constructed according to the Standards for Reporting Diagnostic Accuracy Studies (STARD) 2015 guidelines [30].

Subjects
Patients who underwent both lumbar vertebrae radiography and MRI were included in this study.Radiographs for our DL method were collected at two institutions (Institutions 1 and 2).Institutional review boards approved this study in both institutions (institution 1: No. 20-00457, September 2020, and institution 2: No. 020-0342, March 2021).Informed consent in this retrospective study was obtained from all subjects by the opt-out method.
We collected lateral lumbar vertebral radiographs.In Institution 1, if anterior and posterior flexion imaging were acquired in addition to lateral lumbar vertebrae radiography, both images were also included.They were used as sample images to detect each lumbar vertebra automatically.Each lumbar vertebra image, after automatic cropping from lateral radiographs, was used as a sample image for CNN classification.Furthermore, in patients with thoracic vertebrae imaging, lumbar vertebrae included in the lateral thoracic vertebrae radiographs were also used for CNN classification.
In Institution 1, 523 consecutive patients with suspected OLVF who underwent radiography and MRI from March 2010 to December 2021 were included.In patients with fresh OLVF, lumbar vertebrae MRI was performed with a mean of 3.7 ± 17.0 days after radiography.Each vertebra (the first to the fifth lumbar vertebrae) was blinded to patient information and independently visually evaluated by two radiologists (14 and 12 years of experience) and classified as normal, old, or fresh OLVF.When the evaluation by two radiologists did not agree, the classification group was determined by consensus.In addition, for lumbar vertebrae that were determined to be fresh, the radiologists evaluated whether they were OLVF or pathological fractures.Pathological fractures are those resulting from bone weakness caused by primary or metastatic bone tumors.
Ninety-three patients whose visual evaluation showed that all the lumbar vertebrae from the first to the fifth were normal were excluded from this study because they may be different in the presence or absence of osteoporosis, that is, in the background of bone density, compared to patients with OLVF.To further improve the accuracy of the determination of freshness, the date of injury onset was confirmed for all patients with OLVF judged fresh in the visual evaluation.Of the 430 patients with fresh or old OLVF in one or more vertebrae, 12 fresh and three old OLVF patients were excluded due to exclusion reasons such as severe crush, foreign substance, poor positioning, poor image quality, injury onset date unknown, or only pathological fracture (if there was OLVF other than pathological fractures, those patients were included).In this study, the criterion for severe crush was a more than 40% reduction in post-fracture vertebral body height compared to pre-fracture vertebral body height, based on Genant's criteria [31].
Exclusion reason 1 focused only on the condition of the fractured vertebrae in radiographs, while exclusion reason 2 covered all images of each vertebra after cropping from radiographs.As a result, 415 subjects were employed in this study (Figure 2a).
In Institution 2, 140 patients who underwent radiography and MRI from January 2011 to December 2021 and were diagnosed with OLVF in MRI interpretation reports by radiologists in daily practice were included in this study.In patients with fresh OLVF, lumbar vertebrae MRI was performed with a mean of 8.1 ± 10.9 days after radiography was acquired.After image collection, two fresh and one old OLVF patient were excluded based on the same exclusion criteria as in Institution 1.As a result, 137 subjects were employed in this study (Figure 2b).We collected lateral lumbar vertebral radiographs.In Institution 1, if anterior and posterior flexion imaging were acquired in addition to lateral lumbar vertebrae radiography, both images were also included.They were used as sample images to detect each lumbar vertebra automatically.Each lumbar vertebra image, after automatic cropping from lateral radiographs, was used as a sample image for CNN classification.Furthermore, in patients with thoracic vertebrae imaging, lumbar vertebrae included in the lateral thoracic vertebrae radiographs were also used for CNN classification.
In Institution 1, 523 consecutive patients with suspected OLVF who underwent radiography and MRI from March 2010 to December 2021 were included.In patients with fresh OLVF, lumbar vertebrae MRI was performed with a mean of 3.7 ± 17.0 days after radiography.Each vertebra (the first to the fifth lumbar vertebrae) was blinded to patient information and independently visually evaluated by two radiologists (14 and 12 years of experience) and classified as normal, old, or fresh OLVF.When the evaluation by two radiologists did not agree, the classification group was determined by consensus.In addition, for lumbar vertebrae that were determined to be fresh, the radiologists evaluated whether they were OLVF or pathological fractures.Pathological fractures are those resulting from bone weakness caused by primary or metastatic bone tumors.
Ninety-three patients whose visual evaluation showed that all the lumbar vertebrae from the first to the fifth were normal were excluded from this study because they may be different in the presence or absence of osteoporosis, that is, in the background of bone density, compared to patients with OLVF.To further improve the accuracy of the determination of freshness, the date of injury onset was confirmed for all patients with OLVF judged fresh in the visual evaluation.Of the 430 patients with fresh or old OLVF in one or more vertebrae, 12 fresh and three old OLVF patients were excluded due to exclusion reasons such as severe crush, foreign substance, poor positioning, poor image quality, injury onset date unknown, or only pathological fracture (if there was OLVF other than pathological fractures, those patients were included).In this study, the criterion for severe crush was a more than 40% reduction in post-fracture vertebral body height compared to prefracture vertebral body height, based on Genant's criteria [31].
Exclusion reason 1 focused only on the condition of the fractured vertebrae in radiographs, while exclusion reason 2 covered all images of each vertebra after cropping from radiographs.As a result, 415 subjects were employed in this study (Figure 2a).Normal, normal vertebra; Old, old osteoporotic lumbar vertebral fractures; Fresh, fresh osteoporotic lumbar vertebral fractures In Institution 2, 140 patients who underwent radiography and MRI from January 2011 to December 2021 and were diagnosed with OLVF in MRI interpretation reports by radiologists in daily practice were included in this study.In patients with fresh OLVF, lumbar vertebrae MRI was performed with a mean of 8.1 ± 10.9 days after radiography was acquired.After image collection, two fresh and one old OLVF patient were excluded based on the same exclusion criteria as in Institution 1.As a result, 137 subjects were employed in this study (Figure 2b).

Image Acquisition
In Institution 1, radiographs were acquired using either the flat panel detector (FPD) of the CALNEO Smart C77 or the CALNEO MT (FUJIFILM Medical Co., Ltd., Tokyo, Japan).CALNEO Smart C77 uses CsI scintillators, and CALNEO MT uses GOS scintillators.For both FPDs, a real grid (8:1 grid ratio) (Mitaya Manufacturing Co., Ltd., Saitama, Japan) was used for scattered radiation removal instead of a scattered radiation correction process such as a virtual grid (FUJIFILM Medical Co., Ltd., Tokyo, Japan).The acquired image size was 10 × 12 inches, the pixel size was 0.15 mm, and the grayscale depth was 14 bits.The source-to-image receptor distance (SID) was 110 cm, the tube voltage was 85 kV, and the current value was automatically determined by the auto exposure control (AEC) system according to the patient's body thickness.The X-ray generator was RAD Speed Pro (SHIMADZU Corporation, Kyoto, Japan).All MRI images were acquired using a 1.5 Tesla MRI system from Ingenia (Philips Healthcare, Best, The Netherlands).
In Institution 2, radiographs were acquired using either of three FPDs.The X-ray generator and FPD in each room are shown in Table S1.The scattered radiation removal was performed on a real grid with a grid ratio of 8:1 or 10:1.The acquired image size was 14 × 17 inches, the pixel size was 0.15 mm, and the grayscale depth was 14 bits.The SID was 130 cm, the tube voltage was 90 kV, and the AEC system automatically determined the current value.An MRI was performed in a total of five rooms.The imaging equipment and magnetic field strength in each room are shown in Table S2.

Image Acquisition
In Institution 1, radiographs were acquired using either the flat panel detector (FPD) of the CALNEO Smart C77 or the CALNEO MT (FUJIFILM Medical Co., Ltd., Tokyo, Japan).CALNEO Smart C77 uses CsI scintillators, and CALNEO MT uses GOS scintillators.For both FPDs, a real grid (8:1 grid ratio) (Mitaya Manufacturing Co., Ltd., Saitama, Japan) was used for scattered radiation removal instead of a scattered radiation correction process such as a virtual grid (FUJIFILM Medical Co., Ltd., Tokyo, Japan).The acquired image size was 10 × 12 inches, the pixel size was 0.15 mm, and the grayscale depth was 14 bits.The sourceto-image receptor distance (SID) was 110 cm, the tube voltage was 85 kV, and the current value was automatically determined by the auto exposure control (AEC) system according to the patient's body thickness.The X-ray generator was RAD Speed Pro (SHIMADZU Corporation, Kyoto, Japan).All MRI images were acquired using a 1.5 Tesla MRI system from Ingenia (Philips Healthcare, Best, The Netherlands).
In Institution 2, radiographs were acquired using either of three FPDs.The X-ray generator and FPD in each room are shown in Table S1.The scattered radiation removal was performed on a real grid with a grid ratio of 8:1 or 10:1.The acquired image size was 14 × 17 inches, the pixel size was 0.15 mm, and the grayscale depth was 14 bits.The SID was 130 cm, the tube voltage was 90 kV, and the AEC system automatically determined the current value.An MRI was performed in a total of five rooms.The imaging equipment and magnetic field strength in each room are shown in Table S2.

Vertebral Body Detection with You Only Look Once
The automatic object detection algorithm used in this study was YOLOv5.There are five main models available to the public: YOLOv5 n/s/m/l/x.The main differences between versions are the automatic detection accuracy and calculation load.In this study, YOLOv5x (the largest) was selected and finetuned in training.Training and validation were conducted using 5728 training images and 1432 validation images (8:2 ratio), with a 10-fold augmentation of 716 radiographs from 415 patients in Institution 1.The image augmentation was performed using Imgaug, a Python library.Details of the image augmentation process are shown in Table 1.A total of six image processing steps were combined to create the processed image.The intensity of each process was randomly determined between the maximum and minimum values.Eighty radiographs were randomly selected out of 137 radiographs from 137 patients in Institution 2 and used for the test.One radiological technologist manually set the ground truth bounding box using the free software labelImg.All sample images were converted to 8-bit PNG images of 640 × 640 pixels.The following parameters were determined by hyperparameter evolution, a method of hyperparameter optimization using a genetic algorithm included in the YOLOv5 system: epochs, 300; batch size, 4; initial learning rate, 0.00967; momentum, 0.92755; weight decay, 0.00057.

Sample Creation
Each lumbar vertebra was automatically cropped based on the bounding box coordinates detected by YOLOv5.After cropping, a histogram flattening process was applied.The image resolution was resized and padded as necessary to 166 (W) × 140 (H) pixels (Figure 3).In this sample creation phase, 99 and 23 sample images were excluded from Institutions 1 and 2, respectively, due to the adverse conditions shown in Figure 2.
10-fold augmentation of 716 radiographs from 415 patients in Institution 1.The image augmentation was performed using Imgaug, a Python library.Details of the image augmentation process are shown in Table 1.A total of six image processing steps were combined to create the processed image.The intensity of each process was randomly determined between the maximum and minimum values.Eighty radiographs were randomly selected out of 137 radiographs from 137 patients in Institution 2 and used for the test.One radiological technologist manually set the ground truth bounding box using the free software labelImg.All sample images were converted to 8-bit PNG images of 640 × 640 pixels.The following parameters were determined by hyperparameter evolution, a method of hyperparameter optimization using a genetic algorithm included in the YOLOv5 system: epochs, 300; batch size, 4; initial learning rate, 0.00967; momentum, 0.92755; weight decay, 0.00057.

Sample Creation
Each lumbar vertebra was automatically cropped based on the bounding box coordinates detected by YOLOv5.After cropping, a histogram flattening process was applied.The image resolution was resized and padded as necessary to 166 (W) × 140 (H) pixels (Figure 3).In this sample creation phase, 99 and 23 sample images were excluded from Institutions 1 and 2, respectively, due to the adverse conditions shown in Figure 2.

Datasets Creation and CNN Classification
In this study, a 3-class CNN classification was performed.All sample images are classified into normal, old, or fresh OLVF groups.
In Institution 1, 228, 68, and 52 sample images, or about 1/10 of the total images in the normal, old, and fresh OLVF groups, were divided, and the number of divided old and fresh OLVF images was tripled and quadrupled to resolve the imbalance in the number of images among each group.As a result, 228, 204, and 208 sample images were prepared in the normal, old, and fresh OLVF groups, respectively, as the test dataset images.After the test dataset image division, a total of 6833 sample images (2056 normal, 2432 old, and 2345 fresh sample images) were divided in the ratio of training 8: validation 2. In Institution 2, no augmentation of the number of sample images was performed.As a result, 436, 135, and 91 sample images were prepared in the normal, old, and fresh OLVF groups, respectively.The sample images in Institution 2 were used for external validation (Figure 4).
In this study, a 3-class CNN classification was performed.All sample images are clas-sified into normal, old, or fresh OLVF groups.
In Institution 1, 228, 68, and 52 sample images, or about 1/10 of the total images in the normal, old, and fresh OLVF groups, were divided, and the number of divided old and fresh OLVF images was tripled and quadrupled to resolve the imbalance in the number of images among each group.As a result, 228, 204, and 208 sample images were prepared in the normal, old, and fresh OLVF groups, respectively, as the test dataset images.After the test dataset image division, a total of 6833 sample images (2056 normal, 2432 old, and 2345 fresh sample images) were divided in the ratio of training 8: validation 2. In Institution 2, no augmentation of the number of sample images was performed.As a result, 436, 135, and 91 sample images were prepared in the normal, old, and fresh OLVF groups, respectively.The sample images in Institution 2 were used for external validation (Figure 4).Normal, normal vertebra; Old, old osteoporotic lumbar vertebral fractures; Fresh, fresh osteoporotic lumbar vertebral fractures A total of four datasets were prepared: training, validation, test, and external validation.The training datasets were used to create the model to automatically determine the presence of OLVF and classify old and fresh OLVF, while the validation datasets were used to adjust the hyperparameters.The test dataset was used to evaluate the classification performance by using images with the same characteristics as those used for training and validation, while external validation was an evaluation of the classification performance on completely unknown images.The important point is that the images in the test and external validation datasets were not used for training or validation.The robustness of the model created in this study was evaluated in more detail by also performing external validation.The parameter settings were the same as in YOLOv5 training (Table 1).Examples of processed images are shown in Figure 5.A total of four datasets were prepared: training, validation, test, and external validation.The training datasets were used to create the model to automatically determine the presence of OLVF and classify old and fresh OLVF, while the validation datasets were used to adjust the hyperparameters.The test dataset was used to evaluate the classification performance by using images with the same characteristics as those used for training and validation, while external validation was an evaluation of the classification performance on completely unknown images.The important point is that the images in the test and external validation datasets were not used for training or validation.The robustness of the model created in this study was evaluated in more detail by also performing external validation.The parameter settings were the same as in YOLOv5 training (Table 1).Examples of processed images are shown in Figure 5.The CNNs output the probability that the input image is normal, old, or fresh OLVF.The class with the highest sum of the predictive probabilities output by three CNNs for each classification group was determined as the result of the CNN classification.The CNNs output the probability that the input image is normal, old, or fresh OLVF.The class with the highest sum of the predictive probabilities output by three CNNs for each classification group was determined as the result of the CNN classification.

CNN Model
An ensemble model using three CNNs was employed in this study.The CNNs used were Resnet-50, DenseNet-161, and Res-NeXt-50.Each CNN was pre-trained with initial weights trained on ImageNet, a large image dataset on Neural Network Console (NNC) (Sony Network Communications Inc., Tokyo, Japan).Each pre-trained CNN model is available at https://nnabla.readthedocs.io/en/latest/python/api/models/imagenet.html(accessed on 15 May 2021).Before the training process, three layers were inserted just below the input layer in each CNN.The first is the "Broadcast" layer to change the color channel of input images from 1 to 3.This allows the use of grayscale images for input in CNN trained with color images.The second is the "MulScalarX" layer to divide pixel values by 255 to normalize pixel values.The third is the "ImageAugmentation" layer to pseudo-enhance the number of sample images to reduce overfitting due to the insufficient number of images.This image augmentation process included scaling, rotation, brightness, and contrast changes.The details of each enhancement process are shown in Table S3.The learning rates of Resnet-50, DenseNet-161, and ResNeXt-50 were set to 0.01, 0.001, and 0.01, respectively.The following training parameters were common to all three CNNs: 100 epochs; batch size, 4; optimizer, Nesterov.

Statistical Analysis
Quantitative data were expressed as the mean and standard deviation.Interobserver and intraobserver visual assessment for the vertebrae condition classification were evaluated by weighted kappa values using the Landis and Koch criteria (0.0-0.2: slight agreement, 0.21-0.40:fair agreement, 0.41-0.60:moderate agreement, 0.61-0.80:substantial agreement, 0.81-1.0:almost perfect agreement) [32].The second visual evaluation for intraobserver calculation was performed more than one month after the first assessment, and radiographs of 50 randomly selected patients from a total of 523 patients were used.All weighted kappa values were calculated using R (version 4.2.0) and the package "irr" (ver0.84.1).
The detection accuracy of the YOLOv5 model developed in this study was evaluated by the mean average precision (mAP).The mAP is the score representing the degree of agreement between the coordinates of the detected bounding box and the ground truth bounding box and is defined as follows: APk is the average precision of class k and n represents the number of classes.In this study, the mAP (0.5) and the mAP (0.5: 0.95) were used as evaluation indices.The mAP (0.5) is the mAP when the intersection over union (IoU) is set to 0.5, and the mAP (0.5: 0.95) is the average of the mAP obtained by changing IoU from 0.5 to 0.95 in 0.05 steps.IoU is an index indicating the overlap degree between the detected bounding box by YOLO and the ground truth bounding box and is calculated by dividing the common part of the two regions by the sum set.
For CNN classification, we evaluated the classification performance using 5-fold crossvalidation.In this method, the datasets are divided into five groups, one of which is the validation data, and the remaining four groups are the training data, and the classification performance is evaluated.All five groups are assigned to the validation data one at a time.Training and validation were performed five times in each CNN per cross-validation component, for a total of 15 results.
The accuracy, sensitivity, specificity, false positive rate, and false negative rate were calculated for the classification performance.These classification performances were calculated by considering either normal, old, or fresh OLVF as positive and the others as negative.The name of the group considered positive was appended to each classification performance.The overall classification performance is the average of the values calculated when each group is considered positive.For example, when normal is considered positive, old and fresh OLVF are considered negative, and the accuracy is shown as accuracy normal.Accuracy all is the average of accuracy normal , accuracy old , and accuracy fresh .
In addition, receiver operating characteristic (ROC) curves were plotted, and the area under the curve (AUC) values were calculated.They were plotted and calculated by scikit-learn, one of the Python libraries.
A total of 95% confidence intervals (CI) for the classification performance and AUC were calculated using scikit-learn and statsmodels in the Python libraries, respectively.

Results
In Institution 1, a total of 716 lateral lumber vertebrae radiographs of 415 OLVF patients were employed.The subjects consisted of 280 fresh OLVF patients with a mean age of 78.5 ± 11.4 years and 135 old OLVF patients with a mean age of 77.1 ± 9.3 years.No significant difference in mean age between fresh and old OLVF patients was observed (p > 0.05).In 280 fresh OLVF patients, radiography in 200 patients (71.4%) was performed within 14 days from the injury onset.
In Institution 2, a total of 137 lateral lumber vertebrae radiographs of 137 OLVF patients were employed.The subjects consisted of 77 fresh OLVF patients with a mean age of 69.6 ± 13.7 years and 60 old OLVF patients with a mean age of 70.8 ± 11.1 years.No significant difference in mean age between fresh and old OLVF patients was observed (p > 0.05).In 77 fresh OLVF patients, radiography was performed in 48 patients (62.3%) within 14 days from the injury onset.

Agreement Rate in the Visual Evaluation of Each Vertebral Body
The interobserver agreement value for visual evaluation by two radiologists was 0.801.The intraobserver agreement values for raters 1 and 2 were 0.821 and 0.861, respectively.The agreement in all evaluations was almost perfect.Consensus was required in 141 of 523 cases because of inconsistent evaluation of at least one vertebra.

YOLOv5
The detection performance of YOLOv5 was mAP (0.5) of 0.995 and mAP (0.5: 0.95) of 0.993 for the validation dataset and mAP (0.5) of 0.982 and mAP (0.5: 0.95) of 0.835 for the test dataset in terms of detection for lumber vertebrae.Using automatic cropping based on the bounding box detected by YOLOv5, vertebra images of 2284 normal, 676 old, and 521 fresh OLVF were produced in Institution 1.Similarly, vertebra images of 436 normal, 135 old, and 91 fresh OLVF were produced in Institution 2. The breakdown of vertebral body numbers is shown in Table 2.

Classification Performance by CNNs
The confusion matrix in this classification is shown in Figure 6.The classification performance was as follows: The accuracy all , sensitivity all , specificity all , false positive rate all , false negative rate all , and AUC all was 0.894 A high accuracy all of 0.86 or higher in both datasets were achieved.The ROC curves for each dataset are shown in Figure 7.

Classification Performance by CNNs
The confusion matrix in this classification is shown in Figure 6.The classification performance was as follows: The accuracyall, sensitivityall, specificityall, false positive rateall, false negative rateall, and AUCall was 0.   The classification performances in both datasets were summarized in Table 3.
Table 3.The summary of each classification performance.The classification performances in both datasets were summarized in Table 3.
learning each time.The other issue is that showing physicians the basis for the CNN classification is impossible.The limitations of this study are as follows: First, cases of lumber vertebrae with deformation/crush, strong scoliosis, and metal material implantation were excluded.If the lumber vertebrae with such characteristics are input into the CNN created in this study, it may be unable to output the correct diagnosis because it was not trained on such images.Second, since this study targeted OLVF, it is unclear whether high classification accuracy can be guaranteed for pathological fractures caused by bony metastasis.It may be difficult to utilize this CNN at facilities with many pathological fracture cases.Third, the number of sample images is limited.To create a more accurate and versatile network, it is necessary to increase the number of samples, collect image data from more facilities, and conduct training using images with various characteristics.

Conclusions
The proposed CNN-based method demonstrated high performance in determining the presence of OLVF and classifying old or fresh OLVF on radiography.Utilizing objective classification results from our CNN is expected to improve the accuracy of fresh OLVF screening.This may lead to appropriate decisions on the indication for close examination with MRI.

Figure 1 .
Figure 1.Overview of this study.LV means lumber vertebrae.Normal, normal vertebra; Old, old osteoporotic lumbar vertebral fractures; Fresh, fresh osteoporotic lumbar vertebral fractures

Figure 2 .
Figure 2. The number of subjects in Institutions 1 (a) and 2 (b).

Figure 3 .
Figure 3.The flow of sample image creation The vertebral number and the confidence value of automatic detection accompany bounding boxes detected by YOLOv5.

Figure 4 .
Figure 4. Datasets creation for the CNN classification.

Figure 5 .
Figure 5. Examples of processed images by imgaug.A total of six image processing steps were combined to create the processed image.

Figure 5 .
Figure 5. Examples of processed images by imgaug.A total of six image processing steps were combined to create the processed image.

Figure 6 .
Figure 6.The confusion matrix in CNN classification in the test (a) and external validation (b) datasets.Normal, normal vertebra; Old, old osteoporotic lumbar vertebral fractures; Fresh, fresh osteoporotic lumbar vertebral fractures.

Figure 6 .
Figure 6.The confusion matrix in CNN classification in the test (a) and external validation (b) datasets.Normal, normal vertebra; Old, old osteoporotic lumbar vertebral fractures; Fresh, fresh osteoporotic lumbar vertebral fractures

Figure 7 .
Figure 7. ROC curve for the test (a) and external validation (b) datasets.Normal, normal vertebra; Old, old osteoporotic lumbar vertebral fractures; Fresh, fresh osteoporotic lumbar vertebral fractures

Table 1 .
Details of the image augmentation process by imgaug.

Table 1 .
Details of the image augmentation process by imgaug.

Table 2 .
The breakdown of each vertebral body number in this study.