Effect of Patient Clinical Variables in Osteoporosis Classification Using Hip X-rays in Deep Learning Analysis

Background and Objectives: A few deep learning studies have reported that combining image features with patient variables enhanced identification accuracy compared with image-only models. However, previous studies have not statistically reported the additional effect of patient variables on the image-only models. This study aimed to statistically evaluate the osteoporosis identification ability of deep learning by combining hip radiographs with patient variables. Materials andMethods: We collected a dataset containing 1699 images from patients who underwent skeletal-bone-mineral density measurements and hip radiography at a general hospital from 2014 to 2021. Osteoporosis was assessed from hip radiographs using convolutional neural network (CNN) models (ResNet18, 34, 50, 101, and 152). We also investigated ensemble models with patient clinical variables added to each CNN. Accuracy, precision, recall, specificity, F1 score, and area under the curve (AUC) were calculated as performance metrics. Furthermore, we statistically compared the accuracy of the image-only model with that of an ensemble model that included images plus patient factors, including effect size for each performance metric. Results: All metrics were improved in the ResNet34 ensemble model compared with the image-only model. The AUC score in the ensemble model was significantly improved compared with the image-only model (difference 0.004; 95% CI 0.002–0.0007; p = 0.0004, effect size: 0.871). Conclusions: This study revealed the additional effect of patient variables in identification of osteoporosis using deep CNNs with hip radiographs. Our results provided evidence that the patient variables had additive synergistic effects on the image in osteoporosis identification.


Introduction
Osteoporosis is a socially important disease with a high incidence in the aging society and is one of the risk factors for fragility fractures [1,2]. The global standard test for diagnosing osteoporosis is estimating bone mineral density (BMD) at the proximal femur and lumbar spine using dual-energy X-ray absorptiometry (DXA). The disadvantages of DXA include potential measurement errors and uncertainty caused by the nearby soft tissues [3], radiation exposure, and high medical costs [4].
Attempts to diagnose osteoporosis via different approaches with other modalities, such as bone morphology and bone parameters based on X-rays have been reported [5,6]. Recent review articles have reported that artificial intelligence (AI) technology developments have led to efficient applications in osteoporosis identification [7,8]. A few studies have reported osteoporosis identification analysis from hip radiographs with machine learning or deep learning (DL) [9][10][11]. Yamamoto et al. reported that convolutional neural network (CNN) models diagnosed osteoporosis for hip radiographs with high accuracy, and the diagnostic ability improved further with the addition of clinical patient variables [11].
In clinical settings, clinicians consider patient factors, examine the images, assume differential diagnoses, and reach a definitive identification. In all decision processes, clinicians use patient factors when estimating and enhancing the pre-test probability. Similarly, diagnostic studies using DL have reported that diagnostic accuracy is higher when the patient variables and images are combined [12]. However, most studies reported improved results when some difference was attained by simple subtraction of the diagnostic accuracies [13][14][15][16]. Moreover, few studies have compared the statistical methods [17]. To our knowledge, previous studies have not statistically reported the additional effect of patient variables on the image-only models in osteoporosis identification using AI.
We aimed to compare the diagnostic ability of osteoporosis using DL with hip radiographs alone and in combination with patient variables. We hypothesized that combining image features with patient variables would enhance the diagnostic ability of osteoporosis with a statistical difference. Such significant difference would clarify the importance of adding patient variables and contribute to the future development of AI diagnostic research in osteoporosis.

Study Design
This study was a single-center retrospective study of DL identification accuracy. The aim of our study was to identify osteoporosis from a dataset segmented from hip radiographs using several residual neural networks (ResNets), types of CNNs. Supervised learning was selected as the DL method. We compared the identification accuracy of DL from hip radiographs only and DL of ensemble models in which clinical variables extracted from clinical records were added to the data set.

Data Collection
Clinical and imaging data from March 2014 to February 2021 were used retroactively. The subjects of this study were 1699 consecutive patients aged 60 years or older who took hip radiographs and received DXA at our hospital 6 months before and after the date of hip radiography.

Data Preprocessing
Simple hip radiographs of each patient were used to acquire the digital images. All digital images were output in tagged image file format (TIFF) format (size: 2836 × 2373, 2836 × 2336, and 2832 × 2836 pixels) from our hospital's picture archiving and communication system (HOPE Dr ABLE-GX, FUJITSU Co., Tokyo, Japan). From the images, we segmented the hip joint area. Each orthopedic surgeon among six orthopedic surgeons processed one image under the supervision of an orthopedic expert. Six orthopedic surgeons manually cropped areas of interest in hip radiographic images using Photoshop Elements (Adobe Systems, Inc., San Jose, CA, USA). The appropriate cropped range has been selected for each hip image. The side of the hip measured using DXA was selected as the cropped side in the pre-analysis image-cropping method. The method of cropping the images was the same as that used in our previous study [11]. As with the DXA measurement, the line of the femoral head and the lower edge of the lesser trochanter were selected and cropped. The cropped areas completely imitated the osteoporosis identification range obtained using the DXA method ( Figure 1). Cropped images were saved in portable network graphics (PNG) format. All orthopedic surgeons who performed the trimming were unaware of the patient's BMD status.

Data Preprocessing
Simple hip radiographs of each patient were used to acquire the digital images. All digital images were output in tagged image file format (TIFF) format (size: 2836 × 2373, 2836 × 2336, and 2832 × 2836 pixels) from our hospital's picture archiving and communication system (HOPE Dr ABLE-GX, FUJITSU Co., Tokyo, Japan). From the images, we segmented the hip joint area. Each orthopedic surgeon among six orthopedic surgeons processed one image under the supervision of an orthopedic expert. Six orthopedic surgeons manually cropped areas of interest in hip radiographic images using Photoshop Elements (Adobe Systems, Inc., San Jose, CA, USA). The appropriate cropped range has been selected for each hip image. The side of the hip measured using DXA was selected as the cropped side in the pre-analysis image-cropping method. The method of cropping the images was the same as that used in our previous study [11]. As with the DXA measurement, the line of the femoral head and the lower edge of the lesser trochanter were selected and cropped. The cropped areas completely imitated the osteoporosis identification range obtained using the DXA method ( Figure 1). Cropped images were saved in portable network graphics (PNG) format. All orthopedic surgeons who performed the trimming were unaware of the patient's BMD status.

Identification of Osteoporosis
In this study, osteoporosis was diagnosed from the hip joint using the DXA method. The parameters investigated included the automatically generated BMD (g/cm 3 ) and the T-score, which were performed at the hip using DXA (HOLOGIC Horizon-A, Apex software version 13.6.0.4, Bedford, MA, USA) by trained personnel using equal measurement routines. Standard position measurements were adopted and the scanned images complied with the following criteria [18]: The hip joint is located in the center of the image, with an internal rotation of 15° to 25°, with the femoral neck, head, and greater trochanter completely within the image. The measurement was normally performed at the left hip; when the left hip had a high degree of deformity or a metal implant, the right hip was selected.
The parameters investigated included the automatically generated BMD (g/cm 2 ) and T-score. Osteoporosis was diagnosed when the T-score of BMD obtained by DXA was -2.5 or lower, according to the World Health Organization diagnostic criteria [19].

Clinical Variables
Patients in the high-risk group of osteoporosis are generally female, older, and have a lower body mass index (BMI) [20]. Although there are many other patient variables, age, gender, and BMI were selected in this study as easily identifiable patient factors. BMI was Figure 1. Crop method as data preprocessing. The manually cropped area perfectly imitated the osteoporosis identification range obtained using the dual-energy X-ray absorptiometry (DXA) method.

Identification of Osteoporosis
In this study, osteoporosis was diagnosed from the hip joint using the DXA method. The parameters investigated included the automatically generated BMD (g/cm 3 ) and the T-score, which were performed at the hip using DXA (HOLOGIC Horizon-A, Apex software version 13.6.0.4, Bedford, MA, USA) by trained personnel using equal measurement routines. Standard position measurements were adopted and the scanned images complied with the following criteria [18]: The hip joint is located in the center of the image, with an internal rotation of 15 • to 25 • , with the femoral neck, head, and greater trochanter completely within the image. The measurement was normally performed at the left hip; when the left hip had a high degree of deformity or a metal implant, the right hip was selected.
The parameters investigated included the automatically generated BMD (g/cm 2 ) and T-score. Osteoporosis was diagnosed when the T-score of BMD obtained by DXA was −2.5 or lower, according to the World Health Organization diagnostic criteria [19].

Clinical Variables
Patients in the high-risk group of osteoporosis are generally female, older, and have a lower body mass index (BMI) [20]. Although there are many other patient variables, age, gender, and BMI were selected in this study as easily identifiable patient factors. BMI was calculated by dividing the weight in kilograms by the square of the height in meters (kg/m 2 ). Weight and height were recorded at the same time as the BMD measurement. Table 1 shows the demographic characteristics of the patients included in this study.

CNN Architecture
In this study, the DL analysis was performed using the standard CNN model ResNet [21], which was proposed by He et al. The residual learning mechanism that is characteristic of ResNet is a common, easy-to-optimize, and effective training method for deep CNN architectures. In addition, it is a mechanism that solves the decrease in accuracy due to deepening of the layer, and a typical ResNet contains 18, 34, 50, 101, or 152 layers.
For model construction, it is effective to use the weight of the existing model as the initial value of additional learning and fine-tuning [22]. Therefore, all ResNet CNNs were trained using transfer learning with fine-tuning employing pre-trained weights from ImageNet database [23]. DL analysis was implemented using a PyTorch DL framework and Python language.

Architecture of the Ensemble Model
In addition to DL analysis using hip joint image data only, we constructed an ensemble model that added the clinical variables of the patient. In preparation for DL analysis, we preprocessed the patient's structural data. Age and BMI were converted to mean normalization, and gender was converted to a one-hot vector representation. As a result, a 1 × 4-dimensional vector was created. The 1D reformed results extracted from the CNN convolution layer of the image were combined with the 1 × 4 D data created from the structural data. The image data processed by the CNN and the combined data with clinical variables were then passed as a fully connected layer. The prediction of the final osteoporosis identification model was output using the rectified linear unit activation function ( Figure 2).

Data Augmentation
In this study, various types of data augmentation techniques were adopted to prevent overfitting. When using training data during image training, the data extension was applied only to the training image data when the images were retrieved in batches. The training image was randomly rotated in the range of −25 degrees to +25 degrees and flipped with a 50% vertical and 50% horizontal probability. Darkness and contrast were randomly changed from −5 to +5%. Each training image was processed with a 50% chance of data augmentation.

Dataset
The CNN model training was performed using k-fold cross-validation in the model training algorithm. The images selected as the dataset were split using a stratified k-fold that split the training, validation, and test data while maintaining the correct label percentages. The training algorithm used k = 4 for k-fold cross-validation to avoid overfitting and bias and to minimize the generalization error. The test data consisted of 425 images. In each fold, the dataset was randomly divided into separate training and validation sets at a ratio of 8:1. The validation dataset selected was independent of the training fold and was used to assess the training status. After completing this one model training step, similar validations were performed four times, each with different test data.

Data Augmentation
In this study, various types of data augmentation techniques were adopted to prevent overfitting. When using training data during image training, the data extension was applied only to the training image data when the images were retrieved in batches. The training image was randomly rotated in the range of −25 degrees to +25 degrees and flipped with a 50% vertical and 50% horizontal probability. Darkness and contrast were randomly changed from −5 to +5%. Each training image was processed with a 50% chance of data augmentation.

Dataset
The CNN model training was performed using k-fold cross-validation in the model training algorithm. The images selected as the dataset were split using a stratified k-fold that split the training, validation, and test data while maintaining the correct label percentages. The training algorithm used k = 4 for k-fold cross-validation to avoid overfitting and bias and to minimize the generalization error. The test data consisted of 425 images.

Identification Process of the DL System
Each ResNet model was trained and analyzed using a 64-bit Ubuntu 16.04.5 LTS operating system with 8GB memory and NVIDIA GeForce GTX 1080(Nvidia Co., Santa Clara, CA, USA), 8GB graphics processing unit. In hyperparameter of this study, the optimizer used stochastic gradient descent. Learning rates of 0.001 and momentum of 0.9 were used. All images were resized to 128 × 128 pixels. All models analyzed a maximum of 100 epochs. Early stopping methods were adopted to prevent overfitting. This early stop method decides to stop learning if the validation error is not updated 15 times in a row.

Performance Metrics
The accuracy, precision, recall, specificity, and F1 score of the test dataset were calculated using the confusion matrix as a performance metric. In addition, the area under the curve (AUC) was measured from the receiver operating characteristic curve. This is related to the function of the classifier to avoid misidentification.

Statistical Analysis
The differences between image-only and ensemble model performance metrics were evaluated in JMP Statistics Software Package Version 14.2.0 for Macintosh (SAS Institute Inc., Cary, NC, USA). The significance level was set to p < 0.05. Parametric tests were performed based on the results of the Shapiro-Wilk test. The difference between the CNN model using images only and ensemble model with patient variables added was calculated for each performance metric using the t-test; effect sizes were calculated for the non-parametric tests and were classified as follows: 0.2 was a small effect, 0.5 a medium effect, and 0.8 a large effect [24]. Table 2 shows the performance metrics of each ResNet model using only hip radiographic images. ResNet-152 scored the highest in accuracy, AUC score, precision, and F1 score. Recall and specificity were the highest for ResNet 101 and ResNet 50, respectively.

Performance of Ensemble Models
The highest accuracy and AUC score were achieved by ResNet50, precision and specificity by ResNet34, recall by ResNet152, and F1 score by ResNet101 (Table 3).  Table 4 shows the evaluation of the differences between the radiographic image-only and ensemble models in the respective performance metrics. The calculation method is ensemble models minus the radiographic image-only models. The AUC improved for all ResNets. ResNet34 improved in all performance metrics, and the addition of patient variables improved accuracy. In addition, we compared two groups of radiographic image-only models and ensemble models of each performance metric in ResNet34. Table 5 shows the results of 4-fold cross-validation evaluation performed 30 times. In AUC, the ensemble model was significantly improved over the image-only model. Regarding the effect size, the AUC was 0.871, which was an effect size that could be classified as a large effect.

Discussion
This DL study demonstrated that adding routinely available patient variables to image-only models improved their diagnostic accuracy of osteoporosis. The mean AUC score was significantly improved (difference: 0.004; 95% CI: 0.002 to 0.0007; p = 0.0004). The patient variables had additive synergistic effects on the image in osteoporosis identification in this DL study.
These results are consistent with those of previous studies in other fields [13,15,16,25,26]. However, performance metrics other than AUC in this study were not improved in some CNN models. The results were similar to those of a diagnostic study of diabetic retinopathy using machine learning [16]. We speculate that the diagnostic accuracy improved due to the amount of essential information and the quality of patient variables that cannot be extracted and interpreted from images alone.
The AUC scores significantly improved. The results with a relatively high AUC score suggest that the image model with patient variables offers a high discriminative power for diagnostic tests [27]. A few previous AI studies reported that the additional patient variables on the image improved the AUC by 2%-4% [14,16]. It is evident that diagnostic accuracy should be as high as possible in a diagnostic test analysis, but it is unclear how much clinical benefit would be provided by such statistical advantage.
In this study, we measured the effect size of an ensemble model of patient variables. Effect size is an indicator of the effectiveness of experimental results and the strength of relationships between variables. In this study, the effect size in AUC for osteoporosis identification was 0.871, which was classified as a large effect. Since few reports have calculated the sizes of such effects based on comparisons between DL models [28], we are confident that our study will play a role as a basic research that helps determine sample sizes for studies in the future.
The strength of this study over previous studies is that the additional effect of the patient factor was statistically assessed in a clinical risk patient population. The applicable patient group in this study was as close as possible to a real-world setting. To our knowledge, this is the first study to statistically clarify the additional effects of patient variables in osteoporosis identification using DL. In addition, the calculated effect size can be used to estimate the sample sizes for future studies. It is suitable to evaluate results statistically rather than simply by comparing different values in the academic research field.
This study has some limitations. First, the selection of patient factors was not assessed. We selected three patient factors in this study based on a previous study [11]. In selecting and deciding on the patient factors, we believed it was important to select a few simple and easy-to-collect factors if we were to prepare for real-world application. A machine learning study reported some osteoporosis risk prediction variables, such as the duration of menopause and diabetes mellitus [29]. In future studies, the selection of patient factors that predominantly influence osteoporosis identification needs to be thoroughly examined. Second, we analyzed the diagnostic accuracy of a limited selection of CNN models. CNN models are being developed at a very fast pace. We must select an appropriate model for handling high-quality images and patient variables. This will need to be validated using various CNN models. Third, we tested only the ResNet34 model with 30 cycles and found a statistically significant difference. Deeper networks require more parameters and take more time; therefore, we were not able to use them in this study. Further research should examine more CNN models and compare the confidence intervals of the differences. We speculate that appropriate models will be identified for clinically required diagnostic accuracy in each situation. Fourth, in this study, we used Photoshop to manually crop, but there are slight differences between workers within the range of the crop. In order to develop a better osteoporosis detection model, it is necessary to further study the range of crop, the difference in resizing, and the processing of padding. As a final goal, it will be necessary to develop and study a method for automatically cutting out from the hip radiographs. Fifth, we could not consider sample size in the method because previous studies did not report effect size or clinical importance difference. In this study, we reported each effect size on each performance metric. Therefore, researchers in the further research can conduct sample size calculation. Finally, we did not evaluate the external validity of our models. In different facilities and settings, the method and quality of radiographs are different. Residual overfitting of our single institutional data might not be applied to other institutional datasets, although we adopted some meticulous methods to prevent overfitting. In addition, people of different races and from different regions have different bone morphologies, and the degree of influence of patient factors will be different [30]. Big data from multicenter studies will enhance external validity and aid further research.

Conclusions
We have revealed the additional effects of patient variables in diagnosing osteoporosis using deep CNNs with hip radiographs. In particular, we found a statistically significant improvement in AUC scores.

Funding:
The authors received no support from any grants.

Institutional Review Board Statement:
The study protocol was approved by the Institutional Review Committee of the Kagawa Prefectural Central Hospital (approval number: 1031), and the study was conducted according to the guidelines of the Declaration of Helsinki.

Informed Consent Statement:
The institutional review committee waived the need for individual informed consent. Therefore, written/verbal informed consent was not obtained from any participant because this study featured a non-interventional retrospective design, and all data were analyzed anonymously. Data Availability Statement: Not applicable.