Deep Learning-Based Hip X-ray Image Analysis for Predicting Osteoporosis

: Osteoporosis is a common problem in orthopedic medicine, and it has become an important medical issue in orthopedics as Taiwan is gradually becoming an aging society. In the diagnosis of osteoporosis, the bone mineral density (BMD) derived from dual-energy X-ray absorptiometry (DXA) is the main criterion for orthopedic diagnosis of osteoporosis, but due to the high cost of this equipment and the lower penetration rate of the equipment compared to the X-ray images, the problem of osteoporosis has not been effectively solved for many people who suffer from osteoporosis. At present, in clinical diagnosis, doctors are not yet able to accurately interpret X-ray images for osteoporosis manually and must rely on the data obtained from DXA. In recent years, with the continuous development of artificial intelligence, especially in the fields of machine learning and deep learning, significant progress has been made in image recognition. Therefore, it is worthwhile to revisit the question of whether it is possible to use a convolutional neural network model to read a hip X-ray image and then predict the patient’s BMD. In this study, we proposed a hip X-ray image segmentation model and a hip X-ray image recognition classification model. First, we used the U-Net model as a framework to segment the femoral neck, greater trochanter, Ward’s triangle, and the total hip in the hip X-ray images. We then performed image matting and data augmentation. Finally, we constructed a predictive model for osteoporosis using deep learning algorithms. In the segmentation experiments, we used intersection over union (IoU) as the evaluation metric for image segmentation, and both the U-Net model and the U-Net++ model achieved segmentation results greater than or equal to 0.5. In the classification experiments, using the T-score as the classification basis, the total hip using the DenseNet121 model has the highest accuracy of 74%.


Introduction
Osteoporosis is one of the most common issues in orthopedic medicine today.People have been facing the challenges of osteoporosis due to the natural process of aging.According to the International Osteoporosis Foundation (IOF), both men and women over the age of 50 have a significant risk of developing osteoporosis, with approximately one-fifth of men and one-third of women falling into this category [1].The risk of osteoporosis dramatically increases in women after menopause.It is estimated that there are approximately 200 million women worldwide who are affected by osteoporosis.As our country is gradually transitioning into an aging society, the percentage of people afflicted by osteoporosis is steadily on the rise.Consequently, effective prevention and treatment of osteoporosis have become vital concerns in the field of orthopedic medicine [2].In the early stages of osteoporosis, there are no obvious symptoms, but fractures may occur because of minor injuries [3].In severe cases, fractures may occur not only in the hip but also in the spine, wrist, arm, and knee.Ultrasound [4], peripheral bone densitometry, etc., are the main instruments used by orthopedic surgeons to diagnose osteoporosis.However, the image of the hip formed by these methods is more difficult for the orthopedic surgeon to read with his/her own eyes and requires the use of more sophisticated instruments, such as dual-energy X-ray absorptiometry (DXA) [5].The bone mineral density (BMD) of the hip is calculated and compared to the T-score of a younger, healthier person to diagnose osteoporosis [6].Currently, in the clinical diagnosis of the orthopedic hip, physicians base their diagnosis of osteoporosis on a DXA report of the femoral neck in the hip and the total hip.BMD from DXA is the main criterion for diagnosing osteoporosis.In Taiwan, because DXA is expensive and limited in number, the popularity of the equipment is far less than that of X-ray imaging, and only high-grade hospitals or nursing homes are equipped with this equipment, while in some remote areas or villages, it is not convenient to use this resource.In some remote areas or rural areas, it is not convenient to use this resource, and cheaper equipment such as ultrasound and peripheral densitometry is mostly used for diagnosis, resulting in a less accurate and less efficient diagnosis of symptoms.
Currently, physicians are unable to manually read X-ray images for osteoporosis in clinical diagnosis.Early diagnosis of osteoporosis is important for the prevention of osteoporotic fractures, and in recent years, artificial intelligence has been gradually introduced into medical diagnosis [7].As machine learning and deep learning methods have made significant advances in image recognition, it is worthwhile to revisit the question of whether or not it is possible to use convolutional neural networks (CNNs) to read a hip X-ray image and further predict a patient's BMD status, and the answer may be affirmative.While traditional machine learning methods can be effective, they rely on manual feature extraction, sequential training of the model, and output of the results.With the great leap in hardware computing power of graphics processing units (GPUs), deep learning can directly input data into the model, and the neural networks will extract the features by themselves when they train the model and then output the results.Compared with machine learning, deep learning omits the part about manual feature extraction.When the amount of data is relatively large, deep learning can extract more features, and the output results will be better than traditional machine learning.
In this study, we collected data from 134 orthopedic patients, most of whom were menopausal women and elderly men.The data collected were patients' left or right hip X-ray images and DXA diagnostic data, of which the DXA diagnostic reports were femoral neck, greater trochanter, Ward's triangle, and total hip BMD and T-score.We used image segmentation, image matting, data augmentation, and DXA reports to classify the BMD and T-score values of the hip X-ray images and designed the experiments using a deep learning algorithm model to predict the BMD and T-score of the hip.A deep learning algorithm model was used to design an experiment to predict whether the femoral neck, greater trochanter, Ward's triangle, and overall hip findings on hip X-ray images constitute osteoporosis and to establish a risk classification model of osteoporosis to assist orthopedic surgeons in diagnosis, hoping to reduce the surgeon's time spent in diagnosis.
The research question of this study is divided into two parts.The first part is that in image segmentation, the segmentation results of four areas (femoral neck, greater trochanter, Ward's triangle, and total hip) on X-ray images may affect the subsequent image classification results and whether the segmentation model can correctly and efficiently segment the contours of the four areas for subsequent image classification experiments to be conducted.The second part is in the depth learning method: what are the prediction results of different models, and which model builds the best prediction of osteoporosis and can more accurately predict the BMD of patients?In addition to exploring the correlation between the interpretation of hip X-ray images and BMD and the experimental accuracy of the depth model, we also analyzed two more areas of the hip, namely the greater trochanter and the Ward's triangle, to provide physicians with more aspects to diagnose and analyze the patients.The study will be carried out using CNN and supervised learning methods.This study will also use CNNs and supervise learning to mark the femoral neck, greater trochanter, Ward's triangle, and whole hip as the data input of the whole experiment, and then the results of the experiment will be further explored, hoping to provide some help for the medical diagnosis of osteoporosis.

Osteoporosis
Osteoporosis is a silent disease that may not cause pain before fractures occur, but as a person ages, their bone mass continually decreases in the body.Osteoporosis reduces the body's bone mineral density (BMD) and causes fragility of the bones in the hip, spine, and wrist.It also leads to fractures and other complications (e.g., intervertebral compression fracture, hip and wrist joint fracture, etc.), which is a disease of bone metabolism [8].Bone is a self-renewing, active tissue.In the process of maintaining bone health, the body continuously breaks down old bone and replaces it with new bone tissue.During childhood and adolescence, new bone formation occurs rapidly, resulting in larger BMD values, which reach their peak around the age of 20.In the following 7 to 10 years, the rate of new bone production is about the same as the rate of decomposition of old bone, and the adult skeleton reaches complete renewal [9].However, when people reach about 40 years old, the rate of bone mineral increase slows down.At this point, the rate of decomposition of bone is greater than the rate of new bone production, and the shell of the bone becomes thinner, making it more fragile.
In bone density examination, DXA calculates BMD as a direct reference indicator, and the lower the value of BMD, the more likely you are to get osteoporosis.Another commonly used indicator is the T-score, which is regarded as an extension of the BMD calculation as a basis for assessing the presence or absence of osteoporosis.The T-score is calculated by dividing the difference between the BMD calculated by a specialized instrument such as DXA and the expected young normal (YN) by the standard deviation (SD) of the BMD in young people [10].The T-score's formula is defined as the following Equation (1): According to the standard definition of the World Health Organization [11], a T-score greater than −1 indicates normal bone mass, less than or equal to −1 or greater than −2.5 indicates low bone mass, and less than or equal to −2.5 indicates osteoporosis.As humans age, bone loss is inevitable, so when the age gradually increases, the two numerical indicators of bone density and T-score will be lower than those of young people.In the clinical diagnosis of osteoporosis in the hip, the diagnosis of osteoporosis is based on a DXA report of the femoral neck and the total hip.

Osteoporosis Detection
In recent years, the rise of artificial intelligence has become more common in the medical industry, with machine learning and deep learning as the main applications.In 2016, a study was proposed by S.K. Hong et al. that primarily utilized machine learning and deep learning to assist orthopedic surgeons in determining the presence of osteoporosis.The study involved collecting DXA diagnosis results from men aged 50 and above as well as postmenopausal women.The focus of the study was on the femoral neck in hip X-ray images.Unlike our study, which solely utilized hip X-ray images as the dataset, their study incorporated various covariates (e.g., height, age, etc.) as part of the dataset.Under this prerequisite, it is easier to predict better results.They employed an artificial neural network (ANN) to perform binary classification of osteoporosis based on T-scores.The results were promising when evaluating men aged 50 and above and postmenopausal women.In this study, the accuracy for men reached 85.8%, with a sensitivity of 81.6% and a specificity of 90.0%.For women, the accuracy was 86.2%, with a sensitivity of 84.7% and a specificity of 87.7% [12].In addition, Support Vector Machine (SVM) outperforms Logistic Regression (LR) in predicting osteoporosis risk and surpassed some traditional clinical decision tools, such as the Osteoporosis Self-Assessment Tool (OST), Osteoporosis Risk Assessment Instrument (ORAI), Simple Calculated Osteoporosis Risk Estimation (SCORE), and Osteoporosis Index of Risk (OSIRIS).In 2013, T.K. Yoo et al. achieved a predictive accuracy of 77%, sensitivity of 78%, and specificity of 76% by collecting medical records of postmenopausal women [13].
In addition to DXA, in 2021, J.W. Adams et al. proposed the application of neural networks for screening application analysis in osteoporosis [14], which utilizes low-frequency radiofrequency data passing through the wrist and uses a multilayer perceptron (MLP) to do the analysis, and in 2020, B. Zhang et al. proposed to train a CNN model based on lumbar spine X-ray images to read osteoporosis and bone loss [15], which are also of great help for orthopedic applications.In 2020, N. Yamamoto et al. used ResNet, GoogLeNet, and EfficientNet to classify hip X-ray images for osteoporosis [16].They employed a T-score for binary classification and focused solely on hip X-ray images, achieving the highest accuracy of 84% in the experiments conducted with GoogLeNet and EfficientNet B3.The difference from our study is that in their dataset of 1131 hip X-ray images, 708 were from patients with confirmed hip fractures.Patients with hip fractures typically have lower bone density, which is a key indicator of osteoporosis.Therefore, the ability to distinguish between those with and without osteoporosis was more pronounced in cases of confirmed hip fractures.The difficulty is how to predict the correct bone density in normal hip X-ray images.In the same study of determining whether hip X-ray images are osteoporosis, the experimental model proposed by R. Jang et al. in 2021 was dichotomized by T-score, and the experimental model mainly uses a deep neural network (DNN) developed based on VGG16 architecture, with an optimal accuracy of 81%, sensitivity of 91%, and specificity of 69% [17].However, in this past study, only hip X-ray imaging data were included in postmenopausal women clinically at high risk for osteoporotic fractures.Furthermore, the data set is limited to women.It does not cover male hip X-rays, where bone density is more difficult to predict.

Fracture Detection
Often, with osteoporosis at the same time, there are compression fractures, and patients with severe osteoporosis do not need much movement to cause a spinal compression bone fracture.The same as osteoporosis, spinal compression fractures can also use deep learning to make predictions.In 2018, F. Cabitza et al. proposed a literature review of spinal bone image segmentation and osteoarthritis fracture prediction [18], focusing on the literature on the application of machine learning and deep learning methods in medicine and biology in the past ten years, among which the most widely used in medical imaging are machine learning SVM and deep learning.The use of deep learning has been increasing year by year, and it has become the bulk trend.The proportion of evaluation indexes used in the model is also summarized, and the results show that the accuracy rate accounts for 45% of the indexes that most experiments will evaluate, and the sensitivity and specificity of the model evaluation indexes account for 25%, which is more common in medical and biological research.
In 2020, W. Abbas et al. used Faster R-CNN to detect and classify fractures in lower leg X-rays [19], collecting X-ray images of lower leg fractures from 50 patients.The Faster R-CNN model achieved 94% accuracy, sensitivity, and specificity of 96% and 90%, respectively.In addition, Y. Yamada et al.'s study used deep learning to determine whether a hip fracture was a fracture based on X-ray images of the anteroposterior and lateral hip positions [20], and the accuracy of hip fracture classification was 98% after model training, which was better than the 95% of the orthopedic surgeon's interpretation.In addition to the spine and hip, J. Olczak et al. used machine learning to make predictions on X-rays of the wrist and ankle [21], and VGG16 had a good effect in determining fracture with an accuracy of 83%.S.W. Chung et al. labeled normal shoulder X-ray images and four abnormal proximal humerus fracture types and then used image enhancement and CNN model training to predict whether a fracture occurred [22].The accuracy of determining a normal shoulder and proximal humerus fracture was 96%, the sensitivity was 99%, and the specificity was 97%, which was higher than the accuracy of an orthopedic surgeon's diagnosis.

Image Segmentation
In medical imaging, both high-and low-level features of medical images are very important, but the traditional image segmentation method takes more time to filter out the unnecessary noise of medical images.Therefore, this study uses U-Net with simple image semantics and an advanced extended U-Net++ architecture to produce a good effect on the medical images through feature splicing, so that overfitting is not easy to form.U-Net model architecture was proposed by O. Ronneberger et al. in 2015 [23].The left half of the U-Net model architecture is the Encoder, which is part of feature extraction.It consists of a series of subsampling modules composed of convolution layers (ReLU) with 3 × 3 kernels and 2 × 2 max-pooling layers, which reduce dimensionality and increase the number of channels.Subsampling performs image information restoration, and upsampling performs image pixel recovery, and then the extracted features are passed down the line.The right half of the model structure is the Decoder, which is part of upsampling.Upsampling has a 2 × 2 convolution kernel and Skip Connection for feature fusion, and finally, a 1 × 1 convolution layer is used to output the result.The model architecture of U-Net++ was proposed by Z. Zhou et al. in 2018 [24].The difference between U-Net++ and U-Net architecture is that the jump connection is mainly redesigned in U-Net++, which is used to fill the semantic difference between Encoder and Decoder feature mapping, improve feature fusion in Decoder, and make semantic feature mapping easier to optimize.In addition, dense jump connection paths are added to improve the performance of image segmentation.The U-Net++ architecture also adds more depth supervisors to prune the model to adjust the model complexity and changes some loss functions, combining cross entropy and dice coefficient, to increase the performance of the model.

Deep Learning Neural Network Models 2.4.1. Convolutional Neural Networks (CNNs)
In 2012, A. Krizhevsky et al. proposed the classical CNN model [25].CNN is the process of obtaining various useful convolution kernels through the neural network learning method of backpropagation, which is mainly divided into three parts: (1) the convolution layer, (2) the pooling layer, and (3) the fully connected layer.The CNN model has an additional convolution layer and a pooling layer compared to traditional deep learning networks.The convolution layer and the pooling layer are for feature extraction, while the fully connected layer is for classification.The convolution layer is used to reduce the dimensionality of the image and convolves the image with a feature detector.The pooling layer is used to replace a certain area of the image with a value to reduce the size of the image, block the feature image to reduce the feature map dimension, retain important features, and obtain the pooled image, which also includes avoiding overfitting.There are three main approaches: max pooling, mean pooling, and stochastic pooling.The most common is max pooling.Each neuron in the full connection layer is connected to the neuron in the upper layer, and each connection has a different weight value.The full connection layer will integrate all the useful information from the previous results and then flatten the neural network.Flatten to the neural network, which is responsible for producing the final classification result in the softmax activation function.

VGGNet
VGGNet was published in 2014 by K. Simonyan et al. from the Visual Geometry Group of the University of Oxford [26].Its architecture is simple; the number of weights is very large, and it contains convolutional kernel weights and fully connected layer weights.The number of channels from VGGNet is large, and the number of channels in the first layer is 64.Each layer will be doubled later; the maximum number of channels reaches 512; the number of channels increases, which means that more information can be extracted; and many convolution kernels are used.The size is 3 × 3. VGGNet contributes to deepening the model by using smaller stacks of convolutional kernels, and by deepening the network of the model, it can be improved in terms of the ability of the model to be simulated.VGGNet has the advantage of a simpler structure.All networks use the same size of convolutional kernel (3 × 3) and the largest pooling layer (2 × 2).The disadvantage is that VGGNet has many parameters, which take up a large amount of memory space.The common ones are VGG16 and VGG19.The VGG16 model used in this study consists of 13 convolutional layers, three fully connected layers, five pool layers, and softmax output layers.

ResNet
In 2015, ResNet [27] was proposed by K.He et al.Greatly solved the vanishing gradient problem of deep networks, and it was pointed out in the paper that the result of a 20-layer network would not be worse than that of a 56-layer network, and there would be a degradation problem as the network layer gets deeper.The introduction of residual learning (RL) can maintain accuracy and speed even when the network layer is deepened, while the softmax layer is composed of the gradient-log-normalizer (GLN) function, which contains the classification rate distribution.The architecture of ResNet can adjust the depth and width of the model by adjusting the number of channels in the block and the stacking of the blocks, which makes the model easier to optimize, and its invention has solved the degradation problem of deep neural networks.The common ones are ResNet50, ResNet101, and ResNet152.ResNet50 is used in this study.

DenseNet
The architecture of DenseNet was proposed by G. Huang et al. in 2018 [28], and gradients disappear as the network gets deeper.The basic idea is the same as ResNet: DenseNet does not use a very deep or wide network to obtain image recognition effects, but by increasing the fluidity of features and reducing the complexity of the network, the gradient can be obtained from the loss function in each layer of the network, which solves the problem of gradient disappearance.Compared with ResNet's splicing of feature maps, DenseNet has the role of summation, connecting the front network layer with the following network layer and reusing network features.DenseNet is a more simplified model with a lower parameter calculation cost than ResNet.Common ones are DenseNet121, DenseNet169, and DenseNet201.DenseNet121 was used in this study.

Research Framework
The research framework and process of this study are shown in Figure 1, which is divided into four parts: (1) dataset, (2) data preprocessing, (3) image categorization, and (4) result comparison.After obtaining the dataset from the hospital, data preprocessing (including image labeling, image segmentation, image matting, and data augmentation) was performed, and then three different convolutional neural network models were used to categorize the X-ray images of the four parts of the hip, and the classified X-ray images of the hip were divided into three different datasets for the test.The first one is the original X-ray images to verify whether the model training with the original dataset will result in underfitting and poor generalization due to the low complexity of the model and the small number of features in the images.The second type is the X-ray image after data augmentation, which aims to verify whether the data augmentation can effectively improve the overall results and enhance the generalization ability of the model.The third is based on the results of the classification experiments with the best performance in the T-score and BMD classifications of the first two types of data and adding the image segmentation method to verify whether the overall results can be improved.Finally, the results of the segmentation model experiment, the classification model experiment with or without data augmentation, and the classification model experiment with or without image segmentation were compared.
BMD classifications of the first two types of data and adding the image segmentation method to verify whether the overall results can be improved.Finally, the results of the segmentation model experiment, the classification model experiment with or without data augmentation, and the classification model experiment with or without image segmentation were compared.

Datasets
The source of x-ray images for this study were patients in a regional hospital in Taiwan from September 2020 to September 2021, a period of A total of 134 left or right hip radiographs and DXA diagnoses were collected from 134 patients, mostly elderly men and postmenopausal women, in a retrospective study to collect a dataset that was reviewed by the Institutional Review Board for Research Ethics Programs and Studies (IRB).A total of 139 left and right hip radiographs were collected, and for the DXA images, each DXA image had a diagnostic interval distribution of BMD and T-score values for the femoral neck, greater trochanter, Ward's triangle, and total hip, but because the DXA models were divided into two types, one model only displays bone density and T-score data for the femoral neck and total hip and lacks data for the greater trochanter and Ward's triangle.Further screening of the DXA reports showed 139 data for the femoral neck and total hip and 72 data for the greater trochanter and Ward's triangle, with an 8:2 ratio of training set to test set data for each part, resulting in 111 training data for the femoral neck and total hip and 57 training data for the greater trochanter and Ward's triangle.In this study, before the image segmentation experiment, the collected X-ray images were used to mark the contour for each of the four parts for subsequent image segmentation experiments.

Image Labeling
In this study, X-ray images of each of the four areas of the patient's hip (femoral neck, greater trochanter, Ward's triangle, and total hip) were separated and manually labeled using Labelme, an open-source tool that can be used for labeling [29].The four parts of the hip were then framed as shown in Figure 2 below, and the labeled image data were batch converted into binary png files, which were used as inputs for the supervised learning training of U-Net, U-Net++, and image categorization in the image segmentation process.

Datasets
The source of X-ray images for this study were patients in a regional hospital in Taiwan from September 2020 to September 2021, a period of A total of 134 left or right hip radiographs and DXA diagnoses were collected from 134 patients, mostly elderly men and postmenopausal women, in a retrospective study to collect a dataset that was reviewed by the Institutional Review Board for Research Ethics Programs and Studies (IRB).A total of 139 left and right hip radiographs were collected, and for the DXA images, each DXA image had a diagnostic interval distribution of BMD and T-score values for the femoral neck, greater trochanter, Ward's triangle, and total hip, but because the DXA models were divided into two types, one model only displays bone density and T-score data for the femoral neck and total hip and lacks data for the greater trochanter and Ward's triangle.Further screening of the DXA reports showed 139 data for the femoral neck and total hip and 72 data for the greater trochanter and Ward's triangle, with an 8:2 ratio of training set to test set data for each part, resulting in 111 training data for the femoral neck and total hip and 57 training data for the greater trochanter and Ward's triangle.In this study, before the image segmentation experiment, the collected X-ray images were used to mark the contour for each of the four parts for subsequent image segmentation experiments.

Image Labeling
In this study, X-ray images of each of the four areas of the patient's hip (femoral neck, greater trochanter, Ward's triangle, and total hip) were separated and manually labeled using Labelme, an open-source tool that can be used for labeling [29].The four parts of the hip were then framed as shown in Figure 2 below, and the labeled image data were batch converted into binary png files, which were used as inputs for the supervised learning training of U-Net, U-Net++, and image categorization in the image segmentation process.

Image Segmentation
In this study, four parts of the image labeled X-ray images were used in image segmentation by feeding them into U-Net and U-Net++ models for training, and the bit depth of the four parts of the image was converted from the original 24 bits to 8 bits before the model training.The reason for choosing to use U-Net and U-Net++ is that their model structure is simpler, they do not need to spend a lot of time filtering out the remaining noise in the medical images, and they are less likely to form overfits for a small number of image datasets.The binary segmentation prediction results obtained after training the models of U-Net and U-Net++ are shown in Figure 3

Image Segmentation
In this study, four parts of the image labeled X-ray images were used in image segmentation by feeding them into U-Net and U-Net++ models for training, and the bit depth of the four parts of the image was converted from the original 24 bits to 8 bits before the model training.The reason for choosing to use U-Net and U-Net++ is that their model structure is simpler, they do not need to spend a lot of time filtering out the remaining noise in the medical images, and they are less likely to form overfits for a small number of image datasets.The binary segmentation prediction results obtained after training the models of U-Net and U-Net++ are shown in Figure 3 below.

Image Matting
Based on the four parts of the image segmentation of the binary image and the original X-ray image together, the original X-ray image only retains the part of the image segmentation as shown in Figure 4; the other non-part of the contour of the background is removed, and the image is de-behind in the hope that it can enhance the accuracy of the classification of the depth of the learning process, and then the image classification will be segmented images and not segmented images will be compared.

Image Matting
Based on the four parts of the image segmentation of the binary image and the original X-ray image together, the original X-ray image only retains the part of the image segmentation as shown in Figure 4; the other non-part of the contour of the background is removed, and the image is de-behind in the hope that it can enhance the accuracy of the classification of the depth of the learning process, and then the image classification will be segmented images and not segmented images will be compared.

Image Matting
Based on the four parts of the image segmentation of the binary image and the original X-ray image together, the original X-ray image only retains the part of the image segmentation as shown in Figure 4; the other non-part of the contour of the background is removed, and the image is de-behind in the hope that it can enhance the accuracy of the classification of the depth of the learning process, and then the image classification will be segmented images and not segmented images will be compared.

Data Augmentation
In this study, the X-ray image data were insufficient for image classification experiments.Training the model with the original dataset could lead to issues such as model underfitting and poor generalization due to the low complexity of the model and the limited image features.Therefore, the dataset consisting of X-ray images from four different body areas was augmented by applying transformations like rotation (e.g., Figure 5), shifting (e.g., Figure 6), and random scaling (e.g., Figure 7).Importantly, these augmentations were performed without altering the bone contour morphology, background, or color.underfitting and poor generalization due to the low complexity of the model and the limited image features.Therefore, the dataset consisting of X-ray images from four different body areas was augmented by applying transformations like rotation (e.g., Figure 5), shifting (e.g., Figure 6), and random scaling (e.g., Figure 7).Importantly, these augmentations were performed without altering the bone contour morphology, background, or color.The purpose of data augmentation was to aid in the training of deep learning models [30].Additionally, data augmentation serves to address underfitting problems in classification experiments and can potentially enhance experimental accuracy if overfitting issues arise in the future [31].Table 1 presents a comparison of data volume before and after data augmentation.

Data Augmentation
In this study, the X-ray image data were insufficient for image classification experiments.Training the model with the original dataset could lead to issues such as model underfitting and poor generalization due to the low complexity of the model and the limited image features.Therefore, the dataset consisting of X-ray images from four different body areas was augmented by applying transformations like rotation (e.g., Figure 5), shifting (e.g., Figure 6), and random scaling (e.g., Figure 7).Importantly, these augmentations were performed without altering the bone contour morphology, background, or color.The purpose of data augmentation was to aid in the training of deep learning models [30].Additionally, data augmentation serves to address underfitting problems in classification experiments and can potentially enhance experimental accuracy if overfitting issues arise in the future [31].Table 1 presents a comparison of data volume before and after data augmentation.

Data Augmentation
In this study, the X-ray image data were insufficient for image classification experiments.Training the model with the original dataset could lead to issues such as model underfitting and poor generalization due to the low complexity of the model and the limited image features.Therefore, the dataset consisting of X-ray images from four different body areas was augmented by applying transformations like rotation (e.g., Figure 5), shifting (e.g., Figure 6), and random scaling (e.g., Figure 7).Importantly, these augmentations were performed without altering the bone contour morphology, background, or color.The purpose of data augmentation was to aid in the training of deep learning models [30].Additionally, data augmentation serves to address underfitting problems in classification experiments and can potentially enhance experimental accuracy if overfitting issues arise in the future [31].Table 1 presents a comparison of data volume before and after data augmentation.After data expansion through rotation and shifting, each part is expanded by seven times the amount of data, and after expansion, a random scaling method is added so that each batch of data is randomly scaled by between −20% and 20% to achieve the effect of increasing data plurality.

Model Evaluation Indicators
Using Intersection over Union (IoU) in object detection for image segmentation [32].The IoU is calculated by dividing the intersection of two object images by the union, and when the experimental result is greater than or equal to 0.5, it will be regarded as a valid image segmentation result, and the IoU is calculated as the following Equation ( 2): In the part of deep learning image classification, the common evaluation method is accuracy, but in medical and biological research, if we only rely on accuracy, the model evaluation may not be complete enough.Plus, other evaluation methods can make the model evaluation more complete.Before explaining the evaluation indexes of the other models, we will first define the confusion matrix of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) to facilitate the calculation, as shown in Table 2 below.This study belongs to medical image classification, and there are four model evaluation indexes derived using a confusion matrix, which are accuracy, sensitivity, specificity, and F1-score.Each of the model evaluation indexes is defined in Table 3, and the relevant formulas are as follows in Equations ( 3)-( 6 F1-score = 2TP/(2TP + FP + FN) Table 3. Definition of model evaluation indicators.

Evaluation Indicators Description
Accuracy Percentage of correctly diagnosed positive and negative patients in all cases.
Sensitivity Known as the True Positive Rate (TPR), the proportion of patients who are positive who are diagnosed as positive indicates the detection rate of symptoms.

Specificity
Known as the True Negative Rate (TNR), the proportion of patients with a negative diagnosis who are negative indicates the detection rate of asymptomatic patients.

F1-score
The F1-score is used to comprehensively assess the performance of a model.

Osteoporosis Classification Index
There are two common diagnostic indexes for the diagnosis of osteoporosis: one is the BMD calculated by DXA, and the other is the T-score, which is regarded as an extension of the calculation of BMD.The T-score serves as a more representative criterion for osteoporosis diagnosis.Therefore, we used these two indexes as the classification indexes for the subsequent in-depth learning experiments.Bone density was categorized into normal bone mineral and abnormal bone mineral according to the DXA diagnostic images corresponding to the patient's age, and T-score was categorized into not suffering from osteoporosis (T-score > −2.5) and suffering from osteoporosis (T-score ≤ −2.5) according to the DXA diagnostic images, as shown in Table 4 below.

Deep Learning Model Training
In this study, the model training was divided into two parts: image segmentation and image classification.Image segmentation was performed using radiographic images of the femoral neck, greater trochanter, Ward's triangle, and total hip.The labeled images of the four areas were input into the U-Net and U-Net++ models for training, and the results of the two model experiments were compared after binary segmentation of the images was generated by U-Net and U-Net++.Regarding the U-Net and U-Net++ models, we utilized the Adam optimizer, cross entropy loss function, batch size set to 2, and learning rate = 0.00001.For image classification, the X-ray images of the four parts of the body were divided into two categories according to the above-mentioned bone density and T-score indexes, and the experiments were conducted to compare the segmented and nonsegmented images, and then VGG16, ResNet50, and DenseNet121, which are the innovative pre-trained models of deep learning, were used for the image classification experiments in this study after fine-tuning the models.Mreover, VGG16, ResNet50, and DenseNet121, which are innovative deep learning pre-trained models in recent years, were fine-tuned to be used as the image classification experiments in this study.In this study, we used the Adam optimizer, cross-entropy loss function, batch size set to 16, and a lower learning rate of 0.000001 in the image classification experiments.All these parameter settings were determined as the optimal parameters through multiple rounds of testing.Additionally, we added a dropout layer and a dense layer to the model structure.The purpose of the dropout layer is to avoid the over-simulation problem in the subsequent experiments, and the dense layer is used as the output layer to generate various types of probability values with the softmax function.
3.4.4.K-Fold Cross-Validation K-fold cross-validation is a widely used model evaluation method in the field of machine learning.Its primary purpose is to mitigate biases introduced by specific data splits.The process involves dividing the dataset into K subsets.Subsequently, K rounds of evaluation are conducted, where, in each round, one of the subsets serves as the validation set, while the remaining subsets are used for training.This process is repeated until each subset has had the opportunity to be the validation set.Finally, the average of the evaluation results from all rounds is computed as the ultimate evaluation metric.For this study, K was set to 10, and K-fold cross-validation was employed as the model evaluation technique.

Image Segmentation Results
By building U-Net and U-Net++ models for each of the four parts of the image segmentation experiments, the IoU is used as a metric for model evaluation, and the following Table 5 shows the experimental results of U-Net and U-Net++ for image segmentation.The training of U-Net and U-Net++ models produced the model-predicted images of four parts, the original manually labeled X-ray images by IoU computation, and the results produced after the model training.The U-Net++ results of the greater trochanter were better than those of U-Net, and the results of other parts were about the same.The results of segmentation in U-Net++ were used in the present study for the subsequent experiments on image matting and image categorization.

Image Classification Results
After image segmentation and matting, the segmented and non-segmented X-ray images were dichotomized according to the BMD and T-score as an indicator of osteoporosis, and the pre-trained models of VGG16, ResNet50, and DenseNet121, which are innovative deep learning models in recent years, were used to conduct the experiments.The pre-trained deep learning models VGG16, ResNet50, and DenseNet121 were used to perform the experiment.

Categorization Results Using the Original Dataset
Using the T-score as an indicator, Table 6 shows the experimental results of image categorization.The pre-trained model using VGG16 performed the best in the total hip classification test results with an accuracy of 0.69, sensitivity of 0.72, and specificity of 0.66, while the pre-trained model using DenseNet121 performed the worst in the total hip classification test results with an accuracy of only 0.50.It is worth noting that the pre-trained model using VGG16 had a sensitivity of only 0.23 for the greater trochanter and a specificity of only 0.16 for the Ward's triangle.This suggests that there may be issues related to data imbalance or model overfitting in these two areas of the dataset, leading to such results.Overall, except for VGG16, which achieves a better fit in the four parts of the classification training, most other models have underfitting problems in the four parts of the classification results.Using BMD as an indicator, Table 7 shows each part's experimental results for image categorization.The pre-trained model using VGG16 showed the best performance in the total hip classification test results, with an accuracy of 0.70, sensitivity of 0.54, and specificity of 0.80, while the pre-trained model using DenseNet121 showed the worst performance in the greater trochanter classification test results, with an accuracy of only 0.47.The pre-trained model using VGG16 has a specificity of only 0.1 in the classification test of the Ward's triangle, which again suggests that the dataset in this area may have data imbalance or model overfitting problems.Overall, except for VGG16, which achieves a better fit on the four parts of the classification, most of the other models have underfitting problems.
Regardless of whether T-score or BMD was used as the index for the classification of osteoporosis, overall, most of the experiments suffered from underfitting, as well as possible data imbalance or model overfitting in the greater trochanter and Ward's triangle, which required data augmentation and data balancing to improve the accuracy as well as to increase the generalization ability of the model.because the balance of the original dataset of greater trochanter and Ward's triangle was very poor, so even though data augmentation and balancing were carried out, it still could not compensate for the lower values of sensitivity and specificity, but they were much improved compared with those before using data augmentation.Next, the best results of the two osteoporosis classification indexes were added to the image segmentation method to test whether the overall values could be further improved.

Categorization Results Using Segmentation
We used the best performance of each of the two previous classification metrics as the test for the image segmentation experiment and tested whether the addition of segmentation improved accuracy.Table 10 shows the results of the classification experiments of the pre-trained model using DenseNet121 on the total hip as a T-score indicator and the pre-trained model using VGG16 on the total hip as a BMD indicator.In the classification results after adding image segmentation, no matter whether using T-score or BMD as the index for osteoporosis classification, overall, the accuracy and F1-score did not improve, and the accuracy decreased from 74% to about 60%.

Discussion of Experiments
In the experiments of image segmentation, based on the modeling of U-Net and U-Net++, the results of the experiments were taken to the last two decimal places, in which the results of the femoral neck, Ward's triangle, and total hip were the same, and the results of U-Net++ in the region of the greater trochanter were a little bit higher than the results of U-Net, and the results of all segmentation results were greater than equal to 0.5, so all of the segmentations can be considered effective image segmentation.
In the experiments on image classification, we divided them into three experiments.The first experiment was to use the classification results of the original dataset, whether using T-score or BMD as the index for osteoporosis classification.Most of the experimental results had the problem of underfitting as well as the problem of data imbalance or model overfitting in the greater trochanter and Ward's triangle, so it is necessary to improve the accuracy and generalization ability of the model through the methods of data augmentation and data balancing.The second experiment is the classification of results using data augmentation.Whether using T-score or BMD as the index for osteoporosis classification, on the whole, most of the experimental results have improved after using data augmentation.However, there are still lower values of sensitivity and specificity in the big rumble and Ward's triangle, probably because the big rumble and Ward's triangle are poorly balanced in the original dataset.Thus, even with data augmentation and balancing, it still cannot make up for the lower values of sensitivity and specificity due to the imbalance of the data.However, compared to the situation before using the data augmentation, it has improved quite a lot.The third experiment was to use the results of image segmentation.We then added the best results of the two osteoporosis classification indexes to the image segmentation method to test whether it could improve the classification ability of the model.However, in the classification results after adding the image segmentation, the overall accuracy of the hip and the F1-score did not improve, and the accuracy dropped from the highest of 74% to 60%, regardless of whether the osteoporosis classification indexes were based on the T-score or the BMD.We hypothesize that the reason may be that the X-ray images of each part of the hip need the surrounding feature information, and the use of image cutting cuts out the surrounding feature information, which leads to the poor performance of the model classification results, so the model without cutting has a higher classification ability.Looking at the osteoporosis classification index, the experimental results of BMD and T-score do not differ too much, and the overall classification accuracy of the hip is higher than that of other parts of the body, which is more in line with the doctor's expectation, while the classification accuracy of the Ward's triangle is lower because of the serious data imbalance problem in the original dataset.Although the sensitivity and specificity of the classification results improved after data expansion, more results were still below 0.50.In the classification results of the three deep learning models, VGG16 performs better than DenseNet121 and ResNet50, indicating that the classification of hip X-ray images does not necessarily require using the deeper network structures of DenseNet121 and ResNet50.Using a general VGG16 can help solve this classification problem.

Conclusions
Most people are often troubled by osteoporosis as they get older.Since the condition is not obvious in the early stages, people often ignore that their BMD is decreasing with age, and it is important to promote the concept of bone protection and early prevention and treatment.In recent years, convolutional neural networks have been widely used in different fields of research, and the number of medical image analyses is increasing year by year, such as in thoracic medicine, dermatology, ophthalmology, orthopedics, dentistry, etc. Artificial intelligence is expected to be able to promote healthcare in a variety of ways, not only in the diagnosis of the patient and the development of medicines but also to become a good assistant to the doctor, providing better and more personalized medical services so that people can get better healthcare.Through the analysis of hip X-ray images, this study constructed two sets of deep learning models for automatic segmentation and classification of X-ray images, which can be used as a reference for osteoporosis assessment and diagnosis.
In this study's image segmentation results, using the U-Net and U-Net++ construction models, the IoU results of the femoral neck, wards triangle, and total hip showed similar results, and the 0.85 of the U-Net++ in the greater trochanter was better than the 0.78 of the U-Net, and all the segmented IoU were greater than equal to 0.5, which can be regarded as a valid segmentation result.In the experiments on image classification, due to the insufficient amount of raw data, the model complexity is too low and the number of image features is too small, which indirectly leads to model underfitting.The accuracy of the experiments can be improved by data augmentation, and the data augmentation and the addition of the dropout layer in the model are also helpful for the subsequent experiments to prevent overfitting.Using the T-score as a basis for classification, the model with DenseNet121 and without U-Net++ image segmentation has the highest accuracy of 74% on the total hip, and the F1-score is 71%.In the deep learning model comparison, most of the VGG16 experimental accuracies are a bit higher than both DenseNet121 and ResNet50, indicating that instead of using a deeper neural network, the simpler VGG16 model can perform well in the problem of hip X-ray image classification.In the best experimental results, the accuracy of total hip classification was 74% for both T-score and BMD.Using the overall hip image as the basis for osteoporosis was more consistent with the orthopedic surgeon's diagnosis of the hip.
The contribution of this study lies in the establishment of an automated X-ray image segmentation model and an automated model for reading X-ray images of osteoporosis, hoping to provide some assistance to orthopedic surgeons in the diagnosis of osteoporosis.Moreover, the cost of DXA is relatively high.As the middle-aged and old-aged population is increasing year by year, the number of people with osteoporosis is surely increasing year by year, but DXA is only available in higher-grade hospitals or nursing homes.Patients

Figure 1 .
Figure 1.Deep learning-based hip X-ray image analysis for predicting osteoporosis.

Figure 1 .
Figure 1.Deep learning-based hip X-ray image analysis for predicting osteoporosis. below.

Figure 2 .
Figure 2. Image labeling tool and interface.

Figure 2 .
Figure 2. Image labeling tool and interface.

Figure 2 .
Figure 2. Image labeling tool and interface.

3. 3
.2. Image Segmentation In this study, four parts of the image labeled X-ray images were used in image segmentation by feeding them into U-Net and U-Net++ models for training, and the bit depth of the four parts of the image was converted from the original 24 bits to 8 bits before the model training.The reason for choosing to use U-Net and U-Net++ is that their model structure is simpler, they do not need to spend a lot of time filtering out the remaining noise in the medical images, and they are less likely to form overfits for a small number of image datasets.The binary segmentation prediction results obtained after training the models of U-Net and U-Net++ are shown in Figure 3 below.

Figure 4 .
Figure 4. Image matting.3.3.4.Data AugmentationIn this study, the X-ray image data were insufficient for image classification experiments.Training the model with the original dataset could lead to issues such as model

Table 1 .
Comparison of data volume before and after data expansion.

Table 1 .
Comparison of data volume before and after data expansion.
True Negative (TN)The diagnosis was negative and symptom-free.

Table 4 .
Classification basis of the experimental T-score in this study.

Table 5 .
Results of image segmentation experiments.

Table 6 .
Experimental results of image classification by parts (original dataset with T-score).

Table 10 .
Experimental results of image categorization of total hip.