Deep Learning for Orthopedic Disease Based on Medical Image Analysis: Present and Future

: Since its development, deep learning has been quickly incorporated into the ﬁeld of medicine and has had a profound impact. Since 2017, many studies applying deep learning-based diagnostics in the ﬁeld of orthopedics have demonstrated outstanding performance. However, most published papers have focused on disease detection or classiﬁcation, leaving some unsatisfactory reports in areas such as segmentation and prediction. This review introduces research published in the ﬁeld of orthopedics classiﬁed according to disease from the perspective of orthopedic surgeons, and areas of future research are discussed. This paper provides orthopedic surgeons with an overall understanding of artiﬁcial intelligence-based image analysis and the information that medical data should be treated with low prejudice, providing developers and researchers with insight into the real-world context in which clinicians are embracing medical artiﬁcial intelligence.


Introduction
A convolutional neural network (CNN) is a deep learning algorithm architecture created based on a 1962 study investigating the visual process of feline brains, and it has been applied in a wide range of areas, from autonomous vehicles to medical diagnoses [1].
A traditional CNN consists of an input layer that transmits input information, a hidden layer that modifies information (filtering) received from the input layer and amplifies the features (pooling) and an output layer that finally synthesizes and outputs the information.
According to the universal approximation theorem, it has been confirmed that various linear classifications are possible even if the neural network has a shallow hidden layer, and some pioneering studies have shown that classification and detection are improved as the layers constituting the neural network become deeper (deep neural network) [2]. Since 2012, the performance of deep learning has rapidly increased in medical image analysis with the use of deep neural networks, and this has led to a decrease in the classification error rate from approximately 25% in 2011 to 3.6% in 2015.
The CNN model was developed using a pipeline in terms of classification and detection [3], and the improved CNN shows excellent judgment, essentially giving the computer a new visual organ. A CNN has thus been expected to be used for medical diagnoses. However, a CNN does not provide any information on the basis of the decision. Therefore, even if a CNN shows an excellent diagnostic ability, it can only be discussed within a limited scope in medicine, where the basis for a judgment is important [4].
This has been pointed out as a technical limitation that reduces the effectiveness of a CNN in various fields other than medicine [5]. Researchers have dubbed this limitation "black box issues" and worked to develop "explainable artificial intelligence (XAI)" to look inside the problem [6]. The term "explainable" can be expressed as "understandability", "comprehensibility" or "interpretability" and has the same meaning. XAI should not degrade the classification or prediction performance of the model in any way and should degrade the classification or prediction performance of the model in any way and should improve the explainability. Various strategies and suitable CNN architectures have been proposed to implement an appropriate XAI [7]. Unfortunately, the black box nature of deep learning has not been completely resolved, but there are some notable achievements [8]. As one of these achievements, in 2016, Zhou et al. introduced a method explaining how a CNN makes a decision through class activation mapping [9], and this method is widely used in the field of medical artificial intelligence [10]. In a similar context, there are attempts to improve the explainability by improving the existing CNN architecture [11]. Kim et al. modified U-Net, a CNN architecture that has strength in image segmentation, to appropriately increase the explainability. They presented an interpretable version of U-Net (SAU-Net) using an attention module for the decoder part [12].
Hence, studies introducing CNN models for diagnosing and classifying diseases using deep learning have been published in various fields of medicine, including ophthalmology and dermatology [13,14].
This trend is spreading rapidly in the field of orthopedics. Since 2017, when orthopedic disease research using deep learning was first introduced, the number of related papers has increased rapidly, and more than 300 papers in this area have been published. The search was conducted using Pubmed, MEDLINE and Embase, and papers were screened from 1 January 2017 to 2 November 2021. The search query was (orthopedic OR orthopedic) AND (deep learning). Among these studies, two orthopedic surgeons (S.W.C. and J.H.L.) independently reviewed the full text of the retrieved papers. Among these studies, 48 studies which both authors judged to be interesting and practical within the clinical context of orthopedic surgery are introduced and classified according to disease. This paper aims to provide insight into how medical artificial intelligence can help orthopedic surgeons treat patients vividly and in what context clinicians are accepting medical artificial intelligence from developers and researchers.
The authors introduce the selected papers by classifying them into the following sections:  In a similar context, there are attempts to improve the explainability by improving the existing CNN architecture [11]. Kim et al. modified U-Net, a CNN architecture that has strength in image segmentation, to appropriately increase the explainability. They presented an interpretable version of U-Net (SAU-Net) using an attention module for the decoder part [12].
Hence, studies introducing CNN models for diagnosing and classifying diseases using deep learning have been published in various fields of medicine, including ophthalmology and dermatology [13,14].
This trend is spreading rapidly in the field of orthopedics. Since 2017, when orthopedic disease research using deep learning was first introduced, the number of related papers has increased rapidly, and more than 300 papers in this area have been published. The search was conducted using Pubmed, MEDLINE and Embase, and papers were screened from 1 January 2017 to 2 November 2021. The search query was (orthopedic OR orthopedic) AND (deep learning). Among these studies, two orthopedic surgeons (S.W.C. and J.H.L.) independently reviewed the full text of the retrieved papers. Among these studies, 48 studies which both authors judged to be interesting and practical within the clinical context of orthopedic surgery are introduced and classified according to disease. This paper aims to provide insight into how medical artificial intelligence can help orthopedic surgeons treat patients vividly and in what context clinicians are accepting medical artificial intelligence from developers and researchers.
The authors introduce the selected papers by classifying them into the following sections: (1) Deep Learning for Fractures, (2) Deep Learning for Osteoarthritis and the Prediction of Arthroplasty Implants, (3) Deep Learning for Joint-Specific Soft Tissue Disease, (4) Miscellaneous and (5) Discussion.

Deep Learning for Fractures
Fractures are the most familiar ailments to orthopedists and the medical area in which deep learning methods were first applied. In 2018, Chung et al. published a CNN model for diagnosing and classifying proximal humerus fractures. Three specialists labeled 1891 anteroposterior shoulder radiographs as normal shoulders (n = 515) and 4 proximal humerus fracture types (greater tuberosity: 346; surgical neck: 514; 3-part: 269; and 4-part: 247) [15]. After labeling, a CNN model (ResNet-152) was trained with a training dataset created through augmentation of the labeled data. The CNN model recorded 96% accuracy for the normal shoulders and proximal humerus fractures, showing a higher accuracy than a general orthopedist (92.8% accuracy). This model showed a top-1 accuracy of 65-86% and an area under the curve (AUC) of 0.90-0.98 for classifying the fracture types. A recently published paper introduced a model with improved classification accuracy. In 2020, Demir et al. introduced a deep learning model to diagnose and classify humerus fractures using the exemplar pyramid method, a novel, stable feature extraction approach which showed a high classification accuracy of 99.12% [16].
Urakawa et al. trained the VGG-16 CNN model using hip plain radiographs (1773 intertrochanteric hip fracture images and 1573 normal hip images) and showed an accuracy of 95.5% [17]. Yamada et al. trained the CNN model (Xception architectural) based on 3123 hip plain and lateral radiography images, and the trained model classified fractures with 98% accuracy, which is better than orthopedists (92.2% accuracy) [18].
For the hip, as with the shoulder, there has been an attempt to classify fractures by training the CNN model. Lee et al. introduced a CNN model for training 786 anteroposterior pelvic plan radiographs using GoogLeNet-inception v3 [19]. The model classified a proximal femur fracture into type A (trochanteric region), type B (femur neck) and type C (femoral head) according to AO/OTA classification with an overall accuracy of 86.8%, showing a reasonable result. Lind et al. trained a ResNet-based CNN with anteroposterior and lateral knee radiographs, amounting to 6768 images [20]. The trained CNN model classified knee radiographic images according to the AO/OTA classification system and classified proximal tibia fractures, patellar fractures and distal femur fractures with AUCs of 0.87, 0.89 and 0.89, respectively.
The trained CNN diagnosed and classified fractures at a relatively high level in the large appendices of the shoulder, knee and hip. By contrast, a CNN model trained to diagnose and classify fractures in small joints or axial joints showed a relatively low AUC and accuracy. Farda et al. trained a PCANet-based CNN model that classified calcaneal fractures according to Sanders classification using computer tomography with 5534 datasets [21]. The trained CNN model showed 72% accuracy. In addition, Ozkaya et al. trained a CNN model based on ResNet50 with 390 anteroposterior wrist radiographic images [22]. The AUC of the learned CNN was 0.84, showing a relatively satisfactory result, but it was lower than that of experienced orthopedists.
Langerhuizen et al. compared the scaphoid fracture diagnostic accuracy between a deep learning algorithm and an orthopedist [23]. They trained the VGG16 CNN model with 150 radiographic images of scaphoid fractures and 150 images of normal wrist radiography without a fracture. Of the 150 images with scaphoid fractures, 23 could not be judged by the radiographic images and could only be confirmed through magnetic resonance imaging (MRI). The accuracy of the trained CNN model was 72%, which was lower than that of an orthopedic surgeon (84%). However, five of six occult scaphoid fractures were missed by all human observers.
An attempt was also made to diagnose the compression fractures in the spine using a trained CNN. The results showed a significant difference depending on the type of data used for learning. Chen et al. trained a ResNet-based CNN model using plain spine X-rays, and the trained CNN showed an accuracy of 73.59% [24]. By contrast, Yabu et al. presented a CNN model using MRI images as the training data. This model showed a higher accuracy (88%) than that of the surgeons [25].
In summary, fracture diagnosis using artificial intelligence showed a high level of accuracy. The trained CNN model conducted fracture diagnosis (binary classification) with a higher accuracy than fracture classification (multiclass classification), and this gap is expected to decrease as more advanced CNN models are developed.
In classifying fractures, small and axial joints showed a lower accuracy than large joints (Table 1). This may be a limitation of a CNN-based approach, which makes judgments by recognizing the contrast information (e.g., normal margin of the cortical bone and the fracture line or normal joint line) and spatial information of the images. The authors believe that this limitation can be overcome using more powerful CNN models. Most of the diagnosis and classification of fractures using deep learning have focused on osteoporotic fractures, and studies on osteoporotic fracture joints with low frequencies are relatively poor [26]. This may be because the dataset for training the CNN model is sufficient because osteoporotic fractures account for a high proportion of the total fracture frequency, and the fracture pattern is relatively standardized, making it suitable for use in fracture classification.

Deep Learning for Osteoarthritis and Prediction of Arthroplasty Implants
Osteoarthritis is as familiar to orthopedists as fractures. Therefore, several attempts have been made to diagnose and classify osteoarthritis using deep learning algorithms. Xue et al. trained a CNN model based on VGG-16 with 420 plain hip X-rays [27]. This is one of the earliest studies to apply deep learning methods to the orthopedic field, and the trained model diagnosed hip osteoarthritis with an accuracy of 92.8%. Ureten et al. also presented a model for diagnosing hip osteoarthritis using a similar research design, showing an accuracy of 90.2% [28].
Tiulpin et al. trained a CNN model to classify knee osteoarthritis according to the Kellgren-Lawrence grading scale using a Siamese classification CNN [29]. The model trained using plain knee X-rays showed a multiclass accuracy of 66.7%. In addition, Swiecicki et al. trained a Faster R-CNN using plain and lateral knee X-rays from the Multicenter Osteoarthritis Study dataset [30]. The multiclass accuracy of this model was 71.9%, which showed improved performance compared with the previous study conducted by Tiulpin et al. Pedoia et al. trained a DenseNet-based CNN based on MRI-T2 images rather than X-ray data, as used in previous studies, and this model showed a high AUC of 0.83 [31]. Kim et al. trained an SE-ResNet-based CNN model using 4366 knee anteroposterior X-rays as a dataset. Furthermore, they trained the model by adding demographic information (age, sex and body mass index), alignment and metabolic data information that can affect knee osteoarthritis, in addition to image information [32]. The diagnostic performance of the image data with additional patient information showed a significantly higher AUC ( Table 2).
Advanced osteoarthritis of the hip or knee often requires arthroplasty. Several studies have introduced a model for classifying arthroplasty implants used by patients with deep learning algorithms. Karnuta et al. trained the InceptionV3 network-based CNN model using anteroposterior knee X-rays with nine different implant models inserted [33] [37].
In summary, as in the case of using deep learning for fractures, binary classification of osteoarthritis has a higher accuracy than multiclass classification. In particular, the CNN-based model for specifying arthroplasty implants of the hip or knee shows a high accuracy. This may be because, unlike human bone, the implant design is highly standardized, demonstrating a clear margin on X-rays and providing clear contrast information to the CNN model. However, the classification of shoulder arthroplasty implants shows a low level of accuracy. This may be due to the fact that a shoulder anteroposterior X-ray can show a wider range of positions than an anteroposterior radiograph of the knee or hip.  [37].

Deep Learning for Joint-Specific Soft Tissue Disease
In summary, as in the case of using deep learning for fractures, binary classification of osteoarthritis has a higher accuracy than multiclass classification. In particular, the CNNbased model for specifying arthroplasty implants of the hip or knee shows a high accuracy. This may be because, unlike human bone, the implant design is highly standardized, demonstrating a clear margin on X-rays and providing clear contrast information to the CNN model. However, the classification of shoulder arthroplasty implants shows a low level of accuracy. This may be due to the fact that a shoulder anteroposterior X-ray can show a wider range of positions than an anteroposterior radiograph of the knee or hip.

Deep Learning for Joint-Specific Soft Tissue Disease
As for deep learning approaches, an algorithm specialized for detection based on learned images and an algorithm for segmentation by analyzing features have structural differences and have developed into different areas of application [3]. In particular, segmentation has technical difficulties in that it is necessary to preserve spastic information that is easily lost in the outer-layer process of synthesizing the results of the CNN model being trained [38]. Recent studies have attempted to overcome these limitations through techniques such as FCN-based semantic segmentation.
These differences in deep learning algorithms also affect the use of deep learning in the orthopedic field. The deep learning-based studies introduced above are cases of diagnosing and classifying diseases based on X-ray images, and a CNN model specialized for segmentation is not always required [39]. By contrast, for diseases that are diagnosed and classified based on images such as ultrasound or MRI, a satisfactory level of accuracy can be obtained using only a CNN model specialized for segmentation. For example, a CNN model for diagnosing rotator cuff tears is more appropriate for inferring such tears based on the outline of the normal rotator cuff (segmentation) than a method of diagnosis applied by specifying the location where the tear occurred (regional detection).
Therefore, CNN models for diagnosing soft tissue disease in the orthopedic field have mainly been published after 2018, which was when the segmentation technology began to mature. Kim [10]. In addition, Lee et al. developed a new deep learning architecture using an integrated positive loss function and a pretrained encoder. Using this, the location of the rotator cuff tear can be relatively accurately determined, even when imbalanced and noisy ultrasound images are provided [43].
Recent studies suggesting a CNN model for diagnosing meniscal tears, cartilage lesions and anterior cruciate ligament (ACL) ruptures in the knee joint have also been published. Couteaux et al. presented a model that trains a Mask-RCNN with 1828 T2weighted 2D Fast Spin-Echo images to classify the torn part from the normal area of the meniscus and do so according to the location of the tear [44]. This model diagnosed and classified meniscal tears with an AUC of 0.91. Roblot et al. also proposed a model for diagnosing meniscal tears in a similar way, detecting meniscal tears with an AUC of 0.94 [45].
Chang et al. presented a model for diagnosing complete ACL tears by training a U-Net-based CNN using 320 coronal proton density-weighted 2D Fast Spin-Echo images, demonstrating an AUC of 0.97 [46]. In addition, Flannery et al. trained a modified U-Netbased CNN and evaluated the level of segmentation of the model. The segmentation level suggested by the trained model did not show a statistically significant difference from the ground truth (the value actually suggested by an expert) (Figure 3) [47].

Miscellaneous
concerning bone age, attempts to create a model that automatically predicts a bone's age through the learning of plain X-rays of carpal bones have been conducted since before the first deep learning algorithm was developed. Mahmoodi et al. presented a bone age prediction model with an accuracy of 82% in 2000, using a regression model and a Bayesian estimator [48]. A CNN model using a deep learning algorithm was developed, and it is now possible to predict the bone age with improved accuracy. In addition, Han et al. proposed a model with 97.6% accuracy by training the Inception ResNet v2 model with 5876 hand radiographs [39].
For pediatrics, developmental dysplasia of the hip is one of the most common hip joint disorders in infants and young children, and its diagnosis is difficult owing to the extensive variations in pediatric pelvic anatomy [49]. To create a deep learning algorithm that can diagnose developmental dysplasia of the hip, Zhang et al. trained a CNN model (based on ResNet-101) using 10,219 pelvic anteroposterior radiographs of children. The trained model showed a high AUC of 0.975 [50].
An acute pediatric elbow fracture is also difficult to diagnose, owing to the existence of multiple cartilaginous ossification centers and a highly variable appearance [51]. England et al. trained a CNN using 901 lateral elbow radiographs, and the trained model diagnoses elbow fractures with a high AUC of 0.985 [52].
Central dual-energy X-ray absorptiometry is the reference standard for diagnosing osteoporosis and osteopenia. A CNN model for diagnosing osteopenia and osteoporosis using plain radiography without dual-energy X-ray absorptiometry was recently introduced.
Zhang et al. trained a CNN model with 2564 lumbar X-ray images, and this model showed an AUC of 0.767 and 0.810 for osteoporosis and osteopenia, respectively [53]. Yamamoto et al. trained a CNN with 1131 hip X-rays, and this model diagnosed osteoporosis with an accuracy of 0.885 [54].

Miscellaneous
Concerning bone age, attempts to create a model that automatically predicts a bone's age through the learning of plain X-rays of carpal bones have been conducted since before the first deep learning algorithm was developed. Mahmoodi et al. presented a bone age prediction model with an accuracy of 82% in 2000, using a regression model and a Bayesian estimator [48]. A CNN model using a deep learning algorithm was developed, and it is now possible to predict the bone age with improved accuracy. In addition, Han et al. proposed a model with 97.6% accuracy by training the Inception ResNet v2 model with 5876 hand radiographs [39].
For pediatrics, developmental dysplasia of the hip is one of the most common hip joint disorders in infants and young children, and its diagnosis is difficult owing to the extensive variations in pediatric pelvic anatomy [49]. To create a deep learning algorithm that can diagnose developmental dysplasia of the hip, Zhang et al. trained a CNN model (based on ResNet-101) using 10,219 pelvic anteroposterior radiographs of children. The trained model showed a high AUC of 0.975 [50].
An acute pediatric elbow fracture is also difficult to diagnose, owing to the existence of multiple cartilaginous ossification centers and a highly variable appearance [51]. England et al. trained a CNN using 901 lateral elbow radiographs, and the trained model diagnoses elbow fractures with a high AUC of 0.985 [52].
Central dual-energy X-ray absorptiometry is the reference standard for diagnosing osteoporosis and osteopenia. A CNN model for diagnosing osteopenia and osteoporosis using plain radiography without dual-energy X-ray absorptiometry was recently introduced.
Zhang et al. trained a CNN model with 2564 lumbar X-ray images, and this model showed an AUC of 0.767 and 0.810 for osteoporosis and osteopenia, respectively [53]. Yamamoto et al. trained a CNN with 1131 hip X-rays, and this model diagnosed osteoporosis with an accuracy of 0.885 [54].
For alignment, Pei et al. published an interesting study using a deep learning algorithm to automatically measure the hip-knee-ankle angle. They trained a CNN model with 796 unilateral lower limb X-rays, showing a difference of 0.49 • from the ground truth measured directly by orthopedic surgeons [55]. In addition, Rouzrokh and Pouria et al. trained a CNN model with 600 hip anteroposterior and 600 hip lateral X-rays taken after total hip arthroplasty and programmed this model to automatically derive the acetabular component inclination and version. Compared with the ground truth, this model showed a difference of 1.35 • for the inclination and 1.39 • for the anteversion [56].
Galbusera et al. presented a CNN model trained using biplanar radiographs of the spine. The model automatically calculated the T4-T12 kyphosis, L1-L5 lordosis, Cobb angle of scoliosis, pelvic incidence, sacral slope and pelvic tilt. Among them, the pelvic tilt showed a difference of 2.7 • compared with the ground truth, whereas the L1-L5 lordosis showed a difference of 11.5 • from the ground truth [56].
Concerning metastasis and infections in the spine, the spine is a joint that receives a high blood supply and is relatively easily exposed to metastasis compared with other joints [57]. Therefore, studies for diagnosing metastatic lesions using deep learning algorithms have mainly focused on the spine. Wang et al. reported that a CNN model trained with sagittal fat-suppressed T2 2D Fast Spin-Echo spine images localized metastatic lesions with a sensitivity of 90% [58]. In addition, Chmelik et al. trained a CNN with sagittal computed tomography images containing 1046 lytic lesions and 1135 sclerotic lesions, and the trained model detected lytic and sclerotic lesions with AUCs of 0.80 and 0.78, respectively [59].
Kim et al. published a CNN model to discriminate between tuberculous and pyogenic spondylitis. They trained the CNN using axial T2-weighted 2D Fast Spin-Echo images, and the trained CNN model divided the two conditions with an AUC of 0.80, with no significant difference from a human reader [60].
As for other applications, in addition to the previously introduced papers, studies using deep learning algorithms in the field of orthopedic surgery have been published.

Discussion
Orthopedics, along with dermatology, ophthalmology and cardiology, is the medical field in which research into deep learning algorithms is most actively conducted. Related research has been explosively increasing since 2017, and this trend is expected to continue until the "new winter", when the development of artificial intelligence will reach its limit.
To date, image analysis studies of orthopedic diseases using deep learning have shown excellent results overall. Several studies have reported that in fractures and osteoarthritis, a trained CNN model has a diagnostic accuracy comparable to that of an expert. The studies also presented satisfactory results for the classification of fractures and osteoarthritis. However, the accuracy of multiclass classification did not reach detection, and studies on small joints presented relatively poor results compared with studies on large joints.
Nevertheless, it is expected that this limitation can be overcome for two reasons. First, the CNN model for medical image analysis aims for accurate diagnosis and appropriate classification, and the types of classes required for this purpose are relatively small. When there are few class types, Basha et al. proved that the accuracy can be improved using a CNN model structured as a deeper layer [63]. Therefore, it is expected that the development of a CNN model with deep hyperparameters will increase the accuracy of multiclass classification through medical image analysis. Second, medical images are extremely refined data compared with images used to learn road traffic conditions or climate predictions; that is, researchers can relatively easily obtain appropriate image data without noise, such as different heights of traffic lights or flying birds. This means that even with simple data augmentation such as an affine transformation, an appropriate dataset for training the CNN model can be provided. Therefore, the authors expect that the development of a CNN model and the accumulation of additional medical images will increase the classification accuracy of fractures and osteoarthritis, which are relatively weak compared with the accuracy of diagnosis. In the same context, it is also expected that the diagnosis and classification of joint-specific soft tissue will be improved, owing to the development of deep learning algorithms advantageous for segmentation. Indeed, there are several recent studies that have completed segmentation at a high level [64,65]. In particular, Hashimoto et al. and others segmented the psoas major muscle through a U-net-based CNN model, and the trained U-net-based CNN model showed an average of 86.6% intersection over union (IoU). U-net is one of the most important semantic segmentation frameworks of CNNs [66] and has the strength of having an architecture that can recognize structural edges. Therefore, U-net is expected to be widely used for segmentation of medical images [67]. Although not in the field of orthopedics, new CNN architectures based on U-Net are continuously being introduced and reporting notable results [68]. Rundo et al. performed prostate zonal segmentation with USE-Net, incorporating Squeeze-and-Excitation blocks (SE) into U-Net [69]. Yeung et al. showed that the model trained with a dual attention-gated CNN (Focus U-Net), which improved the U-Net, segmented the polyp of the colonoscopy image to a satisfactory level [70].
Studies published in the field of orthopedic surgery have thus far been unable to present a CNN model with a higher level of diagnosis and classification than experts. An in-depth discussion is needed as to whether these results are a problem that can be overcome through data accumulation or the development of a better CNN, or whether they are a natural limitation of a CNN model learned from image data.
The authors offer two approaches. First, experts do not solve problems with image data alone. Experts can utilize information other than images, such as the patient's demographic data, the degree of pain, the nature of the disease and a physical examination, which can affect the disease diagnosis and classification. Indeed, Kim et al. reported that a CNN model trained by adding demographic information (age, sex and body mass index), alignment and metabolic data that could affect knee osteoarthritis showed a statistically significantly higher AUC [32]. Therefore, even if an improved CNN model is developed and highquality image data are accumulated, there is a possibility that the image analysis-based CNN model using a deep learning algorithm will not reach the level of experts.
Second, despite the opinions presented above, the possibility that CNN models will outperform experts in certain fields cannot be excluded, because the CNN model analyzes images from a different point of view than human beings. Among 150 images of scaphoid fractures, Langerhuizen et al. included 23 scaphoid fracture image data that could only be confirmed through an MRI. The trained CNN model showed a lower level of accuracy than orthopedic surgeons, but it detected five of six occult scaphoid fractures that were missed by all human observers [23]. It is therefore necessary to carefully discuss whether an image analysis model using deep learning can outperform experts.
It is clear that the present CNN models have room for improvement. However, this does not undermine the significance of the studies conducted to date. The currently developed CNN model can reduce the task intensity of the expert reader and can be used for the education of non-expert medical workers, such as medical students or specialists during training [71]. In addition, through a developed CNN model, a pediatrician can roughly estimate a patient's bone age using only X-rays without the help of an orthopedic surgeon.
A step away from the fate of clinical doctors and CNN's accuracy battle, there are interesting and more practical studies that give practical help to patients and doctors. Nie et al. converted native medical CT images to higher resolution images through generative adversarial networks (GANs) [72], and this study has the potential to be extended to MRI images [73]. Therefore, it can help a society that has no choice but to use low-quality MRI due to insufficient medical infrastructure or patients who have difficulty using high-quality MRI due to cost problems.
The authors reviewed deep learning approaches for orthopedic diseases applied through image analysis and found some limitations. First, there are no models approved by the Food and Drug Administration, other than a CNN model for predicting the bone age in children and a model for diagnosing wrist fractures [74]. In other medical departments, several models have been approved by the Food and Drug Administration, starting with a deep learning-based model for the automatic diagnosis of diabetic retinopathy in April 2018.
Second, no prospective studies have been conducted [75]. To improve the quality of research and continue applicable studies, a prospective and randomized trial according to the CONSORT-AI guidelines presented in 2020 will be necessary [76].
Third, recently described deep learning methods have mostly been designed to conduct a single task. To be useful in clinical practice, multiple deep learning algorithms will need to evaluate every possible abnormality. Some efforts have been made to overcome these limitations. For example, Grauhan et al. presented a CNN model for diagnosing fractures, joint dislocation and osteoarthritis through plain shoulder radiographs [77].
Finally, there is a need to reduce expert bias on a given dataset. Orthopedic surgeons have traditionally used ultrasound, computed tomography or MRIs to diagnose soft tissue diseases. However, deep learning algorithms often make appropriate judgments beyond human cognition. Kang et al. presented a model for diagnosing SSC tendon tears with a CNN model trained using axillary lateral radiographs, and the learned model showed an appropriate level of accuracy [78]. Thus, orthopedic surgeons may have the freedom to develop CNN models based on their imagination, free from prejudice.
In conclusion, image analysis using deep learning presents a clear milestone in the field of orthopedics and is experiencing explosive growth. The development of a CNN architecture and the accumulation of refined image data are expected to lead to the development of more sophisticated models. However, it is difficult to predict whether a deep learning model that exceeds the capability of experts can be created. Orthopedic surgeons who want to apply a deep learning algorithm to image analysis need to treat data with low prejudice, present research that meets the newly suggested guidelines and focus on developing models that can multitask.
Author Contributions: Conception and design of study, interpretation of data and approval of the version of the manuscript to be published, S.W.C.; interpretation of data, drafting the manuscript and approval of the version of the manuscript to be published, J.L. All authors have read and agreed to the published version of the manuscript.
Funding: This results was supported by "Regional Innovation Strategy (RIS)" through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE). (2021RIS-001(1345341783)).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.