Detecting Ankle Fractures in Plain Radiographs Using Deep Learning with Accurately Labeled Datasets Aided by Computed Tomography: A Retrospective Observational Study

Datasets Abstract: Ankle fractures are common and, compared to other injuries, tend to be overlooked in the emergency department. We aim to develop a deep learning algorithm that can detect not only deﬁnite fractures but also obscure fractures. We collected the data of 1226 patients with suspected ankle fractures and performed both X-rays and CT scans. With anteroposterior (AP) and lateral ankle X-rays of 1040 patients with fractures and 186 normal patients, we developed a deep learning model. The training, validation, and test datasets were split in a 3/1/1 ratio. Data augmentation and under-sampling techniques were administered as part of the preprocessing. The Inception V3 model was utilized for the image classiﬁcation. Performance of the model was validated using a confusion matrix and the area under the receiver operating characteristic curve (AUC-ROC). For the AP and lateral trials, the best accuracy and AUC values were 83%/0.91 in AP and 90%/0.95 in lateral. Additionally, the mean accuracy and AUC values were 83%/0.89 for the AP trials and 83%/0.9 for the lateral trials. The reliable dataset resulted in the CNN model providing higher accuracy than in past studies.


Introduction
Orthopedic radiography is one of the most common imaging methods to diagnose fractures. However, fractures particularly in the foot and ankle, tend to be easily overlooked or misdiagnosed when radiographs are interpreted especially in the emergency department (ED) [1].
Ankle injuries are a common cause for outpatient visits; hence, it is important that their diagnosis be accurate for further evaluation and treatment. The ankle consists of 3 bones (the tibia, fibula, and talus), 2 joints (ankle and syndesmosis), and 3 sets of ligaments (medial, lateral, and syndesmotic). Owing to the complex structure of the ankle, fractures associated with it are often difficult to identify, raising the rate of misdiagnosis to nearly 4.2% in the ED [2,3].
Artificial intelligence (AI) can potentially provide a solution to this challenge; several studies are presently being undertaken to detect fractures using deep learning technologies [4]. Deep learning is a subdomain of AI wherein a system is trained to imitate the human brain. Convolutional neural networks (CNN) is a widely used deep learning algorithm for data processing, especially for 2D images [5].
Previous studies have successfully applied CNN to detect fractures via radiographs [6][7][8]. Yu et al. [9], designed a CNN algorithm using pelvic radiographs that could detect a femoral neck fracture with 97% accuracy; the algorithm represented other types of fractures with decreasing accuracy thereby rendering itself inadequate for the purpose of our study.
To the best of our knowledge, there are two studies on ankle fractures using deep learning. Santos et al. [10] used data from structured reports of X-ray images of ankle fractures. The dataset included 157 patients, of which 129 revealed fractures and 28 without, and the model exhibited an accuracy of 77% with the area under curve (AUC) being 0.85. Kitamura et al. [11] performed an extensive study on a larger dataset of 596 images (ankle with and without fractures equally apportioned) with five different CNN architectures; the accuracy peaked at 81%.
Both studies are proof of concept research; hence, we considered conducting a more practical study. Large or definite fractures can be easily diagnosed even by a beginner; minor fractures wherein the fracture line is obscured or overlapping are more difficult to detect in the ED as well as the outpatient department as shown in Figures 1 and 2. Hence, the aim of our study is not only to distinguish definite fracture but also not to overlook vague minor fractures in the radiography. To this effect, we tried to involve the patient data of minor fractures as well as definite fracture where possible. For accurate labeling, we reviewed both X-rays and computed tomography (CT) scans of all the fractures. Additionally, a machine learning expert (Mo, Y.-C.) was consulted to achieve higher performance and accuracy of the proposed model. Our study is the first of its kind in that we labeled the data almost perfectly by means of images of X-rays and CT scans.

Dataset Preparation
The Ethics Committee of Hallym University approved the use of data for the purpose of this study and the Institutional Review Board exempted the requirement of submitting written informed consent (2020-04-032-001). We reviewed the patients over 18 years of age with diagnosed ankle sprain or fracture and selected those who had undergone both X-rays and CT scans in the lower extremity. Exams were reviewed by three senior medical specialists, an orthopedist specializing in the foot and ankle (J.Lee), a radiologist with expertise in the musculoskeletal framework, and an emergency physician (J.Kim). We then manually labeled them, and the results are as follows: 1040 instances of patients were diagnosed with fractures, i.e., "abnormal" and 186 instances of those were without fractures, i.e., "normal". Subsequently, their anteroposterior (AP) and lateral ankle X-rays were extracted. Figure 3 shows the data collection and preparation process. The exclusion criteria included open fractures and those with operation history because those cases need thorough examination. No additional patient information, such age, sex, and medical history, was retained. Figure 3. Dataset preparation. Both AP and lateral X-rays and images of CT scan were reviewed simultaneously by three specialists and labeled normal or abnormal. The used X-rays were extracted to make up the dataset. We prepared two datasets for experiments.  We conducted a total of five experiments for the AP and lateral datasets respectively. Each experiment was conducted with same normal data and different abnormal data of 5 sets. (https://github.com/pepperfield/Detecting-Ankle-Fractures-in-Plain-Radiographs-Using-Deep-2-Learning-with-Accurately-Labeled-Dataset (accessed on 3 September 2021).)

Dataset Augmentation
Dataset augmentation helps to enhance the accuracy of a classification task by introducing data diversity to the training dataset without adding further images [12]. As depicted in Figure 5, the following transformations were applied to images at random in each epoch: (1) rotation: from -10 • to +10 • ; (2) height/width shift of ±10%; (3) brightness variation of ±10%; (4) zoom in/out by ±10%; (5) horizontal flip with a 50% probability. Model training Inception V3 is a popular 3D image classifier based on CNN that has shown a high success rate in classifying medical images [13,14]. Here, we trained the Inception V3 model to classify images as "normal" and "abnormal"; we then drew the receiver operating curve and observed area under the curve (ROC-AUC). Figure 6 shows the overall process. Figure 6. Model training Inception V3 is a convolutional neural network with 48 deep layers. It is a pre-trained model using more than a million images from the ImageNet. The network can classify images into 1000 object categories.
The hyper-parameter setting is as follows. The optimizer is ADAM and the learning rate is set to 3 × 10 −5 . We set the batch size to 8 and the max epoch to 200. We adopted the model which achieved the highest validation accuracy during the training, for testing the model performance. Experiments were conducted on a Windows PC with a Intel© i7-core @3.2 GHz processor and 32 GB RAM, NVIDIA GeForce RTX 2080 Ti, and Tensorflow 2.1.0.

Model Estimation Index
A confusion matrix, as shown in Table 2, is used to not only estimate but also explain the model performance in case of imbalanced classes [15]. It is to be noted that metrics like precision, recall, and F1 score are as important as accuracy and AUC in validating the performance of the model. Precision is also known as a positive predictive value, recall is the same as sensitivity and the F1 score, ranging from 0 to 1, and is a harmonic mean of the two. A higher F1 score indicates better accuracy and overall performance. Here, we applied all the aforementioned parameters to the results of the experiments. The accuracy, precision, recall, and F1 score are given by the following equations:

Results
In the AP trials, AP2 and AP4 achieved the best accuracy of 86%, whereas AP4 achieved the highest AUC of 0.92. In the lateral trials, Lat3 achieved the highest accuracy of 90% and Lat1 achieved the highest AUC of 0.95, respectively. Because of the class imbalance of our data, AUC is a more meaningful index than accuracy. Hence, AP4 and Lat1 are the best trained models. In addition, the overall performance of the AP trial was better than that of the lateral trial. The results of the experiments are summarized in Tables 3-6 and Figure 7.

Discussion
Clinicians, especially in the ED, tend to overlook minor fractures of ankle due to the busy environment in the emergency room; this happens until they have had sufficient experience. Thus, we tried to determine whether machine learning algorithms could help diagnose these fractures. The primary intent of our study was to analyze the feasibility of using AI techniques for identifying ankle fractures. This was achieved using the following criteria: (1) The model should be able to correctly detect minor fractures as well as major fractures; (2) the performance of the model should either equal or surpass human diagnosis. We concluded that the study, in its present capacity, fell short on both aspects. Other studies on pelvic, wrist, and other fractures showed over 90% classification accuracy and an AUC of 0.95 [16][17][18]. Compared with the previous two studies on ankle fractures using deep learning, our study showed improved accuracy (i.e., 83% > 76%) [10,11]. Therefore, we conducted further analysis on the results obtained from our study.
First, this study is the first ever to use both X-rays and images of CT scans of ankle fractures for a labeling process; previous studies have utilized only X-rays. Moreover, the images were reviewed by three medical professionals and hence we can confirm that our datasets are reliable. However, they are also quite imbalanced in that there were relatively less normal images per abnormal ones, as patients who underwent CT scans were primarily concerned with ankle fractures. Even though we used an under-sampling technique, it is a major limitation of our study. Second, apart from class imbalance, we encountered one other major issue during the experiments that was associated with artifacts, especially splint. Among the X-rays that had splint, the majority featured fractures, but the model, nonetheless, classified all images with splint as fracture (i.e., Figure 8). Because there is no study published in this regard, this issue needs to be studied further as splint is used to treat a wide range of injuries. It is still unclear how to train a model to classify X-rays with splint. Figure 8. Non-fracture X-ray with splint. Even though it is normal, in the test session, the deep learning model interpreted X-ray with splint as fractured group. The colored area was precepted by fractured part.
Third, minor fractures were occasionally accompanied by large fractures; such multiple fractures may cause the model to misunderstand the only large fracture as "abnormal" while attempting to categorize the minor fracture as "normal". To resolve this problem, we plan to apply an object detection algorithm in our future study, as shown in Figure 9. The object detection model can not only classify but also localize where the lesion is in images with a bounding box (red line on Figure 9) in the dataset.
Lastly, owing to insufficient normal images in the datasets, the hyper-parameter could not be efficiently fine-tuned despite employing under-sampling and data augmentation techniques.
Given the various challenges we encountered in the machine learning tasks, it is difficult to determine precisely why the AP trial demonstrated higher accuracy as compared to the lateral trial, considering that images of the lateral type have more overlapping parts. It seems to be resolved by applying "Explainable AI", which is a promising field and shows decision-making processes using various algorithms [19,20]. Using this, we aim to train models to make smarter decisions and more accurate predictions in our next study.
In summary, our ultimate objective is developing a robust AI framework that can help to detect fractures and convey supporting information with an accuracy of over 90%. We intend for the application to be scaled to accommodate CT, MRI, and pediatric patients as well as to not only reduce medical accidents but also assist medical professionals and doctors in their evaluations and treatment plans.

Conclusions
Even with the rapid developments in AI, the application of AI in the medical domain has a long way to go. We aimed to develop a deep learning model that can detect ankle fractures of various sizes on X-ray. Even though we did not achieve high performance, we showed better results compared to previous studies. We reviewed the limitations of the study and proposed theoretical solutions that constitute our next work. In the future, we will investigate the feasibility of explainable AI.