1. Introduction
Orthopedic radiography is one of the most common imaging methods to diagnose fractures. However, fractures particularly in the foot and ankle, tend to be easily overlooked or misdiagnosed when radiographs are interpreted especially in the emergency department (ED) [
1].
Ankle injuries are a common cause for outpatient visits; hence, it is important that their diagnosis be accurate for further evaluation and treatment. The ankle consists of 3 bones (the tibia, fibula, and talus), 2 joints (ankle and syndesmosis), and 3 sets of ligaments (medial, lateral, and syndesmotic). Owing to the complex structure of the ankle, fractures associated with it are often difficult to identify, raising the rate of misdiagnosis to nearly 4.2% in the ED [
2,
3].
Artificial intelligence (AI) can potentially provide a solution to this challenge; several studies are presently being undertaken to detect fractures using deep learning technologies [
4]. Deep learning is a subdomain of AI wherein a system is trained to imitate the human brain. Convolutional neural networks (CNN) is a widely used deep learning algorithm for data processing, especially for 2D images [
5].
Previous studies have successfully applied CNN to detect fractures via radiographs [
6,
7,
8]. Yu et al. [
9], designed a CNN algorithm using pelvic radiographs that could detect a femoral neck fracture with 97% accuracy; the algorithm represented other types of fractures with decreasing accuracy thereby rendering itself inadequate for the purpose of our study.
To the best of our knowledge, there are two studies on ankle fractures using deep learning. Santos et al. [
10] used data from structured reports of X-ray images of ankle fractures. The dataset included 157 patients, of which 129 revealed fractures and 28 without, and the model exhibited an accuracy of 77% with the area under curve (AUC) being 0.85. Kitamura et al. [
11] performed an extensive study on a larger dataset of 596 images (ankle with and without fractures equally apportioned) with five different CNN architectures; the accuracy peaked at 81%.
Both studies are proof of concept research; hence, we considered conducting a more practical study. Large or definite fractures can be easily diagnosed even by a beginner; minor fractures wherein the fracture line is obscured or overlapping are more difficult to detect in the ED as well as the outpatient department as shown in
Figure 1 and
Figure 2. Hence, the aim of our study is not only to distinguish definite fracture but also not to overlook vague minor fractures in the radiography. To this effect, we tried to involve the patient data of minor fractures as well as definite fracture where possible. For accurate labeling, we reviewed both X-rays and computed tomography (CT) scans of all the fractures. Additionally, a machine learning expert (Mo, Y.-C.) was consulted to achieve higher performance and accuracy of the proposed model. Our study is the first of its kind in that we labeled the data almost perfectly by means of images of X-rays and CT scans.
4. Discussion
Clinicians, especially in the ED, tend to overlook minor fractures of ankle due to the busy environment in the emergency room; this happens until they have had sufficient experience. Thus, we tried to determine whether machine learning algorithms could help diagnose these fractures. The primary intent of our study was to analyze the feasibility of using AI techniques for identifying ankle fractures. This was achieved using the following criteria: (1) The model should be able to correctly detect minor fractures as well as major fractures; (2) the performance of the model should either equal or surpass human diagnosis. We concluded that the study, in its present capacity, fell short on both aspects. Other studies on pelvic, wrist, and other fractures showed over 90% classification accuracy and an AUC of 0.95 [
16,
17,
18]. Compared with the previous two studies on ankle fractures using deep learning, our study showed improved accuracy (i.e., 83% > 76%) [
10,
11]. Therefore, we conducted further analysis on the results obtained from our study.
First, this study is the first ever to use both X-rays and images of CT scans of ankle fractures for a labeling process; previous studies have utilized only X-rays. Moreover, the images were reviewed by three medical professionals and hence we can confirm that our datasets are reliable. However, they are also quite imbalanced in that there were relatively less normal images per abnormal ones, as patients who underwent CT scans were primarily concerned with ankle fractures. Even though we used an under-sampling technique, it is a major limitation of our study.
Second, apart from class imbalance, we encountered one other major issue during the experiments that was associated with artifacts, especially splint. Among the X-rays that had splint, the majority featured fractures, but the model, nonetheless, classified all images with splint as fracture (i.e.,
Figure 8). Because there is no study published in this regard, this issue needs to be studied further as splint is used to treat a wide range of injuries. It is still unclear how to train a model to classify X-rays with splint.
Third, minor fractures were occasionally accompanied by large fractures; such multiple fractures may cause the model to misunderstand the only large fracture as “abnormal” while attempting to categorize the minor fracture as “normal”. To resolve this problem, we plan to apply an object detection algorithm in our future study, as shown in
Figure 9. The object detection model can not only classify but also localize where the lesion is in images with a bounding box (red line on
Figure 9) in the dataset.
Lastly, owing to insufficient normal images in the datasets, the hyper-parameter could not be efficiently fine-tuned despite employing under-sampling and data augmentation techniques.
Given the various challenges we encountered in the machine learning tasks, it is difficult to determine precisely why the AP trial demonstrated higher accuracy as compared to the lateral trial, considering that images of the lateral type have more overlapping parts. It seems to be resolved by applying “Explainable AI”, which is a promising field and shows decision-making processes using various algorithms [
19,
20]. Using this, we aim to train models to make smarter decisions and more accurate predictions in our next study.
In summary, our ultimate objective is developing a robust AI framework that can help to detect fractures and convey supporting information with an accuracy of over 90%. We intend for the application to be scaled to accommodate CT, MRI, and pediatric patients as well as to not only reduce medical accidents but also assist medical professionals and doctors in their evaluations and treatment plans.