Classification and Visualisation of Normal and Abnormal Radiographs; A Comparison between Eleven Convolutional Neural Network Architectures

This paper investigates the classification of radiographic images with eleven convolutional neural network (CNN) architectures (GoogleNet, VGG-19, AlexNet, SqueezeNet, ResNet-18, Inception-v3, ResNet-50, VGG-16, ResNet-101, DenseNet-201 and Inception-ResNet-v2). The CNNs were used to classify a series of wrist radiographs from the Stanford Musculoskeletal Radiographs (MURA) dataset into two classes—normal and abnormal. The architectures were compared for different hyper-parameters against accuracy and Cohen’s kappa coefficient. The best two results were then explored with data augmentation. Without the use of augmentation, the best results were provided by Inception-ResNet-v2 (Mean accuracy = 0.723, Mean kappa = 0.506). These were significantly improved with augmentation to Inception-ResNet-v2 (Mean accuracy = 0.857, Mean kappa = 0.703). Finally, Class Activation Mapping was applied to interpret activation of the network against the location of an anomaly in the radiographs.


Introduction
Fractures of the wrist and forearm are common injuries, especially among older and frail persons who may slip and extend their arm to protect themselves [1]. In some cases, the person involved may think that they have not injured themselves seriously, and the fractures are ignored and left untreated [2]. These fractures can provoke impairment in the wrist movement [3]. In more serious cases, fractures can lead to complications such as ruptured tendons or long-lasting stiffness of the fingers [4] and can impact the quality of life [5].
Treatment of fractures through immobilisation and casting is an old, tried-and-tested technique. There are Egyptian records describing the re-positioning of bones, fixing with wood and covering with linen [6], and there are also records of fracture treatment in the Iron Age and Roman Britain where "skilled practitioners" treated fractures and even "minimised the patient's risk of impairment" [7]. The process of immobilisation is now routinely performed in the Accidents and Emergency (A&E) departments of hospitals under local anaesthesia and is known as Manipulation under Anaesthesia (MUA) [8], or closed reduction and casting. MUA interventions in many cases represent a significant proportion of the Emergency Department workload. In many hospitals, patients are initially treated with a temporary plaster cast, then return afterwards for the manipulation as a planned procedure. MUA, although simple, is not entirely free of risks. Some of the problems include bruising, tears of the skin, complications related to the local anaesthetic, and there is discomfort for the patients. It should be noted that a large proportion of MUA procedures fail. It has been reported that 41% of Colles' fractures treated with MUA required alternative treatment [9]. The alternative to MUAis open surgery, which is also known as Open Reduction and Internal Fixation (ORIF) [10], and can be performed with local or general anaesthesia [11,12] to manipulate the fractured bones and fixate them with metallic pins, plates or screws. The surgical procedure is more complicated and expensive than MUA. In some cases, it can also lead to serious complications, especially with metallic elements that can interfere with the tendons and cut through subchondral bones [13,14]. ORIF is more reliable as a long term treatment.
Despite the considerable research in the area [8,10,13,[15][16][17][18], there is no certainty into which procedure to follow for wrist fractures [19][20][21]. The main tool to examine wrist fractures is through diagnostic imaging, e.g., X-ray or Computed Tomography (CT). The images produced are observed by highly skilled radiologist and radiographers in search for anomalies, and based on experience, they then determine the most appropriate procedure for each case. The volume of diagnostic images has increased significantly [22], and work overload is further exacerbated by a shortage of qualified radiologists and radiographers as exposed by The Royal College of Radiologists [23]. Thus, the possibility of providing computational tools to assess radiographs of wrist fractures is attractive. Traditional analysis of wrist fractures has focused on geometric measurements that are extracted either manually [24][25][26][27] or through what is now considered traditional image processing [28]. The geometric measurements that have been of interest are, amongst others: radial shortening [29], radial length [25], volar and dorsal displacements [30], palmar tilt and radial inclination [31], ulnar variance [24], articular stepoff [26], and metaphyseal collapse ratio [27]. Non-geometric measurements such as bone density [32,33] as well as other osteoporosis-related measurements, e.g., cortical thickness, internal diameter, and cortical area [34], have also been considered to evaluate bone fragility.
However, in recent years, computational advances have been revolutionised by the use of machine learning and artificial intelligence (AI), especially with deep learning architectures [35]. Deep learning is a part of the machine learning method where input data is provided to a model to discover or learn the representations that are required to perform a classification [36]. These models have a large number levels, far more than the input/hidden/output layers of the early configurations, and thus are considered deep. At each level, nonlinear modules transform the representation of the data from the input data into a more abstract representation [37].
Deep learning has had significant impact in many areas of image processing and computer vision, for instance, it provides outstanding results in difficult tasks like the classification of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [38] and it has been reported that deep learning architectures have in some cases outperformed expert dermatologists in classification of skin cancer [39]. Deep learning has been widely applied for segmentation and classification [40][41][42][43][44][45][46][47][48]. Deep learning applied system versus radiologists' interpretation on detection and localisation of distal radius fractures has been reported by [49]. Diagnostic improvements have been studied by [50], where deep learning supports the medical specialist to a better outcome to the patient care. Automated fracture detection and localisation for wrist radiographs are also feasible for further investigation [51].
Notwithstanding their merits, deep learning architectures have several well-known limitations: significant computational power is required together with large amounts of training data. There is a large number of architectures, and each of them will require a large number of parameters to be fine tuned. Many publications will use one or two of these architectures and compare against a baseline, like human observers or a traditional image processing methodology. However, a novice user may struggle to select one particular architecture, which in turn may not necessarily be the most adequate for a certain purpose. In addition, one recurrent criticism is their black box nature [52][53][54][55], which implies that it is not always easy or simple to understand how the networks perform in the way they do. One method to address this opacity is through explainable techniques, such as activation maps [56,57] as a tool explain visually the localisation of class-specific image regions.
In this work, the classification of radiographs into two classes, normal and abnormal, with eleven convolutional neural network (CNN) architectures was investigated. The architectures compared were the following: (GoogleNet, VGG-19, AlexNet, SqueezeNet, ResNet-18, Inception-v3, ResNet-50, VGG-16, ResNet-101, DenseNet-201 and Inception-ResNet-v2). This paper extends a preliminary version of this work [58]. Here, we extended the work by applying data augmentation to the two models that provided the best results, that is, ResNet-50 and Inception-ResNet-v2. Furthermore, class activation maps were generated analysed.
The dataset used to compare the architectures was the Stanford MURA (musculoskeletal radiographs) radiographs [59]. This is a database that contains a large number of radiographs; 40,561 images from 14,863 studies, where each study is manually labelled by radiologists as either normal/abnormal. The radiographs cover seven anatomical regions, namely Elbow, Finger, Forearm, Hand, Humerus, Shoulder and Wrist. This paper focused mainly on the wrist images. The main contributions of this work are the following: (1) an objective comparison of the classification results of 11 architectures, which can help the selection of a particular architecture in future studies, (2) the comparison of the classification with and without data augmentation, which resulted in significantly better results, and (3) the use of class activation mapping to analyse the regions of interest of the radiographs.
The rest of the manuscript is organised as follows. Section 2 describes the materials, that is, the data base of radiographs, and the methods that describe the Deep Learning models that were compared and the class activation mapping (CAM) to visualise the activated regions. The performance metrics of accuracy and Cohen's kappa coefficient are described at the end of this section. Section 3 present the results of all the experiments and the effect of the different hyper-parameters. Predicted abnormality in the radiographic images will also be visualised by using class activation mapping. The manuscript finishes with a discussion of the results in Section 4.

Materials
The data used to compare the 11 CNNs was obtained from the public dataset MUsculoskeletal RAdiographs (MURA) from a competition organised by researchers from Stanford University [59]. The dataset has been manually labelled by board-certified radiologists between 2001 and 2012. The studies (n = 14,656) are divided into training (n = 13,457), and validation (n = 1199). Furthermore, the studies have been allocated in groups called abnormal (i.e., those radiographs that contained fractured bones, foreign bodies such as implants, wires or screws, etc.) (n = 5715) or normal (n = 8941). Representative normal cases are illustrated in Figure 1 and abnormal cases in Figure 2. The distribution per anatomical region is shown in Table 1. In this paper, the subset of the wrists was selected. The cases of normal and abnormal wrist radiographs is presented in Table 2. Notice that these were subdivided into four studies.

Convolutional Neural Network
Convolutional Neural Network (CNN) is a type of deep learning [35,36] model. A typical CNN classification model is composed of two key components: first, features are extracted through a series of convolutional layers with pooling and activation functions. Some modern architectures (e.g., ResNet) will also include batch normalization and/or skip connections to mitigate the problem of vanishing gradient during model training. Next, these features are input into one or more fully-connected layers to derive the final classification prediction (e.g., an estimated class probability). These class predictions are used to compute the problem-specific loss.
The input in a CNN, i.e., an image to be classified, can be transformed through the feature extraction layers to form a set of relevant features required by the network. These features can be regarded as the global descriptors of the image. In the fully-connected layers for classification, the relations of the features are learned by an iterative process of weight adjustment. A prediction probability can be deduced at the final layer with the inclusion of an activation function (e.g., softmax function). At the training stage, a loss (e.g., cross entropy loss) is computed between the prediction and the ground truth for weight adjustment during backpropagation. At the evaluation stage, the predicted class can be inferred from most probable class using an argmax function, and this can be evaluated against the ground truth for classification accuracy.
A description summary of the applied models used in Table 3 is as follows: AlexNet [60] is one of the earlier adoptions of deep learning in image classification and has won the ILSVRC 2012 competition by significantly outperforming its next runner up. It consists of 5 layers of convolutions of various sizes and 3 fully connected layers. It also applies a ReLU activation for nonlinearity. GoogleNet (Inception V1) [61] introduced the inception module formed of small size convolutions to reduce trainable parameters for better computational utilisation. Despite a deeper and wider network than AlexNet, the number of parameters for training has reduced from 60 million (Alexnet) to 4 million. VGG [62] is the runner-up in the ILSVRC2014, which was won by GoogleNet in the same year. It utilises only 3 × 3 convolutions in multiple layers and is deeper than AlexNet. It has a total of 138 million trainable parameters, and thus can be computationally intensive during training. ResNet [63] is formed by a deep network of repetitive residual blocks. These blocks are made up of multiple convolution layers coupled with a skip connection to learn the residual based on the previous block. This allows the network to be very deep, capable of 100 s of network layers. Inception-v3 [64] improves the configuration of the inception module in GoogleNet from a 5 × 5 convolutional layer in one of the branches to two 3 × 3 layers reducing the number of parameters. SqueezeNet [65] introduced the fire module which consists of a layer with 1 × 1 convolution (i.e., squeeze layer) and a second layer with 3 × 3 convolution (i.e., expand layer). The number of channels into the expand layer is also reduced. This has led to a significant reduction in trainable parameters while maintaining similar accuracy to AlexNet in the ILSVRC 2012 dataset. DenseNet [66] is composed of multiple dense blocks (small convolutional layers, batch normalisation and ReLU Activation). A transition layer with batch normalisation, 1 × 1 convolution and average pooling is added in between the dense blocks. The blocks are each closely connected with all previous blocks by skip connections. DenseNet has demonstrated a full utilisation of residual mechanism while maintaining model compactness to achieve competitive accuracy. Inception-ResNet-v2 [67] incorporates the advantages of the Inception modules into the residual blocks of a ResNet and achieve even more accurate classification in ILSVRC 2012 dataset than either ResNet 152 or Inception-v3.

Experiments
In this work, we considered the following eleven CNN architectures to classify wrist radiographs into two categories (Normal/Abnormal): GoogleNet, VGG-19, AlexNet, SqueezeNet, ResNet-18, Inception-v3, ResNet-50, VGG-16, ResNet-101, DenseNet-201 and Inception-ResNet-v2. The details of these are presented in Table 3. The training process of the architecture was tested with different numbers of epochs (10,20,30), and different mini-batch sizes (16,32,64). The experiment pipeline is illustrated in Figure 3. All the architectures were compared under the same conditions, without pre-or post-processing initially except resizing of the initial images to the input size for each architecture as the X-ray images presented different sizes. For instance, the images were resized to 224 × 224 for ResNet-50 and 299 × 299 for Inception-ResNet-v2. In the cases where the input was a 3-channel image, i.e., an RGB colour image, and the input image was in grayscale, this channel was replicated. The dataset was split into 90% for training and 10% for testing. The same hyper-parameters were applied as described in Table 4 and continued in Table 5.  Then, for the two architectures that provided the highest accuracy and Cohen's kappa coefficient (ResNet-50 and InceptionResnet-v2), several modifications were applied regarding, specifically, the use of data augmentation and CNN's training options. The classification with and without augmentation was done to assess the impact that augmentation can have in the results. In addition, visualisation of the network activations with class activation mapping was explored.

Further Processing with Data Augmentation
For the two best performing architectures, the effect of data augmentation was also evaluated. The following augmentations have been performed to each of the training images: (1) rotations of (−5 to 5 • ), (2) vertical and horizontal reflections, (3) shear deformations of (−0.05 to 0.05 • ) in horizontal and vertical directions, and (4) contrast-limited adaptive histogram equalisation (CLAHE) [68]. Translations were not applied as the training images were captured with a good range of translational shift.

Class Activation Mapping
Class activation mapping (CAM) [56] provides a visualisation of the most significant activation mapping for a targeted class. It provides an indication of what exactly the network is focusing its attention on. Similar to the schematics in Figure 4, the class activation map is generated at the output of the last convolutional layer. In this work, this is represented with a rainbow/jet colour map where the intensity spectrum ranges from blue (lowest activation), green and red (highest activation).
For the two best performing models, the CAM representations were generated at layer "activation_49_relu" for ResNet-50 and "conv_7_bac" for Inception-ResNet-v2, respectively. The CAM maps were up-scaled to the input resolution and overlaid on top of the original radiography for the location of the abnormalities. . Schematic illustration of the X-ray classification process and class activation mapping through layer-wise activation maps across different dense blocks. At each level, a series of feature maps are generated, the resolution decreases progress through the blocks. Colours indicate the range of activation: blue corresponds to low activation, red for highly activated features. The final output, visualised here using class activation mapping, which highlights the area(s) where abnormalities can be located.

Performance Metrics
Accuracy (Ac) was calculated as the proportion of correct predictions among the total number of cases examined, that is: where TP and TN correspond to positive and negative classes correctly predicted and FP and FN correspond to false predictions. Cohen's kappa (κ) was also calculated, as it is the metric used to rank the MURA challenge [59,69] and it is considered more robust, as it takes into account the possibilities of random agreements. Cohen's kappa κ was calculated in the following way. With being the total number of events, the probability of a yes, or TP, is the probability of a no, or TN, is and the probability of random agreement P R = P Y + P N , then

Implementation Details
Experiments were conducted in Matlab R2018b IDE completed with Deep Learning Toolbox, Image Processing Toolbox and Parallel Computing Toolbox. These experiments were conducted using a workstation with a processor from Intel Xeon ® W-2123 CPU 3.60 GHz, 16 GB of 2666 MHz DDR4 RAM, 500 GB SATA 2.5-inch solid-state drive, and NVIDIA Quadro P620 3GB graphic card.

Results
The effect of the number of epochs mini-batch sizing and data augmentation was evaluated on the classification of wrist radiographs in eleven CNN architectures. Tables 6 and 7 present the aggregated best results for each architecture in prediction accuracy and Cohen's kappa score, respectively. Table 6. Results of accuracy for eleven convolutional neural networks used to classify the wrist images in the MURA dataset. The best results for each row are highlighted in italics and the overall the best results are highlighted in bold. For the 11 architectures with no data augmentation, Inception-Resnet-v2 performs the best with an accuracy (Ac = 0.723) and Cohen's kappa (κ = 0.506). DenseNet-201 fares slightly lower (Ac = 0.717, κ = 0.497). The lowest results were obtained with GoogleNet (Ac = 0.654, κ = 0.381). This potentially indicates better feature extraction with deeper network architectures. Figures 5 and 6 illustrate some cases of the classification for Lateral and Postero-anterior views of wrist radiographs.

No. CNNs SGDM ADAM Rms Prop Mean Epoch Mini-Batch Size
The comparison between ADAM, SGDM, and RMSprop shows no indicative superiority implying that each of these optimisers were capable of achieving the optimal solution. Incremental change to the number of epochs beyond step 30 yields no improvement in accuracy indicating that the models have converged. The choice of the attempted minibatches show no difference in results. With data augmentation, the results show significant improvement, e.g., accuracy increases by approximately 19% (up by 0.134) and Cohen's kappa by 39% (up by 0.197) for the Inception-ResNet-v2 architecture. Table 7. Cohen's kappa results from eleven convolutional neural networks used to classify the wrist images in the MURA dataset. The best results for each row are highlighted in italics and the overall best results are highlighted in bold.   Class activation maps were obtained and overlaid on top of the representative images in Figures 1 and 2. The CAMs obtained for ResNet50 are shown in Figures 7 and 8, while those for Inception-ResNet-v2 are shown in Figures 9 and 10. In all cases, the CAMs were capable of indicating the region of attention used in the two architectures applied. This is especially valuable for identifying where the abnormalities are in Figures 8 and 10. While both models indicate similar regions of attention, Inception-ResNet-v2 appears to have smaller attention regions (i.e., more focused) than those in ResNet50. This may indicate a better extraction of features in the Inception-ResNet-v2 leading to better prediction results. Finally, the activation maps corresponding to Figures 5 and 6 are presented in Figure 11.

Discussion
In this paper, eleven CNN architectures for the classification of wrist x-rays were compared. Various hyper-parameters were attempted during the experiments. It was observed that Inception-ResNet-v2 provided the best results (Ac = 0.747, κ = 0.548), which were compared with leaders of the MURA challenge, which reports 70 entries. The top three places of the leaderboard were κ = 0.843, 0.834, 0.833, the lowest score was κ = 0.518 and the best performance for a radiologist was κ = 0.778. Thus, without data augmentation, the results of all the networks were close to the bottom of the table. Data augmentation significantly improved the results to achieve the 25th place of the leaderboard with (Ac = 0.869, κ = 0.728). Whilst this result was above the average of the table, the positive effect of data augmentation was confirmed to be close to human-level performance.
The CAM provides a channel to interpret how a CNN architecture is trained for feature extraction and the visualisation of the CAMs in the representative images was interesting in several aspects. First, the activated regions in ResNet-50 appeared more broad-brushed than those of the Inception-ResNet-v2. This applied both to the cases without abnormalities (Figures 7 and 9) and those with abnormalities (Figures 8 and 10); second, the localisation of regions of attention by Inception-ResNet-v2 also appeared more precise than the ResNet-50. This can be appreciated in several cases, for instance the forearm that contains a metallic implant (b) and the humerus with a fracture (g); third, the activation on the cases without abnormalities provides a consistent focus in areas where abnormalities are expected to appear. This suggests that the network has appropriately learned regions essential to the correct class prediction.
One important point to notice is that all the architectures provided lower results than those at the top of the MURA leaderboard table, even those tested with data augmentation. The top 3 architectures in the MURA leaderboard are: (1) base-comb2-xuan-v3 (ensemble) by jzhang Availink, (2) base-comb2-xuan (ensemble), also by jzhang Availink and (3) muti_type (ensemble model) by SCU_ MILAB. These reported the following Cohen's Kappa values of (1) 0.843, (2) 0.834 and (3) 0.833 respectively. Ensemble models are reported for the top 11 architectures and the highest single model is located in position 12 with a value of 0.773. Whilst in this paper only the wrist subset of the MURA dataset was analysed, it is not considered that these would be more difficult to classify than the other anatomical parts. When data augmentation was applied to the input of the architectures, the results were significantly better, but still lower than the leaderboard winners. We speculate further steps could improve the performance of CNN-based classification. Specifically:

1.
Data Pre-Processing: In addition to a grid search of the hyper-parameters, image pre-processing to remove irrelevant features (e.g., text labels) may help the network to target its attention. Appropriate data augmentations (e.g., rotation, reflection, etc.) will allow better pattern recognition to be trained and, in turn, provides higher prediction accuracy.

2.
Post Training Evaluation: class activation map provides an interpretable visualisation for clinicians and radiologists to understand how a prediction was made. It allows the model to be re-trained with additional data to mitigate any model bias and discrepancy. Having a clear association of the key features with the prediction classes [70] will aid in developing a more trustworthy CNN-based classification especially in a clinical setting.

3.
Model Ensemble: [71,72] or combination of the results of different architectures have also shown better results than an individual configuration. This is also observed in the leaderboard for the original MURA competition.

4.
Domain Knowledge: The knowledge of anatomy (e.g., bone structure in elbow or hands [73]) or the location/orientation of bones [28] can be supplemented in a CNNbased classification to provide further fine tuning in anomaly detection as well as guiding the attention of the network for better results [74].

Conclusions
In this paper, an objective comparison of eleven convolutional neural networks was performed. The architectures were used to classify a large number of wrist radiographs which were divided into two groups, some that contained abnormalities, like fractures or metallic plates, and normal, i.e., healthy. The comparison in Figure 12 showed a gradual improvement of the two metrics, namely, accuracy and Cohen's kappa, with more recent and deeper architectures. The best results were provided by ResNet-50 and Inception-Resnet-v2. Data augmentation was evaluated and was shown to increase the results significantly. Class activation maps were useful to observe the salient regions of each radiograph as they were passed through the architectures. Objective comparisons are important, especially for non-experts, who may consider one architecture without knowing if that is the optimal choice for their specific problem. Figure 12. Illustration of the effect of the number of layers of architectures against the two metrics used in this paper Accuracy and Cohen's Kappa. Each architecture is represented by a circle, except those with augmentation that are represented by an asterisk. For visualisation purposes, numbers are added and these correspond to the order of Table 7 (1 GoogleNet, 2 VGG-19, 3 AlexNet, 4 SqueezeNet, 5 ResNet-18, 6 Inception-v3, 7 ResNet-50, 8 VGG-16, 9 ResNet-101, 10 DenseNet-201, 11 Inception-ResNet-v2, 12 ResNet-50 (augmentation), 13 Inception-ResNet-v2 (augmentation)). Notice the slight improvement provided by deeper networks and the significant improvement that corresponds to data augmentation.  Data Availability Statement: The dataset for this study is publicly available by request from Stanford Machine Learning Group at https://stanfordmlgroup.github.io/competitions/mura/. Acknowledgments: A.A. has been supported through a Doctoral Scholarship by The School of Mathematics, Computer Science and Engineering at City, University of London. Karen Knapp from Exeter University is acknowledged her valuable discussions regarding fractures.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: