Comparing Stacking Ensemble Techniques to Improve Musculoskeletal Fracture Image Classification

Bone fractures are among the main reasons for emergency room admittance and require a rapid response from doctors. Bone fractures can be severe and can lead to permanent disability if not treated correctly and rapidly. Using X-ray imaging in the emergency room to detect fractures is a challenging task that requires an experienced radiologist, a specialist who is not always available. The availability of an automatic tool for image classification can provide a second opinion for doctors operating in the emergency room and reduce the error rate in diagnosis. This study aims to increase the existing state-of-the-art convolutional neural networks’ performance by using various ensemble techniques. In this approach, different CNNs (Convolutional Neural Networks) are used to classify the images; rather than choosing the best one, a stacking ensemble provides a more reliable and robust classifier. The ensemble model outperforms the results of individual CNNs by an average of 10%.


Introduction
The incidence of bone fractures is affected by many factors including age, gender, biology, physiology, and access to treatment and prevention programs [1][2][3]. Impairmentrelated bone fractures contribute to an increase in morbidity and mortality across the age span [3]. The leading causes of bone fractures include osteoporosis [2,4,5] and trauma [6]. Osteoporosis is a chronic bone disease related to the loss of bone density. Bone trauma can be defined as an injury caused to the bone by a force external to the body. When applied to clinical practice, a systematic study of bone fracture data can allow clinicians to compare affected and unaffected patient groups, determine definable and preventable characteristics that predispose patients to skeletal fractures and ensure the provision of appropriate prevention and treatment strategies [7][8][9].
Deep learning is a subfield of artificial intelligence that has gained much attention due to its robust results in many challenging domains, such as machine translation, natural language processing, and computer vision, among others [10]. Convolutional neural networks (CNNs) are part of deep learning domains where at least one layer of the neural network is a convolution layer. Many state-of-the-art results were achieved by using CNNs, especially in the computer vision domain. However, one of the main drawbacks of using CNNs to classify images is the dataset size needed to train them accurately: thousands of images are usually required. This issue limits the usage of CNNs in classifying images in the medical field. However, two main techniques can be used to get over the issue of dataset size: transfer learning and ensemble learning. Transfer learning was introduced to get over the dataset size challenge by training a CNN on a sizeable nonmedical dataset, then fine-tuning the weights to a medical dataset. Ensemble learning is a technique based on the principle of the crowd's wisdom. In ensemble learning, multiple classifiers are trained, and then their results are combined. There are different methods of combining the outputs of different classifiers, such as taking the average of their outputs or using a machine learning algorithm over their predictions-that is, creating a classifier to sort different models' outputs.
The classification of fracture images using CNNs was investigated in many studies. Chung et al. [11] used a private dataset to investigate shoulder image usage to classify humerus fractures using ResNet CNN. In particular, they used a dataset with 256 × 256-pixel images. After fine-tuning the CNN, they applied the CNN model to reach an accuracy of 96%, a performance higher than that of human experts. Rajpurkar et al. [12] investigated CNN's performance in classifying bone fractures using a novel dataset called MURA. The images were rescaled to 320 × 320 pixels, and they compared the performance of their networks to three different radiologists' assessments. The radiologists' performance was more accurate than their results.
Olczak et al. [13] tested CNN on a private dataset with 256,000 images split among wrist, hand, and ankle images. The X-ray images were rescaled to 256 × 256 pixels, and VGG16 achieved the best performance. Additionally, CNN's performance was comparable to two senior orthopedic surgeons. Lindsey et al. [14] studied the effect of having a CNN to classify wrist fractures in the ER to help in fracture diagnosis. They used a private dataset to train a CNN by relying on an extension of U-Net architecture as a classifier. Subsequently, they did a controlled experiment to evaluate the importance of having a CNN to aid doctors' diagnoses. Olczak et al. [13] found a significant decrease of 47% in the misinterpretation rate when using CNN.
Uysal et al. [15] studied the effect of different SOTA CNN architectures on the MURA dataset's shoulder images. The images were scaled to 320 × 320 pixels. The Adam optimizer was chosen, with a learning rate of 0.0001 with 40 epochs for training. The best kappa score was 0.6942, resulting from taking an ensemble of ResNet34, DenseNet169, DenseNet201, and a sub-ensemble of different CNNs.
Guan et al. [16] used 3392 positive images (images with fractures) of the humerus, elbows, finger, hand, and forearm fracture from the MURA dataset to detect bone fractures. The image size was scaled to be 800 pixels for the shorter side and 1333 pixels for the longer side. Subsequently, an improved two-stage RCNN detection method was proposed. However, the original MURA dataset did not contain annotations of exact fracture locations. For this reason, the locations were annotated using the help of three radiological experts with more than 20 years of experience. To mitigate the small dataset's effect, two main geometric augmentation techniques were considered: horizontal flipping and random rotation. The authors used 4 GPU NVIDIA GeForce GTX 1080Ti graphic cards to train the model. The average precision reported in the study was 62.04% [16].
Huynh et al. [17] used the humerus images from the MURA dataset. The images were rescaled to 150 × 150 pixels, and a novel CNN architecture with two convolution layers and one max pooling layer was presented. The learning rate of 0.0001 achieved the best results compared to three different learning rates. Moreover, the proposed model achieved a validation accuracy of 0.684 compared to a kappa score of 0.600 for the MURA dataset. Urinbayev et al. [18] used the MURA dataset as a preprocessing step to determine the organ type before further classification. Thus, the dataset was used in the context of organ type classification and not for fracture classifications. The main aim was to classify chest X-ray images.
Kitamura et al. [19] investigated the performance of three SOTA CNN models on a private dataset. The dataset used was composed of 596 ankle images. The image size was 300 × 300 pixels. Four different geometric augmentation techniques were used to increase the size of the training dataset and to make the models more invariant. The techniques used were random rotation, flipping, brightness variation, and contrast alteration. An NVIDIA GeForce 1080 GTX graphics card was used to perform the experiments, in which the performance of each of the three CNNs and an ensemble composed of the three models were investigated. The performance of the ensemble methods was better than any single model's performance, and the best accuracy recorded was 81%.
Many studies have investigated the effects of ensemble techniques, especially their impacts on CNNs. Chouhan et al. [20] proposed a study to classify pneumonia images and used five different CNNs. By taking the average of the five CNNs instead of using them individually, it was possible to increase the final model's accuracy. A 2% increase in accuracy was achieved by using the average instead of using the best-performing CNN. Even with the robust model introduced by He et al. [21] in 2016, the authors used the ensemble technique to win the ImageNet challenge [22]. Whereas their top model scored a 4.49% error rate, using ensemble techniques allowed them to reduce the error rate to 3.57%. Rajaraman et al. [23] investigated four different approaches to accurately classify tuberculosis from X-ray images. They reported that stacking achieved the best results. Cha et al. [24] studied different techniques for classifying ear diseases by using otoendoscopic images. They compared the performance of nine CNNs and then selected the best two CNNs to build an ensemble model. The ensemble model achieved the best result compared to the other CNNs taken into account. Kandel and Castelli [25] investigated the impact of transfer learning, i.e., using the weights of ImageNet instead of training the network from scratch on X-ray classification to detect fractures. They used the seven MURA datasets [12], and investigated six state-of-the-art CNN architectures: VGG19 [26], InceptionV3 [27], ResNet50 [21], DenseNet [28], Xception [29], and InceptionResNet [30]. Experimental results show that the weights of ImageNet had an impact other than training the CNN from scratch. Moreover, sometimes, by training the networks from scratch, it did not converge at all, given the MURA dataset's limited size [12].
Joshi et al. [31] reviewed several contributions that apply artificial intelligence techniques to fracture detection. Among techniques, they pointed out that only a few works use CNNs for fracture detection. They cited five papers in which CNNs were used with data augmentation for fracture detection. While their work suggests that stacking was used in the past with traditional (fully connected) neural networks, random forests, and support vector machines, no study was cited on the use of stacking with CNNs. Thus, we believe our contribution is timely and provides an interesting contribution to this field.
Our primary goal is to develop a classification framework for bone fracture detection using X-ray images. To reach our goal, we compared different stacking techniques to improve the classification of fractures using X-ray images.
The main contributions of this work are as follows: (1) We evaluate different state-of-the-art CNNs. We also assess their combinations' performance, either in averaging, weighted average, or majority vote. (2) We investigate the usage of the stacking ensemble along with CNN outputs. To do so, we tested the performance of eight different machine learning meta-classifiers. Two main methods are considered: using probability outputs to construct the level-1 ensemble, or using label outputs.
The rest of this paper is organized as follows: In the second section, we describe our methodology. The third section presents our results obtained. The fourth section discusses the main findings of this study. Finally, the fifth section concludes the paper.

Materials and Methods
In Figure 1, we illustrate a schematic diagram of the proposed pipeline, showing how we stacked different machine learning algorithms. In the following subsections, we briefly discuss the methods used in this research and the dataset used to train the CNN.

Conventional Neural Networks
After CNN's success in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC challenge), it became the de facto algorithm for image classification. The difference between CNNs and any other neural network is the presence of the convolution layer. The convolution layer's importance is that it decreases the number of connections required and considers the spatial and temporal information of images. The convolution layer works by applying a window size called the kernel that convolves the image to detect essential features that can be used in the classification task. Equation (1) shows the convolution operation for colored images: where (. ) is the input image, is the color channels, (. ) is the kernel, and , is the output pixel in the ( , ) position. Moreover, the asterisk * operation resembles the convolution operation between the image and the kernel, which equals the dot product's summation ⊙ between different pixels in the ( , ) position.
A CNN mainly consists of two types of layers: primary layers and secondary layers. Primary layers are mandatory layers that constitute the CNN (i.e., convolution layers, pooling layers, and fully connected layers). Secondary layers (e.g., dropout layers and normalization layers) are used to increase CNN performance.
2.1.1. VGG19 Network VGG networks were introduced by Simonyan and Zisserman [26] to participate in the ImageNet challenge. VGG networks are composed of sequential convolutional blocks separated by max-pooling layers. VGG networks come in many variants. We used a 19layered version (i.e., VGG19).

Inceptionv3 Network
The InceptionV3 network was introduced by Szegedy et al. [33] to participate in the ImageNet challenge. The InceptionV3 network has a novel module called the inception module, where convolution layers are connected in parallel and sequentially.

Conventional Neural Networks
After CNN's success in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC challenge), it became the de facto algorithm for image classification. The difference between CNNs and any other neural network is the presence of the convolution layer. The convolution layer's importance is that it decreases the number of connections required and considers the spatial and temporal information of images. The convolution layer works by applying a window size called the kernel that convolves the image to detect essential features that can be used in the classification task. Equation (1) shows the convolution operation for colored images: where I(.) is the input image, c is the color channels, F(.) is the kernel, and O[i, j] is the output pixel in the (i, j) position. Moreover, the asterisk * operation resembles the convolution operation between the image and the kernel, which equals the dot product's summation between different pixels in the (i, j) position.
A CNN mainly consists of two types of layers: primary layers and secondary layers. Primary layers are mandatory layers that constitute the CNN (i.e., convolution layers, pooling layers, and fully connected layers). Secondary layers (e.g., dropout layers and normalization layers) are used to increase CNN performance.
2.1.1. VGG19 Network VGG networks were introduced by Simonyan and Zisserman [26] to participate in the ImageNet challenge. VGG networks are composed of sequential convolutional blocks separated by max-pooling layers. VGG networks come in many variants. We used a 19-layered version (i.e., VGG19).

InceptionV3 Network
The InceptionV3 network was introduced by Szegedy et al. [32] to participate in the ImageNet challenge. The InceptionV3 network has a novel module called the inception module, where convolution layers are connected in parallel and sequentially.

Resnet Network
ResNet was introduced by He et al. [21] in 2015. They proposed a new connection called the residual connection to minimize the vanishing gradient problem that usually happens to very deep networks (i.e., networks with more than 30 layers).

Xception Network
The Xception network was introduced by Chollet [29]. They used depth-wise separable convolutional layers instead of conventional convolutional layers.

Densenet Network
DenseNet was introduced by Huang et al. [28]. They were inspired by the residual connection proposed in the ResNet CNN [21]. Instead of using the residual connection, they densely connected the convolutional layers. In addition, the layers were concatenated instead of added to each other, similar to ResNet.

Machine Learning Algorithms
In this section, we provide a summary of the different machine learning algorithms used in this paper.

Logistic Regression
This is a simple linear model with a binary outcome. Logistic regression models the target variable with a line for two dependent variables or a hyperplane for more than two variables. Equation (2) shows the logistic operation: where p is the probability of success (in our study is the presence of fracture), β 0 is the model intercept, and β i are the regression coefficients.

Bagging and Random Forests
Breiman [33] introduced the bagging algorithm to decrease the high variance of decision trees. "Bagging" stands for bootstrap aggregation, where several decision trees are constructed using sampling with replacement from the dataset. The trees' outputs are averaged in case of regression, and a majority vote is taken in the event of a classification task. Breiman introduced an upgrade to bagging called random forests [34]: to decorrelate the trees, only a subset of the variables is randomly selected in each step to construct a decision tree. Bagging and random forests consider homogenous learners.

AdaBoost
AdaBoost (adaptive boosting) [35] is an ensemble approach that uses an iterative process. The main idea is to give more importance to misclassified data points built by the first weak learner, then build another weak learner that concentrates on these misclassified data points. This iterative nature of the algorithm ensures that each misclassified data point will have a learner specifically built to focus on it. The formula of AdaBoost is presented in Equation (3): where K is the number of weak classifiers, θ k is the weight of the Kth weak classifier, and f k (x) is the Kth weak classifier.

Gradient Boosting Machine
A gradient boosting machine [36] is an iterative ensemble technique. Usually, the weak learners are decision trees. The weak learners have residuals that the subsequent learner tries to minimize. In the stochastic gradient descent algorithm, Friedman [37] proposed building weak learners on samples drawn from a dataset by using sampling with replacement technique to give the model some variability and decrease bias.

Naïve Bayes
Naïve Bayes (NB) is a classifier based on the Bayes theorem, which is calculated by estimating probabilities. The NB classifier assumes that every conditional probability of each feature is independent of each other. According to Bayes' theorem, NB can be calculated using Equation (4): where P(x|y) is the conditional probability of each feature, P(x) is the feature's prior probability, and P(y) is the target class y's prior probability. Generally speaking, NB classifiers have three main subclasses: Gaussian, multinomial, and Bernoulli. In this study, we used the GaussianNB classifier, in which the likelihood of features is assumed to be Gaussian. This can be calculated using Equation (5): where the parameters σ 2 y and µ y are the variance and the mean, respectively. These values are estimated using maximum likelihood. For more information, the reader is referred to the corresponding article [38].

Stacking
Wolpert [39] introduced stacking as an ensemble algorithm that is different from bagging, random forest, and boosting: stacking considers heterogeneous learners. A schematic diagram of the stacking method is presented in Figure 2. There are usually two or more levels of classifiers. The first level is called the zero level, and it contains the base classifiers that take the original inputs. As seen in Figure 2, H 0 is the original dataset, which is the MURA dataset in our case. The zero-level classifiers will produce the H 1 dataset, which will be used in the second level by the meta-classifiers (or level-one classifiers). H 1 is the dataset that will be produced by the base classifiers, which are the CNNs in our case. C B i are the base classifiers that will be used to produce the H 1 dataset. C M i are the meta-classifiers that will be used for classification of the H 1 dataset. H 1 could be probabilities or labels, meaning the output of the C B i that will be used by C M i . We will compare both methods.

Evaluation Metrics
Two evaluation metrics were used to assess each classifier's performance. Below are the summaries for each.

Evaluation Metrics
Two evaluation metrics were used to assess each classifier's performance. Below are the summaries for each.

Accuracy
The percentage of correctly classified images to the total number of images. Can be calculated using Equation (6): where TP is the number of true positive images, TN is the number of truly negative images, FP is the number of false-positive images, and FN is the number of false-negative images.

Kappa Score
The Kappa score [40] measures the agreement between the actual label and the predicted label. It ranges from −1 to +1, where +1 means that the classifier predicts the correct labels and ≤0 means that the classifier is just a random guess. The Kappa score can be calculated using Equation (7): where Agreement Observed is the accuracy of classifier C and Agreement Expected is calculated using Equation (8).
For k mutually exclusive categories with N total data points and n ki the number of times classifier C predicted category k.

Results
As pointed out by [19,42], fine-tuning the CNN architectures will yield better res and converge faster than training a CNN from scratch. In this paper, we fine-tune CNN layers using the ImageNet dataset. All hyperparameters were fixed for all CN during the training process. We used the Adam optimizer [43] with a learning rat 0.0001 for all CNNs. The batch size was 64. All images were rescaled to 96 × 96 pixels early stopping criterion was used to stop the training if no progress occurred fo

Results
As pointed out by [19,41], fine-tuning the CNN architectures will yield better results and converge faster than training a CNN from scratch. In this paper, we fine-tuned all CNN layers using the ImageNet dataset. All hyperparameters were fixed for all CNNs during the training process. We used the Adam optimizer [42] with a learning rate of 0.0001 for all CNNs. The batch size was 64. All images were rescaled to 96 × 96 pixels. An early stopping criterion was used to stop the training if no progress occurred for 50 epochs. Four image augmentation techniques were considered to increase the model performance: zooming, 180 • rotations, and horizontal and vertical flips. We relied on data augmentation to balance the numbers of images in the different classes. All machine learning algorithms' hyperparameters were kept at the default settings of the Scikit-learn Python package. An NVIDIA RTX 2060 graphics card and an Intel i7-10750H CPU were used for training.
Due to the dataset's size, we did not use point estimates. Instead, each base-learner classifier was trained 10 times, then average scores were calculated. To determine the statistical significance of the results, we followed the methods of [43,44], in which a small confidence interval implies statistical significance [45]. For the sake of readability, we provided a summary of the results for all images in Table 1. Full tables may be found in the annex. In the first experiment, we tested different techniques on the humerus images. The first set of experiments tested the performance of various state-of-the-art CNNs on the humerus dataset, which considered level-0 classifiers. The VGG19 architecture achieved the highest score among the five CNNs, κ = 0.6299 ± 2.40%. This was the greatest kappa value and the smallest confidence interval compared to other level-0 classifiers. The ResNet50 network achieved the lowest score, κ = 0.5349 ± 3.67% but had the greatest confidence interval. Subsequently, to assess the performance of statistical metrics, we averaged the predictions of the five CNNs. A majority vote was taken, and the weighted average based on kappa scores was calculated. The three scores were approximately the same, with the average score being slightly higher than the others. The second set of experiments aimed at training a machine learning classifier over the probability outputs of the CNN. The NB classifier achieved the highest score, with a score slightly lower than the one achieved in the first set of experiments. The third set of experiments was similar to the second but considered the label outputs of the CNN. The scores achieved in the third experiment were smaller than the ones achieved in the first and the second sets of experiments. Overall, the highest score was achieved by taking the average of the level-0 classifiers for the humerus images.
We tested the different techniques on the finger fracture images, as we had with the humerus images. The first set of experiments considered the level-0 classifiers. The best CNN was the DenseNet network, κ = 0.4168 ± 3.54%; while the InceptionV3 network achieved the lowest score, κ = 0.3566 ± 2.41%. To assess the performance of simple statistical metrics, we averaged the predictions of the five CNNs. A majority vote was taken, and the weighted average based on kappa scores was calculated. The kappa score of the average vote and the weighted vote were the same and were higher than the majority vote and the DenseNet score from level 0. Overall, in the first set of experiments, the highest score was achieved by considering the CNNs' average vote. In the second set of experiments, each CNN's probability score was used to train a machine learning classifier. The highest score was achieved by the NB classifier, κ = 0.4862 ± 2.42%. The score achieved by the NB classifier was greater than all previous scores. For the third set of experiments, the NB classifier also achieved the highest score; however, it was less than that achieved by using the probability score. Overall, the second set of experiments' performance was better than both the first and the third sets of experiments.
In the first set of experiments for elbow images, the greatest kappa scores were achieved by the VGG19 network, κ = 0.6436 ± 2.19%; and the Xception network, κ = 0.6433 ± 0.91%. The VGG19 network's kappa value was slightly high; however, the Xception network's confidence interval was less than that of the VGG19 network. To assess the performance of simple statistical metrics, the predictions of the five CNNs were averaged, the majority vote was taken, and the weighted average based on kappa scores was calculated. All three kappa scores were similar to the weighted average score (0.6970 ± 0.67%), which was slightly greater than the others with the smallest confidence interval. In the second set of experiments, the probabilities of the CNNs were used to train eight machine-learning classifiers. All scores were approximately similar, with the LR bagging classifier being the highest, κ = 0.6758 ± 1.74%. Overall, the scores of the second set of experiments were lesser than those of the first set of experiments. The third set of experiments' results were lesser than those achieved in the first and second experimental sets. AdaBoost classifier scored the highest, κ = 0.6720 ± 1.78%. Overall, the best kappa score achieved for the elbow images was achieved by taking the weighted average vote of level-0 classifiers.
For wrist images, the best score achieved by a level-0 classifier was by the Xception network, κ = 0.6127% ± 1.59%. To assess the performance of simple statistical metrics, the predictions of the five CNNs were averaged, the majority vote was taken, and the weighted average based on kappa scores was calculated. The highest score was achieved by the weighted average of the level-0 classifiers, κ = 0.6556 ± 0.58%, which was greater than the Xception network's score. In the second set of experiments, the best performer was the logistic regression classifier, κ = 0.6510 ± 1.06%. While this score was greater than the level-0 classifiers' scores, it was still less than the weighted average vote of the level-0 classifiers. The results of the third set of experiments were lesser than both the first and second sets. Overall, in all experiments, the weighted average vote of the level-0 classifiers yielded the best results for wrist images.
For forearm images, the best CNN was DenseNet, κ = 0.5592 ± 2.16%. To assess the performance of simple statistical metrics, the predictions of the five CNNs were averaged, the majority vote was taken, and the weighted average based on kappa scores was calculated. The weighted average votes achieved the highest score, κ = 0.5765 ± 1.85%. The results achieved in the second set of experiments were better than those of the first set, with the NB classifier achieving the highest kappa score, κ = 0.6195 ± 2%. The results achieved in the third set of experiments were as high as the second set, with the NB classifier achieving the highest kappa score as well, κ = 0.6201 ± 1.76%. Overall, the NB classifier using label predictions achieved the best results for forearm images.
For hand images, the VGG19 network achieved the highest score among all CNNs, κ = 0.4358 ± 1.68%. To assess the performance of simple statistical metrics, the predictions of the five CNNs were averaged, the majority vote was taken, and the weighted average based on kappa scores was calculated. The weighted average votes achieved the highest score, κ = 0.4260 ± 2.30%. However, this score was less than the score achieved by the VGG19 network. The results achieved in the second set of experiments were better than those of the first set, with the NB classifier achieving the highest kappa score, κ = 0.5029 ± 1.14%. The results achieved in the third set of experiments were slightly lesser than the second set, with the NB classifier achieving the highest kappa score, κ = 0.4962 ± 2.62%). Overall, the best result was achieved by an NB classifier using probabilities.
For shoulder images, the best performing CNN was the VGG19 network, κ = 0.4638 ± 2.49%. To assess the performance of simple statistical metrics, the predictions of the five CNNs were averaged, the majority vote was taken, and the weighted average based on kappa scores was calculated. The weighted average votes achieved the highest score, κ = 0.4908 ± 2.19%. For the second set of experiments, the highest score was achieved by the GBM classifier, κ = 0.5050 ± 2.09%. The GBM kappa score was greater than both the level-0 classifiers and their average weight score. For the third set of experiments, the bagging classifier achieved the highest score using logistic regression, κ = 0.4882 ± 2.30%. Overall, the best classifier for shoulder images was the GBM classifier trained using probability scores. Appendix A contains all the results for the expriments we performed.

Discussion
In this study, rather than relying on a single CNN classification, we investigated different methods to combine the results of individual CNN networks (level-0 classifiers) in order to classify musculoskeletal X-ray images. We first trained five different CNN networks, then assessed their performance. Afterward, we combined their predictions by taking an average of their votes, a weighted average of their votes, or by taking their majority votes. We called this last step a meta-learner using statistics. Afterward, we examined the eight different machine learning algorithms' performance on top of the predictions made by level-0 classifiers. Two different inputs were used to train the machine learning algorithms-either the probability output of the level-0 classifiers or their label output. In Table 2, we present the differences between the kappa scores of the best performing level-0 classifiers (CNNs) and those of the level-1 classifiers (machine learning algorithms). By comparing the meta-learners using statistics (MLUSs) across the seven different datasets, we observed that their performance was usually better than any single CNN, except for in the hand dataset, whose best-performing CNN (VGG19) achieved a better score than the three combination methods. Across the three methods, the weighted average achieved the best results. Even in the case of the hand dataset, it yielded a better score. It is worth noting that the weights were measured based on kappa scores. The greatest difference between the meta-learners using statistics and the best CNN occurred in the elbow dataset, with an average difference of 8.30%. The least difference occurred in the hand dataset, where the best CNN achieved better results than the combinations by an average of 5.06%.
Comparing the meta-learners using probability (MLUPs) to level 0, we observed that the MLUP classifiers achieved better results than the level-0 CNNs. The most significant difference was spotted in the finger dataset, where the average increase was 12.36% compared to level 0. The least difference occurred in the humerus dataset, in which the average increase was 2.68% compared to level 0. Additionally, across the eight machine learning algorithms, all achieved more accurate results than the best level-0 classifier. The best achieving meta-classifier was NB for four datasets. Comparing MLUPs to MLUSs, MLUPs achieved higher results than MLUSs for four datasets, and MLUSs achieved the best results for the remaining three datasets. However, it is worth noting that for the hand dataset, where the score of the best level-0 classifier achieved better results than all the MLUSs, MLUPs achieved significantly better results than the MLUSs, with an average 9.59% increase in accuracy.
Comparing the meta-learners using labels (MLULs) to level-0 CNNs, we observed that the MLUL meta-classifiers achieved better results than all level-0 CNNs. The most significant difference was in the finger dataset, where the average increase was 8.78% compared to level 0. The least difference occurred in the humerus dataset, where the average increase was 0.61% compared to level 0. However, across the eight machine learning algorithms, some scored lower than the highest level-0 classifiers, such as the SGD bagging classifier for the humerus and shoulder datasets and the SVC bagging classifier for the humerus dataset. The best meta-classifier was the NB classifier, as it achieved the best results five times. Comparing MLULs to MLUSs, MLUSs achieved higher results than MLULs for four datasets. The MLUSs achieved the best results for the remaining three datasets. However, it is worth noting that for the hand dataset, where the score of the best level-0 classifier achieved better results than all MLUSs, MLULs achieved significantly better results than the MLUSs, with an average increase of 7.66%. Comparing MLUPs to MLULs, MLUPs achieved better results than MLULs for all seven datasets.
From our results, it is clear that using stacking achieves better results than using any single classifier on its own. This conclusion matches the results obtained by several works presented in the literature [46]. One of the leading hypotheses for why stacking achieves better results than any single classifier is that combining the outputs of different classifiers can decrease each classifier's error. This diversity is what makes stacking work [47,48], whereas the underlying level-0 classifiers must be different from each other and must make different errors. In our case, each CNN was different from the others both in its number of parameters and its underlying architecture, so each CNN made different errors.
Regarding the computational power needed for each MLUS, MLUP, and MLUL, the least was needed for the MLUSs. Here, no high computational power was needed since the method is all about taking the average or the majority vote. However, for MLUPs and MLULs, greater computational power was needed to train the machine learning algorithms. Nevertheless, it is worth noting that the computational power needed for the machine learning algorithm is less than the power needed to train CNNs by order of magnitude. Full tables on computational requirements can be found in the annex.
Our choice of dataset was successful for two reasons. First, the dataset has seven different types of images. Second, each of the seven datasets has a different size. Thus, it was possible to compare our results across different types of images and different dataset sizes. Based on our results, we conclude that stacking algorithms achieve higher results than level-0 algorithms alone. MLUPs achieved better results than MLULs in all datasets.
Thus, we recommend using MLUPs over MLULs. However, further studies are needed to compare MLUSs to MLUPs and determine which method is better.
Concerning other results presented in the literature, Kitamura et al. [19] investigated an ensemble method's performance built on three CNNs: InceptionV3, ResNet, and Xception. They reported that using an ensemble (where the voting method was used to combine the different CNNs' outputs) achieved better results than using a single model. Moreover, they investigated combining all the models versus combining only the best-achieving models. The performance of the ensemble using all models was better than that of the ensemble built on the best models only [19]. Both our study and Kitamura et al.'s [19] concluded that an ensemble of different CNNs would yield better results than using a single CNN.
Chung et al. [11] used a single model (ResNet) to classify humerus fracture images. They compared the performance of the single model to human experts. However, they neither compared their model to other CNN models nor examined any ensemble models' performance. Olczak et al. [13] compared five different CNN architectures and selected the best model, which was the VGG16 network. They did not examine the performance of any ensemble models. Similarly, Lindsey et al. [14] did not investigate the role of ensemble learning and instead relied on simple learners.
The main difference between our study and Uysal et al.'s [15] is that we used all seven different MURA datasets. Uysal et al. [15] used only the shoulder dataset. Thus, their results cannot be generalized over the different types of fracture images. Additionally, compared to Uysal et al. [15], we repeated each experiment 10 times to increase the significance of our findings. On the other hand, Uysal et al. [15] provided no average results and only presented their findings on a single model. They reported a kappa score exceeding our highest score for the shoulder dataset; we might speculate that this is due to the fact that they considered high-resolution images. All in all, considering the different experimental settings between our work and Uysal et al.'s work, the results obtained for the shoulder dataset are comparable.
The main difference between our study and Huynh et al.'s [17] is that we used all seven MURA datasets, while Huynh et al. [17] used only the humerus dataset. Thus, in this case as well, it is difficult to generalize their findings on all the different types of fracture images. The authors compared their model's validation accuracy score to the MURA study's kappa score [12]. Our score was comparable to that reported by Huynh et al. [17], even if our image sizes were smaller than the sizes they considered. Our score for the humerus dataset was κ = 0.6662 ± 1.60%, compared to κ = 0.684 reported by [17].

Conclusions
Using deep learning techniques in the emergency room can be very helpful for doctors when detecting fractures. In particular, deep learning models can reduce the time needed to classify fracture types. In this study, we discussed several ensemble techniques that can be used to improve musculoskeletal fracture classification performance. In the first set of experiments, we averaged the predictions of five CNNs, calculated a weighted average, and took a majority vote among CNNs to assess the performance of simple statistical metrics. In the second set of experiments, we investigated different machine learning algorithms as meta-classifiers for stacking techniques trained on the probability output of level-0 CNNs. In the third set of experiments, the meta-classifiers were trained on the level-0 CNNs' label output. From our results, we conclude that using stacking algorithms achieves better results than using a single CNN.
However, MLUSs achieved better results than level-0 classifiers in six out of seven datasets. MLUSs have no high computational power requirements. MLUPs and MLULs achieved better results than level-0 classifiers in all datasets but required greater computational power. MLUPs achieved better accuracy than MLULs, so we conclude that MLUPs are better than MLULs. Nevertheless, the question of whether MLUPs or MLUSs are more accurate still requires further investigation.

Data Availability Statement:
The MURA dataset underlying this study is a publicly available dataset available from arXiv:1712.06957.

Conflicts of Interest:
The authors declare no conflict of interest.