Modeling Uncertainty in Fracture Age Estimation from Pediatric Wrist Radiographs

: In clinical practice, fracture age estimation is commonly required, particularly in children with suspected non-accidental injuries. It is usually done by radiologically examining the injured body part and analyzing several indicators of fracture healing such as osteopenia, periosteal reaction, and fracture gap width. However, age-related changes in healing timeframes, inter-individual variabilities in bone density, and signiﬁcant intra- and inter-operator subjectivity all limit the validity of these radiological clues. To address these issues, for the ﬁrst time, we suggest an automated neural network-based system for determining the age of a pediatric wrist fracture. In this study, we propose and evaluate a deep learning approach for automatically estimating fracture age. Our dataset included 3570 medical cases with a skewed distribution toward initial consultations. Each medical case includes a lateral and anteroposterior projection of a wrist fracture, as well as patients’ age, and gender. We propose a neural network-based system with Monte-Carlo dropout-based uncertainty estimation to address dataset skewness. Furthermore, this research examines how each component of the system contributes to the ﬁnal forecast and provides an interpretation of different scenarios in system predictions in terms of their uncertainty. The examination of the proposed systems’ components showed that the feature-fusion of all available data is necessary to obtain good results. Also, proposing uncertainty estimation in the system increased accuracy and F1-score to a ﬁnal 0.906 ± 0.011 on a given task.


Introduction
Knowing the approximate age of a bone fracture is a medically and forensically relevant issue, especially in the context of suspected non-accidental trauma in a child. Physicians often request radiologists to estimate the age of a specific fracture, which still might not clearly be answerable in many situations.
Radiography can help estimating a fracture's age due to specific changes in a fracture's appearance or through the presence of reparative processes. These processes include mechanisms like soft-tissue swelling in the early, and osteopenia or periosteal reaction in the later phases of healing. In the end, typically after a few months, a fracture is fully remodeled and ceases to be visible. Systematic evaluation of the the published literature revealed that radiographic characteristics of bone healing differ substantially across individual investigations [1]. Radiologically, digital radiography (DR) and computed tomography (CT) are used for fracture detection, and subsequently for estimating fracture age [2]. In adults, fracture healing proceeds in a relatively uniform manner so that the fracture age can be approximated based on the radiological findings [3]. In children, however, fracture healing is primarily dependent on patient age, but may also show interindividual variabilities. As a result, determining fracture age in pediatric patients is more difficult and lacks a sufficient quantity of consistent data in the literature [3][4][5]. It also leads to a certain degree of skewness within related datasets.
Researchers successfully used artificial intelligence (AI), and specifically deep learning (DL) algorithms automatically classifying medical images in the last years [6]. Computer vision (CV) competed with, and in some studies, even exceeded human experts in fracture detection on X-ray studies [7][8][9]. AI also achieved striking results, transforming images from one domain to another or enhancing medical images by, e.g., suppressing plaster casts from them [10,11]. The backbone of the cited related manuscripts, and many other AI-related studies in medical diagnostics, are neural networks (NN) [12]. Although NNs are highly successful in performing numerous tasks, the explainability of their predictions remains their biggest drawback [13,14]. Hence, a good medical decision support system must provide as much information about the origin of its decisions as possible.
To the best of our knowledge, no DL algorithm has yet been developed to estimate or even determine pediatric fracture ages. The wrist is the most common region of pediatric fractures, leading to a sufficient availability of data [15]. We hypothesize that fracture age estimation can be satisfactorily performed by up-to-date convolutional neural networks (CNN) in pediatric digital wrist radiographs. Therefore, the contributions of our research are as follows: • This is the first-ever attempt to tackle the issue of estimating pediatric fracture age using AI. Hence, we propose a standard, as well as guidelines for other researchers to follow; • By utilizing the Monte-Carlo dropout method, which treats NN as a Gaussian sampling process, we are able to estimate the uncertainty of the proposed system decisions, unveiling prediction certainty to increase trustworthiness; • We propose a novel system based on a CNN combining different features obtained from the medical reports/cases to estimate fracture age. The system is generalpurpose, and we believe that its design can also be utilized in other research fields as well.

Related Work
As stated in the previous section, to the best of our knowledge, there is no related research investigating the topic of fracture age estimation. So, as the related work, and starting point of our research, we looked at researches dealing with any kind of age/time estimation from the medical images. One of the pioneer studies in estimating bone age from the X-ray hand images is proposed in [16]. The core idea presented in the paper revolved around segmenting the short bones from the image that can help, due to the bone growth, in the age estimation. This idea was followed by Ebner et al. on the hand images generated by medical resonance imaging (MRI). However, they utilized more advanced methods based on the random regression forest method [17]. Similarly, for the age estimation based on the X-ray images of the hand, Thodberg et al. developed The BonXpert method for automated determination of skeletal maturity [18]. One of the key events dealing with bone age estimation is "The RSNA Pediatric Bone Age Machine Learning Challenge" [19]. The challenge is to estimate skeletal age in a curated data set of 14, 236 pediatric hand X-ray images. This challenge showed that NNs can be really useful in coping with a given task of bone age estimation. To summarize, many studies estimate the bone age from different body parts (such as wrist [20] or ankle joint [21]) and different modalities (such as MRI [22] or X-ray [23]). Some of the approaches use landmarks, while others are trying to find useful patterns in whole images to estimate the age of the bones (typically by utilizing NNs) [24]. Although our task is to estimate the age of the fracture, the proposed system is greatly inspired by the proposed methods for bone estimation due to the similarity in the nature of the task.

Materials and Methods
Next, we present the dataset used in our experiments, the modeling task, and the developed software system.

Dataset and Task Formulation
The dataset originated from the Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Austria. It contains 3570 digital radiography (DR) studies of pediatric wrists. All data were anonymized, and the images were stored as 16-bit grayscale Portable Network Graphics (PNG) files. The dataset featured comprehensive annotations, including a patient identifier, obfuscated study date and time with preserved relative time intervals, and the information of whether that study was a first presentation to the emergency room or not. DR studies typically contained both anteroposterior (AP) and lateral (LAT) views or projections. In addition to this, one medical case (compare Figure 1) also included patients' age and gender (male/female) and the age of the fracture in weeks. Hence, the total number of images in the dataset was 7140. There were 2209 male patients with average age of 10.85 ± 3.54 and 1361 female patients with average age of 9.21 ± 3.17. The wrist fracture age distribution in weeks, calculated between follow-up and initial study, is presented in Figure 2a. As it can be seen, the number of initial exams in which the age of the fracture is 0 weeks old is significantly greater than any other fracture age. This can be explained by the fact that not all injuries need X-ray follow-ups. The number of initial exams was 2418, and the total number of all other exams was 1152. To account for data skewness, we decided to follow the guidelines proposed by Krawczyk and Bartosz [25]. First, we split the data into three meaningful groups based on the fracture age. The first group represents initials exams (0 weeks old fractures), the second group represents around half a month old fractures (1-3 weeks old fractures), while the last group represents fractures older than three weeks. Grouping was done by addressing the same label to all samples belonging to the same group. The described grouping gave the distribution that is shown in Figure 2b. Therefore, the final data distribution used in the research was as follows: group "0 weeks" contained 2418 samples, group "1-3 weeks" 583 samples, and group "3+ weeks" contained 569 samples. From this point on, we tackled the issue of data skewness by adjusting the loss function with the extension of Weighted Cross-Entropy, proposed in [26]. Dataset skewness is substantially influenced by the fact that younger children have lower fracture consolidation times, meaning that older fractures are rarer.
Nevertheless, the dataset skewness is one of the issues that our system tries to overcome. To enhance the credibility of the system, we decided to estimate the uncertainty of its decision making. To summarize, in this paper we provide the system design that estimates the age of the childrens' wrist fractures based on the X-ray images, patients' age, and gender. In addition, the system learns and provides uncertainty about the made prediction, which brings us closer towards believable and explainable AI.  Histograms representing the distribution of data used in the research. In subfigure (a), we present the distribution of a fracture's age over the weeks. In subfigure (b), we depict a histogram of the same data after grouping the data into three groups.

Proposed System Overview
As mentioned in the related work subsection, the developed system is inspired by "The RSNA Pediatric Bone Age Machine Learning Challenge" [19]. This challenge influenced several papers that were addressing the subject of age estimation from X-ray images [27,28]. Namely, as mentioned previously, to the best of our knowledge there is no related work for fracture age estimation. Hence, we found a similar bone age estimation problem from X-ray images of the hand and took it as a starting point of our proposed system design. All of the solutions presented in challenge have two things in common. The first one is that they use a CNN to extract features from images. The second is supplementing the extracted features with additional information (such as patient age or gender)-encoding all the features in a single vector. Based on the values contained in that vector, the classifier NN head (feed-forward NN) is making predictions. In Figure 3 we present a detailed flowchart of the developed system that has produced the best results in the conducted experiments. As it can be seen, the proposed system has two parts: • The first part of the proposed system has two components-NNs (NN1 and NN2) predicting fracture age based on lateral (P LAT ) or anteroposterior (P AP ) input images. The NNs are using EfficientNetB1 (current state-of-the-art NN architecture for classification) as a feature extractor with a custom-developed fully-connected NN head on top of it. The EfficientNetB1 topology (depicted on Figure 4) was the same as that proposed in the original paper (image input size is 240 × 240 pixels) [29], while the fully-connected NN head architecture can be seen in Figure 3. The number of neurons in each dense layer of the NNs head is 1024, 1024, 512, and 3, respectively. The dropout rate in the dropout layers was set to 10%. The output of each of the two NNs predicts fracture age based on the image it receives as the input. In order to determine if the EfficientNetB1 is the best performing network for our system, we have compared it with other popular deep learning architectures: VGG19 [30], ResNet101 [31], InceptionV3 [32], and Xception [33]. As it can be seen in Appendix B, the EfficientNetB1 was the best performing model of all tested models, and that is why we chose it for our system.; • The second part of the proposed system is a fully connected NN (NN3) that takes as the input a vector (size 8 × 1) created from the outputs of NN1 and NN2, and patient's gender (g) and age (a). The topology of the NN3 can also be seen in Figure 3. It is constructed from four fully connected layers with 512, 256, 128, and 64 layers, respectively. Dropout rate in the dropout layers was the same as for the first part of the proposed system (10%). The output of NN3 is a final prediction of the fracture age from the assembly of features. Also, we have employed a decision uncertainty estimation algorithm, which we discuss in the following subsection.

Uncertainty Estimation Algorithm
NNs are generally considered black boxes that lack explainability or any certainty over their decision [34,35]. There are two types of uncertainty: epistemic uncertainty and aleatoric uncertainty. Epistemic uncertainty is the result of limited data and knowledge, while aleatoric uncertainty arises from the stochasticity of observations. Since we cannot influence the input data, we are focused on estimating epistemic uncertainty. The Bayesian approach can be used to some extent to overcome the problem of epistemic uncertainty [36]. Namely, Yarin and Zoubin, in their paper [37] have shown that dropout can be used as a Bayesian approximation of model assembly. In other words, they have proved that a NN with dropout layers is mathematically equivalent to a Bayesian approximation of the Gaussian process [38]. With dropout, each subset of neurons acts as one of k = 250 new NNs that are part of an assembly. Hence, every of the NNs in the assembly has its prediction that can be marked as nn i . The process of generating new NNs can be seen as a Monte Carlo sampling from the original NN. As it can be seen in Equation 1, to obtain the uncertainty prediction for each of the three classes (c 1 -"0 weeks", c 2 -"1-3 weeks", c 3 -"3+ weeks") we estimate the normal distribution N c defined with the mean µ c and standard deviation σ c of the k = 250 sampled NNs. The µ c and σ c are given in Equations (2) and (3) respectively, and are calculated based on NN1 output P AP , NN2 output P LAT , patient's gender g, and patient's age a.
That leads to two system outputs: The first one is the prediction of the entire NN, while the second one is the uncertainty output (as depicted in Figure 3). The uncertainty part of the NN is calculated only during prediction. To raise awareness concerning a potential uncertainty, we measured the overlap region between normal distribution curves of two classes having the highest prediction score. If the area of overlap between the two top predicted classes exceeds 50%, the user receives a warning to verify the plausibility of the output. To obtain the overlapping area between two distributions, we used the algorithm proposed by Linacre [39]. Sampling number k = 250, representing the number of NNs used in uncertainty estimation, was chosen empirically, based on several trials using different values. We discovered that using more than 250 samples would yield a similar result. On the other hand, the overlapping threshold value of 50% was taken from related work as a standard (similarly to the IoU threshold value of 0.5 commonly used in object detection tasks).

The Proposed System Training
Due to the relatively small number of data instances (3570), the proposed system is trained by utilizing 5-fold cross-validation [40]. We have also tested 10-fold cross-validation, but it resulted in nearly the same results as the 5-fold cross-validation, only it was more time-consuming (which is why we used 5-folds in the end). Therefore, each of the five disjoint subsets of the available dataset contains 2856 cases for system training, leaving 714 cases for system testing. The proposed system comprises three NNs where the first two NNs' outputs are part of the input vector for the third NN. Hence, we first trained the first two NNs on lateral and anteroposterior data. We took the necessary precautions to prevent data leakage. Namely, every fold contained data from the same patients' cases during the sampling process of training and testing data. For instance, if one patient case was in a training set for one of the five folds, it means that both NN1 and NN2 models, as well as NN3, were trained on that particular case. This way, the conclusions based on the NNs results can be compared and discussed because the train and test data (of every fold) had the same source-the same patient/case. Furthermore, we performed the following data augmentation methods over the training data to enhance the versatility of the data and make the trained NNs more robust. For the NNs trained on the images, we used: random flip, random rotation (value in range [−18 • , 18 • ]), brightness and contrast random adjustments, and random cropping of the input image. For the third-fully connected NN (NN3), we did not use any augmentation. The only operation done on the data for the training of the NN3 was scaling the age of the patients to [0, 1] range. Data augmentation was not performed on the test sets.
All NNs had the same training hyperparameters (learning rate, batch size, and earlystopping). We investigated a much larger training hyperparameters space using grid search, but the conducted experiments resulted in the same training hyperparameters for all three NNs. Therefore, we used Adam optimizer with learning rate α = 10 −4 (chosen from the interval of [10 −1 , 10 −2 , ..., 10 −6 ]) with batch size 32 (we have also tested the batch size 16, but it had worse performance). We trained NNs for 150 epochs with early stopping after 25 consecutive epochs with no improvements on the test loss function. The loss function was weighted cross-entropy tailored for this purpose and motivated by the work presented in [41]. Reminding ourselves about the obvious data skewness, we increased the weights of rarer classes. We calculated the weight w i of each class using Equation (4). The weight w i is calculated as the ratio of maximum samples of the classes N c i over the number of samples of the class being observed. This procedure resulted in the function W (x) shown in Equation (5). The final weighted cross-entropy loss used for system training is given by Equation (6) whereŷ i represents the NN3 output class, y i is the correct class and W (x) is a function described in 5 that weighted the loss.
Another approach to address the data skewness is to generate synthetic data for the skewed classes. There are many approaches to do so: from SMOTE algorithm [42] to generative adversarial methods [43]. However, we find these inadequate for the considered problem of fracture age detection from medical images. Namely, since estimating the fracture age is a challenging task even for radiologists, estimating the age of artificial, synthetically added fractures is even more complex and hard to verify. Furthermore, we need to generate a pair of images (LAT and AP) projection, making data synthesis resulting in possible unrealistic cases. Therefore, for this problem, we have chosen a method that does not impair data.

Results and Discussion
In this section we present and discuss evaluation results of the proposed system and its components. In Appendix A we provide McNemar's test results between the proposed system results and its components in the form of five tables-one for each of the five folds [44]. Furthermore, particular attention is set on the uncertainty estimation of the proposed system as well as the interpretation of its results in different situations.

Proposed System Evaluation Results
To evaluate the proposed system and its components (NNs), we used standard classifier evaluation metrics: precision, recall, F1-Score, and accuracy. Tables 1-3 provide results for the proposed system and its components (NN1 and NN2). Results are presented for each of the five folds. Also, we calculated the mean and standard deviation for each metric, which can be considered as an overall score of the NN or system being evaluated. Therefore, in Table 1 we show the results of the NN1 (AP) component of the system. In Table 2 the results of NN2 (LAT) are shown, while Table 3 contains the proposed system's performance results. By evaluating every component of the system separately, we obtained additional information about their importance and impact on the whole system. According to our expectations, the proposed system performed best on every fold with an overall F1-score of 0.878 ± 0.018 and accuracy of 0.878 ± 0.017. We can also state that the NN1 (AP) component is more accurate/informative than the NN2 (LAT) component (0.858 ± 0.023 F1-score against 0.835 ± 0.012, respectively). Furthermore, the proposed system obtained ∼ 2% better results than its best component. The obtained results indicate that the fusion of all possible information is necessary to improve the overall system performance. We have conducted McNemar's test of the significance between the proposed system and its components to support these claims. The results of McNemar's test (displayed in tables of Appendix A) show that there is a statistical difference between the proposed system (NN3) and its component on the statistical level p < 0.05 for folds one and three. On the remaining three folds, there is no significant difference between NN3 and NN1, but there is a significant difference between NN3 and NN2. Therefore, although NN3 obtained the best results on all folds (and overall), NN1 could be sufficient to obtain reasonably good results on some folds. However, it will not generalize as well as the proposed system. To depict the severity of the task being solved, we provide the confusion matrices of the proposed system for the five-folds in Figure 5. It can be easily noticed that the proposed system mostly confuses the older fracture which indicates the need for the next step: the evaluation of proposed system uncertainty. Seeing that we are the first ones to tackle the issue of fracture age estimation, there was no other research to compare our results with.

Uncertainty Estimation Results
We measured the mean and standard deviation of all σ 2 values in every fold to estimate the proposed system's uncertainty. The measurements were done for each NN component, and for the system as a whole. Also, we divided the measures into two separate groups. The first group is the one where the system or NN component correctly predicted the class, while the second group is the one having erroneous predictions. The expectation is that the subject being evaluated/observed will have smaller uncertainty on the correctly predicted cases and higher uncertainty on the wrongly predicted cases. By comparing the results in Table 4 (correct predictions uncertainty) and Table 5 (wrong predictions uncertainty) we come up with following observations: • Both NNs and the proposed system have higher uncertainty on the erroneous predictions than on the correct ones. This is the desired behavior because we want the system to be confident (have the lowest uncertainty) when it is correct and be very uncertain when it makes an erroneous prediction. Also, it is necessary to observe that the biggest difference between overall uncertainty (mean ± stdev) is for the proposed system (0.140 − 0.063 = 0.077), which indicates that the system, from this point of view, behaves better than its components; • For the correct prediction, the best result was obtained by the NN1 (AP input) with an average uncertainty of 0.034 ± 0.014. The worse result was obtained by the proposed system (0.063 ± 0.015). In other words, the proposed system was more uncertain in its correct decision than any of its components, although it obtained the highest F1-score and accuracy (Table 3). This phenomenon, we believe, is due to more inputs/information that the proposed system takes into account.; • For the erroneous predictions, the proposed system obtained the best results (highest average uncertainty (0.140 ± 0.003). Analogous to the case of correct predictions, we want the system to be as unsure as possible when making erroneous predictions.  Next, we discuss the interpretability of the proposed system's predictions. In Figure 6, we depict six scenarios from the perspective of model uncertainty. On the x-axis, we set the probability of the class, while the y-axis shows the decision certainty. In the scenario depicted in Figure 6a, the proposed system has "no doubts" in its decisions. The proposed system predicted class "0 weeks" with 92.6% probability and a quite high certainty (correct prediction is printed in boldface in the subfigure's legend). In subfigures (b) and (c), it can be seen that the proposed system gave correct predictions, but the uncertainty of its decision is more considerable than in subfigure (a). Namely, as the overlap between normal distributions representing uncertainty in two classes with the highest probability increases, we can assume that the model could make an erroneous prediction. We set the threshold for system reporting high uncertainty probability to 0.5, which means that the top two distributions overlap in over 50% of their areas. By adjusting the mentioned threshold, we are setting up the proposed system sensitivity. For instance, in Figure 6e, the proposed system made a correct prediction but reported high uncertainty about it because of its overlap with another class in 92.9%. On the other hand, in subfigure (f), the proposed system made an incorrect prediction, but due to the reported uncertainty (overlap of 70%), we cannot take its decision for granted. In both cases, we can conclude that we cannot be confident about the models' prediction, but we can eliminate that the fracture is not a fresh one (0 weeks old) due to the model's high certainty in that 0 weeks class prediction. Therefore, the system still helped by eliminating the one class. Finally, we need to be aware that the proposed system is not perfect; it can still make mistakes and drive completely wrong predictions with high certainty. This case is depicted in Figure 6d.
We also wanted to inspect the results of the proposed system with regard to its reporting uncertainty. In this case, we considered it a correct prediction if the system reported its uncertainty, and one of the top two classes is the correct one. As it can be seen in Table 6, this adjustment resulted in an increase of the proposed system accuracy and F1-score by ∼ 2%. The accuracy and F1-score of the proposed system are now both 0.906 ± 0.011, which, due to the skewness of the data and complexity of the tackled problem, is a decent result. Also, improvement of the proposed system was confirmed by McNemar's test presented in Appendix A. The results of McNemar's test show that the proposed system was significantly better (statistical level p < 0.01) on all five-folds against its components (NN1 and NN2), as well as NN3 (the proposed system without taking into account reported uncertainty). Figure 6. Possible outputs of the proposed system concerning uncertainty. Subfigure (a) depicts a scenario with no uncertainty detected. Subfigures (b,c) depict two scenarios with an uncertainty overlap between the top two predictions, but this uncertainty is smaller than the set threshold of 0.5. Subfigure (d) depicts a scenario with no uncertainty but with an incorrect prediction. Subfigures (e,f) depict two scenarios with uncertainty detected between the top two predictions. However, in subfigure (e) the system would predict the correct class, while in subfigure (f) it would predict the incorrect one.

Results Summary
We can summarize the results of our experiments and our observations based on the obtained results in the following: • The fusion of the input data in the system results in increased model accuracy, compared independently to any of its components; • The amount of uncertainty of the proposed system is greater than the amount of uncertainty of its individual components; • The uncertainty in the incorrect predictions of the proposed system is higher than the uncertainty in its correct prediction, which is the desirable system behavior; • Uncertainty estimation can help with the output interpretability and can enhance the system usability, especially in cases where the data is poor and very skewed (as was the case with fracture age estimation). Including the uncertainty in the proposed system's decision increased its average accuracy to ∼ 90.6%, which can serve as a benchmark point for similar research.
Therefore, the proposed standard yielded by this research employs that we need to use all available data about the patient to develop a good system for fracture age estimation. We have age, gender, and projections available in our case, but the anamnesis could be of great help. The system must focus on each input separately; that is why we have used three neural networks-one neural network that accepts all the data simply would not converge in our case. Furthermore, utilizing the Monte-Carlo dropout method for uncertainty estimation increased not only the accuracy and F1-score but also improved the explainability and plausibility of the whole system. We strongly advise utilizing uncertainty estimation methods when developing computer-aided diagnosis systems that are supposed to be used in real-life practice since they can only result in the benefit of the whole system and its users.

Conclusions and Future Work
For the first time, we tackled the issue of fracture age estimation by designing an AI system based on CNNs. Because of skewness in our pediatric wrist radiography dataset, we employed uncertainty estimation as a tool to enhance the proposed system's accuracy and reliability. Thus, the proposed system becomes more white-box like-providing its decision certainty-which gives a better foundation for the expert using the system to come up with their final decision.
In the future, we are planning on enhancing the proposed system by extracting the regions representing only the fractures and estimating the age of those regions only (instead of the whole images, which is the case now). This way, the proposed system would be more precise and explainable. Also, we plan to solve fracture estimation as a regression problem. Namely, we classified fracture ages in three classes representing clinically relevant intervals in weeks, but fracture age follows a regression curve. To solve this problem entirely, the system needs to be more resistant toward the detected data skewness. To achieve the mentioned goal, we aim to expand the current dataset, and inspect the effect of convolutional filters of the neural networks utilized by the system on the final result. It could also be necessary to introduce some kind of memory in our system to account for rarer instances. Funding: This work has been supported in part by the Croatian Science Foundation (grant number IP-2020-02-3770) and by the University of Rijeka, Croatia (grant numbers uniri-tehnic-18-17, and uniri-tehnic-18-15).

Informed Consent Statement:
The ethics committee of the Medical University of Graz (IRB00002556) approved the study protocol (No. EK 31-108 ex 18/19). Because of the retrospective data analysis, the committee waived the requirement for informed patient or legal representative consent. We performed all study-related methods in accordance with the Declaration of Helsinki and the relevant guidelines and regulations. Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare that they have no conflict of interest.

Appendix A. McNemar's Test Evaluation of the Proposed System and Its Components
The following tables will show the results of McNemar's test for the proposed system and its components. The names in the tables represent: NN1 represents EfficientNetB1 model estimating age from AP images, NN2 represents EfficientNetB1 model estimating age from LAT images, NN3 represents the system estimation. In contrast, NN + UNC stands for the system with uncertainty taken into account. Highlighted values in the tables represents statistical significance at the p < 0.05 level. The following tables respectively show results for each of the five folds.
hyperparameters for one tested model was the same for LAT and AP data. It is necessary to mention that we have utilized transfer learning for all tested models from ImageNet pre-trained models [45] and the head of the tested models was always the same (the one depicted it Figure 3). To prevent overfitting of the models, we have used early stopping after 25 epochs with no improvements on the test loss function or if the number of epochs trained exceeded 150. In Tables A7 and A8 we present the average and standard deviation of the model's performance on the five test folds. The EfficientNetB1 achieved the best scores on LAT data in all metrics with an F1-score of 0.835 ± 0.012 and accuracy of 0.834 ± 0.013. On the AP data, the EfficientNetB1 model obtained the best F1-score of 0.858 ± 0.023 while the VGG19 model obtained the best accuracy of 0.859 ± 0.014 (the EfficentNetB1 was second best). However, since F1-score is a more general metric (includes false positives and false negatives), we decided to use EfficientNetB1 as the best model and utilize it in our system. Furthermore, due to the difference in parameter number, EfficientNetB1 training duration and memory usage are considerably lower than the VGG19 model, which is another reason to use EfficientNetB1.