High-Performance Scaphoid Fracture Recognition via Effectiveness Assessment of Artiﬁcial Neural Networks

. Abstract: Image recognition through the use of deep learning (DL) techniques has recently become a hot topic in many ﬁelds. Especially for bioimage informatics, DL-based image recognition has been successfully used in several applications, such as cancer and fracture detection. However, few previous studies have focused on detecting scaphoid fractures, and the related effectiveness is also not signiﬁcant. Aimed at this issue, in this paper, we present a two-stage method for scaphoid fracture recognition by conducting an effectiveness analysis of numerous state-of-the-art artiﬁcial neural networks. In the ﬁrst stage, the scaphoid bone is extracted from the radiograph using object detection techniques. Based on the object extracted, several convolutional neural networks (CNNs), with or without transfer learning, are utilized to recognize the segmented object. Finally, the analytical details on a real data set are given, in terms of various evaluation metrics, including sensitivity, speciﬁcity, precision, F1-score, area under the receiver operating curve (AUC), kappa, and accuracy. The experimental results reveal that the CNNs with transfer learning are more effective than those without transfer learning. Moreover, DenseNet201 and ResNet101 are found to be more promising than the other methods, on average. According to the experimental results, DenseNet201 and ResNet101 can be recommended as considerable solutions for scaphoid fracture detection within a bioimage diagnostic system.


Introduction
To date, artificial intelligence (AI) has advanced technologies worldwide, successfully being applied in many fields such as industry, commerce, agriculture, smart life, medical informatics, and so on. In these fields, image recognition is widely used to simulate human vision (so-called computer vision). In traditional computer vision, an image can be described by its color, texture, and shape; thereby, the objects in an image can be recognized by machine learning methods. However, the effectiveness of such approaches is limited in incomplete feature analysis. To address such problems, deep learning (DL) has been proposed as a solution, which can be viewed as a set of neural networks based on human neural biology and which can recognize objects by conducting iterative feature filtering. In fact, DL has been shown to be effective for the diagnosis of cancer and fractures. For Although some researchers have made attempts to address the issue above, the prediction result is not satisfactory. Therefore, in this paper, we propose a two-stage scaphoid fracture recognizer based on deep learning techniques. Overall, the contribution of this study over those in the literature can be summarized as follows: • To increase the precision, the proposed method identifies and segments the scaphoid bone first. Based on this segmentation, the recognition space is reduced into a specific area and the computational cost is also reduced.

•
To discover the most effective recognizer, we provide a detailed analysis and conduct a set of experiments considering numerous existing well-known CNNs using data augmentation and transfer learning. • Afterward, a comprehensive empirical study on a real data set is presented, and an insightful analysis is proposed, in order to make a near-optimal recommendation from the testing CNNs. • Finally, the proposed method will be materialized further in the bioimage diagnostic system in the Kaohsiung Branch of Chang Gung Hospital, Taiwan.
Basically, this paper can be viewed as an evaluation study, and the major intent behind it is to enhance the existing techniques through the use of segmentation, data augmentation, and transfer learning. By conducting an effectiveness assessment, we can confidently recommend AI networks for the desired purpose. The remainder of this paper is laid out as follows. A review of related works is provided in Section 2. In Section 3, the proposed method for scaphoid fracture detection is presented in detail. The empirical study is described in Section 4, and our conclusions are given in Section 5. Although some researchers have made attempts to address the issue above, the prediction result is not satisfactory. Therefore, in this paper, we propose a two-stage scaphoid fracture recognizer based on deep learning techniques. Overall, the contribution of this study over those in the literature can be summarized as follows: • To increase the precision, the proposed method identifies and segments the scaphoid bone first. Based on this segmentation, the recognition space is reduced into a specific area and the computational cost is also reduced.

•
To discover the most effective recognizer, we provide a detailed analysis and conduct a set of experiments considering numerous existing well-known CNNs using data augmentation and transfer learning. • Afterward, a comprehensive empirical study on a real data set is presented, and an insightful analysis is proposed, in order to make a near-optimal recommendation from the testing CNNs. • Finally, the proposed method will be materialized further in the bioimage diagnostic system in the Kaohsiung Branch of Chang Gung Hospital, Taiwan.
Basically, this paper can be viewed as an evaluation study, and the major intent behind it is to enhance the existing techniques through the use of segmentation, data augmentation, and transfer learning. By conducting an effectiveness assessment, we can confidently recommend AI networks for the desired purpose. The remainder of this paper is laid out as follows. A review of related works is provided in Section 2. In Section 3, the proposed method for scaphoid fracture detection is presented in detail. The empirical study is described in Section 4, and our conclusions are given in Section 5.

Related Works
Artificial intelligence has recently shifted bioinformatics toward an automated and effective field. Deep learning plays a critical role in this field, with applications in areas such as petroleum engineering [1], computer engineering [2], electrical engineering [3], biomedical engineering [4], software engineering [5], and energy engineering [6]. Although many diseases can be detected successfully by AI, an effective method for recognizing scaphoid fractures is not easy to pursue. This is the main aim of this paper. In this section, works in the literature related to AI-assisted bioinformatics are reviewed categorically.

Bioimage Recognition and Deep Learning
To date, bioinformatics has played an important role in risk assessment [7,8], disease detection [9], and healthcare [10], including sources of bioimages [11], biochemical tests [12], diagnostic reports [13], and so on. Therefore, several AI techniques, including data mining and deep learning, have been proposed for this topic. Among them, artificial neural networks (ANNs) have been widely and effectively used with bioimages. Regarding the chest, Tang et al. [14] tested several convolutional neural networks (CNNs) for abnormality classification, including AlexNet, ResNet, Inception, and DenseNet, among others. Furthermore, CheXNet and VGG19 have been used to extract the features from X-ray images for further classification of pediatric pneumonia [15]. Considering lung CT (computed tomography) images, Francis et al. [16] and Javan et al. [17] took advantage of Gated-SCNN and VGG-XGBoost, respectively, to recognize pulmonary lesions. Besides lung and osteosarcoma, deep learning has also been applied to segment the intracerebral hemorrhage in CT images [18][19][20]. For liver, Xia et al. [15] combined traditional multiclassification cross-entropy loss functions to perform better segmentation. Moreover, Yang et al. [21] constructed FCN-DecNet for coarse segmentation by optimization weighting. Other cancer image recognition (e.g., breast) has been performed using small SE-ResNet module [22].

Musculoskeletal Image Recognition and Deep Learning
Sato et al. [23] and Huang et al. [24] adopted CNNs to detect musculoskeletal abnormalities and to segment osteosarcoma tumors using radiograph and CT images, respectively. Rayan et al. [25] and England et al. [26] utilized CNNs to detect pediatric elbow fractures and effusions. Olczak et al. [27] used five networks to classify four types of radiographs, including fracture, laterality, body part, and exam view. Kim et al. [28] applied transfer learning from CNNs to enhance the prediction of fractures from radiographs. In addition to the above musculoskeletal image recognition, scaphoid fracture detection is the aim of the proposed paper. At present, scaphoid fractures can be detected manually by radiologists using radiographs and CT images. In terms of radiographs, Nazarova et al. [29] provided a study showing that different angling views can help radiologists to detect scaphoid fractures. In terms of CT images, the cone-beam CT [30] improved the prediction effectiveness significantly, in contrast to radiographs, for a radiologist. Apart from manual detection, Ozkaya et al. [31] compared the performances of automated recognition (CNN), emergency department physicians, and orthopedic specialists using radiographs. The related experimental results showed the experienced specialists performed the best, while the performance of the CNN was close to that of less-experienced specialists.

Comparative Study
Although numerous forerunners have been devoted to musculoskeletal image recognition, there exists room for further research on scaphoid fracture detection. Here, a comparative study is conducted, as shown in Table 1, which shows the primary points of the previous studies, including the data used, detection by deep learning, number of CNNs, detection on radiographs, transfer learning, data augmentation, number of evaluation metrics, and segmentations, thus indicating the uniqueness of this paper.

Framework
As shown in Table 1, some past studies related to musculoskeletal images exist. Few studies have provided a comprehensive assessment for existing CNNs in scaphoid fracture recognition. This issue motivated us to conduct a comparative evaluation for numerous CNNs, providing practical insights for radiologists. The framework of the proposed method is shown in Figure 2, including the offline training and online recognition phases. detection on radiographs, transfer learning, data augmentation, number of evaluation metrics, and segmentations, thus indicating the uniqueness of this paper.

Framework
As shown in Table 1, some past studies related to musculoskeletal images exist. Few studies have provided a comprehensive assessment for existing CNNs in scaphoid fracture recognition. This issue motivated us to conduct a comparative evaluation for numerous CNNs, providing practical insights for radiologists. The framework of the proposed method is shown in Figure 2, including the offline training and online recognition phases.  Offline Training This is a foundational phase that constructs a recognition model through data collection, data preprocessing, and transfer learning. For data preprocessing, the scaphoid bones are extracted from the labeled images. Next, the extracted images are augmented. For transfer learning, the recognition model is constructed by transfer learning from the pretrained model.  Offline Training This is a foundational phase that constructs a recognition model through data collection, data preprocessing, and transfer learning. For data preprocessing, the scaphoid bones are extracted from the labeled images. Next, the extracted images are augmented. For transfer learning, the recognition model is constructed by transfer learning from the pretrained model. This phase starts with the input of an unknown image. Then, the scaphoid bone is segmented from the unknown image. Based on the recognition model, this segmented object is classified by convolutional neural networks. In this paper, we test a set of CNNs to reveal their respective effectiveness.

Offline Training
In this section, the data collection, data segmentation, data augmentation, and learning model are discussed.

Data Collection
The experimental data were gathered from Kaohsiung Chang Gung Memorial Hospital, Taiwan, including 154 adult patients (called KCGMH data in Section 4), that is, the aim of this paper is the adult patient fractures. In overall, this data contained types of en-face and en-profile images, and the en-face images were chosen as the filtered data first. In this filtered data, there were 178 clearly positive (fracture) instances and 308 negative (normal) instances, as identified by radiologists. Hence, 178 instances were selected as the examined data, for each of the positive and negative sets, that is, there were 356 images in total. Finally, 70% of positive and negative data were randomly selected as the training data, while the others were taken as the testing data [31]. For CNNs, 10% of training data was split as the validation data. Note that the reason for using these collected data, instead of the former data, can be explained by the following concerns. First, the medical data are not easy to collect, due to privacy concerns. Second, the data sets of former methods are not public. Third, without permission from patients, it is not appropriate to experiment on collected data, from an ethical point of view.

Data Segmentation
Traditionally, a CNN uses a sliding window to filter the features for the whole image. However, in our context, the recognition quality is limited when using the whole image. This is because the scaphoid bone is only a subarea of the image. Considering this issue, in this paper, the scaphoid bone was extracted from the image before training and testing, that is, the scaphoid bone is regarded as an object, which is segmented to narrow the recognition space from the whole image to a subimage. In this paper, the object detector used is YOLO-v4 (You Only Look Once-version 4) [32], which is an efficient and effective method extended from YOLO-v3 [33]. In the related literature, YOLO-v4 is composed of three main parts, namely a backbone, neck, and head, where the backbone is CSPDarknet53 [34], the neck contains SPP [35] and PANet [36], and the head is YOLO-v3. Note that, for the segmentation in this paper, the number of training instances was 408 and the accuracy was around 96%. Figure 3a,b show examples of segmentation results for right hands, while (c) and (d) are those for left hands. All successfully segmented images were considered as the experimental data. In the segmented data, the maximum height and width were 353 and 264, respectively, while 94 and 71 were the minimum height and width, respectively.

Data Augmentation
In real applications, it is not easy to collect sufficient data for better recognition quality. This incurs a problem called overfitting, which indicates that a gap exists between the training and prediction performances. To address this issue, data augmentation is widely

Data Augmentation
In real applications, it is not easy to collect sufficient data for better recognition quality. This incurs a problem called overfitting, which indicates that a gap exists between the training and prediction performances. To address this issue, data augmentation is widely adopted as a solution, which transforms the images through resizing, flipping, rotating, scaling, and presentation tuning operations, among others. For this concern, in this paper, the training data were augmented by flipping and rotation. Figure 4 shows examples of original, flipped, and rotated images. Finally, the training data size was enlarged to 1136.

Data Augmentation
In real applications, it is not easy to collect sufficient data for better recognition qua ity. This incurs a problem called overfitting, which indicates that a gap exists between th training and prediction performances. To address this issue, data augmentation is widel adopted as a solution, which transforms the images through resizing, flipping, rotating scaling, and presentation tuning operations, among others. For this concern, in this pape the training data were augmented by flipping and rotation. Figure 4 shows examples o original, flipped, and rotated images. Finally, the training data size was enlarged to 113

Convolutional Neural Networks
Advanced hardware technology has enabled the rapid growth of artificial neural ne work technology. They can simulate the human brain to recognize images, as is the cas for CNNs. The basic idea is that a CNN performs feature filtering by convolutional com putation and pooling. Then, through flattening, full connection, and softmax activation the prediction probabilities for each class are output. Figure 5 depicts the framework con taining convolution sets, maximum pooling sets, flattening, a fully connected layer, and softmax activation function. Figure 6 illustrates the convolution, which indicates that convolutional map is generated by sliding dot product computations between an imag and a filtering map, while the maximum pooling indicates that a map is generated b maximizing the region features from the convolutional map. In this paper, we compar five NNs: VGG [37], ResNet [38], DenseNet [39], InceptionNet [40], and EfficientNet [41

Convolutional Neural Networks
Advanced hardware technology has enabled the rapid growth of artificial neural network technology. They can simulate the human brain to recognize images, as is the case for CNNs. The basic idea is that a CNN performs feature filtering by convolutional computation and pooling. Then, through flattening, full connection, and softmax activation, the prediction probabilities for each class are output. Figure 5 depicts the framework containing convolution sets, maximum pooling sets, flattening, a fully connected layer, and a softmax activation function. Figure 6 illustrates the convolution, which indicates that a convolutional map is generated by sliding dot product computations between an image and a filtering map, while the maximum pooling indicates that a map is generated by maximizing the region features from the convolutional map. In this paper, we compare five NNs: VGG [37], ResNet [38], DenseNet [39], InceptionNet [40], and EfficientNet [41].

Data Augmentation
In real applications, it is not easy to collect sufficient data for better recognition quality. This incurs a problem called overfitting, which indicates that a gap exists between the training and prediction performances. To address this issue, data augmentation is widely adopted as a solution, which transforms the images through resizing, flipping, rotating, scaling, and presentation tuning operations, among others. For this concern, in this paper, the training data were augmented by flipping and rotation. Figure 4 shows examples of original, flipped, and rotated images. Finally, the training data size was enlarged to 1136.

Convolutional Neural Networks
Advanced hardware technology has enabled the rapid growth of artificial neural network technology. They can simulate the human brain to recognize images, as is the case for CNNs. The basic idea is that a CNN performs feature filtering by convolutional computation and pooling. Then, through flattening, full connection, and softmax activation, the prediction probabilities for each class are output. Figure 5 depicts the framework containing convolution sets, maximum pooling sets, flattening, a fully connected layer, and a softmax activation function. Figure 6 illustrates the convolution, which indicates that a convolutional map is generated by sliding dot product computations between an image and a filtering map, while the maximum pooling indicates that a map is generated by maximizing the region features from the convolutional map. In this paper, we compare five NNs: VGG [37], ResNet [38], DenseNet [39], InceptionNet [40], and EfficientNet [41].   Table 2 provides the architecture details of the considered CNN models, including the numbers of convolutions and pools, activation functions, and optimization functions. Based on these architectures, our primary intent was to approximate the nearly optimal settings for different CNN models, as depicted in Table 3. The detailed analysis for determining the best settings is shown in Section 4.   Table 2 provides the architecture details of the considered CNN models, including the numbers of convolutions and pools, activation functions, and optimization functions. Based on these architectures, our primary intent was to approximate the nearly optimal settings for different CNN models, as depicted in Table 3. The detailed analysis for determining the best settings is shown in Section 4.  In addition to data augmentation, transfer learning has been proposed to increase the recognition effectiveness when facing insufficient data. The basic notion is to reuse a pre-trained model to enhance the prediction ability. In this paper, two paradigms of transfer learning are utilized, namely layer transfer and fine-tuning. For layer transfer, as shown in Figure 7, the network is initialized by the pretrained model. Next, feature filtering is skipped, and only the prediction layer is trained by the target images. For fine-tuning, as shown in Figure 8, the whole network is initialized by the pretrained model. Then, the feature filtering and prediction layer both keep working with target images. Therefore, the network is tuned based on both the pretrained model and target images. In this paper, the fine-tuning-based transfer learning is performed with VGG networks, while the layer transfer-based learning was performed for the other networks.

Online Recognition
This phase works with an unknown image as input. First, the scaphoid bone is anchored and segmented from the unknown image. Next, the split image is resized. Finally, as shown in Figures 5-8, the split object is predicted through transfer learning from CNN.

Online Recognition
This phase works with an unknown image as input. First, the scaphoid bone is anchored and segmented from the unknown image. Next, the split image is resized. Finally, as shown in Figures 5-8, the split object is predicted through transfer learning from CNN.

Online Recognition
This phase works with an unknown image as input. First, the scaphoid bone is anchored and segmented from the unknown image. Next, the split image is resized. Finally, as shown in Figures 5-8, the split object is predicted through transfer learning from CNN.

Empirical Study
To determine the recognition quality for scaphoid fractures, we conducted a detailed empirical study. In the following, it is presented in two main aspects-overall comparison and detailed analysis-for the compared deep learning techniques based on CNNs.

Experimental Settings
In the experiments, the adopted CNNs included VGG16, VGG19, ResNet50 (RN50), ResNet101 (RN101), ResNet152 (RN152), DenseNet121 (DN121), DenseNet169 (DN169), DenseNet201 (DN201), Inception-V3 (Inv3), and EfficientNetB0 (ENB0). All adopted CNNs were executed using the Keras API and the whole evaluation was conducted on a personal computer with 16 GB RAM, an Intel 8-Core i7-10700 processor, and an Nvidia GeForce RTX 3070 GPU with 8 GB RAM. The recognition quality was measured through seven metrics: sensitivity, specificity, precision, accuracy, F1-score, AUC, and kappa based on a confusion matrix (shown in Table 4). In this matrix, there are four outcomes: true positive (TP), false positive (FP), false negative (FN), and true negative (TN), where positive (P) and negative (N) indicate the numbers of actual positive and negative cases, respectively, in the data. Further, TP and TN indicate the number of hits and number of correct rejections, respectively, while FP and FN indicate the number of false alarms and number of misses, respectively. The definitions for the sensitivity, specificity, precision, accuracy, F1-score, AUC, and kappa metrics are as follows: AUC denotes the area under the receiver operating characteristic curve. In the proceeding experiments, the results are shown in terms of accuracy charts and AUC curves, where the charts and AUC curves were generated by the Microsoft Excel and Python (Matplotlib Pyplot) software, respectively. The major intent of using accuracy as the basic measure was to observe the overall correction rates, including true positives and true negatives simultaneously, which has been widely used in the field of machine learning [42,43]. Based on the accuracy, the other six metrics were further chosen by referring to the references [23][24][25][26][27][28][29][30]. Moreover, the CNN models were constructed using Keras and tensorflow 2.3. The reasons for selecting the software are that they are popular, cheap, and easy to obtain.

Comparison of the Compared CNNs without Transfer Learning
In the first evaluation, we examined the compared CNNs without transfer learning. In this evaluation, the training data was augmented, the batch size was 32, and the images were resized to 196×196. Table 5 provides the effectiveness of the compared CNNs without transfer learning for different metrics, delivering a set of observations. First, although the specificity and precision of VGG16 were high, the values of other metrics were low. This indicates that, in the negative ground-truths, the correct rate was high, and most positive prediction results are correct. In contrast, the sensitivity of VGG19 was the best, but the related specificity and kappa are low, that is, the VGG nets were not robust enough. Second, in terms of accuracy, the three DenseNets were better than the others. This indicates that, for both positive and negative predictions, the correct rates were high for DenseNets. Third, overall, DN201 obtained the best F1-score, kappa, and accuracy. This reveals that DN201 can be recommended in a situation without transfer learning. In summary, the performances of all CNNs without transfer learning were not satisfactory. The potential reason for this is the overfitting problem, caused by insufficient data. This motivated us to fuse the transfer learning into the prediction model, as mentioned above.

Performance Analysis in Different Settings for the Compared CNNs with Transfer Learning
As the evaluation results without transfer learning were unsatisfactory, from the viewpoint of all metrics, the further models with transfer learning were examined. Before the further examinations, the settings for the further models with transfer learning had to be set. For this purpose, in this section, the performance analysis under different settings is demonstrated, through several evaluations.

Impact of Data Augmentation
The first result to assess, in terms of the problem of insufficient data, is the impact of data augmentation. Figure 9 shows the effectiveness of different models with different data augmentations. In these results, different data augmentation types had different impacts for each model. On average, the performance rank for three augmentation types were rotation 15 • , rotation −15 • , and flipping. Overall, for any of the individual augmentations, the performance was better than that without augmentation (denoted as "original"). Besides the individual augmentations, further comparison between those with and without augmentations was made. Figure 10 reveals the results in a manner similar to that of Figure 9. Three viewpoints for the results are shown here. First, the accuracies of VGG Nets and EfficientNets did not improve significantly. Second, the best improvement was for RN101, which reached an improvement rate of approximately 73%. Third, the potential interpretation for the difference in effectiveness of augmentation on different models is that the model architectures were sensitive to data augmentation, considering the pre-trained model. In detail, the residual idea works well for ResNets and DenseNets. This is also the main contribution to discerning the difference among AI Nets. Fourth, on the whole, most models were enhanced through data augmentation. Note that this evaluation was carried out based on the settings presented in Table 3.
were rotation 15°, rotation −15°, and flipping. Overall, for any of the individual augmentations, the performance was better than that without augmentation (denoted as "original"). Besides the individual augmentations, further comparison between those with and without augmentations was made. Figure 10 reveals the results in a manner similar to that of Figure 9. Three viewpoints for the results are shown here. First, the accuracies of VGG Nets and EfficientNets did not improve significantly. Second, the best improvement was for RN101, which reached an improvement rate of approximately 73%. Third, the potential interpretation for the difference in effectiveness of augmentation on different models is that the model architectures were sensitive to data augmentation, considering the pretrained model. In detail, the residual idea works well for ResNets and DenseNets. This is also the main contribution to discerning the difference among AI Nets. Fourth, on the whole, most models were enhanced through data augmentation. Note that this evaluation was carried out based on the settings presented in Table 3.

Impact of Batch Size
The next concern to clarify is the impact of different batch sizes. For this, we conducted a further evaluation of all models, using different batch sizes. Table 6 shows the experimental results for different batch sizes, which can be interpreted from two aspects. First, for each model, different models performed best with different batch sizes. Some differences between the best and worst are obvious, while the others are not. For example, for RN152, the accuracy difference was the biggest, while that for INv3 was the smallest, that is, the impact of batch size on RN152 was significant, which reached 0.292, while that for INv3 was not clear. Overall, the average difference of all models was 0.114. Second, considering the batch size, the accuracy of INv3 can reach 0.861 only when using a batch size of 8. This indicates that INv3 does not need much cost to perform well. Additionally, the batch size ranking, in terms of average accuracies, was 8, 24, 16, and 32. However, the best batch size differed for each model. Note that this evaluation was made using the resize to 196 × 196.

Impact of Batch Size
The next concern to clarify is the impact of different batch sizes. For this, we conducted a further evaluation of all models, using different batch sizes. Table 6 shows the experimental results for different batch sizes, which can be interpreted from two aspects. First, for each model, different models performed best with different batch sizes. Some differences between the best and worst are obvious, while the others are not. For example, for RN152, the accuracy difference was the biggest, while that for INv3 was the smallest, that is, the impact of batch size on RN152 was significant, which reached 0.292, while that for INv3 was not clear. Overall, the average difference of all models was 0.114. Second, considering the batch size, the accuracy of INv3 can reach 0.861 only when using a batch size of 8. This indicates that INv3 does not need much cost to perform well. Additionally, the batch size ranking, in terms of average accuracies, was 8, 24, 16, and 32. However, the best batch size differed for each model. Note that this evaluation was made using the resize to 196 × 196.

Impact of Resizing
The final setting for the overall comparisons was the parameter resize. This parameter indicates the image size formatted normally for all models. This was used as the input image sizes were not equal. Table 7 depicts the evaluation results for different resizes. From the results, we can see that the differences between the best and worst results for each model were not clear. It can be interpreted that the impact of image size is not obvious, where the maximum difference was just 0.139. Although the best performance occurred when resizing to 196 × 196, the best setting for each model differed. Note that this evaluation was made using a batch size of 32.

Overall Comparisons for the Compared CNNs with Transfer Learning
In the above experiments, the best settings for data augmentation, batch size, and resizing were determined. In this subsection, based on these settings, a comparative study is detailed, with evaluations of all models using transfer learning. Table 8 shows the effectiveness of the compared CNNs with transfer learning for different metrics, which can be explained from numerous viewpoints. First, the ranking of the five Nets was DenseNet, InceptionNet, EfficientNet, ResNet, and VGG, with average accuracies of 0.889, 0.889, 0.861, 0.852, and 0.806, respectively. Second, the top three individual models were DN201, DN169, and RN101, in accordance with their accuracies. Third, overall, the best performances for each metric consistently occurred in two networks, namely DN201 and RN101. From this aspect, DN201 and RN101 were the two most reliable models. Fourth, from an average point of view, the top three individual models were DN201, RN101, and INv3, for which the averages of all metric results were 0.886, 0.882, and 0.879, respectively. This can be regarded as an echo of the third point that DN201 and RN101 had a balanced performance for all metrics. Note that * indicates the best performance in each metric (attribute).

Experimental Discussion
To realize the recognition quality of the well-known CNNs, in this paper, a comprehensive evaluation was proposed, as detailed above. In the following, an overall empirical discussion is further given, for detailed analysis.

•
The medical data in real applications is not easy to collect. To make the experiment more reliable, the experimental data were gathered from Kaohsiung Chang Gung Memorial Hospital, instead of crawling the Web. Therefore, due to the problem of insufficient data, overfitting occurred and heavily impacted the recognition quality.
To cope with this problem, two operations were adopted: data augmentation and transfer learning. Here, an issue to clarify is the performance comparisons among learning by original data, learning by data augmentation, learning by transfer learning, and learning by fusing data augmentation and transfer learning. This comparison can be explained by Figure 11, summarizing Tables 5 and 8 and Figure 10. From this comparison, we can determine that, whatever the model, without the fusion of transfer learning and data augmentation, the best accuracy cannot be achieved.

•
In addition to the four observations mentioned in Section 4.4, another point to show is the performance comparison in terms of the AUC. Figure 12 reveals the AUC comparisons of models with accuracies larger than 0.9, namely RN50, RN101, DN121, DN201, INv3, and ENB0. Basically, it can be divided into two observation spaces, by the line where the false positive rate (FPR) equals 0.1. Before FPR 0.1, DN121 is better than the others. To the contrary, the performances of RN101 and DN201 are relatively close but higher than those of the other models, after FPR 0.1. Overall, RN101, DN201, and INv3 are the highly considerable models, in terms of the AUC.

•
Besides the AUC, a further comparison to show the interdiscrimination ability of TP, TN, FP, and FN for each model is the hybrid validation, which indicates the average of the F1-score, AUC, and kappa metrics. Figure 13 depicts the validation result, showing that the top three models were DN201, DN169, and RN101, although their values were close. This result is consistent with those given above. However, by considering the overall performances in Table 8, the best two models for recognizing scaphoid fractures are DN201 and RN101.

•
In addition to the effectiveness evaluation, another important issue is the efficiency. Considering this issue, Figure 14 shows the execution time of compared models. The comparison result revealed that DenseNet, InceptionNet, and EfficientNet had higher costs than the other models. By integrating the results of Figures 13 and 14, three further viewpoints are given here. First, from the effectiveness point of view, DN201, DN169, and RN101 are three candidate models. Second, from the efficiency point of view, VGG16, RN50, and RN152 are three candidate models. Third, from a balanced viewpoint, the top three models are VGG16, RN50, and RN152 because the referred balances between test performance and training cost are higher than that of the other models. Overall, DN201 and RN101 are the considerable recommendations, due to three aspects. First, the training operation would not be performed frequently. Second, the testing quality is more important than the training cost in the field of bioinformatics. Third, in the experiments, the recognition for each model could be carried out within 1 s. Hence, the testing quality was our major consideration in determining the recommendations.

•
In the above experiments, the evaluation results using the KCGMH data were shown in great detail. However, to make the experiments more robust, we further verified the above models using another data set named RSNA [44], which was proposed for a challenge to predict bone age from pediatric hand radiograph. The major intent behind this verification is to investigate the performance variances of above recognition models if using different data sources. In fact, the data set RSNA contains bones of different ages, and 538 images were selected as the testing data. Because the bones in this data set are all normal, the specificity (TN rate) is the aimed measure. Figure 15 shows the specificity comparisons of using RSNA data and KCGMH data for the recognition models. The specificity differences are small (within 5%), even though the data sets are generated by different radiograph capturing devices, that is, the constructed models are stable in detecting the scaphoid fractures.

•
In summary, the above experimental results provide evidence that DenseNet and ResNet perform better than the other three networks. From these two networks, DN201 and RN101 are further selected as the recommended recognizers because the overall performances are better than the others in these two networks.

Experimental Discussion
To realize the recognition quality of the well-known CNNs, in this paper, a comprehensive evaluation was proposed, as detailed above. In the following, an overall empirical discussion is further given, for detailed analysis.

•
The medical data in real applications is not easy to collect. To make the experiment more reliable, the experimental data were gathered from Kaohsiung Chang Gung Memorial Hospital, instead of crawling the Web. Therefore, due to the problem of insufficient data, overfitting occurred and heavily impacted the recognition quality.
To cope with this problem, two operations were adopted: data augmentation and transfer learning. Here, an issue to clarify is the performance comparisons among learning by original data, learning by data augmentation, learning by transfer learning, and learning by fusing data augmentation and transfer learning. This comparison can be explained by Figure 11, summarizing Tables 5 and 8 and Figure 10. From this comparison, we can determine that, whatever the model, without the fusion of transfer learning and data augmentation, the best accuracy cannot be achieved. Figure 11. Accuracies of compared CNNs by considering transfer learning (termed TL), data augmentation (termed DA), and fusion of TL and DA (termed TL+DA).
• In addition to the four observations mentioned in Section 4.4, another point to show is the performance comparison in terms of the AUC. Figure 12 reveals the AUC comparisons of models with accuracies larger than 0.9, namely RN50, RN101, DN121, DN201, INv3, and ENB0. Basically, it can be divided into two observation spaces, by the line where the false positive rate (FPR) equals 0.1. Before FPR 0.1, DN121 is better than the others. To the contrary, the performances of RN101 and DN201 are relatively close but higher than those of the other models, after FPR 0.1. Overall, RN101, DN201, and INv3 are the highly considerable models, in terms of the AUC. • Besides the AUC, a further comparison to show the interdiscrimination ability of TP, TN, FP, and FN for each model is the hybrid validation, which indicates the average of the F1-score, AUC, and kappa metrics. Figure 13 depicts the validation result, showing that the top three models were DN201, DN169, and RN101, although their TN, FP, and FN for each model is the hybrid validation, which indicates the aver of the F1-score, AUC, and kappa metrics. Figure 13 depicts the validation res showing that the top three models were DN201, DN169, and RN101, although t values were close. This result is consistent with those given above. However, by c sidering the overall performances in Table 8, the best two models for recogniz scaphoid fractures are DN201 and RN101. • In addition to the effectiveness evaluation, another important issue is the efficien Considering this issue, Figure 14 shows the execution time of compared models. comparison result revealed that DenseNet, InceptionNet, and EfficientNet higher costs than the other models.

•
In the above experiments, the evaluation results using the KCGMH data were sho in great detail. However, to make the experiments more robust, we further verif the above models using another data set named RSNA [44], which was proposed a challenge to predict bone age from pediatric hand radiograph. The major int behind this verification is to investigate the performance variances of above recog tion models if using different data sources. In fact, the data set RSNA contains bo of different ages, and 538 images were selected as the testing data. Because the bo in this data set are all normal, the specificity (TN rate) is the aimed measure. Fig  15 shows the specificity comparisons of using RSNA data and KCGMH data for recognition models. The specificity differences are small (within 5%), even thou the data sets are generated by different radiograph capturing devices, that is, the c structed models are stable in detecting the scaphoid fractures.

•
In summary, the above experimental results provide evidence that DenseNet a ResNet perform better than the other three networks. From these two netwo DN201 and RN101 are further selected as the recommended recognizers because overall performances are better than the others in these two networks. bioinformatics. Third, in the experiments, the recognition for each model could be carried out within 1 s. Hence, the testing quality was our major consideration in determining the recommendations. • In the above experiments, the evaluation results using the KCGMH data were shown in great detail. However, to make the experiments more robust, we further verified the above models using another data set named RSNA [44], which was proposed for a challenge to predict bone age from pediatric hand radiograph. The major intent behind this verification is to investigate the performance variances of above recognition models if using different data sources. In fact, the data set RSNA contains bones of different ages, and 538 images were selected as the testing data. Because the bones in this data set are all normal, the specificity (TN rate) is the aimed measure. Figure  15 shows the specificity comparisons of using RSNA data and KCGMH data for the recognition models. The specificity differences are small (within 5%), even though the data sets are generated by different radiograph capturing devices, that is, the constructed models are stable in detecting the scaphoid fractures.

•
In summary, the above experimental results provide evidence that DenseNet and ResNet perform better than the other three networks. From these two networks, DN201 and RN101 are further selected as the recommended recognizers because the overall performances are better than the others in these two networks.

Research Limitations
In this paper, we detail how we studied numerous CNN models as an important determinant of scaphoid fracture recognition. However, there are several limitations in this study that need to be declared here. First, the recognition models are constructed by the same data source. Second, the data type is limited in the adult type because this paper is just a beginning of scaphoid fracture recognition. More data types will be tested in the future. Third, the data size is not large because the real data are not easy to gather. Fourth, the data came only from patients in Kaohsiung Chang Gung Memorial Hospital, Taiwan. Fifth, the models of radiograph capturing devices are the same. Sixth, the resolution qualities are close because the images are generated from the same radiograph-capturing devices. More related suggestions for future research are listed in the proceeding section.

Conclusions and Future Works
Over the past few years, deep learning methods have been successfully used in the field of bioimage recognition. However, few studies have focused on the detection of scaphoid fractures. In this paper, the related works are reviewed and compared comprehensively. From the review, the lack of related works inspired us to propose a two-stage method based on state-of-the-art CNNs. In the first stage, the scaphoid bone is segmented from the target image. Next, the segmentation is augmented by geometric operations. Then, the recognition model is constructed using transfer learning from pretrained CNNs. Finally, a set of CNNs were evaluated, in terms of seven metrics, and a detailed analysis was provided. In the following, the main contributions of our work are discussed.

•
In this paper, one of our major intentions was to narrow the recognition focus to the scaphoid bone, instead of the whole image. For this, the scaphoid bone was viewed as an object to be extracted from the target image. Therefore, the prediction cost can be reduced, and the effectiveness can be significantly increased.

•
Based on the segmentation, several CNNs were employed for solution of the problem. Data augmentation and transfer learning were fused to enhance the prediction quality.
The experimental results revealed that the enhancement due to this fusion could tackle the overfitting problem. • For insight into the recognition results obtained for the models used, in this paper, numerous detailed evaluations were conducted. Through the overall analysis, insightful and comprehensive perspectives were obtained. Finally, based on the empirical study, the best two models could be recommended, namely DenseNet201 and ResNet101.
Although we presented a set of effective enhancements for the detection of scaphoid fractures, the following issues remain to be addressed in the future: • First, with more data, the effectiveness can be increased. However, scaphoid fracture data are not easy to collect. For this purpose, in the future, we will attempt to gather more data from the other branches of Chang Gung Memorial Hospitals in Taiwan, including from pediatric patients, that is, the pediatric images will be added to enhance the coverage. • Second, in this paper, the adopted models operated individually. In the future, an ensemble model will be constructed by optimizing the combinations. Moreover, a progressive learning model will also be tested in order to increase the prediction quality.

•
Third, in addition to radiographs, CT images will be used as test data, utilizing the ideas proposed in this paper.
Finally, several suggestions about future related studies with similar focus are listed: • Other bioimage recognition methods have also been studied by the authors' team, such as those for the detection of liver and lung tumors. It has been suggested that, regardless of the method, segmentation is useful in decreasing the computational cost and increasing the accuracy.

•
In real applications, a huge imbalance between positives and negatives exists, especially for cancers and noncancers. Consequently, it has been suggested that data augmentation and transfer learning are necessary. • For fracture recognition, although the experimental results indicated that most scaphoid fractures can be recognized, other symptoms, such as hydrops, joint effusion, nonunion, delayed union, avascular necrosis, arthritis, and so on also need attention, as automated recognition of those symptoms is helpful to doctors in determining the required treatments. Institutional Review Board Statement: The data were approved by Kaohsiung Chang Gung Memorial Hospital, Taiwan, and all operations in this paper were executed according to the ethical standards of the Institutional Review Board, Taiwan.