Improving Patient Safety in the X-ray Inspection Process with EfficientNet-Based Medical Assistance System

Patient safety is a paramount concern in the medical field, and advancements in deep learning and Artificial Intelligence (AI) have opened up new possibilities for improving healthcare practices. While AI has shown promise in assisting doctors with early symptom detection from medical images, there is a critical need to prioritize patient safety by enhancing existing processes. To enhance patient safety, this study focuses on improving the medical operation process during X-ray examinations. In this study, we utilize EfficientNet for classifying the 49 categories of pre-X-ray images. To enhance the accuracy even further, we introduce two novel Neural Network architectures. The classification results are then compared with the doctor’s order to ensure consistency and minimize discrepancies. To evaluate the effectiveness of the proposed models, a comprehensive dataset comprising 49 different categories and over 12,000 training and testing sheets was collected from Taichung Veterans General Hospital. The research demonstrates a significant improvement in accuracy, surpassing a 4% enhancement compared to previous studies.


Introduction
According to the World Health Organization's (WHO) fact sheet on patient safety [1] and the latest TPR 2020 Annual Report [2], a significant number of patients are at risk of experiencing harm due to incorrect medical management [3]. Nevertheless, it is worth noting that close to 50% of these adverse events are preventable [4]. Misidentification of the patient's examination site is a major source of error events in hospital radiology departments, as shown in various examples such as those in Figures 1 and 2. For instance, a doctor may order a right (R't) WRIST anterioposterior (AP) image, but the radiographer may mistakenly take a R't WRIST lateral (LAT) image, or an order of left (L't) FOOT AP image may be taken as the L't FOOT AP image, which potentially leads to incorrect diagnosis and treatment. Studies conducted by Sadigh et al. [5] at two large US academic hospitals showed that out of 1,717,713 examinations performed during their study period, 67 error reports were identified, with an estimated event rate of 4 per 100,000 examinations. Although the error reports were identified, with an estimated event rate of 4 per 100,000 examinations. Although the probability of such errors seems low, any errors can have serious consequences, leading to delayed diagnosis and treatment for patients.
Patient safety is of utmost importance in medical services, and any measures to prevent such errors should be taken. For example, the outpatient X-ray room of Taichung Veterans General Hospital performs nearly 1000 X-ray images daily, and in such a highpressure environment, human errors are unavoidable. Therefore, it is crucial to improve medical procedures to address this issue and ensure patient safety.  Since 2013, there has been a significant rise in research focused on machine learning within the health and life sciences domain. Among the most widely adopted applications is the support it provides to doctors in diagnosing and treating patients. Despite the continued reliance of most hospitals on traditional methods, such as PDCA (Plan-Do-Check-Act), for enhancing medical procedures, researchers have introduced groundbreaking solutions [6][7][8][9][10] that utilize image classification to assist in diagnostics. For instance, Gao [7] employed a deep learning approach to classify Fundus Fluorescein Angiography (FFA) images, determining the presence or absence of diabetic retinopathy. Gupta [8] utilized two models, MobileNetV2 and DarkNet19, for classifying patients as either having or not having COVID-19. In [9], Pan et al. employed a deep learning model to categorize fundus images into three groups: normal, macular degeneration and tessellated fundus. The iERM system proposed by Kai et al. [10] is a two-stage Deep Learning system that enhances the grading performance and interpretability of ERM by incorporating human segmentation of key features.
Image classification is a vital task in computer vision, where CNNs play a crucial role [11][12][13][14][15][16][17][18][19][20][21][22][23][24]. They excel at extracting features and enabling accurate classification. CNNs are widely used and are particularly valuable in medical diagnostics. Transfer learning, using pre-trained models, such as ImageNet, is a common technique to enhance performance on new tasks with smaller datasets. Feature Extraction and Fine Tuning are the two main approaches to transfer learning. For this study, the PyTorch Image Models (timm) library by Ross Wightman [25] is used for feature extraction and the fine-tuning of medical images. error reports were identified, with an estimated event rate of 4 per 100,000 examinations. Although the probability of such errors seems low, any errors can have serious consequences, leading to delayed diagnosis and treatment for patients.
Patient safety is of utmost importance in medical services, and any measures to prevent such errors should be taken. For example, the outpatient X-ray room of Taichung Veterans General Hospital performs nearly 1000 X-ray images daily, and in such a highpressure environment, human errors are unavoidable. Therefore, it is crucial to improve medical procedures to address this issue and ensure patient safety.  Since 2013, there has been a significant rise in research focused on machine learning within the health and life sciences domain. Among the most widely adopted applications is the support it provides to doctors in diagnosing and treating patients. Despite the continued reliance of most hospitals on traditional methods, such as PDCA (Plan-Do-Check-Act), for enhancing medical procedures, researchers have introduced groundbreaking solutions [6][7][8][9][10] that utilize image classification to assist in diagnostics. For instance, Gao [7] employed a deep learning approach to classify Fundus Fluorescein Angiography (FFA) images, determining the presence or absence of diabetic retinopathy. Gupta [8] utilized two models, MobileNetV2 and DarkNet19, for classifying patients as either having or not having COVID-19. In [9], Pan et al. employed a deep learning model to categorize fundus images into three groups: normal, macular degeneration and tessellated fundus. The iERM system proposed by Kai et al. [10] is a two-stage Deep Learning system that enhances the grading performance and interpretability of ERM by incorporating human segmentation of key features.
Image classification is a vital task in computer vision, where CNNs play a crucial role [11][12][13][14][15][16][17][18][19][20][21][22][23][24]. They excel at extracting features and enabling accurate classification. CNNs are widely used and are particularly valuable in medical diagnostics. Transfer learning, using pre-trained models, such as ImageNet, is a common technique to enhance performance on new tasks with smaller datasets. Feature Extraction and Fine Tuning are the two main approaches to transfer learning. For this study, the PyTorch Image Models (timm) library by Ross Wightman [25] is used for feature extraction and the fine-tuning of medical images. Patient safety is of utmost importance in medical services, and any measures to prevent such errors should be taken. For example, the outpatient X-ray room of Taichung Veterans General Hospital performs nearly 1000 X-ray images daily, and in such a high-pressure environment, human errors are unavoidable. Therefore, it is crucial to improve medical procedures to address this issue and ensure patient safety.
Since 2013, there has been a significant rise in research focused on machine learning within the health and life sciences domain. Among the most widely adopted applications is the support it provides to doctors in diagnosing and treating patients. Despite the continued reliance of most hospitals on traditional methods, such as PDCA (Plan-Do-Check-Act), for enhancing medical procedures, researchers have introduced groundbreaking solutions [6-10] that utilize image classification to assist in diagnostics. For instance, Gao [7] employed a deep learning approach to classify Fundus Fluorescein Angiography (FFA) images, determining the presence or absence of diabetic retinopathy. Gupta [8] utilized two models, MobileNetV2 and DarkNet19, for classifying patients as either having or not having COVID-19. In [9], Pan et al. employed a deep learning model to categorize fundus images into three groups: normal, macular degeneration and tessellated fundus. The iERM system proposed by Kai et al. [10] is a two-stage Deep Learning system that enhances the grading performance and interpretability of ERM by incorporating human segmentation of key features.
Image classification is a vital task in computer vision, where CNNs play a crucial role [11][12][13][14][15][16][17][18][19][20][21][22][23][24]. They excel at extracting features and enabling accurate classification. CNNs are widely used and are particularly valuable in medical diagnostics. Transfer learning, using pre-trained models, such as ImageNet, is a common technique to enhance performance on new tasks with smaller datasets. Feature Extraction and Fine Tuning are the two main approaches to transfer learning. For this study, the PyTorch Image Models (timm) library by Ross Wightman [25] is used for feature extraction and the fine-tuning of medical images.
The outpatient X-ray room at Taichung Veterans General Hospital handles an average monthly workload of approximately 30,000 X-ray images, which translates to almost 1000 X-ray images generated daily. This workload is managed by only 3-4 radiologists and 1 support personnel, leading to a high-pressure work environment where human errors are likely to occur. After analyzing the causes of errors, including subjective and objective factors such as physical discomfort, unfamiliar inspection sites, long working hours, and high pressure, it was determined that corrections and assistance in the medical process could reduce such errors.
Miao [6] introduced a solution to tackle the mentioned problem, with the goal of classifying 49 categories of X-ray images. These categories encompass 25 sites, 40 categories with direction, and 9 categories without direction. The best result obtained was a testing accuracy of 94.10% trained with Xception [26]. However, Miao identified two issues that led to low accuracy: the feature gap between positive and lateral positions was too small and left and right images were difficult to classify. To enhance the accuracy of Miao [6], we used EfficientNet as our CNN model. We also propose two enhanced model architectures: T40P3x4 and T40P2x4A2P2 and implement three optimization strategies to address the issues of low accuracy: using a more robust model, data purification, and data augmentation. By implementing these improvements, the overall system accuracy increased by more than 4.0%, reaching 98.16%. This significant improvement has not only enhanced the quality of medical services but also improved patient safety.

•
The main contribution of this paper includes: • According to TPR 2020 [2], it is evident that there is a higher likelihood of radiologists causing delays in patient diagnoses. However, the effective utilization of EfficientNet has significantly reduced human errors among radiologists. Compared to previous studies [6], our model demonstrates improved accuracy, and in addition to that, we offer F1-score, Recall, and Precision measurements; • The 49-category classifier exhibits some misclassifications in certain body parts. To enhance the accuracy of these specific body parts, we propose the two-level classifier T40P3x4. This new approach promises to further improve the overall accuracy to 98%; • Despite achieving 98% accuracy, the two-level architecture does encounter misclassifications in three body parts of RGB pictures. In order to address this limitation, we present a novel three-level architecture: T40P2x4A2P2. The latest methodology not only boosts the overall accuracy significantly but also efficiently classifies body parts that were previously misclassified using the two-level architecture, leading to a remarkable 98.16% accuracy.
The rest of the paper is organized as follows: In Section 2, we will introduce the datasets, materials and methods used in our study. Section 3 presents the results of our experiments. We will discuss the results, compare them with Miao [6] and talk about future work in Section 4. Finally, in Section 5, we draw our conclusions.

Overview
This study was conducted using the X-ray room located in the outpatient department of Taichung Veterans General Hospital as the primary source of data. The data collection period lasted for 8 months, from September 2021 to April 2022. The images obtained in this study were of the most frequently examined areas, which constituted 80% of the daily workload for radiologists. The 15 examined areas were divided into 40 categories with directional indicators and 9 categories with no directional indicators, resulting in a total of 49 categories. In Figures 3 and 4, the following abbreviations were used: AP and PA indicate anterioposterior and posterioanterior directions, LAT refers to lateral direction, OBL refers to an oblique direction, STD indicates standing position, L't represents left, R't represents right, C-spine represents cervical spine, T-spine represents thoracic spine and L-spine represents lumbar spine. represents right, C-spine represents cervical spine, T-spine represents thoracic spine and L-spine represents lumbar spine.

Purification
During the data collection process, it was observed that a small number of images for some sites were different from the majority of images due to special conditions. For instance, in Figure 5, the patient's position cannot be discerned from the image because of factors such as wearing a cast, a brace, or clothing covering the body. Additionally, in Figure 6, the radiation of the same site can vary significantly between children and adults  represents right, C-spine represents cervical spine, T-spine represents thoracic spine and L-spine represents lumbar spine.

Purification
During the data collection process, it was observed that a small number of images for some sites were different from the majority of images due to special conditions. For instance, in Figure 5, the patient's position cannot be discerned from the image because of factors such as wearing a cast, a brace, or clothing covering the body. Additionally, in Figure 6, the radiation of the same site can vary significantly between children and adults

Purification
During the data collection process, it was observed that a small number of images for some sites were different from the majority of images due to special conditions. For instance, in Figure 5, the patient's position cannot be discerned from the image because of factors such as wearing a cast, a brace, or clothing covering the body. Additionally, in Figure 6, the radiation of the same site can vary significantly between children and adults or due to certain actions that the patient may be required to perform by the physician. Both of these scenarios can affect the accuracy of the data set and confuse the model's recognition of categories during training. As a result, these special cases were removed in this study to avoid negatively impacting the model's training.
or due to certain actions that the patient may be required to perform by the physician. Both of these scenarios can affect the accuracy of the data set and confuse the model's recognition of categories during training. As a result, these special cases were removed in this study to avoid negatively impacting the model's training.  This strategy is also expected to enhance the model's ability to extract precise features for each site on the left and right, thereby improving the challenge of classifying left and right, as mentioned in Miao's paper [6]. Since the number of X-ray types taken by the radiology department varied daily, this study selected 100-300 images for each category after data purification, resulting in a total of 12,152 images. Table 1 shows the number of RGB pictures for each body category. In these 20 body categories, we have highlighted the 11 directional body categories. We further break down the directional body categories into 40 additional categories in addition to the original 9 non-directional body categories. Figure 7 illustrates the data distribution across all these 49 categories. We observe that FE-MUR LEG and ELBOW have a relatively lower number of RGB pictures, which could potentially lead to misclassification by the classifier. To tackle this issue of limited dataset size, we will employ a data augmentation scheme, as detailed in Section 2.2.2.
Changes in the appearance of the irradiated site. The upper row is the most common way to irradiate that site, and the lower row is the special situation.
Healthcare 2023, 11, x FOR PEER REVIEW 5 of 20 or due to certain actions that the patient may be required to perform by the physician. Both of these scenarios can affect the accuracy of the data set and confuse the model's recognition of categories during training. As a result, these special cases were removed in this study to avoid negatively impacting the model's training.  This strategy is also expected to enhance the model's ability to extract precise features for each site on the left and right, thereby improving the challenge of classifying left and right, as mentioned in Miao's paper [6]. Since the number of X-ray types taken by the radiology department varied daily, this study selected 100-300 images for each category after data purification, resulting in a total of 12,152 images. Table 1 shows the number of RGB pictures for each body category. In these 20 body categories, we have highlighted the 11 directional body categories. We further break down the directional body categories into 40 additional categories in addition to the original 9 non-directional body categories. Figure 7 illustrates the data distribution across all these 49 categories. We observe that FE-MUR LEG and ELBOW have a relatively lower number of RGB pictures, which could potentially lead to misclassification by the classifier. To tackle this issue of limited dataset size, we will employ a data augmentation scheme, as detailed in Section 2.2.2.  This strategy is also expected to enhance the model's ability to extract precise features for each site on the left and right, thereby improving the challenge of classifying left and right, as mentioned in Miao's paper [6]. Since the number of X-ray types taken by the radiology department varied daily, this study selected 100-300 images for each category after data purification, resulting in a total of 12,152 images. Table 1 shows the number of RGB pictures for each body category. In these 20 body categories, we have highlighted the 11 directional body categories. We further break down the directional body categories into 40 additional categories in addition to the original 9 non-directional body categories. Figure 7 illustrates the data distribution across all these 49 categories. We observe that FEMUR LEG and ELBOW have a relatively lower number of RGB pictures, which could potentially lead to misclassification by the classifier. To tackle this issue of limited dataset size, we will employ a data augmentation scheme, as detailed in Section 2.2.2.   x  263  263  ANKLE  287  298  x  290  289  x  x  1164  ELBOW  167  163  x  155  188  x  x  673  FEMUR  125  114  x  144  109  x  x  492  FOOT  278  x  285  226  x  239  x  1028  HAND  298  x  298  289  x  296  x  1181  KNEE  296  282  x  300  291  x  x  1169  KNEESTD  286  283  x  299  283  x  x  1151  LEG  149  134  x  124  105  x  x

System Workflow
The system workflow, as illustrated in Figure 8, begins with the system capturing an RGB picture prior to the radiologist performing an X-ray. This picture serves as the input to the classifier introduced in this paper, which aims to identify the specific body part being imaged.
The classifier analyzes the image and produces its classification results, indicating the identified body part. These results are then compared with the doctor's order, which specifies the expected body part for the X-ray.
In the case where the classifier's results align with the doctor's order, indicating a correct identification of the body part, the system generates a notification to inform the radiologist that the process is complete. This notification serves as confirmation that the X-ray is ready to be taken.

System Workflow
The system workflow, as illustrated in Figure 8, begins with the system capturing an RGB picture prior to the radiologist performing an X-ray. This picture serves as the input to the classifier introduced in this paper, which aims to identify the specific body part being imaged. However, if there is a mismatch between the classifier's results and the doctor's order, suggesting a potential incorrect identification of the body part, the system generates a warning notification. This notification is sent to alert the radiologist of the discrepancy, The classifier analyzes the image and produces its classification results, indicating the identified body part. These results are then compared with the doctor's order, which specifies the expected body part for the X-ray.
In the case where the classifier's results align with the doctor's order, indicating a correct identification of the body part, the system generates a notification to inform the radiologist that the process is complete. This notification serves as confirmation that the X-ray is ready to be taken.
However, if there is a mismatch between the classifier's results and the doctor's order, suggesting a potential incorrect identification of the body part, the system generates a warning notification. This notification is sent to alert the radiologist of the discrepancy, prompting further investigation and ensuring the correct body part is imaged before proceeding with the X-ray.
By implementing this workflow, the system enhances the accuracy and reliability of body part identification during X-ray procedures, providing valuable support to radiologists and reducing the risk of misdiagnosis or procedural errors.

Data Augmentation
In order to overcome the issue of insufficient data diversity resulting from the scarcity of medical images, this study has implemented data augmentation in addition to transfer learning. Figure 9 demonstrates four general image augmentation techniques [27]. Given that the relationship between human body sites and their surrounding environment, such as light shades, hospital beds, and medical appliances, plays a crucial role in classification for our task, we have opted to use Flip and Rotation for data augmentation. These techniques are effective in preserving the relationship between different objects in the image, as opposed to Scale and Crop, which may only retain a portion of the image. However, if there is a mismatch between the classifier's results and the doctor's order, suggesting a potential incorrect identification of the body part, the system generates a warning notification. This notification is sent to alert the radiologist of the discrepancy, prompting further investigation and ensuring the correct body part is imaged before proceeding with the X-ray.
By implementing this workflow, the system enhances the accuracy and reliability of body part identification during X-ray procedures, providing valuable support to radiologists and reducing the risk of misdiagnosis or procedural errors.

Data Augmentation
In order to overcome the issue of insufficient data diversity resulting from the scarcity of medical images, this study has implemented data augmentation in addition to transfer learning. Figure 9 demonstrates four general image augmentation techniques [27]. Given that the relationship between human body sites and their surrounding environment, such as light shades, hospital beds, and medical appliances, plays a crucial role in classification for our task, we have opted to use Flip and Rotation for data augmentation. These techniques are effective in preserving the relationship between different objects in the image, as opposed to Scale and Crop, which may only retain a portion of the image. In summary, we have utilized rotation as a means of enhancing data generalization during the training process. Prior to importing images for model training, we randomly rotated them by plus or minus 30 degrees and used bicubic interpolation to complement the rotated image. This is to simulate the potential displacement of the patient's site during an X-ray and enables the model to better learn the nuanced features of each site. As a result, Healthcare 2023, 11, 2068 8 of 20 this approach helps to improve the issue of the small feature gap between the positive and lateral positions, as pointed out by Miao [6].

EfficientNet
Mingxing Tan et al. aimed to develop a model scaling method that could optimize both speed and accuracy. To achieve this, they re-examined several dimensions of model scaling proposed by their predecessors, including network depth, width, and image resolution. While previous studies had typically focused on enlarging one of these dimensions to improve accuracy, the authors recognized that these dimensions are mutually influential and proposed EfficientNet [28] through experiments. Specifically, they first formulated the problem definition to explore the relationship between network depth, width, and image resolution in achieving model accuracy. They assumed that the entire net is N, and the i-th layer is expressed as: where F i is the operator, Y i is output tensor and X i is input tensor. Let N consist of k convolutional layers, then it can be expressed as: In practice, convolutional layers are usually divided into same architecture stages, so N can be expressed as follows: where i is the stage index, and F L i i is the convolutional layer of the i-th stage, F i repeats L i times, and H i , W i , C i is the shape of the input image.
To reduce the search space, the authors established certain constraints, including fixing the basic structure of the network, imposing equal scaling on all layers, and incorporating memory and computation constraints. As a result, the scaling of the network could only be optimized by multiplying the baseline network defined byF i ,L i ,Ĥ i ,Ŵ i ,Ĉ i in the formula below with a constant magnification: where d, w, r are coefficients for scaling network depth, width and resolution.
After conducting experiments that involved adjusting only one dimension at a time, as well as adjusting all three dimensions simultaneously, the authors proposed a compound scaling method. This method involves using a compound coefficient φ to uniformly scale the network width, depth, and resolution: where α, β, γ are constants that can be determined by a small grid search.
The authors considered that doubling network depth would double FLOPS while doubling network width or resolution would quadruple FLOPS. The FLOPS of a regular operation is proportional to d, w2, r2. As a result, scaling a CNN with Equation (3) would increase the total FLOPS by α·β 2 ·γ 2 φ . To keep the total FLOPS increase to approximately 2 φ , they constrained α·β 2 ·γ 2 ≈ 2.
EfficientNet-B0 was generated based on MnasNet [29], and the authors used the compound scaling method to obtain EfficientNet-B1 to EfficientNet-B7. In this study, we chose to use EfficientNet-B3 for our experiments. This is because, compared to B0, B3 increased the accuracy rate by 3.5% with an increase of 6.7 M parameters. In contrast, B4 increased the number of parameters by 7 M compared to B3 but only improved the accuracy rate by 1.3%. Table 2 shows these results. Furthermore, EfficientNet-B3 was found to be a more robust model than Xception [26], which had the best results in Miao's [6] paper. This is because EfficientNet-B3 showed an improvement of more than 2% over Xception on the ImageNet dataset, as also shown in Table 2.

Experiment Setting
In this study, we employ EfficientNet-B3 to perform image classification on 49 categories of RGB images of the body sites that need to undergo an X-ray examination. We first introduce the dataset and training strategy, followed by the evaluation metrics we use.
DataSets: We collected a total of 12,152 images from the X-ray room of the outpatient department at Taichung Veterans General Hospital. We performed data purification to obtain 49 categories and split the data of each category into Training, Validation, and Testing sets at a ratio of 7:1:2. During training, we randomly rotated each image by plus or minus 30 degrees and resized it to 288 × 288. During validation and testing, we resized each image to 320 × 320.
Training Strategy: The hyperparameters for this training strategy are as follows: a batch size of 8, Cross-Entropy loss function, Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.001 and momentum of 0.9. The pre-trained weights from the ImageNet training are used as the initial weights for transfer learning. The training process involves two steps: feature extraction and fine-tuning. In the feature extraction step, we modify the pre-trained model by replacing the fully connected layer to output predictions for the 49 classes in the new task. We then freeze all the layers in the pre-trained model except for the fully connected layer and train only this layer for 10 epochs. This step allows the model to learn to map the features extracted by the pre-trained layers to the new task. In the fine-tuning step, we unfreeze all the layers in the pre-trained model and train the entire model for 30 epochs. This step allows the model to adjust the pre-trained weights to better fit the new task. During this step, the entire model is updated using the SGD optimizer with the specified learning rate and momentum.

Evaluation Metrics:
The evaluation of the model includes both global and categoryspecific performance measures, along with a usability assessment. To evaluate the overall performance, we will use the Accuracy metric, which measures the percentage of correctly classified examples in the dataset across all categories. For category-specific performance, the Confusion Matrix will provide a visual representation of the model's performance in each category. Additionally, Precision, Recall, and F1-score will be calculated for each category to offer a more detailed assessment of performance. These metrics are commonly used in multi-class classification tasks to evaluate the precision (accuracy of positive predictions), recall (sensitivity to true positive examples), and F1-score (harmonic mean of precision and recall) for each category as in [37].  Figure 10 shows the training accuracy and training loss of EfficientNet-B3 on our 49-categories classification task. The training accuracy and training loss serve as crucial indicators to observe the learning progress of the models. By monitoring these metrics during the training process, the programmer can gain insights into how well the model is learning from the data. A high training accuracy and low training loss typically signify that the model is effectively capturing patterns and generalizing well to the training data. In our experiment, after training the model for 30 epochs, we achieved impressive final training accuracy (red line) of 99.89% and validation accuracy (green line) of 97.71% for the 49 categories classification task. The final training loss was 1%, and the validation loss was 8%, indicating good generalization and effective error minimization during training.

Analysis of Results
From Tables 3 and 4, it is evident that L't ELBOW LAT is the site with the lowest Recall and F1-score and also makes the most mistakes (eight times) in all categories, as shown in Figure 11. It is also observed that all categories with the lowest F1-score are from three sites, ELBOW, FEMUR, and LEG. The categories of FEMUR and LEG cover a large area, making it more challenging for the model to understand the complex relationship between the site and the surrounding environment to classify the different directions accurately. On the other hand, ELBOW images did not capture the real difference between the categories in different directions, and the categories in different directions look very similar in appearance, as shown in Figure 12.

Analysis of Results
From Tables 3 and 4, it is evident that L't ELBOW LAT is the site with the lowest Recall and F1-score and also makes the most mistakes (eight times) in all categories, as shown in Figure 11. It is also observed that all categories with the lowest F1-score are from three sites, ELBOW, FEMUR, and LEG. The categories of FEMUR and LEG cover a large area, making it more challenging for the model to understand the complex relationship between the site and the surrounding environment to classify the different directions accurately. On the other hand, ELBOW images did not capture the real difference between the categories in different directions, and the categories in different directions look very similar in appearance, as shown in Figure 12.

Analysis of Results
From Tables 3 and 4, it is evident that L't ELBOW LAT is the site with the lowest Recall and F1-score and also makes the most mistakes (eight times) in all categories, as shown in Figure 11. It is also observed that all categories with the lowest F1-score are from three sites, ELBOW, FEMUR, and LEG. The categories of FEMUR and LEG cover a large area, making it more challenging for the model to understand the complex relationship between the site and the surrounding environment to classify the different directions accurately. On the other hand, ELBOW images did not capture the real difference between the categories in different directions, and the categories in different directions look very similar in appearance, as shown in Figure 12. To address these challenges, the proposed two-stage improvement approach aim capture subtle feature differences between individual sites more accurately, particu for the difficult-to-classify sites. The proposed hierarchical classification approach pected to improve the accuracy by more accurately capturing the feature differenc these challenging sites.  Figure 12. Sites that are difficult to classify.

Two-Stage Improvement  T40P3x4
In the proposed two-stage improvement approach, T40P3x4, the first stage inv merging the 12 categories derived from the 3 sites with the lowest F1-score into 3 ca ries: ELBOW, FEMUR, and LEG, thereby reducing the total categories to 40. In the se stage, a 4-categories classifier is trained for each of the three merged categories to cla  To address these challenges, the proposed two-stage improvement approach aims to capture subtle feature differences between individual sites more accurately, particularly for the difficult-to-classify sites. The proposed hierarchical classification approach is expected to improve the accuracy by more accurately capturing the feature differences of these challenging sites.

Two-Stage Improvement
• T40P3x4 In the proposed two-stage improvement approach, T40P3x4, the first stage involves merging the 12 categories derived from the 3 sites with the lowest F1-score into 3 categories: ELBOW, FEMUR, and LEG, thereby reducing the total categories to 40. In the second stage, a 4-categories classifier is trained for each of the three merged categories to classify them back to the original category.
The proposed architecture, T40P3x4, consists of two stages, and EfficientNet-B3 is used for the model in any stage. Compared to the previous classifier, S49 (Single stage: 49-categories), this new architecture is expected to improve the accuracy of the model by capturing subtle feature differences between individual sites more accurately, as shown in Figure 13. The proposed architecture, T40P3x4, consists of two stages, and EfficientNet-B3 is used for the model in any stage. Compared to the previous classifier, S49 (Single stage: 49categories), this new architecture is expected to improve the accuracy of the model by capturing subtle feature differences between individual sites more accurately, as shown in Figure 13. The training strategy of the first stage model of T40P3x4 is similar to that of training S49, with the exception of some modifications in the data set distribution. The S49 data set is used as the basis for modification, but due to the merging of multiple categories in the current classification, directly aggregating the data sets of multiple categories into one new category may cause data imbalance. Therefore, for the three sites merged into one Figure 13. Architecture of T40P3x4. The red color words means the input image will go through the path. The training strategy of the first stage model of T40P3x4 is similar to that of training S49, with the exception of some modifications in the data set distribution. The S49 data set is used as the basis for modification, but due to the merging of multiple categories in the current classification, directly aggregating the data sets of multiple categories into one new category may cause data imbalance. Therefore, for the three sites merged into one category, we controlled the total number of data for the new category at 300. We performed uniform sampling on the Training set, Validation set, and Testing set for the four categories to be merged into a new category on average. So, the subcategories of these 3 sites will all contribute 53, 7, and 15 images to their new category in sequence, totaling 75 images. Please refer to Table 5 for more details.  Figure 14 shows the training accuracy and training loss of the first-stage model of T40P3x4. Table 6 show the data distribution of the second stage input image. The final training accuracy was 99.91%, training loss was 1%, validation accuracy was 98.94%, validation loss was 5%, and the testing accuracy was 98.82%. In terms of the training results of the classifiers in the second stage, all classifiers had a training accuracy of more than 98.5%, as shown in Table 7. Regarding testing accuracy, ELBOW achieved 94.12%, FEMUR achieved 94.95%, and LEG achieved 87.38%. Combining these two stages' classifiers, we obtained an overall testing accuracy of 97.99% on the original 49 categories. The confusion matrix is shown in Figure 15. In the following section, we will compare the training results of T40P3x4 and S49.  Table 7. Regarding testing accuracy, ELBOW achieved 94.12%, FEMUR achieved 94.95%, and LEG achieved 87.38%. Combining these two stages' classifiers, we obtained an overall testing accuracy of 97.99% on the original 49 categories. The confusion matrix is shown in Figure 15. In the following section, we will compare the training results of T40P3x4 and S49.

Three-Stage Improvement  T40P2x4A2P2
In Table 7, it is evident that T40P3x4 is not suitable for LEG classification. The confusion matrix in Figure 15 indicates a higher probability of misclassifying R't LEG AP, R't LEG LAT, and L't LEG AP. As a result, we propose an alternative architecture, T40P2x4A2P2, as shown in Figure 16. In this architecture, we have divided the LEG parts into two stages. We first classify the AP and LAT as LEG AP/LAT and then classify the R't and L't as LEG R't/L't.
In the training strategy we have used the same training strategy as T40P3x4. Figure  17 displays the training accuracy and training loss of the first stage model of T40P2x4A2P2, and Table 8 shows the final training results. The Training accuracy, Training Loss, and Testing Accuracy of the LEG AP classifier are 98.32%, 4%, and 98.08%, respectively. The Training accuracy, Training Loss, and Testing Accuracy of the LEG RL classifier are 97.60%, 4%, and 92.31%, respectively. The testing accuracy for ELBOW, FE-MUR, and LEG are 94.12%, 94.95%, and 93.16%, respectively.   In Table 7, it is evident that T40P3x4 is not suitable for LEG classification. The confusion matrix in Figure 15 indicates a higher probability of misclassifying R't LEG AP, R't LEG LAT, and L't LEG AP. As a result, we propose an alternative architecture, T40P2x4A2P2, as shown in Figure 16. In this architecture, we have divided the LEG parts into two stages. We first classify the AP and LAT as LEG AP/LAT and then classify the R't and L't as LEG R't/L't.  When we replaced the two-stage LEG classifiers with the original T40P3x4 LEG classifiers, we achieved an overall testing accuracy of 98.16% on the original 49 categories, as shown in the confusion matrix in Figure 18. In the following section, we will compare the training results of T40P3x4 and S49.  In the training strategy we have used the same training strategy as T40P3x4. Figure 17 displays the training accuracy and training loss of the first stage model of T40P2x4A2P2, and Table 8    When we replaced the two-stage LEG classifiers with the original T40P3x4 LEG classifiers, we achieved an overall testing accuracy of 98.16% on the original 49 categories, as shown in the confusion matrix in Figure 18. In the following section, we will compare the training results of T40P3x4 and S49.  When we replaced the two-stage LEG classifiers with the original T40P3x4 LEG classifiers, we achieved an overall testing accuracy of 98.16% on the original 49 categories, as shown in the confusion matrix in Figure 18. In the following section, we will compare the training results of T40P3x4 and S49.

Discussion
In this section, we will discuss the results and compare them with those of S49, and also explore possibilities for future work.
 Results Comparison Table 9 shows the comparison of 12 categories in terms of F1-score between S49, T40P3x4, and T40P2x4A2P2. The lower F1-score of T40P3x4 in the LEG categories can be attributed to the fact that the model's understanding of LEG features may have originated from the features learned in other body parts. Therefore, when training a classifier solely related to LEG-derived categories, the model may fail to achieve effective learning outcomes.
However, it is important to note that although T40P3x4 enhances the overall accuracy from 97.59% to 98.00%, the performance of LEG does not improve and instead declines. On the other hand, T40P2x4A2P2 improves the F1-score of LEG-derived categories and, when using the same classifier as T40P3x4, can enhance the overall accuracy from 98.00% to 98.16%.

Discussion
In this section, we will discuss the results and compare them with those of S49, and also explore possibilities for future work.
• Results Comparison Table 9 shows the comparison of 12 categories in terms of F1-score between S49, T40P3x4, and T40P2x4A2P2. The lower F1-score of T40P3x4 in the LEG categories can be attributed to the fact that the model's understanding of LEG features may have originated from the features learned in other body parts. Therefore, when training a classifier solely related to LEG-derived categories, the model may fail to achieve effective learning outcomes. However, it is important to note that although T40P3x4 enhances the overall accuracy from 97.59% to 98.00%, the performance of LEG does not improve and instead declines. On the other hand, T40P2x4A2P2 improves the F1-score of LEG-derived categories and, when using the same classifier as T40P3x4, can enhance the overall accuracy from 98.00% to 98.16%. Table 10 displays the results of the comparison between our three proposed model architectures and the four models used in [6]. The model architecture employed in [6] directly classifies 49 categories, making it identical to our S49 model. However, our three model architectures achieved higher testing accuracy compared to the four models used in [6]. This outcome suggests that our proposed models outperformed the models utilized in [6] when it comes to the accurate classification of the 49 categories. • Future Work Table 9 emphasizes the enhancements in F1-score achieved by three different model architectures, while Table 10 showcases the testing accuracy of these models. To further improve both F1 scores and classification accuracy, it is vital to consider employing diverse model architectures or even exploring alternative models like DenseNet. As a future direction, we propose conducting thorough investigations into various model architectures to identify the most suitable one for augmenting the overall classification performance. This approach holds the potential to yield more accurate and dependable results, thereby enhancing the efficiency and effectiveness of the classification system.

Conclusions
Patient safety is crucial in medicine, and reducing delayed diagnoses is vital for achieving this goal. We developed a medical assistance system for X-ray inspections using deep learning techniques and data from Taichung Veterans General Hospital's radiology department. The system effectively minimizes errors and receives positive feedback from users. By implementing this system, we enhance medical processes and improve the quality of services while prioritizing patient safety. The reduction in delayed diagnoses prevents potential harm and fosters a safer medical environment.
The primary focus of this study was to address the multi-classification task in X-ray inspections. To achieve this, we employed EfficientNet [26] for training and testing and introduced data purification and augmentation enhancements customized for the task. Additionally, we proposed two different architectures to further enhance the classification accuracy. As a result of directly implementing EfficientNet into the classifier, the system's accuracy increased significantly from 94.10% (as in [6]) to 97.59%. Subsequent experiments involved fine-tuning the system architecture and adopting two-stage and three-stage classification approaches, resulting in impressive overall accuracies of 98% and 98.16%, respectively. To further improve classification accuracy, future research should explore new model architectures or different neural networks. These efforts hold the potential to elevate the accuracy even further, advancing the effectiveness of the classification system.