4.3. The MLFNe as a Standalone Experiment Performance Assessment
The main objective of the two experiments was to identify BFs, with the intention of enhancing patient outcomes, streamlining the diagnostic process, and significantly lowering both patient time and costs. In these experiments, 87% of the BFMRX dataset, which comprises 9246 images, was utilized for training purposes. The remaining 5%, or 506 images, was set aside for testing, while 8%, equating to 828 images, was allocated for validation. The BFMRX dataset has two classes: fractured and non-fractured. Both experiments emphasized transfer learning to pre-train six DL models—MLFNet, DenseNet-169, EfficientNet-B3, Inception-V3, MobileNet-V2, and ResNet-101. During the initial transfer learning phase, we conducted supervised pre-training using the ImageNet dataset for these six models. Following this, we executed the fine-tuning phase using the training sets derived from the BFMRX dataset. At the conclusion of each experiment, we used the testing set of the BFMRX dataset and applied the evaluation metrics (as outlined in Equations (1)–(7)) to assess the performance of the six DL models.
In the initial experiment, we employed the MLFNet model independently and assessed its performance using the testing subset of the BFMRX dataset. The results from six DL models, along with their evaluation metrics, are presented in
Table 6,
Table 7 and
Table 8. The average accuracy for each model is as follows: MLFNet achieved an accuracy of 99.60%, DenseNet-169 reached 95.06%, EfficientNet-B3 obtained 93.68%, Inception-V3 recorded 94.27%, MobileNet-V2 reached 97.04%, and ResNet-101 logged 91.90%. These findings indicate that MLFNet exhibited the highest accuracy among all the models assessed.
Table 6 shows exceptional performance for the
MLFNet model in distinguishing between fractured and non-fractured cases. The average accuracy was an impressive 99.60%, indicating the model’s nearly flawless classification capability. Both the average specificity and recall were high at 99.63%, demonstrating the model’s strong ability to accurately identify TNs and TPs. The average FNR was notably low at just 0.37%, while the NPV and precision were both at 99.58%, indicating that the model made very reliable predictions. The average F1-score was 99.60%, confirming a nearly perfect balance between precision and recall.
For the fractured class, the model achieved 100% recall, meaning it accurately identified all the fractured cases with 0% false negatives. It also recorded 99.25% specificity, 99.17% precision, and an F1-score of 99.58%, reflecting high confidence and accuracy in detecting fractures. The NPV for this class was perfect at 100%, further supporting its strong performance. For the non-fractured class, the model reached 100% specificity and 99.25% recall, with a slightly higher FNR of 0.75%. Nevertheless, it achieved 100% precision, 99.17% NPV, and an F1-score of 99.63%, demonstrating consistent and dependable classification. Overall, the MLFNet model showed superior accuracy and robustness, making it highly suitable for clinical fracture detection tasks.
The average accuracy of the DenseNet-169 model was 95.06%, meaning the model accurately classified most instances. The average specificity and recall were both 95.03%, showing a balanced ability to identify TNs and TPs. The average FNR was 4.97%, indicating a modest number of missed positive cases. Additionally, the average NPV and precision were both 95.05%, demonstrating reliable predictive capability. However, the reported average F1-score of 48.15% seemed to be a typographical or calculation error, as it was inconsistent with the other metrics and the individual class F1-scores.
For the fractured class, the model achieved 95.06% accuracy, with a specificity of 95.52% and a recall of 94.54%, reflecting decent performance in identifying fractured cases. The FNR was 5.46%, and the NPV, precision, and F1-score were 95.17%, 94.94%, and approximately 95.00%, respectively (rounded from the value labeled as 0.95, which likely represented 95%). For the non-fractured class, the model maintained the same accuracy of 95.06%, with a specificity of 94.54%, a recall of 95.52%, an FNR of 4.48%, and strong supporting values in NPV (94.94%), precision (95.17%), and F1-score (95.34%). Overall, the DenseNet-169 model demonstrated consistent and balanced performance, though it was slightly lower than other evaluated models, and the average F1-score needs correction to reflect accurate results.
The average accuracy of the EfficientNet-B3 model was 93.68%, indicating a generally reliable classification rate. Both the average specificity and recall were 93.98%, suggesting a balanced ability to identify TNs and TPs. However, the average FNR was 6.02%, which is notably higher than that of previously analyzed models. On the positive side, the NPV and precision averaged 93.94%, reflecting strong overall predictive reliability. The average F1-score was 93.68%, confirming a fair balance between precision and recall.
For the fractured class, the model achieved 93.68% accuracy, with a recall of 99.16% and a very low FNR of 0.84%, indicating high sensitivity to fractured cases. However, the specificity dropped to 88.81%, and the precision was 88.72%, suggesting a higher incidence of false positives in this category. The NPV was very high at 99.17%, and the F1-score was 93.65%, highlighting strong recall but moderate precision.
In contrast, for the non-fractured class, the model maintained the same accuracy of 93.68%, but with a much higher FNR of 11.19% and lower recall at 88.81%, indicating that it missed more normal cases. Nonetheless, it achieved very high precision at 99.17%, a specificity of 99.16%, and an F1-score of 93.70%, illustrating a reversal of performance trade-offs compared to the fractured class. Overall, while EfficientNet-B3 performed well, it demonstrated a clear precision–recall trade-off between classes, which may require adjustment based on clinical priorities, such as minimizing missed fractures.
The average accuracy of the Inception-V3 model reached 94.27%, indicating a high rate of correct predictions. The model also achieved balanced average specificity and recall values of 94.19%, demonstrating its effectiveness in identifying both TNs and TPs. The average FNR was 5.81%, which is relatively low, while both the NPV and precision averaged 94.31%, highlighting consistent predictive reliability. The average F1-score of 94.24% confirmed a good balance between precision and recall across both classes.
For the fractured class, the model achieved 94.27% accuracy, with a specificity of 95.52% and a recall of 92.86%. This suggests that while it accurately identified most fractured cases, a small number were missed. The FNR was 7.14%, slightly higher than ideal, and the precision, NPV, and F1-score were 94.85%, 93.77%, and 93.84%, respectively—indicating solid performance with room for improvement. For the non-fractured class, the model maintained 94.27% accuracy, with 92.86% specificity and 95.52% recall, reflecting better sensitivity but slightly lower specificity. The FNR was 4.48%, with precision, NPV, and F1-score values of 93.77%, 94.85%, and 94.64%, respectively. Overall, the Inception-V3 model performed reliably but showed slight trade-offs between recall and specificity depending on the class, suggesting potential for further fine-tuning to enhance generalization.
The MobileNet-V2 model achieved an average accuracy of 97.04%, indicating high correctness in its predictions. The average specificity and recall were also 97.04%, meaning the model was equally effective in identifying TNs and TPs. The average FNR was low at 2.96%, while both the NPV and precision were 97.02%, demonstrating consistent and reliable predictive performance. The average F1-score of 97.03% confirmed the model’s well-balanced trade-off between precision and recall.
For the fractured class, the model achieved 97.04% accuracy, with a specificity of 97.01% and a recall of 97.06%, showcasing its excellent ability to correctly identify fractured cases. The FNR was 2.94%, and the NPV, precision, and F1-score were 97.38%, 96.65%, and 96.86%, respectively, indicating slightly better performance in identifying TPs than in avoiding FPs. For the non-fractured class, the accuracy remained at 97.04%, with a specificity of 97.06%, recall of 97.01%, and FNR of 2.99%. The precision was 97.38%, and the NPV was 96.65%, resulting in an F1-score of 97.20%, which showed a slightly higher harmonic balance compared to the fractured class. Overall, the MobileNet-V2 model demonstrated robust, consistent performance with minimal class imbalance and a high level of reliability.
The ResNet-101 model achieved an average accuracy of 91.90%, indicating it correctly classified most samples. Both the average specificity and recall were 92.23%, suggesting a generally effective ability to identify TNs and TPs. The average FNR was 7.77%, indicating some missed positive cases. The NPV and precision were both 92.25%, while the average F1-score was 91.90%, demonstrating a reasonably strong but not optimal balance between precision and recall.
In the fractured class, the model achieved 91.90% accuracy, with a specificity of 86.57% and a recall of 97.90%. This indicates a high sensitivity to detecting fractures, albeit at the expense of a higher false positive rate. The FNR was low at 2.10%, and the precision, NPV, and F1-score were 86.62%, 97.89%, and 91.91%, respectively—highlighting strong recall but slightly weaker precision. For the non-fractured class, the model also maintained 91.90% accuracy, but the performance pattern was reversed: it achieved higher specificity at 97.90% and precision at 97.89%, while the recall dropped to 86.57%, and the FNR rose to 13.43%, suggesting the model missed more non-fractured cases. The NPV was 86.62%, and the F1-score was 91.88%. Overall, while the model performed adequately, it exhibited a recall–precision trade-off between classes, and there is room for improvement to reduce false positives for fractured cases and false negatives for non-fractured ones.
The analysis of computational efficiency and resource consumption among the six models reveals distinct variations, as shown in
Table 8.
MLFNet stands out as the lightest option, achieving the quickest training time of 81.44 s, the fastest inference at 12.03 s, and the lowest memory use at approximately 10.3 GB, even though it has 35.07 M parameters. On the other hand, ResNet-101 exhibits the highest computational demands, taking 291.27 s for training, 28.61 s for inference, and requiring the most memory at around 20.9 GB, all while having over 43 million parameters, which underscores its substantial architecture. DenseNet-169 also requires considerable resources, with the longest training time of 419.18 s and a notable parameter count of about 13 million, although its inference speed is comparatively better than that of ResNet-101. EfficientNet-B3 offers a balanced approach, with a moderate training duration of 305 s, lower memory usage at 15.9 GB, and around 11 million parameters, making it an efficient choice with less complexity compared to larger models. Inception-V3 positions itself in the middle, needing about 161 s for training and 19.5 s for inference, alongside a relatively high parameter count of approximately 22 million and memory usage of about 17.4 GB. Lastly, MobileNet-V2 is notable for its efficiency, achieving a training time of 114.57 s, a fast inference of around 14.88 s, and the lowest parameter count at roughly 2.6 million, despite slightly elevated memory consumption of around 18.4 GB.
In summary, MLFNet and MobileNet-V2 are the most resource-efficient models, while ResNet-101 and DenseNet-169 demand significant computational power, with EfficientNet-B3 and Inception-V3 providing a balanced compromise between performance and resource allocation.
Figure 9 illustrates the training and validation performance of the six DL models over 30 epochs, displaying both loss (on the left) and accuracy (on the right) curves. Extreme instabilities were observed in DenseNet-169, EfficientNet-B3, Inception-V3, MobileNet-V2, and ResNet-101, where the validation loss exhibited abnormally high spikes (up to 1 × 10
13 in DenseNet-169 and similarly large values in others). These were traced to numerical instabilities caused by high initial LRs and the absence of gradient clipping, occasionally compounded by logging artifacts. Training configurations were adjusted with reduced learning rates, warm-up scheduling, and gradient clipping to mitigate these effects.
For the MLFNet model, the training loss steadily decreased from around 0.48 to nearly zero, while the validation loss also dropped sharply in the early epochs before stabilizing with minor fluctuations between epochs 10 and 25, remaining low overall. This pattern indicates that the model effectively learned from the training data and maintained good generalization to unseen validation data without significant overfitting.
Regarding accuracy, the MLFNet model experienced a rapid increase during the initial epochs. Training accuracy rose from approximately 75% to nearly 100%, while validation accuracy followed a similar trend, quickly surpassing 95% and remaining stable with minor variations throughout the remaining epochs. The small gap between training and validation accuracy, along with the low and stable validation loss, suggests that the model achieved excellent generalization and high predictive performance. Overall, the close alignment between training and validation metrics demonstrates a well-optimized and robust model, capable of delivering consistent results on both seen and unseen data.
The training and validation loss plot (left) of the DenseNet-169 model across 30 epochs revealed a significant issue. While the training loss steadily decreased toward zero, the validation loss experienced a drastic spike in the first epoch, reaching approximately 1.4 × 1013. This likely indicated numerical instability or error, such as exploding gradients or division by zero. Following this spike, the validation loss quickly dropped to near zero and remained flat, which is not typical behavior and suggests a possible flaw in the loss computation or logging.
In the accuracy plot (right), the training accuracy consistently improved from about 78% to nearly 99%, indicating effective learning. However, the validation accuracy was quite erratic during the first 10 epochs, fluctuating between 40% and 95%. This instability likely stemmed from the issues observed in the validation loss. After the initial fluctuations, the validation accuracy stabilized and aligned more closely with the training accuracy, remaining above 90%, indicating improved generalization. Overall, while the training behavior was smooth, the unusual patterns in validation loss and early validation accuracy pointed to significant early instability in the model training process, possibly due to data irregularities, improper initialization, or learning rate issues. Addressing these factors is essential for ensuring reliable evaluation.
For the EfficientNet-B3 model over 30 epochs, on the left, you can see the loss curves, and on the right, the accuracy curves. The training loss (in red) remained consistently low and stable throughout the epochs, indicating smooth convergence during training. In contrast, the validation loss (in green) exhibited significant instability, with sharp spikes, particularly around epochs 12, 17, and 19, where the loss exceeded 5000. These sudden increases point to severe overfitting, numerical instability, or data-related issues, such as outliers or mislabeled validation samples.
Looking at accuracy, the training accuracy rose steadily from approximately 80% to nearly 100%, showing effective learning on the training set. However, the validation accuracy fluctuated sharply, especially in the early and middle epochs, and dropped significantly around epoch 21, coinciding with the spike in validation loss. Despite this drop, validation accuracy later recovered and closely aligned with training accuracy, stabilizing above 95% by the final epochs.
Overall, the comparison indicated a major gap between training and validation loss, suggesting that while the model performed well on the training data, its ability to generalize to the validation set was inconsistent and occasionally unreliable. These issues highlight potential problems with model regularization, learning rate tuning, or the quality of the validation data, which need to be addressed for more robust and stable performance.
The Inception-V3 demonstrates effective but differing learning patterns over 30 epochs. While there was a reduction in loss and an increase in accuracy, these changes showed an inverse relationship, leading to significant overfitting. Training loss dropped sharply from 1.2 to nearly 0.0, following a near-exponential decay that indicated strong optimization. Validation loss decreased more gradually from 1.0 to 0.4, leveling off after epoch 15 with little further improvement. At the same time, training accuracy increased significantly from 50% to 90%, while validation accuracy rose from 60% to 80%, stabilizing after epoch 20.
The inverse relationship was most evident during the early training stages (epochs 0–10):
However, a noticeable divergence appeared after epoch 15:
Training loss continued to fall to 0.0.
Validation loss remained at 0.4 (five times higher than training loss).
Training accuracy reached 90%.
Validation accuracy plateaued at 80%.
This resulted in a 10% accuracy gap and a loss gap of 0.4, highlighting substantial overfitting. Validation metrics showed early stabilization at epoch 15 (loss 0.4, accuracy 80%), while training metrics kept improving for another 15 epochs without similar gains in validation. Nonetheless, the coordinated progress in the early phase confirmed that the initial weight updates effectively contributed to performance improvements, with validation accuracy peaking at 80%—a level suitable for diagnostic use. The stalled validation metrics after epoch 15 suggested that early stopping at this point could prevent overfitting while ensuring optimal generalization.
The MobileNet-V2 model shows a troubling gap between the loss and accuracy metrics throughout the training epochs. The training and validation loss rose sharply from an initial value of around 0.5 to about 30 by epoch 30, indicating a serious problem with model convergence. This steep increase suggests potential issues such as a learning rate that is too high, model instability, or mismatched data.
On the other hand, both training and validation accuracy dropped significantly, falling from 0.8 to 0.4 within the first three epochs before leveling off at a suboptimal performance level. The rapid decline in accuracy corresponded with the rising loss, confirming that the model did not succeed in generalizing or learning useful patterns. The ongoing gap between training and validation accuracy pointed to overfitting, but the main problem appeared to be severe model divergence, as both metrics worsened together. This inverse relationship—where loss increased while accuracy decreased—highlighted a fundamental failure in the learning process.
The ResNet-101 presents a serious case of overfitting and model divergence, characterized by conflicting trends in training and validation performance of ResNet-101. The training loss steadily decreased to almost zero by epoch 30, while the validation loss skyrocketed to 1 × 107, indicating a critical failure to generalize beyond the training data. This divergence started subtly at epoch 5 and worsened significantly after epoch 10, suggesting issues with optimization (such as a learning rate that is too high or insufficient regularization).
On the other hand, training accuracy approached perfect levels (around 1.0), but validation accuracy plummeted from 0.7 to 0.4 by epoch 30. The contrasting relationship between these metrics was clear: as training loss decreased (theoretically improving the fit), validation loss increased, and validation accuracy dropped to below-random levels. This contradiction confirmed that the model was memorizing noise and outliers in the training set instead of learning generalizable patterns. The widening gap between training and validation accuracy after epoch 5 (exceeding 0.6 by epoch 30) further highlighted the issue of overfitting. The simultaneous surge in validation loss and drop in accuracy pointed to significant flaws in the model design or compatibility with the data.
In
Figure 10, the MLFNet model exhibited exceptional performance, achieving a near-perfect accuracy of 99.6% (504/506 correct predictions). Crucially, it demonstrated 100% recall/sensitivity (238/238 true fractures detected), meaning no actual fractures were missed—a critical achievement for medical diagnostics where FNs carry high risks. Precision is 99.2% (238/240), indicating only 2 FPs (healthy cases misclassified as fractured). The specificity is 99.3% (266/268), confirming strong identification of non-fractured cases. The F1-score harmonized precision and recall at 99.6%, reflecting an optimal balance. The absence of FNs suggests the model prioritizes patient safety by erring toward over-caution. While the two false positives might warrant minor tuning (e.g., adjusting decision thresholds), this performance is clinically outstanding overall.
DenseNet-169 shows exceptional diagnostic performance in classifying fractures, achieving nearly perfect differentiation between fractured and non-fractured cases. The model achieved an accuracy of 95.1% (481/506 correct predictions), indicating strong but imperfect performance. It demonstrated a recall (sensitivity) of 94.5% (225/238 actual fractures detected), meaning 13 fractures were missed—a notable concern for clinical safety. Precision stood at 94.9% (225/237 predicted fractures correct), with 12 FPs (healthy cases flagged as fractured). Specificity was 95.5% (256/268 non-fractures correctly identified), reflecting reliable detection of healthy cases. The F1-score balanced precision and recall at 94.7%. While the model handled non-fractured cases effectively, the 13 false negatives indicated potential risks in under-diagnosis, suggesting further tuning or data refinement was needed to improve sensitivity for critical medical applications.
EfficientNet-B3 shows exceptional diagnostic performance in classifying fractures, achieving almost perfect separation between fractured and non-fractured cases. The model achieved a high recall of 99.2% (236/238 actual fractures correctly identified), missing only two true fractures—a critical strength for medical safety. However, it exhibited lower precision of 88.7% (236/266 predicted fractures correct), generating 30 FPs where healthy cases were erroneously flagged as fractured. Specificity was 88.8% (238/268 non-fractures accurately recognized), indicating moderate effectiveness in confirming healthy cases. Overall accuracy reached 93.7% (474/506 correct), while the F1-score balanced these metrics at 93.7%. The model prioritized fracture detection (minimizing false negatives) at the cost of higher false alarms, suggesting it leaned toward caution in clinical trade-offs. Further optimization to reduce false positives was warranted without compromising sensitivity.
Inception-V3 shows effective diagnostic performance with a balanced outcome for detecting fractures. The model achieved an accuracy of 94.3% (477/506 correct predictions), reflecting competent but inconsistent performance. It demonstrated moderate sensitivity (recall) of 92.9% (221/238 actual fractures detected), missing 17 true fractures—a clinically significant shortfall where under-diagnosis posed risks. Conversely, precision was strong at 94.8% (221/233 predicted fractures correct), with only 12 FPs, indicating effective avoidance of unnecessary interventions for healthy cases. Specificity reached 95.5% (256/268 non-fractures identified), surpassing its recall performance. The F1-score balanced these metrics at 93.8%, highlighting a trade-off favoring precision over recall. While the model excelled in confirming non-fractured cases and minimizing false alarms, its missed fractures suggested limitations in reliability for safety-critical applications, warranting further refinement to improve sensitivity.
MobileNet-V2 presents classification performance for a fracture detection model, indicating significant challenges with class imbalance. The model delivered a strong overall performance with an accuracy of 97.0% (491/506 correct), the highest among recent comparative models. It achieved excellent recall (sensitivity) of 97.1% (231/238 actual fractures detected), missing only 7 true fractures—a significant improvement over architectures like InceptionV3 (17 FNs) and DenseNet169 (13 FNs). Precision remained high at 96.6% (231/239 predicted fractures correct), with merely 8 FPs, indicating minimal over-diagnosis. Specificity reached 97.0% (260/268 non-fractures correctly identified), demonstrating balanced capability across both classes. The F1-score settled at 96.8%, reflecting optimal harmony between sensitivity and precision. While not matching HybridSFNet’s perfection (0 FN/FP), MobileNetV2 prioritized clinical safety through low false negatives while maintaining operational efficiency, suggesting it struck a practical balance for real-world deployment.
The ResNet-101 model achieved a high recall of 97.9% (233/238 actual fractures detected), missing only 5 true fractures, which prioritized diagnostic safety by minimizing FNs. However, it exhibited notably low precision of 86.6% (233/269 predicted fractures correct), generating 36 FPs—the highest among recent models (e.g., MobileNetV2: 8 FPs, HybridSFNet: 2 FPs). Specificity was 86.6% (232/268 non-fractures correctly identified), revealing significant challenges in reliably excluding non-fractured cases. Overall accuracy reached 91.9% (465/506 correct), while the F1-score balanced recall and precision at 91.8%. The model leaned heavily toward sensitivity over specificity, resulting in substantial over-referrals (false alarms) that could burden clinical workflows. While its fracture detection capability was robust, the trade-off warranted calibration to reduce false positives for practical deployment.
Figure 11 displays the Receiver Operating Characteristic (ROC) curves for the six DL models. The ROC curve for
MLFNet showed nearly perfect classification performance, closely hugging the top-left corner of the plot and achieving an area under the curve (AUC) score of 1.00. This indicates that MLFNet effectively distinguished between fractured and non-fractured X-ray images without compromising sensitivity or specificity. The curve’s steep ascent and maximal AUC highlight both high sensitivity and specificity, demonstrating the robustness of the proposed architecture.
DenseNet-169 also performed well, achieving an AUC of 0.95. Its ROC curve rose sharply toward the top-left corner but fell slightly short of MLFNet’s ideal path. This performance suggests a strong ability to differentiate between the two classes, although some misclassifications could occur compared to MLFNet’s perfect separation.
EfficientNet-B3 matched DenseNet-169 with an AUC of 0.95, following a similar trajectory in its ROC curve, which featured a steep rise initially and a gradual approach toward the upper right. While its performance was commendable, it did not achieve the flawless separation seen with MLFNet.
Inception-V3 outperformed both DenseNet-169 and EfficientNet-B3 with an AUC of 0.96, indicating a slightly better balance between sensitivity and specificity. This suggests it managed borderline cases a bit more effectively, but still did not match MLFNet’s perfect discrimination.
MobileNet-V2 achieved an AUC of 0.97, surpassing DenseNet-169, EfficientNet-B3, and Inception-V3. Its ROC curve indicated near-perfect classification capability, particularly excelling in minimizing false positives. Nevertheless, it still showed minor deviations from the ideal trajectory compared to MLFNet.
ResNet-101 demonstrated the weakest performance among the six models, with an AUC of 0.92. Although its ROC curve steadily rose toward the top-left, it lagged in the early part of the curve, suggesting more false positives at lower thresholds. This indicates that ResNet-101 had comparatively less discriminative power in this classification task.
Overall, MLFNet clearly outperformed the other models, achieving a perfect AUC of 1.00 and an ideal ROC curve shape. MobileNet-V2 followed closely with an AUC of 0.97, while Inception-V3 slightly surpassed both DenseNet-169 and EfficientNet-B3, which both had an AUC of 0.95. ResNet-101 trailed with an AUC of 0.92, indicating the least effective discrimination between classes. Ultimately, MLFNet demonstrated superior discriminative ability, confirming the effectiveness of its tailored architecture for fracture detection.
Figure 12 presents the analysis of precision–recall curves for the six DL models.
MLFNet showed excellent performance, with a precision–recall curve area of 0.99 for the ‘Fractured’ class and a perfect score of 1.00 for the Non_Fractured class. Both curves remained high across the recall range, indicating consistently high precision even as recall increased. This suggests that MLFNet was highly effective at accurately identifying both fractured and non-fractured cases with minimal false positives.
DenseNet-169 also performed strongly, achieving areas of 0.92 for the ‘Fractured’ class and 0.93 for the Non_Fractured class. The precision remained high for both classes across most of the recall range, with only a slight drop at higher recall values. This indicates that DenseNet-169 was generally robust in its predictions, maintaining good precision while identifying a significant number of relevant instances.
EfficientNet-B3 demonstrated good performance, with areas of 0.88 for the ‘Fractured’ class and 0.94 for the Non_Fractured class. The curve for the Non_Fractured class was notably higher and more stable than that for ‘Fractured’, which showed a more pronounced decline in precision at higher recall levels. This indicates that while EfficientNet-B3 performed well overall, it was more adept at accurately identifying non-fractured cases.
Inception-V3 achieved solid performance, with areas of 0.91 for the ‘Fractured’ class and 0.92 for the Non_Fractured class. Both curves maintained high precision across a substantial portion of the recall range, indicating a reliable ability to classify instances correctly. Similarly to DenseNet-169, there was a noticeable dip in precision at very high recall, suggesting a trade-off in performance when attempting to capture nearly all relevant instances.
MobileNet-V2 exhibited strong and consistent performance, with areas of 0.95 for the ‘Fractured’ class and 0.96 for the Non_Fractured class. Both precision–recall curves remained high and stable across the entire recall range, indicating that MobileNet-V2 maintained excellent precision even at high recall levels for both classes. This suggests that it was highly effective and balanced in its classification capabilities.
ResNet-101 showed acceptable performance, with areas of 0.86 for the ‘Fractured’ class and 0.92 for the Non_Fractured class. The curve for the Fractured class exhibited a more significant drop in precision at higher recall values compared to the other models, indicating a greater challenge in maintaining precision for this class. However, the ‘Non_Fractured’ class performed considerably better, maintaining higher precision across the recall range.
When comparing the precision–recall curves across all six models, several key observations emerged. MLFNet and MobileNet-V2 consistently demonstrated the strongest performance. MLFNet achieved near-perfect scores, particularly for the Non_Fractured class (area = 1.00), while maintaining exceptionally high precision for Fractured cases (area = 0.99). MobileNet-V2 also showed outstanding and balanced performance, with areas of 0.95 and 0.96 for the Fractured and Non_Fractured classes, respectively.
DenseNet-169 and Inception-V3 exhibited strong, comparable performance, with area scores generally in the low 0.90s for both classes. These models maintained good precision for a significant portion of the recall range but demonstrated a more noticeable drop at very high recall values, suggesting a slight trade-off between precision and recall at extreme ends.
EfficientNet-B3 performed well, especially for the Non_Fractured class (area = 0.94), but its performance for the Fractured class (area = 0.88) was slightly lower and less stable compared to the top performers.
ResNet-101 generally showed the lowest performance among the models, particularly for the Fractured class (area = 0.86), where its precision dropped more significantly to higher recall levels. While its performance for the Non_Fractured class (area = 0.92) was respectable, the disparity between the two classes was more pronounced in ResNet-101 than in the other models.
Overall, MLFNet and MobileNet-V2 stood out for their superior and consistent precision–recall characteristics across both classes, making them the most robust choices for this multi-class classification task.
4.4. The Hybrid MLFNet Performance Assessment
In the second experiment, we incorporated the MLFNet model into hybrid ensemble architectures featuring DenseNet-169, EfficientNet-B3, Inception-V3, MobileNet-V2, and ResNet-101. The results of the second experiment are presented in
Table 9 and
Table 10. The MLFNet + DenseNet-169 achieved an accuracy of 98.81%, while MLFNet + EfficientNet-B3 reached 98.02%. MLFNet + Inception-V3 recorded an accuracy of 97.83%, MLFNet + MobileNet-V2 attained 97.04%, and MLFNet + ResNet-101 achieved 96.64% on the testing set of the BFMRX dataset. Among the models assessed, MLFNet + DenseNet-169 exhibited the highest accuracy.
In
Table 9, the evaluation results show that the MLFNet + DenseNet-169 model performed well and consistently when tested on the BFMRX dataset. The average accuracy was 98.81%, demonstrating the model’s strong ability to correctly classify both fractured and non-fractured cases. The average specificity and recall were 98.76%, indicating that the model effectively identified TNs and TPs. The average FNR was low at 1.24%, while both the NPV and precision were 98.87%, further confirming the model’s reliability in making predictions.
When analyzed by class, the model classified fractured cases with an accuracy of 98.81%, a specificity of 99.63%, and a recall of 97.90%, meaning that only a small number of actual fractured cases were misclassified. The FNR for fractured samples was 2.10%, with both precision and F1-score being high at 99.57% and 98.73%, respectively, indicating a strong balance between sensitivity and precision. For non-fractured cases, the accuracy remained steady at 98.81%, with a slightly lower specificity of 97.90%, but a very low FNR of 0.37%, showing that few fractured cases were incorrectly labeled as non-fractured. The NPV was 99.57%, precision was 98.16%, recall was 99.63%, and F1-score was 98.89%, all of which highlight the model’s strong predictive ability for normal cases. Overall, the model maintained balanced and robust performance across both classes, with particularly high precision and recall values that are essential in medical diagnostic applications.
The evaluation results of the MLFNet + EfficientNet-B3 model showed that it performed strongly and consistently when tested on the dataset. The average accuracy was 98.02%, highlighting the model’s effectiveness in accurately identifying both fractured and non-fractured cases. The average specificity and recall were both 97.99%, indicating the model’s proficiency in detecting TNs and TPs, respectively. Furthermore, the average FNR was low at 2.01%, while the NPV and precision averaged 98.04%, suggesting that the model made highly reliable predictions. The average F1-score was 98.02%, indicating a balanced trade-off between precision and recall.
For the fractured class, the model achieved an accuracy of 98.02%, with a specificity of 98.51% and a recall of 97.48%, reflecting a low misclassification rate for actual fractured cases. The FNR was 2.52%, and the NPV, precision, and F1-score were 97.78%, 98.31%, and 97.89%, respectively, demonstrating strong performance with minimal false negatives. In the non-fractured class, the model also reached 98.02% accuracy, with a slightly lower specificity of 97.48% but a higher recall of 98.51% and a lower FNR of 1.49%. The NPV was 98.31%, the precision was 97.78%, and the F1-score was 98.14%, all indicating excellent performance in identifying normal cases. Overall, the model displayed high and consistent performance across both classes, with minimal disparities, making it well-suited for practical diagnostic applications.
The evaluation results for the MLFNet + Inception-V3 model demonstrated strong and balanced performance across both classes. The average accuracy was 97.83%, indicating the model’s effectiveness in correctly classifying both fractured and non-fractured cases. The average specificity and recall were both 97.85%, highlighting the model’s strong ability to identify true negatives and true positives. Additionally, the FNR averaged 2.15%, which remained low, while both the NPV and precision stood at 97.79%, suggesting highly reliable predictions. The average F1-score of 97.82% confirmed a well-balanced trade-off between precision and recall.
In the fractured class, the model achieved 97.83% accuracy, with 97.39% specificity and 98.32% recall, indicating it correctly detected most fractured cases with few false positives. The FNR was low at 1.68%, and the precision, NPV, and F1-score were 97.10%, 98.49%, and 97.70%, respectively, reflecting consistent and accurate classification. For the non-fractured class, the model also achieved 97.83% accuracy, with 98.32% specificity, 97.39% recall, and a slightly higher FNR of 2.61%. The precision, NPV, and F1-score were 98.49%, 97.10%, and 97.94%, respectively, demonstrating excellent performance in detecting normal cases. Overall, the model exhibited highly reliable and balanced classification capabilities, making it suitable for real-world medical diagnostics.
The evaluation results for the MLFNet + MobileNet-V2 model showed strong performance in classifying both fractured and non-fractured cases. The average accuracy was 97.04%, indicating a high level of correct predictions across the dataset. MLFNet + MobileNet-V2 achieved an average specificity and recall of 97.08%, demonstrating its effectiveness in identifying true negatives and true positives. The average FNR was low at 2.92%, while both the NPV and precision stood at 96.99%, suggesting reliable predictions. The average F1-score of 97.03% reflected a good balance between precision and recall.
For the fractured class, the model achieved 97.04% accuracy, with a specificity of 96.27% and a recall of 97.90%, indicating effective identification of fractured cases with minimal false negatives. The FNR was 2.10%, and the NPV, precision, and F1-score were 98.10%, 95.88%, and 96.88%, respectively, showing solid performance despite slightly lower precision. In the case of non-fractured samples, the model maintained the same accuracy of 97.04%, with a higher specificity of 97.90% and a lower recall of 96.27%. The FNR was slightly higher at 3.73%, while the NPV, precision, and F1-score were 95.88%, 98.10%, and 97.18%, respectively. Overall, the model demonstrated consistent performance across both classes, with a minor trade-off between precision and recall, making it a reliable choice for medical image classification tasks.
The evaluation results for the MLFNet + ResNet-101 model showed strong performance in classifying fractured and non-fractured cases. The average accuracy was 96.64%, indicating that the model successfully classified most samples. Both the average specificity and recall reached 96.66%, demonstrating the model’s effectiveness in identifying true negatives and true positives. The average FNR was low at 3.34%, while the NPV and precision were both 96.60%, indicating reliable predictive performance. The average F1-score was 96.63%, reflecting a good balance between precision and recall.
In the fractured class, the model achieved an accuracy of 96.64%, with a specificity of 96.27% and a recall of 97.06%, highlighting its strong ability to detect actual fracture cases. The FNR was 2.94%, while the NPV, precision, and F1-score were 97.36%, 95.85%, and 96.45%, respectively, showing slightly lower precision but high recall. For the non-fractured class, the model maintained the same accuracy of 96.64%, with higher specificity at 97.06% and slightly lower recall at 96.27%. The FNR increased slightly to 3.73%, while the NPV, precision, and F1-score were 95.85%, 97.36%, and 96.81%, respectively. Overall, the model provided consistent and reliable results across both classes, with only minor differences in class-specific metrics.
Figure 13 shows the training and validation loss (on the left) and training and validation accuracy (on the right) over 30 epochs for the five ensemble models. In the case of the
MLFNet + DenseNet-169 model, the analysis indicates that the model converged quickly, with both loss and accuracy stabilizing after about the fifth epoch.
In the loss curve, the training loss steadily decreased from approximately 0.35 to nearly 0.01, demonstrating effective learning from the training data. The validation loss also dropped quickly during the initial epochs and then varied slightly between 0.03 and 0.07, but generally remained low and close to the training loss. This suggests that the model did not experience overfitting and generalized well to new data.
Regarding accuracy, the model showed significant improvement in the first five epochs, with training accuracy increasing from around 85% to over 98%, eventually nearing 100%. Similarly, the validation accuracy rose quickly and stabilized around 98–99%, with minor fluctuations. The small gap between training and validation accuracy further indicates strong generalization and a minimal risk of overfitting.
Overall, the comparison of the accuracy and loss plots reveals a consistent and stable learning process, where the model achieved high accuracy with minimal loss for both training and validation sets, showcasing its robustness and reliability for fracture classification tasks.
For the ensemble MLFNet + EfficientNet-B3, the training loss (red) began at approximately 0.55 and consistently decreased, dropping below 0.05 by the end of training. The validation loss (green) also declined sharply during the early epochs but showed slight fluctuations between 0.1 and 0.2 after epoch 10, indicating some variability in the model’s generalization ability.
In contrast, the training accuracy rose quickly from around 70% to nearly 99%, demonstrating effective learning from the training data. The validation accuracy also improved rapidly, increasing from about 78% to roughly 96%, and remained relatively stable with minor fluctuations throughout the training process. The gap between training and validation accuracy was small, particularly after epoch 10, suggesting that the model did not experience significant overfitting.
In summary, the comparison of the loss and accuracy plots showed that while the model learned effectively (as evidenced by the steady decrease in training loss and increase in training accuracy), the slightly fluctuating validation loss indicated some variability in generalization. Nevertheless, the consistently high validation accuracy confirmed that the model maintained strong predictive performance on unseen data.
The MLFNet + Inception-V3 exhibited highly effective learning dynamics, with well-synchronized loss reduction and accuracy improvement. Both training and validation loss curves exhibited smooth exponential decay, decreasing from approximately 0.4 to near 0.02 over 30 epochs, indicating stable convergence without significant oscillations. The validation loss closely tracked the training loss throughout, maintaining a narrow gap of <0.01 after epoch 15, which demonstrated excellent generalization capability with minimal overfitting.
Concurrently, accuracy metrics showed complementary improvement: training accuracy rose steadily from 80% to 94%, while validation accuracy progressed from 85% to 90%. The validation accuracy plateaued after epoch 20, with only 0.5% fluctuation, indicating model stability. The inverse relationship between loss and accuracy was particularly evident at epoch 10, where loss decreased by 55% (from 0.4 to 0.18) and accuracy increased by 12.5% (from 80% to 90%). This coordinated progression confirmed efficient feature extraction and weight optimization. The terminal metrics at epoch 30 showed near-ideal alignment: training loss (0.02) ≈ validation loss (0.03), and training accuracy (94%) > validation accuracy (90%).
The persistent 4–5% accuracy gap between training and validation in later epochs suggested slight overfitting, though the minimal loss gap (<0.01) confirmed it was well-managed. The validation accuracy stabilized at 90% after epoch 20, while training accuracy continued improving to 94%, reflecting appropriate complexity balancing. These patterns collectively indicated successful model optimization, where loss reduction directly translated to accuracy gains, with validation metrics providing reliable performance estimates for real-world deployment.
The MLFNet + MobileNet-V2 model exhibited effective learning dynamics characterized by a strong inverse correlation between loss reduction and accuracy improvement. Training loss decreased steadily from 0.25 to near 0.00 over 30 epochs, while validation loss followed a parallel trajectory but plateaued at 0.05 after epoch 15, indicating early convergence. Concurrently, training accuracy rose from 88% to 98%, showing continuous improvement throughout the training. Validation accuracy increased more moderately from 90% to 94%, plateauing after epoch 20 with minimal fluctuation.
The inverse relationship was particularly pronounced between epochs 5 and 15, where loss decreased by 80% (from 0.20 to 0.04) and accuracy increased by 6% (from 90% to 96%). While training metrics showed near-perfect optimization (0.00 loss, 98% accuracy), the validation metrics demonstrated excellent generalization: terminal validation loss (0.05) remained 5× higher than training loss (0.00), and validation accuracy (94%) was 4% lower than training accuracy (98%).
The growing divergence after epoch 15—where training loss continued decreasing while validation loss stabilized—suggested mild overfitting. However, the validation accuracy maintained a stable plateau at 94% with only ±0.5% variation in the final 10 epochs. This indicated robust feature extraction despite the overfitting tendency, with the 4% accuracy gap between training and validation representing an acceptable trade-off for generalization capability. The coordinated progression confirmed that loss reduction directly translated to accuracy gains throughout the training process.
The MLFNet + ResNet-101 exhibited consistent improvement in both loss reduction and accuracy enhancement over 30 epochs, though there were emerging signs of overfitting in the later stages. Training loss decreased steadily from 0.5 to 0.1, following a near-linear trajectory that demonstrated effective optimization. Validation loss initially mirrored this trend but plateaued at 0.15 after epoch 20, revealing early convergence and a growing generalization gap. Concurrently, training accuracy showed robust improvement from 75% to 98%, while validation accuracy increased more moderately from 75% to 92%, plateauing after epoch 25 with only ±0.5% fluctuation.
The inverse relationship between loss and accuracy was particularly pronounced between epochs 5 and 15: loss decreased by 60% (from 0.4 to 0.16), and accuracy increased by 17% (from 78% to 95%). This strong correlation confirmed that weight updates effectively translated to performance gains. However, diverging trends emerged in the later epochs. After epoch 20, training loss continued to decrease to 0.1, validation loss stalled at 0.15, training accuracy reached 98%, and validation accuracy plateaued at 92%.
The terminal metrics revealed a 7% accuracy gap and a 0.05 loss gap between training and validation, indicating mild overfitting. Despite this, the validation accuracy stabilized at 92% with minimal variance in the final five epochs, confirming reliable generalization. The coordinated early-phase progression demonstrated efficient feature learning, while the later-phase divergence suggested that model complexity could be reduced for better regularization. Overall, the validation metrics (92% accuracy, 0.15 loss) represented clinically viable performance for diagnostic deployment.
In
Figure 14, the
MLFNet + DenseNet-169 model achieved exceptional performance, attaining an accuracy of 98.8% (500/506 correct predictions). It demonstrated high recall (sensitivity) of 97.9% (233/238 actual fractures detected), missing only 5 true fractures—a clinically robust outcome for safety-critical applications. Precision was nearly perfect at 99.6% (233/234 predicted fractures correct), with just one FP, minimizing unnecessary interventions. Specificity reached 99.6% (267/268 non-fractures correctly identified), highlighting outstanding reliability in confirming healthy cases. The F1-score balanced these metrics at 98.7%. While the 5 FNs fell short of the HybridSFNet’s flawless recall (0 FN), the single false positive represented a significant improvement over models like EfficientNetB3 (30 FPs) and ResNet101 (36 FPs). This performance positioned the model as a top-tier solution, effectively balancing diagnostic safety and operational efficiency for fracture detection.
The MLFNet + EfficientNet-B3 model demonstrated high-performance characteristics with an accuracy of 98.0% (496/506 correct predictions). It achieved a recall of 97.5% (232/238 actual fractures detected), missing 6 true fractures—a marginal increase in FNs. Precision remained exceptional at 98.3% (232/236 predicted fractures correct), with only 4 FPs, reflecting minimal over-diagnosis. Specificity reached 98.5% (264/268 non-fractures correctly identified), nearing perfection in healthy case identification. The F1-score balanced these metrics at 97.9%. While the 6 FNs represented a slight safety gap relative to the optimized version, the 4 false positives still outperformed most peer models (e.g., EfficientNetB3: 30 FPs, ResNet101: 36 FPs). This pre-optimization state already delivered clinically viable results but indicated that threshold refinement could further reduce missed fractures.
The MLFNet + Inception-V3 model achieved outstanding sensitivity (recall) of 98.3% (234/238 actual fractures detected), missing only 4 true fractures—a significant improvement over the standalone InceptionV3 model (17 FNs). Precision remained strong at 97.1% (234/241 predicted fractures correct), though 7 FPs indicated moderate over-referrals of healthy cases. Specificity reached 97.4% (261/268 non-fractures correctly identified), demonstrating robust performance in confirming non-injured cases. Overall accuracy stood at 97.8% (495/506 correct), while the F1-score balanced recall and precision at 97.7%. The MLFNet enhancement substantially elevated InceptionV3’s diagnostic safety by reducing FNs by 76% (from 17 to 4 FNs), though slight FP inflation persisted compared to top performers like HybridMLFNet (1 FP). This represented a clinically viable balance, prioritizing fracture detection while maintaining operational efficiency.
The MLFNet + MobileNet-V2 model achieved a recall of 97.9% (233/238 actual fractures detected), reducing FNs to 5—a slight improvement over standalone MobileNetV2 (7 FNs) and matching ResNet101’s sensitivity. However, it exhibited a lower precision of 95.9% (233/243 predicted fractures correct) due to 10 false positives, doubling its predecessor’s FP count (standalone MobileNetV2: 5 FPs). Specificity declined to 96.3% (258/268 non-fractures identified) compared to the standalone version’s 97.0%. Overall accuracy remained 97.0% (491/506 correct), identical to the base MobileNetV2, while the F1-score settled at 96.9%. The enhancement prioritized further reduction in missed fractures but introduced more false alarms, suggesting that while diagnostic safety strengthened, operational efficiency moderately decreased relative to the original architecture. This represented a viable clinical trade-off where sensitivity gains outweighed precision costs for critical applications.
The MLFNet + ResNet-101 enhanced ResNet101 model achieved a recall of 97.1% (231/238 actual fractures detected), missing 7 true fractures—a slight decline from the base ResNet101’s 97.9% recall (5 FNs). Precision improved significantly to 95.9% (231/241 predicted fractures correct), reducing FPs to 10 from the base model’s 36 FPs. Specificity remained strong at 96.3% (258/268 non-fractures identified). Overall accuracy reached 96.6% (489/506 correct), surpassing the base ResNet101’s 91.9%, while the F1-score rose to 96.5% from 91.8%. The enhancement successfully mitigated the base model’s critical weakness of excessive false positives, cutting FP errors by 72% without substantially compromising sensitivity. This represented a meaningful clinical optimization, balancing fracture detection reliability (low FN) with operational efficiency (reduced false alarms) compared to the original architecture.
Table 11 provides a summary of the architectural setups and output dimensions of various MLFNet + CNN-based hybrid models utilized in the study. Each model, including DenseNet-169, EfficientNet-B3, Inception-V3, MobileNet-V2, and ResNet-101, adhered to a common processing workflow but varied in their backbone feature extractors and the resulting output sizes. All the models commenced with an input layer sized at (1, 128, 128, 3). Feature extraction was performed using pretrained backbones, each yielding different output shapes: DenseNet-169 produced (1, 4, 4, 1664), EfficientNet-B3 generated (1, 4, 4, 1536), Inception-V3 resulted in (1, 2, 2, 2048), MobileNet-V2 gave (1, 4, 4, 1280), and ResNet-101 returned (1, 4, 4, 2048).
These features were then processed through a series of SFNet blocks, along with batch normalization and dropout layers, which gradually reduced the spatial dimensions while enhancing the depth of features. The outputs were subsequently flattened and concatenated, leading to slightly varied fusion shapes across the models: (1, 132,736) for DenseNet-169, (1, 132,608) for EfficientNet-B3, (1, 133,120) for both Inception-V3 and ResNet-101, and (1, 132,352) for MobileNet-V2. Ultimately, each model included fully connected layers with Dense, Dropout, and a final output layer shaped at (1, 1) for binary classification. Overall, the table illustrated the impact of pretrained backbones on the dimensionality of intermediate representations while maintaining a consistent overall architecture.