Next Article in Journal
Industry Index Volatility Spillovers and Forecasting from Crude Oil Prices Based on the MS-HAR-TVP Model
Previous Article in Journal
High-Resolution Numerical Scheme for Simulating Wildland Fire Spread
Previous Article in Special Issue
Transformative Noise Reduction: Leveraging a Transformer-Based Deep Network for Medical Image Denoising
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Anatomical Alignment of Femoral Radiographs Enables Robust AI-Powered Detection of Incomplete Atypical Femoral Fractures

by
Doyoung Kwon
1,
Jin-Han Lee
2,
Joon-Woo Kim
2,
Ji-Wan Kim
3,
Sun-jung Yoon
4,
Sungmoon Jeong
5,6,* and
Chang-Wug Oh
2,*
1
School of Artificial Intelligence, Kyungpook National University, Daegu 41566, Republic of Korea
2
Department of Orthopedic Surgery, School of Medicine, Kyungpook National University Hospital, Kyungpook National University, Daegu 07364, Republic of Korea
3
Department of Orthopaedic Surgery, Asan Medical Center, University of Ulsan College of Medicine, Seoul 05505, Republic of Korea
4
Department of Orthopedic Surgery, Jeonbuk National University Medical School, Jeonju 54907, Republic of Korea
5
Department of Medical Informatics, Kyungpook National University, Daegu 41566, Republic of Korea
6
Research Center for Artificial Intelligence in Medicine, Kyungpook National University Hospital, Daegu 07364, Republic of Korea
*
Authors to whom correspondence should be addressed.
Mathematics 2025, 13(22), 3720; https://doi.org/10.3390/math13223720
Submission received: 24 September 2025 / Revised: 15 November 2025 / Accepted: 19 November 2025 / Published: 20 November 2025

Abstract

An Incomplete Atypical femoral fracture is subtle and requires early diagnosis. However, artificial intelligence models for these fractures often fail in real-world clinical settings due to the “domain shift” problem, where performance degrades when applied to new data sources. This study proposes a data-centric approach to overcome this problem. We introduce an anatomy-based four-step preprocessing pipeline to normalize femoral X-ray images. This pipeline consists of (1) semantic segmentation of the femur, (2) skeletonization and centroid extraction using RANSAC, (3) rotational alignment to the vertical direction, and (4) cropping a normalized region of interest (ROI). We evaluate the effectiveness of this pipeline across various one-stage (YOLO) and two-stage (Faster R-CNN) object detection models. On the source domain data, the proposed alignment pipeline significantly improves the performance of the YOLO model, with YOLOv10n achieving the best performance of 0.6472 at mAP@50–95. More importantly, in zero-shot evaluation on a completely new domain, standing AP X-ray, the model trained on aligned data exhibited strong generalization performance, while the existing models completely failed (mAP = 0), YOLOv10s, which applied the proposed method, achieved 0.4616 at mAP@50–95. The first-stage detector showed more consistent performance gains from the alignment technique than the second-stage detector. Normalizing medical images based on inherent anatomical consistency is a highly effective and efficient strategy for achieving domain generalization. This data-driven paradigm, which simplifies the input to AI, can create clinically applicable, robust models without increasing the complexity of the model architecture.

1. Introduction

Atypical femoral fracture (AFF) typically manifests in the subtrochanteric or diaphyseal region of the femur. AFF can occur with long-term use of bisphosphonates and denosumab for the treatment of osteoporosis and has been reported to occur bilaterally in more than 40% of cases [1]. The prevalence of AFF has been reported to be 1.8, 16, and 13 per 100,000 patients after 2, 3, and 10 years of bisphosphonate use, respectively, [2,3,4]. As the older adult population grows and more patients receive osteoporosis treatment, the prevalence of AFF is expected to rise, while the goal of osteoporosis treatment is to reduce the risk of femoral fractures, it is paradoxical that these therapies can cause fractures, necessitating a deeper understanding of their pathogenesis and preventive strategies.
Fortunately, AFF has subtle characteristics that can be detected early. Most patients undergoing long-term osteoporosis treatment exhibit cortical buckling as an incomplete fracture, initially due to repeated microfractures and healing in the femur’s lateral cortex [5]. This stage is referred to as incomplete AFF (IAFF). Patients are often asymptomatic at this stage, leading to delayed diagnosis until pain or a complete fracture occurs. Imaging techniques such as MRI, CT, or radioisotope scans can accurately diagnose IAFF [6]; however, these methods are time-consuming, expensive, and involve high doses of radiation. In contrast, X-rays are faster and more accessible, but identifying IAFF features on X-rays is particularly challenging.
The radiographic features of IAFF, such as localized and subtle thickening of the lateral cortex (cortical buckling or beaking) or a transverse radiolucent line, are extremely subtle, making them easy to miss even for experienced orthopedic surgeons [7]. According to one study, only 7% of AFFs are correctly identified upon initial diagnosis, implying that most patients miss the optimal window for treatment. Furthermore, the condition’s rarity means that a general musculoskeletal radiologist may encounter less than one case per year, making it challenging to maintain a high level of clinical suspicion [8]. Since the signs of IAFF are subtle and easily overlooked, reliable, automated image analysis may assist with consistent detection and timely review.
However, in addition to these clinical challenges, building the large-scale datasets essential for artificial intelligence (AI) development also faces practical hurdles. High-quality standing anteroposterior (AP) X-rays, especially of diagnostically challenging IAFF cases, are extremely scarce, making it a significant challenge to secure sufficient data to train AI.
Recent advances in deep learning have demonstrated exceptional performance in various fields, including medical image analysis. In orthopedics, deep learning techniques have been successfully applied to tasks such as fracture classification [9,10,11], predicting the onset of knee arthritis using X-rays [12], and postoperative ultrasound monitoring [13]. However, existing research has several fundamental limitations.
First, many studies have focused on the problem of classification, which determines the presence or absence of a fracture [14]. This approach fails to provide the crucial information for IAFF diagnosis: ‘Where is the fracture located?’ As the diagnosis of IAFF hinges on identifying the precise location of a minute lesion, merely classifying an image as ‘fracture present’ has limited clinical utility [15]. Second, previous studies have often targeted clear, complete fractures or reported high performance within well-controlled datasets [16]. Therefore, a novel approach capable of precisely localizing these subtle lesions in real-world clinical settings is urgently needed.
A more severe problem that these models face when applied in real clinical environments is domain shift [14]. Domain shift refers to the phenomenon where the statistical distribution of the data used for training (the source domain) differs from that of the data to which the model is applied (the target domain). This problem becomes especially severe in medical imaging because hospitals use different X-ray systems, follow different acquisition protocols, serve diverse patient populations, and face hardware aging and minor operator errors. Consequently, a model that shows over 90% accuracy on a specific dataset can experience a drastic drop in performance when applied to data from another hospital (external validation), making it challenging to ensure clinical reliability [17]. This challenge remains a leading barrier to translating medical AI from research into routine clinical practice.
Many existing studies have attempted to solve the domain shift problem by developing more sophisticated and complex AI model architectures. This study, however, shifts the perspective to find and solve the root of the problem within the data itself. In this study, we demonstrate that a data-centric approach is a more effective and fundamental solution for creating an environment in which AI can learn more easily. The core idea of this research is to propose a preprocessing technique that leverages the fact that the anatomical structure of the femur is similar across individuals to align all femurs in X-ray images into a consistent orientation and form. Through this transformation, we systematically remove noise irrelevant to diagnosis—such as patient posture, imaging angle, and unnecessary background information—and guide the AI to focus purely on the ’signal’: the bone’s morphology and incomplete fracture features. Specifically, we reduce intraclass variation in bone size and orientation due to patient positioning and imaging angles, and standardize all images to a similar scale, making the model robust to unnecessary changes. Furthermore, by aligning key anatomical structures to a consistent location, we encourage the model to utilize implicit positional information to learn fracture features. Through this systematic refinement, the AI is provided an ideal learning environment, free from the distractions of confounding variables. Preliminary results from our study show that a model trained on data refined by this methodology successfully detected fractures in data from a completely different domain (e.g., evaluating on standing X-rays after training on femoral X-rays) without any additional training. We interpret these results as preliminary support for achieving domain generalization through data preprocessing alone, without added model complexity. This paper aims to systematically demonstrate how this data-centric approach can dramatically improve AI’s robustness and generalization performance across various object detection models. The main contributions of this study are threefold:
  • We propose a novel preprocessing pipeline for anatomical normalization of femur radiographic images.
  • We demonstrate the profound performance impact of our data-driven approach through a comprehensive analysis of detection architectures.
  • We demonstrate successful zero-shot generalization for IAFF detection in a completely different imaging domain, validating the robustness of our method.

2. Related Works

2.1. AI for AFF and IAFF Diagnosis

Recent studies on AFF classification and detection fall into two main lines: image-only and multimodal approaches that integrate radiographs with structured clinical variables. This section reviews key studies within each of these domains. Image-only approaches for separating AFF from typical femoral fractures (NFF) primarily rely on fine-tuning ImageNet-pretrained CNNs such as VGG [18], Inception [19], and ResNet [20]. A notable example from Zdolsek et al. showed that a ResNet model achieved 94% accuracy following manual view correction, with Class Activation Mapping (CAM) [21] used to visualize salient regions [10]. Building on this trend, Nguyen et al. introduced AFFnet, which augments a ResNet50 backbone with a Box Attention Guide (BAG) module to supervise the network’s scan pattern using bounding-box–derived attention maps, enabling four-way classification [22]. Multimodal work combines imaging with electronic health records (EHRs) to boost predictive power. Schilcher et al. fused radiographs with seven preselected variables (age, sex, osteoporosis diagnosis, rheumatoid disease, corticosteroid therapy, proton-pump inhibitor therapy, and bisphosphonate therapy). They reported that image-only performance (AUC = 0.966) increased to 0.987 with late/probability fusion, while sensitivity rose from 0.796 to 0.903. The study also noted that these gains were larger when fewer images per patient were available [8].
Two complementary strategies have emerged focusing on the detection of IAFF, where radiological cues are often small and ambiguous. First, Kim et al. targeted early IAFF detection with a transfer-learning ensemble trained on 1050 radiographs (100 IAFF / 950 normal). Their method involved edge-enhancing Sobel preprocessing and strong augmentations before fine-tuning EfficientNet [23], MobileNet [24], and DenseNet [25]. A majority-vote ensemble of the top three models reached an AUC of 0.998, with Sobel [26] preprocessing yielding up to a 5-percentage-point accuracy gain. Lesion localization was provided via Score-CAM [7]. Second, Chang et al. proposed CFNet, a context-aware, level-wise feature-fusion architecture tailored to IAFF. It features a Dual Context-aware Complementary Extractor (global view + sequential high-resolution slices), a Level-wise Perspective-preserving Fusion Network, and a self-attention–based Spatial Anomaly Focus Enhancer to suppress false negatives. Across diverse views (AP/ER/IR/LT), CFNet achieved an accuracy of 0.931, F1-score of 0.9456, and AUROC of 0.9692, surpassing prior baselines [27].

2.2. Preprocessing Techniques in Medical Images

Prior studies have used various preprocessing techniques to improve AI model performance. Low-level image processing methods include Sobel filters or Canny edge detection to clarify boundaries 10 and non-local means filters to remove noise [28]. Furthermore, segmentation-based approaches using models like UNet, first to isolate specific anatomical structures, such as the femur, are also widely used [29]. Additionally, cropping techniques that remove unnecessary backgrounds or medical device logos have been shown through Explainable AI (XAI) to effectively prevent the model from focusing on irrelevant features [27]. However, most works apply these methods as isolated steps, and few present an integrated framework that addresses fundamental generalization issues such as domain shift [7].

2.3. Domain Generalization Strategy

Research on domain generalization for domain shift typically falls into three directions. Feature-level methods train models to extract domain-invariant features across datasets, using techniques such as distribution alignment and adversarial training. For example, model-level methods alter the architecture by learning to separate domain-specific from domain-general components [10]. In contrast, data-level methods reduce variability in the input before feeding it to the model; examples include simulating diverse domain styles via augmentation or—as in this study—standardizing images into a standard form [30]. This approach, which enables the subsequent AI model to naturally acquire generalization performance by improving the fundamental representation of the data without complex model structures or training strategies, holds high potential, especially in medical imaging, where anatomical structures are consistent. This study shows that this data-level approach can be the most direct and efficient solution to the domain shift problem.

3. Methods

3.1. Automated Preprocessing Pipeline for Femoral Alignment

This study proposes a multi-stage preprocessing pipeline that standardizes the anatomical structure of the femur to ensure model generalization across X-ray images from different domains in Figure 1. The first step of the pipeline is to accurately segment the femur region from the input X-ray image.
  • Step 1: Femur Segmentation
Based on the fact that bones exhibit clear contrast with soft tissue in X-ray images, a deep learning-based semantic segmentation model was utilized. Various segmentation models, such as UNet [31], FCN8(ResNet50) [32], and HRNet [33] were used to generate a binary mask corresponding to the femur region in the image. This mask serves as the basis for accurately extracting bone location and shape information in subsequent steps. The axis-aligned bounding box method struggles to fully express the directionality of the tilted femur, and the oriented bounding box (OBB) has the limitation of being complex to learn. Therefore, a mask generation method that enables precise pixel-level separation was adopted. In this method, we choose the HRNet because it performs well in maintaining high-resolution feature maps, which leads to more precise segmentation boundaries.
  • Step 2: Femur Axis Extraction
From the binary femur mask generated in Step 1, a principal axis representing the bone’s orientation was extracted. First, a morphological skeletonization algorithm was applied to the mask to extract its 1-pixel-wide medial axis [34]. This process, however, often introduces small, spurious branches (i.e., outliers) that deviate from the main femoral shaft due to minor irregularities in the mask. To robustly identify the principal axis in the presence of these outliers, the resulting skeleton points were first smoothed with a Gaussian blur (kernel size = 5 × 5, σ = 1 ). Then, the RANSAC (Random Sample Consensus) [35] algorithm was applied. To account for the bone’s anatomical curvature, we employed a 2nd-order polynomial (quadratic) model within the RANSAC framework. This approach was chosen for its robustness in identifying the inlier points corresponding to the femoral shaft, while effectively disregarding spurious skeleton branches as outliers. From the resulting quadratic fit, we computed the average slope, which serves as a robust representation of the bone’s overall orientation.
  • Step 3: Axis Alignment via Rotation
All femur images were aligned to the same orientation based on the central axis extracted in Step 2. Using the average slope derived from the 2nd-order RANSAC fit, we calculated the angle between the bone’s central orientation and the image’s vertical direction. An affine transformation was then applied to rotate both the image and its mask by this angle, standardizing all femurs to a vertical alignment. This process, as shown in Figure 2, eliminated orientation inconsistencies arising from slight differences in the shooting angle or patient position.
  • Step 4: Region of Interest Cropping and Normalization
Finally, the image was cropped around the aligned femur region to minimize the influence of irrelevant background information on the model. In Step 3, based on the binary mask rotated with the femur, an inscribed bounding box was calculated that encompassed the entire bone region while minimizing the margins. The central axis of the femur found in Step 2 was aligned to the center of the bounding box. Finally, the original image was cropped based on this bounding box, resulting in a standardized femur image with all data points of similar size and location.

3.2. Detection Models

In this study, we utilized several structurally different object detection models to comprehensively verify the effectiveness of the proposed preprocessing methodology.

3.2.1. Faster R-CNN

First, we evaluated Faster R-CNN [36], a representative two-stage object detector. Faster R-CNN is characterized by its high accuracy, comprising a Region Proposal Network (RPN) that first proposes suspicious regions and a detector that precisely classifies and repositions the proposed regions. The backbone network, which serves as the feature extractor for this model, employs various models from the ResNet family. Specifically, we used ResNet50, ResNet101, and ResNeXt101 (32x8d, 64x4d) [37], which structures the basic ResNet blocks into multiple parallel paths to enhance efficiency and expressiveness. Additionally, to evaluate the ability to effectively detect objects of various sizes, backbones combined with a Feature Pyramid Network (FPN) [38] (e.g., ResNet50-FPN, ResNet101-FPN) were also included in the experiments. FPNs combine low-level and high-level features to generate feature maps across multiple scales, enabling robust detection of objects ranging from small fractures to relatively large lesions within images. Given the small area of the IAFF target relative to the entire image, we re-tuned the RPN anchor aspect ratios during the initial transfer learning phase. We must clarify that this was not part of the zero-shot generalization step, but rather a one-time setup applied to all baseline models (Standard and Alignment) to adapt the pre-trained detector to the specific geometry of the target. The impact of this tuning is further detailed in our ablation study.

3.2.2. YOLO

Finally, among the first-stage detectors, we evaluated models from the YOLO (You Only Look Once) [39], which is widely known for its high speed and efficiency. YOLO achieves near-real-time detection speeds through a unique architecture that divides the image into grids and directly predicts the bounding box and class of objects in each grid cell. In this study, we utilized multiple generations of YOLO, including v8, v9, v10, and v11, to analyze performance changes resulting from continuous advancements in the architecture [40,41,42]. Furthermore, to evaluate the trade-off between inference speed and accuracy, considering practicality in clinical settings, we comprehensively utilized lightweight models of different sizes, such as nano (n), small (s), and tiny (t), within each version.

4. Experiments

4.1. Datasets

4.1.1. Femoral X-Ray

For model training and testing, we retrospectively collected femoral X-ray data from patients who visited Kyungpook National University Hospital (KNUH) from August 2010 to November 2022. The dataset consisted of 236 patients (144 normal, 92 IAFF), and a total of 794 images (364 normal, 430 IAFF) were collected. All X-ray data were provided in DICOM format. As shown in Figure 3, each patient can have up to six images, including AP (anteroposterior), ER (external rotation), and IR (internal rotation) images of the left and right femurs, with a minimum of one and a maximum of six images per patient. For training, the data were resized to 1024 × 1024.

4.1.2. Alignment X-Ray

We prepared alignment data for the collected femoral X-rays through preprocessing. The alignment data was resized and padded to 1024 pixels in height and 256 pixels in width, while maintaining the image ratio, for use in training and evaluation.

4.1.3. Standing AP X-Ray

To evaluate the domain generalization performance of the trained model using femur alignment, we additionally collected a standing AP X-ray dataset that is completely independent of the source domain. This dataset was collected at Kyungpook National University Hospital and consists of 20 healthy individuals and 20 IAFF patients. As shown in Figure 4, unlike the source domain, which captures only the femur, this dataset contains images of a much wider range, covering the entire area from the toes to the pelvis, providing a completely new type of input data for the model. We evaluate the femur region in this standing AP X-ray.

4.2. Training Details

To ensure robust evaluation of model performance, we performed five-fold cross-validation on the source domain dataset. The training, validation, and testing ratios were set to 9:1:1. Furthermore, stochastic gradient descent (SGD) [43] was used as the optimizer in all experiments, with an initial learning rate of 0.01 and a total of 200 epochs. The batch size varied depending on the model’s structural characteristics. The YOLO series model was successfully trained with a batch size of 16. In contrast, the Faster R-CNN series model used a batch size of 4. This is because the RPN, a core component of Faster R-CNN, tends to exhibit training instability with larger batch sizes due to the inclusion of excessive background regions. (Training failed with a batch size of 16.)
For data augmentation, we randomly applied horizontal flips, rotations (±10), brightness shifts (±0.1), and contrast shifts (±0.1).
Given the modest size of our dataset (n = 794) for deep learning, we employed a transfer learning strategy to prevent overfitting. Not all models were trained from scratch; instead, we initialized their backbones with weights pre-trained on large-scale datasets (e.g., ImageNet or COCO) and then fine-tuned them on our femoral X-ray data. This approach, combined with the rigorous cross-validation protocol described below, ensures a valid and scientifically sound evaluation.
All experiments were conducted on a workstation equipped with an NVIDIA RTX A6000 (48 GB) GPU, 128 GB of RAM, and an AMD EPYC 7513 32-Core Processor.

4.3. Evaluation Metrics

To quantitatively evaluate the performance of the proposed methodology, we utilized several standard metrics widely used in the field of object detection. The fundamental components for understanding these metrics are defined as follows: A True Positive (TP) is a case where the model correctly detects an IAFF. A False Positive (FP) is a case where the model incorrectly identifies a normal case as an IAFF. A False Negative (FN) is a case where the model fails to detect an actual IAFF. These determinations are based on the Intersection over Union (IoU), which measures the degree of overlap between a predicted bounding box and a ground-truth bounding box.
Precision: This metric measures the proportion of true positives among all predictions classified as positive. It focuses on minimizing False Positives, and a high precision value indicates that the model’s positive predictions are highly reliable.
p r e c i s i o n = T P T P + F P
Recall: This metric, also known as sensitivity, measures the proportion of true positives that were correctly identified among all actual positive cases. It focuses on minimizing False Negatives, and a high recall value indicates that the model misses few positive cases.
R e c a l l = T P T P + F N
F1-Score (F1): This is the harmonic mean of Precision and Recall. It is particularly useful for evaluating a model’s overall performance when there is a trade-off between precision and recall.
F 1 = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
mAP (mean Average Precision): This is the primary metric for evaluating the overall performance of object detection models. It is calculated from the area under the Precision-Recall curve, which is summarized as the Average Precision (AP) for a single class. The AP is computed as the weighted sum of precisions at each recall step:
A P = k = 1 N ( R k R k 1 ) P k
where P k and R k are the precision and recall at the k-th threshold. The mAP is then calculated by averaging the AP scores across all N object classes:
m A P = 1 N i = 1 N A P i
In this study, since we have a single class (IAFF), the mAP is equivalent to the AP for that class. We evaluated the model’s performance comprehensively by measuring mAP at various IoU thresholds (e.g., mAP@0.5, mAP@0.75) to assess both detection and localization accuracy.

5. Results

5.1. Analysis of Data Changes After Preprocessing

The proposed preprocessing pipeline is designed to fundamentally alter the input data distribution, creating a more favorable learning environment for the AI model. In Figure 5, we compare the data distribution before (Femoral X-ray) and after (Alignment X-ray) the alignment process, providing both quantitative and visual summaries of these changes.
The most significant change occurred in the relative size of the target lesions. As shown in the “Box-to-Image Area Ratio” histogram, IAFF lesions in the original dataset occupied a very small portion of the entire image, typically ranging from 0.25% to 1%. This is typical of the small object detection problem, a significant challenge for many detection architectures. After applying the alignment and cropping pipeline, this ratio nearly doubled, moving to a range of 0.5% to 2%. This physical enlargement of the relative size of the targets is a key factor in the performance improvement, detailed in the following sections, allowing the subtle features of IAFF to be more prominent and easier for the model to learn.
Furthermore, the preprocessing process further improved the uniformity across the dataset. The “Center X” and “Center Y” histograms of the Alignment X-ray are more densely distributed than those of the Femoral X-ray, demonstrating the success of the re-centering step. This allows the model to expect the femur to appear in a consistent location within the frame. Similarly, the “height/width” ratio of the final cropped image exhibits a more normalized Gaussian distribution, which reduces the overall variability in the shape of the input data. By systematically reducing the variance in scale, position, and manner, the model can allocate more capacity to learning the discriminative features of the fracture itself, rather than learning the various ways the femur can be imaged.

5.2. Bone Segmentation

As shown in Table 1, to validate the performance of this segmentation step, we evaluated the HRNet model using 5-fold cross-validation on our segmentation dataset. We used two standard metrics: the Dice Score to measure the volumetric overlap between the predicted mask and the ground truth, and the 95% Hausdorff Distance (HD95) to evaluate the boundary delineation accuracy. Our model achieved a high mean DSC of 0.952 (±0.015) and an HD95 of 31.43, confirming its high accuracy. This robust segmentation performance ensures a reliable mask for the subsequent axis extraction step.

5.3. Analysis of Preprocessing Effect Within the Source Domain

We evaluated the alignment transform described in Section 4.1 by comparing detector performance before (“Standard”, S) and after (“Alignment”, A) preprocessing on the same source-domain data. Table 2 reports mean ± sd for mAP@50, mAP@75, and mAP@50–95 across folds. Overall, alignment yields consistent improvements for most one-stage (YOLO) models, whereas two-stage (Faster R-CNN) detectors show a heterogeneous response.
For the YOLO family, alignment yielded the greatest gains and the best overall results. YOLOv10n(A) achieved the highest mAP@50–95 score of 0.6472 (±0.0513), outperforming the baseline model of 0.6198 (±0.0463) by 0.0274 (relative to about 4.4%). Even with the stricter IoU, YOLOv10n(A) led with an mAP@75 score of 0.7894 (±0.0428). In terms of coarse IoU accuracy, YOLOv9s(A) achieved the highest mAP@50 score of 0.9569 (±0.0292) among all tested models, while some YOLO variants exhibited slight regression in isolated metrics, the dominant trend was positive, with the highest headline scores achieved with alignment.
Two-stage (Faster R-CNN) models respond less uniformly. Alignment sometimes raises mAP@50 (e.g., ResNet50 FPN: 0.8069 → 0.8388), yet these gains frequently do not carry over to higher-IoU metrics. For example, the same ResNet50 FPN model shows a decline in mAP@50–95 (0.5061 (S) vs. 0.4645 (A)). Conversely, alignment can also be highly beneficial; the highest mAP@50–95 for a two-stage model was achieved by ResNeXt-101-32x8d (A) at 0.5438, a substantial improvement over its 0.3869 (S) baseline. These results indicate that alignment does not uniformly benefit region-proposal pipelines and that its utility for two-stage detectors depends on the backbone and the target IoU regime.
Comparing the best results for each family on the key summary metric (mAP@50–95) reveals the following differences: YOLOv10n(A) with 0.6472 outperforms the best performance two-stage model (ResNeXt-101-32x8d (A); 0.5438) by 0.1034 (about 10.3% in absolute value). In environments with small data amounts and small objects, the one-stage decoder appears to utilize alignment more effectively. This is likely because orientation/scale normalization and background reduction preferentially favor dense single-shot heads. In contrast, the RPN-based pipeline is more sensitive to distribution changes due to translation.
For applications prioritizing overall detection quality, YOLOv10n with alignment provides the strongest mAP@50–95 in our study. If high recall at lenient IoU is paramount (e.g., screening), YOLOv9s with alignment yield the highest mAP@50. For two-stage models, alignment may still be useful as a data-cleaning step, but its net effect should be validated for the specific backbone and IoU threshold of interest.
As shown in Table 3, for the YOLO variants, alignment generally improves both precision and recall, consequently boosting F1 scores. The strongest overall balance is achieved by YOLOv9s (A) with an F1 score of 0.9369, surpassing its standard counterpart (0.9133) and delivering the highest precision among all models (0.9668 vs. 0.9144 for the standard version). When sensitivity is prioritized, YOLOv8s (A) attains the highest recall (0.9102, up from 0.8732), while also improving F1 (0.9259 vs. 0.9209). Other YOLO models likewise benefit: YOLOv10s (A) boosts F1 to 0.9035 (vs. 0.8762), YOLOv11s (A) to 0.9121 (vs. 0.8756), YOLOv8n (A) to 0.9226 (vs. 0.9064), YOLOv10n (A) to 0.8977 (vs. 0.8827), and YOLOv9t (A) to 0.9194 (vs. 0.9134). A single exception is YOLOv11n (A), which shows a small decline in F1 to 0.9001 (vs. 0.9097), suggesting that very compact backbones can become slightly over-selective after alignment.
For Faster R-CNN, alignment effects are mixed and often reflect a precision–recall trade-off mediated by the RPN and backbone. Modest gains appear for ResNet-50 FPN (F1 boosted to 0.8206 vs. 0.8083) and ResNet-101 FPN (F1 to 0.8108 vs. 0.8036), indicating that, with FPN on ResNet backbones, alignment can enhance both selectivity and coverage. In contrast, some models show significant improvement; ResNeXt-101-32x8d improved substantially (F1 score from 0.6678 to 0.8471). Other backbones show minor changes: ResNeXt-101-64x4d FPN improves slightly (to 0.8597 vs. 0.8548), while ResNeXt-101-64x4d shows only a minor uptick (to 0.8108 vs. 0.8036) and ResNeXt-101-32x8d FPN shows a small gain (0.8133 vs. 0.7993). These outcomes suggest that some two-stage pipelines require retuning of RPN anchors or scales after alignment to avoid recall loss.
Comparing families on the summary metric, the best YOLO result (YOLOv9s (A), F1: 0.9369) exceeds the best Faster R-CNN result (ResNeXt-101-64x4d FPN (A), F1: 0.8597) by +0.0772 absolute. In this low-data, small-object setting, alignment therefore provides a larger net benefit to one-stage detectors. Practically, when balanced accuracy is the objective, YOLOv9s with alignment are the preferred choice; when sensitivity is paramount (e.g., screening), YOLOv8s with alignment offer the highest recall. For two-stage models, alignment can still be advantageous—particularly with ResNet-FPN backbones—but backbone-specific validation and, where necessary, RPN/anchor retuning are recommended to mitigate recall degradation.

5.4. Visualization of Femoral and Alignment X-Ray

Figure 6 demonstrates the visual effects of the alignment preprocessing. In each subpanel, the square crop represents Standard (S), the rectangular crop represents Alignment (A), the green box represents the ground truth, and the blue box represents the model prediction. In the S condition, background noise remains around the IAFF lesion due to the overlap of the zygomatic bone and soft tissue and the deviation of the shooting angle, resulting in loose boundaries (FP: False positive) or missed lesions (FN: False negative). In the A condition, where the alignment is applied to the same original image, the femur axis is normalized, and the out-of-field background is removed, resulting in tighter convergence of the box around the lesion, which significantly corrects misses and over-detections. This qualitative improvement is consistent with the quantitative improvements (increases in mAP and F1) observed in Table 2 and Table 3, suggesting that it supports accurate localization, especially in the high IoU region.

5.5. Error Analysis

Qualitative analysis of the displayed images shows mixed performance with specific error patterns. The model successfully identifies several True Positives (TP) as shown in Figure 7, but also shows notable failures. False Positives (FP) are generated near features that confound the model, such as the surgical implant on the left side of Figure 7b (prediction 0.53) and the normal cortical bone adjacent to the actual fracture on the left side of Figure 7a (prediction 0.33). More importantly, the model exhibits a complete False Negative (FN) on the right side of Figure 7a, failing to detect a subtle but clear Ground Truth (GT) lesion. This is likely due to the model’s limited sensitivity to capable but subtle fractures, and possibly also to subtle differences in skin texture.

5.6. Zero-Shot Evaluation of Standing AP X-Ray

We verified the generalizability of representations learned in the source domain by performing zero-shot evaluation on the target domain, Standing AP, without any additional training or fine-tuning. Inference was performed with the source domain weights frozen. Conversely, Standard input conditions were excluded from the Standing AP evaluation because the shooting geometry (rotation, tilt, and FOV) of the original image differed significantly from the source, significantly diverging from the reference coordinate system/scale assumptions. Furthermore, mAP converged to 0, making comparison impossible.
As shown in the performance results in Table 4, the YOLO family showed consistent superiority in all IoU sections. YOLOv10s recorded the best performance with mAP@50–95 of 0.4616, followed by YOLOv11n (0.4600) and YOLOv10n (0.4510) by a small margin. Even in the low IoU criterion of mAP@50, YOLOv10s was the best with 0.8924, and YOLOv10s also had the highest mAP@75, which reflects high IoU sensitivity, with 0.4400. In contrast, the best performance of the Faster R-CNN was 0.2986 mAP@50–95 of ResNeXt-101-64×4d FPN, showing a clear gap with the top YOLO.
The effectiveness of FPN integration was confirmed in a two-stage internal comparison. Applying FPN to a ResNet/ResNeXt backbone (e.g., ResNet-101 → ResNet-101 FPN) resulted in an overall increase in mAP due to the benefit of multi-resolution feature integration. However, the improvement was limited in the high IoU range (@75, @50–95), and in some combinations, the accuracy of box convergence was insufficient to bridge the gap with the one-stage. This is interpreted as a result of the inability of the RPN’s proposal region and anchor distribution to sufficiently adapt to the changed object/background distribution after alignment preprocessing in an environment with a complex background structure and high shooting variation, such as Standing AP.
As shown in Table 5, the precision, recall, and F1 indices also reaffirmed the superiority of YOLO. YOLOv11n recorded the highest F1(0.8531), followed by YOLOv8s (0.8495), YOLOv11s (0.8488), and YOLOv10s (0.8354). Specifically, YOLOv11s provided the most conservative (minimum false positives) detection with a precision of 0.9337, and YOLOv10n showed the most sensitive detection with a recall of 0.8506. As a result, a selection spectrum of precision-oriented (11s), recall-oriented (10n), and balanced (11n/8s/10s) models was formed within the YOLO family, suggesting that model selection is possible for each application.
Considering the mixed behavior of the Faster R-CNN, some combinations of FPNs (e.g., ResNeXt-101-64×4d FPN, F1: 0.7479) showed meaningful balanced performance, but the gap with the top YOLO was maintained. The pose and scale variations of Standing AP, the repetitive skeleton structure, and the small size of the lesion conflicted with the anchor scale/aspect ratio assumption of RPN, causing a trade-off where recall loss occurred even if precision was improved by alignment. This implies that the Faster R-CNN pipeline is relatively sensitive to preprocessing variations, and high IoU performance may be limited if the anchor scale/aspect ratio and threshold are not retuned in the zero-shot transition environment.
For applications that prioritize overall accuracy (mAP@50–95), YOLOv10s is a suitable first choice, while YOLOv11n is preferable for balance metrics (F1). For scenarios such as screening that prioritize sensitivity (Recall), YOLOv10n is advantageous, and for environments where minimizing false positives (Precision) is key, YOLOv11s is advantageous. When adopting a two-stage, FPN combination is virtually mandatory, and after alignment and scaling, it is necessary to re-tune the RPN anchor scale/aspect ratio and NMS threshold to recover high IoU performance. In summary, under zero-shot conditions of Standing AP, the one-stage decoder was confirmed to be relatively robust to pose and background variations, thereby facilitating the achievement of both parameter efficiency and performance.

5.7. Visualization of Standing AP X-Ray

As shown in Figure 8, these are visualization results inferred with YOLOv10n (the highest performing Fold of the experiment) on Standing AP. The red squares in the original full-body image represent the IAFF region of interest (left and right thigh areas), the green boxes in the zoomed-in panel represent the ground truth, and the blue boxes represent the YOLO predictions (confidence). In (a)–(c), the ground truth and predictions show high overlap even in areas with low soft tissue contrast or blurred cortical boundaries, and in particular, (c) converges stably with high confidence on both sides. In cases (b) and (d), the reliability tends to decrease somewhat when the background pattern is complicated by metal fixation or adjacent knee structures, but the localization at the region of interest level is maintained. Overall, the alignment preprocessing mitigates background variations and magnification differences, confirming that the model consistently captures the IAFF even under zero-shot conditions.

5.8. Ablation Study on Anchor Sizes

To analyze the impact of anchor box configuration on the performance of our proposed ‘Alignment’ method, we conducted an ablation study. Figure 9 illustrates a comparison of mAP@50 performance between the ‘Alignment’ method (solid line) and the ‘standard’ method (dashed line) across three distinct anchor size groups, using six different backbone architectures.
As shown in the results, our ‘alignment’ method demonstrated a substantial performance increase when moving from Group 1 ([4, 8, 16, 32, 64]) to Group 2 ([8, 16, 32, 64, 128]). We attribute this to the ‘alignment’ process, which normalizes the object’s pose and increases its effective scale, leading to a better match with the relatively larger anchor boxes in Group 2.
Conversely, we observed a general performance degradation in Group 3 ([16, 32, 64, 128, 256]) for both methods. This suggests that while ‘alignment’ increases the object’s effective scale, the target objects are not inherently large-scale. The excessively large anchors in Group 3 appear to hinder optimal IoU matching. This study confirms that the anchor configuration of Group 2 provides the optimal setting for our methodology.

5.9. Computational Performance

We measured the computational costs of all models to evaluate their practical feasibility. Table 6 details the average training time per epoch (in seconds) and the maximum GPU VRAM usage (in GB) for both Standard (S) and Alignment (A) inputs. To ensure a fair comparison, all experiments were conducted on an NVIDIA RTX A6000 GPU, and the batch size was fixed at 4 for all models.
The results demonstrate two key findings. First, our alignment pipeline offers a dramatic and consistent reduction in GPU memory (VRAM) usage. By removing unnecessary backgrounds and normalizing the input from 1024x1024 (S) to a focused 1024x256 (A), the memory footprint plummeted. For example, the ResNet50 FPN model’s VRAM requirement was cut from 11.4 GB to just 3.9 GB. This efficiency gain, observed across all models, highlights that our data-centric approach makes training significantly more accessible.
Second, the impact on per-epoch training time was architecture-dependent. For the Faster R-CNN family, the smaller (A) input size consistently reduced the time per epoch (e.g., ResNet50 FPN: 46.6 s vs. 20.9 s). This is likely due to the RPN generating proposals on a much smaller feature map, reducing the overall computational load.
Conversely, for the YOLO models (v10, v11), the aligned (A) data resulted in a longer time per epoch. We utilized rect=true training to avoid the computational waste associated with padding. However, this forces the model’s neck architecture (e.g., PANet) to process high-aspect-ratio feature maps (e.g., 32 × 8 ) instead of square-like ones (e.g., 32 × 32 ). This appears to create an optimization bottleneck; the underlying GPU computations and memory access patterns are likely less efficient when processing these non-square tensors, for which the architecture may not be fully optimized.
Despite variations in training time, the critical takeaway for clinical application is the inference speed. We measured the average inference time and confirmed that all models, when processing the aligned data, achieved inference speeds of less than one second per image, validating their feasibility for real-time deployment.

6. Discussion

This study demonstrated that a data-driven approach leveraging the inherent anatomical consistency of the data can dramatically improve AI’s domain generalization performance without complex model modifications, while femurs vary in length and thickness across individuals, their overall morphological characteristics are remarkably similar. This very characteristic also contributes to the consistent morphological differences between healthy bones and those with incomplete, atypical fractures.
This anatomical consistency leads to a clear contrast between bone and soft tissue in X-ray images, enabling a relatively simple segmentation model to accurately capture the overall bone morphology. Thanks to this accurate segmentation, the “center axis” extracted in the subsequent step consistently represented the structural center of the femur across all patient data. Consequently, the “verticalization” process, which rotates the images around this center axis, effectively removes noise caused by differences in imaging angle and patient position, and plays a crucial role in aligning all bones to a standardized shape.
The data-driven approach in this study fundamentally differs from existing model-driven domain generalization studies. Model-centric studies primarily attempt to address differences in data distribution across different domains at the algorithmic level. This can lead to the complexity of restructuring the model structure each time a new domain is encountered or using multiple models, like ensemble models, to simultaneously process multiple domains. In contrast, this study demonstrated that by standardizing the data itself, a single model can operate robustly across multiple domains without additional modification. This suggests that AI models can be a much more efficient and scalable solution for applying and deploying in real-world clinical settings.
This study has several primary limitations that must be acknowledged. First, and most significantly, all data, including both the source and target domains, were sourced from a single institution (Kyungpook National University Hospital). Although our zero-shot evaluation on the Standing AP domain showed promising results, validating the model’s generalization and robustness absolutely requires rigorous testing on independent, multi-center external data from diverse medical settings, equipment, and patient populations.
Second, the Standing AP X-ray dataset used for our crucial zero-shot evaluation is critically small (20 normal subjects and 20 IAFF subjects). We acknowledge that this small sample size limits the statistical reliability of these generalization claims, while these findings sufficiently demonstrate the potential of our method, they must be interpreted as preliminary. A follow-up study with a substantially larger and more diverse target dataset is essential to robustly validate these claims. These limitations require further research to address.
An interesting finding was the discrepancy in the optimal model capacity between domains, while YOLOv10n (nano) achieved the highest mAP@50–95 on the standardized ‘clean’ source domain, YOLOv10s (small) surpassed it in the zero-shot target domain. We attribute this to the classic trade-off between specialization and generalization. The minimal capacity of the ‘nano’ model was highly specialized and sufficient for the source data, avoiding overfitting. However, the slightly larger capacity of the ‘small’ model allowed it to learn more robust and complex feature representations, which were not fully utilized on the source data but proved essential for generalizing to the more diverse and complex unseen target domain.

7. Conclusions

Early diagnosis of IAFF is clinically important, but subtle image features and data scarcity have made AI model development challenging. In particular, domain migration has been a major barrier to clinical application of existing models. This study aimed to address this issue through a data-centric approach that leverages the unique feature of medical imaging—the consistency of anatomical structures. The proposed four-step preprocessing pipeline converts diverse femoral X-ray images into a standardized format, enabling a single AI model to perform robustly across multiple domains without additional training. This study highlights the importance of improving the fundamental data representation rather than developing complex models. Future studies should further validate the generalization performance of the proposed methodology using data from other institutions.

Author Contributions

Conceptualization, S.J. and C.-W.O.; methodology, D.K., J.-H.L. and S.J.; software, D.K.; validation, D.K., J.-H.L. and S.J.; formal analysis, D.K. and J.-H.L.; investigation, D.K., J.-W.K. (Joon-Woo Kim), J.-W.K. (Ji-Wan Kim), S.-j.Y. and J.-W.K. (Ji-Wan Kim); resources, S.J. and C.-W.O.; data curation, J.-H.L., J.-W.K. (Joon-Woo Kim) and C.-W.O.; writing—original draft, D.K. and J.-H.L.; writing—review and editing, J.-H.L., S.J. and C.-W.O.; visualization, D.K.; supervision, S.J. and C.-W.O.; project administration, S.J. and C.-W.O.; funding acquisition, S.J. and C.-W.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: RS-2022-H130593).

Institutional Review Board Statement

Institutional Review Board Statement: This study was approved by the Kyungpook National University Hospital (KNUH) Institutional Review Board under approval number KNUH202402007 on 26 February 2024.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this article are not readily available due to the inclusion of personally identifiable information. Ethical and legal constraints imposed by our institutional review board prevent us from sharing the raw data.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
IAFFIncomplete Atypical Femoral Fracture
RANSACRandom Sample Consensus
TPTrue Positive
TNTrue Negative
FPFalse Positive
FNFalse Negative

References

  1. Schilcher, J.; Aspenberg, P. Incidence of stress fractures of the femoral shaft in women treated with bisphosphonate. Acta Orthop. 2009, 80, 413–415. [Google Scholar] [CrossRef]
  2. Dell, R.M.; Adams, A.L.; Greene, D.F.; Funahashi, T.T.; Silverman, S.L.; Eisemon, E.O.; Zhou, H.; Burchette, R.J.; Ott, S.M. Incidence of atypical nontraumatic diaphyseal fractures of the femur. J. Bone Miner. Res. 2012, 27, 2544–2550. [Google Scholar] [CrossRef] [PubMed]
  3. Adler, R.A.; El-Hajj Fuleihan, G.; Bauer, D.C.; Camacho, P.M.; Clarke, B.L.; Clines, G.A.; Compston, J.E.; Drake, M.T.; Edwards, B.J.; Favus, M.J.; et al. Managing osteoporosis in patients on long-term bisphosphonate treatment: Report of a task force of the American Society for Bone and Mineral Research. J. Bone Miner. Res. 2016, 31, 16–35. [Google Scholar] [CrossRef]
  4. van de Laarschot, D.M.; McKenna, M.J.; Abrahamsen, B.; Langdahl, B.; Cohen-Solal, M.; Guañabens, N.; Eastell, R.; Ralston, S.H.; Zillikens, M.C. Medical management of patients after atypical femur fractures: A systematic review and recommendations from the European Calcified Tissue Society. J. Clin. Endocrinol. Metab. 2020, 105, 1682–1699. [Google Scholar] [CrossRef]
  5. Bégin, M.J.; Audet, M.C.; Chevalley, T.; Portela, M.; Padlina, I.; Hannouche, D.; Ing Lorenzini, K.; Meier, R.; Peter, R.; Uebelhart, B.; et al. Fracture risk following an atypical femoral fracture. J. Bone Miner. Res. 2020, 37, 87–94. [Google Scholar] [CrossRef] [PubMed]
  6. Cheung, A.M.; McKenna, M.J.; van de Laarschot, D.M.; Zillikens, M.C.; Peck, V.; Srighanthan, J.; Lewiecki, E.M. Detection of atypical femur fractures. J. Clin. Densitom. 2019, 22, 506–516. [Google Scholar] [CrossRef]
  7. Kim, T.; Moon, N.H.; Goh, T.S.; Jung, I.D. Detection of incomplete atypical femoral fracture on anteroposterior radiographs via explainable artificial intelligence. Sci. Rep. 2023, 13, 10415. [Google Scholar] [CrossRef]
  8. Schilcher, J.; Nilsson, A.; Andlid, O.; Eklund, A. Fusion of electronic health records and radiographic images for a multimodal deep learning prediction model of atypical femur fractures. Comput. Biol. Med. 2024, 168, 107704. [Google Scholar] [CrossRef] [PubMed]
  9. Tanzi, L.; Vezzetti, E.; Moreno, R.; Moos, S. X-ray bone fracture classification using deep learning: A baseline for designing a reliable approach. Appl. Sci. 2020, 10, 1507. [Google Scholar] [CrossRef]
  10. Zdolsek, G.; Chen, Y.; Bögl, H.P.; Wang, C.; Woisetschläger, M.; Schilcher, J. Deep neural networks with promising diagnostic accuracy for the classification of atypical femoral fractures. Acta Orthop. 2021, 92, 394–400. [Google Scholar] [CrossRef]
  11. Murphy, E.; Ehrhardt, B.; Gregson, C.L.; von Arx, O.; Hartley, A.; Whitehouse, M.; Thomas, M.; Stenhouse, G.; Chesser, T.; Budd, C.; et al. Machine learning outperforms clinical experts in classification of hip fractures. Sci. Rep. 2022, 12, 2058. [Google Scholar] [CrossRef]
  12. Wang, C.T.; Huang, B.; Thogiti, N.; Zhu, W.X.; Chang, C.H.; Pao, J.L.; Lai, F. Successful real-world application of an osteoarthritis classification deep-learning model using 9210 knees—An orthopedic surgeon’s view. J. Orthop. Res.® 2023, 41, 737–746. [Google Scholar] [CrossRef] [PubMed]
  13. Teng, Y.; Pan, D.; Zhao, W. Application of deep learning ultrasound imaging in monitoring bone healing after fracture surgery. J. Radiat. Res. Appl. Sci. 2023, 16, 100493. [Google Scholar] [CrossRef]
  14. Guan, H.; Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 2021, 69, 1173–1185. [Google Scholar] [CrossRef]
  15. Yıldız Potter, İ.; Yeritsyan, D.; Mahar, S.; Kheir, N.; Vaziri, A.; Putman, M.; Rodriguez, E.K.; Wu, J.; Nazarian, A.; Vaziri, A. Proximal femur fracture detection on plain radiography via feature pyramid networks. Sci. Rep. 2024, 14, 12046. [Google Scholar] [CrossRef]
  16. Kuo, R.Y.; Harrison, C.; Curran, T.A.; Jones, B.; Freethy, A.; Cussons, D.; Stewart, M.; Collins, G.S.; Furniss, D. Artificial intelligence in fracture detection: A systematic review and meta-analysis. Radiology 2022, 304, 50–62. [Google Scholar] [CrossRef]
  17. Valliani, A.A.; Gulamali, F.F.; Kwon, Y.J.; Martini, M.L.; Wang, C.; Kondziolka, D.; Chen, V.J.; Wang, W.; Costa, A.B.; Oermann, E.K. Deploying deep learning models on unseen medical imaging using adversarial domain adaptation. PLoS ONE 2022, 17, e0273262. [Google Scholar] [CrossRef]
  18. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  19. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  20. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  21. Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
  22. Nguyen, H.H.; Le, D.T.; Shore-Lorenti, C.; Chen, C.; Schilcher, J.; Eklund, A.; Zebaze, R.; Milat, F.; Sztal-Mazer, S.; Girgis, C.M.; et al. AFFnet-a deep convolutional neural network for the detection of atypical femur fractures from anteriorposterior radiographs. Bone 2024, 187, 117215. [Google Scholar] [CrossRef]
  23. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  24. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  25. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  26. Kanopoulos, N.; Vasanthavada, N.; Baker, R.L. Design of an image edge detection filter using the Sobel operator. IEEE J. Solid-State Circuits 1988, 23, 358–367. [Google Scholar] [CrossRef]
  27. Chang, J.; Lee, J.; Kwon, D.; Lee, J.H.; Lee, M.; Jeong, S.; Kim, J.W.; Jung, H.; Oh, C.W. Context-Aware Level-Wise Feature Fusion Network with Anomaly Focus for Precise Classification of Incomplete Atypical Femoral Fractures in X-Ray Images. Mathematics 2024, 12, 3613. [Google Scholar] [CrossRef]
  28. Spanos, N.; Arsenos, A.; Theofilou, P.A.; Tzouveli, P.; Voulodimos, A.; Kollias, S. Complex Style Image Transformations for Domain Generalization in Medical Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5036–5045. [Google Scholar]
  29. Erdaş, Ç.B. Automated fracture detection in the ulna and radius using deep learning on upper extremity radiographs. Jt. Dis. Relat. Surg. 2023, 34, 598. [Google Scholar] [CrossRef]
  30. Yoon, J.S.; Oh, K.; Shin, Y.; Mazurowski, M.A.; Suk, H.I. Domain generalization for medical image analysis: A review. Proc. IEEE 2024, 112, 1583–1609. [Google Scholar] [CrossRef]
  31. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015, Munich, Germany, 5–9 October 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  32. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  33. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
  34. Van der Walt, S.; Schönberger, J.L.; Nunez-Iglesias, J.; Boulogne, F.; Warner, J.D.; Yager, N.; Gouillart, E.; Yu, T. scikit-image: Image processing in Python. PeerJ 2014, 2, e453. [Google Scholar] [CrossRef]
  35. Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
  36. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
  37. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
  38. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
  39. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  40. Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
  41. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Zhang, Z.; Lin, Z.; Wu, Z.; Liu, J. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  42. Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO. 2023. Available online: https://www.ultralytics.com/events/yolovision/2023 (accessed on 15 January 2025).
  43. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Figure 1. Overview of the entire data preprocessing, training, and inference process.
Figure 1. Overview of the entire data preprocessing, training, and inference process.
Mathematics 13 03720 g001
Figure 2. This is an example image of an algorithm that finds the principal axis after bone segmentation. In the second image, the white color is the result of the skeletal algorithm. The red color is the result of calculating the average gradient after applying RANSAC and fitting with a 2nd-order polynomial.
Figure 2. This is an example image of an algorithm that finds the principal axis after bone segmentation. In the second image, the white color is the result of the skeletal algorithm. The red color is the result of calculating the average gradient after applying RANSAC and fitting with a 2nd-order polynomial.
Mathematics 13 03720 g002
Figure 3. Each image exemplifies a view that can appear when taking an X-ray: (a) AP, (b) ER, and (c) IR view. The red boxes indicate IAFF.
Figure 3. Each image exemplifies a view that can appear when taking an X-ray: (a) AP, (b) ER, and (c) IR view. The red boxes indicate IAFF.
Mathematics 13 03720 g003
Figure 4. Standing X-ray showing the right and left femoral areas highlighted by red boxes, representing IAFF.
Figure 4. Standing X-ray showing the right and left femoral areas highlighted by red boxes, representing IAFF.
Mathematics 13 03720 g004
Figure 5. IAFF bounding box parameters—position ( x , y ) , width ( w ) , height ( h ) , and area ratio r = w h W H —for (a) femoral X-ray and (b) alignment-registered femoral X-ray.
Figure 5. IAFF bounding box parameters—position ( x , y ) , width ( w ) , height ( h ) , and area ratio r = w h W H —for (a) femoral X-ray and (b) alignment-registered femoral X-ray.
Mathematics 13 03720 g005
Figure 6. Visualization comparison on femoral X-rays: Standard (S) vs. Alignment (A). Blue = prediction, green = Ground Truth (IAFF).
Figure 6. Visualization comparison on femoral X-rays: Standard (S) vs. Alignment (A). Blue = prediction, green = Ground Truth (IAFF).
Mathematics 13 03720 g006
Figure 7. Examples of False Positive and False Negative. (a) False Positive in the Standard image. (b) False Negative in the Alignment image.
Figure 7. Examples of False Positive and False Negative. (a) False Positive in the Standard image. (b) False Negative in the Alignment image.
Mathematics 13 03720 g007
Figure 8. Visualization of Standing AP X-ray from the best YOLOv10n model. The red bounding box indicates the femoral portion in Standing AP.
Figure 8. Visualization of Standing AP X-ray from the best YOLOv10n model. The red bounding box indicates the femoral portion in Standing AP.
Mathematics 13 03720 g008aMathematics 13 03720 g008b
Figure 9. Ablation study on RPN anchor sizes. Performance (mAP@50) of ‘alignment’ (solid blue) vs. ‘standard’ (dashed orange) methods across three anchor size groups (group 1, 2, 3) for six different backbone models. Group 2 ([8, 16, 32, 64, 128]) achieves the best performance for the ‘alignment’ method.
Figure 9. Ablation study on RPN anchor sizes. Performance (mAP@50) of ‘alignment’ (solid blue) vs. ‘standard’ (dashed orange) methods across three anchor size groups (group 1, 2, 3) for six different backbone models. Group 2 ([8, 16, 32, 64, 128]) achieves the best performance for the ‘alignment’ method.
Mathematics 13 03720 g009
Table 1. Results of the bone segmentation.
Table 1. Results of the bone segmentation.
ModelDice ScoreHD95
FCN8 (ResNet50)0.8793 (±0.0088)162.86 (±18.86)
U-Net0.9200 (±0.0101)156.21 (±16.87)
HRNet w320.9566 (±0.0029)31.43 (±11.05)
Table 2. Detection accuracy under Standard (S) vs. Alignment (A) preprocessing. Results (mean ± sd) are reported for mAP@50, mAP@75, and mAP@50–95.
Table 2. Detection accuracy under Standard (S) vs. Alignment (A) preprocessing. Results (mean ± sd) are reported for mAP@50, mAP@75, and mAP@50–95.
ModelData TypemAP
@50
mAP
@75
mAP
@50–95
TypeBackbone or Version
Faster R-CNNResNet50S0.7089 (±0.1213)0.3576 (±0.1261)0.3277 (±0.0828)
A0.744 (±0.0613)0.4056 (±0.1918)0.4028 (±0.0998)
ResNet101S0.7545 (±0.0857)0.4021 (±0.1108)0.4104 (±0.0696)
A0.772 (±0.0923)0.3926 (±0.0794)0.4092 (±0.0538)
ResNet50 FPNS0.7893 (±0.0604)0.3585 (±0.0914)0.4264 (±0.0577)
A0.8069 (±0.0979)0.559 (±0.1034)0.5061 (±0.0612)
ResNet101 FPNS0.8361 (±0.1302)0.4807 (±0.1393)0.4645 (±0.0935)
A0.8388 (±0.1115)0.5054 (±0.1339)0.4867 (±0.0792)
ResNeXt-101-64x4dS0.8015 (±0.0716)0.4017 (±0.1644)0.3949 (±0.1033)
A0.8147 (±0.0619)0.4229 (±0.1309)0.4075 (±0.0715)
ResNeXt-101-32x8dS0.7471 (±0.0545)0.3901 (±0.0595)0.3999 (±0.059)
A0.7675 (±0.1154)0.3484 (±0.1155)0.3869 (±0.067)
ResNeXt-101-64x4d FPNS0.8508 (±0.0451)0.5177 (±0.1067)0.4825 (±0.0577)
A0.8524 (±0.0933)0.6498 (±0.15)0.5438 (±0.0511)
ResNeXt-101-32x8d FPNS0.8078 (±0.0375)0.335 (±0.1452)0.4227 (±0.0782)
A0.8143 (±0.0849)0.5314 (±0.1442)0.4824 (±0.0659)
Yolo8nS0.9207 (±0.0443)0.6837 (±0.0427)0.5927 (±0.0173)
A0.9392 (±0.0344)0.6902 (±0.1358)0.5881 (±0.0538)
8sS0.9219 (±0.0404)0.6323 (±0.1303)0.5746 (±0.0378)
A0.9521 (±0.035)0.7848 (±0.0502)0.645 (±0.0296)
9tS0.9201 (±0.031)0.7102 (±0.1565)0.5985 (±0.0558)
A0.9442 (±0.0428)0.7421 (±0.0598)0.6148 (±0.0225)
9sS0.9378 (±0.0412)0.693 (±0.1176)0.5955 (±0.0452)
A0.9569 (±0.0292)0.701 (±0.0755)0.6246 (±0.0224)
10nS0.9105 (±0.0463)0.742 (±0.1141)0.6198 (±0.0463)
A0.9246 (±0.0449)0.7894 (±0.0428)0.6472 (±0.0513)
10sS0.9076 (±0.0235)0.5941 (±0.0744)0.5618 (±0.0172)
A0.9352 (±0.0306)0.7525 (±0.0693)0.627 (±0.0446)
11nS0.9207 (±0.0377)0.7216 (±0.1192)0.6181 (±0.0524)
A0.935 (±0.0661)0.7727 (±0.098)0.6389 (±0.0708)
11sS0.9096 (±0.0582)0.7072 (±0.1499)0.6171 (±0.0765)
A0.9499 (±0.0164)0.7775 (±0.1415)0.6377 (±0.0507)
Table 3. Detection accuracy under Standard (S) vs. Alignment (A) preprocessing. Results (mean ± sd) are reported for Precision, Recall, and F1.
Table 3. Detection accuracy under Standard (S) vs. Alignment (A) preprocessing. Results (mean ± sd) are reported for Precision, Recall, and F1.
ModelData TypePrecisionRecallF1 Score
TypeBackbone or Version
Faster R-CNNResNet50S0.809 (±0.0909)0.654 (±0.1246)0.7144 (±0.0812)
A0.8939 (±0.0719)0.682 (±0.1096)0.7685 (±0.0751)
ResNet101S0.8757 (±0.0526)0.706 (±0.0847)0.7786 (±0.0575)
A0.8521 (±0.112)0.732 (±0.1055)0.7799 (±0.0715)
ResNet50 FPNS0.8397 (±0.1061)0.756 (±0.0814)0.7914 (±0.0693)
A0.94 (±0.0679)0.714 (±0.1207)0.8083 (±0.096)
ResNet101 FPNS0.8539 (±0.1715)0.798 (±0.132)0.8163 (±0.1273)
A0.9508 (±0.0582)0.738 (±0.145)0.8206 (±0.0885)
ResNeXt-101-64x4dS0.8767 (±0.0679)0.744 (±0.0945)0.8036 (±0.0783)
A0.9623 (±0.0454)0.704 (±0.0757)0.8108 (±0.0533)
ResNeXt-101-32x8dS0.7637 (±0.4306)0.596 (±0.3346)0.6678 (±0.3784)
A0.9216 (±0.0836)0.79 (±0.0752)0.8471 (±0.053)
ResNeXt-101-64x4d FPNS0.9303 (±0.0507)0.796 (±0.0669)0.8548 (±0.0265)
A0.9375 (±0.0512)0.802 (±0.1126)0.8597 (±0.0649)
ResNeXt-101-32x8d FPNS0.8745 (±0.0556)0.768 (±0.0587)0.7993 (±0.0301)
A0.8751 (±0.0845)0.74 (±0.1143)0.8133 (±0.0802)
Yolo8nS0.946 (±0.0755)0.8738 (±0.0803)0.9064 (±0.0633)
A0.954 (±0.0565)0.8959 (±0.0427)0.9226 (±0.0284)
8sS0.9302 (±0.0378)0.8405 (±0.0934)0.8797 (±0.0387)
A0.9433 (±0.042)0.9102 (±0.0408)0.9259 (±0.0334)
9tS0.9474 (±0.0566)0.8843 (±0.0324)0.9134 (±0.024)
A0.9475 (±0.0599)0.8941 (±0.0778)0.9194 (±0.065)
9sS0.9144 (±0.0472)0.9135 (±0.0569)0.9133 (±0.0443)
A0.9668 (±0.0324)0.9094 (±0.0505)0.9369 (±0.0388)
10nS0.8952 (±0.0541)0.8732 (±0.0752)0.8827 (±0.052)
A0.913 (±0.0641)0.8843 (±0.066)0.8977 (±0.0592)
10sS0.9045 (±0.052)0.8511 (±0.0386)0.8762 (±0.0343)
A0.9237 (±0.0406)0.8849 (±0.062)0.9035 (±0.0485)
11nS0.9365 (±0.0388)0.8862 (±0.0676)0.9097 (±0.0448)
A0.9178 (±0.0831)0.8856 (±0.0629)0.9001 (±0.0617)
11sS0.9026 (±0.0761)0.8532 (±0.0897)0.8756 (±0.0745)
A0.9279 (±0.0347)0.8986 (±0.0422)0.9121 (±0.0233)
Table 4. Detection accuracy Standing AP X-ray. Results (mean ± sd) are reported forfor mAP@50, mAP@75, and mAP@50–95.
Table 4. Detection accuracy Standing AP X-ray. Results (mean ± sd) are reported forfor mAP@50, mAP@75, and mAP@50–95.
ModelmAP
@50
mAP
@75
mAP
@50–95
TypeBackbone or Version
Faster R-CNNResNet500.4576 (±0.1223)0.2475 (±0.0428)0.1609 (±0.0252)
ResNet1010.5155 (±0.1222)0.2486 (±0.0347)0.1527 (±0.0504)
ResNet50 FPN0.6507 (±0.2082)0.2694 (±0.0272)0.2479 (±0.078)
ResNet101 FPN0.7209 (±0.2081)0.301 (±0.0321)0.2772 (±0.0612)
ResNeXt-101-64x4d0.6776 (±0.1077)0.2937 (±0.05)0.2629 (±0.0514)
ResNeXt-101-32x8d0.7259 (±0.0939)0.2973 (±0.0787)0.2594 (±0.0681)
ResNeXt-101-64x4d FPN0.7776 (±0.1077)0.2937 (±0.05)0.2629 (±0.0514)
ResNeXt-101-32x8d FPN0.7959 (±0.0939)0.2973 (±0.0787)0.2594 (±0.0681)
Yolo8n0.8594 (±0.0696)0.3155 (±0.0978)0.4075 (±0.0414)
8s0.8892 (±0.0445)0.215 (±0.0317)0.3768 (±0.0183)
9t0.8447 (±0.0748)0.2574 (±0.0436)0.396 (±0.0354)
9s0.881 (±0.0528)0.2425 (±0.0825)0.3868 (±0.0543)
10n0.8687 (±0.0433)0.3602 (±0.0971)0.451 (±0.0238)
10s0.8924 (±0.0531)0.44 (±0.0458)0.4616 (±0.0226)
11n0.8862 (±0.0247)0.4602 (±0.1015)0.46 (±0.0371)
11s0.884 (±0.0148)0.2861 (±0.0458)0.4154 (±0.0148)
Table 5. Detection accuracy under Standing AP X-ray. Results (mean ± sd) are reported for Precision, Recall, and F1).
Table 5. Detection accuracy under Standing AP X-ray. Results (mean ± sd) are reported for Precision, Recall, and F1).
ModelPrecisionRecallF1 Score
TypeBackbone or Version
Faster R-CNNResNet500.4678 (±0.1867)0.444 (±0.1876)0.4503 (±0.1746)
ResNet1010.4605 (±0.052)0.432 (±0.1475)0.4289 (±0.0668)
ResNet50 FPN0.6667 (±0.1012)0.601 (±0.0976)0.6423 (±0.1444)
ResNet101 FPN0.7166 (±0.1422)0.766 (±0.0422)0.7444 (±0.0511)
ResNeXt-101-64x4d0.8015 (±0.1225)0.678 (±0.0977)0.721 (±0.0686)
ResNeXt-101-32x8d0.755 (±0.1168)0.73 (±0.1237)0.7305 (±0.0743)
ResNeXt-101-64x4d FPN0.8051 (±0.1315)0.698 (±0.0986)0.741 (±0.0786)
ResNeXt-101-32x8d FPN0.755 (±0.1168)0.73 (±0.1237)0.7305 (±0.0743)
Yolo8n0.8188 (±0.0241)0.8352 (±0.0769)0.8258 (±0.0459)
8s0.9102 (±0.0355)0.8036 (±0.0975)0.8495 (±0.0403)
9t0.7949 (±0.0899)0.8369 (±0.1034)0.8099 (±0.0688)
9s0.8698 (±0.1091)0.8074 (±0.1066)0.8325 (±0.0821)
10n0.7986 (±0.1001)0.8506 (±0.0462)0.8212 (±0.0631)
10s0.8733 (±0.096)0.8049 (±0.03)0.8354 (±0.0453)
11n0.8832 (±0.0516)0.8257 (±0.0234)0.8531 (±0.0329)
11s0.9337 (±0.0393)0.7782 (±0.0228)0.8488 (±0.0284)
Table 6. Computational training performance. Comparison of average time per epoch (seconds) and max GPU VRAM (GB) for Standard (S) vs. Alignment (A) inputs.
Table 6. Computational training performance. Comparison of average time per epoch (seconds) and max GPU VRAM (GB) for Standard (S) vs. Alignment (A) inputs.
ModelData TypeTraining
Time
GPU
Memory
TypeBackbone or VersionParameters
Faster R-CNNResNet5078.99MS48.315.5
A33.18.6
ResNet10197.98MS61.119
A34.39.6
ResNet50 FPN40.9MS46.611.4
A20.93.9
ResNet101 FPN59.89MS56.714.9
A27.15.3
ResNeXt-101-64x4d136.89MS97.927
A44.812.1
ResNeXt-101-32x8d142.22MS130.628.6
A51.613.7
ResNeXt-101-64x4d FPN98.79MS9324.8
A34.88
ResNeXt-101-32x8d FPN104.13MS126.225.2
A42.68.1
Yolo8n3MS10.81.9
A9.91.2
8s11.2MS10.92.6
A10.81.9
9t2MS202.3
A20.71.4
9s7.1MS21.63.8
A21.62.3
10n2.3MS362.5
A75.61.6
10s7.2MS39.64
A86.42.4
11n2.6MS28.82.3
A681.3
11s9.4MS32.43.5
A722.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kwon, D.; Lee, J.-H.; Kim, J.-W.; Kim, J.-W.; Yoon, S.-j.; Jeong, S.; Oh, C.-W. Anatomical Alignment of Femoral Radiographs Enables Robust AI-Powered Detection of Incomplete Atypical Femoral Fractures. Mathematics 2025, 13, 3720. https://doi.org/10.3390/math13223720

AMA Style

Kwon D, Lee J-H, Kim J-W, Kim J-W, Yoon S-j, Jeong S, Oh C-W. Anatomical Alignment of Femoral Radiographs Enables Robust AI-Powered Detection of Incomplete Atypical Femoral Fractures. Mathematics. 2025; 13(22):3720. https://doi.org/10.3390/math13223720

Chicago/Turabian Style

Kwon, Doyoung, Jin-Han Lee, Joon-Woo Kim, Ji-Wan Kim, Sun-jung Yoon, Sungmoon Jeong, and Chang-Wug Oh. 2025. "Anatomical Alignment of Femoral Radiographs Enables Robust AI-Powered Detection of Incomplete Atypical Femoral Fractures" Mathematics 13, no. 22: 3720. https://doi.org/10.3390/math13223720

APA Style

Kwon, D., Lee, J.-H., Kim, J.-W., Kim, J.-W., Yoon, S.-j., Jeong, S., & Oh, C.-W. (2025). Anatomical Alignment of Femoral Radiographs Enables Robust AI-Powered Detection of Incomplete Atypical Femoral Fractures. Mathematics, 13(22), 3720. https://doi.org/10.3390/math13223720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop