3.2. Methodology
To identify the most effective architecture for the proposed musculoskeletal X-ray analysis framework, a comparative experimental stage was first established for body-part classification using several State-of-the-Art deep learning models, as shown in
Figure 1. Specifically, DenseNet-201 [
19], ResNet-101 [
20], VGG-19 [
21], Inception-V3 [
22], EfficientNet-B0 [
23], and the proposed hybrid Xception-Swin architecture were trained and evaluated under the same experimental setting to classify radiographs into the seven anatomical categories: elbow, finger, forearm, hand, humerus, shoulder, and wrist.
To clarify the role of component (2) in
Figure 1, labeled stage definition with stratified splitting: this step defines how the MURA dataset is partitioned into training, validation, and test subsets for each of the two stages of the framework. Stratified splitting ensures that the class proportions of normal and abnormal studies are preserved across all subsets, which is important for imbalanced medical imaging datasets. The split percentages used are 70% training, 30% validation, with the official MURA test set kept fully independent. This step is not related to the architectural comparison; it applies uniformly to all models and both stages. Its purpose is to define the data organization before training begins, not to perform any model selection.
The comparative model-selection stage uses the five CNN baselines alongside the proposed hybrid to determine the best-performing architecture before committing it to the full seven-body-part fracture classification pipeline. All six models in this comparison were trained using ImageNet-pretrained weights as initialization, followed by fine-tuning on the MURA body-part classification task. None of the models were used without any retraining. The pretrained weights provided a strong initialization that reduced training time and improved generalization on the relatively small MURA subsets, while the task-specific fine-tuning adapted the representations to musculoskeletal radiograph anatomy. This transfer learning approach is standard practice for medical image classification tasks where training from scratch on limited data would result in poor generalization. The selected comparison models represent diverse and widely used deep learning families, including conventional convolutional networks, residual learning, dense connectivity, multi-scale convolutional design, and compound-scaled architectures, all of which have shown strong utility in medical image analysis. Their inclusion provided a robust benchmark for assessing whether a hybrid convolution-transformer framework could offer measurable advantages over standard CNN-based approaches.
Algorithm S1 (from the Supplementary File of Algorithm Pseudocode) provides a consolidated pseudocode description of the complete proposed framework corresponding to the seven components illustrated in
Figure 1. The algorithm is organized to follow the same sequential structure as the figure: data preprocessing and partitioning, comparative architecture screening for body-part classification, hybrid model construction and training, body-part-wise abnormality detection, and explainability analysis. Readers may refer to
Algorithm S1 in the Supplementary Materials, alongside
Figure 1 for a precise operational description of each step.
Based on the comparative results obtained in the body-part classification task, the proposed Xception-Swin hybrid model demonstrated the best overall performance among all evaluated architectures. It achieved the highest performance measures, indicating superior discriminative ability, stronger class-balanced prediction, and higher agreement beyond chance across the seven anatomical classes. In addition, the confusion matrix and ROC analysis showed that the model effectively separated closely related musculoskeletal regions with only limited inter-class confusion. These findings established the hybrid Xception-Swin framework as the most suitable architecture for the study. Therefore, it was selected as the final proposed model and subsequently used as the backbone for transfer learning-based body-part-wise fracture classification, optimization, testing, calibration analysis, and explainability assessment.
Following this comparative screening stage, all remaining experiments adhered to the general methodological framework described below. Standardized settings were maintained across models wherever applicable, while model-specific parameters, such as input resolution, backbone-dependent preprocessing, and architecture-specific implementation details, were adjusted according to the requirements of each network. After selecting the best-performing architecture through body-part classification, the same hybrid Xception-Swin framework was adopted in a transfer learning setting for the subsequent abnormality detection experiments across the individual anatomical subsets. In this design, knowledge learned during anatomical discrimination provided a strong initialization for body-part-wise abnormality recognition, thereby improving convergence and enabling the model to adapt more effectively to fracture-related patterns within each region.
The proposed method is a hybrid deep learning framework for fracture classification from musculoskeletal X-ray images. The specific pairing of Xception and Swin-Tiny is motivated by the diagnostic requirements of musculoskeletal radiograph analysis rather than a generic CNN-Transformer combination: Xception’s depthwise separable convolutions are particularly effective for identifying fine-grained cortical discontinuities, fracture line textures, and subtle periosteal irregularities, while Swin-Tiny’s shifted-window hierarchical attention enables the model to attend to long-range joint morphology, bone contour alignment, and structural relationships across spatially distant image regions. These two types of evidence are complementary in fracture assessment but are insufficiently captured by either architecture alone.
The selection of Swin-Tiny over other Swin configurations was driven by three practical considerations specific to this task. First, the MURA body-part-specific subsets are relatively small in size, with the humerus and forearm subsets containing only a few hundred training studies. Larger Swin variants such as Swin-Small and Swin-Base have substantially more parameters (49M and 88M, respectively, compared with 28M for Swin-Tiny) and would be at substantially higher risk of overfitting on these limited training sets without extensive regularization. Second, the input resolution of 224 × 224 used throughout this study aligns with the Swin-Tiny Patch4 Window7 configuration, making it the most compatible variant without requiring resolution adjustments. Third, Swin-Tiny achieves strong ImageNet performance comparable to ResNet-50 while operating at a fraction of the computational cost of Swin-Base, making it a practical choice for a dual-backbone architecture where two backbones must be jointly loaded and fine-tuned within an 8 GB VRAM constraint.
Regarding the consideration of other Transformer-based backbones: ViT (Vision Transformer) was considered but not selected because standard ViT variants require large-scale pretraining data to perform well and lack the hierarchical feature extraction that is beneficial for multi-scale medical image analysis. DeiT addresses the data efficiency concern but still produces non-hierarchical single-resolution features, which limits its utility in the multi-scale fusion variant of the proposed architecture. Swin Transformer was selected over these alternatives because its hierarchical shifted-window design produces multi-scale feature maps at four resolution levels, directly enabling the multi-scale spatial fusion path that is one of the two fusion strategies in the proposed dual-path design.
The framework integrates Xception and Swin-Tiny in a unified architecture, where Xception extracts fine local patterns such as cortical disruptions, fracture lines, and subtle textural irregularities, while Swin-Tiny captures long-range structural dependencies through hierarchical shifted-window self-attention. To further strengthen the learning process, both branches are initialized through transfer learning, allowing the model to start from informative pretrained representations rather than random initialization. In the first stage, this transfer learning-enabled hybrid model learns discriminative anatomical representations for body-part classification. In the second stage, the same architecture is fine-tuned for each anatomical subset to distinguish abnormal/fracture from normal/non-fracture cases. The complete pipeline therefore consists of image enhancement, input standardization, stratified train-validation splitting, imbalance-aware training, transfer learning-based hybrid feature extraction and fusion, model optimization, quantitative evaluation, and explainability analysis.
The preprocessing and augmentation details are presented in full for reproducibility. Readers primarily interested in model architecture and results may refer to
Table 2,
Table 3 and
Table 4 as concise summaries and skip the detailed descriptions without loss of continuity. The preprocessing stage was designed as a radiology-informed pipeline tailored to the specific visual characteristics of musculoskeletal X-ray images, rather than a generic image processing sequence. While individual techniques such as CLAHE [
24] are well established, their specific configuration, ordering, and combination in this pipeline are motivated by the diagnostic properties of the MURA dataset. Fracture evidence in musculoskeletal radiographs is characteristically subtle and low-contrast, particularly in regions such as the finger, forearm, and wrist where cortical discontinuities can span only a few pixels. Standard global histogram equalization is unsuitable here because it amplifies background noise uniformly and can obscure fine osseous boundaries. CLAHE was therefore selected over alternatives because its tile-wise adaptive operation with a clip limit of 2.0 and tile grid size of 8 × 8 enhances local trabecular and cortical contrast while actively suppressing noise amplification through histogram clipping, as detailed in
Table 2. Following CLAHE, fast non-local means denoising with h = 7, template window size = 7, and search window size = 21 was applied to suppress acquisition noise while preserving structural edges, an important consideration because over-smoothing at this stage can eliminate fine fracture lines. A subsequent mild unsharp masking step with amount = 0.8, radius = 1.2, and threshold = 3 was then used to selectively sharpen diagnostically meaningful edges. The threshold parameter specifically prevents amplification of insignificant pixel fluctuations introduced by the preceding denoising step, making this a complementary rather than redundant operation. After enhancement, each image was converted into a three-channel representation, resized to 224 × 224, and normalized with ImageNet statistics, namely mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225], ensuring compatibility with the pretrained Xception and Swin-Tiny backbones. The specific contribution of this pipeline is therefore not in any individual technique but in the principled, task-specific configuration and sequencing of the full pipeline for musculoskeletal fracture analysis.
To preserve class distribution during model development, the original training data were divided using stratified splitting, where 30% of the available training samples were reserved for validation and the remaining 70% were used for training, as described in
Table 3. A fixed random seed of 42 was used to ensure reproducibility. The independent test set was kept completely separate from all optimization and model selection steps. Validation and test images underwent deterministic processing only, typically consisting of either Resize(256) followed by CenterCrop(224) or direct resizing to 224 × 224, depending on the training implementation, followed by tensor conversion and ImageNet normalization.
The evaluation in this study is conducted at the image level, meaning that each radiograph is classified independently as normal or abnormal. The original MURA benchmark [
6] evaluates at the study level, where all views belonging to a single patient study are aggregated into one diagnosis using a majority voting or max-pooling rule: a study is classified as abnormal if any view is predicted abnormal. This distinction has two practical consequences. First, image-level evaluation results in a larger effective sample size (3197 images across the public validation set versus approximately 1199 studies), which means the confusion matrix counts, confidence intervals, and kappa values reported here are not directly comparable to study-level metrics. Second, in clinical practice a single study yields one diagnosis regardless of how many views it contains, so image-level accuracy does not directly reflect clinical diagnostic accuracy. Study-level aggregation is therefore identified as a future methodological improvement. The current image-level results are reported consistently across all body parts for transparency and comparability with prior studies that also report image-level metrics on the MURA public validation set.
Since fracture datasets are naturally imbalanced, the training pipeline explicitly addressed imbalance through weighted sampling and, in one implementation, class-weighted loss. Class frequencies were computed from the training subset, and inverse-frequency sample weights were assigned to construct a WeightedRandomSampler with replacement enabled and num_samples equal to the training set size. This increased the effective exposure of minority-class samples during each epoch. In the alternative hybrid implementation, class imbalance was also handled with class-weighted cross-entropy loss, ensuring that underrepresented classes contributed more strongly to gradient updates. These mechanisms reduced majority-class bias and improved sensitivity to abnormal cases.
To improve generalization, online data augmentation was applied only to training images, as detailed in
Table 4. In the multi-run hybrid setup, the augmentation sequence included RandomResizedCrop (224, scale = 0.85–1.0), RandomHorizontalFlip (
p = 0.5), RandomRotation (±10°), RandomAffine (translation = 0.05, scale = 0.95–1.05), and ColorJitter (brightness = 0.10, contrast = 0.15). In the more extensive hybrid implementation, stronger augmentation was used, including RandomResizedCrop (224, scale = 0.80–1.0), RandomHorizontalFlip (
p = 0.5), RandomVerticalFlip (
p = 0.15), RandomRotation (±15°), RandomAffine (translation = 0.06, shear = 8°), ColorJitter (brightness = 0.20, contrast = 0.20, saturation = 0.10, hue = 0.03), and RandomGrayscale (
p = 0.03). Additional tensor-level perturbations included Gaussian noise with standard deviation 0.01–0.06 and probability 0.6, salt-and-pepper noise with corruption amount 0.002–0.012 and probability 0.5, and RandomErasing (
p = 0.20, scale = 0.002–0.05, ratio = 0.3–3.3).
These augmentations were intentionally bounded to preserve anatomical plausibility while increasing training diversity. The augmentation design is not a generic image augmentation strategy but is specifically constrained by the clinical properties of musculoskeletal radiographs. Vertical flipping probability was set to
p = 0.15 rather than
p = 0.5 because inverted radiographs are occasionally encountered in clinical practice but are not the norm, and over-applying this transformation would produce unrealistic training examples. Rotation was bounded to plus or minus 15 degrees because larger rotations would misrepresent the acquisition geometry of upper-extremity X-rays. Hue jitter was limited to 0.03 because musculoskeletal radiographs carry diagnostic information primarily through intensity and texture rather than color. RandomErasing was applied at very small scale (0.002 to 0.05) to simulate localized acquisition artifacts such as soft tissue overlaps and metallic implant shadows, which are specifically relevant in musculoskeletal imaging. Each augmentation parameter was therefore selected based on clinical plausibility rather than optimized as a general-purpose setting. The full augmentation settings are provided in
Table 4 for reproducibility purposes. Readers may treat this table as a technical reference rather than a primary contribution of the paper.
The proposed hybrid architecture combined pretrained Xception and Swin-Tiny backbones, as described in
Table 5. In one implementation, both networks produced global pooled embeddings that were projected into a common latent space of 512 dimensions. Fusion was configured as attention-based fusion, where the projected Xception and Swin embeddings were treated as tokens and passed through a MultiheadAttention [
25] block with 8 heads and dropout = 0.1. The fused representation was then normalized and passed through a classifier consisting of Linear → GELU → Dropout (0.3) → Linear, with an intermediate hidden layer of 256 units. In the other hybrid implementation, the model extracted multi-scale feature maps from Swin-Tiny using out_indices = (0, 1, 2, 3) and from Xception using out_indices = (2, 3). These feature maps were projected through 1 × 1 convolutions into a common channel space of 128 channels, spatially aligned, concatenated, and passed through sequential Conv-BatchNorm-ReLU fusion blocks. After global average pooling, the fused descriptor was regularized with dropout = 0.4 and fed to a final fully connected classification layer.
Both implementations follow the same methodological principle: local CNN-derived and global transformer-derived representations are fused to improve fracture classification. Other hybrid combinations, such as ResNet plus Swin or DenseNet plus ViT, were not experimentally compared in this study. This absence of direct comparison is a limitation, and the rationale for the chosen pairing is therefore based on architectural reasoning rather than exhaustive empirical ablation. Specifically, Xception was preferred over ResNet as the CNN branch because its depthwise separable convolutions produce more parameter-efficient local feature extraction at the same depth, which is advantageous when jointly training with a transformer branch under GPU memory constraints. DenseNet was not selected as the CNN branch because its dense connectivity pattern produces feature maps with strong inter-layer redundancy, which may conflict with the attention mechanism in the fusion layer that is designed to selectively weight distinct CNN and Transformer representations. These design decisions represent deliberate architectural choices rather than arbitrary defaults, and a systematic ablation comparing alternative hybrid combinations is identified as a direction for future work.
Training was carried out using transfer learning with ImageNet-pretrained weights, as mentioned in
Table 6. The second stage adopts a three-phase fine-tuning strategy to stabilize optimization of the dual-backbone model during body-part-wise abnormality detection. This design choice is motivated by a concrete problem: Xception and Swin-Tiny have fundamentally different internal representations, learning dynamics, and gradient scales. If both backbones are unfrozen simultaneously from the start, the randomly initialized fusion and classifier layers produce large gradients that destabilize the pretrained backbone weights before they can contribute meaningful representations. The three-phase strategy prevents this by decoupling head adaptation from backbone adaptation and by sequentially introducing backbone parameters only after the fusion layer has converged to a stable operating point. In the attention-based hybrid implementation, a staged fine-tuning strategy was therefore applied as follows. During Phase 1, both backbones were frozen and only the classifier head was trained, allowing the fusion mechanism to learn a stable mapping between the two fixed representation spaces. During Phase 2, the final Xception blocks and the last Swin layer were unfrozen for partial adaptation, enabling the backbones to adjust their highest-level representations toward fracture-relevant musculoskeletal features while retaining the lower-level pretrained structure. During Phase 3, the entire model was unfrozen for full fine-tuning with a lower backbone learning rate, ensuring global parameter refinement without catastrophic forgetting of pretrained representations.
Optimization used AdamW [
26] with lr_backbone = 1 × 10
−5, lr_head = 1 × 10
−4, and weight decay = 1 × 10
−4. Training ran for 100 epochs, with batch size = 8, label smoothing = 0.1, and CosineAnnealingLR scheduling with eta_min = 1 × 10
−7. In the multi-scale fusion implementation, AdamW was also used, with learning rate = 1 × 10
−4, weight decay = 1 × 10
−4, batch size = 8, and 50 epochs, again with cosine annealing. Automatic mixed precision (AMP) was enabled on CUDA devices in both settings to reduce memory use and improve training efficiency.
To avoid overfitting, early stopping was applied based on validation performance, detailed in
Table 6. In the attention-based hybrid implementation, early stopping used patience = 12 and min_delta = 1 × 10
−4. In the multi-scale fusion implementation, the stopping parameters were patience = 7 and min_delta = 1 × 10
−4. The best model was selected according to validation performance, using validation AUC when available and validation F1-score as a fallback. This ensured that the final test results corresponded to the most generalizable checkpoint rather than the last epoch.
The combination of differential learning rates for backbone and head, cosine annealing scheduling, label smoothing, and phase-gated unfreezing collectively addresses the specific optimization challenges of a dual-backbone hybrid trained on small body-part-specific subsets with class imbalance, and represents a principled training design rather than a default fine-tuning procedure.
Performance was assessed using multiple complementary metrics. Standard classification metrics included accuracy, precision, recall, F1-score, and AUC, with additional use of macro F1, weighted F1, balanced accuracy, average precision, and Cohen’s kappa in some experiments. Confusion matrices were generated to quantify true positives, true negatives, false positives, and false negatives, from which sensitivity and specificity were derived. In addition to discrimination, model calibration was analyzed using expected calibration error (ECE) with 15 bins. Reliability diagrams were produced by comparing confidence and empirical accuracy across bins. In this context, metrics such as accuracy, F1, precision, recall, AUC, sensitivity, and specificity are interpreted as higher is better (↑), whereas loss and ECE are interpreted as lower is better (↓).
Finally, explainability analysis was performed to examine whether the hybrid model focused on clinically meaningful regions, as detailed in
Table 7. The explainability pipeline used Grad-CAM [
27], Grad-CAM++ [
28], and occlusion sensitivity [
29] on the independent test set. Input images remained at 224 × 224, while explanation figures were exported at a 512-pixel output size and 300 DPI for improved visual quality. Up to 5 samples per class were selected for qualitative analysis. Grad-CAM and Grad-CAM++ were computed by attaching hooks to the last convolutional feature layer and backpropagating class gradients, whereas occlusion sensitivity iteratively masked local patches to measure changes in confidence.
The resulting heatmaps were overlaid on the original radiographs to confirm that the model attended primarily to fracture-relevant bone regions. The role of the explainability analysis in this study is not to introduce novel XAI algorithms but to serve two specific functional purposes: first, to verify that the dual-path hybrid model attends to clinically meaningful anatomy rather than background artifacts or acquisition markers, which is a necessary quality check for any model intended for clinical decision support; and second, to provide cross-validated interpretation by comparing gradient-based maps from Grad-CAM and Grad-CAM++ with perturbation-based maps from occlusion sensitivity, so that convergent regions can be identified with higher confidence than any single method would allow. These two purposes are directly relevant to clinical trustworthiness and are not served by the single-method XAI approaches used in most prior studies on the MURA dataset.
To provide a quantitative measure of cross-method agreement, pairwise Spearman rank correlations were computed between the three heatmap activation maps (Grad-CAM, Grad-CAM++, and occlusion sensitivity) for each test sample and averaged across all body parts. This metric quantifies whether the three methods consistently identify the same image regions as important, without requiring pixel-level fracture annotations. Higher correlation between gradient-based and perturbation-based methods specifically indicates that highlighted regions are not only visually salient but also functionally important for the model’s prediction.
The explainability analysis in this study is qualitative rather than quantitative. A quantitative overlap metric such as the Dice Similarity Coefficient between heatmap activation regions and annotated fracture locations, as used by Lysdahlgaard [
14], could not be computed because pixel-level fracture annotations are not available in the MURA dataset. The MURA dataset provides only study-level normal or abnormal labels without bounding box or segmentation annotations. Conducting a quantitative XAI evaluation would therefore require either a separate annotated dataset or an independent clinical reader study, both of which are identified as future work directions.