Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking

Sekiura, Konatsu; Yoshimura, Takaaki; Sugimori, Hiroyuki

doi:10.3390/app151910534

Open AccessArticle

Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking

by

Konatsu Sekiura

¹,

Takaaki Yoshimura

^2,3,4

and

Hiroyuki Sugimori

^4,5,*

¹

Department of Health Sciences, School of Medicine, Hokkaido University, Sapporo 060-0812, Japan

²

Department of Health Sciences and Technology, Faculty of Health Sciences, Hokkaido University, Sapporo 060-0812, Japan

³

Department of Medical Physics, Hokkaido University Hospital, Sapporo 060-8648, Japan

⁴

Global Center for Biomedical Science and Engineering, Faculty of Medicine, Hokkaido University, Sapporo 060-8638, Japan

⁵

Department of Biomedical Science and Engineering, Faculty of Health Sciences, Hokkaido University, Sapporo 060-0812, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10534; https://doi.org/10.3390/app151910534

Submission received: 30 August 2025 / Revised: 24 September 2025 / Accepted: 28 September 2025 / Published: 29 September 2025

(This article belongs to the Special Issue Advances in Medical Imaging: Techniques and Applications)

Download

Browse Figures

Versions Notes

Abstract

Purpose: To develop and validate a two-stage system for automated quality assessment of shoulder true-AP radiographs by combining joint localization with quality classification. Materials and Methods: From the MURA “SHOULDER” subset, 2956 anteroposterior images were identified; 59 images with negative–positive inversion, excessive metallic implants, extreme exposure, or presumed fluoroscopy were excluded, yielding a class-balanced set of 2800 images (1400 OK/1400 NG). A YOLOX-based detector localized the glenohumeral joint, and classifiers operated on both whole images and detector-centered crops. To enhance interpretability, we integrated Grad-CAM into both whole-image and local classifiers and assessed attention patterns against radiographic criteria. Results: The detector achieved AP@0.5 = 1.00 and a mean Dice similarity coefficient of 0.967. The classifier attained AUC = 0.977 (F1 = 0.943) on a held-out test set. Heat map analyses indicated anatomically focused attention consistent with expert-defined regions, and coverage metrics favored local over whole-image models. Conclusions: The two-stage, XAI-integrated approach provides accurate and interpretable assessment of shoulder true-AP image quality, aligning model attention with radiographic criteria.

Keywords:

shoulder radiography; true-AP view; image quality assessment; deep learning; re-imaging decision; object detection; explainable AI; Grad-CAM

1. Introduction

Shoulder radiography remains a fundamental modality for initial diagnosis in the musculoskeletal domain. The true anteroposterior (true-AP) view plays a particularly crucial role in evaluating the glenohumeral joint and subacromial space, serving as an essential tool for detecting pathological findings such as joint space narrowing, Hill–Sachs lesions, and changes in acromiohumeral distance [1,2]. Despite its diagnostic importance, obtaining consistently adequate image quality presents significant challenges in clinical practice. Even slight deviations in radiographic positioning parameters, including trunk rotation, X-ray tube angulation, and upper extremity rotation, can cause anatomical structure overlap, compromising diagnostic accuracy and necessitating retake examinations [3].

In general projection radiography, retakes are reported at approximately 10% depending on facility and study period [4,5]. Although shoulder-specific breakdowns are limited, suboptimal positioning is a common contributor, and appropriate positioning remains essential for shoulder projections, including the true anteroposterior view. This elevated retake rate represents a multifaceted problem in clinical radiology. It increases cumulative patient radiation exposure, reduces examination efficiency, prolongs patient wait times, and results in inefficient utilization of medical resources. The cumulative effect of these issues impacts both patient care quality and healthcare system efficiency, highlighting the need for systematic approaches to quality improvement in shoulder radiography [4].

Traditional quality assurance in radiography has relied primarily on the empirical judgment of experienced radiographers and retrospective quality audits. While these approaches have served as the cornerstone of radiographic quality control, they possess inherent limitations. The subjective nature of visual assessment leads to inter-observer variability, even among experienced professionals. Additionally, the dependency on human resources constrains scalability and real-time feedback capabilities, particularly in high-volume clinical settings or facilities with limited expert staffing [6,7]. These limitations underscore the need for objective, automated quality assessment systems that can provide consistent and immediate feedback.

Recent advances in deep learning technology have demonstrated promising applications in medical image quality assessment. Convolutional neural network (CNN)-based systems have achieved significant success in chest radiography and mammography, where automated quality evaluation has shown clinical utility in detecting positioning errors, identifying artifacts, and ensuring diagnostic adequacy [8,9]. These successes suggest the potential for similar applications in musculoskeletal imaging. However, the translation of these technologies to musculoskeletal radiography, particularly shoulder imaging, remains relatively unexplored.

The availability of large-scale annotated datasets has been a critical factor in advancing deep learning applications in medical imaging. The MURA (musculoskeletal radiographs) dataset, comprising 40,561 upper extremity radiographs from 12,173 patients across seven anatomical regions, represents a valuable resource for developing and validating AI systems in musculoskeletal imaging [10]. In upper-limb radiography, deep-learning studies have primarily focused on abnormality detection, fracture identification, and disease classification [10,11,12]. The specific challenge of automated quality assessment for immediate retake determination—a practical need at the point of image acquisition—has received limited attention in the literature.

The implementation of AI systems in clinical radiology faces a critical requirement beyond performance metrics: explainability. Healthcare professionals require understanding of AI decision-making processes to trust and effectively utilize these systems [13,14]. The “black box” nature of deep learning models poses a significant barrier to clinical adoption, particularly in scenarios where AI recommendations directly influence patient care decisions. Visualization techniques such as gradient-weighted class activation mapping (Grad-CAM) have emerged as valuable tools for interpreting CNN decisions by highlighting image regions that contribute to model predictions [15]. However, the application of these explainability methods in musculoskeletal radiography quality assessment remains largely unexplored.

A key gap in current explainability research is the lack of quantitative validation against expert judgment. While many studies present qualitative visualizations of model attention, few have established objective metrics to measure the concordance between AI-generated attention maps and the regions that expert radiographers consider diagnostically relevant [16]. This gap is particularly pronounced in shoulder radiography, where the complex anatomical relationships and subtle positioning variations require precise localization of assessment criteria. The development of quantitative explainability metrics that align with expert consensus represents a critical need for advancing clinically acceptable AI systems.

The anatomical complexity of the shoulder joint presents unique challenges for automated quality assessment. Unlike chest radiography, where standardized landmarks are relatively consistent across patients, shoulder anatomy exhibits significant individual variation in bone morphology, joint angles, and soft tissue profiles. This variability necessitates approaches that can adaptively focus on relevant anatomical structures while maintaining robustness to patient-specific variations. The concept of anatomical localization—explicitly identifying and focusing on specific joint structures—mirrors the visual workflow of experienced radiographers who systematically evaluate specific anatomical relationships when assessing image quality.

Current object detection technologies, particularly those based on the YOLO (you only look once) family of architectures, have demonstrated remarkable success in identifying and localizing anatomical structures in medical images [17,18]. The evolution to YOLOX represents further improvements in detection accuracy and computational efficiency [19]. Similarly, advanced CNN architectures combining inception modules with residual connections, such as Inception-ResNet-v2, have shown superior performance in medical image classification tasks [20,21,22,23]. The integration of these complementary technologies—object detection for anatomical localization and sophisticated CNNs for quality classification—presents an opportunity to develop systems that mirror human visual assessment workflows. Rather than proposing a new algorithm, we present a task-tailored two-stage pipeline that adapts established detection/classification models to shoulder true-AP QA and integrates Grad-CAM-based explanations.

The clinical workflow implications of automated quality assessment extend beyond technical performance. An effective system must provide immediate feedback at the point of image acquisition, enabling radiographers to make informed decisions about retake necessity before patient departure. This requirement necessitates not only accurate assessment, but also computational efficiency compatible with clinical imaging systems. Furthermore, the system must provide interpretable outputs that radiographers can understand and verify, supporting rather than replacing professional judgment [24,25].

The purpose of this study is to develop and validate an automated quality assessment system for shoulder true-AP radiographs that addresses these multifaceted challenges. Our approach employs a two-stage deep learning pipeline designed to mirror the human visual assessment process: first localizing the relevant anatomical region, then performing focused quality evaluation. This methodology aims to achieve three critical objectives: (1) providing objective, consistent quality assessment that reduces unnecessary retake examinations; and (2) ensuring clinical interpretability through quantitative explainability metrics that align with expert judgment. By addressing these objectives, this research seeks to establish a foundational framework for AI-assisted quality control in shoulder radiography, with potential implications for broader applications in musculoskeletal imaging.

2. Materials and Methods

2.1. Study Design and Workflow Overview

This retrospective study utilized the publicly available MURA v1.1 (musculoskeletal radiographs) dataset [10]. The dataset comprises 40,561 upper-extremity radiographs from 12,173 patients, covering seven anatomical regions: elbow, finger, forearm, hand, humerus, shoulder, and wrist. For this analysis, we focused exclusively on anteroposterior shoulder radiographs from the “SHOULDER” folder within the MURA dataset. We initially identified 2956 shoulder AP radiographs. Of these, 59 images were excluded due to negative–positive inversion, excessive metallic implants, extreme over- or under-exposure, or presumed fluoroscopic acquisition. Three radiographers independently reviewed all remaining images and assigned OK/NG quality labels by consensus. We then constructed a class-balanced dataset of 2800 shoulder radiographs (1400 OK and 1400 NG) for model development. Pre-adjudication inter-rater agreement statistics were not recorded for all cases.

All images were provided in PNG format with varying dimensions. As preprocessing steps, rectangular images were zero-padded to create square dimensions, then resized to 512 × 512 pixels. Single-channel images were replicated across three channels for consistency. Image quality assessment criteria were established through consensus among three radiographers. Images showing clear glenohumeral joint space without excessive overlap between the humeral head and glenoid were classified as “OK” (acceptable quality), while those with significant anatomical overlap obscuring the joint space were classified as “NG” (requiring retake) (Figure 1).

The analytical framework employed a two-stage deep learning pipeline. Stage one utilized YOLOX [19] for glenohumeral joint detection, with the region of interest defined as encompassing the acromion tip, humeral head, and glenohumeral joint space (Figure 2). Ground truth bounding boxes were established through consensus annotation by three radiographers. Stage two employed Inception-ResNet-v2 [20] for binary quality classification (OK/NG), processing both whole images and locally cropped regions extracted using stage-one detection boxes, both of which were resized to 299 × 299 pixels to match the input resolution of Inception-ResNet-v2. The overall workflow of our two-stage pipeline is illustrated in Figure 3. Explainability was implemented via Grad-CAM for both classifiers; maps were overlaid and summarized using coverage metrics as detailed in Section 3.3.

Data partitioning followed a 5-fold cross-validation scheme. Each fold contained 2240 training images (1120 OK, 1120 NG) and 560 test images (280 OK, 280 NG). Offline data augmentation was applied to training data through brightness scaling (0.50, 0.75, 1.00, 1.25, 1.50) and horizontal flipping, resulting in a 10-fold increase in training samples. No online augmentation was performed during training. Computational resources included two NVIDIA RTX A6000 GPUs (NVIDIA, Santa Clara, CA, USA) for parallel processing. Model development, training, evaluation, and software integration were performed using MATLAB (version 2025a; MathWorks, Inc., Natick, MA, USA). No deployment-context runtime profiling was conducted in this study.

2.2. Object Detection for Glenohumeral Joint Localization

Detection principle; we adopt an anchor-free, one-stage YOLOX detector with a decoupled head that jointly predicts objectness/class probabilities and bounding-box geometry on dense, multi-scale feature maps. During inference, per-cell predictions are filtered by confidence and consolidated with non-maximum suppression to yield the glenohumeral joint box. We developed a single-class object detector based on YOLOX architecture to automatically localize the glenohumeral joint region from shoulder anteroposterior radiographs. The detector was trained to identify a single class labeled “ShoulderJoint.” The complete dataset of 2800 images (1400 OK, 1400 NG) was stratified by quality labels and divided into five equal subsets for 5-fold cross-validation. Each fold utilized 2240 images for training and 560 for testing.

The YOLOX model employed CSPDarknet53 as the backbone network with feature pyramid network (FPN) for multi-scale feature fusion. Initial weights were transferred from COCO pre-trained models [26]. The loss function followed YOLOX specifications, utilizing complete IoU (CIoU) loss for bounding box regression and binary cross-entropy for objectness and class predictions. Optimization was performed using stochastic gradient descent (SGD) with momentum 0.9 and weight decay 5 × 10⁻⁴. The initial learning rate was set to 0.01 with cosine annealing schedule. Training proceeded for 3 epochs with batch size 128.

Detector configuration and training protocol; we developed a single-class YOLOX detector (“ShoulderJoint”) to localize the glenohumeral joint on shoulder AP radiographs using five-fold cross-validation (2240 train/560 test per fold, stratified by OK/NG). The model used a CSPDarknet53 backbone with FPN, initialized from COCO pre-trained weights. Following YOLOX specifications, losses comprised CIoU for box regression and binary cross-entropy for objectness/class. Optimization employed SGD (momentum 0.9, weight decay 5 × 10⁻⁴) with an initial learning rate 0.01 and cosine annealing. Training ran for 3 epochs with batch size 128. No Bayesian optimization or systematic hyperparameter search was conducted; the above settings were fixed a priori based on prior reports and preliminary sanity checks. Inference used the default YOLOX post-processing (confidence filtering and NMS). Performance was evaluated per fold using AP@0.5 and the Dice similarity coefficient between predicted and ground-truth boxes.

Performance evaluation was conducted on test sets from each fold. Primary detection metrics included average precision at IoU threshold 0.5 (AP@0.5). Additionally, we calculated the Dice similarity coefficient (DSC = 2|A∩B|/(|A| + |B|)) between predicted and ground truth bounding boxes, reporting the average DSC across all test images. This detector provides the region-of-interest crop for the local classifier (Section 2.3); the end-to-end inference protocol is described in Section 2.4.

2.3. Image Classification for Quality Assessment

A binary classifier was developed to assess radiographic quality (OK/NG) in shoulder anteroposterior images. The dataset of 2800 images (1400 OK and 1400 NG) was stratified by quality labels and divided into five subsets for cross-validation. Each fold contained 2240 training images (1120 OK, 1120 NG) and 560 test images (280 OK, 280 NG). The aforementioned offline augmentation expanded the training set by 10-fold.

The classification architecture employed Inception-ResNet-v2, leveraging its hybrid inception modules and residual connections for enhanced feature extraction. The network was modified with a two-class softmax output layer for binary classification. Training utilized a batch size of 128 for 3 epochs with Adam optimizer, initial learning rate 0.001, and categorical cross-entropy loss. During inference, the class with maximum probability was assigned as the predicted label.

Evaluation metrics included recall, precision, F1 score, and area under the receiver operating characteristic curve (AUC). To address potential class imbalance effects, all classification metrics except AUC were calculated using macro-averaging. Metrics were computed for individual folds and summarized as mean ± standard deviation across folds.

2.4. Integrated System Development and Explainability Assessment

We operationalized the two-stage design into a fixed inference protocol that links detection and classification (Figure 3). This system mimics the radiographer’s visual workflow by first localizing the anatomical region of interest, then performing focused quality assessment. The evaluation framework rigorously followed the 5-fold cross-validation structure to ensure unbiased assessment.

For each test image in each fold, the workflow proceeded as follows: (1) whole image classification using the fold-specific model, generating Grad-CAM visualizations for the predicted class; (2) glenohumeral joint detection using the fold-specific detector; (3) extraction of the local region using detected bounding boxes; and (4) local image classification using a separately trained fold-specific model on cropped images, generating corresponding Grad-CAM visualizations. This approach enabled paired comparison of whole-image and local-image attention patterns for identical cases.

Explainability validation focused on quantifying agreement between Grad-CAM attention regions and expert-defined anatomical relevance. Three radiographers established consensus assessments through the following protocol: Grad-CAM activations were normalized to [0,1] range, and contour lines at thresholds τ = 0.50 and τ = 0.75 were evaluated for overlap with the glenohumeral joint region. Each image–model combination (whole/local) received binary scoring (0/1), indicating presence or absence of joint coverage. Mean scores across all test images defined the joint ROI coverage metrics: Coverage@50 (τ = 0.50) and Coverage@75 (τ = 0.75). Assessment was performed blind to classification outcomes and confidence scores to minimize observation bias.

Statistical analysis employed a two-tier approach. Primary analysis used McNemar’s test with continuity correction to evaluate paired binary outcomes (whole vs. local coverage) within each fold-class combination. Secondary analysis aggregated fold-level differences (d = Local − Whole) by quality class (OK/NG), applying paired t-tests to estimate mean differences with 95% confidence intervals. All statistical tests used two-sided significance level α = 0.05. This comprehensive evaluation framework determined whether localized or whole-image models provide more anatomically relevant attention patterns for quality assessment decisions.

3. Results

3.1. Glenohumeral Joint Detection Performance

Table 1 presents detection performance metrics across five folds. AP@0.5 achieved perfect scores of 1.00 for all individual folds and consequently a mean of 1.00, indicating exceptional detection accuracy at the standard IoU threshold. The precision–recall balance at IoU 0.5 demonstrated near-perfect agreement across all test sets. Mean Dice similarity coefficients for folds 1–5 were 0.9665, 0.9619, 0.9697, 0.9659, and 0.9701, respectively, yielding an overall mean of 0.9668 ± 0.0032. The narrow range (0.9619–0.9701) indicates highly consistent anatomical localization across diverse patient presentations.

3.2. Quality Classification Performance

Table 2 and Table 3 summarize the classification metrics for the whole-image and local (cropped) models across folds. Overall performance demonstrated high discriminative capability with mean recall 0.943 ± 0.021, precision 0.944 ± 0.019, F1 score 0.943 ± 0.018, and AUC 0.977 ± 0.014. AUC values ranged from 0.957 (fold 2) to 0.993 (fold 5), consistently exceeding the 0.95 threshold indicative of excellent discrimination. F1 scores peaked at 0.962 (fold 4) with a minimum of 0.914 (fold 2), showing limited inter-fold variability relative to the mean. The balanced recall and precision values across folds suggest appropriate trade-off between false positives and false negatives. For the local (cropped) model, the mean recall, precision, F1 score, and AUC were 0.959, 0.960, 0.959, and 0.993, respectively. AUC values ranged from 0.988 (fold 2) to 0.997 (fold 4), and F1 scores ranged from 0.939 (fold 1) to 0.980 (fold 4), again showing limited inter-fold variability. Compared with the whole-image model, the local model consistently achieved higher performance across all metrics (e.g., ΔAUC ≈ +0.016 and ΔF1 ≈ +0.016).

3.3. Explainability Assessment

Table 4 presents joint ROI coverage metrics comparing whole-image and local-image models across all fold–class combinations. Figure 4 illustrates representative examples of the two-stage pipeline outputs, showing (from left to right): original radiographs, YOLOX detection results with bounding boxes, cropped ROI regions, Grad-CAM visualizations from whole-image classification, and Grad-CAM visualizations from local ROI classification. The heat maps clearly demonstrate the superior anatomical focus achieved by the local model, with activation patterns more precisely concentrated on the glenohumeral joint region. Local models consistently demonstrated superior anatomical focus compared to whole-image counterparts. McNemar’s test revealed statistically significant differences for all comparisons at both thresholds (p < 0.05). Coverage improvements for local models ranged from +0.09 to +0.32 for Coverage@50 and +0.28 to +0.58 for Coverage@75. For both the whole-image classifier (net1) and the detector-centered local classifier (net2), Grad-CAM maps were computed on the last convolutional block of each classifier using the predicted class logit as the target. Grayscale images were replicated to three channels when necessary and resized to each network’s input size prior to inference. The resulting Grad-CAM maps were bilinearly upsampled to the display resolution (whole image for net1, local crop for net2) and overlaid with semi-transparency (alpha = 0.5) using a jet colormap. To aid interpretation, isocontours were drawn at 50% and 75% of the maximum positive response in red; when negative responses were present, 50% and 75% of the minimum response were contoured in blue. Figure titles report the predicted class and the corresponding confidence score returned by the classifier.

Fold-aggregated analysis reinforced these findings. For OK images, Coverage@50 showed whole-image mean 0.754 [95% CI: 0.632–0.876] versus local-image 0.982 [0.964–1.000], with mean difference +0.229 [+0.119, +0.338] (paired t-test, p < 0.05). Coverage@75 demonstrated whole-image 0.476 [0.313–0.640] versus local 0.946 [0.887–1.000], difference +0.469 [+0.318, +0.621] (p < 0.05).

For NG images, Coverage@50 yielded whole-image 0.841 [0.785–0.896] versus local 0.982 [0.971–0.993], difference +0.141 [+0.087, +0.196] (p < 0.05). Coverage@75 showed whole-image 0.436 [0.365–0.506] versus local 0.885 [0.848–0.923], difference +0.449 [+0.353, +0.546] (p < 0.05). These substantial and consistent improvements in anatomical focus validate the clinical relevance of the localized approach.

4. Discussion

Our study demonstrates the successful development and validation of a two-stage deep learning system for automated quality assessment in shoulder true-AP radiographs. The integration of anatomical localization through object detection with focused quality classification represents a methodologically novel approach that mirrors the visual workflow of experienced radiographers. The exceptional detection performance (AP@0.5 = 1.00, mean DSC = 0.967) and robust classification metrics (AUC = 0.977, F1 = 0.943) establish the technical feasibility of automated quality assessment in shoulder radiography.

The near-perfect glenohumeral joint detection achieved in our study surpasses previously reported performance for anatomical localization in musculoskeletal radiographs. While previous studies using YOLO-based architectures for skeletal structure detection reported AP@0.5 values ranging from 0.85 to 0.93 [27], our refined approach with YOLOX and carefully curated annotations achieved consistent perfect detection across all folds. This improvement likely reflects both architectural advances in YOLOX and the focused nature of single-joint detection compared to multi-structure localization tasks. The high Dice coefficients (>0.96) indicate precise boundary delineation, critical for ensuring that subsequent quality assessment focuses on diagnostically relevant anatomy.

Our shoulder-specific system achieved a mean AUC of 0.977 on the held-out test set, indicating strong discriminative performance in absolute terms. While prior reports in other radiographic QA domains (e.g., chest radiography) have listed AUCs around 0.89–0.94 [28,29], these figures are not directly comparable to our setting due to differences in anatomy, label definitions (binary vs. multi-grade), dataset composition, and evaluation protocols; we therefore cite them only as background context. Within our task, the observed balance of precision and recall reflects threshold selection on the internal test set; any inference about clinical retake rates would require prospective validation.

The most significant contribution of our work lies in the quantitative demonstration of improved explainability through anatomical localization. ‘XAI-enhanced’ in our context refers to the systematic use of Grad-CAM and coverage analyses within the two-stage pipeline rather than the proposal of a new XAI algorithm. The substantial improvements in Coverage@50 (+0.141 to +0.229) and Coverage@75 (+0.449 to +0.469) metrics for local versus whole-image models provide empirical evidence that focused analysis yields more clinically relevant attention patterns. This finding aligns with cognitive studies in radiology showing that expert radiographers employ systematic search patterns focused on specific anatomical regions rather than global image assessment [30,31]. Previous explainability studies in medical imaging have primarily relied on qualitative visual assessment of attention maps [32]; our quantitative coverage metrics, combined with the visual evidence in Figure 4, establish an objective framework for evaluating and comparing explainability across different architectural approaches.

The consistency of explainability improvements across both OK and NG image categories strengthens the clinical validity of our approach. Interestingly, NG images showed slightly lower coverage improvements compared to OK images, possibly reflecting the greater anatomical variability in malpositioned radiographs. This observation underscores the value of explicit anatomical localization in handling positioning variations, a critical requirement for clinical deployment where patient positioning cannot be perfectly standardized. The visual evidence presented in Figure 4 further substantiates our quantitative findings. The comparison between whole-image and local Grad-CAM visualizations reveals distinct patterns of model attention. In whole-image classification, activation patterns tend to be diffuse and often extend beyond the glenohumeral joint region, potentially incorporating clinically irrelevant areas such as soft tissue shadows or rib cage structures. In contrast, the local model’s Grad-CAM activations demonstrate concentrated focus on the anatomically critical region, specifically the glenohumeral joint space, humeral head, and glenoid fossa. This focused attention pattern is consistent across both OK and NG cases, though NG cases show slightly more dispersed activations, likely reflecting the anatomical distortion present in malpositioned radiographs. The visual concordance between local model attention and expert-defined regions of interest provides intuitive validation of our coverage metrics and supports the clinical interpretability of the system. Related work on interpretable hybrid architectures in other mission-critical domains further underscores the value of combining transparent attention mechanisms with task-specific modeling [33]. Generalizability and reuse; the proposed two-stage paradigm—task-specific joint/landmark localization followed by quality classification with Grad-CAM explanations—is directly reusable for other anatomies (e.g., hip, knee, wrist) by (i) specifying anatomy-appropriate image-quality criteria, (ii) annotating the target region(s) for detection, and (iii) fine-tuning the classifier on curated OK/NG labels. Only minor code changes are required beyond detector/classifier configuration and threshold calibration; however, anatomy-specific external validation will be necessary to confirm robustness across scanners, protocols, and institutions.

Several limitations warrant consideration in interpreting our findings. First, this study utilized a single publicly available dataset (MURA), which, while comprehensive, may not fully represent the imaging characteristics and quality variations encountered across different institutions and equipment manufacturers. The generalizability to other imaging protocols, particularly those using different exposure parameters or digital radiography systems, requires further validation. Because this study relies on a single-source dataset with internal splits, generalizability to other institutions and acquisition protocols requires external validation. Accordingly, because the present study was designed to characterize method behavior under controlled conditions (fixed internal splits), its findings should not be interpreted as evidence of transportability; confirming robustness will require external, multi-institutional validation across scanners, protocols, and patient populations. Second, our binary quality classification (OK/NG) represents a simplification of the continuous spectrum of image quality encountered in clinical practice. Radiographers often make nuanced decisions based on specific clinical indications and patient factors not captured in our model. This deliberate focus on the operational decision boundary does not capture intermediate grades; future prospective work should adopt validated multi-grade schemes with consensus protocols and inter-rater reliability to reflect the continuous nature of quality. Third, the ground truth annotations for both detection boxes and quality labels were established through consensus among three radiographers from a single institution, potentially introducing institutional bias in quality standards. Inter-observer variability in quality assessment, a well-documented challenge in radiography [34], was not formally quantified in our study. Because individual pre-adjudication labels were not retained for all cases, inter-rater agreement (e.g., κ statistics) could not be computed; prospective studies will log rater-wise labels to quantify agreement prior to consensus. Fourth, the computational requirements of the two-stage pipeline, while suitable for modern clinical workstations, may present implementation challenges in resource-constrained settings. Regarding hyperparameter optimization, we did not perform a systematic hyperparameter search or architecture sweep (Bayesian optimization or grid/random search). Consequently, our results may not reflect the best-achievable configuration. Future work will incorporate structured optimization and ablation studies (backbone/scale variants, learning-rate schedules, post-processing thresholds) to characterize sensitivity and maximize performance. Moreover, we did not perform a head-to-head comparison with state-of-the-art methods on an identical dataset; this lack of a direct benchmark is a limitation of the present study, reflecting the absence of a matched, shoulder true-AP dataset and the non-transferability of existing public implementations to our binary-label setting. Moreover, although the pipeline is designed to transfer to other body parts with modest adaptation, performance will depend on anatomy-specific quality definitions and data distributions; therefore, external validation per anatomy is warranted. We did not perform deployment-context latency benchmarking; because end-to-end time is strongly influenced by I/O, preprocessing, and system scheduling, real-time feasibility should be established prospectively on the target clinical stack. Finally, our explainability metrics, while quantitative, still require human interpretation and may not capture all aspects of clinical decision-making in quality assessment. While Grad-CAM offered a practical, architecture-agnostic choice that we quantified via coverage, explanation stability remains method-dependent; benchmarking alternative methods (e.g., Score-CAM, SmoothGrad, LIME) and reporting stability/uncertainty summaries constitute important future methodological work.

The clinical implications of our findings extend beyond technical performance metrics. The demonstrated alignment between AI attention patterns and expert judgment suggests that our system could serve as an effective training tool for novice radiographers, providing visual feedback on anatomical regions critical for quality assessment. Selective quality review workflows—in which only images flagged as potentially problematic receive manual inspection—could significantly reduce the cognitive burden on radiographers while maintaining high quality standards. This approach could significantly reduce the cognitive burden on radiographers while maintaining high quality standards.

Future research directions should address the identified limitations while expanding the scope of application. Multi-institutional validation studies incorporating diverse imaging equipment and protocols would strengthen evidence for generalizability. Development of continuous quality scoring systems, potentially incorporating multiple quality dimensions, could provide more nuanced assessment aligned with clinical practice. Beyond the immediate accept-versus-retake decision modeled here, clinically calibrated multi-grade quality scales (e.g., good/acceptable/unacceptable) warrant prospective development with harmonized criteria and reliability assessment to capture meaningful gradations. Integration with radiographic positioning feedback systems could enable real-time guidance for technologists during image acquisition. Extension of the methodology to other shoulder projections and anatomical regions would demonstrate the broader applicability of the two-stage localization–classification framework.

5. Conclusions

This study successfully developed and validated a two-stage deep learning system for automated quality assessment of shoulder true-AP radiographs, achieving exceptional detection accuracy (AP@0.5 = 1.00) and robust classification performance (AUC = 0.977). The quantitative demonstration of improved explainability through anatomical localization, with Coverage@75 improvements of 44.9–46.9%, provides strong evidence for the clinical relevance of our approach. Our contribution lies in a reproducible, interpretable pipeline—joint localization plus whole/local classification—rather than a new network architecture. The system’s ability to generate attention patterns aligned with expert radiographer judgment addresses a critical requirement for clinical AI implementation. Our findings establish a methodological framework for explainable automated quality assessment that could reduce retake rates, standardize quality evaluation, and support radiographer training. While limitations regarding dataset diversity and simplified quality criteria require consideration, the demonstrated performance and explainability metrics support the potential for clinical translation. This work contributes to the broader goal of integrating interpretable AI systems into radiographic workflows, ultimately improving patient care through reduced radiation exposure and enhanced imaging efficiency.

Author Contributions

K.S. contributed to the data analysis, algorithm construction, and writing and editing of the manuscript. T.Y. reviewed and edited the manuscript. H.S. proposed the idea and contributed to the data acquisition, performed supervision and project administration, and reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable, as this study used publicly available datasets.

Informed Consent Statement

Not applicable, as this study used publicly available datasets.

Data Availability Statement

The created models in this study are available on request from the corresponding author. The source code of this study is available at https://github.com/MIA-laboratory/ShoulderXAI/ (accessed on 31 August 2025).

Acknowledgments

The authors would like to thank the laboratory members of the Medical Image Analysis Laboratory for their help.

Conflicts of Interest

The authors declare that no conflicts of interest exist.

Abbreviations

The following abbreviations are used in this manuscript:

AI	artificial intelligence
AP	anteroposterior
AP@0.5	average precision at IoU threshold 0.5
AUC	area under the receiver operating characteristic curve
CNN	convolutional neural network
CI	confidence interval
CIoU	complete intersection over union
COCO	common objects in context
DSC	Dice similarity coefficient
F1	F1 score
FPN	feature pyramid network
Grad-CAM	gradient-weighted class activation mapping
GPU	graphics processing unit
IoU	intersection over union
MURA	musculoskeletal radiographs
NG	requiring retake
OK	acceptable quality
PNG	portable network graphics
ROI	region of interest
SGD	stochastic gradient descent
YOLO	you only look once
true-AP	true anteroposterior

References

Sanders, T.G.; Jersey, S.L. Conventional Radiography of the Shoulder. Semin. Roentgenol. 2005, 40, 207–222. [Google Scholar] [CrossRef]
Suter, T.; Gerber Popp, A.; Zhang, Y.; Zhang, C.; Tashjian, R.Z.; Henninger, H.B. The Influence of Radiographic Viewing Perspective and Demographics on the Critical Shoulder Angle. J. Shoulder Elb. Surg. 2015, 24, e149–e158. [Google Scholar] [CrossRef] [PubMed]
Williams, M.B.; Krupinski, E.A.; Strauss, K.J.; Breeden, W.K.; Rzeszotarski, M.S.; Applegate, K.; Wyatt, M.; Bjork, S.; Seibert, J.A. Digital Radiography Image Quality: Image Acquisition. J. Am. Coll. Radiol. 2007, 4, 371–388. [Google Scholar] [CrossRef]
Waaler, D.; Hofmann, B. Image Rejects/Retakes–Radiographic Challenges. Radiat. Prot. Dosim. 2010, 139, 375–379. [Google Scholar] [CrossRef]
Jones, A.K.; Polman, R.; Willis, C.E.; Shepard, S.J. One Year’s Results from a Server-Based System for Performing Reject Analysis and Exposure Analysis in Computed Radiography. J. Digit. Imaging 2011, 24, 243–255. [Google Scholar] [CrossRef] [PubMed]
Ween, B.; Kristoffersen, D.T.; Hamilton, G.A.; Olsen, D.R. Image Quality Preferences among Radiographers and Radiologists. A Conjoint Analysis. Radiography 2005, 11, 191–197. [Google Scholar] [CrossRef]
Lau, L.; KH, N.; BJ, A. Randomised Audit of Reject-Repeat Rates and the Implementation of a Reject-Repeat Database. Br. J. Radiol. 2004, 77, 911–917. [Google Scholar]
Sakaida, M.; Yoshimura, T.; Tang, M.; Ichikawa, S.; Sugimori, H. Development of a Mammography Calcification Detection Algorithm Using Deep Learning with Resolution-Preserved Image Patch Division. Algorithms 2023, 16, 483. [Google Scholar] [CrossRef]
Moriya, R.; Yoshimura, T.; Tang, M.; Ichikawa, S.; Sugimori, H. Development of a Method for Estimating the Angle of Lumbar Spine X-Ray Images Using Deep Learning with Pseudo X-Ray Images Generated from Computed Tomography. Appl. Sci. 2024, 14, 3794. [Google Scholar] [CrossRef]
Rajpurkar, P.; Irvin, J.; Bagul, A.; Ding, D.; Duan, T.; Mehta, H.; Yang, B.; Zhu, K.; Laird, D.; Ball, R.L.; et al. MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs. arXiv 2018, arXiv:1712.06957. [Google Scholar] [CrossRef]
Chung, S.W.; Han, S.S.; Lee, J.W.; Oh, K.S.; Kim, N.R.; Yoon, J.P.; Kim, J.Y.; Moon, S.H.; Kwon, J.; Lee, H.J.; et al. Automated Detection and Classification of the Proximal Humerus Fracture by Using Deep Learning Algorithm. Acta Orthop. 2018, 89, 468–473. [Google Scholar] [CrossRef]
Thian, Y.L.; Li, Y.; Jagmohan, P.; Sia, D.; Yao Chan, V.E.; Tan, R.T. Convolutional Neural Networks for Automated Fracture Detection and Localization on Wrist Radiographs. Radiol. Artif. Intell. 2019, 1, e180001. [Google Scholar] [CrossRef] [PubMed]
Reyes, M.; Meier, R.; Pereira, S.; Silva, C.A.; Dahlweid, F.M.; von Tengg-Kobligk, H.; Summers, R.M.; Wiest, R. On the Interpretability of Artificial Intelligence in Radiology: Challenges and Opportunities. Radiol. Artif. Intell. 2020, 2, e190043. [Google Scholar] [CrossRef] [PubMed]
Amann, J.; Blasimme, A.; Vayena, E.; Frey, D.; Madai, V.I. Explainability for Artificial Intelligence in Healthcare: A Multidisciplinary Perspective. BMC Med. Inform. Decis. Mak. 2020, 20, 310. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
van der Velden, B.H.M.; Kuijf, H.J.; Gilhuijs, K.G.A.; Viergever, M.A. Explainable Artificial Intelligence (XAI) in Deep Learning-Based Medical Image Analysis. Med. Image Anal. 2022, 79, 102470. [Google Scholar] [CrossRef]
Sakamoto, M.; Yoshimura, T.; Sugimori, H. Automated Coronary Artery Identification in CT Angiography: A Deep Learning Approach Using Bounding Boxes. Appl. Sci. 2025, 15, 3113. [Google Scholar] [CrossRef]
Inomata, S.; Yoshimura, T.; Tang, M.; Ichikawa, S.; Sugimori, H. Automatic Aortic Valve Extraction Using Deep Learning with Contrast-Enhanced Cardiac CT Images. J. Cardiovasc. Dev. Dis. 2025, 12, 3. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI 2017, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 4278–4284. [Google Scholar]
Anaya-Isaza, A.; Mera-Jiménez, L.; Verdugo-Alejo, L.; Sarasti, L. Optimizing MRI-Based Brain Tumor Classification and Detection Using AI: A Comparative Analysis of Neural Networks, Transfer Learning, Data Augmentation, and the Cross-Transformer Network. Eur. J. Radiol. Open 2023, 10, 100484. [Google Scholar] [CrossRef]
Gupta, R.K.; Bharti, S.; Kunhare, N.; Sahu, Y.; Pathik, N. Brain Tumor Detection and Classification Using Cycle Generative Adversarial Networks. Interdiscip. Sci. Comput. Life Sci. 2022, 14, 485–502. [Google Scholar] [CrossRef]
Kang, D.W.; Park, G.H.; Ryu, W.S.; Schellingerhout, D.; Kim, M.; Kim, Y.S.; Park, C.Y.; Lee, K.J.; Han, M.K.; Jeong, H.G.; et al. Strengthening Deep-Learning Models for Intracranial Hemorrhage Detection: Strongly Annotated Computed Tomography Images and Model Ensembles. Front. Neurol. 2023, 14, 1321964. [Google Scholar] [CrossRef] [PubMed]
Krupinski, E.A. Current Perspectives in Medical Image Perception. Atten. Percept. Psychophys. 2010, 72, 1205–1217. [Google Scholar] [CrossRef] [PubMed]
Kelly, B.S.; Judge, C.; Bollard, S.M.; Clifford, S.M.; Healy, G.M.; Aziz, A.; Mathur, P.; Islam, S.; Yeom, K.W.; Lawlor, A.; et al. Radiology Artificial Intelligence: A Systematic Review and Evaluation of Methods (RAISE). Eur. Radiol. 2022, 32, 7998–8007. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. [Google Scholar]
Srinivasu, P.N.; Kumari, G.L.A.; Narahari, S.C.; Ahmed, S.; Alhumam, A. Exploring the Impact of Hyperparameter and Data Augmentation in YOLO V10 for Accurate Bone Fracture Detection from X-Ray Images. Sci. Rep. 2025, 15, 9828. [Google Scholar] [CrossRef]
Oura, D.; Sato, S.; Honma, Y.; Kuwajima, S.; Sugimori, H. Quality Assurance of Chest X-Ray Images with a Combination of Deep Learning Methods. Appl. Sci. 2023, 13, 2067. [Google Scholar] [CrossRef]
Usui, K.; Yoshimura, T.; Ichikawa, S.; Sugimori, H. Development of Chest X-Ray Image Evaluation Software Using the Deep Learning Techniques. Appl. Sci. 2023, 13, 6695. [Google Scholar] [CrossRef]
Kundel, H.L.; Nodine, C.F.; Conant, E.F.; Weinstein, S.P. Holistic Component of Image Perception in Mammogram Interpretation: Gaze-Tracking Study. Radiology 2007, 242, 396–402. [Google Scholar] [CrossRef]
Drew, T.; Võ, M.L.H.; Wolfe, J.M. The Invisible Gorilla Strikes Again: Sustained Inattentional Blindness in Expert Observers. Psychol. Sci. 2013, 24, 1848–1853. [Google Scholar] [CrossRef]
Esteva, A.; Kuprel, B.; Novoa, R.A.; Ko, J.; Swetter, S.M.; Blau, H.M.; Thrun, S. Dermatologist-Level Classification of Skin Cancer with Deep Neural Networks. Nature 2017, 542, 115–118. [Google Scholar] [CrossRef] [PubMed]
Chechkin, A.; Pleshakova, E.; Gataullin, S. A Hybrid KAN-BiLSTM Transformer with Multi-Domain Dynamic Attention Model for Cybersecurity. Technologies 2025, 13, 223. [Google Scholar] [CrossRef]
Robinson, P.J.A. Radiology’s Achilles’ Heel: Error and Variation in the Interpretation of the Rontgen Image. Br. J. Radiol. 1997, 70, 1085–1098. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Examples of OK and NG shoulder radiographs. (a) OK: The articular surfaces are separated with a clear joint space; no retake required. (b) NG: The humeral head and glenoid edges overlap, obscuring the joint space; retake required. Orange line: Edge of the humeral head articular surface, yellow line: Edge of the glenoid articular surface.

Figure 2. Definition of glenohumeral joint detection region. (a) Whole shoulder anteroposterior radiograph. (b) Cropped image of the yellow region of interest in (a).

Figure 3. Two-stage deep learning pipeline for quality assessment. Stage-1 detection outputs are shown with yellow bounding boxes. Stage-2 classification is applied to both whole images and locally cropped regions. In the last two panels, overlaid Grad-CAM heatmaps indicate model attention: warmer colors (red) denote higher contribution to the classification, and cooler colors (blue) denote lower contribution.

Figure 4. Comparison of whole-image and local model outputs for (a) OK cases and (b) NG cases. Columns from left to right: original radiograph, detection result with bounding box overlay, cropped ROI (extracted region), whole-image Grad-CAM activation, local ROI Grad-CAM activation. Heat map colors represent activation intensity (red: high, blue: low). The local model consistently produces more focused activation patterns centered on the glenohumeral joint compared to the diffuse patterns from the whole-image model.

Table 1. Glenohumeral joint detection performance across five folds: AP@0.5 and mean Dice similarity coefficient (DSC).

Fold	AP@0.5 ^a	Average DSC ^b
fold 1	1	0.9665
fold 2	1	0.9619
fold 3	1	0.9697
fold 4	1	0.9659
fold 5	1	0.9701
Mean	1	0.9668

^a Average precision at IoU threshold 0.5, ^b Dice similarity coefficient.

Table 2. Whole-image quality classification performance across five folds: recall, precision, F1 score, and AUC.

Fold	Recall	Precision	F1 Score	AUC ^a
fold 1	0.9321	0.9322	0.9321	0.9733
fold 2	0.9143	0.9153	0.9142	0.9569
fold 3	0.9536	0.9539	0.9536	0.9790
fold 4	0.9625	0.9632	0.9625	0.9838
fold 5	0.9536	0.9536	0.9536	0.9926
Mean	0.9432	0.9437	0.9432	0.9771

^a Area under the receiver operating characteristic curve.

Table 3. Local (cropped) quality classification performance across five folds: recall, precision, F1 score, and AUC.

Fold	Recall	Precision	F1 Score	AUC ^a
fold 1	0.9393	0.9420	0.9392	0.9928
fold 2	0.9446	0.9467	0.9446	0.9884
fold 3	0.9643	0.9643	0.9643	0.9912
fold 4	0.9804	0.9807	0.9804	0.9965
fold 5	0.9679	0.9680	0.9679	0.9964
Mean	0.9593	0.9603	0.9593	0.9931

^a Area under the receiver operating characteristic curve.

Table 4. Explainability assessment comparing whole-image vs. local models: joint ROI coverage at τ = 0.50 and τ = 0.75 (Coverage@50/@75) with McNemar’s test p-values for OK and NG classes across folds.

Fold	Class	Whole50	Whole75	Local50	Local75	P50	P75
fold 1	OK	0.7536	0.4679	0.9714	0.8786	p < 0.05	p < 0.05
fold 1	NG	0.8179	0.4679	0.9714	0.9107	p < 0.05	p < 0.05
fold 2	OK	0.6679	0.3821	0.9643	0.9143	p < 0.05	p < 0.05
fold 2	NG	0.7786	0.3893	0.9893	0.9107	p < 0.05	p < 0.05
fold 3	OK	0.7714	0.3929	0.9893	0.9714	p < 0.05	p < 0.05
fold 3	NG	0.8500	0.4714	0.9821	0.8429	p < 0.05	p < 0.05
fold 4	OK	0.9071	0.7036	1.0000	0.9857	p < 0.05	p < 0.05
fold 4	NG	0.8964	0.4893	0.9929	0.8643	p < 0.05	p < 0.05
fold 5	OK	0.6679	0.4357	0.9857	0.9786	p < 0.05	p < 0.05
fold 5	NG	0.8607	0.3607	0.9750	0.8964	p < 0.05	p < 0.05

p-values from McNemar’s test; reported as p < 0.05 (two-sided, α = 0.05).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sekiura, K.; Yoshimura, T.; Sugimori, H. Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking. Appl. Sci. 2025, 15, 10534. https://doi.org/10.3390/app151910534

AMA Style

Sekiura K, Yoshimura T, Sugimori H. Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking. Applied Sciences. 2025; 15(19):10534. https://doi.org/10.3390/app151910534

Chicago/Turabian Style

Sekiura, Konatsu, Takaaki Yoshimura, and Hiroyuki Sugimori. 2025. "Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking" Applied Sciences 15, no. 19: 10534. https://doi.org/10.3390/app151910534

APA Style

Sekiura, K., Yoshimura, T., & Sugimori, H. (2025). Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking. Applied Sciences, 15(19), 10534. https://doi.org/10.3390/app151910534

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of an XAI-Enhanced Deep-Learning Algorithm for Automated Decision-Making on Shoulder-Joint X-Ray Retaking

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Workflow Overview

2.2. Object Detection for Glenohumeral Joint Localization

2.3. Image Classification for Quality Assessment

2.4. Integrated System Development and Explainability Assessment

3. Results

3.1. Glenohumeral Joint Detection Performance

3.2. Quality Classification Performance

3.3. Explainability Assessment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI