Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Quantifying the Contribution of Bone Morphology to Implant Selection in Shoulder Arthroplasty Using CT-Based Deep Learning

Bioengineering 2026, 13(5), 574; https://doi.org/10.3390/bioengineering13050574

by Andrea Moglia¹

, Luca Marsilio^1,*

, Matteo Rossi¹

, Alfonso Manzotti²

, Luca Mainardi¹

and Pietro Cerveri^1,3,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Bioengineering 2026, 13(5), 574; https://doi.org/10.3390/bioengineering13050574

Submission received: 11 April 2026 / Revised: 5 May 2026 / Accepted: 14 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue AI and Data Science in Biomedicine: Powering the Next Generation of Diagnostics and Therapies)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript studies whether CT-derived bone morphology alone contains useful information for shoulder arthroplasty implant selection. The authors combine a previously developed CEL-UNet segmentation module with an extended ArthroNet+ multi-task classifier to predict osteophyte severity, joint-space narrowing, humeroscapular alignment, and implant type in a multicenter cohort of 600 patients.

Several possible issues:

1. Section 2.4 states that the test set contains 50 anatomical and 50 reverse cases, but Table 5 reports ArthroNet+ sensitivities of about 61% for anatomical implants and 91% for reverse implants. On a balanced 100-case test set, these sensitivities would imply an accuracy of about 76%, not about 87%. A similar inconsistency appears between the IT per-class recalls in Table 3 and the reported IT accuracy of 0.865. These numbers should be carefully rechecked and clarified.

2. The paper says that surgeons and model were evaluated under identical morphology-only constraints, but the actual inputs are not the same: ArthroNet+ operates on a CT subregion of the glenohumeral joint, whereas surgeons were shown only 3D reconstructed bone surfaces generated from the segmentation module. In addition, Table 5 includes an Original Surgeons row, which is not a real benchmark because it corresponds to the historical surgical choices used as reference labels.

3. Section 2.4 defines a 500/100 train/test split, but Section 2.5.2 later says that a subset of 100 cases was randomly selected from the available cohort for surgeon evaluation. The manuscript should explicitly state whether these are exactly the held-out test cases. If not, the human-AI comparison could include cases outside the independent test set.

4. Although the cohort is multicentric, the manuscript reports only one 500/100 split, with no cross-validation, no site-wise analysis, and no external validation. The test set is also balanced 50/50 even though the full cohort is about 25% anatomical and 75% reverse, so the reported accuracy does not reflect real clinical prevalence.

5. The manuscript states that implant labels correspond to surgical procedures performed in clinical practice and therefore reflect individual surgeon decisions rather than consensus-based or objectively validated labels. This means the model is learning historical clinical decision patterns rather than optimal implant selection. That framing is acceptable for the stated hypothesis, but it should be emphasized even more clearly and should limit any stronger translational claims.

6.Section 2.2 states that the segmentation and pathology-staging pipeline is reused from previous work and that the main novelty is the implant-type branch plus the controlled human-AI evaluation. However, the manuscript does not clearly distinguish previously established components from the new contribution, especially regarding the segmentation part. In addition, the dataset is private and available only on request, which makes independent verification difficult

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Overall, this is a clinically relevant and well-structured study. The manuscript addresses a focused question, and its combination of automated CT analysis with a controlled surgeon comparison makes the work informative and meaningful. The results are of interest, and the authors are appropriately cautious in interpreting the model as a tool for quantifying morphology-based decision signals rather than replacing real clinical decision making. In my opinion, the manuscript can be considered for publication after minor revision.

The authors are encouraged to further harmonize the wording across the Abstract, Discussion, and Conclusions, and make it explicit that the model predicts observed implant choices rather than an objectively optimal implant selection. The manuscript already notes that implant labels were derived from real surgical decisions rather than a consensus-based gold standard. This point deserves clearer and more consistent emphasis throughout the paper, so that readers do not misinterpret the reported 86.5 percent performance as clinical decision accuracy in the strict sense.
At present, the implant type task is mainly presented through overall accuracy. However, the class-wise results show that the model performs substantially better for reverse implants than for anatomical implants. It would be helpful to reflect this asymmetry more clearly in the Abstract and in the summary of the Results, so that the overall accuracy does not obscure the imbalance in class-specific performance.
The authors are encouraged to provide the detailed distributions of OS, JS, and HSA categories in both the training and test sets. Although the manuscript reports the implant type ratio and notes that the intermediate grades of OS and JS are more difficult to classify, the absence of category-wise sample numbers makes it difficult for readers to judge how strongly the results may be influenced by label distribution.
The authors may consider adding a brief explanation of the high disagreement cases. The manuscript already uses entropy to illustrate case-level variability and provides examples of full agreement and maximal disagreement. However, a short note in the main text or figure legend describing whether these highly disputed cases share certain morphological features would help readers better understand the source of uncertainty under morphology-only conditions.
The authors may also consider citing the following review: Jess R, Ling T, Xiong Y, Wright CJ, Zhao F. Mechanical environment for in vitro cartilage tissue engineering assisted by in silico models. Biomaterials Translational. 2023, 4(1): 18-26. This review systematically discusses how mechanical cues and computational modeling can be integrated to guide cartilage tissue engineering studies and may help broaden the background section.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

I have reviewed the manuscript titled “Quantifying the Contribution of Bone Morphology to Implant Selection in Shoulder Arthroplasty Using CT-Based Deep Learning” and find that it presents a well-structured and clinically relevant investigation into the role of osseous morphology in implant selection through a unified deep learning framework; the integration of segmentation (CEL-UNet) and multi-task classification (ArthroNet+) alongside a controlled human AI comparison constitutes a meaningful contribution, particularly in isolating morphology-driven decision signals, and the experimental design is clearly articulated and supported by quantitative results (e.g., segmentation Dice scores up to 0.99 and implant prediction accuracy of 86.5%); however, despite these strengths, several methodological and interpretative limitations require substantial revision before the work can be considered for publication at an international level, as the study relies on implant labels derived from individual surgeon decisions rather than a validated consensus or outcome-based ground truth, which introduces label noise and potential bias that is insufficiently addressed in the current analysis, and the absence of critical demographic and clinical variables (e.g., sex, BMI, rotator cuff status) limits both the generalizability and the interpretability of the findings, especially given that the central claim concerns the sufficiency of morphology alone; furthermore, while the authors emphasize the hypothesis-driven nature of the study, the manuscript lacks a rigorous statistical validation of this hypothesis beyond accuracy comparisons, and additional analyses (e.g., calibration curves, confidence intervals, or decision-curve analysis) would strengthen the claims regarding clinical relevance; the class imbalance in implant types, although partially mitigated weighted loss, still results in markedly asymmetric performance (notably poor sensitivity for anatomical implants), and alternative strategies such as stratified evaluation or cost-sensitive learning should be explored; additionally, the comparison with surgeons, although interesting, is limited by the single-center cohort and relatively small sample size (n=100), and the low Fleiss’ κ (~0.15) should be more critically interpreted in light of experimental constraints rather than implicitly highlighting model superiority; the manuscript would also benefit from improved transparency and reproducibility, as there is no code or data sharing plan provided (e.g., via GitHub or Zenodo), which is increasingly expected for deep learning studies; importantly, no ablation study is presented to quantify the contribution of each component (e.g., segmentation quality, individual pathology tasks, or multi-task learning design) to the final implant prediction performance, making it difficult to assess the true methodological novelty; similarly, the absence of validation on an external dataset raises concerns robustness and generalizability across imaging protocols and populations; the paper would also benefit from the inclusion of a state-of-the-art comparison table summarizing similar approaches in shoulder arthroplasty planning to better contextualize the contribution; moreover, given the clinical nature of the task, incorporating explainable AI techniques (e.g., saliency maps or attention mechanisms) would significantly enhance interpretability and trustworthiness, particularly in borderline cases highlighted in the results; the figures and tables, while informative (e.g., Tables 1 5 and Figures 4 7), could be improved in readability and clarity, especially regarding labeling consistency and resolution; finally, although limitations are discussed, a more explicit “novelties and contributions” section and a clearer articulation of future research directions particularly regarding multimodal integration would strengthen the manuscript’s positioning; overall, the study addresses an important question and proposes a promising framework, but substantial methodological clarification, additional validation, and improved transparency are required to support its claims; I recommend major revision

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have satisfactorily addressed my comments. The revised manuscript is clearer, the main numerical inconsistency has been corrected, and the limitations of the approach are now better explained.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have completely addressed all my comments, and I have no further concerns. Therefore, I recommend accepting the paper.

Article Menu

Quantifying the Contribution of Bone Morphology to Implant Selection in Shoulder Arthroplasty Using CT-Based Deep Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI