Differential Diagnosis of Parotid Tumors on Ultrasound: Interobserver Variability and Examiner-Specific Decision Rules—A Machine Learning Approach
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis study investigates interobserver variability in ultrasound-based diagnosis of parotid gland tumors and introduces an interpretable machine-learning surrogate framework to model examiner-specific diagnostic patterns. The authors evaluate diagnostic performance, quantify agreement using Cohen’s κ, and examine the relationship between surrogate complexity and examiner performance. The topic is clinically relevant and timely, particularly given the operator dependence of ultrasound and the increasing importance of explainable AI in medical imaging. While the study is well motivated and methodologically interesting, several important issues should be addressed to improve scientific rigor, clarity, and clinical impact.
- The abstract should explicitly state the type of surrogate model(s) used (e.g., decision trees), and clarify whether cross-validation or any external validation was applied.
- The manuscript should more clearly explain the intended clinical role of the surrogate models, particularly whether they are designed to support examiner training, improve reporting standardization, or assist less-experienced clinicians.
- The authors exclusively use decision-tree learning. Justification is required, along with discussion of alternative interpretable classifiers (e.g., rule-based models, logistic regression, or generalized additive models) and why they were not explored.
- The single-center, retrospective design and reliance on a single ultrasound platform limit external validity. Validation on multicenter datasets and across different ultrasound vendors would significantly strengthen the robustness and clinical applicability of the findings.
- The absence of image harmonization or acquisition normalization may introduce confounding variability that affects both interobserver agreement and surrogate learning. Some form of normalization or stratified analysis is recommended.
- The wide heterogeneity in examiner experience likely influences both diagnostic performance and surrogate complexity. Stratified analyses or experience-adjusted modeling would help disentangle expertise effects from true interobserver variability.
- Disabling tree pruning substantially increases the risk of overfitting, particularly given the relatively small dataset. Comparison with pruned trees, regularized models, or cross-validated surrogates would improve methodological robustness.
- Reliance on a single train–test split limits statistical reliability. Repeated cross-validation or bootstrapping should be used to obtain more stable performance estimates.
- Since the primary goal of surrogates is to replicate examiner decision-making, metrics quantifying agreement between surrogate predictions and examiner labels (e.g., accuracy, κ) must be reported. Currently, performance is only evaluated against histopathology, which does not directly assess surrogate fidelity.
- Explicit feature importance analysis is needed to clarify the relative contribution of clinical versus imaging variables. Sensitivity analyses excluding strong clinical predictors (e.g., lymph node status, facial nerve function) would help rule out shortcut learning.
- Table 2 shows relatively low surrogate accuracies (48.0%–78.6%), and some models demonstrate extreme sensitivity–specificity imbalance. These limitations should be explicitly discussed, including their implications for interpreting examiner-specific diagnostic pathways.
- While pairwise DeLong tests were applied to examiner performance, no statistical comparisons are reported for surrogate models. It is unclear whether performance differences between surrogates are statistically meaningful.
- Figure 3(a) should include AUC values directly on the ROC curves.
- Figure resolutions require improvement.
- Summarizing examiner performance metrics in a table would significantly improve readability without removing existing figures.
- The manuscript would benefit from analyzing misclassification patterns across tumor subtypes to provide deeper clinical insights and guide potential AI-assisted training strategies.
- The discussion should more clearly elaborate on how these findings could inform structured reporting systems, examiner training programs, and future explainable AI model development.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsMajorn Concerns
-
The monocentric, retrospective design substantially limits external validity. Although acknowledged in the limitations, the manuscript does not sufficiently discuss how institutional practice patterns and examiner training environment may have shaped decision rules and agreement metrics.
-
The cohort is histologically imbalanced, with a small number of rare entities (e.g., cysts, angioma). While the authors state that subtype classification was not the goal, this imbalance may still bias examiner heuristics and surrogate structures toward dominant entities.
-
Exclusion of 11 cases due to image quality is briefly mentioned, but no sensitivity analysis is provided to assess whether exclusion may have influenced observer variability or performance metrics.
-
Examiner background is heterogeneous in both specialty and ultrasound training; however, experience level, DEGUM certification, and specialty are tightly intertwined. This makes it difficult to disentangle whether observed performance differences reflect experience, specialty, or prior exposure to structured ultrasound curricula.
-
While examiners were blinded to histopathology, it remains unclear whether they were aware of the study aim focusing on interobserver variability and surrogate modeling, which could have influenced decision-making behavior.
-
Several descriptors (e.g., echogenicity, internal texture, acoustic phenomena) are inherently subjective, yet no operational definitions or visual reference standards are provided within the main manuscript. This limits reproducibility and may partly explain the wide κ ranges reported.
-
Lesion size is categorized (<1 cm, 1–2 cm, >2 cm) rather than treated as a continuous variable. The rationale for categorization and potential information loss is not discussed.
-
Clinical metadata (e.g., smoking status, tumor history, facial nerve palsy) are incorporated into the surrogate models, but their availability and reliability in routine ultrasound referral settings are not addressed.
-
Interobserver agreement is assessed using pairwise Cohen’s κ; however, no global multi-rater agreement metric (e.g., Fleiss’ κ or ICC where applicable) is provided, which would have strengthened the overall assessment of variability.
-
The interpretation of κ values relies on Landis–Koch thresholds without acknowledging their known limitations and context dependency, particularly in imbalanced datasets.
-
Multiple hypothesis testing is addressed for AUC and sensitivity/specificity comparisons, but no correction strategy is applied to the large number of pairwise κ comparisons across descriptors.
-
Accuracy and AUC are emphasized, yet the clinical implications of the observed sensitivity–specificity trade-offs are not sufficiently discussed, particularly the high false-positive rates seen in lower-performing examiners.
-
Confidence intervals are reported, but overlap between examiners is not systematically interpreted, which weakens claims of meaningful performance separation beyond Examiner 3.
-
Decision trees are trained with pruning intentionally disabled, but the potential for overfitting—especially given the modest sample size and categorical predictors—is not quantitatively evaluated (e.g., via cross-validation or stability analysis).
-
Surrogate performance is assessed against histopathology, although the models are explicitly trained to replicate examiner labels. This dual evaluation framework may confuse readers and should be more clearly justified.
-
The link between surrogate complexity and examiner performance is described qualitatively; however, no formal complexity metric (e.g., tree depth, number of nodes) is statistically tested against performance measures.
-
While the manuscript argues that compact decision rules reflect superior expertise, alternative explanations such as label noise, cognitive bias, or dataset-specific shortcut learning are only briefly mentioned and not critically explored.
-
The discussion occasionally approaches normative statements about “better” diagnostic strategies, despite the descriptive nature of the surrogate models. Causal interpretations should be more cautiously framed.
-
The potential role of these surrogates in training or quality assurance is proposed, but no concrete implementation pathway or validation framework is outlined.
-
No external validation cohort is included, either for examiner performance benchmarking or for surrogate model robustness. This limits the applicability of findings beyond the study setting.
-
The manuscript does not address how the proposed approach would perform with different ultrasound systems, acquisition protocols, or examiner populations.
The study addresses a relevant and innovative question by externalizing examiner-dependent decision patterns in parotid ultrasound. However, methodological clarifications, stronger statistical justification, and more cautious interpretation are needed to ensure that conclusions remain proportional to the descriptive and exploratory nature of the analysis.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe description of the machine learning process (with KNIME) and statistical analysis is thorough; a brief explanation in the text for why tree pruning is turned off, despite the risk of overfitting in a small sample would improve the technical section.
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have satisfactorily addressed all my concerns. Please provide a clean version of the updated manuscript, ensuring all previous deletions and track changes are removed.
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for the thorough and constructive revision. The methodological clarifications and statistical improvements have significantly strengthened the manuscript. The study is now clear, balanced, and ready for publication. Congratulations on the successful revision.

