Impact of Imaging Modality on AI-Based Detection of Incidental Maxillary Sinus Pathology: Comparison of Panoramic Radiography and CBCT
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis retrospective paired-design study compares the diagnostic performance of a commercial dental AI platform (Diagnocat) for detecting maxillary sinus abnormalities on panoramic radiographs (OPG) versus cone-beam computed tomography (CBCT) images acquired from the same patients. The study addresses a clinically relevant gap—namely, whether AI outputs remain consistent across imaging modalities of fundamentally different dimensionality. The paired design and adherence to STARD reporting are methodological strengths. However, several issues related to reference standard construction, statistical analysis, and interpretation warrant revision before the manuscript can be considered suitable for publication.
Major Revisions
The study exhibits reference standard bias because the consensus CBCT reading serves as the reference standard against which the AI is evaluated on both CBCT and OPG, which creates an inherent disadvantage for OPG because pathology visible only on CBCT will systematically produce false negatives for the OPG arm while being correctly classified for CBCT; therefore, the authors should explicitly acknowledge this verification bias and discuss how it inflates the apparent performance gap between modalities.
There is an absence of formal paired statistical comparison as the study's primary aim is to compare AI performance across modalities, yet no inferential test for paired proportions (e.g., McNemar's test or generalized estimating equations accounting for within-patient clustering) is applied to directly compare accuracy, recall, or other metrics between OPG and CBCT, and since overlapping confidence intervals alone do not constitute a formal comparison, a direct paired test should be reported.
Due to category-specific analysis limitations, the AI provides only a binary "any abnormality" output, so the category-specific analyses (mucosal thickening, polyps/cysts, free fluid) do not reflect the AI's ability to classify lesion type but merely whether the lesion was conspicuous enough to trigger a general alert; this distinction should be stated more prominently, as the current presentation may mislead readers into interpreting these as type-specific diagnostic accuracies.
Regarding the clinical interpretation of recall on CBCT, a recall of approximately 54% on CBCT means that nearly half of true sinus abnormalities are missed, so the discussion should address more directly whether this level of sensitivity is clinically acceptable for an opportunistic screening tool, including the potential consequences of false-negative reassurance.
Minor Revisions
The sample size justification targets CI half-width precision but does not address statistical power for between-modality comparisons, which is the study's primary objective, and this should be clarified.
Observer qualifications differ ("5 years" vs. "more than five years"), and reporting exact experience for both observers would improve transparency.
The AI platform version and whether any software updates occurred during the data collection period (January 2023–March 2025) should be specified, given that cloud-based platforms may change algorithms without user notification.
The 2 mm threshold for pathological mucosal thickening is referenced but not uniformly adopted in the literature, so a brief justification with supporting citations would strengthen this methodological choice.
Several figures (Figures 2, 3) use bar charts with error bars that are difficult to interpret at the presented scale, and it would be clearer to tabulate these data or use forest plots for visualization of point estimates and confidence intervals.
Author Response
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding corrections highlighted in the re-submitted files.
1. The study exhibits reference standard bias because the consensus CBCT reading serves as the reference standard against which the AI is evaluated on both CBCT and OPG, which creates an inherent disadvantage for OPG because pathology visible only on CBCT will systematically produce false negatives for the OPG arm while being correctly classified for CBCT; therefore, the authors should explicitly acknowledge this verification bias and discuss how it inflates the apparent performance gap between modalities.
We thank the reviewer for this insightful observation regarding reference standard bias. We agree that using CBCT as the reference standard for OPG creates an inherent disadvantage for the 2D modality. As suggested, we have explicitly acknowledged this verification bias in our Discussion/limitations section (page 17, lines 528-534). We believe this addition provides a more nuanced and honest interpretation of the AI’s performance across different imaging modalities.
2. There is an absence of formal paired statistical comparison as the study's primary aim is to compare AI performance across modalities, yet no inferential test for paired proportions (e.g., McNemar's test or generalized estimating equations accounting for within-patient clustering) is applied to directly compare accuracy, recall, or other metrics between OPG and CBCT, and since overlapping confidence intervals alone do not constitute a formal comparison, a direct paired test should be reported.
and
The sample size justification targets CI half-width precision but does not address statistical power for between-modality comparisons, which is the study's primary objective, and this should be clarified.
We highly appreciate this constructive suggestion and have now performed formal paired statistical analyses. McNemar's test for paired proportions has been applied to directly compare AI performance across the two imaging modalities. Please find additional 3.4 subsection in the Results section of the manuscript (page 14-15, lines 419-434; tab.6 and tab.7).
Because the study was primarily designed as an estimation study, we retained a precision-based sample-size justification and clarified this in the Methods (page 4, line 135)
3. Due to category-specific analysis limitations, the AI provides only a binary "any abnormality" output, so the category-specific analyses (mucosal thickening, polyps/cysts, free fluid) do not reflect the AI's ability to classify lesion type but merely whether the lesion was conspicuous enough to trigger a general alert; this distinction should be stated more prominently, as the current presentation may mislead readers into interpreting these as type-specific diagnostic accuracies.
We agree that this distinction should be stated more prominently. As suggested, we have added this information in AI Evaluation point 2.5 (page 5, lines 170-172). Similar explanations are existing in a Statistical evaluation point 2.7. (page 6, lines 212-218) and Discussion/limitations (page 17, lines 516-519).
4. Regarding the clinical interpretation of recall on CBCT, a recall of approximately 54% on CBCT means that nearly half of true sinus abnormalities are missed, so the discussion should address more directly whether this level of sensitivity is clinically acceptable for an opportunistic screening tool, including the potential consequences of false-negative reassurance.
While CBCT outputs may miss a proportion of subtle lesions, our results indicate that a mucosal thickness greater than 5 mm3 and a polyp/retention cyst volume above 400.4 mm³ should be correctly diagnosed by the AI. We’ve added an additional information about a clinical aspect of a recall approximately 54% at the Discussion section (page 15, line 447-452). In conclusion a negative result on this opportunistic screening tool does not rule out every sinus pathology and clinical correlation remains mandatory (page 18, lines 544-545).
Minor Revisions
5. Observer qualifications differ ("5 years" vs. "more than five years"), and reporting exact experience for both observers would improve transparency.
We’ve added more specific information to the text that second observer has 12 years of experience in dentomaxillofacial imaging (page 5, lines 178-179).
6. The AI platform version and whether any software updates occurred during the data collection period (January 2023–March 2025) should be specified, given that cloud-based platforms may change algorithms without user notification.
Independent evaluations of the software, such as those analyzing dental treatment signs, used a consistent version Diagnocat 1.0 within a 10-day interval of April 10-20, 2025, specifically to ensure that no version updates occurred during the study period. (page 4, lines 164-166)
7. The 2 mm threshold for pathological mucosal thickening is referenced but not uniformly adopted in the literature, so a brief justification with supporting citations would strengthen this methodological choice.
According to your request, we added more citations to strengthen this established radiological criteria. (page 5, line 192)
8. Several figures (Figures 2, 3) use bar charts with error bars that are difficult to interpret at the presented scale, and it would be clearer to tabulate these data or use forest plots for visualization of point estimates and confidence intervals.
We’ve changed Fig. 2 and 3 to forest plots and the data is existing in Tables 1 and 2.
Reviewer 2 Report
Comments and Suggestions for AuthorsOverall, I recommend Major Revision. While the paired-image design offers valuable clinical insights into AI performance across modalities, significant methodological gaps regarding lesion categorization and comparison with contemporary literature must be addressed.
-
Methodological ambiguity in category-specific AI outputs
The study acknowledges that the AI platform provides only binary "any abnormality" alerts, yet it performs category-specific analyses. The authors should clarify how the absence of specific labels from the AI affects the validity of these sub-group performance metrics.
-
Inconsistent reporting of diagnostic performance metrics
There is a discrepancy between the recall and accuracy values reported in the "Sample size and precision" section versus the Results tables. Ensuring numerical consistency across all sections is vital for the reproducibility and reliability of the findings.
-
Incomplete justification for the 30-day imaging interval
While a 30-day window is common, the authors should more rigorously discuss how transient physiological changes in the maxillary sinus might impact the consensus reference, potentially introducing bias in the comparison between OPG and CBCT datasets.
-
Limited discussion on the "Generalist" AI performance gap
The discussion attributes lower accuracy to the "generalist" nature of the software. However, the manuscript lacks a detailed error analysis exploring whether specific anatomical superimpositions on OPGs systematically lead to the observed "chance-level" performance.
-
Insufficient engagement with recent AI advances and models
The literature review overlooks several dental AI developments and Vision-Language Models like Gemini-3 and Dynasmile in the last two years. Integrating these recent advances would significantly broaden the study's scope and better contextualize the performance of commercial platforms.
Author Response
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding highlighted changes in the re-submitted files.
- Methodological ambiguity in category-specific AI outputs
The study acknowledges that the AI platform provides only binary "any abnormality" alerts, yet it performs category-specific analyses. The authors should clarify how the absence of specific labels from the AI affects the validity of these sub-group performance metrics.
We agree that this distinction should be stated more prominently. As suggested, we have added this information in AI Evaluation point 2.5 (page 5, lines 170-172). Similar explanations are existing in a Statistical evaluation point 2.7. (page 6, lines 212-218) and Discussion/limitations (page 17, lines 516-519).
- Inconsistent reporting of diagnostic performance metrics
There is a discrepancy between the recall and accuracy values reported in the "Sample size and precision" section versus the Results tables. Ensuring numerical consistency across all sections is vital for the reproducibility and reliability of the findings.
We are terribly sorry for the incorrect data in Table 3, now all the data stays consistent. (Tab. 3, page 11)
- Incomplete justification for the 30-day imaging interval
While a 30-day window is common, the authors should more rigorously discuss how transient physiological changes in the maxillary sinus might impact the consensus reference, potentially introducing bias in the comparison between OPG and CBCT datasets.
We appreciate you highlighting this issue. We agree that it requires more in-depth discussion – we mention it in the Material and Methods section (page 4, lines 126-131) and we highlighted this in the Discussion/limitations (page 17, lines 524-528).
- Limited discussion on the "Generalist" AI performance gap
The discussion attributes lower accuracy to the "generalist" nature of the software. However, the manuscript lacks a detailed error analysis exploring whether specific anatomical superimpositions on OPGs systematically lead to the observed "chance-level" performance.
We’ve contained the OPG limitations to the introduction (page 2, lines 65-69) and discussion (page 16, lines 481-485)
A profound analysis of specific anatomical superimpositions on OPGs is beyond the scope of the current study; however, it represents a compelling avenue for future research in a separate publication.
- Insufficient engagement with recent AI advances and models
The literature review overlooks several dental AI developments and Vision-Language Models like Gemini-3 and Dynasmile in the last two years. Integrating these recent advances would significantly broaden the study's scope and better contextualize the performance of commercial platforms.
Diagnocat was selected for this study because it is widely deployed in our setting and due to its capacity to support both 2D and 3D dental imaging within a unified workflow (page 3, lines 106-107). Although our further research includes a comparative evaluation of multiple AI platforms, we want to focus in the future on AI solutions designed specifically for clinical practitioners. We have added that the present study is intentionally a real-world evaluation of a deployed clinical platform rather than a benchmark of all AI architectures (page 2, lines 79-81).
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have undertaken a thorough and substantive revision of the manuscript. All four major concerns have been adequately addressed: (1) reference standard bias is now explicitly acknowledged in the Limitations section with appropriate mechanistic explanation; (2) formal paired statistical comparisons using McNemar’s test have been added as a dedicated results subsection with comprehensive tabular reporting; (3) the category-specific analysis limitations are now prominently clarified at multiple points including table titles, footnotes, and a worked methodological example; and (4) the clinical implications of ~54% recall on CBCT are directly and responsibly discussed, with clear warnings against false-negative reassurance. Among the minor points, observer qualifications, AI platform versioning, the 2 mm threshold justification, and figure improvements have all been satisfactorily addressed, with only minor residual gaps in the a priori power calculation and Figure 2 formatting. The revised manuscript now presents a methodologically sound, transparently reported, and clinically responsible evaluation of a commercial dental AI platform. The scientific contribution—demonstrating that AI diagnostic performance is fundamentally modality-dependent and that a generalist AI platform functions primarily as a conspicuity detector—is clearly articulated and well-supported by the data.
Author Response
We sincerely thank the reviewer for the constructive, and insightful evaluation of our revised manuscript.
Residual gaps in the a priori power calculation and Figure 2 formatting: We have expanded the description of our a priori sample size and power analysis in the Matherials and Methods section (page 4, lines 135-154). We have formatted Figure 2.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe quality is good.
Author Response
We sincerely thank the reviewer for the constructive, and insightful evaluation of our revised manuscript.
