Review Reports - Detection of Cholesteatoma Residues in Surgical Videos Using Artificial Intelligence

Round 1

Reviewer 1 Report (Previous Reviewer 3)

Comments and Suggestions for Authors

In this manuscript, the authors present a proof-of-concept study using artificial intelligence for intraoperative detection of residual cholesteatoma in surgical videos. This work addresses an important clinical problem in medicine and represents a significant step toward AI-based surgical guidance.

This study is groundbreaking in that it focuses on surgical video analysis for cholesteatoma detection, going beyond static image analysis. The goal of the study (reducing recurrence rates by preventing unrecognized residual disease) is of significant importance for improving surgical outcomes.
The detailed description of the methodology is noteworthy, despite the limitations of the data: The authors acknowledge the fundamental difficulty of working with a rare disease: a small dataset (144 videos from 88 patients). They employ several sophisticated strategies to reduce overfitting and improve robustness. The article also utilizes a carefully balanced 6-fold cross-validation scheme to ensure robust evaluation of model performance across different patient subgroups. Ensemble modeling: Training 288 models and using ensemble predictions effectively reduces variance and provides a more stable model performance estimate than using a single model.

To improve the manuscript I recommend authors to draw there attention on following points:

1.The model was trained and tested on data from a single institution, raising concerns about its generalizability to different surgical populations, equipment, and techniques. Performance may drop significantly when applied externally, a common challenge for AI models in medicine, especially those trained on limited data. I advise to describe the limitation of method in more details.

2. The manuscript does not specify technical requirements and limitations for the images used, since the neural network was trained and tested on images from a single source.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report (Previous Reviewer 2)

Comments and Suggestions for Authors

I have no further comments on this revised manuscript

Author Response

We sincerely thank the reviewer for the positive evaluation of our revised manuscript and for their support. We appreciate the time and effort dedicated to reviewing our work.

Reviewer 3 Report (Previous Reviewer 1)

Comments and Suggestions for Authors

- How does the AI handle differences in video quality, lighting, or patient anatomy, especially with so few cases used for training?
- Was there any validation on an external dataset or cross-institutional data to assess the model's performance beyond the training environment?
- Since surgeons use touch and real-time judgment, how reliable is AI that only uses video to find lesions—especially in cases with bleeding or unclear views?
- Since surgeons rely on tactile feedback and real-time context during surgery, how effective is an AI system that relies only on visual information, especially in cases with bleeding or unclear views?
- Visual aids are essential in the research to support the study. Author to add additional diagrams to support section 2.
- The novelty of the work should be explicitly stated and directly aligned with the study's objectives.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report (New Reviewer)

Comments and Suggestions for Authors

This work proposes an artificial intelligence-based system for detecting residual cholesteatoma in surgical videos. Deep neural networks are utilized to analyze various surgical videos captured with both endoscopic and microscopic modalities, which try to overcome the limitations of small evaluation datasets. Major comments are given as follows.
1. The dataset size limits the generalizability for a rare condition like cholesteatoma.
2. A direct comparison between the performance of the AI system and experienced surgeons is missing.
3. While the study provides proof-of-concept evidence, the technical feasibility discussion of real-time AI implementation during surgery should also be included.
4. The analysis of false-positive and false-negative cases is weak, which is insufficient in addressing the underlying causes.
5. The diagnostic performance after video editing does not clearly explain how these edits could be standardized in a real-world setting.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

- The paper doesn't contain any background study. A thorough literature review is necessary to establish a stronger foundation for the presented work. Clearly mention the problem statement and the research objective.
- Dataset contains 144 videos from 88 patients, justify how the theory will establish any conclusion with this limited dataset.
- The paper talks about a Novel AI model to evaluate the diagnosis accuracy from endoscopic and microscopic footage. However, there's no reference o the proposed model. No derivation, no algorithm.
- The result section missing the comparison with existing establishment. Please justify the outcomes.
- References are not in proper format. Looks like copy pasted from generative AI platform.

Author Response

Comments 1: The paper doesn't contain any background study. A thorough literature review is necessary to establish a stronger foundation for the presented work. Clearly mention the problem statement and the research objective.

Response 1: We sincerely thank the reviewer for this valuable comment. We agree that the original version of the manuscript lacked sufficient background to clearly position our study within the existing body of research. In the revised manuscript, we have substantially expanded the Introduction section to provide a more comprehensive literature review and to clarify both the clinical problem and the objective of our study (p.2, lines 39-68).

Specifically, we:

Summarized recent applications of AI in otorhinolaryngology, including diagnostic imaging, surgical skill assessment, and intraoperative guidance [1] (p.2, lines 39-41).

Cited reviews on AI in surgical video analysis, which highlight the global trend toward workflow recognition, skill evaluation, and real-time support [2] (p.2, lines 41-43).

Added evidence of real-time augmentation of nasal endoscopy using AI [3], underscoring the feasibility of intraoperative video-based systems (p.2, lines 43-44).

Contrasted these advances with prior studies in otology, which have mainly focused on static images (otoscopic photographs and CT scans) [4] (p.2, lines 44-47). We noted that only one study has evaluated intraoperative images, using a small dataset and reporting limited sensitivity and specificity [5][6] (p.2, lines 47-50).

Highlighted concerns regarding the generalizability of AI models trained on small datasets [7], which is directly relevant to rare diseases such as cholesteatoma (p.2, lines 50-55).

Clarified that although clinically applicable AI systems require large and diverse datasets, proof-of-concept studies remain valuable for elucidating technical requirements when data are limited (p.2, lines 56-60).

Finally, we explicitly stated the aim and novelty of our study as follows (p.2, lines 61-68):

“The aim of this study was to evaluate the performance of deep learning models trained on a limited dataset of postoperative surgical videos and to identify design requirements necessary for reliable lesion discrimination. The novelty of this study does not lie in the backbone architecture itself, but rather in integrating dual-modality surgical videos and adopting systematic strategies to enhance diagnostic stability under constrained data conditions. Through this approach, this study provides proof-of-concept insights that may inform future efforts toward developing AI-based intraoperative support systems for cholesteatoma surgery.”

We believe these additions and clarifications address the reviewer’s concern and strengthen the foundation of the study.

Comments 2: Dataset contains 144 videos from 88 patients, justify how the theory will establish any conclusion with this limited dataset.

Response 2: We thank the reviewer for this important comment. We agree that the dataset in this study (144 videos from 88 patients) is relatively small. This limitation is largely due to the rarity of cholesteatoma, which has an estimated prevalence of approximately 1 in 25,000 individuals, making it inherently difficult to assemble large-scale video datasets in a single institution. To address this limitation, we have made the following clarifications and revisions: In the Introduction, we now explicitly mention the rarity and prevalence of cholesteatoma to provide context for the limited dataset (p.2, lines 56-60).

We emphasize that the present study is not intended to establish a clinically deployable model, but rather to serve as a proof-of-concept investigation. In the revised Methods, we clarified that we employed extensive data augmentation (1,440-fold, including random rotation, inversion, resizing, and color/contrast adjustment) and a six-fold cross-validation strategy to maximize the utility of the limited dataset while minimizing bias. To evaluate stability, we trained 288 models across multiple random splits and used ensemble prediction as an experimental strategy. We acknowledge that relying on 23 models is not clinically feasible; however, this approach allowed us to quantify variability and identify design requirements under constrained data conditions (Figure 5, revised).

We clarified in the Discussion that future clinical translation will require larger, multi-institutional datasets and the development of robust single models that can achieve comparable performance without reliance on large ensembles (p.18, lines 771-776).

We believe these revisions address the reviewer’s concern by clarifying the purpose of the study, providing justification for the limited dataset, and appropriately positioning the findings as an exploratory step toward the development of clinically practical AI systems for cholesteatoma surgery.

Comments 3: The paper talks about a Novel AI model to evaluate the diagnosis accuracy from endoscopic and microscopic footage. However, there's no reference to the proposed model. No derivation, no algorithm.

Response 3: We thank the reviewer for this important comment. In the revised manuscript, we have clarified the proposed model by adding schematic workflow diagrams (Figures 5–7, revised) and representative input images (Figures 1–2, revised) to illustrate the algorithmic process in detail. Furthermore, Section 2.4(Methods, p7, lines 237-242) now explicitly describes the backbone network (MobileNetV2), and Supplementary Table S4 (revised) provides the full network structure and parameters. We also emphasize that the novelty of this study lies not in the backbone architecture itself, but in its application to dual-modality surgical videos, the introduction of a two-step Before/After editing framework, and the use of extensive augmentation with repeated cross-validation and a large-scale 23-model ensemble to stabilize performance under limited data conditions. We believe these additions sufficiently address the reviewer’s concern.

Comments 4: The result section is missing the comparison with existing establishment. Please justify the outcomes.

Response 4: We thank the reviewer for this constructive suggestion. Although this study did not include a direct comparison between the model’s performance and the diagnostic accuracy of experienced surgeons, we recognize that such analyses would be useful for validating the clinical relevance of the system and represent an important direction for future research.

In the revised Discussion (p.17, lines 603-606), we now explicitly note that our results should not be interpreted as directly comparable with existing studies, since differences in datasets, methodologies, and evaluation settings make straightforward comparisons difficult. Instead, we emphasize that the present work should be regarded as an exploratory proof-of-concept study that clarifies technical requirements for future systems. We also highlight that future studies should consider multi-institutional collaborations (p.18, lines 759-768) to validate the generalizability of our findings and to directly compare AI performance with established diagnostic benchmarks and clinical expertise.

We believe these revisions address the reviewer’s concern while positioning the study appropriately as a foundation for future investigations.

Comments 5: References are not in proper format. Looks like copy pasted from generative AI platform.

Response 5: We appreciate the reviewer’s comment regarding the reference formatting. As mentioned in our correspondence with the Assistant Editor, the inconsistencies were due to initial manual entry errors before we adopted Zotero with the official MDPI citation style. We have since thoroughly revised the reference list, correcting mismatches in titles, DOIs, and author order. The updated manuscript with the corrected references has already been submitted to the editorial office, and we believe this resolves the concern.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This study presents a novel application of artificial intelligence (AI) for detecting cholesteatoma residues in surgical videos, addressing a critical challenge in otolaryngology. Several areas require clarification and further discussion to strengthen the manuscript.

1.The study acknowledges the rarity of cholesteatoma, but the sample size of 144 cases from 88 patients remains relatively small for training robust AI models. Could the authors discuss potential strategies to mitigate the limitations of a small dataset, such as transfer learning or synthetic data augmentation?

2.Were there any efforts to ensure the dataset included a diverse range of cholesteatoma presentations (e.g., varying stages, sizes, or locations)? This could impact the generalizability of the results.

3.The diagnostic accuracy of ~80% is promising, but the manuscript would benefit from a deeper discussion of the false positives and false negatives. What were the common characteristics of cases where the model underperformed?

4.The study mentions significant variability across training runs due to the limited dataset. How was this variability quantified, and what steps were taken to ensure the reported results are representative?

5.The authors highlight the potential for real-time intraoperative use, but the study is described as a proof-of-concept. What are the key technical and clinical hurdles that need to be addressed before this system can be deployed in real-world surgical settings?

6.How does the AI system’s performance compare to the diagnostic accuracy of experienced surgeons? A comparative analysis would strengthen the clinical relevance of the findings.

Author Response

Please note that the detailed responses with all revisions highlighted (in red) are provided in the attached Word file.
The plain text of the responses (without highlights) is entered below.

Comments 1: The study acknowledges the rarity of cholesteatoma, but the sample size of 144 cases from 88 patients remains relatively small for training robust AI models. Could the authors discuss potential strategies to mitigate the limitations of a small dataset, such as transfer learning or synthetic data augmentation?

Response 1: We thank the reviewer for this valuable comment. We agree that the dataset in this study (144 videos from 88 patients) is relatively small, which is largely due to the rarity of cholesteatoma (estimated prevalence of approximately 1 in 25,000). To mitigate this limitation, we employed extensive data augmentation (approximately 1,440-fold, including random rotation, inversion, resizing, and color/contrast adjustment) and a six-fold cross-validation strategy to maximize the utility of the available dataset. In addition, we trained 288 models across multiple random splits and used ensemble prediction as an experimental strategy to quantify variability and identify design requirements under constrained data conditions.

While we did not apply transfer learning in the present study, we agree that it represents a promising approach for future work to further address the challenges of small datasets. Similarly, synthetic data generation strategies could be valuable in expanding training material and improving generalizability. We have revised the Discussion to clarify these points and to emphasize that future clinical translation will require larger, multi-institutional datasets and robust single models that can achieve high performance without reliance on large ensembles (p.18, lines 771-776).

Comments 2: Were there any efforts to ensure the dataset included a diverse range of cholesteatoma presentations (e.g., varying stages, sizes, or locations)? This could impact the generalizability of the results.

Response 2: We thank the reviewer for this important comment. The dataset included a wide age range, from children to older adults, and also incorporated both congenital and acquired cholesteatoma cases. In addition, the study utilized both endoscopic and microscopic videos, which provided diverse visual perspectives and imaging characteristics (Methods, Section 2.1, p.2, lines 72–86).

Regarding stage, size, and location, we note that cholesteatoma generally presents with similar morphological features irrespective of these factors. For surgical decision-making, what is most critical is the ability to distinguish residual cholesteatoma epithelium from surrounding normal mucosa within the operative field, which is not fundamentally altered by the stage or location of the disease.

We therefore believe that the dataset captured the clinically relevant diversity necessary for this proof-of-concept study. Nevertheless, we agree that further validation using larger, multi-institutional datasets will be important to ensure the generalizability of our findings.

Comments 3: The diagnostic accuracy of ~80% is promising, but the manuscript would benefit from a deeper discussion of the false positives and false negatives. What were the common characteristics of cases where the model underperformed?

Response 3: We thank the reviewer for this insightful comment. In the revised Discussion, we have expanded the description of false-positive and false-negative cases (p.17, lines 556-571). Specifically, we noted that false negatives often involved surgical fields with significant bleeding or very thin residual epithelium, but no obvious volumetric lesions were overlooked. False positives were also associated with bleeding fields or frames in which the absence of disease could not be confidently excluded, although subsequent follow-up confirmed no recurrence in these cases. We further emphasized that surgeons rely on tactile feedback and the overall intraoperative context, whereas AI depends solely on visual information, making visually ambiguous cases difficult for both clinicians and AI. Finally, we suggested that incorporating temporal continuity and multimodal intraoperative information may help to improve diagnostic performance in future model development. We believe these clarifications address the reviewer’s concern.

Comments 4: The study mentions significant variability across training runs due to the limited dataset. How was this variability quantified, and what steps were taken to ensure the reported results are representative?

Response 4: We thank the reviewer for this important comment. To address this point, we have added Supplementary Figures S1–S4, which show the variability in model performance across training runs. In the revised Methods (p.8 Section 2.6, Evaluation), we now refer readers to these supplementary materials. We believe this addition clarifies how variability was considered and provides greater transparency to the reported results.

Comments 5: The authors highlight the potential for real-time intraoperative use, but the study is described as a proof-of-concept. What are the key technical and clinical hurdles that need to be addressed before this system can be deployed in real-world surgical settings?

Response 5: We thank the reviewer for this important comment. In the revised Discussion, we have clarified the key technical and clinical hurdles that must be overcome before intraoperative deployment (p.18, lines 769-774). Specifically, we noted that the present study employed a large ensemble of models as an experimental strategy to stabilize performance under limited data conditions (Figure 5). While useful for proof-of-concept, such an approach is not clinically feasible. Future development will require the construction of robust single models trained on larger multi-institutional datasets, together with achieving real-time processing capability and seamless integration into the surgical workflow. We believe these clarifications address the reviewer’s concern by outlining the necessary steps for translation into real-world surgical practice.

Comments 6: How does the AI system’s performance compare to the diagnostic accuracy of experienced surgeons? A comparative analysis would strengthen the clinical relevance of the findings.

Response 6: We thank the reviewer for this constructive suggestion. Although this study did not include a direct comparison between the model’s performance and the diagnostic accuracy of experienced surgeons, we recognize that such analyses would be useful for validating the clinical relevance of the system and represent an important direction for future research. At the same time, the present findings suggest that AI support may help reduce the risk of overlooked lesions and serve as a valuable aid for a wide range of surgeons. Therefore, future studies should aim to verify the generalizability of these findings through multi-institutional collaborations, and we emphasize the importance of considering direct comparisons with established diagnostic benchmarks and clinical expertise in future validation efforts (p.18, lines 759-768).

We believe these revisions address the reviewer’s concern while positioning the study appropriately as a foundation for future investigations.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

This article describes the possibility of using AI to improve the accuracy of Cholesteatoma Residues diagnostics. The article has high scientific and medical-technical significance.
The article is written in accordance with the basic rules of a scientific manuscript. The main characteristics of the neural network efficiency assessment applicable to the analysis of medical video images are analyzed. It is especially worth noting that the dataset used in the study was provided by a medical organization affiliated with the authors.
However, to improve the manuscript, I suggest the authors pay attention to the following points:
1. To confirm the effectiveness of the proposed methods, I recommend expanding the analysis of references to similar studies. Perhaps the authors could compare the accuracy and specificity indicators obtained in the proposed study with them
2. It would be useful to supplement the article with image examples and more accurate descriptions of the reference, diagnostically significant parameters that the study is aimed at finding
3. The MobileNet-V2 architecture is used in the work. However, the authors did not provide a rationale for the choice of a specific architecture and, possibly, a comparison with similar ones.

Author Response

Please note that the detailed responses with all revisions highlighted (in red) are provided in the attached Word file.
The plain text of the responses (without highlights) is entered below.

Comments 1: To confirm the effectiveness of the proposed methods, I recommend expanding the analysis of references to similar studies. Perhaps the authors could compare the accuracy and specificity indicators obtained in the proposed study with them.

Response 1: We thank the reviewer for this valuable suggestion. In the revised Introduction, we expanded the background to more clearly situate our work within the existing literature (p2, lines47-55). We added references noting that high diagnostic accuracy has been demonstrated for differentiating cholesteatoma using otoscopic images (Koyama et al., 2024), whereas only one study has evaluated intraoperative images, which involved a small dataset and reported limited sensitivity and specificity (Tseng et al., 2023; Miwa et al., 2022). By contrast, our study is the first to employ both microscopic and endoscopic surgical videos without manual annotation, thereby more directly reflecting real intraoperative conditions. We believe these additions clarify the novelty of our approach and address the reviewer’s concern.

Comments 2: It would be useful to supplement the article with image examples and more accurate descriptions of the reference, diagnostically significant parameters that the study is aimed at finding.

Response 2: We thank the reviewer for this helpful suggestion. In the revised manuscript, we have added representative input images (Figures 1–2, revised) and schematic workflow diagrams (Figures 5–7, revised) to illustrate the algorithmic process and diagnostically relevant parameters more clearly. We believe these additions improve the clarity of the manuscript and directly address the reviewer’s concern.

Comments 3: The MobileNet-V2 architecture is used in the work. However, the authors did not provide a rationale for the choice of a specific architecture and, possibly, a comparison with similar ones.

Response 3: We thank the reviewer for this important comment. In the revised Methods (Section 2.4, Neural Network), we have clarified the rationale for selecting MobileNetV2. Specifically, we noted that this architecture is lightweight, well-suited for training with relatively small datasets, and has been successfully applied in previous medical imaging studies. In addition, we considered the practical constraints of this proof-of-concept investigation: testing multiple heavier architectures would have required substantial computational cost, and our preliminary experience indicated that such models often lead to overfitting when trained on limited datasets. These factors supported our decision to adopt a lightweight model that balances efficiency and generalizability. We believe this clarification addresses the reviewer’s concern.

Author Response File: Author Response.docx