Next Article in Journal
When Should We Biopsy? A Risk Factor-Based Predictive Model for EIN and Endometrial Cancer
Previous Article in Journal
Stage III NSCLC Treatment Patterns in Spain: A Population-Based Study of the GOECP-SEOR
Previous Article in Special Issue
Predicting Toxicities and Survival Outcomes in De Novo Metastatic Hormone-Sensitive Prostate Cancer Using Clinical Features, Routine Blood Tests and Their Early Variations
 
 
Article
Peer-Review Record

Digital Pathology with AI for Cervical Biopsies: Diagnostic Accuracy at the CIN2+ Threshold

Cancers 2025, 17(23), 3808; https://doi.org/10.3390/cancers17233808
by Anja Kristin Andreassen 1, Elin Mortensen 2,3, Roy Stenbro 4, Øistein Sørensen 4 and Sveinung Wergeland Sørbye 3,*
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Cancers 2025, 17(23), 3808; https://doi.org/10.3390/cancers17233808
Submission received: 5 November 2025 / Revised: 24 November 2025 / Accepted: 26 November 2025 / Published: 27 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Dear auhors,

The study evaluates the performance of an AI system, EagleEye, in diagnosing cervical biopsies, particularly at the CIN2+ threshold. It demonstrates that while EagleEye achieved high sensitivity (93.3%) and moderate specificity (71.8%), it serves as a supportive tool rather than a replacement for pathologists. The AI system effectively highlighted suspicious areas, improving detection of clinically relevant lesions, especially in challenging borderline cases. However, it struggled with glandular lesions, necessitating expert review to correct potential misclassifications. Overall, the findings underscore the importance of a human-in-the-loop approach for accurate cervical biopsy diagnostics.
To enhance the paper, consider the following suggestions:

1. Broaden the Dataset: Include a more diverse range of cases to reduce spectrum bias and improve generalizability.
2. Optimize AI Thresholds: Implement formal ROC or decision-curve analyses for better calibration of AI operating points.
3. Quantify Workflow Impact: Measure reading times and workload changes to assess AI's effect on pathologist efficiency.
4. Enhance Visual Outputs: Improve interpretability of AI results by providing clearer visual aids, such as heatmaps.
5. Expand Limitations Discussion: Elaborate on potential biases and the implications of using a single-center dataset.
6.Clarity and Structure: The article lacks clear section headings, making navigation difficult.
7. Statistical Reporting: Some statistical results are presented without adequate context or explanation, potentially confusing readers.
8. Terminology Consistency: Inconsistent use of terms (e.g., "CIN2+" vs. "CIN2") may lead to misunderstandings.
9. Visual Aids: Absence of figures or diagrams to illustrate key findings limits comprehension.
10. References: Some references lack complete citation details, affecting credibility.
11. Conclusion Depth: Conclusions are somewhat superficial and could be expanded to discuss implications more thoroughly.
Comments on the Quality of English Language

Overall, the manuscript is written in clear and professional scientific English. The structure, terminology, and tone are appropriate for a high-impact medical journal, and the authors communicate complex methodological and clinical concepts effectively. However, the text is very dense, and some sections would benefit from streamlining and minor editorial polishing to improve readability and flow.

Specific observations:

  • Grammar and syntax are generally correct, with only occasional long or overly complex sentences that could be simplified for clarity.

  • Consistency of terminology is good, though a few repeated explanations and lengthy sentences in the Introduction and Discussion could be tightened.

  • Acronyms are correctly defined, but frequent use may challenge readers unfamiliar with digital pathology; selective rephrasing or reducing redundancy would help.

  • Transitions between paragraphs can be improved in several places to enhance narrative flow.

  • The manuscript contains no major spelling or typographical errors.

Author Response

Reviewer 1

The study evaluates the performance of an AI system, EagleEye, in diagnosing cervical biopsies, particularly at the CIN2+ threshold. It demonstrates that while EagleEye achieved high sensitivity (93.3%) and moderate specificity (71.8%), it serves as a supportive tool rather than a replacement for pathologists. The AI system effectively highlighted suspicious areas, improving detection of clinically relevant lesions, especially in challenging borderline cases. However, it struggled with glandular lesions, necessitating expert review to correct potential misclassifications. Overall, the findings underscore the importance of a human-in-the-loop approach for accurate cervical biopsy diagnostics.

To enhance the paper, consider the following suggestions:

Comments 1: Broaden the Dataset: Include a more diverse range of cases to reduce spectrum bias and improve generalizability.

Response 1: We agree that a broader and more heterogeneous dataset would further reduce spectrum bias and improve generalizability. However, the aim of the present study was more limited: to evaluate whether EagleEye AI can assist pathologists in distinguishing lesions that require treatment (CIN2, CIN3, ACIS and invasive carcinoma) from lesions that can be followed conservatively (normal epithelium and CIN1) within a routine cervical cancer screening setting. In line with this aim, we included all relevant diagnostic categories from the WHO classification for cervical biopsies (normal, CIN1, CIN2, CIN3, ACIS and cervical cancer), but the number of cases in each category was constrained by the single-centre design, the study period, and the fact that this work was conducted as a master’s thesis with limited time and resources. Because most cervical biopsies at our hospital are originally reported by the main pathologist (P2), we also had to specifically search the archives to identify suitable cases reported by another pathologist (P1) to enable the planned comparisons between P1, P2 and EagleEye AI. We have now clarified these limitations in the Methods and Discussion and explicitly acknowledge that the relatively small, single-centre dataset may limit external generalizability, and that larger multi-centre studies with a more diverse case mix are needed. Our department has recently implemented new high-volume scanners and a DICOM-based workflow, and we are planning a follow-up study with a re-trained EagleEye AI model on a larger and more diverse dataset, but this lies beyond the scope of the present work.

Comments 2: Optimize AI Thresholds: Implement formal ROC or decision-curve analyses for better calibration of AI operating points.

Response 2: We agree that ROC and decision-curve analyses are valuable tools for exploring alternative operating points and formally assessing clinical net benefit. However, in the current implementation of EagleEye AI there is no explicit, tunable decision threshold at the slide level. The model is trained on millions of tiles from thousands of cervical biopsies (WSI), where each tile is classified by EagleEye AI and by pathologists, with an overall agreement of about 93% at the tile level. EagleEye AI does not “know” which lesions should be treated or not; it simply assigns a diagnostic category to each tile based on similarity to large numbers of tiles with the same diagnosis in the training set. Slide-level output is then derived from these tile-level classifications according to a predefined decision rule for CIN2+ versus <CIN2, rather than a continuous probability score where one could move an operating point along an ROC curve. We cannot retrospectively change how EagleEye AI classifies tiles or calibrate a slide-level probability without re-designing and re-training the model. Within these constraints, we therefore report sensitivity and specificity at the pre-specified CIN2+ threshold and have added a short paragraph in the Discussion to clarify this limitation and to state that future versions of the model could be developed with probabilistic outputs that would allow formal ROC and decision-curve analyses.

Comments 3: Quantify Workflow Impact: Measure reading times and workload changes to assess AI's effect on pathologist efficiency.

Response 3: We agree that quantitative data on reading times and workload would be valuable to assess the effect of EagleEye AI on pathologist efficiency. However, this was not feasible within the design and setting of the present study. We retrieved previously diagnosed cervical biopsy slides from the archives, re-evaluated them blindly in the microscope (P2), scanned the slides, re-evaluated them on screen after a 4-week wash-out without AI support, uploaded the files to EagleEye AI, and finally reviewed EagleEye AI heatmaps and diagnoses after another 4-week wash-out (P2). At all stages, P2 was blinded to clinical information and all other results. This multi-step research workflow does not reflect a realistic routine diagnostic pathway and therefore does not allow meaningful measurement of time savings or workload reduction.

At the time the study was conducted, our department was not fully digitized and the routine workflow was still based on analogue slide reading in the microscope. We are now in the process of transitioning to a fully digital workflow in which all slides are scanned and evaluated on screen. When EagleEye AI is fully integrated into our laboratory information system and digital viewer, we anticipate that AI-generated heatmaps and provisional diagnoses will be available at the time of primary slide review, allowing the pathologist to focus on the most relevant regions of interest instead of screening the entire WSI. This is expected to reduce the risk of overlooked CIN2+ lesions and may also decrease the need for ancillary tests such as p16 immunostaining in cases where the pathologist’s interpretation and EagleEye AI are concordant. Furthermore, current routines in many laboratories require that all high-grade lesions (CIN2+) are reviewed by a second pathologist; in the future, it may be possible to evaluate whether concordance between a pathologist and EagleEye AI can safely substitute a routine second human review in selected cases. We have added a paragraph in the Discussion to clarify that measuring reading times and workload changes was beyond the scope of this study, and that prospective implementation studies will be needed to formally quantify the impact of EagleEye AI on diagnostic efficiency and workflow.

Comments 4: Enhance Visual Outputs: Improve interpretability of AI results by providing clearer visual aids, such as heatmaps.

Response 4: We agree that clear visual outputs are important for interpretability and practical use of AI in diagnostic pathology. The current EagleEye AI system already generates colour-coded heatmaps that highlight the most suspicious (“hot spot”) areas of each cervical biopsy. In addition, all tiles from a given biopsy can be displayed in an ordered list, from those most consistent with high-grade lesions to those representing low-grade changes or non-lesional tissue. By moving the cursor over the thumbnail heatmap or using the arrow keys, the pathologist can navigate seamlessly between tiles and immediately view each region at high magnification in the corresponding WSI.

To address your comment, we have clarified this functionality in the Methods/Results and improved the quality and labelling of the example figures in the revised manuscript (including representative heatmaps and tile views), in order to better illustrate how EagleEye AI guides the pathologist towards the most relevant diagnostic areas.

Comments 5: Expand Limitations Discussion: Elaborate on potential biases and the implications of using a single-center dataset.

Response 5: We agree that the limitations related to potential biases and the single-center design should be more clearly stated. First, this was a retrospective study performed at a single university hospital in Norway with its own screening routines, case mix, HPV vaccination coverage and referral patterns. The distribution of diagnoses, the prevalence of CIN2+ and the proportion of borderline or equivocal cases may therefore differ from other institutions and countries, which limits the external generalizability of our findings. Second, the study dataset was not a consecutive screening cohort but was enriched for CIN2+ and diagnostically challenging cases in order to address the main research question. This case selection may introduce spectrum bias and affects the absolute values of sensitivity, specificity and predictive values, although it does not invalidate the comparative assessment of EagleEye AI versus pathologists within this setting.

Third, most cervical biopsies in our department are routinely reported by a small number of experienced gynecologic pathologists, and our results may not be directly transferable to laboratories with a different level of subspecialisation or workload. In addition, EagleEye AI was developed and trained on cervical biopsy slides from the same institution, using the same staining protocols and (at the time) the same scanner type. Even though the training and test sets were disjoint at the slide level, this introduces a risk that the model has, to some extent, learned centre-specific features, and we have not performed an independent external validation in another laboratory or on slides scanned with a different platform. Finally, the overall sample size, and especially the number of glandular lesions (ACIS and adenocarcinomas), was limited, leading to relatively wide confidence intervals for some estimates and a higher degree of uncertainty for subgroup analyses.

We have now expanded the Limitations section to explicitly discuss these sources of bias and to emphasise that larger, prospective multi-centre studies with external validation across different laboratories, scanners and case mixes are needed before EagleEye AI can be recommended for widespread implementation.

Comments 6: Clarity and Structure: The article lacks clear section headings, making navigation difficult.

Response 6: We appreciate this comment and agree that clear structure and navigation are important for readability. Our manuscript is organized according to the standard Cancers template, with the main sections Introduction, Materials and Methods, Results, Discussion and Conclusions, and with multiple subheadings that separate the description of the study population, slide preparation and scanning, the EagleEye AI system, statistical analyses, and the main outcome measures.

To further improve clarity, we have now: (i) checked that all section and subsection headings follow a consistent hierarchy and formatting throughout the manuscript; (ii) slightly reworded some subheadings to make them more descriptive (for example by explicitly mentioning “EagleEye AI” or the “CIN2+ threshold” where relevant); and (iii) added a short roadmap paragraph at the end of the Introduction that explicitly points the reader to the Methods (Section 2), Results (Section 3) and the Discussion/Conclusions (Sections 4–5). We hope these minor adjustments will make the manuscript easier to navigate for readers.

Comments 7: Statistical Reporting: Some statistical results are presented without adequate context or explanation, potentially confusing readers.

Response 7: We agree that statistical results should be reported with sufficient context to be easily interpretable. In this manuscript, our primary focus has been on a small set of standard measures—pairwise agreement quantified by Cohen’s κ with 95% confidence intervals and McNemar’s test, and diagnostic accuracy at the CIN2+ threshold expressed as sensitivity, specificity and 95% confidence intervals—always in relation to a clearly specified reference (P1 or EE+P2) and with explicit numerators and denominators (e.g., 56/60 CIN2+ detected, 28/39 <CIN2 correctly classified). To address your concern more explicitly, we have now: (i) expanded the “Statistical Analysis” subsection to briefly explain the purpose of each metric (agreement vs. accuracy vs. reclassification) and how it relates to the study aims; (ii) added short interpretative sentences in the Results immediately after key numerical estimates (for example, explicitly stating that the observed trade-off for EagleEye AI is an increased sensitivity for CIN2+ at the cost of more false positives near the CIN1/CIN2 threshold); and (iii) checked that all abbreviations (e.g., κ, CI) are defined at first use and that table titles and footnotes clearly describe what each statistic represents. We hope these clarifications make the statistical reporting more transparent and reduce any potential confusion for readers who are less familiar with diagnostic accuracy studies.

Comments 8: Terminology Consistency: Inconsistent use of terms (e.g., "CIN2+" vs. "CIN2") may lead to misunderstandings.

Response 8: We thank the reviewer for this comment and agree that precise and consistent terminology is essential. In our manuscript, “CIN2” is always used as a specific histological diagnosis according to the WHO classification, while “CIN2+” is not a diagnosis in itself but a threshold used throughout the study to distinguish low-grade lesions (normal/CIN1) from high-grade lesions (CIN2, CIN3, ACIS and invasive carcinoma) that are considered clinically relevant and generally require treatment. We use “CIN2+” consistently in this sense when reporting diagnostic accuracy, agreement and reclassification at the high-grade threshold, whereas “CIN2” is only used when referring to that specific diagnostic category (e.g., in one of the tables). To avoid any possible misunderstanding, we have now added an explicit definition of “CIN2+” in the Methods (Study outcomes/Definitions) and have carefully checked the manuscript to ensure that the distinction between the diagnosis CIN2 and the threshold CIN2+ is consistently maintained throughout.

Comments 9: Visual Aids: Absence of figures or diagrams to illustrate key findings limits comprehension.

Response 9: We agree that visual aids can help readers better understand the key findings. In the revised manuscript we now include two figures in addition to the eight tables with detailed diagnostic accuracy and agreement results. Figure 1 is an illustration of the EagleEye AI user interface for a cervical biopsy, showing the overview heatmap, class-distribution panel and tile view that the pathologist uses when reviewing AI output. Figure 2 is a Venn diagram that illustrates concordance and discordance between all readers. We believe that these figures, together with the tables, make both the AI workflow and the main results easier to grasp for readers.

 

 

Comments 10: References: Some references lack complete citation details, affecting credibility.

Response 10: We agree that complete and accurate references are important for the credibility of the manuscript. We have therefore carefully re-checked the entire reference list against the Cancers author guidelines. All references now include full citation details (authors, title, journal or book/report, year, volume, page range or article number, and DOI or stable URL with access date for online documents). We did not identify any references lacking essential information, but we corrected a few minor typographical and formatting issues to ensure consistency. If there are specific references that the reviewer considers incomplete, we would be grateful for further details so that we can amend these accordingly.

Comments 11: Conclusion Depth: Conclusions are somewhat superficial and could be expanded to discuss implications more thoroughly.

Response 11: We agree that the Conclusions section is an important place to summarise the main findings and discuss their clinical and methodological implications. We have therefore expanded and refined the Conclusions to (i) more explicitly link the numerical results (sensitivity/specificity, reclassification of 11 additional CIN2+ cases, agreement metrics) to known areas of human variability in cervical histopathology; (ii) spell out the practical role of EagleEye as a “safety-net” decision-support tool that can enhance CIN2+ detection, support more consistent triage decisions and potentially optimise the targeted use of p16 by focusing attention on the most suspicious foci; and (iii) provide a clearer roadmap for future work, including prospective multi-site validation on unselected cohorts, formal quantification of workflow effects (reading time, time-to-diagnosis, prioritisation of high-risk cases) and further model development for glandular lesions while preserving interpretable outputs. We hope that the revised Conclusions now provide a deeper and more forward-looking discussion of the implications of our findings.

Comments 12: Quality of English Language

Overall, the manuscript is written in clear and professional scientific English. The structure, terminology, and tone are appropriate for a high-impact medical journal, and the authors communicate complex methodological and clinical concepts effectively. However, the text is very dense, and some sections would benefit from streamlining and minor editorial polishing to improve readability and flow.

Response 12: We thank the reviewer for the positive assessment of the English language and overall clarity. In response to these helpful suggestions, we have carefully re-read the manuscript and performed minor editorial polishing to improve readability and flow. Specifically, we have simplified several long or complex sentences, reduced some redundancies in the Introduction and Discussion, slightly reduced the density of acronyms where possible, and improved transitions between selected paragraphs. We hope that these revisions further enhance the clarity and accessibility of the manuscript.

 

Reviewer 2 Report

Comments and Suggestions for Authors

In this full length article, Anja et al. reported a deep-learning–based digital pathology system (EagleEye) for detecting CIN2+ in cervical punch biopsies and compares its performance across multiple diagnostic conditions: the original pathologist sign-out (P1), an experienced gynecologic pathologist under microscope and WSI conditions (P2), EagleEye alone, and an AI-assisted human-in-the-loop workflow (EE+P2). The authors also include an exploratory evaluation of spatial concordance between p16 immunostaining and AI-highlighted regions. The study is well designed for its stated purpose, with strengths in transparent reporting, detailed breakdown of reclassification patterns, and a clear focus on the AI’s intended role as decision support rather than autonomous diagnosis. The manuscript is generally clearly written, methodologically sound, and addresses an important gap: human-in-the-loop evaluation of cervical biopsy AI, including glandular disease (ACIS). However, there are several key concerns related to spectrum bias, internal validation, limited independence between developers and evaluators, and underdeveloped statistical analyses for threshold optimization. Clarification and tempering of claims are needed.

Here is the list about several major and minor comments:

Major comments:

  1. Spectrum-balanced sampling and prevalence bias require stronger justification and clearer limitations: The dataset is intentionally enriched (≈60–70% CIN2+, 20% ACIS), which substantially inflates PPV, NPV, and may distort κ. While limitations are acknowledged, the manuscript often interprets metrics (e.g., incremental case-finding of 11 CIN2+) as if they might approximate real clinical gain. The authors should: More explicitly emphasize that the prevalence is far from any real-world biopsy cohort; Reframe sensitivity/specificity findings as internal performance estimates only; Avoid implying that the “11 recovered CIN2+ cases” reflects likely clinical impact. A dedicated paragraph in the Discussion or Limitations is needed.
  2. Independence and conflict-of-interest concerns: A major reader (P2) contributed to training labels and participated in software development. Although the authors mention this, a more explicit discussion of how this affects generalizability and potential unintentional calibration bias; Clear justification for why a developer-pathologist served as the blinded evaluator instead of an independent external reader; Clarification of whether any of the exact study slides overlapped with P2’s prior annotation work. This issue may not invalidate the work, but transparency must improve.
  3. Question of whether EE+P2 can serve as a “reference”: The manuscript uses EE+P2 as a “secondary, augmented reference,” but this risks incorporation bias because AI is part of the reference. This should be clarified: EE+P2 cannot be considered a comparator for evaluating P1 accuracy; Analyses using EE+P2 should be presented strictly as case-finding exploration, not diagnostic accuracy. The language surrounding these results should be revised for caution.
  4. Lack of statistical evaluation for human-in-the-loop improvement: Although the manuscript comments on increased sensitivity with AI assistance, it does not evaluate significance or quantify improvement in a more formal manner (e.g., McNemar’s test on P2 digital vs EE+P2). Adding such comparisons would strengthen claims about workflow benefits.
  5. No efficiency or workload analysis: The authors note that reading time and workload were not measured. However, this significantly limits the conclusions on clinical utility. At minimum, Discuss how the absence of efficiency data limits interpretation. Consider including preliminary qualitative impressions from P2 (if available).

Minor comments:

  1. Figure clarity: Suggest revising the Venn diagram to ensure legibility and improving color contrast.
  2. Terminology consistency: Several labels vary slightly (e.g., “machine,” “model,” “algorithm,” “EagleEye”). Use consistent terminology throughout.

Author Response

Reviewer 2

In this full length article, Anja et al. reported a deep-learning–based digital pathology system (EagleEye) for detecting CIN2+ in cervical punch biopsies and compares its performance across multiple diagnostic conditions: the original pathologist sign-out (P1), an experienced gynecologic pathologist under microscope and WSI conditions (P2), EagleEye alone, and an AI-assisted human-in-the-loop workflow (EE+P2). The authors also include an exploratory evaluation of spatial concordance between p16 immunostaining and AI-highlighted regions. The study is well designed for its stated purpose, with strengths in transparent reporting, detailed breakdown of reclassification patterns, and a clear focus on the AI’s intended role as decision support rather than autonomous diagnosis. The manuscript is generally clearly written, methodologically sound, and addresses an important gap: human-in-the-loop evaluation of cervical biopsy AI, including glandular disease (ACIS). However, there are several key concerns related to spectrum bias, internal validation, limited independence between developers and evaluators, and underdeveloped statistical analyses for threshold optimization. Clarification and tempering of claims are needed.

Here is the list about several major and minor comments:

Major comments:

Comments 1: Spectrum-balanced sampling and prevalence bias require stronger justification and clearer limitations: The dataset is intentionally enriched (≈60–70% CIN2+, 20% ACIS), which substantially inflates PPV, NPV, and may distort κ. While limitations are acknowledged, the manuscript often interprets metrics (e.g., incremental case-finding of 11 CIN2+) as if they might approximate real clinical gain. The authors should: More explicitly emphasize that the prevalence is far from any real-world biopsy cohort; Reframe sensitivity/specificity findings as internal performance estimates only; Avoid implying that the “11 recovered CIN2+ cases” reflects likely clinical impact. A dedicated paragraph in the Discussion or Limitations is needed.

Response 1: We agree that the deliberately spectrum-balanced sampling, with a high prevalence of CIN2+ and ACIS, introduces substantial spectrum and prevalence bias and that this must be stated very clearly. The study was designed as a single-centre, method-development master’s thesis with constrained time and resources. Under contemporary HPV-based screening, the vast majority of cervical punch biopsies are normal or CIN1, with fewer CIN2/CIN3 and very few ACIS or invasive cancers; these high-grade lesions are expected to become even rarer as vaccinated cohorts enter screening. A purely consecutive cohort would therefore have yielded only a small number of CIN2+ and almost no glandular or invasive cases, making it difficult to meaningfully evaluate EagleEye across the full histologic spectrum or to stress-test its behaviour at the CIN2+ threshold. For this reason, we intentionally enriched the sample to include all relevant diagnostic categories (normal, CIN1, CIN2, CIN3, ACIS, invasive carcinoma) in approximately equal numbers.

In response to the reviewer’s comment, we have further clarified this rationale in the Methods (Study Design and Case Selection) and have strengthened the discussion of limitations in the Strengths and Limitations subsections. Specifically, we now state that: (i) sensitivity and specificity estimates should be interpreted as internal performance parameters under spectrum-enriched conditions rather than as externally valid operating characteristics; (ii) PPV and NPV are study-internal and will be lower in real-world biopsy cohorts with 5–10% CIN2+ prevalence; and (iii) the “11 recovered CIN2+ cases” reflect incremental case-finding within this enriched sample and should not be taken as the expected absolute gain in routine clinical practice. We have also tempered the wording in the Results and Discussion to avoid implying direct real-world clinical impact and to emphasise that prospective, multi-centre validation on unselected screening/referral cohorts is required before any conclusions about clinical benefit can be drawn.

Comments 2: Independence and conflict-of-interest concerns: A major reader (P2) contributed to training labels and participated in software development. Although the authors mention this, a more explicit discussion of how this affects generalizability and potential unintentional calibration bias; Clear justification for why a developer-pathologist served as the blinded evaluator instead of an independent external reader; Clarification of whether any of the exact study slides overlapped with P2’s prior annotation work. This issue may not invalidate the work, but transparency must improve.

Response 2: We thank the reviewer for raising these important points regarding independence, potential calibration bias, and transparency. EagleEye AI was initially developed as an in-house decision-support tool beginning in 2020. In the early development phase, P2 contributed substantially to training labels: first by annotating epithelial regions and then by assigning WHO-based diagnoses to individual tiles. After several thousand tiles had been annotated and incorporated into iterative training cycles, the model reached sufficient accuracy to automatically identify epithelium, generate heatmaps, and present tiles ordered from most severe lesions to normal epithelium. Over subsequent years, almost all cervical biopsies from our laboratory were scanned, processed by EagleEye, and double-checked by P2, while all high-grade (CIN2+) cases continued to undergo independent co-signing by another pathologist based on analogue microscope review, in accordance with local QA routines.

For the present study, however, the 99 slides (≈5,000 tiles) were not part of any training or tuning of EagleEye AI. None of these WSIs were used during model development, and P2 did not create any tile-level annotations for them. All visible pen marks from the original sign-out (P1) were physically removed from the slides before scanning. There is therefore no slide-level overlap between the 9,000+ previously scanned and annotated biopsies used in development and the 99 study slides, although they do share the same laboratory, H&E protocol, scanner type, and file format.

We agree that P2’s dual role—as a contributing developer and as an expert reader—may introduce a risk of unintentional calibration bias and limits the independence of the evaluation. We chose P2 as the blinded evaluator because the primary aim of this master’s thesis was an internal method-development study at a single centre, and we wanted to (i) benchmark EagleEye against the routine sign-out (P1) and (ii) explore how an experienced gynecologic pathologist at our own institution would interact with the AI in a human-in-the-loop workflow. To mitigate bias, we implemented strict blinding and washout procedures: P2 was not the original reporting pathologist, had no access to prior reports, clinical information, cytology, HPV results, deeper levels or p16, and was blinded to P1’s diagnoses and to all EagleEye outputs during the microscope and digital WSI readings, with ≥4-week washout and independently randomized case order between sessions. Only in the EE+P2 condition did P2 see EagleEye outputs, and even then remained blinded to P1.

In response to the reviewer’s comment, we have (i) clarified in the Methods (AI system and Human readings subsections) that P2 contributed to training labels during early development but had no annotation role for the 99 study slides, which were disjoint from the training set; (ii) expanded the Limitations section to explicitly state that the involvement of a developer-pathologist as reader may favour internal validity but limits independence and generalizability, and that truly external, multi-centre validation with independent readers is required before widespread adoption; and (iii) added a statement in the Conflict of Interest/Author Contributions sections to transparently disclose P2’s role in model development and the absence of commercial funding or ownership related to EagleEye AI.

Comments 3: Question of whether EE+P2 can serve as a “reference”: The manuscript uses EE+P2 as a “secondary, augmented reference,” but this risks incorporation bias because AI is part of the reference. This should be clarified: EE+P2 cannot be considered a comparator for evaluating P1 accuracy; Analyses using EE+P2 should be presented strictly as case-finding exploration, not diagnostic accuracy. The language surrounding these results should be revised for caution.

Response 3: We agree that EE+P2 cannot be regarded as an independent reference standard and that its use as a comparator introduces verification/incorporation bias, because the AI output is part of the construct being evaluated. Our intention was not to present EE+P2 as a true “gold standard,” but to use it as an internal, augmented comparator to explore how many additional CIN2+ cases might be surfaced when an experienced gynecologic pathologist reviews the WSI with AI guidance compared with routine sign-out by P1.

To address this more clearly, we have revised the manuscript in several places. In the Methods (Reference standards and endpoints), we now explicitly state that EE+P2 is not treated as a definitive truth standard, that it incorporates the intervention under study, and that any comparisons against EE+P2 are subject to verification/incorporation bias and should be interpreted as case-finding rather than diagnostic accuracy. In the Results, the section describing P1 relative to EE+P2 has been reworded to emphasise that this is an exploratory analysis of incremental case-finding, not an assessment of P1 “accuracy” against an independent reference. In the Discussion and Limitations, we further stress that all sensitivity/specificity estimates and the observed “recovery” of 11 additional CIN2+ cases reflect internal performance under spectrum-enriched conditions and do not directly translate into real-world clinical impact.

Comments 4: Lack of statistical evaluation for human-in-the-loop improvement: Although the manuscript comments on increased sensitivity with AI assistance, it does not evaluate significance or quantify improvement in a more formal manner (e.g., McNemar’s test on P2 digital vs EE+P2). Adding such comparisons would strengthen claims about workflow benefits.

Response 4: We agree that formal statistical comparisons are important to quantify the added value of AI assistance. In the revised manuscript, we have now explicitly evaluated the difference between P2-digital and EE+P2 at the CIN2+ threshold using McNemar’s test. The 2×2 table comparing P2-digital with EE+P2 (Table 4) shows 5 discordant “upgrades” and 4 “downgrades”; McNemar’s exact test was non-significant (p = 1.00), indicating no systematic directional shift once an experienced gynecologic pathologist has already reviewed the WSI without AI support. We have added this result and a short interpretative paragraph in the Results (new subsection 3.3.3) and tempered our wording in the Discussion to avoid implying a statistically demonstrable gain over P2-digital.

At the same time, we believe the human-in-the-loop workflow remains clinically relevant when viewed against routine sign-out (P1). Using the AI-assisted consensus (EE+P2) as a methodological comparator, P1 showed 83.8% sensitivity and 100% specificity, reflecting 11 CIN2+ cases that were only identified when an experienced gynecologic pathologist reviewed the WSI with AI guidance. We now emphasize that these findings should be interpreted as internal evidence of incremental case-finding compared with routine sign-out, rather than as a statistically significant improvement over expert digital reading, and we explicitly state that EE+P2 is used solely as a methodological construct rather than as an independent gold standard.

Comments 5: No efficiency or workload analysis: The authors note that reading time and workload were not measured. However, this significantly limits the conclusions on clinical utility. At minimum, Discuss how the absence of efficiency data limits interpretation. Consider including preliminary qualitative impressions from P2 (if available).

Response 5: We agree that the absence of quantitative efficiency and workload data is an important limitation and that it restricts the conclusions we can draw about clinical utility. As now stated explicitly in the Methods (Section 2.8) and Discussion (Sections 4.7 and 4.9), reading time, time-to-sign-out, and perceived workload/stress were not pre-specified endpoints and were not recorded, so we do not claim any measurable gain (or loss) in speed or workload from EagleEye in the present study. In the revised Discussion, we instead frame potential workflow effects as hypotheses: that AI-guided hotspot highlighting may reduce the risk of overlooked CIN2+ lesions, increase the pathologist’s confidence when EagleEye and the human reader agree, and potentially decrease reliance on ancillary p16 staining and routine second reads for high-grade lesions—particularly in settings with less experienced readers—provided that high-grade diagnoses remain subject to expert confirmation. We have added explicit wording to the Limitations to emphasize that these points are speculative and that prospective implementation studies, with instrumented logging of reading time, micro-interactions, and standardized workload measures, will be required to formally quantify any impact of EagleEye on efficiency and human factors.

Minor comments:

Comments 6: Figure clarity: Suggest revising the Venn diagram to ensure legibility and improving color contrast.

Response 6: We thank the reviewer for this suggestion. In response, we have experimented with several alternative layouts for the Venn diagram and adjusted font size, label placement, and colour contrast to improve legibility, while preserving the underlying counts and relationships between the four readers. After comparing the alternatives, the authors agreed that the current configuration, with these minor visual refinements, provided the clearest overall presentation and we have therefore retained this version in the revised manuscript.

Comments 7: Terminology consistency: Several labels vary slightly (e.g., “machine,” “model,” “algorithm,” “EagleEye”). Use consistent terminology throughout.

Response 7: We agree that consistent terminology is important. In the revised manuscript, we have systematically reviewed and harmonized the wording throughout. We now refer to the system consistently as “EagleEye” in the text, tables, and figure legends (including instances that previously used alternative labels), and use the generic term “algorithm” only when explicitly contrasting human versus AI conditions (e.g., “pathologist vs algorithm”). Terms such as “machine” and “model” have been removed or replaced accordingly to maintain consistent nomenclature.

Reviewer 3 Report

Comments and Suggestions for Authors

I read this manuscript with great interest. A very interesting topic of using artificial intelligence in the diagnosis of cervical pathology. The results obtained using artificial intelligence are very promising, but the authors of the manuscript rightly note that the results must be evaluated by an experienced pathomorphologist. In my opinion, the conclusions section is too long; some of the text should be moved to the discussion section. The conclusions section should emphasize the main result summarizing the manuscript – please correct this. In my opinion, the work submitted to me for evaluation fully deserves publication after minor editorial corrections.

Author Response

Reviewer 3

I read this manuscript with great interest. A very interesting topic of using artificial intelligence in the diagnosis of cervical pathology. The results obtained using artificial intelligence are very promising, but the authors of the manuscript rightly note that the results must be evaluated by an experienced pathomorphologist.

Comments 1: The conclusions section is too long; some of the text should be moved to the discussion section. The conclusions section should emphasize the main result summarizing the manuscript – please correct this. In my opinion, the work submitted to me for evaluation fully deserves publication after minor editorial corrections.

Response 1: We thank the reviewer for this helpful suggestion and agree that the Conclusions section should focus more tightly on the main findings, with forward-looking implementation aspects moved to the Discussion. In the revised manuscript, we have shortened the Conclusions by (i) removing the detailed description of potential workflow effects and implementation steps (previously starting from “In practice, such a system could contribute…” and including the subsequent list of prerequisites for broad clinical adoption) and (ii) moving these elements to Section 4.7 (“Implications and future work”). The Conclusions now emphasize the central results of the study—diagnostic performance at the CIN2+ threshold, the human-in-the-loop nature of EagleEye, and the main implications for case-finding—while pointing more briefly to the need for external validation and prospective workflow studies.

Revised Conclusions:

“5. Conclusions
In this spectrum-enriched series of 99 digitized cervical biopsies, a human-in-the-loop AI workflow (EagleEye) achieved high case-finding for treatment-relevant disease at the CIN2+ threshold while pathologists retained full diagnostic authority. Compared with the original sign-out (P1), EagleEye reached 93.3% sensitivity (specificity 71.8%); using the AI-assisted read (EE+P2) as a methodological comparator, P1 showed 83.8% sensitivity and 100% specificity, indicating that AI guidance helped surface additional CIN2+ lesions mainly in borderline P1-Normal/CIN1 biopsies and subtle invasive candidates within P1-CIN3. Agreement across readers and conditions was substantial to almost perfect (percent agreement 89–94%; κ up to 0.86), and diagnostic performance was strongest for squamous CIN, with ACIS highlighting the intended dependence on expert adjudication in a human-in-the-loop design.

Taken together, these findings support EagleEye as a safety-net decision-support tool that can reduce the risk of missed high-grade lesions and promote more consistent grading of cervical biopsies, while final diagnostic responsibility remains with the pathologist. External multi-site validation and prospective workflow studies will be needed to confirm generalizability and quantify the impact of EagleEye on routine diagnostic pathways before broad clinical implementation.”

 

Back to TopTop