Automated Assessment of Ki-67 Labeling Index Using Cell-Level Detection and Classification in Whole-Slide Images
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis study addresses a very important issue in histopathological cancer diagnostics. The Ki-67 labeling index and its routine evaluation have long been subjects of heated debate within the pathology and oncology communities. Any attempt to develop a tool that may support pathologists in this task is therefore highly valuable.
This appears to be a well-organized and rigorously conducted study that makes a meaningful contribution to the field. The inclusion of 320 cases is one of the study’s major strengths. The potential practical value of the study in clinical practice is highly commendable. It is also worth noting that the authors clearly describe the study limitations in the Discussion section.
The manuscript is well written and easy to read. The quality of English is more than satisfactory, and the most relevant information is clearly presented. However, in my opinion, the Methods section is somewhat disorganized and requires clarification and reorganization.
The authors state that the study group consists of 320 retrospective histopathological specimens; however, there is insufficient information on how the dataset was structured during the study. How many cases were allocated to the training set and how many to the testing set?
“The classification dataset was constructed by expert manual annotation of individual nuclei on Ki-67–stained images” (Rows 110–111) — were all 320 cases manually annotated by three experts, or did this apply only to the training set? If only the training set was manually annotated, how was the testing set evaluated? Was it assessed visually by eye-balling, according to routine diagnostic practice? The statement “according to routine diagnostic practice” (Rows 127–128) requires further clarification.
Moreover, the definition of “image” (Row 111) should be specified. Does this refer to regions of interest (ROIs) selected from whole-slide images? What software was used for expert annotation — integrated scanner software or third-party software?
What criteria were applied to determine the threshold for Ki-67 positivity? Were all stained nuclei included, regardless of staining intensity (including weak staining)?
The Methods section begins with the statement: “The proposed method consists of two sequential steps: (1) automatic detection of individual cell nuclei using a deep learning–based instance segmentation algorithm, and (2) classification of detected nuclei.” In my opinion, placing the “Overview of the analysis pipeline” at the beginning of the Methods section may be misleading. The section should instead begin with a detailed description of dataset formation as a ground truth, as this forms the basis for algorithm development.
“Annotations were reviewed to ensure consistency…” (Rows 113–114) — this requires more detailed explanation. Additionally, there is no information regarding the models of slide scanners used to generate the dataset.
As an interdisciplinary study based on medical material, the manuscript requires additional methodological detail. Since the dataset consists of retrospective archived slides, the years of diagnosis should be specified. Basic information regarding the immunohistochemistry protocol (antibody clone(s), automated vs. manual staining) should also be provided. In my opinion, reporting only cancer localization is insufficient; at minimum, histological tumor types should be described.
“The cell classification model was evaluated on a test set consisting of 71K positive and 170K negative [nuclei]” (Rows 144–145) — do these numbers refer to the total number of annotated nuclei?
In summary, in my opinion, the manuscript may be suitable for publication in Diagnostics after addressing the above-mentioned issues.
Author Response
This study addresses a very important issue in histopathological cancer diagnostics. The Ki-67 labeling index and its routine evaluation have long been subjects of heated debate within the pathology and oncology communities. Any attempt to develop a tool that may support pathologists in this task is therefore highly valuable.
This appears to be a well-organized and rigorously conducted study that makes a meaningful contribution to the field. The inclusion of 320 cases is one of the study’s major strengths. The potential practical value of the study in clinical practice is highly commendable. It is also worth noting that the authors clearly describe the study limitations in the Discussion section.
The manuscript is well written and easy to read. The quality of English is more than satisfactory, and the most relevant information is clearly presented. However, in my opinion, the Methods section is somewhat disorganized and requires clarification and reorganization.
The authors state that the study group consists of 320 retrospective histopathological specimens; however, there is insufficient information on how the dataset was structured during the study. How many cases were allocated to the training set and how many to the testing set?
Response: we’ve added a mention that it’s 45% for training, 5% for validation, and 50% for testing
“The classification dataset was constructed by expert manual annotation of individual nuclei on Ki-67–stained images” (Rows 110–111) — were all 320 cases manually annotated by three experts, or did this apply only to the training set? If only the training set was manually annotated, how was the testing set evaluated? Was it assessed visually by eye-balling, according to routine diagnostic practice? The statement “according to routine diagnostic practice” (Rows 127–128) requires further clarification.
Response: For the training and testing of AI classification, all of the images were annotated, both training and testing.
For the evaluation of the LI with three pathologists, a total of 20 512x512 unannotated image patches were used.
The routine diagnostic practice refers to each pathologist usual method of assessing the LI
Moreover, the definition of “image” (Row 111) should be specified. Does this refer to regions of interest (ROIs) selected from whole-slide images? What software was used for expert annotation — integrated scanner software or third-party software?
Response: We’ve clarified that it is an image patch of size 512x512 corresponding to ROIs. We used an in house web-based software for annotation.
What criteria were applied to determine the threshold for Ki-67 positivity? Were all stained nuclei included, regardless of staining intensity (including weak staining)?
Response: Yes, all stained nuclei were included, regardless of stating intensity.
The Methods section begins with the statement: “The proposed method consists of two sequential steps: (1) automatic detection of individual cell nuclei using a deep learning–based instance segmentation algorithm, and (2) classification of detected nuclei.” In my opinion, placing the “Overview of the analysis pipeline” at the beginning of the Methods section may be misleading. The section should instead begin with a detailed description of dataset formation as a ground truth, as this forms the basis for algorithm development.
Response: We’ve added a mention that an annotated dataset was first constructed to train the AI-based method. Later subsections expand on this.
“Annotations were reviewed to ensure consistency…” (Rows 113–114) — this requires more detailed explanation. Additionally, there is no information regarding the models of slide scanners used to generate the dataset.
Response: we’ve added that it was reviewed by a senior pathologist. This is to simply make sure that annotations from different pathologists were done as per requirements. We’ve added that it’s a Leica aperio AT2 scanner.
As an interdisciplinary study based on medical material, the manuscript requires additional methodological detail. Since the dataset consists of retrospective archived slides, the years of diagnosis should be specified. Basic information regarding the immunohistochemistry protocol (antibody clone(s), automated vs. manual staining) should also be provided. In my opinion, reporting only cancer localization is insufficient; at minimum, histological tumor types should be described.
Response: we’ve added the following:
antibody clone: MIB-1
Automated staining (Dako automated stainer)
breast cancers: invasive ductal carcinoma; lung cancers: adenocarcinoma and squamous cell carcinoma; stomach cancers: adenocarcinoma; colon cancers: adenocarcinoma; uterine cancers: endometrial carcinoma
“The cell classification model was evaluated on a test set consisting of 71K positive and 170K negative [nuclei]” (Rows 144–145) — do these numbers refer to the total number of annotated nuclei?
Response: yes. we’ve added a clarification
In summary, in my opinion, the manuscript may be suitable for publication in Diagnostics after addressing the above-mentioned issues.
Response: Thank you so much.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe following comments are to be incorporated into the revised manuscript.
- The introduction section of the manuscript should include the motivation behind the research.
- The literature section of the manuscript should be more informative and include the latest studies.
- The author has to include data collection protocols.
- In the methodology section, the author mentions that CNN model is used for the classification task but no CNN model is discussed in the methodology. Therefore, the author has to discuss the CNN model configurations and training parameters.
- The author has to explain the parameters used for the result section.
- For each used performance evaluation parameter, the author has to give brief details and descriptions that help the readers.
- Under the result section, the author has to include a comparative analysis with SoTA.
Author Response
The following comments are to be incorporated into the revised manuscript.
1.The introduction section of the manuscript should include the motivation behind the research.
Response: we mention our motivation in the introduction as part of “our primary objective”. We’ve separated it into its own paragraph to make it easier to see.
2.The literature section of the manuscript should be more informative and include the latest studies.
Response: we have added a few more references.
3.The author has to include data collection protocols.
Response: we’ve added a few additional information in regards to the data.
4.In the methodology section, the author mentions that CNN model is used for the classification task but no CNN model is discussed in the methodology. Therefore, the author has to discuss the CNN model configurations and training parameters.
Response: the cell detection is performed by a pre-existing cell detection model which we reference. The main bulk of the task is the cell detection. We’ve added a mention of the simple classification CNN model.
5.The author has to explain the parameters used for the result section.
Response: If by parameters it is meant the statistical evaluation parameters, they are mentioned in the statistical analysis section.
6.For each used performance evaluation parameter, the author has to give brief details and descriptions that help the readers.
Response: the evaluation parameters mentioned in the statistical analysis section are all standard metrics used in statistical analysis and part of common knowledge.
7.Under the result section, the author has to include a comparative analysis with SoTA.
Response: The aim of this study was to compare the performance against manual reading of pathologists, which is the current standard. This study is not a machine learning algorithm paper where the aim is to improve on the state of the art.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis study primarily investigates the automated, AI-based assessment of the Ki-67 proliferation index. I believe the study has significant shortcomings and therefore needs to be revised according to the following points:
- The abstract provides no information on how the AI-assisted system that classifies cell nuclei as Ki-67 positive or negative was designed. The abstract should be updated with this in mind.
- The abstract does not clearly highlight the difference and unique contribution of this study to similar AI-based Ki-67 studies in the literature. This contribution must be clarified.
- The sentence, "Although this approach is well established, it is inherently subjective and time-consuming [4–7]," is scientifically inadequate. What is the contribution of the references given in this sentence? The sentence should be strengthened, and the excess references should be removed.
- The literature review related to the problem addressed is limited to the sentence, "Recent advances in computational pathology have enabled reliable detection and classification of individual cell nuclei on histopathological images using deep learning–based methods [9–13]." This is unacceptable. Previous studies must be presented critically, and then the contribution of the proposed method in this study should be emphasized.
- The introduction is very weak. The clinical significance of Ki-67 and the limitations of manual assessment are examined. However, the question "How does this study differ methodologically from previous studies?" remains unanswered. A paragraph answering this question must be added to the introduction.
- It is recommended that the contributions to the literature provided by the proposed method be listed at the end of the introduction.
- The visual flowchart on the left in Figure 1 is not connected to any sections. The flowchart should be revised.
- How were the training, validation, and test datasets separated during the experiments?
- How were the ROIs selected in the "Whole-slide image preparation" section? This should be explained.
- The parameters with which the StarDist model was fine-tuned are not specified. This should be explained.
- To ensure the reproducibility of the experiments, the training time, optimization method, learning rate, number of epochs, and augmentation strategies should be explained. - Test set numbers are given (71K positive, 170K negative), but it is not specified from how many cases these cells came from, and from which tumor types they were selected. This should be explained.
- Cut-off values ​​such as <10%, 10–20%, >20% are used, but it should be explained why these thresholds were chosen or whether they vary depending on the tumor type.
Author Response
This study primarily investigates the automated, AI-based assessment of the Ki-67 proliferation index. I believe the study has significant shortcomings and therefore needs to be revised according to the following points:
- The abstract provides no information on how the AI-assisted system that classifies cell nuclei as Ki-67 positive or negative was designed. The abstract should be updated with this in mind.
Response: The aim of this study was to compare against pathologists. The AI model consists of pre-existing cell detection algorithm which we reference with an added simple CNN classifier to classify the cell. We’ve added a bit more detail about the classifier being a lightweight CNN.
- The abstract does not clearly highlight the difference and unique contribution of this study to similar AI-based Ki-67 studies in the literature. This contribution must be clarified.
Response: the main contribution is the evaluation against pathologists which we mention.
- The sentence, "Although this approach is well established, it is inherently subjective and time-consuming [4–7]," is scientifically inadequate. What is the contribution of the references given in this sentence? The sentence should be strengthened, and the excess references should be removed.
Response: all those references mention the fact that there is some subjectivity involved despite the manual approach being common practice. We’ve replaced “well established” with “common practice”.
- The literature review related to the problem addressed is limited to the sentence, "Recent advances in computational pathology have enabled reliable detection and classification of individual cell nuclei on histopathological images using deep learning–based methods [9–13]." This is unacceptable. Previous studies must be presented critically, and then the contribution of the proposed method in this study should be emphasized.
- The introduction is very weak. The clinical significance of Ki-67 and the limitations of manual assessment are examined. However, the question "How does this study differ methodologically from previous studies?" remains unanswered. A paragraph answering this question must be added to the introduction.
Response: This paper is not an AI methods paper. The sole aim was to evaluate using an off-the-shelf AI method against pathologists. There is no technical novelty and the aim is to simply add further data about the performance of an AI-based method against pathologists for the evaluation of the Ki-67 LI.
We have added the following paragraph
We do not propose a new AI architecture; instead, we treat the AI system as an off-the-shelf tool and focus on a clinically grounded evaluation framework: a traceable, cell-level output (per-nucleus positivity) and benchmarking of AI-derived Ki-67 LI against a multi-pathologist reference with performance interpreted in relation to inter-observer agreement. This design directly addresses the practical question of whether an existing AI system performs within the variability expected in routine practice.
- It is recommended that the contributions to the literature provided by the proposed method be listed at the end of the introduction.
Response: We’ve isolated the primary objective of the paper into a separate paragraph and expanded it with the above paragraph.
- The visual flowchart on the left in Figure 1 is not connected to any sections. The flowchart should be revised.
Response: Figure 1 about the AI-based computation of LI is already referenced in the text. Nonetheless, we’ve updated the caption to make it clear.
- How were the training, validation, and test datasets separated during the experiments?
Response: we’ve added a mention that it’s 45% for training, 5% for validation, and 50% for testing
- How were the ROIs selected in the "Whole-slide image preparation" section? This should be explained.
Response: we’ve moved it to the annotation section and added the following:
Regions of interest (ROIs) containing representative tumour tissue were selected by an experienced pathologist for annotation, making sure that an ROI does not include artefacts, non-tumour tissue, and large areas of background. The size of a given ROI image patch was 512x512 pixels.
- The parameters with which the StarDist model was fine-tuned are not specified. This should be explained.
Response: the model was simply fine-tuned with additional data using the exact same parameters as described in the original stardist paper. We’e added a mention of this
- To ensure the reproducibility of the experiments, the training time, optimization method, learning rate, number of epochs, and augmentation strategies should be explained.
Response: we’ve added a paragraph on this
- Test set numbers are given (71K positive, 170K negative), but it is not specified from how many cases these cells came from, and from which tumor types they were selected. This should be explained.
Response: we’ve added a mention that they came from 50% for the cases as we had
- Cut-off values ​​such as <10%, 10–20%, >20% are used, but it should be explained why these thresholds were chosen or whether they vary depending on the tumor type.
Response: we have added the following
For the threshold-based analyses, Ki-67 LI values were grouped into three ordinal categories using commonly applied clinical cut-offs (<10%, 10--20%, and >20%). These thresholds were chosen to reflect typical low/intermediate/high proliferation strata similar to ranges used in some prior studies \cite{ahn2015differences, nielsen2021assessment}. The optimal Ki-67 thresholds are not universal and may vary by tumour type, clinical context, and institutional practice; therefore, the cut-off analysis in this study is intended as a pragmatic sensitivity analysis rather than as a tumour-specific recommendation.
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe responses provided by the authors after the first revision, as well as the changes they claim to have made in the manuscript, are not sufficient, and the overall quality of the paper remains low. The following issues must be addressed:
- The statement, “We’ve added a bit more detail about the classifier being a lightweight CNN.” is insufficient. What exactly is the pre-existing cell detection algorithm? Simply providing a citation is not enough; the integration process must be clearly explained.
- The response, “the main contribution is the evaluation against pathologists which we mention.” is highly inadequate. There are already numerous AI vs. pathologist comparison studies in the literature. Therefore, unless it is clearly articulated why this study is novel, why it is necessary, and which specific gap it addresses, the claim of originality remains weak. This statement must be substantially strengthened.
- In the previous revision, it was explicitly stated that so many references were not necessary for the sentence, “Although this approach is well established, it is inherently subjective and time-consuming [4–7],” and that some should be removed. However, this issue has not been addressed.
- Previous studies in the literature that compare AI and pathologists should be critically analyzed, and the specific contributions of the present study should be clearly articulated in relation to them.
- The split of “45% train, 5% val, 50% test” is unusual. The test proportion is excessively high. Simply adding these ratios to the manuscript is insufficient. A clear justification must be provided explaining why this specific distribution was chosen.
Author Response
[Reviewer 3]
The responses provided by the authors after the first revision, as well as the changes they claim to have made in the manuscript, are not sufficient, and the overall quality of the paper remains low. The following issues must be addressed:
- The statement, “We’ve added a bit more detail about the classifier being a lightweight CNN.” is insufficient. What exactly is the pre-existing cell detection algorithm? Simply providing a citation is not enough; the integration process must be clearly explained.
Response: The preexisting cell detection algorithm is already described in detail in the referenced paper and we have simply directly used their publicly available code https://github.com/stardist/stardist . We have added a reference to the code.
We did not make any other modifications to this method or their training protocol. The implementation details of that method are dense and are not the main contribution of this paper.
- The response, “the main contribution is the evaluation against pathologists which we mention.” is highly inadequate. There are already numerous AI vs. pathologist comparison studies in the literature. Therefore, unless it is clearly articulated why this study is novel, why it is necessary, and which specific gap it addresses, the claim of originality remains weak. This statement must be substantially strengthened.
Response: This study compares the performance of pathologists against an off-the-shelf cell detection and simple CNN classification algorithm. And yes, it is true that there are many similar studies in the literature, so there is no claim of novelty nor originality. What we aim for is scientific soundness and further adding to the scientific evidence to corroborate previous studies.
- In the previous revision, it was explicitly stated that so many references were not necessary for the sentence, “Although this approach is well established, it is inherently subjective and time-consuming [4–7],” and that some should be removed. However, this issue has not been addressed.
Response: We have reduced them to two references. We, nonetheless, believe that the main argument is that the existing practice is time-consuming and subject and therefore needs to be substantiated, hence the need for multiple references.
- Previous studies in the literature that compare AI and pathologists should be critically analyzed, and the specific contributions of the present study should be clearly articulated in relation to them.
Response: Our aim in this study is not to conduct a comprehensive comparative review of existing literature, but rather to assess performance of an off-the-shelf algorithm against pathologists. As stated there is no particular novelty in methodology about our study. However, it does provide new information about the performance of pathologists against an off-the-shelf algorithm for the assessment of ki-67 apart from. This journal does not emphasize method novelty as pre-condition for publication.
- The split of “45% train, 5% val, 50% test” is unusual. The test proportion is excessively high. Simply adding these ratios to the manuscript is insufficient. A clear justification must be provided explaining why this specific distribution was chosen.
Response: The larger the test set, the higher the confidence in the performance. Classifying the cells into positive and negative is an extremely simple task as it is highly correlated with colour. In our initial experiment even a simple classifier based on colour threshold was enough to achieve 91% AUC. For the CNN, after a certain point, adding further training images made absolutely no improvement on the performance.
Round 3
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have carefully addressed all the issues and made the necessary revisions to the manuscript. The paper is acceptable in its current form.

