Next Article in Journal
Master and Auxiliary Compound Control for Multi-Channel Confluent Water Supply Switching Control Based on Variable Universe Fuzzy PID
Next Article in Special Issue
Systematic Quantification of Cell Confluence in Human Normal Oral Fibroblasts
Previous Article in Journal
Quantitative Determination of Phenolic Acids and Flavonoids in Fresh Whole Crop Rice, Silage, and Hay at Different Harvest Periods
Previous Article in Special Issue
Multi-Steps Registration Protocol for Multimodal MR Images of Hip Skeletal Muscles in a Longitudinal Study
 
 
Article
Peer-Review Record

An Empirical Evaluation of Nuclei Segmentation from H&E Images in a Real Application Scenario

Appl. Sci. 2020, 10(22), 7982; https://doi.org/10.3390/app10227982
by Lorenzo Putzu * and Giorgio Fumera
Reviewer 1:
Reviewer 2: Anonymous
Appl. Sci. 2020, 10(22), 7982; https://doi.org/10.3390/app10227982
Submission received: 30 September 2020 / Revised: 25 October 2020 / Accepted: 6 November 2020 / Published: 10 November 2020
(This article belongs to the Special Issue Image Processing Techniques for Biomedical Applications)

Round 1

Reviewer 1 Report

Review of the manuscript, “An Empirical Evaluation of Nuclei Segmentation from H&E Images in a Real Application Scenario” by Putzu and Fumera

 

In this work the authors describe a “real-world” comparison of five different quantitative approaches for nuclear segmentation of publicly available digital images typical of those used for histopathological evaluation, hematoxylin & eosin (H&E) stained sections of paraffin-embedded, formalin-fixed tissue.  The objective of the work is to evaluate the performance gap of segmentation methods when annotation of gold-standard nuclear segmentation does or does not exist.

 

The authors compare the output of five computational approaches applied to two datasets using a set of metrics, which have been previously described and made publicly available in the form of Matlab code by another group. The primary results are tables of the scoring metric values for each algorithm as it was applied to the different datasets, either within the same dataset or between the two datasets. While the quantitative assessment of the capabilities of the different computational approaches applied toward nuclear segmentation is useful, an important feature of this work is the process by which these comparisons were made. However, since the sections describing the methods were severely inadequate and no computational resources were provided (e.g., access to raw data, code used for performing data manipulation, links to the publicly available segmentation algorithms, any specific hardware requirements needed to perform the studies, etc.), assessment of the validity of the authors’ findings is not possible.

 

For example, the cited URL for the MIC17 dataset (http://miccai.cloudapp.net/competitions/) is no longer valid. Moreover, the presumed site (http://www.miccai.org/events/challenges/) does not have a link to the indicated challenge data, and the URL the authors cite for the BNS dataset (https://peterjacknaylor.github.io/) does not contain any image data. The authors state that data augmentation methods were used to expand the training data, including random cropping, vertical and horizontal flipping, however, no more details were provided. What was the total number of images used for training? What were the sizes of the cropped images? Finally, it is unclear why the authors do not provide (or make available) all data and code used to generate the results. The data and code require minimal storage space and there are plenty of publicly available platforms for sharing code and data.

 

In summary, while this manuscript may provide some useful information about the accuracy of different machine learning algorithms applied to nuclear segmentation of H&E histopathology digital images, the lack of details as to how the evaluations were performed makes it very hard to assess the validity of the authors’ findings.

Author Response

In this work the authors describe a “real-world” comparison of five different quantitative approaches for nuclear segmentation of publicly available digital images typical of those used for histopathological evaluation, hematoxylin \& eosin (H\&E) stained sections of paraffin-embedded, formalin-fixed tissue. The objective of the work is to evaluate the performance gap of segmentation methods when annotation of gold-standard nuclear segmentation does or does not exist. The authors compare the output of five computational approaches applied to two datasets using a set of metrics, which have been previously described and made publicly available in the form of Matlab code by another group. The primary results are tables of the scoring metric values for each algorithm as it was applied to the different datasets, either within the same dataset or between the two datasets. While the quantitative assessment of the capabilities of the different computational approaches applied toward nuclear segmentation is useful, an important feature of this work is the process by which these comparisons were made. However, since the sections describing the methods were severely inadequate and no computational resources were provided (e.g., access to raw data, code used for performing data manipulation, links to the publicly available segmentation algorithms, any specific hardware requirements needed to perform the studies, etc.), assessment of the validity of the authors’ findings is not possible.

Thanks for this comment. To ensure the reproducibility of the experiments, we have inserted all the links to the repositories from which we have extracted all the code used in this work. We have also added the hardware information just to give an idea of the resources needed to make this code work, even if we still doubt that this information is necessary to judge the goodness of our experiments nor the validity of our findings. Certainly, the hardware configuration can heavily affect calculation times, but this is not the kind of comparison we wanted to do in this work

For example, the cited URL for the MIC17 dataset (http://miccai.cloudapp.net/competitions/) is no longer valid. Moreover, the presumed site (http://www.miccai.org/events/challenges/) does not have a link to the indicated challenge data, and the URL the authors cite for the BNS dataset (https://peterjacknaylor.github.io/) does not contain any image data.

Thanks for pointing this out, indeed the URL of the MIC17 data set is currently not responding as also reported here \url{https://www.kaggle.com/c/data-science-bowl-2018/discussion/47572}. But it must be noted that, even if we cannot write it in the manuscript due to the absence of a specif license that allows to directly redistribute the data set, all the images and GT of the used data sets are present in at least one of the reported public repository. For the URL of the BNS data set, we wrongly inserted the homepage of Peter Naylor, instead of the page related to the BNS data set, that can be found in \url{https://peterjacknaylor.github.io/PeterJackNaylor.github.io/2017/01/15/Isbi/} or directly here \url{https://members.cbio.mines-paristech.fr/~pnaylor/Downloads/BNS.zip}. Thus we corrected this information inside the manuscript.

The authors state that data augmentation methods were used to expand the training data, including random cropping, vertical and horizontal flipping, however, no more details were provided. What was the total number of images used for training? What were the sizes of the cropped images? Finally, it is unclear why the authors do not provide (or make available) all data and code used to generate the results. The data and code require minimal storage space and there are plenty of publicly available platforms for sharing code and data.

Thanks for this annotation, indeed, as we have already said before, our intention is to favour the reproducibility of the experiments, but for fairness, we prefer that the same authors take credit for their work, so rather than create a new repository we have added the link to the repositories from which we have extracted the data augmentation procedure. We have also included in the manuscript all the missing details on the data augmentation procedure, that increases the original training sets by twenty times, so the total number of training images for MIC17 and BNS are 480 and 460 respectively, creating random crops of size $224\time224$, that are flipped vertically and horizontally with a probability of 50\%. 

In summary, while this manuscript may provide some useful information about the accuracy of different machine learning algorithms applied to nuclear segmentation of H\&E histopathology digital images, the lack of details as to how the evaluations were performed makes it very hard to assess the validity of the authors’ findings.

Thanks for this comment. we have rewritten some sentences and added others in order to clarify some details of our work.

Reviewer 2 Report

  This paper evaluated cell nuclei segmentation methods from histopathology tissue sections. They show the methods overall perform well, but are impacted by variation in colors and scale of of the sections. They suggest future work should consider a pre-processing procedure on target images to be more similar to source images for training of the models.   Overall, the paper is well-written and concise, but the results did not show statistical analysis, so it was hard to compare data between methods. They data were also not visualized in graphs, and instead just listed in tables, which make the values difficult to interpret. The conclusions of the paper are limited, and propose future work.
  1. Line 211 - There should be more discussion on the meaning of these values and how to interpret them.
  2. Table 1 - There are no units of these values. It appears F1, P, R, and ODI are a scaling parameter <1, but the scaling of OHD is not clear. They said that OHD "measures the distance from ground truth,... so lower OHD means better segmentation accuracy." Should OHD be normalized to the same-data set? Are there statistically significant differences between these methods on target test data sets? It appears the DIMANN performs well for both same-set and target test data, except for OHD, P, and F1 for same-set. What does this mean? More discussion and interpretation is required.
  3. Line 235-236 - Please interpret these values and results for the reader before jumping to a conclusions about the data. Are these differences statistically significant?
  4. Table 2 - Are these differences statistically significant?  What is the variability of the data?

Author Response

This paper evaluated cell nuclei segmentation methods from histopathology tissue sections. They show the methods overall perform well, but are impacted by variation in colors and scale of of the sections. They suggest future work should consider a pre-processing procedure on target images to be more similar to source images for training of the models. Overall, the paper is well-written and concise, but the results did not show statistical analysis, so it was hard to compare data between methods. They data were also not visualized in graphs, and instead just listed in tables, which make the values difficult to interpret. The conclusions of the paper are limited, and propose future work.

Thanks for your comment and the suggestion. We performed a statistical analysis to compare the results obtained on same-data set and cross-data set setting, but, since for reasons of space it is not possible to report the whole analysis in the form of graphs, we have reported just the most representative graphs of this analysis. We agree that in many cases the presence of graphs could help the data interpretation, but in this case, where numerous evaluation metrics have been used, the tables allow to have all the data grouped in a synthetic but easily comparable way, while the use of graphs could be very dispersive and not help the interpretation of the data at all.


Line 211 - There should be more discussion on the meaning of these values and how to interpret them.


Thanks for this suggestion. We have added more details and equations for the used metrics.

Table 1 - There are no units of these values. It appears F1, P, R, and ODI are a scaling parameter <1, but the scaling of OHD is not clear. They said that OHD "measures the distance from ground truth,... so lower OHD means better segmentation accuracy." Should OHD be normalized to the same-data set? Are there statistically significant differences between these methods on target test data sets? It appears the DIMANN performs well for both same-set and target test data, except for OHD, P, and F1 for same-set. What does this mean? More discussion and interpretation is required.

We have included further details on the used metrics, explaining their purposes, their range of values and their meanings. As mentioned before, we performed the statistical analysis, and we confirmed that there are statistical differences between the segmentation methods in same-data set and cross-data set scenario. 

Line 235-236 - Please interpret these values and results for the reader before jumping to a conclusions about the data. Are these differences statistically significant?

We included further details that help the readers to understand the results and these sentences.

Table 2 - Are these differences statistically significant? What is the variability of the data?

Even in this case, we performed the statistical analysis, and we confirmed that there are statistical differences between the segmentation methods in same-data set and cross-data set scenario. 

Round 2

Reviewer 2 Report

authors have sufficiently addressed my comments.

Back to TopTop