Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Robust Resolution-Enhanced Prostate Segmentation in Magnetic Resonance and Ultrasound Images through Convolutional Neural Networks

Appl. Sci. 2021, 11(2), 844; https://doi.org/10.3390/app11020844

by Oscar J. Pellicer-Valero^1,*

, Victor Gonzalez-Perez², Juan Luis Casanova Ramón-Borja³, Isabel Martín García², María Barrios Benito², Paula Pelechano Gómez², José Rubio-Briones³

, María José Rupérez⁴ and José D. Martín-Guerrero¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2021, 11(2), 844; https://doi.org/10.3390/app11020844

Submission received: 29 December 2020 / Revised: 13 January 2021 / Accepted: 15 January 2021 / Published: 18 January 2021

(This article belongs to the Special Issue Recent Developments in Machine Learning Techniques for Medical Image Analysis)

Round 1

Reviewer 1 Report

The authors employ CNN in segmentation of the prostate images acquired from MRI and US.
The employ checkpoint ensembling and Neural resolution Enhancement and have used 5 different datasets to train and validate their models.

They have shown their model to show strong levels of performance.

Introduction
Line 34-38: In their experiments, this current one or are they referring to another one?
If it is the former, it should not be placed in the introduction, if it refers to the latter, it should be cited with its proper reference.

Line 305-306: To assess the performance of their models, ABD, DSC and HD95 were employed.
In the discussions section at lines 509-510, they suggest their models to improve in terms of accuracy, robustness, and generalizability.

Do the performance parameters really reflect accuracy? If the authors want to suggest accuracy, other performance parameters should be included in the results section. If not, then the lines 509-510 should be rephrased.
Caution is needed in interpreting the results.
Since the study was based just on segmentation and not disease identification, the clinical utility of their application should be explored more into detail in the discussion section.

Line 334-335
Two paired t-tests were used to confirm the differences in the DSC.
Did the values follow a normality distribution?
Since these values are skewed, using a paired t-test does not seem appropriate.
I would suggest using log it transformed values, and then applying the statistical methods.

Author Response

The authors employ CNN in segmentation of the prostate images acquired from MRI and US. The employ checkpoint ensembling and Neural resolution Enhancement and have used 5 different datasets to train and validate their models. They have shown their model to show strong levels of performance.

Introduction

Line 34-38: In their experiments, this current one or are they referring to another one?

If it is the former, it should not be placed in the introduction, if it refers to the latter, it should be cited with its proper reference.

In this paragraph we are trying to put forward the argument that inter-observer variability is high in manual prostate segmentation. To do so, we searched the literature for prostate segmentation papers where a comparison between experts is performed. However, we were only able to find one, which is cited in that paragraph [3]. Since we could not find more papers about it, we thought that adding the expert comparison performed on our data would not only help support the comparison from [3], but also our own (since the results in our paper are similar to theirs in this comparison), as well as the original argument (since the obtained similarities between experts are not very good).

[3] Shahedi, M.; Halicek, M.; Li, Q.; Liu, L.; Zhang, Z.; Verma, S.; Schuster, D.M.; Fei, B. A semiautomatic approach for prostate segmentation in MR images using local texture classification and statistical shape modeling. Med. Imaging 2019 Image-Guided Proced. Robot. Interv. Model.; Fei, B.; Linte, C.A., Eds. SPIE, 2019, Vol. 10951, p. 91. doi:10.1117/12.2512282.

Line 305-306: To assess the performance of their models, ABD, DSC and HD95 were employed.

In the discussions section at lines 509-510, they suggest their models to improve in terms of accuracy, robustness, and generalizability.

Thank you for the comment. As you mention, accuracy is employed improperly in the conclusion of the paper, since it is not one of the metrics considered. We agree that it is misleading, and hence we have changed the word “accuracy” to “performance” in the conclusion.

Caution is needed in interpreting the results.

Since the study was based just on segmentation and not disease identification, the clinical utility of their application should be explored more into detail in the discussion section.

The last four paragraphs of section 4, Discussion, explore the clinical utility of the results in sufficient depth. In particular, paragraph 486-491 introduces the kind of procedures where this model might be useful. The next paragraph discusses the application of the MR and US segmentation models to the previously described procedures. Finally, the last two paragraphs describe two practical applications of Neural Resolution Enhancement. We therefore believe the extent of the discussion to be adequate, since all direct applications derived from the developed work are already explored.

Line 334-335

Two paired t-tests were used to confirm the differences in the DSC. Did the values follow a normality distribution? Since these values are skewed, using a paired t-test does not seem appropriate. I would suggest using log it transformed values, and then applying the statistical methods.

We agree that normality should have been assessed before applying the paired t-test. As per your suggestion, we have done that now. A paragraph has been added to Section 2.10 (line 311-315) to reflect these findings: “When comparing these metrics among groups, the Wilcoxon signed-rank test was employed, which is the non-parametric equivalent of the paired t-test. The Wilcoxon test was needed due to the distribution of the metrics in the test set not being normal (p-value <= 0.001 using D'Agostino and Pearson's normality test for DSC, ABD and HD95 results).”

We decided to use the non-parametric Wilcoxon test instead of your suggestion of applying the logit transform because this transform can only be applied to values in the range [0,1], but ABD and HD95 metrics do not lie in that range. Furthermore, even if DSC does lies in the range [0,1], it is not a probability, and hence applying this transform to it would be arguable.

All sections where t-test was employed have been recalculated with the Wilcoxon test. However, the only changes in statistical significance (wrt using t-test) occurred in lines 375-382, which have been rewritten to reflect them: “When comparing our model against each of the others with a Wilcoxon test, only the first contender (MSD-Net) was found to be significantly (p-value <= 0.01) better in all metrics, while the fourth contender (nnU-Net) was better in terms of DSC (p-value = 0.037) and ABD (p-value = 0.030), but not HD95 (p-value = 0.439). The nnU-Net [37] is a very recent and interesting method that tries to automate the process of adapting a CNN architecture to a new dataset by making use of a sensible set of heuristics. Regarding the MSD-Net, unfortunately, its specifics are yet to be published as of the writing of this paper.”

[37] Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2020. doi:10.1038/s41592-020-01008-z.

Reviewer 2 Report

The approach proposed in the paper is one of the possible and already known methods that allow for the segmentation of MR images of the prostate. The obtained result is not qualitatively better than those already obtained by other scientists. It is another proof that convolutional networks are good at analyzing images. But it also proves that medical image segmentation is not a simple task. In addition to the expert knowledge that enables this task, the quality of the taken images is equally important. And in connection with this problem, the weakness of this work is revealed. Since experts differ from each other in their results, it should first be determined how much they agree with each other. In other words, it seems to me that the number of experts who allowed the training set to be created was too small. I would also be much more careful in assessing the practical application of the proposed method. Methods unrelated to convolutional networks are much more practical. They don't require such large learning datasets, let alone the technical problems of learning.

Author Response

We thank you for your comment. As you suggest, it would be ideal to have many more MRs segmented by several experts each, so that an average segmentation could be used for ground truth, and a much more in depth inter-expert performance assessment could be made. However, obtaining such segmentations is, as argued in this paper, hard and laborious, and for this reason the vast majority of prostate segmentation papers do not perform any kind of inter-expert comparison. In fact, we could only find one paper [3] in addition to ours, where such a comparison is performed. Despite this limitation, we are able to show (albeit in a limited set of patients) that the predicted segmentations improve significantly beyond inter-expert agreement.

Regarding the practical application, we believe that a model like this one could be employed already in clinical practice given the previous argument. Even if the radiologist or the urologist should check the segmentation and correct it in case of error, it would still be a fantastic tool for speeding up the procedure. Finally, I would agree that both traditional segmentation models (e.g.: statistical shape models), and deep learning-based ones (e.g.: CNNs) have their merit, and both may be useful depending on the application. For the particular application of prostate segmentation, however, CNNs have been ubiquitous over the last five years (i.e.: the Promise12 leaderboard is full with CNN entries: https://promise12.grand-challenge.org/evaluation/challenge/leaderboard/), not only because they are the new fad, but also because they have achieved better performances than their more traditional counterparts.

Reviewer 3 Report

This manuscript proposes a model for accurate Magnetic Resonance (MR) and 3D Ultrasound (US) prostate image segmentation. Techniques including deep supervision, 3D data augmentation, cyclic learning rate, checkpoint ensembling and a hybrid densenet-resnet architecture are applied. Neural Resolution Enhancement triplicates the resolution along the z-axis and improves the resolution of the output mask. The general model was trained on five MR datasets and can be applied to different tasks such as US images by transfer learning. Experiment results shows its effectiveness and robustness among all the testing datasets.

Minor comments:

Some typos / sentences difficult to understand, e.g.,

line 188-189: Simply put, Deep Supervision consists in forcing the CNN to make predictions at several points along the CNN.

line 417: the only experiments were significance was found

Please add some explanations why in equation (1) you constrain 98% of the voxels to fall within the range [0, 1], but not any other percentage?
It’s better to add some description of the subfigures in Figure 7.

Author Response

Some typos / sentences difficult to understand, e.g.,

line 188-189: Simply put, Deep Supervision consists in forcing the CNN to make predictions at several points along the CNN.

We agree, and the sentence has been reworded to: “Unlike regular CNNs, which predict the segmentation mask from the last layer only, deeply supervised CNNs attempt to predict it from several intermediate layers as well.”

line 417: the only experiments were significance was found

As you mention, the whole paragraph seems confusing, so we have reworded it, and we have changed that particular sentence to: “In fact, out of all the experiments conducted in this Section, only these two were found to make a statistically significant difference, […]”.

Please add some explanations why in equation (1) you constrain 98% of the voxels to fall within the range [0, 1], but not any other percentage?

It has no scientific explanation, it is just an intensity normalization method that seemed to work well in validation when compared to others, such as standarization (i.e. subtracting the mean and dividing by the standard deviation). The intuition behind the particular formula comes from the idea that, when working with medical images, it is common to saturate the intensities between two given percentiles in order to increase the contrast. This formula does the same, but without saturating the intensities so that no information is lost. The percentage was found empirically by looking at the histogram of several images. We consider this to be a semi-arbitrary, yet reasonable engineering choice for the model, and as such no further explanation was provided. Same thing as if we had chosen to standardize the image instead.

It’s better to add some description of the subfigures in Figure 7.

As per your suggestion, we have updated the caption:

"Worst HD95 (7.2mm) of all MR test datasets. Two slices from the apex (left and center) are missed by the algorithm, hence amounting to a minimum of 2 × 3mm of error, plus some extra mm from the segmentation errors commited in the third slice (right)."

Round 2

Reviewer 1 Report

I deem the paper to be acceptable for publication.

All issues raised during the revision were adequately addressed.

Article Menu

Robust Resolution-Enhanced Prostate Segmentation in Magnetic Resonance and Ultrasound Images through Convolutional Neural Networks

Further Information

Guidelines

MDPI Initiatives

Follow MDPI