Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Machine Learning Algorithm Accuracy Using Single- versus Multi-Institutional Image Data in the Classification of Prostate MRI Lesions

Appl. Sci. 2023, 13(2), 1088; https://doi.org/10.3390/app13021088

by Destie Provenzano¹

, Oleksiy Melnyk², Danish Imtiaz², Benjamin McSweeney², Daniel Nemirovsky²

, Michael Wynne²

, Michael Whalen², Yuan James Rao²

, Murray Loew¹ and Shawn Haji-Momenian^2,*

Reviewer 1:

Zhong Xue

Reviewer 2:

Andre Luiz Costa

Reviewer 3:

Zhuo He

Reviewer 4: Anonymous

Appl. Sci. 2023, 13(2), 1088; https://doi.org/10.3390/app13021088

Submission received: 21 December 2022 / Revised: 10 January 2023 / Accepted: 12 January 2023 / Published: 13 January 2023

(This article belongs to the Special Issue Applications of Radiomics and Deep Learning in Medical Image Analysis)

Round 1

Reviewer 1 Report

This paper trained classification models based on a) ProstateX-2 data, b) local institutional data, and c) combined ProstateX-2 and local data. These models were then tested with the same or different data sources. The results showed that accurate prostate cancer classification algorithms trained on single-institutional image data performed significantly better on the same testing data than other or combined data sources. My impression is that the paper basically reports a negative experimental results however it is worth to know among researchers. The results suggest over-fitting of neural networks for different datasets, and several factors might be important to analyze why.

One is that the overall number of data seems small, and this may need to be discussed

Another is the statistical power, the paper computed p values and AUCs. I suggest a statistician can help improve the writing and analysis of the results (such as what kind of p-values are computed? how about the statistical power? ). A special attention can be given to the analysis of the results and statistics, because the results tell us that it is not possible to train a model using one or two datasets, and apply the model to some new data.

Monor: columns of the tables can be aligned better.

Is it machine-learning or neural network classifiers? please make this clear, thanks.

Author Response

Dear Reviewer,

We’d like to begin by thanking you so much for your prompt time and attention in regards to this careful review of our paper. We really appreciate you taking the time to assist us in making sure this paper is the most clear it can be and in the best state it can be and wanted to reiterate that before we begin our response to your comments. Thank you again.

This paper trained classification models based on a) ProstateX-2 data, b) local institutional data, and c) combined ProstateX-2 and local data. These models were then tested with the same or different data sources. The results showed that accurate prostate cancer classification algorithms trained on single-institutional image data performed significantly better on the same testing data than other or combined data sources. My impression is that the paper basically reports a negative experimental results however it is worth to know among researchers. The results suggest over-fitting of neural networks for different datasets, and several factors might be important to analyze why.

Thank you so much for this summary of the results. This is indeed unfortunately a negative result but an important one to present to the ML/AI community in regards to the variety of studies presented for clinical medicine. Generalizability is a very important problem in ML research and we are happy this point was able to be conveyed in our paper.

One is that the overall number of data seems small, and this may need to be discussed

Thank you so much for this point about the small sample size. This is indeed a common problem across many clinical studies. We have included a quick point in the discussion to address this (Seen below or found at Page 20, Line 406 -> 409)

Another limitation is that the sample size of both the original publicly available dataset and our local dataset are small. Ideally thousands of scans, if not every scan potentially available at the time, would be used to train the model. However, due to practical limitations the cohort was limited by scan availability of each.

Another is the statistical power, the paper computed p values and AUCs. I suggest a statistician can help improve the writing and analysis of the results (such as what kind of p-values are computed? how about the statistical power? ). A special attention can be given to the analysis of the results and statistics, because the results tell us that it is not possible to train a model using one or two datasets, and apply the model to some new data.

Thank you so much for this point. The p-values were computed at p < 0.01 using a 2-tailed t-test as detailed in the methods from lines 239 – 244. However as this point is not clear, we have included the following addition to the table legend to ensure this is clearer in the results: P 8 Lines 262 – 264:

T-tests compared the AUC and accuracies of algorithms trained and tested using the same image data source(labeled *) to those using different training and testing sources (labeled †), with statistically significant differences underlined by 2-tailed t-test (p<0.01)

Thank you for your suggestion of a power analysis, this is definitely a good point! The statistical power (beta) of this study with 63 participants is roughly 80% as a sample size of 55 was required to ensure an alpha of 0.01 (p-value using 2- tailed 2-test) with our given distributions. Thank you so much for this last point. The authors would like to clarify we do hope that the main takeaway is not that one or two datasets can not be used to train a model that generalizes well, but rather that any one or two datasets used to train a model must be representative of the population in question if the model is to be generalizable. For example, a heterogenous dataset could very well on its own be used to model the general population of prostate cancer (PCA) within all of the continental United States, however a homogeneous one as is so commonly unfortunately the case with open-source datasets does not work as well. To clarify this we’ve included the following line in the discussion and hope this satisfies the reviewer (Page 10, Lines 361 – 363):

These results do not aim to limit the development of models on one or two datasets, however hope to encourage consideration for the heterogeneity and patient population when applying clinical models.

Monor: columns of the tables can be aligned better.

Thank you so much for this point, the column headers have been realigned to the table.

Is it machine-learning or neural network classifiers? please make this clear, thanks.

Thank you so much for your consideration. We have updated the introduction to attempt to clarify this. The model type used was a ResNet algorithm, which is a type of convolutional neural network published in 2015 by He et al. See: He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.

Although there are many definitions of machine learning currently available, for this study we used the original 1959 IBM definition of Machine Learning that considers deep learning and convolutional neural networks to be a subset of the broader field of ML. See: Samuel, Arthur (1959). "Some Studies in Machine Learning Using the Game of Checkers". IBM Journal of Research and Development. 3 (3): 210–229. CiteSeerX 10.1.1.368.2254. doi:10.1147/rd.33.0210

This definition has been more recently reiterated by the common sources used to teach ML by Sarle, Kohavi, Alpaydin, James, and Bishop. See:

Sarle, Warren S. (1994). "Neural Networks and statistical models". SUGI 19: proceedings of the Nineteenth Annual SAS Users Group International Conference. SAS Institute. pp. 1538–50. ISBN 9781555446116. OCLC 35546178

Kohavi and F. Provost, "Glossary of terms," Machine Learning, vol. 30, no. 2–3, pp. 271–274, 1998.

Alpaydin, Ethem (2010). Introduction to Machine Learning. London: The MIT Press. ISBN 978-0-262-01243-0. Retrieved 4 February 2017.

Kohavi, R. and Provost, F. (1998) Glossary of terms. Machine Learning—Special Issue on Applications of Machine Learning and the Knowledge Discovery Process. Machine Learning, 30, 271-274.
https://doi.org/10.1023/A:1017181826899

Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. p. vii.

Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer, ISBN 978-0-387-31073-2

We’ve included the following line in the introduction to address this (Page 2, Lines 64 – 66):

The highly accurate model used was a subset of a machine learning algorithm called a convolutional neural network: the Residual Neural Network (ResNet).

Reviewer 2 Report

This is an important study highlighting the accuracies using machine learning algorithms to analyze prostate cancer lesions from different institutions.

Strengths of the study include a large, robust data set with an independent gold standard, and a solid statistical analysis. Scientific studies of artificial intelligence, and specifically deep learning, applied to imaging-based diagnosis are currently important and relevant.

There are, however, some observations to be done before the publication of this manuscript.

Abstract:

Spell out all acronyms the first time that they are used (e.g.: mpMRI; AUC)

Keywords: Please check your keywords to ensure that they are MeSH terms.

Materials and Methods

Please show the full name before every abbreviation for the first time. (e.g.: MR, DICOM).

There must be much more detail about how the images were segmented and calculated. Please, include figures that show the process for those of us who have not done this by ourselves.

Give more information about tumor volume and its differential impact on the classification stability.

Author Response

Dear Reviewer:

We’d like to begin with a thank you for your careful review and prompt time and attention in regards to our submission. We really do appreciate the time and attention required of reviewers to read a manuscript and ensure it is in the best state it can be. We wanted to reiterate that we thank you so much before we begin our response to your comments. Thank you again.

This is an important study highlighting the accuracies using machine learning algorithms to analyze prostate cancer lesions from different institutions. Strengths of the study include a large, robust data set with an independent gold standard, and a solid statistical analysis. Scientific studies of artificial intelligence, and specifically deep learning, applied to imaging-based diagnosis are currently important and relevant.There are, however, some observations to be done before the publication of this manuscript.

Thank you so much for your feedback. We feel this is an important study as well.

Abstract: Spell out all acronyms the first time that they are used (e.g.: mpMRI; AUC)

Thank you so much for your feedback. This is definitely important to include and we’ve updated the abstract to spell out the terms. Please see below or review page 1 lines 16 – 31 for this update:

Abstract: Background: Recent studies report high accuracies when using machine learning (ML) algorithms to classify prostate cancer lesions on publicly available datasets. However, it is unknown if these algorithms generalize well to data from different institutions. Methods: This was a retrospective study using multi-parametric Magnetic Resonance Imaging (mpMRI) data from our institution (63 mpMRI lesions) and the ProstateX-2 challenge, a publicly-available annotated image set (112 mpMRI lesions). Residual Neural Network (ResNet) algorithms were trained to classify lesions as high risk (hrPCA) or low-risk/benign. Algorithms were trained on a) ProstateX-2 data, b) local institutional data, and c) combined ProstateX-2 and local data. The algorithms were then tested on a) ProstateX-2, b) local and c) combined ProstateX-2 and local data. Results: Algorithms trained on either local or ProstateX-2 image data had high Area Under the ROC Curve (AUC)s (0.82-0.98) in the classification of hrPCA when tested on their own respective populations. AUCs decreased significantly (0.23-0.50, p < 0.01) when algorithms were tested on image data from the other institution. Algorithms trained on image data from both institutions re-achieved high AUCs (0.83-0.99). Conclusion: Accurate prostate cancer classification algorithms trained on single-institutional image data performed poorly when tested on outside-institutional image data. Heterogeneous multi-institutional training image data will likely be required to achieve broadly-applicable mpMRI algorithms.

Keywords: Please check your keywords to ensure that they are MeSH terms.

Thank you for this important consideration, the keywords have been reviewed to ensure they are MeSH terms. To modify the terms appropriately the keywords were reduced to 4 and verified.

Keywords: Machine Learning 1; Prostate Cancer 2; Magnetic Resonance Imaging 3 Artificial Intelligence 4;

Materials and Methods Please show the full name before every abbreviation for the first time. (e.g.: MR, DICOM).

Thank you so much for this important consideration, we’ve updated the methods to reflect the full name of each term before their use as an abbreviation. Please see lines 98, 102, 105, 107, 169 or the text below:

…Gleason scores (GS).

… a) small field of view (FOV) axial T2 (Transverse Relaxation Time) turbo-spin echo sequence

… the Apparent Diffusion Coefficient (ADC) map was

… three-dimensional centroid Digital Imaging and Communications in Medicine (DICOM)

… the Picture Archiving and Communication (PACS) system

There must be much more detail about how the images were segmented and calculated. Please, include figures that show the process for those of us who have not done this by ourselves.

Thank you so much for this important consideration. We have now included a figure as Supplement 1 detailing this process and showing a quick overview as to how images are segmented and calculated. We reference this in page 5, line 171 on the manuscript:

Overview of this process is detailed in Supplement 1.

The perimeter of the index cancer of the 41 enrolled lesions was traced in the Picture Archiving and Communication (PACS) system on the axial T2 sequence by a single body-fellowship trained radiologist (SHM) with 7 years of prostate MRI experience, and also exported.

Give more information about tumor volume and its differential impact on the classification stability.

Thank you so much for this important point. Sub-analysis of model performance is such an important part of validating these results and we appreciate the call to attention for the need to look at tumor volume. Tumor volume is certainly an interesting factor to include! Visual inspection suggests that the tumor volume in the Local dataset was larger on average than the PX2 dataset. However, we only have tumor volumes for the Local dataset. As the number of samples in the testing sample for each cross validated fold was already small, the authors found that by splitting the tumor volume into ranges we were unable to have enough data to make any statistically significant conclusions. (We also looked at race and had a similar problem). We did however include sub-analysis by prostate zone in the now labeled Table 4 to attempt to account for differences in the prostate as this has historically in prior publications shown to have an effect. We included the range, mean, and standard deviation of tumor volume in the methods on page 4 lines 138-142 to attempt to address the need to include some info for volume in the manuscript. See below:

Based on these stringent criteria, the index cancer of 41 of the 153 prostatectomy patients were confidently correlated with surgical pathology results and entered into this study. Tumor volume for the local dataset ranged from 0.00 – 82.90cc (Mean 10.63 + 15.31). The patient demographics and tumor GG and zonal distribution are summarized in Table 1.

Reviewer 3 Report

The authors used different combinations of public and local datasets and training and test set, and compared the performance of different models obtained by ResNet50 on classification problems with different data. The authors' discussion of these results concludes that accurate prostate cancer classification algorithms trained on mono-centric image data are poorly tested on external datasets. Algorithms that are widely accepted require heterogeneous multi-center data.

Comments:

It is suggested that the authors change the “algorithm” to “model”. Because the authors do not propose or improve any algorithm, they use different data to train the existing ResNet algorithm and get different models.
Formatting/writing issues
1. What is GS? p4-110
2. What is csPCA in Table 1? What’s the difference with hrPCA?
3. What is T2 and ADC? Give the full name when the abbreviation is first used.
4. Rename and keep consistent the names of your models in Figures and Tables. Also, these names are so confusing. Try to use a dash or superscript/subscript letters. p6-200, Figure 1, Table 1 & 2
5. etc.
Why not use multiplanar and dynamic contrast-enhanced sequences? as far as I know, multiplanar would improve the image quality. p3-93
In the local dataset, the authors excluded the multi-focal or bilateral areas. how about the public database? p4-117
What is the standard deviation of the mean AUC and accuracy in your 5x CV? Please show in your Table 2.
How did you split the data for different models? For example, algorithm-PX2, tests in local data. training: 80%PX2, test: 20% PX2 +100% local? Or training: 80%PX2, test: 20% PX2 +20% local? And how many patients (px and local) are in each fold? please give more details. as well as Model-PX2-PXL
Is there any explanation for the results in Tables 1 and 2 that for both training and testing of the same dataset (PX2 and PX2, Local and Local, PXL and PXL), all results did not reach the significant level?
The size of the database is also a variable. The two datasets together have 175 samples. Has the result of the model-PXL been tested with a random selection of 87 data (50/50) from both datasets?
The authors cite other articles' ideas in the discussion section and conclude with their own arguments. In each argument, please indicate which specific data performance from the experimental results of this paper supports that argument.
What is the generalizability of the ResNet50? Are there any requirements for the input data size? Are algorithms with fewer parameters being considered? ResNet32 18? Or other popular image classification models, such as VGG16, Inceptionv3

Author Response

Dear Reviewer,

We’d like to begin by thanking you so much for your prompt review, time, and attention in regards to our manuscript. We can’t thank you enough for taking the time to provide a careful review and ensure this paper can be the best it can be. Reviewing an article is certainly a time-consuming process and we’d like to thank you again before we begin our response to your comments.

The authors used different combinations of public and local datasets and training and test set, and compared the performance of different models obtained by ResNet50 on classification problems with different data. The authors' discussion of these results concludes that accurate prostate cancer classification algorithms trained on mono-centric image data are poorly tested on external datasets. Algorithms that are widely accepted require heterogeneous multi-center data.

Thank you for this summary! This was what we hoped one would take from the paper.

It is suggested that the authors change the “algorithm” to “model”. Because the authors do not propose or improve any algorithm, they use different data to train the existing ResNet algorithm and get different models.

Thank you so much for this important consideration. We have updated the term algorithm to model throughout the paper when appropriate. In particular whenever referring to the algorithm (For example a ResNet algorithm) we have left the term algorithm. When referring to a “trained algorithm” (IE model) we’ve updated the term to model. There were 119 references to algorithm in the paper that have been updated accordingly in red within the manuscript.

Formatting/writing issues

What is GS? p4-110
What is csPCA in Table 1? What’s the difference with hrPCA?
What is T2 and ADC? Give the full name when the abbreviation is first used.
Rename and keep consistent the names of your models in Figures and Tables. Also, these names are so confusing. Try to use a dash or superscript/subscript letters. p6-200, Figure 1, Table 1 & 2
etc.

2.1 and 2.3 Thank you so much for this important consideration, we’ve updated the methods to reflect the full name of each term before their use as an abbreviation. Please see lines 98, 102, 105, 107, 169 or the text below:

…Gleason scores (GS).

… a) small field of view (FOV) axial T2 (Transverse Relaxation Time) turbo-spin echo sequence

… the Apparent Diffusion Coefficient (ADC) map was

… three-dimensional centroid Digital Imaging and Communications in Medicine (DICOM)

… the Picture Archiving and Communication (PACS) system

2.2 Thank you so much for drawing attention to this oversight. Clinically significant prostate cancer (csPCA) is another term for hrPCA however for the purposes of this manuscript, has been updated to reflect hrPCA only. See Table 1: csPCA -> hrPCA

2.4 Thank you so much for this important consideration. The references to each algorithm have been updated to model as per your earlier comment and have been updated to match formatting throughout.

Why not use multiplanar and dynamic contrast-enhanced sequences? as far as I know, multiplanar would improve the image quality. p3-93

Thank you so much for bringing attention to this important consideration. It is well-recognized in the radiology literature that dynamic contrast enhanced sequences are not highly useful in the classification of prostate cancer. As a result, the radiology PIRADS scoring system has markedly downgraded the use of dynamic contrast enhanced sequences given its lower utility. This study is not intended to perform 3 dimensional image analysis and multiplanar images were not used. ADC map images are only obtained in the axial plane so multiplanar classificaiton is not even possible using the ADC map. Studies have also shown that the axial image with the largest surface area of the lesions is sufficiently "representative" of the entire 3D volume of the lesion, and can serve as its surrogate in image analysis. As such to be consistent with how a Radiologist would approach this problem and to ensure these models are clinically applicable, the sequences were chosen as such.

In the local dataset, the authors excluded the multi-focal or bilateral areas. how about the public database? p4-117

Thank you so much for addressing the use of multi-focal lesions. While prostate-X2 dataset consisted of scans with greater than 1 lesion within the gland, unlike our local dataset, lesion analysis was performed on a per-lesion basis and not on a per-gland basis. This was done by using the provided (I,j,k) coordinates of each centroid of the lesion. These are publicly available on the ProstateX/X-2 challenge website. The local prostate dataset consisted of only single dominant cancers in the gland to ensure 100% accurate radiology-surgical pathology correlation.

What is the standard deviation of the mean AUC and accuracy in your 5x CV? Please show in your Table 2.

Thank you so much for your consideration regarding standard deviation of the results table. After much formatting we decided to include the standard deviations in a supplemental table to ensure Table 3 and 4 are readable. We have updated the text and supplement to include this information. Please see the new Supplemental Table 3 and 4 for this update in the results. Or page 7 line 256 -257 and page 8 lines 292 - 293 with this new information (Text below).

Standard Deviations for each cross validated model run are available in Supplementary Table 3.

Standard Deviations for each cross validated model run are available in Supplementary Table 4.

How did you split the data for different models? For example, algorithm-PX2, tests in local data. training: 80%PX2, test: 20% PX2 +100% local? Or training: 80%PX2, test: 20% PX2 +20% local? And how many patients (px and local) are in each fold? please give more details. as well as Model-PX2-PXL

This is an excellent point that should be included, and the authors thank the reviewer for bringing this to their attention. An additional table, the new Table 2, has now been included that details the full breakdown of each combination and total patients in each. All models used an 80/20 training/testing split to account for cross-validation. The combined PX2-PXL fold did an 80/20 training/ testing split as well however ensured that an equal portion of both PX2 and Local patients were included in both the training and testing samples. This meant that the final testing sample would have both 20% PX2 + 20% local. To address this please see the new Table 2 (Page 6 lines 229 – 231 and the table after Figure 2 or the text below).

. The total patients included in each model is listed in Table 2.

Table 2: Total Patients included in each of the initial datasets trained and tested. The total included images for each imaging type (T2, ADC, or T2 + ADC) were the same for each training/ testing combination for the model.

Is there any explanation for the results in Tables 1 and 2 that for both training and testing of the same dataset (PX2 and PX2, Local and Local, PXL and PXL), all results did not reach the significant level?

Thank you so much for pointing out this potential point of confusion. All of the data individually on the training/ testing runs reached a significance level through cross-validation. What the underlined records show however is the difference between the original model and the subsequent models (for example PX2/PX2 vs PX2/Local). As PX2/PX2 vs itself would not be statistically significantly different from itself it is indicated by a asterisk and not underlined. To clarify this we’ve updated the table description to include mention of the 2-tailed t-test: (See page 7 lines 262-264 or the text below).

to those using different training and testing sources (labeled †), with statistically significant differences underlined by 2-tailed t-test (p<0.01).

The size of the database is also a variable. The two datasets together have 175 samples. Has the result of the model-PXL been tested with a random selection of 87 data (50/50) from both datasets?

Thank you so much for this consideration. Unfortunately as our local database consisted of 63 samples, we were not able to do an 87/ 87 split of each. However the PXL models did use data from both to attempt to address this. It’s an excellent point and we hope the take home of our paper that data trained on heterogeneous or multi-institutional datasets performs better than those trained on a single institution or homogeneous dataset. In the future for studies involving this data we hope to incorporate even more data to achieve the 87/87 split as you’ve suggested and it is an excellent next step.

The authors cite other articles' ideas in the discussion section and conclude with their own arguments. In each argument, please indicate which specific data performance from the experimental results of this paper supports that argument.

Thank you for this important consideration. The following lines have been updated with relevant supporting data (please see the discussion or review the text below):

A similar pattern was also demonstrated when model performance was assessed using lesions only from the peripheral or transitional zones (PZ: 0.91 – 0.44, TZ: 0.93 – 0.61).

Model^PXLT2s had higher AUCs when tested on single-institution image data compared to multi-institutional data (0.92 – 0.83).

Model^PXLADCs had higher AUCs when tested on multi-institutional image data as compared to single-institutional data

The single-institutionally-trained models’ AUCs were slightly higher when using the T2 sequence as compared to the ADC map, which could be a function of its higher resolution (0.93 – 0.96 vs 0.91 – 0.82).

Such truly benign lesions were unlikely to be within the ProstateX-2 image data set and model^PX2 was not trained on them, which also likely impacted model^PX2’s performance on local institutional image data (T2 AUC 0.44).

What is the generalizability of the ResNet50? Are there any requirements for the input data size? Are algorithms with fewer parameters being considered? ResNet32 18? Or other popular image classification models, such as VGG16, Inceptionv3

Thank you for bringing this to the authors’ attention. The ResNet50 framework itself can be applied to any problem. The input datasize used was 512 by 512 as this was the default matrix size for a DICOM image. (page 6 lines 174 – 182 of the methods) Algorithms with fewer parameters were considered. Multiple algorithms were tested from the tensorflow and pytorch package before settling on ResNet which beat out others’ performance considerably. This study first attempted to emulate the four winners of the original ProstateX challenge and did review InceptionV3 and VGG16. However these were very slow and did not perform as well. It may seem counterintuitive as VGG16 has only 16 layers where as the ResNet is much deeper, but ResNet’s skip connections allowed for a much faster training time and surprisingly better performance (VGG16 did have 132 million parameters which could be part of the problem). InceptionV3 has only 4 million parameters but was also unable to match the performance of the ResNet. To address this the authors have updated the methods to include that we did try other models, see page 5 lines 186 – 187 or the below text:

ResNet model was selected after training and testing several other common ML frameworks due to its increased performance and speed.

Reviewer 4 Report

The authors use Resnet as a model for classification of prostate MRI lesions, based on heterogeneous multi-institutional training image data. My remarks to improve the presentation of this paper:

-abstract to reformulate
- Very brief introduction that will be rewritten.
-This paper needs a related work section and a comparison between the works proposed for the classification of prostate cancer lesion.
- The summary is disorganized; I believe you should have two sections: one for the database and one that describes your approach with well-defined steps.
- The title of the machine learning article, as well as the introduction level, talk about ML, but I later discover that you work with the RESNET model, despite the fact that it is a deep learning model; I find this very surprising.
- table titles in paragraph form; try to include a paragraph and a reference for each table, as well as a meaningful title for each.
- very short conclusion
- I can't find your real contribution for this task, so a state of the art and a section describing your contribution are very necessary.

Author Response

The authors use Resnet as a model for classification of prostate MRI lesions, based on heterogeneous multi-institutional training image data. My remarks to improve the presentation of this paper:

Thank you so much for this summary, this was indeed what we hoped would be the take home of this study.

-abstract to reformulate

Thank you so much for this point. We’ve included more descriptive terms in the abstract and expanded all abbreviated terms. See below or page 1 lines 16-31):

Abstract: Background: Recent studies report high accuracies when using machine learning (ML) algorithms to classify prostate cancer lesions on publicly available datasets. However, it is unknown if these models once trained generalize well to data from different institutions. Methods: This was a retrospective study using multi-parametric Magnetic Resonance Imaging (mpMRI) data from our institution (63 mpMRI lesions) and the ProstateX-2 challenge, a publicly-available annotated image set (112 mpMRI lesions). Residual Neural Network (ResNet) algorithms were trained to classify lesions as high risk (hrPCA) or low-risk/benign. Models were trained on a) ProstateX-2 data, b) local institutional data, and c) combined ProstateX-2 and local data. The models were then tested on a) ProstateX-2, b) local and c) combined ProstateX-2 and local data. Results: Models trained on either local or ProstateX-2 image data had high Area Under the ROC Curve (AUC)s (0.82-0.98) in the classification of hrPCA when tested on their own respective populations. AUCs decreased significantly (0.23-0.50, p < 0.01) when models were tested on image data from the other institution. Models trained on image data from both institutions re-achieved high AUCs (0.83-0.99). Conclusion: Accurate prostate cancer classification models trained on single-institutional image data performed poorly when tested on outside-institutional image data. Heterogeneous multi-institutional training image data will likely be required to achieve broadly-applicable mpMRI models.

- Very brief introduction that will be rewritten.

Thank you for bringing the brevity of the introduction and need for revision to the authors attention. The introduction is modeled after the MDPI formatting guidelines for this journal. We attempted to format and draft our introduction to match other papers on this topic such as the following:

Gaudiano C, Mottola M, Bianchi L, Corcioni B, Cattabriga A, Cocozza MA, Palmeri A, Coppola F, Giunchi F, Schiavina R, Fiorentino M, Brunocilla E, Golfieri R, Bevilacqua A. Beyond Multiparametric MRI and towards Radiomics to Detect Prostate Cancer: A Machine Learning Model to Predict Clinically Significant Lesions. Cancers. 2022; 14(24):6156. https://doi.org/10.3390/cancers14246156

Liu J, Niraj M, Wang H, Zhang W, Wang R, Kadier A, Li W, Yao X. Down-Regulation of lncRNA MBNL1-AS1 Promotes Tumor Stem Cell-like Characteristics and Prostate Cancer Progression through miR-221-3p/CDKN1B/C-myc Axis. Cancers. 2022; 14(23):5783. https://doi.org/10.3390/cancers14235783

However to address this concern, we have rewritten the introduction to add a section according to your suggestion below comparing this to related work, what the state of the art is, and we have included more information regarding the ResNet type of model. We also have updated the introduction to include descriptions of all terms before their use as an abbreviation in the paper.

Please see lines 36 – 57 and 57 – 72 on page 1 or the text below for the additional updates.

Over the last decade, there have been significant advancements in multiparametric prostate MRI (mpMRI) [1, 2] and machine learning (ML) applications in mpMRI [3-5]. While mpMRI has high sensitivity and specificity for the detection of prostate cancer, accurate discrimination between high-risk prostate cancer (hrPCA, defined as Gleason grade ≥ 4+3 in this study and others[6, 7]) and low-grade/benign prostate lesions remains challenging and is paramount for clinical management [8, 9]. Methods to distinguish hrPCA from low-grade/benign PCA are important as low-grade/ benign prostate lesions can be managed with active surveillance instead of invasive treatment. Many recent publications report highly accurate machine learning algorithms for the classification of prostate lesions on MRI, with area under the receiver operating characteristic curve (AUCs) of 80 – 98%, that appear to successfully address this diagnostic obstacle [10].

While ProstateX and other mpMRI ML results appear promising, caution is warranted as the majority of these studies are single-institutional studies, often using a single MRI scanner manufacturer[15-17]. A systematic review of ML algorithms in mpMRI noted the paucity of multi-institutional studies[18]. The need for multi-institutionally trained models using heterogenous image data is being recognized [19, 20], spurring the development of the field of federated learning [21]. The purpose of this study was to determine the efficacy of highly accurate ML classification models trained on prostate image data from one institution and tested on image data from another institution. The highly accurate model used was a subset of a machine learning algorithm called a convolutional neural network: the Residual Neural Network (ResNet). The broader impact of single- and multi-institutional training data on model performance was also assessed.

The results of this study serve to identify the need for heterogeneous or multi-institutional training datasets for broadly applicable clinical models. Additionally, this study draws attention to an important concern, which is potential lack of generalizability of models trained on single-institutional or homogeneous datasets.

-This paper needs a related work section and a comparison between the works proposed for the classification of prostate cancer lesion.

Thank you so much for this important concern. To distinguish our work from related works we added the following text to the introduction (as we reference above to your other point) lines 57 - 72:

In addition lines 43 – 62 of the introduction discuss other related prostate cancer studies (See below for the text itself):

Many recent publications report highly accurate machine learning algorithms for the classification of prostate lesions on MRI, with area under the receiver operating characteristic curve (AUCs) of 80 – 98%, that appear to successfully address this diagnostic obstacle [10].

Recently, a consortium of the American Association of Physicists in Medicine (AAPM), the SPIE (the International Society for Optics and Photonics), and the National Cancer Institute (NCI) conducted ProstateX and ProstateX-2 Challenges [11]. They published publicly-available image datasets of annotated mpMRI lesions (as identified by radiologist) and their subsequent MRI-guided biopsy results [12], asking challenge participants to classify lesions as hrPCA or “benign” and to predict lesion Gleason grade. Top trained algorithms (models) developed using the challenge image data also reached accuracies of > 90% in the classification of hrPCA versus “benign” lesions [11, 13, 14].

And lines 327 – 346 of the discussion additionally reference similar studies and the importance of multi-institutional data (see below for text):

MR signal intensity is not correlated with an absolute standard reference, and is dependent on MR hardware, tissue characteristics, pulse sequence, method of k-space filling, etc. Standardization of MRI signal intensity scale has been an ongoing challenge[36], and may be able to further minimize multi-scanner and multi-institutional image differences. Signal intensity normalization techniques have been shown to impact prostate cancer radiomics [37]. Sunoqrot et al. also reported improved prostate MRI lesion classification (as benign or malignant) in multi-institutional image data following an automated T2-weighted image normalization using both fat and muscle[38]. MR signal intensity normalization was not performed in this study given the already high performance of the models.

Many of the published machine learning studies with highly accurate models are based on single-institution/single-scanner or single-institution/multi-scanner studies. One systematic review on the performance of machine learning applications in the classification of hrPCA found 66% (18/27 studies) were performed at a single institution on a single scanner [39]. In this same study, 4/27 studies were performed on more than one scanner from the same vendor, 2/27 were performed on scanners by two vendors, and only one study used multi-institutional image data. Another meta-analysis of 12 studies using machine learning for the identification of hrPCA similarly showed that all studies originated from a single institution or image data repository (the Cancer Imaging Archive) [40].

Lines 363 – 366 and 381-382 of the discussion reference similar studies done with federated learning, highlighting the importance of this study as it showcases the need for multi-institutional data and provides support for future studies using federated learning (see below for text).

All these factors have led to the development of the field of federated machine learning, which provides the computing architecture for the construction of multi-institutional models using de-centralized and de-identified patient data [46]. Federated learning in medical algorithms is in its early stages and requires additional inter-institutional computing infrastructure or use of commercial platforms [47-49].

Finally, lines 411-415 of the discussion reference similar studies that explains our rationale for defining hrPCA as we did based on other similar studies and treatment guidelines for prostate cancer (see below).

While each Gleason score has defined features, some inter-observer variability and subjectivity is recognized in the pathology literature [53]. This study defined hrPCA as GS ≥ 4+3, similar to other major studies[6, 7], in order to identify patients who would definitively benefit from treatment; other studies have defined clinically-significant prostate cancer as GS ≥ 3+4[54].

- The summary is disorganized; I believe you should have two sections: one for the database and one that describes your approach with well-defined steps.

Thank you for bringing this to the author’s attention, this is a very important point. To clarify the organization of the methods we have updated to the following section headers (see pages 2 – 6 or the below text):

2.1 ProstateX-2 Patient Population:

2.2 ProstateX-2 MR Imaging & Image Data:

2.3 Local Institutional Patient Population:

2.4 Local Institutional MR Imaging & Image Data:

2.5 Image Preparation & Lesion Segmentation:

2.6 Convolutional Neural Network Training & Testing:

- The title of the machine learning article, as well as the introduction level, talk about ML, but I later discover that you work with the RESNET model, despite the fact that it is a deep learning model; I find this very surprising.

Thank you so much for pointing out this distinction! We have updated the introduction to attempt to clarify this. The model type used was a ResNet algorithm, which is a type of convolutional neural network published in 2015 by He et al. See: He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.

This definition has been more recently reiterated by the common sources used to teach ML by Sarle, Kohavi, Alpaydin, James, and Bishop. See:

Kohavi and F. Provost, "Glossary of terms," Machine Learning, vol. 30, no. 2–3, pp. 271–274, 1998.

Alpaydin, Ethem (2010). Introduction to Machine Learning. London: The MIT Press. ISBN 978-0-262-01243-0. Retrieved 4 February 2017.

Gareth James; Daniela Witten; Trevor Hastie; Robert Tibshirani (2013). An Introduction to Statistical Learning. Springer. p. vii.

Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer, ISBN 978-0-387-31073-2

We’ve included the following line in the introduction to address this (Page 2, Lines 64 – 66):

The highly accurate model used was a subset of a machine learning algorithm called a convolutional neural network: the Residual Neural Network (ResNet).

- table titles in paragraph form; try to include a paragraph and a reference for each table, as well as a meaningful title for each.

Thank you for drawing our attention to the paragraph form of the table. The table description was formatted according to the MDPI guidelines for this journal and were created based on the template provided by the journal for manuscript formatting. However to attempt to address this we’ve reviewed the text and added the following lines to ensure all tables and figures are referred to within the text:

Page 6 Lines 219 – 220: . The combination of algorithm training and testing sets are summarized in Figure 2. The total patients included in each model is listed in Table 2.

Page 7 Lines 253 – 255: These results and corresponding model accuracies are summarized in Table 3. Stand-ard Deviations for each cross validated model run are available in Supplementary Table 3.

Page 8 Lines 290 – 292: These results are summarized in Table 4. Standard Deviations for each cross validated model run are available in Supplementary Table 4.

- very short conclusion

Thank you for bringing the brevity of the conclusion to the authors attention. The conclusion is modeled after the MDPI formatting guidelines for this journal (it is listed as an optional section). We attempted to format and draft our conclusion to match other papers on this topic such as the following which either included no conclusion or a short one.

However to address this concern, we added a line discussing the specific contribution to the literature. Please see below or lines 429 – 438 of the conclusion.

Accurate prostate cancer classification algorithms that were trained on single-institutional image data performed poorly when tested on outside-institutional image data; they required training on both image data sets to re-achieve high accuracy. While recent publications have reported high-performing ML models for the classification of hrPCA, most utilize models trained on “homogeneous” single-institution-trained image data. This study has demonstrated that generalizable models require heterogeneous and ideally multi-institutional datasets. Heterogeneous multi-institutional training image data, perhaps through a federated learning system, will likely be required to achieve broadly-applicable models.

- I can't find your real contribution for this task, so a state of the art and a section describing your contribution are very necessary.

Thank you for drawing this concern to the author’s attention. This is certainly an important point. To address this the authors have included additional text in the conclusion, additional text in the discussion, and a paragraph in the introduction as also referenced above. Please see below for the updates:

Introduction:

Discussion: These results do not aim to limit the development of models on one or two datasets, however hope to encourage consideration for the heterogeneity and patient population when applying clinical models.

Conclusion: Accurate prostate cancer classification algorithms that were trained on single-institutional image data performed poorly when tested on outside-institutional image data; they required training on both image data sets to re-achieve high accuracy. While recent publications have reported high-performing ML models for the classification of hrPCA, most utilize models trained on “homogeneous” single-institution-trained image data. This study has demonstrated that generalizable models require heterogeneous and ideally multi-institutional datasets. Heterogeneous multi-institutional training image data, perhaps through a federated learning system, will likely be required to achieve broadly-applicable models.

Round 2

Reviewer 1 Report

I think the reply mostly answered the original comments.

One note is that t-test is not proper for comparing AUC or ROC. For AUC how many values do you have? We used DeLong tests before, authors may consider dependent on their experiences.

Otherwise I feel the paper is ready for acceptance.

Author Response

Thank you so much for your additional comments and review!

The t-test in question was used to determine the difference in means for the 5-fold cross validated results between two models. As such the t-test was appropriate and only used to compare the means of the 5 different folds for two separate final model results.

This methodology has been detailed in common ML teaching texts such as Alpaydin, E. (2016). Machine learning. MIT Press.

But in particular, this analysis by Varma and Simon shows how this methodology of the re-sampled 5-fold cross validated sample used with a t-test helps eliminate bias when comparing models and analyzing results:

Varma, S., Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91 (2006). https://doi.org/10.1186/1471-2105-7-91

Thank you again!

Reviewer 2 Report

The manuscript is now ready for publication.

Author Response

Thank you so much! The authors appreciate your review.

Reviewer 3 Report

In the previous author's response, the answer to question 8 (The size of the database is also a variable. The two datasets together have 175 samples. Has the result of the model-PXL been tested with a random selection of 87 data (50/50) from both datasets?) is not convincing enough.

As the authors state, the audience will learn from the article that monocentric image data is poorly tested against image classification models on external datasets. The emphasis is on the fact that widely accepted algorithms require heterogeneous polycentric data, not the size of the data volume. So it is important to control the dataset's size variable in the comparison tests.

The authors are also fully capable of completing this comparison test. Such data allocation was achieved when training the Model-Loc and testing on PXL. (training 50 (80%) test 36 (20%)) in Table 2. The authors should be able to train/test the model in Model-PXL with the same amount of data.

So the authors need additional experiments. To prove that multicenter data prediction is indeed better than monocenter data prediction with the same amount of data. Not in the "next step".

Author Response

Thank you so much for your additional comments, we have greatly enjoyed working with you so far and appreciate your time and attention. In regards to your comments:

There were only 63 lesions in the local dataset, with 112 lesions in the open-source PX-2 dataset. So there are 24 too few lesions to do an 87/87 split without extensive additional data collection or inclusion of the same data in both the training and testing set. The authors also hope we can demonstrate in this response that the additional data is not necessary and that the current test as it stands is appropriate.

The Loc -> PXL test was done on 50 local lesions with a 13 Local, 23 PX-2 split in the testing set to compose the 36. So it still only used the 63 total lesions from the local dataset in the training set with additional data from the PX-2 dataset included in the testing set. To address this point of confusion we hope the new Table 2 inserted into the revised manuscript helps to better indicate this data breakdown for future readers. Table 2 has been inserted after line 242 and showcases the total data in each of the training and testing datasets.

In the combined folds of the testing runs, 80% of the PX-2 data was combined with 80% of the local data and tested on a combined 20% of the PX-2 data and 20% of the local data. The reason the 80/20 training/ testing split was consistently performed was to ensure that 5-fold cross-validation could be used. The use of cross-validation for 5-folds (and a shuffle/ randomization test for each fold of the cross-validation) was very important to ensure the results are statistically significant, defensible, and able to be repeated for future studies. This methodology has been previously detailed by the analysis by Varma and Simon and in various ML texts such as that by Apaydin (Varma, S., Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91 (2006). https://doi.org/10.1186/1471-2105-7-91 and E. Alpaydin. Adaptive Computation and Machine Learning MIT Press, Cambridge, MA, 3 edition, (2014)) This methodology was ultimately selected as it helps to reduce bias in ML model results.

Additionally, while doing these tests the authors found that the current results stood without additional data collection as across the board, models trained on the combined dataset performed better than those trained on any one dataset alone. This demonstrates it doesn't require a large amount of additional data to reduce bias/overfitting in the model. By adding only 50 local lesions to the PX-2 89 training lesions, the models tested on the PX2 data and Local data by far outperformed the models tested on only one dataset alone (PXL had results of 0.85 -> 0.99 on the individual datasets compared with the 0.23-> 0.55 of PX2 -> Local or 0.41 -> 0.54 Local -> PX2). Testing the data from any one dataset model (PX2 or Local) on the combined dataset (PXL) also performs in between the results of both, which is consistent with these results.

When initially training the models the authors went through a variety of splits (overexpression of Local data in training, overexpression of PX2 data in training, etc) and ultimately settled on the current breakdown (80% of PX-2, 80% of Local) for the listed reasons: 1) we found that even adding just a small proportion of local data to the PX-2 training served to make the model more generalizable and 2) by keeping this 80/20 split the authors were able to continue to perform cross-validation in addition to the shuffle test which allowed for tests of generalizability and statistical significance.

As the results right now show that adding even a small proportion of local data to the PX-2 data serves to make the model more generalizable, adding more data would be unlikely to change these conclusions. It would still show that a heterogeneous training set makes results more generalizable than a homogeneous one.

As this is an important point we have included the following text in the discussion (See page 11 lines 393 - 394 or the text below):

The multi-institutionally-trained models (model^PXL) maintained relatively high AUCs (0.85-0.99) when tested on single-institutional image data, although there were statistically significant differences in performance between multi- and single-institutional testing sets (p<0.01). Model^PXLT2s had higher AUCs when tested on single-institution image data compared to multi-institutional data (0.92 – 0.83). Model^PXLADCs had higher AUCs when tested on multi-institutional image data as compared to single-institutional data (0.98 – 0.85). Additionally, it is important to note these increases in performance occurred with only a small amount of additional data (50 Local lesions added to the 89 PX2 lesions), which showcases that even a small proportion of heterogeneous training data serves to make a model more generalizable. These findings also have important implications for federated learning: it suggests that training on heterogeneous multi-institutional image data may have associated cost or benefits for model performance when tested on single-institution image data. A few published studies using algorithms with federated learning models have outperformed local institutional algorithms in prostate segmentation [50, 51]. These results do not aim to limit the development of models on one or two datasets, however hope to encourage consideration for the heterogeneity and patient population when applying clinical models.

Thank you again for your time and attention and we hope this clears up this point of confusion! We look forward to your response and hope you consider this manuscript ready for publication.

Reviewer 4 Report

I thank the authors for these efforts to improve the previous version and i wish you good continuation;
I have a few remarks to improve the presentation of your paper:
1/ Add a paragraph at the end of the introduction as a summary of the rest of the paper.
2/ It would be great if you could include some lines about the use of deep learning and machine learning in the medical field in general before discussing prostate, and you could include some recent references like "if it's possible":

*/ Ammar, Lassaad Ben, Karim Gasmi, and Ibtihel Ben Ltaifa. "ViT-TB: Ensemble Learning Based ViT Model for Tuberculosis Recognition." Cybernetics and Systems (2022): 1-20

*/ Gasmi Karim, "Hybrid deep learning model for answering visual medical questions.", The Journal of Supercomputing, 2022, p. 1-18.

2/ Add a few lines between Sections 2 and 2.2.
3/ Divide the section 3 results section into two subsections based on the tables.
4/ Add a subsection 3.1 to describe the evaluation criteria used with mathematical formulas.
5/ Add some future work at the end of the conclusion.

Author Response

The authors would like to thank the reviewer for the additional remarks. To address these:

1) A paragraph has been added to the end of the introduction with a summary of the rest of the paper (See page 2 lines 69 – 74 or the text below):

This study tested the effects of single- and multi-institutional studies by training a series of models to classify high risk Prostate Cancer (hrPCA) on the open-source ProstateX-2 dataset, local institutional dataset, and combined dataset including data from both. Models were then tested on corresponding data from each of the three subsets and compared to one another through statistical testing. Sub-analysis was performed on the PZ and TZ regions of the prostate. Finally, the results were analyzed and discussed in the broader context of need for heterogeneous datasets.

2) The following lines have been added to the discussion with the above sources cited to discuss the importance of the use of heterogeneous datasets in the broader context of clinical medicine (see page 12 lines 451 – 454 or the text below):

These results are not only important for the classification of hrPCA but also for the broader context of machine learning for clinical medicine. Machine learning has been applied to many other problems in medicine. [59, 60] The use of heterogeneous training data is important to improve the generalizability and utility for these clinical models.

Ammar, Lassaad Ben, Karim Gasmi, and Ibtihel Ben Ltaifa. "ViT-TB: Ensemble Learning Based ViT Model for Tuberculosis Recognition." Cybernetics and Systems (2022): 1-20
Gasmi Karim, "Hybrid deep learning model for answering visual medical questions.", The Journal of Supercomputing, 2022, p. 1-18.

3) A few lines have been added between sections 2 and 2.2

4) Section 3 results has been divided into two subsections based on the tables (1 Model Results for Classification of hrPCA on PX2, Local, and PXL data, 3.1 Sub-analysis Results on PZ or TZ lesions).

5) A subsection has been added to the supplement with the relevant equations used in the methods and evaluation criteria. New Supplementary table 4 has been inserted (see the red text in the manuscript for reference) and Supplementary tables 4 and 5 have been renamed to 5 and 6 to accommodate this.

6) Future work has been added to the conclusion (See lines 464 – 465 or the text below):

Heterogeneous multi-institutional training image data, perhaps through a federated learning system, will likely be required to achieve broadly-applicable models. Future work for the classification of hrPCA from prostate MRI should focus on the use of heterogeneous training data to create and validate new models.

We hope these updates sufficiently address the reviewer's comments and thank you again for your time and attention. We have greatly enjoyed working with you.

Article Menu

Machine Learning Algorithm Accuracy Using Single- versus Multi-Institutional Image Data in the Classification of Prostate MRI Lesions

Further Information

Guidelines

MDPI Initiatives

Follow MDPI