Next Article in Journal
Diketo-Pyrrolo Pyrrole-Based Acceptor-Acceptor Copolymers with Deep HOMO and LUMO Levels Absorbing in the Near Infrared
Next Article in Special Issue
Provider Fairness for Diversity and Coverage in Multi-Stakeholder Recommender Systems
Previous Article in Journal
Insight into the Preparation of MgAl-Layered Double Hydroxide (LDH) Intercalated with Nitrates and Chloride Adsorption Ability Study
Previous Article in Special Issue
Deep Variational Embedding Representation on Neural Collaborative Filtering Recommender Systems
 
 
Article
Peer-Review Record

Outcome Prediction for SARS-CoV-2 Patients Using Machine Learning Modeling of Clinical, Radiological, and Radiomic Features Derived from Chest CT Images

Appl. Sci. 2022, 12(9), 4493; https://doi.org/10.3390/app12094493
by Lorenzo Spagnoli 1,2, Maria Francesca Morrone 1,2, Enrico Giampieri 3,4, Giulia Paolani 1,2, Miriam Santoro 1,2, Nico Curti 4, Francesca Coppola 5, Federica Ciccarese 5, Giulio Vara 5, Nicolò Brandi 5, Rita Golfieri 5, Michele Bartoletti 6, Pierluigi Viale 6 and Lidia Strigari 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2022, 12(9), 4493; https://doi.org/10.3390/app12094493
Submission received: 11 March 2022 / Revised: 21 April 2022 / Accepted: 24 April 2022 / Published: 28 April 2022
(This article belongs to the Special Issue Human and Artificial Intelligence)

Round 1

Reviewer 1 Report

The authors built various machine learning models which use radiomic features from chest computed tomography (CT) images—derived semi-automatically—to predict mortality in a high-risk COVID-19 positive group. They also train models using clinical and radiological variables. Use of a relatively large dataset of 436 chest CT images from positive patients was made possible by the application of semi-automated segmentation tools. The authors find that models using only radiomic features performed similarly to those using clinical variables, with models utilizing either data source outperforming models utilizing variables derived from radiological assessment for all three architectures investigated. They conclude that features based on semi-automatically segmented CT images, combined with machine learning approaches, can aid identification of patients at elevated mortality risk, which could support clinicians in caring for COVID-19 patients.

The work is an interesting and timely application of machine learning. While their methods seem generally sound, there are several aspects which should be clarified, reexamined, and/or justified. Interpretation of results are also illuminating and considered in a practical manner, but some result figures should be improved and additional analyses should be carried out (though I would not expect these to be extensive or particularly burdensome for the authors). Overall, I believe this work holds promise for publication and—with revision—will contribute moderately to the literature.

I provide detailed comments below, grouped by section.

 

FEATURED APPLICATION

The “Featured Application” statement seems to contain grammatical errors (perhaps just a key word missing) and is rather confusing. (The confusion may just be due to the missing word or other error.)

After reading the entire manuscript and returning to this “Featured Application” statement, it remains unclear to me how “this methodology might be used to…refine existing Machine Learning models based on radionics’ analysis” or what aspect of your methodology you consider to be “fast, objective and reusable.” Are you simply suggesting that extracting features using semi-automated segmentation methods could be used in developing other machine learning models? If so, that hardly seems to be a novel development of your work, nor is it one of the important findings or areas of focus. I would urge the authors to reconsider the entirety of this statement to better align with the novelty and focus of the rest of the manuscript.

 

ABSTRACT

Under results, you report that ML models outperformed “the radiological assessment.” Do you mean the ML models that used radiological features, or human radiological assessment and prediction? (I didn’t see any comparison in the rest of the manuscript comparing performance to human radiological assessment, so this was confusing to me.)

 

INTRODUCTION

I find the report that chest CT has a sensitivity of 97% to be misleading, since the cited study found that it had a specificity of only 25% and an accuracy of 68%. Reporting only the sensitivity may be interpreted to imply a high degree of accuracy rather than a method offering accuracy marginally better than chance with a strong bias towards overdiagnosis.

 

MATERIALS AND METHODS

Were the distributions of patient classes equivalent between the three image acquisition systems? (I.e. was the proportion of patients who died the same for all three?) There have been examples where the imaging system used was itself a predictor of outcome (i.e. patients who were disposed towards a certain outcome were more likely to be imaged with a certain system), so it is important to confirm that such bias was not introduced here.

The authors report that patients were redirected from neighboring hospitals. Presumably these patients would be the most at-risk among all hospitalized patients. Consequently, is it reasonable to generalize the results to the broader hospitalized patient population, or only those who are sufficiently ill to warrant direction to a major university hospital?

It appears that there may be some skew between follow-up lengths for the “dead” and “alive” patient classes. Was there any different between the classes? Differences in speed of disease or symptom progression, delay in diagnosis, or other factors may have resulted in patients who later died receiving CT closer to their diagnosis than those who ended up surviving. If this is the case, could any of the radiomic or radiological findings be explained (at least in part) or biased by differences in imaging timing relative to the disease progression?

The authors report that data was collected over 13 months, specifically over a period of particularly rapid evolution in clinical care for COVID-19 patients and also viral evolution.

  • How can the authors be sure that imaging criteria and protocols were consistent throughout that period?
  • How can the authors be sure that changes in clinical care over time did not impact the predictive radiomic and clinical factors? (For example, were changes in mortality over time assessed? Were data segmented temporally to check whether findings and/or accuracy were consistent across the entire period?)
  • Different variants have been shown to present differently, both symptomatically and by antigen testing. Presumably there could be radiological and/or radiomic differences in presentation as well. How can the authors be certain that their findings are generalizable across variants?

The authors note in their “Institutional Review Board Statement” (but not elsewhere) that the study was prospective. I believe indicating this in the methods section would provide additional clarity.

Table 1

  • It is reported that the minimum days of hospitalization is 0. Does this not violate the eligibility/inclusion criteria (which seems to require hospital admission)?
  • The “Yes” entry for “Bilateral involvement” has been cut off.
  • Units are not reported for respiratory rate (presumably breaths per minute).
  • Authors likely intend to report variable as “Sex” rather than “Gender” (applies in text as well, e.g. line 127), as confirmed by the feature name (in Figure 4).
  • [Entirely optional recommendation] The authors could easily consolidate the table into two columns, as in most published tables reporting clinical characteristics of patients.

Why was a binary fever metric used instead of using temperature as a continuous variable? Similarly, would blood pressure not be more informative than binary hypertensive state?

Is the image segmentation workflow validated for patients with respiratory disease? Could the disease state (severity, stage of progression, etc.) alter the image in ways that impact the accuracy of the segmentation process? Was there any manual review or check of the automated workflow output?

[Entirely optional recommendation] Default parameters for radiomic feature definition and extraction (lines 107-112) is not critical and could be moved to supplementary materials.

The authors did not investigate predictive performance for ICU admission, just death. Why was ICU admission not also assessed? Since ICU admission is not analyzed anywhere, what is the purpose of reporting the information in Table 2?

How were the architecture and hyperparameters determined?

If the authors are using machine learning—particularly deep neural networks—why were convolutional neural networks (CNNs) not applied directly to the images? As I’m sure the authors are aware, CNNs are often highly effective in image analysis applications. The reported method instead extracts predetermined features, then provided those to a neural network. One of the fundamental advantages of neural networks is that important features can be learned rather than crafted, but in this case a predetermined set of crafted radiomic features limited the set of considered information. (Admittedly, pre-selecting the features does make the results more interpretable, and allows for the “most important feature” analysis that the authors carry out. Was this the motivation?)

 

RESULTS

You assert earlier that “this methodology might be used to identify early predictors from CT chest images,” but each method/architecture identifies mostly different important features. If each method considers different features to be the important ones, how can you conclude that the method can identify predictors? Which is most reliable in identifying predictors?

On the basis of the training and testing result discrepancies reported in Table 3, it is abundantly clear that the Random Forest classifier was over-fit to the training data, and likely had an architecture with too much complexity or degrees of freedom relative to the dataset size. Why wasn’t a simpler model, shorter training period, or other overfit mitigation strategies used?

In Figure 4, the fever variable name is in Italian and may cause confusion.

Figure 5

  • The figure is rather messy and unappealing, with labels, legends, and titles retaining what appear to be original programming variable names rather than reader-friendly descriptions.
  • The vertical axis is showing the overall fraction, not percentage (which would be 100 times the displayed values).
  • It’s a bit unclear what the blue and pink data represent. I believe you’re saying that blue is the performance for the model which is able to simultaneously consider both clinical and radiomic data as variables, while pink indicates the cases where both individual clinical-only and radiomic-only models are both correct. Can you confirm and clarify?

Why is the analysis corresponding to Figure 5 only performed for the Lasso models? Are the conclusions about over- and under-estimation on the basis of age and feature sets consistent across all three architectures?

 

DISCUSSION

You raise an interesting point about the high prevalence of smoking history among your patient population despite lacking mortality prediction power (which seems to suggest that history of smoking increases the chances of becoming seriously ill and requiring hospitalization but may not be a useful differentiating factor among those hospitalized). Is it possible that “history of smoking” could manifest in other changes (e.g. elevated respiratory rate, hypertension, etc.) that make it a somewhat redundant feature, thereby making it appear irrelevant despite having a causal relationship with other features that were found to be relevant? Also, did history of smoking vary by age (the most important predictor) or gender? If these (non-causal) factors were correlated, it could similarly diminish the apparent relevance of smoking history. Correlation analyses among your variables would be informative.

The WHO website should be cited as a reference, and not embedded within the manuscript text (lines 258-259).

Is it possible to relate any of the most relevant radiomic variables to pathological findings or states?

The authors highlight “how disorder and inhomogeneity in the gray levels are related to the damage in the lungs as well as to the age of the patient” in Figures S2. Would it be possible to report the values of some of the key radiomic variables for each of these images (perhaps in the caption)? It would be interesting and informative to be able to compare the quantitative values and associate each with the corresponding qualitative visual appearance.

I believe the discussion would be aided by a comparison of performance between the developed models and human radiologist interpretation/performance, if possible.

The authors may consider strengthening their discussion by acknowledging and addressing limitations of the presented work and potential future work.

 

AUTHOR CONTRIBUTIONS

The introductory statement (lines 306-307) should not have been retained from the template.

The task of “writing—original draft preparation” is a single task and should not have been split into “writing” and “original draft preparation.” It is also perplexing that different authors were listed for these two tasks when split.

 

REFERENCES

There seem to be several anomalies and likely errors within references/citations, including 1, 13, 14, 20, mostly related to author names. The authors should closely review their references when revising their manuscript.

 

SUPPLEMENTARY MATERIALS

Table S1

  • No units are reported.
  • The Table contents need to be better described. What is the “follow-up” (i.e. time between what events)? What are waves 1 and 2?
  • It’s difficult to determine without understanding what these values represent, but is 0 a viable follow-up length? (The minimum is reported as 0 for three of these groups.)
  • Are differences statistically significant?
  • Table S1 is never referenced in the main manuscript, so it’s unclear which section it corresponds to or supplements.

It appears that Figures S1 and S2 have been swapped relative to their captions.

When I open the document, the third page is entirely empty/blank.

Figure S3

  • The plots should likely include vertical axis labels.
  • How or why were these two particular radiomic features selected for investigation?
  • What trends or findings are you trying to illustrate or convey?

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

 The work represents an interesting prospective study about semi-automatic segmentation of chest CT images of SARS-COV-2 patients.

In your population of 436 patients you reported that 363 (83%) were obese. It seems a high prevalence, is it a verified data? Could it be source of bias?

In figure 4 “Febbre” should be corrected in “fever” I suppose.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Radiomic, Radiological, and Clinical data were used for predicting death caused by COVID-19.

 

 

Major points

 

“the range of pixel values representing the vascular tree (from -400 HU to about 1000 HU).” Consolidation caused by COVID-19 may be ignored in this setting. Consolidation is accompanied by severe COVID-19, compared with GGO. Is this setting acceptable?

 

According to Testing AUC in Table 3, CT (Radiomic and Radiological data) is not useful for predicting death caused by COVID-19, compared with Clinical data. I cannot accept description of abstract and paper’s conclusion.

 

 

 

Minor points

 

Time interval between RT-PCR and CT should be included in the paper.

 

“Bilateral involvement    2.4%) — 33 (7.6%)” Broken.

 

“Deep Neural Network” I think that authors did not use DNN. Authors used Multi-layer Perceptron. Please rewrite it.

 

What is Febbre in Figure 4?

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors have sufficiently addressed my comments, and I appreciate their efforts in responding to my questions and revising the manuscript. I believe the manuscript has been significantly improved.

Some of the revisions to the manuscript appear to have introduced new errors, which I hope the authors will check, and if necessary address, prior to publication:

  • In Section 2.1 (and possibly the first paragraph of 2.2), the new text utilizes acronyms which aren’t defined.
  • On line 131, did you mean to write “nuance” rather than “nuisance”?
  • The revised Figure S2 captions seems to have not been finalized, and is internally inconsistent. (Two different sets of entropy values are reported for the four images.)
  • The captions of Figures 5 and S4 both reference a “pink colored histogram,” but there appears not to be any such graph in either. It seems that the description perhaps instead refers to the blue histogram in the final panel. Either the description should be updated or, preferably (to avoid confusion with the blue bars in the firs to panels), the color of the histogram should be changed accordingly.
  • In the discussion (line 339), you reference Figure S4, but it does not seem to be relevant in that context.
  • The revised “Institutional Review Board Statement” includes text from the manuscript that doesn’t make sense in that context (i.e. “…are reported in the following paragraphs.”)
  • It does not seem appropriate to indicate that you are preparing a separate manuscript in the methods section of this paper (lines 195-196).

I was also somewhat disappointed to read that the authors are preparing a separate predictive analysis on radiomics features for ICU admissions rather than including it in this manuscript. Assuming that these efforts utilize the same dataset, methods, etc., splitting this work into two manuscripts risks weakening both and unnecessarily introducing redundancy into the literature.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

"We revised the conclusion specifying that the ML approach based on radiomic approach does not substitute that obtained from clinical information but works as a second expert opinion only based on features extracted from chest CT images stored in the RIS/PACS to refer a patient to further clinical investigation."

 

Featured Application, Conclusions in Abstract, and Conclusions in Main text in the revision do not match my previous comment and authors' response. 

 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop