Can Artificial Intelligence Interpret Pulmonary Function Tests and Predict Prolonged Air Leaks After Lung Resection
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsDear Authors,
I read with interest the manuscript assessing the use of AI to predict postoperative PALs.
The topic is interesting and actual. English language is clear. The background is well reported, even if it should be implemented adding more relevant references. The aim is clear and well presented in the introduction section. Matherials and Methods are well explained and needs minor changes. Statistical section should be implemented describing which test was used for each variable. Results presentation should be implemented for a better comprehension. In particular the author stated that AI can access and utilized data via scanned models but there are no data supporting this. The discussion section in my opinion should focus on the comparative studies (ref in table 2) explaining which variables have been used by other studies for a more accurate comparison.
I have some comment and suggestions:
In the introduction, referring to the lines 65-69, the authors mentioned just 2 references even if stated that "several" models have been created. Please change accordingly.
Lines 54: "characterized" should be changed with "defined as"
Lines 114-117 is not clear if 75 is the numbers of PFT features or both PFT and clinical features. Moreover, authors should state how many features were initially extracted and how many was removed due to missing data.
line 158 the author use the term "final validation cohort" probably to refer to the "Internal Validation cohort". Please better define this cohort of patients in lines 141-142 and in line 158.
Statistical analysis description should be implemented declaring wich test have been used for different types of variables (e.g. t test for is for parametric variables, what did you use for non-parametric? Did you assess normally distribution of clinical variables? how did you compare discrete variables?)
Lines 215-220 should be moved in the discussion section and not in Results
Line 221 "The model consisted of 321 total trainable parameters". This line is confusing. Is not clear if the model use other than the 10 parameters identified after FSA. Please clarify.
Lines 254-256 the author refers to studies that use ML to predict postoperative complication but do not mention and discuss the studies that use ML to predict postoperative PALs. Please add a brief presentation of that models and discussion.
Lines 257-259 repeat the background and should be removed for easier reading.
The absence of an External validation cohort is one of the limitation of the study and should be added.
The model use congestive heart failure, prior cardiothoracic surgery, valvolar heart disease mitralic and valular heart disease tricuspidal... don't you think that these variables are correlated?
Did you try to build an "only PTF" model? which accuracy it reach?
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe model is developed and validated on a single institution data set of 420 cohort. It is a major limitation of the study and should provide rationale for why the results may be generalized. SHAP values and clinical interpretation of key predictors is critical for clinical relevance and usage, these should be added to the discussion. Justification of choosing FSA over LASSO and other embedded methods needs to be discussed. Cross validation stability needs to be discussed. The imputation method used for accounting missing values needs to be justified. Sensitivity could have been much better to be considered as a clinical risk prediction tool. The heterogeneity across data sets needs to be addressed. Table 2 only compares with prior studies, where as a discussion of differences in datasets and populations is missing. Clear definition of variables needs to be incorporated. Figure 1 needs a detailed labeling and add confidence intervals for AUC in ROC curve (fig 2). Statistical reporting is needed, ass p-values for comparison of models. Ensure formatting consistency throughout.
Author Response
Please see the attachment
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Editor and Authors,
Thank you for asking me to review this manuscript titled “Can Artificial Intelligence Interpret Pulmonary Function Tests and Predict Prolonged Air Leaks After Lung Resection” by Mr. Omar Zahra and his preceptors.
First let me comment that as a thoracic surgeon and educator myself it is always a pleasure to see and evaluate works from medical students.
This manuscript addresses an important surgical problem which is the prediction of prolonged air leak (PAL) following pulmonary resections using an AI-based evaluation of pulmonary function testing values (PFTs). The study is novel and relevant, has an adequate methodology and is potentially important, particularly due to its integration of OCR-extracted PFT data with machine learning.
The manuscript is well-structured and clearly written. However, there are significant limitations that must be addressed before publication!
Specifically:
In the introduction the statement on line 72 about AI outperforming pulmonologists regarding interpretation of PFT results is a bit extreme and although there might be a single article proposing this I would be skeptical to include such a strong notion!!
Were patient selection consecutive or were there missing patients not included?
Why was 2016 to 2023 selected as the period of recruitment given this was 3 years ago? Why not choosing a more recent period since prolonged air leak is an immediate complication and does not require long term survival data!
The inclusion into the analysis of pneumonectomies’ and sleeve resections is comparing unequal parameters since these include large bronchial stump closure and bronchial anastomoses which are tissue different to parenchyma thus the assumption of homogeneity of variance is not met! I suggest removing these patients (they are only 10!!) from the analysis as a prolonged air leak from a bronchopleural fistula is a much different pathology compared to a parenchymal surgical stapler line leak!! Did these patients have air leaks?
Which PFTs and variables were measured initially and which were used in the model analysis? How were the 10 predictive variables selected by the training model’s algorithm? What were the selection parameters used?
The model is trained and validated only on a single-institution dataset (n=420) and no external or temporal validation cohort is used. This is a limitation of the study which limits its potential applicability in other patient cohorts and needs to be acknowledged!
How exactly were missing data handled? The description in page 3 (line 116) is oversimplified while it is known that mean imputation when used for missing values can introduce bias and ignores data distribution and uncertainty. Please explain this as a limitation!
Because of the small dataset used (420 patients with 50 events there is the potential of overfitting the models! Specifically the events-per-variable ratio is low so that the forward selection plus the neural network analysis increases overfitting risk! Please report the cross-validation strategy (k-fold?) used and the regularization methods used (dropout, L2, etc.).
Please rephrase line 185 in that demographic and clinical variables are presented and compared in Table 1. The way it is worded now is a bit uncommon in scientific writing!
It is well known that forward stepwise selection (FSA) can be unstable and prone to selection bias! Why weren’t more robust approaches such as LASSO / Elastic Net or/and recursive feature elimination not used?
For the confidence intervals for AUC the sensitivity and specificity are missing and no calibration metrics (i.e., calibration curve, Brier score) are provided.
The predominantly white population (78.8%) limits the applicability of the findings to more diverse populations!
Why is insurance information important and why have they been presented and included in the comparison? How does it affect the outcome of one’s procedure? Do private patients receive better quality care?
In terms of interpretation of results the reported AUC (0.83) is promising, but likely optimistic due to the was only internal validation performed and the small dataset utilized. A sensitivity of 60% may be insufficient for clinical screening alone!
The most interesting finding of the study is that AI selected non-traditional PFT variables for example DLCO/VA rather than FEV1 as predictors of PAL! This may reflect the assumption that DLCO reflects much better tissue quality and parenchymal perfusion as well as injury response and healing rates which are important parameters in the development of post-resection PAL! This is something that can be further explored in the discussion (the authors have only briefly alluded to this in lines 287 - 288)!
In conclusion this an interesting study but it needs some significant improvement prior to been ready for publication. However, I am looking forward to evaluating the revision of the work and encourage the main author and his preceptors to continue their efforts!!
Author Response
Please see the attachment.
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsGreat improvement
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Editor and Authors,
It was my pleasure to review this revised manuscript which has been re-submitted for assessment. The authors have now substantially improved the manuscript and have addressed most of the reviewer comments adequately. The lack of external validation has been properly acknowledged but still remains the largest scientific limitation! This is acceptable for an exploratory ML study, but still weakens its impact.
Consequently, the paper is now acceptable for publication. I congratulate the young authors for a job well done!
Kind regards.
