Management of Severe COVID-19 Diagnosis Using Machine Learning

Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- The abstract and methods state that 257 patients were enrolled. However, 226 observations were used for modeling with 31 excluded for “incompleteness identified by automated classifiers” and missingness is later described as less than 1% in several variables. The criteria for incompleteness, the exact rules used by automated classifiers to flag records, and the distribution of missingness across features are not described clearly. How were incomplete records flagged by automated classifiers in terms of explicit thresholds and rules?
- Target label definitions appear inconsistent across sections and figures. Severity is encoded as 0 for mild, 1 for moderate, and 2 for severe in the methods. However, the decision tree description refers to 0 as negative outcome, 1 as positive outcome, and 2 as recovery or mild course. Why?
- Metrics and validation strategy are not sufficient. The dataset is imbalanced across the three severity classes, yet only overall accuracy and multiclass AUC are emphasized, with AUC values reported as 1.000 for all top models and without per class recall, precision, macro versus micro aggregation, or calibration assessment. External validation is not performed, and no learning curves or variance across folds are shown beyond a mean and standard deviation. Please consider reporting per class metrics and confusion matrices for every model that is compared.
- A key limitation is the absence of any demographic analysis of communication or behavioral signals that ran in parallel to the clinical markers. The modeling ignores whether demographic factors connect to information exposure or content dynamics that often co vary with health outcomes and care seeking. Several prior works, such as https://doi.org/10.3390/computers12110221 and https://doi.org/10.1016/j.ipm.2021.102541 have highlighted the role of demographic factors when performing similar studies. If performing a diversity-based analysis is not feasible at this point, it is suggested that the authors review a few such works and state this as a future scope of work.
- Imputation and exclusion choices are not justified at the variable level. Median substitution is reported for variables that appear to be categorical indicators, and the paper does not explain the coding scheme or whether mode imputation or model based approaches were considered. Records were excluded for incompleteness using automated classifiers without a description of which features were incomplete. Also, why was exclusion preferred to imputation in those cases? Furthermore, what motivated the use of median substitution for binary family history variables? Please explain.
Author Response
Comment 1: The abstract and methods state that 257 patients were enrolled. However, 226 observations were used for modeling with 31 excluded for “incompleteness identified by automated classifiers” and missingness is later described as less than 1% in several variables. The criteria for incompleteness, the exact rules used by automated classifiers to flag records, and the distribution of missingness across features are not described clearly. How were incomplete records flagged by automated classifiers in terms of explicit thresholds and rules?
Response 1: We cordially appreciate your hard work and valuable time related to reviewing of our manuscript. We thank the reviewer for pointing out this lack of clarity. We have now added a detailed description in the Materials and Methods section to explain how incomplete records were flagged. Specifically, records were excluded when critical variables (e.g., IL-6, ALT) or more than a predefined proportion of essential features were missing, whereas isolated missing values (<1% across several categorical variables) were imputed. This clarification ensures that the criteria for data exclusion and imputation are now transparent.
Comment 2: Target label definitions appear inconsistent across sections and figures. Severity is encoded as 0 for mild, 1 for moderate, and 2 for severe in the methods. However, the decision tree description refers to 0 as negative outcome, 1 as positive outcome, and 2 as recovery or mild course. Why?
Response 2: We thank the reviewer for noting this inconsistency. The target label definitions were indeed intended to follow a consistent scheme (0 = mild, 1 = moderate, 2 = severe). The description in the Results section accompanying Figure 4 mistakenly referred to alternative labels (“negative/positive outcome”), which we have now corrected to align with the definitions provided in the Methods.
Comment 3: Metrics and validation strategy are not sufficient. The dataset is imbalanced across the three severity classes, yet only overall accuracy and multiclass AUC are emphasized, with AUC values reported as 1.000 for all top models and without per class recall, precision, macro versus micro aggregation, or calibration assessment. External validation is not performed, and no learning curves or variance across folds are shown beyond a mean and standard deviation. Please consider reporting per class metrics and confusion matrices for every model that is compared.
Response 3: We thank the reviewer for this valuable suggestion. We agree that reporting only overall accuracy and multiclass AUC was insufficient given the class imbalance in our dataset. In the revised manuscript, we have added per-class precision, recall, and F1-score metrics for all classifiers (Supplementary Tables S1–S2) and included confusion matrices for the top-performing models (ExtraTreesClassifier, RandomForestClassifier, and DecisionTreeClassifier) as Supplementary Figures S1–S10. This provides a more granular evaluation of model behavior across severity classes. While external validation and calibration analyses remain outside the scope of the present study, we acknowledge them as directions for future work.
Comment 4: A key limitation is the absence of any demographic analysis of communication or behavioral signals that ran in parallel to the clinical markers. The modeling ignores whether demographic factors connect to information exposure or content dynamics that often co vary with health outcomes and care seeking. Several prior works, such as https://doi.org/10.3390/computers12110221 and https://doi.org/10.1016/j.ipm.2021.102541 have highlighted the role of demographic factors when performing similar studies. If performing a diversity-based analysis is not feasible at this point, it is suggested that the authors review a few such works and state this as a future scope of work.
Response 4: Thank you for pointing this out. We have reviewed the works you indicated and greatly appreciate the valuable insights they provided. In our conclusions, we incorporated a note on the absence of demographic analysis of communicative and behavioral signals and specified our intention to integrate such an analysis into the design of our subsequent studies.
Comment 5: Imputation and exclusion choices are not justified at the variable level. Median substitution is reported for variables that appear to be categorical indicators, and the paper does not explain the coding scheme or whether mode imputation or model based approaches were considered. Records were excluded for incompleteness using automated classifiers without a description of which features were incomplete. Also, why was exclusion preferred to imputation in those cases? Furthermore, what motivated the use of median substitution for binary family history variables? Please explain.
Response 5: We thank the reviewer for highlighting this important issue. We have revised the Materials and Methods section to clarify the rationale behind our imputation and exclusion strategies. Specifically, we now explain that binary family history variables were coded as 0/1 indicators and imputed using median substitution, which in practice is equivalent to mode imputation and justified by the very low (<1%) missingness. By contrast, the 31 excluded records lacked several critical biochemical variables (e.g., IL-6, ALT, AST, lymphocytes, platelets), which could not be reliably imputed given their central role in predicting severity. Exclusion was therefore preferred over imputation to preserve data integrity and avoid bias. All changes in the manuscript are highlighted in green for better visibility.
Reviewer 2 Report
Comments and Suggestions for AuthorsI read with interest the work from Sydorchuck et al. In their work, the authors used machine learning models to identify variables to predict COVID-19 severity. This is a rigorous work that seeks to identify variables as predictors of severity for COVID-19. The paper is well-written and scientifically sound. I have a series of comments for the authors to consider in a further iteration of their work:
The abstract needs some work. The current state of it starts showing right away the methodology. Some context and the objective of the work should be included.
The introduction touches on some important topics but the last paragraph lacks a formal description of the objectives of the research.
methods:
I have a series of questions regarding the cohort:
- line 114: the authors mention study participants with confirmed COVID-19. Was that confirmation done by PCR? Antigen testing?
- what is the inclusion criteria for their groups (mild and moderate-to-severe)?
- the paragraph on lines 118-138 is results and should not be included in the methods
This reviewer is confused whether the authors are dealing with a binary or multiclass classification problem. In section 3 (model development and evaluation). The authors describe their participants in two classes (lines 114-117; 197 moderate to severe, 60 mild). Later in line 180, the authors mention the outcome variable (severity) comprised three classes.
results:
- This reviewer in particular does not appreciate the authors using the symbol ‘↔’ in their correlation analysis given that the authors are not describing a logical biconditional as the mathematical/logical function of the symbol.
- I am particularly not convinced the authors achieved dataset optimization for machine learning by excluding some highly correlated variables given the authors never presented results on their models tested with/without the exclusion of the correlated variables and compared the results.
- It is not clear to this reviewer how the authors did the correlation analysis depicted in Figure 2 if there are categorical variables (including the target). The figure not having a label does not help in further identifying the methodological approach used.
- This reviewer does not understand why the authors built the decision tree depicted in figure 4. Later in the discussion, the authors mention that they chose a decision tree for interpretability, which is true. However, some of the aspects of a ‘black box’ classifier can be post hoc analyzed with model interpreters (Shap, lime, etc). Given the appropriate performance on the decision tree compared to other methods, this reviewer is convinced by the use of the tree in Figure 4. However, the authors need to justify in the results section why they chose the tree.
- The authors identified ‘depression’ (categorical variable) as an important feature in their models (high up in the tree and overall high importance in Figure 3). Let’s consider a patient with IL6 <= 54.72 and depression = 0. A leaf node is reached and the patient gets assigned a class 0. What is the clinical interpretation of such phenomena?
Finally, the authors did not use any external data to validate their models. While this reviewer recognizes it is not trivial to have independent observations that would fit the data preprocessing stages depicted in this work, it is important for the authors at least to recognize it as a limitation of their work.
Minor
- the title and axis of figure 2 have characters in Cyrillic. Please ensure English readability.
- SNP in the abbreviation list is duplicated
- In the data availability statement, the authors claim the data is contained within the article. I could not obtain it.
- What is the version of scikit learn used?
Author Response
Comment 1: The abstract needs some work. The current state of it starts showing right away the methodology. Some context and the objective of the work should be included.
Response 1: Thank you very much for your hardwork and valuable comments. Information regarding the relevance and objectives of the study has been incorporated into the revised abstract.
Comment 2: The introduction touches on some important topics but the last paragraph lacks a formal description of the objectives of the research. All changes in the manuscript's text are highlighted in green.
Response 2: We appreciate your helpful comment. The introduction has been updated to include a formal description of the research objectives.
Comment 3: I have a series of questions regarding the cohort: line 114: the authors mention study participants with confirmed COVID-19. Was that confirmation done by PCR? Antigen testing?
Response 3: Thank you very much for this valuable observation. The diagnosis of COVID-19 in all study participants was confirmed by PCR testing. This information has been incorporated into the Data acquisition and preprocessing section.
Comment 4: What is the inclusion criteria for their groups (mild and moderate-to-severe)?
Response 4: Thank you very much for this remark. We have added the criteria for allocating patients into groups with mild and moderate-to-severe disease to the Data acquisition and preprocessing section.
Comment 5: The paragraph on lines 118-138 is results and should not be included in the methods
Response 5: We cordially appreciate your valuable comment. We opted to retain this information in the Data acquisition and preprocessing section, as we believe it contributes to a more coherent structure and improves the clarity and readability of the manuscript for the readers.
Comment 6: This reviewer is confused whether the authors are dealing with a binary or multiclass classification problem. In section 3 (model development and evaluation). The authors describe their participants in two classes (lines 114-117; 197 moderate to severe, 60 mild). Later in line 180, the authors mention the outcome variable (severity) comprised three classes.
Response 6: We addressed a multiclass classification task with three severity levels of COVID-19 (mild, moderate, severe). This phrase we added in the beginning of "Model development and evaluation" section. For better understanding, we clarified that we are dealing with a multiclass classification task in several places in the article: Abstract: instead of "197 had moderate-to-severe disease and 60 presented with mild disease" (n=257) we clarified "226 patients with confirmed COVID-19 (54 – moderate, 142 – severe and 30 with mild disease). The target variable was disease severity (mild, moderate, severe)." Materials and Methods (Data acquisition and preprocessing): we paraphrase and added table of dataset for clarification Before paraphrasing: " This cohort study enrolled 257 patients with confirmed COVID-19, of whom 197 had moderate-to-severe disease and 60 presented with mild disease."After paraphrasing: " This cohort study enrolled 257 patients with confirmed COVID-19, of whom 197 had moderate-to-severe disease (including 55 with moderate and 142 with severe disease) and 60 presented with mild disease, resulting in three severity classes: mild (n=60), moderate (n=55), and severe (n=142). The target variable was disease severity (mild, moderate, severe). …. After excluding 31 incomplete records, the final dataset included 226 observations with class/severity distribution: mild (n=30, class 0), moderate (n=54, class 1), and severe (n=142, class 2) (Table 1). All changes are highlighted in green.
Comment 7: This reviewer in particular does not appreciate the authors using the symbol ‘↔’ in their correlation analysis given that the authors are not describing a logical biconditional as the mathematical/logical function of the symbol.
Response 7: We appreciate the reviewer's attention to detail regarding the use of the symbol ‘↔’ in the correlation analysis section. We agree that this symbol is typically reserved for logical biconditionals and was not intended to imply such in our context. To address this, we have replaced ‘↔’ with "between ... and ..." throughout the relevant paragraph in Section 3 (Results). This change enhances clarity and avoids any potential misinterpretation.
Comment 8: I am particularly not convinced the authors achieved dataset optimization for machine learning by excluding some highly correlated variables given the authors never presented results on their models tested with/without the exclusion of the correlated variables and compared the results.
Response 8: We thank the reviewer for this insightful comment and agree that empirical validation of the feature exclusion step is essential to demonstrate dataset optimization. To address this, we conducted an additional comparative analysis by evaluating the top 10 classifiers on the full dataset (prior to exclusion of highly correlated variables) versus the optimized dataset. The results, now included in an extended Table 2 and a new paragraph in Section 3 (Results), show that models on the optimized dataset achieve comparable or slightly superior performance (e.g., ExtraTreesClassifier: 0.974 ± 0.022 accuracy on optimized vs. 0.965 ± 0.028 on full), with improved stability (lower SD). This confirms that exclusion reduced redundancy and multicollinearity without negatively impacting predictive accuracy, while enhancing model interpretability—a key consideration in clinical applications. The revised manuscript includes these updates, with tracked changes for reference.
Comment 9: It is not clear to this reviewer how the authors did the correlation analysis depicted in Figure 2 if there are categorical variables (including the target). The figure not having a label does not help in further identifying the methodological approach used.
Response 9: We appreciate the reviewer's feedback on the clarity of the correlation analysis in Figure 2, particularly regarding the handling of categorical variables and the lack of labels. To address this, we have revised the figure caption and surrounding text in Section 3 (Results) to explicitly reference the methodological approach detailed in the Methods section: Pearson’s correlation for continuous pairs, Cramér’s V for categorical pairs, and the correlation ratio (η) for mixed pairs.
Comment 10: This reviewer does not understand why the authors built the decision tree depicted in figure 4. Later in the discussion, the authors mention that they chose a decision tree for interpretability, which is true. However, some of the aspects of a ‘black box’ classifier can be post hoc analyzed with model interpreters (Shap, lime, etc). Given the appropriate performance on the decision tree compared to other methods, this reviewer is convinced by the use of the tree in Figure 4. However, the authors need to justify in the results section why they chose the tree.
Response 10: We thank the reviewer for this valuable comment. We agree that the rationale for presenting the decision tree in Figure 4 should be clearly described in the Results section. Accordingly, we have revised the Results section to explain that, although ensemble classifiers such as ExtraTrees and RandomForest achieved the highest predictive accuracy, we additionally present a single decision tree because it also demonstrated high accuracy (93.8%) and provided 100% class probability in several terminal nodes. Most importantly, the decision tree offers a transparent, rule-based structure that clinicians can easily interpret and apply in practice. This justification has now been added directly after Table 2 in the Results section.
Comment 11: The authors identified ‘depression’ (categorical variable) as an important feature in their models (high up in the tree and overall high importance in Figure 3). Let’s consider a patient with IL6 <= 54.72 and depression = 0. A leaf node is reached and the patient gets assigned a class 0. What is the clinical interpretation of such phenomena?
Response 11: We thank the reviewer for raising this important point. We agree that the clinical meaning of the decision tree split involving depression requires clarification. We have added an explanation in the Discussion to highlight that depression is associated with systemic inflammation and altered cytokine profiles, including elevated IL-6 levels. Thus, patients without depression and with lower IL-6 values were more likely to be classified into the mild category (class 0). This reflects the combined influence of psychosomatic and immunological mechanisms rather than a direct causal link, and we now explicitly discuss this clinical interpretation.
Comment 12: Finally, the authors did not use any external data to validate their models. While this reviewer recognizes it is not trivial to have independent observations that would fit the data preprocessing stages depicted in this work, it is important for the authors at least to recognize it as a limitation of their work.
Response 12: We thank the reviewer for this important observation. We agree that the absence of external validation is a limitation of our study. We have now explicitly acknowledged this point in the Conclusions section, emphasizing that future work should include external validation on independent cohorts to confirm the generalizability of our findings.
Comment 13: The title and axis of figure 2 have characters in Cyrillic. Please ensure English readability.
Response 13: Thank you, technical error, corrected and re-checked with Grammarly software.
Comment 14: SNP in the abbreviation list is duplicated.
Response 14: Thank you for attention, technical error, corrected.
Comment 15: In the data availability statement, the authors claim the data is contained within the article. I could not obtain it.
Response 15: The phrase was changed to "The data presented in this study is available on request from the corresponding author" is added.
Comment 16: What is the version of scikit learn used?
Response 16: We sincerely appreciate the reviewer for this remark. We have now specified the exact version of the scikit-learn library used in our analyses. This information has been added to the Materials and Methods section (Model development and evaluation).
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have revised their paper as per most of my comments from the previous review round. I do not have any additional comments at this point.
Author Response
The authors express their sincerest gratitute for incredible work to improve our manuscript. Thank you very much!
Reviewer 2 Report
Comments and Suggestions for AuthorsI appreciate the work in providing the revised manuscript, it definitely improved the quality of the submission. While most of the queries I initially proposed to the authors have been satisfactorily answered, I have some follow up questions on how the authors handled my original comments with the following numbers (first round of reviews):
Comment 6: I appreciate the willingness of the authors to revise their manuscript in regards providing clarification on the binary/multiclass problem. Now it is clear for readers that the work is multiclass. However, I did not find the clinical parameters that define the 'severe' class in the 'Data acquisition and preprocessing section.
Comment 11: I appreciate the authors in further trying to clarify the variable 'depression' and its relationship with assigning a class. The conservative claims the authors make about correlation and not causation are appropriate. However, the authors mention in text 'Depression is known to be associated with chronic systemic inflammation and dysregulation of cytokine pathways, including elevated IL-6 levels reported in prior studies'. I could not find the references of such studies.
Comment 12: I appreciate the authors recognizing the lack of external data as a limitation of their work. A practical tip for the authors would be not to include a limitations paragraph in the very last paragraph of their work. This reviewer sees this as detrimental to the hard work employed in this manuscript.
Author Response
We sincerely appreciate the reviewer for valuable comments and remarks.
Comment 6: I appreciate the willingness of the authors to revise their manuscript in regards providing clarification on the binary/multiclass problem. Now it is clear for readers that the work is multiclass. However, I did not find the clinical parameters that define the 'severe' class in the 'Data acquisition and preprocessing section.
Response 1: Thank you very much for this important observation. The clinical parameters delineating the moderate and severe categories were incorporated into the “Data acquisition and preprocessing” section (highlighted in blue):
Moderate COVID-19 was defined by the presence of fever, dry cough, dyspnea, and tachypnea (20–30 breaths/min) without clinical signs of severe pneumonia, including oxygen saturation (SpO₂) ≥90%. Severe disease was characterized by the above-mentioned clinical manifestations in combination with at least one of the following: respiratory rate ≥30 breaths/min, blood oxygen saturation ≤90%, PaO₂/FiO₂ ratio <300, or pulmonary infiltrates involving >50% of lung fields. In all cases, pneumonia was confirmed by instrumental diagnostic methods, including computed tomography (CT) or chest radiography.
Comment 11: I appreciate the authors in further trying to clarify the variable 'depression' and its relationship with assigning a class. The conservative claims the authors make about correlation and not causation are appropriate. However, the authors mention in text 'Depression is known to be associated with chronic systemic inflammation and dysregulation of cytokine pathways, including elevated IL-6 levels reported in prior studies'. I could not find the references of such studies.
Response 11: References [36, 37] to relevant studies were incorporated to substantiate this information (highlighted in blue).
Comment 12: I appreciate the authors recognizing the lack of external data as a limitation of their work. A practical tip for the authors would be not to include a limitations paragraph in the very last paragraph of their work. This reviewer sees this as detrimental to the hard work employed in this manuscript.
Response 12: We are very grateful for the constructive practical recommendation. Accordingly, the paragraph addressing study limitations was relocated to the Discussion section (highlighted in blue).