Machine Learning Approaches to Identify Patient Comorbidities and Symptoms That Increased Risk of Mortality in COVID-19

Providing appropriate care for people suffering from COVID-19, the disease caused by the pandemic SARS-CoV-2 virus, is a significant global challenge. Many individuals who become infected may have pre-existing conditions that may interact with COVID-19 to increase symptom severity and mortality risk. COVID-19 patient comorbidities are likely to be informative regarding the individual risk of severe illness and mortality. Determining the degree to which comorbidities are associated with severe symptoms and mortality would thus greatly assist in COVID-19 care planning and provision. To assess this we performed a meta-analysis of published global literature, and machine learning predictive analysis using an aggregated COVID-19 global dataset. Our meta-analysis suggested that chronic obstructive pulmonary disease (COPD), cerebrovascular disease (CEVD), cardiovascular disease (CVD), type 2 diabetes, malignancy, and hypertension as most significantly associated with COVID-19 severity in the current published literature. Machine learning classification using novel aggregated cohort data similarly found COPD, CVD, CKD, type 2 diabetes, malignancy, and hypertension, as well as asthma, as the most significant features for classifying those deceased versus those who survived COVID-19. While age and gender were the most significant predictors of mortality, in terms of symptom–comorbidity combinations, it was observed that Pneumonia–Hypertension, Pneumonia–Diabetes, and Acute Respiratory Distress Syndrome (ARDS)–Hypertension showed the most significant associations with COVID-19 mortality. These results highlight the patient cohorts most likely to be at risk of COVID-19-related severe morbidity and mortality, which have implications for prioritization of hospital resources.


Introduction
As of the end of May 2020, over 6 million cases of SARS-CoV-2 infection have been confirmed globally, and over 3,696,000 deaths attributed to the associated disease, COVID-19. 1 Asymptomatic human to human spread remains a challenging aspect of the viral containment effort, unlike previous pandemic coronaviruses SARS and MERS, which showed co-occurrence of symptoms with infectiousness. 2COVID-19 epidemiological data suggests elderly people are most at risk of developing severe symptoms, 3 although severe symptoms and mortality occur in all age groups.The more prominent symptoms include high fever, cough and sputum production, headache, hemoptysis and diarrhea, and as the infection worsens, an acute respiratory distress syndrome can develop that requires intensive care management.Identifying those most at risk of severe symptoms and death remains a research priority to aid early and appropriate allocation of resources and targeted patient management.As more population data is released, predictive analytic methods may be able to provide such information for patients based on their clinical characteristics.
Reports are emerging that many of the patients most affected by COVID-19 also present with significant comorbidities.A recent study by Richardson et al. 4 describing 5,700 confirmed COVID-19 cases reported that many of these patients were suffering from hypertension (56.6%), obesity (41.7%) or type 2 diabetes (33.8%) at the time of their infection; greater than their respective prevalence in the population, which suggests a link to SARS-CoV-2 effects on metabolic and vascular systems.This indicates that the comorbidities an individual has may provide crucial prognostic information if SARS-CoV-2 infection co-occurs.There is also data emerging that suggests significant heterogeneity in disease presentation.Xu et al. 5 described clinical characteristics (including laboratory and chest radiography data) from 62 Chinese COVID-19 patients that differed from those described by Guan et al. in another Chinese region. 6The reasons for this heterogeneity in presentations remain unclear, but the relative incidence of comorbidities (and other clinical features) in different patient cohorts provide one explanation.The nature and strength of comorbidity interaction with COVID-19 may also provide important clues to the mechanisms of their interaction and how this may be countered.
To address these issues, we used three approaches to analyze the currently available clinical information.Firstly, we conducted a meta-analysis of available retrospective cohort studies of COVID-19 patient data that focused on comorbidity and selected clinical features.Secondly, we also obtained and aggregated a novel COVID-19 dataset from 4,81,289 patients from across 141 different countries 7 , and identified significant comorbidity associations.Thirdly, we applied machine learning algorithms to this novel aggregated data to classify comorbidities with mortality.These three approaches enabled us to thoroughly assess the comorbidities and clinical features most significantly associated with mortality in COVID-19 patients.

Meta-analysis of published clinical reports of COVID-19 disease
Initially, our meta-analysis search terms identified a total of 195 relevant articles.From these articles, we excluded 96 duplicate references and considered the remaining 99.By careful screening of the title and abstract, we excluded 34 articles based on the criteria noted above (e.g., we did not include case reports, review reports) and we only considered full-text papers that examined comorbidity and clinical symptoms on COVID-19 patients; these are listed in Table 1.Finally, for the remaining articles, we reviewed the full text and removed a further 36 studies because they were either reviews or editorials lacking clinical details.Twenty-six articles eventually met the inclusion criteria for our meta-analysis.A flow-diagram of literature screening is shown in Figure 1.
A total of 13,400 COVID-19 patients from twenty-six studies [4][5][6] were thus included in our metaanalysis. Most ofthe studies were conducted in China (24), one was from the USA, and another was from Italy.The mean age of the full sample was 54.5 years, with 8,149 (60.81%) males and 39.19% females (Table 1).Of these, there were 2,964 patients (22.11%) who developed a severe condition or were admitted to the ICU or had died (Table 1).Note that, for calculating the prevalence we considered the full data set from all the 26 publications.However, due to lack of information (patients were not stratified based on the degree of severity), we considered only 11 publications in the analysis to assess the effect of symptoms and comorbidities on COVID-19 disease severity or death.

Publication bias
In parallel to the meta-analysis of data, we also conducted an analysis of publication bias for all symptoms and comorbidities.Table 4 shows the results of possible publication biases, which were assessed using funnel plots and Egger's testing (for details, see, Supplementary Figure 3).The results of the Egger's test (  > 0.05) suggest that except for the symptom of anorexia, there were no significant publication biases seen in the variables analyzed.

Clinical characteristics of patients in aggregated recently generated COVID-19 patient datasets
Following our meta-analysis of the published literature, we also sought to assess recent COVID-19 clinical case data available from open-source online repositories; this allowed us to apply additional novel predictive machine learning methods to COVID-19 data complementing our meta-analysis of the published literature.Data were obtained from two different large data repositories and processed as detailed in the methods section.Following filtering for case data to include only cases with sufficiently detailed clinical information, as well as case mortality information, we obtained a total of 1,143 patient cases for analysis.Table 5 displays summary statistics of these 1,143 patients stratified by survival/mortality outcomes.The analysis found that of the 1,143 patients, 86.61% had no comorbidities, whereas 5.34% and 7.87% of patients had only one or more than one comorbidity, respectively.The most common coexisting comorbidities were hypertension (8.66%), diabetes (7.44%), cardiovascular disease (3.5%), and kidney disease (1.75%).In contrast, malignancy of any kind (0.87%), asthma (0.87%), COPD (0.61%), chronic lung disease (0.61%), cerebrovascular disease (0.44%), surgical history (0.26%), neurodegenerative disease (0.17%), infectious disease (0.17%), and liver disease (0.17%) were found to be far less likely to co-occur with COVID-19 in this dataset.Analyzing this data for clinical symptomatology found that the most common clinical presentation of patients with COVID-19 was fever (14.17%) followed by cough (12.42%), pneumonia (6.47%), acute respiratory distress symptoms (5.69%), dyspnea (3.06%), fatigue (2.19%), septic shock (1.49%), headache (0.96%), myalgia (0.79%), diarrhea (0.61%) and nausea (0.26%).
Table 5 also shows the status of patients who were deceased.The selected 1143 patients included 319 (27.91%) deceased, of which 32.60% were female and 61.76% were male.The median age of the deceased patients was 51 years and IQR of 36 to 66 years.A majority of patients (67.08%) had no comorbidities in this dataset.Only 10.97% of patients had one comorbidity, while 21.94% had more than one comorbidity.In the deceased patient subgroup, the rate of comorbidities was significantly higher than survived patients.The comorbidities most frequently seen in COVID-19 patients that did not survive their infection included type 2 diabetes (19.12%), cardiovascular disease (6.27%), and kidney disease (4.08%).However, while the other comorbidities we studied (see Table 5) were less frequently observed in COVID-19 patients, when they did co-occur, they did so only in patients who had died (Table 5).Descriptive analysis of the symptoms in the deceased COVID-19 patients found that the most significant symptoms seen in the deceased patients were pneumonia (21.32%), fever (12.85%), cough (11.60%), acute respiratory distress symptom (9.72%) and septic shock (4.70%) (Table 5).

Supervised machine learning identifies the most significant COVID-19 comorbidities
To predict significant COVID-19 comorbidities, and to compare with our meta-analysis of the published literature, we designed and performed a machine learning analysis of our 1,143 patient's datasets.We applied six different machine learning algorithmic approaches (Random Forest, Decision Tree, GBM, XGB, SVM and LGBM) to identify the best predictors of COVID-19 patient mortality among the comorbidities and symptoms.We achieved a regression accuracy of > 80% in all six approaches to comorbidity and mortality; specifically, that was 83% for Decision Tree, 84% for GBM, and 86% for XGB, 87% for Random Forest and SVM, and 88% for LGBM.These methods also achieved accuracy for symptoms of > 85% in all six approaches, with GBM and LGBM showing 90% accuracy.Accuracy matrices, including precision, recall or sensitivity, f1 score, area under the ROC curve (AUC), and log loss values, are shown in Supplementary Table 1 for symptoms data and in Supplementary Table 2 for comorbidity data.The coefficient values for the features (symptoms) are given in Supplementary Table 3, and the features (comorbidities) are given in Supplementary Table 4.Our results indicate that age is the most significant predictor of mortality as well as the gender.We compared both results (most significant features) for symptoms and comorbidities found from different algorithms and got similar predictions.In figure 2 we represent the significance level for symptoms and diseases.After calculating the coefficient values for every algorithm, we measured the symptoms and diseases in the same scale by quantile normalization and using the average normalized values in Figure 2. The most significant symptoms were pneumonia, acute respiratory distress syndrome (ARDS), dyspnea, fever and cough (Supplementary Table 3) and the most significant comorbidities found were hypertension, diabetes and metabolic diseases, chronic kidney disease, cardiovascular disease, chronic obstructive pulmonary disease (COPD), asthma and malignancy in this cohort (Supplementary Table 4).

Significant pairs of interacting comorbidities and symptoms associated with death in COVID-19
One of the unique findings of this study is the identification of significant pairs of comorbidities and symptoms that are associated with death among COVID-19 patients.For identification of symptomcomorbidity interactions, we applied the Fisher's exact testing procedure.The negative logarithm of the p-values obtained from the tests are presented in Figure -3.We observed that the symptom-comorbidity combination of Pneumonia-Hypertension, Pneumonia-Diabetes and ARDS-Hypertension had the most significant effects on mortality in COVID-19 patients (Figure3).
Taken together, these data provide a comprehensive analysis of the current published literature, as well as a novel machine learning classification analysis using recently aggregated data to identify significant comorbidities and symptom relationships relating to death from COVID-19 disease.

Discussion
The recent and continuing spread of SARS-CoV-2 has vastly outpaced the ability of many public health care systems around the world to respond and manage.There are many examples from even advanced economies where medical professionals have had to make distressing decisions about prioritization of insufficient care resources.This highlights the critical need for fast and accurate classification of those patients most at risk of severe disease or fatality to best allocate hospital resources during times of crisis.
To this end, we have performed a number of analyses to assess how disease outcome is related to a range of patient comorbidities and clinical features.Firstly, we investigated published COVID-19 clinical data using a conventional meta-analysis.We found almost no evidence of publication bias in this data, and little grey literature sources of use to our study.This may reflect the current strong imperative to rapidly publish any available studies.Our meta-analysis identified COPD, CEVD, CVD, diabetes, malignancy, and hypertension as most significantly associated with COVID-19 severity in the current published literature.
We also obtained and analyzed aggregated COVID-19 patient data (not derived from published clinical trials or retrospective studies) using statistical and machine learning methods.We found that patients most at risk of dying from COVID-19 had particular comorbidities and patient features, most of which were seen in our meta-analysis.Our machine learning analysis of this patient dataset for the classification of deceased versus recovered COVID-19 patients identified COPD, CVD, CKD, diabetes, malignancy, hypertension, and asthma as most significant.These results provide detailed insights into the strength of the relationship between these factors and patients' risk of dying from COVID-19, identifying prognostic factors by largely independent means.This may lead to identification of disease mechanisms of interest by considering pathways that may be common to these comorbidities.Already such considerations have been made with several studies reporting strong evidence for a link between SARS-Cov-2 actions and vascular damage. 41Further, given that the angiotensin converting enzyme (ACE-2) receptor is used by the virus for entry into host cells, it has been suggested that the already strained ACE-2-Ang-(1-7)-Mas in metabolic disorders may result in a respiratory compromise (42).The role of upregulation of the ACE-2 receptors by ACE inhibitors and angiotensin II receptor blockers used in the management of hypertension, diabetes, and CKD (43) also requires further exploration in elucidating the metabolic pathways that underpin the relationship between these co-morbidities and increased SARS-Cov-2 related severe morbidity and mortality.
It is likely that there are many different factors interacting that lead to the co-incidence of COVID-19 and comorbidities greatly detrimental to patient outcome.We found using machine learning classification methods that age and gender are the most significant predictor of COVID-19 mortality.Indeed, it is likely that in many cohorts, age is strongly associated with the co-occurrence of significant comorbidities as these tend to be age-related diseases.Nevertheless, comorbidities analyzed here such as diabetes, hypertension and asthma do occur across age categories, suggesting mortality in COVID-19 is impacted by other characteristics yet to be identified; perhaps differences in environment and/or genetic predispositions are likely relevant factors for future consideration.
Mechanistically, the association between lung-related comorbidities such as COPD and COVID-19 disease severity are an expected outcome of this study.COPD is a chronic lung condition, often caused by a patient history of smoking.Patients with COPD present with pulmonary damage and chronic breathing difficulty; thus, the co-occurrence of a severe lower respiratory viral infection and pneumonia is a significant challenge, particularly in the elderly.In contrast, the association of severe COVID-19 disease with conditions such as vascular diseases (CVD, CEVD) and diabetes, is perhaps more complex.Data are emerging however that suggests SARS-CoV-2 infection is associated with a severe inflammatory storm that can result in vascular inflammation, as well as myocarditis.Thus cardiovascular and metabolic diseases are likely compounding the impact of COVID-19; perhaps presenting a therapeutic opportunity for broad-spectrum anti-inflammatory medications, although the data on efficacy remain to be acquired.
An important consideration remains the limitations of the available data for predictive analyses.COVID-19 remains a relatively recent phenomenon, and thus the data may contain biases that cannot as yet be circumvented.For example, the majority of data coming from mainland China presents biases related to population genetics as well as environmental effects that will not be observed in similar European datasets.Nevertheless, our analysis of this cohort data from 1143 patients comes from repository data acquired from across 141 countries; thus, systematic biases of this kind should be minimal.Additionally however, there may be unidentified reporting biases in global hospital data due to severe under-resourcing and staff shortages in some locations, necessitating priority reporting.Over the coming months, more data will become available from more diverse nations and population groups that will enable fuller investigation of these issues.

Conclusion
In summary, we have performed a comprehensive meta-analysis of available published literature, as well as a novel machine learning analysis of a separate cohort of COVID-19 patients.We identified significant comorbidities and COVID-19 patient symptoms that are important for consideration when assessing patient needs; something that remains critical at a time where hospitals are often understaffed and under-resourced.Data suggest that the comorbidities most implicated in severe COVID-19 are lungrelated, such as COPD and asthma, as well as vascular-related conditions, such as CVD and CEVD.Thus, it is critical that at-risk populations be prioritized in efforts around social isolation and resource allocation during this pandemic.As data continue to be accrued, it will become possible to answer questions regarding gender and age-related comorbidity relationships including medication history as well as population genetics and environmental effects that may be relevant to treatment optimization.

Methods
This study has two parts -i) meta-analysis of previously published literature, and ii) machine learning algorithm based analysis on patient-level cohort data.

Meta-analysis of published data Search strategy and study selection
][10] Potential and relevant studies were extracted by conducting a systematic search of databases; from January 1, 2019, to April 20, 2020, in PubMed (Medline), Springer, Web of Science, EMBASE, and Cochrane Library databases.This study used keywords for database screening; '2019-nCoV', '2019 novel coronavirus', 'COVID-19', 'clinical characteristics and symptoms of coronavirus'.Databases using comorbidity combinations for all comorbidities studied were also searched, with the following structure: "COVID-19 and diabetes", "COVID-19 and hypertension", "COVID-19 and COPD" and related terms.The list of cited references from selected articles were manually screened to identify missing studies.All articles selected for the meta-analysis were written in English and all search procedures were independently performed by two investigators (MMA and AT).
For this study, articles that described the clinical characteristics of COVID-19 patients were included, particularly symptoms and comorbidities, along with their prevalence and specific information on the distribution of patients on the basis of severity.Key exclusion criteria were: (a) duplicate publications, (b) case reports, reviews, editorials, letters, or (c) studies that failed to provide sufficient information on clinical patient characteristics, as judged by the two investigators.

Data extraction for statistical analysis
The two investigators who performed the literature screening also extracted the data independently from the selected studies.Differences in the chosen literature were reconciled by discussion and screening by a third investigator (MAM).We extracted the following variables: first author name, year of publication, number of patients, age, sex, number of patients suffering severe diseases (note that patients were not stratified based on the degree of comorbidity severity or symptom severity), number of non-severe patients where these were reported, patient survival, patients needing intensive care unit (ICU) support, and the prevalence of multiple symptoms and comorbidities.The definition of 'severe' was clearly described in some articles, however not all.We maintained the case definitions as defined by the original authors.The odds ratios (OR) were calculated to describe the severity of clinical symptoms in severe patients compared to non-severe patients.The degree of variability across studies (heterogeneity) was assessed by I 2 and Cochran's Q test 11 .Due to the existence of heterogeneity in studies, randomeffects models were utilized to estimate the average effect of variables, along with their precision which can provide a more accurate estimate of the 95% confidence intervals (CI).

Data collection
We obtained publicly available anonymized clinical data that was derived from both non-hospitalized and hospitalized COVID-19 positive patients; patient diagnoses were based on WHO guidelines. 12The cases were captured between February 14, 2020, to April 31, 2020.Real-time data was collected from open-source COVID-19 data repositories. 13,14The data obtained came from a total of 4,81,289 individual patient clinical records from 141 countries.Summary descriptive statistics for this clinical data are shown in Table 5.The clinical attributes collected included clinical symptoms and signs, details of any comorbidities, date of admission in the hospital, date of confirmation of COVID-19 caseness, date of death or hospital release, details of other associated disease outcomes, as well as demographic data; the latter included age, gender, travel history, and location (e.g., city, province, and country) of the patient.From these data, we filtered for select criteria e.g.patients who are deceased and recovered and released from hospitals.We also excluded patients where data relating to their mortality or recovery from infection was not included.The final filtered dataset included 1,143 COVID-19 patients with detailed clinical information, of whom 319 were reported as deceased and 824 as recovered.

Selection of significant variables
The focus of this study was to analyze the mortality and survival rates in our filtered 1,143 patient datasets and to relate these rates to comorbidity incidences.We, thus considered respondent age (continuous), sex (male, female), travel history, and the commonly occurring comorbidities, both individually and occurring in multiples.The comorbidities studied included cardiovascular disease (CVD), chronic obstructive pulmonary disease (COPD), cerebrovascular disease (CEVD), chronic kidney disease (CKD), chronic lung disease (CLD), neurodegenerative disease, hypertension, diabetes (type 2), malignancies, infectious diseases, surgical history, asthma, and liver disease.Additionally, we included several clinical symptoms for analysis, including the incidence of fever, cough, pneumonia, acute respiratory distress symptoms (ARDS), dyspnea, fatigue, septic shock, headache, myalgia, diarrhea, and nausea, in order to predict at an early stage and to identify the relationship of the severity or death.We assessed the influence of these variables on the probability of returning a positive diagnosis of SARS-CoV-2 infection.

Statistical analysis
Continuous variables were summarized by median along with interquartile range (IQR), and compared by utilizing the Mann-Whitney U test. 15 The frequency of categorical variables was presented in percent and compared with a chi-square test. 16Moreover, Fisher's exact test 17 was applied to low-frequency cells.A two-sided  (type-I error) less than 0.05 was considered as a measure of statistical significance.All statistical analysis was performed in the R statistical computing environment (version 3.6.1).

Machine learning algorithms
In this study, six supervised machine learning algorithms were applied to identify the minimum number of symptoms and comorbidities that were predictive of COVID-19 infection.These algorithms included Random Forest, Decision Tree, Gradient Boosting Machine (GBM), XGBoost (XGB), Support Vector Machine (SVM) and Light Gradient Boosting Machine (LGBM).We extracted the required variables from the raw data, and then performed data cleaning and scaling to pre-process the collected data.Imputation techniques were used to address the missing (2.2%) age and gender values, in particular, the missing age was imputed using random values selected from the age IQR, and gender was imputed randomly according to male and female ratios present in the full dataset.Data was randomly split into training (80% individuals) and testing (20% individuals) data sets to perform machine learning prediction and validation.To measure accuracy, several measures such as precision, recall or sensitivity, f1 score, area under the receiver operating characteristic (ROC) curve (AUC), and log loss values were employed.After achieving high accuracy with the model training, we extracted the features with the highest impact on symptoms and comorbidities classifying a positive COVID-19 infection.

Figure 1 :Figure 2 : 10 Figure 3 :
Figure 1: Flow diagram of literature search for including studies in meta-analysis

CVD= Cardiovascular disease; COPD=Chronic obstructive pulmonary disease; CEVD= Cerebrovascular disease; CKD=Chronic Kidney Disease; CLD= Chronic lung disease; ARDS=Acute Respiratory Distress Syndrome
Figure 2: Meta-analysis of severity of comorbidities and symptoms in COVID-19 fatalities Supplementary Table5: Assessing association between comorbidity and symptoms using Fisher's exact test of deceased patients Supplementary