Assessing Smoking Status and Risk of SARS-CoV-2 Infection: A Machine Learning Approach among Veterans

The role of smoking in the risk of SARS-CoV-2 infection is unclear. We used a retrospective cohort design to study data from veterans’ Electronic Medical Record to assess the impact of smoking on the risk of SARS-CoV-2 infection. Veterans tested for the SARS-CoV-2 virus from 02/01/2020 to 02/28/2021 were classified as: Never Smokers (NS), Former Smokers (FS), and Current Smokers (CS). We report the adjusted odds ratios (aOR) for potential confounders obtained from a cascade machine learning algorithm. We found a 19.6% positivity rate among 1,176,306 veterans tested for SARS-CoV-2 infection. The positivity proportion among NS (22.0%) was higher compared with FS (19.2%) and CS (11.5%). The adjusted odds of testing positive for CS (aOR:0.51; 95%CI: 0.50, 0.52) and FS (aOR:0.89; 95%CI:0.88, 0.90) were significantly lower compared with NS. Four pre-existing conditions, including dementia, lower respiratory infections, pneumonia, and septic shock, were associated with a higher risk of testing positive, whereas the use of the decongestant drug phenylephrine or having a history of cancer were associated with a lower risk. CS and FS compared with NS had lower risks of testing positive for SARS-CoV-2. These findings highlight our evolving understanding of the role of smoking status on the risk of SARS-CoV-2 infection.


Introduction
As of 1 July 2022, approximately 548 million cases and approximately 6.3 million deaths worldwide have been attributed to coronavirus 2 (SARS-CoV-2) infections [1]. Our current understanding is that COVID-19 is a multifactorial disease caused by the SARS-CoV-2 virus, in which infection and morbidity depend on multiple factors [2]. Smoking likely plays an important role in the COVID- 19 pandemic, but what that role is exactly remains unclear.

SARS-CoV-2 Test Positivity, Smoking, and Other Variables
The primary outcome of the SARS-CoV-2 test positivity was determined using a reverse transcription polymerase chain reaction (RT-PCR) SARS-CoV-2 test. We used the first date of a positive test to construct the cohort and, therefore, included only the first

SARS-CoV-2 Test Positivity, Smoking, and Other Variables
The primary outcome of the SARS-CoV-2 test positivity was determined using a reverse transcription polymerase chain reaction (RT-PCR) SARS-CoV-2 test. We used the first date of a positive test to construct the cohort and, therefore, included only the first positive test result. The main exposure, smoking status, was obtained from the VHA Electronic Medical Record (EMR) Health Factors (HF) dataset. The HF table contains longitudinal data on patients routinely generated by clinical visits and stored within the EMR, including smoking status and history. The smoking status data were mapped to distinct categories based on the most recent updates (Current, Former, Never, and Unknown Smokers) from the HF dataset. Prior data were used to resolve any discrepancies. Specifically, any Current or Former Smoker who became a Never Smoker was identified as a Former Smoker. Prior research reported a high agreement between records in the EMR and self-reported smoking status gathered from questionnaires with reported kappa statistics ranging from 0.66-0.74 [14][15][16]. We also included demographic variables (age), clinical characteristics (BMI), pre-existing conditions (hypertension, cancer, diabetes mellitus, etc.), and pre-infection medications. Patients' smoking status, pre-existing conditions, and pre-infection medications were obtained for 2 years prior to the index date.

Imputation Process
There were missing data for smoking status; 10.0% for all veterans and 10.7% for those who tested positive. We used the Multiple Imputations by Chained Equations in R [17] to impute the unknown smoking status data to Current, Former, or Never and the missing data for age and BMI. To increase the generalizability of the imputation model, all available covariates without missingness were included, and five datasets were imputed using five iterations.

Identifying the Most Important Variables Using a Machine Learning Selection Process
We illustrate the process of the most important variable selection in Figure 2, as previously described [6]. We sequentially used a cascade of machine learning approaches in four steps to identify the most important predictors for a positive SARS-CoV-2 test using both imputed and unimputed datasets. The variables were curated from the full breadth of EMR (i.e., demographic, clinical, pre-existing conditions, and pre-infection medications). We started with 165 initial variables. First, we removed any variable with a prevalence of less than 1%, and this step reduced the dimensionality to 119 variables. Second, a univariate filter method (a chi-squared test used to evaluate the significance of each independent variable to the target variable) excluded variables not statistically associated with testing positive for SARS-CoV-2 infection at p < 0.05. We retained 108 variables. Third, we applied an embedded method, the least absolute shrinkage and selection operator (LASSO) [18], with 10-fold cross-validation to select the most important variables; this step kept 76 variables. LASSO is a regression model that adds shrinkage (regularization penalty) to shrink the coefficients of less contributory variables to zero. The 10-fold cross-validation split the dataset into ten equal parts. Nine parts were used for training and one part for validation. The process was repeated ten times. The variables that were retained most frequently in predictive models for SARS-CoV-2 infection were selected as the most important variables. Finally, we applied a wrapper method, the sequential forward selection (SFS) variable with a five-fold cross-validation. This final step selected 12 variables. SFS starts with an empty set of variables, then it identifies the best variable (associated with the best performing single regression model based on a p-value selection criterion); next, it evaluates all possible models (pairs) of the best variable and each of the remaining variables and selects the best pair. It sequentially adds a new variable to the preselected variables until the variable addition does not reduce the criterion-value by less than 0.05. The full lists and set of variables selected at each step are presented in Supplementary Materials (Table S2a for imputed data and Table S2b for unimputed data). The analysis of unimputed data after removing patients with unknown smoking status provided an identical list of features at each step.

Statistics
Descriptive analyses were conducted to summarize the cohort characteristics by smoking status using the unimputed data. Mean and standard deviation (±SD) were reported for continuous variables; frequency count (N) and percentage (%) were computed for categorical variables; and Chi-square statistics and ANOVA (p-values) evaluated the difference between the groups of smokers. The associations between smoking status and a positive SARS-CoV-2 test were assessed using two binary logistic regression models to estimate the odds ratios (OR) and report 95% confidence intervals (95% CI). The first model adjusted for the most important covariates identified through the feature selection process described above (age (older age), BMI (being overweight and being underweight), cancer, dementia, Hispanic or Latino, lower respiratory infections, pneumonia, sex (male), septic shock, and current smoking). The second model used the same covariates but stratified the patients by age (age <65 versus ≥65). The imputed dataset was used for primary analysis and the unimputed dataset for sensitivity analysis. Statistical analyses and machine learning were performed using Python 3.8.3.

Participant Characteristics Overall and by Smoking Status
The 1,176,306 veterans who were tested for SARS-CoV-2 infection at the VHA during the study period had a positivity rate of 19.6% (Table 1)

Statistics
Descriptive analyses were conducted to summarize the cohort characteristics by smoking status using the unimputed data. Mean and standard deviation (±SD) were reported for continuous variables; frequency count (N) and percentage (%) were computed for categorical variables; and Chi-square statistics and ANOVA (p-values) evaluated the difference between the groups of smokers. The associations between smoking status and a positive SARS-CoV-2 test were assessed using two binary logistic regression models to estimate the odds ratios (OR) and report 95% confidence intervals (95% CI). The first model adjusted for the most important covariates identified through the feature selection process described above (age (older age), BMI (being overweight and being underweight), cancer, dementia, Hispanic or Latino, lower respiratory infections, pneumonia, sex (male), septic shock, and current smoking). The second model used the same covariates but stratified the patients by age (age <65 versus ≥65). The imputed dataset was used for primary analysis and the unimputed dataset for sensitivity analysis. Statistical analyses and machine learning were performed using Python 3.8.3.

Participant Characteristics Overall and by Smoking Status
The 1,176,306 veterans who were tested for SARS-CoV-2 infection at the VHA during the study period had a positivity rate of 19.6% (Table 1). They were predominantly men (86.5%), white (65.6%), and non-Hispanic or not Latino (84.9%). They had a mean age of 60. 4

Most Important Variables
The most important variable selection process started with 165 initial features and ended with the 12 most important variables ( Figure 2

SARS-CoV-2 Test Positivity and Pre-Existing Respiratory Illnesses
Among the 15 pre-existing respiratory illnesses in our dataset, the most prevalent were chronic lung disease, obstructive sleep apnea, chronic obstructive pulmonary disease, lower respiratory infection, bronchitis, asthma, pneumonia, and acute respiratory failure (Supplementary Table S4). Ten of these variables were selected using univariate analysis, the LASSO step kept six, and only two remained after the final step as important risk factors: lower respiratory infections (aOR 1.09; 95% CI: 1.07, 1.11; p < 0.0001) and pneumonia (aOR 1.20; 95% CI: 1.17, 1.22; p < 0.0001) ( Table 3, Supplementary Table S2a,b).

SARS-CoV-2 Test Positivity and Smoking Status
In the univariate analysis, the unadjusted odds of testing positive for SARS-CoV

SARS-CoV-2 Test Positivity and Smoking Status Stratified by Age
To address the marked difference in the risk between younger and older veterans, veterans were stratified by age (age <65 versus age ≥65), and risk estimates were adjusted for the top 12 variables, except for age ≥85 (

Discussion
We analyzed data from a large cohort and found that currently smoking is associated with a lower risk of testing positive among veterans tested for SARS-CoV- 19  Why current smoking is associated with a lower risk of testing positive for SARS-CoV-2 remains unclear. Previous epidemiologic studies have reported findings of an apparent protective effect from testing positive for SARS-CoV-2 among smokers, although some studies were not actually looking for the effect of smoking status. In our analysis, smoking was the main variable of interest. Our findings extend previously reported findings of lower risks of testing positive for SARS-CoV-2 in Current Smokers (OR: 0.52) and Former Smokers (OR: 0.92) in the same population of veterans using a smaller cohort (n = 88,747) tested early in the pandemic [9]. Our results are also consistent with a decreased risk of SARS-CoV-2 infection (hazard ratio range 0.40 to 0.48) observed among heavy, moderate, and light smokers in a cohort study of 8.28 million participants, 19,486 of whom tested positive [19].
A systematic review of 233 studies [20] reported a lower prevalence of smokers among those individuals who tested positive. A subsequent meta-analysis of studies with detailed smoking status (Current, Former, and Never Smokers) also demonstrated a reduced risk of infection in Current Smokers (RR: 0.74) compared with Never Smokers. Despite the number of studies available, only one was rated with good quality. Jose et al. [21] described the risk in 69,264 patients, including reduced risks for testing positive in patients who smoked cigarettes only and patients who both smoked and vaped relative to nonsmoker/non-vapers but similar risk among those who vaped only. These findings suggest the lower risk of testing positive for SARS-CoV-2 is unique to cigarette smoking. In a cohort of 22,914 veterans with cancer, Current Smokers had a significantly lower prevalence (5.3%) for positive SARS-CoV-2 tests compared with Former and Never Smokers combined (9.5%) [22].
By contrast, a recent study triangulating observational analyses (OA) and Mendelian randomization (MR) reported that former smoking (37%) increased the risks of infection, hospitalization, and death in OA; current smoking (3%) only increased the risks of hospitalization and death; in MR, both smoking initiation and heaviness increased the risks of all three outcomes (OR range 1.45 to 10.02), indicative of a causal effect of smoking on COVID-19 severity [23]. Likewise, data from two survey-based studies documented higher risks of infection among Current Smokers [24,25]. Further, two meta-analyses of hospital-based studies [20,26] revealed that Current and Former Smokers are more likely to experience severe COVID-19 complications, such as hospitalization, disease severity requiring ICU, and death. Our previous study found that Former Smokers had an increased risk of in-hospital mortality, whereas the risk was similar between Current and Never Smokers [6].
Two other individual factors were associated with reduced risk of testing positive. Patients with a history of cancer had a significantly reduced risk (aOR 0.72; 95% CI: 71, 0.73). Previous studies also reported a reduced risk of infection among cancer patients [7,11]. It is possible that cancer patients may be more likely to get tested when asymptomatic; thus, the reduced risk may be related to care seeking behaviors and screening for COVID-19. People who had a past prescription of phenylephrine before testing were also at lower risk (aOR 0.76; 95% CI: 0.73, 0.79) of testing positive for SARS-CoV-19. This is a decongestant medication generally prescribed to treat stuffy nose, cough pain, and fever. It is possible that people with regular upper respiratory symptoms who use this decongestant may be more likely to get tested than other people.
Unlike previous epidemiological studies that rely on a predetermined set of covariates, our larger sample size permitted application of a machine learning approach to take full advantage of each individual variable in the EMR. This explains why our model identified pre-existing conditions and pre-infection medications as independent factors associated with positive SARS-CoV-19 test results, despite their relatively low frequency in the sample. Other known comorbidities and risk factors for severe COVID-19 [34][35][36][37], such as diabetes, cardiovascular disease, hypertension, and chronic kidney disease, were selected at intermediate steps during the variable selection process but were not retained as the most important variables in the last sequential forward step.
Knowledge about physiological mechanisms that could confer lower risk for SARS-CoV-2 infection among smokers remains underdeveloped. Prior studies by us [38] and others [39,40] have shown that smoking upregulates ACE-2 receptors in lung tissue, particularly in Goblet cells [38], which could make Current Smokers more susceptible to lung infections from SARS-CoV-2, which is not what we found in this study. Nicotine exposure upregulates many subtypes of nicotinic acetyl choline receptors. Upregulation of the α7 pentamer could have an anti-inflammatory effect, whereas α3β4 pentamer upregulation increases particle transport speed, which could help flush pathogens. Studies of vapers, who are exposed to nicotine but avoid many toxic substances in tobacco cigarettes, have failed to demonstrate altered risks of SARS-CoV-2 infection according to vaping status [41], which argues against nicotine itself playing a substantial role [42] in protecting smokers for SARS-CoV-2 infections. Similarly, a clinical trial evaluating nicotine application in hospitalized COVID-19 patients showed no benefit [43]. An alternate hypothesis is that upregulation of the RAS pathway due to toxicants in cigarette smoke eventually leads to the apoptosis of alveolar epithelial cells and the proliferation of fibroblasts, reducing targets for COVID-19. One RAS pathway member upregulated in smokers, Angiotensin 1-7, has anti-proliferative and anti-inflammatory activities, unlike other RAS members that are proinflammatory. A homeostatic balance between inflammatory and anti-inflammatory components of the RAS pathway could influence protection from COVID-19 among smokers, but further studies are needed to understand the specific protection from COVID-19 that seems to be provided to Current Smokers but not Former Smokers [44].

Strengths and Weaknesses
Recent critiques have argued that studies reporting an inverse correlation between smoking and testing positive have several methodological flaws, such as not having smoking as the primary exposure, low prevalence of smokers in cohorts, lacking explicit data on smoking status, not adjusting for confounders associated with smoking, and unexpectedly low percentages of chronic obstructive pulmonary disease and cardiovascular disease particularly among smokers, indicating potential selection bias [45,46]. Veterans are known to use tobacco products at a higher rate than non-veterans [47][48][49]. According to a 2016 report, 32 Major strengths of this analysis include the large cohort with a national scope assessing the association between smoking status and testing positive for SARS-CoV-2. The VHA is known to have a high agreement of coding smoking status in EMR. Notably, we imputed missing smoking status and obtained similar results between unimputed and imputed datasets. To avoid confounding with vaccination, we restricted our analysis to the period prior to widespread vaccination. We used machine learning to select the most important variables we adjusted for in multivariate analysis. In future studies, application of a broader suite of machine learning tools could identify additional factors and interactions among them that predict risk for COVID-19 infections [51]. A recent review of applications of many different machine learning applications found that the regularized logistic regression we used is widely used and has a comparable accuracy to many other applications [52].
Our study has some limitations. The study population is veterans, mostly men seeking care in the VHA, with a higher burden of comorbidities compared to the general US population. There is potential for selection bias; some of our findings were likely affected by differences in accessing care at the VHA. It is possible that smokers who are sick and tested positive were getting their care outside the VHA, while the fewer sick smokers were coming to the VHA and less likely to test positive. These socially determined behaviors could partially explain the discrepant association between smoking and contracting SARS-CoV-19. However, in an age-stratified analysis to create subgroups of more similar healthcare seeking behaviors, the protective effects of current smoking were comparable in older and younger veterans. The lack of association between a positive SARS-CoV-19 test and Former Smokers among older veterans could reflect a longer period of abstention in older patients; thus, their risk becomes similar to Never Smokers.

Conclusions
This study suggests that currently smoking is associated with a decreased risk for testing positive for SARS-CoV-2 among veterans receiving their testing at VHA facilities. Further research should continue to explore the relationship between smoking and SARS-CoV-2 infection and outcomes to develop and deliver clear messages about risk and harm mitigation.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/healthcare10071244/s1, Table S1. Full list of all patients' characteristics; Table S2a,b. List of variables studied and results of most important variable selection from prevalence, Univariate Analysis, LASSO, and sequential forward stepwise selection for Imputed data; List of variables studied and results of feature selection (FS) from prevalence, Univariate Analysis, LASSO, and sequential forward stepwise selection for Unimputed data; Table S3. Prevalence of patient characteristics by testing status; Table S4. Pre-existing respiratory diseases; Table S5. Association between smoking status and COVID-19 positivity stratify by age for imputed and unimputed data.

Data Availability Statement:
The data is available behind the VHA firewall and it cannot leave the VHA electronic health records. Any request for data access requires official approval process.