Factors Associated with E-Cigarette Use in U.S. Young Adult Never Smokers of Conventional Cigarettes: A Machine Learning Approach

E-cigarette use is increasing among young adult never smokers of conventional cigarettes, but the awareness of the factors associated with e-cigarette use in this population is limited. The goal of this work was to use machine learning (ML) algorithms to determine the factors associated with current e-cigarette use among US young adult never cigarette smokers. Young adult (18–34 years) never cigarette smokers from the 2016 and 2017 Behavioral Risk Factor Surveillance System (BRFSS) who reported current or never e-cigarette use were used for the analysis (n = 79,539). Variables associated with current e-cigarette use were selected by two ML algorithms (Boruta and Least absolute shrinkage and selection operator (LASSO)). Odds ratios were calculated to determine the association between e-cigarette use and the variables selected by the ML algorithms, after adjusting for age, gender and race/ethnicity and incorporating the BRFSS complex design. The prevalence of e-cigarette use varied across states. Factors previously reported in the literature, such as age, race/ethnicity, alcohol use, depression, as well as novel factors associated with e-cigarette use, such as disabilities, obesity, history of diabetes and history of arthritis were identified. These results can be used to generate further hypotheses for research, increase public awareness and help provide targeted e-cigarette education.


Introduction
There has been a rapid increase in the use of e-cigarettes among youth and young adults in the US [1][2][3]. E-cigarettes include devices that allow users to vaporize and inhale an aerosol that typically contains nicotine, flavorings and other additives [4]. The long-term effects of e-cigarette use remain largely unknown, but e-cigarette aerosols contain toxins that can affect health [5][6][7][8]. There is increasing evidence that e-cigarettes may be associated with an increased risk of oral diseases [9,10], prediabetes [11], depression [12,13], asthma, chronic obstructive pulmonary disease (COPD) and respiratory symptoms [14][15][16][17][18]. Recently, the Centers for Disease Control and Prevention (CDC) reported multiple cases of e-cigarette or vaping product use-associated lung injury (EVALI), some of which resulted in deaths [19]. Tetrahydrocannabinol (THC)-containing e-cigarettes, or e-cigarette cartridges containing vitamin E acetate, were likely responsible for these clusters of EVALI [20]. This highlights the fact that e-cigarettes may unknowingly contain potentially harmful substances. E-cigarettes have been associated with marijuana, non-prescribed drug use and subsequent cigarette smoking, which may be explained by confounding due to common liability such as shared genetic vulnerability or environmental factors [21][22][23][24][25][26].
E-cigarette use is prevalent among smokers of conventional cigarettes [27], but e-cigarette use by never smokers (sole e-cigarette use) is also rising [2]. In 2016, 15% of all e-cigarette users (an estimated 1.9 million U.S. adults) were sole e-cigarette users and approximately 1.2 million of them were less than 25 years old [28]. Moreover, in 2015 and 2016, the results from two different national surveys show that 40% and 44%, respectively, of current e-cigarette users aged 18-C24 years were sole e-cigarette users [27,29]. E-cigarettes may be safer than cigarettes for smokers [30], but never smokers who use e-cigarettes likely receive little benefit [31]. Studies have shown that the perception of e-cigarettes and motivation for e-cigarette use varied based on cigarette smoking status [31,32], therefore, factors unique to never smokers need to be identified.
Young adults (18-34 years old) are more likely than older adults to report current e-cigarette use [29,33], and a significant percentage of young adults, especially 18-24year-olds, report sole e-cigarette use [2,34], but there is a paucity of research on the factors associated with e-cigarette use in this population [29]. Identifying the factors associated with e-cigarette use in young adults is critical, in light of a recent study that showed that 76% of the EVALI patients were <35 years old [35]. Additionally, knowledge of these factors is also important for regulatory authorities, because the recent FDA decision to reduce the nicotine content of combustible cigarettes may deter some individuals from initiating cigarette smoking and instead switch to the use of e-cigarettes and other noncombustible tobacco products [36]. The factors associated with e-cigarette use can be identified using machine learning (ML) techniques.
There has been an increase in the application of ML techniques to medicine and other research areas [37], but there is a paucity of the use of ML techniques in tobacco research. ML is a natural extension of traditional statistical approaches that becomes increasing valuable as the amount of data increases and the dimensionality of the dataset increases [38]. As the amount of variables to be considered increases, identifying all the variables associated with an outcome and determining the variables to be included in models becomes increasingly difficult to implement properly using standard statistical methods [38][39][40]. ML techniques can be used to identify variables associated with an outcome as the number of variables increase. ML techniques have been applied to survey data to identify variables that are associated with different psychological and disease conditions [41][42][43][44][45][46].
Variables with known relationships or exploratory guesses are used to identify factors associated with e-cigarette use. This approach may lead to the exclusion of important variables that can improve our understanding of e-cigarette use in young adults. ML techniques can reduce this limitation by automatically identifying variables associated with e-cigarette use. The goal of this study is to use ML techniques to identify demographic, behavior and health factors associated with current e-cigarette use in a representative population of young adult never smokers in the US. This is especially important because of the rapidly changing field of e-cigarette use by young adult never smokers and the potential gaps in understanding the factors associated with e-cigarette use in this population. These identified factors may be used in other models that include e-cigarette use to reduce bias due to confounding. This study will inform the work of researchers, physicians, and regulatory authorities seeking to develop programs to better target young adults at risk of sole e-cigarettes use.

Materials and Methods
The 2016 and 2017 cross-sectional Behavioral Risk Factor Surveillance System (BRFSS) survey data were used for the analysis [47,48]. The BRFSS is a combined project between CDC and all the states in the US and participating territories. Data in the BRFSS are self-reported and collected using landlines and cellphones. The BRFSS is designed to collect data on demographics, chronic health conditions, health-related risk behaviors and the use of preventive services from the noninstitutionalized adult population (≥18 years) residing in the US and participating territories. The BRFSS includes a core set of questions that is used by all the states and optional modules that can be included by the different states. Core questions include questions about current health-related perceptions, conditions, and behaviors, as well as demographic questions. The core component includes the annual core comprising of questions asked each year to all the participants and rotating core questions that are included in evenand odd-numbered years. More information about the BRFSS design can be found elsewhere [49,50].

Study Population
Data from the annual core questions from the 2016 and 2017 BRFSS survey were combined as detailed in other reports [51,52] and used for the analysis. Participants were included in the analysis if they were young adults (18-34 years), were never cigarette smokers and were either current or never e-cigarette users. E-cigarette use was determined using these two questions: "Have you ever used an e-cigarette or other electronic vaping product, even just one time, in your entire life?" and "Do you now use e-cigarettes or other electronic "vaping" products every day, some days, or not at all". Never e-cigarette users reported having never used an e-cigarette and current e-cigarette users reported currently using e-cigarettes every day or some days. Never cigarette smokers reported having smoked less than 100 cigarettes in their entire life.
There were 148,618 young adults (18-34 years). E-cigarette use and smoking status could not be ascertained for participants who reported "Don't know/Refused/Missing" for e-cigarette use (n = 7585) and cigarette use (n = 6995). These participants were removed from the analysis. Additionally, participants who were current or former cigarette smokers (n = 44,418) and/or former e-cigarette users (n = 39,268) were removed from the analysis.

Data Preprocessing
Annual core questions that were the same in 2016 and 2017 surveys were selected as variables for the analysis. Variables that were used to create other variables and variables not related to health perceptions, conditions, behaviors, or demographics (such as imputation flags, weights, and stratum) were removed from the analysis. Missing data that could be ascertained from other variables (e.g., questions that were not asked based on response to a previous question) were replaced with the appropriate categorical value. Categorical variables where participants selected "Don't know/Not sure/Refused/Missing" were converted to a new categorical value. This was done to remove the missingness in the data [53]. Current and never e-cigarette use was combined to create a binary outcome for this analysis. After preprocessing the data, 47 variables and the outcome were selected as input for the ML algorithm.

Statistical Analysis Step 1: Initial Variable Selection
Boruta [54] and the least absolute shrinkage and selection operator (LASSO) [55,56] were used to select the variables that were associated with current e-cigarette use. These two algorithms will select different sets of variables, thereby reducing the likelihood of important variables being omitted. Boruta and LASSO have been used for variable selection for various types of data, such as survey, medical and genomic data [57][58][59][60][61][62][63][64].
Boruta is a wrapper built around the random forest classification algorithm. Random forest is an ensemble method where classification is performed by voting on multiple unbiased weak decision trees. Random forest can deal with nonlinear and complex relationships between the variables and the outcome. Furthermore, random forest considers the impact of each predictor variable individually, as well as in multivariate interactions with other predictor variables [65]. Boruta works by adding randomness to the data and creating randomized variables called "shadow" features. In each iteration of the algorithm, features that achieve higher importance (Z score) than the shadow features are counted. Variables with significantly larger importance values than the shadow variables are declared important variables, and the others are declared unimportant variables. The algorithm works to find all the relevant/important variables in the data. The important variables are those significantly correlated with the outcome. A detailed description of Boruta can be found elsewhere [54].
The LASSO algorithm puts a constraint on the sum of the absolute values of the logistic regression model parameters by applying a shrinking (regularization) process that penalizes the coefficients of the regression variables and shrinks the least important variables to zero. The tuning parameter λ controls the strength of the penalty. A detailed description about LASSO can be found elsewhere [55].
To avoid the errors and limitations due to a single application of a ML algorithm, and to reduce the sensitivity of the variable selection methods to small perturbations in the data [66,67], 100 iterations of Boruta and 300 iterations of LASSO with random samples consisting of 80% of the original data were performed. The features selected were stable at this number of iterations. More bootstrap iterations of LASSO were performed, because LASSO is computationally less expensive than Boruta. For LASSO, during each bootstrap iteration, a tenfold cross-validation was used to select the lambda (λ m ) that produced the minimum mean cross validation error [56,68]. The variables with non-zero coefficient for variables other than "Don't know/Not sure/Refused/Missing" for λ m were selected. For both ML algorithms, the variables that were selected in ≥90% of the iterations of the bootstraps were identified as significant variables. The variables selected by either of the two algorithms were used as input to the final variable selection method.

Statistical Analysis Step 2: Final Variable Selection
Multivariable logistic regression was used to examine the association between e-cigarette use and the variables selected from either Boruta or LASSO, after controlling for gender, age and race/ethnicity, which are considered to be non-modifiable demographic exposures [69]. There were no statistical adjustments for the association between these non-modifiable demographic exposures and e-cigarette use [69]. Creating multivariable logistic regression models for each selected feature and adjusting for only the non-modifiable demographic exposures (gender, age, and race/ethnicity) will independently identify the factors associated with e-cigarette use. Also, in order to make the results representative of the United States noninstitutionalized young adult never smoker population, the BRFSS complex design was incorporated into the analysis, to account for the probability of selection and adjust for nonresponse bias and non-coverage errors [51,52]. The BRFSS complex data weights and analysis for the subpopulations were calculated as detailed elsewhere [70]. All analyses were conducted using R version 3.6.1, R Foundation for Statistical Computing: Vienna, Austria, 2019 [71].
Boruta package [54] was used for Boruta, glmnet package [56] was used for LASSO, and survey package [72] was used for the multivariable logistic regression. All the default parameters for Boruta were used, including mtry = square root of the number of predictor variables and ntree = 500. These are sufficient in most cases, since random forest performance has a weak dependence on its parameters [54]. MaxRuns was increased to 250 to prevent the algorithm from ending prematurely, thereby increasing the number of tentative features [54]. For LASSO, cv.glmnet in the glmnet package was used. Family was set to binomial and all the default parameters of cv.glmnet were used, including nfold = 10 and alpha = 1 [56].

Results
There were 79,539 young adult never cigarette smokers. 3,146 were current e-cigarette users and 76,393 were never e-cigarette users. Among young adult never smokers, 55 1 Includes participants who are divorced or widowed or separated; 2 Includes participants who are out of work or unable to work, homemakers or retired; 3 Defined as ≥4 drinks for females and ≥5 drinks for males on 1 occasion in the past 30 days; 4 Defined as ≥7 drinks for females and ≥14 drinks for males per week; 5 Participant answered "yes" to whether any of the following happened in the past year: intravenous drug use, treatment for sexually transmitted or venereal disease, received money or drugs in exchange for sex, had anal sex without a condom or had four or more sex partners; 6 Participants answered "yes" to "Are you blind or do you have serious difficulty seeing, even when wearing glasses?"; 7 Participants answered "yes" to "Because of a physical, mental, or emotional condition, do you have serious difficulty concentrating, remembering, or making decisions"; 8 Participants answered "yes" to "Do you have serious difficulty walking or climbing stairs?"; 9 Participants answered "yes" to "Do you have difficulty dressing or bathing?"; 10 Participant answered "yes" to Because of a physical, mental, or emotional condition, do you have difficulty doing errands alone such as visiting a doctor's office or shopping?; 11 Participants answered "yes" to "Has a doctor, nurse, or other health professional ever told you that you had some form of arthritis, rheumatoid arthritis, gout, lupus, or fibromyalgia? (Arthritis diagnoses include: rheumatism, polymyalgia rheumatica; osteoarthritis (not osteoporosis); tendonitis, bursitis, bunion, tennis elbow; carpal tunnel syndrome, tarsal tunnel syndrome; joint infection, Reiter's syndrome; ankylosing spondylitis; spondylosis; rotator cuff syndrome; connective tissue disease, scleroderma, polymyositis, Raynaud's syndrome and vasculitis (giant cell arteritis, Henoch-Schonlein purpura, Wegener's granulomatosis, polyarteritis nodosa).
After the initial variable selection, 38 variables were selected by Boruta and 27 variables were selected by LASSO to be significantly associated with e-cigarette use. Both algorithms selected 26 identical variables. State/territory of residence was selected by both algorithms to be significantly associated with e-cigarette use, therefore, the prevalence of sole e-cigarette use in the different states and US territories for 2016 and 2017 was calculated and shown in Figure 1 and Table S1.
Guam had the highest prevalence of sole e-cigarette use by young adults, while Puerto Rico had the lowest prevalence of sole e-cigarette use by young adults. Among the US states, sole e-cigarette use by young adults was more prevalent in Michigan and Wyoming, and less prevalent in South Dakota.
The results of the multivariable logistic regression are shown in Table 2. Three univariate logistic regressions (one for each of the following: age, gender and race/ethnicity) and 34 different multivariable logistic regressions (one for each of the selected features adjusted for age, gender and race/ethnicity) were performed. Table 2 shows the odds ratio for each selected feature after adjusting for age, gender and race/ethnicity. Variables selected by both algorithms and unique variables selected by each of the algorithms are also shown in Table 2. Guam had the highest prevalence of sole e-cigarette use by young adults, while Puerto Rico had the lowest prevalence of sole e-cigarette use by young adults. Among the US states, sole e-cigarette use by young adults was more prevalent in Michigan and Wyoming, and less prevalent in South Dakota.
The results of the multivariable logistic regression are shown in Table 2. Three univariate logistic regressions (one for each of the following: age, gender and race/ethnicity) and 34 different multivariable logistic regressions (one for each of the selected features adjusted for age, gender and race/ethnicity) were performed. Table 2 shows the odds ratio for each selected feature after adjusting for age, gender and race/ethnicity. Variables selected by both algorithms and unique variables selected by each of the algorithms are also shown in Table 2.     Odds of e-cigarette use decreased with increasing age. Females, black non-Hispanic, other races non-Hispanic and Hispanics compared to white non-Hispanics, students compared to participants who were currently employed, and participants who had a flu shot in the past year were less likely to use e-cigarettes.
Participants who were not currently married, participants whose highest level of completed education was high school graduation compared to those who did not graduate from high school; participants who currently rent or have other arrangements, participants who could not see a doctor because of cost in the past 12 months and those who reported internet use in the past 30 days had increased odds of e-cigarette use.
Participants who were obese, who reported poor physical or mental health, who reported current smokeless tobacco use, alcohol consumption including binge drinking and heavy drinking and risky behaviors (such as occasionally driving without seatbelts, engaging in HIV risky behaviors and testing positive for HIV) had increased odds of e-cigarette use. Additionally, participants who reported vision disability, cognitive disability, independent living disability and self-care disability had increased odds of e-cigarette use. Compared with persons without the respective chronic health conditions, participants who reported a history of arthritis, diabetes, depressive disorder and participants who currently have asthma also had increased odds of e-cigarette use.

Discussion
We used an ML approach to identify previously reported as well as unreported factors associated with sole e-cigarette use in US young adults. Sole e-cigarette use differed across states. Demographic factors such as age, gender and race and other factors such as use of smokeless tobacco, alcohol consumption, engaging in risky behaviors, reporting poor mental and physical health, disabilities and chronic health conditions were associated with sole e-cigarette use.
Some of the variables selected by the algorithms have been reported previously for adult sole e-cigarette users. Mirbolouk et al. reported that adult sole e-cigarette use differed across states and the prevalence of sole e-cigarette use was highest among males and persons aged 18 to 24 years [28]. Additionally, participants who used the internet, were binge drinkers, engaged in HIV risky behaviors and reported at least 1 day with mental distress had a higher prevalence of sole e-cigarette use than non-users [28]. In another study looking at e-cigarette use in adult never smokers (never smokers included current smokers who were not smokers a year ago), black people and Hispanics had decreased odds of current and regular e-cigarette use, while unmarried participants had increased odds of current and regular e-cigarette use [73]. E-cigarette use has also been shown to be associated with alcohol use and alcohol use disorder in nonsmokers of cigarettes [74]. Associations with asthma [18] and depression [13] have also been reported for sole e-cigarette use. Thus, our ML approach agrees with the literature confirming some known factors associated with sole e-cigarette use.
Additionally, our study extends the literature on sole e-cigarette use, by identifying several new factors associated with increased odds of sole e-cigarette use. The new factors identified include vision, cognitive, self-care and independent living disabilities. Obesity, risky behaviors (driving without a seat belt and ever being tested for HIV) and chronic conditions (history of diabetes and arthritis) were also identified as associated with e-cigarette use. Additionally, home ownership and having had a flu vaccine were also identified to be associated with e-cigarette use. Further research is needed to validate these findings and to explore the nature of these associations. Some of the identified characteristics of sole e-cigarette use have been shown in cigarette smokers [75][76][77][78][79][80], which may indicate a similarity in some behavioral predictors of cigarette and sole e-cigarette use.
Most of the variables selected by LASSO were also selected by Boruta, thereby independently confirming an association between those variables and e-cigarette use. Boruta, however, selected more variables because it is a heuristic algorithm designed to find all relevant variables, including weakly relevant variables [54]. Additionally, the differences found could be due to non-linear relationships or interactions between the variables and outcomes. Some of the initial variables selected by the ML algorithms were not statistically significant after adjusting for confounders (age, gender and race/ethnicity) and the BRFSS complex design method. This may be due to the fact that the ML algorithms cannot accommodate the BRFSS complex design that adjusts for demographic differences between sampled individuals and the population they represent. Therefore, while the features were significant in the sample used for the ML algorithms, they may not have been statistically significant in the US population of never smokers. Additionally, the relationship between the selected variables and e-cigarette use may not be adequately explained by a multivariable logistic regression model. Other limitations of the ML algorithms include the fact that Boruta is computationally expensive, especially for large datasets, and LASSO has no grouping property, and as such, tends to select only one variable from a group of highly correlated variables [54,81].
Our ML approach reduces the dependence on known information and exploratory hypotheses, which are commonly used to select features that are associated with an outcome or are included in regression models. By automatically selecting features associated with an outcome, our approach reduces the possibility of missing important or previously unreported features. Furthermore, our ML approach may be used to identify features associated with an outcome as the dimension of the data increases, which is common in larger survey data. Our results show the utility of the ML approach. We were able to identify previously reported features, as well as novel features that were associated with current e-cigarette use in never smokers.
The strength of the study was the large number of participants available for the analysis, who were nationally representative of US non-institutionalized young adult never smokers. Some of the limitations include the cross-sectional nature of the analysis, the inability to establish a causality, and a lack of biochemical confirmation of e-cigarette and conventional cigarette use, which may lead to under reporting of use, which may bias the results of the analysis. Furthermore, since the data are based on self-report, there is the potential for recall bias and diagnosis misclassification bias by the participants. Our approach may have been affected by multiplicity, as we tested multiple factors associated with e-cigarette use. Additionally, the data is unbalanced, and the outcome is sparse, and this can affect the detection of some of the features associated with e-cigarette use. Moreover, the features not selected by the ML algorithms may be associated with e-cigarette use, however, those features have not been previously reported as features associated with e-cigarette use in young adult never cigarette smokers. Furthermore, we reduced the limitation of missing important features by using two different ML algorithms.

Conclusions
We were able to use machine learning algorithms to identify the factors associated with e-cigarette use in a nationally representative population of young adult never smokers. We were able to identify factors previously reported in the literature, as well as novel factors associated with e-cigarette use. Our ML approach reduces the dependence on known information and exploratory hypotheses, and reduces the possibility of missing important or previously unreported factors. Our findings may guide researchers, policy makers and health care providers, generate further hypotheses for research, increase public awareness and help provide targeted e-cigarette education on e-cigarettes use in young adult never smokers. E-cigarette products are rapidly changing, and monitoring their use patterns is a high priority for policymakers [82]. Future studies are required in order to understand the state level differences and the implications of e-cigarette use in participants with disabilities, high risk behaviors and chronic conditions. Supplementary Materials: The following are available online at http://www.mdpi.com/1660-4601/17/19/7271/s1. Table S1: State and US territory-specific prevalence of current e-cigarette use among young adults who are never-smokers of conventional cigarettes.

Conflicts of Interest:
The authors declare no conflict of interest.