Gene Environment Interactions and Predictors of Colorectal Cancer in Family-Based, Multi-Ethnic Groups

For the personalization of polygenic/omics-based health care, the purpose of this study was to examine the gene–environment interactions and predictors of colorectal cancer (CRC) by including five key genes in the one-carbon metabolism pathways. In this proof-of-concept study, we included a total of 54 families and 108 participants, 54 CRC cases and 54 matched family friends representing four major racial ethnic groups in southern California (White, Asian, Hispanics, and Black). We used three phases of data analytics, including exploratory, family-based analyses adjusting for the dependence within the family for sharing genetic heritage, the ensemble method, and generalized regression models for predictive modeling with a machine learning validation procedure to validate the results for enhanced prediction and reproducibility. The results revealed that despite the family members sharing genetic heritage, the CRC group had greater combined gene polymorphism rates than the family controls (p < 0.05), on MTHFR C677T, MTR A2756G, MTRR A66G, and DHFR 19 bp except MTHFR A1298C. Four racial groups presented different polymorphism rates for four genes (all p < 0.05) except MTHFR A1298C. Following the ensemble method, the most influential factors were identified, and the best predictive models were generated by using the generalized regression models, with Akaike’s information criterion and leave-one-out cross validation methods. Body mass index (BMI) and gender were consistent predictors of CRC for both models when individual genes versus total polymorphism counts were used, and alcohol use was interactive with BMI status. Body mass index status was also interactive with both gender and MTHFR C677T gene polymorphism, and the exposure to environmental pollutants was an additional predictor. These results point to the important roles of environmental and modifiable factors in relation to gene–environment interactions in the prevention of CRC.


Introduction
Colorectal cancer (CRC) is a cancer that is preventable by modifying environmental and lifestyle interventions for human ecological development [1][2][3][4][5][6]. Well-defined environmental interventions may improve cancer treatment effects, prevent cancer progression and increase survival through epigenetic mechanisms with gene environment interactions [1,4,5]. Approximately 70% of CRC is related to environmental and lifestyle factors, while about 30% of CRC risk is inheritable with 5% being highly aggressive in cancer progression for metastatic penetrance [7][8][9]. Hence, the most common risks

Results
We used three phases of data analytics, including exploratory family-based analyses adjusting for dependence within the family for sharing genetic heritage [44]. In the first stage of data visualization and understanding, we used bootstrap forest, also known as bagging (i.e. bootstrap aggregating), which is one of the most popular ensemble methods [37][38][39][40]. The ensemble methods are based on the logic of resampling, which is a well-known remedy for small-sample studies. For example, while developing the bootstrapping method in 1983, Diaconis and Efron had only 15 observations [45]. In resampling, the sample is treated as a virtual population and then different subsets are randomly drawn from the sample for multiple analyses. Bias can be observed and corrected by such repeated analyses of random subsets [46]. In the second stage, our strategy was to identify the most influential predictors within three categories of genetic factors, demographic/environmental factors, and lifestyle factors for dimension reduction. We also used generalized regression models for predictive modeling with machine learning validation procedures [47], including significant variables and variables with significant interactions identified through the data visualization of bi-variate interaction profilers, to validate the results for enhanced prediction and reproducibility.

Characteristics of Study Participants
We recruited a total of 54 families, 108 participants, 54 CRC cases and 54 matched family controls. We attempted to match the groups on various demographic factors for this family-based study. The family control group had a younger age because many of the available family members were the offspring of the cancer patients. Table 1 presents the comparisons of key demographic [48], lifestyle health metrics [49,50], and environmental factors [51,52] between the two groups. Parameters that were significantly different between the control and cancer groups included age, gender, and exposure to pollutants (all p < 0.05), adjusted for associated blood-related family members [44]. As this was a proof-of-concept study, additional adjustment of p-values for multiple testing was not used for the exploratory analyses of related factors.
Other noteworthy factors of importance included sleepiness during day time; cancer patients reported an average of 0.4 more sleepy days than the family controls. Physical inactivity was associated with an elevated risk of cancer (an average of 11 min less active per week in cancer patients than the control group). However, most people were sedentary, only two (3.7%) of the control group and one (1.9%) of the cancer group participants met the recommended 150 minutes or longer physical activity in this study. Additionally, using alcohol was associated with a higher risk of cancer (14.9% more use in the cancer than the control group).
These demographic/lifestyle/environmental factors were compared across the racial-ethnic subgroups ( Table 2). The results showed that the Hispanic and the Black samples had higher body mass index (BMI) with greater than 50% of the Hispanic and the Black samples being obese than the White (29.4%) and the Asian (2.4%) samples (p < 0.0001). Additionally, there were more Whites than the other three racial groups who drank alcohol (p = 0.0001).
Between the two groups, the total gene polymorphism rates of the five chosen genes in the folate methylation pathways ranged from zero to six, with a possible maximum score of 10 if each of the five genes had homozygous polymorphisms. MTHFR enzyme deficiency was calculated by combining the loss of enzyme functions from MTHFR C677T (loss of 35% for each of the two T polymorphic alleles) and MTHFR A1298C (a loss of 15% for each of the two C polymorphic alleles), a composite score of both MTHFR C677T and MTHFR A1298C polymorphisms [15]. To decrease the degrees of freedom and increase the power in the statistical testing, the total polymorphism score was recoded into two groups using the median split between <4 and ≥4. Increased polymorphism of the five genes combined was associated with an increased risk of CRC (p < 0.05), while no significant difference between the control and cancer groups was noted for each gene alone and the composite score on the MTHFR enzyme deficiency (Table 3). There was a general trend that the cancer group had increased polymorphisms and lesser percentage of wild type alleles for all genes including MTHFR C677T, MTR A2756G, MTRR A66G, and DHFR 19bp, except for MTHFR A1298C, where the control group had increased polymorphisms and lower wild type alleles compared to the CRC group, which had decreased polymorphisms and higher wild type alleles.   There were significant differences in the presentation of all five gene polymorphisms across the four racial-ethnic groups (all p < 0.05, Table 4). For comparison among these racial groups, in general, the Asian and the White samples had more polymorphisms on these five genes combined than the Hispanic and the Black samples. For comparisons among the groups of the individual genes, the Hispanic and the White samples had higher MTHFR enzyme deficiencies (average of 36%) resulting from polymorphisms of MTHFR C677T and MTHFR A1298C compared to the Asian (27%) and the Black (0%) subgroups. The Asian (88%) and the Black (78%) samples had higher DHFR 19 bp deletions than the White (59%) and the Hispanic (48%) samples. The White (79%) and the Black (67%) samples had higher MTRR A66G polymorphisms than the Asian (52%) and the Hispanic (26%) samples. Furthermore, the Black (56%) and the White (41%) samples had higher levels of polymorphisms on the MTR A2756G gene than the Asian (29%) and the Hispanic (9%) subgroups.
The distribution of the polymorphisms on these five genes for the control and cancer groups and the four racial-ethnic subgroups are further presented in Table 5. We checked the Hardy-Weinberg equilibrium (HWE) analysis of these five genes to assess the distribution equilibrium of the evolutionary mechanisms in population genetics [53], associated with factors such as population migration or stratification and disease association. MTHFR A1298C and DHFR 19bp had significant (both p < 0.05) HWE with disequilibrium, while this was not significant for each of the racial-ethnic subgroups for these two genes. We further checked the distribution of alleles for population-based allele frequencies across the ethnic groups to provide the reference distribution in comparison to our findings (Table 5, http://useast.ensembl.org/index.html; https://www.cdc.gov/genomics/ population/genvar/frequencies/mthfr.htm).

Most Influential Predictors per Category-The Ensemble Method
Influential predictors were identified in three categories: genetic factors, demographic/ environmental factors, and lifestyle factors (as indicated by health metrics) [48,49]. Individual predictors were then selected by using the decision tree methods to build models and then from the rank order of column contributions selecting the most influential variables using the bootstrap forest method [28][29][30][31]. The column contribution is presented using the G 2 statistics, which is derived from the conventional likelihood ratio chi-square statistic, as chi-square is a test of goodness-of-fit between the expected count and the actual account. By the same token, G 2 indicates how well the expected count and actual count classified into that group fit with each other.
The most crucial genetic predictors of cancer (Table 6) appeared to be MTRR66 polymorphism and MTHFR deficiency. On the rank order of importance among the 10 demographic/environmental factors (Table 7), BMI ranked the highest for importance, followed on the next level by marital status and race, then dropped to the variables of exposure to pollution and gender, then dropped to health insurance coverage and air quality in the community, and finally to variables including the convenience of access health care, air quality in the home, and tobacco smoker in the home. Our exploration found that age alone trumped all other potential predictors. However, this result is not informative because it is a well-known fact that older people are more vulnerable to chronic health issues leading to cancer. This piece of information about age cannot lead to any actionable item because nothing can be done to reverse aging. Thus, age was not included in the exploratory analysis to allow other potential predictors to emerge. Among the 16 lifestyle/health metrics variables (Table 8), after six rows there is a sharp drop of G 2 , and therefore stress, physical activity minutes, time using alcohol, spiritual support, sleepiness, and functional role are considered the most important predictors.        In the second stage, dimension reduction, our strategy was to identify the most influential predictors within the three categories of genetic factors, demographic/environmental factors, and lifestyle factors (as indicated by health metrics). To select the most influential predictors within each category, we used the criteria of column contribution and variable importance. Both the ensemble method and the regression methods were run to identify potential predictors in each group and in each category. The misclassification rates of both models were compared to verify the function of a predictive model according to genetic, demographic/environmental factors, and lifestyle categories. For this sample, the random forest models outperformed the original logistic regression analyses for all three domains of factors, as presented by lower misclassification rates (Table 9).

Predictors for Gene-Environment Interaction
The most significant variables for gene-environment interactions were then taken into consideration simultaneously, and Table 10 presents the rank order of important factors by G 2 and a portion of combined bootstrap forest analyses of all three domain factors. G 2 is based on LogWorth and the likelihood ratio chi-square statistics, whereas portion is counted by how often the variable recurs in the repeated analyses. It is important to point out that like using the scree plot in factor analysis, the decision of adopting the most important predictors is based on the overall pattern i.e., how the variable pops out in G 2 relative to others, not an absolute cut-off like the alpha level. It is noteworthy that the first four top predictors are modifiable (BMI, physical activity, sleepiness, and spiritual support). Genetic factors (MTHFR deficiency and MTRR A66G polymorphism), which are non-malleable, rank number five and number nine for the total sample.
The role of important predictors in cancer was further examined by racial-ethnic subgroups to explore potential actionable factors per racial-ethnic groups. Table 11 indicates that for Asians (n = 42) the number one predictor was sleepiness, then followed by the stress levels, then MTRR A66G polymorphism and physical activity levels. The outstanding G 2 suggests that sleepiness and stress trumped all other factors in predicting cancer for Asians. For Hispanics (n = 23) the top predictor was spiritual support, which trumped all other factors, as shown in Table 12. For Whites (Table 13, n = 34), the most important variables were physical activity, BMI, and alcohol use. Because there were only nine black participants, there was not enough variation for resampling to construct a model using the bootstrap forest method.

Predictive Modeling for Gene-Environment Interactions-Generalized Regression Analysis
Using the most influential variables identified in Section 2.2, two generalized regression models were developed using leave-one-out cross-validation methods to predict the probability of cancer. Generalized regression is also known as penalized regression. As the name implies, the modeling process penalizes complicated models to avoid overfitting. Hence, compared with conventional regression modeling, generalized regression tends to yield an optimal model. In each case, the models were first compared to a logistic regression model with validation for a baseline. For model one the parameter estimates along with the associated p-values for the baseline logistic regression results with validation are shown in the left panel of Table 14, including significant interaction terms (BMI interacting with alcohol use) in addition to total gene polymorphism score and other significant parameters. The regularized parameters remaining in the generalized regression elastic net Alkaike's information criterion (AIC) with correction (AICc) and leave-one-out models are shown in the middle and right panels of Table 14, with the predictor, alcohol use, eliminated from the model as indicated by the zero value for the estimate. Table 14. Baseline logistic regression model and generalized regression elastic net models on the predictors of colorectal cancer from gene-environment interactions (of total gene polymorphisms). The predictive performance for the generalized regression elastic net models can be characterized by examining the receiver operating characteristic (ROC) curve and the misclassification rates ( Figure 1). The misclassification rate for the baseline logistic regression in the left panel was higher than the other two methods, with a misclassification rate of 0.3714 as compared to 0.2963 and 0.2804. The elastic net validation model outperformed the original logistic regression model on predictive accuracy by lower misclassification rates. The ROC areas under the curve are shown in Figure 1, with the baseline logistic regression model in the left panel with an area under the curve of 0.7817 and the generalized regression elastic net AICc model and leave-one-out model in the middle and right panels with an area under the curve (AUC) of 0.7652. In the elastic net models, alcohol use was the variable to leave out; however, BMI and alcohol had significant interactions. Therefore, as the base of the interactive variable, the BMI variable must remain in the model. In a similar way to the previous model, in the second model we used an elastic net AICc validation and with leave-one-out validation with a baseline model of logistic regression with a validation column by including the individual gene parameters and significant interaction terms (gender with BMI, MTHFR C677T with BMI. Results of the parameters for the logistic regression are shown in Table 15, and results for the model results are shown in Figure 2 for ROC area under the curve. As before, the generalized regression Elastic Net models outperformed the baseline logistic regression model with better predictive accuracy (lower misclassification rates and larger AUCs). In the elastic net model, BMI was the variable to leave-out; however, BMI and gender status as well as BMI and MTHFR C677T polymorphism had significant interactions. Therefore, BMI variable must remain in the model.

Logistic Regression Original Model with Validation
In both predictive models of CRC, by either including total gene polymorphisms or individual In a similar way to the previous model, in the second model we used an elastic net AICc validation and with leave-one-out validation with a baseline model of logistic regression with a validation column by including the individual gene parameters and significant interaction terms (gender with BMI, MTHFR C677T with BMI. Results of the parameters for the logistic regression are shown in Table 15, and results for the model results are shown in Figure 2 for ROC area under the curve. As before, the generalized regression Elastic Net models outperformed the baseline logistic regression model with better predictive accuracy (lower misclassification rates and larger AUCs). In the elastic net model, BMI was the variable to leave-out; however, BMI and gender status as well as BMI and MTHFR C677T polymorphism had significant interactions. Therefore, BMI variable must remain in the model.
In both predictive models of CRC, by either including total gene polymorphisms or individual genes as part of genetic factors of gene-environment interactions, gender (more men than women in the CRC group compared to the control group) and BMI status (more overweight and obese status in the CRC group than the control group) were consistent predictors. In the model where the total gene polymorphism was used for prediction of CRC, alcohol use (more use in the CRC group than the control group) was interactive with BMI status. In the model where the single genes were included for the prediction of CRC, the BMI variable was interactive with both gender and MTHFR C677T polymorphism and the exposure to pollution was an additional predictor of CRC in the model when single genes were included. These predictive models were run for each racial-ethnic subgroup. However, we did not observe stable results because of the limited number of samples per racial-ethnic subgroups. Therefore, the subgroup analyses per racial-ethnic subgroups of the predictors of CRC from gene-environment interactions are not presented.

Logistic Regression with Validation
Elastic Net with AICc Validation Elastic Net with Leave-One-Out

Discussion
We presented the gene-environment interactions and predictors of CRC by including key genes in the one-carbon metabolism pathways, with environmental and lifestyle factors, by using various analytics to validate the findings across the methods. Using the ensemble method, the most influential factors included gene polymorphisms of MTRR A66G and MTHFR, and lifestyle factors such as BMI, exposure to pollutants, and gender. Using the most influential factors, the two best predictive models were also generated using the generalized regression models and leave-one-out cross validation methods. With the machine learning approach, these models included a random validation dataset to yield more reliable prediction. For the prediction of CRC, BMI status and gender were consistent predictors in the models. The use of alcohol (more use in the CRC group) interacted

Discussion
We presented the gene-environment interactions and predictors of CRC by including key genes in the one-carbon metabolism pathways, with environmental and lifestyle factors, by using various analytics to validate the findings across the methods. Using the ensemble method, the most influential factors included gene polymorphisms of MTRR A66G and MTHFR, and lifestyle factors such as BMI, exposure to pollutants, and gender. Using the most influential factors, the two best predictive models were also generated using the generalized regression models and leave-one-out cross validation methods. With the machine learning approach, these models included a random validation dataset to yield more reliable prediction. For the prediction of CRC, BMI status and gender were consistent predictors in the models. The use of alcohol (more use in the CRC group) interacted with BMI status in predicting CRC. BMI status was also interactive with both gender and MTHFR C677T polymorphism in predicting CRC. Also, the exposure to pollutants was an additional predictor of CRC.
While previous studies have presented gene-environment interactions, associating genes in the one carbon metabolism pathways with folate deficiency [24,25,27] and CRC [24,27], new predictive modeling and validation analytics with interactions have become readily available for convenient use through SAS JMP programming (SAS Institute, Cary, NC, USA). Therefore, we included the gene-environment interactions, between the modifiable factors and the genes in our analytic approach, to examine potential epigenetic mechanisms. Overall, the CRC group had increased combined gene polymorphisms than the control group, including MTHFR C677T, MTR A2756G, MTRR A66G, and DHFR 19bp, except MTHFR A1298C. Additional modifiable factors included BMI status, exposure to pollutants, and alcohol use for CRC risks.
We presented the distributions of the genotype alleles for five genes in the one carbon metabolism pathway for four racial-ethnic groups. In addition to the four gene polymorphisms (MTHFR C677T and A1298C, MTR A2756G, and MTR A66G) that were presented for the CRC cases [24,27], and in numerous meta analyses [10][11][12][13], we included DHFR 19 bp deletion as an additional gene in the folate-metabolism pathway. DHFR 19 bp in the folate methylation pathway has not been presented for the CRC cases in various ethnic groups before. These four ethnic groups presented different polymorphism patterns for these five genes.
As a proof-of-concept study, to examine gene-environment interactions for cancer prevention, we used the ensemble method, as it is a well-known remedy for small-sample studies to validate the analyses by the random subsets of samples [45]. We further used the generalized regression method integrating significant parameters and bivariate interactions to maximize the model quality with the simplest optimal model. We did not have a sufficient number of subjects for the ethnic subgroups for analyses, especially the Black sample, for most influential predictors or subgroup analyses using the generalized regression model. Therefore, further studies are needed that include larger samples to further validate these findings for various ethnic groups. We presented the very first study cross-validating the findings using both conventional inferential statistics and the ensemble method to predict the risk of CRC. While there are limitations to family-based, case-control designs because of genetic associations among the family members, we used the family-based analysis technique to explore and control for the family associations. Despite these limitations, there are advantages for methodological concerns to include family members in community-based studies. First, the inclusion of family members can enforce the active participation of the family as an ecological unit, and more reliable reporting of modifiable lifestyle or environmental parameters [54,55]. Involving family members in a community-based study can also facilitate support from family members for patients, with a heightened awareness within the family unit of the importance of modifiable lifestyles, thus helping to adopt healthier lifestyles. The validity of research observations is also strengthened in that patient lifestyles are better monitored with the increased awareness of the family unit. Therefore, the rigor and reliability of the data are enhanced, for sustainable interventions with behavioral improvements.
To add to the genetic factors, our results point to a list of modifiable lifestyle and environmental factors [33][34][35][36] in relation to the gene-environment interactions for the prevention of CRC. The top modifiable factors included BMI status, environmental pollution, and alcohol use. Recent studies including metaprediction studies that examined gene-environment interactions consistently presented that increased air pollution is associated with increased gene polymorphism and trends to increased disease risks across various disease conditions, especially for MTHFR C677T polymorphisms and genes in the methylation pathways [28][29][30][31][32][33][34][35]. Environmental toxicants such as air pollution and smoking can induce oxidative stress and disregulate reactive oxygen species [28][29][30]. Studies suggested that exposure to oxidative stress caused damage to cellular DNA that leads to mutations, genomic instability, and ultimately malignancy [28][29][30]. From these understandings, future studies may focus on the epigenetics of methyl-donors to detox the hazards from environmental pollution, with healthy lifestyles and weight-based interventions to prevent CRC. Additionally, future research can be designed to examine environmental pollutants and lifestyles with gene-environment interactions in CRC prevention.

Study Population and Setting
We included 108 participants, 54 CRC cases and 54 matched family/friend controls by accessing the California Cancer Registry (CCR) database and additional cases through case referrals by the participants. The study was approved by the appropriate Human Subjects Institutional Review Boards (IRB) from the California State Committee for the Protection of Human Subjects for data access through the CCR (CPHS-12-12-1007, approved 2013-2019), and from the local educational institutions (Azusa Pacific University, approved 2013-2015; Augusta University, 806069-7, approved 2015-2018). To qualify for the study, CRC cases had to be: (1) not at the terminal stage of cancer expecting death within six months, (2) 18-80 years of age, (3) have a family member living with or nearby the case for over one year. Family members must be: (1) 18-80 years of age, (2) not having CRC, (3) not at the terminal stage of other illness expecting death within the six months. Both the case and the family member had to have adequate cognitive and mental capacities, and be willing to participate in the interviews and biological sample for genotyping data collection. The CRC cases were survivors, having been diagnosed with CRC for at least two years by the time the CCR released their data. CRC cases and their families were screened based on the inclusion criteria.
Given that a diverse racial-ethnic population resides in southern California, we targeted to recruit at least five families per racial-ethnic group. representing the proportions of various populations in southern California. Following the approval by the IRBs, CRC cases were screened and randomly selected by systematic stratification based on the racial-ethnic groups from the roster databases provided by the CCR. The qualified cases were contacted through the established procedures as required by the CCR, with an introduction letter followed with phone contact. Moreover, family/friend members residing with or near the CRC cases were recruited along with the CRC cases. Most families were visited at their homes for data collection while a few families visited the campus to participate in data collection.

Demographic/Environmental and Lifestyle Data
Participants were interviewed with items of standardized instruments for health-related lifestyle status [33], following the framework of My Own Health Report (MOHR). The MOHR project included a web-based survey with the list of health metrics including health behaviors and lifestyles. The intent of the MOHR project was to harmonize the national health metrics databases with a minimum dataset in the primary care settings. For this project, the elements of these health metrics included in the MOHR project were included to evaluate the lifestyles in relation to the polygenic one carbon metabolism pathways. Family history, functional capacities, cancer risks and activities, and demographics were collected using the items summarized from the Centers for Disease Control and Prevention (CDC) 1999-2012 National Health and Nutrition Examination Survey and National Health Interview Survey [50]. Community environment and health were collected using the items listed in the integrated prevention framework of Institute of Medicine [51] and World Health Organization [52] for cancer prevention. The family pedigrees were completed with family history data using the standard process established by the Coalition for Health Professional Education in Genetics [48].

Genotyping Data
Data sent to the laboratories were de-identified for subjects. Laboratory staff members were blinded to the case control and other status of the samples to enhance the objectivity of laboratory analyses. The specimens were stored on ice and sent in containers with dry ice via express mail to the laboratory following data collection. Once arrived at the laboratory, specimens were kept frozen in deep freezer at −80 • C freezer until analysis.
Genotyping procedures were described elsewhere earlier [56,57]. Briefly, genomic DNA was isolated from salivary samples using the SK-1 swab and Isohelix collection tubes with dry capsules (Boca Scientific, Boca Raton, FL, USA), and/or from peripheral blood samples using the Qiagen Blood DNA Kit (Qiagen Inc., Valencia, CA, USA). The Taqman technique [56] was used for genotyping of the gene polymorphisms using allele specific fluorescent probes with a StepOnePlus™ real-time polymerase-chain reaction system (Thermo Fisher Scientific, Waltham, MA, USA). Quality control was strictly conducted with four duplicate positive controls and four negative controls loaded in each of 96-well plates. Additionally, genotyping assays were repeated with 10% of the samples that were duplicate with salivary and blood samples, and genotyping results were in 100% agreement for the repeated tests. The results of the genotyping for five genes were shared with the participants within six months or sooner following the data collection, as soon as they became available.
MTHFR enzyme deficiency was calculated by adding up the total loss of enzymatic functions from both MTHFR C677T and A1298C polymorphisms, 35% for 677 CT and 70% for 677 TT polymorphisms, and 15% for 1298 AC and 30% for 1298 CC variants [20,21,58]. The total gene mutations from five genes were computed together, with possible ranges of 0-10, with scores of one for heterozygous and two for homozygous polymorphism mutations per each of the five genes included in this study.

Data Analysis
Our data analysis followed three phases of exploratory family-based analysis [44] to adjust for the effects of sharing the genetic heritage within the family, data visualization and understanding, data reduction, and model building using JMP Pro 13 (SAS Institute, Cary, NC, USA) [59,60]. In the first stage of data visualization and understanding, we used bootstrap forest, also known as bagging (i.e., bootstrap aggregating), which is one of the most popular ensemble methods [24][25][26][27]. The ensemble methods are based on the logic of resampling, which is a well-known remedy for small-sample studies [45]. In resampling the sample is treated as the virtual population and then different subsets are randomly drawn from the sample for multiple analyses. Bias can be observed and corrected by such repeated analyses on random subsets [46].
The ensemble method is a resampling technique that synthesizes analyses of many subsets of the original data. This approach is superior to conventional regression modeling because ordinal least square regression or logistic regression analyses tend to yield an overfitted model. Numerous studies have confirmed that the ensemble approach outperforms any single model, such as regression or univariate statistics [61][62][63]. In addition, conventional statistical procedures are limited by the sample size. If the number of parameters to be estimated exceeds the degrees of freedom, the regression model would be highly unstable. The ensemble method is based on machine learning, in which datasets are partitioned and analyzed by different models. Each model is considered a weak learner and the final solution is a synthesis of all these weak learners. When different models are generated by resampling, inevitably some are high bias model (underfit) while some are high variance model (overfit). In the end, the ensemble cancels out these errors. Specifically, each model carries a certain degree of sampling bias, but finally the errors also cancel out each other [62].
In the second stage, dimension reduction, our strategy was to identify the most influential predictors within the three categories of genetic factors, demographic/environmental factors, and lifestyle factors (as indicated by health metrics). To select the most influential predictors within each category, we used the criteria of column contribution and variable importance. Both the ensemble method and the regression methods were run to identify potential predictors in each group in each category. The misclassification rates of both models were compared to verify the function of a predictive model per genetic, demographic/environmental factors, and lifestyle categories. As shown in Table 9, the bootstrap forest model in all three domains outperformed the original logistic regression model with lower misclassification rates per category. Using the bootstrap forest ensemble method, G 2 and the portion of column contribution per variable were used to present the rank order of importance.
In the final stage of model prediction, we used generalized regression to obtain a smaller prediction error [59]. The most significant variables and significant interactions were visualized using the interaction profilers for bi-variate interactions of the three categories of variables, and the final set of significant variables were selected for the tested models. The prediction profiler enables the analyst to ask "what if" questions. Specifically, the analyst manipulates the levels of including different variables to see how the model is changed. By doing so we can understand how the interaction of various factors affect the outcome and the sensitivity of the model. Generalized regression is also known as penalized regression, meaning that the variable selection process penalizes complexity. To get the optimal model, the algorithm imposes a penalty on the model when redundant predictors are included. The index for showing complexity is AIC or AICc [64][65][66], developed by Hirotsugu Akaike [67,68], and is in alignment with Ockham's razor: All things being equal, the simplest model tends to be the best one; and simplicity is a function of the number of adjustable parameters. Thus, a smaller AIC suggests a more optimal model. Specifically, AIC is a fitness index for trading off the complexity of a model against how well the model fits the data. The general form of AIC is AIC = 2k -2lnL, where k is the number of parameters and L is the likelihood function of the estimated parameters. Increasing the number of free parameters to be estimated improves the model fitness, however, the model might be unnecessarily complex. To reach a balance between fitness and parsimony, AIC not only rewards goodness of fit, but also includes a penalty against over-fitting and complexity. Hence, the most optimal model is the one with the lowest AIC value. Since AIC attempts to find the model that best explains the data with a minimum of free parameters, it is considered an approach favoring simplicity. In this sense, AIC is better than R 2 and adjusted R 2 , which always go up as additional variables enter in the model, favoring complexity. However, AIC does not necessarily change by adding variables. Rather, it varies based upon the composition of the predictors and thus it is a better indicator of the model quality [47]. Burnham and Anderson recommend replacing AIC with AICc [64,65], especially when the sample size is small, and the number of parameters is large. Actually, AICc converges to AIC as the sample size gets larger and larger. Hence, AICc should be used regardless of sample size and the number of parameters. The methodology of JMP Pro allows for several classes of modeling estimation methods including lasso, forward selection and elastic net [69], and several validation methods including the ones we chose, AICc validation and leave-one-out cross validation methods, because of their effectiveness for small data sets [70]. Model performance was assessed using misclassification rate (smaller is better), AICc, and the area under the ROC curve.