Personalized Nutrition—Genes, Diet, and Related Interactive Parameters as Predictors of Cancer in Multiethnic Colorectal Cancer Families

To personalize nutrition, the purpose of this study was to examine five key genes in the folate metabolism pathway, and dietary parameters and related interactive parameters as predictors of colorectal cancer (CRC) by measuring the healthy eating index (HEI) in multiethnic families. The five genes included methylenetetrahydrofolate reductase (MTHFR) 677 and 1298, methionine synthase (MTR) 2756, methionine synthase reductase (MTRR 66), and dihydrofolate reductase (DHFR) 19bp, and they were used to compute a total gene mutation score. We included 53 families, 53 CRC patients and 53 paired family friend members of diverse population groups in Southern California. We measured multidimensional data using the ensemble bootstrap forest method to identify variables of importance within domains of genetic, demographic, and dietary parameters to achieve dimension reduction. We then constructed predictive generalized regression (GR) modeling with a supervised machine learning validation procedure with the target variable (cancer status) being specified to validate the results to allow enhanced prediction and reproducibility. The results showed that the CRC group had increased total gene mutation scores compared to the family members (p < 0.05). Using the Akaike’s information criterion and Leave-One-Out cross validation GR methods, the HEI was interactive with thiamine (vitamin B1), which is a new finding for the literature. The natural food sources for thiamine include whole grains, legumes, and some meats and fish which HEI scoring included as part of healthy portions (versus limiting portions on salt, saturated fat and empty calories). Additional predictors included age, as well as gender and the interaction of MTHFR 677 with overweight status (measured by body mass index) in predicting CRC, with the cancer group having more men and overweight cases. The HEI score was significant when split at the median score of 77 into greater or less scores, confirmed through the machine-learning recursive tree method and predictive modeling, although an HEI score of greater than 80 is the US national standard set value for a good diet. The HEI and healthy eating are modifiable factors for healthy living in relation to dietary parameters and cancer prevention, and they can be used for personalized nutrition in the precision-based healthcare era.


Introduction
Chronic inflammation is a major risk factor for colon and rectum health for the prevention of colorectal cancer (CRC) [1][2][3][4][5][6]. CRC is the number one most preventable cancer for men and women in the world [7]. The most significant contributing factors in CRC development have been recognized ethnic groups. We measured multidimensional data using the ensemble method [56][57][58][59] to identify variables of importance within domains of genetic, demographic, and dietary parameters. We then constructed predictive generalized regression (GR) modeling with a supervised machine learning validation procedure with the target variable (cancer status) being specified to validate the results for enhanced prediction and reproducibility [60][61][62][63].

Study Population and Setting
A total of 106 human subjects participated and completed dietary data instruments, 53 CRC and 53 paired family/friend members. We accessed the California Cancer Registry (CCR) database and additional cases through case referrals by the participants. The human subjects protocol was approved by the designated appropriate Human Subjects Institutional Review Boards (IRB) from the California State Committee for the Protection of Human Subjects for data access through the CCR (CPHS-12-12-1007, approved 2013-2019) and from the local educational institutions (Azusa Pacific University, approved 2013-2015; Augusta University, 806069-7, approved 2015-2018) [55] (see Supplementary file for Informed Consent Form). Inclusion criteria has been reported previously [55] and is summarized as follows: participants had to (a) be expected to live for at least 6 months; (b) be 18-80 years of age; (c) have a family/friend member nearby to act as the case and family/friend pair, (d) have adequate cognitive and mental capacities, (e) be willing to participate in the interviews and biological samples for the genotyping data collection.

Demographic and Genetic Measurements
The measurements and instruments used in this study have been reported previously [55], including the health-related lifestyle and dietary status [64], family history, functional capacities, cancer risks and activities, demographics [65], and family pedigrees (www.nchpeg.org) [66]. The five genes in the folate metabolism-related pathway included in the study were the MTHFR gene polymorphisms, C677T (rs1801133) and A1298C (rs1801131) involved with MTHFR enzymes which elevate homocysteine levels [28][29][30][31]; DHFR 19 base pairs (19 bp) (rs70991108) which are involved in folic acid conversion into methylenetetrahydrofolate (MTHF), the usable folate form [32,33]; and MTR A2756G (rs1805087) and MTRR A66G (rs1801394) which convert/recycle homocysteine back to usable MTR for the methylation cycle [34][35][36][37]. Gene mutations of folate metabolism-related pathways could lead to the loss of functions related to the methylation process [55]. The total possible gene polymorphism rates of the five chosen genes in the folate methylation pathways ranged from 0 to a possible maximum score of 10 if each of the five genes had homozygous polymorphisms. The presence of an MTHFR enzyme deficiency was calculated by combining the loss of enzyme functions from MTHFR C677T (loss of 35% for each of the two T polymorphic alleles) and MTHFR A1298C (a loss of 15% for each of the two C polymorphic alleles) to give a composite score of both MTHFR C677T and MTHFR A1298C polymorphisms [55,67]. Genotyping procedures have been described elsewhere earlier using the Taqman Technique [55,68,69].
The HEI includes items of healthy portions of various quality food groups and limited portions of unhealthy food groups, issued by the US Department of Agriculture (USDA) based on the Dietary Guidelines for Americans (DGA) standards for a healthy lifestyle. The HEI is composed of 12 scored components which include the 5 major food groups-Fruit (total and whole), vegetables (total and greens/beans), grains (total and whole), dairy and protein and oils and nuts-In addition to limiting the intake of saturated fats, sodium, and empty calories. The total HEI score is the sum of the components and has a minimum score of 0 and a maximum score of 100. A score between 0-50 indicates a poor diet; 51-80 indicates a moderate diet quality that needs improvement; and a score greater than 80 indicates a good diet [42]. The recommended daily intakes (RDI) are issued by the Food and Nutrition Board of the Institute of Medicine which recommends the average daily levels of intake that are sufficient to meet the nutrient requirements of most healthy people based on gender and age [43]. Macronutrients include carbohydrates, protein, total and saturated fat, and cholesterol. Micronutrients include B-vitamins-(B9 (folate), B1 (thiamine), B2 (riboflavin), B3 (niacin), B6 and B12), vitamins A, C, D, and E, calcium, magnesium, iron, zinc, and methionine [75].

Data Analysis
The details of data analysis have been presented previously [55] and are summarized in the following text. We employed various methods, including the visualization and identification of data patterns related to family dependence [76], the ensemble method to identify variables of importance for the dimension reduction of multidimensional data, and predictive model building using JMP Pro 13 (SAS Institute, Cary, NC, USA) [77,78]. Influential predictors were identified using bootstrap forest prediction modelling in three categories: genetic, demographic and lifestyle, and dietary intake factors. Column contribution and variable importance were examined within each category. From the rank order of column contributions, the most influential variables were selected using the bootstrap forest method as variables of significance [56][57][58][59]77,78]. The column contribution was presented using G 2 statistics for classification accuracy, which was derived from the conventional likelihood ratio X 2 statistic. However, unlike X 2 analysis, G 2 results are not subject to sample size effects. X 2 is a test of goodness-of-fit between the expected count and the actual count. By the same token, G 2 indicates how well the expected count and actual count are classified into those groups. Ensemble methods included bootstrap forest and recursive trees [45][46][47][48], which are suited for small-sample studies [79], with a machine learning approach [80]. This has been shown to outperform single models, including regression or univariate statistics [81,82]. The misclassification rates of each model were compared to verify the function of a predictive model for the genetic, demographic, and dietary categories.
We then utilized GR with supervised machine learning validation, because the target variable had been specified, to obtain a smaller prediction error [77]. The index of complexity, Akaike Information Criterion with correction (AICc), was used [83][84][85][86][87] to test the fitness of the models, with smaller AICc values indicating optimal models. AICc outperformed the R 2 and adjusted R 2 methods which tend to favor complexity for the model quality [65]. We used the Elastic Net [88] and validation methods including the AICc validation and Leave-One-Out (LOO) cross validation methods due to their effectiveness on small data sets [89]. We assessed the model performance using the misclassification rate (smaller is better), AICc, and the area under the receiver operating characteristic (ROC) curve (AUC). The primary criterion was the fitness indicator with AICc to counteract the common problem in traditional statistics: overfitting. A well-predicted model might be an overfitted model, and thus, predictive accuracy is the secondary criterion and was determined using the misclassification rate and AUC.
GR is also known as penalized regression, meaning that the variable selection process penalizes complexity. To get the optimal model, the algorithm imposes a penalty on the model when redundant predictors are included. When there are several collinear predictors, least absolute shrinkage and selection operator (LASSO) selects just one and ignores others or zeroes out some regression coefficients. The Ridge method counteracts collinearity and variance inflation by shrinking the regression coefficients towards zero, but not exactly zero. The Elastic Net method combines the penalties of both the LASSO and Ridge approaches. While Lasso might shrink the coefficient of an unimportant variable all the way down to zero and Ridge just shrinks it towards zero, Elastic Net is in the middle, and thus, it tends to yield the most optimal model by balancing variance and bias. With the use of early stopping, Elastic Net is suitable for handling a data set with many variables and a few observations. In Elastic Net, a stage-wise algorithm called LARS-EN (least angle regression of Efron et al., 2004 [90,91] efficiently finds the best solution path. In short, it is more likely to balance variance and bias than other methods. Unlike linear least squares, when estimating the unknown parameters in a linear regression model, GR can simply zero out certain unused predictors [92][93][94][95]. In this case, the p-values in the linear regression model at most could only be 0.9999, but not exactly 1. However, when all permutations are exhausted, such as what is done in an exact test, the probability could be exactly 1. Along a similar vein, GR exhausts different paths to find the best model. When the full model has a mixture of important and unused predictors, the p-value cannot be 1. However, when the data can be perfectly described by the restricted model that results from path searching, the probability of observing the data can be 1. When developing a GR model for a predictive model, the first type of model presented in JMP Pro 13 is a logistic regression (LR) model, because it is the default estimation method. After this default method, other model launches can be pursued by choosing a variety of estimation methods (LASSO, Elastic Net, and others) and associated validation methods (a validation column, minimum AICc, LOO validation, and others) [90,96,97]. We chose the AICc validation and LOO cross validation methods because of their effectiveness for small data sets [98]. In effect, the default LR method could be characterized as an explanatory model, whereas the other GR estimation methods might best be characterized as predictive models. An explanatory model is typically used to explain the associations between the model parameters and the model response to test causal hypotheses, whereby a predictive model is used to predict future observations [99]. The nature of the model objectives (causal versus predictive) directly influence the underlying algorithms which can result in different results from models using the same set of initial parameters. Typically, using an explanatory model, the set of statistically significant parameters is identified for a final model. The predictive model using GR pursues methods to shrink coefficients towards 0, in part to guard against overfitting the model. For model prediction in GR analysis, continuous variables are recoded into new dichotomous variables, grouped by either median distribution or known score criteria, such as those related to healthy eating.
The interactive prediction profilers were used to visualize the direction of association between two parameters (a predictor or factor with the outcome variable of healthy eating status or health outcomes in the profiler) or among three parameters (set of interactive variables with non-parallel distribution in addition to the outcome status of healthy eating or health outcomes in the interactive profilers). The visualization of the interactive profilers enables the analyst to ask "what-if" questions. Specifically, the analyst manipulates the levels of included variables to see how the model changes. By doing so, we can understand how the interaction of various factors affects the outcome and the sensitivity of the model. Table 1 presents the comparisons of the key demographic factors between the control and cancer groups. The significantly different parameters between the control and cancer groups included gender, age, and total number of gene polymorphism mutations (all p < 0.05). We previously reported the distribution of the polymorphisms for the control and cancer groups and the four racial-ethnic subgroups [55] using the Hardy-Weinberg equilibrium (HWE) analysis. The total gene mutation score presented a median split between <4 and ≥4 for this sample and was significantly increased for the CRC group compared to the family/friend controls (p < 0.05) ( Table 1). The comparisons of demographic factors across the racial-ethnic subgroups are presented in Supplementary Table S1. Based on the body mass index (BMI) measurement, more than 50% of Hispanic and Black participants in this study were obese, a much greater proportion than in the White (29%) and Asian (0%) samples (p < 0.0001). More Whites in this study drank alcohol than the other three racial groups (p = 0.0007). In regard to the total gene mutation score on the five genes in the folate metabolism-related pathway, more Asian and White participants had greater total gene mutation scores than Hispanic and Black participants (see Supplementary Table S1).

Dietary Parameters
In regard to the comparisons of dietary parameters, no items were significantly different between the control and case groups in the HEI (Table 2) or RDI (Table 3). However, in terms of the differences in HEI parameters between racial groups, Asians had greater total fruit intakes (2.3 cups) and whole fruit intakes (1.6 cups) compared to the other three racial groups (both p < 0.001). Caucasians had the next highest fruit intakes (1.3 cups of total fruit and 1.01 cups of whole fruit), and Hispanics and African Americans had similarly low fruit intakes (see Supplementary Table S2). Another significant difference between racial groups was sodium intake (p < 0.05). While all racial groups consumed greater than the RDI levels for sodium, Asians had the highest sodium intake of 3.79 g, followed by Hispanics with 3 g, then Caucasians with 2.8 g, and African Americans with 2 g (see Supplementary Table  S2). In the four racial groups, more than half of the sample ate more than 45% of the RDI for carbohydrates (p < 0.05), Asians having the highest intake (85%), followed by African Americans (77.8%), Hispanics (65.2%), and lastly, Caucasians (55.9%). Another significant dietary parameter was total fat. Hispanics had the highest intake (52.2%), consuming greater than 35% of their total calories from fat and exceeding the RDI, followed by Caucasians (47.1%), then African American (33.3%), and Asians (15%). In regard to the saturated fat intakes, more African Americans (77.8%), Caucasians (61.8%), and Hispanics (56.5%) consumed over the RDI for saturated fat than Asians (35%) (p < 0.05) (see Supplementary Table S3).

Most Influential Predictors of Variables of Importance
Through the identification of the variables of importance, the most crucial predictors from the genetic, demographic, and dietary categories were identified. In terms of dietary parameters, all individual parameters involved in the HEI and RDI were tested. A HEI score of 77, the median split for this study sample, (instead of HEI 80) was used as the significant dietary predictor. The most crucial dietary variables of importance appeared in rank order (see Supplementary Table S4) as the total vegetable intake [10 ounce (oz)], followed by the total folate intake (100% RDI), vitamin B12 (150% RDI), total grains (4 oz), and HEI (median score 77). The most crucial genetic predictor was identified as the total number of gene polymorphism mutations (≥4) for all five genes combined. The significant demographic factors included gender and body weight. For all domains, Table 4 presents the rank order of the 10 predictors, including the demographic characteristics of age, gender, and overweight status (BMI status); two genetic parameters, including the total polymorphism score and MTHFR 677; and five dietary parameters, including the total vegetable intake (10 oz), total folate intake (100% RDI), HEI (score of 77), vitamin B12 (150% RDI), and thiamine (100% RDI).            Figure 1a further illustrates the profiler of the five genes, the MTHFR enzyme deficiency score and the total gene polymorphism mutation score in association with the CRC risk, and Figure 1b, shows examples of key interaction profiles of these gene parameters with the CRC risk. It is noteworthy to point out that while the MTHFR 677 and 1298 gene polymorphisms had downward trend associations with the CRC risk, the MTHFR enzyme deficiency score showed an upward or positive correlation with the CRC risk (Figure 1a). The interaction profilers for the associations of these seven gene parameters with CRC risk, as presented in Figure 1b, were all parallel lines, indicating no two-way interactions for these seven gene parameters in association with the CRC risk. Figure 2a present the profiler of HEI, thiamine, the total gene mutation score of the five genes, MTHFR 677 polymorphism mutations, overweight BMI status, gender, and age as predictors for CRC, and Figure 2b presents the interaction profiles of four selected factors as examples of the interaction profiles. The lines of association with the CRC risk crossed and were non-parallel for the interaction between HEI and thiamine. Supplementary Figure S1a presents the profilers of these parameters with vegetable intake and the interaction profiles of the remaining parameters. The lines of association with CRC risk crossed and were non-parallel for overweight BMI status, with gender and BMI interacting with the MTHFR 677 polymorphism (see Supplementary Figure S1b) as gene-environment interactions.

Predictors of Cancer from Genes, Diet, and Interactive Parameters
Figure 1a further illustrates the profiler of the five genes, the MTHFR enzyme deficiency score and the total gene polymorphism mutation score in association with the CRC risk, and Figure 1b, shows examples of key interaction profiles of these gene parameters with the CRC risk. It is noteworthy to point out that while the MTHFR 677 and 1298 gene polymorphisms had downward trend associations with the CRC risk, the MTHFR enzyme deficiency score showed an upward or positive correlation with the CRC risk (Figure 1a). The interaction profilers for the associations of these seven gene parameters with CRC risk, as presented in Figure 1b, were all parallel lines, indicating no two-way interactions for these seven gene parameters in association with the CRC risk. Figure 2a present the profiler of HEI, thiamine, the total gene mutation score of the five genes, MTHFR 677 polymorphism mutations, overweight BMI status, gender, and age as predictors for CRC, and Figure 2b presents the interaction profiles of four selected factors as examples of the interaction profiles. The lines of association with the CRC risk crossed and were non-parallel for the interaction between HEI and thiamine. Supplementary Figure S1a presents the profilers of these parameters with vegetable intake and the interaction profiles of the remaining parameters. The lines of association with CRC risk crossed and were non-parallel for overweight BMI status, with gender and BMI interacting with the MTHFR 677 polymorphism (see Supplementary Figure S1b

Predictive Model
Using the most influential variables (Table 4), two GR models were developed using LOO cross validation methods to predict the probability of CRC. GR is also known as penalized regression. As the name implies, the modeling process penalizes complicated models to avoid overfitting. Hence, compared with conventional regression modeling methods, such as LR, GR tends to yield a more optimal model. In each case, the models were first compared to the conventional baseline logistic regression (LR) model through validation. The parameter estimates along with the associated pvalues for the baseline LR results with validation are shown in the left panel of Table 5 and  Supplementary Table S5, including the parameter estimates for effect sizes and 95% confidence intervals (CI). Then, two GR models were developed using the Adaptive Elastic Net method with AICc validation and the Adaptive Elastic Net method with LOO cross validation to predict the probability of cancer (the middle and right panels of Table 5 and Supplementary Table S5).

Predictive Model
Using the most influential variables (Table 4), two GR models were developed using LOO cross validation methods to predict the probability of CRC. GR is also known as penalized regression. As the name implies, the modeling process penalizes complicated models to avoid overfitting. Hence, compared with conventional regression modeling methods, such as LR, GR tends to yield a more optimal model. In each case, the models were first compared to the conventional baseline LR model through validation. The parameter estimates along with the associated p-values for the baseline LR results with validation are shown in the left panel of Table 5 and Supplementary Table S5, including the parameter estimates for effect sizes and 95% confidence intervals (CI). Then, two GR models were developed using the Adaptive Elastic Net method with AICc validation and the Adaptive Elastic Net method with LOO cross validation to predict the probability of cancer (the middle and right panels of Table 5 and Supplementary Table S5).  In Supplementary Table S5, a seven-factor model with a baseline conventional LR model was constructed with two significant interactions-thiamine and HEI 77, and gender and overweight as measured by BMI status-And four significant individual parameters associated with these interactions and three additional individual factors: the total polymorphism score, age (median: 56), and vegetable intake (10 oz) (all p < 0.05 except vegetable intake: p < 0.1). While the effect of overweight status was not significant, it must be included in the models because of its interaction with gender. The GR LOO validation model was the best model with the lowest misclassification rate (0.22) and the highest AUC coverage (0.85, Supplementary Figure S2). In regard to significant parameters Both GR models presented the HEI (score of 77) and thiamine (100% RDI), and possibly vegetable intake (10 oz), as modifiable factors, in addition to the total polymorphisms of five genes in the OCM pathway and demographic characteristics of age and gender as predictors of cancer. While the total polymorphism score was a significant parameter for both GR models, it was not significant for the conventional LR model.
When MTHFR 677 was added into the predictive model (Table 5) to give an eight-factor model, the same significant interaction terms were noted as associated factors. The misclassification rate for the Elastic Net LOO validation, shown in Table 5 on the right, was the lowest at 0.21, and the baseline LR (on the left) also presented a best and lower rate of 0.22, whereas the AICcs were similar to the earlier model, as shown in Table S5. The Elastic Net LOO validation outperformed the LR model with a lower misclassification rate, AUC, and the identification of more significant parameters, again leaving out overweight status due to its "0" parameter estimate and a p value of "1". The AUCs ( Figure 3) were 0.86 for the Elastic Net LOO model (right panel), and 0.85 for both the Elastic Net AICc validation model (middle panel) and the LR model (left panel). Vegetable intake (10 oz) was shown to be a significant parameter in the GR LOO model, whereas the interaction of MTHFR 677 with overweight BMI) was approaching significance with a p value of 0.059. Only four out of 11 parameters (three interactions and eight individual factors) including only one interaction term (HEI and thiamine) were significant in the LR models, compared to eight out of 11 tested parameters being significant in both GR models.
To illustrate the effects of different factors on these predictive models, Table S6 presents a series of models by progressively including the additional factors presented in the Table 5. The p-values for the significance of the parameter estimates, misclassification rates, AICc, and AUCs of the individual variables (i.e., HEI, thiamine, overweight BMI status, gender, total gene polymorphism mutation score, age, vegetable intake, MTHFR 677, and total folate intake) and their significant interactions were included in these illustrative progressions. As shown in Table S6, the misclassification rate was the lowest and best in the models presented in Table 5, the GR LOO model (0.21 versus 0.24 in one more factor or one less factor models) and the AICcs in the GR AICc validation, compared to the other GR models tested. Adding folate intake as an additional parameter to give a nine-factor model increased the misclassification rates for the LR and GR LOO models, while the inclusion of folate as a parameter did not reach significance (see Supplementary Table S6).

Logistic Regression with Validation Elastic Net with AICc Validation
Elastic Net with Leave-One-Out

Discussion
Using supervised machine-learning analytics, we presented a ground-breaking predictive modeling study which gives improved prediction accuracy and the best fitted model, to identify significant predictors including interaction terms. We found the significant predictors of CRC and built prediction models using identified predictors of importance. We observed a composite of five key genes in the OCM pathway; the dietary parameters of thiamine and a HEI score of 77 and their interactions; and age, gender, and overweight status and their interactions as predictors of cancer in multiethnic CRC families. In addition, through the dimension reduction approach, which recognizes the variables of importance, the best predictive model was generated using the GR models, Elastic Net AICc validation and LOO cross validation methods. We observed the HEI as modifiable dietary factor and OCM related genetic factors as independent factors for CRC risk in this study. In addition to the HEI, other significant dietary predictors found in this study included thiamine and vegetable intake, which are converging dietary risk factors for CRC, to demonstrate that the findings related to the HEI dietary parameters presented as a composite score were not due to chance. Additionally, the prediction models presented in this study were better than conventional models presented in previous studies at identifying potential interactive parameters, addressing improved accuracy (lower misclassification rate and AUC), and recognizing the fitness of models with AICc. No previous studies have validated their predictions with added criteria to achieve rigor and reproducibility in their results. While aging and demographic characteristics such as gender might not be modifiable in the prevention of cancer, it is promising to see that dietary parameters play significant roles in the cancer prediction (as shown through the supervised machine learning based GR models with validations). Healthy eating as a modifiable habit is particularly promising due to its beneficial intervention against mutated genes in the OCM pathways which place a patient at a higher risk of cancer.
HEI interacted with thiamine (Vitamin B1), which is a new finding for the literature. Thiamine is tested as part of the RDI analysis, and the natural food sources for thiamine include wholegrains, legumes, and some meats and fish which HEI scoring included as part of a healthy diet (versus limiting portions of salt, saturated fat and empty calories). Both gender and the MTHFR 677 polymorphism interacted with overweight BMI status in the prediction of CRC, with the cancer group having more men and more overweight cases. While previous studies tested the association of higher HEI scores with lower CRC risk [47][48][49][50][51][52], we further documented the scale of HEI with a median split distribution (a score of 77 versus 80) for the best predictive model in predicting the CRC risk with the diverse sample used in this study. The HEI score was significantly split at the median

Discussion
Using supervised machine-learning analytics, we presented a ground-breaking predictive modeling study which gives improved prediction accuracy and the best fitted model, to identify significant predictors including interaction terms. We found the significant predictors of CRC and built prediction models using identified predictors of importance. We observed a composite of five key genes in the OCM pathway; the dietary parameters of thiamine and a HEI score of 77 and their interactions; and age, gender, and overweight status and their interactions as predictors of cancer in multiethnic CRC families. In addition, through the dimension reduction approach, which recognizes the variables of importance, the best predictive model was generated using the GR models, Elastic Net AICc validation and LOO cross validation methods. We observed the HEI as modifiable dietary factor and OCM related genetic factors as independent factors for CRC risk in this study. In addition to the HEI, other significant dietary predictors found in this study included thiamine and vegetable intake, which are converging dietary risk factors for CRC, to demonstrate that the findings related to the HEI dietary parameters presented as a composite score were not due to chance. Additionally, the prediction models presented in this study were better than conventional models presented in previous studies at identifying potential interactive parameters, addressing improved accuracy (lower misclassification rate and AUC), and recognizing the fitness of models with AICc. No previous studies have validated their predictions with added criteria to achieve rigor and reproducibility in their results. While aging and demographic characteristics such as gender might not be modifiable in the prevention of cancer, it is promising to see that dietary parameters play significant roles in the cancer prediction (as shown through the supervised machine learning based GR models with validations). Healthy eating as a modifiable habit is particularly promising due to its beneficial intervention against mutated genes in the OCM pathways which place a patient at a higher risk of cancer.
HEI interacted with thiamine (Vitamin B1), which is a new finding for the literature. Thiamine is tested as part of the RDI analysis, and the natural food sources for thiamine include wholegrains, legumes, and some meats and fish which HEI scoring included as part of a healthy diet (versus limiting portions of salt, saturated fat and empty calories). Both gender and the MTHFR 677 polymorphism interacted with overweight BMI status in the prediction of CRC, with the cancer group having more men and more overweight cases. While previous studies tested the association of higher HEI scores with lower CRC risk [47][48][49][50][51][52], we further documented the scale of HEI with a median split distribution (a score of 77 versus 80) for the best predictive model in predicting the CRC risk with the diverse sample used in this study. The HEI score was significantly split at the median score of 77 into greater or less scores, confirmed through the machine-learning recursive tree method and the predictive modeling, while an HEI score of greater than 80 is the set value for a good diet according to the US national standard [44][45][46]. The results showed that the HEI and healthy eating are modifiable factors for healthy living, in addition to the genes in the OCM pathway. Personalized nutrition can be planned when patients present increased gene mutations in the OCM pathway, particularly by having heightened awareness of supplying methyl donors to improve health outcomes.
CRC is a disease that comprises a group of molecularly heterogeneous diseases that are characterized by a range of genomic and epigenomic alterations [38]. Therefore, genes, diet, and interactive parameters may increase the risk of CRC due to specific molecular features. For example, a recent study demonstrated an association between pro-inflammatory diets, such as those including red and processed meats, refined grains, and carbonated drinks, and a higher risk for CRC subtypes with absent/low-lymphocytic reactions than CRC subtypes with high-lymphocytic reactions in the tumor microenvironment. The pro-inflammatory diet-associated CRC subtype was shown to be hypermutated CRC with microsatellite instability (MSI), the CpG (cytosine and guanine separated by only one phosphate group) island methylation phenotype (CIMP), and the BRAF wild-type phenotype [38,39]. While previous studies presented gene-environment interactions, associating genes in the OCM pathway [73,74,77] related to CRC prevention [73,77], we applied new GR predictive modeling and validation analytics methods using JMP pro programming (SAS Institute, Cary, NC, USA). We used the supervised machine-learning based analytics with the target variable being specified as cancer status and included the ensemble methods and the GR Elastic Net methods that are well-known remedies for small-sample studies to validate the analyses using random subsets of samples [96] in the best fit models. These analytics presented converging parameters for the reproducibility and rigor of the predictive modeling. While some family participants in this study shared genetic heritage with the cancer cases, the CRC group had increased combined gene mutations in the OCM pathway than the control group in this family-based study. The finding that healthy eating is a modifiable factor for cancer prevention is promising and encouraging to the families with CRC history.
Our sample size was limited with a total of 106 participants: 53 CRC cases and 53 matched family/friend controls. For the predictive modeling construction using the GR Elastic Net LOO model, we did not have a sufficient number of samples from any of the four racial-ethnic subgroups to generate stable results for the racial ethnic subgroups. Elastic models and machine learning techniques (classification tree/bootstrap random forest) are designated to build a parsimonious predictive model by selecting variables of importance or applying shrinkage penalties to variables of less significance. For small sample sizes, as in this article, they should serve the intended purpose. Data-driven selection approaches like LASSO or random forest are not stochastic, a factor that conventional model inference requires for its sampling distribution. While elastic models like LASSO could provide estimates with less variance, they may also introduce a certain degree of bias into the parameter estimates [98]. For valid parameter estimation in our small dataset, we included bootstrapping and conventional LR with parameter estimates for effect sizes and confidence intervals, as recommended previously [98]. The Elastic Net method is suitable for handling data sets with many variables and few observations. In the Elastic Net method, a stage-wise algorithm called LARS-EN [90,91] efficiently finds the best solution path and it is more likely to balance variance and bias than other methods. In summary, future studies with larger samples are needed to generate stable results and to further validate these findings for various racial-ethnic groups. Caution is warranted when interpreting the results of this study for various ethnic groups, as there is potential for inflated Type I errors due to multiple testing of the models and not adjusting p-values for the small sample sizes. Further studies involving gene-environment/diet interactions using larger diverse samples should be designed to validate these findings.
In summary, we examined genetic, demographic, and dietary parameters and related interactions in preparation for the precision-based healthcare era for cancer prevention and to improve health outcomes for personalized nutrition. We used a cross-validation approach to predict the risk of CRC from individual parameters and related interactions in relation to OCM and inflammatory pathways. For family-centered healthcare, the family-based design can provide further evidence on the most efficient and effective interventions to prevent cancer, as family members can help to provide more accurate monitoring and sustained eating habits [56,97]. Future studies may focus on the epigenetics of methyl donors from healthy eating related to folate metabolism and its mechanisms to achieve healthy living and cancer prevention.