Abstract
Complex diseases such as type 2 diabetes are influenced by both environmental and genetic risk factors, leading to a growing interest in identifying gene–environment (G × E) interactions. A three-step variable selection method for single-index varying-coefficients models was proposed in recent research. This method selects varying and constant-effect genetic predictors, as well as non-zero loading parameters, to identify genetic factors that interact linearly or nonlinearly with a mixture of environmental factors to influence disease risk. In this paper, we extend this approach to a binary response setting given that many complex human diseases are binary traits. We also establish the oracle property for our variable selection method, demonstrating that it performs as well as if the correct sub-model were known in advance. Additionally, we assess the performance of our method through finite-sample simulations with both continuous and discrete gene variables. Finally, we apply our approach to a type 2 diabetes dataset, identifying potential genetic factors that interact with a combination of environmental variables, both linearly and nonlinearly, to influence the risk of developing type 2 diabetes.
Keywords:
gene–environment interaction (G × E); generalized single-index varying-coefficient models (gSIVCM); mixture of exposures; nonlinear G × E; synergistic G × E MSC:
62J99; 62P10; 62G99
1. Introduction
The identification of gene–environment (G × E) interactions for complex traits has been a longstanding challenge in genetic studies. Ottman [1] defined G × E interaction as “different effect of a genotype on disease risk in persons with different environmental exposures”. Traditionally, G × E interactions have been studied using a single environmental factor, as incorporating multiple environmental factors exponentially increases model complexity. This can lead to biased estimates and large standard errors, a phenomenon known as the curse of dimensionality. Moreover, an increasing number of epidemiological studies have shown that disease risk can be influenced by simultaneous exposure to multiple environmental factors [2,3]. However, little is known about how multiple environmental factors, when considered collectively, interact with genetic factors to affect disease outcomes. Investigating this area could provide valuable insights and inform strategies for future disease prevention.
Guan et al. [4] proposed a single-index varying-coefficients model in the form to address non-linear gene–environment interactions. However, this model is limited to continuous phenotypes. In this paper, we generalize the model to handle binary phenotype data, given that many complex human diseases are binary traits in nature. Specifically, we propose a generalized single-index varying-coefficients model (gSIVCM):
where represents the unknown gene effect of , modeled as a non-parametric function modulated by its loading index . Here, X represents q-dimensional environmental exposures, and captures the joint effect of q environmental factors. The effect of these factors on the response is modeled through the function. In this model, we allow the effect of on the risk of Y to vary across multiple X through the function.
If , then has no effect on Y. If , where is a constant, the effect of on Y is constant, and there is no interaction between and the mixture of X. If , we conclude that interacts with the environmental mixture, and the form of the interaction effect is captured by the non-parametric function .
The unique structure of gSIVCM enables us to capture how a mixture of multiple environmental factors interacts with genetic factors. Moreover, model (1) is highly flexible and can accommodate a wide range of models. For example, when and , it reduces to a generalized varying-coefficients model; when and , it becomes a standard generalized single-index model.
When both the number of gene variables and the number of environmental variables X is large, the complexity of model (1) introduces unique challenges for variable selection, particularly due to the nonlinear, non-parametric structure of the function and its unknown loading parameter . Guan et al. [4] recently proposed a three-step variable selection approach for single-index varying-coefficients model (SIVCM), which classifies the non-parametric gene effect into varying, constant, and zero effects, while also identifying non-zero loading parameters. In this paper, we extend this variable selection approach to gSIVCM. Instead of penalizing the squared error loss as in SIVCM, we implement a penalized log-likelihood method tailored to our model.
Variable selection has been a central topic in modern statistical research. The core idea is to add a penalty term to the optimization function. Different choices of penalty functions yield estimators with varying properties. Fan and Li [5] introduced three key criteria for a penalized estimator: sparsity, unbiasedness, and continuity. They also characterized the oracle property, which ensures that the model can (1) consistently identify the true subset of relevant variables and (2) estimate their coefficients as accurately as if the true model were known in advance. This has become a benchmark for evaluating new penalized estimators. Examples of penalized functions include Bridge regression [6], the Least Absolute Shrinkage and Selection Operator (LASSO) [7], Adaptive LASSO [8], Smoothly Clipped Absolute Deviation (SCAD) [5], and Minimax Concave Penalty (MCP) [9]. While LASSO is well-regarded for its simple formulation and efficient algorithm (e.g., LARS), it does not possess the oracle property. In contrast, Adaptive LASSO, SCAD, and MCP all satisfy the oracle property. For our model, we chose the MCP penalty due to its desirable theoretical and practical properties.
The rest of the paper is organized as follows: Section 2 introduces the proposed variable selection method, including the formulation of the penalized log-likelihood, the iterative optimization procedure, and the selection of tuning parameters and initial values for . Section 3 discusses the asymptotic properties of the proposed method. In Section 4, we evaluate the performance of our method through several finite-sample simulations. Finally, in Section 5, we apply our method to a type 2 diabetes dataset, followed by conclusion and discussion.
2. Statistical Method
Throughout the paper, superscript T is used to denote matrix transpose, is used to denote norm, is used to denote the natural logarithm of a. For notation simplicity, denotes the norm throughout the paper. For the sake of simplicity, we use constant and non-zero constant interchangeably.
2.1. Model Setup
The generalized single-index varying-coefficients model for the binary case is set as follows:
where represents the binary response variable, and n denotes the sample size. The set comprises unknown real-valued continuous functions. contains environmental variables, and represents the loading parameters of the model. where and is a continuous or discrete vector of length n for . Consequently, for , represents the effect of on Y on a log-odds scale, while serves as the intercept term.
For simplicity of notation, we denote . Thus, model (1) can be rewritten as follows:
2.2. Estimation Method
Our goal is to estimate and select unknown functions and unknown loading parameter . For the sake of identifiability, we assume , , and that cannot take the form .
We first approximate the non-parametric function with B-spline basis functions. Without loss of generality, we assume , denote K as the number of internal knots and h as the degree of the B-spline basis function. For instance, indicates linear splines, represents quadratic splines, and represents cubic splines. From standard B-spline theory, let be internal knots satisfying , and let be the left-closed, right-open interval for , and the closed interval . Let be a collection of functions f defined on satisfying: (i) the restriction of f to is a polynomial of degree h or less for ; (ii) f is times continuously differentiable on .
Let . Then, by Schumaker [10], we have a normalized B-spline basis function for . There exists a linear transformation matrix such that
where each component of is a function of u. Then, for , we approximate by
where and . With the B-spline approximation, in (2) can be approximated by where
Hence, the selection of the non-parametric function is transformed to the selection of its B-spline coefficients . Note that the transformation allows us to separate the constant effect of from its varying effect on the response.
- If , then is a varying effect predictor.
- If and , then is a constant effect predictor.
- If and , then has no effect on the response.
Given the binary response, we adopt the penalized log-likelihood approach, and the log-likelihood function is defined as follows:
where is the ith subject of . The penalized log-likelihood objective function is then defined as follows:
where are penalty functions for , , and , respectively. The indicator function equals 1 if the condition in the parentheses is satisfied, and 0 otherwise.
Note that the construction of the penalty term in function (4) reflects an “order” in selecting the effect of : first, the model determines whether has a varying effect; if not, it then decides whether has a non-zero constant effect or no effect at all. Furthermore, we do not penalize the intercept term or due to model constraints.
For the penalty function, we use the MCP penalty proposed by Zhang [9]:
with regularization parameters and . Following Breheny and Huang [11], defaults to 3. We acknowledge that the MCP penalty, while offering theoretical advantages such as unbiased variable selection and effective sparsity, may pose computational challenges, particularly for large datasets. This can be efficiently implemented using coordinate descent algorithms which are well-suited for convex sub-problems and take advantage of sparsity to reduce computational overhead (e.g., [11]). For large-scale datasets, parallelization and optimization techniques, such as warm starts and active set updates, can significantly improve performance. For cases where computational cost is prohibitive, alternative penalties, such as the SCAD or LASSO penalty, may be considered, although they come with their own trade-offs.
2.3. The Estimation Step
To optimize function (4), we follow the approach proposed by Guan et al. [4] and adopt a three-step iterative method.
Step 1: Given a preliminary estimator of , denoted by , we obtain the first-step estimation of , denoted by , via a group penalized regression:
where
Step 1 classifies into two categories: varying (V) or non-varying (NV). Specifically, if , and if .
Step 2: In Step 2, we refine the selection of for functions in the non-varying category identified in Step 1. Specifically, we select the non-zero constant effects and classify the non-parametric functions into constant (C) and zero (0). This is achieved by penalizing only when for . No penalty is applied to . Additionally, is excluded from the model when , i.e., .
The Step 2 estimator is obtained via penalized regression:
where
and is the ith element of , defined as follows:
After Steps 1 and 2, we obtain the estimator based on and classify for into V, C, or 0. The next step is to estimate the loading parameter given .
Step 3: The estimator is obtained via penalized regression:
where
and is the ith element of , defined as follows:
Finally, set and iterate Steps 1 to 3 until convergence.
Remark: For Steps 1 and 2, we use the block coordinate descent algorithm for group penalties. For Step 3, we employ the local quadratic approximation (LQA) algorithm proposed by Fan and Li [5]. For further details, readers are referred to the Appendix. This iterative approach requires selecting tuning parameters , the order h, and the number of internal knots K for the B-spline approximation, as well as an appropriate initial value for .
2.4. Selection of Tuning Parameters
2.4.1. Selection of Tuning Parameters
We propose using the Bayesian Information Criterion (BIC) [12] to select the tuning parameters.
Step 1: We select as the minimizer of
where is the minimizer of defined above, is chosen as the estimator from the previous iteration, and is the total number of non-zero coefficients when is the penalized parameter.
Step 2: We select as the minimizer of
where is the minimizer of defined above, is chosen as the estimator from the previous iteration, and is the total number of non-zero coefficients when is the penalized parameter.
Step 3: We select as the minimizer of
where is the minimizer of , is the minimizer of the B-spline coefficients from Step 2, and is the total number of non-zero coefficients when is the penalized parameter.
The parameters are searched over a grid of exponentially decreasing values, with a minimum value of and the maximum value set such that all penalized estimators are zero. We use 100 tuning parameters in the search.
2.4.2. Selection of Order h and Number of Internal Knots K
Since h represents the order of the B-spline basis function, higher degrees introduce more complex interactions and collinearity between environmental factors and genetic predictors. We search for the optimal h in the set . For K, only when (where n is the sample size, r is the smoothness of , and ), the selection approach achieves oracle properties. We therefore search for the optimal K in the neighborhood of , denoted by . In our simulations, .
We fit the following intercept-only model using the B-spline approximation:
We denote its estimator as and . Let . The optimal K and h are selected as the values minimizing
For the iterative algorithm proposed above, we require a reasonable initial value for (denoted by ) to begin. We obtain by fitting model (5) with the selected K and h.
2.5. Theoretical Properties
We study the properties of the penalized likelihood estimator. Let and represent the true values of and , respectively, and let denote the true value of the B-spline coefficients . Assume without loss of generality that for , for , is varying for , non-zero constant for , and zero for . The following theorem establishes the consistency of the penalized least square estimators.
Theorem 1.
Assume that the regularity conditions (A1)–(A7) in the Appendix hold, and that the number of knots satisfies . Then:
- (i)
- ;
- (ii)
- ,
where , r is defined in condition (A2) of the supplemental material, and denotes the first derivative of the penalty function . Furthermore, under additional regularity conditions outlined in the Appendix, we can show that the estimator possesses sparsity.
Theorem 2.
Assume the regularity conditions (A1)–(A7) in the Appendix hold and . Let . If and as , then with probability approaching 1:
- (i)
- for ;
- (ii)
- for , where is some non-zero constant;
- (iii)
- for .
Theorems 1 and 2 indicate that our penalized likelihood estimator is consistent and possesses oracle properties. That is, for the index parameters , the method can accurately select all non-zero , and the estimates of the non-zero parameters are close to true with the difference between the two approaching zero rapidly. For the varying coefficient functions , the method can accurately classify them into three categories: (1) varying functions, (2) non-zero constant functions, and (3) zero functions. Moreover, the difference between the true and the estimated function also converges to zero quickly. The proofs are given in the Appendix.
3. Simulation
We evaluated the performance of our model via finite-sample simulations. The performance is assessed in the following ways: (1) classification accuracy of , denoted as the oracle percentage; (2) IMSE of the estimated m-function; (3) selection accuracy of ; and (4) estimation accuracy of (MSE). A total of R simulations are conducted for all cases.
The oracle percentage of is defined as the proportion of correct classifications across R simulations. For instance, if is a varying function and it is classified as varying in g simulations, then the oracle percentage for is .
The IMSE of is given by
where is the number of points for estimating the MSE of the predicted function; and are estimators of the B-spline coefficients for the r-th simulation using the proposed approach; is the estimator of the loading parameter for the r-th simulation; and corresponds to the quantile within the range of . In our simulations, is set to 100.
The oracle percentage of is defined as the proportion of correct selections of across R simulations. For example, if and is selected as non-zero in g simulations, then the oracle percentage for is .
The MSE of is computed as
where is the estimator for in the r-th simulation. This represents the average MSE for .
The simulation data were generated based on the model (1). The environmental variable X was drawn from a distribution. For the loading parameter , we set and all other were set to zero. We evaluated the performance of the proposed approach with both continuous (e.g., gene expressions) and discrete (e.g., SNPs) predictors G.
3.1. Continuous G
In the continuous case, the non-parametric functions were set as follows:
The predictors G were generated from distribution. We conducted simulations to evaluate the model’s performance under , and .
Table 1 presents the selection and estimation accuracy for with continuous predictors. Across all cases, the selection accuracy was close to for varying, constant, and zero-effect coefficients. The IMSE of the proposed model was in the order of or for varying and constant effect predictors. As the model dimension p increased (from 50 to 100), a slight increase in the model IMSE was observed. Conversely, as the sample size n increased (from 1000 to 2000), both the model IMSE and the oracle IMSE decreased, aligning with the asymptotic properties of the proposed model. These results suggest that the proposed variable selection approach performs well in both selection and estimation accuracy for the non-parametric functions .
Table 1.
Selection and estimation accuracy of function for continuous predictors.
Table 2 presents the selection and estimation accuracy for the loading parameter . In all scenarios, the selection accuracy for non-zero loading parameters () was nearly , with MSE values in the order of to . For zero loading parameters (), the selection accuracy was approximately when . As the sample size increased to , the oracle percentages improved to , with MSE values in the order of to . Overall, the MSEs improve as the sample size increases, and the model MSE is very close to the oracle MSE, indicating strong estimation performance for the loading parameters.
Table 2.
Selection and estimation accuracy of the loading parameters for continuous predictors.
3.2. Discrete G
We extended our evaluation to examine the performance of the proposed model with discrete predictors, G. Single nucleotide polymorphism (SNP) data are one of the most commonly used types of genetic data. SNPs take values of 0, 1, and 2, representing the genotypes aa, Aa, and AA, respectively. Additionally, SNPs exhibit a wide range of minor allele frequencies (MAF), making it crucial for our simulations to incorporate these characteristics.
To reflect these properties, G was simulated using the following probability distribution function:
where denotes the jth predictor for the ith subject, with and , and is the MAF of the minor allele A.
The gene effect functions, , were set in Table 3.
Table 3.
Function and the MAF of the associated SNP.
In this setup, predictors exhibit varying and constant effects with . The purpose of setting varying MAFs is to check the selection and estimation performance under different MAFs. For zero-effect predictors , their values are uniformly distributed within the range . The environmental variables X were generated from a unif distribution. Finally, Y was generated according to the model specified in (1). We evaluated the performance of the proposed model through simulations, considering , , and .
Table 4 presents the selection and estimation accuracy of the non-parametric function for discrete G. We observed that the sample size n and MAF () of were the primary factors influencing the performance of the proposed model. To better visualize their impact, we present Figure 1.
Table 4.
Selection and estimation accuracy of for discrete predictors.
Figure 1.
Selection and estimation accuracy of function for discrete predictors under different sample sizes, data dimensions, and minor allele frequencies.
As the sample size increased (from 1000 to 2000), the model’s performance improved significantly. For instance, the oracle percentages for increased from approximately to nearly , and the corresponding IMSE decreased substantially. These results align with the asymptotic theory of the proposed model. Conversely, as the MAF for decreased (from 0.5 to 0.1), we observed a decline in performance, reflected in both oracle percentages and model IMSE. For example, in the case where , the oracle percentages for (), (), and () were approximately , , and , respectively. The IMSE increased correspondingly, from 0.4 () to 0.5 () and then to 1.3 (). This is a common phenomenon in genetic studies as smaller MAFs provide less data information to estimate the corresponding function, leading to poor estimation and selection performance.
Table 5 demonstrates the selection and estimation results for the loading parameter . We observed that the sample size n was the primary factor influencing model performance. When the sample size was large (), the oracle percentage for non-zero loading covariates () was , and the oracle percentage for zero loading covariates () was approximately . The MSE for was in the order of to . In contrast, when the sample size was relatively small (), the oracle percentage for non-zero loading covariates remained at , but the oracle percentage for zero loading parameters decreased to around . Comparing the cases where and , we saw a slight reduction in selection accuracy for zero loading parameters when . This is expected, as model performance typically declines with increased model complexity. As the sample size increased to 2000, this difference in performance was not significant.
Table 5.
Selection and estimation accuracy of the loading parameters for discrete predictors.
Based on simulation results with both continuous and discrete G variables, we observed the following key findings about the proposed model. First, the model demonstrates robust performance with large sample sizes ( or 2000). Second, when , the false positive rate for the loading parameter was around 5%. This may reflect an inherent limitation of the LQA algorithm that cannot shrink zero parameters to exactly zero when the sample is small. Third, the model exhibits superior performance for SNP variants with larger MAFs (e.g., or 0.5) compared to SNP variants with smaller MAF (). This enhanced performance likely stems from the higher amount of data information content provided by SNPs with higher MAFs. Similar phenomenon is commonly observed in other genetic association studies. Overall, the simulation results demonstrate that the method performs reasonably well under finite-sample conditions.
4. A Case Study
We demonstrated the applicability of our proposed model using a type 2 diabetes dataset containing genotypes (SNPs), environmental factors, and the phenotype (presence of type 2 diabetes). This dataset comprises two nested case–control cohort studies: the Nurses’ Health Study (NHS) and the Health Professionals Follow-Up Study (HPFS) from the Gene–Environment Association Studies Consortium (GENVEA). Detailed descriptions of these cohorts can be found in Colditz and Hankinson [13] and Rimm et al. [14]. Initially, the dataset included 3391 females (NHS) and 2599 males (HPFS).
After data cleaning, which involved removing subjects with mismatched genotypes and phenotypes, SNPs with more than missing data, SNPs with MAF , and SNPs deviating from the Hardy–Weinberg equilibrium (p-value ), the final dataset included 5865 subjects (2494 males and 3371 females), with 2733 cases and 3132 controls, and 655,002 SNPs. The dataset also contained 12 continuous environmental factors, such as height, weight, age, and alcohol consumption. We fit a marginal logistic regression model for all 12 factors and, selected 5 environmental factors for the analysis: total physical activity (, denoted as act), BMI (), alcohol intake (, denoted as alcohol), heme iron intake (, denoted as heme), and glycemic load (, denoted as gl). Thus, for the fitted model defined in (1), .
Based on SNP locations, we mapped all SNPs to known genes and selected genes containing more than 30 SNPs, resulting in a total of 2178 genes. For each gene, we applied the proposed variable selection approach to identify significant SNPs and their effects. To ensure identifiability, the first element of the loading parameter was constrained to be a non-zero positive value. We fit the proposed model five times for each gene, varying the environmental factor used as the first element each time. An SNP was deemed significant if it was identified as either varying or constant across all environmental factor orders, indicating a strong signal.
In total, our model identified 13 varying-effect SNPs and 26 constant-effect SNPs. Here, we present one of the selected varying-effect SNPs as an example. Please refer to Table A1 and Table A2 in the Appendix for the complete list of selected varying- and constant-effect SNPs. Previous studies [15,16] have suggested that the gene TCF7L2 is associated with type 2 diabetes across multiple populations. Specifically, Sale et al. [15] reported a strong association between type 2 diabetes and SNPs rs7903146 and rs7901695 within this gene. Consistent with these findings, our model identified SNP rs7901695 as a constant-effect predictor, indicating no interaction with the five environmental factors (note that SNP rs7903146 was not observed in this dataset).
Figure 2 shows plots of the marginal total environmental effect and the interaction effect of the SNP rs6537663 in gene TCF7L2, with heme iron intake as the first loading covariate. As the index increases, we observed that the marginal effect initially decreases, then increases, and subsequently exhibits a rapid decrease as the total effect of the five environmental factors increases. For the interaction effect, it fluctuates around zero as the total effect of the five environmental variables increases, indicating that the SNP is unresponsive (or insensitive) to changes in these variables. However, as the index continues to increase, the SNP reacts to environmental changes, with a dramatic increase in risk for type 2 diabetes beyond a certain threshold. This estimated effect suggests that the genetic sensitivity of the SNP to the total effect of the five environmental variables follows a threshold model. This finding has practical implications: most individuals tolerate daily environmental changes, including dietary variations, without adverse effects. However, when such changes exceed a certain limit, the risk of disease may increase.
Figure 2.
Plot of varying effects on a log-odds scale for SNP rs6537663 in Gene TCF7L2.
Table 6 presents the selection and estimation results of the loading parameters , with heme iron intake as the first loading covariate. The model selects all loading parameters except for alcohol consumption (alcohol). Notably, we observed that body mass index (BMI) has the largest effect, which aligns with practical knowledge, as BMI is positively associated with type 2 diabetes and is a well-established risk factor for the disease. Notably, the sign of the loading parameters aligns with known associations in the literature. For example, Hu et al. [17] reported that higher physical activity is linked to a lower risk of type 2 diabetes, while Field et al. [18] demonstrated that high BMI increases the risk of type 2 diabetes. Additionally, Rajpathak et al. [19] found that high heme iron intake is associated with an increased risk of type 2 diabetes, and Salmerón et al. [20] showed that a high glycemic load is linked to a higher risk of type 2 diabetes. The signs of the estimated loading coefficients are consistent with these established findings.
Table 6.
Estimation and selection result for .
5. Discussion
G × E interactions have been extensively studied in the literature, leading to the development of numerous statistical models. In this paper, we propose a three-step iterative variable selection approach for the generalized single-index varying-coefficients model (gSIVCM) with a binary response. Our goal is to identify varying, constant, and zero-effect genes, as well as to select non-zero environmental factors that interact with varying-effect genes. Biologically, our approach is attractive as it provides a novel perspective on G × E interactions. The flexibility of our model allows for the detection of non-linear interactions, making it particularly suitable when gene effects are non-linearly influenced by simultaneous exposure to multiple environmental factors. Statistically, gSIVCM reduces the dimensionality of the model by treating multiple indices X as a single index, which significantly alleviates the curse of dimensionality when multiple environmental factors interact with gene effects.
For the theorem presented in this work, we note that while the regularity conditions provide theoretical guarantees for the method’s performance, their applicability in real-world settings may be uncertain. Directly checking these conditions is typically infeasible due to the complexity and high-dimensional nature of real-world data. Despite these challenges, we emphasize that the method has shown strong empirical performance in various settings, including simulations and case studies. This suggests that the assumptions, while idealized, may approximate practical scenarios sufficiently well in many cases. We propose that future research could focus on developing diagnostic tools to assess whether the underlying assumptions are reasonable for a given dataset or on extending the method to relax these conditions.
Our work builds upon the framework introduced by Guan et al. [4] with a three-step variable selection approach for SIVCM with continuous responses. We extended our previous work to binary responses, broadening its applicability to a wider range of biological studies. For continuous responses, researchers are often interested in specific quantiles of the response, such as birth weight, rather than the mean. It is a natural progression to extend our variable selection approach to quantile regression settings, which we plan to explore in future studies. In addition, the methodology proposed in this paper can be readily generalized to include other discrete response variables with appropriate use of link functions. Another interesting direction for future research would be to adapt the selection procedure developed in the multivariate framework to the functional data framework. This idea has been explored in a general context by Aneiros et al. [21] and more specifically in the context of partial single-index modeling by Novo et al. [22]. Additionally, a broader survey of applications for this concept can be found in Aneiros et al. [23]. While this paper focuses on the multivariate framework, exploring these connections in the functional data setting could open new avenues for methodological advancements.
While the proposed method demonstrates strong empirical performance, several limitations warrant discussion. First, the computational cost of fitting high-dimensional datasets can be substantial, particularly when dealing with large-scale genomic or pathway-based analyses. Second, the success of the method depends on the validity of the regularity conditions, which may not always hold in noisy or sparsely sampled data. For example, dependencies among predictors or unmeasured confounding factors could affect the accuracy of variable selection. Finally, the interpretability of selected indices and varying coefficients may require domain expertise, limiting accessibility for practitioners without a statistical background. To further validate the robustness and utility of the method, additional testing on datasets with diverse characteristics, such as varying levels of noise, sample sizes, or population structures, would be beneficial. These studies would help establish the method’s adaptability to different study designs and its potential limitations in specific scenarios.
Among the listed genes in Table A1 in the appendix, ABCA1 and NTRK2 show strong evidence of association with type 2 diabetes (T2D), with ABCA1 influencing lipid metabolism and insulin secretion [24,25] and NTRK2 regulating energy balance and glucose metabolism via BDNF signaling [26]. Genes such as GALNT2, PTK2B, and RBM15-AS1 are linked to metabolic pathways or inflammation, indirectly influencing T2D risk [27,28,29]. Other genes, including LARS2, UNC5C, and SCAI, have limited or emerging evidence, often tied to mitochondrial dysfunction, apoptosis, or cellular stress, highlighting the need for further investigation into their roles in T2D pathogenesis [30,31,32]. For genes listed in Table A2, TCF7L2 stands out as a well-established genetic risk factor for T2D. Variants in TCF7L2 are strongly associated with impaired insulin secretion and glucose metabolism, contributing significantly to T2D susceptibility [33]. Another important gene is GALNT2, which has been implicated in lipid and glucose metabolism; polymorphisms in this gene are linked to alterations in glycemic traits and may indirectly influence T2D risk [27]. For the other genes, though no strong evidence directly linking them to T2D has been identified in the current literature, they may still play roles in metabolic or cellular pathways relevant to diabetes pathophysiology, warranting further investigation. It is worth noting that the findings generated by our algorithm are intended to serve as a statistical foundation for further exploration by experts in the field.
In this work, we demonstrated the model’s applicability through a gene-based analysis. Our method can also be extended to pathway-based analysis. In the human genome, pathways typically include a diverse array of genes, with each gene containing hundreds to thousands of SNPs. A pathway-based SNP-level analysis can be conducted by modeling SNPs within a pathway as genetic variables. Alternatively, a pathway-based gene-level analysis can be performed by summarizing the genomic information of each gene into a few principal components (PCs) using principal component analysis (PCA) [34] or sparse principal component analysis (sPCA) [35]. The proposed method can then be applied to select significant PCs within a pathway, facilitating the identification of key interactions between genes and environmental mixtures. By accounting for the genomic structure, this pathway-based approach has the potential to yield more interpretable and biologically meaningful results. Moreover, the ability to extend gSIVCM to quantile regression and functional data broadens its applicability significantly. For instance, quantile regression could allow researchers to explore G × E interactions at various points in the response distribution, such as studying low birth weight or extreme phenotypes in epidemiological studies. Similarly, applying the method to functional data could be transformative in fields like neuroscience or metabolomics, where variables are often measured continuously over time or space.
Author Contributions
Conceptualization, Y.C.; methodology, S.G. and X.L.; software, S.G.; validation, S.G.; formal analysis, S.G.; investigation, S.G.; data curation, S.G.; writing—original draft preparation, S.G.; writing—review and editing, Y.C.; visualization, S.G. and X.L.; supervision, Y.C.; funding acquisition, Y.C. and X.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Institutes of Health (NIH) grant number R21HG010073 (to Y. Cui) and by the National Natural Science Foundation of China grant number 12271329 and 72331005 (to X. Liu). The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Data Availability Statement
The datasets used for the analyses described in this manuscript were obtained from dbGaP at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000091.v2.p1 (accessed on 20 December 2024). The R code used to implement the method can be accessed at https://github.com/kwan8911/gSIVCM (accessed on 20 December 2024).
Acknowledgments
The authors wish to thank the five anonymous reviewers for their insightful comments and suggestions, which significantly improved the manuscript. Funding support for the GWAS of Gene and Environment Initiatives in Type 2 Diabetes was provided through the NIH Genes, Environment and Health Initiative [GEI] (U01HG004399).
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A
Appendix A.1. Computational Algorithm
Step 1 and Step 2 follow directly from the group coordinate descent algorithm shown in the main text, and we omit the details here.
Step 3:
Since this is a penalized likelihood problem, we adopt the local quadratic approximation (LQA) approach proposed by Fan and Li [5]. Let denote the most recent value of . By applying Taylor expansion at , we obtain the following:
where
and
and
Let
We update as follows:
The updated value of is then obtained through standardization:
Finally, we iterate this process until convergence.
Appendix A.2. Proofs of Theorems
Appendix A.2.1. Notations
Let the space of Lipschitz continuous functions for any fixed constant c be denoted as . Define as the space of p-th order smooth functions.
Appendix A.2.2. Some Regularity Conditions
- (A1)
- The density function of the random variable is bounded away from 0 on , where is the compact support of X. There exists a constant such that .
- (A2)
- For , the non-parametric function with .
- (A3)
- .
- (A4)
- The matrix is positive definite, and each element of .
- (A5)
- Define
For and , it holds that as .
- (A6)
- (A7)
- Let be internal points of , where and . Define , , . There exists a constant such that , and .
Appendix A.2.3. Proof of Theorem 1
Let , and we have , hence the restriction and is equivalent to . To show the consistency of , it is enough to show the consistency of .
Define , , , and , where and correspond to the B-spline coefficients . Similarly, , where corresponds to .
To establish the consistency of and , we need to show that , ∃ a sufficiently large C, such that
where
and is the log-likelihood function defined above. If (A1) holds, it implies that with probability at least , there exists a local maximum within the ball . Hence, there exists a local maximizer such that .
Define
Since for , for , and for , we have the following:
Applying Taylor expansion at and simplifying the bounds for , we derive that
where is the Fisher information matrix. By standard arguments of likelihood theory, is of the order , is of the order , and for sufficiently large C, dominates uniformly in . Furthermore, we have
Since , we conclude that is dominated by .
For and , we have the following:
and
Similarly, and are dominated by . Thus, by choosing a sufficiently large C, we obtain . Therefore, the consistency of the penalized least squares estimator is proven.
Appendix A.2.4. Proof of Theorem 2
For (i), for ease of notation, let , where and . Since , it follows that for sufficiently large n. By Theorem 1, it suffices to show that for ,
and for , there exists a small constant such that as , with probability approaching 1, for , we have the following:
We start with
Then, expanding at using a Taylor expansion, we obtain the following:
where lies between and . After simplification, we derive the following:
Since and , the sign of is completely determined by . Thus, we conclude for .
For (ii) & (iii), by applying similar arguments as in part (i), it follows that with probability approaching 1, for and for . Using and the fact that , we have the following:
where is a constant, and
This completes the proof.
Appendix A.3. Additional Real Data Analysis Results
Table A1 lists all the varying-effect SNPs selected, along with the gene IDs to which they were mapped. Similarly, Table A2 provides a summary of all the constant-effect SNPs selected, along with their corresponding gene IDs.
Table A1.
List of SNPs with varying effects.
Table A1.
List of SNPs with varying effects.
| Gene ID | Gene Symbol | SNP ID |
|---|---|---|
| GeneID:440600 | RBM15-AS1 | rs6537663 |
| GeneID:2590 | GALNT2 | rs6666516 |
| GeneID:729993 | SHISA9 | rs1015431 |
| GeneID:54768 | HYDIN | rs4788621 |
| GeneID:117532 | TMC2 | rs7509377 |
| GeneID:758 | MPPED1 | rs5766384 |
| GeneID:23395 | LARS2 | rs4311249 |
| GeneID:647107 | LINC01192 | rs2404825 |
| GeneID:8633 | UNC5C | rs3775049 |
| GeneID:2185 | PTK2B | rs6557991 |
| GeneID:4915 | NTRK2 | rs6559870 |
| GeneID:19 | ABCA1 | rs4742969 |
| GeneID:286205 | vSCAI | rs2416996 |
Table A2.
List of SNPs with constant effect.
Table A2.
List of SNPs with constant effect.
| Gene ID | Gene Symbol | SNP ID |
|---|---|---|
| GeneID:114827 | FHAD1 | rs3815792 |
| GeneID:2899 | GRIK3 | rs12118788 |
| GeneID:260425 | MAGI3 | rs11102660 |
| GeneID:9857 | CEP350 | rs2293990 |
| GeneID:2590 | GALNT2 | rs9308482 |
| GeneID:6934 | TCF7L2 | rs7901695 |
| GeneID:55742 | PARVA | rs7101596 |
| GeneID:867 | CBL | rs4489755 |
| GeneID:10867 | TSPAN9 | rs740771 |
| GeneID:57494 | RIMKLB | rs11047510 |
| GeneID:196385 | DNAH10 | rs11058132 |
| GeneID:64328 | XPO4 | rs1961415 |
| GeneID:23348 | DOCK9 | rs7326971 |
| GeneID:23348 | DOCK9 | rs7991210 |
| GeneID:57099 | AVEN | rs16962542 |
| GeneID:11060 | WWP2 | rs16970994 |
| GeneID:25780 | RASGRP3 | rs6708570 |
| GeneID:100505498 | LOC100505498 | rs6730602 |
| GeneID:117532 | TMC2 | rs11696526 |
| GeneID:29780 | PARVB | rs5765571 |
| GeneID:25814 | ATXN10 | rs713999 |
| GeneID:9620 | CELSR1 | rs11090812 |
| GeneID:23429 | RYBP | rs17009630 |
| GeneID:80254 | CEP63 | rs11710699 |
| GeneID:8633 | UNC5C | rs10516957 |
| GeneID:157680 | VPS13B | rs1788161 |
References
- Ottman, R. Gene-environment interaction: Definitions and study designs. Prev. Med. 1996, 25, 764. [Google Scholar] [CrossRef] [PubMed]
- Carpenter, D.O.; Arcaro, K.; Spink, D.C. Understanding the human health effects of chemical mixtures. Environ. Health Perspect. 2002, 110 (Suppl. S1), 25. [Google Scholar] [CrossRef]
- Sexton, K.; Hattis, D. Assessing cumulative health risks from exposure to environmental mixtures-three fundamental questions. Environ. Health Perspect. 2007, 115, 825–832. [Google Scholar] [CrossRef] [PubMed]
- Guan, S.; Zhao, M.; Cui, Y. Variable selection for single-index varying-coefficients models with applications to synergistic G × E interactions. Electron. J. Stat. 2023, 17, 823–857. [Google Scholar] [CrossRef]
- Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
- Frank, L.E.; Friedman, J.H. A statistical view of some chemometrics regression tools. Technometrics 1993, 35, 109–135. [Google Scholar] [CrossRef]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
- Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942. [Google Scholar] [CrossRef]
- Schumaker, L. Spline Functions: Basic Theory; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Breheny, P.; Huang, J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 2011, 5, 232–253. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Colditz, G.A.; Hankinson, S.E. The Nurses’ Health Study: Lifestyle and health among women. Nat. Rev. Cancer 2005, 5, 388–396. [Google Scholar] [CrossRef]
- Rimm, E.B.; Giovannucci, E.L.; Willett, W.C.; Colditz, G.A.; Ascherio, A.; Rosner, B.; Stampfer, M.J. Prospective study of alcohol consumption and risk of coronary disease in men. Lancet 1991, 338, 464–468. [Google Scholar] [CrossRef] [PubMed]
- Sale, M.M.; Smith, S.G.; Mychaleckyj, J.C.; Keene, K.L.; Langefeld, C.D.; Leak, T.S.; Freedman, B.I. Variants of the transcription factor 7-like 2 (TCF7L2) gene are associated with type 2 diabetes in an African-American population enriched for nephropathy. Diabetes 2007, 56, 2638–2642. [Google Scholar] [CrossRef]
- Grant, S.F.; Thorleifsson, G.; Reynisdottir, I.; Benediktsson, R.; Manolescu, A.; Sainz, J.; Styrkarsdottir, U. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet. 2006, 38, 320–323. [Google Scholar] [CrossRef]
- Hu, F.B.; Sigal, R.J.; Rich-Edwards, J.W.; Colditz, G.A.; Solomon, C.G.; Willett, W.C.; Manson, J.E. Walking compared with vigorous physical activity and risk of type 2 diabetes in women: A prospective study. J. Am. Med. Assoc. 1999, 282, 1433–1439. [Google Scholar] [CrossRef] [PubMed]
- Field, A.E.; Coakley, E.H.; Must, A.; Spadano, J.L.; Laird, N.; Dietz, W.H.; Rimm, E.; Colditz, G.A. Impact of overweight on the risk of developing common chronic diseases during a 10-year period. Arch. Intern. Med. 2001, 161, 1581–1586. [Google Scholar] [CrossRef] [PubMed]
- Rajpathak, S.N.; Crandall, J.P.; Wylie-Rosett, J.; Kabat, G.C.; Rohan, T.E.; Hu, F.B. The role of iron in type 2 diabetes in humans. Biochim. Biophys. Acta (BBA) Gen. Subj. 2009, 1790, 671–681. [Google Scholar] [CrossRef] [PubMed]
- Salmerón, J.; Ascherio, A.; Rimm, E.B.; Colditz, G.A.; Spiegelman, D.; Jenkins, D.J.; Stampfer, M.J.; Wing, A.L.; Willett, W.C. Dietary fiber, glycemic load, and risk of NIDDM in men. Diabetes Care 1997, 20, 545–550. [Google Scholar] [CrossRef] [PubMed]
- Aneiros, G.; Vieu, P. Partial linear modelling with multi-functional covariates. Comput. Stat. 2015, 30, 647–671. [Google Scholar] [CrossRef]
- Novo, S.; Aneiros, G.; Vieu, P. Automatic and location-adaptive estimation in functional single-index regression. TEST 2021, 30, 481–504. [Google Scholar] [CrossRef]
- Aneiros, G.; Horová, I.; Hušková, M.; Vieu, P. On functional data analysis and related topics. J. Multivar. Anal. 2022, 189, 104861. [Google Scholar] [CrossRef]
- Kruit, J.K.; Wijesekara, N.; Westwell-Roper, C.; Bruce, J.; Zhao, M.; Kahn, S.E.; Hayden, M.R. Loss of ATP-binding cassette transporter A1 in mice impairs glucose tolerance and insulin secretion in vivo. Diabetes 2012, 61, 3166–3177. [Google Scholar]
- Singh, J.; Kumar, V.; Aneja, A.; Gupta, S.; Dhingra, S. Genetic polymorphisms in ABCA1 (rs2230806 and rs1800977) and LIPC (rs2070895) genes and their association with the risk of type 2 diabetes: A case-control study. Int. J. Diabetes Dev. Ctries. 2022, 42, 227–235. [Google Scholar] [CrossRef]
- Xu, B.; Li, X.; Nian, K.; Li, J.; Li, M.; Zheng, W.; Han, X. BDNF and NTRK2 in metabolic regulation. Nat. Med. 2019, 25, 697–707. [Google Scholar]
- Weissglas-Volkov, D.; Aguilar-Salinas, C.A.; Nikkola, E.; Tusie-Luna, T.; Cruz-Bautista, I.; Arellano-Campos, O.; Pajukanta, P. GALNT2 polymorphisms and their role in lipid and glucose metabolism. Hum. Mol. Genet. 2011, 20, 1632–1640. [Google Scholar]
- Wang, Y.; Wang, J.; He, Z.; Tang, X.; Zheng, Y.; Liu, H. Inflammatory pathways in T2D: A focus on, P.T.K.2.B. Immunometabolism 2021, 3, e210006. [Google Scholar]
- Wang, X.; Zhang, J.; Li, Y.; Wang, Z.; Sun, Q.; Zhou, H.; Fan, Y. RBM15 regulates hepatic insulin sensitivity in GDM offspring. Mol. Med. 2023, 29, 45. [Google Scholar]
- van Berge, L.; Keating, D.J.; Greber, B.; O’Neill, C.; Lang, J. Mitochondrial dysfunction in metabolic syndrome. Front. Endocrinol. 2019, 10, 607. [Google Scholar]
- Morrison, J.L.; Gallas, P.R.; Zheng, W.; Mohtashami, S.; Wong, T. UNC5C and its relevance to beta-cell apoptosis. Endocrinology 2020, 161, bqaa064. [Google Scholar]
- Smith, A.R.; Jones, K.M.; Patel, D.; Thomas, C.; Reid, L. The role of SCAI in cellular stress pathways. J. Cell Sci. 2019, 132, jcs231829. [Google Scholar]
- Gloyn, A.L.; Braun, M.; Rorsman, P. Type 2 diabetes susceptibility gene TCF7L2 and its role in beta-cell function. Diabetes 2009, 58, 800–802. [Google Scholar] [CrossRef]
- Jolliffe, I. Principal Component Analysis; John Wiley Sons, Ltd.: Hoboken, NJ, USA, 2002. [Google Scholar]
- Zou, H.; Hastie, T.; Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 2006, 15, 265–286. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).