Integrating Multi–Omics Data for Gene-Environment Interactions

Gene-environment (G×E) interaction is critical for understanding the genetic basis of complex disease beyond genetic and environment main effects. In addition to existing tools for interaction studies, penalized variable selection emerges as a promising alternative for dissecting G×E interactions. Despite the success, variable selection is limited in terms of accounting for multidimensional measurements. Published variable selection methods cannot accommodate structured sparsity in the framework of integrating multiomics data for disease outcomes. In this paper, we have developed a novel variable selection method in order to integrate multi-omics measurements in G×E interaction studies. Extensive studies have already revealed that analyzing omics data across multi-platforms is not only sensible biologically, but also resulting in improved identification and prediction performance. Our integrative model can efficiently pinpoint important regulators of gene expressions through sparse dimensionality reduction, and link the disease outcomes to multiple effects in the integrative G×E studies through accommodating a sparse bi-level structure. The simulation studies show the integrative model leads to better identification of G×E interactions and regulators than alternative methods. In two G×E lung cancer studies with high dimensional multi-omics data, the integrative model leads to an improved prediction and findings with important biological implications.


Introduction
Gene-environment interactions reveal how the changes in environmental exposures mediate the contribution of genetic factors in order to influence the variations in disease traits, which makes it critical in understanding the comprehensive genetic architecture of complex diseases [1,2]. Traditionally, G×E interaction studies have mainly been conducted within the framework of genetic association studies in order to hunt down the important main and interaction effects that are associated with the disease phenotypes [3,4].
Most of the existing G×E studies are one-dimensional, in that the interactions between environmental factors and one type of genetic factor (such as gene expression or SNPs) have been considered. In the multi-omics era, there is a pressing need to account for multiplatform measurements in G×E studies. Consider a G×E analysis with environmental factors and gene expression (GE) as the G factors. In addition, DNA methylation (DM) and copy number alterations (CNA), which are the regulators of the genetic factors, are also available. A typical G×E analysis only focuses on the interaction effects that involve the G factor (GE) and ignores its regulators, losing the extra power of elucidating the genetic basis of complex disease while using multi-level omics data.
Integrating multi-omics data for prognostic outcomes has mainly been conducted using parallel and horizontal integration strategies [5]. With the parallel integration, different types of omics measurements are treated equally, and important associations between these measurements and the prognostic outcome are identified in a joint model. On the other hand, the hierarchical integration fully accounts for the regulatory information by accommodating the indirect effects of regulators, such as DM and CNA, on the prognostic outcomes that are mediated through GEs. Meanwhile, the direct effects of regulators on the cancer outcomes, which have not been captured by GEs through other mechanisms, such as post-transcriptional regulations, should also been taken into consideration.
Given the availability of multi-omics features, the major limitation of existing G×E interaction studies lies in the incapability of integrating regulators in the interaction model under prognostic outcomes, which has motivated us to develop a two stage integrative model for G×E interaction analysis while using multi-level cancer omics data. At the first stage, the sparse regulatory relationship has been determined through penalization, where the linear regulatory modelling [6], or LRM, has been adopted in order to identify the sets of regulators that influence the sets of GEs, as well as the residuals of gene expression and residuals of regulators that cannot be captured by the LRMs. At the second stage, the LRMs and both types of residuals are treated as direct effects on cancer outcomes in the G×E model, and penalization has been conducted in order to identify the important main and interaction effects.
In the past decade, the effectiveness of regularization for G×E interaction studies has been increasingly witnessed [7]. Extension of the technique for an integrated interaction study is not trivial. Our method significantly advances from existing integration studies not tailored for interaction structures and interaction analysis ignoring the multidimensional omics measurements. Extensive simulation studies, have been performed to demonstrate the advantage of the proposed method over multiple alternatives. Our method leads to main and interaction effects with sensible biological implications and improved prediction performance in two case studies of the lung cancer data (LUSC and LUAD) from TCGA.

Method
Let Y n×1 denote cancer outcome, E n×q = (E 1 , · · · , E q ) denote the q environmental factors, G n×p g = (G 1 , · · · , G p g ) denote the p g gene expressions, and R n×p r = (R 1 , · · · , R p r ) denote the p r regulators. Suppose that we have two measurements for the regulators, p r 1 DM and p r 2 CNA, then we can obtain R n×p r by stacking the measurements together with p r = p r 1 + p r 2 . Next, we describe the overall analysis framework and integrative model.

Analysis Framework
First, consider a G×E model in the multi-omics scenario, where the regulators of the G factors are also included, in addition to the main and interaction effects.
where α k , β j , and η jk are the regression coefficients for the kth environmental factor, jth gene expression and their interactions, respectively. Besides, γ t is the regression coefficient for the tth regulator and is the random error. Model (1) shares the spirit of parallel integration by treating the genetic factor and its regulators equally. Although such a strategy has shown to be effective in several studies, a more attractive alternative is to conduct vertical integration via accounting for the regulatory information among the different levels of omics measurements [5]. Typically, integrating multi-omics data in a main effect model with prognostic outcomes consists of two steps. At the first step, the sparse regulatory relationship can be identified, which leads to gene expressions that are modulated and not modulated by regulators, which can then be linked to clinical outcomes at the second step [6,8]. Specifically, Zhu et al. [6] proposed the linear regulatory model (LRM) to pinpoint the set of regulators that affect the corresponding set of GEs. Subsequently, the clinical model incorporates the GEs, residual GEs, and residual regulators. In this study, we extend the LRM to investigate the G×E interactions in the presence of multi-level omics measurements. In particular, the prognostic model at the second stage consists of : (1) a low dimensional environmental factors; (2) regulated GEs in the form of LRMs from the first stage and their interactions with those environmental factors; (3) Residual GEs and their interactions with environmental factors; and, (4) the residual effects of regulators.

Stage 1: The Linear Regulatory Model (LRM)
Denote g = (g 1 , · · · , g p g ) as the p g gene expressions and denote r = (r 1 , · · · , r p r ) as the p r regulators. The LRM can be expressed as where a is the intercept, V = (v 1 , · · · , v L ) and U = (u 1 , · · · , u L ) both contain L columns of loading vectors (v l and u l for l ∈ {1, · · · , L}). Denote L as the total number of LRMs.
Here, we assume U and V have orthogonal columns, such that u l ⊥u l , v l ⊥v l , for l = l . With this assumption, no overlap between gene expressions and regulators exists in LRM. We expect that different LRMs represent different regulated relationship between gene expressions and regulators [9]. In addition, v l and u l are assumed as sparse loading vectors, as only a small number of gene expressions is regulated by, at most, a small number of regulators [10]. For the jth gene expression, j = 1, · · · , p g , we right multiply V to both sides in order to simplify Equation (2). Afterwards, the LRM can be formulated as a regression model with response variable g j and predictors r: where a j is an intercept and θ j is the regression coefficient vector. Equation (3) indicates that one gene expression is regulated by a number of regulators. We impose sparsity on θ j through penalization to identify a sparse regulatory relationship. Subsequently, the penalized regression model can be written as where λ is the tuning parameter. The LASSO is adopted for its computational simplicity and satisfactory performance [11]. Equation (4) leads to a regularized estimate of θ j , which indicates that one gene expression is regulated by a limited amount of regulators. Next, we further investigate the relationship between sets of gene expressions and regulators through singular value decomposition (SVD). The regression model (3) can be collectively written as where a is the vector of the intercept, g 1×p g = (g 1 , · · · , g p g ), r 1×p r = (r 1 , · · · , r p r ), and Θ p r ×p g = (θ 1 , · · · , θ p g ) is the transition matrix. The SVD is performed on the transition matrix in order to separate the regression coefficients representing gene expression and regulators: where D = diag(d 1 , · · · , d L ) is a diagonal matrix with L diagonal elements. The diagonal matrix D can account for the dissimilarity among loading vectors in terms of different scaling factors. Subsequently, we can obtain the estimated coefficients for gene expression and regulators by decomposing the estimated transition matrixΘ. Under the sparse condition, one gene expression is only regulated by a few of regulators, and one regulator affects a few of gene expressions [10]. In order to impose sparsity, we adopt the sparse SVD method that was developed by Lee et al. (2010) [12], where sparse singular vectors that correspond to the largest singular values are recursively obtained. Consider the first largest singular value (d 1 , u 1 , v 1 ), then the regularized sparse SVD can be expressed as where · F is the Frobenius norm. Tuning parameter λ is the same for u 1 and v 1 for computation efficiency. Here d 1 is treated as the scaling factor. After estimating (d 1 , u 1 , v 1 ), we updateΘ =Θ −d 1û1v 1 and recursively update (d l , u l , v l ), for l = 2, · · · , L in a similar manner. With sparse SVD, we can decompose the coefficient and impose sparsity on p z and p x for every LRM. The standard LASSO is not applicable within the current LRM formulation, since the shrinkage has been imposed on scaled singular vectors.

Stage 2: The Penalized G×E Interaction Model
Now, we integrate multiomics measurements for G×E interactions. The regulated GEs, residual GEs, as well as residual regulators can be obtained through LRMs. The G factors are represented by regulated GEs and residual GEs, which are involved in the interaction with dimensional environmental factors. The partition of gene expressions into regulated and non-regulated components proceeds, as follows. The L sets of regulated gene expressions (GV) are equivalent to the corresponding sets of regulators (RU). We include the L sets of regulated GEs (GV) in the G×E model , since gene expressions are more directly related to cancer outcomes. The residual GEs, i.e., the non-regulated GEs that cannot be captured by LRMs, is denoted asG n×p g . The G factors, consisting of both GV andG, interact with q environmental factors. Denote W j = (G j V j , G j V j E 1 , · · · , G j V j E q ,G j ,G j E 1 , · · · ,G j E q ), (j = 1, · · · , p g ). Subsequently, W j corresponds to the interaction with respect to the jth GE. We only consider the main effect of residual regulators, because the influences of regulators on cancer outcomes are mostly mediated by gene expressions, and investigating its interactions with environmental factors is not of interest.
The quantifications of the residualsG andR are conducted through perpendicular projection operation. Because both can be calculated in the same manner, we takeG as an example. For the jth gene expression, define S j as the set of all LRMs that contains the j th gene expression. If S j is empty, then the jth gene expression is not regulated, which results inG j = G j . If S j is not empty, we denote V S j as the sub-matrix of V that only contains columns (LRMs) of the jth gene expression. Following the perpendicular projection operation, we calculate the residual asG j = (I − GV S j ((GV S j ) (GV S j )) −1 (GV S j ) )G j , which is the projection of G j onto the orthogonal space of GV S j .
Consider n subjects, p g gene expressions, and L LRMs. Subsequently, all of the main and interaction effects can be collectively written as where X 1 = (GV, GVE 1 , · · · , GVE q ) denotes the main effects of regulated GEs and their interactions with the environmental factors. Similarly, the effects that correspond to residual GEs are defined as X 2 = (G,GE 1 , · · · ,GE q ). Subsequently, we consider the following penalized regression models for G×E interactions: where X 1l = (GV l , GV l E 1 , · · · , GV l E q ), (l = 1, · · · , L) represents the lth LRM and its interaction with q environmental factors, and denotes the main and interaction effects with respect to the jth residual GEs. Here, b 1l and b 2j are the corresponding regression coefficients for X 1l and X 2j . γ t is the coefficients for R t (t = 1, · · · , p r ), the residual of regulators. P i (·; λ i ), (i = 1, 2, 3), is the penalty function with λ i as the tuning parameter to impose sparsity. The three tuning parameters are set as the same because regression coefficients from the three components are on a similar scale, and different tunings dramatically increase the computational cost. Regularized identification in G×E interaction studies demands tailored penalty functions [7]. For instance, b 1l stands for all the main and interaction effects with respect to the lth LRM. The selection of b 1l on the group levels determines whether the lth LRM has any effect at all. If so, then selection of the individual effects within the group further determines the main and/or interactions that are associated with the cancer outcome. Therefore, penalized selection should accommodate the bi-level (or sparse group) structure. To be consistent with the analysis in stage 1, we still adopt LASSO as the baseline penalty function. Specifically, we have where P 1 (b 1l ; λ 1 ) and P 2 (b 2j ; λ 2 ) are sparse group LASSO. The L1 norm and L2 norm ( · 2 ) result in penalized identification on the individual and group level, respectively. The sparse group regularization has been adopted for the bi-level selection of main and interaction effects on the individual and group level simultaneously. Its advantage over LASSO in G×E studies has been demonstrated in multiple studies [7]. A corresponding price paid is computational cost, as different bi-level regularization usually demands different tunings. Because we only consider the main effect of residuals of regulators, the L1 norm penalty is adopted for γ t (t = 1, · · · , p r ). Because the number of environmental factors is usually low, the selection of them is not of interest. They are pre-determined with evidence of being associated with cancer from previous studies. The proposed regularization respects a weak hierarchy between main and interaction effects as the penalty has not been imposed on the environmental main effects. Accordingly, once an interaction effect is selected, at least one of the two corresponding main effects will be in the model.

Computation
The Equation (8) can be expressed as: are the coefficient vectors for the main and interaction effects of the regulated and residual GEs, respectively. In addition, γ p r ×1 = (γ 1 , · · · , γ p r ) is the coefficient vector for residual regulators. The integrative analysis consists of two steps. In the first step, the loading matrices U and V are estimated through the construction of LRMs. The jth column ofΘ, which is denoted asθ j , (j = 1, · · · , p g ), is estimated by minimizing Equation (4). For l = 1, · · · , L, the singular vectors that correspond to the largest singular values, (û l ,v l ,d l ), are conducted through the rank-1 sparse SVD onΘ. The rank-1 sparse SVD is recursively performed for l = 1, · · · , L, by updatingΘ (l+1) =Θ (l) −û ldlv l at each l. In the second step, the shrinkage estimate of the regression coefficients can be obtained in the G×E model, where GV, RU, residuals of gene expressions (G), and residuals of regulators (R) are calculated accordingly. At the kth iteration, the vector of estimated regression coefficients for all of the environmental factors is computed byα by minimizing Equation (9). The iteration stops until convergence. Algorithm 1 shows the outline of algorithm:

Algorithm 1 The Integrative analysis for G×E Interaction
Step 1: Estimate the loading matrices of LRMs U and V: construct LRMs.
for l = 1, · · · , L do (b) Apply rank-1 sparse SVD onΘ to obtain the singular vectors corresponding to largest singular values (u l , v l , d l ).
until convergence LASSO is adopted in order to conduct the selection of important LRMs from the first stage. At the second stage, a sparse group LASSO has been formulated to accommodate the identification of main and interaction effects on both the group and individual level. We conjecture that other penalization methods, such as adaptive LASSO [13], SCAD [14], and MCP [15], are also applicable in our framework. For example, MCP can be adopted in order to identify sparse regulatory relationship from the first stage, and a sparse group MCP is also tailored for the identification of important G×E interactions in the clinical model. We do not compare the performances of different baseline penalization methods within our framework, as it is not the main interest here.
At the first step, we only use one tuning parameter λ for conducting sparse SVD, due to the similarity in scales between GE and its regulators. The three tuning parameters, λ 1 , λ 2 , λ 3 , have been used in the second step, where λ 1 and λ 2 determine the sparsity of main and interaction effects with respect to the regulated and unregulated GEs correspondingly, and λ 3 controls the sparsity of the residuals from regulators. We choose the optimal tuning parameters using five-fold cross-validation in both the simulation study and real data analysis. The analysis has been implemented with statistical software R (version 3.6.3). In simulation, the average CPU time of running one replicated simulated data (n = 500, p g = p r = 200, q = 4) is 23.1 min. on a regular desktop PC. The R codes are available from the corresponding author.

Simulation
We perform simulation in order to evaluate the utility of the proposed method integrative G×E model, termed IGE. In addition, we consider three alternative methods: (1) the S-LASSO selects gene expressions and regulators separately using LASSO. (2) The J-LASSO selects gene expressions and regulators that are based on LASSO simultaneously.
(3) ColReg, the collaborative regression [16], identifies important GEs and regulators jointly in terms of explaining similar variation under the cancer outcome.
We generate the data, as follows. First, each row of R is independently generated from a multivariate normal distribution with mean zero and one of the four covariance structures: (i) AR-1 structure with correlation coefficient 0.25 |i−j| for the ith and jth regulators; (ii) banded correlation structure, where the ith and jth regulators have ρ = 0.33 if |i − j| = 1 and ρ = 0 otherwise; (iii) the covariance that was extracted from TCGA lung squamous cell carcinoma (LUSC) data in Section 4; and, (iv) the covariance structure of the lung adenocarcinoma (LUAD) from Section 4.
Choose L = 20 for the number of LRMs between gene expression and regulators. For l = 1, · · · , 20, u l or v l is randomly assigned five non-zero entries, with values being generated from unif [2,4]. Subsequently, Θ is computed as ∑ 20 l=1 u l v l and G is generated as G = RΘ + ε, where each row of matrix ε is independently generated from a multivariate normal distribution with mean zero and the same covariance structure as R. To generate the cancer outcome, each row of E is generated independently from a multivariate normal distribution with marginal mean zero and AR-1 structure, where the ith and jth components have correlation coefficient 0.5 |i−j| . Subsequently, we generate the response from model (1) under standard normal errors.
200 gene expression, 200 regulators, and four environmental factors are simulated with two different sample sizes, 500 and 1000. We randomly select 30 gene expressions to assign non-zeros effects in model (1). For every selected gene expression, four non-zero entries are randomly assigned to the coefficients of G factor or its corresponding G×E interactions. Those values are generated from unif [0.25, 0.5] and unif [0.5, 1] for weak and strong coefficient signals, respectively. The coefficients of regulators are randomly assigned with 30 non-zero coefficients being generated from unif [1,2]. The coefficients of environmental factors are generated from unif [2,3].
For a comprehensive evaluation, we consider a sequence of tuning parameter values (from 0 to 3, total 100 lambda values) and then use the receiver operating characteristic (ROC) curve and partial area under the ROC curve (PAUC) to compare the different methods. The total simulation replication is 100. All of the PAUCs are tabulated in Tables 1 and 2. Figures 1 and 2 show the ROC curves for the AR-1 structure and estimated covariance from LUSC. Appendix A provides other scenarios of ROC curves, respectively.
We consider using the receiver operating characteristic (ROC) curve and the partial area under the ROC curve (PAUC) to compare different methods. Total simulation replicates is 100. Tables 1 and 2 tabulate all of the PAUCs. Figures 1 and 2 show the ROC curves for AR-1 structure and estimated covariance from LUSC. Appendix A provides the ROC curves in other scenarios. For all simulation scenarios, the proposed method has higher PAUCs than the alternative methods. For example, in Table 1     In addition, the proposed method outperforms the alternatives when the correlation is extracted from real data. For example, in Table 1, with estimated covariance from LUSC and weak signals, the proposed method has close PAUCs in both G and G×E and regulators, 0.59 (sd 0.09) and 0.55 (sd 0.15). Other methods have low accuracy in identifying main and interaction effects. In particular, J-LASSO, S-LASSO, and ColReg have PAUCs 0.42 (sd 0.05) and 0.19 (sd 0.06), 0.39 (sd 0.04) and 0.21 (sd 0.06), and 0.28 (sd 0.04) and 0.21 (sd 0.07), respectively. When magnitude of the signals and sample size increase (e.g., with LUSC and strong signals), the proposed method still have the best performance in identification. Overall, the IGE model has much higher identification accuracy than other methods across different simulation settings by borrowing strength from accounting for regulatory relationship and bi-level selection in G×E interaction studies.

Analysis of TCGA Data
Lung cancer is a top rank common cancer for both men and women. In this section, we apply the proposed method as well as the alternatives on lung adenocarcinoma (LUAD) data and lung squamous cell carcinoma (LUSC) data from the Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/). At present, LUAD is the most common lung cancer subtype among non-smokers and women, although it has been shown that smoking may increase the risk of LUAD [17,18]. On the other hand, LUSC is closely associated with smoking, and it is more common in men than in women [19]. LUAD grows more slowly with smaller masses than LUSC of the same stage, but LUAD tends to initiate metastasis at the early stages [20].
The processed level 3 data have been downloaded from TCGA data portal while using package cgdsr. We match the multi-omics measurements with the clinical/environmental variables and survival outcome. LUSC and LUAD has 344 and 426 subjects, correspondingly. We first conduct screenings to reduce dimensionality, so the regularization methods can be appropriately applied. Here, we select the top 200 mRNA with the largest marginal variances. As we matched the CNA and Methylation profiles with same mRNA, the corresponding 200 measurements on CNA and Methylation are selected at the same time. We select age, gender, smoking pack years, and pathologic tumor stage as environmental variables. The accelerated failure time (AFT) model (Appendix B) has been adopted in order to link the omics and clinical measurements to survival outcomes.

Lung Adenocarcinoma (LUAD) Data
The proposed method identifies eight LRMs with one residual effect of gene expression (mRNA) and 14 residual effects of regulators (DM and CNA). Additionally, the proposed method results in the identification of seven LRM×E interactions and 11 G×E interactions from mRNA residual effects. Table 3 provides the identified main effects of LRMs, residual GEs, and regulators. We can observe that LRMs does not contain effects from methylation, while most of the residual effects in regulators are from methylation.The identification results have important biological implications. As a representative example, gene PIK3R2 is identified by 6 different LRMs. From a recent study [21], PIK3R2 is significantly associated with lung adenocarcinoma and its pathway plays a critical role in the progress of LUAD. Besides, gene STK3 is identified by five different LRMs. STK3 belongs to a large family of serine/threonine kinases, which are implicated in the regulation of signaling pathways involved in cell growth, differentiation and death. [22,23]. The identified LRMs are also meaningful. For example, we observe the regulatory relationship between PIK3R2 and NEK2 from both LRM #1 and #6. One of the recent studies shows that this natural downstream regulation is significantly related to cancer outcome [24]. Among all of the residual effects, we observe that most of them are from methylation. For example, SLC2A1, ECT2, TNS4, DKK1, and GNPNAT1 are found to be associated with the survival of lung cancer patients [25][26][27][28][29]. Table 3. Analysis of the the Cancer Genome Atlas (TCGA) lung adenocarcinoma (LUAD) data: linear regulatory models (LRMs) and residual effects for gene expression and regulators with the estimated coefficient or loadings in the parentheses.  Table 4 provides the identification results for interaction effects. The proposed method selects variables with a sparse group nature. There are five LRMs interacting with environments. The first and fourth LRMs interact with two environment factors, and the second, third, and fifth interact with one environment factor. Additionally, the proposed method can identify a total of 11 interactions involving mRNA residual effects. Note that, here, the G factor is no longer in the usual sense from existing G×E studies. The G factor are represented by the LRMs and residual mRNAs that correspond to the regulated and un-regulated G factors, respectively. In terms of prediction, we adopt a random sampling approach. More specifically, we randomly select 30% data as a test set and the remaining as a training set. The estimates are generated using the training set only and the predictions are made based on the testing set. We dichotomize the predicted response at the median, create two risk groups, and compute log-rank statistics, which measure the difference in survival between the two groups. Larger log-rank test statistic indicates better predictive performance. The procedure is repeated 100 times to avoid extreme splits. The average log-rank test statistics are 5.97 (IGE, sd 0.35), 4.76 (S-LASSO, sd 0.25), 4.60 (J-LASSO, sd 0.08), and 3.74 (ColReg, sd 0.26), respectively. The proposed method has the largest log-rank statistic, hence the best prediction performance.

Lung Squamous Cell Carcinoma (LUSC) Data
The proposed method identifies eight LRMs with two residual effects from GEs and 17 residual effects from regulators (DM and CNA). The interactions involve seven LRMs and 26 mRNAs. Table 5 provides the identified main effects using the proposed method. As aforementioned, we aim to find a sparse relationship between gene expressions and regulators. Therefore, a small subset of regulators are related to genes and vice versa. Table 6 provides the identifications of G×E interaction effects. There's one LRM not interacting with any other environmental factors. The findings have important implications. For instance, gene RNF24 is identified by 2 different LRMs (#1, #2). RNF24 is a membrane protein, which interacts with TRPC protein [30]. A recent study shows that RNF24 acts as one of the important factors for the prognosis of carcinoma [31]. RNF24 is also shown to be correlated with the occurrence of esophageal adenocarcinoma [32]. For DM, RGP1 is identified by three different LRMs (#4, #6, #7). RGP1 belongs to the regulation of guanosine diphosphate (GDP) reaction exchange, and it acts as a prognostic factor in cancer, according to Anand (2020) [33]. For CNA, CD163L1 is identified by three different LRMs (#1, #4, #8), and it can be used as a significant biomarker of cancer [34]. The identified LRMs are also meaningful. For example, the regulatory relationship between NCOR2 and TCTN2 can be identified in LRM #7. This result has also been observed in a regulatory network analysis [35].
Among all of the residual effects, LRAT, PLEKHA6, ACOT7, KLK6, PLEKHB1, FGFRL1, and FPR2 are associated with prognosis of LUSC patients from existing studies [36][37][38][39][40][41]. Table 5. Analysis of the TCGA LUSC data: LRMs and residual effects for gene expression and regulators with the estimated coefficient or loadings in the parentheses. We adopt a random sampling approach and apply log-rank test for assessment in order to evaluate prediction. We adopt the similar procedure as previous real data analysis section. After repeating 100 times, the average log-rank test statistics are 33.20 (IGE, sd 2.32), 25.06 (S-LASSO, sd 1.84), 24.41 (J-LASSO, sd 2.13), and 27.88 (ColReg, sd 2.45), respectively. The proposed method has superior prediction performance over alternatives.

Discussion
We have conducted an integrative gene-environment interaction analysis for multidimensional omics data based on the proposed two-step variable selection model. Specifically, at the first step, sparse regulatory relationship between the G factor and its regulators have been pinpointed via penalization, which leads to effects that can be directly linked to the prognostic outcomes. At the second step, a G×E prognostic model has been considered, where the G factor that is involved in the interaction consists of regulated (corresponding to the LRM) and unregulated (i.e., the residual GE) components. Besides, the residuals of the regulator are also included. The integrative G×E analysis fully takes the advantage of the multi-omics measurements, which distinguishes itself from most of the published studies.
Traditionally, statistical testing based marginal analysis has dominated the G×E studies. The paradigm shift to the joint analysis has been mainly motivated by the gene set and pathway-based association analysis [42][43][44][45]. Recently, the effectiveness of regularized variable selection has been recognized not only in joint G×E studies when a large number of genetic factors are involved [7], but also in multi-level omics integrations [5]. Therefore, it has been adopted here.
This study can be improved by the following aspects. Because strong correlations have been widely observed in among omics measurements, network based penalization can be imposed to accommodate the correlations among regulators at the first stage [46][47][48].
Besides, robustness can be incorporated at the first stage to model the regulatory relationship between GE and its regulators [49], and in the second stage for a robust prognostic model [50,51]. Accounting for the form of environmental factors has received considerable attention in G×E studies, which results in the development of a wide range of nonparametric [52][53][54] and semiparametric [55][56][57] methods. However, in integrative G×E studies, capturing the nonlinear form of interaction is challenging. In this study, we focus on prognostic outcomes. With other types of outcomes, such as the longitudinal phenotypes [58,59], the G×E model in the second stage can be modified accordingly. Funding: This study received no external funding. It has been partly supported by an innovative research award from KSU Johnson Cancer Research Center and a KSU Faculty Enhancement Award.

Institutional Review Board Statement:
This study is a secondary data analysis. The dataset can be freely downloaded through TCGA data portal. The IRB is not required for accessing and using the data. The patient information has been de-identified from the dataset used in this study.

Informed Consent Statement:
Not applicable due to the reason specified above.

Data Availability Statement:
The datasets used for the analyses described in this manuscript has been downloaded from the TCGA data portal (https://portal.gdc.cancer.gov/) and are available to the general public without restricted access.