Penalized Variable Selection for Lipid–Environment Interactions in a Longitudinal Lipidomics Study

Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid–environment interactions, in addition to the lipid and environment main effects, have important implications in understanding the lipid metabolism and related changes in phenotype. In this study, we developed a novel penalized variable selection method to identify important lipid–environment interactions in a longitudinal lipidomics study. An efficient Newton–Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. We conducted extensive simulation studies to demonstrate the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. As weight control via dietary calorie restriction and exercise has been demonstrated to prevent cancer in a variety of studies, analysis of the high-dimensional lipid datasets collected using 60 mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects.


Introduction
Longitudinal data are frequently observed in a diversity of scientific research areas, including economics, biomedical studies, and clinical trials. A common characteristic of the longitudinal data is that the same subject is measured repeatedly over a certain period of time; thus, the repeated measurements are correlated. Many modeling techniques have been proposed to accommodate the multivariate correlated nature of the data [1,2]. The emergence of new types of data has brought constant challenges to the development of novel statistical methods for longitudinal studies. One representative example is the high-dimensional data where the number of variables is much larger than the sample size. As penalization has been demonstrated as an effective way for conducting variable selection in linear and generalized linear models with a univariate response [3,4], substantial efforts have been devoted to developing penalized variable selection methods with longitudinal responses, such [5][6][7], among many others.
This study was partially motivated by overcoming the limitations of existing penalization methods in order to analyze the high-dimensional lipidomics data from longitudinal studies. Lipids are a broad group of biomolecules in eukaryotic membranes, involved in various critical biological roles such as energy storage, cellular membrane structure, or cell signaling and homeostasis [8][9][10][11]. Lipid metabolism has been found to be associated with several diseases, especially chronic diseases such as diabetes, cancer, inflammatory disease, and Alzheimer [12][13][14].
The lipid data were obtained from our previous work on the lipid changes in weight controlled CD-1 mice [15]. In the current study, the phenotype of interest is the body weight of experimental animals, which was measured every week for 10 weeks. The environmental factor was exercise and/or dietary restriction, which had four different levels, control (ad libitum feeding and sedentary), AE (exercise and ad libitum feeding), PE (exercise and pair feeding), and DCR (sedentary and 20% dietary calorie restriction). Both triacylglycerol (TG) and diacylglycerol (DG) profiles in the plasma were measured using electrospray ionization MS/MS [15]. Here, we focused on the DG profiles and treated them as lipid factors. Besides the lipid main effects, we were particularly interested in investigating the interactions between lipids and environment/treatment effects, which will shed novel insight in the understanding of weight changes in a longitudinal setting beyond studies solely focusing on the main lipidomics effects. With the control as the baseline, we created a group of three dummy variables to represent the four levels of the treatment factor that can be treated as environmental factors in general. The product between the dummy variable group and lipid denotes the lipid-environment interactions. The formulation of the interaction group in our study shared the spirit of group LASSO, which was primarily motivated by the selection of important dummy variable groups from ANOVA problems [16]. As the total number of main and interaction effects was much larger than the sample size, penalized variable selection was a natural choice to identify the important subset of effects. Such methods for G×E interactions, including [17,18], however, cannot be adopted for the longitudinal studies.
On the other hand, existing penalization methods in longitudinal studies have been mostly developed for the identification of important main effects only. For instance, Wang et al. [5] proposed the penalized generalized estimating equation (PGEE) to select predictors that are associated with the longitudinal response. Ma et al. [6] considered the selection of important predictors and estimation of non-parametric effects through splines for repeated measures data. Cho and Qu [7] developed a penalized quadratic inference function (PQIF) method to conduct variable selection on main effects. Fan et al. [19] developed robust variable selection through a penalized robust estimating equation to incorporate the correlation structure for repeated measurements. These studies have ignored the interaction effects and cannot be adopted to analyze our data directly. In addition, our limited search also suggests that user-friendly software packages for variable selection methods in longitudinal studies have been relatively underdeveloped. For penalization methods, only two R packages (PGEE and pgee.mixed) are available, and both packages have focused on the selection of important main effects. The codes for most studies in this area have not even been made publicly available.
To accommodate simultaneously the selection of individual and group structure corresponding to the main lipid effect and interaction effect respectively, we propose a novel penalized variable selection method for longitudinal clustered data. Our method significantly advances the existing penalization methods by considering the interaction effects. Through incorporating the group structure, selection of both main and interaction effects can be efficiently conducted within the generalized estimating equation framework [20]. Furthermore, to facilitate fast computation and reproducible research, we implement the proposed and benchmark methods in the R package (interep https://cran.r-project.org/ package=interep) [21]. The software package is open-source, and the core module has been developed in C++. The advantage of our method over alternatives has been demonstrated in extensive simulation studies. Analysis of the motivating dataset yields findings with important implications.

Data and Model Settings
Consider a longitudinal study with n subjects and k i observations measured repeatedly across time for the ith subject (1 i n). Without loss of generality, we set k i = k. Y ij denotes the response for the ith subject at time j (1 j k). X ij = (X ij1 , ..., X ijp ) is the p-dimensional vector of lipid factors. In our study, E ij = (E ij1 , ..., E ijq ) denotes the q-dimensional treatment factor. Suppose that the lipid factors, treatment factors, and their interactions are associated with the longitudinal phenotype through the following model: vectors that represent all the main and interaction effects. The interactions between lipids and treatment factors are modeled through X ij ⊗ E ij , the Kronecker product of the p-dimensional vector X ij , and the q-dimensional vector E ij within the following form: which is a pq-dimensional vector. β 0 is the intercept. β 1 , β 2 , and β 3 are unknown coefficient vectors of dimensions q, p, and pq, respectively. We assume that the observations are dependent within the same subject, and independent if they are from different subjects. i = ( i1 , ..., ik i ) T follows a multivariate normal distribution N k (0, Σ i ), with Σ i as the covariance matrix for the repeated measure of the ith subject across the k time points.

Generalized Estimating Equations
The joint likelihood function for longitudinally clustered response Y ij is generally difficult to specify. Liang and Zeger [20] developed the generalized estimating equations (GEE) method to account for the intra-cluster correlation. It models the marginal instead of the conditional distribution given the previous observations and only requires a working correlation structure for Y ij to be specified.
V i is often unknown in practice and difficult to estimate especially when the number of variance components is large. In GEE, the covariance matrix V i is specified through a simplified is characterized by a finite-dimensional nuisance parameter η. Commonly adopted correlation structures for R(η) can be independent, AR(1), and exchangeable, among others. Liang and Zeger [20] showed that if η can be consistently estimated, the GEE estimator of the regression coefficient is consistent. Furthermore, the consistency holds even when the working correlation structure is misspecified.

Penalized Identification
When the dimensionality of lipid factors is high, the total number of main and interaction effects is even higher. However, only a small subset of important effects is associated with the phenotype, which naturally leads to a variable selection problem. Penalized GEE based methods, including Wang et al. [5] and Ma et al. [6], have been proposed for conducting selection under correlated longitudinal responses. However, those published studies focus on the main effects and ignore the interactions. As shown in (1), the lipid-environment interactions are modeled on the group level, that is the interaction between all the q treatment factors and the hth lipidomics measurement (1 h p). Such a group structure cannot be accommodated by variable selection methods from existing longitudinal studies. This fact motivates us to develop a method for the interaction analysis of repeated measures data, termed as interep, with the following penalized generalized estimating equation: where U(β) is the score equation in GEE and ρ (·) is the first derivative of the minimax concave penalty (MCP) [22]. Since the environmental factors are usually of low dimension and are predetermined as important ones, they are not subject to penalized selection. U(β) is defined as: and the MCP can be expressed as: where λ is the tuning parameter and γ is the regularization parameter. The first derivative function of the MCP penalty is: MCP can be adopted for the regularized selection on both individual and group level effects. It is fast, continuous, and nearly unbiased [22]. In (3), the vector β 2 = (β 21 , ..., β 2p ) denotes the regression parameters for all the p lipid factors. β 3 = (β 31 , .., β 3p ) , which denotes the regression parameters for lipid-environment interactions, is a vector of length pq. β 3h is a vector of length q (h = 1, 2, ..., p), corresponding to the interactions between the hth lipid feature and the environment factors. With the control as the baseline, the environment factors have been formulated as a group of dummy variables. With high-dimensional main and interaction effects, penalization is critical for the identification of important effects out of the large number of candidates. In the penalized generalized estimating equation (3), the first penalty term adopts MCP directly to conduct the selection of main lipid effects on the individual level. The second penalty, in the forms of group MCP, imposes shrinkage on the product between the lipid factors and dummy variable group, which corresponds to the lipid-environment interactions. The group level selection of interaction effects is consistent with the mechanism of creating the dummy variable group of environmental factors. Note that such a rationale of formulating the penalized generalized estimating Equation (3) is deeply rooted in group LASSO [16].
In particular, λ 1 and λ 2 in (3) are tuning parameters. ρ (||β 3h || Σ h ; √ qλ 2 , γ) is the group MCP penalty that corresponds to the interactions between the hth (h = 1, 2, ..., p) lipid factor and the q environment factors. The empirical norm ||β 3h || Σ h is defined as: , and it contains the q columns in Z that correspond to the interactions from the hth lipid factor with the q environment factors. A variety of penalized variable selection methods for high-dimensional longitudinal data have been developed in the past two decades for analyzing high-dimensional omics data, such as gene expressions, single nucleotide polymorphisms (SNPs), and copy number variations (CNVs) [5,6]. However, lipidomics data have been rarely investigated by using high-dimensional variable selection methods. We developed a package, (interep https://cran.r-project.org/package=interep) that incorporates our recently developed penalization procedures to conduct interaction analysis for high-dimensional lipidomics data with repeated measurements [21].
Remark: The uniqueness of the proposed study lies in accounting for the group structure of lipid-environment interactions through penalized identification. Therefore, the main lipid effects and lipid-environment interactions are penalized on individual and group levels, separately, which leads to a formulation of both MCP and group MCP penalties. Although our model has been motivated from a specific lipidomics profiling study in weight controlled mice [15], it can be readily extended to accommodate more general cases in interaction studies where the environmental factors are not dummy variables formulated from the ANOVA setting. In such a case, for each lipid factor, the main lipid effects and lipid-environment interactions form a group, with the leading component of the group being a vector of 1s. As not all the effects in the group are expected to be associated with the phenotype, a sparse group type of variable selection is demanded. Such a formulation has been investigated in survival analysis [23], but not in longitudinal studies yet. With a simple modification of our model to penalize the main and interaction effects on the individual and group level simultaneously, the proposed one becomes a penalized sparse group GEE model and can be adopted to handle general environmental factors in high-dimensional cancer genomics studies.

Computational Algorithms
We developed an efficient Newton-Raphson type of algorithm to obtain the penalized estimateβ. Starting with an initialized value, we can solve the penalized GEE iteratively. The estimatedβ (d+1) in the (d + 1)th iteration can be solved as: where U (d) is the score function expressed in terms ofβ (d) at the dth iteration and T (d) is the corresponding first derivative function of U (d) : which is also a function ofβ (d) . The MCP penalty was imposed on both the individual level (main lipid effects) and group level (lipid-environment interactions). Therefore, W (d) is a diagonal matrix that contains the first derivative of the MCP penalty for the lipid factors and the first derivative of the group MCP penalty for the lipid-environment interactions. We define W (d) as: where is a small positive number set to 10 −6 to avoid the numerical instability when the denominator is zero. The first (1 + q) elements on the diagonal of W are zero, suggesting that there is no shrinkage imposed on the coefficients for the intercept and the environmental factors. We can use nWβ and nW to approximate the first derivative function of MCP in the penalized score equation and the second derivative function of the MCP penalty, respectively. Given a fixed tuning parameter, the regression parameterβ (d+1) can be updated iteratively till convergence. The stopping criterion is that the L1 norm for the L1 difference between two consecutive iterations is less than 10 −3 , and convergence can usually be achieved within 10 iterations.
There are two tuning parameters λ 1 and λ 2 and a regularization parameter γ. λ 1 controls the sparsity of lipid factors, and λ 2 determines sparsity among lipid-environment interactions. We chose the optimal tuning parameters λ 1 and λ 2 using five-fold cross-validation in both the simulation study and real data analysis. The regularization parameter γ was obtained via a data driven approach. In our numerical study, we examined a sequence of values, such as 1.8, 3, 4.5, 6, and 10, suggested by published studies, and found that the results were not sensitive to the choice of the value of γ, and then set the value at 3. We split the dataset into five equally sized subsets and took four of them as the training dataset, leaving the last subset as the testing dataset. The penalized estimates were obtained from the training data, and then, prediction performance was evaluated on the testing data. A joint search over a two-dimensional grid of (λ 1 , λ 2 ) was conducted to find the optimal pair of tuning parameters.
In our study, we considered the methods considering both lipid main effects and lipid-environment interactions with exchangeable working correlation (A1), AR(1) working correlation (A2), and independence working correlation (A3). For comparison with the methods that cannot accommodate the identification of lipid-environment interactions, we also included A4-A6, which incorporate the exchangeable, AR(1), and independence working correlation, respectively. The alternative methods A4-A6 do not ignore the interaction effects. Instead, they treat the interaction effects individually, so the group structure considered in A1-A3 does not exist. We computed the CPU running time for 100 replicates of simulated lipidomics data with n = 250, ρ = 0.8, p = 75 (with a total dimension of 304) and fixed tuning parameters on a regular laptop for A1-A6, which can be implemented using our developed package: (interep https://cran.r-project.org/package=interep) [21]. The CPU running time in seconds was 48.8 (A1), 40.2 (A2), 29.0 (A3), 49.3 (A4), 39.7 (A5), and 27.9 (A6), respectively.

Simulation
We evaluated the performance of all six methods (A1-A6) through extensive simulation studies. Among them, A1-A3 were developed for accommodating the interaction structures with different working correlations, while A4-A6 were only focused on the identification of main effects so the structure of the group level interaction effects were not respected. Note that there are existing studies that can also achieve the selection of main effects in longitudinal studies. For example, Wang et al. [5] adopted the smoothly clipped absolute deviation (SCAD) penalty for conducting the selection of main effects. Since the MCP is incorporated as the baseline penalty in A1-A3, A4-A6 have thus been developed based on MCP and used as benchmark methods for comparison.
The responses were generated from the model (2) with sample size n = 250 and 500. The number of time points k was set to five. The dimensions for lipid factors X ij were p = 75, 150 and 300. With q = 3 for E ij , we first simulated a vector of length n from the standard normal distribution. A group of three binary dummy variables for environmental factors could then be generated after dichotomizing the vector at the 30th and 70th percentiles. In addition, the lipids were simulated from a multivariate normal distribution with mean zero and the AR1 covariance matrix with marginal variance one and auto-correlation coefficient 0.5. We simulated the random error from a multivariate normal distribution by assuming a zero mean vector and an AR1 covariance structure with ρ = 0.5 and 0.8. Note that when considering the interactions, the actual dimensionality was much larger than p. For instance, given n = 250, p = 150, and q = 3, the total dimension for all the main and interaction effects was 604.
The coefficients were simulated from U[0.4, 0.8] for 17 nonzero effects, consisting of the intercept, 3 environmental dummy variables, 4 lipid main effects, and 3 groups of lipid-environment interactions (9 interaction effects). We generated 100 replicates for the four settings: (1) n = 250 and p = 75, (2) n = 250 and p = 150, (3) n = 500 and p = 150, and (4) n = 500 and p = 300. All the rest of the coefficients were set to zero. For each setting, we considered two correlation coefficients (ρ = 0.5 and 0.8) for the random error. The number of true positives (TP) and false positives (FP) was recorded.
In addition to identification results, we also calculated the estimation accuracy in terms of the difference between estimated and true coefficients. In particular, the mean squared error corresponding to the true nonzero coefficients and true zero coefficients (for noisy effects) were termed as MSE and NMSE, respectively. The total mean squared error for the coefficient vector, or TMSE, is computed as: is the estimated value of β in the rth simulated dataset. MSE and NMSE were calculated in a similar way as for TMSE.
Identification results of the six methods (A1-A6) are tabulated in Tables 1-4. In general, A1-A3, which account for both the lipid main effects and lipid-environment interactions, had better performance than A4-A6, which only accommodated the main effects. For example, in Table 1, given n = 250, ρ = 0.5, p = 75, the actual dimension is 304. A1 identified 14.5 (sd 1.9) nonzero effects out of all the 17 true positives, with a relatively small number of false positives of 4.8 (sd 3.1). On the other hand, A4 identified a smaller number of true positives, 1.3 (sd 1.5), with a larger number of false positives, 6.6 (sd 4.2). Among the identified effects, A1 identified 7.4 (sd 1.5) interactions, with 3.1 (sd 2.6) false positives. A4 identified a smaller TP of 6.1 (sd 1.1) and a higher FP of 5.1 (sd 3.3) of the lipid-environment interactions. We could observe that the difference in identification performance between A1 and A4 came mainly from the interaction effects, which was due to the fact that A4 could not accommodate the group level selection corresponding to the lipid-environment interactions. As the dimension increased, A1 outperformed A4 more significantly. For instance, in Table 4, the overall dimension for n = 500, ρ = 0.8, p = 300 is 1204. A1 had a TP of 15.9 (sd 1.2) and an FP of 3 (sd 2.6), while A4 had a smaller TP 14.5 (sd 1.2) and a higher FP 4.5 (sd 3.0). Figures 1 and 2 are plotted based on the identification results from Tables 1-4. We can observe that overall, A1-A3 outperformed A4-A6 with a higher TP and a lower FP under each setting.         = 0.5 = 0.8    In terms of estimation accuracy, A1-A3 also had a better performance compared with A4-A4, as shown in Tables 5 and 6. For the panel corresponding to n = 250, ρ = 0.5, and p = 75 in Table 5, the mean squared error for the nonzero coefficients of A1 was 0.1055, which was less than half of that of A4 (0.2321). Besides, A1 also had a smaller total mean squared error (TMSE). All the pieces of evidence suggested that A1 had higher estimation accuracy than A4. We can observe the pattern for the rest of the four methods. As the dimension increased to n = 500, ρ = 0.8, and p = 300 (so the total dimension was 1204) in Table 6, the MSE of A1 (0.0688) was also smaller than that of A4 (0.1949). There were no obvious differences in NMSE among these settings.
Another important conclusion we make from the simulation study is that, for the methods that differ only in working correlation, i.e., A1 (exchangeable), A2 (AR1), and A3 (independence), there was no significant difference in terms of either identification or estimation accuracy, as shown by Tables 1-6, as well as Figures 1 and 2. Such an observation suggests that the proposed methods under the GEE framework were robust to the misspecification of the working correlation, and this is consistent with the conclusions from main effects only models in longitudinal studies [7].  To mimic the sample size and number of lipid factors in the case study, we also conducted a simulation in settings with n = 60, p = 30, and q = 3. Therefore, the overall dimension of main and interaction effects was 124. The coefficients were generated from U[1.4,1.8] for 17 nonzero effects. The identification and prediction results are summarized in Tables A1 and A2 in the Appendix A, respectively. Consistent patterns were observed. For example, in terms of identification, under ρ = 0.5, A1 had a higher TP of 13.6 (sd 2.5) compared to the 11.1 (sd 2.6) of A4, and a lower FP of 4.7 (sd 2.7), compared to the FP of 5.4 (sd 2.8) identified by A4.
Evaluation of all the methods, especially A1-A3, was also conducted when the true underlying model was misspecified. We generated the response (phenotype) from a main effect only model with eight true main effects when n = 250, p = 75, ρ = 0.8 with a total dimension of 304. Results are provided in Table A3. When the interaction effects did not exist, A1 had only identified a very small number of false interaction effects, with 0.7 (sd 1.7) false positives. A2-A6 performed similarly in terms of identifying false interaction effects. All six methods identified a comparable number of true main effects. Overall, all methods had similar performance in identification, as well as prediction, when the data generating model had only main effects. Such a phenomenon is reasonable by further examining the results in Table 1. We found that the major difference between A1-A3 and A4-A6 was due to the identification of interaction effects. Therefore, when only main effects were present, all the methods had comparable performances.
Penalized regression and hypothesis testing are two related, but distinct aspects in statistical analysis. The proposed study was not aimed at developing test statistics, computing the power functions, and assessing the control of type 1 error, so these statistical test related results are not available, just like most of the studies on penalized regression. Recently, efforts devoted to bridging the two areas have been mainly restricted to linear models under high-dimensional settings [24][25][26]. Extensions to interaction models have not been reported so far. In particular, we are not aware of results reported for longitudinal models. Nevertheless, we conducted the simulation by assuming the null model and tabulate the identification results in Table A4. The results should be interpreted as identification with misspecified models. As we observed, under the null model, all six methods led to a very small number of false positives.
To assess the consistency of variable selection in longitudinal settings, we carried out the stability selection [27] under n = 250, p = 75, and ρ = 0.8. Each time, we selected 200 out of the total of 250 subjects without replacement and then conducted selection. The process was repeated 100 times, which yielded a proportion of selected effects. Larger proportions of being selected suggested stable results. Stability selection is well known for assessing the stability of penalized selection, and it alleviates the concern that the effects have only been identified by chance. We investigate the selection proportions of the 17 true main and interaction effects for all six methods in Table A5. A1 identified 14 true effects with proportions above 70%, which is consistent with the results shown in the lower panel of Table 1, where 13.7 TPs (sd 2.3) were identified. Such a consistent pattern can be observed across all six methods.
Although no consensus on the optimal criterion of selecting tuning parameters has been reached so far, cross-validation is perhaps the most well accepted criterion to select tuning parameters in the community of high-dimensional data analysis [3,4]. To further justify its appropriateness, under the setting of n = 250 and p = 75, we performed the analysis by selecting tuning parameters using an independently generated testing dataset with a sample size of 1000 and p = 75. The models were fitted on the training dataset, and prediction was assessed based on the independently generated testing dataset, so no data were used in training the model. The identification and prediction results are tabulated in Tables A6 and A7, respectively. A comparison to Tables 1 and 5 demonstrates that the results obtained by cross-validation and validation were very close.

Real Data Analysis
We applied the proposed and alternative methods on a dataset from one of our previous studies in animal models [15]. In the study, 60 female CD-1 mice were assigned to four different treatment groups, which were control (ad libitum feeding and sedentary), AE (exercise and ad libitum feeding), PE (exercise and pair feeding), and DCR (sedentary and 20% dietary calorie restriction). The phenotype of interest was mice's body weight, which was measured every week for 10 weeks. Mice were sedentary and given ad libitum feeding in the control group, where they could eat as much as they wanted without doing treadmill exercises. In the AE group, mice received ad libitum feeding and ran on the treadmill every day at a speed of 0.5 mph, 1 hour per day, and 5 days a week, while mice in the PE group did the same exercise, but were given the same amount of diet as the mice in the control group. Mice in the DCR group had 20% less calorie intake than the control group, but they had the same intake of protein, vitamins, and minerals. The composition of 176 plasma neutral lipid species of interest was measured. In the current study, we only focused on diacylglycerols. In addition, the diacylglycerol lipid species that have a majority of samples lower than the detection limits were excluded so there were 31 diacylglycerols. In total, there were 31 lipid main effects and 93 lipid-environment interactions.
Using the method A1 (interep with the exchangeable working correlation) as shown in Table 7, we identified seven lipid species that had different effects in weight control of mice (AE, PE, or DCR) on body weight compared to those of the control mice. Among them, C20:1/16:1 and C20:1/20:4 had negative interactions in AE mice, where C denotes carbon. For the lipid species of C20:1/16:1, C 39 H 76 O 5 N, the regression coefficient was −2.9145 for AE mice. That is, mice with an increased amount of C20:1/16:1 tended to have a lower body weight compared to that of the control. In the AE mice, both C16:0/C16:0 and C22:6/C18:1 had strong positive associations with body weights. It is interesting that C16:0/C16:0 were negatively associated with body weight in both PE and DCR mice. C16:0 is also called palmitic acid and is one of most common saturated fatty acids. Increased consumption of palmitic acid is associated with higher risk of cardiovascular disease, type 2 diabetes, and cancer [28]. The negative association of C16:0/16:0 and body weight in DCR and PE suggests that when the calories of the diet are restricted, the accumulation of saturated fat in the body actually decreased compared to the control. Another lipid that is negatively associated with body weight in DCR and PE mice is C18:1/16:1. The lipids that were positively associated with body weight in PE were C18:2/C16:1, C20:1/C16:1, and C22:6/C18:1. All species contain unsaturated fatty acids. Among them, C22:6 is one of the omega-3 polyunsaturated fatty acids (PUFA). In DCR, the two lipids that were positively associated with body weight were C18:2/16:1 and C20:1/20:4. Both fatty acids C18:2 and C20:4 were PUFA. The results seem to be consistent with our previous finding that exercise with paired feeding may increase the amount of PUFA in phospholipids in mice skin [29]. Table 7. Real data analysis result from method A1 (method accommodating the lipid-environment interactions with exchangeable working correlation). In addition, we adopted A4 to analyze the lipid data. A4 also had the exchangeable working correlation, but it could not conduct group level selection of the lipid-environment interactions. The identification results are tabulated in Table 8. Note that the selection of interactions with individual dummy environment factors was not consistent with the formulation of the lipid-environment interactions. In terms of prediction, A1 had a smaller prediction error (4.04) than that of A4 (4.97).

Discussion
Investigation of the potential roles of lipids in the regulation and control of cellular function and the interactions between lipids and environmental factors are very important in the understanding of physiology and disease processes. Traditionally, the analyses mostly focus on the total amount of a particular type of lipid, such as total triglyceride, total cholesterol, and omega-3 fatty acid. With the recent advances in instrumental technology, it is feasible to analyze quantitatively a broad range of lipid species in a single platform [13,15,[30][31][32]. The vast arrays of data generated in lipid profiling studies bring challenges to the statistical analysis of lipidomics data [33][34][35].
In this study, we proposed a penalized variable selection method to identify important lipid-environmental effects in longitudinal studies. Some statistical methods have already been reported for lipidomics studies, including the marginal test and variable selection methods [15,32,34,35]; however, they cannot be directly extended to longitudinal studies. On the other hand, existing variable selection methods for longitudinal data have been predominately developed for the identification of main effects and cannot accommodate the group level interaction structure unique to our studies. Both the simulation and case study have convincingly demonstrated the merit of the proposed interep over alternatives.
We selected tuning parameters based on cross-validation. A further investigation of different tuning criteria is interesting, but beyond the scope of this study, especially given the fact that many well known variable selection methods in longitudinal studies, such as [5], have been conducted using cross-validation. To facilitate a fair cross-comparison with existing relevant studies, we believe it is reasonable to adopt cross-validation to choose tuning parameters. Note that the aforementioned stability selection analysis also partially justifies the usage of cross-validation. We acknowledge that other criteria for selecting tunings, such as double cross-validation [36], could be a potential reliable choice. However, as it is not a widely accepted tuning criterion for high-dimensional data analysis and has not been adopted in any longitudinal studies so far, we postpone the investigation to the future.
Interaction studies have been historically pursued by statisticians [37]. Within the high-dimensional scenario, accounting for such a complex structure, in both gene-gene (G × G) and gene-environment (G × E) interaction studies, is challenging, but also rewarding [38]. The proposed study is among the first to investigate penalized identification of lipid-environment interactions in longitudinal studies. Both the simulation study and case study yielded interesting findings. G × G interaction is computationally more challenging than G × E interactions since both main effects involved in the interactions are of high dimensionality. Following the representative G × G interaction studies [39,40], we can extend the proposed study to lipid-lipid interactions, which has not been investigated in longitudinal studies so far. Besides, when multi-omics measurements are available, it is also of great interest to examine interaction effects through multi-omics integration studies in the longitudinal setting [41,42].
The proposed model can also be estimated using the quadratic inference functions (QIF). GEE relies on the working correlation matrix R(η), and it enables us to find the consistent estimator of the regression parameter if consistent estimators of the nuisance parameters η can be obtained. However, consistent estimators of η do not always exist in some cases. QIF has been proposed to avoid explicit estimation of the nuisance parameters by assuming the inverse of the working correlation matrix R(η) can be approximated by a linear combination of a class of base matrices [7,43]. Thus, QIF is robust to the misspecification of the working correlation.
In this paper, we are interested in the identification of lipid-treatment (or environment) interactions through penalization. The success of set based analysis, including those for the gene set [44] and SNP set [45,46], has tremendously motivated the development of statistical methods for G × E interactions from marginal analyses ( [47,48]) to penalization methods [17,18,49]. Our model can be potentially extended in the following aspects. First, as data contamination and outliers have been widely observed in repeated measurements, robust variable selection methods in G × E interaction studies [23,[50][51][52] can be extended to longitudinal settings. Second, recently, multiple Bayesian methods have been proposed for pinpointing important G × E interaction effects [53][54][55]. Within the framework of analyzing repeated measurements, Bayesian variable selection for interactions has not been extensively examined. Investigations of all these possible directions will be postponed to the near future.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:       Mean (sd) based on 100 replicates. A1-A3: methods accommodating the lipid-environment interactions with exchangeable, AR(1), and independence working correlations, respectively. A4-A6: methods not accommodating the lipid-environment interactions with exchangeable, AR(1), and independence working correlations, respectively.