Next Article in Journal
Segmenting Action-Value Functions Over Time Scales in SARSA via TD(Δ)
Previous Article in Journal
Development and Application of a Fuzzy-Apriori-Based Algorithmic Model for the Pedagogical Evaluation of Student Background Data and Question Generation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prioritizing Longitudinal Gene–Environment Interactions Using an FDR-Assisted Robust Bayesian Linear Mixed Model

Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(11), 728; https://doi.org/10.3390/a18110728
Submission received: 1 September 2025 / Revised: 14 November 2025 / Accepted: 17 November 2025 / Published: 19 November 2025

Abstract

Analysis of longitudinal data in high-dimensional gene–environment interaction studies have been extensively conducted using variable selection methods. Despite their success, these studies have been consistently challenged by the lack of uncertainty quantification procedures to identify main and interaction effects under longitudinal phenotypes that follow heavy-tailed distributions due to disease heterogeneity. In this article, to improve statistical rigor of variable selection-based G × E analysis, we propose to apply the robust Bayesian linear mixed-effect model with a false discovery rate (FDR) control procedure to tackle these challenges. The Bayesian mixed model adopts a robust likelihood function to account for skewness in longitudinal phenotypic measurements, and it imposes spike-and-slab priors to detect important main and interaction effects. Leveraging the parallelism between spike-and-slab priors and the Bayesian approach to hypothesis testing, we perform variable selection and uncertainty quantification through a Bayesian false discovery rate (FDR)-assisted procedure. Numerical analyses have demonstrated the advantage of our proposal over alternative approaches. A case study of a longitudinal cancer prevention study with high-dimensional lipid measures yields main and interaction effects with important biological implications.

1. Introduction

Gene–environment interactions provide additional insights in explaining variations in complex diseases traits beyond main genetic and environmental effects [1]. With large-scale omics features, the dissection of gene–environment interactions becomes a high-dimensional data analysis problem, which can generally be addressed through two frameworks: hypothesis testing and variable selection [2,3,4]. Hypothesis testing–based methods are usually of marginal nature by analyzing one or a small number of markers at a time through statistical tests, with p-values being generated for each test to perform statistical inference. Multiple testing procedures then follow to detect significant interactions. Due to the computational advantage to conduct each test and inferential power to identify significant interactions, this branch of methods dominates in genome-wide association studies (or GWAS) [5,6], including biobank-scale studies [7,8,9,10]. On the other hand, variable selection methods, especially those based on regularization, have achieved success in addressing challenging issues that are not widely considered in testing based studies, including accommodating structured sparsity, such as the hierarchy between main and interaction effects, and accounting for robustness of the G × E analysis [4].
Due to the heterogeneity of complex diseases, disease phenotypes generally follow heavy-tailed distributions with outliers. Robust penalization methods have therefore been developed to safeguard against outliers to improve accuracy in identifying important omics features in bioinformatics studies [11]. From a modeling perspective, robust regularized variable selection is of the form of “robust unpenalized loss function + penalty function”, where the robustness comes from loss functions that can downweigh the influence from outlying observations. Typical choices of robust loss functions include Huber loss [12], quantile check loss [13], rank-based loss [14], among many others [11]. Subsequently, corresponding robust penalization methods have been widely developed [15,16,17].
Addressing robustness is also a critical issue in gene–environment interaction studies, which usually involve analyzing complex disease traits [4]. In the literature, robust penalization methods have been extensively developed for interaction studies [18,19,20,21], where robust loss functions are adopted to accommodate data heterogeneity with penalty functions tailored for interaction analysis being developed. Although these studies have yielded gene–environment interactions with meaningful biological implications, there is a lack of statistical inference procedures that can lead to uncertainty quantification measures. This is partly attributable to the difficulty of developing inference measures with robust loss functions, which are often non-differentiable and violate the standard assumptions required for conventional inference procedures under non-robust sparse models with Gaussian model errors [22,23,24,25]. From both frequentist and Bayesian perspectives, some recent studies have been proposed to provide uncertainty quantification measures for G × E analysis [26,27,28], which is parallel to the trend of improving statistical rigor of sparse models via developing high-dimensional inference procedures [22,23,24]. In particular, robust fully Bayesian analysis has been considered since robust Bayesian inference utilizing posterior samples drawn from MCMC can be readily performed [29,30,31].
In longitudinal gene–environment interaction studies where disease phenotypes are repeatedly measured across a sequence of time points, overcoming the aforementioned challenge in statistical inference is further complicated by the defining characteristics of longitudinal data, whereby repeated measures are correlated and demonstrate subject–specific clustering patterns [32]. In the literature, generalized estimating equation (or GEE) is one of the major frameworks for longitudinal data analysis [33], where extensions to high-dimensional settings have been widely considered. For example, Wang et al. (2012) [34] have developed penalized GEE with valid inference through confidence intervals that can be validated on finite samples. However, vulnerability to outliers is a well-known disadvantage of GEE [35]. Consequently, the main effect model-based penalized GEE and inference procedures cannot be directly applied when robustness in interaction studies is a concern. Moreover, addressing data heterogeneity through standard transformation procedures, such as the Box–Cox transformation, is not very effective for longitudinal data. Applying transformations to repeated measures distorts the correlation structure among the original longitudinal observations. Therefore, performing a data transformation step prior to using non-robust GEE-based methods is generally not preferable in longitudinal studies. In spite of attempts to improve robustness through quadratic inference functions [36] in tailored G × E analysis [37,38], uncertainty quantification measures are not available with variable selection procedures, and robust analysis methods with inferential guarantees are still in pressing need for longitudinal studies.
In addition to GEE, the mixed-effect model is another major framework to analyze longitudinal data [35,39]. Since robustifying fully Bayesian analysis in cross-sectional studies has produced valid inference procedures in the presence of data heterogeneity [25], in longitudinal studies, we leverage robust, fully Bayesian methods for gene–environment interaction analyses. Specifically, we are interested in adopting a robust Bayesian linear mixed-effect model that captures inter-connections among repeated measures through incorporating the random effects [35,40], and models G × E interactions through fixed effects. Our primary interest in this study is prioritizing, or ranking, important main omics effects and G × E interactions. Consequently, although a plethora of shrinkage priors exist [41], we focus on two-component spike-and-slab priors, which account for large effects through slab components in terms of continuous densities such as the normal or Laplace distributions, while shrinking regression coefficients corresponding to small effects to zero via a spike component as a point mass at zero [42,43]. The two-group nature of spike-and-slab priors naturally results in a posterior inclusion probability (PIP) indicating the proportion of regression coefficients representing main or interaction effects being sampled from the slab component. Therefore, a larger PIP suggests a stronger evidence supporting the significance of the effects and can be used as a ranking tool to prioritize the importance of detected effects in longitudinal interaction studies.
In a cross-sectional G × E analysis of type 2 diabetes data with high-dimensional single nucleotide polymorphism (SNP) features, Lu et al. (2021) [29] have developed a robust marginal analysis method to generate a ranked list of G × E interactions using PIP. In frequentist studies, marginal penalization methods that select important markers based on non-zero shrinkage estimates have also been considered [44]. Compared to the frequentist regularization method, Lu et al. (2021) [29] offers the advantages of robustness and uncertainty quantification through PIPs. Although the results are promising, a caveat is the lack of a valid rule for determining cut-offs to claim important identification effects.
With spike-and-slab priors, median probabilities have been shown to be effective for the purpose of variable selection with theoretical and empirical reasons [25,45,46]. As our primary goal is to produce a ranked list of appropriate length, it may be reasonable to adopt an adaptive threshold. In this longitudinal gene–environment interaction studies, we consider calculating the threshold for labeling significant main and interaction effects utilizing Bayesian false discovery rates (FDRs) [47,48]. Suppose that PIP j is the PIP included as a slab component for the regression coefficient corresponding to the jth main or interaction effect, then ( 1 PIP j ) indicates the probability of being a false positive if the jth effect is claimed as an important one but is actually not associated with disease traits. Therefore, ( 1 PIP j ) can be interpreted as estimates for Bayesian q–values or local FDRs [49,50,51]. Morris et al. (2008) [48] and Zhang et al. (2014) [47], among others, have provided procedures to determine the cut-off adaptively with the FDR control.
Combined, our proposal consists of the following two steps. First, we identify longitudinal gene–environment interactions and corresponding PIPs with the robust Bayesian linear mixed model that accommodates trait heterogeneity, intra-clustering of repeated measures, and high-dimensionality of longitudinal omics data. Then, the Bayesian FDR procedure described above is applied to adaptively choose a threshold to prioritize main and interaction effects as well as claim significance of findings. It is worth noting that Lu et al. (2021) [29] have used ROC curves and the top 100 ranked effects to summarize variable selection results. While informative, ROC curves are only feasible in simulations where the ground truth is available, and the choice of the top 100 effects is somewhat arbitrary. In comparison, the adaptive threshold determined from our study is well-established according to the published literature. Extensive numerical experiments are performed to demonstrate the utility of the proposed methods. In addition, we analyze longitudinal data with high-dimensional lipidomics features and repeatedly measured the body weight of CD-1 mice that can be retrieved from a cancer prevention study [52].

2. Materials and Methods

2.1. The Robust Linear Mixed Model

In a longitudinal gene–environment interaction study with n subjects and k repeated measures per subject, we denote Y i j as the phenotype of interest for the ith subject measured at the jth time point ( 1 i n , 1 j k ). At time point j, let G i j = ( G i j 1 , , G i j p ) be the p-dimensional vector for genetic factors and E i j = ( E i j 1 , , E i j q ) be the q-dimensional vector for environmental factors. Then, we consider the following robust longitudinal gene-environment interaction mixed-effect model:
Y i j = T i j β 0 + G i j β 1 + E i j β 2 + ( G i j E i j ) β 3 + U i j α i + ε i j ,
where the fixed effect includes a 3 by 1 vector β 0 that is associated with baseline time effects T i j = ( 1 , j , j 2 ) , while β 1 , β 2 , and β 3 , which are the p-, q-, and p q -dimensional regression coefficient vectors, represent the genetic and environmental main effects as well as their interactions, respectively. Specifically, the longitudinal gene–environment interactions can be expressed as a Kronecker product in terms of an p q -dimensional vector:
G i j E i j = G i j 1 E i j 1 , G i j 1 E i j 2 , , G i j 1 E i j q , G i j 2 E i j 1 , , G i j p E i j q .
The mixed model (1) accommodates correlations among repeated measures through random effects α i . Under the random intercept–slope model, α i is a 2 by 1 vector associated with U i j = ( 1 , j ) , where j is the time point ( 1 j k ). U i j α i leads to a scalar in model (1). Under the random intercept model, U i j = 1 , and α i reduces to a scalar accordingly. The model errors ϵ i j are assumed to be independent and follow heavy-tailed distributions.
Under standard, low-dimensional mixed-effects models, model selection between a random-intercept model and a random intercept-and-slope model can be performed using information criteria such as AIC or BIC, or by comparing marginal likelihoods [35]. However, in high-dimensional settings, these model selection criteria are confounded by the subset of selected features, and the issue therefore remains an open problem. We adopt the random intercept-and-slope model throughout this study and justify our choice based on the case study.
Model (1) differs from published works such as [37,53] in that it does not arise from a repeated-measure one-way ANOVA design. Therefore, unlike the group-based dummy variable structure used for environmental factors in [53], the E i j terms in model (1) do not have such an implication, as they are not the result of dummy coding for categorical predictors.

2.2. A Bayesian Formulation of the Robust Linear Mixed Model

To ensure the robustness of model (1), we assume that model errors ϵ i j follow identical and independent Laplace distributions. Specifically,
f ( ϵ i j τ ) = τ 4 exp τ 2 | ϵ i j | , i = 1 , , n , j = 1 , , k ,
where the scale and location parameters of the Laplace distribution are 2 τ and 0, respectively. The L1 norm in the exponent of the kernel, | ϵ i j | , is critical for a robust linear mixed model since this formulation downweights the importance of outliers and heavy-tailed model errors. Given the fixed and random effects in model (1), we can obtain the conditional distribution of Y i j as
f ( Y i j μ i j , α i , U i j , τ ) = τ 4 exp τ 2 Y i j μ i j U i j α i .
where the mean function μ i j is defined as μ i j = T i j β 0 + G i j β 1 + E i j β 2 + ( G i j E i j ) β 3 . As Laplace model errors can be expressed as ϵ i j = τ 1 ξ v i j z i j , where v i j Exp ( 1 ) , z i j N ( 0 , 1 ) , and ξ = 8 [54], we formulate the linear mixed model (1) utilizing the following hierarchical structure:
Y i j = μ i j + U i j α i + ξ τ 1 / 2 v ˜ i j z i j ,
v ˜ i j τ ind τ exp ( τ v ˜ i j ) , z i j ind N ( 0 , 1 ) , i = 1 , , n and j = 1 , , k .

2.3. Robust Sparse Bayesian Linear Mixed Model

Under the above mixed model for longitudinal gene–environment interactions, the high dimensionality of genetic factors G i j necessitates variable selection to identify main and interaction effects that are associated with disease phenotypes. The low-dimensional environmental factors E i j are not subject to selection, and we only consider selecting genetic factors and gene–environment interactions. Therefore, our proposed model respects weak hierarchical structures between main and interaction effects [4]. From a frequentist variable selection perspective, regularized loss function for the above gene–environment interaction problem can be expressed as
i = 1 n j = 1 k | Y i j μ i j U i j α i | + λ 1 m = 1 p | β 1 m | + λ 2 s = 1 p q | β 3 s | ,
where λ 1 , λ 2 > 0 are tuning parameters for penalized selection on the main and interaction effects. To enable uncertainty quantification in the presence of heavy-tailed model errors, we consider an Bayesian approach and imposing appropriate shrinkage priors on p- and p q -dimensional fixed effects β 1 and β 3 , respectively. While variable selection is the focus here, it is also acknowledged that statistical testing based approaches can be developed within mixed effects as well for association analysis [55,56].
We show brief prior elicitation procedures as follows. For main effects β 1 m ( m = 1 , , p ) , we first consider assigning Laplacian shrinkage priors such that π ( β 1 m | τ , λ 1 ) = ( τ λ 1 / 2 ) exp { τ λ 1 | β 1 m | } . To make the representation more concise, let η 1 = τ λ 1 so we can assign Laplacian shrinkage priors that can be hierarchically expressed as a scale mixture of normals [57]:
β 1 m r 1 m ind N ( 0 , r 1 m ) , r 1 m ind η 1 2 2 exp η 1 2 2 r 1 m .
Since Laplacian priors do not lead to exact zero in posterior estimates for fixed effect, we further incorporate an extra layer of point mass spike-and-slab priors in the Laplacian hierarchy as
β 1 m ϕ 1 m , r 1 m ind ϕ 1 m N ( 0 , r 1 m ) + ( 1 ϕ 1 m ) δ 0 ( β 1 m ) , ϕ 1 m ind Bernoulli ( π 1 ) , r 1 m ind Gamma 1 , η 1 2 2
where the spike component δ 0 ( β 1 m ) is a point mass at 0, promoting exact sparsity in estimating main effect. On the other hand, the slab component N ( 0 , r 1 m ) models large signals. The binary indicator ϕ 1 m [ 0 , 1 ] specifies whether the spike or slab component is active. When ϕ 1 m = 1 , the prior in Equation (2) reduces to the univariate Laplace prior, suggesting that the mth genetic variant has a main effect. Otherwise, if ϕ 1 m = 0 , there is no main effect. Similarly, we impose the following priors on β 3 s ( s = 1 , , p q ) , regression coefficients representing interaction effects, as
β 3 s ϕ 3 s , r 3 s ind ϕ 3 s N ( 0 , r 3 s ) + ( 1 ϕ 3 s ) δ 0 ( β 3 s ) , ϕ 3 s ind Bernoulli ( π 2 ) , r 3 s ind Gamma 1 , η 2 2 2
where the latent mixture proportion π 2 [ 0 , 1 ] and the same rationale as in the prior specified on main effect follows.
Hyper priors to retain conjugacy have been assigned as π 1 Beta ( a 1 , b 1 ) and π 2 Beta ( a 2 , b 2 ) . In addition, Gamma priors are assumed as η 1 2 Gamma ( c 1 , d 1 ) , η 2 2 Gamma ( c 2 , d 2 ) , and τ Gamma ( w 1 , w 2 ) ; an inverse-Gamma prior is assigned to ϕ 2 Inverse - Gamma ( e 1 , e 2 ) . We set the hyperparameters as a 1 = b 1 = a 2 = b 2 = c 1 = d 1 = c 2 = d 2 = w 1 = w 2 = e 1 = e 2 = 1 .
For the random effect α i , a multivariate normal prior under the random intercept-and-slope model is adopted: α i MVN ( 0 , ϕ 2 I ) .

2.4. Robust Variable Selection with Bayesian False Discovery Rates

With spike-and-slab priors [42,43], Bayesian variable selection can be performed by using all the MCMC samples. Consider the ( p + p q ) main and interaction effects with the hierarchical structures specified in (2) and (3), respectively. Let us assume that G posterior samples have been drawn after excluding burn-ins. At the gth MCMC iteration, the uth main or interaction effect is selected if its binary indicator ϕ u ( g ) = 1 , ( 1 u p + p q ). Otherwise, it is not selected, i.e., ϕ u ( g ) = 0 . Hence, the posterior inclusion probability (PIP) of selecting the uth main or interaction effect across G MCMC iterations is
p u = 1 G g = 1 G ϕ 1 m ( g ) , , 1 u p + p q .
With a slight abuse of notation, we use β u to represent the uth main or interaction effect. Under a two-group spike-and-slab formulation, variable selection can be viewed as Bayesian hypothesis testing with H 0 u : β u = 0 versus H 1 u : β u 0 . The complement of the PIP, 1 p u = Pr ( H 0 u data ) , is the Bayesian local false discovery rate (local FDR) for the uth effect, or the posterior probability that the effect is null. This quantity represents the expected contribution of the uth main or interaction effect to the global false discovery proportion if it is declared significant.
Let γ ( 0 , 1 ) be a threshold applied to the PIPs. All main and interaction effects satisfying p u > γ are declared as discoveries. For this selected set of effects, the Bayesian FDR is defined as the average of their local FDRs:
FDR ( γ ) = u = 1 p + p q ( 1 p u ) I ( p u > γ ) u = 1 p + p q I ( p u > γ ) .
where the denominator is the total number of discovery of significant genetic main effects and G × E interactions, while the numerator sums their probabilities as false discoveries (i.e., posterior null probabilities). Thus, FDR ( γ ) is the expected proportion of false discoveries among all declared significant effects. This Bayesian FDR definition has been widely adopted in the published literature [47,48,58].
Instead of using a fixed cut-off such as the median probability threshold γ = 0.5 [25,45], we determine an adaptive threshold that yields a desired global FDR level c ( 0 , 1 ) . Let the sorted PIPs be p [ 1 ] p [ 2 ] p [ p + p q ] . These ordered PIPs serve as candidate thresholds. If we select the top u effects, the Bayesian FDR corresponding to the threshold γ = p [ u ] becomes
FDR ( p [ u ] ) = 1 1 u i = 1 u p [ i ] .
Because the PIPs are sorted in decreasing order, adding an additional effect (moving from u to u + 1 ) reduces average PIP among the selected effects, thereby increasing the expected false discovery proportion. Consequently, FDR ( p [ u ] ) increases monotonically with u. The adaptive procedure therefore identifies the largest set of effects for which the expected false discovery proportion does not exceed the desired level. Formally, we choose
u * = max u : FDR ( p [ u ] ) c , γ ^ = p [ u * ] .
The resulting cut-off γ ^ ensures that the expected proportion of false discoveries among the selected main and interaction effects is controlled at level c. In this way, the threshold is determined directly from the posterior evidence and adapts automatically to the sparsity level and signal strength in the data.

2.5. Gibbs Sampling Algorithm

Based on the elicited prior from the section above, we can write the joint density as follows:
π ( β 0 , β 1 , β 2 , β 3 , r 1 m , r 3 s , η 1 2 , η 2 2 , π 1 , π 2 , τ Y i j ) ( τ 1 ) n k 2 i = 1 n j = 1 k v ˜ i j 1 2 × exp τ 2 ξ 2 i = 1 n j = 1 k Y i j μ i j U i j α i 2 / v ˜ i j × exp 1 2 β 0 Σ 0 1 β 0 × exp 1 2 β 2 Σ 1 1 β 2 × m = 1 p ( 1 π 1 ) ( 2 π r 1 m ) 1 / 2 exp 1 2 r 1 m β 1 m 2 I { β 1 m 0 } + π 1 δ 0 ( β 1 m ) × τ n k exp τ i = 1 n j = 1 k v ˜ i j × τ w 1 1 exp ( w 2 τ ) × ( η 1 2 ) c 1 1 exp ( d 1 η 1 2 ) × m = 1 p η 1 2 2 exp η 1 2 2 r 1 m × π 1 a 1 1 ( 1 π 1 ) b 1 1 × s = 1 p q ( 1 π 2 ) ( 2 π r 3 s ) 1 / 2 exp 1 2 r 3 s β 3 s 2 I { β 3 s 0 } + π 2 δ 0 ( β 3 s ) × ( η 2 2 ) c 2 1 exp ( d 2 η 2 2 ) × s = 1 p q η 2 2 2 exp η 2 2 2 r 3 s × π 2 a 2 1 ( 1 π 2 ) b 2 1 × ( ϕ 2 ) n h / 2 × i = 1 n h = 1 2 exp 1 2 ϕ 2 α i h 2 × ( ϕ 2 ) e 1 1 exp ( e 2 ϕ 2 )
We obtain the full conditional distribution as follows:
  • Define μ i j ( 0 ) = μ i j + U ij α i T ij β 0 . Then, the full conditional distribution of β 0 follows MVN ( μ β 0 , Σ β 0 ) with mean
    μ β 0 = Σ β 0 τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 T ij ( Y i j μ i j ( 0 ) ) ,
    and variance
    Σ β 0 = Σ 0 1 + τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 T ij T ij 1 .
  • Denote the partial residual after excluding environmental main effect as μ i j ( 2 ) = μ i j + U ij α i E ij β 2 , then the full conditional distribution of β 2 is MVN ( μ β 2 , Σ β 2 ) with mean
    μ β 2 = Σ β 2 τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 E ij ( Y i j μ i j ( 2 ) ) ,
    and variance
    Σ β 2 = Σ 1 1 + τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 E ij E ij 1 .
  • To obtain the full conditional distribution of β 1 m , which represents the main genetic effects, we first define μ i j ( 1 m ) = μ i j + U ij α i G i j m β 1 m and l 1 m = π ( β 1 m = 0 | rest ) . Then, the posterior distribution of β 1 m given the rest of the parameters follows a spike-and-slab distribution:
    β 1 m | rest ( 1 l 1 m ) N ( μ β 1 m , Σ β 1 m ) + l 1 m δ 0 ( β 1 m ) ,
    where
    μ β 1 m = Σ β 1 m τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 G i j m ( Y i j μ i j ( 1 m ) ) , Σ β 1 m = ( τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 G i j m 2 + 1 r 1 m ) 1 .
    and the posterior mixture proportion is
    l 1 m = π 1 π 1 + ( 1 π 1 ) ( r 1 m ) 1 2 ( Σ β 1 m ) 1 2 exp { 1 2 Σ β 1 m ( τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 G i j m ( Y i j μ i j ( 1 m ) ) ) 2 } .
    The posterior distribution of β 1 m is a mixture component distrinbution consisting of a normal distribution and a point mass at 0. At each MCMC iteration, β 1 m is drawn from N ( μ β 1 m , Σ β 1 m ) with probability ( 1 l 1 m ) , and set to 0 otherwise. If β 1 m = 0, then we have ϕ 1 l = 0 . Otherwise, ϕ 1 l = 1 .
  • We show the full conditional distribution of β 3 s denoting the effect size of G E i j s , the interaction between the omics features and environmental factors, ( s = 1 , , p q ) . With a partial residual μ i j ( 3 s ) = μ i j + U ij α i G E i j s β 3 s and l 3 s = π ( β 3 s = 0 | rest ) , the posterior distribution of β 3 s is expressible as follows:
    β 3 s | rest ( 1 l 3 s ) N ( μ β 3 s , Σ β 3 s ) + l 3 s δ 0 ( β 3 s ) ,
    where mean vector and covariance are
    μ β 3 s = Σ β 3 s τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 G E i j s ( Y i j μ i j ( 3 s ) ) , Σ β 3 s = τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 G E i j s 2 + 1 r 3 s 1 ,
    respectively, and the posterior mixture proportion is
    l 3 s = π 2 π 2 + ( 1 π 2 ) ( r 3 s ) 1 2 ( Σ β 3 s ) 1 2 exp { 1 2 Σ β 3 s ( τ ξ 2 i = 1 n j = 1 k v ˜ i j 1 G E i j s ( Y i j μ i j ( 3 s ) ) ) 2 } .
  • The full conditional distributions of ( r 1 m ) 1 and ( r 3 s ) 1 are
    ( r 1 m ) 1 | rest Inverse - Gamma ( 1 , η 1 2 2 ) if β 1 m = 0 Inverse - Gaussian ( η 1 2 , η 1 2 β 1 m 2 ) if β 1 m 0 ,
    and
    ( r 3 s ) 1 | rest Inverse - Gamma ( 1 , η 2 2 2 ) if β 3 s = 0 Inverse - Gaussian ( η 2 2 , η 2 2 β 3 s 2 ) if β 3 s 0 .
  • The full conditional distributions of η 1 2 and η 2 2 are
    η 1 2 | rest Gamma ( c 1 + p , d 1 + m = 1 p r 1 m 2 ) ,
    and
    η 2 2 | rest Gamma ( c 2 + p q , d 2 + s = 1 p q s 3 s 2 ) .
  • The full conditional distributions of π 1 and π 2 are
    π 1 | rest Beta ( a 1 + m = 1 p I { β 1 m = 0 } , b 1 + m = 1 p I { β 1 m 0 } ) .
    and
    π 2 | rest Beta ( a 2 + s = 1 p q I { β 3 s = 0 } , b 2 + s = 1 p q I { β 3 s 0 } ) .
  • The full conditional distribution of v ˜ i j 1 :
    v ˜ i j 1 | rest Inverse - Gaussian ( 2 ξ 2 ( Y i j μ i j U ij α i ) 2 , 2 τ ) .
  • The full conditional distribution of τ :
    τ | rest Gamma ( w 1 + 3 n k 2 , w 2 + i = 1 n j = 1 k v ˜ i j + i = 1 n j = 1 k ( 2 ξ 2 ) 1 v ˜ i j 1 ( Y i j μ i j U ij α i ) 2 .
  • The full conditional distribution of random effects α i is
    α i | rest MVN ( μ α i , Σ α i ) ,
    where
    μ α i = τ ξ 2 Σ α i j = 1 k U ij ( Y i j μ i j ) v ˜ i j , and Σ α i = τ ξ 2 j = 1 k U ij U ij v ˜ i j + 1 ϕ 2 I 1 .
  • Finally, the full conditional distribution of ϕ 2 is
    ϕ 2 | rest Inverse - Gamma ( n + e 1 , e 2 + 1 2 i = 1 n α i α i ) .
The Gibbs sampler, as a Markov Chain Monte Carlo (MCMC) method, generates samples from the posterior distribution of a set of parameters by iteratively sampling from the conditional distributions of each parameter, given the current values of the others. We summarize the Gibbs sampling Algorithm 1 based on the derivations of posterior distributions. In the context of Bayesian inference, the algorithm begins by initializing all parameters, then proceeds through a sequence of steps where, at each iteration, one parameter is sampled from its conditional distribution, conditioned on the most recent values of the other parameters. This process is repeated for a large number of iterations (e.g., 10,000 in out numeric studies). The rationale behind Gibbs sampling is that, although the joint posterior distribution of all the parameters may be complex, the full conditional distributions of each parameter, given the others, are easy to sample using MCMC. By repeating this process for many iterations, the algorithm generates posterior samples of model parameters that can be used to approximate the posterior distribution of the parameters.
Algorithm 1. Gibbs Sampler for Bayesian Inference
  • Initialize: Set initial values for β 0 , β 1 , β 2 , β 3 , r 1 m , r 3 s , η 1 2 , η 2 2 , π 1 , π 2 , τ .
  • Set iteration counter t = 0 .
  • While  t < 10000   do:
    (3.1)
    β 0 MVN ( μ β 0 , Σ β 0 ) . Sampling β 0 .
    (3.2)
    β 2 MVN ( μ β 2 , Σ β 2 ) . Sampling β 2 .
    (3.3)
    β 1 m ( 1 l 1 m ) N ( μ β 1 m , Σ β 1 m ) + l 1 m δ 0 ( β 1 m ) . Sampling β 1 m .
    (3.4)
    β 3 s ( 1 l 3 s ) N ( μ β 3 s , Σ β 3 s ) + l 3 s δ 0 ( β 3 s ) . Sampling β 3 s .
    (3.5)
    r 1 m Inverse - Gamma ( 1 , η 1 2 2 ) or Inverse - Gaussian ( η 1 2 , η 1 2 β 1 m 2 ) . Sampling r 1 m .
    (3.6)
    r 3 s Inverse - Gamma ( 1 , η 2 2 2 ) or Inverse - Gaussian ( η 2 2 , η 2 2 β 3 s 2 ) . Sampling r 3 s .
    (3.7)
    η 1 2 Gamma ( c 1 + p , d 1 + m = 1 p r 1 m 2 ) . Sampling η 1 2 .
    (3.8)
    η 2 2 Gamma ( c 2 + p q , d 2 + s = 1 p q r 3 s 2 ) . Sampling η 2 2 .
    (3.9)
    π 1 Beta ( a 1 + m = 1 p I { β 1 m = 0 } , b 1 + m = 1 p I { β 1 m 0 } ) . Sampling π 1 .
    (3.10)
    π 2 Beta ( a 2 + s = 1 p q I { β 3 s = 0 } , b 2 + s = 1 p q I { β 3 s 0 } ) . Sampling π 2 .
    (3.11)
    v ˜ i j 1 Inverse - Gaussian ( 2 ξ 2 ( Y i j μ i j U ij α i ) 2 , 2 τ ) . Sampling v ˜ i j 1 .
    (3.12)
    τ Gamma ( w 1 + 3 n k 2 , w 2 + i = 1 n j = 1 k v ˜ i j + i = 1 n j = 1 k ( Y i j μ i j U ij α i ) 2 2 ξ 2 ) . Sampling τ .
    (3.13)
    α i MVN ( μ α i , Σ α i ) . Sampling α i .
    (3.14)
    ϕ 2 Inverse - Gamma ( n + e 1 , e 2 + 1 2 i = 1 n α i α i ) . Sampling ϕ 2 .
  • Increment t = t + 1 .
Output: Posterior samples for β 0 , β 1 , β 2 , β 3 , r 1 m , r 3 s , η 1 2 , η 2 2 , π 1 , π 2 , τ , α i , ϕ 2 .

3. Results

3.1. Simulation

We assess the performance of the FDR-assisted robust Bayesian linear mixed model with spike-and-slab priors, or RBLSS, with three alternative methods. A direct competitor is its non-robust counterpart, the FDR-assisted Bayesian linear mixed model with spike-and-slab priors, which is termed as BLSS. We also consider robust Bayesian linear mixed model, or RBL, without using spike-and-slab priors. Laplacian shrinkage priors are placed on high-dimensional genetic main effects and interactions. The non-robust variant of RBL is termed as BL.
The mixed model (1) has been adopted to generate longitudinal phenotypes with sample size n = 500 and three time points (i.e., k = 3). The number of genetic factors is p = 50 and 100, with q = 3 environmental factors, leading to a total dimension of main genetic effects and interactions as p + p q = 200 and 400, respectively. Environmental factors are simulated from multivariate normal distributions with marginal mean 0, marginal variance 1, and an AR–1 correlation with ρ = 0.5. The non-zero fixed effects from model (1), consisting of those that are not subject to selection (i.e., β 0 and β 2 ) as well as those corresponding to important main and interaction effects from β 1 and β 3 , are generated from a uniform distribution Unif [0.4, 0.8]. A total of 8 main G effects and 12 G × E interactions are assumed to be associated with the longitudinal response. Under the random intercept-and-slope model (1), each component of the random effects is simulated from a standard normal distribution. The random error are generated from the following: (1) N(0, 1) (Error 1), (2) t-distribution with 2 degrees of freedom (t(2)) (Error2), (3) LogNormal(0,2) (Error3), (4) Laplace(0, 2 ) (Error4). Except N(0,1), all the rest are heavy-tailed model errors.
In addition, we simulate genetic factors according to the three settings below. In Setting 1, continuous genetic factors are generated from the same multivariate Gaussian distribution adopted to generate environmental factors. The AR structure is considered in computing the gene correlation, under which gene j and k have correlation ρ | j k | where ρ = 0.5 , with variance 1. Since the single nucleotide polymorphism (SNP) data are categorical, we further consider two more realistic settings to generate high-dimensional genetic factors.
Specifically, in Setting 2, we mimic the categorical nature of SNP data by dichotomizing the gene expression values according to quantile levels 1 3 and 2 3 , with the 3 levels (0, 1, 2) corresponding to genotypes (0 = aa, 1 = Aa, and 2 = AA). In Setting 3, SNP data are simulated based on a pairwise linkage disequilibrium (LD) pattern. Given LD and the minor allele frequency (MAF) of two adjacent SNPs, such as SNP 1 and SNP 2, one can compute four hapolytypes, which leads to a conditional genotype probability matrix of SNP 2 given SNP 1. Then, in general, SNP genotypes can be simulated according to the conditional distribution of current genotypes based on the previous genotype data. Please refer to Ren et al. (2023) [30] for more details about the data-generating procedure.
For the two methods with spike-and-slab priors, we apply the Bayesian FDR procedure described in Section 2.4 with a global FDR of 0.05 after model fitting. The 95% marginal credible intervals are applied to identify important findings under methods without using spike-and-slab priors. Variable selection accuracy is assessed based on the number of true positives (TPs), false positives (FPs), F1 score, and Matthews correlation coefficient (MCC), where F1 and MCC are defined as F 1 = 2 T P 2 T P + F P + F N and MCC = T P · T N F P · F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N ) , respectively. In terms of estimation accuracy, among selected main and interaction effects, we compute 1 distance between the estimated and true fixed effects.
Table 1 shows the performance in terms of the above metrics under Setting 1 across four error distributions. With 50 genetic factors and 3 environmental variables, there are a total of 150 main and interaction effects. Under the N ( 0 , 1 ) model error, the two FDR-assisted methods, RBLSS and BLSS, are the top-performing approaches, with BLSS showing a slight advantage. Specifically, BLSS achieves a lower average FP of 0.80 (sd 0.86) compared to 1.36 (sd 1.12) for RBLSS, while maintaining a similar number of true positives. BLSS also attains the highest F1 score and Matthews correlation coefficient (MCC), both approximately 0.96, outperforming the other methods, particularly RBL and BL. Additionally, it yields the lowest estimation error of 2.10 (sd 0.53). In the presence of skewed model errors, Table 1 indicates that the advantage shifts to RBLSS. For example, under t(2) error, RBLSS has the highest F1 score of 0.95 (sd 0.03) and MCC of 0.94 (sd 0.04), as well as the lowest estimation error, 2.68 (sd 0.61). In terms of variable selection, RBL is the second-best method, achieving both an F1 score and MCC of around 0.90. However, it yields a much larger estimation error of 15.41 (sd 0.99). When the number of main genetic effects increases to 100 under Setting 1, Table 2 shows that the pattern remains consistent with that in Table 1. FDR-assisted RBLSS outperforms the other three in both variable selection and estimation error.
Table 1 shows the following patterns that warrant more thorough discussions. First, under N ( 0 , 1 ) error, the non-robust BLSS is comparable to, or even slightly outperforms, the robust RBLSS, provided that the variability in point estimates of the F1 score, MCC, and estimation errors is ignored. In general, the suboptimal performance of robust statistical methods under Gaussian model errors is well-documented, as these methods penalize or downweight extreme observations that are infrequent in such cases [12]. This leads to unnecessary penalization of observations, resulting in decreased efficiency due to the trade-off between robustness and efficiency.
Table 1 also shows that, under Laplace(0, 2 ) error, the advantage of RBLSS over BLSS diminishes compared to the case with t ( 2 ) error. The main reason for this is that the Laplace(0, 2 ) distribution generates fewer extreme observations than the t ( 2 ) distribution and behaves more similarly to the N ( 0 , 1 ) distribution. Specifically, the Laplace distribution has a sharper peak and exponentially decaying tails, making it more concentrated around the mean, similar to N ( 0 , 1 ) , but less likely to producing extreme values than the t ( 2 ) , which has a wider, less sharp peak and polynomially decaying tails.
Furthermore, Table 1 and additional numerical results demonstrate that regardless of model errors, RBL and BL, the methods utilizing Laplacian shrinkage priors instead of spike-and-slab priors for shrinkage estimation, yield consistently higher estimation error. This is because Laplacian shrinkage priors cannot shrink regression coefficients to exactly zero. As a result, for a sparse regression coefficient vector, they produce much larger estimation errors (and false positive signals) compared to RBLSS and BLSS, which can achieve exact zeros using spike-and-slab priors. Such a phenomenon suggests that, in certain high-dimensional settings, the choice of an aggressive sparsity-inducing prior can outweigh the impact of selecting a robust likelihood in terms of model performance.
We have also visualized the results in Table 1 and Table 2 by generating box-plots of F1 score, MCC and estimation error across 100 replicates in Figure 1, Figure 2 and Figure 3, respectively. In the box-plots, the central line represents the median, the box denotes the interquartile range, and the whiskers and points indicate variability and potential outliers. Figure 2 and Figure 3 demonstrate that for non-robust BLSS and BL, variable selection accuracy in terms of F1 score and MCC deteriorates dramatically under t(2) error which is more skewed than the other three model errors. Although the two non-robust methods show slight improvement under log N ( 0 , 1 ) (Error 3) and Laplace ( 0 , 2 ) (Error 4), which are less heavy-tailed than t(2), the advantage of the robust methods, especially RBLSS, are still evident. By comparing RBL and BLSS, Figure 1, Figure 2 and Figure 3 demonstrate an interesting phenomenon that under Laplace ( 0 , 2 ) (Error 4) and p = 100, the non-robust BLSS can even dominate robust RBL in both identification and estimation. In fact, even under skewed model errors, it is not surprising to observe better estimation performance from non-robust spike-and-slab models compared to robust models without spike-and-slab priors, as the former induces exact sparsity in posterior estimates, and is more consistent with the sparse nature of high-dimensional omics data. However, it is impressive that BLSS also has a advantage over RBL in variable selection.
Setting 2 differs from Setting 1 in that all high-dimensional omics features are categorical. Therefore, it is more similar in structure to SNP data. Results under Setting 2 are shown in Table A1 and Table A2, with Figure A1, Figure A2, Figure A3 in Appendix A. The superior performance of RBLSS over the alternatives in variable selection and estimation under heavy-tailed model errors remains consistent. The previously noted pattern reappears in the comparison between BLSS and RBL. As indicated in Table A2 and Figure A1, Figure A2, Figure A3, the non-robust BLSS dominates robust RBL in both identification and shrinkage estimation under heavy-tailed model errors such as l o g N (0,1) and L a p l a c e ( 0 , 2 ) . In Appendix A, Table A3 and Table A4 demonstrate the performance metrics evaluated under Setting 3 where LD correlations have been adopted to simulate SNP data. The same conclusion about the superiority of RBLSS can be drawn.
We perform additional analyses to illustrate that the proposed method achieves satisfactory performance in finite-sample FDR control. In our study, we follow Müller et al. (2007) [58] to define Bayesian FDR, as shown in Equation (4) in Section 2.4. At the 0.05 FDR level for all three simulation settings, we have computed the Bayesian FDR of RBLSS and BLSS using Equation (4) over 100 replicates, and generated the corresponding violin plots, which are shown in Figure 4, as well as Figure A4 and Figure A5 in Appendix A, respectively. These plots indicate that finite sample FDR control has been achieved. Moreover, sensitivity analysis has been conducted to show that RBLSS is insensitive to difference choices of hyperparameters. Please refer to Table A5 and Table A6 in Appendix A for more details.

3.2. Case Study

We analyze the longitudinal lipidomics data generated using weight–controlled CD1 mice from a repeated measure one-way ANOVA study [52]. The weight of 60 female CD-1 mice repeatedly measured on a weekly basis over 11 weeks is the phenotype of interest. King et al. (2015) [52] have profiled 176 plasma neutral lipid species, focusing on diacylglycerols. After excluding species with measurements below detection limits in most samples, the dataset contains 31 lipid features that are used as genetic factors. The environmental factor is treatment regarding exercise and/or dietary restriction with the following four treatment groups: control (sedentary with ad libitum feeding), AE (treadmill exercise with ad libitum feeding), PE (treadmill exercise with pair feeding), and DCR (sedentary with 20% dietary calorie restriction). With 31 lipid (or G) factors and 1 environmental factors, there are additionally 31 G × E interactions. Our goal of analysis is to identify top-ranked main and interaction effects with FDR control. A comparison based on the Bayesian Information Criterion (BIC), excluding lipid features, has shown that the random intercept-and-slope model (BIC = 2496.4) has outperformed the random intercept model (BIC = 2822.7). Accordingly, the random intercept-and-slope model has been adopted for the follow-up analysis.
Here, we apply FDR-assisted RBLSS and BLSS to detect important effects. Table 3 shows top-ranked main and interaction effects identified by RBLSS with posterior inclusion probabilities (PIP) with an FDR of α = 0.05. In total, there are two main effects and four interaction effects. The structure of a phospholipid has the following meanings. For example, C18:2/20:4 refers to a phospholipid consisting of two fatty acid chains: one with 18 carbons and two double bonds, and the other with 20 carbons and four double bonds.
Table 3 provides additional insight regarding the findings reported in Fan et al. [53], where the four treatment groups, control, AE, PE, and DCR, are modeled as three binary indicators, and lipid-by-treatment interactions are represented as group-level interactions. In contrast, in this study, treatment is modeled as a single covariate, with G × E interactions represented by individual predictors. The modeling strategy used in this study facilitates FDR-assisted variable selection, as both main effects and interaction effects are treated equally as individual predictors in our model. Due to the difference in modeling environmental factors, we do not expect all findings are identical from both studies. However, we believe that main or interaction effects overlapping between the two studies warrant particular attention, as they are strongly associated with the longitudinal response regardless of whether treatment is modeled as a group or a categorical predictor. The interaction between C18:2/20:4 and environmental factor has a PIP of 0.9248, which is consistent with the group-level modeling of treatment in Fan et al. (2025) [53], where group-level G × E involving C18:2/20:4 has also been reported using the median thresholding rule [45].
The phospholipids identified by RBLSS with high posterior inclusion probabilities for main effects, C18:2/16:1 (PIP = 0.9597) and C18:2/18:1 (PIP = 0.9792), are broadly involved in lipid metabolism, membrane composition, and energy regulation. These processes are expected to vary across the four treatment groups in our study. Phospholipids containing linoleic acid (C18:2) and related acyl chains are linked to pathways responsive to changes in energy intake and physical activity, including shifts in membrane remodeling and metabolic signaling [59,60]. Thus, the fact that RBLSS selects C18:2/16:1 and C18:2/18:1 as the top main effects is consistent with broad biological expectations about lipids that differentiate overall metabolic states across these treatment conditions.
In addition, RBLSS identifies several lipid–treatment interaction effects with large PIPs, including C22:7/16:0 (0.9087), C16:0/18:1 (0.9501), C18:3/18:1 (1.0000), and C18:2/20:4 (0.9248). These interactions indicate that the associations between these lipid species and the weight differ across the four groups, rather than being constant shifts. Long-chain and polyunsaturated species such as those involving C18:3, C20:4, and C22:7 are known to participate in adaptive responses to exercise and caloric restriction, while saturated and monounsaturated chains like 16:0 and 18:1 are central to membrane and energy-storage dynamics [59,61]. While a full mechanistic explanation is beyond the scope of this paper, the fact that RBLSS detects plausible lipid main effects and lipid–treatment interactions under realistic experimental conditions indicates that the method effectively captures biologically meaningful structure in high-dimensional lipidolomics data.
We have listed significant findings using non-robust BLSS with the same FDR level of α = 0.05 in Table 4. In total, there are one important main effects (C20:6/16:0) and two important G×E interactions involving phospholipids C18:3/18:1 and C18:2/20:4, respectively. We notice that the total number of pinpointed significant effects under non-robust method is less than robust RBLSS. This pattern in the case study aligns with our simulation studies, which indicate that RBLSS outperforms BLSS by yielding a substantially larger number of significant findings.

4. Discussion

For hypothesis testing and variable selection, the two major frameworks of G × E analysis, their pros and cons carry over from cross-sectional to longitudinal studies. While penalized variable selection methods are generally more flexible to handle defining characteristics of G × E, such as hierarchical structures between main and interaction effects [20,62], resilience to outliers through robustification, and diverse forms of environmental factors (and thus G × E interactions) [4], statistical test-based approaches excel at providing rigorous inference, typically in terms of p-values, such as those in GWAS [32,63]. In this article, we propose to apply FDR-assisted procedures with robust Bayesian linear mixed model to prioritize important main and interaction effects in the presence of longitudinal traits with outliers, thereby offering a new approach to improve the rigor of variable selection–based interaction analysis. This method depends on previous work exploring FDR and Bayesian multiple comparisons rules [58] and studies alike [48,51]. Our numerical studies show that it is promising to couple FDR-assisted procedure with robust Bayesian interaction analysis using mixed-effect models.
Our study can be extended in the following aspects. First, the proposed interaction models respect a weak hierarchical structure without imposing shrinkage estimation on environmental effects [20,21]. To impose strong hierarchy, we conjecture that the strategy implemented or discussed in [20,30] can be generalized to sparse Bayesian mixed-effect models. Such a strategy applies heavier penalization on interaction effects to ensure that, whenever an interaction is selected, the corresponding genetic main effects are also included, thereby preserving strong hierarchy. Second, the current model assumes independence among main and interaction effects. It is intriguing to explore the performance in GWAS-alike settings where genetic features are highly correlated due to linkage disequilibrium. Benjamini and Yekutieli (2001) [64] suggest that if dependence among data is ignored, FDR-based approaches lead to inferior inference results. To better accommodate correlations among genetic predictors, we conjecture that conducting Bayesian FDR analysis after incorporating network information or Markov random field priors will further improve the model performance with correlated features [65]. Last but not least, our proposals also suffer from typical disadvantage of fully Bayesian analysis in that it is computationally intensive and not scalable to genome scale applications. Variational Bayesian methods, among others, can substantially accelerate computations and make the approach more suitable for large-scale genomics studies.

Author Contributions

Conceptualization, X.L., K.F. and C.W.; methodology, X.L., K.F. and C.W.; software, X.L. and K.F.; validation, X.L. and K.F.; formal analysis, X.L. and K.F.; investigation, X.L., K.F. and C.W.; resources, K.F. and C.W.; data curation, X.L., K.F. and C.W.; writing—original draft preparation, X.L. and C.W.; writing—review and editing, X.L., K.F. and C.W.; visualization, X.L.; supervision, C.W.; project administration, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by an Innovative Research Award from the Johnson Cancer Research Center at Kansas State University.

Institutional Review Board Statement

This is a secondary data analysis of mice lipids data obtained from a published study. The use or access of this dataset does not require Institutional Review Board (IRB) approval.

Data Availability Statement

The lipidomics data analyzed in the case study are available upon request from the corresponding author.

Acknowledgments

We thank the editor and the four anonymous reviewers for their careful review and constructive comments, which have significantly improved this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ANOVAAnalysis of Variance
FDRFalse Discovery Rate
GWASGenome-wide association studies
MCMCMarkov chain Monte Carlo
PIPPosterior inclusion probability

Appendix A

Appendix A.1. Additional Numeric Results

Table A1. Assessment under Setting 2 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
Table A1. Assessment under Setting 2 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
 Error Method TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) RBLSS19.78 (0.46)1.52 (1.23)0.96 (0.03)0.95 (0.03)2.01 (0.55)
RBL19.48 (0.68)3.30 (1.84)0.91 (0.04)0.90 (0.05)14.14 (1.05)
BLSS19.88 (0.33)0.94 (0.84)0.97 (0.02)0.97 (0.02)1.72 (0.41)
BL19.66 (0.56)1.96 (1.68)0.95 (0.04)0.94 (0.04)13.35 (1.00)
t ( 2 ) RBLSS17.52 (1.37)0.62 (0.81)0.92 (0.05)0.91 (0.05)2.98 (0.78)
RBL16.48 (1.54)1.04 (1.01)0.88 (0.05)0.87 (0.05)14.56 (1.06)
BLSS12.12 (3.33)0.28 (0.54)0.73 (0.14)0.75 (0.12)5.22 (1.76)
BL12.76 (2.43)3.04 (2.51)0.71 (0.11)0.69 (0.12)23.70 (6.15)
l o g N ( 0 , 1 ) RBLSS19.36 (0.83)0.86 (0.81)0.96 (0.03)0.96 (0.04)2.22 (0.65)
RBL18.56 (0.97)2.10 (1.46)0.91 (0.04)0.90 (0.04)14.13 (1.09)
BLSS16.96 (1.46)0.44 (0.73)0.91 (0.05)0.90 (0.05)3.56 (1.01)
BL16.40 (1.59)2.66 (1.89)0.84 (0.06)0.82 (0.07)20.08 (2.44)
L a p l a c e ( 0 , 2 )RBLSS17.92 (1.23)1.10 (0.89)0.92 (0.05)0.91 (0.05)3.18 (0.77)
RBL16.66 (1.36)1.96 (1.28)0.86 (0.05)0.85 (0.06)15.82 (1.02)
BLSS17.16 (1.33)0.72 (0.57)0.90 (0.04)0.90 (0.05)3.31 (0.72)
BL16.42 (1.16)2.62 (1.72)0.84 (0.05)0.83 (0.06)19.22 (1.39)
Table A2. Assessment under Setting 2 with ( n , p , q , k ) = ( 400 , 100 , 3 , 3 ) and 100 replicates.
Table A2. Assessment under Setting 2 with ( n , p , q , k ) = ( 400 , 100 , 3 , 3 ) and 100 replicates.
 Error Method TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) RBLSS19.40 (0.67)1.84 (1.58)0.94 (0.04)0.94 (0.04)2.29 (0.65)
RBL16.88 (1.19)2.20 (1.47)0.86 (0.04)0.86 (0.05)23.05 (1.42)
BLSS19.48 (0.65)0.76 (0.82)0.97 (0.03)0.97 (0.03)1.91 (0.47)
BL17.48 (1.11)1.06 (1.11)0.91 (0.04)0.90 (0.04)20.96 (1.34)
t ( 2 ) RBLSS18.16 (1.13)0.70 (0.81)0.93 (0.04)0.93 (0.04)2.84 (0.85)
RBL15.06 (1.68)1.00 (0.97)0.83 (0.06)0.83 (0.06)24.32 (1.44)
BLSS11.90 (4.57)0.14 (0.40)0.71 (0.23)0.75 (0.17)5.65 (2.60)
BL8.80 (3.57)1.46 (1.85)0.56 (0.20)0.60 (0.17)42.45(17.73)
l o g N ( 0 , 1 ) RBLSS19.34 (0.66)0.94 (0.93)0.96 (0.03)0.96 (0.03)2.21 (0.55)
RBL17.06 (1.38)1.62 (1.24)0.88 (0.05)0.88 (0.05)24.31 (1.57)
BLSS17.66 (1.39)0.66 (0.75)0.92 (0.05)0.92 (0.05)3.47 (1.01)
BL13.64 (1.85)1.88 (2.13)0.77 (0.08)0.77 (0.08)34.34 (4.74)
L a p l a c e ( 0 , 2 )RBLSS17.04 (1.62)1.10 (1.05)0.89 (0.06)0.89 (0.06)3.50 (0.98)
RBL14.18 (1.47)1.34 (1.14)0.80 (0.05)0.80 (0.05)25.21 (1.52)
BLSS15.54 (1.79)0.48 (0.79)0.86 (0.06)0.86 (0.06)3.87 (0.85)
BL12.46 (1.82)1.32 (1.11)0.74 (0.08)0.74 (0.08)32.00 (2.15)
Figure A1. F1 scores under Setting 2 with p = 50 (upper panel) and p = 100 (lower panel), corresponding to Table A1 and Table A2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure A1. F1 scores under Setting 2 with p = 50 (upper panel) and p = 100 (lower panel), corresponding to Table A1 and Table A2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g0a1
Figure A2. MCC under Setting 2 with p = 50 (upper panel) and p = 100 (lower panel), corresponding to Table A1 and Table A2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure A2. MCC under Setting 2 with p = 50 (upper panel) and p = 100 (lower panel), corresponding to Table A1 and Table A2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g0a2
Figure A3. Estimation error under Setting 2 with p = 50 (upper panel) and p = 100 (lower panel), corresponding to Table A1 and Table A2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure A3. Estimation error under Setting 2 with p = 50 (upper panel) and p = 100 (lower panel), corresponding to Table A1 and Table A2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g0a3
Table A3. Assessment under Setting 3 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
Table A3. Assessment under Setting 3 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
 Error Method TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) RBLSS19.56 (0.73)1.76 (0.85)0.95 (0.03)0.94 (0.04)2.30 (0.60)
RBL19.00 (1.07)3.68 (1.99)0.89 (0.06)0.88 (0.06)14.39 (1.07)
BLSS19.84 (0.37)0.94 (0.74)0.97 (0.02)0.97 (0.02)1.90 (0.45)
BL19.02 (0.98)2.00 (1.32)0.93 (0.04)0.92 (0.04)13.54 (0.88)
t ( 2 ) RBLSS17.28 (1.69)0.70 (0.79)0.91 (0.06)0.90 (0.06)3.28 (0.88)
RBL15.88 (1.73)1.42 (1.18)0.85 (0.06)0.84 (0.06)15.22 (1.25)
BLSS11.22 (4.16)0.28 (0.50)0.69 (0.21)0.73 (0.13)5.90 (1.85)
BL11.48 (3.56)2.66 (2.87)0.66 (0.18)0.66 (0.14)24.50 (4.82)
l o g N ( 0 , 1 ) RBLSS19.08 (0.90)0.78 (0.91)0.96 (0.03)0.95 (0.03)2.38 (0.70)
RBL18.10 (1.16)1.46 (1.39)0.92 (0.05)0.91 (0.05)14.58 (1.28)
BLSS15.94 (2.03)0.54 (0.86)0.87 (0.07)0.87 (0.07)4.26 (1.19)
BL15.50 (1.63)2.22 (1.80)0.82 (0.07)0.81 (0.08)21.61 (3.12)
L a p l a c e ( 0 , 2 )RBLSS17.88 (1.45)0.76 (0.69)0.92 (0.05)0.92 (0.05)3.24 (0.92)
RBL16.68 (1.57)1.48 (1.33)0.87 (0.06)0.86 (0.06)15.97 (1.21)
BLSS16.60 (1.55)0.38 (0.60)0.90 (0.05)0.89 (0.05)3.61 (0.91)
BL15.96 (1.59)2.20 (1.54)0.84 (0.06)0.82 (0.07)20.12 (1.66)
Table A4. Assessment under Setting 3 with ( n , p , q , k ) = ( 400 , 100 , 3 , 3 ) and 100 replicates.
Table A4. Assessment under Setting 3 with ( n , p , q , k ) = ( 400 , 100 , 3 , 3 ) and 100 replicates.
 Error Method TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) RBLSS19.34 (0.69)1.40 (1.12)0.95 (0.03)0.95 (0.04)2.24 (0.58)
RBL17.38 (1.18)2.40 (1.65)0.87 (0.05)0.87 (0.05)23.69 (1.40)
BLSS19.58 (0.61)0.48 (0.65)0.98 (0.02)0.98 (0.03)1.84 (0.44)
BL17.98 (1.20)1.00 (0.97)0.92 (0.04)0.92 (0.04)21.22 (1.17)
t ( 2 ) RBLSS17.28 (1.40)0.78 (1.06)0.91 (0.05)0.90 (0.05)3.32 (0.93)
RBL13.58 (1.93)1.20 (1.37)0.78 (0.07)0.78 (0.07)25.01 (1.93)
BLSS10.44 (3.39)0.22 (0.46)0.66 (0.17)0.69 (0.14)6.52 (2.15)
BL8.24 (3.01)1.84 (2.38)0.53 (0.17)0.57 (0.15)43.24(14.10)
l o g N ( 0 , 1 ) RBLSS19.32 (0.91)0.70 (0.84)0.97 (0.03)0.96 (0.03)2.30 (0.66)
RBL15.58 (2.04)1.26 (1.19)0.84 (0.07)0.84 (0.07)24.06 (1.51)
BLSS15.30 (2.73)0.48 (0.68)0.85 (0.10)0.85 (0.09)4.55 (1.69)
BL11.90 (2.24)1.24 (1.30)0.71 (0.09)0.72 (0.09)34.67 (4.79)
L a p l a c e ( 0 , 2 )RBLSS15.94 (1.60)0.92 (0.80)0.86 (0.06)0.86 (0.06)4.03 (1.02)
RBL12.56 (1.64)1.36 (1.24)0.74 (0.07)0.74 (0.07)25.56 (1.63)
BLSS14.22 (1.79)0.50 (0.68)0.82 (0.07)0.82 (0.06)4.66 (1.06)
BL10.68 (1.96)1.10 (1.07)0.67 (0.09)0.68 (0.08)32.21 (2.23)

Appendix A.2. Additional Results on FDR Control

Figure A4. Violin plots of the Bayesian FDR for RBLSS and BLSS using a 5% FDR threshold under Setting 2 with p = 100 (upper panel) and p = 50 (lower panel). Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure A4. Violin plots of the Bayesian FDR for RBLSS and BLSS using a 5% FDR threshold under Setting 2 with p = 100 (upper panel) and p = 50 (lower panel). Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g0a4
Figure A5. Violin plots of the Bayesian FDR for RBLSS and BLSS using a 5% FDR threshold under Setting 3 with p = 100 (upper panel) and p = 50 (lower panel). Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure A5. Violin plots of the Bayesian FDR for RBLSS and BLSS using a 5% FDR threshold under Setting 3 with p = 100 (upper panel) and p = 50 (lower panel). Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g0a5

Appendix A.3. Sensitivity Analysis

We demonstrate that RBLSS is insensitive to the choice of hyperparameters. For the conjugate Beta prior on π 1 and π 2 , we examine five sets of hyperparameters: (1) Beta(0.5, 0.5), (2) Beta(1, 1), (3) Beta(2, 2), (4) Beta(1, 5), and (5) Beta(5, 1), which correspond to symmetric (1–3), right-skewed (4), and left-skewed (5) Beta distributions. Table A5 indicates that RBLSS’s performance in variable selection and estimation are close to the Beta(1, 1) prior used in numeric studies. In addition, we have also investigated sensitivity of Gamma priors on η 1 2 , η 2 2 and τ 2 ″. Specifically, we have explored various combinations of shape and rate parameters {(0.1,1), (1,1), (5,1), (1,2), (1,5)}, and the corresponding results are presented in Table A6. Again, it is evident that the performance of RBLSS remains consistent with that obtained under the Gamma(1, 1) prior used in our numerical studies.
Table A5. Hyperparameter sensitivity analysis of Beta priors on π 1 and π 2 in RBLSS under Setting 1 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
Table A5. Hyperparameter sensitivity analysis of Beta priors on π 1 and π 2 in RBLSS under Setting 1 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
 Error Prior TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) Beta ( 0.5 , 0.5 ) 19.24 (0.98)1.60 (1.31)0.94 (0.04)0.94 (0.05)2.39 (0.64)
Beta ( 1 , 1 ) 19.18 (0.96)1.42 (1.03)0.94 (0.04)0.94 (0.04)2.40 (0.59)
Beta ( 2 , 2 ) 19.32 (0.89)1.50 (1.15)0.95 (0.04)0.94 (0.04)2.40 (0.61)
Beta ( 1 , 5 ) 19.26 (0.99)1.44 (1.01)0.95 (0.04)0.94 (0.04)2.37 (0.57)
Beta ( 5 , 1 ) 19.36 (0.85)1.92 (1.24)0.94 (0.04)0.93 (0.04)2.46 (0.67)
t ( 2 ) Beta ( 0.5 , 0.5 ) 18.70 (1.15)0.70 (0.93)0.95 (0.04)0.94 (0.04)2.70 (0.66)
Beta ( 1 , 1 ) 18.58 (1.20)0.66 (0.75)0.95 (0.04)0.94 (0.04)2.74 (0.70)
Beta ( 2 , 2 ) 18.78 (0.93)0.72 (0.90)0.95 (0.03)0.95 (0.04)2.66 (0.61)
Beta ( 1 , 5 ) 18.50 (1.05)0.76 (0.89)0.94 (0.04)0.94 (0.04)2.80 (0.62)
Beta ( 5 , 1 ) 18.90 (0.97)1.04 (1.24)0.95 (0.04)0.94 (0.05)2.80 (0.79)
l o g N ( 0 , 1 ) Beta ( 0.5 , 0.5 ) 19.10 (1.02)0.88 (0.82)0.96 (0.04)0.95 (0.04)2.43 (0.69)
Beta ( 1 , 1 ) 19.14 (0.88)0.86 (0.93)0.96 (0.03)0.95 (0.04)2.41 (0.60)
Beta ( 2 , 2 ) 19.16 (1.09)0.90 (0.95)0.96 (0.04)0.95 (0.05)2.39 (0.74)
Beta ( 1 , 5 ) 19.18 (0.85)0.72 (0.95)0.96 (0.04)0.96 (0.04)2.40 (0.77)
Beta ( 5 , 1 ) 19.26 (0.88)1.22 (0.93)0.95 (0.03)0.95 (0.04)2.46 (0.70)
L a p l a c e ( 0 , 2 ) Beta ( 0.5 , 0.5 ) 16.46 (2.17)0.88 (1.06)0.88 (0.08)0.87 (0.09)3.55 (1.16)
Beta ( 1 , 1 ) 16.22 (2.06)0.78 (1.00)0.87 (0.08)0.87 (0.08)3.65 (1.17)
Beta ( 2 , 2 ) 16.48 (1.92)0.72 (0.78)0.88 (0.07)0.88 (0.07)3.51 (1.06)
Beta ( 1 , 5 ) 16.28 (1.95)0.72 (0.78)0.88 (0.07)0.87 (0.07)3.59 (1.00)
Beta ( 5 , 1 ) 16.70 (1.95)1.14 (0.95)0.88 (0.07)0.87 (0.08)3.69 (1.18)
Table A6. Hyperparameter sensitivity analysis of Gamma priors on η 1 2 , η 2 2 and τ 2 in RBLSS under Setting 1 with ( n , p ) = ( 400 , 50 ) and 100 replicates.
Table A6. Hyperparameter sensitivity analysis of Gamma priors on η 1 2 , η 2 2 and τ 2 in RBLSS under Setting 1 with ( n , p ) = ( 400 , 50 ) and 100 replicates.
 Error Prior TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) Gamma ( 0.1 , 1 ) 19.28 (0.88)1.34 (1.17)0.95 (0.04)0.94 (0.05)2.35 (0.63)
Gamma ( 1 , 1 ) 19.18 (0.96)1.42 (1.03)0.94 (0.04)0.94 (0.04)2.40 (0.59)
Gamma ( 5 , 1 ) 19.36 (0.78)1.94 (1.30)0.94 (0.03)0.93 (0.04)2.41 (0.55)
Gamma ( 1 , 2 ) 19.24 (0.87)1.26 (1.01)0.95 (0.03)0.95 (0.04)2.29 (0.53)
Gamma ( 1 , 5 ) 18.92 (1.05)1.02 (1.02)0.95 (0.04)0.94 (0.05)2.38 (0.69)
t ( 2 ) Gamma ( 0.1 , 1 ) 18.72 (1.03)0.68 (1.00)0.95 (0.04)0.95 (0.04)2.73 (0.68)
Gamma ( 1 , 1 ) 18.58 (1.20)0.66 (0.75)0.95 (0.04)0.94 (0.04)2.74 (0.70)
Gamma ( 2 , 2 ) 18.66 (1.24)1.12 (1.14)0.94 (0.05)0.93 (0.05)2.85 (0.79)
Gamma ( 1 , 5 ) 18.44 (1.13)0.66 (0.80)0.94 (0.04)0.94 (0.04)2.80 (0.73)
Gamma ( 5 , 1 ) 18.24 (1.13)0.46 (0.71)0.94 (0.04)0.94 (0.04)2.86 (0.75)
l o g N ( 0 , 1 ) Gamma ( 0.1 , 1 ) 19.26 (0.88)0.78 (0.82)0.96 (0.03)0.96 (0.04)2.37 (0.64)
Gamma ( 1 , 1 ) 19.14 (0.88)0.86 (0.93)0.96 (0.03)0.95 (0.04)2.41 (0.60)
Gamma ( 2 , 2 ) 19.16 (0.82)1.08 (1.01)0.95 (0.03)0.95 (0.04)2.42 (0.69)
Gamma ( 1 , 5 ) 19.02 (0.96)0.84 (0.87)0.95 (0.03)0.95 (0.04)2.44 (0.63)
Gamma ( 5 , 1 ) 18.90 (1.28)0.60 (0.73)0.96 (0.04)0.95 (0.05)2.45 (0.84)
L a p l a c e ( 0 , 2 ) Gamma ( 0.1 , 1 ) 16.42 (2.00)0.84 (0.93)0.88 (0.07)0.87 (0.08)3.65 (1.25)
Gamma ( 1 , 1 ) 16.22 (2.06)0.78 (1.00)0.87 (0.08)0.87 (0.08)3.65 (1.17)
Gamma ( 2 , 2 ) 16.82 (1.91)1.02 (0.89)0.89 (0.07)0.88 (0.07)3.51 (1.08)
Gamma ( 1 , 5 ) 16.14 (2.06)0.78 (0.95)0.87 (0.08)0.86 (0.08)3.75 (1.18)
Gamma ( 5 , 1 ) 15.82 (1.87)0.76 (0.82)0.86 (0.06)0.86 (0.07)3.90 (1.13)

References

  1. Hunter, D.J. Gene–environment interactions in human diseases. Nat. Rev. Genet. 2005, 6, 287–298. [Google Scholar] [CrossRef] [PubMed]
  2. Cornelis, M.C.; Tchetgen Tchetgen, E.J.; Liang, L.; Qi, L.; Chatterjee, N.; Hu, F.B.; Kraft, P. Gene-environment interactions in genome-wide association studies: A comparative study of tests applied to empirical studies of type 2 diabetes. Am. J. Epidemiol. 2012, 175, 191–202. [Google Scholar] [CrossRef] [PubMed]
  3. Han, S.S.; Chatterjee, N. Review of statistical methods for gene-environment interaction analysis. Curr. Epidemiol. Rep. 2018, 5, 39–45. [Google Scholar] [CrossRef]
  4. Zhou, F.; Ren, J.; Lu, X.; Ma, S.; Wu, C. Gene–environment interaction: A variable selection perspective. Epistasis Methods Protoc. 2021, 2212, 191–223. [Google Scholar]
  5. Murcray, C.E.; Lewinger, J.P.; Gauderman, W.J. Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 2009, 169, 219–226. [Google Scholar] [CrossRef]
  6. Wu, C.; Cui, Y. A novel method for identifying nonlinear gene–environment interactions in case–control association studies. Hum. Genet. 2013, 132, 1413–1425. [Google Scholar] [CrossRef]
  7. Tyrrell, J.; Wood, A.R.; Ames, R.M.; Yaghootkar, H.; Beaumont, R.N.; Jones, S.E.; Tuke, M.A.; Ruth, K.S.; Freathy, R.M.; Davey Smith, G.; et al. Gene–obesogenic environment interactions in the UK Biobank study. Int. J. Epidemiol. 2017, 46, 559–575. [Google Scholar] [CrossRef]
  8. Wang, X.; Lim, E.; Liu, C.T.; Sung, Y.J.; Rao, D.C.; Morrison, A.C.; Boerwinkle, E.; Manning, A.K.; Chen, H. Efficient gene–environment interaction tests for large biobank-scale sequencing studies. Genet. Epidemiol. 2020, 44, 908–923. [Google Scholar] [CrossRef]
  9. Lin, W.Y.; Chan, C.C.; Liu, Y.L.; Yang, A.C.; Tsai, S.J.; Kuo, P.H. Performing different kinds of physical exercise differentially attenuates the genetic effects on obesity measures: Evidence from 18,424 Taiwan Biobank participants. PLoS Genet. 2019, 15, e1008277. [Google Scholar] [CrossRef]
  10. Lin, W.Y. Gene-Environment Interactions and Gene–Gene Interactions on Two Biological Age Measures: Evidence from Taiwan Biobank Participants. Adv. Biol. 2024, 8, 2400149. [Google Scholar] [CrossRef]
  11. Wu, C.; Ma, S. A selective review of robust variable selection with applications in bioinformatics. Briefings Bioinform. 2015, 16, 873–883. [Google Scholar] [CrossRef]
  12. Huber, P.J. Robust statistics. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1248–1251. [Google Scholar]
  13. Koenker, R. Quantile Regression; Cambridge University Press: Cambridge, UK, 2005; Volume 38. [Google Scholar]
  14. Han, A.K. Non-parametric analysis of a generalized regression model: The maximum rank correlation estimator. J. Econom. 1987, 35, 303–316. [Google Scholar] [CrossRef]
  15. Wang, H.; Li, G.; Jiang, G. Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J. Bus. Econ. Stat. 2007, 25, 347–355. [Google Scholar] [CrossRef]
  16. Wu, Y.; Liu, Y. Variable selection in quantile regression. Stat. Sin. 2009, 19, 801–817. [Google Scholar]
  17. Alfons, A.; Croux, C.; Gelper, S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 2013, 7, 226–248. [Google Scholar] [CrossRef]
  18. Chai, H.; Zhang, Q.; Jiang, Y.; Wang, G.; Zhang, S.; Ahmed, S.E.; Ma, S. Identifying gene-environment interactions for prognosis using a robust approach. Econom. Stat. 2017, 4, 105–120. [Google Scholar] [CrossRef]
  19. Wu, C.; Shi, X.; Cui, Y.; Ma, S. A penalized robust semiparametric approach for gene–environment interactions. Stat. Med. 2015, 34, 4016–4030. [Google Scholar] [CrossRef]
  20. Wu, C.; Jiang, Y.; Ren, J.; Cui, Y.; Ma, S. Dissecting gene-environment interactions: A penalized robust approach accounting for hierarchical structures. Stat. Med. 2018, 37, 437–456. [Google Scholar] [CrossRef]
  21. Ren, M.; Zhang, S.; Ma, S.; Zhang, Q. Gene–environment interaction identification via penalized robust divergence. Biom. J. 2022, 64, 461–480. [Google Scholar] [CrossRef]
  22. Dezeure, R.; Bühlmann, P.; Meier, L.; Meinshausen, N. High-dimensional inference: Confidence intervals, p-values and R-software hdi. Stat. Sci. 2015, 30, 533–558. [Google Scholar] [CrossRef]
  23. Bühlmann, P.; Kalisch, M.; Meier, L. High-dimensional statistics with a view toward applications in biology. Annu. Rev. Stat. Appl. 2014, 1, 255–278. [Google Scholar] [CrossRef]
  24. Chernozhukov, V.; Chetverikov, D.; Kato, K.; Koike, Y. High-dimensional data bootstrap. Annu. Rev. Stat. Appl. 2023, 10, 427–449. [Google Scholar] [CrossRef]
  25. Fan, K.; Subedi, S.; Yang, G.; Lu, X.; Ren, J.; Wu, C. Is Seeing Believing? A Practitioner’s Perspective on High-Dimensional Statistical Inference in Cancer Genomics Studies. Entropy 2024, 26, 794. [Google Scholar] [CrossRef]
  26. Liang, W.; Zhang, Q.; Ma, S. Hierarchical false discovery rate control for high-dimensional survival analysis with interactions. Comput. Stat. Data Anal. 2024, 192, 107906. [Google Scholar] [CrossRef] [PubMed]
  27. Sun, N.; Chu, J.; He, Q.; Wang, Y.; Han, Q.; Yi, N.; Zhang, R.; Shen, Y. BHAFT: Bayesian heredity-constrained accelerated failure time models for detecting gene-environment interactions in survival analysis. Stat. Med. 2024, 43, 4013–4026. [Google Scholar] [CrossRef] [PubMed]
  28. Sun, N.; Han, Q.; Wang, Y.; Sun, M.; Sun, Z.; Sun, H.; Shen, Y. BHCox: Bayesian heredity-constrained Cox proportional hazards models for detecting gene-environment interactions. BMC Bioinform. 2025, 26, 58. [Google Scholar] [CrossRef] [PubMed]
  29. Lu, X.; Fan, K.; Ren, J.; Wu, C. Identifying gene–environment interactions with robust marginal Bayesian variable selection. Front. Genet. 2021, 12, 667074. [Google Scholar] [CrossRef]
  30. Ren, J.; Zhou, F.; Li, X.; Ma, S.; Jiang, Y.; Wu, C. Robust Bayesian variable selection for gene–environment interactions. Biometrics 2023, 79, 684–694. [Google Scholar] [CrossRef]
  31. Zhou, F.; Ren, J.; Ma, S.; Wu, C. The Bayesian regularized quantile varying coefficient model. Comput. Stat. Data Anal. 2023, 187, 107808. [Google Scholar] [CrossRef]
  32. Fan, R.; Albert, P.S.; Schisterman, E.F. A discussion of gene-gene and gene-environment interactions and longitudinal genetic analysis of complex traits. Stat. Med. 2012, 31, 2565. [Google Scholar] [CrossRef]
  33. Liang, K.Y.; Zeger, S.L. Longitudinal data analysis using generalized linear models. Biometrika 1986, 73, 13–22. [Google Scholar] [CrossRef]
  34. Wang, L.; Zhou, J.; Qu, A. Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics 2012, 68, 353–360. [Google Scholar] [CrossRef] [PubMed]
  35. Fitzmaurice, G.M.; Laird, N.M.; Ware, J.H. Applied Longitudinal Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  36. Qu, A.; Lindsay, B.G.; Li, B. Improving generalised estimating equations using quadratic inference functions. Biometrika 2000, 87, 823–836. [Google Scholar] [CrossRef]
  37. Zhou, F.; Ren, J.; Li, G.; Jiang, Y.; Li, X.; Wang, W.; Wu, C. Penalized variable selection for lipid–environment interactions in a longitudinal lipidomics study. Genes 2019, 10, 1002. [Google Scholar] [CrossRef]
  38. Zhou, F.; Lu, X.; Ren, J.; Fan, K.; Ma, S.; Wu, C. Sparse group variable selection for gene–environment interactions in the longitudinal study. Genet. Epidemiol. 2022, 46, 317–340. [Google Scholar] [CrossRef]
  39. Jiang, J. Linear and Generalized Linear Mixed Models and Their Applications; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  40. Diggle, P. Analysis of Longitudinal Data; Oxford University Press: Oxford, UK, 2002. [Google Scholar]
  41. O’hara, R.B.; Sillanpää, M.J. A review of Bayesian variable selection methods: What, how and which. Bayesian Anal. 2009, 4, 85–117. [Google Scholar] [CrossRef]
  42. Mitchell, T.J.; Beauchamp, J.J. Bayesian variable selection in linear regression. J. Am. Stat. Assoc. 1988, 83, 1023–1032. [Google Scholar] [CrossRef]
  43. George, E.I.; McCulloch, R.E. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 1993, 88, 881–889. [Google Scholar] [CrossRef]
  44. Zhang, S.; Xue, Y.; Zhang, Q.; Ma, C.; Wu, M.; Ma, S. Identification of gene–environment interactions with marginal penalization. Genet. Epidemiol. 2020, 44, 159–196. [Google Scholar] [CrossRef]
  45. Barbieri, M.M.; Berger, J.O. Optimal predictive model selection. Ann. Stat. 2004, 32, 870–897. [Google Scholar] [CrossRef]
  46. Barbieri, M.M.; Berger, J.O.; George, E.I.; Ročková, V. The median probability model and correlated variables. Bayesian Anal. 2021, 16, 1085–1112. [Google Scholar] [CrossRef]
  47. Zhang, L.; Baladandayuthapani, V.; Mallick, B.K.; Manyam, G.C.; Thompson, P.A.; Bondy, M.L.; Do, K.A. Bayesian hierarchical structured variable selection methods with application to molecular inversion probe studies in breast cancer. J. R. Stat. Soc. Ser. C Appl. Stat. 2014, 63, 595–620. [Google Scholar] [CrossRef] [PubMed]
  48. Morris, J.S.; Brown, P.J.; Herrick, R.C.; Baggerly, K.A.; Coombes, K.R. Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 2008, 64, 479–489. [Google Scholar] [CrossRef] [PubMed]
  49. Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 1995, 57, 289–300. [Google Scholar] [CrossRef]
  50. Storey, J.D.; Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 2003, 100, 9440–9445. [Google Scholar] [CrossRef]
  51. Newton, M.A.; Noueiry, A.; Sarkar, D.; Ahlquist, P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 2004, 5, 155–176. [Google Scholar] [CrossRef]
  52. King, B.S.; Lu, L.; Yu, M.; Jiang, Y.; Standard, J.; Su, X.; Zhao, Z.; Wang, W. Lipidomic profiling of di-and tri-acylglycerol species in weight-controlled mice. PLoS ONE 2015, 10, e0116398. [Google Scholar] [CrossRef]
  53. Fan, K.; Jiang, Y.; Ma, S.; Wang, W.; Wu, C. Robust sparse Bayesian regression for longitudinal gene–environment interactions. J. R. Stat. Soc. Ser. C Appl. Stat. 2025, 74, qlaf027. [Google Scholar] [CrossRef]
  54. Kozumi, H.; Kobayashi, G. Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 2011, 81, 1565–1578. [Google Scholar] [CrossRef]
  55. Chiu, C.y.; Yuan, F.; Zhang, B.s.; Yuan, A.; Li, X.; Fang, H.B.; Lange, K.; Weeks, D.E.; Wilson, A.F.; Bailey-Wilson, J.E.; et al. Linear mixed models for association analysis of quantitative traits with next-generation sequencing data. Genet. Epidemiol. 2019, 43, 189–206. [Google Scholar] [CrossRef]
  56. Jiang, Y.; Chiu, C.Y.; Yan, Q.; Chen, W.; Gorin, M.B.; Conley, Y.P.; Lakhal-Chaieb, M.L.; Cook, R.J.; Amos, C.I.; Wilson, A.F.; et al. Gene-based association testing of dichotomous traits with generalized functional linear mixed models using extended pedigrees: Applications to age-related macular degeneration. J. Am. Stat. Assoc. 2021, 116, 531–545. [Google Scholar] [CrossRef]
  57. Park, T.; Casella, G. The bayesian lasso. J. Am. Stat. Assoc. 2008, 103, 681–686. [Google Scholar] [CrossRef]
  58. Müller, P.; Parmigiani, G.; Rice, K. FDR and Bayesian Multiple Comparisons Rules. In Bayesian Statistics 8: Proceedings of the Eighth Valencia International Meeting; Oxford University Press: Oxford, UK, 2007; pp. 359–380. [Google Scholar] [CrossRef]
  59. Calder, P.C. Functional roles of fatty acids and their effects on human health. J. Parenter. Enter. Nutr. 2015, 39, 18–32. [Google Scholar] [CrossRef] [PubMed]
  60. Harayama, T.; Riezman, H. Lipid remodeling factors in membrane homeostasis. Nat. Rev. Mol. Cell Biol. 2018, 19, 389–403. [Google Scholar]
  61. Stanford, K.I.; Goodyear, L.J. Exercise and metabolic health: Interactions among skeletal muscle, adipose tissue, and inflammation. Cell Metab. 2018, 27, 10–22. [Google Scholar] [CrossRef]
  62. Bien, J.; Taylor, J.; Tibshirani, R. A lasso for hierarchical interactions. Ann. Stat. 2013, 41, 1111. [Google Scholar] [CrossRef]
  63. Lin, W.Y. Detecting gene–environment interactions from multiple continuous traits. Bioinformatics 2024, 40, btae419. [Google Scholar] [CrossRef]
  64. Benjamini, Y.; Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001, 29, 1165–1188. [Google Scholar] [CrossRef]
  65. Jiang, Y.; Huang, Y.; Du, Y.; Zhao, Y.; Ren, J.; Ma, S.; Wu, C. Identification of prognostic genes and pathways in lung adenocarcinoma using a Bayesian approach. Cancer Inform. 2017, 16, 1176935116684825. [Google Scholar] [CrossRef]
Figure 1. F1 scores under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel), corresponding to Table 1 and Table 2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure 1. F1 scores under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel), corresponding to Table 1 and Table 2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g001
Figure 2. MCC under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel), corresponding to Table 1 and Table 2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure 2. MCC under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel), corresponding to Table 1 and Table 2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g002
Figure 3. Estimation error under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel), corresponding to Table 1 and Table 2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure 3. Estimation error under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel), corresponding to Table 1 and Table 2, respectively. Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g003
Figure 4. Violin plots of the Bayesian FDR for RBLSS and BLSS using a 5% FDR threshold under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel). Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Figure 4. Violin plots of the Bayesian FDR for RBLSS and BLSS using a 5% FDR threshold under Setting 1 with p = 100 (upper panel) and p = 50 (lower panel). Errors 1–4 correspond to N ( 0 , 1 ) , t(2), LogNormal(0, 2), and Laplace(0, 2 ), respectively.
Algorithms 18 00728 g004
Table 1. Assessment under Setting 1 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
Table 1. Assessment under Setting 1 with ( n , p , q , k ) = ( 400 , 50 , 3 , 3 ) and 100 replicates.
 Error Method TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) RBLSS19.38 (0.78)1.36 (1.12)0.95 (0.03)0.95 (0.04)2.21 (0.55)
RBL18.66 (1.12)3.10 (1.99)0.89 (0.05)0.88 (0.05)14.08 (1.09)
BLSS19.32 (0.84)0.80 (0.86)0.96 (0.03)0.96 (0.04)2.10 (0.53)
BL18.76 (1.10)2.04 (1.71)0.92 (0.04)0.91 (0.05)13.29 (1.13)
t ( 2 ) RBLSS18.66 (0.94)0.72 (0.73)0.95 (0.03)0.94 (0.04)2.68 (0.61)
RBL17.32 (1.30)1.30 (1.13)0.90 (0.05)0.89 (0.05)15.41 (0.99)
BLSS11.68 (3.99)0.42 (0.64)0.71 (0.18)0.72 (0.15)6.27 (2.44)
BL12.62 (3.52)2.96 (2.44)0.70 (0.16)0.68 (0.15)27.65 (8.11)
l o g N ( 0 , 1 ) RBLSS19.10 (1.04)0.88 (0.82)0.96 (0.04)0.95 (0.04)2.40 (0.72)
RBL18.08 (1.18)1.56 (1.39)0.91 (0.05)0.90 (0.05)14.65 (1.35)
BLSS16.06 (2.11)0.46 (0.81)0.88 (0.08)0.87 (0.07)3.95 (1.15)
BL15.18 (1.99)2.12 (1.86)0.81 (0.07)0.80 (0.07)20.49 (2.36)
L a p l a c e ( 0 , 2 )RBLSS16.38 (2.01)0.90 (0.99)0.88 (0.08)0.87 (0.08)3.61 (1.09)
RBL15.50 (2.03)1.46 (1.03)0.84 (0.07)0.83 (0.08)15.70 (1.44)
BLSS14.12 (2.62)0.36 (0.60)0.81 (0.09)0.81 (0.09)4.21 (1.33)
BL14.40 (2.19)2.06 (1.63)0.79 (0.09)0.77 (0.09)19.80 (1.70)
Table 2. Assessment under Setting 1 with ( n , p , q , k ) = ( 400 , 100 , 3 , 3 ) and 100 replicates.
Table 2. Assessment under Setting 1 with ( n , p , q , k ) = ( 400 , 100 , 3 , 3 ) and 100 replicates.
 Error Method TP FP F1 MCCEstimation
Error
N ( 0 , 1 ) RBLSS19.16 (0.84)1.56 (1.37)0.94 (0.04)0.94 (0.05)2.46 (0.75)
RBL17.02 (1.53)2.34 (1.61)0.86 (0.05)0.86 (0.05)23.47 (1.41)
BLSS19.40 (0.73)0.58 (0.76)0.97 (0.02)0.97 (0.02)2.06 (0.48)
BL17.38 (1.34)1.04 (1.18)0.90 (0.05)0.90 (0.05)21.15 (1.28)
t ( 2 ) RBLSS17.28 (1.33)0.46 (0.68)0.91 (0.04)0.91 (0.04)3.26 (0.80)
RBL13.68 (1.87)0.66 (0.92)0.79 (0.07)0.80 (0.06)24.41 (1.25)
BLSS10.16 (3.37)0.48 (0.84)0.65 (0.17)0.68 (0.14)6.86 (2.29)
BL7.86 (2.84)1.58 (2.14)0.52 (0.17)0.56 (0.15)44.21(14.99)
l o g N ( 0 , 1 ) RBLSS19.12 (0.90)0.64 (0.78)0.96 (0.04)0.96 (0.04)2.31 (0.71)
RBL16.42 (1.46)0.96 (0.95)0.88 (0.05)0.88 (0.05)23.73 (1.66)
BLSS16.04 (1.68)0.40 (0.67)0.88 (0.06)0.88 (0.05)4.06 (0.95)
BL12.28 (1.95)1.08 (1.24)0.73 (0.08)0.74 (0.08)33.93 (3.81)
L a p l a c e ( 0 , 2 )RBLSS17.00 (1.50)0.86 (0.99)0.90 (0.05)0.89 (0.05)3.67 (1.06)
RBL14.16 (1.84)1.20 (1.29)0.80 (0.07)0.80 (0.07)25.37 (1.50)
BLSS15.18 (1.80)0.36 (0.69)0.85 (0.06)0.85 (0.06)4.27 (1.21)
BL12.12 (2.22)1.08 (0.97)0.73 (0.09)0.73 (0.08)32.41 (2.11)
Table 3. PIP of top-ranked main and interaction effects identified by RBLSS with an FDR of α = 0.05.
Table 3. PIP of top-ranked main and interaction effects identified by RBLSS with an FDR of α = 0.05.
LipidMain PIPInteraction PIP
C18:2/16:10.9597-
C18:2/18:10.9752-
C22:7/16:0-0.9087
C16:0/18:1-0.9501
C18:3/18:1-1.0000
C18:2/20:4-0.9248
Table 4. PIP of top-ranked main and interaction effects identified by BLSS with an FDR of α = 0.05.
Table 4. PIP of top-ranked main and interaction effects identified by BLSS with an FDR of α = 0.05.
LipidMain PIPInteraction PIP
C20:6/16:00.9863-
C18:3/18:1-1.0000
C18:2/20:4-0.9578
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Fan, K.; Wu, C. Prioritizing Longitudinal Gene–Environment Interactions Using an FDR-Assisted Robust Bayesian Linear Mixed Model. Algorithms 2025, 18, 728. https://doi.org/10.3390/a18110728

AMA Style

Li X, Fan K, Wu C. Prioritizing Longitudinal Gene–Environment Interactions Using an FDR-Assisted Robust Bayesian Linear Mixed Model. Algorithms. 2025; 18(11):728. https://doi.org/10.3390/a18110728

Chicago/Turabian Style

Li, Xiaoxi, Kun Fan, and Cen Wu. 2025. "Prioritizing Longitudinal Gene–Environment Interactions Using an FDR-Assisted Robust Bayesian Linear Mixed Model" Algorithms 18, no. 11: 728. https://doi.org/10.3390/a18110728

APA Style

Li, X., Fan, K., & Wu, C. (2025). Prioritizing Longitudinal Gene–Environment Interactions Using an FDR-Assisted Robust Bayesian Linear Mixed Model. Algorithms, 18(11), 728. https://doi.org/10.3390/a18110728

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop