Quantile Regression Approach for Analyzing Similarity of Gene Expressions under Multiple Biological Conditions

: Temporal gene expression data contain ample information to characterize gene function and are now widely used in bio-medical research. A dense temporal gene expression usually shows various patterns in expression levels under different biological conditions. The existing literature investigates the gene trajectory using the mean function. However, temporal gene expression curves usually show a strong degree of heterogeneity under multiple conditions. As a result, rates of change for gene expressions may be different in non-central locations and a mean function model may not capture the non-central location of the gene expression distribution. Further, the mean regression model depends on the normality assumptions of the error terms of the model, which may be impractical when analyzing gene expression data. In this research, a linear quantile mixed model is used to ﬁnd the trajectory of gene expression data. This method enables the changes in gene expression over time to be studied by estimating a family of quantile functions. A statistical test is proposed to test the similarity between two different gene expressions based on estimated parameters using a quantile model. Then, the performance of the proposed test statistic is examined using extensive simulation studies. Simulation studies demonstrate the good statistical performance of this proposed test statistic and show that this method is robust against normal error assumptions. As an illustration, the proposed method is applied to analyze a dataset of 18 genes in P. aeruginosa , expressed in 24 biological conditions. Furthermore, a minimum Mahalanobis distance is used to ﬁnd the clustering tree for gene expressions.


Introduction
Recently, many researchers have focus on the analysis of gene expression data. The gene expression process records measurements of expression under various biological conditions over a specific time period. At present, micro-array experiments are widely being used to generate rapidly vast amounts of data on gene expression under various biological conditions. The analysis of temporal gene expression is now becoming of great interest to scientists to understand the complex mechanism of gene profiles and characterize gene expressions. Moreover, this analysis also helps bio-medical scientists detect the genes responsible for early cancer (Fang et al. [1]). Genes are generally expressed by transcription into RNA, and then this transcript might be translated into protein. Usually, the gene expression process conveys RNA information as numerical outcomes. For gene expression data, specific biological characteristics (for example, RNA information) are usually measured during a predetermined time interval under different experimental conditions. Several statistical methods to analyze gene expression data have been considered, including clustering, fold expression changes, ANOVA, etc. Draghici and Kulaeva [2] Stats 2022, 5 discussed a noise sampling method based on ANOVA for the selection of differentially regulated genes. They compared their results with the fold change method and discussed the risk of obtaining a false positive when strictly observing fold change. Eisen et al. [3] used a cluster analysis method to compare genes according to similarity. Li et al. [4] introduced a time-lagged correlation coefficient to assess the relationship between genes and used the linear mixed-effect model with splines for gene clustering. Yeung and Ruzzo [5] applied the principal component analysis to analyze gene expression data. They also studied the effectiveness of different clustering algorithms to capture cluster structure. Fang et al. [1] discussed some limitations of the fold expression method, suggesting that genetic information might be lost using this method. One defining feature of gene expression analysis is its time-dependency on the expression levels for a given gene at multiple times. Therefore, it is very important to incorporate the correlation structure within gene expressions. Kerr et al. [6], Storey and Tibshirani [7] studied differentially expressed genes from a single timepoint. Meanwhile, Tusher et al. [8] developed a temporal gene expression model based on one condition. Fang et al. [1] proposed a non-linear regression model to mark the relative change rates of genes. Their study uses changeable variance and covariance structure to model individual gene expression trajectory. Deng et al. [9] investigated the effects of different biological conditions on gene expressions. Under a given condition, Deng et al. [9] used log-normal distribution properties to characterize the variance function of genes and proposed a statistical test approach to test the equality of the variance function for different conditions. Deng et al. [10] studied the threshold points of the gene expressions and constructed test statistics to detect the threshold points.
At present, the available literature only considers using the mean function when analyzing gene expression trajectory. However, when gene expression data are skewed or over-dispersed, the mean function can be affected by outlying observations. Normality is usually assumed as the error term for estimation of the parameters in the model, and this assumption may be improper in practical instances. In addition, the temporal gene expression curve generally shows a strong degree of heterogeneity between multiple biological conditions. As a result, the rates of change for gene expressions may be different in non-central locations, and the mean model can only characterize the central location of the gene expression distribution. Thus, mean function may not properly capture the gene expression trajectory. However, by fitting the gene expression curves at different quantile vaules, the quantile regression approach can be used to completely examine the gene expression rate. Many researchers have explored statistical methods for the analysis of the quantile regression model. Huang and Lee [11] predicted the quantiles of daily Standard&Poor's 500 (S&P 500) returns by incorporating the high-frequency information through combining forecasts into one model. Most recently, Gallardo et al. [12] proposed a parametric quantile regression model for asymmetric response variables. Jung et al. [13] applied the multiple quantile regression method to the estimation of the spatial distribution of soil moisture. Chen et al. [14] studied estimation and inference for linear quantile regression models with generated regressors using a practical, two-step estimation procedure. Nevertheless, to our knowledge, there is no literature examining the gene expression data using the quantile regression model.
The purpose of this research is to apply the quantile regression (QR) model to the analysis of gene expression data and propose a test statistic for the examination of the similarity between gene expressions by comparing the quantile regression coefficients. The remainder of this article is organized as follows. In Section 2, the quantile regression model is proposed for gene expression data and the statistical inference is given for this model. The simulation study is performed in Section 3. An application for gene expression data is presented in Section 4, with the discussion in Section 5.

Quantile Model for Gene Expression Data
In this section, this research methodology is introduced to analyze the gene expression data. In most cases, for gene expression data, specific characteristics were measured at discrete timepoints and measurements were taken under different biological conditions. As gene expression characteristics were measured over a series of timepoints and repeated under different circumstances, the longitudinal data analysis method can be applied to analyze discrete gene expression data. On the other hand, gene expression measurements show a variety of patterns under different biological conditions over specific times. Therefore, the linear quantile mixed model may be appropriate for the analysis of gene expression data. Let Y i (t) be the observed measurement of specific gene expression under the ith condition at time t. To determine the gene trajectory at different quantiles τ (0 < τ < 1), the following quantile model for Y i (t) can be considered.
where, Q (τ) (t) is the quantile curve at time t, is the random noise with zero τth quantile and N denotes the number of conditions. Note that we are unable to observe Y i (t) for all timepoints, only at the very specific occasions t ij at which measurements with errors were taken. Therefore, the observed longitudinal data for a specific gene under ith biological condition consist of the measurements y i = {y i (t ij ); j = 1, · · · k i }, where k i denotes the number of observed timepoints. Note that, for gene expression data, there is no available covariate. However, there are many methods in the literature that can be used to estimate the quantile function Q (τ) (t), including kernel, local polynomial, smoothing splines, regression splines and wavelet-based methods, among others. One simple and straightforward basis is the polynomial basis {1, t, . . . , t p−1 }, in which the true quantile function of gene trajectory expression is modeled as polynomials of degree p − 1. In this article, the method proposed in Donoho and Johnstone [15] and further developed in Zhang [16] is adopted. In this spline method, Q (τ) (t) is approximated using the linear combination of a set of truncated power basis functions. Given a sequence of K interior knots 0 < κ 1 < κ 2 < · · · < κ K < T where T is the end time of observations, the regression spline basis functions of order p are 1, t, t 2 , . . . , t p , (t − κ 1 ) p + , . . . , (t − κ K ) p + . Denoting the vector of r(= K + 1 + p) basis functions by the regression spline smoothing is used to model Q (τ) (t) using the linear combination of the basis functions B(t), and the linear quantile mixed model can be written as where the basis function for the fixed effects parameter is denoted as B(t) = (1, t, t 2 , . . . , t q , (t − κ 1 ) q + , . . . , (t − κ K ) q + ) and Z(t) is considered as basis function for random effects parameter, which could be the q-dimensional sub-vector of B(t)(q ≤ r). In particular, basis functions for the random intercept model and random slope model can be written as follows: For model formulation, gene expression data are considered longitudinal data in the form {Y i (t ij ), B(t ij ) , Z(t ij ) }, for n biological conditions and k i time occasions, i.e., i = 1, 2, · · · , n and j = 1, 2, · · · , k i . Now, define . . .

Estimation of Parameters
Based on the above discussion, the joint density of (Y, U (τ) ) can be written in terms of the τth quantile as follows: For the random intercept model (q = 1), the design matrix for random effects can be written as Z i = (1, 1, · · · , 1) ; i = 1, 2, · · · , n. Let, R q denote q−dimensional Euclidean space. Now, the marginal likelihood can be derived from Equation (5) and written as Moreover, the marginal log-likelihood function can also be written as Since the distribution of Y(t ij ) is assumed to follow an asymmetric Laplace distribution, τth quantile of Y(t ij ) can be estimated using the asymmetric Laplace distribution with location parameter µ i , common scale σ (τ) parameter and known skew parameter (τ). To estimate parameters from joint density f (y, u (τ) ), expressed in Equation (5), it is important to compute the following integral, which is also known as the marginal density of Y i .
Geraci and Bottai [17] develop the gradient search (gs) and the derivative free (df) optimization algorithm to maximize likelihood function in Equation (10). This optimization starts with a parameter value and then searches the positive semi-line for a new parameter value where likelihood is larger. This algorithm works until the likelihood change is sufficiently small or less than the pre-specified tolerance (δ) level. This algorithm begins 0 . The derivative-free optimization algorithm is similar to the gradient search method. This method alternates a loop for θ (τ) and then updates σ (τ) . Now, from the approximate likelihood in Equation (10), the fixed effects parameter β (τ) , the random effects parameter α (τ) and the error term parameter σ (τ) can be estimated by using the algorithm given in Geraci and Bottai [17]. Further, the estimate of quantile function Q (τ) (t) can be written as followŝ which can be expressed as (Geraci and Botai [17]) From the asymptotic normality ofQ (τ) (t), the approximate (1 − α)100% confidence interval for the quantile function Q (τ) (t) can be constructed as follows: where z α 2 is the upper 100 (1 − α 2 )% percentile of standard normal distribution. Although the asymptotic covariance matrix forÛ (τ) has a closed form (12), there is no expression for the covariance matrix of estimatorβ (τ) , and thus we are unable to find the expression for the estimate of covariance cov(β (τ) ,Û (τ) ). However, bootstrap is a very powerful method and can be used in the estimation of covariance matrix for estimators (β (τ) ,Û (τ) ). Here, we use a block bootstrap approach to find the estimate for cov(β (τ) ,Û (τ) ).
The procedures are as follows. 1.
Find the estimated values for the parameters β (τ) , α (τ) and σ (τ) and then calculate the values ofÛ (τ) by using formula (11) from each bootstrap sample and denote the obtained values asβ r ) , r = 1, 2, . . . , R and calculate the sample means of R bootstrap estimates for fixed effects parameters and random effects predictors φ (τ) = (β (τ) , U (τ) ) and ϑ (τ) 4. Now, the bootstrap estimator for covariance matrix of estimators (β (τ) ,Û (τ) ) can be written asV Furthermore, the estimates of model parameters and covariance matrix of estimators are computed by using the 'lqmm' package developed by Geraci and Bottai [17] in the R programming environment. Generally, this 'lqmm' package is used to estimate conditional quantile functions with random effects in linear quantile mixed models.

Test of the Similarity of Quantile Functions for Two Gene Expressions
The other purpose of this research is to identify gene similarity based on quantile functions. Using the results obtained in Section 2.1, the estimates of quantile functions for gene expressions can be obtained. The estimated quantile functions for g genes can be expressed asQ Two genes, h and s, are said to be similar if their quantile curve expressions are equal, i.e.,Q Therefore, the proposed hypothesis is Now, suppose that all quantile functions share the same truncated power basis functions B(t) and Z(t). Then, the τ-quantile curves of gene expressions depend on the fixed effects parameter β (τ) and the random effects components U (τ) , and testing the similarity of two gene expression is equivalent to testing the equality of the corresponding fixed effects parameters and random effects components. Further, the predictors of random effects components depend on the estimates of parameters β (τ)T , α (τ) , σ (τ) . Thus, instead of testing the hypothesis (17), one can test the following hypothesis related to the fixed effects parameters and the random effects components.
Based on parameter estimates and its covariance matrix, the following asymptotic statistic for testing the the hypothesis H 0 : (β where, are the estimated variance-covariance matrices of the estimators (β s ), respectively, which can be computed from (16). When H 0 holds, χ 2(τ) hs in Equation (19) has asymptotic chi-squared distribution with (r + q + 1) degrees of freedom, where r is known as the dimension number of β (τ) for fixed effects parameters and q is the dimension number of α (τ) for random effects components.
Furthermore, since the quantile function of gene expression is determined by rdimensional fixed parameters, q-dimensional random effects components, and scale parameter σ (τ) , the pattern of gene expression h can be induced to be a random point in the (r + q + 1) dimensional Euclidean space, which has the asymptotic multivariate normal distribution with mean (β From this point of view, the statistic χ 2(τ) hs in (19) is also the Mahalanobis distance be- s ). In terms of the minimum Mahalanobis distance, we can obtain the clustering tree for g gene expressions.

Simulation
In this section, we examine the performance of the estimation and chi-square test (19) based on the proposed quantile model for analyzing temporal gene expression data. Extensive simulation studies are carried out to evaluate the accuracy of the estimated quantile curves and the power of the proposed chi-square test (19). The random intercept QR model and random slope QR model are chosen with five different error terms, in which the error terms are chosen from symmetric distribution (normal), symmetric and heavy tailed distribution (Laplace and Student t) and skewed distribution (skew normal and skew t). Hence, there are ten different scenarios to examine through the simulation study. The steps are given below. • Thirty-five equally spaced timepoints between [0, 1] are considered for 50 samples. The data were generated using the following mixed model. The true model is assumed as follows: where, f 0 (t) = exp 5t 1+t 3 , i = 1, · · · , 50. • Random effects components are assumed to follow a normal distribution with a mean of zero and standard deviation of two, i.e., (U i ∼ N(0, 4)). • The following i.i.d errors are considered for true models Equation (20).

Model and Parameter Estimation
For model generalization, the following quantile mixed model is considered: where, B(t) and Z(t) are the basis functions for fixed and random effects components, respectively. The parameter estimation procedure is illustrated below.
• Generate data for n = 50 samples using Equation (20) . • Basis functions for Model (21) are considered as: 1. Random intercept model: Random slope model: Forthe random slope model, we consider normal random effects with a diagonal variance-covariance matrix. A Gauss-Hermite quadrature with seven noves is considered to approximate the marginal log-likelihood of Equation (10). • All parameters are estimated at median (τ = 0.50). • A total of 500 bootstrap replications are considered for the estimation of the covariance matrix of the estimators (β (τ) ,Û (τ) ).
Note that, for simplicity, the interior knots are not added in the basis function B(t) in this simulation study. The simulations were also conducted in the scenario for the basis function B(t) with the interior knots and the results are similar to those without the interior knots. Furthermore, the simulations for the other values of quantile τ are omitted.

Parameter Estimates of Random Intercept QR Model
Parameter estimates for the 50th quantile (τ = 0.50) of the random intercept QR model are reported below for five different scenarios. The estimated median function, along with the confidence interval of each random intercept QR model, is presented in Figures 1-5. Again, the estimated 50th (τ = 0.50) quantile function is also compared with sample median function in each figure.

Parameter Estimates of Random Slope QR Model
Now, the estimates of parameters for the 50th quantile (τ = 0.50) of the random slope QR model are reported below for five different scenarios and estimated median functions; sample median functions, along with confidence intervals, are shown in Figures 6-10

Power Analysis of Proposed Test Statistic
In this section, the power of the chi-square test presented in Equation (19) is evaluated by considering the alternative model. A constant number, m is added to this alternative model.
The alternative model considered in simulation is as follows: where, m ∈ {0, 0.50, 1.00, 1.50, 2.00, 2.50, 3.00}. The L 2 distance between true model and alternative model is m. Here, we assume that random effects (U i ) and error ( i ) follow the same distribution. The power analysis is carried out for both the random slope QR model and random intercept QR model. All simulation results are performed for a sample size of n = 30, 50 and 75 at a level of significance of α = 0.05 and quantile value of τ = 0.25, 0.50, 0.75 with 1000 replications. Tables 1-3 report empirical powers for five random intercept QR models and five random slope models with the quantiles τ = 0.25, 0.5 and 0.75. Power analysis results demonstrate that the statistical performance of the proposed chi-square test statistic is good for both random intercept models and random slope models at a 5% level of significance. This indicates that this test statistic can be applied to determine the similarity of two gene expressions.

Application: Gene Expression Data
In this section, the linear quantile mixed model given in Section 2 is applied to analyze a real dataset of 18 gene expressions in P. aeruginosa expressed in 24 conditions. Some descriptive measures are discussed at the beginning of this section. Towards the end of this section, the similarities between genes are investigated using the proposed test statistic (Equation (19)) based on quantile functions.

Data
A description of 18 genes is reported in Table 4. The data observed for these genes are first analyzed by Fang et al. [1] and then by Deng et al. [10] Initially, polymerase chain reaction (PCR) was performed to amplify regions of P. aeruginosa virulence factors. The primers were synthesized using PAO1 genome data according to Duan [18]. PAO1 chromosomal DNA was used as a PCR template and amplified promoter regions were cloned into XhoI-BamHI restriction sites of the plasmid PMS402. Then plasmids were inserted into PAO1 using electroporation. More details regarding DNA manipulation, PCR and transformation procedures can be found in Duan [18]. The promoter activity was measured as counts per second (CPS) of light production using a Victor2 Multilabel counter. TSBDC minimal medium containing EDTA (400 µg/mL) and (50 µg/mL) Fecl 3 were used to assay gene expression. The reporter strains were grown overnight and the resulting culture was diluted into 1:200 proportion in a 96-well microtiter plate. Then, the promoter activity of the virulence factors was measured under 24 biological conditions every 30 min for 21 h. These genes were considered as quorum-sensing or quorum-sensing-regulated genes, which play an important role in bio-films formation. Thus, this dataset consists of 18 genes in P. aeruginosa (Table 4) and each gene was observed at 43 consecutive timepoints under 24 conditions. Therefore, the data for each gene consist of 1032 observations and, in total, 18,576 observations are measured for 18 genes.

Exploratory Analysis
In this section, a summary is provided of the selected gene expression data. Gene expressions, g(t) are obtained on a log scale under 24 conditions. Figure 11 exhibits gene expression curves of genes PA2975(rluc) and PA0573, respectively. From this figure, it is evident that the expressions for genes PA2975(rluc) and PA0573 are very different under different biological conditions. Figure 12 shows a histogram of expressions on PA2975(rluc) and PA0573, respectively. This figure reveals that teh data are severely skewed (negative) and show non-normality signs. Figures 13 and 14 present the distributions of expressions for PA2975(rluc) and PA0573 with respect to time and conditions, respectively. Both boxplots demonstrate that heterogeneity is present in both PA2975(rluc)(B3) and PA0573(E6) gene expressions. More importantly, both parts of the figure show that the gene expression dataset consists of outliers. Hence, Figures 12-14 suggest that the normality assumption for linear mixed model may be inappropriate for analyzing gene expression data. For this reason, the quantile models are going to be used to determine gene trajectory in terms of different quantile values. This method might provide more significant and insightful results than the mean regression model.

Model and Parameter Estimates
Let Y i (t) be the measurements for condition i at time t. The linear quantile mixed model is considered for estimation purposes. Since no additional covariates are available, polynomial basis functions are considered for fixed effects in this analysis.
i (t); (i = 1, · · · , 24) (t = 0, 0.5, · · · , 21). (23) Random Intercept Model: B(t) = (1, t, t 2 , t 3 , t 4 ) and Z(t) = (1). Random Slope Model: For each gene expression, the same fixed effects and random effects structures are considered when estimating the parameters in both the random intercept model and random slope model. However, AIC is found to be lower for the random intercept model than for the random slope model. Thus, the random intercept model is used for further analysis, instead of the random slope model. Table 5 reports the values of estimated parameters for quantile functions with quantile τ = 0.25, 0.50, 0.75 for PA2975(rluc)(B3) and PA0573(E6). Figures 15 and 16 show the estimated quantile functions and confidence intervals for genes PA2975(rluc)(B3) and PA0573(E6), respectively. Figures 17 and 18 present the estimated quantile functions, along with the sample quantile functions, for gene PA2975(rluc)(B3) and PA0573(E6). Moreover, Figure 19 shows the estimated median functions for 18 genes in P. aeruginosa.

Test of Gene Similarity
To find the similarity between genes, pairwise comparisons of all the gene expressions of 18 genes in P. aeruginosa were investigated in terms of their quantile functions, since gene expression curves depend on fixed effects parameters and random effects components. For τ = 0.50 the following hypotheses were tested.

Concluding Remarks
Gene expression analysis usually tracks the expression values of a large number of genes simultaneously under different biological conditions, and the rate of change may not be consistent at different quantiles. This research uses a linear quantile mixed effect model to analyze the gene expression trajectory. Gene expression models are constructed using basis functions and a statistical test is proposed to determine the similarity of the genes based on estimated fixed effects parameters and random effects components from the linear quantile mixed model. In simulation studies, both the random intercept and random slope models are considered to examine the performance of proposed estimation and test statistics. The simulation results indicate that the chi-square test statistic performs well in all circumstances to test the similarity of quantile functions. Finally, as an illustration, a linear quantile mixed model is applied to a real dataset of 18 genes in P. aeruginosa expressed in 24 conditions, and the similarity between genes is investigated using the proposed test statistic based on quantile functions. Moreover, a clustering tree is also shown using a complete linkage method. This research suggests that this proposed test statistic may help bio-medical scientists test the similarity between genes in terms of their quantile functions. In a simulation study, instead of true quantile functions, estimated quantile functions are compared with sample quantile functions, which is a limitation of this research. This research can be further extended by considering a changeable variance and covariance structure for the gene expression data.