Quantile Regression Approach for Analyzing Similarity of Gene Expressions under Multiple Biological Conditions

Deng, Dianliang; Chowdhury, Mashfiqul Huq

doi:10.3390/stats5030036

Open AccessArticle

Quantile Regression Approach for Analyzing Similarity of Gene Expressions under Multiple Biological Conditions

by

Dianliang Deng

^1,*,† and

Mashfiqul Huq Chowdhury

^2,†

¹

Department of Mathematics and Statistics, University of Regina, Regina, SK S4S 0A2, Canada

²

Department of Statistics, Mawlana Bhashani Science and Technology University, Santosh, Tangail 1902, Bangladesh

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Stats 2022, 5(3), 583-605; https://doi.org/10.3390/stats5030036

Submission received: 6 June 2022 / Revised: 27 June 2022 / Accepted: 29 June 2022 / Published: 2 July 2022

Download

Browse Figures

Versions Notes

Abstract

:

Temporal gene expression data contain ample information to characterize gene function and are now widely used in bio-medical research. A dense temporal gene expression usually shows various patterns in expression levels under different biological conditions. The existing literature investigates the gene trajectory using the mean function. However, temporal gene expression curves usually show a strong degree of heterogeneity under multiple conditions. As a result, rates of change for gene expressions may be different in non-central locations and a mean function model may not capture the non-central location of the gene expression distribution. Further, the mean regression model depends on the normality assumptions of the error terms of the model, which may be impractical when analyzing gene expression data. In this research, a linear quantile mixed model is used to find the trajectory of gene expression data. This method enables the changes in gene expression over time to be studied by estimating a family of quantile functions. A statistical test is proposed to test the similarity between two different gene expressions based on estimated parameters using a quantile model. Then, the performance of the proposed test statistic is examined using extensive simulation studies. Simulation studies demonstrate the good statistical performance of this proposed test statistic and show that this method is robust against normal error assumptions. As an illustration, the proposed method is applied to analyze a dataset of 18 genes in P. aeruginosa, expressed in 24 biological conditions. Furthermore, a minimum Mahalanobis distance is used to find the clustering tree for gene expressions.

Keywords:

chi-square test; classification; linear mixed model; Mahalanobis distance; quantile analysis; temporal gene expressions

1. Introduction

Recently, many researchers have focus on the analysis of gene expression data. The gene expression process records measurements of expression under various biological conditions over a specific time period. At present, micro-array experiments are widely being used to generate rapidly vast amounts of data on gene expression under various biological conditions. The analysis of temporal gene expression is now becoming of great interest to scientists to understand the complex mechanism of gene profiles and characterize gene expressions. Moreover, this analysis also helps bio-medical scientists detect the genes responsible for early cancer (Fang et al. [1]). Genes are generally expressed by transcription into RNA, and then this transcript might be translated into protein. Usually, the gene expression process conveys RNA information as numerical outcomes. For gene expression data, specific biological characteristics (for example, RNA information) are usually measured during a predetermined time interval under different experimental conditions.

Several statistical methods to analyze gene expression data have been considered, including clustering, fold expression changes, ANOVA, etc. Draghici and Kulaeva [2] discussed a noise sampling method based on ANOVA for the selection of differentially regulated genes. They compared their results with the fold change method and discussed the risk of obtaining a false positive when strictly observing fold change. Eisen et al. [3] used a cluster analysis method to compare genes according to similarity. Li et al. [4] introduced a time-lagged correlation coefficient to assess the relationship between genes and used the linear mixed-effect model with splines for gene clustering. Yeung and Ruzzo [5] applied the principal component analysis to analyze gene expression data. They also studied the effectiveness of different clustering algorithms to capture cluster structure. Fang et al. [1] discussed some limitations of the fold expression method, suggesting that genetic information might be lost using this method. One defining feature of gene expression analysis is its time-dependency on the expression levels for a given gene at multiple times. Therefore, it is very important to incorporate the correlation structure within gene expressions. Kerr et al. [6], Storey and Tibshirani [7] studied differentially expressed genes from a single timepoint. Meanwhile, Tusher et al. [8] developed a temporal gene expression model based on one condition. Fang et al. [1] proposed a non-linear regression model to mark the relative change rates of genes. Their study uses changeable variance and covariance structure to model individual gene expression trajectory. Deng et al. [9] investigated the effects of different biological conditions on gene expressions. Under a given condition, Deng et al. [9] used log-normal distribution properties to characterize the variance function of genes and proposed a statistical test approach to test the equality of the variance function for different conditions. Deng et al. [10] studied the threshold points of the gene expressions and constructed test statistics to detect the threshold points.

At present, the available literature only considers using the mean function when analyzing gene expression trajectory. However, when gene expression data are skewed or over-dispersed, the mean function can be affected by outlying observations. Normality is usually assumed as the error term for estimation of the parameters in the model, and this assumption may be improper in practical instances. In addition, the temporal gene expression curve generally shows a strong degree of heterogeneity between multiple biological conditions. As a result, the rates of change for gene expressions may be different in non-central locations, and the mean model can only characterize the central location of the gene expression distribution. Thus, mean function may not properly capture the gene expression trajectory. However, by fitting the gene expression curves at different quantile vaules, the quantile regression approach can be used to completely examine the gene expression rate. Many researchers have explored statistical methods for the analysis of the quantile regression model. Huang and Lee [11] predicted the quantiles of daily Standard&Poor’s 500 (S&P 500) returns by incorporating the high-frequency information through combining forecasts into one model. Most recently, Gallardo et al. [12] proposed a parametric quantile regression model for asymmetric response variables. Jung et al. [13] applied the multiple quantile regression method to the estimation of the spatial distribution of soil moisture. Chen et al. [14] studied estimation and inference for linear quantile regression models with generated regressors using a practical, two-step estimation procedure. Nevertheless, to our knowledge, there is no literature examining the gene expression data using the quantile regression model.

The purpose of this research is to apply the quantile regression (QR) model to the analysis of gene expression data and propose a test statistic for the examination of the similarity between gene expressions by comparing the quantile regression coefficients. The remainder of this article is organized as follows. In Section 2, the quantile regression model is proposed for gene expression data and the statistical inference is given for this model. The simulation study is performed in Section 3. An application for gene expression data is presented in Section 4, with the discussion in Section 5.

2. Quantile Model for Gene Expression Data

In this section, this research methodology is introduced to analyze the gene expression data. In most cases, for gene expression data, specific characteristics were measured at discrete timepoints and measurements were taken under different biological conditions. As gene expression characteristics were measured over a series of timepoints and repeated under different circumstances, the longitudinal data analysis method can be applied to analyze discrete gene expression data. On the other hand, gene expression measurements show a variety of patterns under different biological conditions over specific times. Therefore, the linear quantile mixed model may be appropriate for the analysis of gene expression data. Let

Y_{i} (t)

be the observed measurement of specific gene expression under the i

t h

condition at time t. To determine the gene trajectory at different quantiles

τ

(0 < τ < 1)

, the following quantile model for

Y_{i} (t)

can be considered.

Y_{i} (t) = Q^{(τ)} (t) + ϵ_{i}^{(τ)} (t); i = 1, \dots, N,

(1)

where,

Q^{(τ)} (t)

is the quantile curve at time t,

ϵ_{i}^{(τ)} (t)

is the random noise with zero

τ

th quantile and N denotes the number of conditions. Note that we are unable to observe

Y_{i} (t)

for all timepoints, only at the very specific occasions

t_{i j}

at which measurements with errors were taken. Therefore, the observed longitudinal data for a specific gene under i

t h

biological condition consist of the measurements

y_{i} = {y_{i} (t_{i j}); j = 1, \dots k_{i}}

, where

k_{i}

denotes the number of observed timepoints. Note that, for gene expression data, there is no available covariate. However, there are many methods in the literature that can be used to estimate the quantile function

Q^{(τ)} (t)

, including kernel, local polynomial, smoothing splines, regression splines and wavelet-based methods, among others. One simple and straightforward basis is the polynomial basis

{1, t, \dots, t^{p - 1}}

, in which the true quantile function of gene trajectory expression is modeled as polynomials of degree

p - 1

. In this article, the method proposed in Donoho and Johnstone [15] and further developed in Zhang [16] is adopted. In this spline method,

Q^{(τ)} (t)

is approximated using the linear combination of a set of truncated power basis functions. Given a sequence of K interior knots

0 < κ_{1} < κ_{2} < \dots < κ_{K} < T

where T is the end time of observations, the regression spline basis functions of order p are

1, t, t^{2}, \dots, t^{p}, {(t - κ_{1})}_{+}^{p}, \dots, {(t - κ_{K})}_{+}^{p}

. Denoting the vector of

r (= K + 1 + p)

basis functions by

B (t) = {(1, t, t^{2}, \dots, t^{p}, {(t - κ_{1})}_{+}^{p}, \dots, {(t - κ_{K})}_{+}^{p})}^{⊤},

the regression spline smoothing is used to model

Q^{(τ)} (t)

using the linear combination of the basis functions

B (t)

, and the linear quantile mixed model can be written as

Y_{i} (t) = {[B (t)]}^{⊤} β^{(τ)} + {[Z (t)]}^{⊤} U^{(τ)} + ϵ_{i}^{(τ)} (t);

(2)

where the basis function for the fixed effects parameter is denoted as

B (t) = {(1, t, t^{2}, \dots, t^{q}, {(t - κ_{1})}_{+}^{q}, \dots, {(t - κ_{K})}_{+}^{q})}^{⊤}

and

Z (t)

is considered as basis function for random effects parameter, which could be the q-dimensional sub-vector of

B (t) (q \leq r)

. In particular, basis functions for the random intercept model and random slope model can be written as follows:

Random intercept model, $Z (t) = (1)$ ;
Random slope model $Z (t) = {(1, t)}^{⊤}$ .

For model formulation, gene expression data are considered longitudinal data in the form

{Y_{i} (t_{i j}), B {(t_{i j})}^{⊤}, Z {(t_{i j})}^{⊤}}

, for n biological conditions and

k_{i}

time occasions, i.e.,

i = 1, 2, \dots, n

and

j = 1, 2, \dots, k_{i}

. Now, define

Y_{i} = (\begin{matrix} Y_{i} (t_{i 1}) \\ Y_{i} (t_{i 2}) \\ ⋮ \\ Y_{i} (t_{i k_{i}}) \end{matrix}), B_{i} = (\begin{matrix} B^{⊤} (t_{i 1}) \\ B^{⊤} (t_{i 2}) \\ ⋮ \\ B^{⊤} (t_{i k_{i}}) \end{matrix}), Z_{i} = (\begin{matrix} Z^{⊤} (t_{i 1}) \\ Z^{⊤} (t_{i 2}) \\ ⋮ \\ Z^{⊤} (t_{i k_{i}}) \end{matrix}), ϵ_{i}^{(τ)} = (\begin{matrix} ϵ_{i 1}^{(τ)} \\ ϵ_{i 2}^{(τ)} \\ ⋮ \\ ϵ_{i k_{i}}^{(τ)} \end{matrix})

and

β^{(τ)} = {(β_{1}^{(τ)}, \dots, β_{r})}^{⊤}

,

U_{i}^{(τ)} = {(U_{i 1}^{(τ)}, \dots, U_{i q}^{(τ)})}^{⊤}

. Then, we have the matrix form of the model (2) as follows:

Y_{i} = B_{i} β^{(τ)} + Z_{i} U_{i}^{(τ)} + ϵ_{i}^{(τ)}

Now, for a longitudinal setup, assuming that

Y_{i}

conditioned on random effects,

U_{i}

are independently distributed and this conditional distribution follows the asymmetric Laplace (AL) distribution with location parameter

μ_{i}^{(τ)} = B_{i} β^{(τ)} + Z_{i} U_{i}^{(τ)}

and scale parameter

σ^{(τ)}

, respectively. Therefore, it can be written as

Y_{i} | U_{i}^{(τ)} \sim A L (B_{i} β^{(τ)} + Z_{i} U_{i}^{(τ)}, σ^{(τ)}, τ),

where

β^{(τ)}

denotes the

r \times 1

vector of fixed-effect parameters and

τ

is considered the quantile level. Assume that random effect vectors

U_{i}^{(τ)}

are zero

τ

-quantile vectors and independent of the error term

ϵ_{i}^{(τ)}

of the model (

U_{i}^{(τ)} ⊥ ϵ_{i}^{(τ)}

). Moreover, assume that random effect vectors

U_{i}^{(τ)}

are distributed with the density function

f (u_{i}^{(τ)} | Ψ^{(τ)})

, where

Ψ^{(τ)}

is regarded as the variance–covariance matrix (symmetric positive definite). All parameters depend on the skewness parameter,

τ

(0 < τ < 1)

. Now, the

τ

th linear quantile mixed model can be written as

Y = μ^{(τ)} + ϵ^{(τ)},

(3)

where

Y = {(Y_{1}^{⊤}, Y_{2}^{⊤}, \dots, Y_{n}^{⊤})}^{⊤}, μ^{(τ)} = {(μ_{1}^{(τ) ⊤}, μ_{2}^{(τ) ⊤}, \dots, μ_{n}^{(τ) ⊤})}^{⊤}

and i.i.d components of error

ϵ^{(τ)} = {(ϵ_{1}^{(τ) ⊤}, ϵ_{2}^{(τ) ⊤}, \dots, ϵ_{n}^{(τ) ⊤})}^{⊤}

follow an asymmetric Laplace distribution. Symbolically,

ϵ_{i j}^{(τ)} \sim A L (0, σ, τ), i = 1, 2, \dots, n; j = 1, 2, \dots, k_{i}

. Now, in terms of matrix notation,

τ

th linear quantile of response

(Y)

, denoted as

μ^{(τ)}

, can also be expressed as

μ^{(τ)} = B β^{(τ)} + Z_{\oplus} U^{(τ)},

(4)

where,

U^{(τ)} = {(U_{1}^{⊤}, U_{2}^{⊤}, \dots, U_{n}^{⊤})}^{⊤}

,

Z_{\oplus} = ⨁_{i = 1}^{n} Z_{i}

and

B = {(B_{1}^{⊤}, B_{2}^{⊤}, \dots, B_{n}^{⊤})}^{⊤}

.

2.1. Estimation of Parameters

Based on the above discussion, the joint density of

(Y, U^{(τ)})

can be written in terms of the

τ

th quantile as follows:

\begin{matrix} f (y, u^{(τ)} | β^{(τ)}, Ψ^{| (τ)}, σ^{(τ)}) = & f (y | β^{(τ)}, u^{(τ)}, σ^{(τ)}) f (u^{(τ)} | Ψ^{(τ)}) \\ = & \prod_{i = 1}^{n} f (y_{i} | β^{(τ)}, u_{i}^{(τ)}, σ^{(τ)}) f (u_{i}^{(τ)} | Ψ^{(τ)}) \end{matrix}

(5)

For the random intercept model

(q = 1)

, the design matrix for random effects can be written as

Z_{i} = {(1, 1, \dots, 1)}^{'}; i = 1, 2, \dots, n

. Let,

R^{q}

denote

q -

dimensional Euclidean space. Now, the marginal likelihood can be derived from Equation (5) and written as

\begin{matrix} L_{i} (β^{(τ)}, σ^{(τ)}, Ψ^{(τ)} | y) = \int_{R^{q}} f (y_{i} | β^{(τ)}, u_{i}^{(τ)}, σ^{(τ)}) f (u_{i}^{(τ)} | Ψ^{(τ)}) d u_{i} \end{matrix}

(6)

Moreover, the marginal log-likelihood function can also be written as

l_{i} (β^{(τ)}, σ^{(τ)}, Ψ^{(τ)} | y) = log L_{i} (β^{(τ)}, σ^{(τ)}, Ψ^{(τ)} | y) .

(7)

Since the distribution of

Y (t_{i j})

is assumed to follow an asymmetric Laplace distribution,

τ

th quantile of

Y (t_{i j})

can be estimated using the asymmetric Laplace distribution with location parameter

μ_{i j}^{(τ)} = μ^{(τ)} (t_{i j}) = {[B (t_{i j})]}^{⊤} β^{(τ)} + {[Z (t_{i j})]}^{⊤} U_{i}^{(τ)}

, common scale

σ^{(τ)}

parameter and known skew parameter

(τ)

. To estimate parameters from joint density

f (y, u^{(τ)})

, expressed in Equation (5), it is important to compute the following integral, which is also known as the marginal density of

Y_{i} .

\begin{matrix} f (y_{i} | β^{(τ)}, σ^{(τ)}, Ψ^{(τ)}) = σ_{k_{i}}^{(τ)} \int_{R^{q}} exp \{- \frac{1}{σ} ρ_{τ} (y_{i} - μ_{i}^{(τ)})\} f (u_{i} | Ψ^{(τ)}) d u_{i}, \end{matrix}

(8)

where,

σ_{k_{i}}^{(τ)} = {[\frac{τ (1 - τ)}{σ}]}^{k_{i}}

and

\begin{matrix} ρ_{τ} (y_{i} - μ_{i}^{(τ)}) = \sum_{j = 1}^{k_{i}} ρ_{τ} (y_{i j} - μ_{i j}^{(τ)}) = \sum_{j = 1}^{k_{i}} ρ_{τ} (y (t_{i j}) - {[B (t_{i j})]}^{⊤} β^{(τ)} - {[Z (t_{i j})]}^{⊤} U_{i}^{(τ)}) \end{matrix}

Now, the log-likelihood function for n conditions can be expressed as

\begin{matrix} l (β^{(τ)}, σ^{(τ)}, Ψ^{(τ)} | y) = \sum_{i = 1}^{n} [log (σ_{k_{i}}^{(τ)}) + log \int_{R^{q}} exp \{- \frac{1}{σ} ρ_{τ} (y_{i} - μ_{i}^{(τ)})\} f (u_{i} | Ψ^{(τ)}) d u_{i}] \end{matrix}

(9)

This numerical likelihood can be solved by applying Gaussian quadrature (Gauss–Hermite quadrature or Gauss–Laguerre quadrature) proposed by Geraci and Bottai [17]. Now, assuming normal random effects

(U_{i} \sim N (0, Ψ^{(τ)}))

to Equation (9), the Gauss–Hermite quadrature can be applied to approximate the likelihood function with nodes

ν_{m_{1}, \dots, m_{q}} = {(ν_{m_{1}}, \dots, ν_{m_{q}})}^{'}

and weights

w_{m_{l}}, l = 1, 2, \dots, q

, respectively. Integer M determines the number of points over the real line for each of the q one-dimensional integrals. The covariance matrix of the random effects is reparameterized by parameters

α^{(τ)}

, i.e.,

Ψ (α^{(τ)})

, and parameters, characterized by

β^{(τ)}

and

α^{(τ)}

, are denoted by

θ^{(τ)} = {(β^{(τ)}, α^{(τ)})}^{T}

. Finally, Equation () leads to the following approximate likelihood.

\begin{matrix} l_{a p p} (θ^{(τ)}, σ^{(τ)} | y) = \sum_{i = 1}^{n} log \{\sum_{m_{1} = 1}^{M} \dots \sum_{m_{q} = 1}^{M} f (y_{i} | β^{(τ)}, σ^{(τ)}, [Ψ^{'} {(α^{(τ)})]}^{\frac{1}{2}} ν_{m_{1}, \dots, m_{q}}) \prod_{l = 1}^{q} w_{m_{l}}\} . \end{matrix}

(10)

Geraci and Bottai [17] develop the gradient search (gs) and the derivative free (df) optimization algorithm to maximize likelihood function in Equation (10). This optimization starts with a parameter value and then searches the positive semi-line for a new parameter value where likelihood is larger. This algorithm works until the likelihood change is sufficiently small or less than the pre-specified tolerance (

δ

) level. This algorithm begins estimating by initializing

β^{(τ)} = β_{0}^{(τ)}; α^{(τ)} = α_{0}^{(τ)}; σ^{(τ)} = σ_{0}^{(τ)}

. The derivative-free optimization algorithm is similar to the gradient search method. This method alternates a loop for

θ^{(τ)}

and then updates

σ^{(τ)}

.

Now, from the approximate likelihood in Equation (10), the fixed effects parameter

β^{(τ)}

, the random effects parameter

α^{(τ)}

and the error term parameter

σ^{(τ)}

can be estimated by using the algorithm given in Geraci and Bottai [17]. Further, the estimate of quantile function

Q^{(τ)} (t)

can be written as follows

{\hat{Q}}^{(τ)} (t) = {[B (t)]}^{⊤} {\hat{β}}^{(τ)} + {[Z (t)]}^{⊤} {\hat{U}}^{(τ)}

where

{\hat{β}}^{(τ)}

is the estimate of

β^{(τ)}

and

{\hat{U}}^{(τ)}

is the estimated best linear predictor of

U^{(τ)}

, which can be expressed as (Geraci and Botai [17])

{\hat{U}}^{(τ)} = {\hat{Ψ}}^{(τ)} Z^{⊤} {\hat{Σ}}^{- 1} {Y - B^{⊤} {\hat{β}}^{(τ)} - \hat{E} (ϵ^{(τ)})}

(11)

where

{\hat{Ψ}}^{(τ)} = Ψ ({\hat{α}}^{(τ)}),

Z = {(Z_{1}^{⊤}, \dots, Z_{n}^{⊤})}^{⊤},

\hat{Σ} = \hat{cov} (Y) = Z {\hat{Ψ}}^{(τ)} Z^{⊤} + \hat{cov} (ϵ^{(τ)}) = Z {\hat{Ψ}}^{(τ)} Z^{⊤}

+ diag (\hat{var} (ϵ_{i j}^{(τ)}))

with

\hat{var} (ϵ_{i j}^{(τ)}) = \frac{{\hat{σ}}^{(τ)} (τ^{2} + {(1 - τ)}^{2})}{τ^{2} {(1 - τ)}^{2}}, j = 1, \dots, k_{i}; i = 1, \dots, n

and

\hat{E} (ϵ^{(τ)})

= (\hat{E} (ϵ_{11}^{(τ)}), \dots, \hat{E} (ϵ_{1 k_{1}}^{(τ)}); \dots; \hat{E} (ϵ_{n 1}^{(τ)}), \dots, \hat{E} (ϵ_{n k_{n}}^{(τ)}))

with

\hat{E} (ϵ_{i j}^{(τ)}) = \frac{{\hat{σ}}^{(τ)} (1 - 2 τ)}{τ (1 - τ)}

for

j = 1, \dots, k_{i}; i = 1, \dots, n

. Furthermore, the asymptotic covariance matrix for

{\hat{U}}^{(τ)}

can be derived as

cov ({\hat{U}}^{(τ)}) = {\hat{Ψ}}^{(τ)} Z^{⊤} {\hat{Σ}}^{- 1} Z {\hat{Ψ}}^{(τ)} Z^{⊤} {\hat{Σ}}^{- 1} Z {\hat{Ψ}}^{(τ)}

(12)

Asymptotically,

{\hat{Q}}^{(τ)} (t) \sim N (Q^{(τ)} (t), {\hat{V}}_{Q^{(τ)}} (t))

with

\begin{matrix} {\hat{V}}_{Q^{(τ)}} (t) = \hat{var} ({\hat{Q}}^{(τ)} (t)) = (B^{⊤} (t), Z^{⊤} (t)) \hat{cov} ({\hat{β}}^{(τ)}, {\hat{U}}^{(τ)}) {(B^{⊤} (t), Z^{⊤} (t))}^{⊤} \end{matrix}

From the asymptotic normality of

{\hat{Q}}^{(τ)} (t)

, the approximate

(1 - α) 100 %

confidence interval for the quantile function

Q^{(τ)} (t)

can be constructed as follows:

{\hat{Q}}^{(τ)} (t) \pm z_{\frac{α}{2}} \sqrt{{\hat{V}}_{Q^{(τ)}} (t)}

(13)

where

z_{\frac{α}{2}}

is the upper

100 (1 - \frac{α}{2}) %

percentile of standard normal distribution. Although the asymptotic covariance matrix for

{\hat{U}}^{(τ)}

has a closed form (12), there is no expression for the covariance matrix of estimator

{\hat{β}}^{(τ)}

, and thus we are unable to find the expression for the estimate of covariance

\hat{cov} ({\hat{β}}^{(τ)}, {\hat{U}}^{(τ)})

. However, bootstrap is a very powerful method and can be used in the estimation of covariance matrix for estimators

({\hat{β}}^{(τ)}, {\hat{U}}^{(τ)})

. Here, we use a block bootstrap approach to find the estimate for

\hat{cov} ({\hat{β}}^{(τ)}, {\hat{U}}^{(τ)})

.

The procedures are as follows.

1.: Obtain R bootstrap samples from the original data ${Y_{i} (t_{i j}), B (t_{i j}), Z (t_{i j}); j = 1, 2, \dots, k_{i};$ $i = 1, 2, \dots, n}$
2.: Find the estimated values for the parameters $β^{(τ)}, α^{(τ)}$ and $σ^{(τ)}$ and then calculate the values of ${\hat{U}}^{(τ)}$ by using formula (11) from each bootstrap sample and denote the obtained values as ${\hat{β}}_{1}^{(τ)}, \dots, {\hat{β}}_{R}^{(τ)};$ ${\hat{α}}_{1}^{(τ)}, \dots, {\hat{α}}_{R}^{(τ)}$ ; ${\hat{σ}}_{1}^{(τ)}, \dots, {\hat{σ}}_{R}^{(τ)}$ and ${\hat{U}}_{1}^{(τ)}, \dots, {\hat{U}}_{R}^{(τ)} .$
3.: Set ${\hat{ϕ}}_{r}^{(τ)} = {({\hat{β}}_{r}^{(τ) ⊤}, {\hat{U}}_{r}^{(τ) ⊤})}^{⊤}, {\hat{ϑ}}_{r}^{(τ)} = {({\hat{β}}_{r}^{(τ) ⊤}, {\hat{α}}_{r}^{(τ)}, {\hat{σ}}_{r}^{(τ)})}^{⊤}, r = 1, 2 \dots, R$ and calculate the sample means of R bootstrap estimates for fixed effects parameters and random effects predictors $ϕ^{(τ)} = {(β^{(τ) ⊤}, U^{(τ) ⊤})}^{⊤}$ and $ϑ_{r}^{(τ)} = {(β_{r}^{(τ) ⊤}, α_{r}^{(τ)}, σ_{r}^{(τ)})}^{⊤}$

${\bar{ϕ}}^{(τ)} = \frac{1}{R} \sum_{r = 1}^{R} {\hat{ϕ}}_{r}^{(τ)}, {\bar{ϑ}}^{(τ)} = \frac{1}{R} \sum_{r = 1}^{R} {\hat{ϑ}}_{r}^{(τ)}$

(14)
4.: Now, the bootstrap estimator for covariance matrix of estimators $({\hat{β}}^{(τ)}, {\hat{U}}^{(τ)})$ can be written as

$\begin{matrix} {\hat{V}}_{ϕ^{(τ)}} = \hat{cov} ({\hat{β}}^{(τ)}, {\hat{U}}^{(τ)}) = \frac{1}{R - 1} \sum_{r = 1}^{R} ({\hat{ϕ}}_{r}^{(τ)} - {\bar{ϕ}}^{(τ)}) {({\hat{ϕ}}_{r}^{(τ)} - {\bar{ϕ}}^{(τ)})}^{⊤}, \end{matrix}$

(15)

$\begin{matrix} {\hat{V}}_{ϑ^{(τ)}} = \hat{cov} (β_{r}^{(τ) ⊤}, α_{r}^{(τ)}, σ_{r}^{(τ)}) = \frac{1}{R - 1} \sum_{r = 1}^{R} ({\hat{ϑ}}_{r}^{(τ)} - {\bar{ϑ}}^{(τ)}) {({\hat{ϑ}}_{r}^{(τ)} - {\bar{ϑ}}^{(τ)})}^{⊤} \end{matrix}$

(16)

Furthermore, the estimates of model parameters and covariance matrix of estimators are computed by using the ‘lqmm’ package developed by Geraci and Bottai [17] in the R programming environment. Generally, this ‘lqmm’ package is used to estimate conditional quantile functions with random effects in linear quantile mixed models.

2.2. Test of the Similarity of Quantile Functions for Two Gene Expressions

The other purpose of this research is to identify gene similarity based on quantile functions. Using the results obtained in Section 2.1, the estimates of quantile functions for gene expressions can be obtained. The estimated quantile functions for g genes can be expressed as

{\hat{Q}}_{h}^{(τ)} (t) = {[B (t)]}^{⊤} {\hat{β}}_{h}^{(τ)} + {[Z (t)]}^{⊤} {\hat{U}}_{h}^{(τ)}; h = 1, 2, \dots, g .

Two genes, h and s, are said to be similar if their quantile curve expressions are equal, i.e.,

{\hat{Q}}_{h}^{(τ)} (t) = {\hat{Q}}_{s}^{(τ)} (t); t \in [0, T], T > 0 .

Therefore, the proposed hypothesis is

H_{0} : Q_{h}^{(τ)} (t) = Q_{s}^{(τ)} (t) vs H_{1} : Q_{h}^{(τ)} (t) \neq Q_{s}^{(τ)} (t) .

(17)

Now, suppose that all quantile functions share the same truncated power basis functions

B (t)

and

Z (t)

. Then, the

τ

–quantile curves of gene expressions depend on the fixed effects parameter

β^{(τ)}

and the random effects components

U^{(τ)}

, and testing the similarity of two gene expression is equivalent to testing the equality of the corresponding fixed effects parameters and random effects components. Further, the predictors of random effects components depend on the estimates of parameters

β^{(τ) T}, α^{(τ)}, σ^{(τ)}

. Thus, instead of testing the hypothesis (17), one can test the following hypothesis related to the fixed effects parameters and the random effects components.

\begin{matrix} H_{0} : & (β_{h}^{(τ) ⊤}, α_{h}^{(τ)}, σ_{h}^{(τ)}) = (β_{s}^{(τ) ⊤}, α_{s}^{(τ)}, σ_{s}^{(τ)}) vs \\ H_{1} : & (β_{h}^{(τ) ⊤}, α_{h}^{(τ)}, σ_{h}^{(τ)}) \neq (β_{s}^{(τ) ⊤}, α_{s}^{(τ)}, σ_{s}^{(τ)}) . \end{matrix}

(18)

Based on parameter estimates and its covariance matrix, the following asymptotic statistic for testing the the hypothesis

H_{0} : (β_{h}^{(τ) ⊤}, α_{h}^{(τ)}, σ_{h}^{(τ)}) = (β_{s}^{(τ) ⊤}, α_{s}^{(τ)}, σ_{s}^{(τ)}); (h \neq s)

, is given by

χ_{h s}^{2 (τ)} = {(\begin{matrix} {\hat{β}}_{h}^{(τ)} - {\hat{β}}_{s}^{(τ)} \\ {\hat{α}}_{h}^{(τ)} - {\hat{α}}_{s}^{(τ)} \\ {\hat{σ}}_{h}^{(τ)} - {\hat{σ}}_{s}^{(τ)} \end{matrix})}^{⊤} Ξ (\begin{matrix} {\hat{β}}_{h}^{(τ)} - {\hat{β}}_{s}^{(τ)} \\ {\hat{α}}_{h}^{(τ)} - {\hat{α}}_{s}^{(τ)} \\ {\hat{σ}}_{h}^{(τ)} - {\hat{σ}}_{s}^{(τ)} \end{matrix}),

(19)

where,

Ξ = {({\hat{V}}_{β_{h}^{{(τ)}^{⊤}}, α_{h}^{(τ)}, σ_{h}^{(τ)}}^{h} + {\hat{V}}_{β_{s}^{{(τ)}^{⊤}}, α_{s}^{(τ)}, σ_{s}^{(τ)}}^{s})}^{- 1}

and

{\hat{V}}_{β_{h}^{{(τ)}^{⊤}}, α_{h}^{(τ)}, σ_{h}^{(τ)}}^{h}

and

{\hat{V}}_{β_{s}^{(τ) T}, α_{s}^{(τ)}, σ_{s}^{(τ)}}^{s}

are the estimated variance–covariance matrices of the estimators

(β_{h}^{{(τ)}^{⊤}}, α_{h}^{(τ)}, σ_{h}^{(τ)})

and

(β_{s}^{{(τ)}^{⊤}}, α_{s}^{(τ)}, σ_{s}^{(τ)})

, respectively, which can be computed from (16). When

H_{0}

holds,

χ_{h s}^{2 (τ)}

in Equation (19) has asymptotic chi-squared distribution with

(r + q + 1)

degrees of freedom, where r is known as the dimension number of

β^{(τ)}

for fixed effects parameters and q is the dimension number of

α^{(τ)}

for random effects components.

Furthermore, since the quantile function of gene expression is determined by r-dimensional fixed parameters, q-dimensional random effects components, and scale parameter

σ^{(τ)}

, the pattern of gene expression h can be induced to be a random point in the

(r + q + 1)

dimensional Euclidean space, which has the asymptotic multivariate normal distribution with mean

(β_{h}^{(τ)}, α_{h}^{(τ)}, σ_{h}^{(τ)})

and covariance matrix

{\hat{V}}_{β_{h}^{(τ)}, α_{h}^{(τ)}, σ_{h}^{(τ)}}^{h}

for

h = 1, \dots, g

. From this point of view, the statistic

χ_{h s}^{2 (τ)}

in (19) is also the Mahalanobis distance between random vectors

({\hat{β}}_{h}^{(τ)}, α_{h}^{(τ)}, σ_{h}^{(τ)})

and

({\hat{β}}_{s}^{(τ)}, α_{s}^{(τ)}, σ_{s}^{(τ)})

. In terms of the minimum Mahalanobis distance, we can obtain the clustering tree for g gene expressions.

3. Simulation

In this section, we examine the performance of the estimation and chi-square test (19) based on the proposed quantile model for analyzing temporal gene expression data. Extensive simulation studies are carried out to evaluate the accuracy of the estimated quantile curves and the power of the proposed chi-square test (19). The random intercept QR model and random slope QR model are chosen with five different error terms, in which the error terms are chosen from symmetric distribution (normal), symmetric and heavy tailed distribution (Laplace and Student t) and skewed distribution (skew normal and skew t). Hence, there are ten different scenarios to examine through the simulation study. The steps are given below.

Thirty-five equally spaced timepoints between [0, 1] are considered for 50 samples. The data were generated using the following mixed model. The true model is assumed as follows:

$Y_{i} (t) = f_{0} (t) + U_{i} (t) + ϵ_{i} (t); t \in [0, 1],$

(20)

where, $f_{0} (t) = exp (\frac{5 t}{1 + t^{3}}), i = 1, \dots, 50 .$
Random effects components are assumed to follow a normal distribution with a mean of zero and standard deviation of two, i.e., $(U_{i} \sim N (0, 4))$ .
The following i.i.d errors are considered for true models Equation (20).
1.
Model 1: $ϵ_{i} (t) \sim$ Laplace(0,1)
2.
Model 2: $ϵ_{i} (t) \sim$ Normal(0,1)
3.
Model 3: $ϵ_{i} (t) \sim$ Skew Normal(0,1,1)
4.
Model 4: $ϵ_{i} (t) \sim$ Skew t(0,1,1,4)
5.
Model 5: $ϵ_{i} (t) \sim$ t(3)

3.1. Model and Parameter Estimation

For model generalization, the following quantile mixed model is considered:

Y_{i} (t) = B^{⊤} (t) β^{(τ)} + Z^{⊤} (t) U_{i} + ϵ_{i}^{(τ)},

(21)

where,

B (t)

and

Z (t)

are the basis functions for fixed and random effects components, respectively. The parameter estimation procedure is illustrated below.

Generate data for $n = 50$ samples using Equation (20).
Basis functions for Model (21) are considered as:
1.
Random intercept model: $B (t) = {(1, t, t^{2}, t^{3})}^{⊤}$ , $Z (t) = (1)$ .
2.
Random slope model: $B (t) = {(1, t, t^{2}, t^{3})}^{⊤}$ , $Z (t) = (1, t)$ .
Forthe random slope model, we consider normal random effects with a diagonal variance–covariance matrix. A Gauss–Hermite quadrature with seven noves is considered to approximate the marginal log-likelihood of Equation (10).
All parameters are estimated at median ( $τ = 0.50$ ).
A total of 500 bootstrap replications are considered for the estimation of the covariance matrix of the estimators $({\hat{β}}^{(τ)}, {\hat{U}}^{(τ)})$ .

Note that, for simplicity, the interior knots are not added in the basis function

B (t)

in this simulation study. The simulations were also conducted in the scenario for the basis function

B (t)

with the interior knots and the results are similar to those without the interior knots. Furthermore, the simulations for the other values of quantile

τ

are omitted.

3.2. Simulation Results for Parameter Estimation

3.2.1. Parameter Estimates of Random Intercept QR Model

Parameter estimates for the 50th quantile (

τ = 0.50

) of the random intercept QR model are reported below for five different scenarios. The estimated median function, along with the confidence interval of each random intercept QR model, is presented in Figure 1, Figure 2, Figure 3, Figure 4 and Figure 5. Again, the estimated

50 th

(

τ = 0.50

) quantile function is also compared with sample median function in each figure.

Estimates of parameters in Model 1: ${\hat{β}}^{(τ)} = {[1.19, - 5.06, 68.86, - 53.29]}^{⊤}$ , ${\hat{α}}^{(τ)} = 1.23$ , ${\hat{σ}}^{(τ)} = 0.56$ , ${\hat{Ψ}}^{(τ)} = 1.52$ .
Estimates of parameters in Model 2: ${\hat{β}}^{(τ)} = {[1.68, - 7.42, 74.74, - 56.87]}^{⊤}$ , ${\hat{α}}^{(τ)} = 1.01$ , ${\hat{σ}}^{(τ)} = 0.43$ , ${\hat{Ψ}}^{(τ)} = 1.02$ .
Estimates of parameters in Model 3: ${\hat{β}}^{(τ)} = {[2.03, - 7.46, 74.81, - 57.17]}^{⊤}$ , ${\hat{α}}^{(τ)} = 1.14$ , ${\hat{σ}}^{(τ)} = 0.36$ , ${\hat{Ψ}}^{(τ)} = 1.29$ .
Estimates of parameters in Model 4: ${\hat{β}}^{(τ)} = {[1.88, - 7.85, 76.28, - 57.92]}^{⊤}$ , ${\hat{α}}^{(τ)} = 1.41$ , ${\hat{σ}}^{(τ)} = 0.49$ , ${\hat{Ψ}}^{(τ)} = 1.98$ .
Estimates of parameters in Model 5: ${\hat{β}}^{(τ)} = {[1.29, - 8.70, 76.78, - 57.78]}^{⊤}$ , ${\hat{α}}^{(τ)} = 1.42$ , ${\hat{σ}}^{(τ)} = 0.57$ , ${\hat{Ψ}}^{(τ)} = 2.01$ .

3.2.2. Parameter Estimates of Random Slope QR Model

Now, the estimates of parameters for the

50 th

quantile (

τ = 0.50

) of the random slope QR model are reported below for five different scenarios and estimated median functions; sample median functions, along with confidence intervals, are shown in Figure 6, Figure 7, Figure 8, Figure 9 and Figure 10.

Estimates of parameters in Model 1: ${\hat{β}}^{(τ)} = {[1.27, - 5.08, 68.86, - 53.31]}^{⊤}$ , ${\hat{α}}^{(τ)} = {[1.35, 0.67]}^{⊤}$ , ${\hat{σ}}^{(τ)} = 0.54$ , ${\hat{Ψ}}_{intercept}^{(τ)} = 1.83, {\hat{Ψ}}_{time}^{(τ)} = 0.45$ .
Estimates of parameters in Model 2: ${\hat{β}}^{(τ)} = {[0.98, - 6.43, 72.80, - 55.64]}^{⊤}$ , ${\hat{α}}^{(τ)} = {[1.29, 0.68]}^{⊤}$ , ${\hat{σ}}^{(τ)} = 0.41$ , ${\hat{Ψ}}_{intercept}^{(τ)} = 1.68, {\hat{Ψ}}_{time}^{(τ)} = 0.46$ .
Estimates of parameters in Model 3: ${\hat{β}}^{(τ)} = {[1.92, - 7.43, 74.85, - 57.15]}^{⊤}$ , ${\hat{α}}^{(τ)} = {[1.79, 0.73]}^{⊤}$ , ${\hat{σ}}^{(τ)} = 0.35$ , ${\hat{Ψ}}_{intercept}^{(τ)} = 3.20, {\hat{Ψ}}_{time}^{(τ)} = 0.53$ .
Estimates of parameters in Model 4: ${\hat{β}}^{(τ)} = {[2.30, - 7.83, 76.31, - 57.87]}^{⊤}$ , ${\hat{α}}^{(τ)} = {[1.67, 1.06]}^{⊤}$ , ${\hat{σ}}^{(τ)} = 0.46$ , ${\hat{Ψ}}_{intercept}^{(τ)} = 2.79, {\hat{Ψ}}_{time}^{(τ)} = 1.13$ .

Estimates of parameters in Model 5: ${\hat{β}}^{(τ)} = {[1.25, - 8.53, 76.79, - 57.92]}^{⊤}$ , ${\hat{α}}^{(τ)} = {[1.47, 0.53]}^{⊤}$ , ${\hat{σ}}^{(τ)} = 0.57$ , ${\hat{Ψ}}_{intercept}^{(τ)} = 2.16, {\hat{Ψ}}_{time}^{2} = 0.29$ .

3.3. Power Analysis of Proposed Test Statistic

In this section, the power of the chi-square test presented in Equation (19) is evaluated by considering the alternative model. A constant number, m is added to this alternative model.

The alternative model considered in simulation is as follows:

{[Y_{i} (t)]}_{new} = exp (\frac{5 t}{1 + t^{3}}) + U_{i} (t) + ϵ_{i} (t) + m

(22)

where,

m \in {0, 0.50, 1.00, 1.50, 2.00, 2.50, 3.00}

. The

L_{2}

distance between true model and alternative model is m. Here, we assume that random effects (

U_{i}

) and error (

ϵ_{i}

) follow the same distribution. The power analysis is carried out for both the random slope QR model and random intercept QR model. All simulation results are performed for a sample size of

n = 30, 50

and 75 at a level of significance of

α = 0.05

and quantile value of

τ = 0.25

, 0.50, 0.75 with 1000 replications. Table 1, Table 2 and Table 3 report empirical powers for five random intercept QR models and five random slope models with the quantiles

τ = 0.25, 0.5

and

0.75

. Power analysis results demonstrate that the statistical performance of the proposed chi-square test statistic is good for both random intercept models and random slope models at a 5% level of significance. This indicates that this test statistic can be applied to determine the similarity of two gene expressions.

4. Application: Gene Expression Data

In this section, the linear quantile mixed model given in Section 2 is applied to analyze a real dataset of 18 gene expressions in P. aeruginosa expressed in 24 conditions. Some descriptive measures are discussed at the beginning of this section. Towards the end of this section, the similarities between genes are investigated using the proposed test statistic (Equation (19)) based on quantile functions.

4.1. Data

A description of 18 genes is reported in Table 4. The data observed for these genes are first analyzed by Fang et al. [1] and then by Deng et al. [10] Initially, polymerase chain reaction (PCR) was performed to amplify regions of P. aeruginosa virulence factors. The primers were synthesized using PAO1 genome data according to Duan [18]. PAO1 chromosomal DNA was used as a PCR template and amplified promoter regions were cloned into XhoI-BamHI restriction sites of the plasmid PMS402. Then plasmids were inserted into PAO1 using electroporation. More details regarding DNA manipulation, PCR and transformation procedures can be found in Duan [18]. The promoter activity was measured as counts per second (CPS) of light production using a Victor2 Multilabel counter. TSBDC minimal medium containing EDTA (400

μ

g/mL) and (50

μ

g/mL)

F e c l_{3}

were used to assay gene expression. The reporter strains were grown overnight and the resulting culture was diluted into 1:200 proportion in a 96-well microtiter plate. Then, the promoter activity of the virulence factors was measured under 24 biological conditions every 30 min for 21 h. These genes were considered as quorum-sensing or quorum-sensing-regulated genes, which play an important role in bio-films formation. Thus, this dataset consists of 18 genes in P. aeruginosa (Table 4) and each gene was observed at 43 consecutive timepoints under 24 conditions. Therefore, the data for each gene consist of 1032 observations and, in total, 18,576 observations are measured for 18 genes.

4.2. Exploratory Analysis

In this section, a summary is provided of the selected gene expression data. Gene expressions,

g (t)

are obtained on a log scale under 24 conditions. Figure 11 exhibits gene expression curves of genes PA2975(rluc) and PA0573, respectively. From this figure, it is evident that the expressions for genes PA2975(rluc) and PA0573 are very different under different biological conditions. Figure 12 shows a histogram of expressions on PA2975(rluc) and PA0573, respectively. This figure reveals that teh data are severely skewed (negative) and show non-normality signs. Figure 13 and Figure 14 present the distributions of expressions for PA2975(rluc) and PA0573 with respect to time and conditions, respectively. Both box-plots demonstrate that heterogeneity is present in both PA2975(rluc)(B3) and PA0573(E6) gene expressions. More importantly, both parts of the figure show that the gene expression dataset consists of outliers. Hence, Figure 12, Figure 13 and Figure 14 suggest that the normality assumption for linear mixed model may be inappropriate for analyzing gene expression data. For this reason, the quantile models are going to be used to determine gene trajectory in terms of different quantile values. This method might provide more significant and insightful results than the mean regression model.

4.3. Model and Parameter Estimates

Let

Y_{i} (t)

be the measurements for condition i at time t. The linear quantile mixed model is considered for estimation purposes. Since no additional covariates are available, polynomial basis functions are considered for fixed effects in this analysis.

\begin{matrix} Y_{i} (t) = {[B (t)]}^{⊤} β^{(τ)} + {[Z (t)]}^{⊤} U^{(τ)} + ϵ_{i}^{(τ)} (t); (i = 1, \dots, 24) (t = 0, 0.5, \dots, 21) . \end{matrix}

(23)

Random Intercept Model: $B (t) = {(1, t, t^{2}, t^{3}, t^{4})}^{⊤}$ and $Z (t) = (1)$ .
Random Slope Model: $B (t) = {(1, t, t^{2}, t^{3}, t^{4})}^{⊤}$ and $Z (t) = {(1, t)}^{⊤}$ .

For each gene expression, the same fixed effects and random effects structures are considered when estimating the parameters in both the random intercept model and random slope model. However, AIC is found to be lower for the random intercept model than for the random slope model. Thus, the random intercept model is used for further analysis, instead of the random slope model. Table 5 reports the values of estimated parameters for quantile functions with quantile

τ = 0.25, 0.50, 0.75

for PA2975(rluc)(B3) and PA0573(E6). Figure 15 and Figure 16 show the estimated quantile functions and confidence intervals for genes PA2975(rluc)(B3) and PA0573(E6), respectively. Figure 17 and Figure 18 present the estimated quantile functions, along with the sample quantile functions, for gene PA2975(rluc)(B3) and PA0573(E6). Moreover, Figure 19 shows the estimated median functions for 18 genes in P. aeruginosa.

4.4. Test of Gene Similarity

To find the similarity between genes, pairwise comparisons of all the gene expressions of 18 genes in P. aeruginosa were investigated in terms of their quantile functions, since gene expression curves depend on fixed effects parameters and random effects components. For

τ = 0.50

the following hypotheses were tested.

\begin{matrix} H_{0} : (β_{h}^{(τ)}, U_{h}^{(τ)}) & = (β_{s}^{(τ)}, U_{s}^{(τ)}) vs \\ H_{1} : (β_{h}^{(τ)}, U_{h}^{(τ)}) & \neq (β_{s}^{(τ)}, U_{s}^{(τ)}) \end{matrix}

(24)

where

β_{h}^{(τ)}

and

β_{s}^{(τ)}

are the set of parameters

(β_{0}, β_{1}, β_{2}, β_{3}, β_{4})

for hth gene and sth gene. When

H_{0}

holds,

χ_{h s}^{2 (τ)}

, Equation (19) has an asymptotic chi-squared distribution with

(r + 1)

degrees of freedom, where r is known as the basis of the regression model. Table 6 reports the Mahalanobis distance between 18 genes using the results of parameter estimates and chi-square statistic for

τ = 0.50

. Moreover, considering a significance level of

α = 0.05

, Table 7 reports gene similarity. Figure 20 presents clustering tree of 18 genes in P. aeruginosa. Based on Figure 20 and Table 7, it can be said that PA1841(H3) shows no significant difference to PA2975(rluc)(B3), PA4991(B4), PA0573(E6), PA2997(F5) and PA1748(G5). In addition, the quantile function of PA3771(G6) shows no difference to the quantile function of PA0287(C4), PA0573(E6), PA0649(G2). Furthermore, PA3212(F3) shows similarities with PA5283(A6) and PA1875(E5).

5. Concluding Remarks

Gene expression analysis usually tracks the expression values of a large number of genes simultaneously under different biological conditions, and the rate of change may not be consistent at different quantiles. This research uses a linear quantile mixed effect model to analyze the gene expression trajectory. Gene expression models are constructed using basis functions and a statistical test is proposed to determine the similarity of the genes based on estimated fixed effects parameters and random effects components from the linear quantile mixed model. In simulation studies, both the random intercept and random slope models are considered to examine the performance of proposed estimation and test statistics. The simulation results indicate that the chi-square test statistic performs well in all circumstances to test the similarity of quantile functions. Finally, as an illustration, a linear quantile mixed model is applied to a real dataset of 18 genes in P. aeruginosa expressed in 24 conditions, and the similarity between genes is investigated using the proposed test statistic based on quantile functions. Moreover, a clustering tree is also shown using a complete linkage method. This research suggests that this proposed test statistic may help bio-medical scientists test the similarity between genes in terms of their quantile functions. In a simulation study, instead of true quantile functions, estimated quantile functions are compared with sample quantile functions, which is a limitation of this research. This research can be further extended by considering a changeable variance and covariance structure for the gene expression data.

Author Contributions

Conceptualization, D.D. and M.H.C.; methodology, D.D. and M.H.C.; software, D.D. and M.H.C.; validation, D.D. and M.H.C.; formal analysis, D.D. and M.H.C.; investigation, D.D. and M.H.C.; resources, D.D.; data curation, D.D. and M.H.C.; writing—original draft preparation, D.D. and M.H.C.; writing—review and editing, D.D. and M.H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Restrictions apply to the availability of these data. Data was obtained from other researcher and are available with the permission of this researcher.

Acknowledgments

The authors are very grateful to the editor, associate editor and three referees for their careful reading and valuable comments, which have greatly improved this paper. The first author of this work is partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).

Conflicts of Interest

The authors declare no conflict of interest.

References

Fang, H.B.; Deng, D.; Tian, G.L.; Shen, L.; Duan, K.; Song, J. Analysis for temporal gene expressions under multiple biological conditions. Stat. Biosci. 2012, 4, 282–299. [Google Scholar] [CrossRef]
Draghici, S.; Kulaeva, O.; Hoff, B.; Petrov, A.; Shams, S.; Tainsky, M.A. Noise sampling method: An ANOVA approach allowing robust selection of differentially regulated genes measured by DNA microarrays. Bioinformatics 2003, 19, 1348–1359. [Google Scholar] [CrossRef] [Green Version]
Eisen, M.B.; Spellman, P.T.; Brown, P.O.; Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 1998, 95, 14863–14868. [Google Scholar] [CrossRef] [Green Version]
Li, H.; Luan, Y.; Hong, F.; Li, Y. Statistical methods for analysis of time course gene expression data. Front. Biosci. 2002, 7, a90–a98. [Google Scholar] [CrossRef] [PubMed]
Yeung, K.Y.; Ruzzo, W.L. Principal component analysis for clustering gene expression data. Bioinformatics 2001, 17, 763–774. [Google Scholar] [CrossRef]
Kerr, M.K.; Martin, M.; Churchill, G.A. Analysis of variance for gene expression microarray data. J. Comput. Biol. 2000, 7, 819–837. [Google Scholar] [CrossRef] [PubMed]
Storey, J.D.; Tibshirani, R. Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods Mol Biol. 2003, 224, 149–157. [Google Scholar] [CrossRef]
Tusher, V.G.; Tibshirani, R.; Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. USA 2001, 98, 5116–5121. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Deng, D.; Jahromi, K.R.; Zhou, Z. Influence of biological conditions to temporal gene expression based on variance analysis. In JSM Proceedings; American Statistical Association: Alexandria, VA, USA, 2017; pp. 786–800. [Google Scholar]
Deng, D.; Fang, H.-B.; Jahromi, K.R.; Song, J.; Tan, M. Detection of threshold points for gene expressions under multiple biological conditions. Stat. Interface 2017, 10, 643–655. [Google Scholar] [CrossRef]
Huang, H.; Lee, T.-H. Forecasting Value-at-Risk Using High-Frequency Information. Econometrics 2013, 1, 127–140. [Google Scholar] [CrossRef] [Green Version]
Gallardo, D.I.; Bourguignon, M.; Galarza, C.E.; Gómez, H.W. A Parametric Quantile Regression Model for Asymmetric Response Variables on the Real Line. Symmetry 2020, 12, 1938. [Google Scholar] [CrossRef]
Jung, C.; Lee, Y.; Lee, J.; Kim, S. Performance Evaluation of the Multiple Quantile Regression Model for Estimating Spatial Soil Moisture after Filtering Soil Moisture Outliers. Remote Sens. 2020, 12, 1678. [Google Scholar] [CrossRef]
Chen, L.; Galvao, A.; Song, S. Quantile Regression with Generated Regressors. Econometrics 2021, 9, 16. [Google Scholar] [CrossRef]
Donoho, D.L.; Johnstone, J.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455. [Google Scholar] [CrossRef]
Zhang, J.T. Order-dependent Thresholding with Applications to Regression Splines. In In-Contemporary Multivariate Analysis and Design of Experiments; World Scentific Publishing Co. Pte. Ltd.: Singapore, 2005; pp. 397–425. [Google Scholar]
Geraci, M.; Bottai, M. Linear quantile mixed models. Stat. Comput. 2014, 24, 461–479. [Google Scholar] [CrossRef]
Duan, K.; Dammel, C.; Stein, J.; Rabin, H.; Surette, M.G. Modulation of Pseudomonas aeruginosa gene expression by host microflora through interspecies communication. Mol. Microbiol. 2003, 50, 1477–1491. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 1.

Figure 2. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 2.

Figure 3. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 3.

Figure 4. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 4.

Figure 5. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 5.

Figure 6. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 1.

Figure 7. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 2.

Figure 8. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 3.

Figure 9. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 4.

Figure 10. Estimated median function and C.I. (left); Estimated median and sample median (right) for Model 5.

Figure 11. Gene expressions of PA2975(rluc)(B3) (left) and PA0573(E6) (right).

Figure 12. Histograms of gene expressions of PA2975(rluc)(B3) (left) and PA0573(E6) (right).

Figure 13. Box-plots of gene expressions PA2975(rluc)(B3) regarding time (left) and conditions (right).

Figure 14. Box-plots of gene expressions PA0573(E6) regarding time (left) and conditions (right).

Figure 15. Confidence intervals of gene PA2975(rluc)(B3).

Figure 16. Confidence intervals of gene and PA0573(E6).

Figure 17. Quantile functions and sample quantile functions of PA2975(rluc)(B3): lower quartile (left); median (center); upper quartile (right).

Figure 18. Quantile functions and sample quantile functions of PA0573(E6): lower quartile (left); median (center); upper quartile (right).

Figure 19. Estimated median functions of 18 genes in P. aeruginosa.

Figure 20. The clustering tree of 18 genes in P. aeruginosa.