A New Distribution Family for Microarray Data †

The traditional approach with microarray data has been to apply transformations that approximately normalize them, with the drawback of losing the original scale. The alternative standpoint taken here is to search for models that fit the data, characterized by the presence of negative values, preserving their scale; one advantage of this strategy is that it facilitates a direct interpretation of the results. A new family of distributions named gpower-normal indexed by p∈R is introduced and it is proven that these variables become normal or truncated normal when a suitable gpower transformation is applied. Expressions are given for moments and quantiles, in terms of the truncated normal density. This new family can be used to model asymmetric data that include non-positive values, as required for microarray analysis. Moreover, it has been proven that the gpower-normal family is a special case of pseudo-dispersion models, inheriting all the good properties of these models, such as asymptotic normality for small variances. A combined maximum likelihood method is proposed to estimate the model parameters, and it is applied to microarray and contamination data. R codes are available from the authors upon request.


Introduction
While analysing microarray intensity measurements, it is usual to find asymmetric distributions with some negative values and the purpose of this article is to model data with these characteristics.
The traditional approach with microarray data has been to apply transformations that approximately normalize them, with the drawback of losing the original scale. The initial transformation applied was log 2 ; it allows working with log-ratios which have a simple and intuitive meaning for biologists (see for example [1,2]). This transformation usually works well for high values but not for zero, and low ones. Further, it cannot be applied to negative values. To avoid these drawbacks, [3,4] suggested the generalized logarithm transformation (glog), that allows negative values and this transformation is obtained from a multiplicative-additive linear error model for the data, through a Taylor approximation.
On the other hand, the glog transformation usually works well for low values but it is too severe for high ones. The next improvement was introduced by [5], who defined transformations on a real supported data family named generalized power transformations (gpower): The gpower transformations extend the glog transformation continuously, in the same sense as the Box-Cox family [6] extends the natural logarithm. A related problem presented by microarray data when they are log transformed is the intensity-dependent biases observed in the MA (Minus Average) plots (see [7]). These plots display pairwise comparisons of log intensities (Int1 and Int2) between microarrays. More specifically the vertical axis gives the log ratio or difference of the intensities in a log scale M = log 2 (Int1/Int2) = log 2 (Int1) − log 2 (Int2) and the horizontal axis is the average log intensities A = 1/2log 2 (Int1 × Int2) = 1/2(log 2 (Int1) + log 2 (Int2)). Even when intensities from a controlled spike-in experiment are expected to show a horizontal cloud centred on the M = 0 axis from low to high intensities, non-horizontal structures are apparent. Several approaches have been taken to account for the observed biases (see for example [2,[8][9][10]) , explain and correct the observed MA plot intensity-dependent biases through linear transformations.
To avoid transformations, the alternative standpoint taken here is to search for models that fit the data, preserving their scale. One advantage of this strategy is that it facilitates a direct interpretation of the results. In this direction, [11] showed that data that become normal after a glog transformation belong to what they called the glog-normal distribution family.
In this paper, we extend their results by characterizing those distributions that become normal (or truncated normal) after a gpower transformation. We introduce the gpower-normal family; this family of distributions should be fitted to gene intensities that have been previously calibrated with an affine transformation, according to the Bengtsson and Hössjer proposal [10].
For positive data, a study has been carried out by [12], who analysed the power normal family, related to Box-Cox transformation. The improvement of the gpower-normal family over the glog-normal family is that it can account for lighter asymmetries where the glog transformation is too strong (see [1,5]); also, it has support on the whole real line, not being restricted to positive values.
In Section 2, we describe the development of a new probability model to be used as a statistical tool for microarrays data analysis. Gpower-normal models are defined and their main properties are demonstrated; their relation with pseudo-dispersion models is studied and expressions for the moments and quantiles are obtained. Then, a combined maximum likelihood method is described to obtain estimators of the parameters and it is applied in subsequent sections. For the purpose of illustration, in Section 3, we show several density functions, their quantiles and real data applications. Finally, discussion and conclusions are presented in Sections 4 and 5 respectively. We have placed some proofs in the Appendices so as not to break the flow of the narrative.

Methods
The methodological idea within this section is to obtain a model for microarray data in their original scale. With this purpose, we describe a new tool and its implementation.

Gpower-normal Distribution
With the goal mentioned above, a new distribution family named gpower-normal is presented and its main properties are stated. We are considering data that, when transformed by gpower, become normally distributed.
The next theorem gives the main property of gpower-normal variables: after a gpower transformation they become truncated normals (TN). Recall that if X is a TN variable, its density is given by f X x, µ X , σ 2 = 1 and we will denote that X ∼ TN µ X , σ 2 , a, b (see Dhrymes [13]).
, then the transformed variable X = gpower (Y; p) has a truncated normal distribution TN µ X , The proof of this theorem can be seen in Appendix A. Figure 1 shows the flexibility of this distribution family across different values of the parameters, including symmetric and heavy tailed densities.

Relationship between Gpower-normal Models and Pseudo-dispersion Models
Gpower-normal models are a special case of a general family of distributions called pseudo-dispersion models defined by [14]. It is proven in Appendix B that the densities defined by (2) belong to this family. Expressions for their deviance and unit variance functions are also obtained. From a theoretical point of view, this is interesting because there are very few examples of pseudo-dispersion models known in the specialized literature.

Quantiles and Moments
A straightforward method to obtain quantiles such as the median and quartiles is considered. These quantiles will be useful in the graphical examination of the model fit to a data set, through quantile-quantile plots.
Let Y be a gpower-normal random variable and X = gpower (Y; p) the transformed variable distributed as TN µ X , σ 2 , −1/p, ∞ . Let x α be the α-quantile for X, to obtain an expression for the quantiles of a truncated normal distribution we proceeded as follows: is the corresponding normal variable with cumulative distribution F X 0 and Φ is the cumulative distribution of the standard normal. Now, clearing up and its value can be obtained from the standard normal distribution. This procedure will be applied in Section 3.2.
Expressions for the moments of a gpower-normal family can be expressed in terms of the truncated normal density function (see Appendix C). However, these expressions are not easy to handle, and their convergence for different values of the parameters remains to be analysed. In order to avoid the difficulties in moment calculations (e.g., mean and variance) for models given in Definition 1, we propose the alternative use of quantiles.

Parameter Estimation
The gpower-normal models have three parameters to be estimated. They are related to the corresponding TN model parameters as it has been detailed in Section 2.1. We propose a combined profile likelihood and maximum likelihood approach to estimate the parameters. The five steps of the proposed estimation approach are described below:

1.
Given a data set represented by vector y = (y 1 , y 2 , . . ., y n ), to obtain a profile likelihood for the power p, we consider a grid of values p 0 , p 1 , . . ., p k .

2.
For each p j , 1 ≤ j ≤ k the transformed data x p j are calculated as x p j = gpower y, p j . 3.
Then, for each p j , the corresponding µ p j and σ p j are estimated, maximizing the likelihood function of the truncated normal variable.

4.
Then, p j , µ p j and σ p j are used to obtain the log-likelihood function of y whose density was given by (2):

5.
Finally, p is chosen as the one that maximizes the log-likelihood in the grid: The method described above is applied in Section 3.2 and an implementation has been written in R language [15]. The codes are available from the authors upon request.

Results
To highlight the potential of the distribution family to model data with different skewness and kurtosis, we present several density functions and their quantiles. Also, the model fit is illustrated with real data sets.

Some Examples of Gpower-normal Densities
By way of example, Figure 1 shows the flexibility of this distribution family across different values of the parameters, including symmetric and heavy tailed densities. Graphic representations of these densities are exhibited for two different values of p (0.05 and 0.5) and three different values of σ (1, 5 and 10); always with µ = 0.
It can be seen that kurtosis decreases as σ grows up and asymmetry grows with p. Table 1 contains some quantile values corresponding to the distributions displayed in Figure 1 confirming the observations made above for those displays.

Real Data Applications
As it was mentioned in Section 1, the proposed density family was originally motivated by the modelling of microarray intensities, but its application is more general. Here, we present some examples: the first one corresponds to intensities of microarray data, the second one to concentrations of ammonia and the third one to magnetic contamination indices. For these examples, the parameter estimators are obtained by the method described in Section 2.4. Then, the data fit to the corresponding estimated model is shown graphically in three ways: (1) by the overlap of the data histograms with the density curves, (2) by the overlap of the empirical distribution curve with the adjusted model cumulative distribution function, and (3) by quantile-quantile (Q-Q) plots, that display the ordered data in the horizontal axis and the corresponding quantiles of the estimated model distribution.
It is also evaluated applying Kolmogorov-Smirnov tests to compare the estimated gpower-normal and the glog-normal distributions. The resulting p-values should be taken as descriptive comparisons of the two distributions and as a complement of the Q-Q plots. Example 1. The first set of data corresponds to 30 intensities of one gene and they can be seen in Appendix D. These data were selected from the Yale University MAQC project and downloaded from [16].
The values forp, µ, and σ were estimated by the profile likelihood method (Section 2.4), resulting p = 0.23, µ = 1052 and σ = 3.46. From these estimations, the 0.25, 0.50 and 0.75 quantiles were obtained according to expression (3) resulting in 690, 1052 and 1545, respectively. As can be seen in Figure 2, this set of data is well fitted by the gpower-normal model with the parameters given above. It is confirmed by the Kolmogorov-Smirnov test, with a p-value = 0.7057 for the gpower-normal model and 0.4934 for the glog-normal model.

Histogram and Density
Gene intensity  As in the previous examples, it can be seen in Figure 5 that the fitted gpower-normal model with the parameters given above fits them quite well. The fit was evaluated applying a Kolmogorov-Smirnov test (p-value = 0.57 for the gpower-normal model and 0.29 for the glog-normal model).

Discussion
A new family of distributions named gpower-normal, indexed by p ∈ R, has been defined. Variables whose distribution belongs to this family become normal or truncated normal when a gpower transformation is applied. From a practical point of view, the truncation is very often negligible, such is the case of data in Example 1. The estimated parameters were p = 0.23 , µ = 1052 and σ = 3.516 and thus the truncation constant given in Definition 1 is almost 1, meaning that no truncation is necessary in this case. Similar results were obtained for Examples 2 and 3.
The gpower-normal family can model data that include non-positive values, as required for microarray analysis.
To estimate the model parameters, a combined maximum likelihood method is proposed and it is successfully applied to real data. It enables direct calculations of quantiles for which simple expressions are given. Thus, position and scale measures (i.e., the median and the interquartile range) can be easily obtained, overcoming the difficulties found in moment calculations.
This allows the use of the estimated distribution medians as summary measures for gene intensities. Therefore, to compare the gene intensities in two biological conditions, it seems adequate to compare the corresponding medians of the estimated distributions. This paper extends the results previously presented by Leiva and coauthors [11] and therefore the new distribution family offers a larger set of models. Considering the Leiva et al. proposal as a standard alternative, the new family can fit data for which their proposal might be not flexible enough. As expected, in all examples, the Kolmogorov-Smirnov tests indicated a closer fit of the data to the gpower-normal distribution in comparison with the glog-normal fit.
In addition, it has been proven that the gpower-normal family is a special case of pseudo-dispersion models, inheriting all the good properties of these models, such as asymptotic normality for small variances. It should be pointed out that very few examples of pseudo-dispersion models have been reported in the literature. The obtained p-values should be considered as descriptive values only and as a complement to the Q-Q plot, which provides a visual impression of the fit.

Conclusions
We have presented a family of distributions to model asymmetric data with positive and negative values. It allows investigators to model microarray data, preserving their original scale. Although the construction of this family of distributions was motivated by microarray data, it is important to notice that the model is also useful in other environments, such as chemical and magnetical contamination data. A method has been given to estimate quantiles and in our examples, specific quantile-quantile plots were effective tools for a visual evaluation of the data fit to the estimated models.

Appendix A. Proof of Theorem 1
In this Theorem, it is shown that a gpower-normal variable becomes normal or truncated normal when its corresponding gpower transformation is applied.
Proof. Let x = gpower (y; p), then when p = 0 and it is easy to see that ∂y ∂x for p > 0, x > −1/p and the derivative is positive, which means that gpower (y; p) is a monotonically increasing function. Now, X = gpower (Y) is considered as a function of a random variable Y ∼ GPN(µ, σ, p). The density for X is obtained as follows: that is precisely the density of a TN µ X , σ 2 , −1/p, ∞ random variable. For p < 0, it can be proved analogously that For p = 0, gpower (x, 0) = ln y + y 2 + 1 = arc sinh (y). The inverse transformation is Again, the derivative is positive ∀y, meaning that the function is monotonically increasing and the density of X becomes that is precisely the density of a N µ X , σ 2 random variable.

Appendix B. Relationship between Gpower-normal Models and Pseudo-dispersion Models
We first define proper dispersion models and then extend them to pseudo-dispersion models.

Definition 2.
Given an open interval Ω ⊆ R, a proper dispersion model is a family of random variables Y with density functions of the form where µ ∈ Ω, σ > 0, d (y, y) = 0 ∀y ∈ Ω and d (y, µ) > 0 ∀y = µ (which means that d is a deviance). The constant c σ 2 ensures that this is a density function and V () is the variance function, related with d by Proper dispersion models are characterized by the existence of vector functions s and α such that their deviance can be factorized as d(y, µ) = s(y) T α(µ).
The above definition can be extended, by allowing c () todepend also on µ and the new density is On the other hand, if it satisfies that σc µ, σ 2 −→ (2π) −1/2 , when σ → 0 for all y, it is called a pseudo-dispersion model. A remarkable property of these models is that they are asymptotically normal [14].
It will be proven now that gpower-normal models are pseudo-dispersion models.
Theorem 2. Models given in Definition 1 are pseudo-dispersion models.

Appendix C. Moments of Gpower-normal Family
Expressions for the moments of a gpower-normal family are obtained here, and it is shown that they can be expressed in terms of the truncated normal density function. GPN(µ, σ, p), then its moments can be expressed as Proof. Y can be represented as the inverse gpower transformation of a truncated normal variable X: if p = 0 now, the expectation of Y n can be expressed for p = 0 as