A Bimodal Extension of the Log-Normal Distribution on the Real Line with an Application to DNA Microarray Data

: A bimodal double log-normal distribution on the real line is proposed using the random sign mixture transform. Its associated statistical inferences are derived. Its parameters are estimated by the maximum likelihood method. The performance of the estimators and the corresponding conﬁdence intervals is checked by simulation studies. Application of the proposed distribution to a real data set from a DNA microarray is presented.


Introduction
A log-normal distribution is perhaps the most popular model for skewed data [1]. However, a log-normal distribution is defined only on the positive real line. Many of its application areas involve data spanning the entire real line. One example is the modeling of stock returns. The log-normal distribution is a popular model for stock returns. However, stock returns can be positive or negative. Positive stock returns correspond to profits, while negative stock returns correspond to losses. Other application areas of log-normal distributions involving data spanning the entire real line are discussed later on. Hence, a double log-normal distribution is needed.
We follow the procedure presented in [2] to construct a double log-normal (DLN) distribution. Consider the following two transforms [2]: (i) The random sign transform (RST) given by W = (2Y − 1)X; (ii) The random sign mixture transform (RSMT) given by where Y is a Bernoulli random variable (RV) with parameter β and X, X 1 and X 2 are non-negative RVs independent of Y. The probability density function (PDF) of W is f W (w; β, θ) = β f X (|w|; θ), w < 0, β f X (w; θ), w ≥ 0,
Some double continuous distributions based on the RSMT are: 1.
We construct the DLN distribution using the RSMT, i.e., the distribution of Z when X 1 and X 2 independently follow the log-normal distribution.
The remainder of this paper is organized as follows. In Section 2, the statistical properties of the DLN distribution are presented. The maximum likelihood estimates (MLEs) of the parameters and their asymptotic distributions are studied in Section 3. Simulations to check the finite sample performance of the estimators of the parameters and the corresponding confidence intervals are presented in Section 4. An application of the proposed double distribution to a real data set from a DNA microarray is presented in Section 5. Finally, the conclusions and comments are stated in Section 6.

Statistical Properties
We present the statistical properties of the DLN distribution in this section.

Probability Density Function
The PDF of the DLN distribution is where for −∞ < µ 1 , µ 2 < ∞ and σ 1 , σ 2 > 0, and are the PDFs of the LN distributions. Figure 1 shows the bimodality of the PDF of the DLN distribution for selected parameter values. The DLN distribution has two modes given by are the modes of the LN distributions.

Cumulative Distribution Function
The cumulative distribution function (CDF) of the DLN distribution is where are the CDFs of the LN distributions and is the CDF of the standard normal distribution. Figure 2 shows the CDF of the DLN distribution for selected parameter values. We can observe that F Z (0) = β and hence F Z (0) decreases as β increases.

Hazard Rate Function
The survival function of the DLN distribution is where are the SFs of the LN distributions. The hazard rate function (HRF) of the DLN distribution is (5) Figure 3 shows the HRF of the DLN distribution for selected parameter values. This figure shows that the HRF of the DLN distribution can be bimodal with one mode on each side of the origin.

Moments and Associated Measures
The rth raw moment of the DLN distribution is where are the rth moments of the LN distributions.
In particular, the first four raw moments of Z are The variance, skewness and kurtosis of the DLN distribution can be obtained using the well-known expressions: upon substituting for the raw moments. Figure 4 shows the mean, variance, skewness and kurtosis of the DLN distribution as a function of β for selected values of (µ 1 , σ 1 , µ 2 , σ 2 ). We can observe that the skewness can be negative or positive, i.e., the DLN distribution can be skewed to the left or skewed to the right.

Harmonic Mean
The harmonic mean of an RV V is defined as

Proposition 1. The harmonic mean of the RSMT Z is
the proposition follows.

Corollary 1. The harmonic mean of the DLN distribution is
are the harmonic means of the LN distributions.

Entropies
Entropies are measures of a system's variation, instability or unpredictability. For an RV V with PDF f V (v), the following are two well-known entropies: 1. Tsallis entropy [17]: 2. Shannon entropy [18]:

Proposition 2. The Tsallis entropy of the RSMT Z is
Proof. See [16].

Proposition 3. The Tsallis entropy of the LN distribution with parameters
the proposition follows.

Corollary 3. The Shannon entropy of the LN distribution with parameters (µ, σ)
is The proof of Proposition 4 follows directly from Propositions 2 and 3.  Note that the Tsallis and Shannon entropies can be negative for continuous distributions.

Maximum Likelihood Estimation
In this section, MLEs of the parameters of the DLN distribution and their asymptotic distributions are derived.
1. Figure 7 shows that the absolute biases of the MLEs are small and approach zero as n increases. 2. Figure 8 shows that the MSEs of the MLEs are small and decrease as n increases. 3. Figure 9 shows that the coverage probabilities of the 95% confidence intervals are close to the nominal level.
These conclusions show that the MLEs of the DLN distribution are well behaved for point as well as interval estimation.

Application
In this section, we apply the proposed DLN distribution to a real data set from a DNA microarray reported in [19]. According to Wikipedia, "A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome". The data labelled as "SID 377353, ESTs  Figure 10 shows the histogram of the data, which indicates bimodality around the origin. For the sake of comparing the bimodal DLN distribution with other bimodal distributions, we consider the double inverse Gaussian (DIG) distribution proposed in [16]. The PDF of the DIG distribution is where x > 0, ν j , λ j > 0, j = 1, 2 (9) are the PDFs of inverse Gaussian distributions. Table 1 gives the MLEs, their standard errors (S.E.s), estimated log-likelihoods and Kolmogrov-Smirnov (KS), Anderson-Darling (AD) and Cramér-von Mises (CVM) goodness-of-fit tests of the fitted DIG and DLN distributions. This table shows that the MLE of β and its S.E. are the same for both the fitted DIG and DLN distributions, since the Bernoulli parameter β is estimated independently in the RSMT. In addition, this table shows that the MLEs of µ 1 and µ 2 in the fitted DLN distribution are both negative.  Table 1 shows that the three goodness-of-fit tests have much smaller (larger) test statistics for the fitted DLN (DIG) distribution. This table also shows that the three goodnessof-fit tests reject (accept) the DIG (DLN) distribution for the given data. This conclusion is supported by the diagnostic plots in Figures 11 and 12. In these figures, (i) the PDF and CDF plots indicate, in an informal way, that the fitted DIG (DLN) distribution may not be suitable for the given data; (ii) the quantile-quantile (Q-Q) plots show that the fitted DIG and DLN distributions inappropriately describe the tails of the distributions; (iii) the probability-probability (P-P) plots show that the fitted DIG (DLN) distribution inappropriately (appropriately) describes the center of the distribution.

P−P plot
Theoretical probabilities Empirical probabilities

Conclusions and Comments
We have proposed a bimodal distribution on the real line, referred to as the double log-normal distribution. We have derived its statistical properties, including the probability density, cumulative distribution and hazard rate functions, the moments and associated measures and harmonic mean, as well as Tsallis and Shannon entropies. Additionally, maximum likelihood estimates of the parameters and their asymptotic distribution are provided. Simulation studies showed that the maximum likelihood estimation performed well in terms of the bias, mean squared error and coverage probability of confidence intervals. Application to a DNA microarray data set showed that the proposed distribution is flexible and competitive for modeling bimodal data around the origin.
Instead of the log-normal distribution, one can consider the length biased log-normal distribution developed in [20]. It will be interesting to formulate a double length biased log-normal distribution.  Data Availability Statement: The data are given in the paper. The code used can be obtained from the corresponding author.