Modelling Asymmetric Data by Using the Log-Gamma-Normal Regression Model

Roger Tovar-Falón; Guillermo Martínez-Flórez; Heleno Bolfarine

doi:10.3390/math10071199

,

and

¹

Departamento de Matemáticas y Estadística, Facultad de Ciencias Básicas, Universidad de Córdoba, Monteria 230002, Colombia

²

Departamento de Estatística, Universidade de São Paulo, Sao Paulo 05508-090, Brazil

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics2022, 10(7), 1199;https://doi.org/10.3390/math10071199

Version Notes

Order Reprints

Abstract

In this paper, we propose a linear regression model in which the error term follows a log-gamma-normal (LGN) distribution. The assumption of LGN distribution gives flexibility to accommodate skew forms to the left and to the right. Kurtosis greater or smaller than the normal model can also be accommodated. The regression model for censored asymmetric data is also considered (censored LGN model). Parameter estimation is implemented using the maximum likelihood approach and a small simulation study is conducted to evaluate parameter recovery. The main conclusion is that the approach is very much satisfactory for moderate and large sample sizes. Results for two applications of the proposed model to real datasets are provided for illustrative purposes.

Keywords:

log-gamma-normal distribution; linear regression models; asymmetric data; censored data; maximum likelihood estimators

MSC:

60E05

1. Introduction

Regression models are one of the main statistical techniques frequently used in data analysis in any area of knowledge, especially when there is interest in studying the relationship between a dependent variable (response) and two or more independent (explanatory) variables. In this sense, a regression model with a response variable following a normal distribution is perhaps best known in the literature, and could be considered one of the most widely used; however, the assumption of normality may not be adequate in the dataset under analysis, since these may present degrees of skewness or kurtosis that are not within the range covered by the normal model. Consequently, inferences made from the fitted model may not have statistical validity, and erroneous conclusions may be reached. A solution to the problem of the assumption of normality of the variable of interest is the use of transformations, although it is well known that this solution makes it difficult to interpret results since data are not in the original measurement scale. As an alternative to this issue, many authors have introduced new family distributions that are capable of capturing degrees of skewness and kurtosis greater than those that the normal distribution can capture.

One of the most important works in the context of data with a high degree of asymmetry is Azzalini [1], which is known in the statistical literature as the skew-normal (SN) model. The main characteristic of the SN model is its ability to fit degrees of asymmetry (on the left and right) greater than those of the normal model; however, it is not the best model in terms of capturing high degrees of kurtosis. Relating to the latter, the power-normal (PN) model introduced by Durrans [2] has the particularity of fitting data with a higher degree of kurtosis than the normal and SN model but with less range of asymmetry. The SN and PN models have been studied extensively by many authors, and different extensions of this model have been considered. In Gupta and Gupta [3] for example, the authors showed the existing practical problems when the asymmetry parameter of the SN model is estimated, and they proposed an alternative model named the PN model. The authors also investigated the closeness between the proposed model and the SN model. In Pewsey et al. [4], the authors presented the general results of the likelihood-based inference for the family of power distributions, with particular emphasis on the case of the PN model, complementing the work of Gupta and Gupta [3]. In Martínez-Flórez et al. [5], the authors introduced a new model that generalized both the SN model by Azzalini [1] and the PN model by Durrans [2]. The new model, which is called power-skew-normal (PSN), has the particularity of fitting data with degrees of asymmetry greater than those of the SN model, and is also capable of capturing degrees of kurtosis greater than those of the PN model. Furthermore, the authors showed that the information matrix of the new model is non-singular, which permits carrying out hypothesis tests on the asymmetry parameters based on likelihood-ratio statistics. On the other hand, Martínez-Flórez et al. [6] generalized the log-normal (LN) model from the SN and PN models. In addition, these new proposals contain the LN model as a particular case, and they are more flexible regarding skewness and kurtosis to fit positive data.

Alternatives for fitting asymmetric data with a high degree of kurtosis were reported by other authors such as Tovar-Falón et al. [7], who introduced a new model that generalizes the skew-t model of Azzalini and Capitanio [8] and power-t of Zhao and Kim [9]. Here, the inference was carried out from a classical perspective using the maximum likelihood method. This new model also has, as particular cases, the PSN, SN, PN, Student-t and normal models. In Tung et al. [10], the authors considered a mixture class of log-F distributions to characterize asymmetric distributions by integrating it into a pH acceleration model. The authors studied the impact of the new model in the presence of misspecification of particle size distribution.

Models for asymmetric data with high degrees of skewness and kurtosis, and presenting more than one mode and censored data, were also considered. More details about these topics can be found in Martínez-Flórez et al. [11] and Martínez-Flórez et al. [12], respectively. In addition, all the aforementioned models are easily extensible to the situations of regression models, including cases in which the data show censoring in some value; see Sahu et al. [13], Martínez-Flórez et al. [14].

In Amini et al. [15], the authors introduced a new family of distributions useful for modelling asymmetric data. This new family of continuous distributions is generated by a distribution F and two positive real parameters

δ

and

γ

, which control the skewness and tail weight of the distribution. The probability density function (PDF) of this family is given by

g (x; δ, γ) = \frac{γ^{δ}}{Γ (δ)} {[- ln F (x)]}^{δ - 1} {[F (x)]}^{γ - 1} f (x),

(1)

where

δ, γ \in R^{+}

and

Γ (\cdot)

is the complete gamma function,

F (\cdot)

is the cumulative distribution function (CDF) of X, and

f (\cdot)

is the associated PDF. In this work, the authors studied the main properties of the distribution and addressed the estimation process of the unknown parameters of the model using the likelihood approach.

From the

F (x)

generator, the authors studied some particular properties of the family, among which are the exponential, Weibull, power, Pareto, extreme value and Gumbel distributions. If

δ = 1

and

F (\cdot) = Φ (\cdot)

and

f (\cdot) = ϕ (\cdot)

in model (1), i.e., the CDF and PDF of the standard normal distribution, respectively, the model in (1) is reduced to the PN model by Durrans [2]. Hence, the model in (1) is an extension of the PN model. In Cordeiro et al. [16], the authors studied in detail the properties of the log-gamma-generated family of distributions introduced by Amini et al. [15] and presented some applications of this family. Other particular cases of the model introduced by Amini et al. [15] correspond to the generalized gamma and log-gamma distributions, which have been extensively studied by many authors; see Prentice [17], Lawless [18], Young and Bakir [19], Ortega et al. [20, 21] among others.

The main goal of this article is to focus on the study of the regression model under the assumption that the errors follow a log-gamma-normal (LGN) distribution, which is obtained by taking

F (x) = Φ (x)

and

f (x) = ϕ (x)

in model (1). We also consider the case of the regression model for censored data, and we conduce the parameter estimation using the maximum likelihood approach and its large sample properties.

Although there are many works in the literature related to the generalized log-gamma distribution, our proposal is based on the family of distributions presented by Amini et al. [15], which is known in the literature as log-gamma-generated. For the case of this family, we focus on the case in which the generating function is the normal distribution and the distribution called log-gamma-normal is obtained, and this distribution does not correspond to the distributions previously mentioned. In our proposal, we change the assumption that the errors in the multiple linear regression model follow a normal distribution to that of errors with a log-gamma-normal distribution. It is also important to note that the generalized log-gamma and log-gamma distributions are also particular cases of the family introduced by Amini et al. [15] (assuming a gamma distribution instead of the standard normal distribution), but in our proposal we do not consider these cases.

In addition to carrying out the estimation process of the parameters in the model, we present two applications using real datasets. The first dataset was previously analyzed by Zhang and Davidian [22], and the second dataset is related to a study on the abundance of beryllium scaled to the Sun’s abundance. For the particular case of these datasets, the model fits well, and therefore we can conclude that, apart from the existence of statistical literature for the analysis of asymmetric data, our proposal is a viable alternative that competes with existing models. The main contribution of this model is that the trend of the dataset under examination is better explained using a model with log-gamma type errors instead of one with asymmetric errors using another distribution.

The article is organized as follows. In Section 2, we define the family of LGN distributions and discuss some of its properties. In Section 3, the LGN regression model is defined, and its properties studied. The inference is implemented using the maximum likelihood approach. The censored LGN model for dealing with censored data by maximum likelihood estimation is discussed in Section 4. The results of a small-scale simulation study reveal the good performance of the estimation approach in Section 5. In Section 6, two real data applications are considered, revealing that the datasets in question are better fitted by LGN model than PN and models.

2. Log-Gamma-Normal Distribution

In this section, we define the LGN model, which is obtained from the family given in (1) by taking the CDF of the standard normal distribution, and we study some basic properties.

Definition 1.

The random variable X is said to have a LGN distribution, if X has PDF given by

f_{L G N} (x; δ, γ) = \frac{γ^{δ}}{Γ (δ)} {[- ln Φ (x)]}^{δ - 1} {[Φ (x)]}^{γ - 1} ϕ (x), x \in R,

(2)

where

δ, γ \in R^{+}

,

Γ (a) = \int_{0}^{\infty} u^{a - 1} e^{- u} d u

is the gamma function, and the functions

ϕ (\cdot)

and

Φ (\cdot)

are the PDF and CDF of the standard normal distribution, respectively.

A random variable with LGN distribution is shortly denoted by

X \sim LGN (δ, γ)

. One can note that the function (2) is a proper PDF since

f_{L G N} (x; δ, γ) \geq 0

for all

x \in R

and

δ, γ \in R^{+}

. Thus, letting

y = - ln Φ (x)

, it follows that

\int_{R} \frac{γ^{δ}}{Γ (δ)} {[- ln Φ (x)]}^{δ - 1} {[Φ (x)]}^{γ - 1} ϕ (x) d x = \int_{0}^{+ \infty} \frac{γ^{δ}}{Γ (δ)} y^{δ - 1} e^{- γ y} d y = 1 .

Figure 1 depicts some shapes of LGN distribution for some selected values of the parameters

δ

and

γ

. It can be seen that the parameters

δ

and

γ

affect both the skewness and kurtosis of the model, and hence, the LGN distribution is more flexible for fitting data that may be skewed as well as having thinner or thicker tails than the normal, SN and PN distributions.

Figure 1. PDF of the LGN distribution: (a)

δ = 0.6

and

γ = 0.30

(dotted line), 0.60 (dashed line), 1.0 (dotted–dashed line), 3.0 (long dashed line) and 6.0 (solid line). (b)

γ = 1.5

and

δ = 0.30

(dotted line), 0.60 (dashed line), 1.0 (dotted–dashed line), 2.0 (long dashed line) and 5.0 (solid line).

The LGN distribution reduces to some specific distributions as special cases for specified values of the parameters

δ

and

γ

; some of them are available in the literature and have been widely studied.

Proposition 1.

Let

X \sim L G N (δ, γ)

(i): if $δ = γ = 1$ , then $X \sim N (0, 1)$ ,
(ii): if $δ = 1$ , then $X \sim P N (γ)$ ,
(iii): if $δ = 1$ and $γ = 2$ , then $X \sim S N (1)$ .

Proof.

Demonstration of (i)–(iii) is immediate from the definition of LGN distribution. □

2.1. Moments

Measures of skewness and kurtosis can be given from the moments of the LGN distribution. The following proposition gives an expression of the rth moment of the random variable

X \sim LGN (δ, γ)

which does not have a closed form.

Proposition 2.

Let

X \sim L G N (δ, γ)

then

E [X^{k}] = E [{(Φ^{- 1} (e^{- W}))}^{k}], for k = 1, \dots, n,

(3)

where

Φ^{- 1} (\cdot)

is the inverse of the CDF

Φ (\cdot)

and the random variable W follows a gamma distribution with parameters δ and γ.

Proof.

We have by definition that

E [X^{k}] = \int_{R} x^{k} \frac{γ^{δ}}{Γ (δ)} {[- ln Φ (x)]}^{δ - 1} {[Φ (x)]}^{γ - 1} ϕ (x) d x .

Letting

W = - ln Φ (X)

, then

X = Φ^{- 1} (e^{- W})

, it follows that

E [X^{k}] = \int_{0}^{+ \infty} {(Φ^{- 1} (e^{- w}))}^{k} \frac{γ^{δ}}{Γ (δ)} w^{δ - 1} e^{- γ w} d w,

which is the expected value of the function

{[Φ^{- 1} (e^{- W})]}^{k}

, where W follows a

Gamma (δ, γ)

distribution. □

Based on moments (3), one can obtain the skewness

(\sqrt{β_{1}})

and the kurtosis

(β_{2})

coefficients of the LGN model using the following expressions

\sqrt{β_{1}} = \frac{μ_{3} - 3 μ_{1} μ_{2} + 2 μ_{1}^{3}}{{(μ_{2} - μ_{1}^{2})}^{3 / 2}},

and

β_{2} = \frac{μ_{4} - 4 μ_{1} μ_{3} + 6 μ_{2} μ_{1}^{2} - 3 μ_{1}^{4}}{{(μ_{2} - μ_{1}^{2})}^{2}}

respectively, where

μ_{k} = E [X^{k}]

for

k = 1, \dots, 4

. The skewness and kurtosis coefficients for values of

δ

and

γ

ranging between 0.1 and 200 were calculated using numerical integration with an integrate function of R Development Core Team [23] for LGN model. It was found that

\sqrt{β_{1}} \in [- 1.0190, 1.0143]

and

β_{2} \in [1.7170, 4.9356]

. The given intervals contain the corresponding intervals of skewness and kurtosis coefficients of the SN model, which are

(- 0.9953, 0.9953)

and

[3.0000, 3.8692)

, respectively, and the PN model, which are

[- 0.6115, 0.9007]

and

[1.7170, 4.3556]

, respectively. More details can be found in Pewsey et al. [4]. The previous results illustrate the fact that the LGN model contains models with greater (and smaller) asymmetry degree than both the SN and PN models.

2.2. Distribution Function

In this section, we present the explicit formula for the CDF of LGN distribution.

Proposition 3.

Let

X \sim L G N (δ, γ)

, then

F_{L G N} (x) = \frac{Γ (δ, - γ ln Φ (x))}{Γ (δ)},

(4)

where

Γ (a) = \int_{0}^{\infty} u^{a - 1} e^{- u} d u

is the gamma function and

Γ (a, x) = \int_{x}^{\infty} u^{a - 1} e^{- u} d x

is the upper incomplete gamma function.

Proof.

The CDF of the LGN distribution is obtained as follows:

\begin{matrix} F_{L G N} (x) & = \int_{- \infty}^{x} f_{L G N} (t) d t \\ = \int_{- \infty}^{x} \frac{γ^{δ}}{Γ (δ)} {[- ln Φ (t)]}^{δ - 1} {[Φ (t)]}^{γ - 1} ϕ (t) d t \\ = - \int_{+ \infty}^{- ln Φ (x)} \frac{γ^{δ}}{Γ (δ)} s^{δ - 1} e^{- γ s} d s; by s = - ln Φ (t) \\ = \int_{- ln Φ (x)}^{+ \infty} \frac{γ^{δ}}{Γ (δ)} s^{δ - 1} e^{- γ s} d s \\ = \frac{Γ (δ, - γ ln Φ (x))}{Γ (δ)} \end{matrix}

□

It can be shown that (see Cordeiro et al. [16]) for the density function given in (2), the quantile function is given by:

Q (u) = Φ^{- 1} (exp \{- γ^{- 1} Q^{- 1} (δ, u)\}),

where

Q^{- 1} (δ, u)

is the inverse function of

Q (δ, u) = Γ (δ, u) / (Γ (δ))

.

The inversion method can be used to generate a random variable with LGN distribution. Thus, let

δ, γ \in R^{+}

and U be a random variable with uniform distribution, namely

U \sim U (0, 1)

. Then, the random variable X with distribution

LGN (δ, γ)

can be obtained by letting

X = Φ^{- 1} (e^{- F^{- 1} (1 - U, δ, γ)}),

where

Φ^{- 1} (\cdot)

and

F^{- 1} (\cdot, δ, γ)

are the inverses of the CDF of the normal and gamma distributions, respectively. The survival and hazard functions for the LGN distribution can be obtained from (2) and (4), and they are given by

S (t) = Γ (- ln Φ (t); δ, γ),

and

r (t) = \frac{\frac{γ^{δ}}{Γ (δ)} {[- ln Φ (t)]}^{δ - 1} {[Φ (t)]}^{γ - 1} ϕ (t)}{Γ (- ln Φ (t), δ, γ)},

respectively.

2.3. Location-Scale Extension

Let

X \sim L G N (δ, γ)

. The location-scale extension of the random variable X is defined using the transformation

Y = ξ + σ X

, where

ξ \in R

and

σ \in R^{+}

. The corresponding PDF of Y is given by

\begin{matrix} f_{L G N} (y; ξ, σ, δ, γ) & = \frac{γ^{δ}}{σ Γ (δ)} {[- ln Φ (\frac{y - ξ}{σ})]}^{δ - 1} {[Φ (\frac{y - ξ}{σ})]}^{γ - 1} ϕ (\frac{y - ξ}{σ}), \end{matrix}

(5)

where

ξ

is a location parameter and

σ

is a scale parameter. The random variable Y that has a distribution with density function given in Equation (5) is denoted as

Y \sim L G N (ξ, σ, δ, γ)

.

The previous representation of location scale can be extended to the case where response variable depends on regressor variables, say

Z_{1}, \dots, Z_{p}

, through the relationship

ξ_{i} = z_{i}^{⊤} β

; where

β = {(β_{0}, β_{1}, \dots, β_{p})}^{⊤}

is an unknown vector of regression coefficients and

z = {(1, z_{1}, \dots, z_{p})}^{⊤}

is a vector of known regressors correlated with the response vector.

The rth moment of a variable

Y \sim LGN (ξ, σ, δ, γ)

can be obtained using the formula

E [Y^{r}] = \sum_{l = 0}^{r} (\binom{r}{l}) ξ^{l} σ^{r - l} E [X^{r - l}], r = 1, \dots,

where

X \sim LGN (δ, γ)

.

Proof.

Let

X \sim LGN (δ, γ)

, then, for

Y = ξ + σ X

and

r = 1, \dots

it has

\begin{matrix} E [Y^{r}] & = E [{(ξ + σ X)}^{r}] \\ = E [\sum_{l = 0}^{r} (\binom{r}{l}) ξ^{l} {(σ X)}^{r - l}] \\ = \sum_{l = 0}^{r} (\binom{r}{l}) ξ^{l} σ^{r - l} E [X^{r - l}] . \end{matrix}

In the second line, the binomial theorem is used. □

3. Log-Gamma-Normal Regression Model

Regression models have been a statistical technique widely used in many areas of knowledge to explain the behavior of a response variable, say Y, as a function of other variables called regressors, say

Z_{1}, \dots, Z_{p}

, and a vector of unknown parameters called regression coefficients denoted by

β

. Specifically, for a random sample of n individuals indexed by

i = 1, \dots, n

, we have

y_{i} = z_{i}^{⊤} β + ε_{i} for i = 1, 2, \dots, n;

(6)

where

ε_{i}

is a random variable (random error) with certain PDF, the most common being the normal distribution assumption, i.e.,

ε_{i} \sim N (0, σ^{2})

. Given the multiple departures from the normality assumption and the actual behavior of the random variable

ε_{i}

, this assumption has been replaced in numerous instances by other more realistic ones, usually looking for distributions to fit data with higher or lower skewness and/or kurtosis than that allowed by the normal distribution. Notable inferential mistakes are made (invalid results) when we work under the normal assumption and this assumption is not true. In some cases, a simple transformation helps to solve this problem, but this strategy typically has problems of interpretability of the results or the coefficients of the model.

Now, we change the normal assumption using the LGN assumption in the random error term

ε_{i}

, so we suppose that

ε_{i} \sim LGN (0, σ, δ, γ)

and this leads to

Y_{i} \sim LGN (z_{i}^{⊤} β, σ, δ, γ)

for

i = 1, \dots, n

. The case

δ = γ = 1

follows the ordinary normal regression model. Using the least squares method, we obtain the estimators

\tilde{β} = {(z^{⊤} z)}^{- 1} z^{⊤} y

, which are not unbiased for the parameters of the regression coefficients but the correction

{\tilde{β}}_{0}^{*} = {\tilde{β}}_{0} + \hat{E} [ε]

, where the last term represents the estimated expected value of the random variable

ε

, such that we can obtain unbiased estimators of the parameters.

Estimation Using Maximum Likelihood Method

We initially define some quantities:

Z

is a matrix

n \times (p + 1)

where rows

z_{i}

correspond to observations for the ith individual for p independent variables;

y

is a vector

n \times 1

corresponding to responses for the ith individual; and

β = {(β_{0}, β_{1}, \dots, β_{p})}^{⊤}

is an unknown vector of regression coefficients. Thus, given a random sample of size n, say

y = {(Y_{1}, \dots, Y_{n})}^{⊤}

, where

Y_{i} \sim LGN (z_{i}^{⊤} β; σ, δ, γ)

for

i = 1, \dots, n

; the log-likelihood function for the vector

θ = {(β^{⊤}, σ, δ, γ)}^{⊤}

can be written as follows:

\begin{matrix} ℓ (θ; y) & = n [δ ln (γ) - ln (Γ (δ)) - ln (σ) - 0.5 ln (2 π)] - \frac{1}{2 σ^{2}} {(y - Z β)}^{⊤} (y - Z β) \\ + (δ - 1) \sum_{i = 1}^{n} ln [- ln Φ (\frac{y_{i} - z_{i}^{⊤} β}{σ})] + (γ - 1) \sum_{i = 1}^{n} ln [Φ (\frac{y_{i} - z_{i}^{⊤} β}{σ})] . \end{matrix}

(7)

After taking the first partial derivatives of the log-likelihood function (7) regarding the parameters of interest and setting them equal to zero, we obtain the following score equations:

\begin{matrix} U (β) & = \frac{1}{σ^{2}} Z^{⊤} (y - Z β) - \frac{1}{σ} Z^{⊤} [(δ - 1) U + (γ - 1) I_{n}] Δ_{1} = 0, \end{matrix}

(8)

\begin{matrix} U (σ) = - \frac{n}{σ} + \frac{1}{σ^{3}} {(y - Z β)}^{⊤} (y - Z β) \\ - \frac{1}{σ^{2}} {(y - Z β)}^{⊤} [(δ - 1) U + (γ - 1) I_{n}] Δ_{1} = 0, \end{matrix}

(9)

\begin{matrix} U (δ) & = n ln (γ) - n ψ (δ) + \sum_{i = 1}^{n} ln [- ln Φ (x_{i})] = 0, \end{matrix}

(10)

\begin{matrix} U (γ) & = \frac{n δ}{γ} + \sum_{i = 1}^{n} ln [Φ (x_{i})] = 0, \end{matrix}

(11)

where

Δ_{1} = {(v_{1}, \dots, v_{n})}^{⊤}

, and

U = diag \{1 / u_{1}, \dots, 1 / u_{n}\}

with

v_{i} = ϕ (x_{i}) / Φ (x_{i})

and

u_{i} = ln [Φ (x_{i})]

; and

x_{i} = (y_{i} - z_{i}^{⊤} β) / σ

for

i = 1, \dots, n

;

I_{n}

is the identity matrix of order n, and

ψ (\cdot)

is the digamma function.

The elements of the observed information matrix for the parameter

θ = {(β^{⊤}, σ, δ, γ)}^{⊤}

are easily computed by taking second partial derivatives, obtaining:

\begin{matrix} j_{β^{⊤}} & = \frac{1}{σ^{2}} Z^{⊤} Z + \frac{δ - 1}{σ^{2}} Z^{⊤} Δ_{2} Z + \frac{γ - 1}{σ^{2}} Z^{⊤} Δ_{3} Z, \\ j_{β σ} & = \frac{2}{σ^{3}} Z^{⊤} (y - Z β) + \frac{δ - 1}{σ^{2}} Z^{⊤} Δ_{4} + \frac{γ - 1}{σ^{2}} Z^{⊤} Δ_{5}, \\ j_{σ σ} & = - \frac{n}{σ^{2}} + \frac{3}{σ^{2}} \sum_{i = 1}^{n} x_{i}^{2} + \frac{δ - 1}{σ^{2}} \sum_{i = 1}^{n} \{- 2 \frac{v_{i}}{u_{i}} x_{i} + \frac{v_{i}^{2}}{u_{i}^{2}} x_{i}^{2} + \frac{v_{i} x_{i}^{3} + v_{i}^{2} x_{i}^{2}}{u_{i}}\} \\ + \frac{γ - 1}{σ^{2}} \sum_{i = 1}^{n} [- 2 v_{i} x_{i} + v_{i} x_{i}^{3} + x_{i}^{2} v_{i}^{2}], \\ j_{β γ} & = \frac{1}{σ} Z^{⊤} Δ_{1}, j_{β δ} = \frac{1}{σ} Z^{⊤} Δ_{6}, j_{σ δ} = \frac{1}{σ} X^{⊤} Δ_{6}, j_{σ γ} = \frac{1}{σ} X^{⊤} Δ_{1}, \\ j_{δ δ} & = n ψ_{1} (δ), j_{δ γ} = - \frac{n}{γ}, j_{γ γ} = \frac{n δ}{γ^{2}}, \end{matrix}

where

Δ_{2} = diag \{v_{i}^{2} / u_{i}^{2} + (v_{i} x_{i} + v_{i}^{2}) / u_{i}\}

and

Δ_{3} = diag \{v_{i} x_{i} + v_{i}^{2}\}

with

i = 1, \dots, n

;

Δ_{4} = {(a_{1}, \dots, a_{n})}^{⊤}

,

Δ_{5} = {(b_{1}, \dots, b_{n})}^{⊤}

, and

Δ_{6} = {(c_{1}, \dots, c_{n})}^{⊤}

with

a_{i} = - v_{i} / u_{i} + v_{i}^{2} x_{i} / u_{i}^{2} + (v_{i} x_{i}^{2} + v_{i}^{2} x_{i}) / u_{i}

,

b_{i} = v_{i} x_{i}^{2} + v_{i}^{2} x_{i} - v_{i}

,

c_{i} = v_{i} / u_{i}

for

i = 1, \dots, n

,

ψ_{1} (\cdot)

is the trigamma function and

X = {(x_{1}, \dots, x_{n})}^{⊤}

with

x_{i} = (y_{i} - z_{i}^{⊤} β) / σ

for

i = 1, \dots, n

.

The Fisher information matrix

I (θ)

can be obtained numerically, calculating

n^{- 1}

times the expected value of the observed information matrix. When

δ = γ = 1

, we obtain the case of the normal distribution

N (0, σ^{2})

for the random variable

ε_{i}

. Using numerical approximation, the determinant of the Fisher information matrix is

\det (I (θ)) = \det (Z^{⊤} Z) [- 0.3137 \det (Z^{⊤} Z) + 0.3093 \sum_{j = 0}^{p} {\bar{z}}_{j}^{2}]

, where

\det (\cdot)

denotes the determinant function of a matrix and

{\bar{z}}_{j}

denotes the mean in the sample of the variable

Z_{j}

. Thus, the determinant of the information matrix is different to zero, and the information matrix is non-singular, ensuring the conditions to apply asymptotic approximation to the normal distribution of the maximum likelihood estimator vector of

θ

. Here, the covariances matrix of

\hat{θ}

is the inverse of the Fisher information matrix, i.e.,

Σ_{\hat{θ}} = I^{- 1} (θ)

.

Approximation

N_{p + 4} (θ, Σ_{\hat{θ}})

can be used to construct confidence intervals for

θ_{r}

, which are given by

{\hat{θ}}_{r} \mp z_{1 - α / 2} \sqrt{\hat{σ} ({\hat{θ}}_{r})}

, where

\hat{σ} ({\hat{θ}}_{r})

corresponds to the rth diagonal element of the matrix

Σ_{\hat{θ}}

and

z_{1 - α / 2}

denotes

100 (1 - α / 2)

quantile of the standard normal distribution.

4. Censored LGN Model

Models for censored data are common in economic research, medicine, biology, and survival analysis. Usually, this type of data is analyzed using the Tobit model (see Tobin [24], also known as censored normal model (CN)). In some cases, the tails of the distribution of the random errors are more or less heavy than the tails of the normal distribution, consequently showing that the Tobit model does not estimate the probability in the censored part very well, and this leads to bad estimates. In these cases, it must be assumed that another distribution to model errors, especially in the case of asymmetric errors, can work with the power-normal Tobit model (PNT) (see Martínez-Flórez et al. [25]), the censored SN model, or any other model that fits the degree of asymmetry and the kurtosis of the errors in the model. We now extend the LGN regression model to the censored data, which we will call the censored LGN regression model (CLGN).

Censored LGN Variable

Consider a random variable

Y^{*} \sim LGN (ξ, σ, δ, γ)

and let

\{y_{1}^{*}, y_{2}^{*}, \dots, y_{n}^{*}\}

be a random sample of size n of

Y^{*}

. Let T be a value of censorship for the

Y^{*}

variable. The CLGN random variable Y is defined as

y_{i} = \{\begin{matrix} y_{i}^{*}, & if y_{i}^{*} > T, \\ T, & if y_{i}^{*} \leq T, \end{matrix}

for

i = 1, \dots, n

. We use the notation

CLGN (ξ, σ, δ, γ)

. Consequently, the probability mass at the value T is

\Pr (y_{i} = T) = \Pr (y_{i}^{*} \leq T) = 1 - Γ (- ln Φ (z_{T i}); δ, γ)

, where

z_{T i} = (T - ξ) / σ

. For

y_{i}^{*} > T

, the distribution of the variable Y is

LGN (ξ, σ, δ, γ)

. Although the formulation above the threshold T is not null, it can be transformed back to zero by taking

y_{i}^{*} - T .

Hence, there is no loss of generality in taking

T = 0

.

When we have regressor variables, say

Z_{1}, \dots, Z_{p}

, through the relationship

ξ_{i} = z_{i}^{⊤} β

, where

β = {(β_{0}, β_{1}, \dots, β_{p})}^{⊤}

is an unknown vector of regression coefficients, and

z = {(1, z_{1}, \dots, z_{p})}^{⊤}

is a vector of known regressors correlated with the response vector, we have a CLGN regression model defined by the random variable

y_{i} = max {y_{i}^{*}, T}

, with

y_{i}^{*} = z_{i}^{⊤} β + ε_{i}

,

i = 1, \dots, n;

i.e.,

y_{i} = \{\begin{matrix} z_{i}^{⊤} β + ε_{i}, & if z_{i}^{⊤} β + ε_{i} > 0, \\ 0, & otherwise . \end{matrix}

(12)

For a sample of size n,

y = {(y_{1}, \dots, y_{n})}^{⊤}

, where

Y_{i} \sim CLGN (z_{i}^{⊤} β; σ, δ, γ)

for

i = 1, \dots, n

; the log-likelihood function for the vector

θ = {(β^{⊤}, σ, δ, γ)}^{⊤}

is given by

\begin{matrix} ℓ (θ; y) & = \sum_{0} [1 - Γ (- ln Φ (x_{T i}); δ, γ)] + \sum_{1} [ln (\frac{γ^{δ}}{σ Γ (δ)}) + (δ - 1) ln [- ln Φ (x_{i})]] \\ + \sum_{1} [- \frac{1}{2 σ^{2}} x_{i}^{2} + (γ - 1) ln [Φ (x_{i})]], \end{matrix}

where

\sum_{0}

and

\sum_{1}

denote the sum in the censored part and uncensored part, respectively;

x_{i} = (y_{i} - z_{i}^{⊤} β) / σ

and

x_{T i} = (T - z_{i}^{⊤} β) / σ

.

Special cases from model (12) occur when

δ = γ = 1

, so the Tobit model follows (see Tobin [24]) and with

δ = 1

the Tobit PN model follows (see Martínez-Flórez et al. [25]). The parameters estimation can be performed by the maximum likelihood method, i.e., by maximizing the function

ℓ (θ; y)

, whose solution using iterative numerical methods leads to the maximum likelihood estimator (MLE) of the model.

5. Simulation Study

To study the performance of the MLE

\hat{θ} = {({\hat{β}}^{⊤}, \hat{σ}, \hat{δ}, \hat{γ})}^{⊤}

of parameter vector

θ = {(β^{⊤}, σ, δ, γ)}^{⊤}

, we conducted a Monte Carlo simulation study with small and moderate samples. In the study, we generated 5000 samples of sizes

n =

50, 100, 200 and 500, and we considered the LGN model. The following parameter values were taken:

δ, γ = 0.75

, 1.50;

β = {(β_{0}, β_{1})}^{⊤} = {(2.0, 1.0)}^{⊤}

and we took

σ = 0.50

.

We considered a linear model with a single covariate Z whose values were generated according to a uniform distribution

U (0, 1)

. We also took errors

ε_{i} \sim LGN (0, σ, δ, γ)

. To evaluate estimators performance for point estimates we considered the bias

(Bias)

, the relative bias (RB) defined as (absolute value of bias / true parameter value) and the square root of the mean squared error

R M S E = \sqrt{MSE}

, which is the mean over all samples of the squared bias plus the variance. Maximum likelihood parameter estimates were computed using the optim function in statistical package R Development Core Team [23].

Table 1 and Table 2 present the results of the simulation study. It can be seen from the table that the RMSEs of MLEs for

β_{0}

,

β_{1}

,

σ

,

δ

and

γ

decreases as sample sizes increase, which is expected since estimators are consistent. The relative bias of the MLEs also decrease as sample sizes increase. The MLEs of

β_{0}

are unstable because this parameter is affected by the asymmetry parameter; however, its MLE becomes more stable as the sample size becomes larger. It can also be seen that when the parameter

γ

increases, the bias of the MLEs of the

β_{0}

,

δ

and

γ

is larger. The main conclusion is that we are quite safe working with the MLEs if sample sizes are greater than 100.

Table 1. Performance evaluation for the MLE of

β_{0}

,

β_{1}

,

σ

,

δ

and

γ

under LGN model for

δ = 0.75

and

γ = 0.75, 1.50

.

Table 2. Performance evaluation for the MLE of

β_{0}

,

β_{1}

,

σ

,

δ

and

γ

under LGN model for

δ = 1.50

and

γ = 0.75, 1.50

.

6. Real Data Applications

6.1. Application 1

We consider a dataset related to longitudinal data on cholesterol levels collected as part of the famed Framingham heart study. The file includes information for

n = 200

randomly selected individuals, reported in Zhang and Davidian [22]. The considered variables were the cholesterol level

(Y)

, the age of the individual at baseline

(Z_{1})

and the gender indicator (

Z_{2}

) (

0 =

female,

1 =

male). For this application, we take only the observations in the second period of time of the measurement

(n = 176)

. Table 3 presents the summary statistic, including measures of skewness and kurtosis for cholesterol data. Clearly, the values of the skewness and kurtosis for cholesterol data justify using an asymmetric model, the PN, SN or LGN model.

Table 3. Summary statistics for cholesterol levels for 176 subjects of the Framingham cholesterol study.

A model with errors following a normal distribution was fitted, and it was found that the Shapiro–Wilk normality test gives a value of the test statistic

W = 0.9599

with p-value =

6.254 \times 10^{- 5}

, so the normality of the errors is rejected. We fitted linear regression models by assuming errors following an asymmetric distribution, namely SN, PN and LGN distributions. For estimating parameters in the considered models, we use the optim function available in R Development Core Team [23].

Table 4 presents the MLE for the estimated parameters of the fitted models. We took the obtained estimates from the normal model using the function lm R Development Core Team [23] as the initial values. For

δ

and

γ

(and some cases

σ

) we took the obtained estimates under the SN, PN and LGN location-scale models fitted to Y variable. From the table, the age at baseline variable (

Z_{1}

) is not significant and the cholesterol level depends solely on the gender in normal, PN and SN models. For the LGN model, it follows that the cholesterol level depends on the sex and the age of the individual at the baseline.

Table 4. Estimates and standard error (SE) for normal, PN, SN and LGN linear regression models fitted to cholesterol data.

The considered linear model was

Y_{i} = β_{0} + β_{1} Z_{1 i} + β_{2} Z_{2 i} + ε_{i}, i = 1, 2, \dots, 176 .

To compare the normal, PN and LGN models, which are nested models, we used the AIC, by Akaike [26], AICc (corrected Akaike information criterion), and BIC (Bayesian information criterion) by Schwarz [27], which are written as

\begin{matrix} A I C & = - 2 \hat{ℓ} (\cdot) + 2 k, \\ A I C c & = A I C + (2 k (k + 1)) / (n - (k + 1)), \\ B I C & = - 2 \hat{ℓ} (\cdot) + k log (n), \end{matrix}

where k is the number of unknown parameters in the considered model. The best model is the one with the smallest AIC or AICc or BIC.

Using the Normal, PN, SN and LGN distributions, the scaled residuals

e_{i} = (y_{i} - z_{i}^{⊤} \hat{β}) / \hat{σ}

are evaluated and presented in Figure 2 and Figure 3.

Figure 2. Histogram for scaled residuals for (a) Normal model, (b) PN model, (c) SN model, and (d) LGN model fitted to the cholesterol data.

Figure 3. QQplots for (a) Normal model, (b) PN model, (c) SN model, and (d) LGN model fitted to the cholesterol data.

The normality assumption for errors can be tested by the hypothesis

H_{01} : (δ, γ) = (1, 1) versus H_{11} : (δ, γ) \neq (1, 1),

using the likelihood-ratio (LR) statistics,

- 2 log (Λ_{1}) = - 2 (ℓ_{N} (\hat{θ}) - ℓ_{L G N} (\hat{θ}))

, which for the dataset under study, leads to

- 2 log (Λ_{1}) = 18.228,

so that p-value

< 0.05

, with strong indication against the null hypothesis.

Similarly, the assumption of PN distribution for the errors can be tested by the hypothesis

H_{02} : δ = 1 versus H_{12} : δ \neq 1,

using the LR statistics,

- 2 log (Λ_{2}) = - 2 (ℓ_{P N} (\hat{θ}) - ℓ_{L G N} (\hat{θ}))

, which leads to

- 2 log (Λ_{2}) = 6.622,

so that p-value

< 0.05

, with strong indication against the null hypothesis.

Table 5 presents the AIC, AICc and BIC criteria for the normal, PN, SN and LGN models. Please note that according to these criteria, the model that best fits the dataset is the SN, since it has a lower value of AIC, AICc and BIC, followed by the LGN model. However, we remember that the SN model presents a singular information matrix when the asymmetry parameter

λ

is zero, and therefore, hypothesis tests about the model parameters using likelihood-ratio statistics are not feasible from the theory of large samples; for example, for testing the significance of the asymmetry parameter in the SN model. This constitutes a disadvantage related to the LGN model, for which it was shown that it has a non-singular information matrix. In addition, as mentioned in Section 2, the LGN model has higher ranges of asymmetry and kurtosis than the SN model, so in practice it may be preferable in certain situations.

Table 5. AIC, AICc, and BIC for normal, PN and LGN linear models.

This discussion illustrates that the final selection of a model is often simply a matter of choice. The LGN model can be considered appropriate if we want to use a model with which we can carry out hypothesis tests about the parameters, especially those associated with skewness and kurtosis in the model. In any case, the final choice must be duly justified.

For non-nested models, we used a generalized LR statistic test studied by Vuong [28]. This test was derived to compare competing models that are strictly non-nested. Since

F_{θ}

and

G_{ζ}

are two non-nested models,

f (y_{i} ∣ x_{i}, θ)

and

g (y_{i} ∣ x_{i}, ζ)

two densities corresponding to these non-nested models, the LR statistics to compare both models is given by

L R (\hat{θ}, \hat{ζ}) = \{\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} log \frac{f (y_{i} ∣ x_{i}, \hat{θ})}{g (y_{i} ∣ x_{i}, \hat{ζ})}\},

which does not follow a chi-square distribution. To overcome this problem, Vuong [28] proposed an alternative approach based on the Kullback–Liebler information criterion [29]. Based on the distance between each model and the true process generating the data, namely the model

h^{0} (y ∣ x)

, he arrived at the statistics

T_{L R, N N} = \frac{1}{\sqrt{n}} \frac{L R (\hat{θ}, \hat{ζ})}{\hat{w}},

(13)

where

{\hat{w}}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(log \frac{f (y_{i} ∣ x_{i}, \hat{θ})}{g (y_{i} ∣ x_{i}, \hat{ζ})})}^{2} - {(\frac{1}{n} \sum_{i = 1}^{n} log \frac{f (y_{i} ∣ x_{i}, \hat{θ})}{g (y_{i} ∣ x_{i}, \hat{ζ})})}^{2} .

For strictly non-nested models, the statistic (13) converges in distribution to a standard normal distribution under the null hypothesis of equivalence of the models. Thus, the null hypothesis is not rejected if

| T_{L R, N N} | \leq z_{p / 2}

. On the other hand, we reject at significance level p the null hypothesis of equivalence of the models in favor of model

F_{θ}

being better (or worse) than model

G_{ζ}

if

T_{L R, N N} > z_{p}

(or

T_{L R, N N} < - z_{p}

).

We now use Voung for comparing the LGN versus SN and PN versus SN models fitted to the data, since they are two non-nested models. Let

f (y_{i} ∣ x_{i}, \hat{θ})

be the LGN model and

g (y_{i} ∣ x_{i}, \hat{ζ})

the SN model. The generalized LR test statistic value is

T_{L R, N N} = 33.981

. For PN versus SN, the generalized LR test statistic value is

T_{L R, N N} = 30.072 .

Therefore, the LGN and PN models are significantly superior to the

S N

model, according to the generalized LR statistic. Then, the LGN model is the better model compared with the normal, PN and SN models.

6.2. Application 2

For the second application, we consider a dataset consisting of measurements for 68 solar-type stars. These data were previously described and analyzed by Santos et al. [30] and Tovar-Falón et al. [31]. The dataset is available in the astrodatR library of the R Development Core Team [23] package under the name Stellar Abundances. In this application, we consider the response variable:

log N (B e)

, which represents the log of the abundance of beryllium scaled to Sun’s abundance, i.e., the Sun has

log N (B e) = 0.0

. The explanatory variable is

Teff / 1000

, which represents the effective stellar surface temperature (in Kelvin).

In astronomy, objects such as stars, galaxies or X-ray sources, among others, are observed in some new wave bands. Some of these objects can go unnoticed due to limited sensitivities, leading to upper limits in the measurement of their luminosity (see Feilgelson [32]). For the dataset, 14 observations (19.35%) were censored at 0.0, i.e., 12 beryllium measurements were not detected.

We fitted the censored normal (CN) or Tobit model using the censReg function of R Development Core Team [23]. Likewise, we also fitted the censored power-normal (CPN) and censored LGN (CLGN) models. The Table 6 shows the MLEs of the fitted models. The initial values for the parameters

β_{k}

were initially taken from those returned by the censReg package of the CN model. The outputs show that the explanatory variable X is significant in the considered models.

Table 6. Estimates (standard error) for CN, CPN, and CLGN linear models.

Table 7 contains the AIC and AICC values for the fitted models, where it is observed that the CLGN model presents the best fit. Figure 4a–c show the histogram, the CDF and the qqplot of the CLGN model of the scale residual errors of the uncensored part. Here one can see the good fit of the CLGN model.

Table 7. AI and AICC for CN, CPN and CLGN linear models.

Figure 4. (a) histogram for scaled residuals CLGN model, (b) CDF of the scaled residuals CLGN model (c) qqplot for scaled residuals CLGN regression model.

We compare the Normal and PN models against the LGN model, so for hypothesis testing

(δ, γ) = (1, 1) versus (δ, γ) \neq (1, 1)

and

δ = 1 versus δ \neq 1,

we have

- 2 log (Δ_{1}) = 8.4432

and

- 2 log (Δ_{2}) = 20.1476

both statistics with p-

value < 0.05

for which both tests are rejected and therefore the LGN model performs better than the Normal and PN models.

7. Conclusions

In this paper, we have proposed the asymmetric LGN distribution to give flexibility to the term of error in linear regression models. The LGN is based on the log-gamma-generated families of distributions of Amini et al. [15]. This new model presents greater ranges of asymmetry and kurtosis, and it extends the PN family of distribution; therefore, it has more flexibility in terms of asymmetry and kurtosis. The ordinary Tobit model Tobin [24] and the Tobit power-normal model Martínez-Flórez et al. [25] are special cases from an extension of the studied model LGN to the case of censored data. The maximum likelihood method was implemented, and the Fisher information matrix was derived, and it was shown numerically to be non-singular, which guarantees valid large sample results for the likelihood-ratio statistics. Two illustrations of real data reveal that the proposed model can be a useful alternative to existing models such as normal, power-normal, Tobit normal and Tobit power-normal. In addition, under certain considerations such as the non-singularity of the information matrix of the model and larger ranges of asymmetry and kurtosis, it may be a better alternative to the skew-normal distribution.

Author Contributions

Conceptualization, R.T.-F. and G.M.-F.; Methodology, R.T.-F., G.M.-F. and H.B.; Data curation, G.M.-F.; Formal analysis, R.T.-F., G.M.-F. and H.B.; Investigation, R.T.-F., G.M.-F. and H.B.; Resources, R.T.-F. and G.M.-F.; Software, R.T.-F. and G.M.-F.; Supervision, H.B.; Validation, G.M.-F. and R.T.-F.; Visualization, R.T.-F. and G.M.-F.;Writing—original draft, R.T.-F., G.M.-F. and H.B.; Writing—review and editing, R.T.-F., G.M.-F. and H.B. All authors have read and agreed to the published version of the manuscript.

Funding

Resolución de Problemas de Situaciones Reales Usando Análisis Estadístico a través del Modelamiento Multidimensional de Tasas y Proporciones; Esquemas de Monitoreamiento para Datos Asimétricos no Normales y una Estrategia Didáctica para el Desarrollo del Pensamiento Lógico-Matemático. Universidad de Córdoba, Colombia, Code FCB-05-19.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Details about data available are given in Section 6.

Acknowledgments

G. Martínez-Flórez and R. Tovar-Falón acknowledges the support given by Universidad de Córdoba, Montería, Colombia.

Conflicts of Interest

The authors declare no conflict of interest.

References

Azzalini, A. A class of distributions which includes the normal ones. Scand. J. Stat. 1985, 12, 171–178. [Google Scholar]
Durrans, S.R. Distributions of fractional order statistics in hydrology. Water Resour. Res. 1992, 28, 1649–1655. [Google Scholar] [CrossRef]
Gupta, R.D.; Gupta, R.C. Analyzing skewed data by power normal model. Test 2008, 17, 197–210. [Google Scholar] [CrossRef]
Pewsey, A.; Gómez, H.W.; Bolfarine, H. Likelihood-based inference for power distributions. Test 2012, 21, 775–789. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Skew-normal alpha-power model. Statistics 2014, 48, 1414–1428. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Vergara-Cardozo, S.; González, L.M. The family of log-skew-normal alpha-power distributions using precipitation data. Rev. Colomb. Estad. 2013, 36, 43–57. [Google Scholar]
Tovar-Falón, R.; Bolfarine, H.; Martínez-Flórez, G. The Asymmetric Alpha-Power Skew-t Distribution. Symmetry 2020, 12, 82. [Google Scholar] [CrossRef] [Green Version]
Azzalini, A.; Capitanio, A. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew-t distribution. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2003, 65, 367–389. [Google Scholar] [CrossRef]
Zhao, J.; Kim, H.M. Power-t distributions. Commun. Stat. Appl. Methods 2016, 23, 321–334. [Google Scholar]
Tung, H.P.; Tseng, S.T.; Hsu, N.J.; Hou, Y.T. A generalized pH acceleration model of nano-sol products and the effects of model misspecification on shelf-life prediction. IISE Trans. 2022, 54, 496–504. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Tovar-Falón, R.; Jimémez-Narváez, M. Likelihood-Based Inference for the Asymmetric Beta-Skew Alpha-Power Distribution. Symmetry 2020, 12, 613. [Google Scholar] [CrossRef]
Martínez-Flórez, G.; Tovar-Falón, R.; Martínez-Guerra, M. The Censored Beta-Skew Alpha-Power Distribution. Symmetry 2021, 13, 1114. [Google Scholar] [CrossRef]
Sahu, S.K.; Dey, D.K.; Branco, M.D. A new class of multivariate skew distributions with applications to Bayesian regression models. Can. J. Stat. 2003, 31, 129–150. [Google Scholar] [CrossRef] [Green Version]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. Asymmetric regression models with limited responses with an application to antibody response to vaccine. Biom. J. 2013, 55, 156–172. [Google Scholar] [CrossRef]
Amini, M.; MirMostafaee, S.M.T.K.; Ahmani, J. Log–gamma–generated families of distributions. Statistics 2014, 48, 913–926. [Google Scholar] [CrossRef]
Cordeiro, G.M.; Bourguignon, M.; Ortega, E.M.M.; Ramires, T.G. General mathematical properties, regression and applications of the log-gamma-generated family. Commun. Stat.—Theory Methods 2018, 47, 1050–1070. [Google Scholar] [CrossRef]
Prentice, R.L. A log-gamma model and its maximum likelihood estimation. Biometrika 1974, 61, 539–544. [Google Scholar] [CrossRef]
Lawless, J.F. Inference in the generalized gamma and log gamma distributions. Technometrics 1980, 22, 409–419. [Google Scholar] [CrossRef]
Young, D.H.; Bakir, S.T. Bias correction for a generalized log-gamma regression model. Technometrics 1987, 29, 183–191. [Google Scholar] [CrossRef]
Ortega, E.M.M.; Bolfarine, H.; Paula, G.A. Influence diagnostics in generalized log-gamma regression models. Comput. Stat. Data Anal. 2003, 42, 165–186. [Google Scholar] [CrossRef]
Ortega, E.M.M.; Cancho, V.G.; Paula, G.A. Generalized log-gamma regression models with cure fraction. Lifetime Data Anal. 2009, 15, 79. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Davidian, M. Linear mixed models with flexible distributions of random effects for longitudinal data. Biometrics 2001, 57, 795–802. [Google Scholar] [CrossRef] [PubMed]
R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021; Available online: http://www.R-project.org (accessed on 31 July 2021).
Tobin, J. Estimation of relationship for limited dependent variables. Econometrica 1958, 26, 24–36. [Google Scholar] [CrossRef] [Green Version]
Martínez-Flórez, G.; Bolfarine, H.; Gómez, H.W. The alpha–power tobit model. Commun. Stat.—Theory Methods 2013, 42, 633–643. [Google Scholar] [CrossRef]
Akaike, H. A new look at statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–722. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Vuong, Q.H. Likelihood ratio tests for models selection and non–nested hypotheses. Econometrica 1989, 57, 307–333. [Google Scholar] [CrossRef] [Green Version]
Kleiber, C.; Zeileis, A. Applied Econometrics with R, 1st ed.; Springer: New York, NY, USA, 2008. [Google Scholar]
Santos, N.; López, R.G.; Israelian, G.; Mayor, M.; Rebolo, R.; García-Gil, A.; De Taoro, M.P.; Randich, S. Beryllium abundances in stars hosting giant planets. Astron. Astrophys. 2002, 386, 1028–1038. [Google Scholar] [CrossRef]
Tovar-Falón, R.; Bolfarine, H.; Martínez-Flórez, G. The Asymmetric Power-Student-t Model for Censored and Truncated Data. Acad. Bras. Cienc. 2021, 93, e20190920. [Google Scholar] [CrossRef]
Feilgelson, E.D. astrodatR: Astronomical Data. R Package v. 0.1. Available online: https://cran.r-project.org/web/packages/astrodatR/ (accessed on 31 July 2021).

Figure 1. PDF of the LGN distribution: (a)

δ = 0.6

and

γ = 0.30

(dotted line), 0.60 (dashed line), 1.0 (dotted–dashed line), 3.0 (long dashed line) and 6.0 (solid line). (b)

γ = 1.5

and

δ = 0.30

(dotted line), 0.60 (dashed line), 1.0 (dotted–dashed line), 2.0 (long dashed line) and 5.0 (solid line).

Figure 2. Histogram for scaled residuals for (a) Normal model, (b) PN model, (c) SN model, and (d) LGN model fitted to the cholesterol data.

Figure 3. QQplots for (a) Normal model, (b) PN model, (c) SN model, and (d) LGN model fitted to the cholesterol data.

Figure 4. (a) histogram for scaled residuals CLGN model, (b) CDF of the scaled residuals CLGN model (c) qqplot for scaled residuals CLGN regression model.

Table 1. Performance evaluation for the MLE of

β_{0}

,

β_{1}

,

σ

,

δ

and

γ

under LGN model for

δ = 0.75

and

γ = 0.75, 1.50

.

Table 1. Performance evaluation for the MLE of

β_{0}

,

β_{1}

,

σ

,

δ

and

γ

under LGN model for

δ = 0.75

and

γ = 0.75, 1.50

.

		$δ = 0.75$ , $γ = 0.75$			$δ = 0.75$ , $γ = 1.50$
n	$\hat{θ}$	Bias	$RB$	$RMSE$	Bias	$RB$	$RMSE$
50	${\hat{β}}_{0}$	0.027	1.3	0.625	0.093	4.6	0.605
	${\hat{β}}_{1}$	0.003	0.3	0.207	0.003	0.3	0.171
	$\hat{σ}$	−0.034	6.8	0.194	−0.049	9.8	0.193
	$\hat{δ}$	0.124	16.6	0.586	0.243	32.4	1.187
	$\hat{γ}$	0.439	58.5	1.898	0.464	30.9	2.304
100	${\hat{β}}_{0}$	0.024	1.2	0.489	0.047	2.4	0.455
	${\hat{β}}_{1}$	0.001	0.1	0.143	0.001	0.1	0.119
	$\hat{σ}$	−0.021	4.2	0.166	−0.019	3.8	0.165
	$\hat{δ}$	0.112	15.0	0.526	0.208	27.7	0.953
	$\hat{γ}$	0.291	38.8	1.242	0.414	27.6	1.946
200	${\hat{β}}_{0}$	0.010	0.5	0.344	0.020	1.0	0.308
	${\hat{β}}_{1}$	0.001	0.1	0.102	0.001	0.1	0.085
	$\hat{σ}$	−0.018	3.7	0.142	−0.017	3.4	0.155
	$\hat{δ}$	0.111	14.8	0.450	0.162	21.6	0.760
	$\hat{γ}$	0.214	28.5	0.908	0.393	26.2	1.639
500	${\hat{β}}_{0}$	0.004	0.2	0.163	0.006	0.3	0.157
	${\hat{β}}_{1}$	0.001	0.1	0.051	0.001	0.0	0.042
	$\hat{σ}$	−0.004	0.3	0.108	−0.006	1.2	0.126
	$\hat{δ}$	0.072	9.6	0.321	0.085	11.3	0.430
	$\hat{γ}$	0.124	16.5	0.543	0.281	18.7	1.108

Table 2. Performance evaluation for the MLE of

β_{0}

,

β_{1}

,

σ

,

δ

and

γ

under LGN model for

δ = 1.50

and

γ = 0.75, 1.50

.

Table 2. Performance evaluation for the MLE of

β_{0}

,

β_{1}

,

σ

,

δ

and

γ

under LGN model for

δ = 1.50

and

γ = 0.75, 1.50

.

		$δ = 1.50$ , $γ = 0.75$			$δ = 1.50$ , $γ = 1.50$
N	$\hat{θ}$	Bias	$RB$	$RMSE$	Bias	$RB$	$RMSE$
50	${\hat{β}}_{0}$	0.018	0.9	0.673	0.095	4.7	0.660
	${\hat{β}}_{1}$	−0.003	0.3	0.180	−0.002	0.2	0.144
	$\hat{σ}$	−0.032	6.3	0.273	−0.069	13.9	0.188
	$\hat{δ}$	0.316	21.1	1.747	0.269	17.9	1.364
	$\hat{γ}$	0.593	79.5	2.210	0.422	28.2	2.355
100	${\hat{β}}_{0}$	0.013	0.6	0.522	0.058	2.9	0.540
	${\hat{β}}_{1}$	−0.001	0.1	0.126	−0.001	0.1	0.100
	$\hat{σ}$	−0.017	3.3	0.244	−0.026	5.2	0.159
	$\hat{δ}$	0.308	20.6	1.527	0.257	17.1	1.180
	$\hat{γ}$	0.446	59.5	1.647	0.415	27.7	2.016
200	${\hat{β}}_{0}$	0.012	0.6	0.387	0.034	1.7	0.406
	${\hat{β}}_{1}$	−0.001	0.1	0.090	−0.001	0.1	0.072
	$\hat{σ}$	−0.014	2.8	0.198	−0.014	2.7	0.135
	$\hat{δ}$	0.246	16.4	1.222	0.231	15.4	0.991
	$\hat{γ}$	0.309	41.2	1.281	0.344	22.9	1.602
500	${\hat{β}}_{0}$	0.011	0.5	0.212	0.001	0.1	0.213
	${\hat{β}}_{1}$	−0.001	0.1	0.045	−0.001	0.1	0.036
	$\hat{σ}$	−0.003	0.6	0.124	−0.001	0.1	0.102
	$\hat{δ}$	0.150	10.0	0.757	0.131	8.7	0.628
	$\hat{γ}$	0.139	18.6	0.657	0.240	16.0	1.043

Table 3. Summary statistics for cholesterol levels for 176 subjects of the Framingham cholesterol study.

Mean	SD	$\sqrt{b_{1}}$	$b_{2}$
224.597	41.242	0.896	3.594

Table 4. Estimates and standard error (SE) for normal, PN, SN and LGN linear regression models fitted to cholesterol data.

	Normal	PN	SN	LGN
$β_{0}$	150.6 (15.9)	71.1 (25.0)	128.1 (14.0)	90.9 (0.2)
$β_{1}$	−9.3 (5.8)	−7.7 (5.5)	−5.7 (5.0)	−5.3 (2.6)
$β_{2}$	1.9 (0.4)	1.6 (0.4)	1.3 (0.4)	1.1 (0.2)
$σ$	38.4 (2.1)	61.8 (6.1)	60.3 (4.8)	29.2 (0.1)
$λ$	-	-	3.6 (1.1)	-
$δ$	-	-	-	0.2 (0.1)
$γ$	-	8.2 (3.5)	-	9.2 (4.1)

Table 5. AIC, AICc, and BIC for normal, PN and LGN linear models.

Criteria	Normal Model	PN Model	SN Model	LGN Model
AIC	1789.584	1779.979	1773.770	1775.356
AICc	1789.817	1780.331	1764.123	1775.853
BIC	1799.096	1795.832	1789.622	1794.379

Table 6. Estimates (standard error) for CN, CPN, and CLGN linear models.

Parameters	CN Model	CPN Model	CLGN Model
$β_{0}$	−0.9450 (0.5854)	−1.6054 (0.6910)	−1.1897 (0.4331)
$β_{1}$	0.3224 (0.1023)	0.5222 (0.1208)	0.3520 (0.0732)
$σ$	0.3147 (0.0281)	0.0813 (0.0688)	0.3268 (0.0326)
$α$	-	-	2.0123 (0.5762)
$λ$	-	0.0280 (0.0477)	4.5803 (0.9748)

Table 7. AI and AICC for CN, CPN and CLGN linear models.

Criteria	CN Model	CPN Model	CLGN Model
AIC	50.2585	63.9628	−25.8152
AICC	50.6335	64.5977	−24.8474
BIC	62.9170	72.8408	−14.7176

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Modelling Asymmetric Data by Using the Log-Gamma-Normal Regression Model

Abstract

1. Introduction

2. Log-Gamma-Normal Distribution

2.1. Moments

2.2. Distribution Function

2.3. Location-Scale Extension

3. Log-Gamma-Normal Regression Model

Estimation Using Maximum Likelihood Method

4. Censored LGN Model

Censored LGN Variable

5. Simulation Study

6. Real Data Applications

6.1. Application 1

6.2. Application 2

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics