A New Generalization of the Truncated Gumbel Distribution with Quantile Regression and Applications

: In this article, we introduce a new model with positive support. This model is an extension of the truncated Gumbel distribution, where a shape parameter is incorporated that provides greater flexibility to the new model. The model is parameterized in terms of the p-th quantile of the distribution to perform quantile regression in this model. An extensive simulation study demonstrates the good performance of the maximum likelihood estimators in finite samples. Finally, two applications to real datasets related to the level of beta-carotene and body mass index are presented.


Introduction
The Gumbel distribution, also known as the type-I generalized extreme value distribution, is commonly used to model data with extreme observations.This distribution and its extensions have a wide range of applications in several disciplines such as hydrology, economics, finance, climatology and seismology.The probability density function (pdf), the cumulative distribution function (cdf) and the quantile function of a random variable X that follows the Gumbel distribution are given by g(x; µ, σ) = 1 where µ ∈ R and σ > 0. Applications of this model in different scenarios can be found in Bhaskaran et al. [1], Gurung et al. [2], Purohit et al. [3], Li et al. [4] and Kang et al. [5].
Several extensions of the Gumbel distribution have been recently proposed in the literature.Hossam et al. [6] presented a statistical model that combines the new alpha power transformation method and Gumbel distribution.Watthanawisut and Bodhisuwan [7] proposed a new extension of the so called Topp-Leone Gumbel distribution that is used to model minimum flow data.Fayomi et al. [8] presented the exponentiated Gumbel-G family of distributions and explored a special case called EGuNH.Nagode et al. [9] introduced a three-parameter Gumbel distribution, which was applied to rope failure data.Oseni and Okasha [10] derived the Gumbel-geometric distribution, which was applied to precipitation and maximum annual wind speed data.Note that all these extensions do not consider a regression framework, and their main objectives rely on the fit of univariate data.
It is evident that regression models have become relevant tools in the era of Data Science.Among them, the so-called quantile regression models (introduced by Koenker and Bassett [11]) are an alternative to the usual regression techniques where the mean response conditional to values of covariates (or explanatory variables) is estimated.The quantile regression models allow us to measure the effects of covariates at different quantiles of the response variable distribution.Thus, they provide an analysis across the entire conditional distribution, as can be seen in Cade et al. [12], Koenker [13] and Wei et al. [14].The mean, as the only summary measure, is generally quite poor for assessing risk, as it is greatly affected by the presence of outlier observations.Outlier data can be quite strange, but at the same time, these can be enough to cause serious problems when analyzing the information obtained; see, for example, Gómez-Déniz et al. [15], who analyzed extreme values in insurance companies.To our knowledge, there are no studies on quantile regression models based on the Gumbel distribution.Thus, the objectives of this work were to introduce a new generalization of the truncated Gumbel distribution and then establish a quantile regression model based on this novel generalization.To do this, a reparametrization was obtained of the new truncated Gumbel generalization by incorporating a parameter that represents the quantile.We should note that the proposed generalization was achieved by considering the work of Neamah and Qasim [16] and the transformation provided by Cooray and Ananda [17].The latter authors developed an extension of the half-normal (HN) distribution through the relation Y = βX 1/α , where X ∼ HN (1).
The rest of the paper is organized as follows.In Section 2, we introduce our proposal, the generalized truncated Gumbel (GTG) distribution, and several important properties of this new model are presented.In Section 3, inference is performed, including some initial points to obtain maximum likelihood (ML) estimators and present the observed Fisher information matrix for the proposed model.In Section 4, the reparametrized model in terms of a quantile is presented.In Section 5, the simulation study carried out to analyze the performances of the ML estimators in finite samples for the proposed model without and with covariates is discussed.In Section 6, two real-data applications are presented to illustrate the proposed models, without and with covariates.Finally, in Section 7, some concluding comments are presented.

Generalized Truncated Gumbel Distribution
Neamah and Qasim [16] derived a new model with positive support for the Gumbel distribution by truncating its pdf from the left.We will refer to the resulting model of these authors as the truncated Gumbel (TG) distribution, which is defined in the interval (0, ∞).In considering the reparametrization λ = µ/σ, the pdf of the TG distribution can be written as follows: where β > 0 is a scale parameter, λ ∈ R is a shape parameter, and g(u) = exp(−u − exp(−u)) and G(u) = exp(− exp(−u)) are the pdf and cdf for the standard Gumbel distribution, respectively.
In this work, we considered the transformation developed by Cooray and Ananda [17] to extend the TG distribution.That is, we considered the transformation Z = βY 1/α , where Y ∼ TG(1, λ).We will refer to this extension as the generalized truncated Gumbel (GTG) distribution.Important functions, such as the pdf, cdf, hazard and quantile functions of the GTG distribution are provided below.
Proof.Both functions are obtained immediately from their definitions.
Figure 1 shows the pdf, cdf and hazard function for the GTG(1, λ, α) model, considering some combinations for λ and α.We observe that the GTG model can have decreasing or unimodal shapes for the pdf, whereas for the hazard function, we can have decreasing or increasing shapes.Also, we observe that for some combinations of λ and α, the cdf rapidly increases, although all of them tend to 1 when z increases.

Mode
The shape of the pdf of Z ∼ GTG(β, λ, α) can be examined based on its inflection points.By computing the first derivative of log( f (z)) with respect to z, where f (z) is the pdf for the GTG model, we obtain that where v = z β α − λ.By equating the previous expression to 0, we obtain that from which the mode of Z can be numerically obtained.The nature of the points are determined by ∂ 2 log( f (z))/∂z 2 = u(z), where u(z) is given by β 2α .Depending on whether u(z 0 ) < 0 or u(z 0 ) > 0, where z = z 0 is a solution of Equation ( 5), the inflection points can be local maxima or minima.Figure 2 shows the shape of u(z) for β = 1 and selected values of λ and α.From here, we observe that the pdf of the GTG distribution is zero when z → ∞, both for when λ takes a positive or negative value.
Proof.It follows from a direct computation, by applying the definition of the quantile function.
Corollary 1.The quartiles of the GTG distribution are as follows: 1.
Proof.It is immediate from Proposition 3.

Moments
Proposition 4. Let Z ∼ GTG(β, λ, α) and n be a positive integer.Then, the n-th moment of Z is given by where Y ∼ TG(1, λ).Then, the (n/α) moment of Y can be computed by following the properties presented in Neamah and Qasim [16].
Corollary 2. If Z ∼ GTG(β, λ, α), then the first four moments and the variance of Z are obtained as follows: 1.
Proof.It is immediate from Proposition 4.
Corollary 3. Let Z ∼ GTG(β, λ, α).Then, the skewness coefficient ( β 1 ) and the kurtosis coefficient (β 2 ) are given by Proof.The expressions above are obtained using the definitions of the skewness and kurtosis coefficients; that is, where Remark 2. Proposition 4 shows that the moments of the GTG distribution basically depend on the moments of the TG(1, λ) model.Plots for the expected value, variance, skewness and kurtosis coefficients of the GTG(1, λ, α) model are given in Figure 3 for different values of the λ and α parameters.The bottom plots in Figure 3 reflect the effect of the α parameter: a lower value of α produces higher values of the skewness and kurtosis coefficients.This fact can also be appreciated in Tables 1 and 2.

Bonferroni Curves
In different disciplines, such as socio-economics and public health sciences, there is a necessity to compare and analyze the inequality of non-negative distributions.Generally, Bonferroni curves are used as graphical methods to achieve the required comparison/analysis (see Bonferroni [18], and Arcagni and Porro [19] for a further discussion about these curves).The following result shows the expressions of these curves for the GTG model. where ) k e −t dt, and v = exp(−(q/β) α + λ).
Proof.The expression above is obtained using the definition of the Bonferroni curves; that is, where µ is the expected value of the corresponding non-negative random variable, and q = F −1 (p).

Inference for the GTG Distribution
In this section, we discuss the maximum likelihood (ML) approach for parameter estimation in the GTG model.

Maximum Likelihood Estimators
Let z 1 , z 2 , . . ., z n be a random sample of size n from the GTG(β, λ, α) model.Then, the log-likelihood function for θ = (β, λ, α) is given by where Therefore, the score assumes the form S(θ) = (S β (θ), S λ (θ), S α (θ)), where and The ML estimators are then obtained by numerically solving the equation S(θ) = 0 3 , where 0 p denotes a vector of zeros with length p. Solutions for Equations ( 9)-( 11) can be obtained using numerical procedures in R [20], such as the Newton-Raphson method.To initialize the numerical algorithm that solves S(θ) = 0 3 , in the next subsection, we propose an initial point for the vector θ.

Initial Points
In this subsection, we propose estimators based on the quantiles for the GTG distribution, and these estimators are an alternative to the moment estimators, which meets the objective of using them as initial values to calculate the maximum likelihood estimators of the GTG distribution.
Let q 1 , q 2 and q 3 be the sample quartiles that are based on z 1 , z 2 , . . ., z n .Initial values for θ can be obtained by equating the sample quartiles with the theoretical quartiles.The resulting equations are given by , and .
The solutions for β and α, say β and α, can be expressed in terms of λ (the solution for λ) as follows: whereas λ is obtained from the non-linear equation Therefore, the initial point based on this method is given by θ quart = β, λ, α .

GTG Quantile Regression Model
For the GTG model, the mean has a complicated form, and then, it is not recommendable to consider a mean-parameterized version of the model.On the other hand, and thinking in a context of heterogeneous observations, quantile regression is a more appropriate tool for analyzing data in presence of covariates because they allow for a complete description of the distribution of the response variable (not just a particular measure as is the case when regression on the mean is conducted).
Specifically, for the GTG model and considering that τ represents the pth quantile of the distribution, we obtain the equation τ = Q(p; β, α, λ), τ ∈ (0, ∞).By solving such an equation, we obtain Thus, we can reparameterize the pdf and cdf of the GTG model as and respectively, where z > 0, α > 0, τ ∈ (0, ∞), and 0 < p < 1 is fixed.We refer to this model as the reparameterized GTG (RGTG) model.The consideration of z ⊤ i = (z i1 , z i2 , . . ., z iq ) as a set of q known covariates related to the p-th quantil of the i-th individual can be introduced in the model as follows: where β(p) = (β 1 (p), β 2 (p), . . ., β q (p)) ⊤ is a q-dimensional vector of unknown regression parameters (q < n), and ψ(•) is a link function, which is continuous, invertible and at least twice differentiable.A natural choice in this context is the logarithm link, i.e., ψ(u) = log(u).With this framework, the corresponding log-likelihood function for the RGTG quantile regression model is given by where k n (p) = n(log(α(p)) + h(λ(p), p) − log(1 − G(−λ(p))) + λ(p)).The estimation of the regression parameters is obtained by directly maximizing this function.

Simulation Study
In this section, we present two simulation studies related to assessing the performances of the ML estimators for the GTG model and the RGTG quantile regression model.

Without Covariates
In this study, we carried out a simulation study to evaluate the performances of the ML estimators given in Section 3.1.We generated random values from the GTG(β, λ, α) distribution with Algorithm 1.
Algorithm 1 Simulating values from the GTG(β, λ, α) distribution We used the following sequence to perform a simulation study to evaluate the behavior in finite samples of the MLEs of the GTG model.For β, we fixed three values: 1, 2 and 3; for λ, we fixed two values: 2 and 3; for α, we fixed two values: 1 and 2; and for the sample size n we fixed four values: 150, 300, 600 and 1000.For each combination of β, λ, α and n, we simulated 1000 replicates of that size and calculated ML estimators and their standard errors.Table 3 summarizes the mean of the estimated biases (bias), the mean of the estimated standard errors (SE), and the squared root of the estimated mean squared errors (RMSE), and each estimated coverage probability (CP) was obtained by taking into account the asymptotic distribution of the ML estimator using a 95% confidence level.Note that as the sample size increases, the bias, SE and RMSE decrease, which suggests that the ML estimators of the GTG model have an acceptable behavior even in finite samples.Moreover, the SE and RMSE terms tend to become closer as the sample size increases, suggesting that the variance of the estimators is well estimated.Finally, the CP terms come closer to the nominal value as n increases, which suggests that the asymptotic approach to the normal of the ML estimators of the GTG model is reasonable, even in finite samples.
Figures A1-A5 in Appendix A.1 display the standard deviation (SD), mean of SE, RMSE, mean of the relative bias (RB) and CP of the 95% asymptotic confidence intervals of the estimates, under different sample sizes.It can be observed in Figures A1-A4 that the first four measures decrease as the sample size n increases, as expected in standard asymptotic theory.Finally, the CPs in Figure A5 indicate convergence to the nominal values.

Applications
In this section, we present two applications to real datasets to illustrate the performance of the GTG model and RGTG quantile regression model.The first application is related to the beta-carotene levels in a certain sample of persons.The second application involves explaining Mexican American individuals' body mass index (BMI) in terms of their waist circumference and sex.The BMI is calculated by dividing a person's weight in kilograms by the square of their height in meters, Kg/m 2 .Furthermore, it is a tool that health organizations use to monitor and plan public health programs.

Without Covariates
We considered the retinol plasma database available at http://lib.stat.cmu.edu/datasets/Plasma_Retinol (accessed on 5 April 2024), which focuses on understanding the determinants of the plasma levels of retinol, beta-carotene and other carotenoids.The main variable of interest is BETADIET, which represents the amount of beta-carotene consumed daily by each individual in micrograms (mcg).The importance of analyzing this variable lies in its direct relationship with the plasma levels of beta-carotene (BETAPLASMA), a key nutrient with antioxidant properties and a precursor of vitamin A.  It can be seen from Figure 5 that the GTG has a better fit compared to the GT, Weibull and Slash truncation positive normal (STPN) (see Gómez et al. [22]) models, in addition to a good behavior of the fitted GTG cdf compared to the empirical cdf.Based on the Akaike information criterion (AIC) (see Akaike [23]) and Bayesian information criterion (BIC) (see Schwarz [24]) given in Table 5, we also see that the GTG model is preferred (among the fitted models) for this dataset.

With Covariates
In this application, we fit the quantile regression model to a dataset provided by Cortés et al. [25].The dataset comprises the body mass index (BMI) measured in Kg/m 2 , waist circumference (Waist) in centimeters, age in years and Sex (1 for female and 0 for male) of 1743 individuals who self-identified as Mexican American in the National Health and Nutrition Examination Survey (NHANES) conducted between 2017 and 2018.
Here, we assumed that the BMI follows an RGTG distribution, denoted as Z i ∼ RGTG(p; ρ ip , α p , λ p ). Accordingly, we propose the following structure for modeling: where β 1 (p), β 2 (p), β 3 (p), λ(p) and ν(p) are the parameters used for the estimation, for i = 1, . . ., 1743 and p ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}.The parameter estimates and their corresponding standard errors are presented in Table 6, where it is observed that all estimates are statistically significant at all quantiles p.In relation to the interpretations of the regression coefficients, we can consider the following interpretations.We will present the interpretations in relation to the median, but for this problem, they are also valid for the other quantiles considered:

•
The estimated median BMI for females, assuming a waist circumference of 0, is exp(2.076)≈ 7.973.Additionally, to assess the model's adequacy, we calculated the normalized quantile residuals (NQRs) along with their respective envelopes; see Dunn and Smyth [26].These can be observed graphically in Figure A6 of Appendix A.2, where observation #1267 is highlighted as atypical.Also, we obtained the measures of a generalized Cook's distance and likelihood displacement for p = 0.9; see Figure 6.Here, we highlight that the potentially influential observations are #264, #486, #516, #1267 and #1299.
Table 7 presents a classification (https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmi_dis.htm (accessed on 5 April 2024)) of overweight and obesity by sex, BMI, waist circumference, and risk of diseases (type 2 diabetes, hypertension, and cardiovascular disease) for the highlighted observations in the residuals, generalized Cook's distance and likelihood displacement, considering p = 0.9 (obese individuals).We observe from the table that observations #264 and #1267 indicate a very poor health status, emphasizing the importance of their detection.

Final Discussion
In this work, we proposed a new distribution called generalized truncated Gumbel.The model has quite a few interesting properties, mainly associated with having a cumulative distribution function and a quantile function in closed form.For this reason, an extension of the model was proposed to be used in the context of quantile regression.Future extensions of the model are directed in the context of random effects.

Figure 2 .
Figure 2. Shape of u(z) for β = 1 and some value selections of α and λ.

Figure 5 .
Figure 5. (a) GTG, TG, Weibull and STPN models adjusted using the maximum likelihood method for BETADIET.(b) Empirical (black) and fitted GTG (blue) cdf for the BETADIET dataset.

Figure 6 .
Figure 6.(a) Generalized Cook's distance and (b) likelihood displacement for the NHANES dataset.

Table 3 .
Estimated bias, SE, RMSE and CP for ML estimators in finite samples from the GTG model.

Table 4 .
Descriptive statistics of the amount of beta-carotene consumed daily by each individual in micrograms (BETADIET).

Table 5 .
Estimated parameters and their standard errors (in parentheses) for the GTG, TG, Weibull, and STPN models for the BETADIET dataset.The AIC and BIC criteria are also presented.
= 3.504 and exp ( ν(p)) = exp (1.309) ≈ 3.702 are the estimates of the shape parameters associates with the median BMI.

Table 6 .
ML estimates of the parameters and their corresponding standard errors (in parentheses) for the RGTG quantile regression model.

Table 7 .
Classification of overweight and obesity by BMI, waist circumference and associated disease risks.