Extended Exponential Regression Model: Diagnostics and Application to Mineral Data

: In this paper, we reparameterized the extended exponential model based on the mean in order to include covariates and facilitate the interpretation of the coefﬁcients. The model is compared with common models deﬁned in the positive line also reparametrized in the mean. Parameter estimation is approached based on the expectation–maximization algorithm. Furthermore, we discuss residuals and inﬂuence diagnostic tools. A simulation study for recovered parameters is presented. Finally, an application illustrating the advantages of the model in a real data set is presented.


Introduction
Models with positive support have been used a lot in the literature for their usefulness. For example, in areas such as survival analysis, reliability, regression models, among others. In this context, the common models used are the exponential (E), Weibull (W), gamma (G), Birnbaum-Saunders (BS), and generalizations of these distributions, for instance, for limiting cases as illustrated in Fisher and Tippett [1]. A well-known generalization is the one introduced by Gómez et al. [2], named the extended exponential (EE) model with probability density function (PDF) f (y) = α 2 (1 + βy) exp(−αy) α + β , y > 0, α > 0 and β ≥ 0.
The motivation for this model arises from a mixture between the E distribution with rate α and the G distribution with shape equal to 2 and rate α, respectively, where the mixture probability are given by α/(α + β) and β/(α + β), respectively. Another model with a similar motivation is the Lindley (L) distribution Ghitany et al. [3]. The exponentiated generalized EE model has been proposed by Andrade et al. [4]. Rasekhi et al. [5], Rasekhi et al. [6] introduce a generalization and a discrete version of the EE model. In Remark 1, we can see the flexibility of the model proposed by Gómez et al. [2] since having only two parameters has as special cases three well-known distributions. The mean and variance of the EE(α, β) model are Var(Y) = α 3 + 5α 2 β + 6αβ 2 + 2β 3 α 5 + 3α 4 β + 3α 3 β 2 + α 2 β 3 , respectively. The rest of the paper proceeds as follows. In Section 2, we introduce a new parameterization of the EE distribution that is indexed by the mean and mixture parameters. Section 3 presents the EE regression model with varying mean and the estimation problem approached via maximum likelihood (ML) estimation via the expectation-maximization (EM) algorithm. In addition, diagnostic measures are discussed. In Section 4, some numerical results of the estimators are presented with a discussion of the obtained results. Furthermore, we discuss an application to real data that shows the usefulness of the proposed model. Concluding remarks are given in Section 5.

A EE Distribution Parameterized by Its Mean and Mixture Parameters
Regression models are typically obtained to model the mean of a distribution. However, the PDF of the EE distribution given in (1) is indexed by α and β. In this context, in this section, we considered a new parameterization of the EE distribution in terms of the mean and the mixture proportion of the distribution, say µ > 0 and π ∈ [0, 1], respectively. Consider the parameterization, Under this new parameterization, the PDF in Equation (1), it follows from where y > 0, µ > 0 and π ∈ [0, 1]. Henceforth, we referred to a random variable (RV) with PDF as in (2) as the reparameterized extended exponential model (we denote as REE(µ, π)). With this parameterization, based on results in Gómez et al. [2], we have that where (π) = is the coefficient of variation (CV). Figure 1 displays some plots of the PDF in (2) for some parameter values. We can notice that the distribution is very flexible and it can be an interesting alternative to other distributions with positive support. Table 1 gives a summary of the two indices, the skewness and kurtosis for the reparameterized gamma (RGA), reparameterized Birnbaum-Saunders (RBS) and REE distributions, respectively. The interested reader in reparameterized regression models is referred to Santos-Neto et al. [7] and Bourguignon et al. [8,9]. We highlight that in models reparametrized in terms of the mean we can compare the regression coefficients directly.

REE Regression Model
Suppose a random sample Y 1 , . . . , Y n be n independent RV, where each Y i , i = 1, . . . , n, follows the PDF given in (2) with mean µ i and mixture proportion parameter π. Suppose the mean parameter of Y i satisfies the following functional relation: where γ = (γ 0 , . . . , γ p ) is a vector of unknown regression coefficients, γ ∈ R p+1 , with p < n, η 1i is a linear predictor and z i = (1, z i1 , . . . , z ip ) are observations on p known regressors, for i = 1, . . . , n. Furthermore, we assume that the covariate matrices Z = (z 1 , . . . , z n ) have rank p. The link functions g 1 : R → R + in (3) must be strictly monotone, positive, and at least twice differentiable, such that µ i = g −1 1 (z i γ), with g −1 1 (·) being the inverse function of g 1 (·). Finding the ML estimate of the parameter vector by direct maximization of the log-likelihood can be a hard task. Taking into account the mixture representation of the REE model, we develop an estimation procedure based on the EM algorithm; see Dempster et al. [10] for details about such algorithm.

EM Algorithm
Considering the mixture representation of the REE distribution, we have where RGA(µ, φ) with PDF f (y; µ, φ) ∝ y φ−1 exp{−φy/µ} and B(π) denotes the Bernoulli distribution with success probability equal to π. Under this setting, D obs = (Y, Z) represents the observed data, X = (x 1 , . . . , x n ) represents the unobserved (latent) data and D comp = (Y, Z, X) denotes the complete data, where Y = (y 1 , . . . , y n ), Y = (y 1 , . . . , y n ) and X = (x 1 , . . . , x n ). The complete log-likelihood for ψ = (γ, π) is given by Let ψ (k) be the estimate of ψ at the k-th iteration of the EM algorithm and denote the conditional expectation of c (ψ; D comp ) given the observed data as Q(ψ | ψ (k) ). Therefore, where x . . , n. The distribution of X i | D obs ; ψ is derived in the following proposition. Proposition 1. For the REE model, the distribution of X i | D obs ; ψ in the hierarchical representation in (4) is The proof is complete applying the Bayes's theorem for P(X i = x i | D obs ; ψ).

Corollary 1.
The following expected values are directly from Proposition 1.
In general, the three steps of the Algorithm 1 are: M-step II. Update π (k) as follows The E, M-I and M-II steps are alternated repeatedly until a suitable convergence rule is satisfied, e.g., the difference in successive values of the estimates is less than a tolerance value.

Remark 2.
When no covariates are available, the M-step I for µ can be updated in a closer form as follows: We carry out inference for ψ of the REE regression model using asymptotic distribution of the ML estimator ψ obtained by the EM algorithm. This estimator is consistent and has a multivariate normal joint asymptotic distribution with an asymptotic mean ψ and asymptotic covariance matrix Σ ψ , respectively, which can be obtained from the corresponding expected information Fisher information matrix I(ψ). Hence, we have as n → ∞, where D −→ means "convergence in distribution", and J (ψ) = lim n→∞ 1 n I(ψ). Note that I( ψ) is a consistent estimator of the asymptotic covariance matrix of ψ. According to these results, an approximate 100 × [1 − τ]% confidence region for ψ obtained from (7) is where χ 2 p+2;1−τ denotes the [1 − τ] × 100th percentile of the chi-squared distribution with p + 2 degrees of freedom and Σ ψ is an estimate of Σ ψ .

Diagnostic Analysis
Diagnostic analysis is an important way to detect influential cases and evaluate their effect on the model assumptions. In this subsection, we use the local influence approach to detect influential observations that under small perturbation of the model exert a great influence on the ML estimators. There are basically two approaches to detect influential observations that seriously influence the results of statistical analysis: (A1) the first approach is the case-deletion approach, in which the impact of deleting an observation on the estimators is directly assessed by metrics such as the likelihood distance and Cook's distance, see Cook [11]; (A2) the second one is to estimate outputs with respect to the model inputs via various minor model perturbations, such as the local influence; see Cook [12]. Zhu and Lee [13] introduced a unified approach for local influence analysis of general statistical models with missing data on the basis of the Q-function for the EM-type algorithm. Alternative methodologies to perform diagnostic analysis can be seen in Bolboaca and Jantschi [14], Jantschi [15].

Case Deletion Measures
Case-deletion diagnosis is an approach to detect the effect of dropping the ith case from the data set. Here, we develop diagnostic measures with the whole vector (y i , x i ) deleted and denote these by the subscript "(i)". The classic measures are the Cook distance and the likelihood displacement. Following this approach, Lee and Xu [16] introduce the analogue to the generalized Cook's distance (GCD c i ) and likelihood displacement (LD c i ) for the Q-function, which are given by

Perturbation Schemes
In this subsection, we consider three different perturbation schemes for the REE regression model.

Response Perturbation
We here assume an additive perturbation for the response variables y = (y 1 , . . . , y n ) , such that y(ω) = y + ωs z , where s z is a scale factor equal to the estimated standard deviation of y. Then, perturbed Q-function is given by

Covariate Perturbation
Here, we are interested in perturbing a specific explanatory variable. Under this condition, we consider an additive perturbation of the explanatory variables by setting z r (ω) = z r + s r ω, with r = 1, . . . , p and s r is a scale value typically given by the standard deviation of z r . In this perturbation scheme, the perturbed Q-function is given by

Residual Analysis
In order to check the goodness-of-fit of the REE regression model, we propose two types of residuals for our model, which are the randomized quantile (RQ) and the generalized Cox-Snell (GCS) residuals proposed by [17,18], and given respectively by where µ i = x γ, Φ −1 is the inverse function of the N(0, 1) cumulative distribution function (CDF) and S Y (y i ; µ i , π) is the estimated CDF of the RV given in (2). If the REE model is correctly specified, then the RQ residual has a N(0, 1) distribution, regardless of the model specification, whereas the GCS residual has an exponential distribution with a parameter equal to one.

Simulation Study
In this section, we carry out a simulation study to evaluate the performance of the ML estimators of the REE regression model parameters. Here, we consider for each individual two covariates (p = 2): one dichotomous, drawn from the B(0.5) distribution, and one continuous, drawn from the standard normal model. Those covariates were included although µ i following Equation (3). In addition, we consider four values for π: 0.2, 0.5, 0.75, and 0.95, and three combinations for the vector of parameters γ = (γ 0 , γ 1 , γ 2 ) related to the covariates: (1, 0.5, 0.01), (1, −1, −0.01) and (2, −0.5, 0.02). We also consider four sample sizes: 50, 100, 200, and 500. Each combination for π, γ, and n were replicated 10,000 times. For each scenario and for each parameter, we computed the mean bias, the mean of the estimated standard errors (SE), the root of the mean squared error (RMSE), and the 95% coverage probability (CP) based on the asymptotic distribution for maximum likelihood estimators. Results are presented in Tables 2 and 3. Note that, in all scenarios, the bias for each parameter is acceptable and decreases when the sample size is increasing. The SE and RMSE terms also are closer, reducing their difference when n increases, suggesting that the variance of estimators is well estimated. Finally, the CP's are closer to the nominal value (95%), especially when n is increased, which suggests that the distribution of the estimators is well approximated by the normal, even with moderate sample sizes. Simulation and application codes were written in the R programming language, R [19], where they were compiled using the Windows 10 operating system, 16 GB RAM, Intel Core i7 processor, 64 bits.

Applications
In this section, we present a real data application where the good performance of the REE model is presented over other common models in the literature.

Exploratory Data Analysis to the Mineral Data Set
We illustrate the proposed methodology by applying it to a real-world mineral data set. These data were obtained from the Department of Mining of the University of Atacama, Chile, to study the concentration of some ores in the soil. This data set corresponds to a sample of 86 measurements of the concentration of the Vanadium and Thorium ores, respectively. We consider a regression model to explain the quantity of vanadium (V) in terms of the quantity of thorium (Th). The study of the concentration of ores in the soil is a matter of public health since it is possible to detect, for example if the tributary water may be contaminated with heavy metals, among others. Similar works can be found in Gómez et al. [20], Bolfarine et al. [21], Olmos et al. [22] and Reyes et al. [23] that verified the concentration of nickel and zinc in the soil. Table 4 provides a descriptive summary of the observed vanadium concentration that includes median (MD), mean (y), SD, CV, skewness (CS) and kurtosis (CK), and minimum (y (1) ) and maximum (y (n) ) values. The unit of measurement of the concentration (response variable) is parts-per million (ppm). From this table, note the positively skewed nature of the data distribution. The skewed nature is confirmed by the histogram of Figure 2 (left). In addition, Figure 2 (right) indicated some relationship between quantity of vanadium in terms of the quantity of thorium, which motivates the use of the REE regression model to study these variables.

Estimation and Model Checking
We consider the REE regression model with the structure: η i = γ 0 + γ 1 z i1 . Table 5 provides the estimation and hypothesis testing results for the REE regression model analyzing mineral data. Results of the RGA and RBS regression models are also detailed in this table, as well as their AIC (Akaike [24]), BIC (Schwarz [25]), and log-likelihood values.  Figure 3 shows the QQ plots with simulated envelope for the GCS and QS residuals. These plots allow us to check graphically whether the GCS and QS residuals follow the EXP(1) and N(0, 1) distributions or not, respectively. From Figure 3, note that these residuals present a good agreement with their corresponding target distributions.

Diagnostic Analysis
Based on estimation and model validation results presented previously, we conducted a diagnostic analysis for the REE regression model, the suggested fitted model according to the analysis. Next, we carry out our diagnostic analysis based on global and local influence. First, Figure 4 presents the case-deletion measures GCD c i (ψ) and LD c i (ψ) discussed in Section 3.2. From this figure, LD c i (ψ) statistic indicates that the cases #25, #44, and #69 are potentially influential. On the other hand, the GCD c i (ψ) statistic does suggest case #44 as potentially influential. Index plots of C i for π and γ under the case-weight, response, and covariate perturbation schemes are displayed in Figure 5. Note that case #44 is detected as potentially influential on π and γ under the three perturbation schemes. In order to check the impact on the model inference of the detected influential cases, we remove influential cases and reestimating the parameters as well as their corresponding SEs. Table 6 shows the parameter estimates and their corresponding estimated SEs without observation #44. In addition, p-values are shown for the regression coefficients based on t-tests.

Conclusions
In the present paper, the reparameterization of the EE model based on the mean motivated us to propose a regression model for positive data. The maximum likelihood method is employed with the EM algorithm for estimating the model parameters. Application to real data with covariates was presented showed by the AIC and BIC criteria besides the deviance residuals in which the REE model fit better for this data set than the other reparameterizations considered.
The main contribution of this paper is to develop EM algorithms for maximum likelihood estimation as well as to apply Zhu and Lee's approach for case-deletion measures and the local influence diagnostics in the linear regression models with REE errors. Closed-form expressions are obtained for the M and E steps of EM algorithm, for the observed information matrix, for the Hessian matrixQ, and for the ∆matrix under three perturbation schemes. The models can be fitted using standard available software packages, like R (code available upon request).