Bayesian Reference Analysis for the Generalized Normal Linear Regression Model

This article proposes the use of the Bayesian reference analysis to estimate the parameters of the generalized normal linear regression model. It is shown that the reference prior led to a proper posterior distribution, while the Jeffreys prior returned an improper one. The inferential purposes were obtained via Markov Chain Monte Carlo (MCMC). Furthermore, diagnostic techniques based on the Kullback–Leibler divergence were used. The proposed method was illustrated using artificial data and real data on the height and diameter of Eucalyptus clones from Brazil.


Introduction
Although the normal distribution is used in many fields for symmetrical data modeling, when the data come from a distribution with lighter or heavier tails, the assumption of normality becomes inappropriate. Such circumstances show the need for more flexible models such as the Generalized Normal (GN) distribution [1], which encompasses various distributions such as the Laplace, normal, and uniform distributions.
In the presence of covariates, the normal linear regression model can be used to investigate and model the relationship between variables, assuming that the observations follow a normal distribution. However, it is well known that a normal linear regression model can be influenced by the presence of outliers [2,3]. In these circumstances, as discussed above, it is necessary to use models in which the error distribution presents heavier or lighter tails than normal such as the GN distribution.
Due to its flexibility, the GN distribution is considered a tool to reduce the impact of the outliers and obtain robust estimates [4][5][6]. This distribution has been used in different contexts, with different parameterizations, but the main difficulties to adopting GN distribution modeling have been computational problems since there are no explicit expressions for estimators of the shape parameter. The estimation of the shape parameter should be done through numerical methods. However, in the classical context, there are problems of convergence, as demonstrated in the studies [7,8].
In the Bayesian context, the methods presented in the literature for the estimation of the parameters of the GN distribution are restricted and applied to particular cases. West [9] proved that a scale mixture of the normal distribution could represent the GN distribution. Choy and Smith [10] used the prior GN distribution for the location parameter in the Gaussian model. They obtained the summaries of the posterior distribution, estimated through the Laplace method, and examined their robustness properties. Additionally, the authors used the representation of the GN distribution as a scale mixture of the normal distribution in random effect models and considered the Markov Chain Monte Carlo method (MCMC) to obtain the posterior summaries of the parameter of interest.
Bayesian procedures for regression models with GN errors have been discussed earlier. Box and Tiao [4] used the GN distribution from the Bayesian approach in which they proposed robust regression models as an alternative to the assumptions of normality of errors in regression models. Salazar et al. [11] considered an objective Bayesian analysis for exponential power regression models, i.e., a reparametrized version of the GDN. They derived the Jeffreys prior and showed that such a prior results in an improper posterior. To overcome this limitation, they considered a modified version of the Jeffreys prior under the assumption of independence of the parameters. This assumption does not hold since the scale and the shape parameters correlate with the proposed distribution. Additionally, the use of the Jeffreys prior is not appropriate in many cases and can cause strong inconsistencies and marginalization paradoxes (see Bernardo [12], p. 41).
Reference priors, also called objective priors, can be used to overcome this problem. This method was introduced by Bernardo [13] and enhanced by Berger and Bernardo [14]. For the proposed methodology, the prior information is dominated by the information provided by the data, resulting in a vague influence of the prior distribution. Estimations are made based on priors through the maximization of the expected Kullback-Leibler (KL) divergence between the posterior and prior distributions. The resulting reference prior affords a posterior distribution that has interesting features, such as consistent marginalization, one-to-one invariance, and consistent sampling properties [12]. Some applications of reference priors can be seen for other distributions in [15][16][17][18][19][20].
In this paper, we considered the reference approach for estimating the parameters of the GN linear regression model. We showed that the reference prior led to a proper posterior distribution, whereas the Jeffreys prior brought an improper one and should not be used. The posterior summaries were obtained via Markov Chain Monte Carlo (MCMC) methods. Furthermore, diagnostic techniques based on the Kullback-Leibler divergence were used.
To exemplify the proposed model, we considered the 1309 entries on the height and diameter of Eucalyptus clones (more details are given in Section 8). For these data, a linear regression model using the normal distribution for the residuals was not adequate, and so, we used the GN linear regression approach. This paper is composed as follows. Section 2 presents the GN distribution with some special cases, and in the following Section 3, we introduce the GN linear regression model. Section 4 shows the reference and Jeffreys priors, respectively, for a GN linear regression model. Then, the model selection criteria, in Section 6, are discussed, as well as their applications. Section 7 shows the proposed method for analyzing compelling cases considering the Bayesian reference prior approach through the use of the Kullback-Leibler divergence. Section 8 presents studies with an artificial and a real application, respectively. Finally, we discuss the conclusions in Section 9.

Generalized Normal Distribution
The Generalized Normal distribution (GN distribution) has been referred to in the literature under different names and parametrizations such as the exponential power distribution or the generalized error distribution. The first formulation of this distribution [1] as a generalization of the normal distribution was characterized by the location, scale, and shape parameters.
Here, we considered the form presented by Nadarajah [21]. It is understood that the random variable Y is the GN distribution given its probability density function (pdf) as: The parameter µ is the mean; τ is the scale factor; and s is the shape parameter. In particular, the mean, variance, and coefficient of the kurtosis of Y are given by: respectively. The GN distribution characterizes leptokurtic distributions if s < 2 and platykurtic distributions if s > 2. In particular, the GN distribution displays the Laplace distribution when s = 1 and the normal distribution when s = 2 and τ is equal to √ 2 σ, where σ is the standard deviation, and when s → ∞, the pdf converges to a uniform distribution in (1).
This distribution is flexible-symmetrical concerning the average and unimodality. Moreover, it allows a more flexible fit for the kurtosis than a normal distribution. Furthermore, the ability of the GN distribution to provide a precise fit for the data depends on its shape.
Zhu and Zinde-Walsh [22] proposed a reparameterization of the asymmetric exponential power distribution that allows us to observe the effect of the shape parameter on the distribution, which was adapted for the GN distribution. Using a similar reparameteriza- The reparametrization above is used throughout the paper. Figure 1 shows the density functions of the GN distribution in (2) for various parameter values.

Generalized Normal Linear Regression Model
The GN linear regression model is defined as: where Y i is the vector of the response for the ith case, x i = (x i1 , . . . , x ip ) contains the values of the explanatory variables, β = (β 1 , . . . , β p ) is the vector of regression coefficients, and i is the vector of random errors that follows a GN distribution with mean zero, scale parameter σ, and shape parameter s.
Therefore, the likelihood function is: The log-likelihood function (4) is given by: The first-order derivatives of the log-likelihood function in (5) are given by: where Ψ(s) = Γ (s) Γ(s) is the digamma function. The score function obtains the Fisher information matrix. This matrix is helpful to get the reference priors for the model parameters. The following proposition obtains the elements of the Fisher information matrix for the model in (3).

Proposition 1.
Let I(θ) be the Fisher information matrix, with θ = (β, σ, s). The elements of the Fisher information matrix, with I ij (θ) = I ji (θ) and θ j the jth element of θ, are given by, x i x i , where s > 1 and Ψ (s) = ∂Ψ(s) ∂s is the trigamma function. The restriction s > 1 ensures that the elements I ij (θ), calculated for i, j = 1, 2, 3, are finite and the information matrix I(θ) is positive definite.
For further details of this proof, please see Proposition 5 in Zhu and Zinde-Walsh [22]. Then, Fisher's information matrix is given by: The corresponding inverse Fisher's information matrix is given by: The matrix in (9) coincides with the Fisher information matrix found by Salazar et al. [11] due to the one-to-one invariance property.

Objective Bayesian Analysis
An important class of objective priors was introduced by Bernardo [13] and later developed by Berger and Bernardo [14]. This class of prior is known as reference priors. A vital feature of the method developed by Berger-Bernardo is the specific treatment given to interest and nuisance parameters. The construction of the reference prior, in the case of nuisance parameters, must be made using an ordered parameterization. The parameter of interest is selected, and the procedure is followed below (see Bernardo [12], for a detailed discussion).
Then, it holds that the conditional reference priors can be represented as: and: where dω j = dω j × · · · × dω m and π R (ω i |θ, ω 1 , . . . , ω i−1 ), i = 1, . . . , m are all proper. A compact approximation will be only required, for the corresponding integrals, if any of those conditional reference priors is not proper.

Reference Prior
The parameter vector (β, σ, s) is ordered and divided into three distinct groups, according to their inferential importance. We considered here the case in which β is the parameter of interest and σ and s are the nuisance parameters. To obtain a joint reference prior for the parameters β, σ, and s, the following ordered parameterization was adopted: Consider the Fisher matrix in (9), the inverse Fisher matrix in (10), and Corollary 1. Let Therefore, a joint reference prior for the ordered parameter is given by: where β ∈ p , σ ∈ + , and s > 1.
Using the likelihood function (4) and the joint reference prior (11), we obtain the joint posterior distribution for β, σ, and s, The posterior conditional probability densities are given by, The densities in (13) and (15) do not belong to any known parametric family, and the densities in (14) can be easily reduced to an inverse-gamma distribution form by the transformation λ = σ s . The parameters of interest were obtained by Monte Carlo methods in a Markov Chain (MCMC). Thus, the posterior densities were evaluated by applying the Metropolis-Hastings algorithm; see, e.g., Chen et al. [23].

A Problem with the Jeffreys Prior
The use of the Jeffreys prior in the multiparametric case is often controversial. Bernardo ([12], p. 41) argued that the use of the Jeffreys prior is not appropriate in many cases and can cause strong inconsistencies and marginalization paradoxes. This prior is obtained from the square root of the determinant of the Fisher information matrix of (9), where: x i x i , and: [ is given by: Such a prior was also presented in Salazar et al. [11]. Both priors are from the family of prior distributions represented as: they are usually improper, and c is a hyperparameter, while π(s) is the prior related to the shape parameter. The Jeffreys prior and reference prior are of the form (17) with, respectively, The posterior distribution associated with the prior in (17) is proper if: where L(s|y) is the integrated likelihood for s, given by: Corollary 2. The marginal prior for s given in Equations (18) and (19) is a continuous function in  Proof. See Appendix A.
Proposition 3. The reference prior given in (11) yields a proper posterior distribution, and the Jeffreys prior given in (16) leads to an improper posterior distribution.
Proof. See Appendix A.
Therefore, the Jeffreys prior leads to an improper posterior distribution and cannot be used in a Bayesian analysis. Another objective prior known as the maximum data information prior could be considered [16,[24][25][26]; however, such a prior is not invariant under one-to-one transformations, which limits its use. Additionally, the main aim was to consider objective priors. We avoided the use of normal or gamma priors due to the lack of invariance in the parameters. Moreover, the prior may depend on hyperparameters that are not easy to elicit for the GN distributions, and the posterior estimates may vary depending on the included information. Finally, Bernardo [12] pointed out that the use of our "flat" priors to assign "non-informative" priors should be strongly discouraged because they often result in the suppression of important inappropriate and unjustified assumptions that can easily have a strong influence on the analysis, or even make it invalid.

Metropolis-Hastings Algorithm
Here, the Metropolis-Hastings algorithm is considered to sample from µ, σ, and s. In this case, the following conditional distributions are considered: π R (µ|σ, s, y), π R (σ|µ, s, y), and π R (s|µ, σ, y), respectively. On the other hand, µ ∈ , σ > 0, s > 1, we considered the following change of variables (µ, σ, s) forω = (ω 1 , ω 2 , ω 3 ) = (µ, log(σ), log(s − 1)). This modification leads the parametric space to 3 , which allows us to sample from the posterior distribution in a more efficient way. Considering the Jacobian transformation, the posterior distribution is given by: The construction of the Metropolis-Hastings algorithm is done using a random walk for the parameter ω 1 , for instance the transition is obtained using q(ω 1 , ω * 1 ), where ω * 1 is generated from ω * 1 = ω 1 + kz where z ∼ N(0, η 2 ) and k is a constant that controls the acceptance rate. We have that η 2 is the diagonal element of the covariance matrix from the joint posterior distribution of ω, obtained assuming the maximum a posterior estimate of π R (ω|y).
The computational stability was improved considering the logarithm scale. The steps to sample from the posterior distribution are:
After we generated the values of ω, we computed (µ, σ = exp(ω 2 ), s = 1 + exp(ω 3 )). It was assumed that η has the same values for all steps. The value of k is defined as aiming for the acceptance rate to be between 20% and 80% [27]. To confirm the convergence of the chains, the Geweke diagnostic [28] was used, as well as graphical analysis.

Selection Criteria for Models
In the Bayesian context, there are a variety of criteria that can be adopted to select the best fit between a collection of models. This paper considered the following criteria: the Deviation Information Criterion (DIC) defined by Spiegelhalter et al. [29] and the Expected Bayesian Information Criterion (EBIC) proposed by Brooks [30]. These criteria are based on the posterior mean deviation, E(D(ω)), which is the deviation evaluated at the posterior mean of ω; thus, it is estimated withD = 1 Q ∑ Q q=1 D(ω (q) ), where the index q indicates the q − threalization of a total of Q realizations and D(ω) = −2 ∑ n i=1 log( f (y i |ω)), where f (.) is the probability density of the GN distribution. The criteria DIC and EBIC are given by: Another widely accepted criterion for model selection is the Conditional Predictive Ordinate (CPO). A detailed description of this selection criterion and the CPO statistic, as well as the applicability of the method for selecting models can be found in Gelfand et al. [31,32]. Let D denote the complete data set and D (−i) denote the data with the i-th observation excluded. Consider π(ω|D (−i) ) for i = 1, . . . , n, the posterior density of ω given D (−i) . Thus, we can define the i-th observation of the CPO i by: where f (y i |ω) is the probability density function. High values of CPO i indicate the best model. An estimate for CPO i can be obtained using an MCMC sample of the posterior distribution of ω given D, π(ω|D). For this, let ω 1 , . . . , ω Q be a sample of the distribution π(ω|D), of size Q. A Monte Carlo approximation of the CPO i [23] is given by: A summary statistic of the CPO i 's is ∑ n i=1 log( CPO i ), wherein the higher the value of B, the better the fit of the model. To illustrate the proposed methodology, a comparison between the normal and GN models is presented in Section 8.

Bayesian Case Influence Diagnostics
One way to observe the influence of observations on a fit of the model is via the global diagnosis; for instance, one can remove some cases from the analysis and analyze the effect of the removal [33]. The diagnostics of the case influence in a Bayesian perspective are based on the Kullback-Leibler divergence (K-L). Let K(π, π (−i) ) denote the K-L divergence between π and π (−i) , where π denotes the posterior distribution of ω for all data and π (−i) denotes the posterior distribution of ω without the i-th case. Specifically, and therefore, K(π, π (−i) ) measures the effect of deleting the i-th case from the full one on the joint posterior distribution of ω. The calibration can be obtained by solving the equation: where the Bernoulli distribution B(p) is expressed by the parameter p with success probability [34] and p i is a calibration measure of the K-L divergence. After some algebraic manipulation, the obtained expression is: bounded by 0.5 ≤ p i ≤ 1. Therefore, if p i is significantly higher than 0.5, then the i-th case is influential. The posterior expectation (22) can also be written in the form: where E ω|D (.) denotes the expectation from the posterior π(ω|D). Thus, (23) can be estimated by using the MCMC methods to achieve the sample from the posterior π(ω|D). Therefore, if ω 1 , . . . , ω Q is a sample of size Q of π(ω|D), then:

Applications
In this section, the proposed method is illustrated using artificial and real data.

Artificial Data
An artificial sample of size n = 500 was generated in accordance with (3) with p = 2, x i = (1, x i1 ), x i1 ∼ N(2.5, 1), β = (2, −1.5) , σ = 1, and s = 2.5. The posterior samples were generated by the Metropolis-Hastings technique through the MCMC implemented in the R software [35]. A single chain of dimensions 300,000 was considered for each parameter, and also, we discarded the first 150,000 iterations (burn-in), aiming to reduce correlation problems. A space with a size of 15 was used, resulting in a final sample size of 10,000. The convergence of the chain was verified by the criterion proposed by Geweke [28]. Table 1 shows the posterior summaries for the parameters of the GN linear regression model. It can be seen that the estimates were close to the true values, and the 95% HPD credibility intervals covered the true values of the parameters. We used the same sample previously simulated to investigate the K-L divergence measure in detecting the GN linear regression model's influential observations. We selected Cases 50 and 250 for perturbation. For each of these two cases and also considering both cases simultaneously, the response variable was disturbed as follows: y i = y i + 5S y , where S y is the standard deviation of y i . The MCMC estimates were done similarly to those in the last section. Note that, due to the invariance property, µ can be computed for the standard GN distribution using the Bayes estimates of β 1 and β 2 , that is µ = x i β.
To reveal the impact of the influential observations in the estimates of β 1 , β 2 , σ, and s, we calculated the measure of Relative Variation (RV), which was obtained by RV = θ −θ 0 θ × 100%, whereθ andθ 0 are the posterior averages of the model parameters considering the original data and perturbed data, respectively. Table 2 shows the posterior estimates for the artificial data and the RVs of the estimates of the parameters concerning the original simulated data. The data set denoted by (a) consisted of the original simulated data set without perturbation, and the data sets denoted by (b) to (d) consisted of data sets resulting from perturbations in the original simulated data set. Higher values of the relative variations for the parameters σ and s showed the presence of influential points in the data set. However, the estimate of s did not differ so much from the perturbed Cases (c) and (d). Considering the samples generated from the posterior distribution of the GN linear regression model parameters, we estimated the measures of the K-L divergence and their respective calibrations for each of the cases considered (a-d), as described in Section 7. The results in Table 3 show that for the data without perturbation (a), the selected cases were not influential because they had small values for K(π, π (−i) ) and calibration close to 0.577. However, when the data were perturbed (b,d), the values of K(π, π (−i) ) were more extensive, and their calibrations were close or equal to one, indicating that these data were influential.

Real Data Set
In order to illustrate the proposed methodology, recall the Brazilian Eucalyptus clones data set on the height (in meters) and the diameter (in centimeters) of Eucalyptus clones. The data belong to a large pulp and paper company from Brazil. As a strategy for the rising rentability of the forestry enterprise and keeping pulp and paper production under control, the company needs to keep an intensive Eucalyptus clone silviculture. The height of the trees is an important measure for selecting different clone species. Moreover, it is desirable to have trees of similar heights, possibly with a slight variation, and consequently with a distribution function with lighter tails.
The objective is to relate the tree's diameter (explanatory variable) with its height (response variable). The GN and normal linear regression models were fit to the data via the Bayesian reference process. The posterior samples were generated by the Metropolis-Hastings technique, similar to the simulation study, in which we considered a single chain of dimensions 300,000 for each parameter, with a burn-in of 150,000 iterations; additionally, jumps with a size of 15 were used, resulting in a final sample size of 10,000. The Geweke criteria verified the convergence of the chain. Table 4 shows the posterior summaries for the parameters of both distributions and the model selection criteria. The GN linear regression model was the most suitable to represent the data as it performed better than the normal linear regression model for all the criteria used. For the fit regression model chosen, note that β 2 was significant, and then, for every one-unit increase in the diameter, the average height of Eucalyptus 0.95 meters increased. In particular, the analysis of the shape parameter (s > 2) provided strong evidence of a platykurtic distribution for the errors, and this favored the GN linear regression model. This was further confirmed by graphical analyses of the quantile residuals of the GN linear regression model presented in Figure 2d.   Figures 2a and 3a show the scatter plot of the data and the adjusted normal and GN linear regression models. It was observed that, on average, the estimated heights were close to those observed, indicating that the models considered had a good fit. The residuals graph by the fit values and the residuals graph by the observations were also quite similar for both models. The presence of heteroscedasticity (see Figures 2b and 3b) was noted, as well as the quadratic trend (see Figures 2c and 3c) for the height-to-diameter ratio of the Eucalyptus. The graphs of the quantiles of the GN distribution and the normal distribution for the residuals of the models are presented in Figures 2d and 3d, respectively. It can be seen that in the setting of the normal linear regression model, many points were far from their tails, indicating an inadequate specification of the error distribution for the model. On the other hand, using our proposed approach in the GN distribution, we observed a good fit as points were following the line, indicating that the theoretical residuals were close to the observed residuals. Therefore, there was evidence that the model chosen outperformed the normal linear regression model in fitting the data.  To investigate the influence of height and diameter data from Eucalyptus on the fit of the generalized normal linear regression model chosen, we calculated the K-L divergence measures and their respective calibrations. Figure 4 shows the K-L measurements for each observation. Note that Cases 335, 479, and 803 exhibited higher values of the K-L divergence when compared with other observations. The K-L divergences and calibrations concerning three observations that showed the highest calibration values are presented in Table 5. It can be seen that Observation 803 was possibly an influential case, which was also shown to be an outlier from the visual analysis. To assess whether this observation altered the parameter estimates of the GN linear regression model, we carried out a sensitivity analysis.  Figure 4. Index plot of K(π, π (−i) ) for the height and diameter data of Eucalyptus.  Table 6 shows the new estimates of the model parameters after excluding the case with the greatest calibration value and the relative variations (RV) for these estimates regarding the Eucalyptus data. Here, the relative variations were obtained by RV = θ −θ 0 θ × 100%, whereθ andθ 0 are the posterior averages of the model parameters obtained from the original data and from the data without influential observation, respectively. We noted a slight change in the RV of the s parameter when we excluded influential observations. However, such a change was insignificant. This indicated that the GN linear regression model was not affected by the compelling cases. Overall, regardless of the unit measurement/scale in both cases (synthetic and real data sets), the visual representation corroborated the explainability/adjustment of the GN model, by the quantile-quantile plot, residuals versus fits plot, and standardized errors showing no pattern left to explain equal error variances, as well as no outliers.

Discussion
In this paper, we presented the generalized normal linear regression model from objective Bayesian analysis. The Jeffreys prior and reference prior for the generalized normal model were discussed in detail. We proved that the Jeffreys prior leads to an improper posterior distribution and cannot be used in a Bayesian analysis. On the other hand, the reference prior leads to a proper posterior distribution.
The parameter estimates were based on a Bayesian reference analysis procedure via MCMC. Diagnostic techniques based on the Kullback-Leibler divergence were built for the generalized typical linear regression model. Studies with artificial and real data were performed to verify the adequacy of the proposed inferential method.
The result of the application to a set of actual data showed that the generalized normal linear regression model outperformed the normal linear regression model, regardless of the model selection criteria. Furthermore, through a study of artificial data and real data, the Kullback-Leibler divergence effectively detected the points that were influential in the fit of the generalized normal linear regression model. The withdrawal of such influential points from the set of real data showed that the generalized normal model was not affected by influential observations. This result was corroborated by the fact that the generalized normal distribution was considered a tool for reducing outliers and achieving robust estimates. The proposed methodology showed consistent marginalization and sampling properties and thus eliminated the problem of estimating the parameters of this important regression model. Moreover, adopting the reference (objective) prior, we obtained oneto-one consistent results, under the Bayesian paradigm, enabling the GN distribution in a practicalform.
Further works can explore a great number of extensions using this study. For instance, the method developed in this article may be applied to other regression models such as the Student t regression model and the Birnbaum-Saunders regression model, among others. Additionally, other generalizations of the normal distribution should be considered [36][37][38].

Conflicts of Interest:
The authors declare no conflict of interest.

Proof of Proposition 3.
Considering the reference prior given in Corollary 2, the result of Corollary 3, and Condition (20), it follows that the posterior reference distribution is proper if: Thus, Therefore, the reference prior leads to a proper posterior distribution.
Considering the Jeffreys prior given in Corollary 2, the result of Corollary 3, and Condition (20), it follows that the Jeffreys posterior distribution is proper if: Therefore, the Jeffreys prior returned a posterior that is improper, completing the proof.