Estimation of Beta-Pareto Distribution Based on Several Optimization Methods

: This paper is concerned with the maximum likelihood estimators of the Beta-Pareto distribution introduced in Akinsete et al. (2008), which comes from the mixing of two probability distributions, Beta and Pareto. Since these estimators cannot be obtained explicitly, we use nonlinear optimization methods that numerically provide these estimators. The methods we investigate are the method of Newton-Raphson, the gradient method and the conjugate gradient method. Note that for the conjugate gradient method we use the model of Fletcher-Reeves. The corresponding algorithms are developed and the performances of the methods used are conﬁrmed by an important simulation study. In order to compare between several concurrent models, namely generalized Beta-Pareto, Beta, Pareto, Gamma and Beta-Pareto, model criteria selection are used. We ﬁrstly consider completely observed data and, secondly, the observations are assumed to be right censored and we derive the same type of results.


Introduction
In this work we are interested in the four-parameter distribution called Beta-Pareto (BP) distribution, introduced recently by Akinsete et al. (2008) [1]. This distribution is a generalization of several models like Pareto, logbeta, exponential, arcsine distributions, in the sense that these distributions can be considered as special cases of the BP distribution, by making transformation of the variable or by setting special values of the parameters (cf. Reference [1]).
It is well known that the family of Pareto distribution and the corresponding generalizations have been extensively used in various fields of applications, like income data [2], environmental studies [3] or economical-social studies [4]. Concerning the generalizations of the Pareto distribution, we can cite the Burr distribution [5], the power function [6] and the logistic distribution. It is important also to stress that heavy tailed phenomena can be successfully modelled by means of (generalized) Pareto distributions [7]. Note that this BP model is based on Pareto distribution which is known to be heavy tailed, as it was shown in Reference [1] using some theoretical and application arguments. We would like to stress that this is important in practice because it can be used to describe skewed data better than other distributions proposed in the statistical literature. A classical example of this kind is given by the exceedances of Wheaton River flood data which are highly skewed to the right.
All these reasons show the great importance of statistically investigating any generalization of Pareto distributions. Thus the purpose of our work is to provide algorithmic methods for practically computing the maximum likelihood estimators (MLEs) of the parameters of the BP distribution, to carry out intensive simulation studies in order to investigate the quality of the proposed estimators, to consider model selection criteria in order to choose among candidate models and also to take into account right-censored data, not only completely observed data. Note that right-censored data are often observed in reliability studies and survival analysis, where experiments are generally conducted over a fixed time period. At the end of the study, generally we cannot observe the failure times of all the products or death (remission) of all patients due to loss to follow-up, competing risks (death from other causes) or any combination of these. In such cases, we do not always have the opportunity of observing these survival times, observing only the censoring times. Consequently, it is of crucial importance for this type of data and applications to develop a methodology that allows the estimation in the presence of right censoring.
Generally, numerical methods are required when the MLEs of the unknown parameters of any model cannot be obtained. For the BP distribution, we propose the use of several nonlinear optimization methods: Newton-Raphson' method, the gradient and the conjugate gradient methods. For the Newton-Raphson's method, the evaluation of the approximation of the Hessian matrix is widely studied in the literature. In this work, we are interested in the BFGS method (from Broyden-Fletcher-Goldfarb-Shanno), the DFP method (from Davidon-Fletcher-Powell) and the SR1 method (Symmetric Rank 1). For the conjugate gradient method we use the model of Fletcher-Reeves.
It is well known that the gradient and conjugate gradient methods are better than the Newton-Raphson's method. Nonetheless, most statistical studies use the Newton-Raphson's method instead of the gradient or conjugate gradient methods, maybe because it is much more easier to put in practice. Our interest is to put in practice the gradient method and the conjugate gradient method in our framework of BP estimation and also present the Newton-Raphson method for comparison purposes only.
The structure of the article is as follows. In the next section we introduce the BP distribution and we give some elements of MLE for such a model. Section 3 is devoted to the numerical optimization methods used for obtaining the MLEs for the parameters of interest. Firstly, we briefly recall the numerical methods that we use (Newton-Raphson, gradient and conjugate gradient). Secondly, we present the corresponding algorithm for the method of conjugate gradient, which is the most complex one. We end this section by investigating through simulations the accuracy of these three methods. In Section 4 we use model selection criteria (AIC, BIC, AICc) in order to chose between several concurrent models, namely between the Generalized Beta-Pareto, Beta, Pareto, Gamma, BP models respectively. In Section 5 we assume that we have at our disposal right-censored observations and we derive the same type of results as we did for complete observations.

BP Distribution and MLEs of the Parameters
A random variable X has a BP distribution with parameters α, β, θ, k > 0, if its probability density function is x > 0, being the Gamma function. Consequently, the support of X is [θ, +∞). The corresponding cumulative distribution function can be written as In Figures 1-5, the pdf, survival, cdf, hazard and cumulative hazard functions are presented for several values of the parameters.        Let us now consider an i.i.d. (independent and identically distributed) sample (x 1 , x 2 , x 3 , ..., x n ) of a random variable X following a BP distribution with density f (x; α, β, k, θ). The corresponding log-likelihood function can be written as As usual, when we need to see the log-likelihood function as a function of the parameters α, β, k, θ only, we shall write l(α, β, k, θ) instead of l(α, β, k, θ | x 1 , x 2 , . . . , x n ). The score equations are thus given by ∂l(α, β, k, θ) ∂k where the function Ψ is defined by Ψ(x) := Γ(x) , x > 0. As x ≥ θ, the MLE of the parameter θ is the first order statistic x (1) and the MLEs of α, β and k are obtained by solving the system (2)-(4).
Using Equations (2)-(4) we compute the elements of the Fisher information matrix as follows [1]: There are no closed formulas for solving these equations, so numerical methods are required.

Numerical Optimization for Solving the Score Equations
In this section we are interested in proposing numerical methods for solving the score equations of the MLEs of the BP parameters. We will consider three methods: the Newton-Raphson's method, the gradient method and the conjugate gradient method. Firstly, we will recall the general lines of these optimization methods. Secondly, we will give a detailed description of the application of the most complex method, namely of the conjugate gradient one, for computing the MLEs of the BP distribution. We close the section by presenting numerical results of the implementation of these methods in our BP framework.

General Description of Optimization Methods
Les us first recall the main lines of the three numerical methods that we have considered. The objectives of these methods will be to minimize/maximize a function l : R p → R; in our case, p = 3 (the three parameters of the BP distribution to be estimated, α, β, k). As it was mentioned in the previous section, although the number of parameters of the BP distribution is 4 (the parameters are α, β, k and θ), the MLE of the forth parameter θ can be immediately obtained as the first order statistics, θ = x (1) . For this reason, although the function l has 4 arguments, in the optimization methods only 3 will be in fact involved.

Newton-Raphson's Method
Newton's method: Let us denote by V := (α, β, k, θ) a current point of the function l, let ∇l(V k ) be the gradient and H K be the Hessian function at an iteration V k of the current point V. Newton's method consists in taking the descent direction Near-Newton method: Often in practice, the inverse of the Hessian, H −1 K , is very difficult to evaluate when the function l is not analytic. The gradient is always more or less accessible (by inverse methods). As the Hessian cannot be computed exactly, we try to evaluate an approximation. Among the methods that approximate the Hessian, three are retained here: the method BFGS (for Broyden-Fletcher-Goldfarb-Shanno), the method DFP (for Davidon-Fletcher-Powell) and the method SR1 (for Symmetric Rank 1 method) [8].
We introduce the notation s k = V k+1 − V k and y k = ∇l(V k+1 ) − ∇l(V k ) and we choose H 0 a definite positive matrix; for convexity reasons the identity matrix is usually chosen.
Update of the Hessian by the method BFGS: In this approach, the approximation of the Hessian is given by

Gradient Method
This algorithm is based on the fact that, in the vicinity of a point V, the function l decreases most strongly in the opposed direction of the gradient of l, Let us fix an arbitrary > 0. The algorithm can be described as follows: Step 0 (Initialization): V 0 = (α 0 , β 0 , k 0 , θ 0 ) that satisfies the conditions for the parameters of the BP distribution; set k = 0 and go to Step 1 2.
Step 1: Computation of d k = −∇l(V k ); if ∇l(V k ) ≤ , then STOP; If not, go to Step 2.

3.
Step 2: Compute λ k the solution of the problem Set k = k + 1 and go to Step 1.
The algorithm that uses this direction of descent is called the gradient algorithm or the deepest descent algorithm. The algorithm generally requires a finite number of iterations, depending on the regularity of the function l. In practice, we often observe that ∇l(V) is a good direction of descent, but the convergence to the solution is generally slow. Despite its poor numerical performance, this algorithm is worth studying.

Conjugate Gradient Method
The conjugate gradient method is one of the most famous and widely used methods for solving minimization problems. It is mostly used for big size problems. This method was proposed in 1952 by Hestenes and Steifel for the minimization of strictly convex quadratic functions [9].
Many mathematicians have used this method for the nonlinear (non-quadratic) case. This was done for the first time in 1964 by Fletcher and Reeves [10], then in 1969 by Polak and Ribiere [11]; another variant was studied in 1987 by Fletcher [12]. The strategy adopted was to use a recursive sequence where λ k is a properly chosen positive real constant called "step" and d k is a non-zero real vector called "direction". As the conjugate gradient algorithm is used to solve nonlinear functions, we should note that nonlinear conjugate gradient methods are not unique. Algorithm for the conjugate gradient method: The Algorithm is initialized by the step 0 of the simple gradient. 0. We have d 0 = −∇l(V 0 ) as long as a convergence criterion is not verified. 1. Determination of a step λ k by some linear search method. Computation of a new iteration 2. Evaluation of a new gradient ∇l(V k+1 ).
3. Computation of the real ω k+1 by some methods, for example Fletcher and Reeves (see below) , 4. Construction of a new descent direction: Several methods exist for computing the term ω k+1 ; we will be concerned in the sequel by the methods of Fletcher-Reeves, of Polack-Ribiere, and of Hesteness-Stiefel (a variant of the method of Fletcher-Reeves). It should be noted that this last method is particularly effective in the case when the function is quadratic and when the linear search is carried out exactly. The corresponding ω k+1 for these methods is given by:

Fletcher-Reeves
Hestenes-Stiefel Noting that we have the recurrence where the step of descent λ k is obtained with an exact or inaccurate linear search. The descent vector d k is obtained by using a conjugate gradient algorithm based on the recurrence formula where

Optimization Methods for Computing the MLEs of the BP Distribution
At this stage, we are interested in describing the application of the three optimization methods for computing the MLEs of the BP distribution. In the sequel we will present in details only the most complex one, namely the conjugate gradient method. Following the same line, the other two methods can also be adapted for computing the MLEs of the BP distribution. Note also that we will give in the next section numerical results for all three methods.
So, our purpose is to numerically determine the MLEs of the BP parameters by calculating the maximum of the log-likelihood function l(α, β, k, θ) given in (1).
As we have previously mentioned, the MLE of the parameter θ is the first order statistic, so, in the sequel, the iterates To compute the MLEs of the parameters, we proceed as follows: 1.
Step 0: Step 1: Computation of the gradient Else: go to the next step.

3.
Step 2: Computation of the direction of descent when we use the method of Fletcher-Reeves (5), we will have the result

Determination of a step λ 0
We can find λ 0 with exact and inaccurate linear search methods. In our case, the use of the exact linear search helps us to have a fast convergence. The exact linear search method is to solve the problem For this, we will look for the value of λ 0 which cancels the first derivative of the function Thus we can deduce the exact value of λ 0 If λ 0 is positive, we accept it and we go to next step. If not, we use inexact methods of linear search to have an approximation of the optimal value λ 0 and we go to next step. The main inexact methods are the so-called inaccurate linear search methods of Armijo [13], of Goldstein [14], of Wolfe [15] and of Wolfe strong [16].

Construction of the vector V
where λ 0 was previously obtained. Set i = i + 1 and go to Step 1.
Based on Zoutendijk's theorem [17] and globally convergent Riemannian conjugate gradient method [18], the convergence of the algorithm is ensured.

Numerical Results
We carried out an important simulation study using the programming software R. In the sequel we present the results obtained by means of the three optimization methods.
In Tables 1-3, the MLEs of the parameters α, β and k of the BP distribution are presented, together with the standard deviations (SDs), considering the three optimization methods. Note the convergence of the estimators obtained with the gradient and conjugate gradient methods and note also that the Newton's method does not converge with a non-quadratic or nonlinear function.
The Tables 4-6 present the MLEs of the parameters for the conjugate gradient method and gradient methods, as well as the bias and the mean square errors (MSEs) of the estimators. Note that the MSEs of the estimators have very small values, smaller than 10 −4 .      The interest of the method of conjugated gradient comes from the fact that it converges quickly towards the minimum; we can show that in N dimensions it only needs maximum N calculation steps, if the function is exactly quadratic. The inconvenient of the Newton's method is that it requires to know the Hessian of the function in order to determine the descent step. We have seen that the conjugate gradient algorithm chooses optimally the descent directions through V(α, β, k, θ).
In Figures 6-8 we also present the evolution of the MSEs of the computed estimators. The three numerical methods are carried out for m = 10,000 iterations.
As we see in these figures, the MLEs of the parameters α, β, and k obtained by Newton's, gradient and conjugate gradient methods are √ n-consistent. And we can say that the conjugate gradient method gives the best results. In this way, we have numerically checked the well known properties of MLEs, namely the consistency and asymptotic normality, by verifying that the values of the mean square errors obtained are √ n-consistent, which confirm the theory of the method used.

Model Selection
We want also to take into account model selection criteria in order to choose among several candidate models, BP, Beta, Pareto, Gamma and Generalized Beta-Pareto distributions. The Generalized BP (GBP) distribution is a 5-parameter distribution introduced in Reference [19], with a different form given in Reference [20]. The density of this distribution is defined by (8) with the parameters α, β, θ, c, k > 0. For the model selection problem, we considered the general information criterion (GIC) where: L is the likelihood, K is the number of parameters of the model, I is an index for the penalty of the model. The following well-known criteria are obtained for different values of I: and where n is the sample size. We have simulated data according to BP distribution, Pareto (P) distribution, Gamma (G) distribution and GBP distribution. We have computed the three criteria AIC, BIC and AICc and recorded in Tables 7-10 the number of times in 1000 iterations that each of the models is selected (minimum value of the corresponding criteria).
Several remarks need to be done. First, since the support of a BP distribution is [θ, +∞), it makes sense to compare data from BP distribution with data from Gamma, Pareto and GBP distributions. For example, when θ is close to 0, the support of the BP distribution is close to the support of Gamma distribution. BP distribution reduces to Pareto distribution in the case where α = β = 1, and the support is the same.
Second, note that we have added the Beta (B) distribution as one of the candidate distributions to be chosen by the information criteria. Although this is not a "real" candidate (since the support is (0, 1)), we wanted to check also the model selection criteria in the presence of an "unusual" model. Clearly, it does not make sense to compare data from BP, GBP, Gamma or Pareto distribution, on the one side, with data from Beta distribution, on the other side. Nonetheless, it is important to have an idea about how the model criteria behave in this case.
We can notice that all criteria choose the correct model in all the 4 situations, for moderate or even small values of the sample size.
We can also remark that, when the underlying model is the GBP, for small values of the sample size n (less than 50), the criteria fail in choosing the correct model; nonetheless, starting from reasonable values of the sample size n (around 50 or 100), the correct model is chosen by all criteria in most of the cases. This phenomenon could be generated by the fact that the number of parameters in the GBP model is the highest one, 5, which has an influence on the information criteria.
We have also noticed a strange phenomenon: also when the underlying model is the GBP, for small values of the sample size n, the model preferred by the criteria is the Beta model instead of the GBP, that is to say the"unusual" model, as previously explained. The fact is more accentuated for the BIC criteria, so surely it is related to the penalization due to the number of parameters.
Another remark related to the difference between BP and GBP models needs to be done. When comparing Tables 7 and 10, we can notice an asymmetry between the model selection criteria when the underlying true models are the BP, respectively the GBP model. On the one side, when the underlying model is the GBP (Table 7), the BP model is chosen by all criteria in 15-20% of cases for small values of the sample size. On the other side, when the underlying model is the BP (Table 10), the GBP model is never chosen. Although we think that this phenomenon could be related to the number of parameters of each model, to the presence of the other models investigated by the criteria as candidate models (Pareto, Beta, Gamma), or to the particular case that we considered in simulations, we do not have a complete explanation for this phenomenon.

MLEs for Right-Censored Data
Let us consider that the random right-censorship C is non-informative. Roughly speaking, a censorship is non-informative in the sense that the censoring distribution provides no information about the lifetime distribution. If the censoring mechanism is independent of the survival time, then we will have non-informative censoring. In fact, in practice we almost always mean independent censoring by non-informative censoring. It is well known that if the censoring variable C depends on the lifetime X, we run into the problem of non-identifiability: starting from observations, we cannot make inference about X or C, because several different distributions for (X, C) could provide the same distribution for the observation. For these reasons, researchers almost always consider independence between censoring and lifetime. In most practical examples this makes sense; for instance, if the lifetime represents the remission time of patients and the censoring mechanism comes from loss to follow-up (patients change town, for instance), it is natural to consider that we have independence between censoring and lifetime.
In our case, suppose that the variables X and C have respectively the probability density functions f and g and survival functions S and G; all the information is contained in the couple (T j , δ j ) where T j = min(X j , C j ) is the observed time and δ j = 1 (X j ≤C j ) is the censorship indicator. So, the contribution to the likelihood for the individual j is Note that the term f (t j |V)G(t j ) δ j corresponds to the case δ j = 1, that is to say to the case when the lifetime is observed; in this situation, the contribution to the likelihood of the lifetime X is f (t j |V), while the contribution to the likelihood of the censorship is G(t j ). The second term has an analogous interpretation in the case when δ j = 0, that is to say in the case when the censorship is observed. We also assume that there are no common parameters between the censoring and the lifetime; consequently, the parameter V does not appear in the distribution of the censorship. The useful part of the likelihood (for obtaining the MLEs of interest, that is, the MLEs of the BP distribution) is then reduced to L = m ∏ j=1 f (t j |V) δ j S(t j |V) 1−δ j and the log-likelihood is where x ≥ θ and α, β, k, θ > 0. Consequently we get

Numerical Results for BP Parameter Estimation with Censored Data
To show the efficiency of the MLEs obtained by the gradient and conjugate gradient method when data are right censored, an important simulation study was realized. The results of the MLEs and their SDs are summarized in the Tables 11 and 12.   Table 11. Results of simulation with samples of censored data using the conjugate gradient method with m = 10,000, α = 13, β = 23, k = 18, θ = 0. 15  To evaluate and compare the performance of the estimators obtained by the proposed methods, we present in the Tables 13 and 14 the bias and MSEs of the estimators.

Model Selection for Censored Data
In this section, the censored data are used with the AIC, BIC and AICc criteria to choose the best statistical model for our statistical population. These results are presented in Tables 15-18. As for the uncensored data, we can conclude that the information criteria choose the correct model even for small or medium values of the sample size n. Note that similar remarks as the ones done in Section 4 for model selection for complete (uncensored) data hold true also here.