Abstract
This paper is concerned with the maximum likelihood estimators of the Beta-Pareto distribution introduced in Akinsete et al. (2008), which comes from the mixing of two probability distributions, Beta and Pareto. Since these estimators cannot be obtained explicitly, we use nonlinear optimization methods that numerically provide these estimators. The methods we investigate are the method of Newton-Raphson, the gradient method and the conjugate gradient method. Note that for the conjugate gradient method we use the model of Fletcher-Reeves. The corresponding algorithms are developed and the performances of the methods used are confirmed by an important simulation study. In order to compare between several concurrent models, namely generalized Beta-Pareto, Beta, Pareto, Gamma and Beta-Pareto, model criteria selection are used. We firstly consider completely observed data and, secondly, the observations are assumed to be right censored and we derive the same type of results.
1. Introduction
In this work we are interested in the four-parameter distribution called Beta-Pareto (BP) distribution, introduced recently by Akinsete et al. (2008) [1]. This distribution is a generalization of several models like Pareto, logbeta, exponential, arcsine distributions, in the sense that these distributions can be considered as special cases of the BP distribution, by making transformation of the variable or by setting special values of the parameters (cf. Reference [1]).
It is well known that the family of Pareto distribution and the corresponding generalizations have been extensively used in various fields of applications, like income data [2], environmental studies [3] or economical-social studies [4]. Concerning the generalizations of the Pareto distribution, we can cite the Burr distribution [5], the power function [6] and the logistic distribution. It is important also to stress that heavy tailed phenomena can be successfully modelled by means of (generalized) Pareto distributions [7]. Note that this BP model is based on Pareto distribution which is known to be heavy tailed, as it was shown in Reference [1] using some theoretical and application arguments. We would like to stress that this is important in practice because it can be used to describe skewed data better than other distributions proposed in the statistical literature. A classical example of this kind is given by the exceedances of Wheaton River flood data which are highly skewed to the right.
All these reasons show the great importance of statistically investigating any generalization of Pareto distributions. Thus the purpose of our work is to provide algorithmic methods for practically computing the maximum likelihood estimators (MLEs) of the parameters of the BP distribution, to carry out intensive simulation studies in order to investigate the quality of the proposed estimators, to consider model selection criteria in order to choose among candidate models and also to take into account right-censored data, not only completely observed data. Note that right-censored data are often observed in reliability studies and survival analysis, where experiments are generally conducted over a fixed time period. At the end of the study, generally we cannot observe the failure times of all the products or death (remission) of all patients due to loss to follow-up, competing risks (death from other causes) or any combination of these. In such cases, we do not always have the opportunity of observing these survival times, observing only the censoring times. Consequently, it is of crucial importance for this type of data and applications to develop a methodology that allows the estimation in the presence of right censoring.
Generally, numerical methods are required when the MLEs of the unknown parameters of any model cannot be obtained. For the BP distribution, we propose the use of several nonlinear optimization methods: Newton-Raphson’ method, the gradient and the conjugate gradient methods. For the Newton-Raphson’s method, the evaluation of the approximation of the Hessian matrix is widely studied in the literature. In this work, we are interested in the BFGS method (from Broyden-Fletcher-Goldfarb-Shanno), the DFP method (from Davidon-Fletcher-Powell) and the SR1 method (Symmetric Rank 1). For the conjugate gradient method we use the model of Fletcher-Reeves.
It is well known that the gradient and conjugate gradient methods are better than the Newton-Raphson’s method. Nonetheless, most statistical studies use the Newton-Raphson’s method instead of the gradient or conjugate gradient methods, maybe because it is much more easier to put in practice. Our interest is to put in practice the gradient method and the conjugate gradient method in our framework of BP estimation and also present the Newton-Raphson method for comparison purposes only.
The structure of the article is as follows. In the next section we introduce the BP distribution and we give some elements of MLE for such a model. Section 3 is devoted to the numerical optimization methods used for obtaining the MLEs for the parameters of interest. Firstly, we briefly recall the numerical methods that we use (Newton-Raphson, gradient and conjugate gradient). Secondly, we present the corresponding algorithm for the method of conjugate gradient, which is the most complex one. We end this section by investigating through simulations the accuracy of these three methods. In Section 4 we use model selection criteria (AIC, BIC, AICc) in order to chose between several concurrent models, namely between the Generalized Beta-Pareto, Beta, Pareto, Gamma, BP models respectively. In Section 5 we assume that we have at our disposal right-censored observations and we derive the same type of results as we did for complete observations.
2. BP Distribution and MLEs of the Parameters
A random variable X has a BP distribution with parameters , if its probability density function is
with being the Gamma function. Consequently, the support of X is The corresponding cumulative distribution function can be written as
In Figure 1, Figure 2, Figure 3, Figure 4 and Figure 5, the pdf, survival, cdf, hazard and cumulative hazard functions are presented for several values of the parameters.
Figure 1.
Density of the BP distribution.
Figure 2.
Survival function of the BP distribution.
Figure 3.
Cumulative distribution function of the BP distribution.
Figure 4.
Hazard rate of the BP distribution.
Figure 5.
Cumulative hazard rate of the BP distribution.
Let us now consider an i.i.d. (independent and identically distributed) sample of a random variable X following a BP distribution with density The corresponding log-likelihood function can be written as
As usual, when we need to see the log-likelihood function as a function of the parameters only, we shall write instead of The score equations are thus given by
where the function is defined by
As , the MLE of the parameter is the first order statistic and the MLEs of and k are obtained by solving the system (2)–(4).
There are no closed formulas for solving these equations, so numerical methods are required.
3. Numerical Optimization for Solving the Score Equations
In this section we are interested in proposing numerical methods for solving the score equations of the MLEs of the BP parameters. We will consider three methods: the Newton-Raphson’s method, the gradient method and the conjugate gradient method. Firstly, we will recall the general lines of these optimization methods. Secondly, we will give a detailed description of the application of the most complex method, namely of the conjugate gradient one, for computing the MLEs of the BP distribution. We close the section by presenting numerical results of the implementation of these methods in our BP framework.
3.1. General Description of Optimization Methods
Les us first recall the main lines of the three numerical methods that we have considered. The objectives of these methods will be to minimize/maximize a function in our case, (the three parameters of the BP distribution to be estimated, k). As it was mentioned in the previous section, although the number of parameters of the BP distribution is 4 (the parameters are k and ), the MLE of the forth parameter can be immediately obtained as the first order statistics, For this reason, although the function l has 4 arguments, in the optimization methods only 3 will be in fact involved.
3.1.1. Newton-Raphson’s Method
Newton’s method: Let us denote by a current point of the function let be the gradient and be the Hessian function at an iteration of the current point Newton’s method consists in taking the descent direction
Near-Newton method: Often in practice, the inverse of the Hessian, is very difficult to evaluate when the function l is not analytic. The gradient is always more or less accessible (by inverse methods). As the Hessian cannot be computed exactly, we try to evaluate an approximation. Among the methods that approximate the Hessian, three are retained here: the method (for Broyden-Fletcher-Goldfarb-Shanno), the method (for Davidon-Fletcher-Powell) and the method (for Symmetric Rank 1 method) [8].
We introduce the notation and and we choose a definite positive matrix; for convexity reasons the identity matrix is usually chosen.
Update of the Hessian by the method BFGS: In this approach, the approximation of the Hessian is given by
3.1.2. Gradient Method
This algorithm is based on the fact that, in the vicinity of a point V, the function l decreases most strongly in the opposed direction of the gradient of
Let us fix an arbitrary The algorithm can be described as follows:
- Step 0 (Initialization): that satisfies the conditions for the parameters of the BP distribution; set and go to Step 1
- Step 1: Computation of ; if then STOP; If not, go to Step 2.
- Step 2: Compute the solution of the problemandSet and go to Step 1.
The algorithm that uses this direction of descent is called the gradient algorithm or the deepest descent algorithm. The algorithm generally requires a finite number of iterations, depending on the regularity of the function In practice, we often observe that is a good direction of descent, but the convergence to the solution is generally slow. Despite its poor numerical performance, this algorithm is worth studying.
3.1.3. Conjugate Gradient Method
The conjugate gradient method is one of the most famous and widely used methods for solving minimization problems. It is mostly used for big size problems. This method was proposed in 1952 by Hestenes and Steifel for the minimization of strictly convex quadratic functions [9].
Many mathematicians have used this method for the nonlinear (non-quadratic) case. This was done for the first time in 1964 by Fletcher and Reeves [10], then in 1969 by Polak and Ribiere [11]; another variant was studied in 1987 by Fletcher [12]. The strategy adopted was to use a recursive sequence
where is a properly chosen positive real constant called “step” and is a non-zero real vector called “direction”. As the conjugate gradient algorithm is used to solve nonlinear functions, we should note that nonlinear conjugate gradient methods are not unique.
Algorithm for the conjugate gradient method:
The Algorithm is initialized by the step 0 of the simple gradient.
0. We have as long as a convergence criterion is not verified.
1. Determination of a step by some linear search method. Computation of a new iteration
2. Evaluation of a new gradient
3. Computation of the real by some methods, for example Fletcher and Reeves (see below),
4. Construction of a new descent direction:
5. Increment
Several methods exist for computing the term we will be concerned in the sequel by the methods of Fletcher-Reeves, of Polack-Ribiere, and of Hesteness-Stiefel (a variant of the method of Fletcher-Reeves). It should be noted that this last method is particularly effective in the case when the function is quadratic and when the linear search is carried out exactly. The corresponding for these methods is given by:
Fletcher-Reeves
Polack-Ribiere
Hestenes-Stiefel
Noting that
we have the recurrence
where the step of descent is obtained with an exact or inaccurate linear search. The descent vector is obtained by using a conjugate gradient algorithm based on the recurrence formula
where
3.2. Optimization Methods for Computing the MLEs of the BP Distribution
At this stage, we are interested in describing the application of the three optimization methods for computing the MLEs of the BP distribution. In the sequel we will present in details only the most complex one, namely the conjugate gradient method. Following the same line, the other two methods can also be adapted for computing the MLEs of the BP distribution. Note also that we will give in the next section numerical results for all three methods.
So, our purpose is to numerically determine the MLEs of the BP parameters by calculating the maximum of the log-likelihood function given in (1).
As we have previously mentioned, the MLE of the parameter is the first order statistic, so, in the sequel, the iterates are fixed.
To compute the MLEs of the parameters, we proceed as follows:
1. Step 0: Initialization
2. Step 1: Computation of the gradient
If then stop, is the optimal vector.
Else: go to the next step.
3. Step 2: Computation of the direction of descent
If
If when we use the method of Fletcher-Reeves (5), we will have the result
4. Step 3: Computation of
such as
Determination of a step
We can find with exact and inaccurate linear search methods. In our case, the use of the exact linear search helps us to have a fast convergence. The exact linear search method is to solve the problem
For this, we will look for the value of which cancels the first derivative of the function
Thus we can deduce the exact value of
If is positive, we accept it and we go to next step. If not, we use inexact methods of linear search to have an approximation of the optimal value and we go to next step. The main inexact methods are the so-called inaccurate linear search methods of Armijo [13], of Goldstein [14], of Wolfe [15] and of Wolfe strong [16].
Construction of the vector
where was previously obtained.
Set and go to Step 1.
Based on Zoutendijk’s theorem [17] and globally convergent Riemannian conjugate gradient method [18], the convergence of the algorithm is ensured.
3.3. Numerical Results
We carried out an important simulation study using the programming software R. In the sequel we present the results obtained by means of the three optimization methods.
In Table 1, Table 2 and Table 3, the MLEs of the parameters and k of the BP distribution are presented, together with the standard deviations (SDs), considering the three optimization methods. Note the convergence of the estimators obtained with the gradient and conjugate gradient methods and note also that the Newton’s method does not converge with a non-quadratic or nonlinear function.
Table 1.
Results of simulation with conjugate gradient method for 10,000 iterations, with
Table 2.
Results of simulation with gradient method for 10,000 iterations, with .
Table 3.
Results of simulation with Newton’s method for 10,000 iterations, with .
The Table 4, Table 5 and Table 6 present the MLEs of the parameters for the conjugate gradient method and gradient methods, as well as the bias and the mean square errors (MSEs) of the estimators. Note that the MSEs of the estimators have very small values, smaller than
Table 4.
MLEs of with the conjugate gradient method and gradient method.
Table 5.
MLEs of with the conjugate gradient method and gradient method.
Table 6.
MLEs of k with the conjugate gradient method and gradient method.
The interest of the method of conjugated gradient comes from the fact that it converges quickly towards the minimum; we can show that in N dimensions it only needs maximum N calculation steps, if the function is exactly quadratic. The inconvenient of the Newton’s method is that it requires to know the Hessian of the function in order to determine the descent step. We have seen that the conjugate gradient algorithm chooses optimally the descent directions through .
In Figure 6, Figure 7 and Figure 8 we also present the evolution of the MSEs of the computed estimators. The three numerical methods are carried out for 10,000 iterations.
Figure 6.
The MSE of computed with Newton’s, gradient and conjugate gradient methods.
Figure 7.
The MSE of computed with Newton’s, gradient and conjugate gradient methods.
Figure 8.
The MSE of computed with Newton’s, gradient and conjugate gradient methods.
As we see in these figures, the MLEs of the parameters and k obtained by Newton’s, gradient and conjugate gradient methods are -consistent. And we can say that the conjugate gradient method gives the best results. In this way, we have numerically checked the well known properties of MLEs, namely the consistency and asymptotic normality, by verifying that the values of the mean square errors obtained are -consistent, which confirm the theory of the method used.
4. Model Selection
We want also to take into account model selection criteria in order to choose among several candidate models, BP, Beta, Pareto, Gamma and Generalized Beta-Pareto distributions. The Generalized BP (GBP) distribution is a 5-parameter distribution introduced in Reference [19], with a different form given in Reference [20]. The density of this distribution is defined by
with the parameters
For the model selection problem, we considered the general information criterion (GIC)
where: L is the likelihood, K is the number of parameters of the model, I is an index for the penalty of the model. The following well-known criteria are obtained for different values of I:
and
where n is the sample size.
We have simulated data according to BP distribution, Pareto (P) distribution, Gamma (G) distribution and GBP distribution. We have computed the three criteria AIC, BIC and AICc and recorded in Table 7, Table 8, Table 9 and Table 10 the number of times in 1000 iterations that each of the models is selected (minimum value of the corresponding criteria).
Table 7.
Smaller AIC/BIC/AICc scores in 1000 simulations from a GBP distribution with .
Table 8.
Smaller AIC/BIC/AICc scores in 1000 simulations from a Pareto distribution with .
Table 9.
Smaller AIC/BIC/AICc scores in 1000 simulations from a Gamma distribution with .
Table 10.
Smaller AIC/BIC/AICc scores in 1000 simulations from a BP distribution with .
Several remarks need to be done. First, since the support of a BP distribution is it makes sense to compare data from BP distribution with data from Gamma, Pareto and GBP distributions. For example, when is close to the support of the BP distribution is close to the support of Gamma distribution. BP distribution reduces to Pareto distribution in the case where and the support is the same.
Second, note that we have added the Beta (B) distribution as one of the candidate distributions to be chosen by the information criteria. Although this is not a “real” candidate (since the support is ), we wanted to check also the model selection criteria in the presence of an “unusual” model. Clearly, it does not make sense to compare data from BP, GBP, Gamma or Pareto distribution, on the one side, with data from Beta distribution, on the other side. Nonetheless, it is important to have an idea about how the model criteria behave in this case.
We can notice that all criteria choose the correct model in all the 4 situations, for moderate or even small values of the sample size.
We can also remark that, when the underlying model is the GBP, for small values of the sample size n (less than 50), the criteria fail in choosing the correct model; nonetheless, starting from reasonable values of the sample size n (around 50 or 100), the correct model is chosen by all criteria in most of the cases. This phenomenon could be generated by the fact that the number of parameters in the GBP model is the highest one, 5, which has an influence on the information criteria.
We have also noticed a strange phenomenon: also when the underlying model is the GBP, for small values of the sample size the model preferred by the criteria is the Beta model instead of the GBP, that is to say the“unusual” model, as previously explained. The fact is more accentuated for the BIC criteria, so surely it is related to the penalization due to the number of parameters.
Another remark related to the difference between BP and GBP models needs to be done. When comparing Table 7 and Table 10, we can notice an asymmetry between the model selection criteria when the underlying true models are the BP, respectively the GBP model. On the one side, when the underlying model is the GBP (Table 7), the BP model is chosen by all criteria in 15–20% of cases for small values of the sample size. On the other side, when the underlying model is the BP (Table 10), the GBP model is never chosen. Although we think that this phenomenon could be related to the number of parameters of each model, to the presence of the other models investigated by the criteria as candidate models (Pareto, Beta, Gamma), or to the particular case that we considered in simulations, we do not have a complete explanation for this phenomenon.
5. MLEs for Right-Censored Data
Let us consider that the random right-censorship C is non-informative. Roughly speaking, a censorship is non-informative in the sense that the censoring distribution provides no information about the lifetime distribution. If the censoring mechanism is independent of the survival time, then we will have non-informative censoring. In fact, in practice we almost always mean independent censoring by non-informative censoring. It is well known that if the censoring variable C depends on the lifetime X, we run into the problem of non-identifiability: starting from observations, we cannot make inference about X or C, because several different distributions for could provide the same distribution for the observation. For these reasons, researchers almost always consider independence between censoring and lifetime. In most practical examples this makes sense; for instance, if the lifetime represents the remission time of patients and the censoring mechanism comes from loss to follow-up (patients change town, for instance), it is natural to consider that we have independence between censoring and lifetime.
In our case, suppose that the variables X and C have respectively the probability density functions f and g and survival functions S and all the information is contained in the couple where is the observed time and is the censorship indicator. So, the contribution to the likelihood for the individual j is
Note that the term corresponds to the case that is to say to the case when the lifetime is observed; in this situation, the contribution to the likelihood of the lifetime X is while the contribution to the likelihood of the censorship is The second term has an analogous interpretation in the case when that is to say in the case when the censorship is observed.
We also assume that there are no common parameters between the censoring and the lifetime; consequently, the parameter V does not appear in the distribution of the censorship. The useful part of the likelihood (for obtaining the MLEs of interest, that is, the MLEs of the BP distribution) is then reduced to
and the log-likelihood is
where and Consequently we get
Setting
and
we obtain the score functions
5.1. Conjugate Gradient Method for Parameter MLEs of the BP Distribution with Right-Censored Data
As we have already mentioned, the MLE of the parameter is the first order statistic. Let us fix an arbitrary The algorithm based on the conjugate gradient that we propose is as follows.
1. Step 0: Initialization
2. Step 1: Computation of the gradient
If then stop, is the optimal vector. If not, go to the next step.
3. Step 2: Computation of the direction of descent
If , then
If then using the Fletcher-Reeves’ method we get
4. Step 3: Computation of
such that
Computation of a step : We can find with exact and inaccurate linear search methods. In this case, the use of the exact linear search helps us to have a fast convergence. The exact linear search method is to solve the problem
To solve this problem, we must find the value of such that
Consequently
and we obtain
5.2. Numerical Results for BP Parameter Estimation with Censored Data
To show the efficiency of the MLEs obtained by the gradient and conjugate gradient method when data are right censored, an important simulation study was realized. The results of the MLEs and their SDs are summarized in the Table 11 and Table 12.
Table 11.
Results of simulation with samples of censored data using the conjugate gradient method with .
Table 12.
Results of simulation with samples of censored data using the gradient method with 10,000, .
To evaluate and compare the performance of the estimators obtained by the proposed methods, we present in the Table 13 and Table 14 the bias and MSEs of the estimators.
Table 13.
Gradient method.
Table 14.
Conjugate gradient method.
5.3. Model Selection for Censored Data
In this section, the censored data are used with the AIC, BIC and AICc criteria to choose the best statistical model for our statistical population. These results are presented in Table 15, Table 16, Table 17 and Table 18. As for the uncensored data, we can conclude that the information criteria choose the correct model even for small or medium values of the sample size Note that similar remarks as the ones done in Section 4 for model selection for complete (uncensored) data hold true also here.
Table 15.
Smaller AIC/BIC/AICc scores in 1000 simulations from a censored Pareto distribution with .
Table 16.
Smaller AIC/BIC/AICc scores in 1000 simulations from a censored Gamma distribution with .
Table 17.
Smaller AIC/BIC/AICc scores in 1000 simulations from a censored BP distribution with .
Table 18.
Smaller AIC/BIC/AICc scores in 1000 simulations from a censored GBP distribution with .
6. Conclusions
In this paper, we have developed different optimization methods to determine the maximum likelihood estimators of the four parameters of the Beta-Pareto distribution. The results obtained showed that all the used methods give -consistent estimators in both complete and right-censored data samples and particularly the conjugate gradient method gives the best results. Using classical model selection criteria, our study proves that this new model can be used instead of several alternative models such as Beta, Pareto, Gamma, and Generalized Beta-Pareto. Another important contribution of our work is the use of right-censored data as an input for the estimation procedure and model selection techniques.
Author Contributions
All authors have equally contributed to different stages of the article. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Akinsete, A.; Famoye, F.; Lee, C. The Beta-Pareto distribution. Statistics 2008, 42, 547–563. [Google Scholar] [CrossRef]
- Asmussen, S. Applied Probability and Queues; Springer: New York, NY, USA, 2003. [Google Scholar]
- Zhang, J. Likelihood moment estimation for the generalized Pareto distribution. Aust. N. Z. J. Stat. 2007, 49, 69–77. [Google Scholar] [CrossRef]
- Levy, M.; Levy, H. Investment talent and the Pareto wealth distribution: Theoretical and experimental analysis. Rev. Econ. Stat. 2003, 85, 709–725. [Google Scholar] [CrossRef]
- Burroughs, S.M.; Tebbens, S.F. Upper-truncated power law distributions. Fractals 2001, 9, 209–222. [Google Scholar] [CrossRef]
- Newman, M.E. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 2005, 46, 323–351. [Google Scholar] [CrossRef]
- Choulakian, V.; Stephens, M.A. Goodness-of-fit tests for the generalized Pareto distribution. Technometrics 2001, 43, 478–484. [Google Scholar] [CrossRef]
- Conn, A.R.; Gould, N.I.M.; Toint, P.L. Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Program. 1991, 50, 177–195. [Google Scholar] [CrossRef]
- Hestenes, M.R.; Stiefel, E. Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bureau Stand. 1952, 49, 409–436. [Google Scholar] [CrossRef]
- Fletcher, R.; Reeves, C.M. Function minimization by conjugate gradients. Computer J. 1964, 7, 149–154. [Google Scholar] [CrossRef]
- Polak, E.; Ribiere, G. Note sur la convergence de méthodes de directions conjuguées. ESAIM Math. Model. Numer. Anal. 1969, 3, 35–43. (In French) [Google Scholar] [CrossRef]
- Fletcher, R. Practical Methods of Optimization, 2nd ed.; Wiley: Hoboken, NJ, USA, 2000. [Google Scholar]
- Armijo, L. Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 1966, 16, 1–3. [Google Scholar] [CrossRef]
- Goldstein, A.A. On steepest descent. J. Soc. Ind. Appl. Math. Ser. A Control 1965, 3, 147–151. [Google Scholar] [CrossRef]
- Wolfe, P. Convergence conditions for ascent methods. SIAM Rev. 1969, 11, 226–235. [Google Scholar] [CrossRef]
- Fletcher, R.; Powell, M.J.D. A rapidly convergent descent method for minimization. Comp. J. 1963, 6, 163–168. [Google Scholar] [CrossRef]
- Sato, H.; Iwai, T. A new, globally convergent Riemannian conjugate gradient method. Optimization 2013, 64, 1011–1031. [Google Scholar] [CrossRef]
- Dai, Y.H.; Yuan, Y. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim. 1999, 10, 177–182. [Google Scholar] [CrossRef]
- Nassar, M.M.; Nada, N.K. The beta generalized Pareto distribution. J. Stat. Adv. Theory Appl. 2011, 6, 1–17. [Google Scholar]
- Mahmoudi, E. The beta generalized Pareto distribution with application to lifetime data. Math. Comp. Simul. 2011, 81, 2414–2430. [Google Scholar] [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).