Model Selection Criteria on Beta Regression for Machine Learning

: Beta regression models are a class of supervised learning tools for regression problems with univariate and limited response. Current ﬁtting procedures for beta regression require variable selection based on (potentially problematic) information criteria. We propose model selection criteria that take into account the leverage, residuals, and inﬂuence of the observations, both to systematic linear and nonlinear components. To that end, we propose a Predictive Residual Sum of Squares (PRESS)-like machine learning tool and a prediction coefﬁcient, namely P 2 statistic, as a computational procedure. Monte Carlo simulation results on the ﬁnite sample behavior of prediction-based model selection criteria P 2 are provided. We also evaluated two versions of the R 2 criterion. Finally, applications to real data are presented. The new criterion proved to be crucial to choose models taking into account the robustness of the maximum likelihood estimation procedure in the presence of inﬂuential cases.


Introduction
The class of nonlinear beta regression models was proposed by [1] and extended to situations in which the data include zeros and/or ones by [2,3].Shortly thereafter, [4] developed for the class of nonlinear beta regression model residuals and measures of local influence.Local influence proposed by [5] is a decisive scheme to select the model that fit well a dataset, takes into account that the estimation process is robust in influential cases.Indeed, the final conclusion about model selection should consider the analysis of influence.The model selection is a crucial step in data analysis, since all inference performance is based on the selected model.[6] evaluated the behavior of different model selection criteria in a beta regression model, such as the Akaike Information Criterion (AIC) [7], Schwarz Bayesian Criterion (SBC) [8] and various approaches based on pseudo-R 2 .
However, it is common for models selected by the usual selection criteria to present poorly fitted or influential observations.Indeed, the best models selected by the usual criteria do not always present residual plots that validate the goodness-of-fit.Also, current fitting procedures for beta regression are infeasible for high-dimensional data setups and require variable selection based on (potentially problematic) information criteria.Furthermore, the usual selection criteria do not offer any insight about the quality of the predictive values.In this context, [9] proposed the PRESS (Predictive Residual Sum of Squares) statistic, which can be used as a measure of the predictive power of a model.[10] proposed a coefficient of prediction based on PRESS, namely P 2 that is similar to the R 2 approach.The P 2 statistic can be used to select models from a predictive perspective [11].Moreover, the PRESS statistic presents a close relationship with the Cook distance [12] and local influence measures [5], as we shall present here.Hence, the P 2 statistic is a selection criterion that takes into account the impact of the observations poorly fitted by the model, observations with atypical residuals, and cases that exert a disproportional effect on the model estimation process, even affect the inference conclusions, influential cases.
Our main goal is to introduce a PRESS-like machine learning tool and its associated P 2 statistic, as a coefficient of prediction for the class of nonlinear beta regression models.The P 2 statistic is used as a selection criterion for beta regression.We carried out Monte Carlo simulations to evaluate the behavior of the P 2 measure, as well as the behavior of two usual R 2 -like criterion, namely: the R 2 LR [6] and R 2 FC [13].We considered a variety of simulation scenarios, as different ranges for the response mean, several sample sizes and values to the precision parameter, five link functions and different model misspecifications.Finally, we present and discuss two applications to real data.
The simulation and application data set results showed that small values of the new criterion are an indication that the robustness of the maximum likelihood estimation procedure of the model in the presence of influential points is worthy of further investigation.This information could not be accessed by usual selection criteria.However, the issue about variability is still better accessed by R 2 -like criteria.Thus, the best machine learning strategy is to use the three criteria discussed here to choose the best model, once each one holds on different information.

P 2 Criterion
Consider the linear model Y = Xβ + ε as the supervised learning procedure, where Y is a vector n × 1 of the responses, X is a known matrix of covariates (measured features) of dimension n × p of full rank, β is the parameter vector of dimension p × 1 and ε is a vector n × 1 of errors.We have the least-squares estimators: β = (X X) −1 X y, the residual r t = y t − x t β and the predicted value y t = x t β, where x t = (x t1 , . . ., x tp ), and t = 1, . . ., n.Notice that we found these quantities in one shot, without doing any sort of iterative optimization.Let β (t) be the least-squares estimate of β without the tth observation and y (t) = x t β (t) be the predicted value of the case deleted, such that r (t) = y t − y (t) is the prediction error.Thus, for multiple regression, the classic Predictive Residual Sum of Squares statistic named here as PRESS C is given by where r t is the ordinary residual obtained by regressing y on X and h tt is the tth diagonal element of the projection matrix X(X X) −1 X of this regression.Now, let y 1 , . . ., y n be independent random variables, such that each y t , for t = 1, . . ., n, is beta distributed denoted by y t ∼ B(µ t , φ t ), i.e., each y t has density function given by where 0 < µ t < 1 and φ t > 0. Here, E(y t ) = µ t and Var(y Ref. [1] proposed the class of nonlinear beta regression models in which the mean of y t and the precision parameter can be written as where β = (β 1 , . . ., β k ) and γ = (γ 1 , . . ., γ q ) are, respectively, k × 1 and q × 1 vectors of unknown parameters (β ∈ IR k ; γ ∈ IR q ), η 1t and η 2t are the nonlinear predictors, x t = (x t1 , . . ., x tk 1 ) and z t = (z t1 , . . ., z tq 1 ) are vectors of covariates (i.e., vectors of known variables), t = 1, . . ., n, k 1 ≤ k, q 1 ≤ q and k + q < n.Both g(•) and h(•) are strictly monotonic and twice differentiable link functions.Furthermore, f i (•), i = 1, 2, are differentiable and continuous functions, such that the matrices J 1 = ∂η 1 /∂β and J 2 = ∂η 2 /∂γ have full rank (their ranks are equal to k and q, respectively).The model's parameters can be estimated by maximum likelihood (ML).In the Appendix, we present the log-likelihood function, the score vector and Fisher's information matrix for the nonlinear beta regression model.Model ( 2)-( 3) embodies the beta regression linear model with varying dispersion when the predictors are linear functions of the parameters.In this case, g(µ t ) = η 1t = x t β and h(φ t ) = z t γ.If, in addition, the predictor for µ t is linear and φ t is constant through the observations, we arrive at the beta regression model defined [13].
The beta regression likelihood is inherently nonlinear and there are no closed form expressions for the ML estimators, and their computations should be performed numerically using a nonlinear optimization algorithm for machine learning, e.g., some form of iterative Newton's method (Newton-Raphson, Fisher's scoring, BHHH, etc.).To propose a PRESS-like statistic for beta regression, we shall explore the relationship between the Fisher iterative ML scheme and a weighted least square regression.This regression considers pseudo variables as proposed [14] to build a Cook-like distance that have been used in several classes of regression models.Fisher's scoring iterative scheme used for estimating β, both to linear and nonlinear regression model, can be written as where m = 0, 1, 2, . . .are the iterations which are carried out until convergence.The convergence happens when the difference |β (m+1) − β (m) | is less than a small, previously specified constant.From Appendix's expressions (A1), (A2) and from (4), it follows that the mth scoring iteration for β, in a class of nonlinear regression model is defined as where the tth elements of the vectors y * and µ * are . ., w n ), w t given in Appendix, expression (A3).Furthermore, Φ, W, J 1 , T and µ * are evaluated at β m .Ref.
[14] suggests that we rewrite the iterative process in (5) by defining the following vector ) such that the equation in (5) becomes Upon convergence, we may write the ML estimator of β as Here, W, Φ, T and J 1 are the matrices W, Φ, T and J 1 , respectively, evaluated at the ML estimators of β and γ.We note that β in (8) can be viewed as the least-squares estimator of β obtained by regressing the pseudo observation vector Since we had the expression of β in (8), several quantities related to the pseudo regression in (9) may be obtained, as the ordinary residual, the projection matrix, the β (t) and the prediction error.Following [14] we have that in which J 1t is the tth row of the J 1 matrix, h * tt and r β t , are respectively, the tth diagonal element of the projection matrix and the tth ordinary residual of the pseudo regression in (8) given by where u 1,t the tth element of the vector u 1 and v t are given in ( 8) and (A3-Appendix), respectively.We note that µ * = E(y * ), v t = Var(y * t ) and r β t in (11) is the standardized weighted residual 1 [15].In what follows we able to define y † (t) = φ 1/2 t w 1/2 t J 1t β (t) and the prediction error Plugging (10) quantity in (12) we then obtain that r † (t) is Finally, for the nonlinear beta regression models, the PRESS-like statistic becomes as It is noteworthy that based on (11) the expression in (13) may be written as in which It is important to emphasize that r β p,t in the PRESS expression given in (14) is the standardized weighted residual 2 proposed by [15] which outperforming the others beta regression residuals in their numerical evaluation.In special, outperforming the ordinary residual (y t − µ t )/ Var(y t ).The weighted residuals are constructed using the difference between the logit of the responses and their fitted means, the main qualities of beta distribution.The same holds to the proposal for the PRESS-like statistic to the class of beta regression.Indeed, the PRESS also has relationships with influence measures [16].Ref. [12] use a version of the likelihood displacement [17] to build the Cook's distance that measures the impact of a given observation on the parameter estimates of the mean sub-model by removing it from the data.Based on approximations for the likelihood displacement, the Cook-like distances have been proposed for several classes of regression models.Focus on beta regression models [18] obtain an approximation for the version of likelihood displacement to build the Cook's distance given by plugging (16) measure in (13) we thus obtain that in which LD t is the Cook-like distance to the class of beta regression.Furthermore, Ref. [19] shown that LD t ≈ C t in which C t is the total local influence of observation t, defined in (A4).Thus, PRESS ≈ ∑ n t=1 C t h * tt , which clarify the relationship of this statistic with the local influence measures.Ref. [19] also suggest that observations such that C t > 2 ∑ n t=1 C t /n can be taken to be individually influential.We shall take the same threshold pattern to highlight influential observations on the index plots of Cook-like distances.When the predictors in (3) are linear functions of the parameters, i.e., g(µ t ) = x t β and h(φ t ) = z t γ, the expression in (17) also represents the PRESS-like statistic for the class of the linear beta regression models with p = k + q unknown regression parameters.
Considering the same approach to build the determination coefficient R 2 in linear models, Ref.
[10] define a prediction measure based on the PRESS C statistic in (1) as P 2 C = 1 − PRESS C /SST (t) , where SST (t) = ∑ n t=1 (y (t) − y (t) ) 2 , with y (t) being the arithmetic mean for n − 1 values of y t .Based on this idea, for beta regression models we must use the pseudo regression procedure defined in (9) to build the P 2 -like coefficient as where SST † (t) = ∑ n t=1 (y † − y † (t) ) 2 , with y † (t) being the arithmetic average for n − 1 values of the vector y † = Φ 1/2 W 1/2 u 1 by excluding the tth observation.It can be shown that SST † (t) = (n/(n − p)) 2 SST † , wherein p is the number of model parameters and SST is the Total Sum of Squares for the full data.For the class of nonlinear beta regression models, , where y † is the arithmetic average of the y † t , t = 1, . . ., n and p = k + q.Please note that P 2 given in ( 18) is not a positive quantity.Indeed, the PRESS/SST † (t) is a positive quantity, thus the P 2 take values in (−∞; 1].The closer to one, the better is the predictive power of the model. To compare the behaviors of the P 2 defined in (18) and R 2 -like criteria we consider at the outset two versions of pseudo-R 2 based on the likelihood ratio.The first one was proposed by [20] as , where L null is the ML achievable (saturated model) and L f it is the likelihood achieved by the model under investigation.The second version is a proposal of [6] that takes into account the inclusion of covariates both in the mean and in the precision sub-models, is given by: R 2 , where α ∈ [0, 1] and δ > 0. Based on simulation presented by the authors we chose α = 0.4 and δ = 1.We also consider the R 2 FC , which is defined as the square of the sample coefficient of correlation between g(y) and η 1 [13], and its penalized version based on [6] given by R where k 1 and q 1 are, respectively, the number of covariates of the mean and dispersion sub-models.By analogy, we define the penalized version of P 2 given by P

Simulation Study
The Monte Carlo experiments present in this section were carried out using both fixed and varying dispersion beta regressions as data generating processes, as well as linear and nonlinear models.All simulations were carried out using the Ox matrix programming language [21].The number of Monte Carlo replications is 10,000.Our goal is simultaneously to assess the performance of the P 2 , R 2 FC and R 2 LR criteria, and, additionally, which values, on average, these statistics could assume under different data settings and features of the regression model.To that end, at the outset, we present the average values of the statistics as the arithmetic mean of the Monte Carlo replicas.Also, we provide information about the distributions of the statistics by a boxplot analysis.
Since the upper limits of all statistics are equal to one, a performance evaluation criterion for these measures is that their values go to one if the model is correctly specified and far from one otherwise.The mean values of the statistics are especially useful when the scenarios considered in the simulations occur in the real data analysis.

Linear Setting: Fixed Dispersion, Omitted Covariates and Link Functions
Table 1 shows the mean values of the statistics obtained by simulation of the constant dispersion beta regression model that involves a systematic component for the mean given by that is based on logit link function.The covariate values were independently obtained as random draws of the following distributions: X ti ∼ U(0, 1), i = 2, . . ., 5 and were kept fixed throughout the experiment.The precisions, the sample sizes and the range of mean response are, respectively, φ = (20, 50, 150, 400, 1000), n = (40, 80, 120, 400), µ ∈ (0.005, 0.12), µ ∈ (0.90, 0.99) and µ ∈ (0.20, 0.88).
Under the model specification given in (19) we investigate the behavior of the statistics by omitting covariates.In this case, we considered the Scenarios 1, 2, and 3, in which are omitted three, two, and one covariate, respectively.In a fourth scenario, the estimated model is correctly specified.The results in Table 1 show that the mean values of all statistics increase as important covariates are included in the model and the value of φ increases.On the other hand, as the size of the sample increases, the model misspecification is evidenced by lower values of the statistics (Scenarios 1, 2, and 3).It shall be noted that the mean values for all statistics are considerably larger when µ ∈ (0.20, 0.88).Additionally, their values approach one when the estimated model is closest to the true model.For instance, in Scenario 4 for n = 40, φ = 150 the values of P 2 and R 2 LR are, respectively, 0.936 and 0.947.The behavior of the statistics for finite sample size changes substantially when µ ∈ (0.90; 0.99).It is noteworthy the reduction of its mean values, in special to the P 2 criterion when µ ≈ 1 revealing the difficulty in fitting the model in this range of µ.Even under true specification (Scenario 4) the P 2 criterion identifies unmistakably some problem in the model-fitting when µ ≈ 1.For instance, when n = 80 and φ = 50, we have P 2 c = −0.007and R 2 LR c = 0.542.The same feature occurs when µ ∈ (0.005, 0.12).In what follows, we shall investigate the empirical distributions of the statistics: FC and R 2 FC c under the correctly specified modeling (scenario 4) in Table 1, for n = 40 and φ = 150.These results are shown using boxplots of 10,000 values of the statistics obtained from the Monte Carlo simulations (Figure 1).The mean value of the statistic replications is represented by a dot on the side of each boxplot.In panels (a), (b) and (c) we present the boxplots for µ ≈ 0, µ scattered on the standard unit interval and for µ ≈ 0, respectively.−0.5 0.0 0.5 Correct specification (a) 0.86 0.88 0.90 0.92 0.94 0.96 0.98 −0.5 0.0 0.5 Figure 1 shows that the means and medians of all statistics are close, thus the mean values of the statistics adequately represent their behavior in these scenarios.We also notice that both P 2 and R 2 criteria are so small, for models correctly specified when µ is close to the boundaries of the standard unit interval (Figure 1).However, it is noteworthy how the P 2 values are substantially smaller than the R 2 -like criterion values.
When the mean response is concentrated on the boundaries of the standard unit interval, even when the model is correctly specified, the statistic of prediction assumes negative values, panel (a) and panel (c).Based on panel (b) (µ ∈ (0.20, 0.88)), it can be seen that when the response mean response is scattered on the standard unit interval, the behavior of the prediction statistic is very different, with values much more concentrated nearby one.The same behavior occurs for the goodness-of-fit measures.R 2 FC and R 2 LR .In Figure 2 we consider a misspecification problem (three omitted covariates).For illustration, we consider only φ = 50 and n = 40, 80, 120, 400, µ ∈ (0.20, 0.88).We notice that when three covariates are omitted, with the increasing of sample size, the replication values of the statistics tend to concentrate at small values, as expected due to the misspecification problem.
We notice that typically the mean and the median of the 10,000 values of the statistic is closed, confirming the usefulness of the mean values to describe these measures.When n = 400 (panel d), the values of all statistics tend to concentrate around a value far from 1, i.e., around 0.3, and 0.4.It behaves noteworthy as the prediction and determination coefficients behave equally in this scenario (µ ∈ (0.20, 0.88)).
In Figure 3 we consider µ ∈ (0.01, 0.20) and the model is estimated correctly, n = 40 and φ = (50, 150, 400, 1000).We notice that the values of the R 2 -like statistics become more concentrated and closer to one as the value of φ increases.Nonetheless, the behavior of P 2 statistics is quite different.Even when φ = 400 this measure displays negative values (panel(c)).These observations that present P 2 negative values are cases, poorly fitted by the model and potential influential cases.It is noteworthy that cases poorly fitted by the model can befall in despite of that φ = 1000 (Figure 3d).The statistics present the same feature when µ ≈ 1.
Incorrect specification Incorrect specification n = 400  To summarize, at the outset, we shall consider the response mean around 0.5.When the model is correctly specified, the P 2 have their values close to one, especially when the model precision or sample size increases.When the proposed model omits important covariates, the P 2 values tend to depart considerably of one and stay below 0.5.The measure R 2 FC and R 2 LR present similar behavior.On the other hand, when the mean of the response is concentrated near zero or one, the P 2 values differ considerably from one, taking negative values even when the model is correctly specified, revealing as it is difficult make prediction close to the boundaries of the unit interval.
Indeed, scenarios in which the model present large dispersion and a substantial concentration of values on one of the boundaries of the standard unit interval tend to present influential observations.In these situations, Ref. [22] argue that for the beta regression models the ML parameter estimation based on the BFGS nonlinear optimization algorithm proved to be typically not robust in influential cases.The P 2 criterion is based on the PRESS-like statistic which presents a relationship both with residuals and influence measures.In this sense, this new criterion that we proposed for the beta regression models outperforms the R 2 LR and R 2 FC in identifying problems on fit the model when µ ≈ 0 or µ ≈ 1 and the precision is not so large.However, this fact does not disable the use of R 2 -like statistics.The P 2 criterion can be viewed as a measure of model bias whereas the R 2 is a quantifier of the model variance.What we emphasize is that we must also consider the P 2 criterion to select the model that best fit a dataset.In the applications we shall present results that show as the R 2 and P 2 criteria contain different and important information about the model-fitting.
Another important question is the link function to the mean sub-model.All simulations, we carried out until now were based on logit link function.In what follows, we present Monte Carlo simulation results in which we consider other link functions, namely: probit, complementary log-log, log-log, and Cauchy, respectively defined as and g(µ) = tan{π(µ − 0.5)}.It is important to emphasize that the same link function is used both to generate the response observations and to fit the model.Our goal is to evaluate the performance of each link function on different ranges of mean and dispersion response, in special we aim to identifying if the link function is related to the problems in fitting the beta regression model when the response is close of the boundaries of the standard unit interval.Thus, we must fit the model correctly.
The results presented in Table 2 showed that when the response mean is close to one, the use of complementary log-log function leads to models with better predictive power as well as better goodness-of-fit.On the other hand, if the mean is close to zero the best results are provided by the log-log link function.When the mean is scattered on the standard unit interval both the probit and logit functions perform well.The Cauchy model performance well only when µ ∈ (0.20, 0.80).Thereby, we can deduce that the link function is related with the small values the P 2 criteria when µ ≈ 0 and µ ≈ 1 displayed in Table 1, since all scenarios were fitted by the logit model.Thus, the appropriate link function can improve the robustness of the ML estimation procedure of the beta models in the presence of influential points.It is noteworthy that these conclusions are supported on the P 2 criterion.
In Table 3, we present the mean values for 10,000 statistic replications.In this table, we report the results only for n = 40.Next, we presented boxplots for the 10,000 statistic replications to other sample sizes.We are considering µ close to 0.5.We notice based on Table 3 that under model misspecification the statistics display smaller values in comparison with Scenario 8 (correct specification), in which as greater is λ greater are the values of the statistics, as expected.When the dispersion is postulated as fixed, as the intensity degree of nonconstant dispersion increases, the mean values of the statistics decreases, which correctly points out for the model misspecification.It is noteworthy that under correct model specification the values of three statistics are so different.In special the P 2 values are greater than the values of R 2 -like criteria.Furthermore, the values of the R 2 FC are considerably smaller than the values of the R 2 LR , in special when λ increases.For example, taking λ = 20, 50, 200, n = 40 we have R 2 LR = (0.796, 0.816, 0.840) and R 2 FC = (0.649, 0.627, 0.500) (Table 3-Scenario 8). Figure 4 supports this evidence.When λ and the sample size increase, for example n = 80 and n = 120, the values of P 2 criterion tend to concentrate close to one, whereas the values of R 2 LR and R 2 FC tend to concentrate below 0.8 and 0.6, respectively.We shall focus on n = 40, Figure 4e the true intensity degree of nonconstant dispersion is close to 200, with φ max ≈ 260 and φ min ≈ 2, whereas λ ≈ 400 with φ max = 730 and φ min ≈ 1.9, that is a substantial distortion of the true intensity degree of nonconstant dispersion.Indeed, it is a substantial distortion of the true variance of the response observations.Since the R 2 -like criteria, select the model that can better explain the variability of the response, it is plausible that these measures present lower values when the distortions between the true and estimated variances of the response variable are so large.Please note that the R 2 LR c takes several values smaller than 0.6 and the R 2 FC c even takes negative values whereas overall the values of P 2 c are greater than 0.6 (Figure 4e).Additionally, in this sense the R 2 FC criterion proved to be more rigorous than the R 2 LR criterion.This is a strong evidence that models with small R 2 FC and high R 2 LR values are worthy of further investigation.Indeed, the best fitted model should display high and close values of the three criteria and of their penalized versions.0.0 0.2 0.4 0.6 0.8 1.0 Correct specification (a) 0.0 0.2 0.4 0.6 0.8 1.0 Correct specification (b) 0.0 0.2 0.4 0.6 0.8 1.0 Correct specification (c) 0.0 0.2 0.4 0.6 0.8 1.0

Nonlinear Setting
In what follows, we shall present Monte Carlo experiments for the class of nonlinear beta regression models.The numerical results were obtained using the following beta regression model as data generating processes: β 5 , t = 1, . . ., n, x t2 ∼ U(1, 2), x t3 ∼ U(4.5, 34.5) and φ were kept fixed throughout the experiment.Here we use the starting value scheme for the estimation by ML proposed by [23].The precision and the sample size are respectively φ = (20, 50, 150, 400), n = (20, 40, 60).Additionally, β = (1.0,1.9, −2.0, 3.4, 7.2) that yields µ ∈ (0.36, 0.98).To evaluate the criterion performance on account of nonlinearity negligence, we consider the following model specification: log All results are based on 10,000 Monte Carlo replications and for each replication.
We evaluated the behavior of the statistics both under model misspecification and under model correct specification.The results displayed in Table 4 reveal that all statistics present values smaller when the model is missspecified.For example, fixing the precision value of φ = 400, for n = 20, we have values of P 2 , R 2 LR and R 2 FC equal to 0.576, 0.700, 0.637, respectively.For n = 40 and n = 60 the values of the statistics are 0.568, 0.698, 0.634 and 0.562, 0.698, 0.633, respectively.We simulated other nonlinear patterns to the sub-model mean predictor, and in some simulations the three criteria did not present smaller values of the feasible linear model than to the nonlinear model correctly specified.

Fluid Catalytic Cracking
The first application employs real data from the graduation work of [24], from Chemistry Department of the Colombia National University.It is based on the Fluid Catalytic Cracking (FFC) process, considered the heart of a gasoline refinery.[24] explains that the FCC process is used to convert hydrocarbons of high molecular weight into small molecules of high commercial value, through the contact of hydrocarbons with a catalyst.The zeolite USY is the major catalyst of the process.The FCC process also involves the vanadium element, steam, and temperature.However, the vanadium on the catalyst decreases gasoline production.
Is special, the vanadium affects the crystallinity of zeolite USY depending on steam concentration and of the temperature during the process.The aim here is modeling the percentage of crystallinity of zeolite USY (y), based on different concentrations of vanadium (x 2 ) and steam (x 3 ), and two values of the process temperature (x 4 ).Typically, the higher the vanadium and steam concentrations, the lower the percentage of crystallinity.[22] modeled these data.At the outset, the authors fitted several linear beta regression models and carried out the residual analysis which made clear the nonlinear trend.Thus, the authors modeling these data using a logit nonlinear beta model defined as t = 1 . . ., 28.We fitted the model in (20) considering five link functions, namely: logit, probit, log-log, complementary log-log (C-Log-Log) and Cauchy.We shall present only the logit and complementary log-log model inferences (Table 5).Similar results are obtained by use the probit, log-log, and Cauchy link functions.However, we report that the parameter γ 2 was significantly different from zero, at the usual nominal levels, only for the model with the Cauchy link function.On the other hand, to the models with C-Log-Log and logit link functions the parameter γ 2 , is far to be significantly different from zero, p-value equal to 0.404 and 0.200, respectively.23,27 del.0.000 0.020 0.000 0.025 0.000 0.000 0.612 0.000 0.061 0.000 0.210 0.000 0.000 0.012 Nevertheless, we computed the selection criteria for the five models and presented in Table 6.The results in this table evidence that the values of the statistics are overall low, except to complementary log-log model.Furthermore, the lower values of the P 2 statistic when compared with the values of R 2 -like criteria, is special to logit model, is an indication of same misspecification on the fitted models.Thus, from now on we shall focus on the complementary log-log and logit models.
In what follows, we shall perform residual and influence diagnostics for the fitted models based on (20) and using the logit and complementary log-log link functions, see Figures 5 and 6, respectively.The index plots of the Cook-like distances identify the observations {10, 12, 16, 24} as influential, for the two link function models.Furthermore, the case {20} is worthy of further investigation for logit model, Figure 5c,d.However, the most important information is provided by the normal probability plot for the logit model in Figure 5b.Here there are two points on the boundaries of envelope bands, cases 22 and 28.Typically, these are influential cases.6)).In Figure 6 we carried out the local influence analysis based on the perturbation simultaneous of the covariate vanadium, which is present both in the mean and dispersion predictor.Thus, we fit the models after the exclusions.We take advantage of the information in Table 5, where we present the relative changes (%) in parameter estimates and standard error estimates, as well as the p-values after the exclusions that most affected the model-fitting.From Table 5 we note that the estimation process of the complementary log-log model proved to be more robust to influential cases than the estimation process of the logit model.
The set {10, 20, 22, 28} impacts the estimates of β 2 and β 4 for both models.However, the complementary log-log model ensures the same inference conclusions whereas to the logit model these parameters become non-significantly different from zero at the 5% level, in special β 4 (p-value = 0.086).For the logit model, the set {13, 20, 23, 27} is still more influential.The exclusion of this set strongly affects the estimates of γ 2 , such that the dispersion becomes varying and β 4 and β 2 become non-significant at the 20% and 5% levels, respectively.This fact is a strong evidence that the parameter estimates of the full data logit model are biased.The P 2 selection criterion was able in identifying this bias pattern, what explains the low value of this criterion to the logit model, even reaching a negative value for their penalized version.Furthermore, this bias pattern is due to the non-robustness of the ML estimation process in the presence of influential cases.We note also that the cases 20 and 28 highlighted on the limits of envelope bands proved being influential cases revealing the importance of to evaluate this plot carefully.Thus, the nonlinear model with varying dispersion does not seem to be a good option to these data.On the other hand, the fit of a nonlinear model with fixed precision, based on complementary log-log link function presents satisfactory values of the all selection criteria (Table 6).
The residual plots in Figure 7a,b support this conclusion, since to the complementary log-log model all residuals are randomly scattered within the envelope bands whereas to the logit model there is the case 5 as potentially influential.However, the cases highlighted as worthy of further investigation by the total local influence did not change the inference conclusions, is despite to yield greater changes in the parameter estimates of the logit model than to the complementary log-log model.Figure 7c,d.For this data set, y ∈ (0.64.0.96) with a median close to 0.81, and the estimated values of φ are quite similar for all link function models, close to 65.In this scenario we verified by Monte Carlo simulation that the models based on complementary log-log functions provide the highest values of all selection criteria.Thus, the application only confirms the simulation results.

Simultaneity Factor
The second application relates to the distribution of natural gas for home usage in São Paulo, Brazil.The real data were obtained from the Instituto de Pesquisas Tecnológicas-IPT (https://www.ipt.br/) and In the Figure 8d,e we present the normal probability plots with simulated envelopes to varying dispersion models based on the log-log, the logit, and the Cauchy link functions, respectively.These plots reveal that the log-log model yields the best fit whereas the Cauchy model yields the worst fit.This is the same conclusion provides by the P 2 criterion, whereas by the R 2 LR criterion all link functions could provide a good fit, even the Cauchy link function.For instance, to the logit and Cauchy models the (R 2 LR , R 2 LR c ) are, respectively, (0.74, 0.71) and (0.67, 0.62).
The performance of the R 2 LR is proved to be poor when we look the Figure 8f, which clarify unmistakably lack of fit of the Cauchy regression.Even the selection of the logit model would not be appropriate since there are ranges of residuals not randomly distributed across the envelope bands (Figure 8e).We note that the P 2 reaches negative values and the R 2 FC is able in identifying some problem on the model variability, whether the Cauchy function is used (R 2 FC = 0.39).Whether a practitioner does not take into account the other statistics beyond of the R 2 LR criterion one could select both logit and Cauchy model to fit the data.This conclusion would be quite counter for what is proved by residual plots and inference results.Please note that the R 2 LR criterion presents a close relation with AIC-like criteria.Thus, we must be careful in using a criterion to choose a model even the usual and classical criteria.
Although we must emphasize that P 2 -like criteria must be used jointly with the R 2 -like criteria.We shall focus on the normal probability plot of the log-log fit (Figure 8d).It is possible to note two points out of the envelope bands just as a slight linear tendency on the residual distribution close of these two points.This pattern explains the discrepancy between the values of the R 2 -like criteria and the value of P 2 criterion, which are equal 0.7, and 0.9, respectively (Table 7).This pattern suggests some problem in the dispersion sub-model or in the distribution of probability postulated for the response.Thus, we decide fit other beta regression models, considering different link functions also for the dispersion sub-model, just as different functions for the computed power beyond of the logarithm function, as covariates.The best fit is still the one provides by the beta regression model defined in (21) considering the log-log link function.In the future we can consider other distributions to fit this data as the simplex distribution that is a dispersion model and can provide a better fit.
However, the beta regression model defined in (21) and based on log-log function is useful for modeling the data on simultaneity factor.This example clarifies how it is important to consider both prediction criteria and different versions of the R 2 criteria to select the best model to fit a dataset.

Conclusions and Future Work
In this paper, we develop the P 2 criterion based on the PRESS-like machine learning tool for the class of beta regression models.We presented results of Monte Carlo simulations carried out to evaluate the performance the P 2 criterion and of the versions R 2 LR and R 2 FC of the R 2 criterion, under correct and incorrect model specifications.We consider different scenarios, including omission of covariates, negligence of varying dispersion and misspecification of nonlinearity.Two applications using real data were performed.
Both the simulation results and applications yield important conclusions.When the mean response is scattered on the standard unit interval, the P 2 and R 2 coefficients perform similarly well, and both enable to identify usual model misspecification.On the other hand, it is noteworthy that when the response values are close to one or zero the P 2 criterion outperformed the R 2 -like criteria in identifying problems on the model-fitting.Generally, these behaviors are related to influential observations and appropriated link functions for each range of response on standard unit interval.We notice that the log-log function models yield the best fits when the response is closer to zero, whereas the complementary log-log models yield the best fits when the response values are closer to one.These last conclusions were only supported by the P 2 criterion, but proved by the residual and influence analyses and by inference results.Another important conclusion is the poor performance of the R 2 LR criterion for beta regression models when the response is close to one of the standard unit interval boundaries.The R 2 FC outperforms the R 2 LR in identifying problems on the model variability on these ranges of the response variable.This conclusion is supported by the normal probability plots with simulated envelopes used in the real application.
Our proposed criterion proved to be very successful, since it selects the same models selected by the residual analysis, by the influence diagnostics and inference results.Despite this fact the normal probability plots with simulated envelopes reveal that questions about the model variability or the response distribution must be accessed the R 2 -like criteria.
Therefore, to the class of beta regression models the best strategy to select the best model to fit a dataset is jointly used the P 2 and R 2 FC criteria.When the two criteria being simultaneously close to one, better shall be the fitted model.Further work will be devoted to the theoretical properties of the P 2 statistic, and revisited statistical analysis, including post-Hoc analysis [26][27][28] with the Tukey's honestly significant difference test, and their p-values adjusted via false discovery rate [29] to highlight the existence of significant differences between the proposed and classical algorithms.

Figure 5 .
Figure 5. Residual and Cook-like distance plots.Varying dispersion model.Data on FCC.(a) Envelope band of weight residual and link function C-Log-Log; (b) Envelope band of weight residual and link function Logit; (c) Cook-like distance and link function C-Log-Log ; (d) Cook-like distance and link function C-Log-Log.

Figure 7 .
Figure 7. Residual and total local influence plots.Constant dispersion.Data on FCC.(a) Envelope band of weight residual and link function C-Log-Log; (b) Envelope band of weight residual and link function Logit; (c) Cook-like distance and link function C-Log-Log ; (d) Cook-like distance and link function C-Log-Log.

Figure 8 .
Figure 8. Diagnostic plots.Data on simultaneity factor.(a) Boxplot of the response; (b) Cook-like distance and link function Log-Log with constant dispersion; (c) Cook-like distance and link function Log-Log with varying dispersion; (d) Envelope band of weight residual and link function Log-Log with varying dispersion; (e) Envelope band of weight residual and link function Logit with varying dispersion; (f) Cook-like distance and link function Cauchy with varying dispersion.

Table 5 .
Parameter estimates, standard errors (s.e.), relative changes in estimates and in standard errors due to cases exclusions and respective p-values.Varying dispersion model.

Table 8 .
Parameter estimates, standard errors (s.e.) and p-values.Data on simultaneity factor.