A Bayesian Measure of Model Accuracy

Ensuring that the proposed probabilistic model accurately represents the problem is a critical step in statistical modeling, as choosing a poorly fitting model can have significant repercussions on the decision-making process. The primary objective of statistical modeling often revolves around predicting new observations, highlighting the importance of assessing the model’s accuracy. However, current methods for evaluating predictive ability typically involve model comparison, which may not guarantee a good model selection. This work presents an accuracy measure designed for evaluating a model’s predictive capability. This measure, which is straightforward and easy to understand, includes a decision criterion for model rejection. The development of this proposal adopts a Bayesian perspective of inference, elucidating the underlying concepts and outlining the necessary procedures for application. To illustrate its utility, the proposed methodology was applied to real-world data, facilitating an assessment of its practicality in real-world scenarios.


Introduction
For effective decision making, a thorough understanding of the problem is crucial.However, this understanding often requires dealing with a significant amount of data, which, due to their volume, present complex relationships.In such scenarios, recognizing crucial data relationships may not be straightforward, necessitating the application of analytical methodologies.Statistical modeling stands as a valuable asset in this context, streamlining complex events through the lens of hypothetical probabilistic models.These models find validation through rigorous empirical observation, enhancing their reliability and utility.Therefore, it is essential to verify that the chosen model adequately represents the problem of interest.Failure to specify the model correctly can compromise the quality of information obtained, leading to inaccuracies, and ultimately, erroneous conclusions.Various methods exist to evaluate the quality of a model, but most involve subjective classification criteria or complex elaboration, deterring their use in practical applications.Hence, this work will introduce a proposal for a Bayesian methodology to evaluate the quality of a statistical model based on its predictive ability.This means assessing the model's effectiveness in predicting values for new instances of the problem at hand.The advantage of this proposal lies in its simplicity.By focusing on the model's predictive capacity, it does not solely rely on its fit to existing data.This approach streamlines its application and promotes its use in decision-making scenarios.
The proposal outlined in this work is a modification of an external validation approach proposed by [1], which lacks an objective criterion for assessing the model's quality.Additionally, this methodology shares a similar logic to the Log Pseudo Marginal Likelihood (LPML) [2], but with distinct objectives.Whereas the LPML compares models, the aim of this work's proposal is to determine whether a model can accurately predict a new observation.The behavior of the accuracy measure was examined through simulated applications in generalized linear models with exponential distribution.
The goals of this work were to introduce a proposal for a Bayesian methodology for assessing model adequacy based on its predictive ability, investigate the performance of this methodology in generalized linear models with exponential distribution, and devise a straightforward criterion for the methodology to assess the quality of a model.The proposed classification criterion was derived from simulated data and was demonstrated using a real dataset from the literature.All simulations and analyses were conducted using the open-source software R [3].

Assessment of the Quality of Statistical Models
The assessment of model quality is an area with extensive literature within the frequentist paradigm, with numerous techniques available for objective evaluation.For example, D'Agostino's book [4] provides an overview of the most important classical goodness-of-fit tests.Conversely, within the Bayesian framework, the literature is relatively recent, and the existing methods are often restrictive or complex, resulting in this fundamental aspect of statistical analysis occasionally being neglected.
In a Bayesian context, evaluating the quality of a model does not rely on the adequacy of the likelihood function used, unlike the classical approach.Instead, it depends on the suitability of the posterior distribution, as any relevant inference for the problem stems from it.Additionally, some authors propose that the quality of a Bayesian model should be judged based on its predictive distribution.If the data do not align with the predictive distribution, it is anticipated that the model is not appropriate [5].
Furthermore, additional methods for evaluating the quality of a model in a Bayesian framework are discussed in [6].Some of the commonly used techniques include Dirichlet Processes [7], posterior predictive check [1], Log Pseudo Marginal Likelihood [2], leave-oneout (LOO) cross-validation, and the Widely Applicable Information Criterion (WAIC) [8].
Dirichlet Processes [7] are utilized in estimating a non-parametric Bayesian model, which is subsequently compared to the proposed model using the Bayes Factor [9] to assess if their difference is significant.This method serves as a model fitting technique that employs the difference between the values estimated by the proposed model and those by the non-parametric model as a quality criterion.Nonetheless, it is associated with the drawback of necessitating a complex process for its development.
The posterior predictive check [1] assesses whether any T statistic from the model is consistent with the empirically observed data.This method entails a dual use of the data, as they are utilized both in the model estimation process and for comparison with the test statistic.The subjectivity in defining the T statistic is a notable critique of this approach, as it must be adapted to each specific problem.
The leave-one-out (LOO) cross-validation and the Widely Applicable information Criterion (WAIC) [8] are methods that estimate pointwise out-of-sample prediction accuracy.According to Vehtari et al. [10], these methods were less used in practice because they involve additional computational steps.In order to mitigate this problem, they presented an optimized computation method for LOO using Pareto-smoothed importance sampling.
The Log Pseudo Marginal Likelihood (LPML) method [2] involves assessment using the Conditional Predictive Ordinate, which represents the predictive density of an observation in the estimated model without it.This approach selects models based on their predictive capacity, computing a statistic that indicates the optimal model to use.However, the obtained statistic does not enable us to determine whether the utilized model is a good fit; rather, it only indicates if it is superior to the others with which it was compared.
Gelman et al. [1] proposed another method based on external validation in which a predictive interval of probability 0.5 is computed for observations not utilized in the modeling process.This involves assessing the number of observations falling within these intervals, as it should closely align with the defined 50% credibility ( [1], p. 142).Despite its intuitive nature, this approach is not widely adopted due to its subjective rejection criterion, which can vary based on the user's perspective on quality assessment.In this context, this work aims to adapt the external validation methodology proposed by [1], as it provides an intuitive approach to assessing the accuracy of a model.To achieve this goal, adjustments will be made to certain steps of the method, facilitating the establishment of an objective criterion for evaluating the model's quality.

Proposal for Analysis of Predictive Capacity
This study proposes an adaptation of the external validation approach suggested by [1] to evaluate a model's quality through the predictive capacity of its posterior distribution.The use of the posterior distribution ensures the suitability of the final model, as methods that solely assess the likelihood function may not ensure the appropriateness of the prior distribution used, potentially compromising the final model's outcomes.The proposed method entails employing the leave-one-out (LOO) technique to assess the model's ability to accurately predict new observations.The procedure consists in calculating the proportion of correctly predicted values and uses it as a quality statistic.It checks whether the observed value is feasible given the chosen credible level and rejects the model when the observed proportion is unlikely.This idea is consistent with the emphasis on model prediction analysis that is common to both schools of thought of 20th century statistics (see [11] for more details) and can be efficiently implemented in a great variety of statistical models.

The Accuracy Measure
Let C i be a credible interval for the predicted value considering a fitted model without observation i from the sample, where i = 1, 2, . . ., n.If the value y i falls within the predicted interval, it is classified as a correct prediction (u i = 1); otherwise, it is classified as an error (u i = 0), i.e., Thus, the proportion of correct predictions is given by The LOO technique prevents the double use of data, unlike the Posterior Predictive Check.Moreover, employing interval estimators simplifies specifying an expected proportion of accurate predictions for the model, as for a γ × 100% credible interval, this proportion should be close to γ, 0 < γ < 1 .Consequently, a κ value far from γ suggests the model lacks good predictive capacity and is not suitable for representing the problem.Therefore, the proposed accuracy measure is determined by the difference between the proportion of accurate predictions and the credible level of the interval The value of ∆ ranges from −γ to 1−γ, and a value of ∆ = 0 indicates a good model accuracy.It is important to note that a proportion of correct predictions significantly higher than the credibility used (∆ > 0) is not beneficial, as it indicates imprecision in the predictive interval.Alternatively, the more negative the value of ∆, the stronger the indication that the model has low predictive capacity.The proposed method shares a similar rationale with the Log Pseudo Marginal Likelihood (LPML), but it serves different objectives.Whereas the LPML is used for model comparisons, the aim here is to determine if a model can effectively predict a new observation.

Decision Criterion
We can construct a hypothesis test for the methodology, providing an objective approach to determine whether there is evidence that the model used lacks good predictive capability for the given problem.Consider the following hypotheses: H : The model has good predictive capability.H a : The model does not have good predictive capability. (4) Hypothesis ( 4) can be tested using Bayesian hypothesis testing (see, for example, [12,13]) to determine if the proportion of correct predictions, κ, is equal to the credibility γ (i.e., ∆ = 0).Thus, hypothesis (4) can be reformulated as: Assuming a prior distribution κ ∼ Beta(a 1 , a 2 ) and u i |κ ∼ Bernoulli(κ), we obtain the posterior distribution of the proportion of correct predictions, given the observations, as κ|u ∼ Beta(A 1 , A 2 ), where Here, u i (i = 1, 2, . . ., n) is given by Equation (1).Moreover, hypothesis (5) can be tested using the evidence value (e-value) of a Full Bayesian Significance Test (FBST) [12].The e-value for testing hypothesis (5) can be obtained through Monte Carlo simulation following the steps of Algorithm 1.
Note: In Steps 2 and 3, 1 dz is the beta function.Below, we provide R code for obtaining the e-value to test hypothesis (5).
In this work, we opted for the level γ = 0.5 since it results in symmetry in the lower and upper deviations.Note that this symmetry does not hold for γ ̸ = 0.5.For γ = 0.95, the situation where the proportion of correct predictions is less than the credible level (∆ < 0) is less concerning than when the proportion of correct predictions is greater than the credible level (∆ > 0).
According to the FBST, hypothesis H is rejected, meaning the proportion of correct predictions is different from 0.5 (or the model does not exhibit good predictive capability), if e-value < α.Here, α is the "critical value" fixed or obtained from elicited loss functions [14].
Alternatively, according to the methodology outlined in this work, we reject the null hypothesis H if |∆ obs | > ∆ critical , where ∆ critical depends on the critical value α and the sample size n.
To establish the critical points for the rejection criterion, samples ranging from n = 10 to 500 were generated.To determine the critical points for other values of n, a least squares regression was performed for the errors ξ = |∆| using the square root of the sample size as the explanatory variable.Notice that the adopted value γ = 0.5 results in symmetry in the error ξ.This regression was adjusted to allow interpolation and extrapolation for n > 40.The regression model adopted was ξ = β 1 √ n .The estimated parameters for the regression curves with α = 0.01, 0.05, 0.1, and 0.2 were, respectively, β 1 = 1.261, β 1 = 0.966, β 1 = 0.812, and β 1 = 0.633.The values of ∆ critical for α = 0.01, 0.05, 0.1, and 0.2 were obtained from the FBST procedure considering the Beta(1, 1) as prior distribution and M = 1,000,000 Monte Carlo replicates.Figure 1 displays the curve fits for the errors associated with sample size for different values of α.These graphs demonstrate satisfactory adjustments, indicating that the regression equations aptly represent the errors.Table 1 presents the values of ∆ critical for n = 10 to 40 as well as its approximation for n > 40.

Exponential Distribution
Consider, for example, the exponential distribution, widely used in fields such as health and reliability.This distribution was chosen for its single parameter, which simplifies the comprehension of the proposed methodology.Let X 1 , X 2 , . . ., X n be a sample from X, which follows an exponential distribution with mean 1 θ , i.e., X|θ ∼ Exponential(θ).Assuming a priori θ ∼ Gamma(a, b), a, b > 0, we obtain the posterior distribution θ|X ∼ Gamma(a + n, b + ∑ n i=1 x i ).Thus, the predictive density function of a new observation Y|X is given by: Therefore, the quantile q of Y|X is resulting in the following equal-tailed γ × 100% credible interval for the predicted value y, given the sample X 1 , X 2 , . . ., X n : The percentage of correct predictions and the proposed accuracy measure in this work can be obtained through the steps outlined in Algorithm 2.
1. Set i = 1; 2. Create sample S i by removing observation i from the complete dataset; 3. From S i , obtain the credible interval C i for a new observation; 4. Check if the observation i, removed from the sample, lies within the predicted interval: (a) If the observation i lies within the credible interval, set u i = 1; (b) If the observation i does not lie within the credible interval, set u i = 0; 5.If i < n, set i = i + 1 and return to Step 2; 6. Calculate the proportion of correct predictions, κ, using Equation (2); 7. Calculate the accuracy measure, ∆, using Equation (3).
In situations where obtaining the predictive distribution is not feasible, it can be numerically estimated using MCMC-Markov Chain Monte Carlo [15].To obtain a numerical approximation of the credible interval mentioned in Step 3, the following procedure can be used: i.
iii.The limits of the equal-tailed credible interval for a new observation y i are given by the quantiles γ 2 and (1 -γ 2 ) of y i , . . ., y [J] i .Alternatively, the HPD (highest posterior density) interval can be obtained from y [1] i , y [2] i , . . ., y

[J]
i by using the emp.hpd command from the TeachingDemos package in R [3].
A drawback of the proposed method is its high computational cost, as it requires estimating a model for each observation in the sample, which can be inefficient for large datasets.In such situations, ref. [10] presented an optimized computation method for LOO using Pareto-smoothed importance sampling.This method effectively manages importance weights and is conveniently implemented in the loo package within the R programming environment [3].
Figure 2 depicts data generated from a sample of size n = 100 from an exponential distribution and its predictive intervals (HPD and equal-tailed) with 50% credibility (γ = 0.5) calculated from Equation (6).The analysis was performed considering a diffuse prior Gamma (a = 100 −1 , b = 100 −1 ) for θ.In an asymmetric distribution, as is the case of Equation ( 6), the HPD and equal-tailed intervals will present distinct regions despite having the same credibility (Figure 2).The fact that the HPD interval does not contain the mean is due to the used credibility and the asymmetry of the predictive distribution, as the HPD interval is dependent on the mode rather than the mean of the distribution.In this example, the exponential distribution exhibited a good predictive fit, with the proportions of correct predictions being 47% and 53% for the 50% equal-tailed and HPD credible intervals, respectively.Note that both types of intervals resulted in 0.030 = |∆ obs | < ∆ critical = 0.966 √ n = 0.097 (Table 1; n = 100; α = 0.05), which leads to non-rejection of the hypothesis that the exponential model has good capacity to predict future data.For observed accuracy rate κ = 0.47 (and κ = 0.53), the FBST yielded an e-value of 0.545, also leading to non-rejection of the hypothesis for α = 0.05.

Poisson Distribution
Let X 1 , X 2 , . . ., X n be a sample from X, which follows a Poisson distribution with mean θ, i.e., X|θ ∼ Poisson(θ).Assuming a priori θ ∼ Gamma(a, b), a, b > 0, we obtain the posterior distribution θ|X ∼ Gamma(a + ∑ n i=1 x i , b + n).Thus, the predictive distribution of a new observation Y|X is given by a Gamma-Poisson distribution (with parameters Therefore, the lower and upper limits of the equal-tailed γ × 100% credible interval for the predicted value y, given the sample X 1 , X 2 , . . ., X n can, respectively, be obtained by L 1 = sup{y : F Y|X (y) ≤ γ 2 } and L 2 = inf{y : F Y|X (y) ≥ 1 − γ 2 .},where F Y|X (y) is the cumulative predictive distribution given by F Y|X (y) = ∑ y k=0 f Y|X (k).
As an example, consider that Y ∼ Gamma-Poisson(502, 30).In this case, the limits of the equal-tailed 50% credible interval are given by L 1 = 13 and L 2 = 19.Figure 3   It is important to emphasize that since the predictive distribution is discrete, the interval may not have credibility (exactly) equal to γ.In fact, the real credibility of the interval will be greater than or equal to γ.Therefore, the test proposed in this paper will be approximate in cases where the predictive distribution is discrete.An alternative in these cases is to consider in hypothesis test (5) the average credibility of each of the n intervals obtained in the LOO steps.
Figure 4 depicts data generated from a sample of size n = 30 from a Negative Binomial distribution and its predictive equal-tailed intervals with 50% credibility estimated by a Poisson model.The analysis was performed considering a diffuse prior Gamma(a = 100 −1 , b = 100 −1 ) for θ.As expected, the Poisson model exhibited a poor predictive fit, with the proportion of correct predictions being 30% for the 50% equal-tailed credible intervals.This observed proportion of correct predictions resulted in 0.2 = |∆ obs | > ∆ critical = 0.183 (Table 1; n = 30; α = 0.05), which leads to rejection of the hypothesis that the Poisson model has good capacity to predict future data.For observed accuracy rate κ = 0.3, the FBST yielded an e-value of 0.023, also leading to rejection of the Poisson model for α = 0.05.In addition, the FBST of hypothesis ( 5) considering κ * = 0.632 (the average credibility of each n = 30 intervals obtained in the LOO steps) yielded e-value < 0.001, also leading to rejection of the Poisson model for α = 0.05.

Simulation Study
In this section, we present a simulation study to verify whether factors such as the nature of the covariates (numeric or categorical), the number of model parameters, or the sample size used for estimation could potentially influence the behavior of the proportion of correct predictions, κ, making it crucial to investigate their effects in determining the critical value.To assess which of these factors truly impact the value of κ, simulation studies were conducted using exponential regression models.To examine the effects of possible interactions among the factors, samples of size n were simulated, considering four scenarios of parameters with one to five predictors each, resulting in twenty distinct scenarios.For each of these scenarios, 1000 samples were generated, totaling 20,000 samples.The methodology was then applied to each of these generated samples.The simulated values of n ranged from 10 to 40, 50, 60, . .., 140, and 150.Given the ease of obtaining and interpreting results, this study will solely use the equal-tailed interval to define the κ value.Therefore, simulations will be conducted exclusively with equal-tailed intervals.The flowchart depicted in Figure 5 illustrates the structure used in the simulation, along with the scenarios of parameters utilized.The scenarios were chosen to maximize the diversity of parameter values and covariates used.Below are the four scenarios considered: In Figures 6 and 7, we can observe the mean and standard deviation of the κ values for each of the M = 20,000 Monte Carlo replicates across the four considered scenarios of factors.It is noticeable that in small sample sizes, significant disparities were observed among models with varying numbers of covariates.Simulations with higher numbers of covariates exhibited higher means and deviations compared to others.This outcome is expected due to model saturation with small samples, attempting to estimate numerous parameters with limited observations, resulting in lower predictive capacity of the adjusted model.However, as the sample size increases, differences based on the number of covariates diminish, and all models converge to the same value in terms of both mean and standard deviation.It is evident that when the model has fewer covariates, approximately less than 20% of the sample size, the value of κ is not affected by this factor.The various scenarios of parameters used did not affect the value of κ, as it showed well-distributed values across the simulations.This suggests that the types of covariates do not significantly influence the model's accuracy percentage.Another expected outcome is the convergence of the standard deviation of the proportion of correct predictions to zero, with the mean converging to 0.5.This occurs because as the sample size increases, there is a greater concentration of correct predictions around the chosen credible level.
To assess whether all simulated proportions of correct predictions exhibited symmetrical behavior, the skewness coefficient was calculated for the number of covariates, type of combination, and sample size.Based on the results from the cross-simulations, it was determined that only the sample size factor has an impact on the proportion of correct predictions, κ.As a result, the sample size, n, will solely be used to establish the model rejection criterion.
Figure 9 displays the average and standard deviation of the simulated accuracy proportions, κ, based on the sample size.Each data point on the graph represents 20,000 simulations, enhancing the precision of the estimates.The average values of κ are consistently centered around 0.5, reflecting the chosen credible level.Additionally, as the sample size increases, the standard deviation tends to zero, as seen in the previous results.Skewness was also calculated for the aggregated proportion of correct predictions based solely on the sample size, with the results displayed in Figure 10.These coefficients are very close to 0, indicating equal-tailed distributions.The findings from this simulation study indicate that only the sample size factor needs to be considered in formulating the rejection criterion, showing that the critical points presented in Table 1 are valid regardless of the type and number of explanatory variables in the model.

Illustrative Example
The Leukemia dataset, as presented by [16], contains information on the time of death (in weeks) and the white blood count for two groups of leukemia patients, totaling 33 observations.The data are presented in Table 2.
In this application, the model proposed was an exponential regression with parameter θ = e −(β 0 +β 1 ×WBC+β 2 ×AG) , where WBC represents the quantity of white blood cells (measured in units of 10,000) and AG denotes the presence of Auer rods and/or significant granulation of the leukemic cells in the bone marrow at the time of diagnosis (AG Present = 1 and AG Absent = 0).The proposed methodology was applied to the Leukemia dataset using a diffuse prior N(µ = 0, σ 2 = 100 2 ) for β 0 , β 1 , and β 2 .This involved generating 1,000,000 samples with thinning interval of 5 and a burn-in of 10,000 in the MCMC process.The results from the LOO technique for assessing predictive capacity using this methodology are presented in Table 3 and Figure 9.  From Figure 11, it is evident that the predictive capacity for individuals with AG Present was unsatisfactory, as a significant number of points lie outside the 50% credible interval.This indicates a potential poor fit of the model to the data.In this application, the observed accuracy rate was κ = 11 33 = 0.333 (Table 3), resulting in ∆ obs = 0.333 − 0.5 = −0.167.Referring to Table 1 for n = 33, we find ∆ critical = 0.152 for α = 0.05.Thus, with 95% credibility, we reject the hypothesis that the exponential model used has good predictive capacity for the problem, since |∆ obs | > ∆ critical .For observed accuracy rate κ = 0.333 (and n = 33), the FBST yielded an e-value of 0.048, also leading to the rejection of the hypothesis for α = 0.05.It is noteworthy that both decision criteria fell into regions that reject the hypothesis, demonstrating suitability in the decision criterion based on critical values for ∆ presented in Table 1.
It is important to mention that the choice of prior distribution impacts directly on the posterior distribution and its misspecification can result in low model accuracy.As an example, consider changing the prior distribution of β 0 to an informative prior N(µ = 0, σ 2 = 1) and keeping the same diffuse prior N(µ = 0, σ 2 = 100 2 ) for β 1 and β 2 .With this change, we obtain an observed accuracy rate of κ = 10 33 = 0.303, resulting in ∆ obs = −0.197.This result indicates that the proposed accuracy measure identified the loss of accuracy due to misspecification of the prior distribution.
In addition, a measure usually used to assess the accuracy of a model is the Root Mean Squared Error (RMSE), defined by Here, Y i is the point estimate of the predicted value of individual i, i = 1, 2, . . ., n, defined as the mean of the predictive distribution.When considering the diffuse prior N(µ = 0, σ 2 = 100 2 ) for β 0 , β 1 , and β 2 (results in Table 3), we obtain RMSE = 40.290.However, when replacing the prior distribution of β 0 by a misspecified informative prior N(µ = 0, σ 2 = 1), the value increased to RMSE = 122.303.When comparing the accuracy of the two models using RMSE, it is clear that the first model is more accurate.However, the RMSE fails to identify that even the "more accurate model" does not present a good predictive capacity for the data in Table 2, as shown by the accuracy measure proposed in this paper.

Discussion
This study presented an adaptation of a methodology based on external validation proposed by [1], which, despite its simplicity and intuitiveness, lacked an objective way to validate models.The adaptation enabled the definition of an accuracy measure following the model rejection criterion, providing an objective way to validate models.Previously, discrimination could vary depending on the researcher's perspective.The development of this proposal was carried out from a Bayesian perspective of inference, elucidating the concepts used in its formulation and outlining the necessary steps for its application.

Figure 2 .
Figure 2. Observed values and 50% credible intervals of a new predicted observation in exponential model.
presents the cumulative distribution function of Y.

Figure 3 .
Figure 3. Cumulative distribution function of Gamma-Poisson distribution with lower and upper limits of the 50% credible interval.The real credibility of the interval is 60.2%.

Figure 4 .
Figure 4. Observed values and 50% credible intervals (region in red) of a new predicted observation in Poisson model.The average credibility of the intervals is 63.2%.

Figure 6 .Figure 7 .
Figure 6.Average proportion of correct predictions, κ, by scenario and number of covariates.

Figure 8 Figure 8 .
Figure 8. Skewness of κ by number of covariates.

Figure 9 .
Figure 9. Average and standard deviation of κ by sample size.

Figure 11 .
Figure 11.The 50% credible intervals obtained by the LOO technique.

Table 3 .
Results of the leave-one-out (LOO) technique for the data inTable 2 (exponential regression model).