Next Article in Journal
Methodology Based on BERT (Bidirectional Encoder Representations from Transformers) to Improve Solar Irradiance Prediction of Deep Learning Models Trained with Time Series of Spatiotemporal Meteorological Information
Next Article in Special Issue
Evaluating the Potential of Copulas for Modeling Correlated Scenarios for Hydro, Wind, and Solar Energy
Previous Article in Journal
The MECOVMA Framework: Implementing Machine Learning Under Macroeconomic Volatility for Marketing Predictions
Previous Article in Special Issue
Is Football Unpredictable? Predicting Matches Using Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Testing for Bias in Forecasts for Independent Multinomial Outcomes

Econometric Institute, Erasmus School of Economics, Burgemeester Oudlaan 50, 3062PA Rotterdam, The Netherlands
*
Author to whom correspondence should be addressed.
Forecasting 2025, 7(1), 4; https://doi.org/10.3390/forecast7010004
Submission received: 27 November 2024 / Revised: 7 January 2025 / Accepted: 10 January 2025 / Published: 13 January 2025
(This article belongs to the Special Issue Feature Papers of Forecasting 2024)

Abstract

:
This paper deals with a test on forecast bias in predicting independent multinomial outcomes where the predictions are probabilities. The new Likelihood Ratio (and Wald) test extends the familiar Mincer Zarnowitz regression to a multinomial logit model instead of a linear regression. The test is evaluated using various simulation experiments, which indicate that the size and power properties are good, even for small sample sizes, in the sense that the size is close to the used 5% level, and the power quickly reaches 1. We implement the test in an empirical setting on brand choice by individual households.

1. Introduction and Motivation

This paper addresses the evaluation of predictions from multinomial models. Such an evaluation is relevant the assess the quality of these models, which are often applied in consumer choice analysis, marketing research, transportation research and more.
There are various tests for the accuracy of multinomial models. There are tests for the goodness of fit, like the Pearson χ 2 test and the Brier score. Also, there are tests based on the prediction realization table and on cross-entropy loss (see [1] for a recent survey.). These tests all concern a measure of the fit. In contrast, in this paper, we propose a Likelihood Ratio (and Wald) test on forecast bias in predicting J independent multinomial outcomes, where the outcomes can be 1 , 2 , , J and where the predictions are probabilities. Such multinomial outcomes are frequently observed in transportation mode choice and brand choice. The probabilities are given from the outset and can be based on an econometric model, like a multinomial probit or multinomial logit model, and they can also be created by experts, who do not necessarily rely on an econometric model. This is an advantage of our simple test.
A useful tool to test for bias forecasts for continuous variables is the so-called Mincer–Zarnowitz regression, by [2]. If realizations y i and forecasts y ^ i are continuous variables, and these can be cross sectional data or time series data, where the forecast sample is i = 1 , 2 , , N , then the auxiliary regression
y i = α + β y ^ i + ε i ,
where ε i is a mean zero uncorrelated error process, can be used to test bias. The parameters can be estimated using Ordinary Least Squares (OLS). The Wald type test of interest concerns the null hypothesis that α = 0 and β = 1 , jointly. Under the null hypothesis, there is no forecast bias.
In this paper, we propose a similar test but now for independent multinomial outcomes, which in a sense extends on the test for binomial outcomes in [3]. There are realizations that can be either 1 , 2 , , o r   J , and the predictions are estimated probabilities of each of the J outcomes. The question now is whether these predictions based on the probabilities are unbiased or not.
The new test is similarly as easy as the Mincer Zarnowitz regression, where now a multinomial logit model is used instead of a linear regression. This article proceeds in Section 2 with the proposed test. In Section 3, the test is evaluated using various simulation experiments. Section 4 implements the test to an empirical setting on brand choice by individual households. Section 5 concludes the article.

2. The Main Idea

Suppose there is a multinomial model for J discrete choices, made by N individuals, and that this model generates fitted probabilities p ^ i j for j = 1 , 2 , , J and i = 1 , 2 , , N . Consider the estimated odds of the probability of choosing j versus k , that is,
Pr [ Y i = j ] Pr [ Y i = k ] = p ^ i j p ^ i k = exp ( ln p ^ i j ) exp ( ln p ^ i k )
for all j , k { 1 , 2 , , J } . The estimated odds imply that
Pr Y i = j = exp ( ln p ^ i j ) l = 1 J exp ( ln p ^ i l )
In case of a predictive bias, the odds ratios are incorrect.
To describe potential bias, we extend (1) by adding intercepts and a slope parameter (alternative to 1), which results in
Pr [ Y i = j ] Pr [ Y i = k ] = p ^ i j p ^ i k = exp ( α j + β ln p ^ i j ) exp ( α l + β ln p ^ i k )
This implies that the choice probabilities in (2) can be written as
Pr Y i = j = exp ( α j + β ln p ^ i j ) l = 1 J exp ( α l + β ln p ^ i l )
with the assumption that α J = 0 for identification, where J can be any of the choices.
Under the joint null hypothesis
α 1 = α 2 = = α J 1 = 0 and   β = 1 ,
there is no bias in the estimated probabilities. We can use this result to test for bias in prediction probabilities by estimating the parameters in a multinomial logit model (MNL) in which the Y variables are the out-of-sample realizations and the p ^ are the predicted probabilities from a statistical model if from the judgment of experts. We can use a Likelihood Ratio (LR) test or Wald test for the composite null hypothesis in (5).
The Likelihood Ratio test statistic is given by
L R = 2 ( l 0 , , 0 , 1 l α ^ 1 , , α ^ J 1 , β ^ )
where the log-likelihood function is given by
l α 1 , , α J 1 , β
= i = 1 N j = 1 J ( α j + β ln p ^ i j ) I [ y i = j ] ln l = 1 J e x p ( α l + β ln p ^ i l )
and where α ^ 1 , , α ^ J 1 , β ^ concern the unrestricted Maximum Likelihood estimates.
Let θ = ( α 1 , , α J 1 , β ) . The Wald test statistic is given by
W = α ^ 1 , , α ^ J 1 , β ^ 1 V ( θ ^ ) 1 ( α ^ 1 , , α ^ J 1 , β ^ 1 )
where V ( θ ^ ) denotes the covariance matrix of the Maximum Likelihood estimator
V θ = l θ θ θ 1
evaluated in the Maximum Likelihood estimates θ ^ = ( α ^ 1 , , α ^ J 1 , β ^ ) .
We can also implement partial tests to examine the absence of relative bias, for example by considering the null hypothesis α 1 = α 2 = = α J 1 = 0 . As we consider a joint test for zero restrictions on all α parameters, the proposed test statistic will be independent of the chosen alternative for identification.
The model in (4) can also be extended to include choice category-specific parameters β j , like
Pr Y i = j = exp ( α j + β j ln p ^ i j ) l = 1 J exp ( α l + β l ln p ^ i l )
No identification restrictions on the β parameters are necessary as the explanatory variables are different across the alternatives. This extension allows for a more subtle analysis of sources of bias. However, as the number of relevant parameter restrictions increases, this may come at the expense of the power of the test.

3. Simulations

To analyze whether our proposed test is useful in practically relevant cases, we now perform various simulation experiments. As a true Data-Generating process (DGP), we simulate probabilities from a Dirichlet distribution, that is
p ^ i 1 , , p ^ i J ~ D i r 1 , , 1
for i = 1 , 2 , , N . Hence, for each individual, we have different probabilities. Given these probabilities, we generate N true values Y i . Next, we create predictive probabilities using
p ^ i j = exp ( a I j = J + b l n p ^ i j ) l = 1 J exp ( a I l = J + b l n p ^ i l )
for j = 1 , 2 , , J = 4 and i = 1 , 2 , , N for different parameters a and b , where I [ . ] is an indicator function. For a = 0 and b = 1 , there is no bias in the forecast probabilities. Next, we estimate the parameters in the MNL model in (4) with the assumption that α J = 0 using Maximum Likelihood Estimation (MLE), described in Chapter 6 of [4], amongst others. Finally, we compute the LR test values for the following three null hypotheses
α 1 = α 2 = = α J 1 = 0 and   β = 1   ( LRab )
α 1 = α 2 = = α J 1 = 0   ( LRa )
β = 1   ( LRb )
where we adopt the 5% significance level. The number of replications is 10,000. The results are presented in Figure 1, Figure 2 and Figure 3.
Figure 1a presents the power plots of the three LR tests for the case that b = 1 and the sample size is N = 50 , with different values of a on the horizontal axis. Figure 1b presents the power plots of the three LR tests for the case that a = 0 and the sample size is N = 50 , with different values of b on the horizontal axis. Note that, for typical applications in marketing research, consumer choice modeling, transportation choice, and more, the sample size N = 50 can be considered as quite small.
Figure 1a,b show that, even already for this small sample, the empirical size is appropriately close to 5%, and the power increases with further away a and b values, respectively. Furthermore, we see that the LRab test has less power than the LRb test when the true α = 0 . However, the loss in power for the LRab test is very small compared to the LRb test when the true β = 1 .
Figure 2a,b consider the same settings, but now for sample size N = 100 , whereas Figure 3a,b concern the sample size N = 250 . Comparing these figures with those in Figure 1a,b, where we looked at sample size N = 50, we see that the power curve of the tests rapidly shows steepness with increasing sample size, and hence quickly converges to 1.

4. Illustration

As an illustration, we consider an optical scanner panel data set on purchases of four brands of saltine crackers in the Rome (Georgia) market, collected by Information Resources Incorporated. The data set contains 3292 purchases of crackers made by 136 households over about two years. The data are also analyzed in Chapter 6 of [4] (the data are available from the authors, and the code amounts to a standard module on multinomial logit model estimation). The brands are called Private label, Sunshine, Keebler, and Nabisco. For each purchase, we have the actual price of the purchased brand, the shelf price of the other brands and four times two dummy variables which indicate whether the brands were on display or featured. To describe brand choice, we consider the conditional logit model, where we include as explanatory variables per category the price of the brand and three 0/1 dummy variables indicating whether a brand was on display only or featured only or jointly on display and featured. To allow for out-of-sample evaluation of the model, we holdout the last purchases of each household from the estimation sample. Hence, we have 3156 observations for parameter estimation and use the estimated model to provide 136 forecast probabilities for the out-of-sample purchases. So, J = 4 and N = 136 .
We obtain the following LR test values:
LRab = 1.67
LRa = 1.57
LRb = 0.06
which suggests that the predicted probabilities from our MNL model do not entail biased forecasts.
When we allow for four different β values, that is, as in (10), we obtain the LR test values
LRab = 2.22
LRa = 3.24
LRb = 2.83
Again, we see that the MNL model, as specified in [4], delivers unbiased forecasts. Hence, both models provide unbiased forecasts.

5. Conclusions

We have proposed a simple to implement Likelihood Ration (and Wald) test for forecast bias in case the predictions concern probabilities on independent multinomial outcomes. The test is independent from the origin of the predictions. With simulations, we have shown that the test has proper empirical size and that the empirical power quickly increases with growing sample size. An illustration showed the ease of use of the test.

Author Contributions

Conceptualization, P.H.F. and R.P.; methodology, P.H.F. and R.P.; software, R.P.; validation, P.H.F. and R.P.; formal analysis, R.P.; investigation, P.H.F. and R.P.; resources, P.H.F. and R.P.; data curation, R.P.; writing—original draft preparation, P.H.F. and R.P.; writing—review and editing, P.H.F.; visualization, R.P.; supervision, P.H.F.; project administration, P.H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data can be obtained from the authors upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. de Jong, V.M.T.; Eijkemans, M.J.C.; van Calster, B.; Timmerman, D.; Moons, K.G.M.; Steyerberg, E.W.; van Smeden, M. Sample size consideration and predictive performance of multinomial logistic prediction models. Stat. Med. 2019, 38, 1601–1619. [Google Scholar] [CrossRef] [PubMed]
  2. Mincer, J.; Zarnowitz, V. The evaluation of economic forecasts. In Economic Forecasts and Expectations; Mincer, J., Ed.; National Bureau of Economic Research: New York, NY, USA, 1969. [Google Scholar]
  3. Franses, P.H. Testing for bias in forecasts for independent binary outcomes. Appl. Econ. Lett. 2021, 28, 1336–1338. [Google Scholar] [CrossRef]
  4. Franses, P.H.; Paap, R. Quantitative Models in Marketing Research; Cambridge University Press (CUP): Cambridge, UK, 2001. [Google Scholar]
Figure 1. (a) The power curve for b = 1 for sample size N = 50 , where the values of a are on the horizontal axis. (b) The power curve for a = 0 for sample size N = 50 , where the values of b are on the horizontal axis.
Figure 1. (a) The power curve for b = 1 for sample size N = 50 , where the values of a are on the horizontal axis. (b) The power curve for a = 0 for sample size N = 50 , where the values of b are on the horizontal axis.
Forecasting 07 00004 g001aForecasting 07 00004 g001b
Figure 2. (a) The power curve for b = 1 for sample size N = 100 , where the values of a are on the horizontal axis. (b) The power curve for a = 0 for sample size N = 100 , where the values of b are on the horizontal axis.
Figure 2. (a) The power curve for b = 1 for sample size N = 100 , where the values of a are on the horizontal axis. (b) The power curve for a = 0 for sample size N = 100 , where the values of b are on the horizontal axis.
Forecasting 07 00004 g002aForecasting 07 00004 g002b
Figure 3. (a) The power curve for b = 1 for sample size N = 250 ,   where the values of a are on the horizontal axis. (b) The power curve for a = 0 for sample size N = 250 , where the values of b are on the horizontal axis.
Figure 3. (a) The power curve for b = 1 for sample size N = 250 ,   where the values of a are on the horizontal axis. (b) The power curve for a = 0 for sample size N = 250 , where the values of b are on the horizontal axis.
Forecasting 07 00004 g003aForecasting 07 00004 g003b
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Franses, P.H.; Paap, R. Testing for Bias in Forecasts for Independent Multinomial Outcomes. Forecasting 2025, 7, 4. https://doi.org/10.3390/forecast7010004

AMA Style

Franses PH, Paap R. Testing for Bias in Forecasts for Independent Multinomial Outcomes. Forecasting. 2025; 7(1):4. https://doi.org/10.3390/forecast7010004

Chicago/Turabian Style

Franses, Philip Hans, and Richard Paap. 2025. "Testing for Bias in Forecasts for Independent Multinomial Outcomes" Forecasting 7, no. 1: 4. https://doi.org/10.3390/forecast7010004

APA Style

Franses, P. H., & Paap, R. (2025). Testing for Bias in Forecasts for Independent Multinomial Outcomes. Forecasting, 7(1), 4. https://doi.org/10.3390/forecast7010004

Article Metrics

Back to TopTop