# Principles of Bayesian Inference Using General Divergence Criteria

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- 1.
- Proceed as though the model class does contain the true sample distribution and conduct a posteriori sensitivity analysis.
- 2.
- Modify the model class in order to improve its robustness properties.
- 3.
- Abandon the given parametric class and appeal to more data driven techniques.

- 4.
- Acknowledge that the model class is only approximate, but is the best available, and seek to infer the model parameters in a way that are most useful to the decision maker.

#### The Data Generating Process and the M-Open World

## 2. Extending Bayesian MDE for the M-Open World

#### 2.1. Why the Current Justification is not Enough

#### 2.2. Principled Justification for KL in M-Open World

#### 2.2.1. Moving Away from KL-Divergence in the M-Open World

#### 2.3. Principled Bayesian Minimum Divergence Estimation

#### 2.3.1. The Likelihood Principle and Bayesian Additivity

#### 2.3.2. A Note on Calibration

## 3. Possible Divergences to Consider

#### 3.1. Total Variation Divergence

#### 3.2. Hellinger Divergence

#### 3.3. $\alpha \beta $-Divergences

#### 3.3.1. Alpha Divergence

#### 3.3.2. Beta Divergence

#### 3.3.3. The S-Hellinger Divergence

#### 3.4. Comparison

#### 3.5. Density Estimation

## 4. Illustrations

#### 4.1. M-Open Robustness

#### Simple Inference

#### 4.2. Regression under Heteroscedasticity

#### 4.3. Time Series Analysis

- an AR(3) with $\mu =(0.25,\text{}0.4,\text{}0.2,\text{}0.3)$
- an AR(1) with $\mu =(0,\text{}0.9)$ with GARCH(1,1) errors $\omega =2,\text{}{\alpha}_{1}=0.99,\text{}{\beta}_{1}=0.01$.
- an AR(1) with $\mu =(0,\text{}0.9)$ with GARCH(1,1) errors $\omega =1,\text{}{\alpha}_{1}=0.75,\text{}{\beta}_{1}=0.01$

#### 4.4. Application to High Dimensions

## 5. Discussion

## Supplementary Materials

## Author Contributions

## Acknowledgments

## Conflicts of Interest

## Appendix A. M-Closed Efficiency

**Table A1.**Table of posterior mean MSE values when estimating $\mathcal{N}({\mu}_{i},{\sigma}_{i}^{2})$ from data sets simulated from the model of size $n=50,\text{}100,\text{}200,\text{}500$ under the Bayesian minimum divergence technology.

MSE | KL | Hell | TV | alpha | Power | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ||

$\mu $ | n = 50 | 1.89 | 4.43 | 1.50 | 16.88 | 2.68 | 6.77 | 2.03 | 4.29 | 0.76 | 93.80 |

n = 100 | 0.64 | 1.29 | 0.57 | 5.19 | 1.20 | 1.83 | 0.67 | 1.25 | 0.44 | 33.80 | |

n = 200 | 0.35 | 0.47 | 0.32 | 1.50 | 0.48 | 0.50 | 0.35 | 0.47 | 0.31 | 7.83 | |

n = 500 | 0.15 | 0.16 | 0.14 | 0.33 | 0.28 | 0.31 | 0.15 | 0.16 | 0.20 | 1.42 | |

$\sigma $ | n = 50 | 1.11 | 1.17 | 1.51 | 1.28 | 3.47 | 6.09 | 1.13 | 1.13 | 8.23 | 67.94 |

n = 100 | 0.60 | 0.61 | 0.98 | 0.87 | 1.30 | 1.61 | 0.65 | 0.64 | 1.99 | 10.64 | |

n = 200 | 0.23 | 0.23 | 0.57 | 0.52 | 0.42 | 0.44 | 0.30 | 0.29 | 0.46 | 1.09 | |

n = 500 | 0.10 | 0.10 | 0.26 | 0.26 | 0.29 | 0.27 | 0.12 | 0.13 | 0.18 | 0.21 |

**Table A2.**Table of sums of posterior mean over (positive) or under (negative) estimation $\mathcal{N}({\mu}_{i},{\sigma}_{i}^{2})$ from data sets simulated from the model of size $n=50,\text{}100,\text{}200,\text{}500$ under the Bayesian minimum divergence technology.

MSE | KL | Hell | TV | alpha | Power | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|

${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ${\mu}_{1}=0$ | ${\mu}_{2}=15$ | ||

$\mu $ | n = 50 | −12 | −32 | −10 | −50 | −5 | −37 | −12 | −32 | −8 | −50 |

n = 100 | −16 | −40 | −14 | −50 | −5 | −22 | −18 | −40 | −12 | −50 | |

n = 200 | −2 | −24 | −8 | −46 | 1 | −10 | −2 | −26 | −2 | −50 | |

n = 500 | −6 | −16 | −6 | −42 | 1 | 3 | −10 | −18 | −8 | −48 | |

$\sigma $ | n = 50 | 4 | 6 | −26 | −10 | 25 | 27 | −8 | −6 | 46 | 50 |

n = 100 | 0 | 0 | −36 | −32 | 19 | 12 | −14 | −14 | 38 | 48 | |

n = 200 | −2 | −2 | −40 | −40 | 23 | 22 | −24 | −24 | 28 | 40 | |

n = 500 | −2 | −2 | −36 | −40 | 27 | 21 | −28 | −28 | 14 | 20 |

**Figure A1.**A comparison of the magnitude of the scores associated with KL-divergence (red). The Hellinger-divergence (blue, $\alpha =0.5$ then divide by 4) and the alpha-divergence (green, $\alpha =0.6$ = solid, $\alpha =0.75$ = dashed, $\alpha =0.85$ = dotted) the when quoting probability x for event with probability g. As $\alpha $ increases, the shape of the alpha divergence score function tends towards that of the KL-divergence.

## References

- Bernardo, J.M.; Smith, A.F. Bayesian Theory; Wiley: Hoboken, NJ, USA, 2001. [Google Scholar]
- Walker, S.G. Bayesian inference with misspecified models. J. Statist. Plan. Inference
**2013**, 143, 1621–1633. [Google Scholar] [CrossRef] - Bissiri, P.; Holmes, C.; Walker, S.G. A general framework for updating belief distributions. J. R. Statist. Soc. Ser. B (Statist. Methodol.)
**2016**, 78, 1103–1130. [Google Scholar] [CrossRef] [PubMed][Green Version] - Box, G.E. Sampling and Bayes’ inference in scientific modelling and robustness. J. R. Statist. Soc. Ser. A (Gen.)
**1980**, 383–430. [Google Scholar] [CrossRef] - Berger, J.O.; Moreno, E.; Pericchi, L.R.; Bayarri, M.J.; Bernardo, J.M.; Cano, J.A.; De la Horra, J.; Martín, J.; Ríos-Insúa, D.; Betrò, B.; et al. An overview of robust Bayesian analysis. Test
**1994**, 3, 5–124. [Google Scholar] [CrossRef] - Watson, J.; Holmes, C. Approximate models and robust decisions. Statist. Sci.
**2016**, 31, 465–489. [Google Scholar] [CrossRef] - Huber, P.J.; Ronchetti, E. Robust Statistics, Series in Probability and Mathematical Statistics; John Wiley & Sons: Hoboken, NJ, USA, 1981. [Google Scholar]
- Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; John Wiley & Sons: Hoboken, NJ, USA, 2011; Volume 114. [Google Scholar]
- Greco, L.; Racugno, W.; Ventura, L. Robust likelihood functions in Bayesian inference. J. Statist. Plan. Inference
**2008**, 138, 1258–1270. [Google Scholar] [CrossRef] - Goldstein, M. Bayes Linear Analysis; Wiley StatsRef: Statistics Reference Online; CRC Press: Boca Raton, FL, USA, 1999. [Google Scholar]
- Owen, A. Empirical likelihood for linear models. Ann. Statist.
**1991**, 19, 1725–1747. [Google Scholar] [CrossRef] - Lazer, D.; Kennedy, R.; King, G.; Vespignani, A. The parable of Google Flu: Traps in big data analysis. Science
**2014**, 343, 1203–1205. [Google Scholar] [CrossRef] [PubMed][Green Version] - Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
- Miller, J.W.; Dunson, D.B. Robust Bayesian inference via coarsening. arXiv, 2015; arXiv:1506.06101. [Google Scholar]
- Goldstein, M. Influence and belief adjustment. In Influence Diagrams, Belief Nets and Decision Analysis; Wiley: Hoboken, NJ, USA, 1990; pp. 143–174. [Google Scholar]
- Hooker, G.; Vidyashankar, A.N. Bayesian model robustness via disparities. Test
**2014**, 23, 556–584. [Google Scholar] [CrossRef][Green Version] - Ghosh, A.; Basu, A. Robust Bayes estimation using the density power divergence. Ann. Inst. Statist. Math.
**2016**, 68, 413–437. [Google Scholar] [CrossRef] - Ghosh, A.; Basu, A. General Robust Bayes Pseudo-Posterior: Exponential Convergence results with Applications. arXiv, 2017; arXiv:1708.09692. [Google Scholar]
- O’Hagan, A.; Buck, C.E.; Daneshkhah, A.; Eiser, J.R.; Garthwaite, P.H.; Jenkinson, D.J.; Oakley, J.E.; Rakow, T. Uncertain Judgements: Eliciting Experts’ Probabilities; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
- Winkler, R.L.; Murphy, A.H. Evaluation of subjective precipitation probability forecasts. In Proceedings of the First National Conference on Statistical Meteorology, Albany, NY, USA, 28 April–1 May 1968; American Meteorological Society Boston: Boston, MA, USA, 1968; pp. 148–157. [Google Scholar]
- Grünwald, P.D.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Statist.
**2004**, 32, 1367–1433. [Google Scholar] - Zellner, A. Optimal information processing and Bayes’s theorem. Am. Statist.
**1988**, 42, 278–280. [Google Scholar] - Celeux, G.; Jewson, J.; Josse, J.; Marin, J.M.; Robert, C.P. Some discussions on the Read Paper “Beyond subjective and objective in statistics” by A. Gelman and C. Hennig. arXiv, 2017; arXiv:1705.03727. [Google Scholar]
- Gelman, A.; Hennig, C. Beyond subjective and objective in statistics. J. R. Statist. Soc. Ser. A (Statist. Soc.)
**2015**, 180, 967–1033. [Google Scholar] [CrossRef] - Goldstein, M. Subjective Bayesian analysis: Principles and practice. Bayesian Anal.
**2006**, 1, 403–420. [Google Scholar] [CrossRef] - Park, C.; Basu, A. The generalized Kullback-Leibler divergence and robust inference. J. Statist. Comput. Simul.
**2003**, 73, 311–332. [Google Scholar] - Bhandari, S.K.; Basu, A.; Sarkar, S. Robust inference in parametric models using the family of generalized negative exponential disparities. Aust. N. Z. J. Statist.
**2006**, 48, 95–114. [Google Scholar] [CrossRef] - Smith, J.Q. Bayesian Decision Analysis: Principles and Practice; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
- Devroye, L.; Gyorfi, L. Nonparametric Density Estimation: The L1 View; John Wiley & Sons Incorporated: Hoboken, NJ, USA, 1985; Volume 119. [Google Scholar]
- Beran, R. Minimum Hellinger distance estimates for parametric models. Ann. Statist.
**1977**, 5, 445–463. [Google Scholar] [CrossRef] - Smith, J. Bayesian Approximations and the Hellinger Metric. Unpublished work. 1995. [Google Scholar]
- Cichocki, A.; Cruces, S.; Amari, S.I. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy
**2011**, 13, 134–170. [Google Scholar] [CrossRef] - Ghosh, A.; Harris, I.R.; Maji, A.; Basu, A.; Pardo, L. A generalized divergence for statistical inference. Bernoulli
**2017**, 23, 2746–2783. [Google Scholar] [CrossRef] - Csisz, I. Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hung.
**1967**, 2, 299–318. [Google Scholar] - Shun-ichi, A. Differential-Geometrical Methods in Statistics; Springer: Berlin/Heidelberg, Germany, 2012; Volume 28. [Google Scholar]
- Cressie, N.; Read, T.R. Multinomial goodness-of-fit tests. J. R. Statist. Soc. Ser. B (Methodol.)
**1984**, 46, 440–464. [Google Scholar] - Sason, I.; Verdú, S. Bounds among f-divergences. IEEE Trans. Inf. Theory
**2015**. submitted. [Google Scholar] - Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika
**1998**, 85, 549–559. [Google Scholar] [CrossRef] - Dawid, A.P.; Musio, M.; Ventura, L. Minimum scoring rule inference. Scand. J. Statist.
**2016**, 43, 123–138. [Google Scholar] [CrossRef] - Kurtek, S.; Bharath, K. Bayesian sensitivity analysis with the Fisher–Rao metric. Biometrika
**2015**, 102, 601–616. [Google Scholar] [CrossRef][Green Version] - Silverman, B.W. Density Estimation for Statistics and Data Analysis; CRC Press: Boca Raton, FL, USA, 1986; Volume 26. [Google Scholar]
- Tamura, R.N.; Boos, D.D. Minimum Hellinger distance estimation for multivariate location and covariance. J. Am. Statist. Assoc.
**1986**, 81, 223–229. [Google Scholar] [CrossRef] - Epanechnikov, V.A. Non-parametric estimation of a multivariate probability density. Theory Probab. Appl.
**1969**, 14, 153–158. [Google Scholar] [CrossRef] - Rosenblatt, M. On the maximal deviation of k-dimensional density estimates. Ann. Probab.
**1976**, 4, 1009–1015. [Google Scholar] [CrossRef] - Abramson, I.S. On bandwidth variation in kernel estimates-a square root law. Ann. Statist.
**1982**, 10, 1217–1223. [Google Scholar] [CrossRef] - Hwang, J.N.; Lay, S.R.; Lippman, A. Nonparametric multivariate density estimation: A comparative study. IEEE Trans. Signal Process.
**1994**, 42, 2795–2810. [Google Scholar] [CrossRef] - Ram, P.; Gray, A.G. Density estimation trees. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011; ACM: New York, NY, USA, 2011; pp. 627–635. [Google Scholar]
- Lu, L.; Jiang, H.; Wong, W.H. Multivariate density estimation by bayesian sequential partitioning. J. Am. Statist. Assoc.
**2013**, 108, 1402–1410. [Google Scholar] [CrossRef] - Li, M.; Dunson, D.B. A framework for probabilistic inferences from imperfect models. arXiv, 2016; arXiv:1611.01241. [Google Scholar]
- Carpenter, B.; Gelman, A.; Hoffman, M.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker, M.A.; Guo, J.; Li, P.; Riddell, A. Stan: A probabilistic programming language. J. Statist. Softw.
**2016**, 20, 1–37. [Google Scholar] [CrossRef] - Hansen, B.E. Nonparametric Conditional Density Estimation. Unpublished work. 2004. [Google Scholar]
- Filzmoser, P.; Maronna, R.; Werner, M. Outlier identification in high dimensions. Comput. Statist. Data Anal.
**2008**, 52, 1694–1711. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**The influence [40] of removing one of 1000 observations from a $t(4)$ distribution when fitting a $\mathcal{N}(\mu ,{\sigma}^{2})$ under the different divergences. Subfigure (

**a**) presents the KL-Divergence (red), Hell-Divergence (blue) and the TV-Divergence (pink); (

**b**) presents the alpha-Bayes $(\alpha =1,\text{}0.95,\text{}0.75,\text{}0.5)$ and (

**c**) presents the power-Bayes $(\alpha =0,\text{}0.05,\text{}0.25,\text{}0.5)$. These values of alpha mean the likelihood is raised to the same power across the alpha and power loss functions. Demonstrates increasing influence for observations in the tails under KL-divergence, while decreasing influence for outlying observations under the robust divergences.

**Figure 2.**

**Top**: Posterior predictive distributions (smoothed from a sample) arising from Bayesian minimum divergence estimation fitting a normal distribution $\mathcal{N}(\mu ,{\sigma}^{2})$ to an $\u03f5$-contaminated normal $0.99\mathcal{N}(0,1)+0.01\mathcal{N}(5,{5}^{2})$ (

**left**), a t-distribution $t(4)$ (

**middle**) and the tracks1 dataset (

**right**) using the KL-Bayes (red), Hell-Bayes (blue), TV-Bayes (pink), alpha-Bayes (green) and power-Bayes (orange).

**Bottom**: Log-density plots (smoothed from a sample) (

**left**) and Normal QQ plots (

**middle**) for the $\u03f5$-contaminated normal dataset. Lower right plots the posterior predictive distributions (smoothed from a sample) from alternative models using the KL-Bayes, Normal (red), t (blue), logNormal (green) and gamma (orange).

**Figure 3.**A data set simulated from the heteroscedastic linear model $y\sim \mathcal{N}(X\mathit{\beta},\sigma {({X}_{1})}^{2})$, with $\sigma ({X}_{1})=exp\left(\right)open="("\; close=")">2{X}_{1}/3$ and $p=1$.

**Figure 4.**

**Top**: One step ahead posterior predictions arising from Bayesian minimum divergence estimation fitting autoregressive (AR) models with the correctly chosen lags to an AR(3) with no additional error (

**left**), an AR(1) with GARCH(1,1) errors, $\alpha =0.99$, $\omega =2$ (

**middle**) and a AR(1) with GARCH(1,1) errors, $\alpha =0.75$, $\omega =1$ (

**right**) using the KL-Bayes (red), Hell-Bayes (blue), TV-Bayes (pink), alpha-Bayes (green) and power-Bayes (orange).

**Bottom**: the difference in one step ahead posterior squared prediction squared errors between the KL-Bayes and the Hell-Bayes. When the model is correct all of the methods appear to perform similarly. Under misspecification the Hellinger-Bayes does a much better job of correctly capturing the underlying dependence in the data.

**Table 1.**Table of posterior mean values for the variance of a fitted standard linear model to the heteroscedastic linear model $y\sim \mathcal{N}(X\mathit{\beta},\sigma {({X}_{1})}^{2})$, with $\sigma ({X}_{1})=exp\left(\right)open="("\; close=")">2{X}_{1}/3$, to be interpreted in terms of the precision of the predictive distribution minimising the respective divergence criteria, across $N=50$ repeats from datasets of size $n=200$ with the dimension of $\mathit{\beta}\text{}p=1,\text{}5,\text{}10,\text{}15,\text{}20$ under the Bayesian minimum divergence technology. The KL-divergences fits a much higher predictive variance in order to accommodate the misspecification in the variance of the model.

${\widehat{\mathit{\sigma}}}^{2}$ | KL | Hell | TV | alpha | Power |
---|---|---|---|---|---|

p = 1 | 2.34 | 0.78 | 0.62 | 1.19 | 0.96 |

p = 5 | 2.36 | 0.51 | 0.47 | 0.95 | 0.98 |

p = 10 | 2.34 | 0.47 | 0.56 | 0.89 | 1.13 |

p = 15 | 2.41 | 0.49 | 0.94 | 0.83 | 1.19 |

p = 20 | 2.38 | 0.50 | 1.19 | 0.82 | 1.27 |

**Table 2.**Table of mean squared errors (MSE) for the posterior means of the parameters $\mathit{\beta}$ of a standard linear model fitted to the heteroscedastic linear model $y\sim \mathcal{N}(X\mathit{\beta},\sigma {({X}_{1})}^{2})$, with $\sigma ({X}_{1})=exp\left(\right)open="("\; close=")">2{X}_{1}/3$ along with MSE values for a test set simulated without error, across $N=50$ repeats from datasets of size $n=200$ with the dimension of $\mathit{\beta}\text{}p=1,\text{}5,\text{}10,\text{}15,\text{}20$ under the Bayesian minimum divergence technology. The alternative divergences estimate the parameters of the underlying mean more accurately which allows them to perform better predictively.

MSE | KL | Hell | TV | alpha | Power | |||||
---|---|---|---|---|---|---|---|---|---|---|

$\mathit{\beta}$ | Test | $\mathit{\beta}$ | Test | $\mathit{\beta}$ | Test | $\mathit{\beta}$ | Test | $\mathit{\beta}$ | Test | |

p = 1 | 0.03 | 3.25 | 0.03 | 3.42 | 0.04 | 4.51 | 0.03 | 2.57 | 0.02 | 1.92 |

p = 5 | 0.09 | 7.92 | 0.06 | 5.61 | 0.05 | 5.12 | 0.04 | 4.32 | 0.04 | 3.56 |

p= 10 | 0.15 | 14.70 | 0.09 | 9.63 | 0.10 | 10.43 | 0.08 | 8.64 | 0.08 | 8.40 |

p = 15 | 0.24 | 22.93 | 0.14 | 13.45 | 0.16 | 15.47 | 0.12 | 11.65 | 0.12 | 11.38 |

p = 20 | 0.29 | 25.88 | 0.19 | 17.71 | 0.18 | 16.46 | 0.16 | 14.79 | 0.16 | 14.89 |

**Table 3.**Root mean squared errors (RMSE) for the KL-Bayes and Hell-Bayes posterior mean predictions when fitting an autoregressive (AR) model to an AR(3) with no additional error, an AR(1) with high volatility GARCH(1,1) errors, and a AR(1) with lower volatility GARCH(1,1) errors, for 100 test data points from the underlying AR model.

RMSE | Correctly Specified | High Volatility | Low Volatility |
---|---|---|---|

KL | 0.49 | 1.87 | 1.21 |

Hell | 0.48 | 1.07 | 0.94 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Jewson, J.; Smith, J.Q.; Holmes, C.
Principles of Bayesian Inference Using General Divergence Criteria. *Entropy* **2018**, *20*, 442.
https://doi.org/10.3390/e20060442

**AMA Style**

Jewson J, Smith JQ, Holmes C.
Principles of Bayesian Inference Using General Divergence Criteria. *Entropy*. 2018; 20(6):442.
https://doi.org/10.3390/e20060442

**Chicago/Turabian Style**

Jewson, Jack, Jim Q. Smith, and Chris Holmes.
2018. "Principles of Bayesian Inference Using General Divergence Criteria" *Entropy* 20, no. 6: 442.
https://doi.org/10.3390/e20060442