# Computing Entropies with Nested Sampling

## Abstract

**:**

## 1. Introduction

#### Notation and Conventions

## 2. Entropies in Bayesian Inference

- Samples can be generated from the prior $p\left(\mathit{\theta}\right)$;
- Simulated datasets can be generated from $p\left(\mathit{d}\right|\mathit{\theta})$ for any given value of $\mathit{\theta}$;
- The likelihood, $p\left(\mathit{d}\right|\mathit{\theta})$, can be evaluated cheaply for any $\mathit{d}$ and $\mathit{\theta}$. Usually, it is the log-likelihood that is actually implemented, for numerical reasons.

#### 2.1. The Relevance of Data

#### 2.2. Mutual Information

## 3. Nested Sampling

#### The Sequence of X-Values

## 4. The Algorithm

Algorithm 1 The algorithm which estimates the expected value of the depth: $-\int p\left({\mathit{\theta}}_{\mathrm{ref}}\right)\int p\left(\mathit{\theta}\right)log\left[P(d(\mathit{\theta};{\mathit{\theta}}_{\mathrm{ref}})<r\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}{\mathit{\theta}}_{\mathrm{ref}})\right]\phantom{\rule{0.166667em}{0ex}}d\mathit{\theta}\phantom{\rule{0.166667em}{0ex}}d{\mathit{\theta}}_{\mathrm{ref}}$, that is, minus the expected value of the log-probability of a small region near ${\mathit{\theta}}_{\mathrm{ref}}$, which can be converted to an estimate of an entropy or differential entropy. The part highlighted in green is standard Nested Sampling with quasi-prior $p\left(\mathit{\theta}\right)$ and quasi-likelihood given by minus a distance function $d(\mathit{\theta};{\mathit{\theta}}_{\mathrm{ref}})$. The final result, $\widehat{{h}_{\mathrm{final}}}$, is an estimate of the expected depth.
| |

Set the numerical parameters: | |

$N\in \{1,2,3,\dots \}$ | ▹ the number of Nested Sampling particles to use |

$r\ge 0$ | ▹ the tolerance |

$\widehat{\mathbf{h}}\leftarrow \left\{\right\}$ | ▹ Initialise an empty list of results |

while more iterations desired do | |

$k\leftarrow 0$ | ▹ Initialise counter |

Generate ${\mathit{\theta}}_{\mathrm{ref}}$ from $p\left(\mathit{\theta}\right)$ | ▹ Generate a reference point |

Generate ${\left\{{\mathit{\theta}}_{i}\right\}}_{i=1}^{N}$ from $p\left(\mathit{\theta}\right)$ | ▹ Generate initial NS particles |

Calculate ${d}_{i}\leftarrow d({\mathit{\theta}}_{i};{\mathit{\theta}}_{\mathrm{ref}})$ for all i | ▹ Calculate distance of each particle from the reference point |

${i}^{\ast}\leftarrow \mathrm{argmin}(\lambda i\to {d}_{i})$ | ▹ Find the worst particle (greatest distance) |

${d}_{\mathrm{max}}\leftarrow {d}_{{i}^{\ast}}$ | ▹ Find the greatest distance |

while ${d}_{\mathrm{max}}>r$ do | |

Replace ${\mathit{\theta}}_{{i}^{\ast}}$ with ${\mathit{\theta}}_{\mathrm{new}}$ from $p\left(\mathit{\theta}\phantom{\rule{0.166667em}{0ex}}|\phantom{\rule{0.166667em}{0ex}}d\left(\mathit{\theta}\right)<{d}_{\mathrm{max}}\right)$ | ▹ Replace worst particle |

Calculate ${d}_{{i}^{\ast}}\leftarrow d({\mathit{\theta}}_{\mathrm{new}};{\mathit{\theta}}_{\mathrm{ref}})$ | ▹ Calculate distance of new particle from reference point |

${i}^{\ast}\leftarrow \mathrm{argmin}(\lambda i\to {d}_{i})$ | ▹ Find the worst particle (greatest distance) |

${d}_{\mathrm{max}}\leftarrow {d}_{{i}^{\ast}}$ | ▹ Find the greatest distance |

$k\leftarrow k+1$ | ▹ Increment counter k |

$\widehat{\mathbf{h}}\leftarrow \mathrm{append}\left(\widehat{\mathbf{h}},k/N\right)$ | ▹ Append latest estimate to results |

$\widehat{{h}_{\mathrm{final}}}\leftarrow \frac{1}{\mathrm{num}\_\mathrm{iterations}}\sum \widehat{\mathbf{h}}$ | ▹ Average results |

## 5. Example 1: Entropy of a Prior for the Data

## 6. Example 2: Measuring the Period of an Oscillating Signal

#### 6.1. Assumptions

#### 6.2. Results

**precisely**?”, to which the mutual information relates? Most practicing scientists would not feel particularly informed to learn that the vast majority of possibilities had been ruled out, if the posterior still consisted of several widely separated modes! Perhaps, in some applications, a more appropriate question is “what is the value of T to within ± 10%”, or something along these lines. See Appendix A for more about this issue.

## 7. Example 3: Data with Pareto Distribution

## 8. Computational Cost and Limitations

## Acknowledgments

## Conflicts of Interest

## Appendix A. Precisional Questions

- $x\in \{1,2,3\}$ and anything that implies it,
- $x\in \{2,3,4\}$ and anything that implies it,
- $x\in \{3,4,5\}$ and anything that implies it,
- and so on.

#### Continuous Case

## Appendix B. Software

https://github.com/eggplantbren/InfoNestand can be obtained using the following

`git`command, executed in a terminal:

git clone https://github.com/eggplantbren/InfoNestThe following will compile the code and execute the first example from the paper:

cd InfoNest/cpp make ./mainThe algorithm will run for 1000 ‘reps’, i.e., 1000 samples of ${\mathit{\theta}}_{\mathrm{ref}}$ , which is time consuming. Output is saved to

`output.txt`. At any time, you can execute the Python script

`postprocess.py`to get an estimate of the depth:

python postprocess.py

`postprocess.py`estimates the depth with a tolerance of $r={10}^{-3}$. This value can be changed by calling the

`postprocess`function with a different value of its argument

`tol.`E.g.,

`postprocess.py`can be edited so that its last line is

`postprocess(tol=0.01)`instead of

`postprocess()`.

`main.cpp`. The default problem is the first example from the paper. For Example 2, since it is a conditional entropy that is being estimated (which requires a slight modification to the algorithm), an additional argument InfoNest:

`:Mode::conditional_entropy`must be passed to the

`InfoNest::execute function`.

## References

- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Knuth, K.H. Toward question-asking machines: The logic of questions and the inquiry calculus. In Proceedings of the 10th International Workshop On Artificial Intelligence and Statistics, Barbados, 6–8 January 2005. [Google Scholar]
- Knuth, K.H.; Skilling, J. Foundations of inference. Axioms
**2012**, 1, 38–73. [Google Scholar] [CrossRef] - Caticha, A.; Giffin, A. Updating probabilities. AIP Conf. Proc.
**2006**, 31. [Google Scholar] [CrossRef] - Szabó, Z. Information theoretical estimators toolbox. J. Mach. Learn. Res.
**2014**, 15, 283–287. [Google Scholar] - Jaynes, E.T. Probability theory: The logic of science; Cambridge university press: New York, NY, USA, 2003. [Google Scholar]
- MacKay, D.J. Information Theory, Inference and Learning Algorithms; Cambridge university press: New York, NY, USA, 2003. [Google Scholar]
- Caticha, A. Lectures on probability, entropy, and statistical physics. arXiv
**2008**, arXiv:0808.0012. [Google Scholar] - Bernardo, J.M. Reference analysis. Handb. Stat.
**2005**, 25, 17–90. [Google Scholar] - Skilling, J. Nested sampling for general Bayesian computation. Bayesian anal.
**2006**, 1, 833–859. [Google Scholar] [CrossRef] - Feroz, F.; Hobson, M.; Bridges, M. MultiNest: An efficient and robust Bayesian inference tool for cosmology and particle physics. Mon. Not. R. Astron. Soc.
**2009**, 398, 1601–1614. [Google Scholar] [CrossRef] - Brewer, B.J.; Pártay, L.B.; Csányi, G. Diffusive nested sampling. Stat. Comput.
**2011**, 21, 649–656. [Google Scholar] [CrossRef] - Handley, W.; Hobson, M.; Lasenby, A. PolyChord: Next-generation nested sampling. Mon. Not. R. Astron. Soc.
**2015**, 453, 4384–4398. [Google Scholar] [CrossRef] - Knuth, K.H.; Habeck, M.; Malakar, N.K.; Mubeen, A.M.; Placek, B. Bayesian evidence and model selection. Digit. Signal Process.
**2015**, 47, 50–67. [Google Scholar] [CrossRef] - Pullen, N.; Morris, R.J. Bayesian model comparison and parameter inference in systems biology using nested sampling. PLoS ONE
**2014**, 9, e88419. [Google Scholar] [CrossRef] [PubMed] - Brewer, B.J.; Donovan, C.P. Fast Bayesian inference for exoplanet discovery in radial velocity data. Mon. Not. R. Astron. Soc.
**2015**, 448, 3206–3214. [Google Scholar] [CrossRef] - Pártay, L.B.; Bartók, A.P.; Csányi, G. Efficient sampling of atomic configurational spaces. J. Phys. Chem. B
**2010**, 114, 10502–10512. [Google Scholar] [CrossRef] [PubMed] - Baldock, R.J.; Pártay, L.B.; Bartók, A.P.; Payne, M.C.; Csányi, G. Determining pressure-temperature phase diagrams of materials. Phys. Rev. B
**2016**, 93, 174108. [Google Scholar] [CrossRef] - Martiniani, S.; Stevenson, J.D.; Wales, D.J.; Frenkel, D. Superposition enhanced nested sampling. Phys. Rev. X
**2014**, 4, 031034. [Google Scholar] [CrossRef] - Henderson, R.W.; Goggans, P.M.; Cao, L. Combined-chain nested sampling for efficient Bayesian model comparison. Digit. Signal Process.
**2017**, 70, 84–93. [Google Scholar] [CrossRef] - Walter, C. Point process-based Monte Carlo estimation. Stat. Comput.
**2017**, 27, 219–236. [Google Scholar] [CrossRef] - Brewer, B.J.; Foreman-Mackey, D. DNest4: Diffusive Nested Sampling in C++ and Python. J. Stat. Softw.
**2016**, in press. [Google Scholar] - Bretthorst, G.L. Nonuniform sampling: Bandwidth and aliasing. AIP Conf. Proc.
**2001**, 567. [Google Scholar] [CrossRef] - Gregory, P.C. A Bayesian Analysis of Extrasolar Planet Data for HD 73526. Astrophys. J.
**2005**, 631, 1198–1214. [Google Scholar] [CrossRef] - Taleb, N.N. The Black Swan: The Impact of the Highly Improbable; Random House: New York, NY, USA, 2007. [Google Scholar]

**Figure 1.**To evaluate the log-probability (or density) of the blue probability distribution near the red point, Nested Sampling can be used, with the blue distribution playing the role of the “prior” in NS, and the Euclidean distance from the red point (illustrated with red contours) playing the role of the negative log-likelihood. Averaging over selections of the red point gives an estimate of the entropy of the blue distribution.

**Figure 2.**A signal with true parameters $A=1$, $\tau =-0.5$, and $\varphi =0$, observed with noise standard deviation 0.1 with the even (gold points) and uneven (green points) observing strategies.

**Figure 3.**The joint distribution for the log-sum of the first half of the ‘Pareto data’ and the second half. The goal is to calculate the entropy of the marginal distributions, the entropy of the joint distribution, and hence the mutual information. The distribution is quite heavy tailed despite the logarithms, and extends beyond the domain shown in this plot.

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Brewer, B.J.
Computing Entropies with Nested Sampling. *Entropy* **2017**, *19*, 422.
https://doi.org/10.3390/e19080422

**AMA Style**

Brewer BJ.
Computing Entropies with Nested Sampling. *Entropy*. 2017; 19(8):422.
https://doi.org/10.3390/e19080422

**Chicago/Turabian Style**

Brewer, Brendon J.
2017. "Computing Entropies with Nested Sampling" *Entropy* 19, no. 8: 422.
https://doi.org/10.3390/e19080422