# Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Analytic Q-Aggregates for Signed Linear Functions

#### 2.2. Monte Carlo Estimators for More Complex Q-Aggregates

#### 2.3. PAC–Bayesian Approach

**Theorem**

**1**

## 3. The Partial Aggregation Estimator

#### 3.1. Reduced Variance Estimates

**Proposition**

**1.**

**Proof.**

**Proposition**

**2.**

**Proof.**

#### 3.2. Single Hidden Layer

## 4. Aggregating Signed-Output Networks

#### 4.1. Lower Variance Estimates of Aggregated Sign-Output Networks

**Proposition**

**3.**

**Proof.**

**Proposition**

**4.**

**Proof.**

#### 4.2. All Sign Activations

## 5. PAC–Bayesian Objectives with Signed-Outputs

**Proposition**

**5.**

**Proof.**

**Theorem**

**2.**

**fix-$\mathit{\lambda}$**”) or simultaneously optimized throughout training for automatic regularisation tuning (“

**optim-$\mathit{\lambda}$**”), we obtain a gradient descent objective:

## 6. Experiments

**sign**, sigmoid (

**sgmd**) or

**relu**activations, before a single-unit final layer with sign activation. Q was chosen as an isotropic, unit-variance normal distribution with initial means drawn from a truncated normal distribution of variance $0.05$. The data-free prior P was fixed equal to the initial Q, as motivated by Dziugaite and Roy [4] (Section 5 and Appendix B).

**fix-$\mathit{\lambda}$**and

**optim-$\mathit{\lambda}$**from Section 5 were used for batch-size 256 gradient descent with Adam [17] for 200 epochs. Every five epochs, the bound (for a minimising $\lambda $) was evaluated using the entire training set; the learning rate was then halved if the bound was unimproved from the previous two evaluations. The best hyperparameters were selected using the best bound achieved in these evaluations through a grid search of initial learning rates $\in \{0.1,0.01,0.001\}$, sample sizes $T\in \{1,10,50,100\}$. Once these were selected training was repeated 10 times to obtain the values in Table 1.

**optim-$\lambda $**was optimised through Theorem 2 on alternate mini-batches with SGD and a fixed learning rate of ${10}^{-4}$ (whilst still using the objective (12) to avoid effectively scaling the learning rate with respect to empirical loss by the varying $\lambda $). After preliminary experiments in

**fix-$\lambda $**, we set $\lambda =m=60,000$, the training set size, as is common in Bayesian deep learning.

**reinforce**, which uses the fix-$\lambda $ objective without partial-aggregation, forcing the use of REINFORCE gradients everywhere;

**mlp**, an unregularised non-stochastic relu neural network with tanh output activation; and the PBGNet model (

**pbg**) from Letarte et al. [6]. For the latter, a misclassification error bound obtained through ${\ell}_{0-1}\le 2{\ell}_{\mathrm{lin}}$ must be used as their test predictions were made through the sign of a prediction function $\in [-1,+1]$, not $\in \{+1,-1\}$. Further, despite significant additional hyperparameter exploration, we were unable to train a three layer network through the PBGNet algorithm directly comparable to our method, likely because of the exponential KL penalty (in their Equation 17) within that framework; to enable comparison, we therefore allowed the number of hidden layers in this scenario to vary $\in \{1,2,3\}$. Other baseline tuning and setup was similar to the above, see the Appendix A for more details.

**reinforce**draws a new set of weights for every test example, equivalent to the evaluation of the other models; but doing so during training, with multiple parallel samples, is prohibitively expensive. Two different approaches to straightforward, not partially-aggregated, gradient estimation for this case suggest themselves, arising from different approximations to the Q-expected loss of the minibatch, $B\subseteq S$ (with data indices $\mathcal{B}$). From the identities

## 7. Discussion

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Further Experimental Details

#### Appendix A.1. Aggregating Biases with the Sign Function

#### Appendix A.2. Dataset Details

#### Appendix A.3. Hyperparameter Search for Baselines

#### Appendix A.4. Final Hyperparameter Settings

**Table A1.**Chosen hyperparameter settings and additional details for results in Table 1. Best hyperparameters were chosen by bound if available and non-vacuous, otherwise by best training linear loss through a grid search as described in Section 6 and Appendix A.3. Run times are rounded to nearest 5 min.

mlp | pbg | Reinforce | Fix-$\mathit{\lambda}$ | Optim-$\mathit{\lambda}$ | ||||||
---|---|---|---|---|---|---|---|---|---|---|

sign | relu | sign | relu | sgmd | sign | relu | sgmd | |||

Init. LR | 0.001 | 0.01 | 0.1 | 0.1 | 0.01 | 0.1 | 0.1 | 0.01 | 0.1 | 0.1 |

Samples, T | - | 100 | 100 | 100 | 100 | 50 | 10 | 100 | 100 | 10 |

Hid. Layers | 3 | 1 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |

Hid. Size | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |

Mean KL | - | 2658 | 15,020 | 13,613 | 2363 | 3571 | 3011 | 5561 | 3204 | 4000 |

Runtime/min | 10 | 5 | 40 | 40 | 35 | 30 | 25 | 35 | 30 | 25 |

- Initial Learning Rate $\in \{0.1,0.01,0.001\}$.
- Training Samples $\in \{1,10,50,100\}$.
- Hidden Size $=100$.
- Batch Size $=256$.
- Fix-$\lambda $, $\lambda =m=60,000$.
- Number of Hidden Layers $=3$ for all models, except PBGNet $\in \{1,2,3\}$.

#### Appendix A.5. Implementation and Runtime

## References

- Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; Wierstra, D. Weight Uncertainty in Neural Network. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 1613–1622. [Google Scholar]
- Langford, J.; Caruana, R. (Not) Bounding the True Error. In Advances in Neural Information Processing Systems 14; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 809–816. [Google Scholar]
- Zhou, W.; Veitch, V.; Austern, M.; Adams, R.P.; Orbanz, P. Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Dziugaite, G.K.; Roy, D.M. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, UAI 2016, Sydney, NSW, Australia, 11–15 August 2017. [Google Scholar]
- Germain, P.; Lacasse, A.; Laviolette, F.; Marchand, M. PAC-Bayesian Learning of Linear Classifiers. In Proceedings of the 26th Annual International Conference on Machine Learning—ICML’09, Montreal, QC, Canada, 14–18 June 2009; pp. 1–8. [Google Scholar] [CrossRef]
- Letarte, G.; Germain, P.; Guedj, B.; Laviolette, F. Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., dAlché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 6872–6882. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv
**2013**, arXiv:1312.6114. [Google Scholar] - Wen, Y.; Vicol, P.; Ba, J.; Tran, D.; Grosse, R. Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn.
**1992**, 8, 229–256. [Google Scholar] [CrossRef] [Green Version] - Catoni, O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. IMS Lect. Notes Monogr. Ser.
**2007**, 56, 1–163. [Google Scholar] [CrossRef] - Mohamed, S.; Rosca, M.; Figurnov, M.; Mnih, A. Monte Carlo Gradient Estimation in Machine Learning. arXiv
**2019**, arXiv:1906.10652. [Google Scholar] - Langford, J.; Seeger, M. Bounds for Averaging Classifiers. 2001. Available online: https://www.cs.cmu.edu/~jcl/papers/averaging/averaging_tech.pdf (accessed on 4 June 2021).
- Seeger, M.; Langford, J.; Megiddo, N. An improved predictive accuracy bound for averaging classifiers. In Proceedings of the 18th International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; pp. 290–297. [Google Scholar]
- Germain, P.; Bach, F.; Lacoste, A.; Lacoste-Julien, S. PAC-Bayesian Theory Meets Bayesian Inference. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 1884–1892. [Google Scholar]
- Knoblauch, J.; Jewson, J.; Damoulas, T. Generalized Variational Inference: Three Arguments for Deriving New Posteriors. arXiv
**2019**, arXiv:1904.02063. [Google Scholar] - Kingma, D.P.; Salimans, T.; Welling, M. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 2575–2583. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Dziugaite, G.K.; Roy, D.M. Data-dependent PAC–Bayes priors via differential privacy. In Advances in Neural Information Processing Systems 31; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 8430–8441. [Google Scholar]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. arXiv
**2015**, arXiv:1603.04467. [Google Scholar]

**Table 1.**Average (from ten runs) binary-MNIST losses and bounds ($\delta =0.05$) for the best epoch and optimal hyperparameter settings of various algorithms. Hyperparameters and epochs were chosen by bound if available and non-vacuous, otherwise by training linear loss. Bold numbers indicate the best values and standard deviation is reported in italics.

mlp | pbg | Reinforce | Fix-$\mathit{\lambda}$ | Optim-$\mathit{\lambda}$ | ||||||
---|---|---|---|---|---|---|---|---|---|---|

sign | relu | sign | sgmd | relu | sign | sgmd | relu | |||

Train Linear | 0.78 | 8.72 | 26.0 | 18.6 | 8.77 | 7.60 | 6.35 | 6.71 | 6.47 | 5.41 |

error, $1\sigma $ | 0.08 | 0.08 | 0.8 | 1.4 | 0.04 | 0.19 | 0.10 | 0.11 | 0.18 | 0.16 |

Test 0–1 | 1.82 | 5.26 | 25.4 | 17.9 | 8.73 | 7.88 | 6.51 | 6.85 | 6.84 | 5.61 |

error, $1\sigma $ | 0.16 | 0.18 | 1.0 | 1.5 | 0.23 | 0.30 | 0.19 | 0.27 | 0.21 | 0.20 |

Bound 0–1 | - | 40.8 | 100 | 100 | 21.7 | 18.8 | 15.5 | 22.6 | 19.3 | 16.0 |

error, $1\sigma $ | - | 0.2 | 0.0 | 0.0 | 0.04 | 0.17 | 0.04 | 0.03 | 0.31 | 0.05 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Biggs, F.; Guedj, B.
Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks. *Entropy* **2021**, *23*, 1280.
https://doi.org/10.3390/e23101280

**AMA Style**

Biggs F, Guedj B.
Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks. *Entropy*. 2021; 23(10):1280.
https://doi.org/10.3390/e23101280

**Chicago/Turabian Style**

Biggs, Felix, and Benjamin Guedj.
2021. "Differentiable PAC–Bayes Objectives with Partially Aggregated Neural Networks" *Entropy* 23, no. 10: 1280.
https://doi.org/10.3390/e23101280