Next Article in Journal
Orbit Classification and Sensitivity Analysis in Dynamical Systems Using Surrogate Models
Previous Article in Journal
Survey Optimization via the Haphazard Intentional Sampling Method
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Global Variance as a Utility Function in Bayesian Optimization †

Max–Planck–Institut für Plasmaphysik, 85748 Garching, Germany
*
Author to whom correspondence should be addressed.
Presented at the 40th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, online, 4–9 July 2021.
Phys. Sci. Forum 2021, 3(1), 3; https://doi.org/10.3390/psf2021003003
Published: 5 November 2021

Abstract

:
A Gaussian-process surrogate model based on already acquired data is employed to approximate an unknown target surface. In order to optimally locate the next function evaluations in parameter space a whole variety of utility functions are at one’s disposal. However, good choice of a specific utility or a certain combination of them prepares the fastest way to determine a best surrogate surface or its extremum for lowest amount of additional data possible. In this paper, we propose to consider the global (integrated) variance as an utility function, i.e., to integrate the variance of the surrogate over a finite volume in parameter space. It turns out that this utility not only complements the tool set for fine tuning investigations in a region of interest but expedites the optimization procedure in toto.

1. Introduction

In many experimental or theoretical approaches the effort of acquiring data may be costly, time consuming or both. The goal is to get insights in either the overall or extremal behaviour of a target quantity with respect to a set of parameters. If insights to functional dependencies between target and parameters are only to be obtained from a computationally expensive function, which may be considered as a black box function, it is instructive to employ surrogate modelling: already acquired data serve as a starting basis for establishing a surrogate surface in parameter space which then gets explored by Bayesian optimization [1]. An overall survey about Bayesian optimization may be found in [2], though it concentrates on an expected improvement (EI) utility and considers noise in the data only in the last paragraph, again by concentrating on EI. In contrast to this nice study we propose to alternate the different utilities at hand. Moreover, a fast information theory related to Bayesian optimization is shown in [3], though this approach approximates any black-box function by a parabolic form which differs from our ansatz letting the black-box function untouched. Interesting insights to multi-objective Bayesian optimization are provided by [4], which considers “multi-objective” in the sense of seeking the extrema–each is free of choice maximum or minimum–of a bunch of single-objective functions. However, the present paper concentrates on finding a common extremum depending on multiple parameters.
For the surrogate modelling we use the Gaussian process method (GP) [5] whose early stages date back to the middle of last century with very first efforts in geosciences [6] tackling the problem of surrogate modelling by so-called kriging [7]. Afterwards, GP has been appreciated much in the fields of neural networks and machine learning [8,9,10,11,12] and further work showed the applicability of active data selection via variance based criterions [13,14]. Our implementation of the GP method in this paper was already introduced at [15], and follows in notation–and apart from small amendments—the very instructive book of Rasmussen & Williams [5].
While in a previous work [16] we investigated the performance of utility functions for the expected improvement of an additional data point or for a data point with the maximal variance, in this paper we would like to introduce the global variance, i.e., the integral over the variance for a target surrogate within a region of interest with respect to a newly added data point. It is the substantial advantage of the Gaussian process method that such a task may be tackled simply on the basis of already acquired data, i.e., before new data have to be determined.

2. Global Variance for Gaussian Process-Based Model

In the following we concisely report the formulas leading to the results in this paper. For a thorough discussion of Gaussian processes please refer to the above mentioned papers, especially to the book of Rasmussen & Williams [5].
The problem of predicting function values in a multi-dimensional space supported by given data is a regression problem for a non-trivial function of unknown shape. Given are n target data y for input data vectors x i of dimension N dim with matrix X = ( x 1 , x 2 , , x n ) written as
y = y 1 y 2 y n , x i = x i 1 x i 2 x i N dim , X = x 11 x 21 x n 1 x 12 x 22 x n 2 x 1 N dim x 2 N dim x n N dim .
We assume that target data y i are blurred by Gaussian noise with σ d i 2 . Further, we assume that the black box function interconnecting input X and target y is at least uniformly continuous and thereby justifies a description of a target surface with a surrogate from the Gaussian process method. Despite the experimental data and the physics background all quantities throughout this paper are without units.
The decisive quantity of a Gaussian process is the covariance function k describing the distance between two vectors x p and x q defined by
k ( x p , x q ) = σ f 2 exp 1 2 x p x q λ 2 .
with the signal variance σ f 2 and length scale λ . A covariance matrix ( K ) i j = k ( x i , x j ) considers the covariances of all input data X . The GP method describes a target value f * at test input vector x * by a normal distribution p ( f * | X , y , x * ) N f ¯ * , var ( x * ) with mean f ¯ * = k * T K + σ n 2 Δ 1 y , and variance var ( x * ) = k ( x * , x * ) k * T K + σ n 2 Δ 1 k * , where the term σ n 2 Δ represents the degree of information in the data. While Δ is the diagonal matrix of the given variances σ d i 2 , the variance σ n 2 accounts for an overall noise in the data. Then the full covariance matrix M of the Gaussian Process is
M = K + σ n 2 Δ
In Bayesian probability theory the three parameters θ = ( λ , σ f , σ n ) T are considered to be hyper-parameters which show up in the marginal likelihood as
log p ( y | θ ) = const 1 2 y T K ( λ , σ f ) + σ n 2 Δ 1 y 1 2 log K ( λ , σ f ) + σ n 2 Δ .
In [16], we showed, that for a sufficiently large data base the target surrogate is well described by using the expectation values of the hyper-parameters in the formulas for f * and var ( f * ) , at least well enough to determine a global optimum in a region of interest (RoI). The global optimum is found by employing utility functions, as there are the expected improvement U EI ( x * ) = I = f max f * p ( f * | X , y , x * ) d f * and the maximum variance U MV ( x * ) = var ( f * ) . For both the respective maximum at x * max = arg max { x * } U EI / MV is sought. While the first one ( U EI ) evaluates the possible information gain from a new data point, the second utility ( U MV ) simply estimates the vector in input space with largest variance in the target surrogate.
In order to have a look on the implications of an additional data point in the surrogate, we propose a further utility function, i.e., the global variance defined on the multi-dimensional RoI [ 1 : 1 ] by
U GV = 1 1 var ( x ) d x .
The exact integration shown in the Appendix A leads to
U GV exact = 2 N dim σ f 2 σ f 4 π λ 2 N dim i j n M 1 i j · k N dim erf 1 λ x i k + x j k 2 λ erf 1 λ x i k + x j k 2 λ
Though Equation (6) represents the correct result, it may turn out in computation runs that the determination of the error-function is substantially time consuming compared to the total expenditure. Therefore, we would like to introduce two alternatives to the exact integration in Equation (5).
The first one is kind of an approximation. Since outside of the RoI the integrand in Equation (5) shows only trivial contributions we shift the upper and lower integral bounds to ± infinity and get from the simple Gaussian integrals
U GV inf σ f 4 π λ N dim i j n M 1 i j exp 1 4 λ 2 x i x j T x i x j .
We dropped the first term in the integral over [ σ f 2 x ] for being infinity, since at least it is a constant contribution regardless of changes in the input X . Although the utility U GVinf in Equation (7) is an approximation only, it has the advantage of being much easier accessible by numerical means and its computation performs much faster compared to Equation (6).
A second much more sophisticated approach is to insert an enveloping Gaussian function with adjustable location x G (guiding center) and variance σ G in the integral of Equation (5). Again the integration limits are shifted to ± infinity, however this time the enveloping Gaussian function takes care of the integrability and we get
U GV env = var ( x ) 1 2 π σ G 2 N dim exp 1 2 σ G 2 ( x x G ) T ( x x G ) d x = σ f 2 σ f 4 λ 2 σ G 2 + λ 2 N dim i j n M 1 i j · exp 1 2 x i T x i + x j T x j λ 2 + x G T x G σ G 2 x i + x j λ 2 + x G σ G 2 T x i + x j λ 2 + x G σ G 2 2 λ 2 + 1 σ G 2 .
The two parameters x G and σ G provide a toolset to guide the search for the next target data evaluations: a smaller standard deviation σ G shifts the attention to the center of the enveloping Gaussian, while x G gives us the possibility to focus on certain regions in the RoI.
All three utilities employing the global variance, U GV exact , U GV inf , U GV env require the inversion of the full covariance matrix M of Equation (3). Since the inversion has to be performed for every newly proposed test input x * this is the main time consuming part in the whole procedure. Let us remind the reader, that the method we are proposing fully resides on input space (together with already acquired data) and the bottle neck is the generation of new target data. Therefore, the starting condition of very expensive (aka time consuming) data acquisition still holds. However, we can beneficially use blockwise matrix inversion [17] since a new input vector x n + 1 expands the covariance matrix for one additional row and line only. Consequently, we reduced the computational effort to n 2 -behaviour instead of n 3 for standard inversion.

3. Proof of Principle

We follow the global optimization scheme from Section 4 of [18]. Again we give proof of principle with a “black box” model featuring a broad parabolic maximum 2 k N d i m x k 2 + ( 1 ) k 0.3 together with a smaller cosine structure 0.1 cos [ 2 π ( x k 0.3 ) / Δ c o s ] on top of it, while we focus on a decent ripple on Δ c o s =0.6 in one and two dimensions ( N d i m =1, 2).
Figure 1, Figure 2 and Figure 3 show in left and right panels the results for one and two dimensions, respectively. The x axis to the right counts the number of newly acquired data for the utility comparisons in Figure 3 and in the bottom rows (c), (d) of Figure 1 and Figure 2.
For every newly added data point proposed by the various utilities, the distance between the true location of the global optimum and the maximal value of the surrogate residing on the data at hand is calculated in Figure 1. In a similar fashion, the search for the best surrogate description of the hidden model is shown in Figure 2.
Eventually Figure 3 demonstrates the use of an enveloping Gaussian function in the integral of the global variance by varying its center x G , e.g., if an educated guess about the location of the extremal structure is at hand, i.e., the guiding center x G is preset to the positive axis (1d) or the quadrant (2d) with the true model maximum. Consequently, Table 1 displays for 1d the specific number of data and for 2d the saturation level for which the target surrogate enters the stage of resembling the true model, i.e. the summed up (absolute) differences between all grid points of the target surrogate and the model starts to diminish with the number of target data only.

4. Discussion

The results above show the usage of various utilities as a toolbox for surrogate modeling. Depending on the task—either to find an extremum or to get a best surrogate description of an unknown “black box” model—and depending on the prior knowledge at hand—presumption of location of the sought extremum or concentration on the region of interest—it is advisable to choose the most eligible utility function. However, even more promising is the combination of utilities of different character to profit from their benefits in toto and to compensate for pitfalls and drawbacks of one or the other utility.
As can be seen in the very first example in Figure 1 for the one-dimensional case the global optimum is found very fast with help of the expected improvement utility U E I (starting to enter the bump with the correct extremum already below N = 10). However, a known drawback of this utility is that it gets stuck in local extrema and that it takes an unreasonably high number of additional data to get distracted from this pitfall.
This is taken into account for the two-dimensional case where the best result with lowest difference to the exact result is obtained by acting in combination of all three utilities U EI , U MV and U GV . Focusing the maximum search on the utility regarding expected improvement alone (black line in Figure 1d would have got stuck in a local extremum with y = 2.03 in the “wrong” quadrant at (−0.26; −0.29) for not recovering from this at all at about N = 63 (internal stop of algorithm for no improvement after entering computing accuracy level) and totally missing the true optimum with y = 2.2 at (0.3; −0.3).
The situation changes for the task of getting a best overall description within the region of interest. To accomplish this the newly introduced global variance utility U GV is of tremendous help both in one and two dimensions—either alone or in combination with at least the maximum variance utility U MV . As shown in Figure 2d the best surrogate can already be established around ninety data points by employing U GV only (full circles in the target surface of Figure 2b, with very few deviations from the true model left.
A guess about the approximate occurrence of an extremal structure–without excluding another region—can be emphasized by a further refinement to the global variance utility. In letting act an enveloping Gaussian within the global variance integral Equation (5) the result is not only much easier to be tackled from a computational point of view, but also the focus of the numerical search for the global optimum can be guided by predetermining the center of Gaussian x G and its integral weight (aka width σ x G ).
Figure 3 shows the results for three different integral weights (0.6; 0.8; 0.9) of the enveloping Gaussian function at two guiding centers: the first one at the origin corresponds to an ignorant scenario where one is not sure about a certain position of some global optimum at all. In the second approach we suppose that the extremal structure may be found in one dimension for positive values and thereby set x G = 0.5 , while in two dimensions it may be located within the quadrant with positive values for x 1 and negative ones for x 2 resulting in x G = [ 0.5 ; 0.5 ] . As can be seen already in Figure 3, but all the more learned from the numbers of Table 1, displacing the center of the enveloping Gaussian function to the real center of the optimum of the hidden model facilitates the development of a best—regarding similarity to the true model—surrogate surface.

Author Contributions

The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Global Variance: Derivation of the Exact Integration

The variance at some (test) point x T = ( x 1 , x 2 , , x N d i m ) in a region confined to [ 1 , 1 ] of dimension N d i m is
var ( x ) = k ( x , x ) k T ( K + σ n 2 Δ ) 1 k .
The covariance k ( x p , x q ) between pairs of input variables ( x p , x q ) is defined by
k ( x p , x q ) = σ f 2 exp ( x p x q ) T ( x p x q ) 2 λ 2 .
While the first term in Equation (A1) is simply k ( x , x ) = σ f 2 , we need for the second term
k = σ f 2 exp 1 2 λ 2 ( x x 1 ) T ( x x 1 ) exp 1 2 λ 2 ( x x 1 ) T ( x x 2 ) exp 1 2 λ 2 ( x x 1 ) T ( x x N )
and the inversion of the matrix M = k + σ n 2 Δ , where the matrix elements are Δ i i = σ d i ( Δ i j = 0 for i j ) and K i j = σ f 2 exp 1 2 λ 2 ( x i x j ) T ( x i x j ) . For a given set of hyper-parameters the matrix M does not depend on the test vector x and may be treated as a constant in integration over d x . Therefore, after the inversion has been performed, M 1 can easily be regarded as a pure number. So the second term in Equation (A1) is just a sum over all components with indices { i , j } ( 1 , , N ) ,
k * T ( K + σ n 2 Δ ) 1 k * = i , j = 1 N k * i M 1 i j k * j
= σ f 4 i , j = 1 N M 1 i j e ( x x i ) T ( x x i ) 2 λ 2 e ( x x j ) T ( x x j ) 2 λ 2 .
Further let us concentrate on the terms in the nominator of the exponential:
( x x i ) T ( x x i ) + ( x x j ) T ( x x j ) = 2 x T x i + x j 2 x i T + x j T 2 x + x i x i T + x j x j T .
Completing the square gives
2 x x i + x j 2 T x x i + x j 2 + 1 2 x i x j T x i x j .
We insert Equation (A7) in Equation (A5) and finally get for the variance
var ( x ) = σ f 2 σ f 4 i , j = 1 N M 1 i j e 1 4 λ 2 x i x j T x i x j e 1 λ 2 x x i + x j 2 T x x i + x j 2 .
Only the second exponential in Equation (A8) depends on x and therefore needs to be considered in the integral of the global variance:
1 1 d x var ( x ) .
We insert Equation (A8) into Equation (A9) and let the integral stay only for the term with x dependency:
1 1 d x var ( x ) = 2 N D σ f 2 σ f 4 i , j = 1 N M 1 i j e 1 4 λ 2 x i x j T x i x j a a a a · 1 1 d x e 1 λ 2 x x i + x j 2 T x x i + x j 2 .
Since the term in the exponential is quadratic it separates into a sum, which itself facilitates the separation of the integral into each dimension. Being simplified to a number of N d one-dimensional integrals they can easily be solved by employing the error function. To prove this, let us have a closer look at the integral only:
1 1 d x e 1 λ 2 x x i + x j 2 T x x i + x j 2 = k N d i m 1 1 d x k e 1 λ 2 x k x i k + x j k 2 2 .
Focusing on a the kth integral and substituting τ k = ( x k x i k + x j k 2 ) / λ some error functions evolve to end up finally in:
1 1 d x k e 1 λ 2 x k x i k + x j k 2 2 = λ 1 λ x i k + x j k 2 λ 1 λ x i k + x j k 2 λ d τ k e τ k 2
= λ 1 λ x i k + x j k 2 λ 0 d τ k e τ k 2 + 0 1 λ x i k + x j k 2 λ d τ k e τ k 2
= π 2 λ erf 1 λ x i k + x j k 2 λ erf 1 λ x i k + x j k 2 λ .
This concludes the study. Simply inserting Equation (A14) into Equation (A10) succeeds in the result reported in the paper:
1 1 d x var ( x ) = 2 N D σ f 2 σ f 4 π 2 λ N d i m i , j = 1 N M 1 i j e 1 4 λ 2 x i x j T x i x j · k N d i m erf 1 λ x i k + x j k 2 λ erf 1 λ x i k + x j k 2 λ .

References

  1. Mockus, J. Bayesian Approach to Global Optimization; Springer: Berlin/Heidelberg, Germany, 1989. [Google Scholar]
  2. Frazier, P.I. A Tutorial on Bayesian Optimization. 2018. Available online: http://xxx.lanl.gov/abs/1807.02811 (accessed on 11 September 2021).
  3. Ru, B.; McLeod, M.; Granziel, D.; Osborne, M.A. Fast Information-theoretic Bayesian Optimisation. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80. [Google Scholar]
  4. Hernández-Lobato, D.; Hernández-Lobato, J.; Shah, A.; Adams, R.P. Predictive Entropy Search for Multi-objective Bayesian Optimization. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; Volume 48. [Google Scholar]
  5. Rasmussen, C.; Williams, C. Gaussian Processes for Machine Learning; MIT Press: Cambridge, UK, 2006. [Google Scholar]
  6. Krige, D.G. A Statistical Approach to Some Basic Mine Valuation Problems on the Witwatersrand. J. Chem. Metal. Mining Soc. S. Afr. 1951, 52, 119–139. [Google Scholar]
  7. Matheron, G. Principles of geostatistics. Econ. Geol. 1963, 58, 1246–1266. [Google Scholar] [CrossRef]
  8. Barber, D. Bayesian Reasoning and Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  9. Bishop, C. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1996. [Google Scholar]
  10. Cohn, D. Neural Network Exploration Using Optimal Experiment Design. Neural Netw. 1996, 9, 1071–1083. [Google Scholar] [CrossRef] [Green Version]
  11. MacKay, D.J.C. Bayesian Approach to Global Optimization: Theory and Applications; Kluwer Academic: Dordrecht, The Netherlands, 2013. [Google Scholar]
  12. Neal, R.M. Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification; Technical Report 9702; Department of Statistics, University of Toronto: Toronto, ON, Canada, 1997. [Google Scholar]
  13. Seo, S.; Wallat, M.; Graepel, T.; Obermayer, K. Gaussian process regression: Active data selection and test point rejection. In Proceedings of the International Joint Conference on Neural Networks, Como, Italy, 24–27 July 2000; pp. 241–246. [Google Scholar]
  14. Gramacy, R.B.; Lee, H.K.H. Adaptive Design and Analysis of Supercomputer Experiments. Technometrics 2009, 51, 130–145. [Google Scholar] [CrossRef] [Green Version]
  15. Preuss, R.; von Toussaint, U. Prediction of Plasma Simulation Data with the Gaussian Process Method. Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Niven, R., Ed.; AIP Publishing: Melville, NY, USA, 2014; Volume 1636, p. 118. [Google Scholar]
  16. Preuss, R.; von Toussaint, U. Global Optimization Employing Gaussian Process-Based Bayesian Surrogates. Entropy 2018, 20, 201. [Google Scholar] [CrossRef] [PubMed]
  17. Invertible Matrix, Section 3.7: Blockwise Inversion. Available online: https://en.wikipedia.org/wiki/Invertible_matrix#Blockwise_inversion (accessed on 25 May 2021).
  18. Preuss, R.; von Toussaint, U. Optimization Employing Gaussian Process-Based Surrogates. Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; Volume 239. [Google Scholar]
Figure 1. (a,b): One- and two-dimensional model with target data (full circles). The square in the bottom line/surface represents the true maximum. On the left the gray shaded area represents the uncertainty region of the surrogate (full line) from using the expected improvement utility only. On the right the points in the bottom surface are input data. Full circles represent additional data proposed by combination of all three utilities. (c,d): Distance surrogate/true maximum for different utilities employed in the global optimization procedure.
Figure 1. (a,b): One- and two-dimensional model with target data (full circles). The square in the bottom line/surface represents the true maximum. On the left the gray shaded area represents the uncertainty region of the surrogate (full line) from using the expected improvement utility only. On the right the points in the bottom surface are input data. Full circles represent additional data proposed by combination of all three utilities. (c,d): Distance surrogate/true maximum for different utilities employed in the global optimization procedure.
Psf 03 00003 g001
Figure 2. (a,b): One- and two-dimensional surrogate with newly acquired data. Surrogate solution (full line) on the left from using combination of U MV and U GV . Surrogate surface on the right from employing U GV only. (c,d): Comparison of the differences between surrogate and true model integrated over RoI for various utilities.
Figure 2. (a,b): One- and two-dimensional surrogate with newly acquired data. Surrogate solution (full line) on the left from using combination of U MV and U GV . Surrogate surface on the right from employing U GV only. (c,d): Comparison of the differences between surrogate and true model integrated over RoI for various utilities.
Psf 03 00003 g002
Figure 3. Summation over difference between grid points of target surrogate and true model as function of additionally acquired data. (a,b): One and two-dimensional case for enveloping Gaussian utility with various weights and guiding center at origin and at (0.5) or (0.5; −0.5).
Figure 3. Summation over difference between grid points of target surrogate and true model as function of additionally acquired data. (a,b): One and two-dimensional case for enveloping Gaussian utility with various weights and guiding center at origin and at (0.5) or (0.5; −0.5).
Psf 03 00003 g003
Table 1. Comparison of enveloping Gaussian utilities with different integral weights and guiding centers in finding the best surrogate. 1d: Changing step to solution. 2d: Saturation level of the solution.
Table 1. Comparison of enveloping Gaussian utilities with different integral weights and guiding centers in finding the best surrogate. 1d: Changing step to solution. 2d: Saturation level of the solution.
Integral Weight of env. Gaussian within RoI0.60.80.95
1d: Corresponding width of env. Gaussian σ G 1.380.950.71
U GV with x G = 0 151422
U GV with x G = 0.5 131213
2d: Corresponding width of env. Gaussian σ G 0.990.790.67
U GV with x G =(0;0)0.230.610.21
U GV with x G =(0.5;-0.5)0.060.140.05
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Preuss, R.; von Toussaint, U. Global Variance as a Utility Function in Bayesian Optimization. Phys. Sci. Forum 2021, 3, 3. https://doi.org/10.3390/psf2021003003

AMA Style

Preuss R, von Toussaint U. Global Variance as a Utility Function in Bayesian Optimization. Physical Sciences Forum. 2021; 3(1):3. https://doi.org/10.3390/psf2021003003

Chicago/Turabian Style

Preuss, Roland, and Udo von Toussaint. 2021. "Global Variance as a Utility Function in Bayesian Optimization" Physical Sciences Forum 3, no. 1: 3. https://doi.org/10.3390/psf2021003003

Article Metrics

Back to TopTop