Gaussian Process Regression Based Multi-Objective Bayesian Optimization for Power System Design

: Within a disruptively changing environment, design of power systems becomes a complex task. Meeting multi-criteria requirements with increasing degrees of freedom in design and simultaneously decreasing technical expertise strengthens the need for multi-objective optimization (MOO) making use of algorithms and virtual prototyping. In this context, we present Gaussian Process Regression based Multi-Objective Bayesian Optimization (GPR-MOBO) with special em-phasis on its profound theoretical background. A detailed mathematical framework is provided to derive a GPR-MOBO computer implementable algorithm. We quantify GPR-MOBO effectiveness and efﬁciency by hypervolume and the number of required computationally expensive simulations to identify Pareto-optimal design solutions, respectively. For validation purposes, we benchmark our GPR-MOBO implementation based on a mathematical test function with analytically known Pareto front and compare results to those of well-known algorithms NSGA-II and pure Latin Hyper Cube Sampling. To rule out effects of randomness, we include statistical evaluations. GPR-MOBO turnes out as an effective and efﬁcient approach with superior character versus state-of-the art approaches and increasing value-add when simulations are computationally expensive and the number of design degrees of freedom is high. Finally, we provide an example of GPR-MOBO based power system design and optimization that demonstrates both the methodology itself and its performance beneﬁts.


Design of Complex Power Systems
Design of power systems is facing a triple disruptive upheaval: (i) primary energy sources are being converted from fossil to renewable, (ii) grid topologies are moving from centralized to decentralized dominance, and (iii) previously separate sectors (heat, electricity, and mobility) are merging. Driven by these changes, emerging technologies will replace established ones, while at the same time the number of degrees of freedom in facility design will increase. The resulting VUCA (volatile, uncertain, complex and ambiguous) [1] environment poses significant risks for all involved stakeholders throughout the value chain, from product owners over system architects to operators and consumers. Simultaneous ecologic and economic success of power plants requires the ability to identify system design alternatives with multi-objective optima in terms of cost-benefit trade-off for individual use cases.
The lack of technical experience around emerging or new technologies significantly increases the challenge of optimization. Virtual prototyping (VP) guided by engineering experience, therefore, may be considered the standard approach for today's energy facility design. However, with an increasing number of technology alternatives and decreasing system-level technical experience, the dependence between objectives and design parameters is unknown in many cases. Power system design, thereby, becomes a black box multi-objective optimization (MOO) problem.
VP-based MOO guided by engineering intuition again fails when engineering experience is lacking and full or fractional factorial coverage of the design space is not a viable option due to (i) the dimensionality (i.e., number of degrees of freedom) of the design space and (ii) the time-consuming and costly effort required to simulate power systems. Therefore, gaining engineering knowledge about the relationship between system design objectives as a function of design parameters based on as few simulations as possible is of paramount importance. VUCA driven power system design, accordingly, requires MOO algorithms being (i) effective, i.e., knowledgable of multi-objective (MO) related optimal cost-benefit (lateron called Pareto-optimal) trade-offs and (ii) efficient, i.e., requiring only a limited number of required (computationally expensive) simulations for quantifying these trade-offs.
GPR [10,11] provides a regression model that (in contrast to alternative methodical approaches) (a) does not suffer from the "curse of dimensionality" [12] ([Section 3.1]), and (b) inherently provides a quantification of regression uncertainty of the model. This uncertainty is exploited by the BO approach [13] which in turn is used to optimize the surrogate. For a more detailed and fundamental comparison of MOO approaches, we recommend [14].
In this paper, we present the mathematical background of GPR-based Bayesian Optimization (GPR-MOBO) in detail. This includes statements and selected proofs of key results. With that theoretical foundation on hand, we derive a computer implementation of an introduced GPR-MOBO algorithm, quantify its effectiveness and efficiency, and demonstrate the superiority of GPR-MOBO over state-of-the-art MOO algorithms, including a GPR-MOBO application to a power system design example. The paper is structured as follows: Section 2 restates the introductory problem in mathematical terms, including definitions such as "Pareto-optimality" or "hypervolume". Section 3 explains Gaussian Process Regression (GPR) and a Bayesian Optimization (BO) based on it before a bipartite validation follows in Section 4: First, the superiority of the presented approach is validated via a mathematical test function before the proposed GPR-MOBO approach is applied to the design and optimization of a real life power system. Section 6 discusses our results, gives a brief summary and highlights ideas for future work.

Problem Statement in Mathematical Terms
Within this chapter, we phrase the task of effectively and efficiently identifying objective trade-offs in mathematical terms. For this purpose, we consider an (unknown) We refer to X ⊂ R d as the design space with target space R t for d, t ∈ N. Assume further, we can sample f at a finite number of points, i.e., choosing x 1 , ..., x N we obtain f (x 1 ), ..., f (x N ). We translate the MOO issue of identifying an optimal design d opt ∈ X to finding solutions of f which represent nondominated (Pareto-optimal) trade-offs of f , i.e., we are looking for Pareto points of f : Definition 1 (Pareto point and front). We write Then, given a set S ⊂ R t , a point s ∈ S is called Pareto point, if there exists no other point t ∈ S satisfying t s and t i < s i for some component i. The set of all Pareto (optimal) points is called the Pareto front of S.
More general, an The set of all such points is called Pareto front of f .
Based on this definition, we will call a MOO algorithm to be effective whenever it is capable to identify the set of non-dominated (Pareto-optimal) trade-offs for the (unknown) black box function. Introducing a measure of effectiveness, we will now define the socalled hypervolume.

Definition 2 (Hypervolume). Denote by
the function sending two t-dimensional real vectors to the cube bounded by them where P denotes the power set. The Hypervolume of some (finite) set Y ⊂ R t with respect to a reference point r ∈ R t is given by the Lebesgue-measure  The Hypervolume is closely related to Pareto points in the following sense. Proposition 1. Let S ⊂ R t be quasi-compact (i.e., bounded and closed) and U ⊂ S be a finite subset. Let r ∈ R t such that for some s ∈ S. Then, p = argmax is a Pareto point of S.

Proof. See Appendix A.
In simple words, Proposition 1 states that maximizing the hypervolume by adding an image (black box function) point implies this point to be a Pareto point. Accordingly, hypervolume is a suitable indicator of MOO effectiveness while the number of (simulation) samples required to find such Pareto points itself is a suitable measure for efficiency.

Bayesian Optimization Based on Gaussian Process Regression
Gaussian Process Regression (GPR) based Bayesian Optimization (BO) using Expected Hypervolume Improvement (EHVI, see Section 3.3) as acquisition function is a promising algorithmic approach to meet the simultaneous goals of effectiveness and efficiency. The following sub-sections will introduce the general mathematical GPR background (Section 3.1), choice of GPR related hyperparameters (Section 3.2) and GPR based multi-objective BO (GPR-MOBO, Section 3.3) prior summarizing previous sub-sections as a mathematical base for the subsequent algorithmic implementation.

Gaussian Process Regression
We summarize and recall the definition, statements and formulas needed in order to properly apply Gaussian Process Regression (GPR).

Multivariate Normal Distribution
Let n ∈ N be a positive integer and C ∈ M n (R) be a real, positive definite matrix of dimension n with M n (R) being the space of n × n matrices with values in R. Let m ∈ R n be a n-dimensional real vector. Recall the multivariate normal distribution N (m, C) to be the probability measure on R n induced by the density function The vector m is called mean(-vector) and the matrix C is called covariance matrix of N (m, C).
Multivariate normal distributions are stable under conditioning in the following sense.
Theorem 2. Let X 1 , X 2 be two random variables such that (X 1 , X 2 ) is multivariate normal N (m, C)-distributed with mean m = (m 1 , m 2 ) ∈ R p × R q and covariance matrix Then, given some x ∈ R q in the co-domain of X 2 , the conditional density function of X 1 given X 2 = x is given by f C ,m with Proof. Theorem 2.5.1 in [15] .
In particular, conditioning a multivariate normal distribution turns out to be multivariate normal distributed as well.

Stochastic Processes
Let (Ω, F , P) be a probability space, I a set and (S, F ) be a measurable space.

Definition 3.
A stochastic process with state space S and index set I is a collection of random variables X i : Ω → S, i ∈ I.

Remark 1.
Recall that arbitrary products of measurable spaces exist and their underlying sets are given by the Cartesian products of sets. By the universal property of the product, a stochastic process X i , i ∈ I, therefore, consists of the same data as a measurable function X : Ω → S I .
Given a stochastic process X : Ω → S I , in practice, we are mostly interested in the induced measure P X = P(X −1 (−)) on S I . On the other hand, given such probability measure P I on S I , we obtain a stochastic process pr i : S I → S given by the canonical projections. In that sense, a stochastic process may be seen as a proper construction of a probability measure on the product space S I .

Gaussian Process
With the definition of stochastic processes on hand, we can generalize the multivariate normal distribution (defined on finite products of real numbers) to possible infinite products of real numbers in the following sense.

Definition 4 (Gaussian Process)
. Let X be a set. A Gaussian Process with index set X is a family of real valued random variables (r x ) x∈X such that for every finite subset U ⊂ X, the random variable (r u ) u∈U is multivariate normal distributed.
Recall that by the above, this induces a probability measure on R I . We can "construct" Gaussian Processes in the following way: Theorem 3. Let X be a set, C : X × X → R be a positive quadratic form in the sense that for every finite subset U ⊂ X the induced matrix is positive definite and m : X → R be a function. Given a subset U ⊂ X, denote by m U = (m(u)) u∈U the induced vector.
Then, there exists a unique probability measure P on R X satisfying for all U ⊂ X finite where pr U : R X → R U denotes the canonical projection. The function C is called covariance function and m is called mean function of P.

Proof. See Appendix B.
In other words, we construct Gaussian Processes by choosing a positive quadratic form C further referred to as covariance function and a mean function m.
Example 4 (Squared exponential kernel). The squared exponential kernel is a covariance function (i.e., a positive quadratic form; see [16]) for every l, σ c > 0. The parameter l is called lengthscale and the parameter σ 2 c is called output variance. Other covariance functions may also be found in [16] .
Example 5 (Covariance with white Gaussian noise). Let m be a function and C be a covariance function. Given U = {x 1 , ..., x n } ⊂ X. The reader may convince himself that the function is a positive quadratic form for each σ 2 > 0. Note that σ 2 may be considered as a hyperparameter.
Combining Theorems 2 and 3, we derive an appropriate "conditioned" Gaussian Process.
Corollary 6. Let (r x ) x∈X be a Gaussian Process with index set X, its covariance function and m : X → R its mean function. Let U ⊂ X be a finite subset consisting of n ∈ N elements and y ∈ R n . Then, there exists a unique probability measure P on R X such that for every finite subset J ⊂ X − U the density function of P • pr −1 J is given by the conditional density function of ((r j ) j∈J , (r u ) u∈U ) given (r u ) u∈U = y.
Its mean function m and covariance function C are constructed as follows: For every Then, and Proof. See Appendix B.

Gaussian Process Regression
Consider a supervised learning problem The task is to find an appropriate approximation of the unknown ("black box") function f . To solve this task, we may use Gaussian Process regression, the idea of which is to (i) define a Gaussian Process on X by defining a mean and covariance function on X (Theorem 3), (ii) condition that Gaussian Process in the sense of Corollary 6 with U = {x 1 , ..., x N } and y = (y 1 , ..., y N ) and (iii) use m from Formula 2 as approximation of f .
A GPR for f and T is then the data of a Gaussian Process on X conditioned to U = {x 1 , ..., x N } and y = (y 1 , ..., y N ). Remark 2. By its very nature, a GPR is equipped with a natural measure of prediction uncertainty. Instead of a single point prediction y for f (x) with x ∈ X, we obtain a probability distribution as the uncertainty in the prediction m (x) at x. Figure 2 illustrates the conditioning of a GPR to some new evaluations.
Top: real function f = sin ( ) and GPR with mean m ( ) and covariance function C where ( ) symbolizes mean plus/minus standard deviation (i.e., m(x) ± C(x, x)) at each point. Bottom: Same as top with m and C conditioned to the vertical cyan dashes.

GPR Hyperparameter Adaption
Using a GPR for supervised learning problems requires the choice of some (initial) mean function and covariance function (Theorem 3). Most examples of covariance functions involve the choice of hyperparameters. Example 4 involves the choice of a lengthscale and output variance.
Consider a supervised learning problem f : X → R with training points T = {(x 1 , y 1 ), ..., (x N , y N )} ⊂ X × R. Given a mean function m and a family of covariance functions C θ with θ element of some index set ϑ, we choose a hyperparameter θ by following the maximum likelihood principle.
Denote by f C,m,θ the density function of the multivariate normal distribution We choose θ by solving Remark 3. In practice, one often replaces f C,m,θ with ln f C,m,θ and solves resulting in identical parameters. However, is more convenient to work with.

Bayesian Optimization
We define the hypervolume improvement as the gain of hypervolume when adding new points. At some places, the underlying function used for calculating new sample (or infill) points is called acquisition function.
Definition 5 (Hypervolume Improvement). Given a reference point r ∈ R t and a finite set of vectors F ⊂ R t , the hypervolume improvement of some y ∈ R t is defined as We denote by the resulting function. We often write HVI r instead of HVI r,F whenever F is clear from context. Observe that HVI r,F is continuous (see Appendix C), hence integrable, on a bounded subset.

Remark 4.
Maximizing the Hypervolume improvement results in Pareto points (see Proposition 1).

Consider a black box function
Given an approximation f of f (such as the mean of a GPR) and a suitable reference point r ∈ R t , we strive to calculate in order to find a preimage of a Pareto point of f .
Recall that GPRs include a prediction uncertainty measure (Remark 2). We can take this additional information into account when maximizing the Hypervolume improvement in the following ways.
Definition 6 (Expected Hypervolume Improvement). Let mean functions m i : X → R and covariance functions C i : X × X → R on X for i = 1, ..., t be given. Denote by ..,t the induced mean vector and by C(x) the diagonal matrix with Then, the expected Hypervolume improvement at x ∈ X is given by the expected value of HVI with respect to the probability measure In many situations, the training data (more precisely the evaluations) are manipulated (i.e., pre-processed) before training (i.e., calculating the hyperparamters of the means und covariance functions). By the very definition, we obtain the following corollary: Corollary 7. Let S ⊂ R t and g : S → R t be a function. Assume g satisfies a i < b i if and only if g(a) i < g(b) i for all i = 1, ..., t and a, b ∈ S. Then, s ∈ S is a Pareto point if and only if g(s) ∈ g(S) is a Pareto point.
Therefore, any function satisfying the above assumptions may be used for pre-processing of data points in the context of multicriterial optimization.

Remark 5.
The expected Hypervolume improvement involves the choice of some reference point. By construction, this choice affects the (expected) hypervolume contribution of any point in the target space. Notice that the reference point must be strictly greater in every component than every pareto optimal solution in order to ensure a hypervolume greater than zero for every such point. For example, if the black box function f factorizes through [0, 1] t ⊂ R t , then, such a reference point may be chosen by for some ε > 0.
Further discussion for the reference point selection may be found in the literature, e.g., under [17].
Roughly speaking, the expected hypervolume improvement is an extension of the hypervolume improvement incorporating the uncertainty information encapsulated in the GPRs. One hopes that maximizing the expected hypervolume improvement (of the GPRs) maximizes the hypervolume improvement of the black box function more efficiently than simply maximizing the hypervolume of the underlying mean function of the GPRs. Note that incorporating the uncertainty (of the model) allows to reflect a trade-off between exploration and exploitation.

Summary-Base for an Algorithmic Implementation
We shortly summarize the necessary steps in order to apply GPR and EHVI based multicriterial optimization to a black box function be evaluations of f . We define X 0 = {x 1 , ..., x N } and Y 0 = {y 1 , ..., y N }.

Setting Up the GPRs
For each i = 1, ..., t we choose a mean function m i and a covariance function (i.e., a positive quadratic form) C i on X. Examples of covariance functions can be found in [16]. By Theorem 3, we obtain a Gaussian Process for each i. In case the covariance function involves the choice of some hyperparameter, we determine that parameter by solving Equation (5). Next, we condition each mean m i and covariance function C i to T using Equations (2) and (3), respectively. We obtain GPRs, defined by their conditioned mean m i and covariance function C i for each i = 1, ..., t (i.e., for each output component).

Maximizing the EHVI
We maximize the expected Hypervolume improvement, i.e., we solve according to Equation (8) with respect to the mean functions m i and covariance functions C i . Algorithms for calculation of the expected Hypervolume improvement may be found in [18] and [19] . Lastly, we evaluate the black box function f at the found p. We close this section by a couple of practical remarks: • GPRs form a rich class of regression models. However, evaluating a GPR involves the inversion of a N × N matrix with N being the number of training points (see Equation (3)). Accordingly, evaluating a GPR tends to get slow with increasing number of training points. • In addition, GPRs (as any regression model) requires a careful pre-processing of the data in order to produce reasonable results. At the very least, the input and output of the training data should be normalized (e.g, to (0, 1)).
To enable the reader understanding the GPR-MOBO algorithm described below in a comprehensible way, we present here a pre-processing example of the training data: Denote by the componentwise minimum resp. maximum within the design space X. Then, define where the division is performed componentwise and X n = n X (X 0 ). Assuming X to be bounded and x max,i = x min,i for all components i, this is well defined. Furthermore, we define y min = (min y∈Y 0 (y i )) i=1,...,t and y max = (max y∈Y 0 (y i )) i=1,...,t by componentwise minimum resp. maximum of the evaluations. Without loss of generality, we may assume y max,i = y min,i for all i. Define and Y n = n Y (Y 0 ). It is straightforward to check that n Y satisfies the assumptions of Corollary 7.
We obtain normalized training data

Algorithmic Implementation and Validation
In this section, we first derive an algorithmic implementation of an adaptive Bayesian Optimization (BO) algorithm based on GPR infill making use of expected hypervolume improvement as acquistion function. in Section 4.2, we apply this algorithm to a mathematical test function with known Pareto front prior validating it in Section 4.3 within context of a real-world power system application.

GPR-MOBO Algorithmic Implementation
The proposed GPR-MOBO workflow presented in Algorithm 1 strictly follows the sequence of steps for applying GPR with EHVI-based MOO to a black box function as described in Section 3.4.
Wherever applicable, we refer to equations as referenced in Section 3. As an initial design of experiments (DoE) X 0 , we propose Latin Hypercube Sampling (LHCS) according [20] as the basis of the initial computationally expensive black box function evaluation f (X 0 ). We assume the initial sampling by N 0 expensive evaluations to cover the full range of parameter values within the target space to guarantee its image on unit cube after normalization.
N tot (maximum number of samples available for computationally expensive black box function evaluations) and N 0 (fraction of N tot as the number of initial samples) may be considered as the GPR-MOBO hyperparameters.

Algorithm 1 Structural MO Bayesian Optimization workflow.
X Simulate at X 0 (expensive) m ← 0 map Choose zero map as mean function C ← (1) Choose covariance function r ← (9) Choose reference point while |X 0 | < N tot do unless abortion criterion is met X n , Y n ← (11), (12) Pre-process data for i = 1 . . . t do for each dimension of target space θ i ← (5) Calculate hyperparameters m i ← (2) Condition mean Condition covariance end for p ← (10) Calculate optimal infill sample point Acquire according design space points return X P , Y P Algorithm 1 has been implemented in Matlab 2021b making use of the minimizer fmincon based on the sqp option (called multiple times with different initial values by GlobalSearch) to find the minimum of the negative log marginal likelihood function for hyperparameter adaption (see Section 3.2) and for BO (see Section 3.3) with application to the negative EHVI function which is provided by [21]. At this point it seems worthwhile noting, GlobalSearch not to yield deterministic results, i.e., multiple runs with identical input values may vary in their output values.
To make the computation more efficient, for numerical inversion of C F , the Cholesky decomposition is used.

Test Function Based Validation
In this subsection, we are aiming for an effectiveness and efficiency comparison of Algorithm 1 versus state-of-the-art alternatives LHCS [20] and NSGA-II [22] . We present results of their application to a well defined black box function with analytically well known Pareto front. In our case, we picked the to be minimized test function ZDT1 according [22] f x i and 2 ≤ d ≤ 30 indicating the design space dimension. Note that ZDT1 exhibits a convex Pareto front independent of d.
We compare results for the alternative Pareto front search algorithms granting N tot = 50 black box evaluations plus adding an NSGA-II analysis with N tot = 200 black box evaluations. For LHCS, all samples were spent within the initial run while Algorithm 1 started with N 0 = 30 initial LHCS samples. In case of N tot = 50 total samples, the NSGA-II algorithm was applied in n gen = 5 generations with a population of n pop = 10 samples while in case of N tot = 200 total samples, n gen = 8 generations with a population of n pop = 25 samples were run. For statistical purposes, the algorithm evaluations were repeated fifty times with different random starting values. Design space dimension was chosen by d ∈ {2, 3, 5, 10, 15, 20, 25, 30}. The reference point during optimization was chosen according Equation (9) with ε = 2%. Making use of knowledge about the ZDT1 target domain, we fixed y min = (0, 0) for pre-processing. To quantify the search algorithms' performance, the hypervolume with respect to reference point (2,10) in relation to the (known) maximal hypervolume is evaluated. The results are plotted in Figure 3.
Box plots in Figure 3 indicate statistical evaluations of the repeated search runs. GPR-MOBO results are drawn in red, LHCS results in blue, NSGA-II results for N tot = 50 in brown and for N tot = 200 in green.

Power System Design Based Validation
Within the last subsection of this chapter, we apply the GPR-MOBO Algorithm 1 to the design and optimization of a power system example. Figure 4 illustrates both, the toolchain and the generic power system model as used for GPR-MOBO power system design and optimization validation.
For the energy domain specific "Power System Modeling and Simulation Environment" of the tool chain in our example, we use the commercial tool PSS ® DE [23]. Connected to the "Script Control" (implemented in Python code) through an "xml interface", adjustable design space parameters of the power system model receive dedicated parameter values as computed within the "Algorithmic Workflow" (execution of the GPR-MOBO algorithm in Matlab 2021b code). Data stored within the "Data Base" are accessible for "Statistical Evaluation" and "Visualisation" (all implemented in Python code). The generic power system model for our validation example is defined by a star topology connecting a standard H0-profile [24] electric load with 20.2 GWh annual demand to three aggregate components ("Wind turbine", "PV", "Battery") and a "Public Grid". The "Power System Modeling and Simulation Environment" is simulating power system results in terms of well defined key performance indicators (KPI), CAPEX (captial expenditures for installation of the according power system), and CO2 (amount of carbon dioxide for a given configuration to provide the total amount of energy). The KPI behavior of the "Power System Modeling and Simulation Environment" can therefore be viewed as a black box function. Parameter value ranges defining the design space for the system of interest are listed in Table 1. As mentioned in Section 3.3, for the GPR-MOBO, the design and target space samples (design parameters and KPIs, respectively) are normalized according Equations (11) and (12).
The results for the experiment are shown in Figure 5. N 0 = 30 initial samples and N tot = 50 samples in total were chosen as GPR-MOBO hyperparameters, while total emission of carbon dioxide (CO2 in kilotons) and capital expenditures for acquisition and installation of aggregate components (CAPEX in million euros) were selected as to be evaluated trade-off KPIs. Making use of knowledge about the according target domain, we fixed y min = (0, 0) for pre-processing. The reference point during optimization was chosen according Equation (9) with ε = 2%. Figure 5 shows the acquired subspace of the target space by the experiment. All sample points are marked by crosses, while the Pareto optimal solutions forming the Pareto front are highlighted by red crosses and the remaining portion of initial (30 LHCS) results are marked by cyan crosses . In addition, an approximate Pareto front is plotted, resulting from a full-factorial design space latticing based on 1200 sample points. Nondominated Pareto points are indicated by blue dots with a Pareto front completed by linear interpolation (marked by a blue dotted line ).

Discussion
According Section 2, we may interpret the identified hypervolume of a black box as an indicator of effectiveness for a given Pareto front search algorithm. Putting that identified hypervolume in ratio to the number of computationally expensive evaluations required to identify this hypervolume, in turn, may be considered a suitable indicator of the algorithm's efficiency. Applying these definitions to results obtained in the previous Section 4, Figure 3 clearly indicates: On the other hand, it is worthwhile mentioning GPR-MOBO algorithm to require more computation time than standard LHCS or NSGA-II algorithms. Depending on the test function dimension d and hardware environment, GPR-MOBO runs (i.e., 20 iterations) took between 3 min and 11 min wherein the fraction for evaluating the black box function (ZDT1) can be treated as negligible while NSGA-II or LHCS runs tool only less than a tenth portion of this time. This indicates that the GPR-MOBO is advisable, if the black box function is expensive to evaluate in terms of capital, time, or other resources.
The experimental validation using an unknown black box function whose result is shown in Figure 5 again validates the effectiveness and efficiency of the GPR-MOBO. The Pareto front as identified by N tot = 50 black box evaluations is already very close to the front as approximated by N tot = 1200 full-factorial lattice based black box evaluations. Some GPR-MOBO Pareto points are even dominating that identified by full-factorial lattice. The example, thereby, demonstrates effectiveness and efficiency of GPR-MOBO based power system design and optimization.
The results shown indicate a general superiority of GPR-MOBO over state-of-the-art algorithms. However, this has been demonstrated only on exemplary base. We therefore point out the inadmissibility of a generalization of superiority. A general superiority of GPR-MOBO cannot and should not be derived from single, individual test functions or application examples. Such a fundamental superiority would have to be mathematically proven and would presumably require fundamental knowledge of the black box function itself or the Pareto front spanned by this black box function.

Conclusion and Outlook
In this paper, we tackled the challenge of power system design and optimization in VUCA environment. We proposed a Multi-Objective Bayesian Optimization based on Gaussian Process Regression (GPR-MOBO) in the context of power system virtual prototyping. After a mathematical reformulation of the challenge, we presented the background of Gaussian Process Regression, including hyperparameter adaption and use in the context of a Bayesian Optimization approach, focusing on the expected improvement in hypervolume. For validation purposes, we benchmarked our GPR-MOBO implementation statistically based on a mathematical test function analytically known Pareto front and compared results to those of well-known algorithms NSGA-II and pure Latin Hyper Cube Sampling. We demonstrated superiority of the GPR-MOBO approach over the compared algorithms, especially for high dimensional design spaces. Finally, we applied the GPR-MOBO algorithm for planning and optimization of a power system (energy park) in terms of selected performance indicators of exemplary character.
Concluding, GPR-MOBO turned out as an effective and efficient approach for power system design and optimization in VUCA environment with superior character when simulations are computationally expensive and the number of design degrees of freedom is high.
Some topics remain open for future investigation. Besides a performance comparison with other than already selected algorithms, some detailed questions within the GPR-MOBO family are worthwhile to be considered. This includes topics such as the choice of acquisition function, pre-processing, selection of the reference points when (expected) hypervolume (improvement) is put in focus and application of various global optimizers to name just a few. One level above, more not yet satisfactorily answered question address the extension to mixed-integer design spaces and issues related to constraint handling. Proof. First, we prove "⊇": Since c i r for all i, we obtain I(x, c i ) ⊂ I(x, r), hence, "⊇". Secondly, we prove "⊆": Given some z ∈ I(x, r) − I(x , r) there exists an i with x i ≤ z i < x i . Since z ≺ r, we obtain z ∈ I(x, c i ). This proves "⊆".
Moreover, if max(x i , x i ) = x i , then, I(x, c i ) = ∅, hence, z ∈ I(x, c i ) implies x i ≤ z i < x i and thus z / ∈ I(x , r). This proves that (I(x, r) ∩ I(x , r)) and t i=1 I(x, c i ) are disjoint.
Corollary A2. For x, x , r ∈ R t with x ≺ r and x i < x i , there exists x ≺ d r with Proof. We may assume x ≺ r, since otherwise I(x , r) = ∅ and the claim follows for d = r.
Using Lemma A1, we obtain Furthermore, x ≺ c i r since x ≺ r and x i < x i . Thus, d = c i satisfies the claim.
Corollary A3. Let S ⊂ R t be a finite subset and x, r ∈ R t with x ≺ r. Assume further for every s ∈ S there exists i s such that x i s < s i s . Then, there exists c ∈ R t with x ≺ c r and I(x, c) ⊆ I(x, r) − s∈S I(s, r).
In particular, the Lebesgue measure λ(I(x, c)) > 0 of I(x, c) is greater than zero.
Proof. We prove the claim by induction over the size n of S. For n = 1, this is the Corollary A2. Assume the claim holds for all such S with size equal to n. Given S with cardinality n + 1, there exists such c with I(x, c) ⊂ I(x, r) − s∈S, s =s 1 I(s, r).
Furthermore, we obtain for some x ≺ c c where the last subset is obtained by applying Corollary A2 to r = c, x = s 1 .
Lastly, the Lebesgue measure λ(I(x, c )) is given by the product ∏ t i=1 (c i − x i ) > 0 which is greater than zero since x ≺ c .
Corollary A4. Let S ⊂ R t be a finite subset and x, r ∈ R t with x, s ≺ r for all s ∈ S. Then, Proof. Given s x, I(x, r) ⊆ ∪ s∈S I(s, r) by the very construction. Thus, "⇐" holds. Given HVI r (S, x) = 0 assume that no s ∈ S satisfies s x. This is for every s ∈ S there exists i s such that x i s < s i s . Since This follows by the Corollary A3.
Corollary A5. Let S ⊂ R t be a finite subset and x, x , r ∈ R t with x , s ≺ r for all s ∈ S, x x and x i < x i for some i. Assume further for every s ∈ S there exists i s such that x i s < s i s . Then, + λ (I(x, c )).
Since x ≺ c , we argue as in the proof of the previous Corollary that λ (I(x, c )) > 0. This proves the claim.
Proof of proposition 1. We may without loss of generality assume s ≺ r for all s ∈ S. Indeed, let T ⊂ S be the points in S satisfying s ≺ r for all s ∈ T. Given some pareto point t ∈ T, assume there exists some s ∈ S − T with s t. Then, s t ≺ r which contradicts s ∈ S − T.
Since S is bounded and closed and HVI r (F, −) is continuous, we deduce the existence of some p ∈ S which maximizes HVI r (F, −). Since HVI r (F, p) = max x∈S HVI r (F, x) ≥ HVI r (F, s) > 0, we obtain there exists no f ∈ F with f p by Corollary A4. i.e., for all f ∈ F there exists i f with p i f < f i f . Assume there exists x ∈ S with x ≺ p and x i < p i . Then, there exists no f ∈ F with f x since x p. By applying Corollary A5, we deduce

Appendix B. Probability Measure for Multivariate Normal Distribution
Theorem A6. Let I be a set and for every J ⊂ I finite an inner regular probability measure P J on R J be given. Given two finite J ⊂ J ⊂ I, denote by pr J⊂J : R J → R J the canonical projection. Assume that for all J ⊂ J ⊂ I finite P J • pr −1 J⊂J = P J holds. Then, there exists a unique measure P I on R I satisfying Before giving the proof, recall: Theorem A7 (Hahn-Kolmogorov). Let S be a set and R ⊂ P (S) be ring, i.e., ∅ ∈ R and R is stable under finite unions and binary complements. Let be a pre-measure, i.e., P(∅) = 0 Then, P 0 extends to a measure P on the sigma algebra σ(R) generated by R. Furthermore, if P 0 is σ-finite, then, the extension is unique.
Proof of Theorem A6. The sigma algebra of the product R I is generated by The reader convinces himself that R is a ring. Define a function by P 0 (A × R I−J ) = P J (A). Observe that given A × R I−J = A × R I−J , then, without loss of generality J ⊂ J and A = A × R J −J . Thus, and P 0 is well defined. Then, the reader convinces himself that P 0 (∅) = P J (∅) = 0 and P 0 is finite additive. We prove that P 0 is σ-additive, i.e., given A 1 , ... ∈ R pairwise disjoint with ∪ n∈N A n ∈ R, then, Then, the Hahn-Kolmogorov theorem above proves the claim. Notice that every probability measure is σ-finite and, thus, P 0 is.
It is well known that it suffices to prove with ∩ n∈N A n = ∅ since P 0 is finite additive. We prove that if there exists an > 0 such that since P J is inner regular. For all i choose some K i ⊂ B i compact such that Furthermore, ∩ n i=1 A i = A n since A i+1 ⊂ A i for all i. Thus, P 0 (∩ n i=1 A i ) = P 0 (A n ) > 2 by (A2) and it suffices to prove for all n. It holds In particular, ∩ n i=1 C i = ∅ and, hence, D n := ∩ n i=1 K i = ∅ for all n. We consider the descending sequence D 1 ⊃ D 2 ⊃ · · · ∈ R.
Since all D i are compact, R t is Hausdorff (hence, D 1 is) and all D i are non-empty, the below claim ensures that ∅ = i∈N K i ⊂ i∈N A i and, hence, the claim.
Lemma A8. Let E ⊃ K 0 ⊃ K 1 ⊃ · · · be a descending sequence of compact topological spaces with E being a Hausdorff space. Then K = ∩ i∈N K i = ∅ implies that there exists some n ∈ N with K n = ∅.
Proof. If K = ∅, then, its complement is E. Recall every compact subset of a Hausdorff space to be closed. Hence, E − K i is open for every i. Since E is compact, there exist i 1 , ..., i k such that E = ∪ j=1,...,k (E − K i j ). Hence, ∩ j=1,...,k K i j = ∅ is empty. Thus, K n ⊂ j=1,...,k K i j = ∅ for some n greater all i j .
Proof of Theorem 3. In view of Theorem A6, it suffices to prove N (m F , C F ) = N (m F , C F ) • pr −1 F ⊂F for any finite subsets F ⊂ F ⊂ X. Observe that pr F ⊂F : R F → R F is given by left multiplication with the matrix A = (δ ij ) i∈F , j∈F and that A has full rank (since the projection is an epimorphism). Furthermore, by construction we obtain equalities C F = AC F A T and m F = Am F .
Given some J ⊂ X finite, a straightforward computation proves m J := (m (j)) j∈J = m J + C J,F C −1 F (y − m F ) C J := (C (j 1 , j 2 )) j 1 ,j 2 ∈J = C J + C J,F C −1 F C F,J .
Then, m and C are functions such that for every J ⊂ X finite the induced matrix C J = (C (j 1 , j 2 )) j 1 ,j 2 ∈J is positive definite and the density function of the induced normal distribution N (m J , C J ) is the conditional density function induced by ((r j ) j∈J , (r f ) f ∈F ) given (r f ) f ∈F = y by Theorem 2. Applying Theorem 3 with mean function m and covariance function C finishes the proof.

Appendix C. Integrability of Hypervolume Improvement
Lemma A9. The Hypervolume improvement function HVI r,F : S → R, y → HVI r (F, y) (A3) is continuous, hence, integrable for F finite and S ⊂ R t such that s, f ≺ r for all s ∈ S, f ∈ F.

To (ii):
We calculate which is continuous in y.