Gaussian Process Regression Based Multi-Objective Bayesian Optimization for Power System Design

Nicolai Palm; Markus Landerer; Herbert Palm

doi:10.3390/su141912777

,

and

Systems Engineering Laboratory, University of Applied Sciences, Lothstrasse 64, 80335 München, Germany

^*

Author to whom correspondence should be addressed.

Sustainability2022, 14(19), 12777;https://doi.org/10.3390/su141912777

This article belongs to the Special Issue Intelligent Mechatronic and Renewable Energy Systems

Version Notes

Order Reprints

Abstract

Within a disruptively changing environment, design of power systems becomes a complex task. Meeting multi-criteria requirements with increasing degrees of freedom in design and simultaneously decreasing technical expertise strengthens the need for multi-objective optimization (MOO) making use of algorithms and virtual prototyping. In this context, we present Gaussian Process Regression based Multi-Objective Bayesian Optimization (GPR-MOBO) with special emphasis on its profound theoretical background. A detailed mathematical framework is provided to derive a GPR-MOBO computer implementable algorithm. We quantify GPR-MOBO effectiveness and efficiency by hypervolume and the number of required computationally expensive simulations to identify Pareto-optimal design solutions, respectively. For validation purposes, we benchmark our GPR-MOBO implementation based on a mathematical test function with analytically known Pareto front and compare results to those of well-known algorithms NSGA-II and pure Latin Hyper Cube Sampling. To rule out effects of randomness, we include statistical evaluations. GPR-MOBO turnes out as an effective and efficient approach with superior character versus state-of-the art approaches and increasing value-add when simulations are computationally expensive and the number of design degrees of freedom is high. Finally, we provide an example of GPR-MOBO based power system design and optimization that demonstrates both the methodology itself and its performance benefits.

Keywords:

power system design; multi-objective optimization; gaussian process regression; Bayesian Optimization; expected hypervolume improvement; squared exponential kernel

1. Design of Complex Power Systems

Design of power systems is facing a triple disruptive upheaval: (i) primary energy sources are being converted from fossil to renewable, (ii) grid topologies are moving from centralized to decentralized dominance, and (iii) previously separate sectors (heat, electricity, and mobility) are merging. Driven by these changes, emerging technologies will replace established ones, while at the same time the number of degrees of freedom in facility design will increase. The resulting VUCA (volatile, uncertain, complex and ambiguous) [1] environment poses significant risks for all involved stakeholders throughout the value chain, from product owners over system architects to operators and consumers. Simultaneous ecologic and economic success of power plants requires the ability to identify system design alternatives with multi-objective optima in terms of cost-benefit trade-off for individual use cases.

The lack of technical experience around emerging or new technologies significantly increases the challenge of optimization. Virtual prototyping (VP) guided by engineering experience, therefore, may be considered the standard approach for today’s energy facility design. However, with an increasing number of technology alternatives and decreasing system-level technical experience, the dependence between objectives and design parameters is unknown in many cases. Power system design, thereby, becomes a black box multi-objective optimization (MOO) problem.

VP-based MOO guided by engineering intuition again fails when engineering experience is lacking and full or fractional factorial coverage of the design space is not a viable option due to (i) the dimensionality (i.e., number of degrees of freedom) of the design space and (ii) the time-consuming and costly effort required to simulate power systems. Therefore, gaining engineering knowledge about the relationship between system design objectives as a function of design parameters based on as few simulations as possible is of paramount importance.

VUCA driven power system design, accordingly, requires MOO algorithms being

(i): effective, i.e., knowledgable of multi-objective (MO) related optimal cost-benefit (lateron called Pareto-optimal) trade-offs and
(ii): efficient, i.e., requiring only a limited number of required (computationally expensive) simulations for quantifying these trade-offs.

While a broad variety of MOO approaches in black box environments deals with effective algorithms, only few of them meet the efficiency criterion [2]. Specifically Bayesian Optimization (BO) [3,4,5,6] algorithms based on Gaussian Process Regression (GPR) [3,7,8,9] appear as interesting effective and efficient MOO candidates.

GPR [10,11] provides a regression model that (in contrast to alternative methodical approaches) (a) does not suffer from the “curse of dimensionality” [12] ([Section 3.1]), and (b) inherently provides a quantification of regression uncertainty of the model. This uncertainty is exploited by the BO approach [13] which in turn is used to optimize the surrogate. For a more detailed and fundamental comparison of MOO approaches, we recommend [14].

In this paper, we present the mathematical background of GPR-based Bayesian Optimization (GPR-MOBO) in detail. This includes statements and selected proofs of key results. With that theoretical foundation on hand, we derive a computer implementation of an introduced GPR-MOBO algorithm, quantify its effectiveness and efficiency, and demonstrate the superiority of GPR-MOBO over state-of-the-art MOO algorithms, including a GPR-MOBO application to a power system design example. The paper is structured as follows: Section 2 restates the introductory problem in mathematical terms, including definitions such as “Pareto-optimality” or “hypervolume”. Section 3 explains Gaussian Process Regression (GPR) and a Bayesian Optimization (BO) based on it before a bipartite validation follows in Section 4: First, the superiority of the presented approach is validated via a mathematical test function before the proposed GPR-MOBO approach is applied to the design and optimization of a real life power system. Section 6 discusses our results, gives a brief summary and highlights ideas for future work.

2. Problem Statement in Mathematical Terms

Within this chapter, we phrase the task of effectively and efficiently identifying objective trade-offs in mathematical terms. For this purpose, we consider an (unknown)

t —

dimensional black box function

f : X \to ℝ^{t} .

We refer to

X \subset ℝ^{d}

as the design space with target space

ℝ^{t}

for

d, t \in ℕ

. Assume further, we can sample f at a finite number of points, i.e., choosing

x_{1}, \dots, x_{N}

we obtain

f (x_{1}), \dots, f (x_{N})

. We translate the MOO issue

d_{opt} = \underset{d \in X}{argmin} f (d)

of identifying an optimal design

d_{opt} \in X

to finding solutions of f which represent non-dominated (Pareto-optimal) trade-offs of f, i.e., we are looking for Pareto points of f:

Definition 1 (Pareto point and front).

We write

\begin{matrix} t ≼ s & : ⟺ & \forall i : t_{i} \leq s_{i} \\ t ≺ s & : ⟺ & \forall i : t_{i} < s_{i} . \end{matrix}

Then, given a set

S \subset ℝ^{t}

, a point

s \in S

is called Pareto point, if there exists no other point

t \in S

satisfying

t ≼ s

and

t_{i} < s_{i}

for some component i. The set of all Pareto (optimal) points is called the Pareto front of S.

More general, an

x \in X

is a Pareto point of f if

f (x)

is a Pareto point of

f (X) = {f (x) : x \in X}

. The set of all such points is called Pareto front of f.

Based on this definition, we will call a MOO algorithm to be effective whenever it is capable to identify the set of non-dominated (Pareto-optimal) trade-offs for the (unknown) black box function. Introducing a measure of effectiveness, we will now define the so-called hypervolume.

Definition 2 (Hypervolume).

Denote by

I : ℝ^{t} \times ℝ^{t} \to P (ℝ^{t}), (x, r) \mapsto {z \in ℝ^{t} : x ≼ z ≺ r}

the function sending two t-dimensional real vectors to the cube bounded by them where

P

denotes the power set. The Hypervolume of some (finite) set

Y \subset ℝ^{t}

with respect to a reference point

r \in ℝ^{t}

is given by the Lebesgue-measure

{HV}_{r} (Y) = λ_{t} (⋃_{y \in Y} I (y, r))

of the union over all cubes bounded by the reference point and by some point in Y.

Figure 1 illustrates the Hypervolume on exemplary base in two dimensions.

Figure 1. Hypervolume (blue area) in

ℝ^{2}

. The dotted rectangles illustrate the volumes i.e.,

I (u, r)

spanned by the respective point

u \in U

(×) and the reference point r (×).

The Hypervolume is closely related to Pareto points in the following sense.

Proposition 1.

Let

S \subset ℝ^{t}

be quasi-compact (i.e., bounded and closed) and

U \subset S

be a finite subset. Let

r \in ℝ^{t}

such that

{HV}_{r} (U \cup {s}) > {HV}_{r} (U)

for some

s \in S

. Then,

p^{'} = \underset{p \in S}{argmax} {HV}_{r} (U \cup {p})

is a Pareto point of S.

Proof.

See Appendix A. □

In simple words, Proposition 1 states that maximizing the hypervolume by adding an image (black box function) point implies this point to be a Pareto point. Accordingly, hypervolume is a suitable indicator of MOO effectiveness while the number of (simulation) samples required to find such Pareto points itself is a suitable measure for efficiency.

3. Bayesian Optimization Based on Gaussian Process Regression

Gaussian Process Regression (GPR) based Bayesian Optimization (BO) using Expected Hypervolume Improvement (EHVI, see Section 3.3) as acquisition function is a promising algorithmic approach to meet the simultaneous goals of effectiveness and efficiency. The following sub-sections will introduce the general mathematical GPR background (Section 3.1), choice of GPR related hyperparameters (Section 3.2) and GPR based multi-objective BO (GPR-MOBO, Section 3.3) prior summarizing previous sub-sections as a mathematical base for the subsequent algorithmic implementation.

3.1. Gaussian Process Regression

We summarize and recall the definition, statements and formulas needed in order to properly apply Gaussian Process Regression (GPR).

3.1.1. Multivariate Normal Distribution

Let

n \in N

be a positive integer and

C \in M_{n} (ℝ)

be a real, positive definite matrix of dimension n with

M_{n} (ℝ)

being the space of

n \times n

matrices with values in ℝ. Let

m \in ℝ^{n}

be a n-dimensional real vector. Recall the multivariate normal distribution

N (m, C)

to be the probability measure on

ℝ^{n}

induced by the density function

f_{C, m} : ℝ^{n} \to ℝ, x \mapsto \frac{exp (- 0.5 {(x - m)}^{⊤} C^{- 1} (x - m))}{\sqrt{det (C) {(2 π)}^{n}}} .

The vector m is called mean(-vector) and the matrix C is called covariance matrix of

N (m, C)

.

Multivariate normal distributions are stable under conditioning in the following sense.

Theorem 2.

Let

X_{1}, X_{2}

be two random variables such that

(X_{1}, X_{2})

is multivariate normal

N (m, C)

-distributed with mean

m = (m_{1}, m_{2}) \in ℝ^{p} \times ℝ^{q}

and covariance matrix

C = [\begin{matrix} C_{1} & C_{2} \\ C_{3} & C_{4} \end{matrix}] \in M_{p + q} (R), C_{1} \in M_{p} (R) .

Then, given some

x \in ℝ^{q}

in the co-domain of

X_{2}

, the conditional density function of

X_{1}

given

X_{2} = x

is given by

f_{C^{'}, m^{'}}

with

m^{'} = m_{1} + C_{2} C_{4}^{- 1} (x - m_{2})

and

C^{'} = C_{1} - C_{2} C_{4}^{- 1} C_{3} .

Proof.

Theorem 2.5.1 in [15]. □

In particular, conditioning a multivariate normal distribution turns out to be multivariate normal distributed as well.

3.1.2. Stochastic Processes

Let

(Ω, F, P)

be a probability space, I a set and

(S, F^{'})

be a measurable space.

Definition 3.

A stochastic process with state space S and index set I is a collection of random variables

X_{i} : Ω \to S, i \in I

.

Remark 1.

Recall that arbitrary products of measurable spaces exist and their underlying sets are given by the Cartesian products of sets. By the universal property of the product, a stochastic process

X_{i}, i \in I

, therefore, consists of the same data as a measurable function

X : Ω \to S^{I}

.

Given a stochastic process

X : Ω \to S^{I}

, in practice, we are mostly interested in the induced measure

P_{X} = P (X^{- 1} (-))

on

S^{I}

. On the other hand, given such probability measure

P_{I}

on

S^{I}

, we obtain a stochastic process

p r_{i} : S^{I} \to S

given by the canonical projections. In that sense, a stochastic process may be seen as a proper construction of a probability measure on the product space

S^{I}

.

3.1.3. Gaussian Process

With the definition of stochastic processes on hand, we can generalize the multivariate normal distribution (defined on finite products of real numbers) to possible infinite products of real numbers in the following sense.

Definition 4 (Gaussian Process).

Let X be a set. A Gaussian Process with index set X is a family of real valued random variables

{(r_{x})}_{x \in X}

such that for every finite subset

U \subset X

, the random variable

{(r_{u})}_{u \in U}

is multivariate normal distributed.

Recall that by the above, this induces a probability measure on

ℝ^{I}

. We can “construct” Gaussian Processes in the following way:

Theorem 3.

Let X be a set,

C : X \times X \to ℝ

be a positive quadratic form in the sense that for every finite subset

U \subset X

the induced matrix

C_{U} = {(C (u_{1}, u_{2}))}_{u_{1}, u_{2} \in U}

is positive definite and

m : X \to ℝ

be a function. Given a subset

U \subset X

, denote by

m_{U} = {(m (u))}_{u \in U}

the induced vector.

Then, there exists a unique probability measure P on

ℝ^{X}

satisfying

P_{p r_{U}} = P \circ p r_{U}^{- 1} = N (m_{U}, C_{U})

for all

U \subset X

finite where

p r_{U} : ℝ^{X} \to ℝ^{U}

denotes the canonical projection.

The function C is called covariance function and m is called mean function of P.

Proof.

See Appendix B. □

In other words, we construct Gaussian Processes by choosing a positive quadratic form C further referred to as covariance function and a mean function m.

Example 4 (Squared exponential kernel).

The squared exponential kernel

\begin{matrix} k_{l, σ_{c}^{2}} : X \times X \to ℝ, (x, y) \mapsto σ_{c}^{2} exp (\frac{1}{2 l^{2}} {‖ x - y ‖}_{2}^{2}) \end{matrix}

(1)

is a covariance function (i.e., a positive quadratic form; see [16]) for every

l, σ_{c} > 0

. The parameter l is called lengthscale and the parameter

σ_{c}^{2}

is called output variance. Other covariance functions may also be found in [16].

Example 5 (Covariance with white Gaussian noise).

Let m be a function and C be a covariance function. Given

U = {x_{1}, \dots, x_{n}} \subset X

. The reader may convince himself that the function

\begin{matrix} C_{wgn} : X \times X \to ℝ, (x, y) \mapsto \{\begin{matrix} C (x, x) + σ^{2} & x = y \in U \\ C (x, y) & e l s e \end{matrix} \end{matrix}

is a positive quadratic form for each

σ^{2} > 0

. Note that

σ^{2}

may be considered as a hyperparameter.

Combining Theorems 2 and 3, we derive an appropriate “conditioned” Gaussian Process.

Corollary 6.

Let

{(r_{x})}_{x \in X}

be a Gaussian Process with index set X,

C : X \times X \to ℝ

its covariance function and

m : X \to ℝ

its mean function. Let

U \subset X

be a finite subset consisting of

n \in ℕ

elements and

y \in ℝ^{n}

.

Then, there exists a unique probability measure P on

ℝ^{X}

such that for every finite subset

J \subset X - U

the density function of

P \circ p r_{J}^{- 1}

is given by the conditional density function of

({(r_{j})}_{j \in J}, {(r_{u})}_{u \in U})

given

{(r_{u})}_{u \in U} = y

.

Its mean function

m^{'}

and covariance function

C^{'}

are constructed as follows: For every

x \in X

define

C_{x, U} = {(C (x, u))}_{u \in U} .

Then,

\begin{matrix} m^{'} (x) = m (x) + C_{x, U}^{⊤} C_{U}^{- 1} (y - m_{U}) \end{matrix}

(2)

and

\begin{matrix} C^{'} (x_{1}, x_{2}) = C (x_{1}, x_{2}) + C_{x_{1}, U}^{⊤} C_{U}^{- 1} C_{x_{2}, U} . \end{matrix}

(3)

Proof.

See Appendix B. □

3.1.4. Gaussian Process Regression

Consider a supervised learning problem

f : X \to ℝ

with training points

T = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})} \subset X \times ℝ .

The task is to find an appropriate approximation of the unknown (“black box”) function f. To solve this task, we may use Gaussian Process regression, the idea of which is to

(i): define a Gaussian Process on X by defining a mean and covariance function on X (Theorem 3),
(ii): condition that Gaussian Process in the sense of Corollary 6 with $U = {x_{1}, \dots, x_{N}}$ and $y = (y_{1}, \dots, y_{N})$ and
(iii): use $m^{'}$ from Formula 2 as approximation of f.

A GPR for f and T is then the data of a Gaussian Process on X conditioned to

U = {x_{1}, \dots, x_{N}}

and

y = (y_{1}, \dots, y_{N})

.

Remark 2.

By its very nature, a GPR is equipped with a natural measure of prediction uncertainty. Instead of a single point prediction y for

f (x)

with

x \in X

, we obtain a probability distribution

N (m^{'} (x), C^{'} (x, x)) .

We interpret

N (0, C^{'} (x, x))

as the uncertainty in the prediction

m^{'} (x)

at x.

Figure 2 illustrates the conditioning of a GPR to some new evaluations.

Figure 2. Top: real function

f = s i n

(

) and GPR with mean m (

) and covariance function C where (

) symbolizes mean plus/minus standard deviation (i.e.,

m (x) \pm \sqrt{C (x, x)}

) at each point. Bottom: Same as top with m and C conditioned to the vertical cyan dashes.

3.2. GPR Hyperparameter Adaption

Using a GPR for supervised learning problems requires the choice of some (initial) mean function and covariance function (Theorem 3). Most examples of covariance functions involve the choice of hyperparameters. Example 4 involves the choice of a lengthscale and output variance.

Consider a supervised learning problem

f : X \to ℝ

with training points

T = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})} \subset X \times ℝ

. Given a mean function m and a family of covariance functions

C_{θ}

with

θ

element of some index set

ϑ

, we choose a hyperparameter

θ

by following the maximum likelihood principle.

Denote by

f_{C, m, θ}

the density function of the multivariate normal distribution

N (m_{{x_{1}, \dots, x_{N}}}, C_{θ, {x_{1}, \dots, x_{N}}}) .

We choose

θ

by solving

θ^{*} = \underset{θ \in ϑ}{argmax} (f_{C, m, θ} ({(y_{i})}_{i = 1, \dots, N})) .

(4)

Remark 3.

In practice, one often replaces

f_{C, m, θ}

with

ln f_{C, m, θ}

and solves

θ^{*} = \underset{θ \in ϑ}{argmax} (ln f_{C, m, θ} ({(y_{i})}_{i = 1, \dots, N}))

(5)

resulting in identical parameters. However,

\begin{matrix} ln f_{C, m, θ} (y) = - \frac{1}{2} y^{⊤} C_{θ, {x_{1}, \dots, x_{N}}} y - \frac{1}{2} ln (det (C_{θ, {x_{1}, \dots, x_{N}}}) - \frac{N}{2} ln (2 π) \end{matrix}

is more convenient to work with.

3.3. Bayesian Optimization

We define the hypervolume improvement as the gain of hypervolume when adding new points. At some places, the underlying function used for calculating new sample (or infill) points is called acquisition function.

Definition 5 (Hypervolume Improvement).

Given a reference point

r \in ℝ^{t}

and a finite set of vectors

F \subset ℝ^{t}

, the hypervolume improvement of some

y \in ℝ^{t}

is defined as

\begin{matrix} {HVI}_{r} (F, y) = {HV}_{r} (F \cup {y}) - {HV}_{r} (F) . \end{matrix}

(6)

We denote by

\begin{matrix} {HVI}_{r, F} : ℝ^{t} \to ℝ, y \mapsto {HVI}_{r} (F, y) \end{matrix}

(7)

the resulting function. We often write

{HVI}_{r}

instead of

{HVI}_{r, F}

whenever F is clear from context. Observe that

{HVI}_{r, F}

is continuous (see Appendix C), hence integrable, on a bounded subset.

Remark 4.

Maximizing the Hypervolume improvement results in Pareto points (see Proposition 1).

Consider a black box function

f : X \to ℝ^{t}

with evaluations

T = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})} \subset X \times ℝ^{t}

. Given an approximation

\hat{f}

of f (such as the mean of a GPR) and a suitable reference point

r \in ℝ^{t}

, we strive to calculate

\underset{x \in X}{argmax} {HVI}_{r} ({y_{1}, \dots, y_{N}}, \hat{f} (x))

in order to find a preimage of a Pareto point of f.

Recall that GPRs include a prediction uncertainty measure (Remark 2). We can take this additional information into account when maximizing the Hypervolume improvement in the following ways.

Definition 6 (Expected Hypervolume Improvement).

Let mean functions

m_{i} : X \to ℝ

and covariance functions

C_{i} : X \times X \to ℝ

on X for

i = 1, \dots, t

be given. Denote by

m (x) = {(m_{i} (x))}_{i = 1, \dots, t}

the induced mean vector and by

C (x)

the diagonal matrix with

C {(x)}_{i i} = C_{i} (x, x) .

Then, the expected Hypervolume improvement at

x \in X

is given by the expected value

EHVI (x) = \int_{y \in ℝ^{t}} {HVI}_{r} (y) d P_{x} = \int_{y \in ℝ^{t}} {HVI}_{r} (y) f_{m (x), C (x)} d y

(8)

of HVI with respect to the probability measure

P_{x} = \otimes_{i = 1}^{t} N (m_{i} (x), C_{i} (x, x)) = N (m (x), C (x))

on

ℝ^{t}

.

In many situations, the training data (more precisely the evaluations) are manipulated (i.e., pre-processed) before training (i.e., calculating the hyperparamters of the means und covariance functions). By the very definition, we obtain the following corollary:

Corollary 7.

Let

S \subset ℝ^{t}

and

g : S \to ℝ^{t}

be a function. Assume g satisfies

a_{i} < b_{i}

if and only if

g {(a)}_{i} < g {(b)}_{i}

for all

i = 1, \dots, t

and

a, b \in S

. Then,

s \in S

is a Pareto point if and only if

g (s) \in g (S)

is a Pareto point.

Therefore, any function satisfying the above assumptions may be used for pre-processing of data points in the context of multicriterial optimization.

Remark 5.

The expected Hypervolume improvement involves the choice of some reference point. By construction, this choice affects the (expected) hypervolume contribution of any point in the target space. Notice that the reference point must be strictly greater in every component than every pareto optimal solution in order to ensure a hypervolume greater than zero for every such point. For example, if the black box function f factorizes through

{[0, 1]}^{t} \subset ℝ^{t}

, then, such a reference point may be chosen by

(1 + ε, \dots, 1 + ε) \in ℝ^{t}

(9)

for some

ε > 0

.

Further discussion for the reference point selection may be found in the literature, e.g., under [17].

Roughly speaking, the expected hypervolume improvement is an extension of the hypervolume improvement incorporating the uncertainty information encapsulated in the GPRs. One hopes that maximizing the expected hypervolume improvement (of the GPRs) maximizes the hypervolume improvement of the black box function more efficiently than simply maximizing the hypervolume of the underlying mean function of the GPRs. Note that incorporating the uncertainty (of the model) allows to reflect a trade-off between exploration and exploitation.

3.4. Summary—Base for an Algorithmic Implementation

We shortly summarize the necessary steps in order to apply GPR and EHVI based multicriterial optimization to a black box function

f : X \to ℝ^{t} .

Let

T = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})} \subset X \times ℝ^{t}

be evaluations of f. We define

X_{0} = {x_{1}, \dots, x_{N}}

and

Y_{0} = {y_{1}, \dots, y_{N}}

.

3.4.1. Setting up the GPRs

For each

i = 1, \dots, t

we choose a mean function

m_{i}

and a covariance function (i.e., a positive quadratic form)

C_{i}

on X. Examples of covariance functions can be found in [16]. By Theorem 3, we obtain a Gaussian Process for each i. In case the covariance function involves the choice of some hyperparameter, we determine that parameter by solving Equation (5). Next, we condition each mean

m_{i}

and covariance function

C_{i}

to T using Equations (2) and (3), respectively. We obtain GPRs, defined by their conditioned mean

m_{i}^{'}

and covariance function

C_{i}^{'}

for each

i = 1, \dots, t

(i.e., for each output component).

3.4.2. Maximizing the EHVI

We maximize the expected Hypervolume improvement, i.e., we solve

p = \underset{x \in X}{argmax} EHVI (x)

(10)

according to Equation (8) with respect to the mean functions

m_{i}^{'}

and covariance functions

C_{i}^{'}

. Algorithms for calculation of the expected Hypervolume improvement may be found in [18,19]. Lastly, we evaluate the black box function f at the found p.

We close this section by a couple of practical remarks:

GPRs form a rich class of regression models. However, evaluating a GPR involves the inversion of a $N \times N$ matrix with N being the number of training points (see Equation (3)). Accordingly, evaluating a GPR tends to get slow with increasing number of training points.
In addition, GPRs (as any regression model) requires a careful pre-processing of the data in order to produce reasonable results. At the very least, the input and output of the training data should be normalized (e.g, to $(0, 1)$ ).

To enable the reader understanding the GPR-MOBO algorithm described below in a comprehensible way, we present here a pre-processing example of the training data: Denote by

x_{min} = {(inf_{z \in X} (z_{i}))}_{i = 1, \dots, d} and x_{max} = {(sup_{z \in X} (x_{i}))}_{i = 1, \dots, d}

the componentwise minimum resp. maximum within the design space X. Then, define

n_{X} : X \to {[0, 1]}^{d}, x \mapsto \frac{x - x_{min}}{x_{max} - x_{min}}

(11)

where the division is performed componentwise and

X_{n} = n_{X} (X_{0})

. Assuming X to be bounded and

x_{max, i} \neq x_{min, i}

for all components i, this is well defined. Furthermore, we define

y_{min} = {(min_{y \in Y_{0}} (y_{i}))}_{i = 1, \dots, t} and y_{max} = {(max_{y \in Y_{0}} (y_{i}))}_{i = 1, \dots, t}

by componentwise minimum resp. maximum of the evaluations. Without loss of generality, we may assume

y_{max, i} \neq y_{min, i}

for all i. Define

n_{Y} : ℝ^{t} \to ℝ^{t}, y \mapsto \frac{y - y_{min}}{y_{max} - y_{min}}

(12)

and

Y_{n} = n_{Y} (Y_{0})

. It is straightforward to check that

n_{Y}

satisfies the assumptions of Corollary 7.

We obtain normalized training data

\begin{matrix} T_{n} = {(n_{X} (x_{1}), n_{Y} (y_{1})), \dots, (n_{X} (x_{N}), n_{Y} (y_{N}))} \subset {[0, 1]}^{d} \times {[0, 1]}^{t} . \end{matrix}

4. Algorithmic Implementation and Validation

In this section, we first derive an algorithmic implementation of an adaptive Bayesian Optimization (BO) algorithm based on GPR infill making use of expected hypervolume improvement as acquistion function. in Section 4.2, we apply this algorithm to a mathematical test function with known Pareto front prior validating it in Section 4.3 within context of a real-world power system application.

4.1. GPR-MOBO Algorithmic Implementation

The proposed GPR-MOBO workflow presented in Algorithm 1 strictly follows the sequence of steps for applying GPR with EHVI-based MOO to a black box function as described in Section 3.4.

Algorithm 1 Structural MO Bayesian Optimization workflow.
$X_{0} \leftarrow {x_{1}, \dots, x_{N_{0}}}$	▷ Set initial DoE
$Y_{0} \leftarrow f (X_{0})$	▷ Simulate at $X_{0}$ (expensive)
$m \leftarrow 0_{map}$	▷ Choose zero map as mean function
$C \leftarrow$ (1)	▷ Choose covariance function
$r \leftarrow$ (9)	▷ Choose reference point
while $\| X_{0} \| < N_{tot}$ do	▷ unless abortion criterion is met
$X_{n}, Y_{n} \leftarrow$ (11), (12)	▷ Pre-process data
for $i = 1 \dots t$ do	▷ for each dimension of target space
$θ_{i}^{*} \leftarrow$ (5)	▷ Calculate hyperparameters
$m_{i}^{'} \leftarrow$ (2)	▷ Condition mean
$C_{i}^{'} \leftarrow$ (3)	▷ Condition covariance
end for
$p \leftarrow$ (10)	▷ Calculate optimal infill sample point
$X_{0} \leftarrow X_{0} \cup \{n_{X}^{- 1} (p)\}$	▷ Add infill to $X_{0}$
$Y_{0} \leftarrow Y_{0} \cup \{f (n_{X}^{- 1} (p))\}$	▷ Add simulation at $n_{X}^{- 1} (p)$ to $Y_{0}$ (expensive)
end while
$Y_{P} \leftarrow {y_{1}, \dots, y_{k}}$	▷ Find non-dominated points
$X_{P} \leftarrow f^{- 1} (Y_{P})$	▷ Acquire according design space points
return $X_{P}, Y_{P}$

Wherever applicable, we refer to equations as referenced in Section 3. As an initial design of experiments (DoE)

X_{0}

, we propose Latin Hypercube Sampling (LHCS) according [20] as the basis of the initial computationally expensive black box function evaluation

f (X_{0})

. We assume the initial sampling by

N_{0}

expensive evaluations to cover the full range of parameter values within the target space to guarantee its image on unit cube after normalization.

N_{tot}

(maximum number of samples available for computationally expensive black box function evaluations) and

N_{0}

(fraction of

N_{tot}

as the number of initial samples) may be considered as the GPR-MOBO hyperparameters.

Algorithm 1 has been implemented in Matlab 2021b making use of the minimizer fmincon based on the sqp option (called multiple times with different initial values by GlobalSearch) to find the minimum of the negative log marginal likelihood function for hyperparameter adaption (see Section 3.2) and for BO (see Section 3.3) with application to the negative EHVI function which is provided by [21]. At this point it seems worthwhile noting, GlobalSearch not to yield deterministic results, i.e., multiple runs with identical input values may vary in their output values.

To make the computation more efficient, for numerical inversion of

C_{F}

, the Cholesky decomposition is used.

4.2. Test Function Based Validation

In this subsection, we are aiming for an effectiveness and efficiency comparison of Algorithm 1 versus state-of-the-art alternatives LHCS [20] and NSGA-II [22]. We present results of their application to a well defined black box function with analytically well known Pareto front. In our case, we picked the to be minimized test function ZDT1 according [22]

\begin{matrix} f : {[0, 1]}^{d} \to ℝ^{2}, (\begin{matrix} x_{1} \\ ⋮ \\ x_{d} \end{matrix}) \mapsto (\begin{matrix} x_{1} \\ g (x) [1 - \sqrt{\frac{x_{1}}{g (x)}}] \end{matrix}) with g (x) = 1 + \frac{9}{d - 1} \sum_{i = 2}^{d} x_{i} \end{matrix}

and

2 \leq d \leq 30

indicating the design space dimension. Note that ZDT1 exhibits a convex Pareto front independent of d.

We compare results for the alternative Pareto front search algorithms granting

N_{tot} = 50

black box evaluations plus adding an NSGA-II analysis with

N_{tot} = 200

black box evaluations. For LHCS, all samples were spent within the initial run while Algorithm 1 started with

N_{0} = 30

initial LHCS samples. In case of

N_{tot} = 50

total samples, the NSGA-II algorithm was applied in

n_{gen} = 5

generations with a population of

n_{pop} = 10

samples while in case of

N_{tot} = 200

total samples,

n_{gen} = 8

generations with a population of

n_{pop} = 25

samples were run. For statistical purposes, the algorithm evaluations were repeated fifty times with different random starting values. Design space dimension was chosen by

d \in {2, 3, 5, 10, 15, 20, 25, 30}

. The reference point during optimization was chosen according Equation (9) with

ε = 2 %

. Making use of knowledge about the ZDT1 target domain, we fixed

y_{min} = (0, 0)

for pre-processing. To quantify the search algorithms’ performance, the hypervolume with respect to reference point

(2, 10)

in relation to the (known) maximal hypervolume is evaluated. The results are plotted in Figure 3.

Figure 3. Statistical evaluation of relative hyper volume from 50 repeated runs comparing performance of three MOO algorithms for a selected set of ZDT1 design space dimension d. Top three evaluations for GPR-MOBO (

N_{0} = 30

initial samples, red), LHCS (blue) and NSGA-II (with

n_{pop} = 10

samples in

n_{gen} = 5

generations, brown) are based on

N_{tot} = 50

samples. Bottom figure evaluation for NSGA-II (

n_{pop} = 25

samples in

n_{gen} = 8

generations, (green) is based on

N_{tot} = 200

samples.

Box plots in Figure 3 indicate statistical evaluations of the repeated search runs. GPR-MOBO results are drawn in red, LHCS results in blue, NSGA-II results for

N_{tot} = 50

in brown and for

N_{tot} = 200

in green.

4.3. Power System Design Based Validation

Within the last subsection of this chapter, we apply the GPR-MOBO Algorithm 1 to the design and optimization of a power system example. Figure 4 illustrates both, the toolchain and the generic power system model as used for GPR-MOBO power system design and optimization validation.

Figure 4. Toolchain and generic power system model as used for GPR-MOBO power system design based validation.

For the energy domain specific “Power System Modeling and Simulation Environment” of the tool chain in our example, we use the commercial tool PSS^®DE [23]. Connected to the “Script Control” (implemented in Python code) through an “xml interface”, adjustable design space parameters of the power system model receive dedicated parameter values as computed within the “Algorithmic Workflow” (execution of the GPR-MOBO algorithm in Matlab 2021b code). Data stored within the “Data Base” are accessible for “Statistical Evaluation” and “Visualisation” (all implemented in Python code). The generic power system model for our validation example is defined by a star topology connecting a standard

H 0

-profile [24] electric load with

20.2

GWh annual demand to three aggregate components (“Wind turbine”, “PV”, “Battery”) and a “Public Grid”. The “Power System Modeling and Simulation Environment” is simulating power system results in terms of well defined key performance indicators (KPI), CAPEX (captial expenditures for installation of the according power system), and CO₂ (amount of carbon dioxide for a given configuration to provide the total amount of energy). The KPI behavior of the “Power System Modeling and Simulation Environment” can therefore be viewed as a black box function.

Parameter value ranges defining the design space for the system of interest are listed in Table 1.

Table 1. Design space limits for the experiment.

As mentioned in Section 3.3, for the GPR-MOBO, the design and target space samples (design parameters and KPIs, respectively) are normalized according Equations (11) and (12).

The results for the experiment are shown in Figure 5.

N_{0} = 30

initial samples and

N_{tot} = 50

samples in total were chosen as GPR-MOBO hyperparameters, while total emission of carbon dioxide (CO₂ in kilotons) and capital expenditures for acquisition and installation of aggregate components (CAPEX in million euros) were selected as to be evaluated trade-off KPIs. Making use of knowledge about the according target domain, we fixed

y_{min} = (0, 0)

for pre-processing. The reference point during optimization was chosen according Equation (9) with

ε = 2 %

.

Figure 5. CAPEX vs. CO₂ evaluations (marked by crosses) as obtained by power system simulation. Initial samples are marked by ×, Pareto-optimal samples by ×. An approximate Pareto front based on 1200 full-factorial design space latticing sample points is indicated by blue dots with interpolation marked by a blue dotted line

.

Figure 5 shows the acquired subspace of the target space by the experiment. All sample points are marked by crosses, while the Pareto optimal solutions forming the Pareto front are highlighted by red crosses × and the remaining portion of initial (30 LHCS) results are marked by cyan crosses ×. In addition, an approximate Pareto front is plotted, resulting from a full-factorial design space latticing based on 1200 sample points. Non-dominated Pareto points are indicated by blue dots with a Pareto front completed by linear interpolation (marked by a blue dotted line

).

5. Discussion

According Section 2, we may interpret the identified hypervolume of a black box as an indicator of effectiveness for a given Pareto front search algorithm. Putting that identified hypervolume in ratio to the number of computationally expensive evaluations required to identify this hypervolume, in turn, may be considered a suitable indicator of the algorithm’s efficiency. Applying these definitions to results obtained in the previous Section 4, Figure 3 clearly indicates:

(i): Given the limited low number of $N_{tot} = 50$ samples, all selected algorithms indicate a decreasing effectiveness over an increasing number of design space dimensions (i.e., degrees of design freedom, DoF).
(ii): GPR-MOBO outperforms the effectiveness of the other algorithms for all DoF d, showing even for $d = 30$ (with $N_{tot} = 50)$ at the median more than $97 %$ effectiveness (versus less than $71 %$ for the other algorithms).
(iii): For $N_{tot} = 50$ , the effectiveness of NSGA-II and LHCS appears comparable within statistical significance for all evaluated DoF n.
(iv): For $N_{tot} = 50$ samples, GPR-MOBO outperforms NSGA-II even when applied with $N_{tot} = 200$ samples.
(v): Efficiency of GPR-MOBO significantly increases with increasing DoF n when compared to other algorithms.

On the other hand, it is worthwhile mentioning GPR-MOBO algorithm to require more computation time than standard LHCS or NSGA-II algorithms. Depending on the test function dimension d and hardware environment, GPR-MOBO runs (i.e., 20 iterations) took between

3 \min

and

11 \min

wherein the fraction for evaluating the black box function (ZDT1) can be treated as negligible while NSGA-II or LHCS runs tool only less than a tenth portion of this time. This indicates that the GPR-MOBO is advisable, if the black box function is expensive to evaluate in terms of capital, time, or other resources.

The experimental validation using an unknown black box function whose result is shown in Figure 5 again validates the effectiveness and efficiency of the GPR-MOBO. The Pareto front as identified by

N_{tot} = 50

black box evaluations is already very close to the front as approximated by

N_{tot} = 1200

full-factorial lattice based black box evaluations. Some GPR-MOBO Pareto points are even dominating that identified by full-factorial lattice. The example, thereby, demonstrates effectiveness and efficiency of GPR-MOBO based power system design and optimization.

The results shown indicate a general superiority of GPR-MOBO over state-of-the-art algorithms. However, this has been demonstrated only on exemplary base. We therefore point out the inadmissibility of a generalization of superiority. A general superiority of GPR-MOBO cannot and should not be derived from single, individual test functions or application examples. Such a fundamental superiority would have to be mathematically proven and would presumably require fundamental knowledge of the black box function itself or the Pareto front spanned by this black box function.

6. Conclusions and Outlook

In this paper, we tackled the challenge of power system design and optimization in VUCA environment. We proposed a Multi-Objective Bayesian Optimization based on Gaussian Process Regression (GPR-MOBO) in the context of power system virtual prototyping. After a mathematical reformulation of the challenge, we presented the background of Gaussian Process Regression, including hyperparameter adaption and use in the context of a Bayesian Optimization approach, focusing on the expected improvement in hypervolume. For validation purposes, we benchmarked our GPR-MOBO implementation statistically based on a mathematical test function analytically known Pareto front and compared results to those of well-known algorithms NSGA-II and pure Latin Hyper Cube Sampling. We demonstrated superiority of the GPR-MOBO approach over the compared algorithms, especially for high dimensional design spaces. Finally, we applied the GPR-MOBO algorithm for planning and optimization of a power system (energy park) in terms of selected performance indicators of exemplary character.

Concluding, GPR-MOBO turned out as an effective and efficient approach for power system design and optimization in VUCA environment with superior character when simulations are computationally expensive and the number of design degrees of freedom is high.

Some topics remain open for future investigation. Besides a performance comparison with other than already selected algorithms, some detailed questions within the GPR-MOBO family are worthwhile to be considered. This includes topics such as the choice of acquisition function, pre-processing, selection of the reference points when (expected) hypervolume (improvement) is put in focus and application of various global optimizers to name just a few. One level above, more not yet satisfactorily answered question address the extension to mixed-integer design spaces and issues related to constraint handling.

Author Contributions

Conceptualization, H.P.; methodology, H.P.; software, N.P. and M.L.; validation, N.P. and M.L.; writing—original draft preparation, H.P.; writing—review and editing, N.P., M.L. and H.P.; visualization, M.L. and H.P.; supervision, H.P.; project administration, H.P.; mathematical theory, N.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Siemens AG under the “Pareto optimal design of Decentralized Energy Systems (ProDES)” project framework.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Siemens AG for providing a free PSS^®DE license for the power system design based validation part.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Maximization of Hypervolume and Pareto Points

Lemma A1.

Let

x, x^{'}, r \in ℝ^{t}

with

x, x^{'} ≺ r

. Define

c_{i} : = (r_{1}, \dots, r_{i - 1}, max (x_{i}, x_{i}^{'}), r_{i + 1}, \dots, r_{t})

as vector r with i-th component replaced with the maximum of

x_{i}

and

x_{i}^{'}

. Then,

I (x, r) = (I (x, r) \cap I (x^{'}, r)) ⊔ ⋃_{i = 1}^{t} I (x, c_{i}) .

Proof.

First, we prove “⊇”:

Since

c_{i} ≼ r

for all i, we obtain

I (x, c_{i}) \subset I (x, r)

, hence, “⊇”.

Secondly, we prove “⊆”:

Given some

z \in I (x, r) - I (x^{'}, r)

there exists an i with

x_{i} \leq z_{i} < x_{i}^{'}

. Since

z ≺ r

, we obtain

z \in I (x, c_{i})

. This proves “⊆”.

Moreover, if

max (x_{i}, x_{i}^{'}) = x_{i}

, then,

I (x, c_{i}) = \emptyset

, hence,

z \in I (x, c_{i})

implies

x_{i} \leq z_{i} < x_{i}^{'}

and thus

z \notin I (x^{'}, r)

. This proves that

(I (x, r) \cap I (x^{'}, r))

and

⋃_{i = 1}^{t} I (x, c_{i})

are disjoint. □

Corollary A2.

For

x, x^{'}, r \in ℝ^{t}

with

x ≺ r

and

x_{i} < x_{i}^{'}

, there exists

x ≺ d ≼ r

with

I (x, d) \subseteq I (x, r) - I (x^{'}, r) .

Proof.

We may assume

x^{'} ≺ r

, since otherwise

I (x^{'}, r) = \emptyset

and the claim follows for

d = r

. Using Lemma A1, we obtain

I (x, r) - I (x^{'}, r) = ⋃_{j = 1}^{t} I (x, c_{j}) \supseteq I (x, c_{i}) .

Furthermore,

x ≺ c_{i} ≼ r

since

x ≺ r

and

x_{i} < x_{i}^{'}

. Thus,

d = c_{i}

satisfies the claim. □

Corollary A3.

Let

S \subset ℝ^{t}

be a finite subset and

x, r \in ℝ^{t}

with

x ≺ r

. Assume further for every

s \in S

there exists

i_{s}

such that

x_{i_{s}} < s_{i_{s}}

. Then, there exists

c \in ℝ^{t}

with

x ≺ c ≼ r

and

I (x, c) \subseteq I (x, r) - ⋃_{s \in S} I (s, r) .

In particular, the Lebesgue measure

λ (I (x, c)) > 0

of

I (x, c)

is greater than zero.

Proof.

We prove the claim by induction over the size n of S. For

n = 1

, this is the Corollary A2. Assume the claim holds for all such S with size equal to n. Given S with cardinality

n + 1

, there exists such c with

I (x, c) \subset I (x, r) - \underset{s \neq s_{1}}{⋃_{s \in S,}} I (s, r) .

Furthermore, we obtain

I (x, r) - ⋃_{s \in S} I (s, r) \overset{c ≼ r}{\supseteq} I (x, c) - I (s_{1}, c) \supseteq I (x, c^{'})

for some

x ≺ c^{'} ≼ c

where the last subset is obtained by applying Corollary A2 to

r = c, x^{'} = s_{1}

.

Lastly, the Lebesgue measure

λ (I (x, c^{'}))

is given by the product

\prod_{i = 1}^{t} (c_{i}^{'} - x_{i}) > 0

which is greater than zero since

x ≺ c^{'}

. □

Corollary A4.

Let

S \subset ℝ^{t}

be a finite subset and

x, r \in ℝ^{t}

with

x, s ≺ r

for all

s \in S

. Then,

{HVI}_{r} (S, x) = 0

if and only if there exists

s \in S

with

s ≼ x

.

Proof.

Given

s ≼ x

,

I (x, r) \subseteq \cup_{s \in S} I (s, r)

by the very construction. Thus, “⇐” holds. Given

{HVI}_{r} (S, x) = 0

assume that no

s \in S

satisfies

s ≼ x

. This is for every

s \in S

there exists

i_{s}

such that

x_{i_{s}} < s_{i_{s}}

. Since

⋃_{s \in S} I (s, r) \cup I (x, r) = ⋃_{s \in S} I (s, r) ⊔ (I (x, r) - ⋃_{s \in S} I (s, r))

and the additivity of the Lebesgue measure, it suffices to prove

λ (I (x, r) - \cup_{s \in S} I (s, r)) > 0 .

This follows by the Corollary A3. □

Corollary A5.

Let

S \subset ℝ^{t}

be a finite subset and

x, x^{'}, r \in ℝ^{t}

with

x^{'}, s ≺ r

for all

s \in S

,

x ≼ x^{'}

and

x_{i} < x_{i}^{'}

for some i. Assume further for every

s \in S

there exists

i_{s}

such that

x_{i_{s}} < s_{i_{s}}

. Then,

{HV}_{r} (S \cup {x}) > {HV}_{r} (S \cup {x^{'}}) \geq {HV}_{r} (S) .

Proof.

Clearly, the last inequality holds. Due to

⋃_{s \in S} I (s, r) \cup I (x, r) = ⋃_{s \in S} I (s, r) ⊔ (I (x, r) - ⋃_{s \in S} I (s, r))

(resp. for

x^{'}

) and the additivity of the Lebesgue measure, it suffices to prove

λ (I (x, r) - ⋃_{s \in S} I (s, r)) > λ (I (x^{'}, r) - ⋃_{s \in S} I (s, r)) .

By Corollary A3, there exists

x ≺ c ≼ r

such that

I (x, c) \subseteq I (x, r) - \cup_{s \in S} I (s, r) .

Observe

I (x, c) - I (x^{'}, r) = I (x, c) - I (x^{'}, c)

since

c ≼ r

. Applying Corollary A2, we obtain some

x ≺ c^{'} ≼ c

such that

I (x, c^{'}) \subseteq I (x, c) - I (x^{'}, r) = I (x, c) - I (x^{'}, c) .

Observe

I (x^{'}, r) \subseteq I (x, r)

since

x ≼ x^{'}

. Together we obtain,

(I (x^{'}, r) - ⋃_{s \in S} I (s, r)) ⊔ I (x, c^{'}) \subseteq I (x, r) - ⋃_{s \in S} I (s, r) .

By the additivity of the Lebesgue measure, we obtain

\begin{matrix} λ (I (x, r) - ⋃_{s \in S} I (s, r)) \geq & λ (I (x^{'}, r) - ⋃_{s \in S} I (s, r)) \\ + λ (I (x, c^{'})) . \end{matrix}

Since

x ≺ c^{'}

, we argue as in the proof of the previous Corollary that

λ (I (x, c^{'})) > 0

. This proves the claim. □

Proof of proposition 1.

We may without loss of generality assume

s ≺ r

for all

s \in S

. Indeed, let

T \subset S

be the points in S satisfying

s ≺ r

for all

s \in T

. Given some pareto point

t \in T

, assume there exists some

s \in S - T

with

s ≼ t

. Then,

s ≼ t ≺ r

which contradicts

s \in S - T

.

Since S is bounded and closed and

{HVI}_{r} (F, -)

is continuous, we deduce the existence of some

p \in S

which maximizes

{HVI}_{r} (F, -)

. Since

{HVI}_{r} (F, p) = max_{x \in S} {HVI}_{r} (F, x) \geq {HVI}_{r} (F, s) > 0,

we obtain there exists no

f \in F

with

f ≼ p

by Corollary A4. i.e., for all

f \in F

there exists

i_{f}

with

p_{i_{f}} < f_{i_{f}}

. Assume there exists

x \in S

with

x ≺ p

and

x_{i} < p_{i}

. Then, there exists no

f \in F

with

f ≼ x

since

x ≼ p

. By applying Corollary A5, we deduce

{HV}_{r} (F \cup {x}) > {HV}_{r} (F \cup {p}) \geq {HV}_{r} (F)

which contradicts

{HVI}_{r} (F, p) = {max}_{x \in S} {HVI}_{r} (F, x)

. □

Appendix B. Probability Measure for Multivariate Normal Distribution

Theorem A6.

Let I be a set and for every

J \subset I

finite an inner regular probability measure

P_{J}

on

ℝ^{J}

be given. Given two finite

J \subset J^{'} \subset I

, denote by

p r_{J \subset J^{'}} : ℝ^{J′} \to ℝ^{J}

the canonical projection. Assume that for all

J \subset J^{'} \subset I

finite

P_{J^{'}} \circ p r_{J \subset J^{'}}^{- 1} = P_{J}

holds. Then, there exists a unique measure

P_{I}

on

ℝ^{I}

satisfying

P_{I} \circ p r_{J}^{- 1} = P_{J}

for all

J \subset I

finite.

Before giving the proof, recall:

Theorem A7 (Hahn-Kolmogorov).

Let S be a set and

R \subset P (S)

be ring, i.e.,

\emptyset \in R

and R is stable under finite unions and binary complements. Let

P_{0} : R \to [0, \infty)

be a pre-measure, i.e.,

P (\emptyset) = 0

and

P_{0} (\cup_{n \in ℕ} A_{i}) = \sum_{n \in ℕ} P_{0} (A_{i})

for pairwise disjoint

A_{1}, A_{2}, \dots \in R

with

\cup_{n \in ℕ} A_{i} \in R .

Then,

P_{0}

extends to a measure P on the sigma algebra

σ (R)

generated by R. Furthermore, if

P_{0}

is σ-finite, then, the extension is unique.

Proof of Theorem A6.

The sigma algebra of the product

ℝ^{I}

is generated by

R = {A \times ℝ^{I - J} | J \subset I finite, A \in B (ℝ^{J})} .

The reader convinces himself that R is a ring. Define a function

P_{0} : R \to [0, \infty)

by

P_{0} (A \times ℝ^{I - J}) = P_{J} (A) .

Observe that given

A \times ℝ^{I - J} = A^{'} \times ℝ^{I - J^{'}}

, then, without loss of generality

J \subset J^{'}

and

A^{'} = A \times ℝ^{J^{'} - J}

. Thus,

\begin{matrix} P_{0} (A^{'} \times ℝ^{I - J^{'}}) & = P_{J^{'}} (A^{'}) & by definition \\ = P_{J^{'}} (A \times ℝ^{J^{'} - J}) \\ = P_{J^{'}} \circ p r_{J \subset J^{'}}^{- 1} (A) \\ = P_{J} (A) & P_{J^{'}} \circ p r_{J \subset J^{'}}^{- 1} = P_{J} \\ = P_{0} (A \times ℝ^{I - J}) \end{matrix}

and

P_{0}

is well defined. Then, the reader convinces himself that

P_{0} (\emptyset) = P_{J} (\emptyset) = 0

and

P_{0}

is finite additive. We prove that

P_{0}

is

σ

-additive, i.e., given

A_{1}, \dots \in R

pairwise disjoint with

\cup_{n \in ℕ} A_{n} \in R

, then,

P_{0} (\cup_{n \in ℕ} A_{n}) = \sum_{n \in ℕ} P_{0} (A_{n}) .

Then, the Hahn-Kolmogorov theorem above proves the claim. Notice that every probability measure is

σ

-finite and, thus,

P_{0}

is.

It is well known that it suffices to prove

lim_{n \in ℕ} P_{0} (A_{n}) = 0

for all

A_{1} \supset A_{2} \supset \dots \in R

with

\cap_{n \in ℕ} A_{n} = \emptyset

since

P_{0}

is finite additive. We prove that if there exists an

ϵ > 0

such that

\begin{matrix} P_{0} (A_{n}) > 2 ϵ \end{matrix}

(A1)

for all

n \in ℕ

, then,

\cap_{n \in ℕ} A_{n} \neq \emptyset

. Write

A_{i} = B_{i} \times ℝ^{I - J_{i}}

such that

J_{i} \subset J_{i + 1}

for every i. Then,

P_{0} (A_{i}) = P_{J_{i}} (B_{i}) = sup_{K \subset B_{i} compact} P_{J_{i}} (K)

since

P_{J}

is inner regular. For all i choose some

K_{i} \subset B_{i}

compact such that

\begin{matrix} P_{J_{i}} (B_{i}) - P_{J_{i}} (K_{i}) < 2^{- i} ϵ . \end{matrix}

(A2)

Write

C_{i} = K_{i} \times ℝ^{I - J_{i}}

for every i. We first prove

P_{0} (\cap_{i = 1}^{n} C_{i}) > ϵ

for all n. Notice

P_{0} (\cap_{i = 1}^{n} C_{i}) = P_{0} (\cap_{i = 1}^{n} A_{i}) - P_{0} (\cap_{i = 1}^{n} A_{i} - \cap_{i = 1}^{n} C_{i})

since

P_{0}

is finite additive and

(\cap_{i = 1}^{n} C_{i}) ⊔ (\cap_{i = 1}^{n} A_{i} - \cap_{i = 1}^{n} C_{i}) = \cap_{i = 1}^{n} A_{i} .

Furthermore,

\cap_{i = 1}^{n} A_{i} = A_{n}

since

A_{i + 1} \subset A_{i}

for all i. Thus,

P_{0} (\cap_{i = 1}^{n} A_{i}) = P_{0} (A_{n}) > 2 ϵ

by (A2) and it suffices to prove

P_{0} (\cup_{i = 1}^{n} A_{n} - C_{i}) = P_{0} (\cap_{i = 1}^{n} A_{i} - \cap_{i = 1}^{n} C_{i}) < ϵ

for all n. It holds

\begin{matrix} P_{0} (\cup_{i = 1}^{n} A_{n} - C_{i}) & \leq P_{0} (\cup_{i = 1}^{n} A_{i} - C_{i}) & A_{n} \subset A_{i} \\ \leq \sum_{i = 1}^{n} (P_{0} (A_{i}) - P_{0} (C_{i})) & P_{0} fin . add . \\ = \sum_{i = 1}^{n} (P_{J_{i}} (B_{i}) - P_{J_{i}} (K_{i})) & by definition \\ \leq \sum_{i = 1}^{n} 2^{- i} ϵ & (A 2) \\ < ϵ \end{matrix}

In particular,

\cap_{i = 1}^{n} C_{i} \neq \emptyset

and, hence,

D_{n} : = \cap_{i = 1}^{n} K_{i} \neq \emptyset

for all n. We consider the descending sequence

D_{1} \supset D_{2} \supset \dots \in R .

Since all

D_{i}

are compact,

ℝ^{t}

is Hausdorff (hence,

D_{1}

is) and all

D_{i}

are non-empty, the below claim ensures that

\emptyset \neq ⋂_{i \in ℕ} K_{i} \subset ⋂_{i \in ℕ} A_{i}

and, hence, the claim. □

Lemma A8.

Let

E \supset K_{0} \supset K_{1} \supset \dots

be a descending sequence of compact topological spaces with E being a Hausdorff space.

Then

K = \cap_{i \in ℕ} K_{i} = \emptyset

implies that there exists some

n \in ℕ

with

K_{n} = \emptyset

.

Proof.

If

K = \emptyset

, then, its complement

E = K^{c} = E - K = ⋃_{n \in ℕ} (E - K_{i})

is E. Recall every compact subset of a Hausdorff space to be closed. Hence,

E - K_{i}

is open for every i. Since E is compact, there exist

i_{1}, \dots, i_{k}

such that

E = \cup_{j = 1, \dots, k} (E - K_{i_{j}})

. Hence,

\cap_{j = 1, \dots, k} K_{i_{j}} = \emptyset

is empty. Thus,

K_{n} \subset ⋂_{j = 1, \dots, k} K_{i_{j}} = \emptyset

for some n greater all

i_{j}

. □

Proof of Theorem 3.

In view of Theorem A6, it suffices to prove

N (m_{F^{'}}, C_{F^{'}}) = N (m_{F}, C_{F}) \circ p r_{F^{'} \subset F}^{- 1}

for any finite subsets

F^{'} \subset F \subset X

. Observe that

p r_{F^{'} \subset F} : ℝ^{F} \to ℝ^{F^{'}}

is given by left multiplication with the matrix

A = {(δ_{i j})}_{i \in F^{'}, j \in F}

and that A has full rank (since the projection is an epimorphism). Furthermore, by construction we obtain equalities

C_{F^{'}} = A C_{F} A^{T}

and

m_{F^{'}} = A m_{F} .

Then, the claim is precisely “(9.5) Satz” in [25] for

A = A

and

a = 0

. □

Proof of Corollary 6.

Denote by

C_{J, F} = {(C (j, f))}_{j \in J, f \in F} .

Given some

J \subset X

finite, a straightforward computation proves

\begin{matrix} m_{J}^{'} & : = {(m^{'} (j))}_{j \in J} = m_{J} + C_{J, F} C_{F}^{- 1} (y - m_{F}) \\ C_{J}^{'} & : = {(C^{'} (j_{1}, j_{2}))}_{j_{1}, j_{2} \in J} = C_{J} + C_{J, F} C_{F}^{- 1} C_{F, J} . \end{matrix}

Then,

m^{'}

and

C^{'}

are functions such that for every

J \subset X

finite the induced matrix

C_{J}^{'} = {(C^{'} (j_{1}, j_{2}))}_{j_{1}, j_{2} \in J}

is positive definite and the density function of the induced normal distribution

N (m_{J}^{'}, C_{J}^{'})

is the conditional density function induced by

({(r_{j})}_{j \in J}, {(r_{f})}_{f \in F})

given

{(r_{f})}_{f \in F} = y

by Theorem 2. Applying Theorem 3 with mean function

m^{'}

and covariance function

C^{'}

finishes the proof. □

Appendix C. Integrability of Hypervolume Improvement

Lemma A9.

The Hypervolume improvement function

\begin{matrix} {HVI}_{r, F} : S \to ℝ, y \mapsto {HVI}_{r} (F, y) \end{matrix}

(A3)

is continuous, hence, integrable for F finite and

S \subset ℝ^{t}

such that

s, f ≺ r

for all

s \in S, f \in F

.

Proof.

It suffices to prove that

h {:ℝ}^{t} \to ℝ, y \mapsto {HV}_{r} (F \cup {y})

is continuous. First, recall that given sets

F_{1}, \dots, F_{n}

, the Lebesgue measure of their union is given by

\begin{matrix} λ (\cup_{i = 1}^{n} F_{i}) = & \sum_{i = 1}^{n} λ (F_{i}) - \sum_{i < j \leq n} λ (F_{i} \cap F_{j}) + \\ + \sum_{i < j < k \leq n} λ (F_{i} \cap F_{j} \cap F_{k}) + \dots \\ + {(- 1)}^{n - 1} λ (\cap_{i = 1}^{n} F_{i}) \end{matrix}

since

λ

is additive. Given

s_{1}, \dots, s_{n} \in ℝ^{t}

, we define

i (s_{1}, \dots, s_{n}) = {(max (s_{1 i}, \dots, s_{n i}))}_{i = 1, \dots, n} .

Clearly,

I (a, r) \cap I (b, r) = I (i (a, b), r)

for all

a, b ≺ r

. Using

i (s_{1}, \dots, s_{n}) = i (i (s_{1}, \dots, s_{n - 1}), s_{n})

, we obtain

\begin{matrix} \cap_{i = 1}^{n} I (s_{i}, r) = I (i (s_{1}, \dots, s_{n}), r) \end{matrix}

(A4)

for

s_{i} ≺ r

by induction. Writing

h (y) = λ (\cup_{f \in F} I (f, r) \cup I (y, r))

as sum as above and using (A4), it suffices to prove

g : S \to ℝ, y \mapsto λ (I (i (s_{1}, \dots, s_{n}, y), r))

to be continuous for all

s_{1}, \dots, s_{n} \in S

. Therefore, it suffices to prove

(i): $i (s_{1}, \dots s_{n}, -) : ℝ^{t} \to ℝ^{t}, y \mapsto i (s_{1}, \dots s_{n}, y)$

and
(ii): $λ (I (-, r)) : S \to ℝ, y \mapsto λ (I (y, r))$

to be continuous.

To (i):

We observe

i (s, y) = \frac{1}{2} (s + y + | s - x |)

(where everything is taken componentwise) and, thus,

i (s, -)

is continuous for all s. Using

i (s_{1}, \dots, s_{n}, y) = i (i (s_{1}, \dots, s_{n}), y),

we obtain the claim.

To (ii):

We calculate

λ (I (y, r)) = \prod_{i = 1}^{t} (r_{i} - y_{i})

which is continuous in y. □

References

Elkington, R. Leadership Decision-Making Leveraging Big Data in Vuca Contexts. J. Leadersh. Stud. 2018, 12, 66–70. [Google Scholar] [CrossRef]
Afshari, H.; Hare, W.; Tesfamariam, S. Constrained multi-objective optimization algorithms: Review and comparison with application in reinforced concrete structures. Appl. Soft Comput. 2019, 83, 105631. [Google Scholar] [CrossRef]
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the Human Out of the Loop: A Review of Bayesian Optimization. Proc. IEEE 2016, 104, 148–175. [Google Scholar] [CrossRef]
Lyu, W.; Xue, P.; Yang, F.; Yan, C.; Hong, Z.; Zeng, X.; Zhou, D. An Efficient Bayesian Optimization Approach for Automated Optimization of Analog Circuits. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 1954–1967. [Google Scholar] [CrossRef]
Zhang, S.; Yang, F.; Yan, C.; Zhou, D.; Zeng, X. An Efficient Batch-Constrained Bayesian Optimization Approach for Analog Circuit Synthesis via Multiobjective Acquisition Ensemble. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2022, 41, 1–14. [Google Scholar] [CrossRef]
Guo, J.; Crupi, G.; Cai, J. A Novel Design Methodology for a Multioctave GaN-HEMT Power Amplifier Using Clustering Guided Bayesian Optimization. IEEE Access 2022, 10, 52771–52781. [Google Scholar] [CrossRef]
Sawant, M.M.; Bhurchandi, K. Hierarchical Facial Age Estimation Using Gaussian Process Regression. IEEE Access 2019, 7, 9142–9152. [Google Scholar] [CrossRef]
Huang, H.; Song, Y.; Peng, X.; Ding, S.X.; Zhong, W.; Du, W. A Sparse Nonstationary Trigonometric Gaussian Process Regression and Its Application on Nitrogen Oxide Prediction of the Diesel Engine. IEEE Trans. Ind. Inform. 2021, 17, 8367–8377. [Google Scholar] [CrossRef]
Koriyama, T.; Nose, T.; Kobayashi, T. Statistical Parametric Speech Synthesis Based on Gaussian Process Regression. IEEE J. Sel. Top. Signal Process. 2014, 8, 173–183. [Google Scholar] [CrossRef]
Schulz, E.; Speekenbrink, M.; Krause, A. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. J. Math. Psychol. 2018, 85, 1–16. [Google Scholar] [CrossRef]
Lewis-Beck, C.; Lewis-Beck, M. Applied Regression: An Introduction; Sage Publications: Thousand Oaks, CA, USA, 2015; Volume 22. [Google Scholar]
Verleysen, M.; François, D. The curse of dimensionality in data mining and time series prediction. In Proceedings of the International Work-Conference on Artificial Neural Networks, Barcelona, Spain, 8–10 June 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 758–770. [Google Scholar]
Frazier, P.I. Bayesian optimization. In Recent Advances in Optimization and Modeling of Contemporary Problems; Informs: Phoenix, AZ, USA, 2018; pp. 255–278. [Google Scholar]
Emmerich, M.; Deutz, A.H. A tutorial on multiobjective optimization: Fundamentals and evolutionary methods. Nat. Comput. 2018, 17, 585–609. [Google Scholar] [CrossRef] [PubMed]
Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed.; Wiley: New York, NY, USA, 2003. [Google Scholar]
Duvenaud, D. Automatic Model Construction with Gaussian Processes. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 2014. [Google Scholar]
Ishibuchi, H.; Imada, R.; Setoguchi, Y.; Nojima, Y. Reference point specification in hypervolume calculation for fair comparison and efficient search. In Proceedings of the Genetic and Evolutionary Computation Conference, Berlin, Germany, 15–19 July 2017; pp. 585–592. [Google Scholar]
Emmerich, M.; Deutz, A.; Klinkenberg, J. Hypervolume-based expected improvement: Monotonicity properties and exact computation. In Proceedings of the 2011 IEEE Congress of Evolutionary Computation (CEC), Ritz Carlton, New Orleans, LA, USA, 5–8 June 2011; pp. 2147–2154. [Google Scholar]
Yang, K.; Emmerich, M.; Deutz, A.; Bäck, T. Efficient computation of expected hypervolume improvement using box decomposition algorithms. J. Glob. Optim. 2019, 75, 3–34. [Google Scholar] [CrossRef]
McKay, M.D.; Beckman, R.J.; Conover, W.J. A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code. Technometrics 1979, 21, 239. [Google Scholar]
Emmerich, M. KMAC V1.0 - The efficient O(n log n) implementation of 2D and 3D Expected Hypervolume Improvement (EHVI). Available online: https://liacs.leidenuniv.nl/~csmoda/index.php?page=code (accessed on 22 June 2022).
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Siemens, Data Sheet PSS^®DE. Available online: https://new.siemens.com/global/en/products/energy/energy-automation-and-smart-grid/grid-edge-software/pssde.html (accessed on 24 August 2022).
Proedrou, E. A comprehensive review of residential electricity load profile models. IEEE Access 2021, 9, 12114–12133. [Google Scholar] [CrossRef]
Georgii, H.O. Stochastik: Einführung in die Wahrscheinlichkeitstheorie und Statistik, 5. Auflage; Walter de Gruyter: Berlin, Germany, 2007. [Google Scholar]

Figure 1. Hypervolume (blue area) in

ℝ^{2}

. The dotted rectangles illustrate the volumes i.e.,

I (u, r)

spanned by the respective point

u \in U

(×) and the reference point r (×).

Figure 2. Top: real function

f = s i n

(

) and GPR with mean m (

) and covariance function C where (

) symbolizes mean plus/minus standard deviation (i.e.,

m (x) \pm \sqrt{C (x, x)}

) at each point. Bottom: Same as top with m and C conditioned to the vertical cyan dashes.

Figure 3. Statistical evaluation of relative hyper volume from 50 repeated runs comparing performance of three MOO algorithms for a selected set of ZDT1 design space dimension d. Top three evaluations for GPR-MOBO (

N_{0} = 30

initial samples, red), LHCS (blue) and NSGA-II (with

n_{pop} = 10

samples in

n_{gen} = 5

generations, brown) are based on

N_{tot} = 50

samples. Bottom figure evaluation for NSGA-II (

n_{pop} = 25

samples in

n_{gen} = 8

generations, (green) is based on

N_{tot} = 200

samples.

Figure 4. Toolchain and generic power system model as used for GPR-MOBO power system design based validation.

Figure 5. CAPEX vs. CO₂ evaluations (marked by crosses) as obtained by power system simulation. Initial samples are marked by ×, Pareto-optimal samples by ×. An approximate Pareto front based on 1200 full-factorial design space latticing sample points is indicated by blue dots with interpolation marked by a blue dotted line

.

Table 1. Design space limits for the experiment.

Design Parameter	Min	Max
PV peak power/MW $_{p}$	0	41
Wind turbine peak power/MW $_{p}$	0	$17.1$
Battery rack capacity/MAh	0	$1.21$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Gaussian Process Regression Based Multi-Objective Bayesian Optimization for Power System Design

Abstract

1. Design of Complex Power Systems

2. Problem Statement in Mathematical Terms

3. Bayesian Optimization Based on Gaussian Process Regression

3.1. Gaussian Process Regression

3.1.1. Multivariate Normal Distribution

3.1.2. Stochastic Processes

3.1.3. Gaussian Process

3.1.4. Gaussian Process Regression

3.2. GPR Hyperparameter Adaption

3.3. Bayesian Optimization

3.4. Summary—Base for an Algorithmic Implementation

3.4.1. Setting up the GPRs

3.4.2. Maximizing the EHVI

4. Algorithmic Implementation and Validation

4.1. GPR-MOBO Algorithmic Implementation

4.2. Test Function Based Validation

4.3. Power System Design Based Validation

5. Discussion

6. Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Maximization of Hypervolume and Pareto Points

Appendix B. Probability Measure for Multivariate Normal Distribution

Appendix C. Integrability of Hypervolume Improvement

References

Article Metrics

Citations

Article Access Statistics