Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models

Berkemeier, Manuel; Peitz, Sebastian

doi:10.3390/mca26020031

Open AccessFeature PaperEditor’s ChoiceArticle

Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models

by

Manuel Berkemeier

^1,*

and

Sebastian Peitz

²

¹

Chair of Applied Mathematics, Faculty for Computer Science, Electrical Engineering and Mathematics, Paderborn University, Warburger Str. 100, 33098 Paderborn, Germany

²

Department of Computer Science, Faculty for Computer Science, Electrical Engineering and Mathematics, Paderborn University, Warburger Str. 100, 33098 Paderborn, Germany

^*

Author to whom correspondence should be addressed.

Math. Comput. Appl. 2021, 26(2), 31; https://doi.org/10.3390/mca26020031

Submission received: 26 February 2021 / Revised: 2 April 2021 / Accepted: 7 April 2021 / Published: 15 April 2021

(This article belongs to the Special Issue Numerical and Evolutionary Optimization 2020)

Download

Browse Figures

Versions Notes

Abstract

We present a local trust region descent algorithm for unconstrained and convexly constrained multiobjective optimization problems. It is targeted at heterogeneous and expensive problems, i.e., problems that have at least one objective function that is computationally expensive. Convergence to a Pareto critical point is proven. The method is derivative-free in the sense that derivative information need not be available for the expensive objectives. Instead, a multiobjective trust region approach is used that works similarly to its well-known scalar counterparts and complements multiobjective line-search algorithms. Local surrogate models constructed from evaluation data of the true objective functions are employed to compute possible descent directions. In contrast to existing multiobjective trust region algorithms, these surrogates are not polynomial but carefully constructed radial basis function networks. This has the important advantage that the number of data points needed per iteration scales linearly with the decision space dimension. The local models qualify as fully linear and the corresponding general scalar framework is adapted for problems with multiple objectives.

Keywords:

multiobjective optimization; trust region methods; multiobjective descent; derivative-free optimization; radial basis functions; fully linear models

1. Introduction

Optimization problems arise in a multitude of applications in mathematics, computer science, engineering and the natural sciences. In many real-life scenarios, there are multiple, equally important objectives that need to be optimized. Such problems are then called Multiobjective Optimization Problems (MOP). In contrast to the single objective case, an MOP often does not have a single solution but an entire set of optimal trade-offs between the different objectives, which we call Pareto optimal. They constitute the Pareto Set and their image is the Pareto Frontier. The goal in the numerical treatment of an MOP is to either approximate these sets or to find single points within these sets. In applications, the problem can become more difficult when some of the objectives require computationally expensive or time consuming evaluations. For instance, the objectives could depend on a computer simulation or some other black-box. It is then of primary interest to reduce the overall number of function evaluations. Consequently, it can become infeasible to approximate derivative information of the true objectives using, e.g., finite differences. This holds true especially if higher order derivatives are required. In this work, optimization methods that do not use the true objective gradients (which nonetheless are assumed to exist) are referred to as derivative-free.

There is a variety of methods to deal with MOPs, some of which are also derivative-free or try to constrain the number of expensive function evaluations. A broad overview of different problems and techniques concerning multiobjective optimization can be found, e.g., in [1,2,3,4]. One popular approach for calculating Pareto optimal solutions is scalarization, i.e., the transformation of an MOP into a single objective problem, cf. [5] for an overview. Alternatively, classical (single objective) descent algorithms can be adapted for the multiobjective case [6,7,8,9,10,11]. What is more, the structure of the Pareto Set can be exploited to find multiple solutions [12,13]. There are also methods for non-smooth problems [14,15] and multiobjective direct-search variants [16,17]. Both scalarization and descent techniques may be included in Evolutionary Algorithms (EA) [18,19,20,21,22]. To address computationally expensive objectives or missing derivative information, there are algorithms that use surrogate models (see the surveys [23,24,25]) or borrow from ideas from scalar trust region methods, e.g., [26].

In single objective optimization, trust region methods are well suited for derivative-free optimization [27,28]. Our work is based on the recent development of multiobjective trust region methods:

In [29], a trust region method using Newton steps for functions with positive definite Hessians on an open domain is proposed.
In [30], quadratic Taylor polynomials are used to compute the steepest descent direction which is used in a backtracking manner to find solutions for unconstrained problems.
In [31], polynomial regression models are used to solve an augmented MOP based on the scalarization in [17]. The algorithm is designed unconstrained bi-objective problems, but the general idea has been formulated for more objectives in [32].
In [33], quadratic Lagrange polynomials are used and the Pascoletti–Serafini scalarization is employed for the descent step calculation.

Our contribution is the extension of the above-mentioned methods to general fully linear models (and in particular Radial Basis Function (RBF) surrogates as in [34]), which is related to the scalar framework in [35]. Most importantly, this reduces surrogate construction complexity, in terms of objective evaluations per iteration, to linear with respect to the number of decision variables, in contrast to the quadratically increasing number of function evaluations for methods using second degree polynomials. We further prove convergence to critical points when the problem is constrained to a convex and compact set by using an analogous argumentation as in [36]. To this end, we extend the theory in [6] to provide new results concerning the continuity of the solutions of the projected steepest descent direction problem, which is based on the alternative formulation by Fliege and Svaiter [7]. We also show how to keep the convergence properties for constrained problems when the Pascoletti–Serafini scalarization is employed (like in [33]).

The remainder of the paper is structured as follows: Section 2 provides a brief introduction to multiobjective optimality and criticality concepts. In Section 3 the fundamentals of the algorithm are explained. In Section 4 we introduce fully linear surrogate models and describe the construction of suitable polynomial models and RBF models for unconstrained and box-constrained problems. We also formalize the main algorithm in this section. Section 5 deals with the descent step calculation so that a sufficient decrease is achieved in each iteration. Convergence is proven in Section 6 and a few numerical examples for unconstrained and finitely box-constrained problems are shown in Section 7. In Section 7 we also compare the RBF models against linear polynomial models that have the same linear construction complexity. We conclude with a brief discussion in Section 8.

2. Optimality and Criticality in Multiobjective Optimization

We consider the following (real-valued) multiobjective optimization problem:

min_{x \in X} f (x) : = min_{x \in X} [\begin{matrix} f_{1} (x) \\ ⋮ \\ f_{k} (x) \end{matrix}] \in R^{k},

(MOP)

with a feasible set

X \subseteq R^{n}

and k objective functions

f_{ℓ} : R^{n} \to R, ℓ = 1, \dots, k

. We further assume (MOP) to be heterogeneous. That is, there is a non-empty subset

I_{ex} \subseteq {1, \dots, k}

of indices so that the gradients of

f_{ℓ}, ℓ \in I_{ex},

are unknown and cannot be approximated, e.g., via finite differences. The (possibly empty) index set

I_{cheap} = {1, \dots, k} \ I_{ex}

indicates functions whose gradients are available.

Solutions for (MOP) consist of optimal trade-offs

x^{*} \in X

between the different objectives and are called non-dominated or Pareto optimal. That is, there is no

x \in X

with

f (x) ≺ f (x^{*})

(i.e.,

f (x) \leq f (x^{*})

and

f_{ℓ} (x) < f_{ℓ} (x^{*})

for some index

ℓ \in {1, \dots, k}

). The subset

P_{S} \subseteq X

of non-dominated points is then called the Pareto Set and its image

P_{F} : = f (P_{S}) \subseteq R^{k}

is called the Pareto Frontier. All concepts can be defined in a local fashion in an analogous way.

Similar to scalar optimization, there is a necessary condition for local optima using the gradients of the objective function. We therefore implicitly assume all objective functions

f_{ℓ}, ℓ = 1, \dots, k,

to be continuously differentiable on

X

. Moreover, the following assumption allows for an easier treatment of tangent cones in the constrained case:

Assumption 1.

Either the problem is unconstrained, i.e.,

X = R^{n}

or the feasible set

X \subseteq R^{n}

is compact and convex. All functions are defined on

X

.

The second case is a standard assumption in the MO literature for constrained problems [6,7]. Now let

\nabla f_{ℓ} (x)

denote the gradient of

f_{ℓ}

and

D f (x) \in R^{k \times n}

the Jacobian of

f

at

x \in X

.

Definition 1.

We call a vector

d \in X - x

a multi-descent direction for

f

in

x

if

〈 \nabla f_{ℓ} (x), d 〉 < 0

for all

ℓ \in {1, \dots, k},

or equivalently if

max_{ℓ = 1, \dots, k} 〈 \nabla f_{ℓ} (x^{*}), d 〉 < 0

(1)

where

〈 •, • 〉

is the standard inner product on

R^{n}

and we consider

X - x = X

in the unconstrained case

X = R^{n}

.

A point

x^{*} \in X

is called critical for (MOP) iff there is no descent direction

d \in X - x^{*}

with (1). As all Pareto optimal points are also critical (cf. [6,37] or [2] [Ch. 17]), it is viable to search for optimal points by calculating points from the superset

P_{crit} \supseteq P_{S}

of critical points for (MOP). Similar to single objective optimization, using such a first order condition makes sense especially in combination with some global method or when exploring the structure of the critical set. We discuss promising approaches in Section 8. Note, that due the above restrictions, our method is not a general replacement for other methods, e.g., scalarization approaches, but rather an additional tool for situations where those are not applicable.

One intuitive way to approach the critical set is by iteratively performing descent steps. Fliege and Svaiter [7] propose several ways to compute suitable descent directions. The minimizer

d^{*}

of the following problem is known as the multiobjective steepest-descent direction.

\begin{matrix} min_{d \in X - x} max_{ℓ = 1, \dots, k} 〈 \nabla f_{ℓ} (x), d 〉 s . t . ∥d∥ \leq 1 . \end{matrix}

(P1)

Problem (P1) has an equivalent reformulation as

\begin{matrix} min_{d \in X - x} β s . t . ∥d∥ \leq 1 and 〈 \nabla f_{ℓ} (x), d 〉 \leq β \forall ℓ = 1, \dots, k, \end{matrix}

(P2)

which is a linear program, if

X

is defined by linear constraints and the maximum-norm

∥•∥ = {∥•∥}_{\infty}

is used [7]. We thus stick with this choice because it facilitates implementation, but note that other choices are possible (see for example [33]).

Motivated by the next theorem we can use the optimal value of either problem as a measure of criticality, i.e., as a multiobjective pendant for the gradient norm. As is standard in most multiobjective trust region works (cf. [29,30,33]), we flip the sign so that the values are non-negative.

Theorem 1.

For

x \in X

let

d^{*} (x)

be the minimizer of (P1) and

ω (x)

be the negative optimal value, that is

ω (x) : = - max_{ℓ = 1, \dots, k} 〈 \nabla f_{ℓ} (x), d^{*} (x) 〉 .

Then the following statements hold:

$ω (x) \geq 0$ for all $x \in X$ .
The function $ω : R^{n} \to R$ is continuous.
The following statements are equivalent:
(a)
The point $x \in X$ isnotcritical.
(b)
$ω (x) > 0$ .
(c)
$d^{*} (x) \neq 0$ .

Consequently, the point

x

is critical iff

ω (x) = 0

.

Proof.

For the unconstrained case all statements are proven in [7] (Lemma 3).

The first and the third statement hold true for

X

convex and compact by definition. The continuity of

ω

can be shown similarly as in [6], see Appendix A.1. □

With further conditions on

f

and

X

the criticality measure

ω (x)

is even Lipschitz continuous and subsequently uniformly and Cauchy continuous:

Theorem 2.

If

\nabla f_{ℓ}, ℓ = 1, \dots, k,

are Lipschitz continuous and Assumption 1 holds, then the map

ω (•)

as defined in Theorem 1 is uniformly continuous.

Proof.

The proof for

X = R^{n}

is given by Thomann [38]. A proof for the constrained case can be found in Appendix A.1 as to not clutter this introductory section. □

Together with Theorem 1 this hints at

ω (•)

being a criticality measure as defined for scalar trust region methods in [36] ([Ch. 8]):

Definition 2.

We call

π : N_{0} \times R^{n} \to R,

a criticality measure for (MOP) if π is Cauchy continuous with respect to its second argument and if

lim_{t \to \infty} π (t, x^{(t)}) = 0

implies that the sequence

\{x^{(t)}\}

asymptotically approaches a Pareto critical point.

3. Trust Region Ideas

Multiobjective trust region algorithms closely follow the design of scalar approaches (see [36] for an extensive treatment) and provide an alternative to (approximate) line-search algorithms (e.g., [7]). Consequently, the requirements and convergence proofs in [29,30,33] for the unconstrained multiobjective case are fairly similar to those in [36]. We will reexamine the core concepts to provide a clear understanding and point out the similarities to the single objective case.

The main idea is to iteratively compute multi-descent steps

s^{(t)}

in every iteration

t \in N_{0}

. We could, for example, use the steepest descent direction given by (P1). This would require knowledge of the true objective gradients, which need not be available for objective functions with indices in

I_{ex}

. Hence, benevolent surrogate model functions

m^{(t)} : R^{n} \to R^{k}, x \mapsto m^{(t)} (x) = {[\begin{matrix} m_{1}^{(t)} (x), \dots, m_{k}^{(t)} (x) \end{matrix}]}^{T},

are employed (at least for the expensive objectives).

The surrogate models are constructed to be sufficiently accurate within a trust region

B^{(t)} : = B (x^{(t)}; Δ^{(t)}) = \{x \in X : ∥x - x^{(t)}∥ \leq Δ^{(t)}\}, with ∥•∥ = {∥•∥}_{\infty},

(2)

around the current iterate

x^{(t)}

. To be precise, the models are made fully linear as described in Section 4.1. This ensures that the model error and the model gradient error are uniformly bounded within the trust region.

The model steepest descent direction

d_{m}^{(t)}

can then computed as the optimizer of the surrogate problem

\begin{matrix} ω_{m}^{(t)} (x^{(t)}) : = & - min_{d \in X - x} β \\ s . t . & ∥d∥ \leq 1, and 〈 \nabla m_{ℓ}^{(t)} (x), d 〉 \leq β \forall ℓ = 1, \dots, k . \end{matrix}

(Pm)

Now let

σ^{(t)} > 0

be a step size. The direction

d_{m}^{(t)}

need not be a descent direction for the true objectives

f

and the trial point

x_{+}^{(t)} = x^{(t)} + σ^{(t)} d_{m}^{(t)}

is only accepted if a measure

ρ^{(t)}

of improvement and model quality surpasses a positive threshold

ν_{+}

. As in [30,33], we scalarize the multiobjective problems by defining

\begin{matrix} Φ (x) : = max_{ℓ = 1, \dots, k} f_{ℓ} (x), Φ_{m}^{(t)} (x) : = max_{ℓ = 1, \dots, k} m_{ℓ}^{(t)} (x) . \end{matrix}

Whenever

Φ (x^{(t)}) - Φ (x_{+}^{(t)}) > 0

, there is a reduction in at least one objective function of

f

because of

0 < Φ (x^{(t)}) - Φ (x_{+}^{(t)}) = f_{ℓ} (x^{(t)}) - f_{q} (x_{+}^{(t)}) \overset{df .}{\leq} f_{ℓ} (x^{(t)}) - f_{ℓ} (x_{+}^{(t)}),

where we denoted by ℓ the (not necessarily unique) maximizing index in

Φ (x^{(t)})

and by q the (neither necessarily unique) maximizing index in

Φ (x_{+}^{(t)})

. (The abbreviation “df.” above the inequality symbol stands for “(by) definition” and is used throughout this document when appropriate.) Of course, the same property holds for

Φ_{m}^{(t)} (•)

and

m^{(t)}

.

Thus, the step size

σ^{(t)} > 0

is chosen so that the step

s^{(t)} = σ^{(t)} d_{m}^{(t)}

satisfies both

x^{(t)} + s^{(t)} \in B^{(t)}

and a “sufficient decrease condition” of the form

Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x^{(t)} + s^{(t)}) \geq κ^{sd} ω (x^{(t)}) min \{C \cdot ω (x^{(t)}), 1, Δ^{(t)}\} \geq 0,

with constants

κ^{sd} \in (0, 1)

and

C > 0

, see Section 5. Such a condition is also required in the scalar case [35,36] and essential for the convergence proof in Section 6, where we show

{lim}_{t \to \infty} ω (x^{(t)}) = 0

.

Due to the decrease condition, the denominator in the ratio of actual versus predicted reduction

ρ^{(t)} = {\begin{matrix} \frac{Φ (x^{(t)}) - Φ (x_{+}^{(t)})}{Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)})} & if x^{(t)} \neq x_{+}^{(t)}, \\ 0 & if x^{(t)} = x_{+}^{(t)} \Leftrightarrow s^{(t)} = 0, \end{matrix}

(3)

is non-negative. A positive

ρ^{(t)}

implies a decrease in at least one objective

f_{ℓ}

, so we accept

x_{+}^{(t)}

as the next iterate if

ρ^{(t)} > ν_{+} > 0

. If

ρ^{(t)}

is sufficiently large, say

ρ^{(t)} \geq ν_{+ +} > ν_{+} > 0

, the next trust region might have a larger radius

Δ^{(t + 1)} \geq Δ^{(t)}

. If in contrast

ρ < ν_{+ +}

, the next trust region radius should be smaller and the surrogates improved.

This encompasses the case

s^{(t)} = 0

, when the iterate

x^{(t)}

is critical for

min_{x \in B^{(t)}} m^{(t)} (x) \in R^{k} .

(MOPm)

Roughly speaking, we suppose that

x^{(t)}

is near a critical point for the original problem (MOP) if

m^{(t)}

is sufficiently accurate. If we truly are near a critical point, then the trust region radius will approach 0. For further details concerning the acceptance ratio

ρ^{(t)}

, see [33] (Section 2.2).

Remark 1.

We can modify

ρ^{(t)}

in (3) to obtain a descent in all objectives, i.e., if

x^{(t)} \neq x_{+}^{(t)}

we test

ρ^{(t)} = \frac{f_{ℓ} (x^{(t)}) - f_{ℓ} (x_{+}^{(t)})}{m_{ℓ}^{(t)} (x^{(t)}) - m_{ℓ}^{(t)} (x_{+}^{(t)})} > ν_{+}

for all

ℓ = 1, \dots, k

. This is the strict acceptance test.

4. Surrogate Models and the Final Algorithm

Until now, we have not discussed the actual choice of surrogate models used for

m^{(t)}

. As is shown in Section 5, the models should be twice continuously differentiable with uniformly bounded hessians. To prove convergence of our algorithm, we have to impose further requirements on the (uniform) approximation qualities of the surrogates

m^{(t)}

. We can meet these requirements using so-called fully linear models. Moreover, fully linear models intrinsically allow for modifications of the basic trust region method that are aimed at reducing the total number of expensive objective evaluations. Finally, we briefly recapitulate how radial basis functions and multivariate Lagrange polynomials can be made fully linear.

Remark 2.

Although the trust region framework is suitable for general convexly constrained compact sets, we will discuss the construction of fully linear polynomial and RBF models for unconstrained and box-constrained problems only.

In the constrained case, we treat the constraints as unrelaxable, that is, we do not allow for evaluations of the true objectives outside

X

, see the definition of

B^{(t)} \subseteq X

in (2). We also ensure to only select training data in

X

during the construction of surrogate models.

To the best of our knowledge there are no construction procedures for the above model types for general (unrelaxable) constraints. A discussion of how some model based algorithms deal with constraints can be found in [28] (Section 7). The issue is also addressed in [27] (Ch. 13). If the constraints are treated as relaxable, then techniques from [39] (Ch. 15) might be applicable such as merit functions or filter methods, but this is left for future research.

4.1. Fully Linear Models

We start by reciting the abstract definition of full linearity as given in [27,35]:

Definition 3.

Let

Δ^{ub} > 0

be given and let

f : R \to R

be a function that is continuously differentiable in an open domain containing

X

and has a Lipschitz continuous gradient on

X

. A set of model functions

M = \{m : R^{n} \to R\} \subseteq C^{1} (R^{n}, R)

is called a fully linear class of models w.r.t. f if the following hold:

There are positive constants $ϵ, \dot{ϵ}$ and $L_{m}$ such that for any given $Δ \in (0, Δ^{ub})$ and for any $x \in X$ there is a model function $m \in M$ with Lipschitz continuous gradient and corresponding Lipschitz constant bounded by $L_{m}$ and such that
- the error between the gradient of the model and the gradient of the function satisfies
  
  $∥\nabla f (ξ) - \nabla m (ξ)∥ \leq \dot{ϵ} Δ, \forall ξ \in B (x; Δ),$
- the error between the model and the function satisfies
  
  $|f (ξ) - m (ξ)| \leq ϵ Δ^{2}, \forall ξ \in B (x; Δ) .$
For this class $M$ there exists “model-improvement” algorithm that, in a finite, uniformly bounded (w.r.t. $x$ and Δ) number of steps, can:
- either establish that a given model $m \in M$ is fully linear on $B (x; Δ)$ , i.e., it satisfies the error bounds in 1,
- or find a model $\tilde{m}$ that is fully linear on $B (x; Δ)$ .

Remark 3.

In the unconstrained case, the requirements in Definition 3 can be relaxed a bit, at least when using the strict acceptance test with

f (x^{(T)}) \leq f (x^{(t)})

for all

T \geq t \geq 0

. We can then restrict ourselves to the set

\begin{matrix} X^{'} : = ⋃_{x \in L (x^{(0)})} B (x; Δ^{ub}), where L (x^{(0)}) : = \{x \in R^{n} : f (x) \leq f (x^{(0)})\} . \end{matrix}

For the convergence analysis in Section 6, we further cite [27] ([Lemma 10.25]). The lemma states that a fully linear model is also fully linear in enlarged regions if the error constants are chosen appropriately:

Lemma 1.

For

x \in X

and

Δ \leq Δ^{ub}

consider a function f and a fully-linear model m as in Definition 3 with constants

ϵ, \dot{ϵ}, L_{m} > 0

. Let

L_{f} > 0

be a Lipschitz constant of

\nabla f

.

Assume w.l.o.g. that

L_{m} + L_{f} \leq ϵ and \frac{\dot{ϵ}}{2} \leq ϵ .

Then m is fully linear on

B (x; \tilde{Δ})

for any

\tilde{Δ} \in [Δ, Δ^{ub}]

with respect to the same constants

ϵ, \dot{ϵ}, L_{m}

.

Finally, we generalize the definition to a vector of real valued functions.

Definition 4.

Let

Δ^{ub} > 0

be given and let

f = {[f_{1}, \dots, f_{k}]}^{T}

be a vector of functions satisfying the requirements of Definition 3. Then

m = {[m_{1}, \dots, m_{k}]}^{T}

, with

m_{ℓ} : R^{n} \to R, ℓ \in {1, \dots, k}

, belongs to a collection of fully linear classes w.r.t.

f

if for each ℓ the function

m_{ℓ}

belongs to a fully linear class w.r.t.

f_{ℓ}

, with error constants

ϵ_{ℓ}

and

{\dot{ϵ}}_{ℓ}

.

The model-improvement algorithm of

m

consists in applying the individual improvement algorithms for all indices

ℓ \in {1, \dots, k}

and

m

is deemed fully linear iff all

m_{ℓ}

are fully linear with constants

ϵ_{ℓ}

and

{\dot{ϵ}}_{ℓ}

.

Definition 4 is stated in a way that allows for different model types for the different objectives. Most importantly, we can use

m_{ℓ} = f_{ℓ}

and

\nabla m_{ℓ} = \nabla f_{ℓ}

if the objective is cheap, i.e.,

ℓ \in I_{cheap}

, and if

f_{ℓ}

not only has Lipschitz gradients but also has a Hessian that is uniformly bounded in terms of its norm. The latter requirement is formalized in Assumption 3 and needed for the convergence analysis.

Algorithm Modifications

With Definitions 3 and 4 we have formalized our assumption that the surrogates become more accurate when we decrease the trust region radius. This motivates the following modifications to the basic procedure:

“Relaxing” the (finite) surrogate construction process to try for a possible descent even if the surrogates are not fully linear.
A criticality test depending on $ϖ_{m}^{(t)} (x^{(t)})$ . If this value is very small at the current iterate, then $x^{(t)}$ could lie near a Pareto critical point. With the criticality test and Algorithm 1 we ensure that the next model is fully linear and the trust region is not too large. This allows for a more accurate criticality measure and descent step calculation.
A trust region update that also takes into consideration $ϖ_{m}^{(t)} (x^{(t)})$ . The radius should be enlarged if we have a large acceptance ratio $ρ^{(t)}$ and the $Δ^{(t)}$ is small as measured against $β ω_{m}^{(t)} (x^{(t)})$ for a constant $β > 0$ .

These changes are implemented in Algorithm 2. For more detailed explanations we refer to [27] (Ch. 10).

Algorithm 1: Criticality Routine.

From Algorithm 2 we see that we can classify the iterations based on

ρ^{(t)}

as in Definition 5.

Definition 5.

For given constants

0 \leq ν_{+} \leq ν_{+ +} < 1, ν_{+ +} \neq 0,

we call the iteration with index

t \in N_{0}

of Algorithm 2.

…successful if $ρ^{(t)} \geq ν_{+ +}$ . The set of successful indices is $S = {t \in N_{0} : ρ^{(t)} \geq ν_{+ +}} \subseteq N_{0}$ . The trial point is accepted and the trust region radius can be increased.
…model-improving if $ρ^{(t)} < ν_{+ +}$ and the models $m^{(t)} = {[m_{1}^{(t)}, \dots, m_{k}^{(t)}]}^{T}$ are not fully linear. In these iterations the trial point is rejected and the trust region radius is not changed.
…acceptable if $ν_{+ +} > ρ^{(t)} \geq ν_{+}$ and the models $m^{(t)}$ are fully linear. If $ν_{+ +} = ν_{+} \in (0, 1)$ , then there are no acceptable indices. The trial point is accepted but the trust region radius is decreased.
…inacceptable otherwise, i.e., if $ρ^{(t)} < ν_{+ +}$ and $m^{(t)}$ are fully linear. The trial point is rejected and the radius decreased.

Algorithm 2: General Trust Region Method (TRM) for (MOP).

4.2. Fully Linear Lagrange Polynomials

Quadratic Taylor polynomial models are used very frequently. As explained in [27] we can alternatively use multivariate interpolating Lagrange polynomial models when derivative information is not available. We will consider first and second degree Lagrange models. Even though the latter require

O (n^{2})

function evaluations they are still cheaper than second degree finite difference models. For this reason, these models are also used in [33,38].

To construct an interpolating polynomial model we have to provide p data sites, where p is the dimension of the space

Π_{n}^{d}

of real-valued n-variate polynomials with degree d. For

d = 1

we have

p = n + 1

and for

d = 2

it is

p = \frac{(n + 1) (n + 2)}{2}

. If

n \geq 2

, the Mairhuber–Curtis theorem [40] applies and the data sites must form a so-called poised set in

X

. The set

Ξ = {_{1}, \dots,_{p}} \subset R^{n}

is poised if for any basis

{ψ_{i}}_{i}

of

Π_{n}^{d}

the matrix

M_{ψ} : = {[ψ_{i} (_{j})]}_{1 \leq i, j \leq p}

is non-singular. Then for any function

f : R^{n} \to R

there is a unique interpolating polynomial

m (x) = \sum_{i = 1}^{p} λ_{i} ψ_{i} (x)

with

m (ξ_{j}) = f (ξ_{j})

for all

j = 1, \dots, p

. Given a poised set

Ξ

the associated Lagrange basis

{l_{i}}_{i}

of

Π_{n}^{d}

is defined by

l_{i} (ξ_{j}) = δ_{i, j}

. The model coefficients then simply are the data values, i.e.,

λ_{i} = f (ξ_{i})

.

Same as in [38], we implement Algorithm 6.2 from [27] to ensure poisedness. It selects training sites

Ξ

from the current (slightly enlarged) trust region of radius

θ_{1} Δ^{(t)}, θ_{1} \geq 1,

and calculates the associated lagrange basis. We can then separately evaluate the true objectives

f_{ℓ}

on

Ξ

to easily build the surrogates

m_{ℓ}^{(t)}

,

ℓ \in {1, \dots, k}

. Our implementation always includes

ξ_{1} = x^{(t)}

and tries to select points from a database of prior evaluations first.

We employ an additional algorithm (Algorithm 6.3 in [27]) to ensure that the set

Ξ

is even Λ-poised, see [27] ([Definition 3.6]). The procedure is still finite and ensures the models are actually fully linear. The quality of the surrogate models can be improved by choosing a small algorithm parameter

Λ > 1

. Our implementation tries again to recycle points from a database. Different to before, interpolation at

x^{(t)}

can no longer be guaranteed. This second step can also be omitted first and then used as a model-improvement step in a subsequent iteration.

4.3. Fully Linear Radial Basis Function Models

The main drawback of quadratic Lagrange models is that we still need

O (n^{2})

function evaluations in each iteration of Algorithm 2. A possible fix is to use under-determined regression polynomials instead [27,31,41]. Motivated by the findings in [34] we chose so-called Radial Basis Function (RBF) models as an alternative. RBF are well-known for their approximation capabilities on irregular data [40]. In our implementation they have the form

m (x) = \sum_{i = 1}^{N} c_{i} φ ({∥x - ξ_{i}∥}_{2}) + π (x), with π = \sum_{j = 1}^{n + 1} λ_{j} ψ_{j} \in Π_{n}^{1} and N \geq n + 1,

(4)

which conforms to the construction by Wild et al. [34]. Here,

φ

is a function from a domain containing

R_{\geq 0}

to

R

. For a fixed

φ

the mapping

φ (∥•∥)

from

R^{n} \to R

is radially symmetric with respect to its argument and the mapping

(x, ξ) \mapsto φ ({∥x - ξ∥}_{2})

is called a kernel.

We will describe the procedure only briefly and refer to [34,42] and the dissertation [41] for more details. To conform to the algorithmic framework the models must have Hessians of uniformly bounded norm. Additionally, we want them to be twice differentiable due to the following, very general result:

Theorem 3

(Th 4.1 in [41]). Suppose that f and m are continuously differentiable in an open domain containing

B^{(t)}

and that

\nabla f

and

\nabla m

are Lipschitz in

B^{(t)}

. Further suppose that m interpolates f on a Λ-poised set

Ξ = {ξ_{1}, \dots, ξ_{n + 1}}

(for a fixed

Λ < \infty

). Then m is fully linear for f as in Definition 3.

The

Λ

-poised set is determined using pivotal algorithms from [34,41] in an enlarged trust region of radius

θ_{1} Δ^{(t)}, θ_{1} \geq 1

. If we restrict ourselves to functions

φ

that are conditionally positive definite (c.p.d.—see [34] for the definition) of order

D \leq 2

, then for any

f : R^{n} \to R

an interpolating model m of form (4) is uniquely determined by solving a linear equation system. If further

φ

is either twice continuously differentiable on an open domain containing

[0, \infty)

with

φ^{'} (0) = 0,

then m from (4) is twice continuously differentiable and has Lipschitz gradients exactly if its Hessian stays bounded. This is the case for all

φ

we consider (see Table 1). The hessian norm is determined by the magnitudes of the coefficients

c_{i}

and by

| φ^{'} (r) / r |

and

| φ^{″} (r) |

.

If there are exactly

N = n + 1

points from a poised set

Ξ

, then the coefficients

c_{i}

vanish and the model (4) is a linear polynomial. The values

| φ^{'} (r) / r |

and

| φ^{''} (r) |

are bounded because of

r \in [0, Δ^{ub}]

and

φ^{'} (0) = 0

. To exploit the nonlinear modeling capabilities of RBF and perform exploration, there is a procedure in [34] to select additional (database) points from within a region of maximum radius

θ_{2} Δ^{ub}

,

θ_{2} \geq θ_{1} \geq 1

, so that the values

| c_{i} |

stay bounded. Modifications for box constraints can be found in [41] ([Sec. 6.3.1]) and [43].

Table 1 shows the RBF we are using and of which order they are. Both the Gaussian and the Multiquadric allow for fine-tuning with a shape parameter

α > 0

. This can potentially improve the conditioning of the interpolation system.

Figure 1b illustrates the effect of the shape parameter. As can be seen, the radial functions become narrower for larger shape parameters. Hence, we do not only use a constant shape parameter

α = 1

like [34] do, but we also use an

α

that is (within lower and upper bounds) inversely proportional to

Δ^{(t)}

.

Figure 1a shows interpolation of a nonlinear function by a surrogate based on the Multiquadric with a linear tail.

5. Descent Steps

In this section we introduce some possible steps

s^{(t)}

to use in Algorithm 2. We begin by defining the best step along the steepest descent direction as given by (Pm). Subsequently, backtracking variants are defined that use a multiobjective variant of Armijo’s rule.

5.1. Pareto–Cauchy Step

Both the Pareto–Cauchy point as well as a backtracking variant, the modified Pareto–Cauchy point, are points along the descent direction

d_{m}^{(t)}

within

B^{(t)}

so that a sufficient decrease measured by

Φ_{m}^{(t)} (•)

and

ω_{m}^{(t)} (•)

is achieved. Under mild assumptions we can then derive a decrease in terms of

ω (•)

.

Definition 6.

For

t \in N_{0}

let

d_{m}^{(t)}

be a minimizer for (Pm). The best attainable trial point

x_{PC}^{(t)}

along

d_{m}^{(t)}

is called the Pareto–Cauchy point and given by

\begin{matrix} x_{PC}^{(t)} & : = x^{(t)} + σ^{(t)} \cdot d_{m}^{(t)}, \\ σ^{(t)} = \underset{0 \leq σ}{arg min} Φ_{m}^{(t)} (x^{(t)} + σ \cdot d_{m}^{(t)}) & s . t . x_{PC}^{(t)} \in B^{(t)} . \end{matrix}

(5)

Let

σ^{(t)}

be the minimizer in (5). We call

s_{PC}^{(t)} : = σ^{(t)} d_{m}^{(t)}

the Pareto–Cauchy step.

If we make the following standard assumption, then the Pareto–Cauchy point allows for a lower bound on the improvement in terms of

Φ_{m}^{(t)}

.

Assumption 2.

For all

t \in N_{0}

the surrogates

m^{(t)} (x) = {[m_{1}^{(t)} (x), \dots, m_{k}^{(t)} (x)]}^{T}

are twice continuously differentiable on an open set containing

X

. Denote by

H m_{ℓ}^{(t)} (x)

the Hessian of

m_{ℓ}^{(t)}

for

ℓ = 1, \dots, k

.

Theorem 4.

If Assumptions 1 and 2 are satisfied, then for any iterate

x^{(t)}

the Pareto–Cauchy point

x_{PC}^{(t)}

satisfies

Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{PC}^{(t)}) \geq \frac{1}{2} ω_{m}^{(t)} (x^{(t)}) \cdot min \{\frac{ω_{m}^{(t)} (x^{(t)})}{c H_{m}^{(t)}}, Δ^{(t)}, 1\},

(6)

where

H_{m}^{(t)} = max_{ℓ = 1, \dots, k} max_{x \in B^{(t)}} {∥H m_{ℓ}^{(t)} (x)∥}_{F}

(7)

and the constant

c > 0

relates the trust region norm

∥•∥

to the Euclidean norm

{∥•∥}_{2}

via

{∥x∥}_{2} \leq \sqrt{c} ∥x∥ \forall x \in R^{n} .

(8)

If

∥•∥ = {∥•∥}_{\infty}

is used, then

c

can be chosen as

c = k

. The proof for Theorem 4 is provided after the next auxiliary lemma.

Lemma 2.

Under Assumptions 1 and 2, let

d

be a non-increasing direction at

x^{(t)} \in R^{n}

for

m^{(t)}

, i.e.,

〈\nabla m_{ℓ}^{(t)} (x^{(t)}), d〉 \leq 0 \forall ℓ = 1, \dots, k .

Let

q \in {1, \dots, k}

be any objective index and

\bar{σ} \geq min \{Δ^{(t)}, ∥d∥\}

. Then it holds that

m_{q}^{(t)} (x^{(t)}) - min_{0 < σ < \bar{σ}} m_{q}^{(t)} (x^{(t)} + σ \frac{d}{∥d∥}) \geq \frac{w}{2} min \{\frac{w}{{∥d∥}^{2} c H_{m}^{(t)}}, \frac{Δ^{(t)}}{∥d∥}, 1\},

where we have used the shorthand notation

w = - max_{ℓ = 1, \dots, k} 〈\nabla m_{ℓ}^{(t)} (x^{(t)}), d〉 \geq 0 .

Lemma 2 states that a minimizer along any non-increasing direction

d

achieves a minimum reduction w.r.t.

Φ_{m}^{(t)}

. Similar results can be found in in [30] or [33]. But since we do not use polynomial surrogates

m^{(t)}

, we have to employ the multivariate version of Taylor’s theorem to make the proof work. We can do this because according to Assumption 2, the functions

m_{q}^{(t)}, q \in {1, \dots, k}

are twice continuously differentiable in an open domain containing

X

. Moreover, Assumption 1 ensures that the function is defined on the line from χ to

x

. As shown in [44] (Ch. 3) a first degree expansion at

x \in B (χ, Δ)

around

χ \in X

then leads to

\begin{matrix} m_{q}^{(t)} (x) = m_{q} (χ) + \nabla m_{q}^{(t)} {(χ)}^{T} h + \frac{1}{2} h^{T} H m_{q}^{(t)} (ξ_{q}) h, & with h = (x - χ), \\ for some ξ_{q} \in \{x + θ (χ - x) : θ \in [0, 1]\}, for all q = 1, \dots, k . \end{matrix}

(9)

Proof of Lemma 2.

Let the requirements of Lemma 2 hold and let

d

be a non-increasing direction for

m^{(t)}

. Then:

\begin{matrix} m_{q}^{(t)} (x^{(t)}) - min_{0 < σ < \bar{σ}} m_{q}^{(t)} (x^{(t)} + σ \frac{d}{∥d∥}) = max_{0 \leq σ \leq \bar{σ}} \{m_{q}^{(t)} (x^{(t)}) - m_{q}^{(t)} (x^{(t)} + σ \frac{d}{∥d∥})\} \\ \overset{(9)}{=} max_{0 \leq σ \leq \bar{σ}} \{m_{q}^{(t)} (x^{(t)}) - (m_{q}^{(t)} (x^{(t)}) + \frac{σ}{∥d∥} 〈 \nabla m_{q}^{(t)} (x^{(t)}), d 〉 + \frac{σ^{2}}{2 {∥d∥}^{2}} 〈 d, H m_{q}^{(t)} (ξ_{q}) d 〉)\} \\ \geq max_{0 \leq σ \leq \bar{σ}} \{- \frac{σ}{∥d∥} max_{j = 1, \dots, k} 〈 \nabla m_{j}^{(t)} (x^{(t)}), d 〉 - \frac{σ^{2}}{2 {∥d∥}^{2}} 〈 d, H m_{q}^{(t)} (ξ_{q}) d 〉\} . \end{matrix}

We used the shorthand

w = - max_{j} 〈 \nabla m_{j}^{(t)} (x^{(t)}), d 〉

and the Cauchy-Schwartz inequality to get

\begin{matrix} \dots \geq max_{0 \leq σ \leq \bar{σ}} \{\frac{σ}{∥d∥} w - \frac{σ^{2}}{2 {∥d∥}^{2}} {∥d∥}_{2}^{2} {∥H m_{q}^{(t)} (ξ)∥}_{F}\} \overset{(8), (7)}{\geq} max_{0 \leq σ \leq \bar{σ}} \{\frac{σ}{∥d∥} w - \frac{σ^{2}}{2} c H_{m}^{(t)}\} . \end{matrix}

The RHS is concave and we can thus easily determine the global maximizer

σ^{*}

.

Similar to [30] (Lemma 4.1) we find

\begin{matrix} m_{q}^{(t)} (x^{(t)}) - min_{0 < σ < \bar{σ}} m_{q}^{(t)} (x^{(t)} + σ \frac{d}{∥d∥}) & \geq \frac{w}{2} min \{\frac{w}{{∥d∥}^{2} c H_{m}^{(t)}}, \frac{Δ^{(t)}}{∥d∥}, 1\}, \end{matrix}

where we have additionally used

\bar{σ} \geq min {Δ^{(t)}, 1}

. □

Proof of Theorem 4.

If

x^{(t)}

is Pareto critical for (MOPm), then

d_{m}^{(t)} = 0

and

ω_{m}^{(t)} (x^{(t)}) = 0

and the inequality holds trivially.

Else, let the indices

ℓ, q \in {1, \dots, k}

be such that

Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{PC}^{(t)}) = m_{ℓ}^{(t)} (x^{(t)}) - m_{q}^{(t)} (x_{PC}^{(t)}) \geq m_{q} (x^{(t)}) - m_{q} (x_{PC}^{(t)})

and define

\bar{σ} : = \{\begin{matrix} min \{Δ^{(t)}, ∥d_{m}^{(t)}∥\} & if ∥d_{m}^{(t)}∥ < 1 or Δ^{(t)} \leq 1, \\ Δ^{(t)} & else . \end{matrix}

(10)

Then clearly

\bar{σ} \geq min \{Δ^{(t)}, ∥d_{m}^{(t)}∥\}

and for the Pareto–Cauchy point we have

m_{q}^{(t)} (x_{PC}^{(t)}) = min_{0 \leq σ \leq \bar{σ}} m_{q} (x^{(t)} + \frac{σ}{∥d_{m}^{(t)}∥} d_{m}^{(t)}) .

From Lemma 2 and

∥d_{m}^{(t)}∥

the bound

(6)

immediately follows. □

Remark 4.

Some authors define the Pareto–Cauchy point as the actual minimizer

x_{min}^{(t)}

of

Φ_{m}^{(t)}

within the current trust region (instead of the minimizer along the steepest descent direction). For this true minimizer the same bound (6) holds. This is due to

\begin{matrix} Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{min}^{(t)}) = m_{ℓ} (x^{(t)}) - min_{x \in B^{(t)}} m_{q} (x) \geq m_{q} (x^{(t)}) - m_{q} (x_{PC}^{(t)}) . \end{matrix}

5.2. Modified Pareto–Cauchy Point via Backtracking

A common approach in trust region methods is to find an approximate solution to (5) within the current trust region. Usually a backtracking procedure similar to Armijo’s inexact line-search is used for the Pareto–Cauchy subproblem, see [36] (Section 6.3) and [30]. Doing so, we can still guarantee a sufficient decrease.

Before we actually define the backtracking step along

d_{m}^{(t)}

, we derive a more general lemma. It illustrates that backtracking along any suitable direction is well-defined.

Lemma 3.

Suppose Assumptions 1 and 2 hold. For

x^{(t)} \in R^{n}

, let

d

be a descent direction for

m^{(t)}

and let

q \in {1, \dots, k}

be any objective index and

\bar{σ} > 0

. Then, for any fixed constants

a, b \in (0, 1)

there is an integer

j \in N_{0}

such that

Ψ (x^{(t)} + \frac{b^{j} \bar{σ}}{∥d∥} d) \leq Ψ (x^{(t)}) - \frac{a b^{j} \bar{σ}}{∥d∥} w

(11)

where, again, we have used the shorthand notation

w = - {max}_{ℓ = 1, \dots, k} 〈\nabla m_{ℓ}^{(t)} (x^{(t)}), d〉 > 0

and Ψ is either some specific model,

Ψ = m_{ℓ}

, or the maximum value,

Ψ = Φ_{m}^{(t)}

.

Moreover, if we define the step

s^{(t)} = \frac{b^{j} \bar{σ}}{∥d∥} d

for thesmallest

j \in N_{0}

satisfying (11), then there is a constant

κ_{m}^{sd} \in (0, 1)

such that

Ψ (x^{(t)}) - Ψ (x^{(t)} + s^{(t)}) \geq κ_{m}^{sd} w min \{\frac{w}{{∥d∥}^{2} c H_{m}^{(t)}}, \frac{\bar{σ}}{∥d∥}\} .

(12)

Proof.

The first part can be derived from the fact that

d

is a descent direction, see e.g., [6]. However, we will use the approach from [30] to also derive the bound (12). With Taylor’s Theorem we obtain

\begin{matrix} Ψ (x^{(t)} + \frac{b^{j} \bar{σ}}{∥d∥} d) = m_{ℓ} (x^{(t)} + \frac{b^{j} \bar{σ}}{∥d∥} d) (for some ℓ \in {1, \dots, k}) \\ = m_{ℓ}^{(t)} (x^{(t)}) + \frac{b^{j} \bar{σ}}{∥d∥} 〈 \nabla m_{ℓ}^{(t)} (x^{(t)}), d 〉 + \frac{{(b^{j} \bar{σ})}^{2}}{2 {∥d∥}^{2}} 〈 d, H m_{ℓ}^{(t)} (ξ_{ℓ}) d 〉 \\ \leq Ψ (x^{(t)}) + max_{q = 1, \dots, k} \frac{b^{j} \bar{σ}}{∥d∥} 〈 \nabla m_{q}^{(t)} (x^{(t)}), d 〉 + max_{q = 1, \dots, k} \frac{{(b^{j} \bar{σ})}^{2}}{2 {∥d∥}^{2}} 〈 d, H m_{q}^{(t)} (ξ_{q}) d 〉 \end{matrix}

\begin{matrix} \overset{(Pm), (7)}{\leq} Ψ (x^{(t)}) - \frac{b^{j} \bar{σ}}{∥d∥} w + \frac{{(b^{j} \bar{σ})}^{2}}{2} c H_{m}^{(t)} . \end{matrix}

(13)

In the last line, we have additionally used the Cauchy–Schwartz inequality. For a constructive proof, suppose now that (11) is violated for some

j \in N_{0}

, i.e.,

Ψ (x^{(t)} + \frac{b^{j} \bar{σ}}{∥d∥} d) > Ψ (x^{(t)}) - \frac{a b^{j} \bar{σ}}{∥d∥} w .

Plugging in (13) for the LHS and substracting

Ψ (x^{(t)})

then leads to

b^{j} > \frac{2 (1 - a) w}{∥d∥ \bar{σ} c H_{m}^{(t)}},

where the right hand side is positive and completely independent of j. Since

b \in (0, 1)

, there must be a

j^{*} \in N_{0}, j^{*} > j,

for which

b^{j^{*}} \leq \frac{2 (1 - a) w}{∥d∥ \bar{σ} c H_{m}^{(t)}}

so that (11) must also be fulfilled for this

b^{j^{*}}

.

Analogous to the proof of [30] ([Lemma 4.2]) we can now derive the constant

κ_{m}^{sd}

from (12) as

κ_{m}^{sd} = min {2 b (1 - a), a} .

□

Lemma 3 applies naturally to the step along

d_{m}^{(t)}

:

Definition 7.

For

x^{(t)} \in B^{(t)}

let

d_{m}^{(t)}

be a solution to

(Pm)

and define the modified Pareto–Cauchy step as

{\tilde{s}}_{PC}^{(t)} : = b^{j} \bar{σ} \frac{d_{m}^{(t)}}{∥d_{m}^{(t)}∥},

where again

\bar{σ}

as in (10) and

j \in N_{0}

is the smallest integer that satisfies

Φ_{m}^{(t)} (x^{(t)} + {\tilde{s}}_{PC}^{(t)}) \leq Φ_{m}^{(t)} (x^{(t)}) - \frac{a b^{j} \bar{σ}}{∥d_{m}^{(t)}∥} ω_{m}^{(t)} (x^{(t)})

(14)

for predefined constants

a, b \in (0, 1)

.

The definition of

\bar{σ}

ensures, that

x^{(t)} + {\tilde{s}}_{PC}^{(t)}

is contained in the current trust region

B^{(t)}

. Furthermore, these steps provide a sufficient decrease very similar to (6):

Corollary 1.

Suppose Assumptions 1 and 2 hold. For the step

{\tilde{s}}_{PC}^{(t)}

the following statements are true:

A $j \in N_{0}$ as in (14) exists.
There is a constant $κ_{m}^{sd} \in (0, 1)$ such that the modified Pareto–Cauchy step ${\tilde{s}}_{PC}^{(t)}$ satisfies

$Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x^{(t)} + {\tilde{s}}_{PC}^{(t)}) \geq κ_{m}^{sd} ω_{m}^{(t)} (x^{(t)}) min \{\frac{ω_{m}^{(t)} (x^{(t)})}{c H_{m}^{(t)}}, Δ^{(t)}, 1\} .$

Proof.

If

x^{(t)}

is critical, then the bound is trivial. Otherwise, the existence of a j satisfying (14) follows from Lemma 3 for

Ψ = Φ_{m}^{(t)}

. The lower bound on the decrease follows immediately from

\bar{σ} \geq min \{∥d_{m}^{(t)}∥, Δ^{(t)}\}

. □

From Lemma 3 it follows that the backtracking condition (14) can be modified to explicitly require a decrease in every objective:

Definition 8.

Let

j \in N_{0}

the smallest integer satisfying

min_{ℓ = 1, \dots, k} \{m_{ℓ}^{(t)} (x^{(t)}) - m_{ℓ}^{(t)} (x^{(t)} + b^{j} \bar{σ} \frac{d_{m}^{(t)}}{∥d_{m}^{(t)}∥})\} \geq \frac{a b^{j} \bar{σ}}{∥d_{m}^{(t)}∥} ω_{m}^{(t)} (x^{(t)}) .

We define the strict modified Pareto–Cauchy point as

{\hat{x}}_{PC}^{(t)} = x^{(t)} + {\hat{s}}_{PC}^{(t)}

and the corresponding step as

{\hat{s}}_{PC}^{(t)} = b^{j} \bar{σ} \frac{d_{m}^{(t)}}{∥d_{m}^{(t)}∥}

.

Corollary 2.

Suppose Assumptions 1 and 2 hold.

1: The strict modified Pareto–Cauchy point exists, the backtracking is finite.
2: There is a constant $κ_{m}^{sd} \in (0, 1)$ such that

$min_{ℓ = 1, \dots, k} \{m_{ℓ}^{(t)} (x^{(t)}) - m_{ℓ}^{(t)} ({\hat{x}}_{PC}^{(t)})\} \geq κ_{m}^{sd} ω_{m}^{(t)} (x^{(t)}) min \{\frac{ω_{m}^{(t)} (x^{(t)})}{c H_{m}^{(t)}}, Δ^{(t)}, 1\} .$

(15)

Remark 5.

In the preceding subsections, we have shown descent steps along the model steepest descent direction. Similar to the single objective case we do not necessarily have to use the steepest descent direction and different step calculation methods are viable. For instance, Thomann and Eichfelder [33] use the well-known Pascoletti–Serafini scalarization to treat the subproblem (MOPm). We refer to their work and Appendix B to see how this method can be related to the steepest descent direction.

5.3. Sufficient Decrease for the Original Problem

In the previous subsections, we have shown how to compute steps

s^{(t)}

to achieve a sufficient decrease in terms of

Φ_{m}^{(t)}

and

ω_{m}^{(t)} (•)

. For a descent step

s^{(t)}

the bound is of the form

Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x^{(t)} + s^{(t)}) \geq κ_{m}^{sd} ω_{m}^{(t)} (x^{(t)}) min \{\frac{ω_{m}^{(t)} (x^{(t)})}{c H_{m}^{(t)}}, Δ^{(t)}, 1\}, κ_{m}^{sd} \in (0, 1),

(16)

and thereby very similar to the bounds for the scalar projected gradient trust region method [36]. By introducing a slightly modified version of

ω_{m}^{(t)} (•)

, we can transform (16) into the bound used in [30,33].

Lemma 4.

If

π (t, x^{(t)})

is a criticality measure for some multiobjective problem, then

\tilde{π} (t, x^{(t)}) = min \{1, π (t, x^{(t)})\}

is also a criticality measure for the same problem.

Proof.

We have

0 \leq \tilde{π} (t, x^{(t)}) \leq π (t, x^{(t)})

. Thus,

\tilde{π} \to 0

whenever

π \to 0

. The minimum of uniformly continuous functions is again uniformly continuous. □

We next make another standard assumption on the class of surrogate models.

Assumption 3.

The norm of all model hessians is uniformly bounded above on

X

, i.e., there is a positive constant

H_{m}

such that

{∥H m_{ℓ}^{(t)} (x)∥}_{F} \leq H_{m} \forall ℓ = 1, \dots, k, \forall x \in B^{(t)}, \forall t \in N_{0} .

W.l.o.g., we assume

H_{m} \cdot c > 1, w i t h c a s i n (8) .

(17)

Remark 6.

From this assumption it follows that the model gradients are then Lipschitz as well. Together with Theorem 2, we then know that

ω_{m}^{(t)} (•)

is a criticality measure for (MOPm).

Motivated by the previous remark, we will from now on refer to the following functions

\begin{matrix} ϖ (x) : = min \{ω (x), 1\} and ϖ_{m}^{(t)} (x) : = min \{ω_{m}^{(t)} (x), 1\} & \forall t = 0, 1, \dots \end{matrix}

(18)

We can thereby derive the sufficient decrease condition in “standard form”:

Corollary 3.

Under Assumption 3, suppose that for

x^{(t)}

and some descent step

s^{(t)}

the bound (16) holds. For the criticality measure

ϖ_{m}^{(t)} (•)

it follows that

Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x^{(t)} + s^{(t)}) \geq κ_{m}^{sd} ϖ_{m}^{(t)} (x^{(t)}) min \{\frac{ϖ_{m}^{(t)} (x^{(t)})}{c H_{m}}, Δ^{(t)}\} .

(19)

Proof.

ϖ_{m}^{(t)} (•)

is a criticality measure due to Assumption 3 and Lemma 4. Further, from (18) and (17) it follows that

\frac{ϖ_{m}^{(t)} (x^{(t)})}{c H_{m}} \leq \frac{1}{c H_{m}} \leq 1

and if we plug this into (16) we obtain (19). □

To relate the RHS of (19) to the criticality

ω (•)

of the original problem, we require another assumption.

Assumption 4.

There is a constant

κ_{ω} > 0

such that

|ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)})| \leq κ_{ω} ω_{m}^{(t)} (x^{(t)}) .

This assumption is also made by Thomann and Eichfelder [33] and can easily be justified by using fully linear surrogate models and a bounded trust region radius in combination with a criticality test, see Lemma 7.

Assumption 4 can be used to formulate the next two lemmata relating the model criticality and the true criticality. They are proven in Appendix A.2. From these lemmata and Corollary 3 the final result, Corollary 4, easily follows.

Lemma 5.

If Assumption 4 holds, then it holds for

ϖ_{m}^{(t)} (•)

and

ϖ (•)

from (18) that

|ϖ_{m}^{(t)} (x^{(t)}) - ϖ (x^{(t)})| \leq κ_{ω} ϖ_{m}^{(t)} (x^{(t)}) .

Lemma 6.

From Assumption 4 it follows that

ϖ_{m}^{(t)} (x^{(t)}) \geq \frac{1}{κ_{ω} + 1} ϖ (x^{(t)}) with {(κ_{ω} + 1)}^{- 1} \in (0, 1) .

Corollary 4.

Suppose that Assumptions 3 and 4 hold and that

x^{(t)}

and

s^{(t)}

satisfy (19). Then

\begin{matrix} Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x^{(t)} + s^{(t)}) & \geq κ^{sd} ϖ (x^{(t)}) min \{\frac{ϖ (x^{(t)})}{c H_{m}}, Δ^{(t)}\}, \end{matrix}

(20)

where

κ^{sd} = \frac{κ_{m}^{sd}}{1 + κ_{ω}} \in (0, 1)

.

6. Convergence

6.1. Preliminary Assumptions and Definitions

To prove convergence of Algorithm 2 we first have to make sure that at least one of the objectives is bounded from below. This is a weaker requirement than the standard assumption that all objectives are bounded from below:

Assumption 5.

The maximum

{max}_{ℓ = 1, \dots, k} f_{ℓ} (x)

of all objective functions is bounded from below on

X

.

To be able to use

ϖ (•)

as a criticality measure and to refer to fully linear models, we further require:

Assumption 6.

The objective

f : R^{n} \to R^{k}

is continuously differentiable in an open domain containing

X

and has a Lipschitz continuous gradient on

X

.

We summarize the assumptions on the surrogates as follows:

Assumption 7.

The vector of surrogate model functions

m_{1}^{(t)}, \dots, m_{k}^{(t)}

belongs to a collection of fully linear classes as in Definition 4: For each objective objective index

ℓ = 1, \dots, k

there are error constants

ϵ_{ℓ}

so that

{\dot{ϵ}}_{ℓ}

and

m_{ℓ}^{(t)}

can be made to satisfy the bounds in Definition 3.

For the subsequent analysis we define component-wise maximum constants as

ϵ : = max_{ℓ = 1, \dots, k} ϵ_{ℓ}, \dot{ϵ} : = max_{ℓ = 1, \dots, k} {\dot{ϵ}}_{ℓ} .

(21)

We also wish for the descent steps to fulfill a sufficient decrease condition for the surrogate criticality measure as discussed in Section 5.

Assumption 8.

For all

t \in N_{0}

the descent steps

s^{(t)}

are assumed to fulfill both

x^{(t)} + s^{(t)} \in B^{(t)}

and (19).

Finally, to avoid a cluttered notation when dealing with subsequences we define the following shorthand notations:

ϖ_{m}^{(t)} : = ϖ_{m}^{(t)} (x^{(t)}), ϖ^{(t)} : = ϖ (x^{(t)}) \forall t \in N_{0} .

6.2. Convergence Proof

In the following we prove convergence of Algorithm 2 to Pareto critical points. We account for the case that no criticality test is used, i.e.,

ε_{crit} = 0

. We then require all surrogates to be fully linear in each iteration and need Assumption 4. The proof is an adapted version of the scalar case in [35].

It is also similar to the proofs for the multiobjective algorithms in [30,33]. However, in both cases, no criticality test is employed, there is no distinction between successful and acceptable iterations (

ν_{+} = ν_{+ +}

) and interpolation at

x^{(t)}

by the surrogates is required. We indicate notable differences when appropriate.

We start with two results concerning the criticality test in Algorithm 2.

Lemma 7.

For each iteration

t \in N_{0}

Assumption 4 is fulfilled if the model

m^{(t)}

is fully-linear and the criticality test was performed and—if applicable—Algorithm 1 has finished.

Proof.

Let

ℓ, q \in {1, \dots, k}

and

d_{ℓ}, d_{q} \in X - x^{(t)}

be solutions of (P1) and (Pm), respectively, such that

\begin{matrix} ω_{m}^{(t)} (x^{(t)}) = - 〈 \nabla m_{ℓ}^{(t)} (x^{(t)}), d_{ℓ} 〉, ω (x^{(t)}) = - 〈 \nabla f_{q} (x^{(t)}), d_{q} 〉 . \end{matrix}

If

ω_{m}^{(t)} (x^{(t)}) \geq ω (x^{(t)})

, then, using Cauchy–Schwartz and

∥d_{ℓ}∥ \leq 1

,

\begin{matrix} |ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)})| & = 〈 \nabla f_{q} (x^{(t)}), d_{q} 〉 - 〈 \nabla m_{ℓ}^{(t)} (x^{(t)}), d_{ℓ} 〉 \\ \overset{df .}{\leq} 〈 \nabla f_{q} (x^{(t)}), d_{ℓ} 〉 - 〈 \nabla m_{q}^{(t)} (x^{(t)}), d_{ℓ} 〉 \\ \leq {∥\nabla f_{q} (x^{(t)}) - \nabla m_{q}^{(t)} (x^{(t)})∥}_{2}, \end{matrix}

and if

ω_{m}^{(t)} (x^{(t)}) < ω (x^{(t)})

, we obtain

\begin{matrix} |ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)})| & \leq {∥\nabla m_{ℓ}^{(t)} (x^{(t)}) - \nabla f_{ℓ} (x^{(t)})∥}_{2} . \end{matrix}

Because

m^{(t)}

is fully linear, it follows that

\begin{matrix} |ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)})| & \leq \sqrt{c} \dot{ϵ} Δ^{(t)}, with \dot{ϵ} from (21) . \end{matrix}

If we just left Algorithm 1, then the model is fully linear for

Δ^{(t)}

due to Lemma 1 and we have

Δ^{(t)} \leq μ ϖ_{m}^{(t)} (x^{(t)}) \leq μ ω_{m}^{(t)} (x^{(t)})

. If we otherwise did not enter Algorithm 1 in the first place, it must hold that

ω_{m}^{(t)} (x^{(t)}) \geq ε_{crit}

and

Δ^{(t)} \leq Δ^{ub} = \frac{Δ^{ub}}{ε_{crit}} ε_{crit} \leq \frac{Δ^{ub}}{ε_{crit}} ω_{m}^{(t)} (x^{(t)})

and thus

|ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)})| \leq κ_{ω} ω_{m}^{(t)} (x^{(t)}), κ_{ω} = \sqrt{c} \dot{ϵ} max \{μ, ε_{crit}^{- 1} Δ^{ub}\} > 0 .

□

In the subsequent analysis, we require mainly steps with fully linear models to achieve sufficient decrease for the true problem. Due to Lemma 7, we can dispose of Assumption 4 by using the criticality routine:

Assumption 9.

Either

ε_{crit} > 0

or Assumption 4 holds.

We have also implicitly shown the following property of the criticality measures.

Corollary 5.

If

m^{(t)}

is fully linear for

f

with

\dot{ϵ} > 0

as in (21) then

\begin{matrix} |ϖ_{m}^{(t)} (x^{(t)}) - ϖ (x^{(t)})| \leq |ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)})| \leq \sqrt{c} \dot{ϵ} Δ^{(t)} . \end{matrix}

Lemma 8.

If

x^{(t)}

is not critical for the true problem (MOP), i.e.,

ϖ (x^{(t)}) \neq 0

, then Algorithm 1 will terminate after a finite number of iterations.

Proof.

At the start of Algorithm 1, we know that

m^{(t)}

is not fully linear or

Δ^{(t)} > μ ϖ_{m}^{(t)} (x^{(t)})

. For clarity, we denote the first model by

m_{0}^{(t)}

and define

Δ_{0} = Δ^{(t)}

. We then ensure that the model is made fully linear on

Δ_{1}^{(t)} = Δ_{0}

and denote this fully linear model by

m_{1}^{(t)}

. If afterwards

Δ_{1}^{(t)} \leq μ ϖ_{m_{1}}^{(t)} (x^{(t)})

, then Algorithm 1 terminates.

Otherwise, the process is repeated: the radius is multiplied by

α \in (0, 1)

so that in the j-th iteration we have

Δ_{j}^{(t)} = α^{j - 1} Δ_{0}

and

m_{j}^{(t)}

is made fully linear on

Δ_{j}^{(t)}

until

Δ_{j}^{(t)} = α^{j - 1} Δ_{0} \leq μ ϖ_{m_{j}}^{(t)} (x^{(t)}) .

The only way for Algorithm 1 to loop infinitely is

\begin{matrix} ϖ_{m_{j}}^{(t)} (x^{(t)}) & < \frac{α^{j - 1} Δ_{0}}{μ} \forall j \in N . \end{matrix}

(22)

Because

m_{j}^{(t)}

is fully linear on

α^{j - 1} Δ_{0}

, we know from Corollary 5 that

|ϖ_{m_{j}}^{(t)} (x^{(t)}) - ϖ (x^{(t)})| \leq \sqrt{c} \dot{ϵ} α^{j - 1} Δ_{0} \forall j \in N .

Using the triangle inequality together with (22) gives us

|ϖ (x^{(t)})| \leq |ϖ_{m_{j}}^{(t)} (x^{(t)}) - ϖ (x^{(t)})| + |ϖ_{m_{j}}^{(t)} (x^{(t)})| \leq (μ^{- 1} + \sqrt{c} ϵ) α^{j - 1} Δ_{0} \forall j \in N .

As

α \in (0, 1)

, this implies

ϖ (x^{(t)}) = 0

and

x^{(t)}

is hence critical. □

We next state another auxiliary lemma that we need for the convergence proof.

Lemma 9.

Suppose Assumptions 6 and 7 hold. For the iterate

x^{(t)}

let

s^{(t)} \in R^{n}

be a any step with

x_{+}^{(t)} = x^{(t)} + s^{(t)} \in B^{(t)}

. If

m^{(t)}

is fully linear on

B^{(t)}

then it holds that

|Φ (x_{+}^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)})| \leq ϵ {(Δ^{(t)})}^{2} .

Proof.

The proof follows from the definition of

Φ

and

Φ_{m}^{(t)}

and the full linearity of

m^{(t)}

. It can be found in [33] (Lemma 4.16). □

Convergence of Algorithm 2 is proven by showing that in certain situations, the iteration must be acceptable or successful as defined in Definition 5. This is done indirectly and relies on the next two lemmata. They use the preceding result to show that in a (hypothetical) situation where no Pareto critical point is approached, the trust region radius must be bounded from below.

Lemma 10.

Suppose Assumptions 1, 3 and 6 to 8 hold. If

x^{(t)}

is not Pareto critical for (MOPm) and

m^{(t)}

is fully linear on

B^{(t)}

and

Δ^{(t)} \leq \frac{κ_{m}^{sd} (1 - ν_{+ +}) ϖ_{m}^{(t)} (x^{(t)})}{2 λ}, w h e r e λ = max \{ϵ, c H_{m}\} a n d κ_{m}^{sd} s s i n (19),

then the iteration is successful, that is,

t \in S

and

Δ^{t + 1} \geq Δ^{(t)}

.

Proof.

The proof is very similar to [35] (Lemma 5.3) and [33] (Lemma 4.17). In contrast to the latter, we use the surrogate problem and do not require interpolation at

x^{(t)}

:

By definition we have

κ_{m}^{sd} (1 - ν_{+ +}) < 1

and hence it follows from Assumptions 4 and 8 and Corollary 3 that

\begin{matrix} Δ^{(t)} & \leq \frac{κ_{m}^{sd} (1 - ν_{+ +}) ϖ_{m}^{(t)} (x^{(t)})}{2 λ} \\ \leq \frac{ϖ_{m}^{(t)}}{2 λ} \leq \frac{ϖ_{m}^{(t)}}{2 c H_{m}} \leq \frac{ϖ_{m}^{(t)}}{c H_{m}} . \end{matrix}

(23)

With Assumption 8 we can plug this into (19) and obtain

Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)}) \geq κ_{m}^{sd} ϖ_{m}^{(t)} min \{\frac{ϖ_{m}^{(t)}}{c H_{m}}, Δ^{(t)}\} \geq κ_{m}^{sd} ϖ_{m}^{(t)} Δ^{(t)} .

(24)

Due to Assumption 7 we can take Definition (3) and estimate

\begin{matrix} |ρ^{(t)} - 1| & = |\frac{Φ (x^{(t)}) - Φ (x_{+}^{(t)}) - (Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)})}{Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)})}| \\ \leq \frac{|Φ (x^{(t)}) - Φ_{m}^{(t)} (x^{(t)})| + |Φ_{m}^{(t)} (x_{+}^{(t)}) - Φ (x_{+}^{(t)})|}{|Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)})|} \\ \overset{Lemma 9, (24)}{\leq} \frac{2 ϵ {(Δ^{(t)})}^{2}}{κ_{m}^{sd} ϖ_{m}^{(t)} Δ^{(t)}} \leq \frac{2 λ Δ^{(t)}}{κ_{m}^{sd} ϖ_{m}^{(t)}} \overset{(23)}{\leq} 1 - ν_{+ +} . \end{matrix}

Therefore

ρ^{(t)} \geq ν_{+ +}

and the iteration t using step

s^{(t)}

is successful. □

The same statement can be made for the true problem and

ϖ (•)

:

Corollary 6.

Suppose Assumptions 1, 3 and 6 to 9 hold. If

x^{(t)}

is not Pareto critical for (MOP) and

m^{(t)}

is fully linear on

B^{(t)}

and

Δ^{(t)} \leq \frac{κ^{sd} (1 - ν_{+ +}) ϖ (x^{(t)})}{2 λ}, w h e r e λ = max \{ϵ, c H_{m}\}, κ_{m}^{sd} a s i n (20),

then the iteration is successful, that is

t \in S

and

Δ^{t + 1} \geq Δ^{(t)}

.

Proof.

The proof works exactly the same as for Lemma 10. But due to Assumption 9 we can use Lemma 7 and employ the sufficient decrease condition (20) for

ϖ (•)

instead. □

As in [35] (Lemma 5.4) and [33] (Lemma 4.18), it is now easy to show that when no Pareto critical point of (MOPm) is approached the trust region radius must be bounded:

Lemma 11.

Suppose Assumptions 1, 3, and 6 to 8 hold and that there exists a constant

ϖ_{m}^{lb} > 0

such that

ϖ_{m}^{(t)} (x^{(t)}) \geq ϖ_{m}^{lb}

for all t. Then there is a constant

Δ^{lb} > 0

with

Δ^{(t)} \geq Δ^{lb} f o r a l l t \in N_{0} .

Proof.

We first investigate the criticality step and assume

ε_{crit} > ϖ_{m}^{(t)} \geq ϖ_{m}^{lb}

. After we finish the criticality loop, we get radius

Δ^{(t)}

so that

Δ^{(t)} \geq min {Δ_{*}^{(t)}, β ϖ_{m}^{(t)}}

and therefore

Δ^{(t)} \geq min {β ϖ_{m}^{lb}, Δ_{*}^{(t)}}

for all t.

Outside the criticality step, we know from Lemma 10 that whenever

Δ^{(t)}

falls below

\tilde{Δ} : = \frac{κ_{m}^{sd} (1 - ν_{+ +}) ϖ_{m}^{lb}}{2 λ},

iteration t must be either model-improving or successful and hence

Δ^{(t + 1)} \geq Δ^{(t)}

and the radius cannot decrease until

Δ^{(k)} > \tilde{Δ}

for some

k > t

. Because

γ_{⇊} \in (0, 1)

is the severest possible shrinking factor in Algorithm 2, we therefore know that

Δ^{(t)}

can never be actively shrunken to a value below

γ_{⇊} \tilde{Δ}

.

Combining both bounds on

Δ^{(t)}

results in

Δ^{(t)} \geq Δ^{lb} : = min {β ϖ_{m}^{lb}, γ_{⇊} \tilde{Δ}, Δ_{*}^{(0)}} \forall t \in N_{0},

where we have again used the fact that

Δ_{*}^{(t)}

cannot be reduced further if it is less than or equal to

\tilde{Δ}

due to the update mechanism in Algorithm 2. □

We can now state the first convergence result:

Theorem 5.

Suppose that Assumptions 1, 3 and 6 to 8 hold. If Algorithm 2 has only a finite number

0 \leq | S | < \infty

of successful iterations

S = {t \in N_{0} : ρ^{(t)} \geq ν_{+ +}}

then

lim_{t \to \infty} ϖ (x^{(t)}) = 0 .

Proof.

If the criticality loop runs infinitely, then the result follows from Lemma 8.

Otherwise, let

t_{0}

any index larger than the last successful index (or

t_{0} \geq 0

if

S = \emptyset

). All

t \geq t_{0}

then must be model-improving, acceptable or inacceptable. In all cases, the trust region radius

Δ^{(t)}

is never increased. Due to Assumption 7, the number of successive model-improvement steps is bounded above by

M \in N

. Hence,

Δ^{(t)}

is decreased by a factor of

γ \in [γ_{⇊}, γ_{↓}] \subseteq (0, 1)

at least once every

M

iterations. Thus,

\sum_{t > t_{0}}^{\infty} Δ^{(t)} \leq N \sum_{i = 1}^{\infty} γ_{↓}^{i} Δ^{(t_{0})} = \frac{N γ_{↓}}{1 - γ_{↓}} Δ^{(t_{0})},

and

Δ^{(t)}

must go to zero for

t \to \infty

.

Clearly, for any

τ \geq t_{0}

, the iterates (and trust region centers)

x^{(τ)}

and

x^{(t_{0})}

cannot be further apart than the sum of all subsequent trust region radii, i.e.,

∥x^{(τ)} - x^{(t_{0})}∥ \leq \sum_{t \geq t_{0}}^{\infty} Δ^{(t)} \leq \frac{N γ_{↓}}{1 - γ_{↓}} Δ^{(t_{0})} .

The RHS goes to zero as we let

t_{0}

go to infinity and so must the norm on the LHS, i.e.,

lim_{t_{0} \to \infty} ∥x^{(τ)} - x^{(t_{0})}∥ = 0 .

(25)

Now let

τ = τ (t_{0}) \geq t_{0}

be the first iteration index so that

m^{(τ)}

is fully linear. Then

|ϖ_{m}^{(t_{0})}| \leq |ϖ^{(t_{0})} - ϖ^{(τ)}| + |ϖ^{(τ)} - ϖ_{m}^{(τ)}| + |ϖ_{m}^{(τ)}|

and for the terms on the right and for

t_{0} \to \infty

, we find:

Because of Assumptions 1 and 6 and Theorem 2 $ϖ (•)$ is Cauchy-continuous and with (25) the first term goes to zero.
Due to Corollary 5 the second term is in $O (Δ^{(τ)})$ and goes to zero.
Suppose the third term does not go to zero as well, i.e., ${ϖ_{m}^{(t)} (x^{(τ)})}$ is bounded below by a positive constant. Due to Assumptions 1 and 7 the iterates $x^{(τ)}$ are not Pareto critical for (MOPm) and because of $Δ^{(τ)} \to 0$ and Lemma 10 there would be a successful iteration, a contradiction. Thus the third term must go to zero as well.

We conclude that the left side,

ϖ (x^{(t_{0})})

, goes to zero as well for

t_{0} \to \infty

. □

We now address the case of infinitely many successful iterations, first for the surrogate measure

ϖ_{m}^{(t)} (•)

and then for

ϖ (•)

. We show that the criticality measures are not bounded away from zero.

We start with the observation that in any case the trust region radius converges to zero:

Lemma 12.

If Assumptions 1, 3 and 6 to 8 hold, then the subsequence of trust region radii generated by Algorithm 2 goes to zero, i.e.,

{lim}_{t \to \infty} Δ^{(t)} = 0 .

Proof.

We have shown in the proof of Theorem 5 that this is the case for finitely many successful iterations.

Suppose there are infinitely many successful iterations. Take any successful index

t \in S

. Then

ρ^{(t)} \geq ν_{+ +}

and from Assumption 8 it follows for

x^{(t + 1)} = x_{+}^{(t)} = x^{(t)} + s^{(t)}

that

\begin{matrix} Φ (x^{(t)}) - Φ (x_{+}^{(t)}) & \geq ν_{+ +} (Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)})) \overset{(19)}{\geq} ν_{+ +} κ_{m}^{sd} ϖ_{m}^{(t)} min \{\frac{ϖ_{m}^{(t)}}{c H_{m}}, Δ^{(t)}\} . \end{matrix}

The criticality step ensures that

ϖ_{m}^{(t)} \geq min \{ε_{crit}, \frac{Δ^{(t)}}{μ}\}

so that

\begin{matrix} Φ (x^{(t)}) - Φ (x_{+}^{(t)}) \geq ν_{+ +} κ_{m}^{sd} min \{ε_{crit}, \frac{Δ^{(t)}}{μ}\} min \{\frac{Δ^{(t)}}{μ c H_{m}}, Δ^{(t)}\} \geq 0 . \end{matrix}

(26)

Now the right hand side has to go to zero: Suppose it was bounded below by a positive constant

ε > 0

. We could then compute a lower bound on the improvement from the first iteration with index 0 up to

t + 1

by summation

Φ (x^{(0)}) - Φ (x^{(t + 1)}) \geq \sum_{τ \in S_{t}} Φ (x^{(τ)}) - Φ (x^{(τ + 1)}) \geq |S_{t}| ε

where

S_{t} = S \cap {0, \dots, t}

are all successful indices with a maximum index of t. Because

S

is unbounded, the right side diverges for

t \to \infty

and so must the left side in contradiction to

Φ

being bounded below by Assumption 5. From (26) we see that this implies

Δ^{(t)} \to 0

for

t \in S, t \to \infty

.

Now consider any sequence

T \subseteq N

of indices that are not necessarily successful, i.e.,

|T \ S| \geq 0

. The radius is only ever increased in successful iterations and at most by a factor of

γ_{↑}

. Since

S

is unbounded, there is for any

τ \in T

a largest

t_{τ} \in S

with

t_{τ} \leq τ

. Then

Δ^{(τ)} \leq γ_{↑} Δ^{(t_{τ})}

and because of

Δ^{(t_{τ})} \to 0

it follows that

lim_{\begin{matrix} τ \in T, \\ τ \to \infty \end{matrix}} Δ^{(τ)} = 0,

which concludes the proof. □

Lemma 13.

Suppose Assumptions 1, 3 and 5 to 8 hold. For the iterates produced by Algorithm 2 it holds that

\underset{t \to \infty}{lim inf} ϖ_{m}^{(t)} (x^{(t)}) = 0 .

Proof.

For a contradiction, suppose that

{lim inf}_{t \to \infty} ϖ_{m}^{(t)} (x^{(t)}) \neq 0 .

Then there is a constant

ϖ_{m}^{lb} > 0

with

ϖ_{m}^{(t)} \geq ϖ_{m}^{lb}

for all

t \in N_{0}

. According to Lemma 11, there exists a constant

Δ^{lb} > 0

with

Δ^{(t)} \geq Δ^{lb}

for all t. This contradicts Lemma 12. □

The next result allows us to transfer the result to

ϖ (•)

.

Lemma 14.

Suppose Assumptions 1, 6, and 7 hold. For any subsequence

{t_{i}}_{i \in N} \subseteq N_{0}

of iteration indices of Algorithm 2 with

lim_{i \to \infty} ϖ_{m}^{(t_{i})} (x^{(t_{i})}) = 0,

(27)

it also holds that

lim_{i \to \infty} ϖ (x^{(t_{i})}) = 0 .

(28)

Proof.

By (27),

ϖ_{m}^{(t_{i})} < ε_{crit}

for sufficiently large i. If

x^{(t_{i})}

is critical for (MOP), then the result follows from Lemma 8. Otherwise,

m^{(t_{i})}

is fully linear on

B (x^{(t_{i})}; Δ^{(t_{i})})

for some

Δ^{(t_{i})} \leq μ ϖ_{m}^{(t_{i})}

. From Corollary 5 it follows that

|ϖ_{m}^{(t_{i})} - ϖ^{(t_{i})}| \leq \sqrt{c} \dot{ϵ} Δ^{(t_{i})} \leq \sqrt{c} \dot{ϵ} μ ϖ_{m}^{(t_{i})} .

The triangle inequality yields

ϖ^{(t_{i})} \leq |ϖ^{(t_{i})} - ϖ_{m}^{(t_{i})}| + ϖ_{m}^{(t_{i})} \leq (\sqrt{c} \dot{ϵ} μ + 1) ϖ_{m}^{(t_{i})}

for sufficiently large i and (27) then implies (28). □

The next global convergence result immediately follows from Theorem 5 and Lemmas 13 and 14:

Theorem 6.

Suppose Assumptions 1, 3, and 5 to 8 hold. Then

{lim inf}_{t \to \infty} ϖ (x^{(t)}) = 0 .

This shows that if the iterates are bounded, then there is a subsequence of iterates in

R^{n}

approximating a Pareto critical point. We next show that all limit points of a sequence generated by Algorithm 2 are Pareto critical.

Theorem 7.

Suppose Assumptions 1 and 3 to 8 hold. Then

{lim}_{t \to \infty} ϖ (x^{(t)}) = 0 .

Proof.

We have already proven the result for finitely many successful iterations, see Theorem 5. We thus suppose that

S

is unbounded.

For the purpose of establishing a contradiction, suppose that there exists a sequence

{\{t_{j}\}}_{j \in N}

of indices that are successful or acceptable with

ϖ^{(t_{j})} \geq 2 ε > 0 for some ε > 0 and all j .

(29)

We can ignore model-improving and inacceptable iterations: During those the iterate does not change, and we find a larger acceptable or successful index with the same criticality value.

From Theorem 6 we obtain that for every such

t_{j}

, there exists a first index

τ_{j} > t_{j}

such that

ϖ (x^{(τ_{j})}) < ε

. We thus find another subsequence indexed by

{τ_{j}}

such that

ϖ^{(t)} \geq ε for t_{j} \leq t < τ_{j} and ϖ^{(τ_{j})} < ε .

(30)

Using (29) and (30), it also follows from a triangle inequality that

|ϖ^{(t_{j})} - ϖ^{(τ_{j})}| \geq ϖ^{(t_{j})} - ϖ^{(τ_{j})} > 2 ε - ε = ε \forall j \in N .

(31)

With

{t_{j}}

and

{τ_{j}}

as in (30), define the following subset set of indices

T = \{t \in N_{0} : \exists j \in N such that t_{j} \leq t < τ_{j}\} .

By (30) we have

ϖ^{(t)} \geq ε

for

t \in T

, and due to Lemma 14, we also know that then

ϖ_{m}^{(t)}

cannot go to zero neither, i.e., there is some

ε_{m} > 0

such that

ϖ_{m}^{(t)} \geq ε_{m} > 0 \forall t \in T .

From Lemma 12 we know that

Δ^{(t)} \overset{t \to \infty}{\to} 0

so that by Corollary 6, any sufficiently large

t \in T

must be either successful or model-improving (if

m^{(t)}

is not fully linear). For

t \in T \cap S

, it follows from Assumption 8 that

Φ (x^{(t)}) - Φ (x^{(t + 1)}) \geq ν_{+ +} (Φ_{m} (x^{(t)}) - Φ_{m} (x^{(t + 1)})) \geq ν_{+ +} κ_{m}^{sd} ε_{m} min \{\frac{ε_{m}}{c H_{m}}, Δ^{(t)}\} \geq 0 .

If

t \in T \cap S

is sufficiently large, we have

Δ^{(t)} \leq \frac{ε_{m}}{c H_{m}}

and

Δ^{(t)} \leq \frac{1}{ν_{+ +} κ_{m}^{sd} ε_{m}} (Φ (x^{(t)}) - Φ (x^{(t + 1)})) .

Since the iteration is either successful or model-improving for sufficiently large

t \in T

, and since

x^{(t)} = x^{(t + 1)}

for a model-improving iteration, we deduce from the previous inequality that

∥x^{(t_{j})} - x^{(τ_{j})}∥ \leq \sum_{\begin{matrix} t = t_{j}, \\ t \in T \cap S \end{matrix}}^{τ_{j} - 1} ∥x^{(t)} - x^{(t + 1)}∥ \leq \sum_{\begin{matrix} t = t_{j}, \\ t \in T \cap S \end{matrix}}^{τ_{j} - 1} Δ^{(t)} \leq \frac{1}{ν_{+ +} κ_{m}^{sd} ε_{m}} (Φ (x^{(t_{j})}) - Φ (x^{(τ_{j})}))

for

j \in N

sufficiently large. The sequence

{\{Φ (x^{(t)})\}}_{t \in N_{0}}

is bounded below (Assumption 5) and monotonically decreasing by construction. Hence, the RHS above must converge to zero for

j \to \infty

. This implies

{lim}_{j \to \infty} ∥x^{(t_{j})} - x^{(τ_{j})}∥ = 0

.

Because of Assumptions 1 and 6,

ϖ (•)

is uniformly continuous so that then

lim_{j \to \infty} ϖ (x^{(t_{j})}) - ϖ (x^{(τ_{j})}) = 0,

which is a contradiction to (31). Thus, no subsequence of acceptable or successful indices as in (29) can exist.

□

7. Numerical Examples

In this section we provide some more details on the actual implementation of Algorithm 2 and present the results of various experiments. We compare different surrogate model types with regard to their efficacy (in terms of expensive objective evaluations) and their ability to find Pareto critical points.

7.1. Implementation Details

We implemented the framework in the Julia language (the code is available under https://github.com/manuelbb-upb/Morbit.jl, accessed on 15 April 2021) and used the surrogate construction algorithms from Section 4.2 and Section 4.3. Concerning the RBF models, the algorithms are thus the same as in [41]. The OSQP solver [45] is used to solve (Pm). For non-linear problems we use the NLopt.jl [46] package. More specifically we use the MMA algorithm [47] in conjunction with DynamicPolynomials.jl [48] to construct the Lagrange polynomials. The Pascoletti–Serafini subproblems is solved using the population based ISRES method [49] with MMA for polishing. The derivatives of cheap objective functions are obtained by means of automatic differentiation [50] and Taylor models use FiniteDiff.jl.

In accordance with Algorithm 2, we perform the shrinking trust region update via

Δ^{(t + 1)} \leftarrow \{\begin{matrix} γ_{⇊} Δ^{(t)} & if ρ^{(t)} < ν_{+}, \\ γ_{↓} Δ^{(t)} & if ρ^{(t)} < ν_{+ +} . \end{matrix}

Note that for box-constrained problems we internally scale the feasible set to the unit hypercube

{[0, 1]}^{n}

and all radii are measured with regard to this scaled domain.

For stopping, we use a disjunction of different criteria:

We have an upper bound $N_{it .} \in N$ on the maximum number of iterations and an upper bound $N_{\exp .} \in N$ on the number of expensive objective evaluations.
The surrogate criticality naturally allows for a stopping test and due to Lemma 11 the trust region radius can also be used (see also [33] [Sec. 5]). We combine this with a relative tolerance test and stop if

$Δ^{(t)} \leq Δ_{\min} OR (Δ^{(t)} \leq Δ_{crit} AND ω (x^{(t)}) \leq ω_{\min}) .$
At a truly critical point the criticality loop Algorithm 1 runs infinitely. We stop after a maximum number $N_{loops} \in N_{0}$ of iterations.
We also employ the common relative stopping criteria

$\begin{matrix} {∥x^{(t)} - x^{(t + 1)}∥}_{\infty} & \leq δ_{x} {∥x^{(t)}∥}_{\infty} and \\ {∥f (x^{(t)}) - f (x^{(t + 1)})∥}_{\infty} & \leq δ_{f} {∥f (x^{(t)})∥}_{\infty} \end{matrix}$

to provoke early stopping.

7.2. A First Example

We ran our method on a multitude of academic test problems with a varying number of decision variables n and objective functions k. We were able to approximate Pareto critical points in both cases, if we treat the problems as heterogeneous and if we declare them as expensive. We benchmarked RBF against polynomial models, because in [33] it was shown that a trust region method using second degree Lagrange polynomials outperforms commercial solvers on scalarized problems. Most often, RBF surrogates outperform other model types with regard to the number of expensive function evaluations.

This is illustrated in Figure 2. It shows two runs of Algorithm 2 on the non-convex problem (T6), taken from [38]:

\begin{matrix} min_{x \in X} [\begin{matrix} x_{1} + ln (x_{1}) + x_{2}^{2}, \\ x_{1}^{2} + x_{2}^{4} \end{matrix}], X = [ε, 30] \times [0, 30] \subseteq R^{2}, ε = 10^{- 12} . \end{matrix}

(T6)

The first objective function is treated as expensive while the second is cheap. In contrast to most other MOPs, there is only one solution and this Pareto optimal point is

{[ε, 0]}^{T}

. When we set a very restrictive limit of

N_{\exp .} = 20

then we run out of budget with second degree Lagrange surrogates before we reach the optimum, see Figure 2b. As evident in Figure 2a, surrogates based on (cubic) RBF do require significantly less training data. For the RBF models the algorithm stopped after two critical loops and the model refinement during these loops is made clear by the samples on the problem boundary converging to zero. The complete set of relevant parameters for the test runs is given in Table 2. We used a strict acceptance test and the strict Pareto–Cauchy step.

7.3. Benchmarks on Scalable Test-Problems

To assess the performance with a growing number of decision variables n, we performed tests on scalable problems of the ZDT and DTLZ family [51,52]. Figure 3 shows results for the bi-objective problems ZDT1-ZDT3 and for the k-objective problems DTLZ1 and DTLZ6 (we used

k = max {2, n - 4}

objectives). All problems are box constrained. Twelve feasible starting points (from the Halton sequence) were generated for each problem setting, i.e., for each combination of n, a test problem and a descent method. The acceptance test and the backtracking were strict.

In all cases the first objective was considered cheap and all other objectives expensive. First and second degree Lagrange models were compared against linear Taylor models and (cubic) RBF surrogates. The Lagrange models were built using a

Λ

-poised set, with

Λ = 1.5

. In the case of quadratic models we used a precomputed set of points for

n \geq 6

. The Taylor models used finite differences and points outside of box constraints were simply projected back onto the boundary. The RBF models were allowed to include up to

(n + 1) (n + 2) / 2

training points from the database if

n \leq 10

and else the maximum number of points was

2 n + 1

. Points were first selected from a box of radius

θ_{1} Δ^{(t)}

with

θ_{1} = 2

and then from a box of radius

θ_{2} Δ^{ub}

with

θ_{2} = 2

. All other parameters differing from the parameters in Table 2 are listed in Table 3. The stopping parameters were chosen so as to exit early and save evaluations.

As expected, the second degree Lagrange polynomials require the most objective evaluations and the quadratic dependence on n is clearly visible in Figure 3, and the quadratic growth of the dark-blue line continues for

n \geq 8

. On average, the linear Lagrange models perform better than the linear Taylor polynomials when using the steepest descent steps—also in accordance with our expectations, because only

n + 1

points are needed for each model (versus

2 n

points). Most models—even the linear ones—profit from using the Pascoletti–Serafini subproblems (see Appendix B) over the steepest descent steps. By far the least evaluations (on average) are needed for the RBF models: The black line consistently stays below all other data points. Note, that the RBF models likely appear to perform slightly better with the steepest descent steps because of the early stopping. In other experiments we noticed that RBF models with Pascoletti–Serafini steps can save evaluations when more precise solutions are required.

For comparison, we also used the weighted sum approach with the single objective

\sum_{ℓ} f_{ℓ}

on each problem instance. We tested both the derivative-free COBYLA solver (described in [53] and implemented by NLopt.jl) and the trust region method using steepest descent and cubic RBF models, i.e., our own implementation of ORBIT [34]. Both solvers were restricted to the same number of maximum function evaluations. In fact, ORBIT was configured with the exact same parameters as in Table 3 and the relative stopping tolerances for COBYLA were

δ_{x} = δ_{f} = 10^{- 2}

. Although, COBYLA also uses linear models it requires significantly more evaluations than most other algorithms. The results of the ORBIT scalarization are more comparable to that of the multiobjective runs.

7.3.1. Solution Quality

Figure 4 illustrates that not only do RBF perform better on average, but also overall. With regard to the final solution criticality, there are a few outliers mostly due to DTLZ1 (see also Figure 5). However, in most cases the solution criticality is acceptable, except for the linear Lagrange models. Moreover, Figure 5 shows that a good percentage of problem instances is solved with RBF, especially when compared to the other linear models. Note, that in cases where the true objectives are not differentiable at the final iterate,

ω

was set to 0 because the selected problems are non-differentiable only in Pareto optimal points. In Figure 5 it also becomes apparent that the bi-objective DTLZ1 instances were the most challenging for all algorithms. DTLZ1 has many local minima and it is likely to exit early near such a local minimum due to repeated unsuccessful iterations. Likewise, ZDT3 is “flat” towards the true Pareto Front so that it becomes hard to make progress there.

Besides criticality, another metric of interest is the spread of solutions for different starting points. Figure 6 shows the final iterates when the algorithm is applied to the bi-objective problems ZDT1 and ZDT2 for 10 different starting points. Additionally, the problems are solved using the weighted sum approach with the derivative-free COBYLA solver. For each starting point the optimizers were allowed 30 objective evaluations and no data were re-used between runs.

As can bee seen, for these problems, the trust region method readily reaches the critical set using only 30 evaluations. Here, the steepest descent direction tends to generate solutions on the problem boundary when applied in such a global manner—with relatively large trust region radii (

Δ^{(0)} = 0.1

and

Δ^{ub} = 0.5

). Nonetheless, the method remains applicable for local refinement of approximate solutions, e.g., after a coarse search for good starting points using global methods or as a corrector in continuation frameworks. The Pascoletti–Serafini step can be employed with different reference points/directions to provide a better covering than both the steepest descent steps and the weighted sum approach. For Figure 6, the points

{[0, - 10 i], i = 1, \dots, 10}

were used. The weighted sum approach (with fixed weights) tends to produce clustered solutions. Especially for the non-convex problem ZDT2 only the boundary points of the true Pareto Front are reached, as expected [1].

7.3.2. RBF Comparison

Furthermore, we compared the RBF kernels from Table 1. In [34], the cubic kernel performs best on single-objective problems while the Gaussian does worst. As can be seen in Figure 7 this holds for multiple objective functions, too: The Gaussian and the Multiquadric require more function evaluations than the Cubic, especially in higher dimensions. If, however, we use a very simple adaptive strategy to fine-tune the shape parameter, then both kernels can finish significantly faster. In both cases, the shape parameter was set to

α = 20 / Δ^{(t)}

in each iteration. Nevertheless, the cubic function appears to be a good choice in general.

8. Conclusions

We have developed a trust region framework for heterogeneous and expensive multiobjective optimization problems. It is based on similar work [29,30,31,33] and our main contributions are the integration of constraints and of radial basis function surrogates. Subsequently, our method is is provably convergent to first order critical points for unconstrained problems and when the feasible set is convex and compact, while requiring significantly less expensive function evaluations due to a linear scaling of model construction complexity with respect to the number of decision variables.

For future work, several modifications and extensions can likely be transferred from the single-objective to the multiobjective case. For examples, the trust region update can be made step-size-dependent (rather than to depend

ρ^{(t)}

alone) to allow for a more precise model refinement, see [36] ([Ch. 10]). We have also experimented with the nonlinear CG method [9] for a multiobjective Steihaug–Toint step [36] ([Ch. 7]) and early results look promising.

Going forward, we would like to apply our algorithm to a real world application, similar to what has been done in [54]. Moreover, it would be desirable to obtain not just one but multiple Pareto critical solutions. Because the Pascoletti–Serafini scalarization is still compatible with constraints, the iterations can be guided in image space by providing different global reference points. Furthermore, it is straightforward to use RBF with the heuristic methods from [55] for heterogeneous problems. We believe that it should also be possible to propagate multiple solutions and combine the TRM method with non-dominance testing as has been done [31] and in [56]. One can think of other globalization strategies as well: RBF models have been used in multiobjective Stochastic Search algorithms [57] and trust region ideas have been included into population based strategies [26]. It will thus be interesting to see whether the theoretical convergence properties can be maintained within these contexts by employing a careful trust-region management. Finally, re-using the data sampled near the final iterate within a continuation framework like in [58] is a promising next step.

Supplementary Materials

Our Julia implementation of the solver is available online at https://github.com/manuelbb-upb/Morbit.jl accessed on 15 April 2021.

Author Contributions

Conceptualization, M.B. and S.P.; methodology, M.B.; software, M.B.; validation, M.B. and S.P.; formal analysis, M.B. and S.P.; investigation, M.B.; writing—original draft preparation, M.B.; writing—review and editing, S.P.; visualization, M.B.; supervision, S.P.; All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the European Union and the German Federal State of North Rhine-Westphalia within the EFRE.NRW project “SET CPS”.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Miscellaneous Proofs

Appendix A.1. Continuity of the Constrained Optimal Value

In this subsection we show the continuity of

ω (x)

in the constrained case, where

ω (x)

is the negative optimal value of (P1), i.e.,

\begin{matrix} ω (x) : = & - min_{d \in X - x} max_{ℓ = 1, \dots, k} 〈 \nabla f_{ℓ} (x), d 〉, \\ s . t . ∥d∥ \leq 1 . \end{matrix}

The proof of the continuity of

ω (x)

, as stated in Theorem 1, follows the reasoning from [6], where continuity is shown for a related constrained descent direction program.

Proof of Item 2 in Theorem 1.

Let the requirements of Item 1 be fulfilled, i.e., let

f

be continuously differentiable and let

X \subset R^{n}

be convex and compact. Further, let

x

be a point in

X

and denote the minimizing direction in (P1) by

d (x)

and the optimal value by

θ (x)

. We show that

θ (x)

is continuous, by which

ω (x) = - θ (x)

is continuous as well.

First, note the following properties of the maximum function:

$u \mapsto {max}_{ℓ} u_{ℓ}$ is positively homogenous and hence

$max_{ℓ} (〈 \nabla f_{ℓ} (x), d_{1} 〉 + 〈 \nabla f_{ℓ} (x), d_{2} 〉) \leq max_{ℓ} 〈 \nabla f_{ℓ} (x), d_{1} 〉 + max_{ℓ} 〈 \nabla f_{ℓ} (x), d_{2} 〉 .$
$u \mapsto {max}_{ℓ} u_{ℓ}$ is Lipschitz with constant 1 so that

$| max_{ℓ} 〈 \nabla f_{ℓ} (x_{1}), d_{1} 〉 - max_{ℓ} 〈 \nabla f_{ℓ} (x_{2}), d_{2} 〉 | \leq ∥D f (x_{1}) d_{1} - D f (x_{2}) d_{2}∥,$

for both the maximum and the Euclidean norm.

Now let

{x^{(t)}} \subseteq X

be a sequence with

x^{(t)} \to x

. Due to the constraints, we have that

d (x) \in X - x

and thereby

d (x) + x - x^{(t)} \in X - x^{(t)}

. Let

(0, 1] ∋ σ^{(t)} : = \{\begin{matrix} min \{1, \frac{1}{∥d (x) + x - x^{(t)}∥}\} & if d (x) \neq x^{(t)} - x, \\ 1 & else . \end{matrix}

Then

σ^{(t)} (d (x) + x - x^{(t)})

is feasible for (P1) at

x^{(t)}

:

$σ^{(t)} (d (x) + x - x^{(t)}) \in X - x^{(t)}$ because $X - x^{(t)}$ is convex and $0, (d (x) + x - x^{(t)}) \in X - x^{(t)}$ as well as $σ^{(t)} \in (0, 1]$ .
$∥σ^{(t)} (d (x) + x - x^{(t)})∥ \leq 1$ by the definition of $σ^{(t)}$ .

By the definition of (P1) it follows that

\begin{matrix} max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x^{(t)}) 〉 \leq σ^{(t)} max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x) + x - x^{(t)} 〉 \\ and by the maximum property 1 \\ max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x^{(t)}) 〉 \leq σ^{(t)} max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x) 〉 + σ^{(t)} max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), x - x^{(t)} 〉 . \end{matrix}

(A1)

We make the following observations:

Because of $∥d (x) + x - x^{(t)}∥ \overset{t \to \infty}{\to} ∥d (x)∥ \leq 1$ , it follows that $σ^{(t)} \overset{t \to \infty}{\to} 1$ .
Because all objective gradients are continuous, it holds for all $ℓ \in {1, \dots, k}$ that $\nabla f_{ℓ} (x^{(t)}) \to \nabla f_{ℓ} (x)$ and because $u \mapsto {max}_{ℓ} u_{ℓ}$ is continuous as well, it then follows that

$max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x) 〉 \to max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 for t \to \infty .$
The last term on the RHS of (A1) vanishes for $t \to \infty$ .

By taking the limit superior on (A1), we then find that

\underset{t \to \infty}{lim sup} θ (x^{(t)}) = \underset{t \to \infty}{lim sup} max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x^{(t)}) 〉 \leq max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 = θ (x)

(A2)

Vice versa, we know that because of

d (x^{(t)}) \in X - x^{(t)}

, it holds that

d (x^{(t)}) + x^{(t)} - x \in X - x

and as above we find that

max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 \leq λ^{(t)} max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x^{(t)}) 〉 + λ^{(t)} max_{ℓ} 〈 \nabla f_{ℓ} (x), x^{(t)} - x 〉

(A3)

with

λ^{(t)} : = \{\begin{matrix} min \{1, \frac{1}{∥d (x) + x^{(t)} - x∥}\} & if d (x) \neq x^{(t)} - x, \\ 1 & else . \end{matrix}

Again, the last term of (A3) vanishes in the limit so that by using the properties of the maximum function and the continuity of

\nabla f_{ℓ}

, as well as

λ^{(t)} \overset{t \to \infty}{\to} 1

, in taking the limit inferior on (A3) we find that

\begin{matrix} θ (x) = max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 \leq \underset{t \to \infty}{lim inf} max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x^{(t)}) 〉 \\ \leq \underset{t \to \infty}{lim inf} [(max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x^{(t)}) 〉 - max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x^{(t)}) 〉) + max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x^{(t)}) 〉] \\ \leq \underset{t \to \infty}{lim inf} [∥D f (x) - D f (x^{(t)})∥ ∥d (x^{(t)})∥ + max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x^{(t)}) 〉] \\ \leq \underset{t \to \infty}{lim inf} max_{ℓ} 〈 \nabla f_{ℓ} (x^{(t)}), d (x^{(t)}) 〉 = \underset{t \to \infty}{lim inf} θ (x^{(t)}) . \end{matrix}

(A4)

Combining (A2) and (A4) shows that

θ (x^{(t)}) \overset{t \to \infty}{\to} θ (x)

. □

Theorem 2 claims that

ω (x)

is uniformly continuous, provided the objective gradients are Lipschitz. The implied Cauchy continuity is an important property in the convergence proof of the algorithm.

Proof of Theorem 2.

We will consider the constrained case only, when

X

is convex and compact and show uniform continuity a fortiori by proving that

ω (•)

is Lipschitz. Let the objective gradients be Lipschitz continuous. Then

D f

is Lipschitz as well with constant

L > 0

. Let

x, y \in X

with

x \neq y

(the other case is trivial) and let again

d (x), d (y)

be the respective optimizers.

Suppose w.l.o.g. that

\begin{matrix} | max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 - max_{ℓ} 〈 \nabla f_{ℓ} (y), d (y) 〉 | & = max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 - max_{ℓ} 〈 \nabla f_{ℓ} (y), d (y) 〉 \end{matrix}

If we define

(0, 1] ∋ σ : = \{\begin{matrix} min \{1, \frac{1}{∥d (y) + y - x∥}\} & if d (y) \neq x - y, \\ 1 & else, \end{matrix}

then again

σ (d (y) + y - x)

is feasible for (P1) at

y

. Thus,

\begin{matrix} max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 - max_{ℓ} 〈 \nabla f_{ℓ} (y), d (y) 〉 \\ \overset{df .}{\leq} max_{ℓ} 〈 \nabla f_{ℓ} (x), σ (d (y) + y - x) 〉 - max_{ℓ} 〈 \nabla f_{ℓ} (y), d (y) 〉 \\ \leq ∥σ D f (x) (d (y) + y - x) - D f (y) d (y)∥ \\ \overset{σ \leq 1}{\leq} ∥σ D f (x) - D f (y)∥ ∥d (y)∥ + ∥D f (x)∥ ∥x - y∥, \end{matrix}

(A5)

where we have again used the maximum property 2 for the second inequality. We now investigate the first term on the RHS. Using

∥d (y)∥ \leq 1

and adding a zero, we find

\begin{matrix} ∥σ D f (x) - D f (y)∥ ∥d (y)∥ \leq ∥D f (x) - D f (y) - (1 - σ) D f (x)∥ \\ \leq L ∥x - y∥ + (1 - σ) ∥D f (x)∥ . \end{matrix}

(A6)

Furthermore,

∥d (y) + y - x∥ \leq 1 + ∥y - x∥

implies

1 / (1 + ∥y - x∥) \leq σ

and

1 - σ \leq 1 - \frac{1}{1 + ∥y - x∥} = \frac{∥y - x∥}{1 + ∥y - x∥} \leq ∥y - x∥ .

We use this inequality and plug (A6) into (A5) to obtain

\begin{matrix} max_{ℓ} 〈 \nabla f_{ℓ} (x), d (x) 〉 - max_{ℓ} 〈 \nabla f_{ℓ} (y), d (y) 〉 & \leq L ∥x - y∥ + 2 ∥D f (x)∥ ∥x - y∥ \\ \leq (L + 2 D) ∥x - y∥, \end{matrix}

with

D = {max}_{x \in X} ∥D f (x)∥

which is well-defined because

X

is compact and

∥D f (•)∥

is continuous. □

Appendix A.2. Modified Criticality Measures

Proof of Lemma 5.

There are two cases to consider:

If $ω_{m}^{(t)} (x^{(t)}) \geq ω (x^{(t)})$ then

$|ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)})| = ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)}) \leq κ_{ω} ω_{m}^{(t)} (x^{(t)}) .$

Now

$|ϖ_{m}^{(t)} (x^{(t)}) - ϖ (x^{(t)})| \in \{\begin{matrix} ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)}) \\ 1 - ω (x^{(t)}) & \leq ω_{m}^{(t)} (x^{(t)}) - ω (x^{(t)}) \\ 1 - 1 & = 0 \end{matrix}\} \leq κ_{ω} ω_{m}^{(t)} (x^{(t)}) .$
The case $ω (x^{(t)}) < ω_{m}^{(t)} (x^{(t)})$ can be shown similarly.

□

Proof of Lemma 6.

Use Lemma 5 and then investigate the two possible cases:

If $ϖ_{m}^{(t)} (x^{(t)}) \geq ϖ (x^{(t)})$ , then the first inequality follows because of $1 \geq 1 / (1 + κ_{ω})$ .
If $ϖ_{m}^{(t)} (x^{(t)}) < ϖ (x^{(t)})$ , then $ϖ (x^{(t)}) - ϖ_{m}^{(t)} (x^{(t)}) \leq κ_{ω} ϖ_{m}^{(t)} (x^{(t)}),$ and again the first inequality follows.

□

Appendix B. Pascoletti–Serafini Step

One example of an alternative descent step

s^{(t)} \in R^{n}

is given in [33]. Thomann and Eichfelder [33] leverage the Pascoletti–Serafini scalarization to define local subproblems that guide the iterates towards the (local) model ideal point. To be precise, it is shown that the trial point

x_{+}^{(t)}

can be computed as the solution to

\begin{matrix} min_{τ \in R, x \in B^{(t)}} τ s . t . m^{(t)} (x^{(t)}) + τ r^{(t)} - m^{(t)} (x) \geq 0, \end{matrix}

(A7)

where

r^{(t)} = m^{(t)} (x^{(t)}) - i_{m}^{(t)} \in R_{\geq 0}^{k}

is the direction vector pointing from the local model ideal point

i_{m}^{(t)} = {[\begin{matrix} i_{1}^{(t)}, \dots, i_{k}^{(t)} \end{matrix}]}^{T}, with i_{ℓ}^{(t)} = {min}_{x \in X}^{} m_{ℓ}^{(t)} (x) for ℓ = 1, \dots, k,

(A8)

to the current iterate value. If the surrogates are linear or quadratic polynomials and the trust region use a p-norm with

p \in {1, 2, \infty}

these sub-problems are linear or quadratic programs.

A convergence proof for the unconstrained case is given in [33]. It relies on a sufficient decrease bound similar to (20). However, it is not shown that

κ^{sd} \in (0, 1)

exists independent of the iteration index t but stated as an assumption.

Furthermore, constraints (in particular box constraints) are integrated into the definition of

ω (•)

and

ω_{m}^{(t)} (•)

using an active set strategy (see [38]). Consequently, both values are no longer Cauchy continuous. We can remedy both drawbacks by relating the (possibly constrained) Pascoletti–Serafini trial point to the strict modified Pareto–Cauchy point in our projection framework. To this end, we allow in (A7) and (A8) any feasible set fulfilling Assumption 1. Moreover, we recite the following assumption:

Assumption A1

(Assumption 4.10 in [33]). There is a constant

r \in (0, 1]

so that if

x^{(t)}

is not Pareto critical, the components

r_{1}^{(t)}, \dots, r_{k}^{(t)},

of

r^{(t)}

satisfy

\frac{{min}_{ℓ} r_{ℓ}^{(t)}}{{max}_{ℓ} r_{ℓ}^{(t)}} \geq r .

The assumption can be justified because

r_{ℓ}^{(t)} > 0

if

x^{(t)}

is not critical and

r_{ℓ}^{(t)}

can be bounded above and below by expressions involving

ω_{m}^{(t)} (•)

, see Remark 4 and [33] ([Lemma 4.9]). We can then derive the following lemma:

Lemma A1.

Suppose Assumptions 1, 2 and 10 hold. Let

(τ^{+}, x_{+}^{(t)})

be the solution to (A7). Then there exists a constant

{\tilde{κ}}_{m}^{sd} \in (0, 1)

such that it holds

Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)}) \geq {\tilde{κ}}_{m}^{sd} ω_{m}^{(t)} (x^{(t)}) min \{\frac{ω_{m}^{(t)} (x^{(t)})}{c H_{m}^{(t)}}, Δ^{(t)}, 1\} .

Proof.

If

x^{(t)}

is critical for (MOPm), then

τ^{+} = 0

and

x_{+}^{(t)} = x^{(t)}

and the bound is trivial [5]. Otherwise, we can use the same argumentation as in [33] ([Lemma 4.13]) to show that for the strict modified Pareto–Cauchy point

{\hat{x}}_{PC}^{(t)}

it holds that

\begin{matrix} Φ_{m}^{(t)} (x^{(t)}) - Φ_{m}^{(t)} (x_{+}^{(t)}) & \geq r min_{ℓ} \{m_{ℓ}^{(t)} (x^{(t)}) - m_{ℓ}^{(t)} ({\hat{x}}_{PC}^{(t)})\} \end{matrix}

and the final bound follows from Corollary 2 with the new constant

{\tilde{κ}}_{m}^{sd} = r κ_{m}^{sd}

. □

References

Ehrgott, M. Multicriteria Optimization, 2nd ed.; Springer: Berlin, Germany, 2005. [Google Scholar]
Jahn, J. Vector Optimization: Theory, Applications, and Extensions, 2nd ed.; Springer: Berlin, Germany, 2011; OCLC: 725378304. [Google Scholar]
Miettinen, K. Nonlinear Multiobjective Optimization; Springer: Berlin, Germany, 2013; OCLC: 1089790877. [Google Scholar]
Eichfelder, G. Twenty Years of Continuous Multiobjective Optimization. Available online: http://www.optimization-online.org/DB_FILE/2020/12/8161.pdf (accessed on 8 April 2021).
Eichfelder, G. Adaptive Scalarization Methods in Multiobjective Optimization; Springer: Berlin, Germany, 2008. [Google Scholar] [CrossRef]
Fukuda, E.H.; Drummond, L.M.G. A Survay on Multiobjective Descent Methods. Pesqui. Oper. 2014, 34, 585–620. [Google Scholar] [CrossRef]
Fliege, J.; Svaiter, B.F. Steepest descent methods for multicriteria optimization. Math. Method. Operat. Res. (ZOR) 2000, 51, 479–494. [Google Scholar] [CrossRef]
Graña Drummond, L.; Svaiter, B. A steepest descent method for vector optimization. J. Comput. Appl. Math. 2005, 175, 395–414. [Google Scholar] [CrossRef]
Lucambio Pérez, L.R.; Prudente, L.F. Nonlinear Conjugate Gradient Methods for Vector Optimization. SIAM J. Optim. 2018, 28, 2690–2720. [Google Scholar] [CrossRef]
Lucambio Pérez, L.R.; Prudente, L.F. A Wolfe Line Search Algorithm for Vector Optimization. ACM Transact. Math. Softw. 2019, 45, 1–23. [Google Scholar] [CrossRef]
Gebken, B.; Peitz, S.; Dellnitz, M. A Descent Method for Equality and Inequality Constrained Multiobjective Optimization Problems. In Numerical and Evolutionary Optimization—NEO 2017; Trujillo, L., Schütze, O., Maldonado, Y., Valle, P., Eds.; Springer: Cham, Switzerland, 2019; pp. 29–61. [Google Scholar]
Hillermeier, C. Nonlinear Multiobjective Optimization: A Generalized Homotopy Approach; Springer Basel AG: Basel, Switzerland, 2001; OCLC: 828735498. [Google Scholar]
Gebken, B.; Peitz, S.; Dellnitz, M. On the hierarchical structure of Pareto critical sets. J. Glob. Optim. 2019, 73, 891–913. [Google Scholar] [CrossRef]
Wilppu, O.; Karmitsa, N.; Mäkelä, M. New Multiple Subgradient Descent Bundle Method for Nonsmooth Multiobjective Optimization; Report no. 1126; Turku Centre for Computer Science: Turku, Sweden, 2014. [Google Scholar]
Gebken, B.; Peitz, S. An Efficient Descent Method for Locally Lipschitz Multiobjective Optimization Problems. J. Optim. Theor. Appl. 2021. [Google Scholar] [CrossRef]
Custódio, A.L.; Madeira, J.F.A.; Vaz, A.I.F.; Vicente, L.N. Direct Multisearch for Multiobjective Optimization. SIAM J. Optim. 2011, 21, 1109–1140. [Google Scholar] [CrossRef]
Audet, C.; Savard, G.; Zghal, W. Multiobjective Optimization Through a Series of Single-Objective Formulations. SIAM J. Optim. 2008, 19, 188–210. [Google Scholar] [CrossRef]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Deb, K. Multi-Objective Optimization Using Evolutionary Algorithms; Wiley: Hoboken, NJ, USA, 2001. [Google Scholar]
Coello, C.A.C.; Lamont, G.B.; Veldhuizen, D.A.V. Evolutionary Algorithms for Solving Multi-Objective Problems, 2nd ed.; Springer: New York, NY, USA, 2007. [Google Scholar]
Abraham, A.; Jain, L.C.; Goldberg, R. (Eds.) Evolutionary multiobjective optimization: Theoretical advances and applications. In Advanced Information and Knowledge Processing; Springer: New York, NY, USA, 2005. [Google Scholar]
Zitzler, E. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications. Ph.D. Thesis, ETH, Zurich, Switzerland, 1999. [Google Scholar]
Peitz, S.; Dellnitz, M. A Survey of Recent Trends in Multiobjective Optimal Control—Surrogate Models, Feedback Control and Objective Reduction. Math. Comput. Appl. 2018, 23, 30. [Google Scholar] [CrossRef]
Chugh, T.; Sindhya, K.; Hakanen, J.; Miettinen, K. A survey on handling computationally expensive multiobjective optimization problems with evolutionary algorithms. Soft Comput. 2019, 23, 3137–3166. [Google Scholar] [CrossRef]
Deb, K.; Roy, P.C.; Hussein, R. Surrogate Modeling Approaches for Multiobjective Optimization: Methods, Taxonomy, and Results. Math. Comput. Appl. 2020, 26, 5. [Google Scholar] [CrossRef]
Roy, P.C.; Hussein, R.; Blank, J.; Deb, K. Trust-Region Based Multi-objective Optimization for Low Budget Scenarios. In Evolutionary Multi-Criterion Optimization; Series Title: Lecture Notes in Computer Science; Deb, K., Goodman, E., Coello Coello, C.A., Klamroth, K., Miettinen, K., Mostaghim, S., Reed, P., Eds.; Springer International Publishing: Cham, Switzerland, 2019; Volume 11411, pp. 373–385. [Google Scholar] [CrossRef]
Conn, A.R.; Scheinberg, K.; Vicente, L.N. Introduction to Derivative-Free Optimization; Number 8 in MPS-SIAM Series on Optimization; Society for Industrial and Applied Mathematics/Mathematical Programming Society: Philadelphia, PA, USA, 2009; OCLC: Ocn244660709. [Google Scholar]
Larson, J.; Menickelly, M.; Wild, S.M. Derivative-free optimization methods. arXiv 2019, arXiv:1904.11585. [Google Scholar] [CrossRef]
Qu, S.; Goh, M.; Liang, B. Trust region methods for solving multiobjective optimisation. Optim. Method. Softw. 2013, 28, 796–811. [Google Scholar] [CrossRef]
Villacorta, K.D.V.; Oliveira, P.R.; Soubeyran, A. A Trust-Region Method for Unconstrained Multiobjective Problems with Applications in Satisficing Processes. J. Optim. Theor. Appl. 2014, 160, 865–889. [Google Scholar] [CrossRef]
Ryu, J.H.; Kim, S. A Derivative-Free Trust-Region Method for Biobjective Optimization. SIAM J. Optim. 2014, 24, 334–362. [Google Scholar] [CrossRef]
Audet, C.; Savard, G.; Zghal, W. A mesh adaptive direct search algorithm for multiobjective optimization. Eur. J. Oper. Res. 2010, 204, 545–556. [Google Scholar] [CrossRef]
Thomann, J.; Eichfelder, G. A Trust-Region Algorithm for Heterogeneous Multiobjective Optimization. SIAM J. Optim. 2019, 29, 1017–1047. [Google Scholar] [CrossRef]
Wild, S.M.; Regis, R.G.; Shoemaker, C.A. ORBIT: Optimization by Radial Basis Function Interpolation in Trust-Regions. SIAM J. Sci. Comput. 2008, 30, 3197–3219. [Google Scholar] [CrossRef]
Conn, A.R.; Scheinberg, K.; Vicente, L.N. Global Convergence of General Derivative-Free Trust-Region Algorithms to First- and Second-Order Critical Points. SIAM J. Optim. 2009, 20, 387–415. [Google Scholar] [CrossRef]
Conn, A.R.; Gould, N.I.M.; Toint, P.L. Trust-Region Methods; MPS-SIAM series on optimization; Society for Industrial and Applied Mathematics: Harrisburg, PA, USA, 2000. [Google Scholar]
Luc, D.T. Theory of Vector Optimization; Lecture Notes in Economics and Mathematical Systems; Springer: Berlin, Heidelberg, 1989; Volume 319. [Google Scholar] [CrossRef]
Thomann, J. A Trust Region Approach for Multi-Objective Heterogeneous Optimization. Ph.D. Thesis, TU Ilmenau, Illmenau, Germany, 2018. [Google Scholar]
Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research; Springer: Berlin, Germany, 2006; OCLC: Ocm68629100. [Google Scholar]
Wendland, H. Scattered Data Approximation, 1st ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar] [CrossRef]
Wild, S.M. Derivative-Free Optimization Algorithms for Computationally Expensive Functions; Cornell University: Ithaca, NY, USA, 2009. [Google Scholar]
Wild, S.M.; Shoemaker, C. Global Convergence of Radial Basis Function Trust Region Derivative-Free Algorithms. SIAM J. Optim. 2011, 21, 761–781. [Google Scholar] [CrossRef]
Regis, R.G.; Wild, S.M. CONORBIT: Constrained optimization by radial basis function interpolation in trust regions. Optim. Methods Softw. 2017, 32, 552–580. [Google Scholar] [CrossRef]
Fleming, W. Functions of Several Variables; Undergraduate Texts in Mathematics; Springer: New York, NY, USA, 1977. [Google Scholar] [CrossRef]
Stellato, B.; Banjac, G.; Goulart, P.; Bemporad, A.; Boyd, S. OSQP: An operator splitting solver for quadratic programs. Math. Program. Comput. 2020, 12, 637–672. [Google Scholar] [CrossRef]
Johnson, S.G. The NLopt Nonlinear-Optimization Package. Available online: https://nlopt.readthedocs.io/en/latest/ (accessed on 8 April 2021).
Svanberg, K. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM J. Optim. 2002, 12, 555–573. [Google Scholar] [CrossRef]
Legat, B.; Timme, S.; Weisser, T.; Kapelevich, L.; Rackauckas, C.; TagBot, J. JuliaAlgebra/DynamicPolynomials.jl: V0.3.15. 2020. Available online: https://zenodo.org/record/4153432#.YG5wjj8RVPY (accessed on 8 April 2021).
Runarsson, T.P.; Yao, X. Search biases in constrained evolutionary optimization. IEEE Trans. Syst. Man Cybern. C Appl. Rev. 2005, 35, 233–243. [Google Scholar] [CrossRef]
Revels, J.; Lubin, M.; Papamarkou, T. Forward-Mode Automatic Differentiation in Julia. arXiv 2016, arXiv:1607.07892. [Google Scholar]
Zitzler, E.; Deb, K.; Thiele, L. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evol. Comput. 2000, 8, 173–195. [Google Scholar] [CrossRef]
Deb, K.; Thiele, L.; Laumanns, M.; Zitzler, E. Scalable Test Problems for Evolutionary Multiobjective Optimization. In Evolutionary Multiobjective Optimization; Series Title: Advanced Information and Knowledge Processing; Abraham, A., Jain, L., Goldberg, R., Eds.; Springer: London, UK, 2005; pp. 105–145. [Google Scholar] [CrossRef]
Powell, M.J. A direct search optimization method that models the objective and constraint functions by linear interpolation. In Advances in Optimization and Numerical Analysis; Gomez, S., Hennart, J.P., Eds.; Springer: Dordrecht, The Netherlands, 1994; pp. 51–67. [Google Scholar]
Prinz, S.; Thomann, J.; Eichfelder, G.; Boeck, T.; Schumacher, J. Expensive multi-objective optimization of electromagnetic mixing in a liquid metal. Optim. Eng. 2020. [Google Scholar] [CrossRef]
Thomann, J.; Eichfelder, G. Representation of the Pareto front for heterogeneous multi-objective optimization. J. Appl. Numer. Optim. 2019, 1, 293–323. [Google Scholar]
Deshpande, S.; Watson, L.T.; Canfield, R.A. Multiobjective optimization using an adaptive weighting scheme. Optim. Methods Softw. 2016, 31, 110–133. [Google Scholar] [CrossRef]
Regis, R.G. Multi-objective constrained black-box optimization using radial basis function surrogates. J. Comput. Sci. 2016, 16, 140–155. [Google Scholar] [CrossRef]
Schütze, O.; Cuate, O.; Martín, A.; Peitz, S.; Dellnitz, M. Pareto Explorer: A global/local exploration tool for many-objective optimization problems. Eng. Optim. 2020, 52, 832–855. [Google Scholar] [CrossRef]

Figure 1. (a) Interpolation of a nonlinear function (black) by a Multiquadric surrogate (blue) based on 5 discrete training points (orange). Dashed lines show the kernels and the polynomial tail. (b) Different kernels in 1D with varying shape parameter (1 or 10), see also Table 1.

Figure 2. Two runs with maximum number of expensive evaluations set to 20 (soft limit). Test points are light-gray, the iterates are black, final iterate is red, white markers show other points where the objectives are evaluated. The successive trust regions are also shown. (a) Using Radial Basis Function (RBF) surrogate models we converge to the optimum using only 12 expensive evaluations. (b) Quadratic Lagrange models do not reach the optimum using 19 evaluations. (c) Iterations and test points in the objective space.

Figure 3. Average number of expensive objective evaluations by number of decision variables n, surrogate type and descent method. “SD” refers to steepest descent and “PS” to Pascoletti–Serafini. “LP1” (orange) are linear Lagrange models, “LP2” (yellow) quadratic Lagrange models, “TP1” (blue) are linear Taylor polynomials based on finite differences and “cubic” (black) refers to cubic RBF models. Additionally the results for weighted sum runs are shown in green, using the COBYLA solver and a single objective variant of the trust region framework, ORBIT.

Figure 4. Box-plots of the number of evaluations and the solution criticality for

n = 5

and

n = 15

for the runs from Figure 3. Outliers are not shown. “WS_C” and “WS_O” refer to the weighted sum approach using COBYLA and ORBIT, respectively.

Figure 4. Box-plots of the number of evaluations and the solution criticality for

n = 5

and

n = 15

for the runs from Figure 3. Outliers are not shown. “WS_C” and “WS_O” refer to the weighted sum approach using COBYLA and ORBIT, respectively.

Figure 5. Each group of bars shows the percentage of solved problem instances, i.e., test runs were the final solution criticality has a value below 0.1. From left to right, the bars correspond to the Trust Region Method (TRM) using linear Lagrange polynomials, the TRM with quadratic Lagrange polynomials, TRM with linear Taylor polynomials, weighted sum with COBYLA, weighted sum with ORBIT and TRM with cubic RBF. Per model and n-value there were 60 runs.

Figure 6. Final iterates in objective space for the bi-objective problems ZDT1 and ZDT2 in 10 variables. The weighted sum method (WS) is compared against the trust region method using steepest descent (DS) and the Pascoletti–Serafini (PS) method.

Figure 7. Each group of bars shows the influence of a adaptive shape radius on the performance of different RBF models (tested on ZDT3) for different decision space dimensions. From left to right the bars correspond to the cubic RBF, the Gaussian—with constant shape factor 1 and with adaptive shape factor

20 / Δ^{(t)}

—and the Multiquadric—with shape factors 1 and

20 / Δ^{(t)}

.

Figure 7. Each group of bars shows the influence of a adaptive shape radius on the performance of different RBF models (tested on ZDT3) for different decision space dimensions. From left to right the bars correspond to the cubic RBF, the Gaussian—with constant shape factor 1 and with adaptive shape factor

20 / Δ^{(t)}

—and the Multiquadric—with shape factors 1 and

20 / Δ^{(t)}

.

Table 1. Some radial functions

φ : R_{\geq 0} \to R

that are c.p.d. of order

D \leq 2

, cf. [34].

Table 1. Some radial functions

φ : R_{\geq 0} \to R

that are c.p.d. of order

D \leq 2

, cf. [34].

Name	$φ (r)$	c.p.d. order D
Cubic	$r^{3}$	2
Multiquadric	$- \sqrt{1 + {(α r)}^{2}}, α > 0$	1
Gaussian	$exp (- {(α r)}^{2}), α > 0$	0

Table 2. Parameters for Figure 2, radii relative to

{[0, 1]}^{n}

.

Table 2. Parameters for Figure 2, radii relative to

{[0, 1]}^{n}

.

Param.	$ε_{crit}$	$N_{\exp .}$	$N_{loops}$	$μ$	$β$	$Δ^{ub}$	$Δ_{\min}$	$Δ^{(0)}$	$ν_{+}$	$ν_{+ +}$	$γ_{⇊}$	$γ_{↓}$	$γ_{↑}$
Value	$10^{- 3}$	20	2	$2 \times 10^{3}$	$10^{3}$	$0.5$	$10^{- 3}$	$0.1$	$0.1$	$0.4$	$0.51$	$0.75$	2

Table 3. Parameters for Figure 3, radii relative to

{[0, 1]}^{n}

.

Table 3. Parameters for Figure 3, radii relative to

{[0, 1]}^{n}

.

Parameter	$ε_{crit}$	$N_{it .}$	$N_{\exp .}$	$N_{loops}$	$Δ_{crit}$	$ω_{\min}$	$Δ_{\min}$	$δ_{x}$	$δ_{f}$	$ν_{+}$	$ν_{+ +}$
Value	$10^{- 2}$	100	$n \times 10^{3}$	3	$10^{- 2}$	$10^{- 3}$	$10^{- 6}$	$10^{- 3}$	$10^{- 3}$	0	$0.1$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Berkemeier, M.; Peitz, S. Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models. Math. Comput. Appl. 2021, 26, 31. https://doi.org/10.3390/mca26020031

AMA Style

Berkemeier M, Peitz S. Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models. Mathematical and Computational Applications. 2021; 26(2):31. https://doi.org/10.3390/mca26020031

Chicago/Turabian Style

Berkemeier, Manuel, and Sebastian Peitz. 2021. "Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models" Mathematical and Computational Applications 26, no. 2: 31. https://doi.org/10.3390/mca26020031

APA Style

Berkemeier, M., & Peitz, S. (2021). Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models. Mathematical and Computational Applications, 26(2), 31. https://doi.org/10.3390/mca26020031

Article Menu

Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models

Abstract

1. Introduction

2. Optimality and Criticality in Multiobjective Optimization

3. Trust Region Ideas

4. Surrogate Models and the Final Algorithm

4.1. Fully Linear Models

Algorithm Modifications

4.2. Fully Linear Lagrange Polynomials

4.3. Fully Linear Radial Basis Function Models

5. Descent Steps

5.1. Pareto–Cauchy Step

5.2. Modified Pareto–Cauchy Point via Backtracking

5.3. Sufficient Decrease for the Original Problem

6. Convergence

6.1. Preliminary Assumptions and Definitions

6.2. Convergence Proof

7. Numerical Examples

7.1. Implementation Details

7.2. A First Example

7.3. Benchmarks on Scalable Test-Problems

7.3.1. Solution Quality

7.3.2. RBF Comparison

8. Conclusions

Supplementary Materials

Author Contributions

Funding

Conflicts of Interest

Appendix A. Miscellaneous Proofs

Appendix A.1. Continuity of the Constrained Optimal Value

Appendix A.2. Modified Criticality Measures

Appendix B. Pascoletti–Serafini Step

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI