Derivative-Free Multiobjective Trust Region Descent Method Using Radial Basis Function Surrogate Models

: We present a ﬂexible trust region descend algorithm for unconstrained and convexly constrained multiobjective optimization problems. It is targeted at heterogeneous and expensive problems, i.e., problems that have at least one objective function that is computationally expensive. The method is derivative-free in the sense that neither need derivative information be available for the expensive objectives nor are gradients approximated using repeated function evaluations as is the case in ﬁnite-difference methods. Instead, a multiobjective trust region approach is used that works similarly to its well-known scalar pendants. Local surrogate models constructed from evaluation data of the true objective functions are employed to compute possible descent directions. In contrast to existing multiobjective trust region algorithms, these surrogates are not polynomial but carefully constructed radial basis function networks. This has the important advantage that the number of data points scales linearly with the parameter space dimension. The local models qualify as fully linear and the corresponding general scalar framework is adapted for problems with multiple objectives. Convergence to Pareto critical points is proven and numerical examples illustrate our


Introduction
Optimization problems arise in a multitude of applications in mathematics, computer science, engineering and the natural sciences.In many real-life scenarios, there are multiple, equally important objectives that need to be optimized.Such problems are then called Multiobjective Optimization Problems (MOP).In contrast to the single objective case, an MOP often does not have a single solution but an entire set of optimal trade-offs between the different objectives, which we call Pareto optimal.They constitute the Pareto Set and their image is the Pareto Frontier.The goal in the numerical treatment of an MOP is to either approximate these sets or to find single points within these sets.In applications, the problem can become more difficult when some of the objectives require computationally expensive or time consuming evaluations.For instance, the objectives could depend on a computer simulation or some other black-box.It is then of primary interest to reduce the overall number of function evaluations.Consequently, it becomes infeasible to approximate derivative information of the true objectives using, e.g., finite differences.In this work, optimization methods that do not use the objective gradients (which nonetheless are assumed to exist) are referred to as derivative-free.
There is a variety of methods to deal with multiobjective optimization problems, some of which are also derivative-free or try to constrain the number of expensive function evaluations.A broad overview of different problems and techniques concerning multiobjective optimization can be found, e.g., in [1][2][3][4].One popular approach for calculating Pareto optimal solutions is scalarization, i.e., the transformation of an MOP into a single objective problem, cf.[5] for an overview.Alternatively, classical (single objective) descent algorithms can be adapted for the multiobjective case [6][7][8][9][10][11]. What is more, the structure of the Pareto Set can be exploited to arXiv:2102.13444v2 [math.OC] 1 Mar 2021 find multiple solutions [12,13].There are also methods for non-smooth problems [14,15] and multiobjective direct-search variants [16,17].Both scalarization and descent techniques may be included in Evolutionary Algorithms (EA) [18][19][20][21], the most prominent of which probably is NSGA-II [22].To address computationally expensive objectives or missing derivative information, there are algorithms that use surrogate models (see the surveys [23][24][25]) or borrow from ideas from scalar trust region methods, e.g., [26].
In single objective optimization, trust region methods are well suited for derivative-free optimization [27,28].Our work is based on the recent development of multiobjective trust region methods: • In [29], a trust region method using Newton steps for functions with positive definite Hessians on an open domain is proposed.

•
In [30] quadratic Taylor polynomials are used to compute the steepest descent direction which is used in a backtracking manner to find solutions for unconstrained problems.

•
In [31] polynomial regression models are used to solve an augmented MOP based on the scalarization in [17].The algorithm is designed unconstrained bi-objective problems.

•
In [32], quadratic Lagrange polynomials are used and the Pascoletti-Serafini scalarization is employed for the descent step calculation.
Our contribution is the extension of the above-mentioned methods to general fully linear models (and in particular radial basis function surrogates as in [33]), which is related to the scalar framework in [34].Most importantly, this reduces the complexity with respect to the parameter space dimension to linear, in contrast to the quadratically increasing number of function evaluations in other methods.We further prove convergence to critical points when the problem is constrained to a convex and compact set by using an analogous argumentation as in [35].This requires new results concerning the continuity of the projected steepest descent direction.We also show how to keep the convergence properties for constrained problems when the Pascoletti-Serafini scalarization is employed (like in [32]).The remainder of the paper is structured as follows: Section 2 provides a brief introduction to multiobjective optimality and criticality concepts.In Section 3 the fundamentals of our algorithm are explained.In Section 4 we introduce fully linear surrogate models and describe their construction.We also formalize the main algorithm in this section.Section 5 deals with the descent step calculation so that a sufficient decrease is achieved in each iteration.Convergence is proven in Section 6 and a few numerical examples are shown in Section 7. We conclude with a brief discussion in Section 8.

Optimality and Criticality in Multiobjective Optimization
We consider the following (real-valued) multiobjective optimization problem: . . .
with a feasible set X ⊆ R n and k objective functions f : R n → R, = 1, . . ., k.We further assume (MOP) to be heterogeneous.That is, there is a non-empty subset I ex ⊆ {1, . . ., k} of indices so that the gradients of f , ∈ I ex , are unknown and cannot be approximated, e.g., via finite differences.The (possibly empty) index set I cheap = {1, . . ., k} \ I ex indicates functions whose gradients are available.Solutions for (MOP) consist of optimal trade-offs x * ∈ X between the different objectives and are called non-dominated or Pareto optimal.That is, there is no x ∈ X with f(x) ≺ f(x * ) (i.e., f(x) ≤ f(x * ) and f (x) < f (x * ) for some index ∈ {1, . . ., k}).The subset P S ⊆ X of non-dominated points is then called the Pareto Set and its image P F := f(P S ) ⊆ R k is called the Pareto Frontier.All concepts can be defined in a local fashion in an analogous way.
Similar to scalar optimization, local optima can be characterized using the gradients of the objective function.We therefore implicitly assume all objective functions f , = 1, . . ., k, to be continuously differentiable on X .Moreover, the following assumption allows for an easier treatment of tangent cones in the constrained case: Assumption 1.Either X = R n or the feasible set X ⊆ R n is closed, bounded and convex.All functions are defined on X .
Because R k is finite-dimensional Assumption 1 is equivalent to requiring X to be compact and convex, which is a standard assumption in the MO literature [6,7].Now let ∇ f (x) denote the gradient of f and Df(x) ∈ R k×n the Jacobian of f at x ∈ X .
where •, • is the standard inner product on R n and we consider X − x = X in the unconstrained case X = R n .
A point x * ∈ X is called critical for (MOP) iff there is no d ∈ X − x * with (1).As all Pareto optimal points are also critical (cf.[6,36] or [2, Ch. 17]), it is viable to search for optimal points by calculating points from the superset P crit ⊇ P S of critical points for (MOP).One way to do so is by iteratively performing descent steps.Fliege and Svaiter [7] propose several ways to compute suitable descent directions.The minimizer d * of the following problem is known as the multiobjective steepest-descent direction.
Problem (P1) has an equivalent reformulation as which is a linear program, if X is defined by linear constraints and the maximum-norm • = • ∞ is used [7].We thus stick with this choice because it facilitates implementation, but note that other choices are possible (see for example [32]).
Motivated by the next theorem we can use the optimal value of either problem as a measure of criticality, i.e., as a multiobjective pendant for the gradient norm.As is standard in most multiobjective trust region works (cf.[29,30,32]), we flip the sign so that the values are non-negative.Theorem 1.For x ∈ X let d * (x) be the minimizer of (P1) and ω(x) be the negative optimal value, that is Then the following statements hold: The function ω : R n → R is continuous.

3.
The following statements are equivalent: Proof.For the unconstrained case all statements are proven in [7, Lemma 3].The first and the third statement hold true for X convex and compact by definition.The continuity of ω can be shown similarly as in [6], see Appendix A.1.
With further conditions on f and X the criticality measure ω(x) is even Lipschitz continuous and subsequently uniformly and Cauchy continuous: , . . ., k, are Lipschitz continuous and Assumption 1 holds, then the map ω(•) as defined in Theorem 1 is uniformly continuous.
Proof.The proof for X = R n is given by Thomann [37].A proof for the constrained case can be found in Appendix A.1 as to not clutter this introductory section.
Together with Theorem 1 this hints at ω(•) being a criticality measure as defined for scalar trust region methods in [35,Ch. 8]: Definition 2. We call π : N 0 × R n → R, a criticality measure for (MOP) if π is Cauchy continuous with respect to its second argument and if lim t→∞ π(t, x (t) ) = 0 implies that the sequence x (t) asymptotically approaches a Pareto-critical point.

Trust Region Ideas
Multiobjective trust region algorithms closely follow the design of scalar approaches (see [35] for an extensive treatment).Consequently, the requirements and convergence proofs in [29,30,32] for the unconstrained multiobjective case are fairly similar to those in [35].We will reexamine the core concepts to provide a clear understanding and point out the similarities to the scalar case.
The main idea is to iteratively compute multi-descent steps s (t) in every iteration t ∈ N 0 .We could, for example, use the steepest descent direction given by (P1).This would require knowledge of the objective gradients -which need not be available for objective functions with indices in I ex .Hence, benevolent surrogate model functions T , are employed.Note, that for cheap objectives f , ∈ I cheap , we could simply use m = f as long as these f are twice continuously differentiable and have Hessians of bounded norm.
The surrogate models are constructed to be sufficiently accurate within a trust region around the current iterate x (t) .The model steepest descent direction d m can then computed as the optimizer of the surrogate problem Now let σ (t) > 0 be a step size.The direction d (t) m need not be a descent direction for the true objectives f and the trial point m is only accepted if a measure ρ (t) of improvement and model quality surpasses a positive threshold ν + .As in [30,32], we scalarize the multiobjective problems by defining + ) > 0, there is a reduction in at least one objective function of f because of where we denoted by the maximizing index in Φ(x (t) ) and by q the maximizing index in Φ(x (t) + ). 1 Of course, the same property holds for Φ (t) m (•) and m (t) .Thus, the step size σ (t) > 0 is chosen so that the step s (t) = σ (t) d (t) m satisfies both x (t) + s (t) ∈ B (t) and a "sufficient decrease condition" of the form with a constant C > 0, see Section 5.Such a condition is also required in the scalar case [34,35] and essential for the convergence proof in Section 6, where we show lim t→∞ ω x (t) = 0.
Due to the decrease condition the denominator in the ratio of actual versus predicted reduction, is nonnegative.A positive ρ (t) implies a decrease in at least one objective f , so we accept is sufficiently large, say ρ (t) ≥ ν ++ > ν + > 0, the next trust region might have a larger radius ∆ (t+1) ≥ ∆ (t) .If in contrast ρ < ν ++ , the next trust region radius should be smaller and the surrogates improved.This encompasses the case s (t) = 0, when the iterate x (t) is critical for min Roughly speaking, we suppose that x (t) is near a critical point for the original problem (MOP) if m (t) is sufficiently accurate.If we truly are near a critical point, then the trust region radius will approach 0. For further details concerning the acceptance ratio ρ (t) , see [32,Sec. 2.2].
Remark 1.We can modify ρ (t) in (3) to obtain a descent in all objectives, i.e., if + we test > ν + for all = 1, . . ., k.This is the strict acceptance test.

Surrogate Models and the Final Algorithm
Until now, we have not discussed the actual choice of surrogate models used for m (t) .As is shown in Section 5, the models should be twice continuously differentiable with uniformly bounded hessians.To prove convergence of our algorithm we have to impose further requirements on the (uniform) approximation qualities of the surrogates m (t) .We can meet these requirements using so-called fully linear models.Moreover, fully linear models intrinsically allow for modifications of the basic trust region method that are aimed at reducing the total number of expensive objective evaluations.Finally, we briefly recapitulate how radial basis functions and multivariate Lagrange polynomials can be made fully linear.

Fully Linear Models
Let us begin with the abstract definition of full linearity as given in [27,34]: Definition 3. Let ∆ ub > 0 be given and let f : R → R be a function that is continuously differentiable in an open domain containing X and has a Lipschitz continuous gradient on X .A set of model functions M = {m : R n → R} ⊆ C 1 (R n , R) is called a fully linear class of models if the following hold: 1.
There are positive constants , ˙ and L m such that for any given ∆ ∈ (0, ∆ ub ) and for any x ∈ X there is a model function m ∈ M with Lipschitz continuous gradient and corresponding Lipschitz constant bounded by L m and such that • the error between the gradient of the model and the gradient of the function satisfies • the error between the model and the function satisfies

2.
For this class M there exists "model-improvement" algorithm that -in a finite, uniformly bounded (w.r.t.x and ∆) number of steps -can • either establish that a given model m ∈ M is fully linear on B(x; ∆) • or find a model m that is fully linear on B(x; ∆).

Remark 2.
In the constrained case, we treat the constraints as hard, that is, we do not allow for evaluations of the true objectives outside X , see the definition of B (t) ⊆ X in (2).We also ensure to only select training data in X during the construction of surrogate models.In the unconstrained case, the requirements in Definition 3 can be relaxed a bit, at least when using the strict acceptance test with f(x (T) ) ≤ f(x (t) ) for all T ≥ t ≥ 0. We can then restrict ourselves to the set For the convergence analysis in Section 6, we cite [27,Lemma 10.25] concerning the approximation quality of fully linear models on enlarged trust regions: Lemma 1.For x ∈ X and ∆ ≤ ∆ ub consider a function f and a fully-linear model m as in Definition 3 with constants , ˙ , L m > 0. Let L f > 0 be a Lipschitz constant of ∇ f .Assume w.l.o.g. that Then m is fully linear on B x; ∆ for any ∆ ∈ [∆, ∆ ub ] with respect to the same constants , ˙ , L m .

Algorithm Modifications
With Definition 3 we have formalized our assumption that the surrogates become more accurate when we decrease the trust region radius.This motivates the following modifications: • "Relaxing" the (finite) surrogate construction process to try for a possible descent even if the surrogates are not fully linear.
• A criticality test depending on (t) m x (t) .If this value is very small at the current iterate, then x (t) could lie near a Pareto-critical point.With the criticality test and criticalityRoutine we ensure that the next model is fully linear and the trust region is not too large.This allows for a more accurate criticality measure and descent step calculation.

•
A trust region update that also takes into consideration m x (t) .The radius should be enlarged if we have a large acceptance ratio ρ (t) and the ∆ (t) is small as measured against βω (t) m x (t) for a constant β > 0. These changes are implemented in Algorithm 1.For more detailed explanations we refer to [27,Ch. 10].
k ] T are not fully linear.In these iterations the trust region radius is not changed.

Fully Linear Lagrange Polynomials
Quadratic Taylor polynomial models are used very frequently.As explained in [27] we can alternatively use multivariate interpolating Lagrange polynomial models when derivative information is not available.We will consider first and second degree Lagrange models.Even though the latter require O(n 2 ) function evaluations they are still cheaper than second degree finite difference models.For this reason, these models are also used in [32,37].
To construct an interpolating polynomial model we have to provide p data sites, where p is the dimension of the space Π d n of real-valued n-variate polynomials with degree d.For If n ≥ 2, the Mairhuber-Curtis theorem [38] applies and the data sites must form a so-called poised set in X .The set + ) and compute ρ (t) with (3); Perform the following updates: m x (t) .end for all j = 1, . . ., p and any function F : R n → R. Given a poised set Ξ the associated Lagrange basis {l i } of Π d n is defined by l i (ξ j ) = δ i,j .The model coefficients then simply are the data values, i.e., λ i = F(ξ i ).
Same as in [37], we implement Algorithm 6.2 from [27] to ensure poisedness.It selects training sites Ξ from the current (slightly enlarged) trust region of radius θ 1 ∆ (t) and calculates the associated lagrange basis.We can then separately evaluate the true objectives f on Ξ to easily build the surrogates m (t) , ∈ {1, . . ., k}.Our implementation always includes ξ 1 = x (t)   and tries to select points from a database of prior evaluations first.
We employ an additional algorithm (Algorithm 6.3 in [27]) to ensure that the set Ξ is even Λ-poised, see [27,Definition 3.6].The procedure is still finite and ensures the models are Procedure criticalityRoutine() Make models m (t) fully linear on B (t) ; /* can change m x (t) then Break; end end actually fully linear.The quality of the surrogate models can be improved by choosing a small algorithm parameter Λ > 1.Our implementation tries again to recycle points from a database.Different to before, interpolation at x (t) can no longer be guaranteed.This second step can also be omitted first and then used as a model-improvement step in a subsequent iteration.

Fully Linear Radial Basis Function Models
The main drawback of quadratic Lagrange models is that we still need O(n 2 ) function evaluations in each iteration of Algorithm 1.A possible fix is to use under-determined regression polynomials instead [27,31,39].Motivated by the findings in [33] we chose socalled Radial Basis Function (RBF) models as an alternative.RBF are well-known for their approximation capabilities on irregular data [38].In our implementation they have the form where ϕ is a function from R ≥0 to R. For a fixed ϕ the mapping ϕ( • ) from R n → R is radially symmetric with respect to its argument and the mapping (x, ξ) → ϕ( x − ξ 2 ) is called a kernel.Wild et al. [33] describe a construction of RBF surrogate models as in (4) (see also [40] and the dissertation [39] for more details).If we restrict ourselves to functions ϕ( • ) that are conditionally positive definite (c.p.d.-see [33,38] for the definition) of order at most two, then the surrogates can be made certifiably fully linear with N = n + 1.As before, the algorithms tries to select an initial training set Ξ = {ξ 1 , . . ., ξ N } ⊂ B(x (t) ; θ 1 ∆ (t) ) with N = n + 1 and a scaling factor θ 1 ≥ 1.The set must be poised for interpolation with affine linear polynomials.Due to ϕ( • ) being c.p.d. of order D ≤ 2, the interpolation system is uniquely solvable for any F : R n → R if we choose Π d n such that d ≥ max{0, D − 1}.We can even include more points, N ≥ n + 1, from within a region of maximum radius θ 2 ∆ ub , θ 2 ≥ θ 1 ≥ 1, to capture nonlinear behavior of F.More detailed explanations can be found in [33].Modifications for box constraints are shown in [39] and [41].

Name
Table 1 shows the RBF we are using and the possible polynomial degrees for π.Both the Gaussian and the Multiquadric allow for fine-tuning with a shape parameter α > 0. This can potentially improve the conditioning of the interpolation system.Fig. 1 (b) illustrates the effect of the shape parameter.As can be seen, the radial functions become narrower for larger shape parameters.Hence, we do not only use a constant shape parameter α = 1 like Wild et al. [33] do, but we also use an α that is (within lower and upper bounds) inversely proportional to ∆ (t) .Fig. 1 (a) shows interpolation of a nonlinear function by a surrogate based on the Multiquadric with a linear tail.  1.

Descent Steps
In this section we introduce some possible steps s (t) to use in Algorithm 1.We begin by defining the best step along the steepest descent direction as given by (Pm).Subsequently, backtracking variants are defined that use a multiobjective variant of Armijo's rule. (5) Let σ (t) be the minimizer in (5).We call s (t) m the Pareto-Cauchy step.
If we make the following standard assumption, then the Pareto-Cauchy point allows for a lower bound on the improvement in terms of Assumption 2. For all t ∈ N 0 the surrogates m (t) where H and the constant c > 0 relates the trust region norm • to the Euclidean norm • 2 via If • = • ∞ is used, then c can be chosen as c = k.The proof for Theorem 3 is provided after the next auxiliary lemma.Lemma 2. Under Assumptions 1 and 2, let d be a non-increasing direction at x (t) ∈ R n for m (t) , i.e., Let q ∈ {1, . . ., k} be any objective index and σ ≥ min where we have used the shorthand notation Lemma 2 states that a minimizer along any non-increasing direction d achieves a minimum reduction w.r.t.
m .Similar results can be found in in [30] or [32].But since we do not use polynomial surrogates m (t) , we have to employ the multivariate version of Taylor's theorem to make the proof work.We can do this because according to Assumption 2, the functions m (t) q , q ∈ {1, . . ., k} are twice continuously differentiable in an open domain containing X .Moreover, Assumption 1 ensures that the function is defined on the line from χ to x.As shown in [42,Ch. 3] a first degree expansion at x ∈ B(χ, ∆) around χ ∈ X then leads to for some Proof of Lemma 2. Let the requirements of Lemma 2 hold and let d be a non-increasing direction for m (t) .Then: We use the shorthand w = − max j ∇m (t) j (x (t) ), d and the Cauchy-Schwartz inequality to get The RHS is concave and we can thus easily determine the global maximizer σ * .Similar to [30,Lemma 4.1] we find where we have additionally used σ ≥ min{∆ (t) , 1}.
Proof of Theorem 3. If x (t) is Pareto-critical for (MOPm), then d m x (t) = 0 and the inequality holds trivially.
Else, let the indices , q ∈ {1, . . ., k} be such that and define Then clearly σ ≥ min ∆ (t) , d (t) m and for the Pareto-Cauchy point we have From Lemma 2 and d m the bound (6) immediately follows.
Remark 3. Some authors define the Pareto-Cauchy point as the actual minimizer x (t) m within the current trust region (instead of the minimizer along the steepest descent direction).For this true minimizer the same bound (6) holds.This is due to

Modified Pareto-Cauchy Point via Backtracking
A common approach in trust region methods is to find an approximate solution to (5) within the current trust region.Usually a backtracking approach similar to Armijo's inexact line-search is used for the Pareto-Cauchy subproblem.Doing so, we can still guarantee a sufficient decrease.
Before we actually define the backtracking step along d m , we derive a more general lemma.It illustrates that backtracking along any suitable direction is well-defined.Lemma 3. Suppose Assumptions 1 and 2 hold.For x (t) ∈ R n , let d be a descent direction for m (t)  and let q ∈ {1, . . ., k} be any objective index and σ > 0. Then there is an integer j ∈ N 0 such that where, again, we have used the shorthand notation w = − max =1,...,k ∇m (t) (x (t) ), d > 0 and Ψ is either some specific model, Ψ = m , or the maximum value, Ψ = Φ (t) m .Moreover, if we define the step s (t) = b j σ d d for the smallest j ∈ N 0 satisfying (11), then there is a constant κ sd m ∈ (0, 1) such that Proof.The first part can be derived from the fact that d is a descent direction, see e.g.[6].However, we will use the approach from [30] to also derive the bound (12).With Taylor's Theorem we obtain q (ξ q )d (Pm),( 7) In the last line, we have additionally used the Cauchy-Schwarz inequality.For a constructive proof, suppose now that ( 11) is violated for some j ∈ N 0 , i.e., Plugging in (13) for the LHS and substracting Ψ(x (t) ) then leads to , where again σ as in (10) and j ∈ N 0 is the smallest integer that satisfies for predefined constants a, b ∈ (0, 1).
The definition of σ ensures, that x (t) + s(t) PC is contained in the current trust region B (t) .Furthermore, these steps provide a sufficient decrease very similar to (6): Corollary 1. Suppose Assumptions 1 and 2 hold.For the step s(t) PC the following statements are true: 1.
A j ∈ N 0 as in (14) exists.

2.
There is a constant κ sd m ∈ (0, 1) such that the modified Pareto-Cauchy step s(t) PC satisfies Proof.If x (t) is critical, then the bound is trivial.Otherwise, the existence of a j satisfying (14) follows from Lemma 3 for Ψ = Φ From Lemma 3 it follows that the backtracking condition ( 14) can be modified to explicitly require a decrease in every objective: m x (t) .
We define the strict modified Pareto-Cauchy point as PC and the corresponding step as Corollary 2. Suppose Assumptions 1 and 2 hold.

1.
The strict modified Pareto-Cauchy point exists, the backtracking is finite.

2.
There is a constant κ sd m ∈ (0, 1) such that min =1,...,k Remark 4. In the preceding subsections, we have shown descent steps along the model steepest descent direction.Similar to the single objective case we do not necessarily have to use the steepest descent direction and different step calculation methods are viable.For instance, Thomann and Eichfelder [32] use the well-known Pascoletti-Serafini scalarization to solve the subproblem (MOPm).We refer to their work and Appendix B to see how this method can be related to the steepest descent direction.

Sufficient Decrease for the Original Problem
In the previous subsections, we have shown how to compute steps s (t) to achieve a sufficient decrease in terms of Φ (t) m and ω (t) m (•).For a descent step s (t) the bound is of the form and thereby very similar to the bounds for the scalar projected gradient trust region method [35].By introducing a slightly modified version of ω (t) m (•), we can transform ( 16) into the bound used in [32] and [30].Lemma 4. If π(t, x (t) ) is a criticality measure for some multiobjective problem, then π(t, x (t) ) = min 1, π(t, x (t) ) is also a criticality measure for the same problem.
We next make another standard assumption on the class of surrogate models.

Assumption 3.
The norm of all model hessians is uniformly bounded above on X , i.e., there is a positive constant H m such that W.l.o.g., we assume Remark 5. From this assumption it follows that the model gradients are then Lipschitz as well.
Together with Theorem 2, we then know that ω Motivated by the previous remark, we will from now on refer to the following functions (x) := min{ω(x), 1} and We can thereby derive the sufficient decrease condition in "standard form": Corollary 3.Under Assumption 3, suppose that for x (t) and some descent step s (t) the bound (16) holds.For the criticality measure Proof.
(t) m (•) is a criticality measure due to Assumption 3 and Lemma 4. Further, from (18)  and ( 17) it follows that and if we plug this into (16) we obtain (19).
To relate the RHS of ( 19) to the criticality ω(•) of the original problem, we require another assumption.

Assumption 4.
There is a constant κ ω > 0 such that This assumption is also made by Thomann and Eichfelder [32] and can easily be justified by using fully linear surrogate models and a bounded trust region radius in combination with the a criticality test, see Lemma 7. Assumption 4 can be used to formulate the next two lemmata relating the model criticality and the true criticality.They are proven in Appendix A.2. From these lemmata and Corollary 3 the final result, Corollary 4, easily follows.

Convergence 6.1. Preliminary Assumptions and Definitions
To prove convergence of Algorithm 1 we first have to make sure that at least one of the objectives is bounded from below: Assumption 5.The maximum max =1,...,k f (x) of all objective functions is bounded from below on X .
To be able to use (•) as a criticality measure and to refer to fully linear models, we further require: Assumption 6.The objective f : R n → R k is continuously differentiable in an open domain containing X and has a Lipschitz continuous gradient on X .
We summarize the assumptions on the surrogates as follows: Assumption 7. The surrogate model functions m k belong to a fully linear class M as defined in Definition 3.For each objective index ∈ {1, . . ., k}, the error constants are then denoted by and ˙ .
For the subsequent analysis we define component-wise maximum constants as We also wish for the descent steps to fulfill a sufficient decrease condition for the surrogate criticality measure as discussed in Section 5.
Assumption 8.For all t ∈ N 0 the descent steps s (t) are assumed to fulfill both x (t) + s (t) ∈ B (t) and (19).
Finally, to avoid a cluttered notation when dealing with subsequences we define the following shorthand notations: ∀t ∈ N 0 .

Convergence of Algorithm 1
In the following we prove convergence of Algorithm 1 to Pareto critical points.We account for the case that no criticality test is used, i.e., ε crit = 0. We then require all surrogates to be fully linear in each iteration and need Assumption 4. The proof is an adapted version of the scalar case in [34].It is also similar to the proofs for the multiobjective algorithms in [30,32].However, in both cases, no criticality test is employed, there is no distinction between successful and acceptable iterations (ν + = ν ++ ) and interpolation at x (t) by the surrogates is required.We indicate notable differences when appropriate.
We start with two results concerning the criticality test in Algorithm 1.

Lemma 7.
Outside the criticalityRoutine, Assumption 4 is fulfilled if the model m (t) is fullylinear (and if Proof.Let , q ∈ {1, . . ., k} and d , d q ∈ X − x (t) be solutions of (P1) and (Pm) respectively such that If ω , then, using Cauchy-Schwarz and d ≤ 1, ≤ ∇ f q (x (t) ), d − ∇m and if ω (t) m x (t) < ω x (t) , we obtain Because m (t) is fully linear, it follows that , with ˙ from ( 21).
If we just left criticalityRoutine, then the model is fully linear for ∆ (t) due to Lemma 1 and we have m x (t) .If we otherwise did not enter criticalityRoutine in the first place, it must hold that ω (t) m x (t) ≥ ε crit and and thus In the subsequent analysis, we require mainly steps with fully linear models to achieve sufficient decrease for the true problem.Due to Lemma 7, we can dispose of Assumption 4 by using the criticality routine: Assumption 9. Either ε crit > 0 or Assumption 4 holds.
We have also implicitly shown the following property of the criticality measures.
Corollary 5.If m (t) is fully linear for f with ˙ > 0 as in (21)  t) is not critical for the true problem (MOP), i.e.
x (t) = 0, then criticalityRoutine will terminate after a finite number of iterations.
Proof.At the start of criticalityRoutine, we know that m (t) is not fully linear or (t) .For clarity, we denote the first model by m (t) 0 and define ∆ 0 = ∆ (t) .We then ensure that the model is made fully linear on ∆ (t) 1 = ∆ 0 and denote this fully linear model by m (t) (t) , then criticalityRoutine terminates.Otherwise, the process is repeated: the radius is multiplied by α ∈ (0, 1) so that in the j-th iteration we have ∆ (t) j = α j−1 ∆ 0 and m (t) j is made fully linear on ∆ (t) The only way for criticalityRoutine to loop infinitely is Because m (t) j is fully linear on α j−1 ∆ 0 , we know from Corollary 5 that (t) Using the triangle inequality together with (22) gives us As α ∈ (0, 1), this implies x (t) = 0 and x (t) is hence critical.
We next state another auxiliary lemma that we need for the convergence proof.
Lemma 9. Suppose Assumptions 6 and 7 hold.For the iterate x (t) let s (t) ∈ R n be a any step with x (t) + = x (t) + s (t) ∈ B (t) .If m (t) is fully linear on B (t) then it holds that Proof.The proof follows from the definition of Φ and m and the full linearity of m (t) .It can be found in [32,Lemma 4.16].
Convergence of Algorithm 1 is proven by showing that in certain situations, the iteration must be acceptable or successful as defined in Definition 4. This is done indirectly and relies on the next two lemmata.They use the preceding result to show that in a (hypothetical) situation where no Pareto-critical point is approached, the trust region radius must be bounded from below.
Lemma 10.Suppose Assumptions 1, 3 and 6 to 8 hold.If x (t) is not Pareto-critical for (MOPm) and m (t) is fully linear on B (t) and , where λ = max{ , cH m } and κ sd m as in (19), Proof.The proof is very similar to [34,Lemma 5.3] and [32,Lemma 4.17].In contrast to the latter, we use the surrogate problem and do not require interpolation at x (t) : By definition we have κ sd m (1 − ν ++ ) < 1 and hence it follows from Assumptions 4 and 8 and Corollary 3 that With Assumption 8 we can plug this into (19) and obtain Due to Assumption 7 we can take the definition (3) and estimate Therefore ρ (t) ≥ ν ++ and the iteration t using step s (t) is successful.
The same statement can be made for the true problem and (•): Corollary 6. Suppose Assumptions 1, 3 and 6 to 9 hold.If x (t) is not Pareto-critical for (MOP) and m (t) is fully linear on B (t) and , where λ = max{ , cH m }, κ sd m as in (20), then the iteration is successful, that is t ∈ S and ∆ t+1 ≥ ∆ (t) .
Proof.The proof works exactly the same as for Lemma 10.But due to Assumption 9 we can use Lemma 7 and employ the sufficient decrease condition ( 20) for (•) instead.
As in [34,Lemma 5.4] and [32,Lemma 4.18], it is now easy to show that when no Pareto-critical point of (MOPm) is approached the trust region radius must be bounded: Lemma 11.Suppose Assumptions 1, 3 and 6 to 8 hold and that there exists a constant lb m > 0 such that m for all t.Then there is a constant ∆ lb > 0 with Proof.We first investigate the criticality step and assume ε crit > m ≥ lb m .After we finish the criticality loop, we get an radius ∆ (t) so that ∆ (t) ≥ min{∆ * } for all t.Outside the criticality step, we know from Lemma 10 that whenever ∆ (t) falls below iteration t must be either model-improving or successful and hence ∆ (t+1) ≥ ∆ (t) and the radius cannot decrease until ∆ (k) > ∆ for some k > t.Because γ ∈ (0, 1) is the severest possible shrinking factor in Algorithm 1, we therefore know that ∆ (t) can never be actively shrunken to a value below γ ∆.
Combining both bounds on ∆ (t) results in where we have again used the fact, that ∆ (t) * cannot be reduced further if it is less than or equal to ∆ due to the update mechanism in Algorithm 1.
We can now state the first convergence result: Theorem 4. Suppose that Assumptions 1, 3 and 6 to 8 hold.If Algorithm 1 has only a finite number 0 ≤ |S| < ∞ of successful iterations S = {t ∈ N 0 : Proof.If the criticality loop runs infinitely, then the result follows from Lemma 8.
Otherwise, let t 0 any index larger than the last successful index (or t 0 ≥ 0 if S = ∅).All t ≥ t 0 then must be model-improving, acceptable or inacceptable.In all cases, the trust region radius ∆ (t) is never increased.Due to Assumption 7, the number of successive modelimprovement steps is bounded above by M ∈ N. Hence, ∆ (t) is decreased by a factor of γ ∈ [γ , γ ↓ ] ⊆ (0, 1) at least once every M iterations.Thus, and ∆ (t) must go to zero for t → ∞.
Clearly, for any τ ≥ t 0 , the iterates (and trust region centers) x (τ) and x (t 0 ) cannot be further apart than the sum of all subsequent trust region radii, i.e., The RHS goes to zero as we let t 0 go to infinity and so must the norm on the LHS, i.e., lim Now let τ = τ(t 0 ) ≥ t 0 be the first iteration index so that m (τ) is fully linear.Then and for the terms on the right and for t 0 → ∞, we find: • Because of Assumptions 1 and 6 and Theorem 2 (•) is Cauchy-continuous and with (25) the first term goes to zero.• Due to Corollary 5 the second term is in O(∆ (τ) ) and goes to zero.
• Suppose the third term does not go to zero as well, i.e., { m x (τ) } is bounded below by a positive constant.Due to Assumptions 1 and 7 the iterates x (τ) are not Pareto-critical for (MOPm) and because of ∆ (τ) → 0 and Lemma 10 there would be a successful iteration, a contradiction.Thus the third term must go to zero as well.
We conclude that the left side, x (t 0 ) , goes to zero as well for t 0 → ∞.
We now address the case of infinitely many successful iterations, first for the surrogate measure (t) m (•) and then for (•).We show that the criticality measures are not bounded away from zero.We start with the observation that in any case the trust region radius converges to zero: Lemma 12.If Assumptions 1, 3 and 6 to 8 hold, then the subsequence of trust region radii generated by Algorithm 1 goes to zero, i.e., lim t→∞ ∆ (t) = 0.
Proof.We have shown in the proof of Theorem 4 that this is the case for finitely many successful iterations.Suppose there are infinitely many successful iterations.Take any successful index t ∈ S. Then ρ (t) ≥ ν ++ and from Assumption 8 it follows for x (t+1) = x The criticality step ensures that Now the right hand side has go to zero: Suppose it was bounded below by a positive constant ε > 0. We could then compute a lower bound on the improvement from the first iteration with index 0 up to t + 1 by summation where S t = S ∩ {0, . . ., t} are all successful indices with a maximum index of t.Because S is unbounded, the right side diverges for t → ∞ and so must the left side in contradiction to Φ being bounded below by Assumption 5. From (26)  The next result allows us to transfer the result to (•).
it also holds that lim i→∞ Proof.By (27), m < ε crit for sufficiently large i.If x (t i ) is critical for (MOP), then the result follows from Lemma 8. Otherwise, m (t i ) is fully linear on B x (t i ) ; ∆ (t i ) for some ∆ (t i ) ≤ µ From Corollary 5 it follows that The triangle inequality yields for sufficiently large i and ( 27) then implies (28).
The next global convergence result immediately follows from Theorem 4 and Lemmas 13 and 14: Theorem 5. Suppose Assumptions 1, 3 and 5 to 8 hold.Then lim inf t→∞ x (t) = 0.
This shows that if the iterates are bounded, then there is a subsequence of iterates in R n approximating a Pareto-critical point.We next show that all limit points of a sequence generated by Algorithm 1 are Pareto-critical.Theorem 6. Suppose Assumptions 1 and 3 to 8 hold.Then lim t→∞ x (t) = 0.
Proof.We have already proven the result for finitely many successful iterations, see Theorem 4. We thus suppose that S is unbounded.
For the purpose of establishing a contradiction, suppose that there exists a sequence t j j∈N of indices that are successful or acceptable with (tj) ≥ 2ε > 0 for some ε > 0 and all j. ( We can ignore model-improving and inacceptable iterations: During those the iterate does not change and we find a larger acceptable or successful index with the same criticality value. From Theorem 5 we obtain that for every such t j , there exists a first index τ j > t j such that x (τ j ) < ε.We thus find another subsequence indexed by {τ j } such that (t) ≥ ε for t j ≤ t < τ j and (τj) < ε. (30) Using ( 29) and (30), it also follows from a triangle inequality that With {t j } and {τ j } as in (30), define the following subset set of indices By (30) we have (t) ≥ ε for t ∈ T , and due to Lemma 14, we also know that then m cannot go to zero neither, i.e., there is some ε m > 0 such that From Lemma 12 we know that ∆ (t) t→∞ −−→ 0 so that by Corollary 6, any sufficiently large t ∈ T must be either successful or model-improving (if m (t) is not fully linear).For t ∈ T ∩ S, it follows from Assumption 8 that If t ∈ T ∩ S is sufficiently large, we have ∆ (t) ≤ ε m cH m and Since the iteration is either successful or model-improving for sufficiently large t ∈ T , and since x (t) = x (t+1) for a model-improving iteration, we deduce from the previous inequality that for j ∈ N sufficiently large.The sequence Φ(x (t) ) t∈N 0 is bounded below (Assumption 5) and monotonically decreasing by construction.Hence, the RHS above must converge to zero for j → ∞.This implies lim j→∞ x (t j ) − x (τ j ) = 0.Because of Assumptions 1 and 6, (•) is uniformly continuous so that then lim j→∞ x (t j ) − x (τ j ) = 0, which is a contradiction to (31).Thus, no subsequence of acceptable or successful indices as in (29) can exist.

Numerical Examples
In this section we provide some more details on the actual implementation of Algorithm 1 and present the results of various experiments.We compare different surrogate model types with regard to their efficacy (in terms of expensive objective evaluations) and their ability to find Pareto-critical points.

Implementation Details
We implemented the algorithm in the Julia language.The OSQP solver [43] was used to solve (Pm).For non-linear problems we used the NLopt.jl[44] package.More specifically we used the BOBYQA algorithm [45] in conjunction with DynamicPolynomials.jl[46] for the Lagrange polynomials and the population based ISRES method [47] for the Pascoletti-Serafini subproblems.The derivatives of cheap objective functions were obtained by means of automatic differentiation [48] and Taylor models used FiniteDiff.jl.
In accordance with Algorithm 1 we perform the shrinking trust region update via Note that for box-constrained problems we internally scale the feasible set to the unit hypercube [0, 1] n and all radii are measured with regard to this scaled domain.
For stopping we use a combination of different criteria:

•
We have an upper bound N it.∈ N on the maximum number of iterations and an upper bound N exp.∈ N on the number of expensive objective evaluations.

•
The surrogate criticality naturally allows for a stopping test and due to Lemma 11 the trust region radius can also be used (see also [32,Sec. 5]).We combine this with a relative tolerance test and stop if • At a truly critical point the criticality loop criticalityRoutine runs infinitely.We stop after a maximum number N loops ∈ N 0 of iterations.If N loops equals 0 the algorithm effectively stops for small (t) m values.

A First Example
We tested our method on a multitude of academic test problems with a varying number of decision variables n and objective functions k.We were able to approximate Paretocritical points in both cases, if we treat the problems as heterogenous and if we declare them as expensive.We benchmarked RBF against polynomial models, because in [32] it was shown that a trust region method using second degree Lagrange polynomials outperforms commercial solvers on scalarized problems.Most often, RBF surrogates outperform other model types with regard to the number of expensive function evaluations.This is illustrated in Fig. 2. It shows two runs of Algorithm 1 on the non-convex problem (T6), taken from [37]:  Fig. 4 illustrates that not only do RBF perform better on average, but also overall.With regards to the final solution criticality, there are a few outliers when the method did not converge using RBF models.However, in most cases the solution criticality is acceptable, see Fig. 4 (b).Moreover, Fig. 5 shows that a good percentage of problem instances is solved with RBF, especially when compared to linear Lagrange polynomial models.Note, that in cases where the true objectives are not differentiable at the final iterate, ω was set to 0 because the selected problems are non-differentiable only in Pareto-optimal points.Furthermore, we compared the RBF kernels from Table 1.In [33], the cubic kernel performs best on single-objective problems while the Gaussian does worst.As can be seen in Fig. 6 this holds for multiple objective functions, too.The dark-blue and the light-blue bars show that both the Gaussian and the Multiquadric require more function evaluations, especially in higher dimensions.If, however, we use a very simple adaptive strategy to fine-tune the shape parameter, then both kernels can finish significantly faster.The pink and the gray bar illustrate this fact.In both cases, the shape parameter was set to α = 20/∆ (t) in each iteration.Nevertheless, cubic function (orange) appears to be a good choice in general.

Conclusion
We have developed a trust region framework for heterogeneous and expensive multiobjective optimization problems.It is based on similar work [29][30][31][32] and our main contributions are the integration of constraints and of radial basis function surrogates.Subsequently, our method is is provably convergent for unconstrained problems and when the feasible set is convex and compact, while requiring significantly less expensive function evaluations due to a linear scaling of complexity with respect to the number of decision variables.
For future work, several modifications and extensions can likely be transferred from the single-objective to the multiobjective case.For examples, the trust region update can be made step-size-dependent (rather than ρ (t) alone) to allow for a more precise model refinement, see [35,Ch. 10].We have also experimented with the nonlinear CG method [9] for a multiobjective Steihaug-Toint step [35,Ch. 7] and early results look promising.
Going forward, we would like to apply our algorithm to a real world application, similar to what has been done in [51].Moreover, it would be desirable to obtain not just one but multiple Pareto-critical solutions.Because the Pascoletti-Serafini scalarization is still compatible with constraints, the iterations can be guided in image space by providing different global utopia vectors.Furthermore, it is straightforward to use RBF with the heuristic methods from [52] for heterogeneous problems.We believe that it should also be possible to propagate multiple solutions and combine the TRM method with non-dominance testing as has been done in the bi-objective case [31].One can think of other globalization strategies as well: RBF models have been used in multiobjective Stochastic Search algorithms [53] and trust region ideas have been included into population based strategies [26].It will thus bee interesting to see whether the theoretical convergence properties can be maintained within these contexts by employing a careful trust-region management.Finally, re-using the data sampled near the final iterate within a continuation framework like in [54] is a promising next step.
Theorem 2 claims that ω(x) is uniformly continuous, provided the objective gradients are Lipschitz.The implied Cauchy continuity is an important property in the convergence proof of the algorithm.
Proof of Theorem 2. We will consider the constrained case only, when X is convex and compact and show uniform continuity a fortiori by proving that ω(•) is Lipschitz.Let the objective gradients be Lipschitz continuous.Then Df is Lipschitz as well with constant L > 0. m (•) using an active set strategy (see [37]).Consequently, both values are no longer Cauchy continuous.We can remedy both drawbacks by relating the (possibly constrained) Pascoletti-Serafini trial point to the strict modified Pareto-Cauchy point in our projection framework.To this end, we allow in (A7) and (A8) any feasible set fulfilling Assumption 1. Moreover we recite the following assumption: Assumption 10 (Assumption 4.10 in [32]).There is a constant r ∈ (0, 1] so that if x (t) is not Pareto-critical, the components r The assumption can be justified because r (t) > 0 if x (t) is not critical and r (t) can be bounded above and below by expressions involving ω (t) m (•), see Remark 3 and [32, Lemma 4.9].We can then derive the following lemma: Lemma A1.Suppose Assumptions 1, 2 and 10 hold.Let (τ + , x (t) + ) be the solution to (A7).Then there exists a constant κsd m ∈ (0, 1) such that it holds Proof.If x (t) is critical for (MOPm), then τ + = 0 and x (t) + = x (t) and the bound is trivial [5].Otherwise, we can use the same argumentation as in [32,Lemma 4.13]

Figure 1 .
Figure 1.(a) Interpolation of a nonlinear function (black) by a Multiquadric surrogate (black) based on 5 discrete training points (orange).Dashed lines show the kernels and the polynomial tail.(b) Different kernels in 1D with varying shape parameter (1 or 10), see also Table1.

5. 1 .Definition 5 .
Pareto-Cauchy Step Both the Pareto-Cauchy point as well as a backtracking variant, the modified Pareto-Cauchy point, are points along the descent direction d (t) m within B (t) so that a sufficient decrease measured by Φ (t) m (•) and ω (t) m (•) is achieved.Under mild assumptions we can then derive a decrease in terms of ω(•).For t ∈ N 0 let d (t) m be a minimizer for (Pm).The best attainable trial point x (t) PC along d (t) m is called the Pareto-Cauchy point and given by x (t)

Definition 6 .
hand side is positive and completely independent of j.Since b ∈ (0, 1), there must be a j * ∈ N 0 , j * > j, for which b j * ≤ 2(1 − a)w d σcH (t) m so that (11) must also be fulfilled for this b j * .Analogous to the proof of [30, Lemma 4.2] we can now derive the constant κ sd m from (12) as κ sd m = min{2b(1 − a), a}.Lemma 3 applies naturally to the step along d (t) m : For x (t) ∈ B (t) let d (t) m be a solution to (Pm) and define the modified Pareto-Cauchy step as s(t) PC := b j σ d lower bound on the decrease follows immediately from σ ≥ min d (t) m , ∆ (t) .

Lemma 14 .
Suppose Assumptions 1, 6 and 7 hold.For any subsequence {t i } i∈N ⊆ N 0 of iteration indices of Algorithm 1 with lim i→∞

Figure 2 .
Figure 2. Two runs with maximum number of expensive evaluations set to 20 (soft limit).Test points are light-gray, the iterates are black, final iterate is red, white markers show other points where the objectives are evaluated.The successive trust regions are also shown.(a) Using RBF surrogate models we converge to the optimum using only 12 expensive evaluations.(b) Quadratic Lagrange models do not reach the optimum using 19 evaluations.(c) Iterations and test points in the objective space.

Figure 3 .
Figure 3. Average number of expensive objective evaluations by number of decision variables n, surrogate type and descent method.LP1 are Linear Lagrange models, LP2 quadratic Lagrange models, TP1 are linear Taylor polynomials based on finite differences and cubic refers to cubic RBF models.Steepest descent and Pascoletti-Serafini were tested on scalable problems, and 12 runs were performed per setting.

Figure 4 .
Figure 4. Box-plots of the number of evaluations and the solution criticality for n = 5 and n = 15 for the steepest-descent runs from Fig. 3.

Figure 5 .
Figure 5. Percentage of solved problem instances, i.e., test runs were the final solution criticality has a value below 0.1.Per model and n-value there were 40 runs.

Figure 6 .
Figure 6.Influence of a adaptive shape radius on the performance of RBF models (tested on ZDT3).