Randomized Simplicial Hessian Update

Bűrmen, Árpád; Tuma, Tadej; Olenšek, Jernej

doi:10.3390/math9151775

Open AccessFeature PaperArticle

Randomized Simplicial Hessian Update

by

Árpád Bűrmen

^*

,

Tadej Tuma

and

Jernej Olenšek

Faculty of Electrical Engineering, University of Ljubljana, Tržaška Cesta 25, SI-1000 Ljubljana, Slovenia

^*

Author to whom correspondence should be addressed.

Mathematics 2021, 9(15), 1775; https://doi.org/10.3390/math9151775

Submission received: 16 June 2021 / Revised: 14 July 2021 / Accepted: 20 July 2021 / Published: 27 July 2021

(This article belongs to the Special Issue Optimization Theory and Applications)

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

Recently, a derivative-free optimization algorithm was proposed that utilizes a minimum Frobenius norm (MFN) Hessian update for estimating the second derivative information, which in turn is used for accelerating the search. The proposed update formula relies only on computed function values and is a closed-form expression for a special case of a more general approach first published by Powell. This paper analyzes the convergence of the proposed update formula under the assumption that the points from

R^{n}

where the function value is known are random. The analysis assumes that the

N + 2

points used by the update formula are obtained by adding

N + 1

vectors to a central point. The vectors are obtained by transforming a prototype set of

N + 1

vectors with a random orthogonal matrix from the Haar measure. The prototype set must positively span a

N \leq n

dimensional subspace. Because the update is random by nature we can estimate a lower bound on the expected improvement of the approximate Hessian. This lower bound was derived for a special case of the proposed update by Leventhal and Lewis. We generalize their result and show that the amount of improvement greatly depends on N as well as the choice of the vectors in the prototype set. The obtained result is then used for analyzing the performance of the update based on various commonly used prototype sets. One of the results obtained by this analysis states that a regular n-simplex is a bad choice for a prototype set because it does not guarantee any improvement of the approximate Hessian.

Keywords:

derivative-free optimization; Hessian update; random matrices; uniform distribution

MSC:

90C56; 90C53; 65K05; 15A52

1. Introduction

Derivative-free optimization algorithms have attracted much attention due to the fact that in many optimization problems, the evaluation of the gradients of the function subject to optimization and constraints is expensive. Such optimization problems can be often formulated as constrained black-box optimization (BBO) [1] problems of the form

\begin{matrix} min f (x) subject to \end{matrix}

(1)

\begin{matrix} c_{i} (x) \leq 0 i = 1, 2, \dots, n_{C} \end{matrix}

(2)

Functions f and

c_{i}

are maps from

R^{n}

to

R

. The objective is to minimize f subject to

n_{C}

nonlinear constraints defined by functions

c_{i}

. The method for computing f and

c_{i}

is treated as a black-box, and the gradients are usually not available. Such problems often arise in engineering optimization when simulation is used for obtaining the function values. BBO often relies on models of the function and of the constraints. Various approaches to building black-box models were developed in the past, such as linear [2] and quadratic models [3], radial-basis functions [4], support vector machines [5], neural networks [6], etc.

In this paper, we focus on the quadratic models of f and

c_{i}

. The most challenging task in building these models is the computation of the Hessian matrix. Instead of using the exact Hessian, the model can utilize an approximate Hessian. The approximation can be improved gradually by applying an update formula based on the function and the gradient values at points visited in the algorithm’s past. As the algorithm converges towards a solution, the approximate Hessian converges to the true Hessian.

For derivative-based optimization, several approaches for updating the approximate Hessian are well studied and tested in practice (e.g., BFGS update, SR1 update [7]). Unfortunately, these approaches rely on the gradient of the function (constraints), which, by assumption, is not available in derivative-free optimization.

Let n denote the dimension of the search space. For derivative-free optimization, a Hessian update formula based on the function values computed at

m \geq n + 2

points visited in the algorithm’s past was proposed by Powell in [8]. The update formula was obtained by minimizing the Frobenius norm of the update applied to the approximate Hessian subject to linear constraints imposed by the function values at m points in the search space. The paper proposed an efficient way for computing the update and explored some of its properties. The convergence rate of the update formula was not studied.

In a later paper, a simple update formula that uses three collinear points for computing the updated approximate Hessian [9] was examined. The normalized direction along which the three points lie was assumed to be uniformly distributed on the unit sphere. With this assumption, the convergence rate of the update was analyzed and shown to be linear. This update formula was successfully used in a derivative-free algorithm from the family of mesh adaptive direct search algorithms (MADS) [10]. A similar Hessian updating approach was used for speeding up global optimization in [11].

The assumption that the points taking part in an update must be collinear is a significant limitation for the underlying derivative-free algorithm. With this in mind, a new simplicial update formula was proposed in [12]. The formula relies on

m \leq n + 2

points. The reason for choosing the term simplicial Hessian update is the fact that the

m - 1

points form a simplex centered around the first point. For

m = n + 2

, the formula is a special case of the update formula proposed in [8]. By imposing some restrictions on the positions of the m points, the update formula can be used for any m that satisfies

3 \leq m \leq n + 2

. The case

m = 3

corresponds to the update formula proposed in [9].

To illustrate the approach for obtaining the update formula, let us assume that the current quadratic model of function f is given by

m (x) = \frac{1}{2} x^{T} B x + {\hat{g}}^{T} x + \hat{c}

(3)

B

is the current approximate Hessian. Let the points where the function is known be denoted by

x_{i}

. For the sake of simplicity, let

f_{i}

denote

f (x_{i})

. Based on these points, we are looking for an updated model:

m_{+} (x) = \frac{1}{2} x^{T} B_{+} x + {\hat{g}}_{+}^{T} x + {\hat{c}}_{+} .

(4)

The model must satisfy m constraints

m_{+} (x_{i}) = f_{i}

(5)

that are linear in

\hat{c}

and the components of

B_{+}

and

{\hat{g}}_{+}

. Based on these constraints, we are looking for an updated approximate Hessian

B_{+}

. Because we have fewer constraints than there are unknowns, we also require that

∥ B_{+} {- B ∥}_{F}

is minimal (

{∥ \cdot ∥}_{F}

denotes the Frobenius norm). The update formula we obtain in this way is a minimum Frobenius norm update formula.

For computing the expected improvement of the approximate Hessian, we first assume f itself is quadratic. We also assume the aforementioned m points are obtained by applying a random orthogonal transformation to

m - 1

vectors that form a prototype set and adding the resulting vectors to a central point. As in [9], the convergence rate of the update is linear. The speed with which the approximate Hessian converges to the true Hessian depends on the choice of the prototype set. Our result is a generalization of the result published in [9].

This paper is divided as follows. In Section 2, some basic properties of minimum Frobenius norm updates are explored. The Frobenius product is revisited with the purpose of simplifying the notation, and the update formula is derived. In the next section, uniformly distributed orthogonal matrices are introduced. Some auxiliary results are derived that are later used for computing the expected improvement of the approximate Hessian. Section 4 analyzes the convergence of the proposed update and derives the expected value of the improvement in the sense of the Frobenius norm of the difference between the approximate Hessian and the true Hessian. The expected improvement is computed for several prototype sets. The section is followed by an example demonstrating the convergence of the proposed update and concluding remarks.

Notation. Components of vectors (

a

) and matrices (

A

) are denoted by subscripts (i.e.,

a_{i}

and

a_{i j}

, respectively). The i-th column of matrix

A

is denoted by

a_{i}

. The unit vectors forming an orthogonal basis for

R^{n}

are denoted by

e_{i}

. Vectors are assumed to be column vectors, and the inner product of two vectors is written in matrix notation as

a^{T} b

. The Frobenius norm and the trace of a matrix are denoted by

{∥ \cdot ∥}_{F}

and

tr (\cdot)

, respectively. The expected value of a random variable is denoted by

E [\cdot]

.

2. Obtaining the Update Formula

Let

H

denote the Hessian of a function. Minimum Frobenius norm (MFN) update formulas replace the current Hessian approximation

B

with a new (better) approximation

B_{+}

in such manner that the Frobenius norm of the change (i.e.,

B_{+} - B

) is minimal, subject to constraints imposed on

B_{+}

.

The Frobenius norm is a norm induced by the Frobenius (inner) product on the space of n-by-n matrices. The Frobenius product of two matrices is given by

A : B = \sum_{i = 1}^{n} \sum_{j = 1}^{n} a_{i j} b_{i j} = tr (A^{T} B) = tr (B^{T} A) .

(6)

Using the Frobenius product, one can express the Frobenius norm of matrix

A

as

\begin{matrix} {∥ A ∥}_{F}^{2} & = & A : A . \end{matrix}

(7)

Quadratic terms can be expressed with the Frobenius product as

\begin{matrix} x^{T} A x & = & A : (x x^{T}) . \end{matrix}

(8)

The Frobenius product introduces the notion of perpendicularity into the set of matrices (not to be confused with the orthogonality of matrices, which is equivalent to

Q^{T} Q = I

).

Definition 1.

Two nonzero matrices

A

and

B

are perpendicular (denoted by

A ⊥ B

) if

A : B = 0

.

The Frobenius product can also be used for expressing linear constraints. A linear equality constraint on matrix

X

can be formulated as

A : X = a .

(9)

The following Lemma provides motivation for the use of minimum Frobenius norm updating.

Lemma 1.

Let

H

,

B

, and

B_{+}

denote the exact, the current approximate, and the updated approximate Hessian, respectively. Suppose we have m linear equality constraints of the form

A_{i} : B_{+} = a_{i}, i = 1, \dots, m .

(10)

imposed on

B_{+}

. Let

P_{⊥}

denote the subspace spanned by matrices

A_{i}

. Then, the corresponding MFN update satisfies

$(B_{+} - B) \in P_{⊥}$ , and
$∥ B_{+} {- H ∥}_{F} \leq {∥ B - H ∥}_{F}$ .

Proof.

Finding the MFN update is equivalent to minimizing the Frobenius norm of

B_{+} - B

subject to linear equality constraints (10). These constraints define an affine subspace in the

n (n + 1) / 2

dimensional space of Hessian matrices, and

B_{+}

is a member of this affine subspace. Because the true Hessian also satisfies constraints (10), it is also a member of the aforementioned affine subspace.

To simplify the problem, we can translate it in such manner that

H

becomes

0

. When we do this, the linear constraints become homogeneous, and instead of an affine subspace, they now define an ordinary subspace

P

. Its orthogonal complement

P_{⊥}

is spanned by matrices

A_{i}

. Due to translation,

B

and

B_{+}

are replaced by

B - H

and

B_{+} - H

, of which the latter is a member of

P

. Points with constant

∥ B_{+} {- B ∥}_{F} = {∥ B_{+} - H - (B - H) ∥}_{F}

lie on a sphere centered at

B - H

. Matrix

B_{+} - H

that corresponds to the smallest

∥ B_{+} {- B ∥}_{F}

lies on a sphere centered at

B - H

that is tangential to subspace

P

. Therefore,

B_{+} - B

must be perpendicular to

P

, i.e.,

B_{+} - B \in P_{⊥}

. This proves the first claim.

Due to

B_{+} - H \in P

, we can see that

B_{+} - H

and

B_{+} - B

are perpendicular. From

B - H = B_{+} - H - (B_{+} - B)

, we have

{∥ B - H ∥}_{F}^{2} = ∥ B_{+} {- H ∥}_{F}^{2} + {∥ B_{+} - B ∥}_{F}^{2}

(11)

The second claim immediately follows from this result. □

Consider a quadratic function

q (x) = \frac{1}{2} x^{T} H x + g^{T} x + c

(12)

where

H

is its Hessian and

g

its gradient at

x = 0

. Let the current and the updated approximation to

q (x)

be given by

m (x) = \frac{1}{2} x^{T} B x + {\hat{g}}^{T} x + \hat{c}

(13)

and

m_{+} (x) = \frac{1}{2} x^{T} B_{+} x + {\hat{g}}_{+}^{T} x + {\hat{c}}_{+},

(14)

respectively. In MFN, updating

B_{+}

is obtained by minimizing

∥ B_{+} {- B ∥}_{F}

. The following lemma introduces one such update based on the case when the value of q is known at

N + 2

points.

Lemma 2.

Let

q_{0}, \dots, q_{N + 1}

, where

N \leq n

denote the values of

q (x)

corresponding to distinct points

x_{0}, \dots, x_{N + 1}

, respectively. Let

v_{i} = x_{i} - x_{0}

and assume

\sum_{i = 1}^{N + 1} α_{i} v_{i} = 0

with at least one

α_{i} \neq 0

. Then the simplicial MFN update satisfying the interpolation conditions

m_{+} (x_{i}) = q_{i}

for

i = 0, 1, \dots, N + 1

can be computed as

B_{+} = B + β A

(15)

where

\begin{matrix} A & = & \frac{1}{2} \sum_{i = 1}^{N + 1} α_{i} v_{i} v_{i}^{T}, \end{matrix}

(16)

\begin{matrix} β & = & \frac{\sum_{i = 1}^{N + 1} α_{i} (v_{i}^{T} (H - B) v_{i})}{{2 ∥ A ∥}_{F}^{2}} = \frac{\sum_{i = 1}^{N + 1} α_{i} (2 (q_{i} - q_{0}) - v_{i}^{T} B v_{i})}{{2 ∥ A ∥}_{F}^{2}} . \end{matrix}

(17)

Proof.

By assumption we have

\begin{matrix} q_{i} = q (x_{i}) & = & \frac{1}{2} x_{i}^{T} H x_{i} + g^{T} x_{i} + c \\ = & \frac{1}{2} {(x_{0} + v_{i})}^{T} H (x_{0} + v_{i}) + g^{T} (x_{0} + v_{i}) + c \end{matrix}

(18)

Due to the interpolation conditions, we have

N + 2

constraints

q_{i} = m_{+} (x_{i}) = \frac{1}{2} {(x_{0} + v_{i})}^{T} B_{+} (x_{0} + v_{i}) + {\hat{g}}_{+}^{T} (x_{0} + v_{i}) + {\hat{c}}_{+}

(19)

By subtraction, we eliminate

{\hat{c}}_{+}

and obtain

N + 1

constraints

q_{i} - q_{0} = \frac{1}{2} v_{i}^{T} B_{+} v_{i} + {({\hat{g}}_{+} + B_{+} x_{0})}^{T} v_{i} i = 1, .., N + 1 .

(20)

Multiplying (20) with

α_{i}

and adding the resulting equations yields

\frac{1}{2} \sum_{i = 1}^{N + 1} α_{i} v_{i}^{T} B_{+} v_{i} + {({\hat{g}}_{+} + B_{+} x_{0})}^{T} \sum_{i = 1}^{N + 1} α_{i} v_{i} = \sum_{i = 1}^{N + 1} α_{i} (q_{i} - q_{0}) .

(21)

By assumption, the second term on the left-hand side of (21) vanishes (thus,

{\hat{g}}_{+}

is eliminated). We are left with a single linear constraint on

B_{+}

:

\frac{1}{2} \sum_{i = 1}^{N + 1} α_{i} v_{i}^{T} B_{+} v_{i} = \sum_{i = 1}^{N + 1} α_{i} (q_{i} - q_{0})

(22)

which can be rewritten by recalling (8) as

A : B_{+} = \sum_{i = 1}^{N + 1} α_{i} (q_{i} - q_{0}),

(23)

where

A = \frac{1}{2} \sum_{i = 1}^{N + 1} α_{i} v_{i} v_{i}^{T} .

(24)

Equation (23) is a linear constraint on the updated Hessian approximation

B_{+}

. This is the only constraint on

B_{+}

. From Lemma 1, we can see that

P_{⊥}

is spanned by

A

. Therefore, we can write

B_{+} - B = β A .

(25)

By computing the Frobenius product of (25) with

A

and taking into account (23), we arrive at

\sum_{i = 1}^{N + 1} α_{i} (q_{i} - q_{0}) - A : B = β A : A .

(26)

Now we can compute

β

:

β = \frac{\sum_{i = 1}^{N + 1} α_{i} (q_{i} - q_{0}) - A : B}{A : A} = \frac{\sum_{i = 1}^{N + 1} α_{i} (2 (q_{i} - q_{0}) - v_{i}^{T} B v_{i})}{{2 ∥ A ∥}_{F}^{2}} .

□

The simplicial update formula introduced by Lemma 2 is the closed-form solution of the equations arising from the MFN update in [8] for

N = n

. One can see this by comparing the interpolation conditions to those in [8]. Due to the assumption

\sum_{i = 1}^{N + 1} α_{i} v_{i} = 0

, we can also apply it when

N < n

. The assumption implies the points

x_{1}, \dots, x_{N + 1}

are positioned in a specific manner with respect to

x_{0}

(i.e., there exists a nontrivial linear combination

\sum_{i = 1}^{N + 1} α_{i} (x_{i} - x_{0}) = 0

).

By choosing

N = 1

, we obtain a special case of the simplicial MFN update, where all three distinct points must be collinear to satisfy

\sum_{i = 1}^{N + 1} α_{i} v_{i} = 0

. Suppose

v_{1} = - v_{2} = v

and

α_{1} = α_{2} = 1

. Then,

\begin{matrix} A & = & v v^{T}, \end{matrix}

(27)

\begin{matrix} β & = & \frac{(q_{1} + q_{2} - 2 q_{0}) - v^{T} B v}{{∥ v ∥}^{4}} = \frac{q_{v}^{(2)} (x_{0}) - v^{T} B v}{{∥ v ∥}^{4}} \end{matrix}

(28)

where

q_{v}^{(2)} (x_{0}) = v^{T} H v

is the second directional derivative of q along direction

v

. The convergence properties of this MFN update formula were analyzed in [9]. The formula was used in the derivative-free optimization algorithm proposed in [10].

3. Uniformly Distributed Orthogonal Matrices

The notion of a uniform distribution over the group of orthogonal matrices (

O_{n}

) can be introduced via the Haar measure [13]. Let

A

denote a matrix with independent normally distributed elements with zero mean and variance 1. A random orthogonal matrix from the Haar measure (

O

) can then be obtained with Algorithm 1.

Algorithm 1 Constructing a random orthogonal matrix from the Haar measure.

Perform QR decomposition $A = Q R$ .
Construct a diagonal matrix $D$ with $d_{i i} = 1$ if $r_{i i} \geq 0$ and $d_{i i} = - 1$ otherwise.
$O = Q D$ .

Multiplying

O

with any unit vector results in a random unit vector that is uniformly distributed on the unit sphere (

S_{n}

) [14]. It can be shown that

O^{T}

is also a uniformly distributed orthogonal matrix. Consequently, every column and every row of

O

are a random unit vector with uniform distribution on

S_{n}

.

The results of this section are obtained with the help of the following lemma.

Lemma 3.

Let

x \in R^{n}

and let

d σ

denote the surface element of

S_{n}

. Then,

V_{n} (r_{1}, r_{2}, \dots, r_{n}) = \int_{∥ x ∥ = 1} x_{1}^{2 r_{1}} x_{2}^{2 r_{2}} \cdot \dots \cdot x_{n}^{2 r_{n}} d σ = \frac{2 \prod_{i = 1}^{n} Γ (r_{i} + 1 / 2)}{Γ (n / 2 + \sum_{i = 1}^{n} r_{i})}

(29)

Proof.

See [15], Appendix B. □

From Lemma 3, we can obtain the surface area of

S_{n}

by choosing

r_{1} = \dots = r_{n} = 0

.

S_{n} = \frac{2 Γ {(1 / 2)}^{n}}{Γ (n / 2)}

(30)

Let

o_{i}

and

o_{j}

denote two random vectors that correspond to the i-th and the j-th column of

O

. If

i \neq j

, then

o_{i}^{T} o_{j} = 0

. We denote the k-th component of

o_{i}

as

o_{k i}

.

Lemma 4.

Let

O

be a uniformly distributed random orthogonal matrix. Then,

\begin{matrix} E [{(e_{k}^{T} o_{i})}^{2}] = E [o_{k i}^{2}] & = & \frac{1}{n}, \end{matrix}

(31)

\begin{matrix} E [{(e_{k}^{T} o_{i})}^{2} {(e_{l}^{T} o_{i})}^{2}] = E [o_{k i}^{2} o_{l i}^{2}] & = & \{\begin{matrix} \frac{3}{n (n + 2)} & k = l \\ \frac{1}{n (n + 2)} & k \neq l \end{matrix}, \end{matrix}

(32)

\begin{matrix} E [{(e_{k}^{T} o_{i})}^{2} {(e_{l}^{T} o_{j})}^{2}] = E [o_{k i}^{2} o_{l j}^{2}] & = & \{\begin{matrix} \frac{1}{n (n + 2)} & k = l, i \neq j \\ \frac{n + 1}{(n - 1) n (n + 2)} & k \neq l, i \neq j \end{matrix} . \end{matrix}

(33)

Proof.

For proving (31), we can assume without loss of generality

k = i = 1

. Because

o_{1} = u

is uniformly distributed on

S_{n}

, the expected value of

f (u)

can be obtained by computing the mean value of

f (u)

over

S_{n}

. We use Lemma 3 for expressing the integral over the surface of

S_{n}

.

\begin{matrix} E [o_{k i}^{2}] & = & E [o_{11}^{2}] = E [u_{1}^{2}] = S_{n}^{- 1} \int_{∥ u ∥ = 1} u_{1}^{2} d σ \end{matrix}

(34)

\begin{matrix} = & S_{n}^{- 1} \frac{2 Γ (3 / 2) Γ {(1 / 2)}^{n - 1}}{Γ (1 + n / 2)} = \frac{1}{n} . \end{matrix}

(35)

Regarding (32) for

k = l

, we have

\begin{matrix} E [o_{k i}^{2} o_{l i}^{2}] & = & E [o_{k i}^{4}] = S_{n}^{- 1} \int_{∥ u ∥ = 1} u_{1}^{4} d σ \end{matrix}

(36)

\begin{matrix} = & S_{n}^{- 1} \frac{2 Γ (5 / 2) Γ {(1 / 2)}^{n - 1}}{Γ (2 + n / 2)} = \frac{3}{n (n + 2)} . \end{matrix}

(37)

For

k \neq l

, we assume without loss of generality

k = 1

,

l = 2

. From Lemma 3, we have

\begin{matrix} E [o_{k i}^{2} o_{l i}^{2}] & = & S_{n}^{- 1} \int_{∥ u ∥ = 1} u_{1}^{2} u_{2}^{2} d σ \end{matrix}

(38)

\begin{matrix} = & S_{n}^{- 1} \frac{2 Γ {(3 / 2)}^{2} Γ {(1 / 2)}^{n - 2}}{Γ (2 + n / 2)} = \frac{1}{n (n + 2)} \end{matrix}

(39)

For (33) with

k = l

, we can show that it is identical to (32) with

k \neq l

. We have

\begin{matrix} e_{k}^{T} o_{i} = e_{k}^{T} O e_{i} = e_{i}^{T} O^{T} e_{k} = e_{i}^{T} {(O^{T})}_{k} \end{matrix}

(40)

where

{(O^{T})}_{k}

is the k-th column of

O^{T}

. This implies

\begin{matrix} {(e_{k}^{T} o_{i})}^{2} {(e_{k}^{T} o_{j})}^{2} = {(e_{i}^{T} {(O^{T})}_{k})}^{2} {(e_{j}^{T} {(O^{T})}_{k})}^{2} \end{matrix}

(41)

To confirm (33) for

k = l

, we take into account that

O^{T}

is also a random orthogonal matrix from the Haar measure, replace

O

with

O^{T}

in (32), and rename i, k, and l to k, i, and j, respectively.

Finally, to prove (33) for

k \neq l

, we can assume without loss of generality

i = k = 1

and

j = l = 2

. The cosine of the angle between

o_{1}

and

e_{2}

can be expressed as

o_{21} = e_{2}^{T} o_{1} = cos ϕ

. Random vector

o_{2}

is orthogonal to

o_{1}

. Its realizations cover a unit sphere in an

n - 1

-dimensional subspace

B

orthogonal to

o_{1}

. Unit vectors

b_{1}, \dots, b_{n - 1}

form an orthogonal basis for this subspace. Note that

b_{i}^{T} o_{1} = 0

. The conditional probability density distribution of

o_{2}

is uniform on the aforementioned unit sphere in

B

. Vector

o_{2}

can be expressed as

o_{2} = \sum_{i = 1}^{n - 1} η_{i} b_{i}, \sum_{i = 1}^{n - 1} η_{i}^{2} = 1,

(42)

where vector

[η_{1}, \dots, η_{n - 1}]

is uniformly distributed on

S_{n - 1}

. Without loss of generality we can choose vectors

b_{i}

in such manner that

e_{2} = o_{1} cos ϕ + b_{1} sin ϕ

, where

ϕ

is the angle between

e_{2}

and

o_{1}

. Now we have

o_{22} = e_{2}^{T} o_{2} = \sum_{i = 1}^{n - 1} {(o_{1} cos ϕ + b_{1} sin ϕ)}^{T} η_{i} b_{i} = η_{1} sin ϕ

(43)

and

o_{11}^{2} o_{22}^{2} = {(e_{1}^{T} o_{1})}^{2} {(e_{2}^{T} o_{2})}^{2} = o_{11}^{2} η_{1}^{2} {sin}^{2} ϕ = η_{1}^{2} o_{11}^{2} (1 - o_{21}^{2}) .

(44)

Next, we can express

E [o_{k i}^{2} o_{l j}^{2}] = E [η_{1}^{2}] E [o_{11}^{2} - o_{11}^{2} o_{21}^{2}],

(45)

where the first expected value refers to

[η_{1}, \dots, η_{n - 1}] \in S_{n - 1}

and the second one to

o_{1} \in S_{n}

. Using Lemma 3 and previously proven (32), we arrive at

E [η_{1}^{2}] E [o_{11}^{2} - o_{11}^{2} o_{21}^{2}] = \frac{1}{n - 1} (\frac{1}{n} - \frac{1}{n (n + 2)}) = \frac{n + 1}{(n - 1) n (n + 2)}

(46)

□

4. Convergence of the Proposed Update

Multiplying vectors in a prototype set

D = {d_{1}, \dots, d_{N + 1}}

with a uniformly distributed random orthogonal matrix

O

results in a set of random vectors

V = {v_{1}, \dots, v_{N + 1}}

such that every

v / ∥ v ∥

is uniformly distributed on

S_{n}

. The angles between the vectors in a realization of such a set are identical to the angles between the corresponding vectors from the prototype set.

Suppose one is interested in the expected amount of improvement resulting from one application of the update formula from Lemma 2. We assume that the

N + 2

points where the function value is computed comprise

x_{0}

and additional

N + 1

points generated using a random orthogonal matrix

O

and a prototype set of vectors

{d_{1}, \dots, d_{N + 1}}

in the following manner.

x_{i} = x_{0} + O d_{i} = x_{0} + v_{i}, i = 1, \dots, N + 1 .

(47)

First, we prove an auxiliary lemma.

Lemma 5.

Let

a

,

b

, and

O

denote two unit vectors with

cos φ = a^{T} b

and a uniformly distributed orthogonal matrix, respectively. Let

u = O a

and

v = O b

. Then,

E [u_{k}^{2} v_{l}^{2}] = \{\begin{matrix} \frac{1 + 2 {cos}^{2} φ}{n (n + 2)} & k = l \\ \frac{n + 1 - 2 {cos}^{2} φ}{(n - 1) n (n + 2)} & k \neq l \end{matrix}

(48)

Proof.

Without loss of generality, the coordinate system can be rotated in such manner that

a = e_{1}

and

b = e_{1} cos φ + e_{2} sin φ

. Then, we have

\begin{matrix} u & = & o_{1}, \end{matrix}

(49)

\begin{matrix} v & = & o_{1} cos φ + o_{2} sin φ . \end{matrix}

(50)

For

k = l

, we have

\begin{matrix} E [u_{k}^{2} v_{k}^{2}] & = & E [o_{k 1}^{2} {(o_{k 1} cos φ + o_{k 2} sin φ)}^{2}] \end{matrix}

(51)

\begin{matrix} = & E [o_{k 1}^{4}] {cos}^{2} φ + E [o_{k 1}^{2} o_{k 2}^{2}] {sin}^{2} φ + 2 E [o_{k 1}^{3} o_{k 2}] cos φ sin φ \end{matrix}

(52)

The last term vanishes because the integral of odd powers of

o_{i j}

over

S_{n}

is zero. By invoking Lemma 4, we arrive at

E [u_{k}^{2} v_{k}^{2}] = \frac{3 {cos}^{2} φ}{n (n + 2)} + \frac{{sin}^{2} φ}{n (n + 2)} = \frac{1 + 2 {cos}^{2} φ}{n (n + 2)}

(53)

For

k \neq l

,

\begin{matrix} E [u_{k}^{2} v_{l}^{2}] & = & E [o_{k 1}^{2} {(o_{l 1} cos φ + o_{l 2} sin φ)}^{2}] \\ = & E [o_{k 1}^{2} o_{l 1}^{2}] {cos}^{2} φ + E [o_{k 1}^{2} o_{l 2}^{2}] {sin}^{2} φ \end{matrix}

(54)

\begin{matrix} + & 2 E [o_{k 1}^{2} o_{l 1} o_{l 2}] cos φ sin φ \end{matrix}

(55)

The last term vanishes due to odd powers of

o_{i j}

. Together with Lemma 4, we have

E [u_{k}^{2} v_{l}^{2}] = \frac{{cos}^{2} φ}{n (n + 2)} + \frac{(n + 1) {sin}^{2} φ}{(n - 1) n (n + 2)} = \frac{n + 1 - 2 {cos}^{2} φ}{(n - 1) n (n + 2)} .

(56)

□

Lemma 6.

Let

{d_{1}, \dots, d_{N + 1}}

be a prototype set of vectors satisfying

\sum_{i = 1}^{N + 1} α_{i} d_{i} = 0

, where all

α_{i} \geq 0

and at least one

α_{i} \neq 0

. Let

O

be a uniformly distributed random orthogonal matrix, and let

v_{i} = O d_{i}

with

∥ d_{i} ∥ ∥ d_{j} ∥ cos φ_{i j} = d_{i}^{T} d_{j} = v_{i}^{T} v_{j}

. Then, the MFN update formula from Lemma 2 involving

N + 2

points (

x_{0}

and the additional

N + 1

points constructed according to (47)) satisfies

E [∥ B_{+} {- B ∥}_{F}^{2}] = E [β^{2} {∥ A ∥}_{F}^{2}] = (γ_{1} - γ_{2}) {∥ B - H ∥}_{F}^{2} + γ_{2} tr {(B - H)}^{2}

(57)

where

γ_{1} = \frac{μ + 2}{n (n + 2)}

(58)

γ_{2} = \frac{(n + 1) μ - 2}{(n - 1) n (n + 2)}

(59)

\begin{matrix} μ & = & {(\frac{\sum_{i = 1}^{N + 1} α_{i} {∥ d_{i} ∥}^{2}}{{2 ∥ A ∥}_{F}})}^{2} = \frac{{(\sum_{i = 1}^{N + 1} α_{i} {∥ d_{i} ∥}^{2})}^{2}}{\sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} ∥ d_{i} ∥^{2} {∥ d_{j} ∥}^{2} {cos}^{2} φ_{i j}} \end{matrix}

(60)

Proof.

By repeating the reasoning in the proof of Lemma 2 on (18), we obtain

A : H = \sum_{i = 1}^{N + 1} α_{i} (q_{i} - q_{0})

(61)

which yields together with the expression for

β

from Lemma 2

β = \frac{(H - B) : A}{A : A} = \frac{\sum_{i = 1}^{N + 1} α_{i} v_{i}^{T} (H - B) v_{i}}{{2 ∥ A ∥}_{F}^{2}} .

(62)

Now we can express

β^{2} {∥ A ∥}_{F}^{2} = \frac{{(\sum_{i = 1}^{N + 1} α_{i} v_{i}^{T} (B - H) v_{i})}^{2}}{{4 ∥ A ∥}_{F}^{2}} .

(63)

Because vectors

v_{i}

are uniformly distributed on the unit sphere, we can rotate the coordinate system without affecting

E [β^{2} {∥ A ∥}_{F}^{2}]

so that

B - H

is diagonalized.

E [β^{2} {∥ A ∥}_{F}^{2}] = E [\frac{{(\sum_{i = 1}^{N + 1} α_{i} v_{i}^{T} D v_{i})}^{2}}{{4 ∥ A ∥}_{F}^{2}}] .

(64)

Let

v_{i k}

denote the k-th component of vector

v_{i}

and

λ_{k}

the k-th eigenvalue of

B - H

(k-th diagonal element of

D

).

E [β^{2} {∥ A ∥}_{F}^{2}] = E [\frac{\sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} \sum_{k = 1}^{n} \sum_{l = 1}^{n} v_{i k}^{2} v_{j l}^{2} λ_{k} λ_{l}}{{4 ∥ A ∥}_{F}^{2}}] .

(65)

We can rewrite (65) as

E [β^{2} {∥ A ∥}_{F}^{2}] = \frac{\sum_{k = 1}^{n} \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} E_{i k j k} λ_{k}^{2} + \sum_{k \neq l} \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} E_{i k j l} λ_{k} λ_{l}}{{4 ∥ A ∥}_{F}^{2}}

(66)

The expected value of

v_{i k}^{2} v_{j l}^{2}

depends on

cos φ_{i j} = v_{i}^{T} v_{j}

. From Lemma 5, we have

E_{i k j l} = E [v_{i k}^{2} v_{j l}^{2}] = \{\begin{matrix} \frac{1 + 2 {cos}^{2} φ_{i j}}{n (n + 2)} ∥ v_{i} ∥^{2} {∥ v_{j} ∥}^{2} & k = l \\ \frac{n + 1 - 2 {cos}^{2} φ_{i j}}{(n - 1) n (n + 2)} ∥ v_{i} ∥^{2} {∥ v_{j} ∥}^{2} & k \neq l \end{matrix}

(67)

Because the eigenvalues of

D

are the same as the eigenvalues of

B - H

, we have

\begin{matrix} {∥ B - H ∥}_{F}^{2} & = & {∥ D ∥}_{F}^{2} = \sum_{k = 1}^{n} λ_{k}^{2}, \end{matrix}

(68)

\begin{matrix} tr {(B - H)}^{2} & = & tr {(D)}^{2} = {(\sum_{k = 1}^{n} λ_{k})}^{2} = \sum_{k = 1}^{n} \sum_{l = 1}^{n} λ_{k} λ_{l} \end{matrix}

(69)

and

\begin{matrix} (γ_{1} - γ_{2}) {∥ B - H ∥}_{F}^{2} + γ_{2} tr {(B - H)}^{2} & = & (γ_{1} - γ_{2}) \sum_{k = 1}^{n} λ_{k}^{2} + γ_{2} \sum_{k = 1}^{n} \sum_{l = 1}^{n} λ_{k} λ_{l} \\ = & γ_{1} \sum_{k = 1}^{n} λ_{k}^{2} + γ_{2} \sum_{k \neq l} λ_{k} λ_{l} \end{matrix}

(70)

Note that

∥ v_{i} ∥ = ∥ d_{i} ∥

. Taking into account (66), (67), and (70) yields

\begin{matrix} γ_{1} & = & \frac{\sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} ∥ d_{i} ∥^{2} {∥ d_{j} ∥}^{2} (1 + 2 {cos}^{2} φ_{i j})}{{4 n (n + 2) ∥ A ∥}_{F}^{2}}, \end{matrix}

(71)

\begin{matrix} γ_{2} & = & \frac{\sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} ∥ d_{i} ∥^{2} {∥ d_{j} ∥}^{2} (n + 1 - 2 {cos}^{2} φ_{i j})}{{4 (n - 1) n (n + 2) ∥ A ∥}_{F}^{2}} . \end{matrix}

(72)

The Frobenius norm of

A

can be expressed as

\begin{matrix} {∥ A ∥}_{F}^{2} & = & tr (A^{T} A) = tr ({(\frac{1}{2} \sum_{i = 1}^{N + 1} α_{i} v_{i} v_{i}^{T})}^{T} (\frac{1}{2} \sum_{j = 1}^{N + 1} α_{j} v_{j} v_{j}^{T})) \\ = & \frac{1}{4} \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} tr (v_{i} v_{i}^{T} v_{j} v_{j}^{T}) = \frac{1}{4} \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} {(v_{i}^{T} v_{j})}^{2} \\ = & \frac{1}{4} \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} ∥ d_{i} ∥^{2} {∥ d_{j} ∥}^{2} {cos}^{2} φ_{i j} \end{matrix}

(73)

By substituting (73) in (71) and (72), we arrive at

\begin{matrix} γ_{1} & = & \frac{{8 ∥ A ∥}_{F}^{2} + \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} ∥ d_{i} ∥^{2} {∥ d_{j} ∥}^{2}}{{4 n (n + 2) ∥ A ∥}_{F}^{2}}, \end{matrix}

(74)

\begin{matrix} γ_{2} & = & \frac{(n + 1) \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} ∥ d_{i} ∥^{2} ∥ d_{j} ∥^{2} - 8 {∥ A ∥}_{F}^{2}}{{4 (n - 1) n (n + 2) ∥ A ∥}_{F}^{2}} . \end{matrix}

(75)

We also have

\sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} ∥ d_{i} ∥^{2} {∥ d_{j} ∥}^{2} = {(\sum_{i = 1}^{N + 1} α_{i} {∥ d_{i} ∥}^{2})}^{2}

(76)

Substituting (76) into (74) and (75) concludes the proof. □

Theorem 1.

Let

γ_{1}

,

γ_{2}

, and μ be defined as in Lemma 6. Then,

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} \leq \{\begin{matrix} 1 - \frac{2 (n - μ)}{(n - 1) n (n + 2)} & μ \geq \frac{2}{n + 1} \\ 1 - \frac{μ}{n} & μ < \frac{2}{n + 1} \end{matrix} .

(77)

Proof.

We start with the following identity.

B_{+} - H + B - B_{+} = B - H .

(78)

Computing the Frobenius norm on both sides and considering

(B_{+} - B) ⊥ (B_{+} - H)

results in

∥ B_{+} {- H ∥}_{F}^{2} + ∥ B_{+} {- B ∥}_{F}^{2} = {∥ B - H ∥}_{F}^{2} .

(79)

Taking into account (15) results in

∥ B_{+} {- H ∥}_{F}^{2} = {∥ B - H ∥}_{F}^{2} - β^{2} {∥ A ∥}_{F}^{2} .

(80)

After Lemma 6 is applied, we have

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} = 1 - (γ_{1} - γ_{2} + γ_{2} \frac{tr {(B - H)}^{2}}{{∥ B - H ∥}_{F}^{2}})

(81)

By definition,

μ \geq 0

and

γ_{1} \geq 0

. For

γ_{2} \geq 0

, we must have

μ \geq 2 / (n + 1)

. By considering

tr {(B - H)}^{2} / {∥ B - H ∥}_{F}^{2} \geq 0

, we arrive at

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} \leq 1 - (γ_{1} - γ_{2}) = 1 - \frac{2 (n - μ)}{(n - 1) n (n + 2)} .

(82)

For

γ_{2} < 0

, we must have

μ < 2 / (n + 1)

. Invoking Lemma A1 yields

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} \leq 1 - (γ_{1} - γ_{2} + n γ_{2}) = 1 - (γ_{1} + (n - 1) γ_{2}) = 1 - \frac{μ}{n} .

(83)

□

From Theorem 1, several results can be derived. First, we will assume the prototype set is a regular N-simplex (i.e., comprises

N + 1

vectors positively spanning an N-dimensional subspace). This case is interesting because the update formula in [9] is obtained for

N = 1

. We are going to show that our estimate of the expected Hessian improvement is identical to the one published in [9]. This update formula (with

N = 1

) was used in an optimization algorithm published in [10].

Next, we are going to show that using a regular n-simplex as the prototype set is a bad choice. According to Theorem 1, no improvement of the Hessian is guaranteed. Even worse, we show that improvement occurs only at the first application of the update formula.

Finally, we will analyze the case where the prototype set is what we refer to as an augmented set of N orthonormal vectors. Such a prototype set with

N = n

was used in the optimization algorithm published in [14].

Corollary 1.

Let

D

be a regular N-simplex (

N \leq n

). Then,

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} \leq 1 - \frac{2 (n - N)}{(n - 1) n (n + 2)}

(84)

Proof.

For all

i \neq j

, we have

cos φ_{i j} = - N^{- 1}

and

∥ d_{i} ∥ = 1

. Because the sum of all vectors in a regular N-simplex is

0

, we conclude

α_{i} = 1

and

\begin{matrix} μ = \frac{{(\sum_{i = 1}^{N + 1} α_{j})}^{2}}{\sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} {cos}^{2} φ_{i j}} = \frac{{(N + 1)}^{2}}{(N + 1) \cdot 1 + N (N + 1) \cdot N^{- 2}} = N \end{matrix}

(85)

Because

2 / (n + 1) \leq 1

for all

n \geq 1

, we have

μ \geq 2 / (n + 1)

, and the result follows from Theorem 1. □

Corollary 1 implies that the most efficient approach to MFN updating with a regular simplex in the role of the prototype set of unit vectors is to use a regular 1-simplex (three collinear points).

Corollary 2.

For

N = 1

and

d_{1} = - d_{2}

, set

D

is a regular 1-simplex and

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} \leq 1 - \frac{2}{n (n + 2)} .

(86)

This result was proven in [9] with a less general approach. Here, we obtain it as a special case of Corollary 1 for

N = 1

.

According to Corollary 1, there is no guaranteed improvement of

{∥ B - H ∥}_{F}

if a regular n-simplex (

N = n

) is used in the update process. In fact, the situation is even worse as we show in the following Lemma.

Lemma 7.

If

D

is a regular n-simplex (

N = n

), then the MFN update from Lemma 2 improves the Hessian approximation only in its first application.

Proof.

From Lemma 2, we can see that

B_{+} = B + β A

(87)

where (see (62))

β = \frac{\sum_{i = 1}^{N + 1} α_{i} v_{i}^{T} (H - B) v_{i}}{{2 ∥ A ∥}_{F}^{2}} .

(88)

For a regular simplex,

α_{i} = 1

and

∥ d_{i} ∥ = 1

. The Frobenius norm of

A

is

{∥ A ∥}_{F}^{2} = \frac{1}{4} \sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} {cos}^{2} φ_{i j} = \frac{1}{4} \cdot (1 \cdot (n + 1) + \frac{1}{n^{2}} \cdot n (n + 1)) = \frac{{(n + 1)}^{2}}{4 n}

(89)

Due to Lemma A3 (see Appendix A for proof), we have

\sum_{i = 1}^{N + 1} α_{i} v_{i}^{T} (H - B) v_{i} = \frac{n + 1}{n} tr (H - B)

(90)

and

β = \frac{2}{n + 1} tr (H - B)

(91)

From definition of

A

, we obtain

tr (A) = \frac{1}{2} \sum_{i = 1}^{n + 1} α_{i} tr (v_{i} v_{i}^{T}) = \frac{1}{2} \sum_{i = 1}^{n + 1} α_{i} {∥ v_{i} ∥}^{2} = \frac{n + 1}{2}

(92)

From

B_{+} - H = B - H + β A

, we can express

tr (B_{+} - H) = tr (B - H) + β tr (A) = 0 .

(93)

Let

B_{++}

denote the approximate Hessian after the second application of the update formula.

B_{++} = B_{+} + β_{+} A_{+}

(94)

Because

β_{+} = 2 tr (H - B_{+}) / (n + 1) = 0

, we have

B_{++} = B_{+}

, and the proof is complete. □

Intuition can mislead one into considering the regular n-simplex as the best choice for positioning

n + 1

points around an origin

x_{0}

when computing an MFN update based on Lemma 2. Lemma 7 shows the exact opposite—a regular n-simplex is the worst choice because the update formula does not improve the Hessian approximation in its second and all subsequent applications.

Definition 2.

An augmented set of

1 \leq N \leq n

orthonormal vectors is a set comprising N mutually orthogonal unit vectors

e_{1}, \dots, e_{N}

and their normalized negative sum

- N^{- 1 / 2} (e_{1} + \dots + e_{N})

.

Note that an augmented set of

N = 1

orthonormal vectors is equivalent to a regular 1-simplex. Now, we have

∥ d_{i} ∥ = 1

and

cos φ_{i i} = 1

. For

i \neq j

, we have

cos φ_{i j} = 0

except for

i = n + 1

or

j = n + 1

when

cos φ_{i j} = - N^{- 1 / 2}

. Because

d_{N + 1}

is the normalized negative sum of the first N vectors,

α_{1} = \dots = α_{n} = 1

and

α_{n + 1} = N^{1 / 2}

.

Corollary 3.

If the prototype set of unit vectors is an augmented set of N orthonormal vectors, then

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} \leq 1 - \frac{2 n - N - N^{1 / 2}}{(n - 1) n (n + 2)} .

(95)

Proof.

\begin{matrix} μ = \frac{{(\sum_{j = 1}^{N + 1} α_{i})}^{2}}{\sum_{i = 1}^{N + 1} \sum_{j = 1}^{N + 1} α_{i} α_{j} {cos}^{2} φ_{i j}} & = & \frac{{(N \cdot 1 + N^{1 / 2})}^{2}}{N \cdot 1 \cdot 1 \cdot 1^{2} + 1 \cdot N^{1 / 2} \cdot N^{1 / 2} \cdot 1^{2} + 2 N \cdot 1 \cdot N^{1 / 2} \cdot N^{- 1}} \\ = & \frac{N {(N^{1 / 2} + 1)}^{2}}{2 N^{1 / 2} (N^{1 / 2} + 1)} = \frac{N + N^{1 / 2}}{2} \end{matrix}

(96)

Because

2 / (n + 1) \leq 1

for all

n \geq 1

, we conclude

μ \geq 2 / (n + 1)

, and the result follows from Theorem 1. □

A special case of Corollary 1 is the following result.

Corollary 4.

If the prototype set of unit vectors is an augmented set of

N = n

orthonormal vectors, then

\frac{E [∥ B_{+} {- H ∥}_{F}^{2}]}{{∥ B - H ∥}_{F}^{2}} \leq 1 - \frac{1}{(n + n^{1 / 2}) (n + 2)} .

(97)

Corollaries 2 and 4 indicate that for an augmented set of

N = n

orthonormal vectors (used in [12]), the expected improvement of the approximate Hessian approaches half of the improvement obtained using a regular 1-simplex (introduced in [9]) when n approaches infinity.

Corollaries 1–4 indicate that the update formula yields a greater improvement of the approximate Hessian when the prototype set of vectors exhibits more directionality, in the sense that the vectors are confined to an

N < n

dimensional subspace of the search space. Lower values of N result in faster convergence.

5. Example

We illustrate the proposed update with a simple example. The sequence of uniformly distributed orthogonal matrices is generated as in [14]. Three prototype sets are examined—regular 1-dimensional and

(n - 1)

-dimensional simplex and the augmented set of n orthonormal vectors. The true Hessian

H

is chosen randomly and the initial Hessian approximation is set to

B = 0

. The progress of the update is measured by the normalized Frobenius distance between

H

and

B

.

Figure 1 depicts the progress of the proposed update with various prototype sets for

n = 5

and

n = 10

. It is clearly visible that the convergence of the update is linear and depends on the choice of the prototype set. The convergence rate of the update using an augmented set of n orthonormal vectors is approximately half of the convergence rate exhibited by the update using a regular 1-simplex. It can also be seen that the bound on the amount of progress obtained from one update (Theorem 1) is fairly conservative. The actual progress of the update is much better in practice.

6. Discussion

The convergence of a Hessian update formula that requires only function values for computing the update was analyzed. The update formula is based on the formula published in [8] that generally requires the function values at

m \geq n + 2

points. The proposed update is based on the case where

m = n + 2

. An additional requirement is introduced, namely that the

m - 1

vectors from the central point to the remaining

m - 1

points must positively span a

m - 2

dimensional subspace of

R^{n}

. This requirement extends the usability of the proposed update to sets of points with

3 \leq m \leq n + 2

members. The set of m points used by the update is generated by adding

m - 1

vectors to a central point in the set. The vectors are obtained by applying a random orthogonal transformation to a prototype set of vectors that spans a

m - 2

dimensional subspace of

R^{n}

.

A lower bound on the expected improvement of the Hessian approximation was derived (Theorem 1). Up to now, no such result was published for the update from [8] and

m = n + 2

. The obtained result was applied to several different prototype sets. The general result obtained for the case when the prototype set is a regular

m - 2

-dimensional simplex (Corollary 1) shows that the expected improvement of the Hessian approximation is greatest for

m = 3

(i.e., 1-dimensional regular simplex) and decreases as the dimensionality of the simplex increases. The special case when

m = 3

(1-dimensional regular simplex) corresponds to the update from [9]. The lower bound on the expected improvement obtained with our general result (Corollary 2) matches the one that was published in [9]. For the n-dimensional regular simplex, our result indicates that the lower bound on expected improvement of the Hessian approximation is 0. Furthermore, it was shown that the Hessian approximation is possibly improved only by the first application of the proposed update formula (Lemma 7). Therefore, the use of the n-dimensional regular simplex in the role of the prototype set is a bad choice.

Next, the expected improvement of the approximate Hessian for a prototype set comprising

N \leq n

orthogonal vectors and their normalized negative sum was derived. Such a prototype set with

N = n

was used in the optimization algorithm published in [12]. It was shown that using this kind of prototype set does guarantee a positive lower bound on the expected improvement of the Hessian approximation (Corollary 4). The general result (Corollary 3), however, again indicates that using a prototype set of lower dimensionality results in faster convergence. The result for

N = 1

(two collinear vectors in the role of the prototype set) is the same as the one obtained for the update from [9].

Finally, the results were illustrated by running the proposed update on a quadratic function with a randomly chosen Hessian for several choices of the prototype set. The observed progress was compared to the lower bound predicted by Theorem 1. The results indicate that the lower bound is quite pessimistic, and that the actual progress is faster. The observed performance was closest to the predicted lower bound for the update formula from [9].

Author Contributions

Conceptualization, Á.B. and T.T.; methodology, Á.B. and J.O.; software, J.O.; validation, J.O.; formal analysis, Á.B.; investigation, Á.B.; resources, T.T.; data curation, Á.B.; writing—original draft preparation, Á.B.; writing—review and editing, Á.B. and J.O.; visualization, J.O.; supervision, T.T.; project administration, T.T.; funding acquisition, T.T. All authors have read and agreed to the published version of the manuscript.

Funding

The research was co-funded by the Ministry of Education, Science, and Sport (Ministrstvo za Šolstvo, Znanost in Šport) of the Republic of Slovenia through the program P2-0246 ICT4QoL—Information and Communications Technologies for Quality of Life.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the anonymous referees for their useful comments that helped to improve the paper. Most notably, the authors would like to thank the second referee whose suggestion lead to the simplification of the proof of Lemma 4.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MFN	Minimum Frobenius norm
BFGS	Broyden-Fletcher-Goldfarb-Shanno
SR1	Symmetric rank-one

Appendix A

The following lemma is used in the proof of the main result.

Lemma A1.

Let

B

be a

n \times n

matrix. Then,

tr {(B)}^{2} \leq n {∥ B ∥}_{F}^{2}

(A1)

Proof.

Let

λ_{i} \in R

denote the n eigenvalues of

B

. We have

\begin{matrix} tr {(B)}^{2} & = & {(\sum_{i = 1}^{n} λ_{i})}^{2}, \end{matrix}

(A2)

\begin{matrix} {∥ B ∥}_{F}^{2} & = & \sum_{i = 1}^{n} λ_{i}^{2} = a . \end{matrix}

(A3)

The maximum of

tr {(B)}^{2}

can be obtained by finding

max {(\sum_{i = 1}^{n} λ_{i})}^{2}

subject to

\sum_{i = 1}^{n} λ_{i}^{2} = a

. The solution of this problem is

| λ_{i} {| = (a / n)}^{1 / 2} i = 1, 2, \dots, n .

(A4)

Considering

\sum_{i = 1}^{n} λ_{i} \leq n {(a / n)}^{1 / 2}

along with (A2) and (A3) concludes the proof. □

Let

S

be a

n \times (n + 1)

matrix whose columns are the vectors comprising a regular simplex in n dimensions. By definition, the following must hold

S^{T} S = {[\begin{matrix} 1 & - n^{- 1} & \dots & - n^{- 1} \\ - n^{- 1} & 1 & \dots & - n^{- 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ - n^{- 1} & - n^{- 1} & \dots & 1 \end{matrix}]}_{(n + 1) \times (n + 1)} = C

(A5)

Clearly, there are infinitely many possible solutions to (A5). We will assume that

S

is upper triangular. A solution to (A5) with this property is unique and can be obtained via Cholesky decomposition of the submatrix of

C

comprising the first n rows and columns which yields the first n columns of

S

. The last column is then obtained as the negative sum of the first n columns. Matrix

S

is in row echelon form and represents what we will refer to as the standard regular simplex. Its components can be expressed as

\begin{matrix} s_{i i}^{2} & = & \frac{(n + 1) (n - i + 1)}{n (n - i + 2)} \end{matrix}

(A6)

\begin{matrix} s_{i j} & = & \{\begin{matrix} - \frac{s_{i i}}{n - i + 1} & j > i \\ 0 & otherwise \end{matrix} \end{matrix}

(A7)

Lemma A2.

Let columns of

V

represent a regular simplex. Then,

V V^{T} = S S^{T} = \frac{n + 1}{n} I_{n \times n}

(A8)

Proof.

Let columns of

S

comprise a standard regular simplex. Diagonal elements of

S S^{T}

can be obtained as

\begin{matrix} \sum_{i = 1}^{n + 1} s_{k i}^{2} = \sum_{i = k}^{n + 1} s_{k i}^{2} = s_{k k}^{2} + (n - k + 1) s_{k (k + 1)}^{2} = s_{k k}^{2} \cdot \frac{n - k + 2}{n - k + 1} = \frac{n + 1}{n} . \end{matrix}

(A9)

Because

S S^{T}

is symmetric, we assume

k > l

for computing the extradiagonal elements

\begin{matrix} \sum_{i = 1}^{n + 1} s_{k i} s_{l i} & = & \sum_{i = k}^{n + 1} s_{k i} s_{l i} = s_{l (l + 1)} \sum_{i = k}^{n + 1} s_{k i} = s_{l (l + 1)} (s_{k k} + (n - k + 1) \cdot s_{k (k + 1)}) \end{matrix}

(A10)

\begin{matrix} = & s_{l (l + 1)} (s_{k k} - (n - k + 1) \frac{s_{k k}}{n - k + 1}) = 0 \end{matrix}

(A11)

This proves

S S^{T} = (n + 1) I_{n \times n} / n

. Any regular simplex

V

can be expressed with the standard regular simplex as

V = Q S

, where

Q

is an orthogonal matrix. Therefore, we have

V V^{T} = Q S S^{T} Q^{T} = \frac{n + 1}{n} Q I_{n \times n} Q^{T} = \frac{n + 1}{n} I_{n \times n}

(A12)

□

Lemma A3.

Let columns of

V

represent a regular simplex, and let

H

be a symmetric matrix. Then,

\sum_{i = 1}^{n + 1} v_{i}^{T} H v_{i} = \frac{n + 1}{n} tr (H)

(A13)

Proof.

\sum_{i = 1}^{n + 1} v_{i}^{T} H v_{i} = tr (V^{T} H V) = tr (H : (V V^{T})) = \frac{n + 1}{n} tr (H : I) = \frac{n + 1}{n} \sum_{i = 1}^{n} h_{i i}

(A14)

□

References

Audet, C.; Kokkolaras, M. Blackbox and derivative-free optimization: Theory, algorithms and applications. Optim. Eng. 2016, 17, 1–2. [Google Scholar] [CrossRef] [Green Version]
Powell, M.J.D. A Direct Search Optimization Method That Models the Objective and Constraint Functions by Linear Interpolation. In Advances in Optimization and Numerical Analysis. Mathematics and Its Applications; Gomez, S., Hennart, J.P., Eds.; Springer: Dordrecht, The Netherlands, 1994; Volume 275, pp. 51–67. [Google Scholar]
Powell, M.J.D. UOBYQA: Unconstrained optimization by quadratic approximation. Math. Program. 2002, 92, 555–582. [Google Scholar] [CrossRef]
Buhmann, M.D. Radial Basis Functions: Theory and Implementations, Volume 12 of Cambridge Monographs on Applied and Computational Mathematics; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Suykens, J.A.K. Nonlinear modelling and support vector machines. In Proceedings of the IMTC 2001 18th IEEE Instrumentation and Measurement Technology Conference. Rediscovering Measurement in the Age of Informatics, Budapest, Hungary, 21–23 May 2001; IEEE: Piscataway, NJ, USA, 2001; pp. 287–294. [Google Scholar]
Fang, Z.Y.; Roy, K.; Chen, B.; Sham, C.-W.; Hajirasouliha, I.; Lim, J.B.P. Deep learning-based procedure for structural design of cold-formed steel channel sections with edge-stiffened and un-stiffened holes under axial compression. Thin-Walled Struct. 2021, 166, 108076. [Google Scholar] [CrossRef]
Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer: New York, NY, USA, 2006. [Google Scholar]
Powell, M.J.D. Least Frobenius norm updating of quadratic models that satisfy interpolation conditions. Math. Program. 2004, 100, 183–215. [Google Scholar] [CrossRef]
Leventhal, D.; Lewis, A.S. Randomized Hessian estimation and directional search. Optimization 2011, 60, 329–345. [Google Scholar] [CrossRef]
Bűrmen, Á.; Olenšek, J.; Tuma, T. Mesh adaptive direct search with second directional derivative-based Hessian update. Comput. Optim. Appl. 2015, 62, 693–715. [Google Scholar] [CrossRef]
Stich, S.U.; Müller, C.L. On Spectral Invariance of Randomized Hessian and Covariance Matrix Adaptation Schemes. In Proceedings of the Parallel Problem Solving from Nature—PPSN XII: 12th International Conference, Taormina, Italy, 1–5 September 2012; Coello Coello, C.A., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M., Eds.; Springer: New York, NY, USA, 2012; pp. 448–457. [Google Scholar]
Bűrmen, Á.; Fajfar, I. Mesh adaptive direct search with simplicial Hessian update. Comput. Optim. Appl. 2019, 74, 645–667. [Google Scholar] [CrossRef]
Stewart, G.W. The efficient generation of random orthogonal matrices with an application to condition estimators. SIAM J. Numer. Anal. 1980, 17, 403–409. [Google Scholar] [CrossRef]
Bűrmen, Á.; Tuma, T. Generating Poll Directions for Mesh Adaptive Direct Search with Realizations of a Uniformly Distributed Random Orthogonal Matrix. Pac. J. Optim. 2016, 12, 813–832. [Google Scholar]
Sykora, S. Quantum Theory and the Bayesian Inference Problems. J. Stat. Phys. 1974, 11, 17–27. [Google Scholar] [CrossRef]

Figure 1. Progress of three simplicial updates for

n = 5

(left) and

n = 10

(right). Dashed lines represent the progress of the update assuming every update application improves the approximate Hessian by the amount predicted in Theorem 1.

Figure 1. Progress of three simplicial updates for

n = 5

(left) and

n = 10

(right). Dashed lines represent the progress of the update assuming every update application improves the approximate Hessian by the amount predicted in Theorem 1.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bűrmen, Á.; Tuma, T.; Olenšek, J. Randomized Simplicial Hessian Update. Mathematics 2021, 9, 1775. https://doi.org/10.3390/math9151775

AMA Style

Bűrmen Á, Tuma T, Olenšek J. Randomized Simplicial Hessian Update. Mathematics. 2021; 9(15):1775. https://doi.org/10.3390/math9151775

Chicago/Turabian Style

Bűrmen, Árpád, Tadej Tuma, and Jernej Olenšek. 2021. "Randomized Simplicial Hessian Update" Mathematics 9, no. 15: 1775. https://doi.org/10.3390/math9151775

APA Style

Bűrmen, Á., Tuma, T., & Olenšek, J. (2021). Randomized Simplicial Hessian Update. Mathematics, 9(15), 1775. https://doi.org/10.3390/math9151775

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Randomized Simplicial Hessian Update

Abstract

1. Introduction

2. Obtaining the Update Formula

3. Uniformly Distributed Orthogonal Matrices

4. Convergence of the Proposed Update

5. Example

6. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI