Randomized Simplicial Hessian Update

: Recently, a derivative-free optimization algorithm was proposed that utilizes a minimum Frobenius norm (MFN) Hessian update for estimating the second derivative information, which in turn is used for accelerating the search. The proposed update formula relies only on computed function values and is a closed-form expression for a special case of a more general approach ﬁrst published by Powell. This paper analyzes the convergence of the proposed update formula under the assumption that the points from R n where the function value is known are random. The analysis assumes that the N + 2 points used by the update formula are obtained by adding N + 1 vectors to a central point. The vectors are obtained by transforming a prototype set of N + 1 vectors with a random orthogonal matrix from the Haar measure. The prototype set must positively span a N ≤ n dimensional subspace. Because the update is random by nature we can estimate a lower bound on the expected improvement of the approximate Hessian. This lower bound was derived for a special case of the proposed update by Leventhal and Lewis. We generalize their result and show that the amount of improvement greatly depends on N as well as the choice of the vectors in the prototype set. The obtained result is then used for analyzing the performance of the update based on various commonly used prototype sets. One of the results obtained by this analysis states that a regular n -simplex is a bad choice for a prototype set because it does not guarantee any improvement of the approximate Hessian.


Introduction
Derivative-free optimization algorithms have attracted much attention due to the fact that in many optimization problems, the evaluation of the gradients of the function subject to optimization and constraints is expensive. Such optimization problems can be often formulated as constrained black-box optimization (BBO) [1] problems of the form min f (x) subject to (1) c i (x) ≤ 0 i = 1, 2, . . . , n C Functions f and c i are maps from R n to R. The objective is to minimize f subject to n C nonlinear constraints defined by functions c i . The method for computing f and c i is treated as a black-box, and the gradients are usually not available. Such problems often arise in engineering optimization when simulation is used for obtaining the function values. BBO often relies on models of the function and of the constraints. Various approaches to building black-box models were developed in the past, such as linear [2] and quadratic models [3], radial-basis functions [4], support vector machines [5], neural networks [6], etc.
In this paper, we focus on the quadratic models of f and c i . The most challenging task in building these models is the computation of the Hessian matrix. Instead of using the exact Hessian, the model can utilize an approximate Hessian. The approximation can be improved gradually by applying an update formula based on the function and the gradient values at points visited in the algorithm's past. As the algorithm converges towards a solution, the approximate Hessian converges to the true Hessian.
For derivative-based optimization, several approaches for updating the approximate Hessian are well studied and tested in practice (e.g., BFGS update, SR1 update [7]). Unfortunately, these approaches rely on the gradient of the function (constraints), which, by assumption, is not available in derivative-free optimization.
Let n denote the dimension of the search space. For derivative-free optimization, a Hessian update formula based on the function values computed at m ≥ n + 2 points visited in the algorithm's past was proposed by Powell in [8]. The update formula was obtained by minimizing the Frobenius norm of the update applied to the approximate Hessian subject to linear constraints imposed by the function values at m points in the search space. The paper proposed an efficient way for computing the update and explored some of its properties. The convergence rate of the update formula was not studied.
In a later paper, a simple update formula that uses three collinear points for computing the updated approximate Hessian [9] was examined. The normalized direction along which the three points lie was assumed to be uniformly distributed on the unit sphere. With this assumption, the convergence rate of the update was analyzed and shown to be linear. This update formula was successfully used in a derivative-free algorithm from the family of mesh adaptive direct search algorithms (MADS) [10]. A similar Hessian updating approach was used for speeding up global optimization in [11].
The assumption that the points taking part in an update must be collinear is a significant limitation for the underlying derivative-free algorithm. With this in mind, a new simplicial update formula was proposed in [12]. The formula relies on m ≤ n + 2 points. The reason for choosing the term simplicial Hessian update is the fact that the m − 1 points form a simplex centered around the first point. For m = n + 2, the formula is a special case of the update formula proposed in [8]. By imposing some restrictions on the positions of the m points, the update formula can be used for any m that satisfies 3 ≤ m ≤ n + 2. The case m = 3 corresponds to the update formula proposed in [9].
To illustrate the approach for obtaining the update formula, let us assume that the current quadratic model of function f is given by B is the current approximate Hessian. Let the points where the function is known be denoted by x i . For the sake of simplicity, let f i denote f (x i ). Based on these points, we are looking for an updated model: The model must satisfy m constraints that are linear inĉ and the components of B + andĝ + . Based on these constraints, we are looking for an updated approximate Hessian B + . Because we have fewer constraints than there are unknowns, we also require that B + − B F is minimal ( · F denotes the Frobenius norm). The update formula we obtain in this way is a minimum Frobenius norm update formula. For computing the expected improvement of the approximate Hessian, we first assume f itself is quadratic. We also assume the aforementioned m points are obtained by applying a random orthogonal transformation to m − 1 vectors that form a prototype set and adding the resulting vectors to a central point. As in [9], the convergence rate of the update is linear.
The speed with which the approximate Hessian converges to the true Hessian depends on the choice of the prototype set. Our result is a generalization of the result published in [9]. This paper is divided as follows. In Section 2, some basic properties of minimum Frobenius norm updates are explored. The Frobenius product is revisited with the purpose of simplifying the notation, and the update formula is derived. In the next section, uniformly distributed orthogonal matrices are introduced. Some auxiliary results are derived that are later used for computing the expected improvement of the approximate Hessian. Section 4 analyzes the convergence of the proposed update and derives the expected value of the improvement in the sense of the Frobenius norm of the difference between the approximate Hessian and the true Hessian. The expected improvement is computed for several prototype sets. The section is followed by an example demonstrating the convergence of the proposed update and concluding remarks.
Notation. Components of vectors (a) and matrices (A) are denoted by subscripts (i.e., a i and a ij , respectively). The i-th column of matrix A is denoted by a i . The unit vectors forming an orthogonal basis for R n are denoted by e i . Vectors are assumed to be column vectors, and the inner product of two vectors is written in matrix notation as a T b. The Frobenius norm and the trace of a matrix are denoted by · F and tr(·), respectively. The expected value of a random variable is denoted by E[·].

Obtaining the Update Formula
Let H denote the Hessian of a function. Minimum Frobenius norm (MFN) update formulas replace the current Hessian approximation B with a new (better) approximation B + in such manner that the Frobenius norm of the change (i.e., B + − B) is minimal, subject to constraints imposed on B + .
The Frobenius norm is a norm induced by the Frobenius (inner) product on the space of n-by-n matrices. The Frobenius product of two matrices is given by Using the Frobenius product, one can express the Frobenius norm of matrix A as Quadratic terms can be expressed with the Frobenius product as The Frobenius product introduces the notion of perpendicularity into the set of matrices (not to be confused with the orthogonality of matrices, which is equivalent to Q T Q = I).

Definition 1. Two nonzero matrices A and B are perpendicular (denoted by
The Frobenius product can also be used for expressing linear constraints. A linear equality constraint on matrix X can be formulated as The following Lemma provides motivation for the use of minimum Frobenius norm updating.

Lemma 1.
Let H, B, and B + denote the exact, the current approximate, and the updated approximate Hessian, respectively. Suppose we have m linear equality constraints of the form imposed on B + . Let P ⊥ denote the subspace spanned by matrices A i . Then, the corresponding MFN update satisfies 1.

Proof.
Finding the MFN update is equivalent to minimizing the Frobenius norm of B + − B subject to linear equality constraints (10). These constraints define an affine subspace in the n(n + 1)/2 dimensional space of Hessian matrices, and B + is a member of this affine subspace. Because the true Hessian also satisfies constraints (10), it is also a member of the aforementioned affine subspace.
To simplify the problem, we can translate it in such manner that H becomes 0. When we do this, the linear constraints become homogeneous, and instead of an affine subspace, they now define an ordinary subspace P. Its orthogonal complement P ⊥ is spanned by matrices A i . Due to translation, B and B + are replaced by B − H and B + − H, of which the latter is a member of P. Points with constant B + − B F = B + − H − (B − H) F lie on a sphere centered at B − H. Matrix B + − H that corresponds to the smallest B + − B F lies on a sphere centered at B − H that is tangential to subspace P. Therefore, B + − B must be perpendicular to P, i.e., B + − B ∈ P ⊥ . This proves the first claim.
Due to B + − H ∈ P, we can see that The second claim immediately follows from this result.
Consider a quadratic function where H is its Hessian and g its gradient at x = 0. Let the current and the updated approximation to q(x) be given by and respectively. In MFN, updating B + is obtained by minimizing B + − B F . The following lemma introduces one such update based on the case when the value of q is known at N + 2 points.
Then the simplicial MFN update satisfying the interpolation conditions m + (x i ) = q i for i = 0, 1, . . . , N + 1 can be computed as where Proof. By assumption we have Due to the interpolation conditions, we have N + 2 constraints By subtraction, we eliminateĉ + and obtain N + 1 constraints Multiplying (20) with α i and adding the resulting equations yields By assumption, the second term on the left-hand side of (21) vanishes (thus,ĝ + is eliminated). We are left with a single linear constraint on B + : which can be rewritten by recalling (8) as where Equation (23) is a linear constraint on the updated Hessian approximation B + . This is the only constraint on B + . From Lemma 1, we can see that P ⊥ is spanned by A. Therefore, we can write By computing the Frobenius product of (25) with A and taking into account (23), we arrive at Now we can compute β: The simplicial update formula introduced by Lemma 2 is the closed-form solution of the equations arising from the MFN update in [8] for N = n. One can see this by comparing the interpolation conditions to those in [8]. Due to the assumption ∑ N+1 i=1 α i v i = 0, we can also apply it when N < n. The assumption implies the points x 1 , . . . , x N+1 are positioned in a specific manner with respect to x 0 (i.e., there exists a nontrivial linear combination By choosing N = 1, we obtain a special case of the simplicial MFN update, where all three distinct points must be collinear to satisfy where q (2) v (x 0 ) = v T Hv is the second directional derivative of q along direction v. The convergence properties of this MFN update formula were analyzed in [9]. The formula was used in the derivative-free optimization algorithm proposed in [10].

Uniformly Distributed Orthogonal Matrices
The notion of a uniform distribution over the group of orthogonal matrices (O n ) can be introduced via the Haar measure [13]. Let A denote a matrix with independent normally distributed elements with zero mean and variance 1. A random orthogonal matrix from the Haar measure (O) can then be obtained with Algorithm 1.
Algorithm 1 Constructing a random orthogonal matrix from the Haar measure.

1.
Perform Multiplying O with any unit vector results in a random unit vector that is uniformly distributed on the unit sphere (S n ) [14]. It can be shown that O T is also a uniformly distributed orthogonal matrix. Consequently, every column and every row of O are a random unit vector with uniform distribution on S n .
The results of this section are obtained with the help of the following lemma.
Lemma 3. Let x ∈ R n and let dσ denote the surface element of S n . Then, Proof. See [15], Appendix B.
From Lemma 3, we can obtain the surface area of S n by choosing r 1 = . . . = r n = 0.
Let o i and o j denote two random vectors that correspond to the i-th and the j-th

Lemma 4.
Let O be a uniformly distributed random orthogonal matrix. Then, (33) Proof. For proving (31), we can assume without loss of generality k = i = 1. Because o 1 = u is uniformly distributed on S n , the expected value of f (u) can be obtained by computing the mean value of f (u) over S n . We use Lemma 3 for expressing the integral over the surface of S n .
For k = l, we assume without loss of generality k = 1, l = 2. From Lemma 3, we have For (33) with k = l, we can show that it is identical to (32) with k = l. We have To confirm (33) for k = l, we take into account that O T is also a random orthogonal matrix from the Haar measure, replace O with O T in (32), and rename i, k, and l to k, i, and j, respectively.
where vector [η 1 , . . . , η n−1 ] is uniformly distributed on S n−1 . Without loss of generality we can choose vectors b i in such manner that e 2 = o 1 cos φ + b 1 sin φ, where φ is the angle between e 2 and o 1 . Now we have and Next, we can express where the first expected value refers to [η 1 , . . . , η n−1 ] ∈ S n−1 and the second one to o 1 ∈ S n . Using Lemma 3 and previously proven (32), we arrive at

Convergence of the Proposed Update
Multiplying vectors in a prototype set D = {d 1 , . . . , d N+1 } with a uniformly distributed random orthogonal matrix O results in a set of random vectors V = {v 1 , . . . , v N+1 } such that every v/ v is uniformly distributed on S n . The angles between the vectors in a realization of such a set are identical to the angles between the corresponding vectors from the prototype set.
Suppose one is interested in the expected amount of improvement resulting from one application of the update formula from Lemma 2. We assume that the N + 2 points where the function value is computed comprise x 0 and additional N + 1 points generated using a random orthogonal matrix O and a prototype set of vectors {d 1 , . . . , d N+1 } in the following manner.
First, we prove an auxiliary lemma.
Lemma 5. Let a, b, and O denote two unit vectors with cos ϕ = a T b and a uniformly distributed orthogonal matrix, respectively. Let u = Oa and v = Ob. Then, Proof. Without loss of generality, the coordinate system can be rotated in such manner that a = e 1 and b = e 1 cos ϕ + e 2 sin ϕ. Then, we have For k = l, we have The last term vanishes because the integral of odd powers of o ij over S n is zero. By invoking Lemma 4, we arrive at E u 2 k v 2 k = 3 cos 2 ϕ n(n + 2) + sin 2 ϕ n(n + 2) = 1 + 2 cos 2 ϕ n(n + 2) For k = l, The last term vanishes due to odd powers of o ij . Together with Lemma 4, we have E u 2 k v 2 l = cos 2 ϕ n(n + 2) + (n + 1) sin 2 ϕ (n − 1)n(n + 2) = n + 1 − 2 cos 2 ϕ (n − 1)n(n + 2) . (56) where all α i ≥ 0 and at least one α i = 0. Let O be a uniformly distributed random orthogonal matrix, and let Then, the MFN update formula from Lemma 2 involving N + 2 points (x 0 and the additional N + 1 points constructed according to (47)) satisfies where γ 1 = µ + 2 n(n + 2) (58) Proof. By repeating the reasoning in the proof of Lemma 2 on (18), we obtain which yields together with the expression for β from Lemma 2 Now we can express Because vectors v i are uniformly distributed on the unit sphere, we can rotate the coordinate system without affecting E β 2 A 2 F so that B − H is diagonalized.
Let v ik denote the k-th component of vector v i and λ k the k-th eigenvalue of B − H (k-th diagonal element of D).
We can rewrite (65) as Because the eigenvalues of D are the same as the eigenvalues of B − H, we have and Note that v i = d i . Taking into account (66), (67), and (70) yields The Frobenius norm of A can be expressed as By substituting (73) in (71) and (72), we arrive at We also have Proof. We start with the following identity.
Computing the Frobenius norm on both sides and considering (B Taking into account (15) results in After Lemma 6 is applied, we have By definition, µ ≥ 0 and γ 1 ≥ 0. For γ 2 ≥ 0, we must have µ ≥ 2/(n + 1). By considering tr(B − H) 2 For γ 2 < 0, we must have µ < 2/(n + 1). Invoking Lemma A1 yields From Theorem 1, several results can be derived. First, we will assume the prototype set is a regular N-simplex (i.e., comprises N + 1 vectors positively spanning an N-dimensional subspace). This case is interesting because the update formula in [9] is obtained for N = 1. We are going to show that our estimate of the expected Hessian improvement is identical to the one published in [9]. This update formula (with N = 1) was used in an optimization algorithm published in [10].
Next, we are going to show that using a regular n-simplex as the prototype set is a bad choice. According to Theorem 1, no improvement of the Hessian is guaranteed. Even worse, we show that improvement occurs only at the first application of the update formula.
Finally, we will analyze the case where the prototype set is what we refer to as an augmented set of N orthonormal vectors. Such a prototype set with N = n was used in the optimization algorithm published in [14]. Corollary 1. Let D be a regular N-simplex (N ≤ n). Then, Proof. For all i = j, we have cos ϕ ij = −N −1 and d i = 1. Because the sum of all vectors in a regular N-simplex is 0, we conclude α i = 1 and Because 2/(n + 1) ≤ 1 for all n ≥ 1, we have µ ≥ 2/(n + 1), and the result follows from Theorem 1.
Corollary 1 implies that the most efficient approach to MFN updating with a regular simplex in the role of the prototype set of unit vectors is to use a regular 1-simplex (three collinear points).

Corollary 2.
For N = 1 and d 1 = −d 2 , set D is a regular 1-simplex and This result was proven in [9] with a less general approach. Here, we obtain it as a special case of Corollary 1 for N = 1.
According to Corollary 1, there is no guaranteed improvement of B − H F if a regular n-simplex (N = n) is used in the update process. In fact, the situation is even worse as we show in the following Lemma.

Lemma 7.
If D is a regular n-simplex (N = n), then the MFN update from Lemma 2 improves the Hessian approximation only in its first application.
Proof. From Lemma 2, we can see that where (see (62)) For a regular simplex, α i = 1 and d i = 1. The Frobenius norm of A is cos 2 ϕ ij = 1 4 · 1 · (n + 1) + 1 n 2 · n(n + 1) = (n + 1) 2 4n (89) Due to Lemma A3 (see Appendix A for proof), we have and From definition of A, we obtain Let B ++ denote the approximate Hessian after the second application of the update formula.
Intuition can mislead one into considering the regular n-simplex as the best choice for positioning n + 1 points around an origin x 0 when computing an MFN update based on Lemma 2. Lemma 7 shows the exact opposite-a regular n-simplex is the worst choice because the update formula does not improve the Hessian approximation in its second and all subsequent applications.

Corollary 3. If the prototype set of unit vectors is an augmented set of N orthonormal vectors, then
Proof.
A special case of Corollary 1 is the following result.

Corollary 4.
If the prototype set of unit vectors is an augmented set of N = n orthonormal vectors, then Corollaries 2 and 4 indicate that for an augmented set of N = n orthonormal vectors (used in [12]), the expected improvement of the approximate Hessian approaches half of the improvement obtained using a regular 1-simplex (introduced in [9]) when n approaches infinity.
Corollaries 1-4 indicate that the update formula yields a greater improvement of the approximate Hessian when the prototype set of vectors exhibits more directionality, in the sense that the vectors are confined to an N < n dimensional subspace of the search space. Lower values of N result in faster convergence.

Example
We illustrate the proposed update with a simple example. The sequence of uniformly distributed orthogonal matrices is generated as in [14]. Three prototype sets are examined-regular 1-dimensional and (n − 1)-dimensional simplex and the augmented set of n orthonormal vectors. The true Hessian H is chosen randomly and the initial Hessian approximation is set to B = 0. The progress of the update is measured by the normalized Frobenius distance between H and B. Figure 1 depicts the progress of the proposed update with various prototype sets for n = 5 and n = 10. It is clearly visible that the convergence of the update is linear and depends on the choice of the prototype set. The convergence rate of the update using an augmented set of n orthonormal vectors is approximately half of the convergence rate exhibited by the update using a regular 1-simplex. It can also be seen that the bound on the amount of progress obtained from one update (Theorem 1) is fairly conservative. The actual progress of the update is much better in practice.

Discussion
The convergence of a Hessian update formula that requires only function values for computing the update was analyzed. The update formula is based on the formula published in [8] that generally requires the function values at m ≥ n + 2 points. The proposed update is based on the case where m = n + 2. An additional requirement is introduced, namely that the m − 1 vectors from the central point to the remaining m − 1 points must positively span a m − 2 dimensional subspace of R n . This requirement extends the usability of the proposed update to sets of points with 3 ≤ m ≤ n + 2 members. The set of m points used by the update is generated by adding m − 1 vectors to a central point in the set. The vectors are obtained by applying a random orthogonal transformation to a prototype set of vectors that spans a m − 2 dimensional subspace of R n .
A lower bound on the expected improvement of the Hessian approximation was derived (Theorem 1). Up to now, no such result was published for the update from [8] and m = n + 2. The obtained result was applied to several different prototype sets. The general result obtained for the case when the prototype set is a regular m − 2-dimensional simplex (Corollary 1) shows that the expected improvement of the Hessian approximation is greatest for m = 3 (i.e., 1-dimensional regular simplex) and decreases as the dimensionality of the simplex increases. The special case when m = 3 (1-dimensional regular simplex) corresponds to the update from [9]. The lower bound on the expected improvement obtained with our general result (Corollary 2) matches the one that was published in [9]. For the n-dimensional regular simplex, our result indicates that the lower bound on expected improvement of the Hessian approximation is 0. Furthermore, it was shown that the Hessian approximation is possibly improved only by the first application of the proposed update formula (Lemma 7). Therefore, the use of the n-dimensional regular simplex in the role of the prototype set is a bad choice.
Next, the expected improvement of the approximate Hessian for a prototype set comprising N ≤ n orthogonal vectors and their normalized negative sum was derived. Such a prototype set with N = n was used in the optimization algorithm published in [12]. It was shown that using this kind of prototype set does guarantee a positive lower bound on the expected improvement of the Hessian approximation (Corollary 4). The general result (Corollary 3), however, again indicates that using a prototype set of lower dimensionality results in faster convergence. The result for N = 1 (two collinear vectors in the role of the prototype set) is the same as the one obtained for the update from [9].
Finally, the results were illustrated by running the proposed update on a quadratic function with a randomly chosen Hessian for several choices of the prototype set. The observed progress was compared to the lower bound predicted by Theorem 1. The results indicate that the lower bound is quite pessimistic, and that the actual progress is faster. The observed performance was closest to the predicted lower bound for the update formula from [9].

Acknowledgments:
The authors would like to thank the anonymous referees for their useful comments that helped to improve the paper. Most notably, the authors would like to thank the second referee whose suggestion lead to the simplification of the proof of Lemma 4.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: MFN Minimum Frobenius norm BFGS Broyden-Fletcher-Goldfarb-Shanno SR1 Symmetric rank-one

Appendix A
The following lemma is used in the proof of the main result.
Let S be a n × (n + 1) matrix whose columns are the vectors comprising a regular simplex in n dimensions. By definition, the following must hold Clearly, there are infinitely many possible solutions to (A5). We will assume that S is upper triangular. A solution to (A5) with this property is unique and can be obtained via Cholesky decomposition of the submatrix of C comprising the first n rows and columns which yields the first n columns of S. The last column is then obtained as the negative sum of the first n columns. Matrix S is in row echelon form and represents what we will refer to as the standard regular simplex. Its components can be expressed as ∑ i=k s ki = s l(l+1) s kk + (n − k + 1) · s k(k+1) (A10) = s l(l+1) s kk − (n − k + 1) s kk n − k + 1 = 0 (A11) This proves SS T = (n + 1)I n×n /n. Any regular simplex V can be expressed with the standard regular simplex as V = QS, where Q is an orthogonal matrix. Therefore, we have VV T = QSS T Q T = n + 1 n QI n×n Q T = n + 1 n I n×n (A12) Lemma A3. Let columns of V represent a regular simplex, and let H be a symmetric matrix. Then,