An Efficient Algorithm for the Separable Nonlinear Least Squares Problem

The nonlinear least squares problem miny,z‖A(y)z + b(y)‖, where A(y) is a full-rank (N + `)× N matrix, y ∈ Rn, z ∈ RN and b(y) ∈ RN+` with ` ≥ n, can be solved by first solving a reduced problem miny‖ f (y)‖ to find the optimal value y∗ of y, and then solving the resulting linear least squares problem minz‖A(y∗)z + b(y∗)‖ to find the optimal value z∗ of z. We have previously justified the use of the reduced function f (y) = CT(y)b(y), where C(y) is a matrix whose columns form an orthonormal basis for the nullspace of AT(y), and presented a quadratically convergent Gauss–Newton type method for solving miny‖C(y)b(y)‖ based on the use of QR factorization. In this note, we show how LU factorization can replace the QR factorization in those computations, halving the associated computational cost while also providing opportunities to exploit sparsity and thus further enhance computational efficiency.


Introduction
Many applications [1][2][3] lead to the need to find a solution (y * , z * ) of the separable nonlinear overdetermined least squares problem min y,z F(y, z) ≡ min y,z A(y)z + b(y) , (1) where the full-rank matrix A(y) ∈ R (N+ )×N and the vector b(y) ∈ R N+ are twice Lipschitz continuously differentiable functions of y ∈ R n , z ∈ R N and ≥ n.Here, we are using the Euclidean norm.The classic Golub-Pereyra variable projection method [4] replaces the problem (1) by the simpler least squares problem min y f (y) ≡ min y [I − A(y)A + (y)]b(y) . ( Once y * has been obtained by solving (2), z * can be found by solving the resulting linear least squares problem min z A(y * )z + b(y * ) ; formally, where A + (y * ) is the pseudo-inverse of A(y * ).An alternative method that reduces (1) to a smaller least squares problem min y f (y) was proposed in [5,6] for the case = n, and the associated iterative technique was then applied to the case > n in [7] but without a complete analytical justification.
The reduced system is min y f (y) ≡ min y C T (y)b(y) , (5) where C(y) is an (N + ) × N matrix whose columns form an orthonormal basis for Null(A T (y)).
In [8], we presented a quadratically convergent Gauss-Newton type iterative method, incorporating second derivative terms, to solve this problem, and provided a complete theoretical justification for our technique.In particular, we showed in [8] that our Gauss-Newton type method, and also the standard Gauss-Newton method [9][10][11], which omits the second derivative terms and is not quadratically convergent, are both invariant with respect to the specific basis matrix C(y) that is used at any particular point in the iteration.This makes it possible to substitute, at each point of the iteration, any easily computed local orthonormal basis for the nullspace of A T (y).This observation also makes it possible to compute both the first and the second derivatives involved in the computation very cheaply.Many instances of (1), such as those arising from the discretization of linear differential equations with only a few nonlinear parameters, involve N → ∞, while and n are fixed and small.In this case, the main computational cost in each step of our quadratically convergent Gauss-Newton type method of [8] is the QR factorization of A(y), costing approximately 4N 3 /3 flops [12] per iteration.In this note, we show how the relevant computations can instead be performed by using one LU factorization of A(y) per iteration, so that the computational cost of the computation is essentially halved to 2N 3 /3 flops [12].We note that, in the case of discretizations of differential equations, the matrix A(y) is typically very sparse and strongly patterned.Since LU factorization more easily exploits and preserves such sparsity and patterns than does QR factorization, this technique also opens up opportunities for further reductions in the computational cost in such applications.
In a previous paper [3], we presented the use of LU factorizations in the context of solving separable systems of nonlinear equations and zero-residual separable nonlinear least squares problems.We mentioned there that extending that approach to nonzero residual problems remained an open question.The current paper resolves that question, while also improving on the technique of [3] in other ways.In particular, our new method does not involve the singular value decomposition, and in contrast to the use of the standard Gauss-Newton method in [3], it retains quadratic convergence even in the case of nonzero residual problems.The latter is achieved by retaining the second derivative terms, and we show here how those terms can be computed very efficiently.This is in contrast to the usual use of the Gauss-Newton method, which sacrifices the quadratic convergence for the convenience of not computing the second derivatives.
In the next section, we present the relevant results from [8] relating to our technique for solving (5) by our Gauss-Newton type method using QR factorization, and show how those results and methods can be modified to use LU factorization.The resulting algorithm and some numerical examples to illustrate the method are given in the following two sections.The numerical results exhibit the quadratic convergence expected from our Gauss-Newton type method.

Analysis
We begin by presenting the relevant background results from [8].Assume that A(y) and b(y) are twice Lipschitz continuously differentiable in a neighborhood of the least squares solution y * of (1).As shown in [5][6][7][8], there exists a smoothly differentiable (N + ) × matrix C(y) whose columns form an orthonormal basis of Null(A T (y)) in a neighborhood of y * .Then, finding the least squares solution of (1) can be reduced to finding the least squares solution of (5).Our Gauss-Newton type method for solving (5) takes the form where f (y) ≡ C T (y)b(y) and f (y), f i (y) and H i (y) denote the derivative of f (y), the i-th component of f (y), and the Hessian matrix of f i (y), respectively.
In [8], we proved that, at any point y (m) in the iteration (6), the particular matrix C(y (m) ) whose columns form an orthonormal basis for the nullspace of A T (y (m) ) can be replaced by any other orthonormal basis C(y (m) ) for this space, without changing y (m+1) .Thus, instead of using the function f (y) ≡ C T (y)b(y) in ( 6), we may use a different function g(y) ≡ C T (y)b(y) at each iteration point y (m) .This freedom to use any orthonormal basis for the nullspace of A T (y) at the point y (m) is key to our numerical technique.It also permits us to write simply C(y (m) ) and f (y (m) ) instead of C(y (m) ) and g(y (m) ) in the rest of this note, regardless of the particular matrix C(y) being used, and for simplicity we write y instead of y (m) .The analysis below centers on the efficient computation of the terms required to use (6) at the particular point y (m) = y.
The following results are derived and used in [5][6][7][8] and are required for our analysis below.With the terms as defined above, and writing C i (y) for the i-th column of the matrix C(y), and the (j, k)-th entry of the n × n Hessian matrix Here, we use [C(y)] j and [C(y)] j,k to denote the termwise first derivative w.r.t.y j and second derivative w.r.t.y j , y k , respectively, of C(y), for j, k = 1, . . ., n.
Define the (N + ) × (N + ) nonsingular matrix Then, we showed in [8] that and where the (p, q)-entry K(y)] j,k (p,q) of the upper triangular matrix K(y)] j,k is Throughout the rest of this paper, we assume not only that A(y) has full rank N, but also that it is sufficiently well-conditioned that LU factorization with partial pivoting can be performed safely.For separable nonlinear systems with a rank-deficient A(y), the technique of bordered matrices [13][14][15][16] may be applied to produce a full rank matrix.
The following theorem lays the foundation for our approach to using the LU factorization.
Theorem 1.Let the LU factorization of the rectangular matrix A(y) with some permutation matrix P(y) be where A(y) is (N + ) × N, P(y) is an (N + ) × (N + ) permutation matrix, L(y) is an (N + ) × N unit lower triangular matrix and U(y) is an N × N nonsingular upper triangular matrix.
(a) The matrix M(y) ≡ [A(y)|S(y)], where S(y) ≡ P T (y) Then, the thin positive QR factorization produces a thin orthonormal (N + ) × matrix C(y) and a small × nonsingular upper triangular matrix R(y), and Proof.(a) By direct computation, M(y) is a product of nonsingular matrices: (b) By ( 15) and ( 16), From (17), we see that the matrix C(y) constructed in the theorem satisfies (7) at y, and thus, as shown in [8], Equations ( 11) and ( 12) hold for this matrix at y.
To determine the computational cost, recall that we assume that N → ∞, while and n are fixed and small.Then, the thin QR factorization in ( 16) is cheap since the matrix Ψ(y) has only columns.More specifically, the LU factorization of the (N + ) × N dimensional A(y) in ( 13) costs (N + )N 2 − N 3 /3 ≈ 2N 3 /3 flops, with an additional O(N 2 ) comparisons to produce P(y) [12], while computing S(y) only costs O(N) flops and the thin QR factorization of the (N + ) × dimensional matrix S(y) takes only (N + ) 2 − 3 = O(N) flops [12].
Notice that solving (11) and (12) involves solving several equations of the form u T M(y) = v T .However, the matrix M(y) defined in (10) and which is used in (11) and (12), differs from the matrix M(y) defined in (14) whose LU factors are given in (18).To reduce the cost of solving (11) and (12) from O(N 3 ) to O(N 2 ) operations, we can exploit the factorization (18) of the matrix M(y), as below.This result is essentially an application of the Sherman-Morrison-Woodbury formula.
Theorem 2. Let P(y)A(y) = L(y)U(y) be the LU factorization of A(y) with a permutation matrix P(y) as in (13), and let the matrices M(y), S(y) and C(y) be defined as in ( 14)-( 16), while M(y) is defined by (10).Denote W(y Then, Proof.(a) By (18), Then, Part (b) follows directly from (a).
Finally, we show how the value z * defined by (3) can be computed efficiently using the LU factorization of A(y * ).
Proof.Let the QR factorization of A(y * ) be Since A(y * ) is full rank, we see that Since C(y * ) is an orthonormal matrix, there exists an × orthogonal matrix Θ such that Hence, the conclusion follows.
Computing z * via (21) involves the matrix M(y * ) of ( 10), but the technique of Theorem 2 may again be used to replace this by the known LU factorization (18) of M(y * ) obtained from the LU factorization of A(y * ).Thus, the cost of this step is also only 2N 3 /3 + O(N 2 ) flops.

Algorithm
We now give a detailed Algorithm 1 based on the analysis from the last section.
Otherwise, if m < m 0 , replace m by m + 1 and go to (a); if m = m 0 , output that the method fails to obtain y * within given m 0 and .
Note that the cost of solving (6) for y (m+1) − y (m) is only O(1) since the matrix involved is only × and we assume = O(1).Thus, the overall cost per iteration remains 2N 3 /3 + O(N 2 ), the primary cost being the LU factorization of A(y (m) ).The standard Gauss-Newton method, which omits the second derivative terms, omits steps (g) and (h), but the resulting method is no longer quadratically convergent.

Examples
We present three examples to illustrate our algorithm for computing the least squares solution of overdetermined separable equations.Our examples have large N and small n, , approximating the theoretical characteristic that n, = O(1) as N → ∞.Similar results hold if we use much larger N.

Example 1. Consider the least squares solution of the overdetermined system
where and y ∈ R (thus n = 1), and where e k+1 is the (k + 1)-th standard unit vector in R 2k+1 .The selected value of y * ∈ R is the value of the exact solution (y * , z * ).Here, = 2.

Conclusions
We have shown how LU factorization can be used to solve separable nonlinear least squares problems with nonzero residuals.Our Gauss-Newton type method retains the second derivative terms and is quadratically convergent.The technique is most efficient for situations in which there are few nonlinear terms and variables.In that context we show how the second derivative terms can be computed relatively cheaply.Our numerical examples demonstrate the effectiveness of the technique in some areas of application.

Example 3 .
Consider the least squares solution of the overdetermined system A(y)z + b(y) = 0, (24)