Parallel Variants of Broyden ’ s Method

In this paper we investigate some parallel variants of Broyden’s method and, for the basic variant, we present its convergence properties. The main result is that the behavior of the considered parallel Broyden’s variants is comparable with the classical parallel Newton method, and significantly better than the parallel Cimmino method, both for linear and nonlinear cases. The considered variants are also compared with two more recently proposed parallel Broyden’s method. Some numerical experiments are presented to illustrate the advantages and limits of the proposed algorithms.


Introduction
Let F : R m → R m be a (nonlinear) mapping and let F (x) = 0 be the corresponding system of equations.In [1], Xu proposed the following partially asynchronous block quasi-Newton method for such a nonlinear system.Let P be a partition of F and x in p partitions, F = (F 1 , ..., F p ), x = (x 1 , ..., x p ), where F i and x i are block functions and block variables, respectively, of the same dimensions m i .The variable x is the common memory of the multiprocessor system, every processor i can read the content of x and can write its new update of x i .Let B i , i = 1, ..., p, be square matrices of dimension m i .The main steps of the process i are : Algorithm 1. Input: x, B i Output: x 1. Compute x + i := x i − B −1 i F i (x); 2. Get x + from x by replacing x i with x + i ; 3. Compute s = x + i − x i and y := F i (x + ) − F i (x); 4. Update B i , B i := B i + (y−B i s)s T s T s ; 5. x i := x + i , x := x + ; Applying the Sherman-Morrison lemma (Step 4, Algorithm 1) to inverse the matrix B i , the polynomial complexity of order three is reduced to polynomial complexity of order two.
More recently, in [2,3] Jiang and Yang proposed several preconditioners for the block Broyden's method and discussed the implementation process.The main steps of parallel variant of Broyden's method considered in [2,3] (Algorithm 2 below) are (B 0 is a block diagonal matrix, B i k is the i th diagonal block of B k ): Algorithm 2.

End
In the case of a linear system, the sequential Broyden's method has global convergence provided that the system matrix is invertible [4], i.e., the sequence generated by this method converges to the solution of the system for any starting point x 0 and for any starting matrix B 0 (or H 0 ).Moreover, in this case the convergence is finite [5], and the solution is obtained in at most 2n steps; conditions in which this method requires a full 2n steps to converge are also given.In the same paper [5], it is shown that Broyden's method has 2n-step Q-quadratic local convergence for the nonlinear case (the initial starting point x 0 and the initial starting matrix B 0 must be close to the solution point of the system and to the Jacobian matrix in the solution point, respectively).Note that 2n-step Q-quadratic convergence means that the subsequence {x 2n } of {x n } has Q-quadratic convergence.
In this paper we propose some new variants of parallel Broyden's method that are suitable both for linear and nonlinear systems.We prove the convergence of our basic algorithm in the linear case.Numerical experiments are presented for linear and nonlinear cases, and some remarks are made concerning the behavior of the generated sequence.

The Parallel Variants and Preliminaries
In the case of a linear mapping, F (x) = Ax − b, the Broyden's "good formula" [6,7], B + = B + θ(y −Bs)s T /s T s (B and B + are the current and the next iterates, respectively, and θ is a positive number chosen such that B + is invertible [4]) can be written in the following form: The main idea of the proposed variants is to use instead of A the block diagonal matrix of A, denoted by D throughout the paper, partitioned according to P. The computer routine dg(A) will produce such a block diagonal matrix, i.e., D := dg(A).
Based on this idea we can consider several parallel variants of Broyden's algorithm.The first variant is just the above mentioned algorithm, to which we added a supplementary step to successively block-diagonalize the matrix E (the Step 5, Algorithm 3 below).We consider also the following slight modification of update formula By adding D to both sides of this formula, the algorithm becomes more homogeneous (the same matrix is used in both Steps 2 and 4, Algorithm 3 below).Moreover (and not in the least) for this variant, a simple and elegant proof can be given for the convergence of the generated sequence (Theorem 1).Therefore the first our variant, which will be considered as the basic algorithm, is (E 0 is a block diagonal matrix): 5: E n+1 := dg( E n+1 ).

End
A variant of Algorithm 3 can be obtained by applying the Sherman-Morrison lemma to avoid the inversion of a matrix in Step 2. It results the following Algorithm 4 (B 0 is a block diagonal matrix):

End
The simplest way to design a similar parallel algorithm for the nonlinear case is to replace D in the basic algorithm with the diagonal block of the Jacobian of F , D = D(x) = dg(J(x)).It results the following Algorithm 5 for the nonlinear case: However, the convergence properties of Algorithm 5, including its specific conditions, rate of convergence, etc., must be further analyzed as well.The numerical experiments (Section 5) performed so far for certain nonlinear systems show a comparable behavior of the two Algorithms (3 and 5) for the linear and nonlinear case.
The main processing of the Algorithms 3, 5, is the computation of (E n + D) −1 or the solution of a corresponding linear system.Taking into account that E n + D is a block diagonal matrix, where (E n + D) i , i = 1, ..., p, are matrices of dimension m i , we have to perform p subtasks, whose main processing is (E n + D) −1 i .It is obvious that these subtasks can be done in parallel.The successive updates of the matrix E, Steps 4 and 5, can also be executed in parallel.The concrete parallel implementation of this algorithm is an interesting problem and is left for a future research.
As usual R m×m denotes the set of square m × m real matrices.We use • and • F to denote the spectral and Frobenius norm respectively.The properties of Algorithm 3 are based on the following two lemmas.
Lemma 1.Let D ∈ R m×m be a block diagonal matrix (conforming with P).Then, for any matrix E ∈ R m×m , there holds dg(E) Proof.Using the inequality dg(A) p ≤ A p , A ∈ R m×m (which is true for both spectral and Frobenius norm), we have To prove the second part of Lemma 1, observe that if θ ∈ [0, 2] then I − θ ss T s 2 = 1 for any s ∈ R m .Therefore Lemma 2. For any matrix M ∈ R m×m , any s ∈ R m , any θ ∈ R, and M := M − θM ss T / s 2 we have (The Formula (1) appears also in [4]; a short proof is given here for completeness).
Proof.By trivial computation we obtain , and the desired equality results.

The Convergence of Algorithm 3 in the Linear Case
In the following, we suppose that the sequence {x n } given by Algorithm 3, with starting point x 0 , starting matrix E 0 and certain θ, satisfies the condition: Algorithm 3 defines the next iteration as a function of the current iteration, x n+1 = G(x n ), G being the iteration function or generation function.It is clear that the condition (2) is weaker than the condition of asymptotic regularity of G or asymptotic regularity of x 0 under G, ( The fulfillment of condition (2) depends not only on the characteristics of A (like the condition number) but also on x 0 , E 0 and θ.Note also that usually, the sequence {E n } is bounded, and if F is also bounded on R m then the condition (2) is satisfied; in Section 5 an example of mapping is given which is bounded on R m and D(x n ) −1 E n < 1.
Theorem 1.Let A ∈ R m×m be a nonsingular matrix, D = dg(A), and let θ be a real number, 0 < θ < 2. Suppose that D is invertible and that a D −1 < 1, where a is defined in Lemma 1. Suppose also that condition ( 2) is satisfied and that E 0 + D is invertible.Then the parallel variant of Broyden method (Algorithm 3) converges to the solution of the equation Ax − b = 0.
Proof.Since E n D −1 ≤ D −1 a < 1, from perturbation Lemma, it results that (E n + D) −1 exists for all n and the sequence {x n } is well defined by Algorithm 3. By taking M = E k + D, s = s k and applying Lemmas 1 and 2, we obtain By summing on k, k = 0, 1, ..., n, we have This implies that

Remarks on the Parallel Implementation
The parallelization of the product n , observe first that the element p ij of the product ss T is p ij = s i s j and s n s T n can be computed by a very simple algorithm.Then, because n gives the m i lines of (E n + D)s n s T n .These products are independent from each other and therefore can be computed in parallel.The same observations can be made for Algorithms 4 and 5.
In order to implement this method, let us suppose that the multiprocessor system has p processors.A simple idea is to set the system into p partitions and assign a partition to every processor.In this case the following issue arises: How large should the partitions m i , i = 1, ..., p be in order to obtain a low cumulative computation cost per iteration?In the case of Algorithms 3 and 5, the main processing of processor i is to inverse a matrix of dimension m i .Therefore every processor has to perform a computation that has a polynomial complexity of order m 3 i .The cumulative computation cost for all p processors is m 3 1 + m 3 2 + ... + m 3 p , and, if there is not overlapping, m 1 + m 2 + ... + m p = m.Further, we use the following elementary inequality where x i are positive numbers and µ = min{x 1 , ..., x p }.If q = 3 and x i = m i , then We will prove now that if then , and this is equivalent with the first part of (3).We obtain the following Propositon 1. Suppose that the smallest size of partitions, µ = min{m 1 , ..., m p }, satisfies the condition (3).Then the cumulative computation cost in Algorithms 3 and 5 is less than m 3 /p.
A similar result can be established for Algorithm 4. The condition (3) becomes and the cumulative computation cost satisfies p i=1 m 2 i ≤ 2m 2 /p.

Numerical Experiments and Conclusions
The main purpose of this section is to highlight the convergence properties of the proposed parallel Broyden's algorithms in the case of a synchronous model.The communication costs between processors, scheduling and loads balancing, optimal processors assignment, and so on, both for synchronous and asynchronous models, are important issues in themselves.However they are not the subject of the present study.
The behavior of the sequence {x n } obtained by the proposed algorithms and for various linear or nonlinear cases is shown in our experiments by the graph of lnε n , where ε n is the error at step n, ε n = F (x n ) or ε n = x n −x * (n on horizontal axis and lnε n on vertical axis).The convergence of the process entails a decreasing graph until negative values (for example, lnε n = −30 means ε n ≈ 10 −15 ).If the convergence is linear (r = 1) then lnε n ≈ nln + lnε 0 ; lnε n depends linearly on n and the graph of lnε n is close to a straight line.For comparison reasons, the proposed parallel Broyden's method is compared with the parallel Newton method and the parallel Cimmino method.We consider the following parallel Newton method [8]: where J(x n ) is the Jacobian of F .Note that in the case of a linear system, by applying this form of Newton method, the direct computation of x * (in one iteration) is lost and the convergence is linear.Consequently, the graph of ln(ε n ) is a straight line.
The block Cimmino method is considered only for the linear case, F (x) = Ax + b, and it can be described as follows (see, for example, [9]).The system is partitioned into p strips of rows, A i and b i , 1 ≤ i ≤ p. Supposing that A i has full row rank, the Moore-Penrose pseudo inverse of A i is defined by where ω is a relaxation parameter.

Experiment 1.
We applied Algorithm 3 to linear systems of medium dimensions, m = 20, 50, 100, 500, 600, with different values of condition numbers.A program "genA" generates a square, sparse matrix A of dimension m with random elements a ij ∈ (−1, 1), i = j and a ii ∈ (d1, d2), where d1, d2 are given (positive) numbers.The percent of nonzero elements is an entry data of "genA" and in this experiment we considered it to be about 50%.The free term b and the starting point x 0 were also randomly generated with elements between some limits.Thus the function F is randomly defined by F (x) = Ax − b.For every considered system, we applied Algorithm 3, the parallel Newton method and the parallel Cimmino method, and we drew the resulting graphs in the same plot; for the graph of Algorithm 3 we used a continuous line, for the parallel Newton method a dashed line, and for the parallel Cimmino method a dotted line.The number of partitions and the dimensions of every partition are entry data of the routine "diag(A)".The test systems were automatically generated by routine "genA"; the three graphs of Figure 1 are examples from a large number of cases, corresponding to certain systems of dimensions 50, 100, 200 respectively; other parameters like c (the condition number), θ (the factor in Step 4), p(n 1 , ..., n p ) (the number of partitions and their dimensions), were chosen as follows: The following conclusions can be drawn.(1) The proposed parallel Broyden's algorithm works well if the system is relatively well conditioned (the condition number should be around ten), and in this case the behavior of Algorithm 3 is very similar to the parallel Newton method and significantly better than the parallel Cimmino method.(Figure 1a); (2) For medium conditioned systems, the behavior of Algorithm 3 is unpredictable, sometimes it works better than the Newton method (Figure 1b); (3) If the system is poorly conditioned (the condition number is greater than 300) the proposed parallel Broyden's algorithm fails to converge (the Newton method has the same behavior) (Figure 1c).
Figure 2a presents the behavior of the sequence generated by the Algorithm 3 for a linear system of dimension m = 500, c = 4.366, p(...) = 5(101, 51, 151, 101, 96), θ = 0.2, E 0 = I.The condition (2) of Theorem 1 is not satisfied and the sequence generated by this algorithm (in this case) does not converge.However, the behavior is interesting, the generated sequence becomes very close to the solution, the error decreases until a very small value (in our example until 10 −12 ), and then the good behavior is broken.The graph corresponding to the sequence generated by Algorithm 4 for a system of dimension m = 600 and c = 3.942, p(...) = 6(101, 51, 151, 101, 101, 95), θ = 0.03, E 0 = (I + D) −1 is presented in Figure 2b.We can observe the similarities between this sequence and the sequence obtained by the parallel Newton method.be seen in the case of the first system (Figure 3a).The third system is an example for which F is bounded on R 3 , as it is required in Section 3. Note that the condition D(x n ) −1 E n < 1 is also satisfied.

Experiment 3.
This experiment is devoted to compare Algorithm 3 with the parallel variants proposed in [1] and [2,3].Three linear systems of low dimensions (m = 5) with condition numbers equal to 8.598, 4.624, 3.701 were considered as test systems.Every system was set into two partitions of dimensions 3 and 2 respectively.The results are presented in Figure 4. We can conclude that the behavior of Algorithm 3 is satisfactory, as in all cases the generated sequence converges to the solution of the system and the convergence is linear.It is worthwhile to note the very good behavior of Algorithm 3 in comparison with the other two variants for cases (b) and (c): in case (b) the Algorithms 1 and 2 appear to have a very slow convergence, and in case (c) these algorithms fail to converge.

Figure 1 .
Figure 1.The graphs of ln(ε n ) generated by Algorithm 3, parallel Newton method and parallel Cimmino method in the linear case; (a) system well conditioned; (b) system medium conditioned; (c) system poorly conditioned.

Figure 2 .
Figure 2. The behavior of the sequence generated by Algorithm 3, Parallel Newton method and Parallel Cimmino method (the graphs (a)) and by Algorithm 4, Parallel Newton method and Parallel Cimmino method (the graphs (b)); (a) system which does not satisfy the condition (2); (b) system well conditioned.