Kaczmarz Anomaly in Tomography Problems

The Kaczmarz method is an important tool for solving large sparse linear systems that arise in computerized tomography. The Kaczmarz anomaly phenomenon has been observed recently when solving certain types of random systems. This raises the question of whether a similar anomaly occurs in tomography problems. The aim of the paper is to answer this question, to examine the extent of the phenomenon and to explain its reasons. Another tested issue is the ability of random row shuffles to sharpen the anomaly and to accelerate the rate of convergence. The results add important insight into the nature of the Kaczmarz method.


Introduction
The Kaczmarz algorithm is an iterative method for solving large sparse linear systems of the form where A ∈ R m×n , b = (b 1 , . . . , b m ) T ∈ R m and x ∈ R n denote the vectors of unknowns. Let the rows of A be denoted by the row vectors a T i , i = 1, . . . , m. Then, an equivalent way to write (1) is a T i x = b i for i = 1, . . . , m.
The idea of the Kaczmarz method is to handle one equation at a time. Let a i 2 = (a T i a i ) 1/2 denote the Euclidean vector norm and let w be a preassigned relaxation parameter that satisfies 0 < w < 2. Then, the kth iteration of the Kaczmarz algorithm, k = 1, 2, . . . , is composed of m steps. The ith step of the kth iteration, i = 1, . . . , m, starts with the vector x k,i−1 and ends with the vector That is, the ith step uses only the ith equation. Observe that for w = 1, the point x k,i is the projection of x k,i−1 on the hyperplane {x|a T i x = b i }. Note also that the kth iteration, k = 1, 2, . . . , starts with the vector and ends with x k = x k,m = x k+1,0 .
The starting point is denoted as x 0 = x 1,0 . The fact that the algorithm uses one row at a time makes it a popular tool for solving large sparse linear systems that arise in important applications, such as computerized tomography or digital signal processing. The literature on the Kaczmarz method is vast and covers various issues. See  and the references therein. Results on the theory behind the method and its rate of convergence can be found in [1,6,14,16,19,21,23,25], while efficient implementations and applications are considered in [3,4,10,[12][13][14][15]19]. In addition, there are several variants of the basic iteration, such as block versions and parallel computing techniques [4,14,19]. In particular, recently, there has been growing interest in Randomized Kaczmarz methods [5,20,22,24]. Some of the variants are easily described by restating the algorithm with the restriction that each iteration regards only one equation, which is chosen according to some rule. In the basic iteration (3), the rows are chosen in a sequential "cyclic" manner. In "Greedy" Kaczmarz, we select an equation that has a maximal residual, while in "Randomized" Kaczmarz, the equation's index i is selected at random, with probability proportional to a i 2 2 , e.g., [5,[20][21][22]24]. However, in the coming discussions, the terms "Kaczmarz method" and "Kaczmarz iteration" refer to (3). The original algorithm of Kaczmarz [16] is obtained from this framework when m = n and w = 1.
The use of the relaxation parameter, w, is motivated by the close relation with the SOR method for solving the linear system where here, y ∈ R m denotes the vector of unknowns. Let y k denote the current solution at the end of the kth iteration of this method, k = 1, 2, . . . . Then, the following observation is well known, e.g., [1,6]. If the starting points satisfy then the equality holds for all k. This relation implies that several convergence properties of Kaczmarz method are inherited from those of the SOR method. Let the linear systemâ T i x =b i , i = 1, . . . , m, be obtained from (2) by "normalizing" the rows of A to have unit length. That is, Then, it is easy to verify that applying the Kaczmarz method to solve (9) yields the same sequence as (3). Hence, when studying the convergence properties of Kaczmarz method, there is no loss of generality in assuming that the rows of A have unit length. (A similar remark applies to the related SOR method).
In this paper, we consider an interesting feature of Kaczmarz method. To illustrate this property, it is assumed that m is considerably larger than n. Let A i and b i be composed from the first i rows of A and b, respectively. That is Then, the new feature is revealed when using Kaczmarz method for solving the linear systems and watching how the number of rows, i, affects the rate of convergence. If A is an arbitrary matrix, we are not expecting a certain behavior. However, as we shall see, in some cases, the number of rows has a dramatic effect on the rate of convergence. Assume first that i is considerably smaller than n. In this case, the Kaczmarz method has a fast rate of convergence. Yet as i increases toward n, the rate of convergence slows down. That is, the more equations we have, the more iterations are needed to solve the system. In particular, as i approaches n, there is a dramatic increase in the number of iterations. The closer i and n are, the slower the convergence. However, as i passes n, the situation is reversed. From now on, the more equations we have, the fewer iterations are needed. Finally, when i is considerably larger than n, the method returns to enjoy rapid convergence. We call this behavior the Kaczmarz anomaly. One aim of this paper is to examine the presence of this phenomenon when solving tomography problems. The first report on the Kaczmarz anomaly appeared in [7], but it remained almost unnoticed. Recently, we have shown in [8] that it is likely to occur whenever the rows' directionsâ i = a i / a i 2 scatter randomly over some portion of the unit sphere. This suggests that a random shuffle of the rows may improve their randomality and strengthen the anomaly. A second aim of the paper is to examine this idea.
The plan of the paper is as follows. The first two sections provide theoretical background that reveals the reasons behind the anomaly phenomenon. Section 2 overviews the condition number anomaly phenomenon, while Section 3 explains how the condition number of A affects the rate of convergence. Combining these features yields the Kaczmarz anomaly. The background is based on recent results by this author; see [8,9]. The use of random shuffles and related techniques is considered in Section 4. The paper ends with numerical experiments that illustrate the anomaly phenomena.

The Smallest Singular Value Anomaly and the Condition Number Anomaly
Let α i denote the largest singular value of the matrix A i , i = 1, . . . , m. Then, α 2 i is the largest eigenvalue of the cross-product matrix A T i A i , and there exists a unit eigenvector, u i , that satisfies and The first assertion characterizes the ascending behavior of the sequence α 1 , . . . , α m . (For proofs of the coming theorems, see [8]).

Theorem 1.
For i = 1, . . . , m − 1, we have the inequalities and Next, we explore the behavior of the smallest singular value, which is rather surprising. Let β i denote the smallest singular value of the matrix A i , i = 1, . . . , m. Then, as the coming theorem shows, the first part of this sequence, β 1 , . . . , β n , is descending. Theorem 2. Letâ i+1 = a i+1 / a i+1 2 be a unit vector in the direction of a i+1 , and let the unit vector v i ∈ R n be a right singular vector of A i that corresponds to β i . Then, for i = 1, 2, . . . , n − 1, we have the inequalities The assumption β i > 0 is not essential for the proof of Theorem 2, but it enables us to replace (18) with the inequality This exposes the actual reasons that force β 2 i+1 to be smaller than β 2 i . One reason is the size of a i+1 . We see that the smaller a i+1 2 2 is, the smaller is β 2 i+1 . Another important factor is the size of the scalar product v T iâ i+1 . Since both v i andâ i+1 are unit vectors, the Cauchy-Schwartz inequality implies (v T iâ i+1 ) 2 ≤ 1, and equality occurs if and only if a i+1 = ±v i . Now, from (19), we see that the larger (v T iâ i+1 ) 2 is, the smaller is β 2 i+1 . In this paper, we concentrate on the behavior of Kaczmarz method, and for this purpose, it is possible to assume that all the rows of A have unit length. This assumption implies that β 1 = 1 and turns (19) into the form A further simplification is allowed when β 2 i becomes considerably smaller than one. In this case, the factor 1/[1 + β 2 i (v T iâ i+1 ) 2 ] approaches one, and the bound is nearly as good as (20).
It is left to see how the second part of the sequence, β n , β n+1 , . . . , β m , behaves. Below, we will show that this part is ascending. The proof is based on the observation that now, β 2 i is the smallest eigenvalue of the cross-product matrix A T i A i . Hence, for i = n, n + 1, . . . , m, there exists a unit eigenvector v i such that and Theorem 3. For i = n, n + 1, . . . , m − 1, we have the inequalities Assume for a moment that β i > 0, which enables us to rewrite (24) in the form Assume further that the rows of the matrix have unit length and random directions. Then, a small β 2 i implies a large increase ratio, while a large β 2 i means a slow increase. Consequently, when i is close to n, we expect a fast increase, but as i moves away from n, the rate of increase is likely to slow down.
Combining the results of Theorems 2 and 3 shows that the sequence β 1 , . . . , β n is decreasing, the sequence β n , . . . , β m is increasing, and β n is the smallest number in the whole sequence. Moreover, in some cases, β n can be considerably smaller than its neighbors. This behavior is called the smallest singular value anomaly.
Let k i denote the condition number of A i , i = 1, . . . , m. In the rest of this section, we assume for simplicity that β i > 0 for i = 1, . . . , m, and that (The discussion of the case when β i = 0 is deferred to the next section. In this case, β i is redefined as the smallest nonzero singular value of A i ). We have seen that the sequence α 1 , . . . , α n is ascending, while the sequence β 1 , . . . , β n is descending. This proves the following conclusion.
Theorem 4. The sequence k 1 , . . . , k n is ascending. That is, The behavior of the sequence k n , k n+1 , . . . , k m is not that straightforward. We know that the sequences α n , α n+1 , . . . , α m and β n , β n+1 , . . . , β m are ascending, but this does not provide decisive information. Indeed, for i ≥ n, one can find examples in which k i+1 < k i as well as examples with k i+1 > k i , e.g., [8]. The condition number anomaly occurs when the sequence k n , . . . , k m is descending. That is, when The reasons behind this behavior lie in the following observations. Theorem 5. Let u i+1 be as in Theorem 1, let v i+1 be as in Theorem 3, and consider the terms and Proof. From (17), we see that while Theorem 3 gives Hence, combining these inequalities yields (31).

Corollary 1. The inequality
The last corollary is a key observation that indicates at which situations the condition number anomaly is likely to occur. Assume for example that the direction of a i+1 is chosen in some random way. Then, the scalar product terms (u T i+1 a i+1 ) 2 and (v T i+1 a i+1 ) 2 are likely to be about the same size. However, since β 2 i is (considerably) smaller than α 2 i , the term In other words, the condition number anomaly is likely to occur whenever the rows' directions scatter in some random way. This conclusion means that the phenomenon is shared by a wide range of matrices. See Tables 1-6 and the examples in [8].

The Rate of Convergence of the Kaczmarz Method
We have seen that the Kaczmarz method for solving (1) is closely related to the SOR method for solving the linear system where Below, we will use this relation to obtain the iteration matrix of Kaczmarz method (3). As before, it is allowed to assume that the rows of A have unit length. That is This assumption implies that G has the form where I denotes the identity matrix, and L is a strictly lower triangular matrix. The SOR iteration splits G in the form where and As before, w is a preassigned relaxation parameter that satisfies 0 < w < 2. The kth SOR iteration, k = 1, 2, . . . , starts with y k−1 and ends with y k , which is computed by solving the linear system In other words, y k is obtained from y k−1 by the rule where is the related iteration matrix, and Observe that (40) enables us to express H w in the form Multiplying (44) by A T gives while substituting I − B −1 w AA T instead of H w shows that This means that the iteration matrix of the Kaczmarz method has the form Note that H w is an m × m matrix, while F w is an n × n matrix. However, as shown in [9], these matrices share the nonzero eigenvalues. The theory of the Kaczmarz method tells us that the sequence x k , k = 1, 2, . . . , converges for any choice of b and x 0 , e.g., [3,6,9,10,14,19,22,25]. Moreover, letx denote the limit point of this sequence; then, the error vectors x k −x satisfy This shows that the rate of convergence depends on ρ(F w ), the spectral radius of F w . The smaller ρ(F w ) is, the faster the convergence. It is interesting, therefore, to see which properties of A make ρ(F w ) small. One answer is given by the following bound, which has recently been derived in [9]. (A second answer is given in the next section, in which we consider the effect of rows shuffling).
Let α denote the largest singular value of A, let β > 0 denote the smallest nonzero singular value of A, and let k = α/β denote the related condition number of A. Let ω opt denote the optimal relaxation parameter of the Kaczmarz method. That is, Then, it is proved in [9] that where c is a constant, The bound is not tight, and the actual rate of convergence (even for w = 1) is often faster than the implied rate. Recall that in many practical problems, w opt is not known in advance, and its value is computed by repeated experiments; see Section 5. Yet the main consequence from this bound is that a small condition number forces fast convergence. Conversely, for large k, the bound tends to 1, which allows slow convergence. Indeed, as explained in [9], the existence of small nonzero singular values invites a slow rate of convergence.
The relation between the condition number and the rate of convergence suggests that the Kaczmarz anomaly phenomenon is caused by the condition number anomaly. We have seen that the last phenomenon is expected to occur whenever the rows' directions scatter randomly. This raises the question of whether tomography problems possess these properties. The next sections attempt to answer this question.

From Random Shuffles to Optimal Ordering
The Kaczmarz anomaly phenomenon is observed by watching how the number of rows affects the rate of convergence. Another property that affects the rate of convergence is row ordering. The initial ordering in tomography problems is often rather poor, in the sense that it yields a slow rate of convergence. A typical tomography matrix is composed from several blocks of rows, where each block is generated by one "view". The views (and the blocks) are ordered according to the size of the view angle, which is the natural geometric order, e.g., [15], p. 602. Yet this natural order minimizes the angle between adjacent views, which is a property that retards convergence. A possible remedy for this difficulty is to apply a random shuffle of rows before starting the Kaczmarz process. The shuffle is aimed at achieving a faster rate of convergence (see below). Yet, at the same time, it improves the randomality of the rows' directions, which sharpens the anomaly phenomena.
The term "random shuffle" means that the rows of the linear system (1) are reordered by applying a random permutation. This converts (1) into the form where P is a random permutation matrix. To simplify the coming discussions and experiments, we assume that the random permutation is chosen by MATLAB's command "randperm (m)", and the matrix PA is generated by the command "shuffle (A)", which uses the "randperm (m)" command. The reordering of the rows is expected to change the iteration matrix of the Kaczmarz method as well as its rate of convergence. It is easy to verify this assertion by using the relation with the SOR method for solving the system Now, it is easy to see that the SOR iteration for solving (56) differs from that of (6). Indeed, the observation that the reordering of rows changes the rate of convergence of the SOR method is not new. See, for example, [27] and the references therein. As mentioned above, it has been observed by several authors that when solving tomography problems, a random row shuffle may improve the rate of convergence of the Kaczmarz method. See, for example, [14,15,19,21,22,24] and the references therein. A possible explanation of this phenomenon comes from geometric interpretation of the basic step (3) when i > 1 and w = 1. Let 0 ≤ θ i ≤ π/2 denote the angle between a i−1 and a i . Then, in two-dimensional space, the distance to the solution point is reduced by the factor cos θ i . That is, a small angle yields a small reduction while a large angle implies a large reduction. When moving to larger dimensions, the situation is not that simple, but it is still true that a small θ i forces a small step toward the solution, while a large θ i allows larger steps. (Recall that a large θ i means that a i−1 and a i are nearly orthogonal). These considerations suggest that a random shuffle may improve the rate of convergence if it improves orthogonality between adjacent rows.
Similar arguments have motivated Herman and Meyer [15] to propose an optimal ordering of rows that takes advantage of the special structure of tomography problems to maximize orthogonality between adjacent rows. Further optimal ordering schemes that follow this approach are described in [11,17,18].
The observation that a random ordering of rows may improve the rate of convergence has motivated the Randomized Kaczmarz algorithm of Strohmer and Vershynin [24]. In this algorithm, the basic step treats one equation whose index is selected at random with probability proportional to a i The use of a random shuffle has recently been considered by Oswald and Zhou [21,22], who proposed an improved randomized method, the Shuffled Kaczmarz algorithm. In this method, each iteration is preceded by a random shuffle of the rows. This formulation has two advantages. First, as in the Kaczmarz method, each iteration treats all the equations. Second, since all the shuffled matrices have the same singular values as A, the bound on the rate of convergence is the same as in the Kaczmarz method.
It is interesting to compare the above randomized methods with the Initial Shuffle method, which uses one random shuffle before starting the Kaczmarz algorithm. Both approaches share the same motivation. If the given system has bad ordering, then a random shuffle is likely to provide a better ordering. Furthermore, basically, we are not expecting large differences in the quality of the generated random shuffles. Hence, in practice, the initial shuffle method is likely to run at the same speed as the randomized Kaczmarz methods. Yet, since it uses only one shuffle, there is a tiny probability to obtain bad ordering, while randomized algorithms avoid this possibility.

Numerical Experiments
The experiments examine the behavior of the Kaczmarz method (3) when solving tomography test problems. The test problems are generated by using MATLAB's functions from "AIR tools", which is a MATLAB package of algebraic reconstruction iterative methods prepared by P.C. Hansen and others [12,13]. The test problems imitate the scanning of an N × N array of square cells. This generates a linear system with n = N 2 unknowns.
to remove zero rows, which yields a linear system with n = 400 unknowns and m = 4340 rows. In Fan beam tomography, each angle (each view) is related to a "fan" of p rays, and the problem is generated by using the function fanbeamtomo (N) (59) with N = 20 and the default values theta = 0:1:359 and p = round ( √ 2N) = 28. This builds a linear system with n = N 2 = 400 unknowns and m = 28 × 360 = 10,080 equations. Then, after removing zero rows, we remain with m = 9520 equations.
In Seismic tomography problems, the linear system is generated by applying the function seismictomo (N, s, p) with N = 20, s = 2N sources, and p = 4N receivers. This setting builds a linear system with n = N 2 = 400 unknowns and m = s × p = 3200 equations. (In this case, there are no zero rows.) The experiments were carried out as follows. At first, we have generated an m × n linear system, Ax = b, as described above. Together with A and b, we are given a prescribed solution x * ∈ R n , which is the one that has been used to build A and b. Then, in the second stage, the rows of A are normalized to have a unit norm. Thus, for i = 1, . . . , m, the ith row of A is redefined as a i = a i / a i 2 , while b is updated as b = Ax * . Finally, after the normalization, the Kaczmarz method was applied to solve partial systems of the form (13). The starting point in our runs is always x 0 = 0, and the iterative process was terminated after 666 iterations.
The shuffled test problems are obtained by reordering the rows of A, using a random permutation. The actual reordering is carried out by applying MATLAB's function shuffle (A).
After the shuffling, the vector b is redefined as b = Ax * where x * denotes the known solution. (The shuffling takes place after the normalization but before starting the solution of the partial linear systems.) The rows in Tables 1-6 describe the use of the Kaczmarz method to solve partial linear systems of the form Recall that A i is a i × n submatrix of A which is composed from the first i rows of A, and b i = (b 1 , . . . , b i ) T ∈ R i is composed from the first i entries of b. The results for the linear system (62) start with the number of rows, i, and the number of zero singular values of A i . Then, we provide the values of α i , β i , and k i , as well as the related residual values. As noted in the tables' headlines, α i is the largest singular value of A i , β i is the smallest nonzero singular value of A i , and k i = α i /b i is the condition number of A i . The residual values are defined as and taking a value of w that yields the smallest residual.   The reading of the tables is simple. Consider for example Table 2 when the number of rows equals 400. In this case, the related 400 × 400 matrix has 22 zero singular values, α i = 4.200, β i = 4.41 × 10 −4 , k i = 9.53× 10³, and the residual values are 1.72 × 10 −3 for w = 1, and 1.59 × 10 −3 for w opt = 0.8.
The experiments reveal interesting features of the anomaly phenomena. First, note the slow increase of the sequence α 1 , . . . , α m . We see that α i is considerably smaller than i. Moreover, the larger i is, the smaller the ratio α i /i. This behavior is due to the fact that the rows have unit length and random directions; see (17).
The second remark is about the smallest singular value anomaly and the related condition number anomaly. The derivation of these properties relies on the assumption that the submatrices A i , i = 1, . . . , m, do not have zero singular values. Yet, as our tables show, several submatrices have zero singular values. Consequently, in some cases, we can see a slight violation of the anomaly behavior.
The third point is about the use of an initial random shuffle. Note that the shuffle reduces the number of zero singular values in the submatrices. In addition, as expected, the shuffled systems enjoy sharper anomaly. In particular, we see that for highly overdetermined (underdetermined) linear systems, the use of a shuffle improves the rate of convergence! Tables 7 and 8 display experiments with randomized Kaczmarz methods. In Shuffled Kaczmarz, each iteration starts with a random shuffle of the linear system that is solved. In Randomized Kaczmarz, each iteration is composed from m steps, where each step treats one randomly chosen equation. Thus, in both methods, the computational effort per iteration is slightly larger than that of Kaczmarz iteration. Consider for example Table 7, which describes experiments with the Shuffled Kaczmarz method. Now, let us inspect the solution of the Fan beam tomography problem of the form (62) with i = 1000 rows. In this case, the related residual values are 1.64 × 10 −3 and 8.54 × 10 −5 , where the smaller value is due to initial shuffle.
The results of Tables 7 and 8 are quite interesting. First, note that the two randomized methods behave in a similar way. In particular, both methods possess the anomaly phenomenon, and the use of an initial shuffle sharpens the anomaly. However, when solving highly overdetermined systems, the use of initial shuffle has a smaller effect, since now, each iteration includes an internal shuffle. Moreover, comparing Tables 7 and 8 with  Tables 1-6 indicates that the randomized methods are not faster than the Kaczmarz method with initial shuffle. That is, one shuffle is enough! Summarizing our experiments, we see that the asymptotic rate of convergence of the Kaczmarz method can be rather slow. Yet, the rate of convergence is considerably affected by a number of factors, such as the number of rows (the Kaczmarz anomaly phenomenon), the value of w, and rows ordering.

Concluding Remarks
Although Kaczmarz method has been well known for many years, the Kaczmarz anomaly phenomenon was observed only recently. This is, perhaps, because it requires a certain randomness of the rows' directions. A major application of Kaczmarz method is to solve large sparse linear systems that arise in computerized tomography. Hence, it is important to expose the extent of the phenomenon when solving such problems. The theory presented in the paper explains the reasons behind the anomaly, while the experiments display its nature.
The Kaczmarz anomaly phenomenon is observed by watching how the number of rows changes the asymptotic rate of convergence. Another property that affects the rate of convergence is row ordering. The initial ordering of tomography problems is often rather poor, which yields a slow rate of convergence. A common remedy that helps to overcome this difficulty is an initial random shuffle. The shuffle is likely to improve the randomality of the rows' directions and, therefore, to sharpen the anomaly phenomenon. The experiments that we have done illustrate this feature.
Repeating the use of a random shuffle at each iteration gives rise to a new randomized algorithm, the Shuffled Kaczmarz method of Oswald and Zhou [21,22], which is not inferior to the celebrated Randomized Kaczmarz method of Strohmer and Vershynin [24]. However, one consequence of our experiments is that randomized methods are not faster than Kaczmarz method with one initial random shuffle.
In our experiments, the random shuffle is based on a random permutations generator. Yet, following Herman and Meyer [15], it is possible to construct an improved initial shuffle that takes advantage of the special structure of tomography problems. The idea is to seek a permutation that improves the orthogonality between adjacent rows. In general, there is no easy way to achieve this task, but the special structure of tomography problems enables effective solutions of this problem, e.g., [11,15,17,18]. As with random shuffles, the use of optimal ordering is expected to sharpen the anomaly phenomenon. However, the testing of this issue is left to future research.