Block Kaczmarz–Motzkin Method via Mean Shift Clustering

: Solving systems of linear equations is a fundamental problem in mathematics. Combining mean shift clustering (MS) with greedy techniques, a novel block version of the Kaczmarz–Motzkin method (BKMS), where the blocks are predetermined by MS clustering, is proposed in this paper. Using a greedy strategy, which collects the row indices with the almost maximum distance of the linear subsystem per iteration, can be considered an efﬁcient extension of the sampling Kaczmarz– Motzkin algorithm (SKM). The new method linearly converges to the least-norm solution when the system is consistent. Several examples show that the BKMS algorithm is more efﬁcient compared with other methods (for example, RK, Motzkin, GRK, SKM, RBK, and GRBK).


Introduction
This paper is concerned with the approximate solution of a large-scale linear system of equations of the form where x ∈ R n need to be determined. If m > n, the linear systems (1) with such thin (or tall) coefficient matrices A are overdetermined and are often inconsistent. In this case, we are interested in the computation of the least-squares solution of systems (1) by where · 2 denotes the Euclidean norm. If m < n, the linear system (1) with such a fat (or flat) matrix A is underdetermined. If the linear system is consistent and has many solutions, then the least Euclidean-norm solution will often be considered. In this paper, we focus on the approximate of the least-norm solution when (1) is consistent. The Kaczmarz method in [1] is a popular algorithm for solving the linear system (1), which is called the algebraic reconstruction technique (ART), and has a large range of fields of applications, such as image reconstruction in computerized tomography [2,3], signal processing [4], and distributed computing [5,6]. In the classical Kaczmarz method [1], all rows of the matrix A are circularly passed according to the given order, and the current iteration point is orthogonally projected to the hyperplane formed by the index row for the next iteration. For a given initial solution x 0 , the iteration scheme of the Kaczmarz method can be presented as where i k = mod(k, m) + 1, k = 0, 1, · · · , A i denotes the ith row of A, b i presents the ith element of b, and A T denotes the transpose of A.
For the consistent linear system (1), Strohmer and Vershynin [7] in 2009 presented a randomized Kaczmarz algorithm (RK) with expected exponential convergence. The RK algorithm selects each iteration row with the row index i k = A i k 2 2 / A 2 F . A theoretical analysis showed that, compared with the selection of a row in the natural order, randomly selecting a row of iterations can greatly improve the convergence speed. In [8], Bai and Wu designed a greedy randomized Kaczmarz (GRK) algorithm. GRK uses a greedy probability strategy, which is derived from the maximum distance rule, to grasp larger entries of the residual vector, and the work row with a larger residual error is determined. The theoretical analysis shows that the GRK converges to the least norm solution with the expected exponential, and the numerical experiments illustrate that GRK converges much faster than RK. In [9], De Loera proposed a sampling Kaczmarz-Motzkin (SKM) for linear feasibility problems by combining the RK method with the Motzkin method [10]. This method mainly consists of two stages. The row index set {1, 2, · · · , m} of matrix A is divided into multiple subsets T = {τ 1 , τ 2 , · · · , τ p } for the first stage, and the second stage involves determining the working row i k = arg max i k ∈τ k {|b i k − A i k x k | 2 } by uniformly extracting one (τ k , k ∈ [p]) in each iteration, and then updating x k by (4). In particular, the SKM method overcomes the weakness of the RK and Motzkin methods. For example, the RK method may be slow to converge because it does not utilize a greedy strategy, and the calculations of all residual vectors are expensive for the Motzkin method. However, the convergence rate of the SKM method may be slow since it enforces only one constraint for the current iteration.
The block Kaczmarz methods, which utilize the multi-row constraints, have received extensive attention for their fast convergence properties. The block iterative methods were first presented by Elfving [11] and Eggermont et al. [12] to solve (1). Then, Needell et al. [13] proposed a randomized block Kaczmarz (RBK) algorithm to solve the leastsquares problem by selecting a subsystem from the predetermined partition at random, which linearly converges to the least-norm solution in expectation. The randomized block Kaczmarz algorithm iterates the form where A τ i k and b τ i k are the submatrix and subvector of A and b, respectively, τ i k ⊆ {1, 2, · · · , m} and A † τ i k represent the Moore-Penrose pseudoinverse of subsystem matrix A τ i k . Although the existence of this "good" paving of all rows is theoretically guaranteed, it is not always effortless to find it. Liu and Gu in [14] designed a randomized greedy block Kaczmarz method to choose a "good" target block by embedding a greedy strategy in the predetermined subsystems, which was derived from the probability criterion proposed by Bai and Wu in [8]. The convergence theory of GRBK shows that the iterative solution sequence linearly converges to the least-norm solution in expectation. In addition, the partitioning method may not be a good partition since it only focuses on the average partitioning of the row index set. Many variants of the block Kaczmarz have recently received extensive attention. For example, literature studies [15][16][17][18][19][20][21] and references therein.
By exploring the structural properties of the coefficient matrix A, some criteria can be used to determine the row partitions (see [13,14,16,22]). In the context of data science, the mean shift clustering (MS) is an iterative algorithm based on kernel density estimation [23], which is widely used for data clustering. So, the combination of the clustering algorithm and greedy technique is extremely meaningful. Inspired by this, to improve the efficiency of block Kaczmarz, we constructed a novel block Kaczmarz-Motzkin method (BKMS(δ, η)) by determining the largest residual block, and then introducing an almost maximum distance criterion for the current block to collect the row indices, close to the largest entry (absolute value), as the iteration index set. The row partitions were divided based on MS clustering, and the row correlation coefficient of the system matrix A was considered the clustering criterion. Moreover, unlike k-means clustering [16,24], it only needs to initialize the bandwidth's δ ∈ (0, 1] automatic partitions instead of pre-specifying the number of partitions κ, which depends on the initialized cluster centers. This paper is organized as follows. The RBK method in [13] and the GRBK method in [14] are summarized in Section 2. Section 3 introduces the MS clustering briefly, and then the BKMS method is proposed and its convergence theory is constructed when (1) is consistent. Section 4 reports several examples and applications. We complete this work with some conclusions and discuss future work in Section 5.

Block Kaczmarz Algorithm and Its Variants
|q ij | 2 and Q T the ith row and a row submatrix (τ k is a row indicator block), the spectral norm, Frobenius norm, and the transpose of the Q, respectively. If any two rows of Q are independently and identically distributed (IID), the correlation coefficient of Q is given by In addition, σ max (Q) and σ min (Q) are the largest and smallest positive singular values of the Q, respectively. Moreover, p i denotes the ith component of a vector p = (p i ) ∈ R m . For constant c ∈ R, we call c the largest integer not exceeding c.

The RBK Algorithm
The RBK method [13] is summarized in Algorithm 1, which is described by Needell and Tropp to solve the overdetermined least-squares problem (2).

Algorithm 1 The randomized block Kaczmarz method (RBK)
Input: A ∈ R m×n is standardized (i.e., A i 2 = 1, i ∈ [m]), b ∈ R m , x 0 ∈ R n , l ∈ N + and partition T = {τ 1 , τ 2 , · · · , τ p } p≤m Output: x l for k = 0, 1, . . . l − 1 do Select a block τ i k uniformly at random from T Update x k+1 by (5) end for The restatement of Lemma 1 in [13] guarantees the linear convergence of the RBK method. Lemma 1. Assume that the overdetermined linear system (1) with A is of full rank and row normalized (i.e., A i 2 = 1, i ∈ [m]). Let α ≤ σ 2 min (A τ ) and β ≥ σ 2 max (A τ ) for each τ ∈ T . Given an initial vector x 0 ∈ R(A T ), then the sequence {x k } ∞ 0 generated by the RBK method exponentially converges to the least-squares solution x * = A † b of (1) in expectation. Furthermore, According to Lemma 1, the convergence rate of RBK is limited by σ 2 min (A). If A is well-conditioned, the upper bound of the convergence error of RBK is large. In addition, we ignore the assumption that the linear system is inconsistent. The second term on the right-hand side of the inequality is zero. So, we remark the convergence factor of RBK as

The GRBK Algorithm
Based on the motivation of the GRK method, Liu and Gu in [14] applied the greedy sampling to the RBK method to replace uniform sampling, and proposed the GRBK method. Algorithm 2 summarizes the GRBK method to solve (1), as follows.

Algorithm 2
The greedy randomized Kaczmarz block method (GRBK) Determine the partition of the indicator set Update x k+1 by (5) end for According to Algorithm 2, we consider that, if there is only a row in each block, then GRK in [8] is a special case of the GRBK method. Liu and Gu proved that the GRBK algorithm has a lower convergence upper bound than RBK and demonstrated the effectiveness of GRBK with several numerical experiments. The result in [14] is restated as follows.

Lemma 2. For a fix partition
Given an initial vector x 0 ∈ R(A T ), then the sequence {x k } ∞ 0 generated by the GRBK method exponentially converges to the least norm By Lemma 2, the following estimates bound the convergence factor of the GRBK method which is less than ρ RBK when A is row normalized.

Block Kaczmarz-Motzkin Methods with Mean Shift Clustering
The greedy strategy in the SKM method ensures that the entry with the largest residual is prioritized. We hope to exploit the correlations between subsystem matrices to determine the row partition of the coefficient matrix A in an efficient way. In the framework of big data, the mean shift method (MS) in [25] is one of the clustering algorithms that divides the data into different categories. Therefore, the combination of the MS clustering and the Kaczmarz method with greedy techniques comes naturally. Inspired by this, a novel block Kaczmarz-Motzkin method based on MS clustering is proposed. In this section, we first restate the sketch of the MS clustering, then design a new, more realistic block Kaczmarz-Motzkin method and construct its convergence theory.

The Mean Shift Clustering
The MS clustering is an iterative algorithm based on kernel density estimation. it mainly consists of three steps, as follows. The first step is to select a pointx i , i ∈ {1, 2, · · · m} randomly from the unclassified data items X = {x 1 ,x 2 , · · · ,x m } as the center point y j . The second step is to find out all the data points in the circle entered on y j and the bandwidth δ/2 as the radius. It can be written as Considering these points belong to the cluster C j , calculate the vector from the center point y j to each element x i in the set S δ (y j ), and add these vectors to obtain the mean shift vector m(y i ), as follows wherex i ∈ S δ (x) and the Gaussian kernel function g(·) are designed to assign different weights to each data point. The center point is moved in the direction of the mean shift vector m(y j ) by |m(y j )|. Update the current center point by y j+1 = m(y j ) + y j (j = 1, 2, · · · ), then repeat the above process until the iterative sequence {y j } j≥1 satisfies y j+1 − y j 2 2 < , where is a user-preset threshold. The last step is to repeat the above two steps until all points are classified. Algorithm 3 summarizes the MS algorithm in [26] for data clustering.
According to Algorithm 3, one knows that, unlike the k-means algorithm in [16] and the random partition method in [13], the MS clustering algorithm does not require us to predetermine the number of clusters. For details, we refer to [25,26].

Algorithm 3 The mean shift method for clustering (MS)
Input: A dataset X = {x 1 ,x 2 , · · · ,x m } containing m data items, the threshold and the bandwidth δ > 0. Output: Estimate cluster centers y * 1 , y * 2 , · · · , y * k , where k is the number of clusters.
Select an initial data pointx j randomly from the data set X , let y j =x j and j ← 1.
repeat Estimate the mean shift vector by Update the cluster centers y j+1 = m(y j ) + y j until y j+1 − y j 2 2 ≤ y * j ← y j end for Merge the cluster centers with distances smaller than δ

Block Kaczmarz-Motzkin Method with Mean Shift
The following describe in detail how to combine MS clustering and the Kaczmarz-Motzkin method. Considering the row partitions of A, we first construct the dataset where i = I ∈ {1, 2, · · · , m} and the superscript I refer to selecting an index randomly from the set formed by the maximum row norm of A (if there are multiple maximum row norm values). F (row) means linear mapping from the row index set of A. By implementing the MS algorithm, we have partitionT = {τ 1 ,τ 2 , · · · ,τ p } of {1, 2, · · · , m}, i.e., A = [Aτ 1 ; Aτ 2 ; · · · , Aτ p ] and b = [bτ 1 ; bτ 2 ; · · · , bτ p ].
Similar to SKM, at the kth iteration, if bτ s − Aτ s x k 2 2 > bτ i − Aτ i x k 2 2 , then theτ s block is to be sampled rather than blockτ i for all i = s ∈ [p]. Considering that a block extracted may be very large, we introduce an almost-maximum distance criterion in one iteration to control the iterative index set. Following this idea, the block Kaczmarz-Motzkin method with MS clustering is summarized in Algorithm 4.

Output: x l
Data set X is obtained by data preprocessing on A Apply the MS(δ) method to X to obtain partition T = {τ 1 ,τ 2 , · · · ,τ p } p≥1 from the set of row index sets {1, 2, · · · , m} for k = 0, 1, . . . l − 1 do Determine the partition of the indicator set , and remark that the F is a linear map. In fact, if ∀ x, y ∈ {1, 2, · · · , m} and a, b ∈ R, the following conditions hold: Now, the convergence of the BKMS algorithm is analyzed. The fact guaranteed in [2] is required. Lemma 3. If A ∈ R m×n is a nonzero real matrix, then for any u ∈ range(A T ).
The following theorem gives the convergence rate of the BKMS Algorithm 4 when (1) is consistent. Theorem 1. Let {τ p } p≥1 be the target blocks generated by Algorithm 3, Assume that ζ ≥ max τ A τ 2 F and α ≥ max |τ| for each τ ∈ {τ p } p≥1 , where |τ| is the cardinality of τ. Given an initial guess x 0 ∈ R(A T ), then the sequence {x k } ∞ 0 generated by the BKMS Algorithm 4 converges linearly to the least norm solution x * = A † b of (1). Furthermore, Then, where σ 2 max (A τ k ) is the largest nonzero singular value of A τ and N k−1 is the cardinality of the set τ k−1 .
Proof. According to the (5) of Algorithm 4 which implies that x k+1 − x k belongs to the column space of matrix A † τ k A τ k . From the iteration scheme of the BKMS, One can obtain that Thus, the following formula can be obtained On the other hand, according to the algorithm flow, we know that So, one can obtain then, Assume that ζ ≥ max τ A τ 2 F for each τ ∈ {τ p } p≥1 . Substituting (9) into (8), one has For the establishment of the second equation above is guaranteed by the (7), then Substituting (11) into (8), one has The recurrence relation with the inequalities (10) and (12) results in (6).

Remark 2. The convergence rate of BKMS, which is denoted by
, is subject to 0 < ρ BKMS < 1 under the conditions of Theorem 1. In fact,

The second inequality holds because
Thus 0 < ρ BKMS < 1, which implies that the BKMS has a linear convergence.

Numerical Examples
In this section, the numerical performances of the RK, Motzkin, GRK, RBK, SKM, GRBK, and BKMS algorithms are compared in different settings of the least-squares problem (1). The operating environment of all experiments is the MATLAB 2020b version on a private computer with AMD Ryzen 5 4600H and 16 GB of memory. The relative solution error is denoted by which is used to measure the efficiency of all methods. The initial vector is set as x 0 = (0, 0, · · · , 0) T for BKMS and other methods. We terminate the iteration process of each method once RSE ≤ 10 −6 . The concept of a random partition sampling in [13,14] is sampling τ k randomly from row partition T = {τ 1 , τ 2 , · · · , τ p } at the k-th iteration, and for each i, the subset τ i can be denoted as The iterative least-squares solver CGLS in [14] is utilized to solve the underdetermined linear subsystem at each iteration since it requires fewer floating-point numbers between matrix-vector products than the MATLAB function "pinv".
This section consists of three parts. In the first part (Examples 1-3), we implement MS clustering to visualize the row partition of the coefficient matrix A and verify that BKMS is more competitive than other methods to solve consistent linear systems (1) with different A. In the second part (Example 4), we consider the effects of parameters δ and η on the performance of BKMS. In the last part (Examples 5 and 6), we provide two practical applications of the BKMS method. Example 1. This example implements the partitions of different system matrices by adopting the MS Algorithm 3. It is known that an ill-conditioned matrix means a high linear correlation between the rows of the matrix [16]. In the left graph of Figure 1a, we tested a dense matrix A ∈ R 1000×500 with the condition number (cond(A)) being 2.60, in the middle graph of Figure 1b, we tested a sparse matrix named "Cities" with cond(A) being 207.15, and in the right graph of Figure 1c, we tested an ill-conditioned sparse matrix named "relat6" with the cond(A) being 4.03 × 10 16 . Figure 1b plots the number of rows contained in each block of the matrices above, respectively. Figure 1a, the rows with similar correlation coefficients µ k are likely to be allocated in a block. This may lead to the considerable control index set τ k in the BKMS method being very large, which is verified in Figure 1b. To alleviate this problem, we continue to partition the current block, so it is reasonable to introduce an almost-maximum distance rule in BKMS.

Example 2.
This example considers a linear system (1) with a tall (or fat) coefficient matrix A ∈ R m×n , i.e., m > n (or m < n). Its elements are pseudorandom numbers from a standard normal distribution generated by the MATLAB function randn. Table 1 reports the iteration steps (IT) and CPU time in seconds (CPU(s)) of RK, Motzkin, GRK, SKM, RBK(p), and GRBK(p) with the same number of blocks p = 8 and BKMS (δ opt , η). The δ opt refers to the numerical optimum parameter rather than a theoretical one. We will further discuss the δ opt in Example 4. Figure 2 shows the numerical performance of all methods, in particular, we control RBK(p), GRBK(p), and BKMS(0.5, 0.1) to have the same number of blocks p = 5. Figure 2a shows the number of rows contained in each subblock of BKMS. Figure 2b,c depict the curves of RSE versus IT and RSE versus CPU(s) of Algorithm 4, applied to solve (1) with the coefficient matrix A ∈ R 1000×500 . In addition, we default η = 0.01 for BKMS(δ) in all examples unless otherwise specified.
It is observed from Table 1 that RBK(p), GRBK(p), and BKMS(δ) are almost superior to RK, Motzkin, GRK, and SKM in terms of iteration steps and CPU time, thanks to the fact that block iteration takes multiple constraints. In addition, in block Kaczmarz methods, we find that BKMS(δ) is more competitive than RBK(p) and GRBK(p). For example, the CPU time speed-up of the BKMS(δ) over GRBK(p) ranges from 1.206 to 1.610. Specifically, the value 1.610 is calculated by a ratio of 2.790 (CPU time consumed by GRBK (5) Table 2 summarizes different sparse matrices with different properties, which are taken from the Florida sparse matrix collection in [27]. The density of a matrix A means the percentage of nonzero elements of A. The IT and CPU times for all methods are reported in Table 2. Similar to Example 2, Figure 3a shows the number of rows contained in each subblock of BKMS(0.5), Figure 3b,c display the plots of RSE versus IT and RSE versus the CPU times of all methods, which are applied to solve a sparse linear system (1) named "WorldCities".    From Table 2, RBK(p), GRBK(p), and BKMS(δ) require fewer iteration steps and CPU times. Similar to the block Kaczmarz methods above, we believe that the BKMS method has better performance. For example, we calculate the CPU time speed-up of the BKMS(δ) over GRBK(p) ranges from 1.692 to 3.493 for solving the (1) with a different sparse matrix A listed in Table 2. From Figure 3, we can see that BKMS converges faster than other methods. By carefully comparing Figure 3c, it is not difficult to find that SKM and RBK have similar time efficiencies. The reason is that calculating the Moore-Penrose pseudoinverse of large blocks reduces the efficiency of RBK. So, the number of coefficient matrix partitions affects the convergence speed of RBK, GRBK, and BKMS. Based on this, in Example 4, we will analyze the sensitivity of the parameters (δ, η) in BKMS.

Example 4.
This example investigates the effects of the parameters δ and η on the BKMS method. We use BKMS to solve system (1) with A ∈ R 5000×1000 and the sparse system (1) named "ab-taha1", respectively. Tables 3 and 4 list the iteration steps and CPU time of BKMS(δ, η) once the RSE < 10 −6 . The "--" means abandoning the introduction of the almost maximum distance rule (i.e., removing steps 5 and 6 in Algorithm 4). In fact, we usually care more about the CPU time, so we use enumeration to select optimal parameters δ and η. This tuning method is also called the grid search in machine learning. That is, find the corresponding different parameter values by locking the optimal value. For example, we select initialization seeds δ = 0.7 and η = 0.1 for BKMS(δ, η) in Table 3.  As depicted in Table 3, fix δ = 0.3, for the different η; although the number of iteration steps increases, the CPU time shows a trend of first falling and then rising. For each column, this law is roughly the same. Finding a suitable δ is crucial. When δ becomes larger, the number of blocks becomes smaller, which requires computing the Moore-Penrose pseudoinverse of a large block to be time-consuming. On the contrary, when δ becomes smaller, the number of blocks increases. Considering an extreme case, when A is row-normalized and there is only one row index in each block, BKMS is mathematically equivalent to the Motzkin method. A similar result also occurs in Table 4. The initialization seeds δ = 0.3 and η = 0.1 can be selected for BKMS(δ, η) in Table 4.
|x| ≥ 3, and the right-hand side which is often severely ill-conditioned. In fact, ordinary linear systems (2) may yield poor approximate solutions in this case, then one of the effective ways to overcome this issue is ridge regression, also called the Tikhonov regularization, namely which can be restated as whereÃ ∈ R (n+n)×n and λ > 0 is a regularization parameter. The measurement matrix A ∈ R n×n , exact solution x * ∈ R n , and b ∈ R n are all calculated by the MATLAB function phillips(n) in [28]. Set n = 1024 and λ = 0.01. Then the size of the square matrix A is 1024 × 1024 and the condition number of A is 2.9043 × 10 10 whileÃ is 58.04. In Figure 4, the (a) reports the approximation solution calculated by RK, Motzkin, GRK, SKM, RBK(4), GRBK(4), and BKMS(0.4, 0.8) together with the exact solution for the phillips test problem when the maximum number of iteration steps is 100. The (b) shows the partition performance of BKMS(0.4, 0.8). Table 5 lists the RSE, CPU time, and iteration of BKMS(0.4, 0.8) compared with RK, Motzkin, GRK, SKM, RBK(4), and GRBK(4) for the "phillips" problem when RSE ≤ 10 −10 .   Figure 4 shows that the recovered solutions by RBK(4), GRBK(4), and BKMS(0.4, 0.8) are much closer to the exact solution than RK, Motzkin, GRK, and SKM do. It can be seen from Table 5 that BKMS obtains less RSE than other methods. In addition, BKMS needs less CPU time and iterations to reach the stop rule.
Example 6 (CT reconstruction problem). This example implements the reconstruction of a computer tomography (CT) image. This test problem can be implemented by the MATLAB function paralleltomo(N, θ, p) in the ART Tools package in [29]. We set N = 50, θ = 0 • : 1 • : 178 • and p = 60, then resulting in the size of A is 10,740 × 2500, the exact solution x * (See the (a) of Figure 5) and the b we obtained by b = Ax * . The BKMS(0.25, 0.3) is applied to reconstruct x * from b and compared with the RK, SKM, RBK (7), and GRBK (7) methods.    Table 6 reports the RSE, the CPU time, and iteration steps of BKMS(0.25, 0.3) compared with RK, SKM, RBK(7), and GRBK(7) once RSE ≤ 10 −6 . We set the maximum iterative number (IT max ) to 8000.
It is shown from Figure 5 that all methods obtained well-restored images above. It can be seen from Table 6 that BKMS(0.25, 0.3) needs less CPU time and iteration steps than other methods for restoring images.

Conclusions
This work proposes a block Kaczmarz-Motzkin method with mean shift clustering as an efficient extension of the Kaczmarz algorithm to solve large consistent linear equations. The row correlation coefficient of the system matrix is utilized as the clustering criterion and this method does not need to specify the number of partitions in advance. The convergence theory of the BKMS method is established when the linear system is consistent. Several numerical examples demonstrate the effectiveness of the current method in this paper.
Although the effects of the clustering bandwidth δ and the parameter η are investigated in the numerical example, the theoretical analysis of δ affecting the performance of BKMS needs further discussion. According to the properties of the linear system, choosing a more appropriate clustering criterion may improve the algorithm performance; we will use these problems as part of our future work.