Conjugate Gradient Algorithm for Least-Squares Solutions of a Generalized Sylvester-Transpose Matrix Equation

: We derive a conjugate-gradient type algorithm to produce approximate least-squares (LS) solutions for an inconsistent generalized Sylvester-transpose matrix equation. The algorithm is always applicable for any given initial matrix and will arrive at an LS solution within ﬁnite steps. When the matrix equation has many LS solutions, the algorithm can search for the one with minimal Frobenius-norm. Moreover, given a matrix Y , the algorithm can ﬁnd a unique LS solution closest to Y . Numerical experiments show the relevance of the algorithm for square/non-square dense/sparse matrices of medium/large sizes. The algorithm works well in both the number of iterations and the computation time, compared to the direct Kronecker linearization and well-known iterative methods.


Introduction
Sylvester-type matrix equations are closely related to ordinary differential equations (ODEs), which can be adapted to several problems in control engineering and information sciences; see e.g., monographs [1][2][3].The Sylvester matrix equation AX + XB = E and the famous special case Lyapunov equation AX + XA T = E have several applications in numerical methods for ODEs, control and system theory, signal processing, and image restoration; see e.g., [4][5][6][7][8][9].The Sylvester-transpose equation AX + X T B = C is utilized in eigenstructure assignment in descriptor systems [10], pole assignment [3], and fault identification in dynamic systems [11].In addition, if we require that the solution X to be symmetric, then the Sylvester-transpose equation coincides with the Sylvester one.A generalized Sylvester equation AXB + CXD = E can be applied to implicit ODEs, and a general dynamical linear model for vibration and structural analysis, robot and spaceship controls; see e.g., [12,13].
The mentioned matrix equations are special cases of a generalized Sylvester-transpose matrix equation: or more generally A direct algebraic method to find a solution of Equation ( 2) is to use the Kronecker linearization transforming the matrix equation into an equivalent linear system; see e.g., [14] (Ch.4).The same technique together with the notion of weighted Moore-Penrose inverse were adapted to solve a coupled inconsistent Sylvester-type matrix equations [15] for least-squares (LS) solutions.Another algebraic method is to apply a generalized Sylvester mapping [13], so that the solution is expressed in terms of polynomial matrices.However, when the sizes of coefficient matrices are moderate or large, it is inconvenient to use matrix factorizations or another traditional methods since they require a large memory to calculate an exact solution.Thus, the Kronecker linearization and other algebraic methods are only suitable for small matrices.That is why it is important to find solutions that are easy to compute, leading many researchers to come up with algorithms that can reduce the time and memory usage of solving large matrix equations.
In the literature, there are two notable techniques to derive iterative procedures for solving linear matrix equations; see more information in a monograph [16].The conjugate gradient (CG) technique aims to create an approximate sequence of solutions so that the respective residual matrix creates a perpendicular base.The desired solution will come out in the final step of iterations.In the last decade, many authors developed CG-type algorithms for Equation (2) and its special cases, e.g., BiCG [17], BiCR [17], CGS [18], GCD [19], GPBiCG [20], and CMRES [21].The second technique, known as gradient-based iterative (GI) technique, intends to construct a sequence of approximate solutions from the gradient of the associated norm-error function.If we carefully set parameters of GI algorithm, then the generated sequence would converge to the desired solution.In the last five years, many GI algorithms have been introduced; see e.g., GI [22,23], relaxed GI [24], accelerated GI [25], accelerated Jacobi GI [26], modified Jacobi GI [27], gradient-descent algorithm [28], and global generalized Hessenberg algorithm [29].For LS solutions of Sylvester-type matrix equations, there are iterative solvers, e.g., [30,31].
Recently, the work [32] developed an effective gradient-descent iterative algorithm to produce approximated LS solutions of Equation (2).When Equation ( 2) is consistent, a CGtype algorithm was derived to obtain a solution within finite steps; see [33].This work is a continuation of [33], i.e., we consider Equation (1) with rectangular coefficient matrices and a rectangular unknown matrix X. Suppose that Equation (1) is inconsistent.We propose a CG-type algorithm to approximate LS solutions, which will solve the following problems: Let L be the set of least-squares solutions of Equation (1).The second problem is to find an LS solution with the minimal norm: Problem 2. Find the matrix X * such that The last one is to find an LS solution closest to a given matrix: Moreover, we extend our studies to the matrix Equation (2).We verify the results from theoretical and numerical points of view.
The organization of this article is as follows.In Section 2, we recall preliminary results from matrix theory that will be used in later discussions.In Section 3, we explain how the Kronecker linearization can transform Equation (1) into an equivalent linear system to obtain LS solutions.In Section 4, we propose a CG-type algorithm to solve Problem 1 and verify the theoretical capability of the algorithm.After that, Problems 2 and 3 are investigated in Sections 5 and 6, respectively.To verify the theory, we provide numerical experiments in Section 7 to show the applicability and efficiency of the algorithm, compared to the Kronecker linearization and recent iterative algorithms.We summarize the whole work in the last section.

Auxiliary Results from Matrix Theory
Throughout, let us denote by R m×n the set of all m-by-n real matrices.Recall that the standard (Frobenius) inner product of A, B ∈ R m×n is defined by A, B := tr(B T A) = tr(AB T ). ( If A, B = 0, we say that A is orthogonal to B. A well-known property of the inner product is that for any matrices A, B, C, D with appropriate dimensions.The Frobenius norm of matrix A ∈ R m×n is defined by The Kronecker product A ⊗ B of A = [a ij ] ∈ R m×n and B ∈ R p×q is defined to be the mp-by-nq matrix whose each (i, j)-th block is given by a ij B.
Lemma 1 ([14]).For any real matrices A and B, we have The vector operator Vec(•) transforms a matrix A = [a ij ] ∈ R m×n to the vector The vector operator is bijective, linear, and related to the usual matrix multiplication as follows.
Lemma 2 ([14]).For any A ∈ R m×n , B ∈ R n×p and C ∈ R p×q , we have For each m, n ∈ N, we define a commutation matrix where the (i, j)-th position of E ij ∈ R m×n is 1 and all other entries are 0. Indeed, P(m, n) acts on a vector by permuting its entries as follows.
Lemma 3 ([14]).For any matrix A ∈ R m×n , we have Moreover, commutation matrices permute the entries of A ⊗ B as follows.
Lemma 4 ([14]).For any A ∈ R m×n and B ∈ R p×q , we have The next result will be used in the later discussions.

Least-Squares Solutions via the Kronecker Linearization
From now on, we investigate the generalized Sylvester-transpose matrix Equation (1), with corresponding coefficient matrices A ∈ R m×n , B ∈ R p×q , C ∈ R m×p , D ∈ R n×q , E ∈ R m×q , and a rectangular unknown matrix X ∈ R n×p .We focus our attention when Equation ( 1) is inconsistent.In this case, we will seek for its LS solution, that is, a matrix X * that solves the following minization problem: A traditional algebraic way to solve a linear matrix equation is known as the Kronecker linearization-to transform the matrix equation into an equivalent linear system using the notions of vector operator and Kronecker products.Indeed, taking the vector operator to Equation (1) and applying Lemmas 2 and 3 yield Let us denote x = Vec X, e = Vec E, and Thus, a matrix X is an LS solution of Equation ( 1) if and only if x is an LS solution of the linear system Mx = e, or equivalently, a solution of the associated normal equation The linear system ( 14) is always consistent, i.e., Equation (1) always has an LS solution.
From the normal Equation ( 14) and Lemmas 2 and 3, we can deduce: 34]).Problem 1 is equivalent to the following consistent matrix equation Moreover, the normal Equation ( 14) has a unique solution if and only if the matrix M is of full-column rank, i.e., M T M is invertible.In this case, the unique solution is given by x * = (M T M) −1 M T e, and the LS error can be computed as follows: If M is of not full-column rank (i.e., the kernel of M is nontrivial), then the system Mx = e has many solutions.In this case, the LS solutions appear in the form x * = M † e + u where M † is the Moore-Penrose inverse of M, and u is an arbitrary vector in the kernel of M.Among all these solutions, is the unique one having minimal norm.

Least-Squares Solution via a Conjugate Gradient Algorithm
In this section, we propose a CG-type algorithm to solve Problem 1.We do not impose any assumption on the matrix M, so that LS solutions of Equation (1) may not be unique.
We shall adapt the conjugate-gradient technique to solve the equivalent matrix Equation (15).Recall that the set of LS solutions of Equation ( 1) is denoted by L. From Lemma 6, observe that the residual of a matrix X ∈ R n×p according to Equation ( 1) is given by Lemma 6 states that X ∈ L if and only if R X = 0. From this, we propose the following algorithm.Indeed, the next approximate solution X r+1 is equal to the current approximation X r along with a search direction U r+1 of suitable step size.
Algorithm 1: A conjugate gradient iterative algorithm for Equation ( 1) Remark 1.To terminate the algorithm, one can alternatively set the stopping rule to be R r F − δ where δ := Mx * − e is the positive square root of the LS error described in Equation ( 16) and > 0 is a small tolerance.
For any given initial matrix X 0 , we will show that Algorithm 1 generates a sequence of approximate solutions X r of Equation ( 1), so that the set of residual matrices R r is orthogonal.It follows that a unique LS solution will be obtained within finite steps.Lemma 7. Assume that the sequences {R r } and {H r } are generated by Algorithm 1.We get Proof.From Algorithm 1, we have that for any r, Lemma 8.The sequences {U r } and {H r } generated by Algorithm 1 satisfy Proof.Using the properties of the Kronecker product and the vector operator in Lemmas 1-5, we have Lemma 9.The sequences {R r }, {U r } and {H r } generated by Algorithm 1 satisfy tr(R T r R r−1 ) = 0, and tr(U T r+1 H r ) = 0, for any r.
Proof.We use the induction principle to prove (21).In order to calculate related terms, we utilize Lemmas 7 and 8.For r = 1, we get These imply that (21) holds for r = 1.Now, we proceed the inductive step by assuming that tr(R T r R r−1 ) = 0 and tr(U T r+1 H r ) = 0. Then Hence, Equationd (21) holds for any r.
Proof.The initial step r = 1 holds due to Lemma 9. Now, assume that Equation ( 22) is valid for all r = 1, . . ., k. From Lemmas 7 and 8, we get Hence, Equation ( 22) holds for any r.
Lemma 11.Suppose the sequences {R r }, {U r } and {H r } are constructed from Algorithm 1. Then for any integers m and n such that m = n, we have Proof.According to Lemma 8 and the fact that tr(R T m−1 R n−1 ) = tr(R T n−1 R m−1 ) for any integers m and n, it remains to prove two equalities in (23) for only m, n such that m > n.For m = n + 1, the two equalities hold by Lemma 9.For m = n + 2, we have and, similarly, we have and, similarly, In a similar way, we can write tr(R T n+1 R n−1 ) and tr(U T n+2 H n ) in terms of tr(R T n R n−2 ) and tr(U T n+1 H n−1 ), respectively.Continuing this process until the terms tr(R T 2 R 0 ) and tr(U T 3 H 1 ) appear.By Lemma 10, we get tr(R T n+1 R n−1 ) = 0 and tr(U T n+2 H n ) = 0. Similarly, we have tr(R T m R n−1 ) = 0 and tr(U T m H n ) = 0 for m = n + 3, . . ., k.

Suppose that tr(R
Hence, tr(R T m−1 R n−1 ) = 0 and tr(U T m H n ) = 0 for any m, n such that m = n.
Theorem 1. Algorithm 1 solves Problem 1 within finite steps.More precisely, for any given initial matrix X 0 ∈ R n×p , the sequence {X r } constructed from Algorithm 1 converges to an LS solution of Equation (1) in at most np iterations.
Proof.Assume that R r = 0 for r = 0, 1, . . ., np − 1. Assume that R np = 0.By Lemma 11, the set {R 0 , R 1 , ..., R np } of residual matrices is orthogonal in R n×p with respect to the Frobenius inner product (5).Therefore, the set {R 0 , R 1 , ..., R np } of np + 1 elements is linearly independent.This contradicts the fact that the dimension of R n×p is np.Thus, R np = 0, and X np satisfies Equation ( 15) in Lemma 6. Hence X np is an LS solution of Equation (1).
We adapt the same idea as that for Algorithm 1 to derive an algorithm for Equation (2) as follows: Algorithm 2: A conjugate gradient iterative algorithm for Equation ( 2) end end The stopping rule of Algorithm 2 may be described as R r F − δ where δ is the positive square root of the associated LS error and > 0 is a small tolerance.

Theorem 2. Consider Equation (2)
are given constant matrices and X ∈ R n×p is an unknown matrix.Assume that the matrix is of full-column rank.Then, for any given initial matrix X 0 ∈ R n×p , the sequence {X r } constructed from Algorithm 2 converges to a unique LS solution.
Proof.The proof of is similar to that of Theorem 1.

Minimal-Norm Least-Squares Solution via Algorithm 1
In this section, we investigate Problem 2. That is, we consider the case when the matrix M may not have full-column rank, so that Equation (1) may have many LS solutions.We shall seek for an element of L with the minimal Frobenius norm.Lemma 12. Assume X ∈ L.Then, any arbitrary element X ∈ L can be expressed as X + Z for some matrix Z ∈ R n×p satisfying Proof.Let us denote the residual of the LS solutions X and X, according to Equation ( 18), by R X and R X , respectively.We consider the different Z := X − X.Now, we compute Since X, X ∈ L, by Lemma 6 we have R X = R X = 0.It follows that Equation ( 25) holds as desired.
Theorem 3. Algorithm 1 solves Problem 2 in at most np iterations by starting with the initial matrix where V 0 ∈ R n×p is an arbitrary matrix, or especially X 0 = 0.
Proof.If we run Algorithm 1 starting with (26), then we can write the solution X * of Problem 2 so that for some matrix V * ∈ R n×p .Now, assume that X is an arbitrary element in L. By Lemma 12, there is a matrix Z ∈ R n×p such that X = X * + Z and Using the property (6), we get Since X * is orthogonal to Z, it follows from the Pythagorean theorem that This implies that X * is the minimal-norm solution.
Theorem 4. Consider the sequence {X r } generated by Algorithm 2 starting with the initial matrix where V 0 ∈ R n×p is an arbitrary matrix, or especially X 0 = 0 ∈ R n×p .Then the sequence {X r } converges to the minimal-norm LS solution of Equatiom (2) in at most np iterations.
Proof.The proof is similar to that of Theorem 3.

Least-Squares Solution Closest to a Given Matrix
In this section, we investigate Problem 3. In this case, Equation (1) may have many LS solutions.We shall seek for one that closest to a given matrix with respect to the Frobenius norm.Theorem 5. Algorithm 1 solves Problem 3 by substituting E with E 1 = E − (AYB + CY T D), and choosing the initial matrix to be where V ∈ R n×p is arbitrary, or specially W 0 = 0 ∈ R n×p .
Proof.Let Y ∈ R n×p be given.We can translate Problem 3 into Problem 2 as follows: min ) and W = X − Y, we see that the solution X of Problem 3 is equal to W * + Y where W * is the minimal-norm LS solution of the equation in unknown W. By Theorem 3, the matrix W * can be solved by Algorithm 1 with the initial matrix (27) where V ∈ R n×p is arbitrary matrix, or especially W 0 = 0. Theorem 6. Suppose that the matrix Equation (2) is inconsistent.Let Y ∈ R n×p be given.Consider Algorithm 2 when we replace the matrix E by and choose the initial matrix where F ∈ R n×p is arbitrary, or W 0 = 0 ∈ R n×p .Then, the sequence {X r } obtained by Algorithm 2 converges to the LS solution of (2) closest to Y within np iterations.
Proof.The proof of the theorem is similar to that of Theorem 5.

Numerical Experiments
In this section, we provide numerical results to show the efficiency and effectiveness of Algorithm 2 (denoted by CG), which is an extension of Algorithm 1.We perform experiments when the coefficients in a given matrix equation are dense/sparse rectangular matrices of moderate/large sizes.We denote by ones(m, n) the m-by-n matrix whose all entries are 1.Each random matrix rand(m, n) has all entries belonging to the interval (0, 1).Each experiment contains some comparisons of Algorithm 2 with the direct Kronecker linearization as well as well-known iterative algorithms.All iterations were performed by MATLAB R2021a, on Mac operating system (M1 chip 8C CPU/8C GPU/8GB/512GB).The performance of algorithms is investigated through the number of iterations, the norm of residual matrices, and the CPU time.The latter is measured in seconds using the functions tic and toc on MATLAB.The next is an example of Problem 1.
Example 1.Consider a generalized Sylvester-transpose matrix equation where the coefficient matrices are given by In fact, we have rank M = 2000 = 2001 = rank [M e], i.e., the matrix equation does not have an exact solution.However, M is of full-column rank, so this equation has a unique LS solution.We run Algorithm 2 using an initial matrix X 0 = 0 ∈ R 50×40 and a tolerance error = Mx * − e = 6.4812,where x * = (M T M) −1 M T e.It turns out that Algorithm 2 takes 20 iterations to get a least-square solution, consuming around 0.2 s, while the direct method consumes around 7 s.Thus, Algorithm 2 takes 35 times less computational time than the direct method.We compare the performance of Algorithm 2 with other well-known iterative algorithms: GI method [31], LSI method [31], and TAUOpt method [32].The numerical results are shown in Table 1 and Figure 1.We see that after running 20 iterations, Algorithm 2 consumes CTs slightly more than other methods, but the relative error R r F is less than those of the others.Hence, Algorithm 2 is applicable and has a good performance.Example 2. Consider a generalized Sylvester-transpose matrix equation where In this case, Equation (29) is inconsistent and the associated matrix M is not of full-column rank.Thus, Equation (29) has many LS solutions.The direct method concerning Moore-Penrose inverse (17) takes 0.627019 s to get the minimal-norm LS solutions.Alternatively, MNLS method [35] can be also used to this kind of problem.However, some coefficient matrices are triangular matrices with multiple zeros, causing the MNLS algorithm diverges and cannot provide answer.Therefore, let us apply Algorithm 2 using a tolerance error = 10 −5 .According to Theorem 4, we choose three different matrices V 0 to generate the initial matrix X 0 .The numerical results are shown in Table 2 and Figure 2.  Figure 2 shows that the logarithm of the relative errors R r F for CG algorithms, using three initial matrices, are rapidly decreasing to zero.All of them consume around 0.037 s to arrive at the desired solution, which is 16 times less than the direct method.
The following is an example of Problem 3.
Example 3. Consider a generalized Sylvester-transpose matrix equation where    We apply Algorithm 2 with a tolerance error = 10 −5 .The numerical results in Figures 3 and 4, and Table 3 illustrate that, in each case, the relative error converges rapidly to zero within 20 iterations, consuming around 0.1 s.Thus, Algorithm 2 performs well in both the number of iterations and computational time.Moreover, changing initial matrix and the desired matrix Y does not siginificantly affect the performance of algorithm.

Figure 1 .
Figure 1.The logarithm of the relative error R r F for Example 1.The next is an example of Problem 2.

Figure 2 .
Figure 2. The logarithm of the relative error for Example 2.

Figure 3 .
Figure 3.The logarithm of the relative error for Example 3 with Y = 0.1 × ones(n, p).

Figure 4 .
Figure 4.The logarithm of the relative error for Example 3 with Y = I.

Table 1 .
Relative error and computational time for Example 1.

Table 2 .
Relative error and computational time for Example 2.

Table 3 .
Relative error and computational time for Example 3.