1. Introduction
Consider the fractional system [
1,
2]
      
      where 
 and (
) represents the order of the fractional derivative, 
, 
 and 
 with 
. If 
 is approximated by the Grünwald–Letnikov rule [
3] at 
, the system (
1) is equivalent to the discrete-time linear system
      
      where 
 and 
. The corresponding optimal control and the feedback gain can be expressed in terms of the unique positive semidefinite stabilizing solution of the discrete-time algebraic Riccati Equation (DARE)
      
There have been numerous methods, including classical and state-of-the-art techniques, developed over the past few decades to solve this equation in a numerically stable manner. See [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15] and the references therein for more details.
In many large-scale control problems, the matrix 
 in the non-linear term and 
 in the constant term are of low-rank with 
, 
, 
, 
, and 
. Then the unique positive definite stabilizing solution in the DARE (
3) or its dual equation can be approximated numerically by a low-rank matrix [
16,
17]. However, when the constant term 
H in the DARE equation has a high-rank structure, the stabilizing solution is no longer numerically low-ranked, making its storage and outputting difficult. To solve this issue, an adapted version of the doubling algorithm, named SDA_h, was proposed in [
18]. The main idea behind SDA_h is to take advantage of the numerical low-rank of the stabilizing solution in the dual equation to estimate the residual of the original DARE. In this way, SDA_h can efficiently evaluate the residual and output the feedback gain. An interesting question up to now is:
The main difficulty, in this case, lies in that the stabilizing solutions both in DARE (
3) and its dual equation are not of low-rank, making the direct application of SDA difficult for large-scale problems, especially the estimation of residuals and the realization of algorithmic termination. This paper attempts to overcome this obstacle. Rather than answering the above question completely, DARE (
3) with the banded-plus-low-rank structure
      
      is considered, where 
 is a banded matrix, 
, 
 are low-rank matrices and 
 is the kernel matrix with 
. The assumption of (
4) is not necessary when 
G and 
H are of low rank, i.e., in that case 
A is allowed to be any (sparse) matrix. We also assume that the high-rank non-linear item and the constant item are of the form
      
      where 
, 
 are positive semidefinite banded matrices, 
, 
, 
 and 
 are symmetric and 
 (here 
 and 
 might be zero). In addition, we assume that 
, 
, and 
 are all banded matrices with banded inverse (BMBI), which has some applications in the power system [
19,
20,
21]. See also [
22,
23,
24,
25,
26,
27,
28,
29], as well as their references for other applications.
The main contributions in this paper are:
- Although the hierarchical (e.g., HODLR) structure [ 30- , 31- ] can be employed to run the SDA to cope with large-scale DAREs with both high-rank  H-  and  G- , it is the first to develop SDA to the factorized form—FSDA—to deal with such DAREs. 
- The structure of the FSDA iterative sequence is explicitly revealed to consist of two parts—the banded part and the low-rank part. The banded part can iterate independently while the low-rank part relies heavily on the product of the banded part and the low-rank part. 
- A deflation process of the low-rank factors is proposed to reduce the column number of the low-rank part. The conventional truncation and compression in [ 17- , 18- ] for the whole low-rank factor does not to work as it destroys the implicit structure and makes the subsequent deflation infeasible. Instead, a partial truncation and compression (PTC) technique is then devised to impose merely on the exponentially increasing part (after deflation), effectively slimming the dimensions of the low-rank factors. 
- The termination criterion of FSDA consists of two parts. The residual of the banded part is considered in the pre-termination, and only if it is small enough, the actual termination criterion involving the low-rank factors is computed. This way, the time-consuming detection of the terminating condition is reduced in complexity. 
The research in this field is also motivated by other applications, such as the finite element methods (FEM). In FEM, the matrices resulting from discretizing the matrix equations exhibit a sparse and structured pattern [
32,
33]. By capitalizing on these advantages, iterative methods designed for such matrices can significantly enhance computational efficiency, minimize memory usage, and lead to quicker solutions for large-scale problems.
The whole paper is organized as follows. 
Section 2 describes the FSDA for DAREs (
3) with high-rank non-linear and constant terms. The deflation process for the low-rank factors and kernels is given in 
Section 3. 
Section 4 dwells on the technique of PTC to slim the dimensions of low-rank factors and kernels. The way to compute the residual, as well as the concrete implementation of the FSDA, is described in 
Section 5. Numerical experiments are listed in 
Section 6 to show the effectiveness of the FSDA.
Notation 1.  (or simply I) is the  identity matrix. For a matrix ,  denotes the spectral radius of A. For symmetric matrices A and , we say  () if  is a positive definite (semi-definite) matrix. Unless stated otherwise, the norm  is the F-norm of a matrix. For a sequence of matrices , . For a banded matrix B,  represents the bandwidth. Additionally, the Sherman–Morrison–Woodbury (SMW) formula (see [34] for example),  is required in the analysis of iterative scheme.    3. Deflation of Low-Rank Factors and Kernels
It has been shown that there is an exponential increase in the dimension of low-rank factors and kernels. Nevertheless, it is clear that the first three items in 
 and 
 (see (15) and (17)) are same as the second to the fourth item in 
 and 
(see (
14) and (16)), respectively. Then the deflation of low-rank factors and kernels is needed to keep these matrices low-ranked. To see this process clearly, we start with the case 
.
Case for .
Consider the deflation of the low-rank factors firstly. It follows from (
14)–(17) that
        
        with
        
        Expanding the above low-rank factors with the initial 
 and 
, one can see from 
Appendix A that 
 and 
 (or 
 and 
) occur twice in 
 (or 
). To reduce the dimension of 
, we remove the duplicated 
 in 
 (or 
 in 
) and retain the one in 
 (or 
). Furthermore, we remove 
 in 
 (or 
 in 
) and keep the one in 
 (or 
). Then the original 
 (or 
) is deflated to 
 (or 
) of a smaller dimension, where the superscript “
d ” indicates the matrix after deflation. Analogously, as 
 and 
 appear twice in 
 and 
, we apply the same deflation process to 
 and 
, respectively, obtaining 
 and 
 in 
Appendix A, where the left blank in each factor corresponds to the deleted matrix and the black bold matrices inherit from the undeflated ones. Note that the deflated matrices 
, 
, 
 and 
 are still denoted by 
, 
, 
 and 
, respectively, in next iteration to simplify notations.
For the kernels at 
, one has
        
        and
        
        with non-zero components defined in (
18)–(
20). Here, details of the deflation of 
 are explained explicitly and that for 
 is similar. In fact, there are 10 block rows and block columns with each of initial size 
 in 
. Due to the deflation of the 
L-factors described above, we add the first and the ninth row to the third and the seventh row and then remove the first and the ninth row, respectively. We also add the the first and the ninth column to the third and the seventh column and then remove the first and the ninth column, respectively, completing the deflation of 
.
Analogously, there are eight block rows and block columns, each of the initial size  in . The deflation process simultaneously adds the seventh column and row subblocks to the third column and row subblocks, respectively. Then the first column sub-block of the upper right  and the first row sub-block of the lower-left  overlap with the first column sub-block of  and the first row sub-block of , respectively, completing the deflation of .
The whole process is described in 
Figure 1 and 
Figure 2 where each small square is of size 
 and each block with gray background represents the non-zero component in 
 and 
. The little white squares in 
 and 
 inherit from the originally undeflated submatrices and the little black squares in 
 and 
 represent the submatrices after summation.
Case for .
After the 
-th deflation, the deflated matrices 
, 
, 
 and 
 are denoted by 
, 
, 
 and 
 for simplicity. Now there are 
 (or 
) columns in 
 and 
 (or 
 and 
) and 
 (or 
) columns in 
 and 
 (or 
 and 
) that are identical. Then, one can remove columns of
        
        and
        
        and keep the columns of
        
        and
        
        in 
 (A1) (or 
 (A3)), respectively. So there are 
 matrices, each of order 
, that are left in 
 (or 
), i.e., 
 in (A1) (or 
 in (A3)) in 
Appendix B. Meanwhile, only one matrix of order 
 is left in 
, (or 
), i.e., the last item 
 in (A1) (or 
 in (A3)) of 
Appendix B. We also take 
 as an example to describe the above deflation more clearly in 
Appendix C.
To deflate 
 (
), columns of
        
        are removed but the columns of
        
        are retained in 
 (or 
). So only one matrix of order 
 is left in 
 (or 
), i.e., the last item 
 in (A2) (or 
 in (A4)) of 
Appendix B. Note that the low-rank factors in the 
-th iteration are the ones after deflation, truncation and compression, deleting the superscript “
d” for the simplicity. We take 
 as an example to describe the above deflation more clearly in 
Appendix D.
Correspondingly, the kernel matrices 
, 
, and 
 are deflated according to their low-rank factors. Here, we describe the deflation of 
 and that of 
 is essentially the same. By recalling the place of non-zero sub-matrices (the block with gray background in 
Figure 3) of 
 in (
21), the deflation process essentially adds 
 to 
, columns 
 to 
 and rows 
 to 
, respectively. See 
Figure 3 for illustration.
Similarly, by recalling the positions of non-zero matrices (the block with gray background in 
Figure 4) of 
 in (
23), the deflation process will add columns 
 to columns 
 and rows 
 to rows 
. See 
Figure 4 for illustration.
  4. Partial Truncation and Compression
Although the deflation of the low-rank factors and kernels in the last section can reduce dimensional growth, the exponential increment of the undeflated part is still rapid, making large-scale computation and storage infeasible. Conventionally, one efficient way to shrink the column number of low-rank factors is by truncation and compression (TC) [
17,
18], which, unfortunately, is hard to be applied to our case due to the following two main obstacles.
- Direct application of TC to , , , , and their corresponding kernels ,  and  at the k-th step will require four QR decompositions, resulting in a relatively high computational complexity and CPU consumption. 
- The TC process applied to the whole low-rank factors at current step breaks up the implicit structure, causing the deflation to be unrealized in the next iteration. 
In this section, we will instead present a technique of partial truncation and compression (PTC) to overcome the above difficulties. Our PTC only requires two QR decompositions of the exponentially increasing (not the entire) parts of low-rank factors, keeping the successive deflation for subsequent iterations.
PTC for low-rank factors. Recall the deflated forms (A1) and (A3) in 
Appendix B. 
 and 
 can be divided to three parts
      
      The number of columns in
      
      and
      
      increases only linearly with 
k, and the last parts
      
      and
      
      are always of size 
. So we only truncate and compress the dominantly growing parts
      
      and
      
      by orthogonalization. Consider the QR decompositions with column pivoting of
      
      where 
 and 
 are permutation matrices such that the diagonal elements of 
 (
 or 
H) are decreasing in absolute value, 
, 
 and 
 and 
 are some small tolerances controlling PTC of 
 and 
, respectively, 
 and 
 are the respective column numbers of 
 and 
 bounded above by some given 
. Then their ranks satisfy
      
      with 
. Furthermore, 
 and 
 are orthonormal and 
 and 
 are full-rank with 
. Then 
 and 
 can be truncated and reorganized as
      
      with 
 and 
.
 Similarly, recalling the deflated forms in (A2) and (A4) in 
Appendix B, 
 and 
 are also divided into two parts,
      
      with
      
     Since 
 and 
 have been compressed to 
 and 
, respectively, one has the truncated and compressed factors
      
      with 
 and 
, finishing the PTC process for the low-rank factors in the 
k-th iteration.
It is worth noting that the above PTC process can proceed to the next iteration. In fact, one has
      
      after the 
k-th PTC. As 
 is equal to 
 and 
 is equal to 
, one can deflate 
 and 
 to
      
      with
      
      Applying PTC to 
 and 
, respectively, again, one has
      
      where 
 and 
 are unitary matrices from QR decomposition and the PTC in the 
-th iteration is completed.
PTC for kernels. Define matrices
      
      with 
 and 
 in (
32). Then the truncated and compressed kernels are
      
 To eliminate items less than  and  in the low-rank factors and kernels, an additional monitoring step is imposed after the PTC process. Specifically, the last item  in  (or  in ) will be discarded if its norm is less than  (or ). Similarly,  in  (or  in ) will be abandoned if its norm is less than  (or ). In this way, the growth of column dimension in the low-rank factors , ,  and , as well as the kernels , , , will be controlled efficiently while sacrificing a hopefully negligible bit of accuracy. Additionally, their sizes after PTC will be further restricted by setting a reasonable upper bound .
  6. Numerical Examples
In this section, we will demonstrate the effectiveness of the FSDA algorithm in computing the approximate solution of the DARE (
3). The FSDA algorithm was implemented using MATLAB 2014a [
38] on a 64-bit PC running Windows 10. The PC had a 3.0 GHz Intel Core i5 processor with 6 cores and 6 threads, 32GB RAM, and a machine unit round-off value of 
eps = 
. The residual for the DARE was estimated using the upper bound formula
      
      where B_RRes in (
39) and LR_RRes in (
40) are the relative residuals for the banded part and the low-rank part, respectively. The tolerance values for truncation and compression were set to 
, and the termination tolerance values were set to 
. We also tried 
eps as the tolerance value for 
, 
 and 
 in our experiments, but found that it had no impact on the residual accuracy. The maximum permitted column number in the low-rank factors was set to 
. As a comparison, we also ran the ordinary SDA algorithm with hierarchical structure (i.e., HODLR) using the hm-toolbox (
http://github.com/numpi/hm-toolbox, accessed on 1 June 2023) [
39,
40]. The SDA algorithm with hierarchical structure is referred to as SDA_HODLR in this paper. The derived relative residual for SDA_HODLR is denoted by 
. In our numerical experiments, the initial bandwidths of all banded matrices in Examples 1 and 3 were relatively small, while those in Example 2 were non-trivial.
Example 1. The first example is of the medium scale, measuring the error between the true solution and the computed one. Given the constant, where ζ and η are positive numbers such that θ is real. Let  with e the random vector satisfying , , , then . Set , . The solution of the DARE is of the form  with  and .
 It is not difficult to see that the solution  is stabilizing since the spectral radius of  is less than unity when .
We first took 
 and 
 to calculate B_RRes, followed by LR_RRes as well as the upper bound of residual of DARE 
. In our implementations, the relative error between the approximated solution (denoted by 
 when terminated at the 
j-th iteration) and the true stabilizing solution 
 was evaluated, and the numerical results are presented in 
Table 1. It is seen that for different scales (
) FSDA was able to attain the prescribed banded accuracy in five iterations. Residuals LR_Res and 
 were then evaluated, attaining the order 
. The relative error with the computational time being not included in the CPU time, also reflects that 
 approximates the true solution very well. On the other hand, SDA_HODLR also attains the prescribed residual accuracy in five iterations, but cost more CPU time (in seconds).
We then took 
 to make the spectral radius of 
 close to 1 and recorded the numerical performance of the FSDA with 
. It is seen from 
Table 1 that the FSDA costs seven iterations before termination, obtaining almost the same banded residual histories (B_RRes) for different 
N. As before, LR_RRes and 
 were of 
 and 
, respectively, showing that 
 is a good approximation to the true solution to DARE (
3). The last relative error 
 also validates this fact. Analogously, SDA_HODLR requires seven iterations to arrive at the residual level 
. It is also seen that the FSDA costs less CPU time than SDA_HODLR for all 
N.
Example 2. Consider a generalized model of power system labelled by PI Sections 20–80 (https://sites.google.com/site/rommes/software, “S10PI_n1.mat” accessed on 1 June 2023). All transmission lines in the network are modelled by RLC ladder networks, of cascaded RLC PI-circuits [41]. The original band-plus-low-rank matrix A has a small scale of 528 (Figure 5) and is then extended to larger ones. Specifically, we extract the banded part  of the bandwidth 217 from the original matrix  and tile it along the diagonal direction for 20 times to obtain . We then implement an SVD of the matrix  to produce the singular value matrix  and the unitary matrices  and . The low-ranked parts  and  are then constructed by tiling  and  20 times and multiplying  from the right, respectively, where  is the number of singular values in  less than . Let  and  be block diagonal matrices with each diagonal block the  random matrix (generated by‘rand(3)’). Let  and  be also diagonal block matrices with the top left element a random number, the last diagonal block  random matrix and others  random matrices. Define matrices G and H aswith , .  We ran the FSDA with three different 
, each conducting five random experiments. In all experiments, B_RRes and LR_RRes (in 
) were observed attaining the pre-terminating condition (
39) and the terminating condition (
40), respectively.
Figure 6 plots the obtained numerical results for five experiments, where Rk is the upper bound of the residual of the DARE, BRes and LRes are the absolute residuals of the banded part and the low-rank part (i.e., the numerators in B_RRes and LR_RRes), respectively. It is seen that the relative residual levels of LR_RRes and B_RRes (between 
 and 
) are lower than those of LRes and BRes (between 
 and 
) in all experiments. Particularly, the gap between them increases as 
 becomes larger. On the other hand, the residual line of Rk is above the residual lines of B_RRes or LR_RRes, attaining the level between 
 and 
. This demonstrates that the FSDA can obtain a relatively high residual accuracy.
 To clearly see the evolution of the bandwidth of the banded matrices and the dimensional increase in the low-rank factors for five iterations, we listed the history of bandwidths of 
, 
, and 
 (denoted by 
, 
, and 
, respectively) and the column numbers of 
 and 
 (denoted by 
 and 
, respectively) in 
Table 2, where the CPU row recorded the consumed CPU time in seconds. It is obviously seen that, for 
, and 3, the FSDA requires 5, 4, and 3 iterations to reach the prescribed accuracy, respectively. Further experiments show that the required number of iterations, when terminated, will decrease as 
 goes larger. Additionally, we see that bandwidths 
 and 
 rise much in the second iteration but keep almost unchanged for the remaining iterations. Nevertheless, 
 decreases gradually after reaching the maximal value in the second iteration, which is consistent with the convergence of 
 in Corollary 1. On the other hand, we see from 
 and 
 that the column numbers in the second iteration are about fourfold of those in the first iteration since the FSDA does not deflate the low-rank factors at the first iteration. However, the column numbers in the fifth iteration (if it exists) are less than twofold of those in the fourth iteration. This reflects that deflation and PTC are efficient in reducing the dimensions of low-rank factors. In our experiments, we also found that nearly half of the CPU time in the FSDA was consumed in forming 
 in the pre-termination. However, such a time expense might decrease if the initial bandwidths 
, 
, and 
 are narrow.
To further compare numerical performances between the FSDA and SDA_HODLR for larger problems, we extended the original scale to 
N = 15,840, 21,120, 26,400 and 31,680 at 
 and ran both algorithms until convergence. The results are listed in 
Table 3, where one can see that both the FSDA and SDA_HODLR (i.e., SDA_HD in the table) attain the prescribed residual accuracy within three iterations, and SDA_HODLR requires less CPU time than FSDA does. However, there seems a strong tendency that the FSDA will outperform the SDA_HODLR on CPU time for larger problems, as the CPU time of the SDA_HODLR appears to surge at 
N = 26,400 and SDA_HODLR used up memory at 
N = 31,680 without producing any numerical results (denoted by “—”). The symbols “*” in the SDA_HODLR column represent no related records for bandwidth and column number of the low-rank factors.
We further modified this example to have a simpler banded part to test both algorithms. Specifically, the relatively data-concentrated banded part of bandwidth 3 is extracted and tiled along the diagonal direction for 20 times to form 
. As before, an SVD is imposed on the rest matrix to construct the low-ranked parts 
 and 
 after tiling the derived unitary matrices 20 times and multiplying 
 from the right. We still selected 
 and ran both the FSDA and SDA_HODLR at scales 
N = 15,840, 21,120, 26,400 and 31,680 again. The obtained results are recorded in 
Table 4, where it is readily seen that the FSDA outperforms the SDA_HODLR on CPU time. Once again, the SDA_HODLR ran out of memory for the case 
N = 31,680.
Example 3. This example is an extension of small-scale electric power systems networks to a large-scale one which is used for signal stability analysis [19,20,21]. The corresponding matrix  is from the power system of New England (https://sites.google.com/site/rommes/software, “ww_36_pemc_36.mat”, accessed on 1 June 2023). Figure 7 presents the original structure of the matrix A of order 66. We properly modified elements , , ; , , , . Then the banded part  is extracted from blocks (1:6, 1:6)
, (7:13, 7:13)
, (14:20, 14:20)
, (21:27, 21:27)
, (28:34, 28:34)
, (35:41, 35:41)
, (42:48, 42:48)
, (49:55, 49:55)
, (56:62, 56:62)
, and (63:66, 63:66)
, admitting the bandwidth of 4. After tiling  200, 400, and 600 times along the diagonal direction, we obtain banded matrix  of scales N = 13,200, 26,400 
and 39,600
. For the low-rank factors, an SVD of the matrix  is firstly implemented to produce the diagonal singular value matrix  and the unitary matrices  and . The low-ranked parts  and  are then constructed by tiling  and  200, 400, and 600 times and dividing their F-norms, respectively, where  is the number of singular values in  less than . The matrices G and H arewith .  We took different 
 and ran the FSDA to compute the stabilizing solution for different dimensions 
N = 13,200, 26,400, and 39,600. In our experiments, the FSDA always satisfied the pre-terminating condition (
39) first and then terminated at LR_RRes 
. We picked 
 and listed derived results in 
Table 5, where BRes (or LRes) and B_RRes (or LR_RRes) record the absolute and the relative residual for the banded part (or the low-rank part), respectively, and 
, 
 record histories of the upper bound of the residual of DARE, the bandwidths of 
, 
 and 
 and the column numbers of the low-rank factors 
 and 
, respectively. Particularly, the 
 column describes the accumulated time to compute residuals (excluding the data marked with “*”).
Obviously, for different N, the FSDA is capable of achieving the prescribed accuracy after five iterations. The residuals BRes, B_RRes, LRes, and LR_RRes indicate that the FSDA tended to converge quadratically. Especially, BRes (or B_RRes) at different N are of nearly same order and terminate at  (or ). Similarly, LRes (or LR_RRes) at different N attain the order  (or ). More iterations seemed useless in improving the accuracy of LRes and LR_RRes. Note that data labelled with the superscript “*” in columns LRes, LR_RRes and  come from the re-running of the FSDA to complement the residual in each iteration, and their corresponding CPU time is not included in the column . Lastly,  indicate that the bandwidths of , , and  are invariant and the column numbers of the low-rank factors grow less than twice in each iteration, demonstrating the effectiveness of the deflation and PTC.
We also ran the FSDA to compute the solution of the DARE of 
 and the results were recorded in 
Table 6. In this case, the FSDA requires seven iterations to reach the prescribed accuracy. As before, the last few residuals in the column BRes (or B_RRes) at different 
N are almost the same of 
 (or 
). The residuals LRes (or LR_RRes) at different 
N terminate at 
 (or 
). In particular, BRes and B_RRes showed that the FSDA attained the prescribed accuracy at the 5th iteration, but the corresponding residual of the low-rank part was still between 
 and 
. So two additional iterations were required to meet the termination condition (
40), even if the residual level in B_RRes kept stagnant in the last three iterations. From a structured point of view, it seems that the low-rank part is approaching the critical case while the banded part still lies in the non-critical case. Similarly, [
] indicate that 
, 
, and 
 are all block diagonal with block sizes 
 and the deflation and PTC for the low-rank factors are effective. Moreover, 
 shows that the CPU times at the current iteration were less than twice that of the previous iteration when 
.
We further compare numerical performances between the FSDA and SDA_HODLR for large-scale problems. Different values of 
 have been tried and the compared numerical behaviors of both algorithms are analogous. We list the results of 
 and 
 in 
Table 7, where one can see that the FSDA requires less iterations and CPU time to satisfy the stop criterion than the SDA_HODLR. Particularly, the SDA_HODLR depleted all memory at 
N = 39,600 and did not yield any numerical results (denoted by “—”). The symbols “*” in the SDA_HODLR column represent no related records for bandwidths and column numbers of the low-rank factors.