Parallel Direct and Iterative Methods for Solving the Time-Fractional Diffusion Equation on Multicore Processors

: The work is devoted to developing the parallel algorithms for solving the initial boundary problem for the time-fractional diffusion equation. After applying the ﬁnite-difference scheme to approximate the basis equation, the problem is reduced to solving a system of linear algebraic equations for each subsequent time level. The developed parallel algorithms are based on the Thomas algorithm, parallel sweep algorithm, and accelerated over-relaxation method for solving this system. Stability of the approximation scheme is established. The parallel implementations are developed for the multicore CPU using the OpenMP technology. The numerical experiments are performed to compare these methods and to study the performance of parallel implementations. The parallel sweep method shows the lowest computing time.


Introduction
The last decades have seen the significant rise of interest to the time-fractional differential equations [1][2][3][4][5]. This is due to possibility to use these equations for modeling the multiple phenomena of anomalous diffusion and other processes with memory effects. Multiple experimental researches [6][7][8] showed that the assumption of the Brownian motion in the diffusion processes may not be sufficient for the accurate description of some physical processes. The fractional differential equations are the powerful mathematical tool for adequate description of many real physical processes, and their application field still grows. The qualitive and quantitative aspects of analysis of these nonlocal models are quite complex for researching.
Thus, development of the efficient numerical algorithms for solving the direct and inverse problems for fractional differential equations is of considerable theoretical and practical interest today.
For solving approximately the initial-boundary problems for the time-fractional diffusion equation, one can use various numerical techniques. The sufficient review is presented in works [9][10][11]. The most popular methods are: the finite difference method, finite element method, spectral methods, and meshless techniques. The numerical methods for solving the fractional differential equations are quite expensive due to nonlocal properties of the fractional derivatives. Thus, development of the efficient numerical algorithms is a crucially important problem. The promising way to solve various compute-intensive problems is parallel computing [12][13][14][15][16][17][18]. Several parallel algorithms has been developed specifically for the fractional differential equations and anomalous diffusion problems [19,20].
In work [21], for solving the SLAE with tridiagonal matrix, the parallel direct algorithm was implemented. It is based on: partitioning the matrix into blocks, processing the blocks separately on different processors, and obtaining the final solution by solving the reduced system.
In work [22], the parallel algorithms were constructed for solving a two-dimensional partial differential equation with the Riemann-Liouville time derivatives. For solving the SLAEs arising in this problem, the iterative Krylov subspace methods were used, namely, the generalized method of minimal residuals, quasiminimal residual method, and induced dimension reduction method. Parallelization was performed by distributing the computation between threads.
In our work, we construct the parallel algorithm for solving the one-dimensional timefractional diffusion equation (TFDE). The implicit finite difference approximation reduces the equation to a large system of linear algebraic equations. Three algorithms are applied to solving this system, namely, the direct Thomas algorithm, the direct parallel sweep method, and the iterative accelerated over-relaxation method. The parallel algorithms are implemented on the multicore processors using the OpenMP technology. To assess the efficiency of the developed parallel algorithms, the numerical experiments are carried out.
The classic Thomas algorithm is widely used for solving the tridiagonal systems arising in various fields. Its main drawback is the limited potential of parallelization. The iterative over-relaxation method was implemented for solving the time-fractional diffusion equation as a serial program in work [23].
In our work, the parallel sweep method is used for the first time to solve the initial boundary problem for the time-fractional diffusion equation. Its main advantage is in reducing the computing time in comparison with the classic Thomas algorithm. It also provides more accurate solution than both Thomas algorithm and the iterative overrelaxation method.
The article is organized as follows. In Section 2, we present: the considered problem, the definitions of fractional-order operators used in the work, construct a discretization of the problem, and show that it can be reduced to solving systems of linear algebraic equations. In Section 3, we describe the numerical algorithms and their convergence properties. In Section 4, we construct and implement the parallel algorithms and present the results of numerical experiments. In Section 5, we discuss these results and highlight the future research directions. Section 6 concludes our work.

Statement of the Problem
Consider the basis time-fractional parabolic partial differential equation in the following form: where U(x, t) is the sought function, a(x), b(x), c(x), d(x, t) are the known functions or constants, 0 < α < 1 is the parameter defining the fractional order of the time derivative. The problem is on the space interval 0 ≤ x ≤ and time interval 0 ≤ t ≤ T. The boundary conditions are The initial condition is where g 0 (x), g 1 (t), g 2 (t) are the given functions.
In this work, we use the Caputo fractional partial derivative in the form D α of order α in the form [24] To solve problem (1), assume that the solution exists and satisfies the Dirichlet boundary conditions. Then, we can consider the following definition of the Caputo fractional partial derivative:

Discretization of Equation and Difference Scheme
To construct the discretization of Equation (1) let us introduce the partitioning of the solution domain. The space interval [0, ] is uniformly split into the grid of m points with step h = ∆x = /m. The time interval [0, T] is split into the grid of N points with step τ = ∆t = T/N. Then, we can denote the grid points for space and time as x i = ih, i ∈ {0, 1, ..., m} and t j = jτ, j ∈ {0, 1, ..., N}, respectively; and we can denote the values of the sought function U(x, t) at the grid points as U i,j = U(x i , t j ).
In this work, the first-order approximation [25] is used for computing the Caputo fractional partial derivative in the left-hand part of Equation (1) For the right-hand part, we use the fully implicit finite difference approximation scheme of the second order. Then, for the grid point (x i , t n ), the approximated Equation (1) is given as

Obtaining the SLAE
Then, we transform this equation in the following way: where Let us denote Thus, we obtain the following equation for the point (x i , t n ): Note that the values U 0,n and U m,n at the boundary points are given. Then, for all inner points x i , i ∈ {1, 2, ..., m − 1}, we can combine equations (8) into the system of linear algebraic equations where Matrix A is a square tridiagonal matrix of (m − 1) × (m − 1) dimension. Thus, to numerically solve problem (1), we need to solve systems (9) sequentially at each time level.

Stability Analysis
Let us prove stability of the finite-difference approximation in Equation (4). To simplify our proof, we consider the case of d(x, t) = 0. Theorem 1. The fully implicit finite difference approximation in Equation (4) for 0 < α < 1 and finite domain 0 ≤ x ≤ 1, with boundary conditions U(0, t) = 0 and U(1, t) = 0 for t ≥ 0 is unconditionally stable.
Proof. Suppose the solution of Equation (1) has the form U n j = ξ n e iωjh , where i is the imaginary unit, ω ∈ R.
After transforms, we obtain Consider the denominator. Apparently, .., m}. The approximate solutions converge to exact solution, i.e., the numerical approximation scheme (4) is stable.

Numerical Methods for Solving the Problem
There is a wide variety of numerical methods for solving SLAEs. In this work, for solving the tridiagonal SLAE (9) we will use the direct Thomas algorithm, direct parallel sweep method, and iterative accelerated over-relaxation method.

Thomas Algorithm
Thomas algorithm (in Russian literature also known as sweep method) [26] for solving the tridiagonal systems was elaborated and investigated independently by many researchers (I. M. Gelfand and O. V. Lokutsievskii in USSR, and L. H. Thomas in USA).
It is a direct method, that is a simplified form of Gaussian elimination for a systems with a special matrices. For the system (9) the algorithm may be written as follows: • The forward elimination phase • The backward substitution phase The algorithm is applicable to the diagonally dominant systems. In our case, this means that the following property must hold: This property holds in the numerical experiments presented below. While the Thomas algorithm is extremely simple to implement and is very cachefriendly (since data is read and stored sequentially), it is essentially a serial algorithm. The flow dependency (meaning that the next coefficient must be calculated using the previous one) does not permit to use most forms of parallelization or vectorization. Thus, the parallel tridiagonal solvers are usually based on other methods [27].

Parallel Sweep Method
In our work, we implement the parallel direct sweep method. It was proposed and researched in works [28,29].
The idea of parallelization consists in decomposing the interval {1, 2, ..., m − 1} into L equal subintervals split by the points m k , k ∈ {1, 2, ..., L − 1}. Let us denote m 0 = 0, m L = m for convenience. Then, the subintervals are Let us introduce the operator The auxillary subtasks may be solved independently for each subinterval To solve them, the sweep method may be used in the following form: • The forward elimination phase • The backward substitution phase The values U i,n in the inner points the interval may be found by superposition If we substitute the Formula (15) into the system (9) for points i ∈ {m 0 , m 1 , ..., m L }, we will obtain the reduced system for the values The parallel sweep algorithm for solving system (9) is summed up in Listing 1.
Listing 1. Parallel sweep algorithm for solving SLAE.

1.
Solve the auxillary subtasks (13) on individual subintervals using method (14). This step may be executed in parallel.

2.
Construct the reduced system (16). This step may by executed in parallel, but requires synchronization or communications because the coefficients of the reduced systems require values from two adjacent subintervals.

3.
Solve the reduced system. Note that its dimension L is much lower than dimension (m − 2) of the basis system. Thus, we can solve it by the classic serial Thomas algorithm. After this step, another synchronization or communication is needed to store or transfer the computed values at the boundary points of the subintervals.

4.
Compute the solution of the basis system using Formula (15). This step may also be executed in parallel.
Essentially, steps 1 and 4 of the algorithm may be performed independently in parallel, while steps 2 and 3 require communications and synchronization.

Correctness and Stability of the Parallel Sweep Method
The sufficient correctness and stability conditions for the parallel sweep methods are Note that A → B.
The following theorem is valid.
Theorem 2. If either condition A or B (17) holds for the basis system (9), then both conditions A and B are satisfied for the reduced system (16): 1.
If condition A holds for (9) for some θ, then condition A holds for (16) with larger θ > θ.
Proof. The proof is constructed in work [29].

Accelerated Over-Relaxation Iterative Method
The accelerated over-relaxation (AOR) iterative method was developed for the systems with the general dense matrices [30,31].
To formulate this method for Equation (9), let us represent the matrix A as a sum of three matrices where D is the diagonal matrix, L is the lower triangular matrix, and V is the upper triangular matrix. Then, the iterative process is defined as follows: where β is the acceleration parameter, γ is the over-relaxation parameter, and U n l is the sought vector at the l-th iteration. Specific values of this parameters reduce the AOR method to other well-known methods: • β = 0, γ = 1 is the Jacobi method; • β = 1, γ = 1 is the Gauss-Seidel method; • β = γ is the successive over-relaxation method.
The algorithm for solving system (9) by the AOR method is executed as in Listing 2.

Do
(a) Compute the approximate solution U n l+1 for the next iteration using Formula (18).
If the criterion is satisfied, then finish the process. Otherwise, set l ← l + 1 and repeat (a-b).

Convergence of the AOR Method
The necessary condition for convergence of AOR method is formulated in work [32]. Let us rewrite method (18) in the form where is the iteration matrix of method.
Proof. The proof is constructed similarly as in works [23,32].
Thus, in the numerical experiments, we select the values of parameters β, γ to satisfy these neccessary conditions.

Parallel Implementation of Algorithms for Solving the TFDE
Numerical simulations for processes described by TFDEs require a lot of computing time. This is due to a fact of non-local properties of the fractional derivative. For the time-fractional equations, the computational complexity increases quadratically with the number of time steps.
Consider the numerical algorithm for solving the initial boundary problem for TFDE (1) that is presented in Listing 3.

4.
Compute the coefficients of matrix A using Formula (6).

5.
Do while (n ≤ N) (a) Compute the right-hand part of the SLAE using Formula (7).
Find the approximate solution U n . This can be achieved by either the Thomas algorithm (Formulas (10)-(11)), the parallel sweep algorithm (Listing 1), or the AOR method (Listing 2). (c) Set n ← n + 1 and repeat (a-c) for the next time step.
Let us formulate the costliest subroutines of this algorithm.

1.
Calculation of the right-hand parts using Formula (7). This procedure requires storing and processing the whole history of the process. The number of terms in Formula (7) increases for each subsequent time level, which increase the amount of calculations.
One way to speed up the computations is applying the parallel computing. In this work, we develop a parallel implementation of the algorithm for solving the time-fractional diffusion equation for the superscalar multicore processors using the OpenMP technology [33] and automatic vectorization by the Intel C++ Compiler.The parallelization is performed as follows.

1.
The elements f i,n for the individual points i can be calculated independently. Thus, to parallelize this process, we can distribute the spatial fragments between OpenMP threads. We decompose the index range i ∈ {0, 1, ..., m} into chunks of length s, and each thread calculates it's own set of chunks. This means that if we denote number of threads omp_num_threads = L, the thread identifier omp_thread_num = l ∈ {0, 1..., L − 1}, the thread l computes f i,n for i ∈ {ls, ls + 1, ..., ls + s}∪ ∪{Ls + ls, Ls + ls + 1, ..., Ls + ls + s} ∪ {2Ls + ls, 2Ls + ls + 1, ..., 2Ls + ls + s} ∪ .... The corresponding C++ code fragment would look like this In this loops' kernel, all memory pointers are accessed either uniformly (the same memory from iteration to iteration), or with unit stride, changing by one element for the next iteration. This ensures the highest efficiency for the superscalar architectures with the SIMD instruction sets (AVX2+FMA3 or AVX-512 for modern Intel CPUs). The experiments show that the optimal value is s = 512 for double precision (8 bytes). This is probably due to the fact that 512 × 8 (bytes) = 4 KB, which is exactly one memory page and also fits into the L1d cache (which is 32 KB per core for the i7-10700k CPU used in experiments).

2.
The Thomas algorithm is inherently serial. Thus, its implementation is executed in the OpenMP single section.

3.
The parallel sweep algorithm is parallel by design. In Listing 1, steps 1 and 4 are performed by each OpenMP thread on its own subinterval. To perform steps 2 and 3, we need synchronization by '#pragma omp barrier' command, which introduces additional overhead. We will investigate this in the numerical experiments.

4.
Computing the next iteration in the AOR method. This process requires the matrixvector multiplications, which are parallelized in the same way as described above for computation of the right-hand parts f i,n . The elements of new estimation vector U n l+1 must be calculated sequentially. This fact reduces the efficiency of parallelization.

Reducing the Cost of Computation of the Right-Hand Parts
Several techniques can be applied for reducing the computational cost for the righthand parts. Formula (7) requires O(n) operations at each time level n and O(N 2 ) operations for the entire process. The cost of SLAE solver at the time level n does not depend on n, thus, its cost is O(N) overall.
One technique for reducing the cost of computing the right-hand parts is described in work [34]. Instead of using the uniform grid for computing the approximation of the fractional derivative, a nested grid is used. The finer mesh is used for the latest history with successively coarser meshes for the more distant history. This approach utilized the fading memory property of the fractional derivative, i.e., the fact that the coefficients w To implement this approach, we modify Formula (7) in the following way: (j, k) ∈ (2, 2 + θ 0 ), (2 + θ 0 , 2 + θ 1 ), (2 + θ 1 , 2 + θ 2 ), ..., (2 + θ log θ n , n) , where θ ∈ N is the stretching coefficient, log θ n is the integer part.
This approach has complexity O(N · log N), in contrast of O(N 2 ) of the full memory. In work [34], it was shown that this nested mesh approach preserves the order of approximation for the fractional derivative.

Results of the Numerical Experiments
In this section, we apply our parallel implementations of Listing 3 to numerical solution of the time-fractional diffusion equations. The experiments were performed on 8-core Intel i7-10700k CPU. This section presents the results of the experiments.

Problem 1
Consider the initial boundary value problem for the following equation: , U(1, t) = 1 + 2t α Γ(α + 1) , and the initial condition The exact solution of this problem is given in paper [35], it is The numerical experiments were performed for the order α = 0.5 on the grid m = 256, N = 16,384. The following combinations of parameters β, γ were used for Formula (18) giving various iterative methods: • Gauss-Seidel method (GS): β = 1, γ = 1; • Successive over-relaxation method (SOR): β = 1.9, γ = 1.9; • Accelerated over-relaxation method (AOR): β = 1.99, γ = 1.8. The problem was also solved by the direct algorithms, namely, the Thomas algorithm (Formulas (10) and (11)) and the parallel sweep algorithm (Formulas (13)- (16)). Figure 1 shows the approximate solution for Problem 1 obtained by the Thomas algorithm. Table 1 presents the results of numerical experiments. It contains the computing times of solving the Problem 1 by five methods described above. T 1 is the computing time by a single OpenMP thread, i.e., of the serial program implementing the given method. T 8 is the time by a parallel program for the same method run by 8 threads. To assess the efficiency of the parallel implementation, the speedup coefficient S L = T 1 /T L is presented, where L is the number of threads. The table also contains the total number K of iterations required to solve the problem by the iterative methods, as well as, the average number K n of iterations at a single time level. The last column presents the error δ = U − U ∞ of the solution obtained by a given method.
with the boundary conditions U(0, t) = 0, U(2, t) = 0, and the initial condition The exact solution of this problem is given in work [37], it is The numerical experiments were performed for the order α = 0.5 on the grid m = 8192, N = 8192. The combination of parameters β, γ for the AOR method is β = 1.99, γ = 1.9.    For the next experiments, we use the large spatial grid m = 32 × 2 20 for the same problem. The time grid in this experiment is just N = 64. Thus, the program requires about 20 GB of memory out of 32 GB available on the PC used in tests. Tables 3 and 4 present the results of numerical experiments for Problem 2 on this grid using the full memory (7) and logarithmic memory (21) approaches for computing the right-hand parts respectively. It also contains the execution times for separate subroutines of the parallel program obtained by profiling with the Intel VTune Profiler.

Discussion
The experiments show that for both problems, the iterative AOR method is clearly inferior to the much simpler direct methods, such as Thomas algorithm and parallel sweep method, in terms of both accuracy and computing time. For Problem 1, the AOR method with tuned parameters requires less iterations and shorter computing time that the Gauss-Seidel and SOR methods. But its computing time is still 15 to 25 times lower than the direct methods.
For coarser spatial grids, such as m = 256 for Problem 1 and m = 8192 for Problem 2, the parallel sweep method is slightly slower than the serial Thomas algorithm. This is caused by too much parallel overhead (time required for synchronizing the threads) for the small sizes of the SLAE. The results are more indicative for the large spatial grid m = 32 × 2 20 (roughly 32 millions) in the last experiment. Here, the parallel sweep algorithm for solving the SLAE shows better time than the classic serial Thomas algorithm.
We also note that the percentage of computing the right-hand parts for each time step is up to two times larger than for solving the SLAE when we use the full memory approach (see Table 3). The computation complexity of the right-hand parts computing is quadratic, while direct methods for SLAE are linear. Using the logarithmic memory approach (see Table 4) reduces the time of computing the right-hand parts 3 to 4 times. Total computing time reduces up to twofold. The error of the solution remain of the same order than for the full memory. Now, the SLAE solver takes more time than right-hand computation. This makes the effect of using the parallel solver more prominent. The parallel implementation on the basis of the parallel sweep method reduces the total computing time from 19 to 9.4 s. In contrast, the implementation which uses the classic serial Thomas algorithm reduces the computing time from 18 to 15.5 s. Thus, the parallel implementations run on 8 threads, gives the computing times of 15.5 and 9.4 s for the classic Thomas algorithms and the parallel sweep method, respectively, a speedup of 1.65 times.
Let us investigate the performance of the procedure of computing the right-hand parts deeper. Table 5 presents the memory bandwidth utilization for the parallel implementation of this procedure (for the full memory approach) for various number of the OpenMP threads measured by the Intel VTune Profiler for the grid size m = 8192, N = 8192. The table shows that even two threads saturate the memory bandwidth and further increase of the number of the threads produces miniscule improvement. This is confirmed by the results in Tables 2-4. The largest speedup is about 2 times, regardless of using the full or logarithmic memory. This is caused by the fact that the calculating the right-hand parts by Formulas (7) and (21) has low arithmetical intensity. Essentially, we perform one multiplication and one addition per 3 numbers read from memory. The roofline analysis by the Intel Advisor [38] confirms that this implementation is memory bound (see Figure 3).
There are several ways to get around this problem. One way is to use the hardware with higher memory bandwidth. Modern GPUs' memory bandwidth is dozens of times higher than CPUs. For example, the NVIDIA RTX 3080 has memory bandwidth of 760 GB/s in comparison with the Intel i7-10700k with DDR4-4133 used in our experiments, which has 57 GB/s (with a comparable performance in the double precision arithmetic of about 450-500 GFLOPS).
The computing cluster systems with distributed memory allow the effective summation of the memory bandwidthes of the individual nodes. It also allows one to solve much larger problems when the total data would not fit into a memory of a single node. This makes the parallel sweep algorithm more viable for such systems.
The other way is to increase the arithmetic intensity and efficiency of memory access. Reusing the data in caches or shared memory can significantly improve the performance. Several methods for optimizing the computational procedures for fractional derivatives are proposed in works [39,40]. An algorithm for automatical optimization of the similar procedure of the matrix-vector multiplication for a multicore processor is presented in work [41]. In future, the Authors plan to develop the efficient numerical algorithms for the forward and inverse 2D and 3D problems for TFDEs. This would require a larger amount of computations, as well as, larger memory requirements. To obtain more efficient parallelization, various techniques may be implemented, such as using red-black partitioning, conjugate gradient type methods, preconditioning, higher-order schemes, etc. One of the promising ways to solve large time-fractional problems is the Parareal method [42]. Currently, it is widely used for numerical solving the initial boundary problems for classical differential equations with integer orders. Its main idea is the time domain decomposition using two grids (a coarse one and a fine one). The coarse grid is used to construct the initial approximations for the subtasks solved on a fine grid and for correcting the solution of the subtasks. The subtasks on a fine grid may be solved independently for each time subinterval. This allows one to implement the efficient parallel algorithms for various high-performance architectures.

Conclusions
In this work, the parallel algorithms for solving the initial boundary problem for the time-fractional diffusion equation are constructed. The algorithms are based on the finitedifference scheme for approximating the differential equation and the Thomas algorithm, parallel sweep algorithm, and accelerated over-relaxation method for solving the systems of linear algebraic equations. The algorithms are implemented for the multicore processors using the OpenMP technology. The numerical experiments were performed to investigate the efficiency of the developed parallel algorithms. The Thomas algorithm and parallel sweep algorithm reduce the total computing time for solving the example problems up to 25 times in comparison with the over-relaxation method and provides better accuracy. For the large spatial grids, using the parallel sweep method speed up the computations up to 2 times on the 8-core CPU in comparison with the classic serial Thomas algorithm. In future, the Authors plan to implement more elaborate algorithms to circumvent the limits of memory bandwidth. This would allow one to solve the larger problems.

Data Availability Statement:
The data presented in this study are the model data. Data sharing is not applicable to this article.