Investigation of High-E ﬃ ciency Iterative ILU Preconditioner Algorithm for Partial-Di ﬀ erential Equation Systems

: In this paper, we investigate an iterative incomplete lower and upper (ILU) factorization preconditioner for partial-di ﬀ erential equation systems. We discretize the partial-di ﬀ erential equations into linear equation systems. An iterative scheme of linear systems is used. The ILU preconditioners of linear systems are performed on the di ﬀ erent computation nodes of multi-central processing unit (CPU) cores. Firstly, the preconditioner of general tridiagonal matrix equations is tested on supercomputers. Then, the e ﬀ ects of partial-di ﬀ erential equation systems on the speedup of parallel multiprocessors are examined. The numerical results estimate that the parallel e ﬃ ciency is higher than in other algorithms.


Introduction
In applied sciences, such as computational electromagnetics, the solving of partial-differential equation systems is usually touched upon.Many variables need to be sought for solving engineering problems.These often need to be transformed into a solution of partial differential equations.When solving partial differential equations, the equations need to be discretized.When discretizing partial differential equations, symmetric systems of equations are usually gotten.Hence, it is necessary to use the idea of symmetry to solve partial differential equations.Several studies on multi-computers have appeared.For instance, Eric Polizzi and Ahmed H. Sameh [1] contributed a spike algorithm as a parallel solution to hybrid banded equations.The algorithm firstly decomposes banded equations into block-tridiagonal form and then makes full use of the divide and conquer technique.However, by increasing the bandwidth, the parallel computation becomes much more complex, leading to a decrease in the parallel efficiency.Obviously, the highly efficient parallelism of banded systems is of great importance.Methods for block-tridiagonal linear equations contain iterative algorithms such as the multi-splitting algorithm [2,3].The multi-splitting algorithm (MPA) [2] can be used to solve large band linear systems of equations; however, it sometimes has lower parallel efficiency.In [4], a method for working out block-tridiagonal equations is provided by the authors.Any incomplete type preconditioner will be appropriate for the algorithm.Based on the Galerkin principle, the parallelism solution for large-scale banded equations is investigated in [5].In [6], a parallel direct algorithm is used on multi-computers.In [7], a parallel direct method for large banded equations is presented.A preconditioner of large-scale banded equations is discovered in [8][9][10][11][12][13][14].The block successive

Decomposition Strategy
Consider large-scale band equations that is x 1 x 2 . . .
where A i , B i , and where in which d i zero matrix.Therefore, the new iterative scheme for the large-scale band system of equations is where the iterative matrix is Obviously, GH is nonsingular, which is the necessary condition that the algorithm holds.In terms of the structure of G and H, the parallelism of the iterative algorithm is preferable.
The strategy is an ILUP algorithm.Compared with published algorithms [2,10,40], the ILUP algorithm requires less multiplication and adds calculation among every iteration, meaning this algorithm has more speedup and higher parallel efficiency.It is appropriate for solving the large-scale system of equations and partial-differential equations for multi-core processors.

Preliminary
Here, some notations are introduced.Two definitions and one lemma are mentioned.

Definition 1. ([39]
) A real n × n matrix A = (a i,j ) with a i,j ≤ 0 for all i j is an M-matrix if A is nonsingular and

Proposition and Theorem
Note that the inverse matrix of the following matrix is gained by the algorithm of the Gaussian elimination.Firstly, from the definitions and lemma, a proposition is obtained as follows.

Proposition 1.
If A is an M-matrix, in this way, the matrices Proof.From Expression (3), in terms of the contracture of A, G, H and M = GH, N = M − A, we have As A is an M-matrix, then Since the block on the m-th row and m-th , Similarly, the block on the m-th row and m-th Secondly, taking advantage of the above lemma and proposition, a theorem is given.

Theorem 1.
If A is an M-matrix, then the approximate factorization of matrix A can be represented by Expression (2), and the iterative scheme Algorithm (4) converges to X * = A −1 b.
Proof.From the above proposition, the approximate factorization of matrix A can be represented by Expression (2).Firstly, prove N ≥ O.
As A is an M-matrix, then Secondly, prove M −1 ≥ O.
, where , where According to the proposition, This section shows that the condition in the theorem is a sufficient condition for convergence of the algorithm.If A is not an M-matrix, Algorithm (4) is sometimes convergent, as is shown in the following section (Example 1).

Storage Method
For the i-th processor (i−1)m+ j , and the convergence tolerance ε.

Results Analysis of Numerical Examples
For testing the new algorithm, some results on the Inspur TS10000 cluster have been given by the new algorithm and order 2 multi-splitting algorithm [2], which is a well-known parallel iterative algorithm.The PEk method [40] is used on the inner iteration of the order 2 multi-splitting algorithm.
In the tables, P is the number of processors, l is the inner iteration time, k is the parameter of the PEk method, T is the run time (in seconds), I is the iterative time, S is the speedup and E is the parallel efficiency (E = S/P).In the following figures, ILUP, BSOR, PEk, and MPA, respectively, denote the iterative incomplete lower and upper factorization preconditioner, the block successive over-relaxation method, the PEk method, and the multi-splitting algorithm.

Results Analysis of the Large-Scale System of Equations Example 1. A in Expression (1) represents
, and where B n = C 1 = O, n = 300, and t = 300.The numerical results are shown in Tables 1-5, and in Figures 1 and 2.
The first example is not a numerical simulation regarding any partial differential equations (PDE); we use this example in order to test the correctness of the iterative incomplete lower and upper factorization preconditioner algorithm.The first example can build a good foundation for the second example regarding PDE.The solutions to the large-scale system of equations for Example 1 by the ILUP are shown in Table 1 and the details of these are as follows: This problem requires solving with more than eight processors and the number of iterations is 238.When increasing the number of processors, time and parallel efficiency all decrease.The number of processors for solving Example 1 transforms from 4 to 64 and the parallel efficiency changes from 91.14% to 73.80%.All of the parallel efficiency values are higher than those in published works, including Cui et al.'s [10], Zhang et al.'s [40], and Yun et al.'s [2] methods, with the values being above 73%.No matter how many processors are used to calculate the problem, the error tolerance of this example is the same: 6.897 × 10 −11 .The results of Example 1 when using the BSOR method [10] are listed in Table 2.When more than four processors are used to resolve the problem of Example 1, the number of iterations is 216.When increasing the number of processors, the time and parallel efficiency decrease.The cost of the time of every iteration and communication is more than that found when using the ILUP algorithm for the large-scale system of equations.Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases.Thus the parallel efficiency is not better than that found when using the ILUP algorithm for the large-scale system of equations.When the number of processors for solving Example 1 is four, the parallel efficiency is 59.56%; however, the parallel efficiency is 91.14% for four processors when using the ILUP algorithm.When increasing the number of processors, the parallel efficiency decreases to 44.81%, which is lower than that found when using the ILUP algorithm.The results of Example 1 when using the PEk method published by Zhang et al. [40] are described as Table 3.When more than four processors are used to resolve the problem of Example 1, the number of iterations is 227.When increasing the number of processors, the time and parallel efficiency decrease.The cost of the time of every iteration and communication is more than that when using the ILUP algorithm for the large-scale system of equations.Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases.Therefore, the parallel efficiency is poorer than that found when using the ILUP algorithm for the large-scale system of equations.When the number of processors used when solving Example 1 is four, the parallel efficiency is 64.08%; however, the parallel efficiency is 91.14% for four processors when using the ILUP algorithm.When increasing the number of processors, the parallel efficiency decreases to 44.79%, corresponding to the parallel efficiency when using the BSOR method, which is lower than that found when using the ILUP algorithm, 73.80%.more than that when using the ILUP algorithm for the large-scale system of equations.Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases.Thus, the parallel efficiency is poorer than when using the ILUP algorithm for the large-scale system of equations.When the number of processors for solving Example 1 is four, the parallel efficiency is 55.64%, 33.50% less than that that found when using the ILUP algorithm.When increasing the number of processors, the parallel efficiency decreases to 40.82%, about 4% less than the parallel efficiency obtained with the BSOR method, which is 23% lower than that that found when using the ILUP algorithm.[2,10,40].As seen in Table 5, the speedup obtained with our method for Example 1 on 64 CPU cores is 47.2324, and the parallel efficiency is 73.80%.The parallel efficiency obtained with the ILUP algorithm is about 29% higher than that obtained using the BSOR method.The parallel efficiency is 29.01%more than that obtained using the PEk method.The parallel efficiency obtained with the BSOR method corresponds to the parallel efficiency obtained with the PEk method.The parallel efficiency is 23% higher than that obtained using the MPA algorithm.Figure 1 illustrates the speedup performances obtained with the ILUP algorithm and the other three methods for Example 1 at different CPU cores.As seen from Figure 1, when increasing the number of processors, the speedup obtained using all the methods increases.No matter how great the number of processors, the speedup obtained using the ILUP algorithm is significantly higher than that obtained using the other three methods, especially when the number of processors is more.Regardless of the number of processors, the speedup values obtained using the BSOR method, the PEk method, and the MPA algorithm are close, particularly those obtained with the BSOR method and the PEk method.
Figure 2 shows the parallel efficiency performance of the ILUP algorithm and the other three methods for Example 1 at different CPU cores.As seen from Figure 2, when increasing the number of processors, the parallel efficiency obtained using all the methods decreases.Regardless of the number of processors, the parallel efficiency obtained using the ILUP algorithm is much higher than that found using the other three methods, maintaining a value of more than 70%.No matter the number of processors, the parallel efficiency values obtained using the PEk method, the BSOR method, and the MPA algorithm are lower and nearer, especially those found using the BSOR method and the PEk method.In particular, when the number of processors is 64, the parallel efficiency obtained using the ILUP algorithm rises above 73%; however, the parallel efficiencies obtained using the BSOR method, the PEk method, and the MPA algorithm are only about 40%.The ILUP algorithm has the clear superiority of producing exceedingly higher parallel efficiency values.
Figure 1 illustrates the speedup performances obtained with the ILUP algorithm and the other three methods for Example 1 at different CPU cores.As seen from Figure 1, when increasing the number of processors, the speedup obtained using all the methods increases.No matter how great the number of processors, the speedup obtained using the ILUP algorithm is significantly higher than that obtained using the other three methods, especially when the number of processors is more.Regardless of the number of processors, the speedup values obtained using the BSOR method, the PEk method, and the MPA algorithm are close, particularly those obtained with the BSOR method and the PEk method.Figure 2 shows the parallel efficiency performance of the ILUP algorithm and the other three methods for Example 1 at different CPU cores.As seen from Figure 2, when increasing the number of processors, the parallel efficiency obtained using all the methods decreases.Regardless of the number of processors, the parallel efficiency obtained using the ILUP algorithm is much higher than that found using the other three methods, maintaining a value of more than 70%.No matter the number of processors, the parallel efficiency values obtained using the PEk method, the BSOR method, and the MPA algorithm are lower and nearer, especially those found using the BSOR method and the PEk method.In particular, when the number of processors is 64, the parallel efficiency obtained using the ILUP algorithm rises above 73%; however, the parallel efficiencies obtained using the BSOR method, the PEk method, and the MPA algorithm are only about 40%.The ILUP algorithm has the clear superiority of producing exceedingly higher parallel efficiency values.

Results Analysis of the Partial-Differential Equations
Example 2. Given the equations The number of processors

Results Analysis of the Partial-Differential Equations
Example 2. Given the equations The results are given in Tables 6-10 and in Figures 3 and 4.
The finite difference method is used to discretize Example 2 in the tests.We adopt second-order central difference schemes to discretize Example 2 and then converse the format for numerical simulation; lastly, we test the iterative incomplete lower and upper factorization preconditioner algorithm on different processors.The results to the partial-differential equations for Example 2 obtained using the ILUP are listed in Table 6.The details are thus: This problem was solved with more than four CPU cores and the number of iterations was 560.When increasing the number of processors, the time and the parallel efficiency can be seen to all decrease.When the number of processors used for solving Example 2 changes from 4 to 64 the parallel efficiency changes from 89.48% to 71.64%.All of the parallel efficiency values are higher than in the published works [2,10,40], being above 71%.Regardless of how many processors are used to compute Example 2, the error allowance of this problem can be seen to be equally 3.158 × 10 −11 .The results for Example 2 obtained with the BSOR method [10] are listed in Table 7.When more than four processors are used to resolve the problem of Example 2, the number of iterations is 793.When increasing the number of processors, the time and parallel efficiency decrease.The cost of the time of every iteration and communication is more than that obtained using the ILUP algorithm for the large-scale system of equations.Hence, the speedup, which is less than that found when using the ILUP algorithm, decreases.Thus, the parallel efficiency is not as good as that found using the ILUP algorithm for the partial-differential equations.When the number of processors used for solving Example 2 is four, the parallel efficiency is 86.24%, 3.24% lower than that found when using the ILUP algorithm for the partial-differential equations.With increasing the number of processors, the parallel efficiency decreases to 52.42%, which is less than that obtained using the ILUP algorithm, 71.64%.The results obtained for Example 2 using the PEk method [40] are given in Table 8.When more than four processors are used to resolve the problem of Example 2, the number of iterations is 798.When increasing the number of processors, the time and parallel efficiency decrease.The cost of the time of every iteration and communication is more than that obtained using the ILUP algorithm for the large-scale system of equations.Hence, the speedup, which is less than that obtained when using the ILUP algorithm, decreases.Thus, the parallel efficiency is poorer than that found when using the ILUP algorithm for the partial-differential equations.When the number of processors used for solving Example 2 is four, the parallel efficiency is 80.59%, which is 8.89% lower than that found when using the ILUP algorithm.When increasing the number of processors, the parallel efficiency decreases to 48.40%, which is 23.24% lower than that obtained with the ILUP algorithm.The results for Example 2 obtained with the multi-splitting algorithm [2] are introduced in Table 9.As seen in Table 9, when more than four processors are used to solve the problem of Example 2, the number of iterations is 838.When increasing the number of processors, the time and parallel efficiency decrease.The cost of the time of every iteration and communication is more than that found when using the ILUP algorithm for the partial-differential equations.Hence, the speedup, which is less than that found using the ILUP algorithm, decreases.Thus, the parallel efficiency is poorer than that obtained using the ILUP algorithm for the large-scale system of equations.When the number of processors used for solving Example 2 is four, the parallel efficiency is 78.25%, 11.23% less than that obtained using the ILUP algorithm.When increasing the number of processors, the parallel efficiency decreases to 46.34%, about 6% less than the parallel efficiency obtained with with the BSOR method, corresponding to the parallel efficiency obtained with the PEk technique, which is 25.3% lower than that found using the ILUP algorithm.10 provides a summary and comparisons of speedup and parallel efficiency obtained using the different methods for Example 2 on 64 CPU cores, which is better than other published works.As seen in Table 10, the speedup in our method for Example 2 on 64 CPU cores is 45.8483 and the parallel efficiency is 71.64%.The parallel efficiency obtained using the ILUP algorithm is 19.22% higher than found using the BSOR method.The parallel efficiency is 23.24% more than that found using the PEk method.The parallel efficiency is 25.3% higher than that obtained using the MPA algorithm.Figure 3 compares the speedup performance of ILUP algorithm and the other three methods for Example 2 at different CPU cores.As seen from Figure 3, when increasing the number of processors, the speedup values of all the methods increase.Regardless of the number of processors, the speedup obtained using the ILUP algorithm is much higher than that found using the other three methods, in particular when the number of processors is greater.No matter the number of processors, the speedup values found using the BSOR method, the PEk method, and the MPA algorithm are close, especially for those found using the PEk technique and the MPA algorithm.For example, when the number of processors is 64, the speedup found using the ILUP algorithm rises above 45; however, the speedup values obtained using the BSOR method, the PEk method, and the MPA algorithm are only about 30.Obviously, the ILUP algorithm has the advantage of producing higher speedup values.Figure 4 shows the parallel efficiency performance of the ILUP algorithm and the other three methods for Example 2 at different CPU cores.As seen from Figure 4, when increasing the number of processors, the parallel efficiency of all the methods decreases.Regardless of the number of processors, the parallel efficiency obtained using the ILUP algorithm is much higher than that found using the other three methods, maintaining a value of more than 70%.When increasing the number of processors, the parallel efficiency values obtained using the BSOR method, the PEk method, and the MPA algorithm are lower and sustain a descent, especially for those found using the MPA algorithm.In particular, when the number of processors is 64, the parallel efficiency obtained using the ILUP algorithm rises above 71%; however, the parallel efficiency values found using the BSOR method, the PEk method, and the MPA algorithm are only about 50%.The ILUP algorithm is clearly beneficial in its production of exceedingly high parallel efficiency values.Figure 4 shows the parallel efficiency performance of the ILUP algorithm and the other three methods for Example 2 at different CPU cores.As seen from Figure 4, when increasing the number of processors, the parallel efficiency of all the methods decreases.Regardless of the number of processors, the parallel efficiency obtained using the ILUP algorithm is much higher than that found using the other three methods, maintaining a value of more than 70%.When increasing the number of processors, the parallel efficiency values obtained using the BSOR method, the PEk method, and the MPA algorithm are lower and sustain a descent, especially for those found using the MPA algorithm.In particular, when the number of processors is 64, the parallel efficiency obtained using the ILUP algorithm rises above 71%; however, the parallel efficiency values found using the BSOR method, the PEk method, and the MPA algorithm are only about 50%.The ILUP algorithm is clearly beneficial in its production of exceedingly high parallel efficiency values.

Conclusions
In this work, an iterative incomplete LU factorization preconditioner for partial-differential equation systems has been presented.The partial-differential equations were discretized into linear equations with the form Ax = b.An iterative scheme of linear systems was used.The iterative ILU preconditioners of linear systems and partial-differential equations systems were performed on different computation nodes of multi-CPU cores.From the above numerical results in the tables and figures, we can obtain the following conclusions: 1.The ILUP algorithm for the large-scale system of equations and partial-differential equation systems was performed on different multi-CPU cores.The numerical results show that the solutions are consistent with the theory.
2. From Example 1, when A is neither positive nor an M-matrix, the ILUP algorithm still converges.3.At any multi-CPU cores, the speedup of the ILUP algorithm for the system of equations is far higher than that found using the BSOR method [10], the PEk method, [40] and the MPA algorithm [2].Evidently, the ILUP algorithm has the advantage of producing higher speedup values.4. No matter the number of processors, the parallel efficiency of the ILUP algorithm is preferable.
The parallel efficiency of the ILUP algorithm is higher than that of the other three algorithms.For example, the parallel efficiency of the ILUP algorithm achieves a value of above 73.8%(as seen in Table 5), which is higher than that for any other algorithm, including the BSOR method [10], the PEk method, [40] and the MPA algorithm [2].Obviously, the ILUP algorithm has the superiority of producing exceedingly high parallel efficiency values.

Conclusions
In this work, an iterative incomplete LU factorization preconditioner for partial-differential equation systems has been presented.The partial-differential equations were discretized into linear equations with the form Ax = b.An iterative scheme of linear systems was used.The iterative ILU preconditioners of linear systems and partial-differential equations systems were performed on different computation nodes of multi-CPU cores.From the above numerical results in the tables and figures, we can obtain the following conclusions: 1.
The ILUP algorithm for the large-scale system of equations and partial-differential equation systems was performed on different multi-CPU cores.The numerical results show that the solutions are consistent with the theory.

2.
From Example 1, when A is neither positive nor an M-matrix, the ILUP algorithm still converges.

3.
At any multi-CPU cores, the speedup of the ILUP algorithm for the system of equations is far higher than that found using the BSOR method [10], the PEk method [40], and the MPA algorithm [2].Evidently, the ILUP algorithm has the advantage of producing higher speedup values.

4.
No matter the number of processors, the parallel efficiency of the ILUP algorithm is preferable.The parallel efficiency of the ILUP algorithm is higher than that of the other three algorithms.For example, the parallel efficiency of the ILUP algorithm achieves a value of above 73.8%(as seen in Table 5), which is higher than that for any other algorithm, including the BSOR method [10], the PEk method [40], and the MPA algorithm [2].Obviously, the ILUP algorithm has the superiority of producing exceedingly high parallel efficiency values.
and d i × d i−1 , and x i , b i are the d i − vectors of the unknowns and the right-hand side,The coefficient matrix A can be approximately decomposed asA ≈ GH(2)Generally, supposing n = pm(m ≥ 2, m ∈ Z), where p represents the processors, let

Figure 2 .
Figure 2. The parallel efficiency values for Example 1.

Figure 4 .
Figure 4.The parallel efficiency values for Example 2.

Figure 4 .
Figure 4.The parallel efficiency values for Example 2.

Table 3 .
Answers for the pseudo-elimination method with parameter k (PEk) for Example 1 (k = 1.6).
[2] results of Example 1 when using the multi-splitting algorithm (MPA) published by Yun et al.[2]are introduced in

Table 4 .
As seen inTable 4, when more than four processors are used to solve the problem of Example 1, the number of iterations is 174.When increasing the number of processors, the time and parallel efficiency decrease.The cost of the time of every iteration and communication is

Table 4 .
The solutions to the multi-splitting algorithm (MPA) used for Example 1.This section compares the speedup and parallel efficiency performance of the ILUP algorithm with methods in other recently published works, including Cui et al.'s [10], Zhang et al.'s [40], and Yun et al.'s [2] methods.Table 5 introduces a summary and comparison of the speedup and parallel efficiency with the different methods used for Example 1 on 64 CPU cores, which is better than other works

Table 5 .
Comparison speedup and parallel efficiency with the different methods used for Example 1 on 64 central processing unit (CPU) cores.

Table 6 .
The iterative incomplete lower and upper factorization preconditioner for Example 2.

Table 9 .
The solutions to the multi-splitting algorithm for Example 2. This section compares the speedup and parallel efficiency performance of the ILUP algorithm with methods in other recently published works, including Cui et al.'s [10], Zhang et al.'s [40], and Yun et al.'s [2] methods.Table

Table 10 .
Comparison of speedup and parallel efficiency values obtained using the different methods for Example 2 on 64 CPU cores.