Computationally E ﬃ cient Solution of a 2D Di ﬀ usive Wave Equation Used for Flood Inundation Problems

: This paper presents a study dealing with increasing the computational e ﬃ ciency in modeling ﬂoodplain inundation using a two-dimensional di ﬀ usive wave equation. To this end, the domain decomposition technique was used. The resulting one-dimensional di ﬀ usion equations were approximated in space with the modiﬁed ﬁnite element scheme, whereas time integration was carried out using the implicit two-level scheme. The proposed algorithm of the solution minimizes the numerical errors and is unconditionally stable. Consequently, it is possible to perform computations with a signiﬁcantly greater time step than in the case of the explicit scheme. An additional e ﬃ ciency improvement was achieved using the symmetry of the tridiagonal matrix of the arising system of nonlinear equations, due to the application of the parallelization strategy. The computational experiments showed that the proposed parallel implementation of the implicit scheme is very e ﬀ ective, at about two orders of magnitude with regard to computational time, in comparison with the explicit one.


Introduction
In river valleys protected by embankments, floods can occur in adjacent areas due to a dike break or during controlled inflows of flood water into polders. Very often the flood covers a large area and the transition time of the flood wave through the river valley can reach even up to a few weeks. Consequently, the mathematical modeling of the flood wave propagation becomes a challenging problem because the governing equations have to be numerically integrated over a vast area for a long simulation time. For this reason, it is very important to elaborate on a relatively simple model, which is able to simultaneously ensure adequate accuracy of the solution within a reasonable time frame.
An unsteady flow over a floodplain can be simulated using the two-dimensional shallow water equation (SWE) [1][2][3][4][5]. The methodology of the solution of the SWE is well recognized for open channels, shallow reservoirs, and wetlands initially covered with a water layer [6]. However, usually a floodplain is initially dry and the SWE must be solved in a domain changing over time as a wave propagates. Consequently, the solution domain is bounded by a moving wave front that separates dry and wet areas. In order to avoid the mentioned problem, the numerical solution of the SWE requires special algorithms that are often based on the finite volume method (FVM) [3,4,7,8] or the finite element method (FEM) [9,10]. The numerical algorithms of FVM as well as FEM lead to satisfactory results when additional modifications are used for flow over an initially dry area. However, the applied modifications can decrease the accuracy of the obtained solution, and simultaneously increase the complexity of the numerical algorithm. Additionally, they become more time-consuming. For these reasons, a simplified model in the form of a diffusive wave have been developed.
conditions. However, in the case of wave propagation problems, the time step can be maintained with a reasonable length only for part of the simulation period, whereas, for the rest of this period, it must be reduced to very small values.
For time integration, the implicit scheme is preferred over explicit ones. The implicit scheme is unconditionally stable and allows a much larger time step to be maintained, which reduces the total number of steps. Consequently, this can make the calculation more efficient. The implicit method, in the form of a two-level difference scheme, was applied for the solution of the diffusive wave equation in the context of a 2D flow [22] as well as a 1D flow [27]. It is interesting that, for the solution of the diffusive wave equation, the implicit methods are seldom applied despite their clear advantage over the explicit schemes. It seems that such a situation can be explained by the fact that the application of the implicit scheme leads to a system of nonlinear algebraic equations that must be solved at each time step. Consequently, the solution of the resulting nonlinear system requires an iterative method, which can complicate the numerical algorithm.
A parallel implementation of solvers for a two-dimensional flow described by the diffusive wave approach has been presented only for explicit schemes [25,26,32,34]. However, using explicit schemes introduces very strict constraints on the computational time step value, and, thus, makes the algorithms very ineffective. As far as implicit schemes are considered, the parallelization for the solution of the 2D diffusive wave equation has not been applied until now. Although many different numerical techniques are used for the solution of this equation, it seems that it is possible to propose a new one, which will be much more efficient. The proposed approach is based on the construction of an algorithm tailored for parallel execution. In such a case, the whole solution approach has to be designed from its basis with the goal of parallelization in mind. To this order, possible decompositions should be applied, and the proper method of the discretization of differential equations has to be chosen. Such a proceeding allows an algorithm to be obtained that can be processed in parallel. Although this approach is much more complicated than the sequential one, it usually results in algorithms that are very efficient. The approach considered in this study deals with the latter category.
This paper focuses on two main aspects of the solution of a 2D diffusive wave equation for floodplain inundation problems. The first concerns improvements for efficiency of the numerical solution, by concentrating on the use of the implicit scheme in parallel computations. For this purpose, in the proposed algorithm, a high computational efficiency was achieved by the application of the dimensional splitting technique and the implicit time integration scheme, which is unconditionally stable, as well as the modified Picard method for the solution of the nonlinear systems of equations, which ensures convergent iterations. Further improvements were achieved using the symmetry of the tridiagonal matrix in the system of linear equations. The second aspect relates to comparing the computational efficiency as well as accuracy of the proposed method with the commonly used algorithm based on the explicit scheme.

Methods
In this work, the 2D diffusive wave model is described by a single nonlinear equation, as follows [16,38].
where t is the time variable (T), x and y are the space co-ordinates (L), H is the water surface elevation above the assumed datum (L), Φ = P − I − E is the source/sink term, which usually involves rainfall (P), infiltration (I), and evaporation (E) (L·T −1 ). The parameters K x and K y introduced in Equation (1) are called the hydraulic diffusion coefficients (L 2 ·T −1 ). They are given by the formulas below. where h = H − z is the flow depth (L), z is the bottom elevation above the assumed datum (L), and n x and n y are the Manning roughness coefficients in x and y direction, respectively (L −1/3 ·T). The term ∂H/∂s is the derivative of the function H with respect to the flow direction s, which is defined as follows.
Equation (1) is written in a conservative (divergent) form ensuring the consistency with the mass conservation principle represented in the SWE system by the continuity equation. It is worth adding that a nonlinear diffusive wave equation with the water surface elevation H, as a dependent variable, is conservative. However, as has been presented by Gąsiorowski and Szymkiewicz [39] for 1D flood routing problems, the same equation with discharge Q satisfying mass principle cannot be derived.
In the proposed numerical algorithm, the computational efficiency is improved by applying: (1) the dimensional decomposition of the 2D governing equation in order to avoid allocating significant memory resources in computers, (2) unconditionally stable implicit scheme instead of explicit schemes that are conditionally stable, (3) effective methods for solving the system of nonlinear and linear algebraic equations, where a property of the symmetrical banded matrix is used, and (4) parallelization methods utilizing multi-core computer architectures.

Decomposition of the 2D Equation with the Splitting Method
The dimensional decomposition technique is very usable and effective in the case of overland flow modelling. It is particularly suitable for parallel implementation. According to this technique, the 2D diffusive wave equation (Equation (1)) at a given time level is split into a set of 1D equations for each of the directions [27]. This leads to two equations describing the wave propagation process in directions x and y, respectively.
Indexes in the superscript of the H variable denote the solution stage resulting from the decomposition. In the first stage, Equation (4) is solved with the initial condition H (1) (x,y,t) = H(x,y i , t = t n ), where i = 1, . . . , N, with N denoting the number of spatial steps in the y direction and n denoting the time step index. As a result of this step, the water stage H (1) (x,y,t n+1 ) is obtained. Then, at the second stage, Equation (5) is solved with the initial condition H (2) (x,y,t) = H (1) (x j ,y,t n+1 ), where j = 1, . . . , M, with M denoting the number of spatial steps in the x direction. The example of a computational grid is displayed in Figure 1.
It is very important that, in the considered case of wave propagation over a floodplain, the directional decomposition allows the use of a rectangular grid of nodes. The only requirement is that the rectangular grid has to cover the entire area where the propagating flood wave is expected.
While solving the diffusive wave equation, the decomposition procedure can be applied with respect to the physical process as well. The physical decomposition leads to the decoupling of the flow over the floodplain from physical processes such as infiltration or evaporation, which is represented by the source/sink term Φ in Equation (1). This additional decomposition at the third stage leads to a Water 2019, 11, 2195 5 of 24 family of differential equations describing the variability of the water stages over time in terms of the intensity of the infiltration, precipitation, and evaporation.
Equation (6) is solved with the initial condition H (3) (x,y,t) = H (2) (x,y,t n+1 ) for all nodes. After the last stage, the water stage values at the next (n + 1) th time step are obtained.
Equation (6) is solved with the initial condition H (3) (x,y,t) = H (2) (x,y,tn+1) for all nodes. After the last stage, the water stage values at the next (n + 1) th time step are obtained.

Solution of the 1D Equation Using the FDM Explicit Scheme
While solving the diffusion wave equation, an explicit finite difference scheme is usually applied. In such a case, the considered 1D nonlinear equation can be approximated using the forward difference for the time derivative and a combination of backward and forward differences for the diffusion term. As a result, for Equation (4), the following scheme for nodes j = 2,3, ..., M − 1 is obtained.
where j is an index of node in the x direction, n is an index of time level, Δt is the time step, and Δx is the spatial step. For simplicity, in Equation (7) and in other algebraic equations, a second subscript for direction y was omitted. The nodal value of the average diffusion coefficient j K can be estimated as follows [40].
in which the term ∂H/∂s is approximated by the following formula.
where subscript i refers to the actually considered column of the computational grid ( Figure 1). The value of the diffusion coefficient K must be limited near the front of a wave propagating over a dry bottom. For this reason, it is assumed that K is equal to zero [41]: • when the water depth is negative or zero, h ≤ 0, • or when the water stage derivative takes a very small value, ∂H/∂s < ε, where ε represents the assumed tolerance, usually ranging from 10 −6 to 10 −9 .

Solution of the 1D Equation Using the FDM Explicit Scheme
While solving the diffusion wave equation, an explicit finite difference scheme is usually applied. In such a case, the considered 1D nonlinear equation can be approximated using the forward difference for the time derivative and a combination of backward and forward differences for the diffusion term. As a result, for Equation (4), the following scheme for nodes j = 2, 3, . . . , M − 1 is obtained.
where j is an index of node in the x direction, n is an index of time level, ∆t is the time step, and ∆x is the spatial step. For simplicity, in Equation (7) and in other algebraic equations, a second subscript for direction y was omitted. The nodal value of the average diffusion coefficient K j can be estimated as follows [40].
in which the term ∂H/∂s is approximated by the following formula.
where subscript i refers to the actually considered column of the computational grid ( Figure 1). The value of the diffusion coefficient K must be limited near the front of a wave propagating over a dry bottom. For this reason, it is assumed that K is equal to zero [41]: • when the water depth is negative or zero, h ≤ 0, • or when the water stage derivative takes a very small value, ∂H/∂s < ε, where ε represents the assumed tolerance, usually ranging from 10 −6 to 10 −9 .
The rearrangement of Equation (7) yields the following.
Introducing the diffusive number, defined as: Equation (11) can be rewritten in the form below.
A stability and accuracy analysis performed for the linear variant of the explicit scheme (Equation (13) with D = const.) indicates that this scheme is second-order accurate, but it is only conditionally stable [42]. Using the von Neumann method, it can be proven that, in this case, the stability condition takes the following form.
which is also equivalent to the condition for the maximal value of the time step.
The application of the explicit scheme does not require the gathering of algebraic linear equations written for all nodes into a common system of equations that is solved at a given time level. In this case, after the imposition of the boundary conditions, it is sufficient that the calculations are carried out systematically through nodes from j = 2 to j = M − 1. The solution process regarding the y direction is analogical to the one described above for the x direction. This method provides a simple implementation in the source code, which does not require the allocation of a significant amount of memory, which is needed for solving large equation systems. However, due to constraints on the length of the time step presented above, such an explicit scheme is not effective. Some improvement of the computational efficiency can be reached by using the adaptive time step method (ATS) [35]. In the ATS method, the value of the time step is based on the stability conditions in the form of Equation (14). Thus, at the current time level, in each node of the computational grid, the diffusion coefficient is examined and the maximal allowable time step values for the following simulation step are computed. However, experiments show that the reduction in computation time using the ATS method is relatively low in comparison to the simulation with the constant time step. Therefore, it seems that remarkable progress can be achieved only when time integration is carried out using implicit schemes.

Solution of the 1D Equation Using Modified FEM with the Implicit Scheme
A more efficient algorithm can be obtained if the 1D nonlinear diffusive wave equation is solved using the modified Galerkin FEM [43]. A modification of the standard FEM deals with spatial integration. A detailed description of this method in the case of Equation (4) is given by Gąsiorowski [27]. The approximation of the 1D diffusive wave equation (Equation (4) or Equation (5)) using the modified FEM provides the following system of ordinary differential equations (ODE) over time.

•
for node j = 1 ω ∆x 2 where ω is the weighting parameter ranging from 0 to 1, and K j is the nodal value of the averaged diffusion coefficient calculated by Equation (8). Equation (17) written for each node gives a system of ODE, which can be expressed using the matrix notation.
where In vector G, the fluxes have a zero value for all internal nodes of elements. These fluxes can exist at the upstream (node j = 1) and downstream (node j = M) ends if the Neumann condition is imposed. For the Dirichlet condition (when the function H(t) is imposed) or for the impervious boundaries (when the flux through the boundary is null ∂H/∂x = 0), all other components of vector G have a zero value. In the considered problem the Neumann condition at the impervious boundary or the Dirichlet condition will be imposed, which means, in further analysis, vector G is omitted.
Due to the spatial discretization, the partial differential equation (Equation (4) or Equation (5)) was reduced to the system of ODE with respect to time (Equation (19)). For the solution of this system, considered to be the initial value problem, the following two-level method is used [44].
where H t+∆t and H t are the water stage values at the time step t + ∆t and t, respectively, ∆t is a time step, θ is the weighting parameter belonging to the interval [1,0]. The implementation of Equation (20) for the system of ODE (19) yields a system of algebraic non-linear equations. ( Introducing the diffusion number (Equation (12)), approximation equations in the system (21) for the interval node j = 2, 3, . . . , M − 1 can be written as follows. where: In Equation (22), two weighting parameters ω and θ are involved. These parameters determine the numerical properties of the presented scheme and allow the stability and the accuracy of the solution to be controlled. It is worth noting that the modified FEM with a two-level scheme (Equation (22)) can be considered as a general approximation formula, which, depending on the values of the weighting parameters ω and θ, leads to the appropriate scheme of FEM or FDM. For example, Equation (22) with the weighting parameter ω = 1 corresponds to the family of FDM formulas. Additionally, when the parameter θ is equal to zero, then Equation (22) is consistent with the explicit scheme of the FDM described by Equation (13).
The stability analysis performed by the Neumann method and the accuracy analysis carried out using the modified equation approach [42] showed that ω > 1/2 reduces non-physical oscillations related with numerical dispersity [40,44]. The parameter θ determines the accuracy and the stability of time integration, and particular values of this parameter lead to some well-known methods of solving ODE. For θ = 1/2, Equation (22) becomes the implicit trapezoidal scheme, which ensures an approximation of the 2nd order, while, for the value θ > 1/2, the approximation of the time derivative is of the first order. In such a case, the accuracy of the numerical solution depends on the time step ∆t, and the applied scheme generates numerical diffusion, which causes an excessive smoothing of the solution. Moreover, the values of parameter θ ≥ 1/2 cause the proposed scheme to be unconditionally stable with respect to time integration. This feature is an improvement over conditionally stable schemes and allows a bigger time step to be used than in the case of the explicit Euler scheme when θ = 0.

Solution of the Nonlinear System
Since the elements of matrix B depend on the solution, the considered problem is nonlinear. For this reason, the system of algebraic Equations (21) has to be solved using iterative methods. In hydraulic problems, the Picard or the Newton method is usually used. The Picard method is one of the simplest iteration techniques, which does not converge as fast as the Newton method. However, in contrast to the Newton method, it is usually convergent for any starting initial values. The modified version of the Picard method [44], which converges faster than the basic one, can be applied to solve the system of nonlinear Equations (21). Then it gives the following sequence of iterations.
where k is an iteration index, ∆H t+∆t is the correction vector, and The iteration process is continued until the following criterion of convergence is satisfied.
where δ is a specified tolerance, usually δ = 0.001 m is assumed.
As can be seen, Equation (23) differs from the Newton method in terms of the matrix of coefficients including the original matrices A and B of the system. Thus, such a form of the Picard method does not require the determination of the Jacobian matrix, which, in this case, can be computationally expensive.

Solution of the Linear System
During the numerical solution of the system of nonlinear algebraic equations by the modified Picard method (Equation (23)), in each iteration, the system of linear algebraic equations has to be solved. This system of linear equations can be written in the following matrix form.
t+∆t ), and f is the right-hand side vector (f = −F (k) ) The resulting coefficient matrix S is tridiagonal with non-zero elements located along the main diagonal. The structure of this matrix is as follows.
Including the properties and structure of matrix S appearing in the system (25) allows the efficiency of the solution algorithm to be improved. Thus, it reduces the required computer memory and the computational time. Such a system with a tridiagonal matrix can be successfully solved using the Thomas algorithm [42]. In this case, only non-zero elements within the bands are used during computation. Consequently, the total number of arithmetic operations is reduced to 3(M − 1) additions and multiplications and 2M − 1 divisions, where M is the number of unknowns [45].
When the modified Picard method is applied for the solution of the nonlinear system, the non-zero elements in matrix S take the following form.
• for row j = 1: • for rows j = 2, 3, . . . , M − 1: • for row j = M: It can be noticed that, except for the first and last rows in matrix S (Equation (27a,b) and Equation (29a,b)), the element α j in the current row j (Equation (28a)) is the same as the element γ j−1 in the preceding row (Equation (28c)) for j − 1. This allows the numerical algorithm to be improved with respect to memory capacity and performance. Taking this into consideration, the resulting matrix S will have the following structure.
The utilization of the resulting matrix structure allows memory to be saved since it is required to store 2(M − 3) + 7 elements instead of 3M. In addition, the Thomas algorithm used for solving tridiagonal systems can be improved in a very simple manner. This will be mainly due to the decreased number of RAM hits. For this reason, the whole matrix S is stored in a single one-dimensional array.
The repeating values γ are included only once in the vector. This storage format has the following form.

Parallelization and Solver Implementation
Due to the applied domain splitting, computations for different rows or columns of the computational grid can be executed independently. Thus, such a numerical algorithm is easily scalable and suitable for parallel computations. For code parallelization, the Open Multi-Processing (OpenMP) framework was used. The OpenMP is a multi-platform extension that provides a set of pre-processor directives and functions, which allow parallel computations to be easily implemented in a fork-join model on a multi-core personal computer with a shared memory. The utilization of the OpenMP requires the suitable specification for the parts of code for parallel execution. This is achieved with the appropriate OpenMP directives. The specified part of the code is executed in parallel. A master thread is forked to a number of slave threads, which are run concurrently, and when the execution of all threads is finished, then the results are merged (Figure 2b). The utilization of the resulting matrix structure allows memory to be saved since it is required to store 2(M − 3) + 7 elements instead of 3M. In addition, the Thomas algorithm used for solving tridiagonal systems can be improved in a very simple manner. This will be mainly due to the decreased number of RAM hits. For this reason, the whole matrix S is stored in a single onedimensional array. The repeating values γ are included only once in the vector. This storage format has the following form.

Parallelization and Solver Implementation
Due to the applied domain splitting, computations for different rows or columns of the computational grid can be executed independently. Thus, such a numerical algorithm is easily scalable and suitable for parallel computations. For code parallelization, the Open Multi-Processing (OpenMP) framework was used. The OpenMP is a multi-platform extension that provides a set of pre-processor directives and functions, which allow parallel computations to be easily implemented in a fork-join model on a multi-core personal computer with a shared memory. The utilization of the OpenMP requires the suitable specification for the parts of code for parallel execution. This is achieved with the appropriate OpenMP directives. The specified part of the code is executed in parallel. A master thread is forked to a number of slave threads, which are run concurrently, and when the execution of all threads is finished, then the results are merged (Figure 2b). The implemented solver consists of the main time loop, which includes two inner loops with one iterating through columns (integration in the y direction) and one iterating through rows (integration in the x direction) of the grid (Figure 2a). The work in these loops is forked to the slave threads and executed in parallel. The slave threads are assigned to the different CPU cores by the The implemented solver consists of the main time loop, which includes two inner loops with one iterating through columns (integration in the y direction) and one iterating through rows (integration in the x direction) of the grid (Figure 2a). The work in these loops is forked to the slave threads and executed in parallel. The slave threads are assigned to the different CPU cores by the OpenMP runtime environment. This situation is schematically depicted in Figure 3. Due to the directional decomposition, the computations in slave threads are performed independently and there is no need to communicate between them. In order to eliminate the unnecessary usage of shared variables, private variables are introduced, which are assigned to the slave threads. Each OpenMP thread has its own memory pool in which its private variables are allocated ( Figure 4). Communication between the master and the slave threads is done with the use of shared variables. The shared variables are the bottom elevation Z and water stage values H for the whole solution domain. However, extensive use of the shared variables by the slave threads would introduce a significant overhead due to the handling of concurrent access to the memory resources. To improve the efficiency of the algorithm, the communication between master and slave threads during parallel execution should be minimal. Shared variables are accessed by the slave threads only when the necessity of reading or writing the data occurs. Such a situation takes place when the master thread is forked and when the slave threads are joined. There is no communication between slave threads during the computations of the water stage values. In the first operation after the fork, the water stage values and bottom elevations corresponding to a considered row or column are copied into the private variables of the slave threads. The water stage values are known from the initial condition or the previous step of computation. When the slave threads finish their tasks, then the result of the computations is written in the shared variable H.
The data from the shared variables H and Z is copied to the private variables denoted as H0 and z, respectively. The private variable z stores the bottom elevation of the considered row or column, whereas H0 stores the water stage values. Apart from the water stage values and bottom elevation in a row or column, the variables required to form and solve the system of arising nonlinear algebraic equations are used. These private variables are used for storing the system matrix S (Equation (25)) and the right-hand side vector f. To improve the solver efficiency, an auxiliary variable p for storing the vector on the right-hand side of Equation (21) is used. This expression is also a part of vector f (Equation (23)) and is constant at a given time level t. Therefore, it has to be computed only once in a considered time step. The computed water stage at time level t + Δt is saved to the variable H0.
Apart from the utilized solution algorithm, in order to achieve the best possible performance of Due to the directional decomposition, the computations in slave threads are performed independently and there is no need to communicate between them. In order to eliminate the unnecessary usage of shared variables, private variables are introduced, which are assigned to the slave threads. Each OpenMP thread has its own memory pool in which its private variables are allocated ( Figure 4). Due to the directional decomposition, the computations in slave threads are performed independently and there is no need to communicate between them. In order to eliminate the unnecessary usage of shared variables, private variables are introduced, which are assigned to the slave threads. Each OpenMP thread has its own memory pool in which its private variables are allocated (Figure 4). Communication between the master and the slave threads is done with the use of shared variables. The shared variables are the bottom elevation Z and water stage values H for the whole solution domain. However, extensive use of the shared variables by the slave threads would introduce a significant overhead due to the handling of concurrent access to the memory resources. To improve the efficiency of the algorithm, the communication between master and slave threads during parallel execution should be minimal. Shared variables are accessed by the slave threads only when the necessity of reading or writing the data occurs. Such a situation takes place when the master thread is forked and when the slave threads are joined. There is no communication between slave threads during the computations of the water stage values. In the first operation after the fork, the water stage values and bottom elevations corresponding to a considered row or column are copied into the private variables of the slave threads. The water stage values are known from the initial condition or the previous step of computation. When the slave threads finish their tasks, then the result of the computations is written in the shared variable H.
The data from the shared variables H and Z is copied to the private variables denoted as H0 and z, respectively. The private variable z stores the bottom elevation of the considered row or column, whereas H0 stores the water stage values. Apart from the water stage values and bottom elevation in a row or column, the variables required to form and solve the system of arising nonlinear algebraic equations are used. These private variables are used for storing the system matrix S (Equation (25)) and the right-hand side vector f. To improve the solver efficiency, an auxiliary variable p for storing the vector on the right-hand side of Equation (21) is used. This expression is also a part of vector f (Equation (23)) and is constant at a given time level t. Therefore, it has to be computed only once in a considered time step. The computed water stage at time level t + Δt is saved to the variable H0.
Apart from the utilized solution algorithm, in order to achieve the best possible performance of Communication between the master and the slave threads is done with the use of shared variables. The shared variables are the bottom elevation Z and water stage values H for the whole solution domain. However, extensive use of the shared variables by the slave threads would introduce a significant overhead due to the handling of concurrent access to the memory resources. To improve the efficiency of the algorithm, the communication between master and slave threads during parallel execution should be minimal. Shared variables are accessed by the slave threads only when the necessity of reading or writing the data occurs. Such a situation takes place when the master thread is forked and when the slave threads are joined. There is no communication between slave threads during the computations of the water stage values. In the first operation after the fork, the water stage values and bottom elevations corresponding to a considered row or column are copied into the private variables of the slave threads. The water stage values are known from the initial condition or the previous step of computation. When the slave threads finish their tasks, then the result of the computations is written in the shared variable H.
The data from the shared variables H and Z is copied to the private variables denoted as H0 and z, respectively. The private variable z stores the bottom elevation of the considered row or column, whereas H0 stores the water stage values. Apart from the water stage values and bottom elevation in a row or column, the variables required to form and solve the system of arising nonlinear algebraic equations are used. These private variables are used for storing the system matrix S (Equation (25)) and the right-hand side vector f. To improve the solver efficiency, an auxiliary variable p for storing the vector on the right-hand side of Equation (21) is used. This expression is also a part of vector f (Equation (23)) and is constant at a given time level t. Therefore, it has to be computed only once in a considered time step. The computed water stage at time level t + ∆t is saved to the variable H0.
Apart from the utilized solution algorithm, in order to achieve the best possible performance of the solver, the proper format of variables has to be used. As linear data structures allow better efficiency to be obtained than two-dimensional data structures, all variables are stored in the form of linear arrays. Bottom elevation data Z and water stage values H for the whole solution domain are stored in the column-major format in linear arrays of size N × M. The private variables storing the rows/columns of water stage values or bottom elevations are arrays of size max (M,N), respectively. A similar situation occurs in the case of the right-hand side vector f. The tridiagonal system matrix S is stored in the row-major format specified in Equation (31).
In modern computer architectures, one of the significant bottlenecks are RAM calls. Since the operating system can store variables in different locations of the RAM, then the access to multiple resources can be time-consuming. Therefore, in the implementation of the solver, a separate memory pool is assigned to each of the threads (Figure 4). To increase the memory search performance, private variables are allocated as adjacent memory blocks in the memory pools assigned to the threads.
Computations in the horizontal direction are performed first and then computations in the vertical direction are performed. Therefore, there is no need to allocate memory for each direction separately, only for the one that has the greater number of nodes. The same memory block is used for the computations in the other direction.
The computation process performed in each thread and the utilization of the variables is listed below.
(1) Bottom height values for the considered row or column are copied from the shared bottom variable Z and stored in the private variable z, (2) Water stage values from the previous computation step for the considered row or column are copied from the shared water stage variable H and written in the private variable H0, (3) Private variable S is used to create and then temporarily store the matrix A − ∆t·(1 − θ)B t , (4) Product (A − ∆t·(1 − θ)B t )·H t is stored in the private variable p, (5) Private variable S is used to create and store matrix A + ∆t·θ·B t+∆t is updated and stored in H0, (9) If the required solution accuracy is obtained, go to step 10, if not, return to step 5, Obtained water stage values for the considered row or column are copied into the global variable H storing the water stage values for the whole domain.
In the above list, steps 5 to 9 refer to the solution of the arising system of nonlinear equations using the modified Picard method.

Measure of Efficiency
The computational efficiency is considered as the number of operations carried out at the execution time in order to obtain a numerical solution with assumed accuracy. Thus, the efficiency can be increased by reducing the computational time, i.e., using the longer time step ∆t for assumed accuracy or by choosing a spatial grid of a suitable size for a given computational time. At the same time, it should be remembered that increasing the time step or spatial step decreases the accuracy of the obtained solution. For this reason, the computational efficiency, including the accuracy of the solution, is a more preferable and useful measure for the comparison of different numerical algorithms than the solution accuracy or execution time considered separately. In the study, the computational efficiency was estimated by the following formula [42].
where E is the efficiency, c is a constant (it is assumed c = 1), T S is the computational time, and RMSE is an accuracy of the solution estimated using the root mean square error.
where H ref is the reference solution, H test denotes the tested solution, and x j is the coordinate of the computational grid node in the tested solution, where j = 1, . . . , M.
The results of the computations were also compared with regard to the relative efficiency.
In which ∆E is the relative efficiency, and E E , E EA , and E I denote the efficiency (calculated using Equation (32)) of the considered scheme: pure explicit, explicit with a variable adaptive time step (ATS), and implicit, respectively.
In order to gain information about the behavior of the solver when it is executed in parallel, an additional measure, in the form of the speed up and the efficiency of parallel computations, was also used. The speed up parameter is defined as the ratio of the sequential computation time to the parallel computation time in order to execute the same algorithm [46].
whereas parallel efficiency is defined as the ratio of the speed up to the number of processors.
where S is the speed up, T S is the computation time with one processor, T P is the computation time with N P processors, and N P is the number of processors.

Results
In order to check the accuracy and computational efficiency of the considered implicit as well as the explicit scheme, 1D and 2D simulations of flood wave propagation were performed. The solver was implemented in C++ language (GCC 6.4.0 compiler suite) and compiled to a 64-bit executable.

One-Dimensional Flow, Horizontal Plane Wetting Test
In this test, a flat channel of the length L = 1000 m with a constant value of the Manning roughness coefficient n = 0.04 m −1/3 s is considered. This coefficient value corresponds to a natural channel with gravel in the bottom [47]. The initial water elevation at the time t = 0 s corresponds to the bottom elevation H 0 (x, t = 0) = Z(x) = 0 m. At the upstream end (x = 0), the inflow boundary condition is imposed in the form of water elevations varying in time.    (33)) as a criterion. It can be seen that the solution obtained with the implicit scheme for θ = 0.5 ( Figure 5a) and with different time step values is only slightly different from the reference solution as well as from the solution obtained with the explicit scheme. However, as the values of the weighting parameter and the time step increase (θ = 0.6 ÷ 1.0 and ∆t = 2 ÷ 20 s), the obtained solution is systematically overestimated with regard to the reference solution (Figure 5b-d). These results confirm that the applied implicit algorithm with θ = 0.5 becomes an approximation of the second order with respect to the time, whereas, for other values, the scheme generates the numerical diffusion with a maximal intensity for θ = 1 (Figure 5d). Table 1, this table presents the influence of the time step and spatial step variability on the accuracy of solutions. In this test, the spatial step ∆x was equal to 2, 5, and 10 m, which correspond to 500, 200, and 100 finite elements, respectively. A comparison of solutions (at the time t = 450 s) was provided for the explicit scheme with the ATS adjustment and for the implicit scheme with different values of the θ parameter. The accuracy of the solution for each scheme was estimated using RMSE (Equation (33)) as a criterion. Table 1. Impact of the space interval ∆x, time interval ∆t, and spatial approximation weighting parameter θ on the accuracy of the numerical solution obtained using the implicit and explicit schemes with ATS.

Spatial
Step ∆x Explicit Implicit

Time
Step min/max ∆t

RMSE Time
Step ∆t  This experiment showed that the accuracy of the solution obtained with the explicit scheme is very similar to the corresponding one generated by the implicit scheme, but with an incomparably greater time step. It takes place when the numerical diffusion is minimized, i.e., for a low value of the weighting parameter θ. For example, the RMSE obtained for the spatial step ∆x = 5 m with the explicit scheme has the same order of magnitude (RMSE = 2.64 × 10 −3 m) as the solution obtained with the implicit method for identical input data with the parameter θ ranging from 0.5 to 0.6 (RMSE = 1.58 × 10 −3 m for θ = 0.5 and RMSE = 10.61 × 10 −3 m for θ = 0.6). In the case of other values of this parameter, θ = 0.8 and θ = 1.0, higher errors were received of RMSE = 24.89 × 10 −3 m and RMSE = 36.26 × 10 −3 m, respectively. For this reason, in all further examples, the weighting parameter θ equal to a value ranging from 0.5 to 0.65 will be used. Such values ensure a stable solution without oscillations, and, simultaneously, the generated numerical diffusion is minimized.
In the next tests, the computational efficiency is considered. At first, a comparison of the computational efficiency between an explicit scheme with a constant time step and a scheme with the adaptive time step (ATS) was performed. Computations were driven for different values of the spatial step ∆x varying between 2 and 10 m. In order to obtain the allowable time step for computations with a constant time step, the minimal value was taken from the adaptive method computations. The results for a simulation period of 450 s are presented in Table 2. Additionally, in Figure 6, the total computation time (Figure 6a) and the efficiency (Figure 6b) with respect to the spatial step ∆x are presented. Table 2. Accuracy and efficiency statistics (the total computation time, the computational efficiency (Equation (32)) and the relative efficiency (Equation (34a)) for the explicit scheme with the constant time step and the adaptive time step (ATS).

Spatial
Step ∆x

Time
Step It is apparent (Table 2 and Figure 6a) that the profit from the time step adaptation with the ATS method is moderate with respect to the execution time. For example, in the case of the computational grid with Δx = 5 m, the required computational time TS was equal to 4.35 s for the constant time step algorithm, whereas the ATS method required 3.82 s. This corresponds to a reduction in the computation time by about 20% in comparison to the simulation with the constant time step. However, if in the considered example we take into account the accuracy of the solution, then it can be seen that the relative efficiency of the ATS method (EEA = 106.9 m −1 s −1 ) is less than the efficiency of the pure explicit method (EEA = 119.9 m − 1 s −1 ). Consequently, the relative computational efficiency ΔE increases to a value 10.8% in favor of the explicit method with the constant time step. This tendency is even more visible in Figure 6b, where it is clearly displayed that the ATS method becomes less effective with regard to the spatial step Δx ≥ 5 m.
In Figure 7, the variability of the time step length with the simulation time for the solution of the ATS method is displayed. It can be seen that the time step length rapidly becomes very small after the initial steps of the simulation and converges to a value of the time step corresponding to a pure explicit scheme. Consequently, the time step can be maintained at a reasonable length only during part of the wave propagation period, whereas, for the rest of the simulation, it must be reduced to very small values.   It is apparent (Table 2 and Figure 6a) that the profit from the time step adaptation with the ATS method is moderate with respect to the execution time. For example, in the case of the computational grid with ∆x = 5 m, the required computational time T S was equal to 4.35 s for the constant time step algorithm, whereas the ATS method required 3.82 s. This corresponds to a reduction in the computation time by about 20% in comparison to the simulation with the constant time step. However, if in the considered example we take into account the accuracy of the solution, then it can be seen that the relative efficiency of the ATS method (E EA = 106.9 m −1 s −1 ) is less than the efficiency of the pure explicit method (E EA = 119.9 m − 1 s −1 ). Consequently, the relative computational efficiency ∆E increases to a value 10.8% in favor of the explicit method with the constant time step. This tendency is even more visible in Figure 6b, where it is clearly displayed that the ATS method becomes less effective with regard to the spatial step ∆x ≥ 5 m.
In Figure 7, the variability of the time step length with the simulation time for the solution of the ATS method is displayed. It can be seen that the time step length rapidly becomes very small after the initial steps of the simulation and converges to a value of the time step corresponding to a pure explicit scheme. Consequently, the time step can be maintained at a reasonable length only during part of the wave propagation period, whereas, for the rest of the simulation, it must be reduced to very small values. explicit scheme. Consequently, the time step can be maintained at a reasonable length only during part of the wave propagation period, whereas, for the rest of the simulation, it must be reduced to very small values.  Table 3, this table presents a detailed comparison between the explicit ATS method and the implicit scheme with regard to the computational efficiency. As shown, the computational time required to finish the simulation using the implicit method is incomparably smaller, by about two orders of magnitude, than the time required by the explicit method. For example, in the case of computations with Δx = 5 m and Δt = 5 s imposed as the initial time step for the explicit method, the computation time was equal to 3.53 s. The computations for an analogous case using the implicit method require a significantly shorter time 0.03 s to finish the simulation.  Table 3, this table presents a detailed comparison between the explicit ATS method and the implicit scheme with regard to the computational efficiency. As shown, the computational time required to finish the simulation using the implicit method is incomparably smaller, by about two orders of magnitude, than the time required by the explicit method. For example, in the case of computations with ∆x = 5 m and ∆t = 5 s imposed as the initial time step for the explicit method, the computation time was equal to 3.53 s. The computations for an analogous case using the implicit method require a significantly shorter time 0.03 s to finish the simulation. Table 3. Accuracy and efficiency statistics (the total computation time, the computational efficiency, and (Equation (32)) the relative efficiency (Equation (34b)) for the explicit scheme with the constant time step, the adaptive time step (ATS), and the implicit method with θ = 0.5 and ω = 0.7.

Spatial
Step x

Time
Step min/max t

Time
Step ∆t The obtained results also indicate that, for a similar order of solution accuracy (RMSE = 1.58 × 10 3 m for the implicit and RMSE = 2.64 × 10 −3 m for the explicit scheme), the implicit scheme is definitely more effective than the explicit scheme. In the considered example (for ∆x = 5 m), the implicit scheme (with efficiency E I = 21,059 m −1 s −1 ) outperforms the explicit scheme (with efficiency E I = 106.9 m −1 s −1 ), and the relative efficiency ∆E is equal to 19,590%. Considerable differences between the schemes become more obvious in Figure 6, which displays the variability of total computation time and the computational efficiency with regard to the spatial step. It can be shown that the amount of time T S required to perform the simulation differs significantly for implicit and explicit methods (Figure 6a), even considering the explicit scheme with the variable time step (ATS). If we consider the computational efficiency (Figure 6b), it is clearly visible that, in this case, the effectiveness of the implicit scheme also increases significantly with the spatial step increasing to the value ∆x = 5 m. After reaching this value, the efficiency is at a high level in comparison to the explicit scheme, but the increase is not so intense.

Two-Dimensional Flow, Parallel Computation for the Wetting-Drying Test
In the final test, simulations of a 2D flow over a floodplain using parallel computations were examined. For this purpose, an area of the dimensions L x × L y = (1000 × 1000) m 2 was considered. The bottom elevation in the considered area is given with the following formula.
Z(x, y) = sin 10 x − L x L x · cos 10 y − L y L y The bottom takes a shape with local hillocks. The bottom elevations, in the form of a contour-density plot, are given in Figure 8a. The Manning roughness coefficient is assumed to be constant and equal to n = 0.05 m −1/3 s over the whole floodplain. At the beginning of the simulation, the area is dry, so the initial water elevation at the time t = 0 s corresponds to the bottom elevation H 0 (x,y,t = 0) = Z(x,y). The zero-flux boundary condition with regard to the water stage is imposed at the boundary of the considered area. The exception is the section placed at y = 0 m and at x between 450 m and 550 m, where the Dirichlet boundary condition, in the form of the variable water level H b (t), is imposed ( Figure 8b). As in the previous test, the computation results were compared to the reference solution obtained by the implicit scheme with the mesh element size Δx = 0.1 m (number of elements 10,000 × 10,000) and the time step size Δt = 0.01 s. Computations conducted by the explicit scheme with the spatial step Δx = Δy = 1 m (number of nodes 1000 × 1000) and the time step Δt = 0.0001 s were also compared to the reference solution. Since the resolution grid for the reference solution is very fine and the total computation time is large for the considered 2D flow problem, the comparison between solutions with regard to the accuracy is performed at the selected time of 1200 s, whereas the computation times TS are compared for the entire simulation period t = 30,000 s. Example solutions of water profiles at given cross-sections corresponding to x = 500, computed for different grid resolutions, are given in Figure 10. In Figure 9 the computation results (for 200 × 200 elements), in the form of a 3D view of the water surface at selected times, are presented. It is apparent that, due to the assumed inflow boundary condition at the beginning of the inundation, the front of the water wave moves over a dry bottom (Figure 9a,b).   It can be seen that the results obtained with the explicit and implicit schemes are close to the reference solution. In the simulation driven using the explicit scheme, the wave propagates with the highest speed and, after the time 1200 s, it reaches a distance of about 694 m. After the same time in the simulation conducted by the implicit scheme, the wave front propagates at a lower speed while achieving a distance of about 674 m. Thus, the solution obtained with the explicit scheme is overestimated, whereas the solution with the implicit scheme is underestimated in comparison with the reference solution where the wave front is located at 680 m.
The quantitative comparison of the results for the implicit scheme is given in Table 4. The errors, expressed in terms of RMSE, vary from 0.026 m (for 100 × 100 elements) to 0.036 m (for 200 × 200 elements), which can be considered as insignificantly small from a practical viewpoint. Similar results were also obtained for the explicit scheme, where the total RMSE achieved the value of 0.02 m.
The calculations of the computation time, the speed up (Equation (35)) and the parallel efficiency (Equation (36)) with regard to the grid resolution and the number of used CPU processors are summarized in Table 4. The dependencies of the speed up as well as parallel efficiency on the number of processors are plotted in Figure 11. As in the previous test, the computation results were compared to the reference solution obtained by the implicit scheme with the mesh element size ∆x = 0.1 m (number of elements 10,000 × 10,000) and the time step size ∆t = 0.01 s. Computations conducted by the explicit scheme with the spatial step ∆x = ∆y = 1 m (number of nodes 1000 × 1000) and the time step ∆t = 0.0001 s were also compared to the reference solution. Since the resolution grid for the reference solution is very fine and the total computation time is large for the considered 2D flow problem, the comparison between solutions with regard to the accuracy is performed at the selected time of 1200 s, whereas the computation times T S are compared for the entire simulation period t = 30,000 s. Example solutions of water profiles at given cross-sections corresponding to x = 500, computed for different grid resolutions, are given in Figure 10.  It can be seen that the results obtained with the explicit and implicit schemes are close to the reference solution. In the simulation driven using the explicit scheme, the wave propagates with the highest speed and, after the time 1200 s, it reaches a distance of about 694 m. After the same time in the simulation conducted by the implicit scheme, the wave front propagates at a lower speed while achieving a distance of about 674 m. Thus, the solution obtained with the explicit scheme is overestimated, whereas the solution with the implicit scheme is underestimated in comparison with the reference solution where the wave front is located at 680 m.
The quantitative comparison of the results for the implicit scheme is given in Table 4. The errors, expressed in terms of RMSE, vary from 0.026 m (for 100 × 100 elements) to 0.036 m (for 200 × 200 elements), which can be considered as insignificantly small from a practical viewpoint. Similar results were also obtained for the explicit scheme, where the total RMSE achieved the value of 0.02 m. It can be seen that the results obtained with the explicit and implicit schemes are close to the reference solution. In the simulation driven using the explicit scheme, the wave propagates with the highest speed and, after the time 1200 s, it reaches a distance of about 694 m. After the same time in the simulation conducted by the implicit scheme, the wave front propagates at a lower speed while achieving a distance of about 674 m. Thus, the solution obtained with the explicit scheme is overestimated, whereas the solution with the implicit scheme is underestimated in comparison with the reference solution where the wave front is located at 680 m.
The quantitative comparison of the results for the implicit scheme is given in Table 4. The errors, expressed in terms of RMSE, vary from 0.026 m (for 100 × 100 elements) to 0.036 m (for 200 × 200 elements), which can be considered as insignificantly small from a practical viewpoint. Similar results were also obtained for the explicit scheme, where the total RMSE achieved the value of 0.02 m. Table 4. Accuracy, efficiency, and statistics (computational time, speed up (Equation (35)), and parallel efficiency (Equation (36)) for parallel computations using the implicit scheme (θ = 0.65 and ω = 0.7) for the simulation time equal to t = 30,000 s.

Time
Step The calculations of the computation time, the speed up (Equation (35)) and the parallel efficiency (Equation (36)) with regard to the grid resolution and the number of used CPU processors are summarized in Table 4. The dependencies of the speed up as well as parallel efficiency on the number of processors are plotted in Figure 11.  Table 4. Accuracy, efficiency, and statistics (computational time, speed up (Equation (35)), and parallel efficiency (Equation (36)) for parallel computations using the implicit scheme (θ = 0.65 and ω = 0.7) for the simulation time equal to t = 30,000 s.

Spatial
Step  11. Dependency of (a) speed up and (b) efficiency on the number of processors applied in the CPU for parallel computations.
As shown, the parallelization strategy is very effective in comparison with the sequential computation. For all assumed grid resolutions, an enormous reduction in computational time is observed. For example, the calculation performed using eight cores on a grid of 1000 × 1000 nodes took TS = 1,5064.3 s (4.1 h), whereas the calculation on a single processor was carried out within TS = 7,3267.8 s (20.3 h). The computational grid with 1000 × 1000 elements corresponds to over one million nodes. Such a refined grid, applied to the simulation of a flood wave at a real time equal to t = 30,000 (8.33 h), constitutes a demanding task with regard to computing power. The same simulation Figure 11. Dependency of (a) speed up and (b) efficiency on the number of processors applied in the CPU for parallel computations.
As shown, the parallelization strategy is very effective in comparison with the sequential computation. For all assumed grid resolutions, an enormous reduction in computational time is observed. For example, the calculation performed using eight cores on a grid of 1000 × 1000 nodes took T S = 15,064.3 s (4.1 h), whereas the calculation on a single processor was carried out within T S = 73,267.8 s (20.3 h). The computational grid with 1000 × 1000 elements corresponds to over one million nodes. Such a refined grid, applied to the simulation of a flood wave at a real time equal to t = 30,000 (8.33 h), constitutes a demanding task with regard to computing power. The same simulation performed by the explicit scheme takes a very long time. In this case, it took 1,638,140 s (455 h = 18 days) with the minimum time step ∆t = 0.001 s using eight threads. Thus, the application of the explicit scheme is very ineffective on a high-resolution grid with a large number of nodes.
As previously stated, the analysis of parallelization by means of the speed up factor and parallel efficiency allows the determination of the optimal number of processors to use for the considered simulation problem. With regard to the speed up (Figure 11a), we can see that this parameter increases as the number of processors increases. Thus, the proposed algorithm is scalable, i.e., it ensures the adequate acceleration growth of calculations as the number of processors increases. In the study, the obtained tendency in terms of the speed up ( Figure 11a) has a nonlinear character, where the growth rate of the speed up gradually decreases with the number of processors. For example, a simulation with two processors using a grid of 1000 × 1000 gives a speed-up factor of 1.8, whereas the use of eight processors leads to a speed-up of 4.86 (-) for the same grid resolution. This decreasing trend is even more apparent with regard to the efficiency (Figure 11b), which yields the values 0.9 and 0.61 (-), respectively.

Discussion
The explicit scheme, due to the stability condition, requires the use of a small-time step, which allows obtaining relatively high accuracy with regard to time integration. In the case of an implicit scheme, the accuracy of the calculations depends on the value of the weighting parameter θ and the time step. As shown, for θ = 0.5, the accuracy of the solution is practically the same as for the explicit scheme and does not depend on the length of the time step. However, for such a value of parameter (θ = 0.5), a numerical solution can generate the non-physical oscillations, which usually lead to the breakdown of calculations. In practical hydrological problems, in order to eliminate these oscillations, the numerical diffusion is introduced by using the weighting parameter θ larger than 0.5. In such a case larger time step leads to results that will be slightly less accurate than when using an explicit scheme. It is worth noting that the proposed scheme also provides a high flexibility, because, by selecting the appropriate values of numerical parameters θ and ω various schemes can be obtained. Thus, due to this feature, the accuracy and stability of the numerical solution can be controlled. The explicit scheme does not provide such possibilities.
For a similar order of accuracy, the implicit algorithm is significantly more effective, about two orders of magnitude with regard to the computational time, than the explicit scheme. Moreover, the implicit scheme outperforms the explicit scheme with the relative efficiency ranging from E = 5300% to E = 49,100% for a 1D problem. Even the use of a variable adaptive time step (ATS) in the explicit scheme leads to a reduction of computation time of only about 20% in comparison with the simulation with the constant time step. It is interesting that the time step can be maintained with a reasonable length only during part of the wave propagation period, whereas, for the rest of the simulation, it must be reduced to very small values to fulfil the stability condition. Moreover, if the accuracy of the obtained solution is considered in the efficiency estimation, then the calculated efficiency of the ATS method can be lower than for the pure explicit scheme. Such a surprising situation is due to the lower accuracy of the solution obtained by means of the ATS method when the larger time step is used. The increasing time step involves a higher value of RMSE, which, in turn, leads to the decreased relative efficiency ∆E (according to Equation (34a)).
An approximation of the diffusive wave equation using the explicit schemes allows calculations to be performed that do not require iterative methods to solve a system of nonlinear algebraic equations, which is the case with an implicit scheme. For this reason, the explicit schemes are easy to implement for parallel computations. However, the application of the explicit scheme is very ineffective for a simulation of a two-dimensional flow on a grid with a large number of nodes. Even the use of parallelization techniques with a maximal number of processors (eight processors in the test) does not bring acceptable results, because, when a long real-time simulation is required, then the execution time can reach unacceptable values from the practical point of view. For 1000 × 1000 (1 million) nodes, the computational time was 455 h. On the other hand, the parallelization of computation using the implicit scheme is very effective and leads to a significant reduction in computational time. In the considered test, this time was only 4.1 h.
The proposed parallel strategy is scalable and ensures the appropriate speed up growth of calculations as the number of processors increases. In theory, the parallel execution of the solver should lead to a linear growth in speed-up for the maintenance of a constant efficiency equal to 1 (see Figure 11b). However, in reality, such a relationship cannot be achieved. The management of the parallel execution of the program and the copying of data results in a limit on the speed-up. In the study, the speed-up has a nonlinear character and is similar to those obtained by Leandro et al. [34]. However, in their study, an explicit scheme was implemented, which significantly extended the computational time.
Taking into account the presented results, the proposed parallel algorithm based on the implicit scheme seems to be a very attractive strategy for the two-dimensional modeling of flood inundation, where calculations with a high resolution or a long-term simulation for a vast domain are required.