1. Introduction
Dambreak or tsunami flows cause not only potential dangers to human life, but also great losses of property. These phenomena can be triggered by some natural hazards, such as earthquakes or heavy rainfall. When a dam breaks, a large amount of water is released instantaneously from the dam and will propagate rapidly to the downstream area. Similarly, tsunami waves flowing rapidly from the ocean bring a large volume of water to coastal areas, which endangers human life as well as damages infrastructure. Since natural hazards have very complex characteristics, in terms of the spatial and temporal scales, they are quite difficult to predict precisely. Therefore, it is highly important to study the evolution of such flows as a part of a disaster management, which will be useful for the related stakeholders in decisionmaking. Such study can be done by developing a mathematical model based on the 2D shallow water equations (SWEs).
Recent numerical models of the 2D SWEs rely, almost entirely, on the computations of (approximate) Riemann solvers, particularly in the applications of the highresolution Godunovtype methods. The simplicity, robustness, and builtin conservation properties of the Riemann solvers, such as the Roe and HLLC schemes, had led to many successful applications in shallow flow simulations, see [
1,
2,
3,
4,
5], among others. Highly discontinuous flows, including transcritical flows, shock waves and moving wet–dry fronts were accurately simulated.
Generally speaking, a scheme can be regarded as a class of Riemann solvers if it is proposed based on a Riemann problem. The Roe scheme was originally devised by [
6] and was proposed to estimate the interface convective fluxes between two adjacent cells on a spatiallyandtemporally discretized computational domain by linearizing the Jacobian matrix of the partial differential equations (PDEs) with regard to its left and right eigenvectors. This linearized part contributes to the computation of the convective fluxes of the PDEs, as a flux difference for the average value of the considered edge taken from its two corresponding cells. Since the eigenstructure of the PDEs—which leads to an approximation of the interface value in connection with the local Riemann problem—must be known in the calculation of the flux difference, the Roe scheme is regarded as an approximate Riemann solver.
More than 20 years later, Toro [
7] then developed a new approximate Riemann solver—HLLC scheme—to simulate shallow water flows, which was an extended version of the HartenLaxvan Leer (HLL) scheme proposed in [
8]. In the HLL scheme, the solution is approximated directly for the interface fluxes by dividing the region into three parts: left, middle, and right. Both the left and right regions correspond to the values of the two adjacent cells, whereas the middle region consists of a single value separated by intermediate waves. One major flaw of the HLL scheme is related to both contact discontinuities and shear waves leading to a missing contact (middle) wave. Therefore, Toro [
7] fixed this scheme in the HLLC scheme by including the computation of the middle wave speed that now the solution is divided into four regions. There are several ways to calculate the middle wave speed, see [
9,
10,
11]. All the calculations deal with the eigenstructure of the PDEs, which is related to the local Riemann problem, and obviously, this brings the HLLC scheme back to a class of Riemann solvers.
Opposite to the Riemann solvers, Kurganov et al. [
12] proposed the centralupwind (CU) method as a Riemannsolverfree scheme, in which the eigenstructure of the PDEs is not required to calculate the convective fluxes. Instead, the local onesided speeds of propagation at every edge, which can be computed in a straightforward manner, are used. This scheme has been proven to be sufficiently robust and at the same time can satisfy both the wellbalanced and positivity preserving properties, see [
13,
14,
15].
To solve the timedependent SWEs, all the aforementioned schemes must be temporally discretized either by using an implicit or an explicit time stepping method. Despite its simplicity, the latter may, however, suffer from a stability computational issue particularly when simulating a very low water on a very rough bed [
16,
17]. The former is unconditionally stable and even is very flexible to use a large time step. However, the computation is admittedly complex. Another way that can be used to overcome the stability issue of the explicit method and to avoid the complexity of the implicit method—is to perform a highorder explicit method, such as the Runge–Kutta highorder scheme. This method is more stable than the explicit method, while the computation remains simple and acceptably cheap as that of the explicit method.
As the highorder time stepping method is now considered, the selection of solvers included in models must be taken into careful consideration, since such solvers—which are the most expensive part in SWEs simulations—need to be computed several times in a single time step. For example, the Runge–Kutta fourthorder (RKFO) method requires the updating of a solver four times to determine the value at the subsequent time step. The more complex the algorithm of a solver is, the more CPU time one obtains.
Nowadays, performing SWE simulations is becoming more and more common on modern hardware/CPUs towards highperformance computing (HPC) using advanced features such as AVX, AVX2, and AVX512, which support the algorithm vectorization for executing some operations in a single instruction—known as single instruction multiple data (SIMD)—so that a significant computation speedup can be achieved. Vectorization on such modern hardware employs vector instructions, which can dramatically outperform scalar instructions, thus being quite important for having more efficient computations. Among the other compilers’ optimizations, vectorization can even be regarded as the common ways for utilizing vectorlevel parallelism, see [
18,
19]. Such a speedup, however, can only exist if the algorithm formulation is suitable for vectorization instructions either automatically (by compilers) or manually (by users) [
20].
Typically, there are three classes of vectorization: auto vectorization, guided vectorization, and lowlevel vectorization. The first type is the easiest one utilizing the ability of the compiler to automatically detect loops, which have a potential to be vectorized. This can be done at compiling time, e.g., using the optimization flag $\mathtt{O}\mathtt{2}$ or higher. However, some typical problems, e.g., noncontiguous memory access and datadependency, make vectorization difficult. For this, the second type may be a solution utilizing some compiler hints/pragmas and array notations. This type may successfully vectorize the loops that cannot be autovectorized by the compiler. However, if not used carefully, it gives no significant performance or even the results can be wrong. The last type is probably the hardest one since it requires deepknowledge about intrinsics/assembly programming and vector classes, thus not so popular.
Especially in simulating complex phenomena such as dambreak or tsunami flows as part of disaster planning, accurate results are obviously of particular interest for modelers. However, focusing only on numerical accuracy but ignoring performance efficiency is no longer an option. For instance, in addition to relatively largesized domains, most of real dambreak and tsunami simulations require performing long realtime computations, e.g., days or even up to weeks. Wasting the performance either due to the complexity level of the solver selected or the code’s inability to utilize the vectorization, is thus undesirable. This becomes our focus in this paper. We compare three common shallow water solvers (HLLC, Roe, and CU schemes) here, where two main findings are pointed out. Firstly, to enable highlyefficient vectorization for all solvers on all the aforementioned hardware, we employ a reordering strategy that we have recently applied in [
21]. This strategy supports guided vectorization and memory access alignment for the array loops attempted in the SWEs’ computations, thus boosting the performance. Secondly, we observe that the CU scheme is capable of outperforming the performance of the HLLC and Roe schemes by exhibiting similar accuracies. These findings would be useful for modelers as a reference to select the numerical solvers to be included in their models as well as to optimize their codes for vectorization.
Some previous studies reporting about vectorization of shallow water solvers are noted here. In [
20], the augmented Riemann solver implemented in a source code Geo Conservation Laws (GeoCLAW) was vectorized using a lowlevel vectorization with SSE4 and AVX intrinsics. The average speedup factors of 2.7× and 4.1× (both with singleprecision arithmetic) were achieved with SSE4 and AVX machines, respectively. Also using GeoCLAW, the augmented Riemann solver was vectorized in [
22] by changing the data layouts from arrays of structs (AoS) to structs of arrays (SoA), thus requiring a considerably huge task for rewriting the code—and then applying a guided vectorization with
$!\$\mathtt{omp}\mathtt{simd}$. The average speedup factors of 1.7× and 4.4× (both with doubleprecision arithmetic) were achieved with AVX2 and AVX512 machines, respectively. In [
23], the split HLL Riemann solver was vectorized and parallelized for the flux computation and state computation modules of the SWEs employing lowlevel vectorization with SSE4 and AVX intrinsics. To the best of our knowledge, this is the first attempt to report the efficiency comparisons of common solvers (both Riemann and nonRiemann solvers) regarding the vectorization on the three modern hardwares without having to perform complex intrinsic functions or to require a lot of work for rewriting the code. We use here an inhouse code of the firstauthor—numerical simulation of free surface shallow water 2D (NUFSAW2D). Some successful applications were shown using NUFSAW2D for varying shallow watertype simulations, e.g., dambreak cases, overland flows, and turbulent flows, see [
17,
21,
24,
25].
This paper is organized as follows. The governing equations and the numerical model are briefly explained in
Section 2. An overview of data structures in our code is presented in
Section 3. The model verifications against the benchmark cases and its performance evaluations are given in
Section 4. Finally, conclusions are given in
Section 5.
2. Governing Equations and Numerical Models
The 2D SWEs are written in conservative form according to [
26] as
where the vectors
$\mathbf{W}$,
$\mathbf{F}$,
$\mathbf{G}$,
${\mathbf{S}}_{b}$, and
${\mathbf{S}}_{f}$ are expressed as
The water depth, velocities in
x and
y directions, gravity acceleration, bottom elevation, and Manning coefficient are denoted by
h,
u,
v,
g,
${z}_{b}$, and
${n}_{m}$, respectively. Using a cellcentered finite volume (CCFV) method, Equation (
1) is spatially discretized over a domain
$\mathrm{\Omega}$ as
Applying the Gauss divergence theorem, the convective fluxes of Equation (
3) can be transformed into a lineboundary integral
$\mathrm{\Gamma}$ as
leading to a flux summation for the convective fluxes by
where
${n}_{x}$ and
${n}_{y}$ are the normal vectors outward
$\mathrm{\Gamma}$,
N is the total number of edges for a cell, and
$\Delta L$ is the edge length. We will investigate the accuracy and efficiency of the three solvers for solving Equation (
5). The inhouse code NUFSAW2D used here implements the modern shockcapturing Godunovtype model, which supports the structured as well as unstructured meshes by storing the average values in each cellcenter. Here we use structured rectangular meshes, hence
N = 4. The secondorder spatial accuracy was achieved with the MUSCL method utilizing the MinMod limiter function to enforce the monotonicity in multiple dimensions. The bedslope terms were computed using a Riemannsolutionfree technique, with which the bedslope fluxes can be computed separately from the convective fluxes, thus giving a fair comparison for the three aforementioned solvers. The friction terms were treated semiimplicitly to ensure stability for wet–dry simulations. The RKFO method is now applied to Equation (
4) as
where
A is the cell area,
$\Delta t$ is the time step,
${\alpha}_{p}$ is the coefficient being 1/4, 1/3 , 1/2, and 1 for
p = 1–4, respectively. The numerical procedures for Equations (
4) and (
6) are given in detail in [
17,
25,
26], thus are not presented here.
3. Overview of Data Structures
3.1. General
Here we explain in detail how the data structures of our code are designed to advance the solutions of Equation (
6). Note this is a typical data structure used in many shallow water codes (with implementations of modern finite volume schemes). As shown in
Figure 1, a domain is discretized into several subdomains (rectangular cells). We call this step the preprocessing stage. Each cell now consists of the values of
${z}_{b}$ and
${n}_{m}$ located at its center. Initially, the values of
h,
u, and
v are given by users at each cellcenter.
As our model employs a reconstruction process to spatially achieve secondorder accuracy with the MUSCL method, it requires the gradient values at cellcenter. Therefore, these gradient values must firstly be computed. This step is called the gradient level. Hereafter, one requires to calculate the values at each edge using the values of its two corresponding cellcenters. This stage is then called the edgedriven level. In this level, a solver, e.g., HLLC, Roe, or CU scheme, is required to compute the nonlinear values of $\mathbf{F}$ and $\mathbf{G}$ at edges. Prior to performing such a solver, the aforementioned reconstruction process with the MUSCL method was employed. Note the values of ${\mathbf{S}}_{b}$ are also computed at the edgedriven level. After the values of all edges are known, the solution can be advanced for the subsequent time level by also computing the values of ${\mathbf{S}}_{f}$. For example, the solutions of $\mathbf{W}$ at the subsequent time level for a cellcenter are updated using the $\mathbf{F}$, $\mathbf{G}$, and ${\mathbf{S}}_{b}$ values from its four corresponding edges—and using ${\mathbf{S}}_{f}$ values located at the cellcenter itself. We call this stage the celldriven level.
Note that the edgedriven level is the most expensive stage among the others; one should thus pay extra attention to its computation. We also point out here that we apply the computation for the edgedriven level in an edgebased manner rather than in a cellbased one, namely we compute the edge values only once per single calculation level. Therefore, one does not need to save the values of $\left[{\sum}_{i=1}^{N}{\left(\mathbf{F}\phantom{\rule{2.84544pt}{0ex}}{n}_{x}+\mathbf{G}\phantom{\rule{2.84544pt}{0ex}}{n}_{y}\right)}_{i}\phantom{\rule{2.84544pt}{0ex}}\Delta {L}_{i}\right]$ in arrays for each cellcenter; only the values of $\left[{\left(\mathbf{F}\phantom{\rule{2.84544pt}{0ex}}{n}_{x}+\mathbf{G}\phantom{\rule{2.84544pt}{0ex}}{n}_{y}\right)}_{i}\phantom{\rule{2.84544pt}{0ex}}\Delta {L}_{i}\right]$ are saved corresponding to the total number of edges, instead. The values of an edge are only valid for one adjacent cell—and such values are simply multiplied by ($1$) for another cell. It is now a challenging task to design an array structure that can ease vectorization and exploit memory access alignment in both the edgedriven and celldriven levels.
3.2. CellEdge Reordering Strategy for Supporting Vectorization and Memory Access Alignment
We focus our reordering strategy here on tackling the two common problems for vectorization: noncontiguous memory access and datadependency. Regarding the former, a contiguous array structure is required to provide contiguous memory access giving an efficient vectorization. Typically, one finds this problem when dealing with an indirect array indexing, e.g., using $\mathtt{x}\left(\mathtt{y}\right(\mathtt{i}\left)\right)$ forces the compiler to decode $\mathtt{y}\left(\mathtt{i}\right)$ for finding the memory reference of $\mathtt{x}$. This is also a typical problem for a nonunit strided access to array, e.g., incrementing a loop by a scalar factor, where nonconsecutive memory locations must be accessed in the loop. The vectorization is sometimes still possible for this problem type. However, the performance gain is often not significant. The second problem relates to usage of arrays identical to the previous iteration of the loop, which often destroy any possibility for vectorization, otherwise a special directive should be used.
See
Figure 2, for advancing the solution of
$\mathbf{W}$ in Equation (
1) for
$\mathtt{k}$, one requires
$\mathbf{F}$,
$\mathbf{G}$, and
${\mathbf{S}}_{b}$ from
$\mathtt{i}$, where
$\mathtt{i}=\mathtt{index\_function}\left(\mathtt{j}\right)$ and
$[\mathtt{j}\leftarrow \mathtt{1}\mathtt{4}]$—and
${\mathbf{S}}_{f}$ from
$\mathtt{k}$ itself. Opting
$\mathtt{index\_function}$ as an operator for defining
$\mathtt{i}$ leads to a use of an indirect reference in a loop. This is not desired since it may avoid the vectorization. This may be anticipated by directly declaring
$\mathtt{i}$ into the same array to that of
$\mathtt{k}$, e.g.,
$\mathtt{W}\left(\mathtt{k}\right)\leftarrow \left[\mathtt{W}\right(\mathtt{k}+\mathtt{m}),\mathtt{W}(\mathtt{k}\mathtt{m}),\mathtt{W}(\mathtt{k}+\mathtt{n}),\mathtt{W}(\mathtt{k}\mathtt{n}\left)\right]$, where
$\mathtt{m}$ and
$\mathtt{n}$ are scalar. This, however, leads to a datadependency problem that makes vectorization difficult.
To avoid these problems, we have designed a celledge reordering strategy, see
Figure 3, where the loops with similar computational procedures are collected to be vectorized. Note that this strategy is only applied once at the preprocessing stage in
Figure 1. The core idea of this strategy is to build contiguous array patterns between edges and cells for the edgedriven level as well as between cells and edges for the celldriven level. We point out here that we only employ 1D array configuration in NUFSAW2D, so that the memory access patterns are straightforward, thus easing unit stride and conserving cache entries. The first step is to devise the cell numbering following the Zpattern, which is intended for the celldriven level. Secondly, we design the edge numbering for the edgedriven level by classifying the edges into two types: internal and boundary edges in the most contiguous way; the former is the edges that have two neighboring cells (e.g., edges 1–31), whereas the latter is the edges with only one corresponding cell (e.g., edges 32–49). The reason for this classification is the computational complexity between the internal and boundary edges differs from each other, e.g., (1) no reconstruction process is required for the latter, thus having less CPU time than the former—and (2) due to corresponding to two neighboring cells, the former accesses more memories than does the latter; declaring all edges only in one single loopgroup therefore deteriorates the memory access patterns, thus decreasing the performance.
For the sake of clarity, we write in Algorithm 1 the pseudocode of the model’s
$\mathtt{SUBROUTINE}$ employed in NUFSAW2D. Note that Algorithm 1 is a typical form applied in many common and popular shallow water codes. First, we mention that
$\mathtt{seg\_x}=\mathtt{5}$,
$\mathtt{seg\_y}=\mathtt{4}$, and
$\mathtt{Ncells}=\mathtt{20}$ according to
Figure 3, where
$\mathtt{seg\_x}$,
$\mathtt{seg\_y}$, and
$\mathtt{Ncells}$ are the total number of domain segments in
x and
y directions, and the total number of cells, respectively. We now explain the
$\mathtt{SUBROUTINE}\mathtt{gradient}$. The cells are now classified into two groups: internal and boundary cells. Internal cells, e.g., cells 6, 7, 10, 11, 14, and 15 are cells whose gradient computations require accessing two cell values in each direction. For example, computing the
xgradient of
$\mathbf{W}$ of cell 6 needs the values of
$\mathbf{W}$ of cells 2 and 10; this is denoted by
$[\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{6}\right)\leftarrow \mathtt{W}\left(\mathtt{2}\right),\mathtt{W}\left(\mathtt{10}\right)]$ and similarly
$[\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{6}\right)\leftarrow \mathtt{W}\left(\mathtt{5}\right),\mathtt{W}\left(\mathtt{7}\right)]$. Boundary cells, e.g., cells 1–4, 5, 8, 9, 12, 13, 16, and 17–20, are cells affiliated with boundary edges. These cells may not always require accessing two cell values in each direction for the gradient computation, e.g.,
$[\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{8}\right)\leftarrow \mathtt{W}\left(\mathtt{4}\right),\mathtt{W}\left(\mathtt{12}\right)]$ but
$[\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{8}\right)\leftarrow \mathtt{W}\left(\mathtt{7}\right),\mathtt{W}\left(\mathtt{8}\right)]$ showing that a symmetric boundary condition is applied to cell 8 in
y direction. Considering the fact that the total number of internal cells is significantly larger than that of boundary cells, we group the internal cells into a single loop and distinguish them from the boundary cells, see Algorithm 2.
Algorithm 1 Typical algorithm for shallow water code (within the Runge–Kutta fourthorder (RKFO) method’s framework) 
 1:
for$t=1\leftarrow [\mathtt{total}\mathtt{number}\mathtt{of}\mathtt{time}\mathtt{step}]$do  2:
$!\hspace{1em}\mathtt{within}\mathtt{the}\mathtt{RKFO}\mathtt{method}\mathtt{from}$ [p=1] $\mathtt{to}$ [p=4]  3:
for $p=1\leftarrow 4$ do  4:
$\mathtt{CALL}\mathtt{gradient}$  5:
→ $\mathtt{compute}\mathtt{gradient}$  6:
$\mathtt{CALL}\mathtt{edgdriven\_level}$  7:
→ $\mathtt{compute}\mathtt{MUSCL\_method}$  8:
→ $\mathtt{compute}\mathtt{bed\_slope}$  9:
→ $\mathtt{compute}\mathtt{shallow\_water\_solver}$  10:
$\mathtt{CALL}\mathtt{celldriven\_level}$  11:
→ $\mathtt{compute}\mathtt{friction\_term}$  12:
→ $\mathtt{compute}\mathtt{update\_variables}$  13:
end for  14:
end for

Algorithm 2 Pseudocode for $\mathtt{SUBROUTINE}\mathtt{gradient}$ 
 1:
for$\mathtt{k}=\mathtt{1}\leftarrow [\mathtt{seg\_x}\mathtt{2}]$do  2:
$\mathtt{l}=(\mathtt{seg\_y}+\mathtt{2})+(\mathtt{k}\mathtt{1})\ast \mathtt{seg\_y}$  3:
$\begin{array}{c}\hfill {!}{\$}{\mathtt{omp}}{\mathtt{simd}}{\mathtt{simdlen}}{\left(}{\mathtt{VL}}{\right)}\phantom{\rule{3.33333pt}{0ex}}{\mathtt{aligned}}{(}{\nabla}{{\mathtt{W}}}_{{\mathtt{x}}}{,}{\nabla}{{\mathtt{W}}}_{{\mathtt{y}}}{:}{\mathtt{Nbyte}}{)}\end{array}$  4:
for $\mathtt{i}=\mathtt{l}\leftarrow [\mathtt{l}+\mathtt{seg\_y}\mathtt{3}]$ do  5:
$.........$  6:
$\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{i}\right)\leftarrow \mathtt{W}(\mathtt{i}\mathtt{seg\_y}),\mathtt{W}(\mathtt{i}+\mathtt{seg\_y})\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{i}\right)\leftarrow \mathtt{W}(\mathtt{i}\mathtt{1}),\mathtt{W}(\mathtt{i}+\mathtt{1})$  7:
end for  8:
end for  9:
$\begin{array}{c}\hfill {!}{\$}{\mathtt{omp}}{\mathtt{simd}}{\mathtt{simdlen}}{\left(}{\mathtt{VL}}{\right)}\phantom{\rule{3.33333pt}{0ex}}{\mathtt{aligned}}{(}{\nabla}{{\mathtt{W}}}_{{\mathtt{x}}}{,}{\nabla}{{\mathtt{W}}}_{{\mathtt{y}}}{:}{\mathtt{Nbyte}}{)}\end{array}$  10:
for$\mathtt{i}=\mathtt{1}\leftarrow \left[\mathtt{seg\_y}\right]$do  11:
$\mathtt{j}=\mathtt{Ncellsseg\_y}+\mathtt{i}$  12:
$\mathtt{i}\mathtt{1}=\mathtt{i}\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{1}=\mathtt{i}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{i}\mathtt{2}=\mathtt{i}+\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{2}=\mathtt{i}$  13:
$\mathtt{i}\mathtt{3}=\mathtt{j}\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{3}=\mathtt{j}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{i}\mathtt{4}=\mathtt{j}+\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{4}=\mathtt{j}$  14:
$.........$  15:
$\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{i}\right)\leftarrow \mathtt{W}\left(\mathtt{i}\right),\mathtt{W}(\mathtt{i}+\mathtt{seg\_y})\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{i}\right)\leftarrow \mathtt{W}\left(\mathtt{i}\mathtt{1}\right),\mathtt{W}\left(\mathtt{i}\mathtt{2}\right)$  16:
$\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{j}\right)\leftarrow \mathtt{W}(\mathtt{j}\mathtt{seg\_y}),\mathtt{W}\left(\mathtt{j}\right)\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{j}\right)\leftarrow \mathtt{W}\left(\mathtt{i}\mathtt{3}\right),\mathtt{W}\left(\mathtt{i}\mathtt{4}\right)$  17:
end for  18:
$\begin{array}{c}\hfill {!}{=}{=}{=}{\mathtt{This}}{\mathtt{loop}}{\mathtt{is}}{\mathtt{not}}{\mathtt{vectorized}}{\mathtt{due}}{\mathtt{to}}{\mathtt{nonunit}}{\mathtt{strided}}{\mathtt{access}}{=}{=}{=}{!}\end{array}$  19:
for$\mathtt{i}=\mathtt{1}\leftarrow [\mathtt{seg\_x}\mathtt{2}]$do  20:
$\mathtt{j}=\mathtt{i}\ast \mathtt{seg\_y}+\mathtt{1}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{1.42271pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{k}=(\mathtt{i}+\mathtt{1})\ast \mathtt{seg\_y}$  21:
$\mathtt{i}\mathtt{1}=\mathtt{j}\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{1}=\mathtt{j}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{i}\mathtt{2}=\mathtt{j}+\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{2}=\mathtt{j}$  22:
$\mathtt{i}\mathtt{3}=\mathtt{k}\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{3}=\mathtt{k}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{i}\mathtt{4}=\mathtt{k}+\mathtt{1}\mathtt{OR}\mathtt{i}\mathtt{4}=\mathtt{k}$  23:
$.........$  24:
$\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{j}\right)\leftarrow \mathtt{W}(\mathtt{j}\mathtt{seg\_y}),\mathtt{W}(\mathtt{j}+\mathtt{seg\_y})\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{j}\right)\leftarrow \mathtt{W}\left(\mathtt{i}\mathtt{1}\right),\mathtt{W}\left(\mathtt{i}\mathtt{2}\right)$  25:
$\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{k}\right)\leftarrow \mathtt{W}(\mathtt{k}\mathtt{seg\_y}),\mathtt{W}(\mathtt{k}+\mathtt{seg\_y})\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{k}\right)\leftarrow \mathtt{W}\left(\mathtt{i}\mathtt{3}\right),\mathtt{W}\left(\mathtt{i}\mathtt{4}\right)$  26:
end for

Algorithm 2 shows three typical loops in the
$\mathtt{SUBROUTINE}\mathtt{gradient}$. The first loop (lines 1–8) is designed sequentially with a factor of
$\mathtt{seg\_x}\mathtt{2}$ for its outer part to exclude all boundary cells. For its inner part, this loop is constructed based on the outer loop in a contiguous way, thus making vectorization efficient. Each element of array
$\nabla {\mathtt{W}}_{\mathtt{x}}$ accesses two elements from array
$\mathtt{W}$ with the farthest alignment of
$\mathtt{seg\_y}$, while each element of array
$\nabla {\mathtt{W}}_{\mathtt{y}}$ also accesses two elements of array
$\mathtt{W}$ but only with the farthest alignment of
$\mathtt{1}$. The second loop (lines 10–17) is also designed similarly to the first one, but since this loop includes boundary cells, each element of arrays
$\nabla {\mathtt{W}}_{\mathtt{x}}$ and
$\nabla {\mathtt{W}}_{\mathtt{y}}$ only accesses one array with the farthest alignment of
$\mathtt{seg\_y}$ and
$\mathtt{1}$, respectively—whereas the other elements from array
$\mathtt{W}$ required are contiguously accessed by each element of both
$\nabla {\mathtt{W}}_{\mathtt{x}}$ and
$\nabla {\mathtt{W}}_{\mathtt{y}}$. Note in our implementation, none of these two loops can be autovectorized by the compiler. Therefore, we apply a guided vectorization with OpenMP directive instead of the Intel one, namely
$!\$\mathtt{omp}\mathtt{simd}\mathtt{simdlen}\left(\mathtt{VL}\right)\phantom{\rule{3.33333pt}{0ex}}\mathtt{aligned}(\mathtt{var}\mathtt{1},\mathtt{var}\mathtt{2},...\phantom{\rule{3.33333pt}{0ex}}:\mathtt{Nbyte})$; this will be explained later in
Section 4.5. The third loop (lines 19–26) is designed for the rest cells, which are not included in the previous two loops. This loop is not devised in a contiguous manner, thus disabling auto vectorization or, although a guided vectorization is possible, it still does not give any significant performance improvement due to nonunit strided access. Despite being unable to be vectorized, the third loop does not significantly decrease the performance of our model for the entire simulation as it only has an array dimension of
$\mathtt{2}\ast [\mathtt{seg\_x}\mathtt{2}]$ (quite small compared to the other two loops).
We now discuss the $\mathtt{SUBROUTINE}\mathtt{edgedriven\_level}$ and sketch it in Algorithm 3. Note for the sake of brevity, only the pseudocode for internal edges is represented in Algorithm 3; for boundary edges, the pseudocode is similar but computed without $\mathtt{MUSCL\_method}$. The first loop corresponds to the edges 1–16 and the second one to the edges 17–31. In the first loop (lines 1–7), each flux computation accesses the array with the farthest alignment of $\mathtt{seg\_y}$, whereas the arrays are designed in the second loop (lines 8–17) to have contiguous patterns. Every edge has a certain pattern for its two corresponding cells, where no datadependency exists, thus enabling an efficient vectorization. Note with this pattern, both loops can be autovectorized; however, we still implement a guided vectorization as it gives a better performance.
Finally, we sketch the $\mathtt{SUBROUTINE}\mathtt{celldriven\_level}$ in Algorithm 4. Again, for the sake of brevity only the pseudocode for internal cells is given. Similar to the internal cell in the $\mathtt{SUBROUTINE}\mathtt{gradient}$, the loop is designed sequentially with a factor of $\mathtt{seg\_x}\mathtt{2}$ for the outer part. In the inner part the arrays access patterns are, however, different to those of the gradient computation, where $\mathtt{W}$ accesses $\mathtt{F}$, $\mathtt{G}$, and ${\mathtt{S}}_{\mathtt{b}}$ from the corresponding edges—and ${\mathtt{S}}_{\mathtt{f}}$ from the corresponding cell; in other words, more array accesses are required in this loop. Nevertheless, the vectorization gives a significant performance improvement since the array accesses patterns are contiguous. However, there is a part that cannot be vectorized in this celldriven level due to nonunit strided access, similar to that shown in Algorithm 2. Again, since the dimension of this nonvectorizable loop is considerably smaller than the others, there is no significant performance alleviation for the entire simulation.
Algorithm 3 Pseudocode for $\mathtt{SUBROUTINE}\mathtt{edgedriven\_leve}$ (only for internal edges) 
 1:
$\begin{array}{c}\hfill {!}{\$}{\mathtt{omp}}{\mathtt{simd}}{\mathtt{simdlen}}{\left(}{\mathtt{VL}}{\right)}\phantom{\rule{3.33333pt}{0ex}}{\mathtt{aligned}}{(}{\nabla}{{\mathtt{W}}}_{{\mathtt{x}}}{,}{\mathtt{W}}{,}{{\mathtt{z}}}_{{\mathtt{b}}}{,}{\mathtt{F}}{,}{\mathtt{G}}{,}{{\mathtt{S}}}_{{\mathtt{b}}}{:}{\mathtt{Nbyte}}{)}\end{array}$  2:
for$\mathtt{i}=\mathtt{1}\leftarrow [\mathtt{seg\_y}\ast (\mathtt{seg\_x}\mathtt{1}\left)\right]$do  3:
$\mathtt{j}=\mathtt{i}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{k}=\mathtt{i}+\mathtt{seg\_y}$  4:
$.........$  5:
$\mathtt{compute}\mathtt{MUSCL\_method}+\mathtt{bed\_slope}+\mathtt{shallow\_water\_solver}$ $\phantom{\rule{42.67912pt}{0ex}}\left[\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{j}\right),\nabla {\mathtt{W}}_{\mathtt{x}}\left(\mathtt{k}\right),\mathtt{W}\left(\mathtt{j}\right),\mathtt{W}\left(\mathtt{k}\right),{\mathtt{z}}_{\mathtt{b}}\left(\mathtt{j}\right),{\mathtt{z}}_{\mathtt{b}}\left(\mathtt{k}\right),...,{{\mathtt{F}}_{\mathtt{x}}}^{\mathtt{L}},{{\mathtt{F}}_{\mathtt{x}}}^{\mathtt{R}},{{\mathtt{G}}_{\mathtt{x}}}^{\mathtt{L}},{{\mathtt{G}}_{\mathtt{x}}}^{\mathtt{R}},{{\mathtt{S}}_{\mathtt{bx}}}^{\mathtt{L}},{{\mathtt{S}}_{\mathtt{bx}}}^{\mathtt{R}}\right]$  6:
$\mathtt{F}+\mathtt{G}\left(\mathtt{i}\right)\leftarrow {{\mathtt{F}}_{\mathtt{x}}}^{\mathtt{L}},{{\mathtt{F}}_{\mathtt{x}}}^{\mathtt{R}},{{\mathtt{G}}_{\mathtt{x}}}^{\mathtt{L}},{{\mathtt{G}}_{\mathtt{x}}}^{\mathtt{R}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{\mathtt{S}}_{\mathtt{b}}\left(\mathtt{i}\right)\leftarrow {{\mathtt{S}}_{\mathtt{bx}}}^{\mathtt{L}},{{\mathtt{S}}_{\mathtt{bx}}}^{\mathtt{R}}$  7:
end for  8:
for$\mathtt{l}=\mathtt{1}\leftarrow \left[\mathtt{seg\_x}\right]$do  9:
$\mathtt{m}=\mathtt{seg\_y}\ast (\mathtt{seg\_x}\mathtt{1})+\mathtt{1}+(\mathtt{l}\mathtt{1})\ast (\mathtt{seg\_y}\mathtt{1})\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{n}=\mathtt{m}+\mathtt{seg\_y}\mathtt{2}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{o}=(\mathtt{l}\mathtt{1})\ast \mathtt{seg\_y}$  10:
$\begin{array}{c}\hfill {!}{\$}{\mathtt{omp}}{\mathtt{simd}}{\mathtt{simdlen}}{\left(}{\mathtt{VL}}{\right)}\phantom{\rule{3.33333pt}{0ex}}{\mathtt{aligned}}{(}{\nabla}{{\mathtt{W}}}_{{\mathtt{y}}}{,}{\mathtt{W}}{,}{{\mathtt{z}}}_{{\mathtt{b}}}{,}{\mathtt{F}},{\mathtt{G}}{,}{{\mathtt{S}}}_{{\mathtt{b}}}{:}{\mathtt{Nbyte}}{)}\end{array}$  11:
for $\mathtt{i}=\phantom{\rule{1.42271pt}{0ex}}\mathtt{m}\leftarrow \mathtt{n}$ do  12:
$\mathtt{j}=(\mathtt{i}\mathtt{m}+\mathtt{1})+\mathtt{o}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{k}=\mathtt{j}+\mathtt{1}$  13:
$.........$  14:
$\mathtt{compute}\mathtt{MUSCL\_method}+\mathtt{bed\_slope}+\mathtt{shallow\_water\_solver}$ $\phantom{\rule{42.67912pt}{0ex}}\left[\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{j}\right),\nabla {\mathtt{W}}_{\mathtt{y}}\left(\mathtt{k}\right),\mathtt{W}\left(\mathtt{j}\right),\mathtt{W}\left(\mathtt{k}\right),{\mathtt{z}}_{\mathtt{b}}\left(\mathtt{j}\right),{\mathtt{z}}_{\mathtt{b}}\left(\mathtt{k}\right),...,{{\mathtt{F}}_{\mathtt{y}}}^{\mathtt{L}},{{\mathtt{F}}_{\mathtt{y}}}^{\mathtt{R}},{{\mathtt{G}}_{\mathtt{y}}}^{\mathtt{L}},{{\mathtt{G}}_{\mathtt{y}}}^{\mathtt{R}},{{\mathtt{S}}_{\mathtt{by}}}^{\mathtt{L}},{{\mathtt{S}}_{\mathtt{by}}}^{\mathtt{R}}\right]$  15:
$\mathtt{F}+\mathtt{G}\left(\mathtt{i}\right)\leftarrow {{\mathtt{F}}_{\mathtt{y}}}^{\mathtt{L}},{{\mathtt{F}}_{\mathtt{y}}}^{\mathtt{R}},{{\mathtt{G}}_{\mathtt{y}}}^{\mathtt{L}},{{\mathtt{G}}_{\mathtt{y}}}^{\mathtt{R}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{\mathtt{S}}_{\mathtt{b}}\left(\mathtt{i}\right)\leftarrow {{\mathtt{S}}_{\mathtt{by}}}^{\mathtt{L}},{{\mathtt{S}}_{\mathtt{by}}}^{\mathtt{R}}$  16:
end for  17:
end for

Algorithm 4 Pseudocode for $\mathtt{SUBROUTINE}\mathtt{celldriven}\mathtt{level}$ (only for internal cells) 
 1:
for$\mathtt{k}=\mathtt{1}\leftarrow [\mathtt{seg\_x}\mathtt{2}]$do  2:
$\mathtt{j}=(\mathtt{seg\_y}+\mathtt{2})+(\mathtt{k}\mathtt{1})\ast \mathtt{seg\_y}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{l}=(\mathtt{seg\_y}\ast (\mathtt{seg\_x}\mathtt{1})+\mathtt{seg\_y})+(\mathtt{k}\mathtt{1})\ast (\mathtt{seg\_y}\mathtt{1})$  3:
$\begin{array}{c}\hfill {!}{\$}{\mathtt{omp}}{\mathtt{simd}}{\mathtt{simdlen}}{\left(}{\mathtt{VL}}{\right)}\phantom{\rule{3.33333pt}{0ex}}{\mathtt{aligned}}{(}{\mathtt{W}}{,}{\mathtt{F}}{,}{\mathtt{G}}{,}{{\mathtt{n}}}_{{\mathtt{m}}}{,}{{\mathtt{S}}}_{{\mathtt{b}}}{,}{{\mathtt{S}}}_{{\mathtt{f}}}{:}{\mathtt{Nbyte}}{)}\end{array}$  4:
for $\mathtt{i}=\mathtt{j}\leftarrow [\mathtt{j}+\mathtt{seg\_y}\mathtt{3}]$ do  5:
$\mathtt{i}\mathtt{1}=\mathtt{l}+(\mathtt{i}\mathtt{j})\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{i}\mathtt{2}=\mathtt{i}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{i}\mathtt{3}=\mathtt{i}\mathtt{1}+\mathtt{1}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}};\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}\mathtt{i}\mathtt{4}=\mathtt{i}\mathtt{seg\_y}$  6:
$.........$  7:
$\mathtt{compute}\mathtt{friction\_term}[\mathtt{W}\left(\mathtt{i}\right),{\mathtt{n}}_{\mathtt{m}}\left(\mathtt{i}\right),...,{\mathtt{S}}_{\mathtt{f}}\left(\mathtt{i}\right)]$  8:
$\mathtt{compute}\mathtt{update\_variables}$  9:
$\mathtt{W}\left(\mathtt{i}\right)\leftarrow \mathtt{F}+\mathtt{G}\left(\mathtt{i}\mathtt{1}\right),\mathtt{F}+\mathtt{G}\left(\mathtt{i}\mathtt{2}\right),\mathtt{F}+\mathtt{G}\left(\mathtt{i}\mathtt{3}\right),\mathtt{F}+\mathtt{G}\left(\mathtt{i}\mathtt{4}\right),{\mathtt{S}}_{\mathtt{b}}\left(\mathtt{i}\mathtt{1}\right),{\mathtt{S}}_{\mathtt{b}}\left(\mathtt{i}\mathtt{2}\right),{\mathtt{S}}_{\mathtt{b}}\left(\mathtt{i}\mathtt{3}\right),{\mathtt{S}}_{\mathtt{b}}\left(\mathtt{i}\mathtt{4}\right),{\mathtt{S}}_{\mathtt{f}}\left(\mathtt{i}\right)$  10:
end for  11:
end for

3.3. Avoiding Skipping Iteration for Vectorization of Wet–Dry Problems
In reality, almost all shallow flow simulations deal with wet–dry problems. To this end, the computations of both solver and bedslope terms in the
$\mathtt{SUBROUTINE}\mathtt{edgedriven}\mathtt{level}$ must satisfy the wellbalanced and positivitypreserving properties as well, see [
27,
28], among others. Similarly, the calculations of the friction terms in the
$\mathtt{SUBROUTINE}\mathtt{celldriven}\mathtt{level}$ must also consider the wet–dry phenomena, otherwise errors are obtained. For example, in the edgedriven level, a wet–dry or dry–dry interface of an edge may exist since one or two cellcenters consist of no water; for both cases, the MUSCL method for achieving secondorder accuracy is sometimes not required or even if this method is still computed, it must be turned back to firstorder accuracy to ensure computational stability by simply defining the edge values according to the corresponding centers. Another example is in the celldriven level, where the transformation of the unit discharges (
$hu$ and
$hv$) back to the velocities (
u and
v) are required for computing the friction terms by a division of a water depth (
h); very low water depth may thus cause significant errors. To anticipate these problems, one often employs some skipping iterations in the loops, see Algorithm 5.
Algorithm 5 Pseudocode of some possible skipping iterations 
 1:
!== This is a typical skipping iteration in the SUBROUTINE edgedriven level ==!  2:
if[wet–dry or drydry interfaces at edges]then  3:
NO MUSCL_method: calculate firstorder scheme  4:
else  5:
compute MUSCL_method: calculate secondorder scheme  6:
if [velocities are not monotone] then  7:
back to firstorder scheme  8:
end if  9:
$.........$  10:
end if  11:
!== This is a typical skipping iteration in the SUBROUTINE celldriven level ==!  12:
if[depths at cellcenters > depth limiter] then  13:
compute friction_term  14:
else  15:
unit discharges and velocities are set to very small values  16:
$.........$  17:
end if

Typically, the two skipping iterations in Algorithm 5 are important to ensure the correctness of shallow water models. Unfortunately, such layouts may destroy auto vectorization—or although a guided vectorization is possible, it does not give any significant improvement or may even decrease the performance significantly. This is because the SIMD instructions simultaneously work only for sets of arrays, which have contiguous positions. In our experiences, a guided vectorization was indeed possible for both iterations; the speedup factors, however, were not so significant. Borrowing the idea of [
22], we therefore change the layouts in Algorithm 5 to those in Algorithm 6, where the early exit condition is moved to the end of the algorithm. Using the new layouts in Algorithm 6, we significantly observed up to 48% more improvements of the vectorization from those given in Algorithm 5. Note that the results given by Algorithms 5 and 6 should be similar because no computational procedure is changed but only the layouts.
Algorithm 6 Pseudocode of the solutions of the skipping iterations in Algorithm 5 
 1:
!== A solution for the skipping iteration in the SUBROUTINE edgedriven level ==!  2:
compute MUSCL_method: calculate secondorder scheme  3:
$.........$  4:
if[velocities are not monotone]then  5:
back to firstorder scheme  6:
end if  7:
$.........$  8:
if[wet–dry or drydry interfaces at edges]then  9:
NO MUSCL_method: calculate firstorder scheme  10:
end if  11:
!== A solution for the skipping iteration in the SUBROUTINE celldriven level ==!  12:
compute friction_term  13:
$.........$  14:
if[depths at cellcenters ≤ depth limiter]then  15:
unit discharges and velocities are set to very small values  16:
$.........$  17:
end if

3.4. Parallel Computation
We explain briefly here the parallel computing implementation of NUFSAW2D according to [
21]. Our idea is to decompose and parallelize the domain based on its complexity level. NUFSAW2D employs hybrid MPIOpenMP parallelization, thus is applicable to parallel simulations with multinodes. However, as we focus here on the vectorization, which no longer influences the scalability beyond one node [
20], we limit our study on singlenode implementations and thus only employ OpenMP for parallelization. Further, we examine the memory bandwidth effect when using only one core or 16 cores (AVX), 28 cores (AVX2), and 64 cores (AVX512).
In
Figure 4 we show an example of the decomposition of the domain in
Figure 3 using four threads; for the sake of brevity, the illustration is given only for the edgedriven level. The parallel directive, e.g.,
$!\$\mathtt{omp}\mathtt{do}$, can easily be added to each loop, thus according to Algorithm 2, in the gradient level the domain is decomposed as: thread 0 (cells 6, 7, 1, 17, 5, 8), thread 1 (cells 10, 11, 2, 18, 9, 12), thread 2 (cells 14, 15, 3, 19, 13, 16), and thread 3 (cells 4, 20). Similarly, regarding Algorithm 3 it gives in the edgedriven level: thread 0 (edges 1–4, 17–22, 32–33, 37–38, 42, 46), thread 1 (edges 5–8, 23–25, 34, 39, 43, 47), thread 2 (edges 9–12, 26–28, 35, 40, 44, 48), and thread 3 (edges 13–16, 29–31, 36, 41, 45, 49). Meanwhile, the celldriven level applies a similar decomposition to that of the gradient level. One can see, the largest loop components, e.g., internal edges 1–4, 5–8, etc., are decomposed in a contiguous pattern easing the vectorization implementation, thus efficient. Note the decomposition in
Figure 4 is based on static load balancing that causes load imbalance due to the nonuniform amount of loads assigned to each thread; this load imbalance will become less and less significant as the domain size increases, e.g., to millions of cells. However, another load imbalance issue—which can only be recognized during runtime—appears, namely the one caused by wet–dry problems, where wet cells are computationally more expensive than dry cells. For this, we have developed in [
21] a novel weighteddynamic load balancing (WDLB) technique that was proven effective to tackle load imbalance due to wet–dry problems. All the parallel and load balancing implementations are described in detail in [
21], thus are not explained here. We also note that we have successfully applied this celledge reordering strategy in [
24,
25] for parallelizing the 2D shallow flow simulations using the CU scheme with good scalability. Yet, we will show in the next section that the celledge reordering strategy proposed can help in easing all the vectorization implementations.
5. Conclusions
A numerical investigation for studying the accuracy and efficiency of three common shallow water solvers (the HLLC, Roe, and CU schemes) has been presented. Four cases dealing with shock waves and wet–dry phenomenon were selected. All schemes were provided in an inhouse code NUFSAW2D, the model of which was of secondorder accurate in space wherever the regimes were smooth and robust when dealing with strong shock waves—and of fourthorder accurate in time. To give a fair comparison, all source terms of the 2D SWEs were treated similarly for all schemes, namely the bedslope terms were computed separately from the convective fluxes using a Riemannsolverfree scheme—and the friction terms were computed semiimplicitly within the framework of the RKFO method.
Two important findings have been shown by our simulations. Firstly, highlyefficient vectorization could be applied to the three solvers on all hardware used. This was achieved by guided vectorization, where a celledge reordering strategy was employed to ease the vectorization implementations and to support the aligned memory access patterns. Regarding singlecore analysis, the vectorization was shown to be able to speedup the performance of the edgedriven level up to 4.5–6.5× on the AVX/AVX2 machines for eight data per vector and 16.7× on the AVX512 machine for 16 data per vector—and to accelerate the entire simulation as well by up to 4–5.5× on the AVX/AVX2 machine and 13.91× on the AVX512 machine. The superlinear speedup in the edgedriven level especially using the AVX512 machine could be achieved probably due to improved cache usage, thus less expensive main memory accesses. Regarding singlenode analysis, our code could reach in the edgedriven level the improvements of 75.7–121.8× on the AVX/AVX2 machine while on the AVX512 machine it achieved up to 928.9× speedup. For updating the entire simulation, our code was able to reach speedups of 68.8–109.6× and 774.6× on the AVX/AVX2 and AVX512 machines, respectively. We observed an interesting phenomenon, where without vectorization the parallelized results of the AVX2 machine outperformed those of the AVX512 machine in both the edgedriven level and the entire simulation with a factor of up to 2×; the parallelizedvectorized results of the AVX512 machine became, however, faster by achieving an average factor of 1.6×. This clearly shows that our reordering strategy could efficiently exploit the vectorization support of such a vectorcomputing machine. Supporting the aligned memory access patterns, the reordering strategy employed has helped in gaining the performances of the “only” vectorized code by averagely 1.45× and 1.4× for the edgedriven level and updating the entire simulation, respectively.
Secondly, we have shown that for the four cases simulated, strong agreements by all schemes were obtained between the numerical results and observed data, where no significant differences were shown for the accuracy. However, in the term of efficiency, the CU scheme was able to outperform the HLLC and Roe schemes with average factors of 1.4× and 1.25×, respectively. Although the vectorization was successful to significantly gain the performance of all solvers, the CU scheme still became the most efficient one among the others. According to this fact, we could conclude that the CU solver as a Riemannsolverfree scheme would in general be able to outperform the Riemann solvers (HLLC and Roe schemes) even for simulations on the next generation of modern hardware. This is because the computational procedures of the CU scheme are acceptably simple especially containing no complex branch statements ($\mathtt{ifthenelse}$) such as required by the HLLC and Roe schemes.
Since simulating shallow water flows—especially complex phenomena that require performing long realtime computations as part of disaster planning such as dambreak or tsunami cases—on modern hardware nowadays and even in the future becomes more and more common, focusing simulations only on numerical accuracy but ignoring the performance efficiency is not an option anymore. Wasting the performance is obviously undesirable due to wasting too much time for such long realtime simulations. Modern hardware offers many features for gaining efficiency, one of which is vectorization that can be regarded as the “easiest” way for benefiting from the vectorlevel parallelism, is thus nontrivial. However, this is not obtained for free; one should at least understand and support—due to the sophisticated memory access patterns—the vectorization concept. The celledge reordering strategy employed here is one of the easiest strategies to utilize the vectorization feature of modern hardware that could easily be applied to any CCFV scheme for shallow flow simulations, together with guided vectorization instead of explicitly by lowlevel vectorization, which might be errorprone and timeconsuming. It is worth pointing out that this strategy is also applicable to any compiler with vectorization support, e.g., Gfortran. We observed that the performance obtained with Intel compiler was typically 2–3× higher than that obtained with Gfortran, which we believe is due to the correspondence of Intel compiler and Intel hardware.
We have also shown that the edgedriven level, especially the reconstruction technique and solver computations, were the most timeconsuming part, which required 65–75% of the entire simulation time. This shows that some more “aggressive” optimization techniques still become a hot topic for future studies to make shallow water simulations more efficient, particularly in the edgedriven level. Finally, we conclude that this study would be useful as a consideration for modelers who are interested in developing shallow water codes.