IberWQ: A GPU Accelerated Tool for 2D Water Quality Modeling in Rivers and Estuaries

: Numerical models are useful tools to analyze water quality by computing the concentration of physical, chemical and biological parameters. The present work introduces a two ‐ dimensional depth ‐ averaged model that computes the most relevant and frequent parameters used to evaluate water quality. High performance computing (HPC) techniques based on graphic processing unit (GPU) parallelization have been applied to improve the efficiency of the package, providing speed ‐ ups of two orders of magnitude in a standard PC. Several test cases were analyzed to show the capabilities and efficiency of the model to evaluate the environmental status of rivers and non ‐ stratified estuaries. IberWQ will be freely available through the package Iber.


Introduction
The application of Environmental Quality Standards (EQS) has become a widely used technique to assess the status of surface water bodies. These standards are based on the spatial and temporal evolution of certain variables and pollutants that might pose a significant risk to the environment (temperature, salinity, fecal contamination, nutrients, dissolved oxygen, pH, etc.). The concentration of these variables, which depends on the receiving waters, must be within a range of values to protect the environment and the human health. More generally, not only the exceedance of a given threshold should be controlled, but also the frequency and duration of those exceedance events. Concentrationduration-frequency curves can be used to define the thresholds that should not be exceeded to guarantee a correct environmental status [1][2][3].
In this context, numerical models have become valuable tools to help decision makers to evaluate alternative measures and solutions for the control of sewage spills. Some of the most known and used models for this purpose are the Water Quality Analysis Simulation Program (WASP) originally developed by [4,5], the QUAL2K model [6] and the CE-QUAL-W2 model [7,8]. Since these models assume different hydrodynamic approximations, their suitability is strongly dependent on the case under study. QUAL2K is a 1D steady flow model for rivers that assumes that the flow is well-mixed in the whole cross section. CE-QUAL-W2 is a 2D laterally averaged model that solves the hydrodynamics in the longitudinal and vertical direction, being therefore appropriate for rivers, reservoirs and estuaries well-mixed in the lateral direction. WASP can solve 1D, 2D and 3D problems with a large variety of pollutant types. However, this model needs to be linked with a hydrodynamic

Model Structure
IberWQ consists of a set of routines that solve a series of unsteady 2D depth-averaged transport equations for different water quality parameters. This includes advection by the mean flow, diffusion due to concentration gradients and reaction with other variables. Those routines are fully integrated within the shallow water model Iber [19], which solves the 2D Saint-Venant equations including turbulence. Thus, the resulting model for water quality and hydrodynamics is fully coupled. Figure 1 shows the different species and biochemical reactions considered in the extension of the IberWQ module presented in this paper. The new components considered in relation to [14] are organic and inorganic phosphorus, phytoplankton, alkalinity and inorganic carbon. In addition, the pH is computed as a function of inorganic carbon and alkalinity. Phytoplankton is represented as the concentration of Chlorophyll-A (Chl-A), as commonly done in other water quality models [22]. Thus, in the case of computing phytoplankton the user must introduce the ratios between nitrogen, phosphorus, carbon and Chl-A in order to evaluate the corresponding biochemical reactions.
Inorganic carbon includes carbonates (CO ), bicarbonates (HCO ) and carbon dioxide (CO ). The three species were modelled together as total inorganic carbon dissolved in water. An a posteriori estimation of the concentration of each species was done as a function of pH (specific equations are presented in Appendix B).
The whole model included therefore 14 water quality parameters that are commonly used to evaluate the environmental status of rivers and estuaries (the 13 variables shown in Figure 1 plus pH). The different model components can be activated or deactivated depending on the problem to be solved. The kinetic constants of most of the biochemical reactions depend on water temperature and salinity. In case that the temperature field is computed by the model, additional atmospheric input data (time series of net solar and atmospheric radiation, air temperature, atmospheric relative humidity and wind velocity) must be introduced by the user.

Model Equations
A generic 2D depth-averaged advection-diffusion equation with reaction terms is solved for each of the species considered in the model. In order to compute the advection terms, the water quality module is linked to a hydrodynamic module that solves the 2D shallow water equations, given by: where ℎ is the water depth, is the bed elevation (topography), ( , ) are the two components of the unit discharge, | | is the modulus of the unit discharge, is the depth-averaged concentration of the species (the available species are shown in Figure 1), is a generic reaction term for the species , is the density of water, is the gravity acceleration, is the Manning coefficient and is the effective viscosity computed using a depth-averaged turbulence model. The available turbulence models include the model, a mixing-length model and a parabolic model, and are described in detail in [23]. If the baroclinic pressure is activated in the simulation, the water density is computed from the salinity and temperature fields. For this, the International One Atmosphere Equation of State of Seawater [24] is employed. The total number of partial differential equations solved ( ) depends on the number of species considered on each specific test case ( ), and it is equal to 3 . The biochemical reactions considered for each species were modeled with the source terms . Each solid arrow in Figure 1 represents a biochemical reaction and was modeled as a first order reaction [22,25]. The formulations used to compute the different source terms are basically the same as those used in well-known and extendedly used models such as WASP, QUAL-2K and CE-QUAL-W2, which follow [22]. The details of the equations implemented in IberWQ are given in Appendix A and B, while the model constants to be introduced by the user are detailed in Appendix C.

Numerical Solver
The 2D depth-averaged transport equations for each species were solved with an unstructured finite volume solver, using a computational grid formed by triangular and quadrilateral elements. The same finite volume mesh was used to solve the transport equations of the water quality species and the hydrodynamic equations. Thus, the water depth (ℎ) and the unit discharges ( , ) needed to compute the advective fluxes in the scalar transport equations were obtained from the hydrodynamic module of Iber. This module solved the 2D Saint Venant equations using an upwind Godunov scheme with the approximate Riemann solver of Roe. In order to transfer the water velocity and depth from the hydraulic module to the water quality advection-diffusion equations, the mass conservative scheme detailed in [26] was used.
The advective terms can be discretized with either a first order upwind scheme [27] or a second order upwind scheme. In particular, the Gamma scheme proposed in [28] was implemented in the solver to obtain second order accuracy in space. Using this approach only implies an increase on the CPU (Central Processing Unit) time of approximately 5%-10%.

Parallelization
In order to address the limitations in terms of efficiency of the Iber model, a new implementation was developed. The new implementation, Iber+ onwards, was first presented in [21]. It was developed in C++ using the object-oriented paradigm to improve the modularity and the maintenance of the source code. Iber+ was parallelized for shared memory systems using OpenMP and for GPUs using the Nvidia CUDA (Compute Unified Device Architecture) [29] platform. The GPUs are a cost-effective solution to accelerate numerical models without the necessity of expensive HPC facilities. GPUs are a solution available for servers, workstations and laptops. GPU computing is attractive not only to reduce the computational time of the simulations but also to improve the time spent on design and test cases. There are numerous cases of success in the literature for both meshless [30] and mesh-based [31][32][33][34] codes. Iber+ features both CPU and GPU parallelization, however the focus of the implementation is on GPU computing, so in this study only the GPU implementation is analyzed. Iber+ is included in the official package of Iber since version 2.5 and can be downloaded without any cost at http://www.iberaula.com. The new developments presented in this paper will be included in the next release.
Originally developed exclusively for computer graphics, the GPUs feature a highly parallel architecture that provides several TFLOPS (10 12 floating point operations per second) of computing power in a single card [35]. This high-throughput hardware can be employed for general applications due to the GPGPU (general processing graphics processing unit) APIs (application programming interface), being Nvidia CUDA one of the most popular in scientific applications. CUDA exposes the GPUs capabilities to programmers due to an extension to common programming languages like C, C++ or Fortran. These extensions provide three abstractions to programmers: the thread hierarchy, shared memory and barrier synchronization [29].
The Nvidia GPUs feature a SIMT (single instruction multiple thread) architecture that issues threads in groups of 32 named warps. Each warp is executed in parallel in a SM (stream multiprocessor). All the threads in a warp execute the same instruction. If a branch instruction is processed, a divergence starts in this case, the threads that follow one of the paths are stalled while the others are executed and vice versa. Once the branch is finished, all the threads converge again. It is essential to consider this peculiarity of the GPUs in order to achieve a high throughput. The algorithms should be revised to avoid unnecessary branching and reorganize data to avoid divergence.
Discrete GPUs have their own memory, independent of the system memory. Data transfers should be made from the system memory to the GPU's memory, usually through the PCI-Express bus. These data transfers can be a potential bottleneck due to the limited bandwidth and higher latency compared with the system memory accesses. Figure 2 shows the flowchart of Iber+, the memory transfers have been reduced as much as possible. The problem data is transferred before the simulation starts and when it is necessary to write it to the hard disk. Only the current timestep of the simulation is transferred on every loop iteration of the simulation. The write of the simulation state to the hard disk can suppose an important part of the total run time, depending on the case and system configuration. To reduce these times, data is written to disk in background by a thread separated from the main execution thread. On the other hand, the data locality is essential in modern architectures to reduce the cache miss rate and thus improve performance. For this, the elements of the mesh are reordered using a space-filling curve, more specifically the Hilbert curve [36] was used for this purpose.
Global synchronization in GPUs is expensive, making some algorithms like reductions more complicated than their CPU counterpart. In order to implement reductions efficiently, Iber+ employs the library Nvidia CUB (CUDA unbound) [37].
Modern GPUs have significant performance differences when working with single or double precision. The performance ratio single/double precision may vary from 1/2 for the Nvidia Tesla V100 For the implementation of the water quality module, an object-oriented approach was used. In this way, new species can be easily added by implementing an abstract class without significantly affecting the existing code.

Test Cases
In this section, four test cases were analyzed to illustrate the capabilities and performance of IberWQ and its GPU implementation. For a detailed validation of the test cases the reader is addressed to previous publications from the authors [14]. More information about the data sources used for the cases analyzed can be found in Appendix F. All the cases were run in a workstation equipped with an AMD Ryzen 7 2700X processor, 32 GB of RAM (random access memory) and an Nvidia RTX 2080ti graphics card. The run time measurements were carried out using the execution time corresponding to the simulation loop.

Description
The first test case simulated the concentration of fecal contamination in a coastal estuary. This is a typical application of 2D depth-averaged water quality models [9,13,[39][40][41][42][43] in which only one species, E. coli, needs to be computed.
The model was applied to compute the concentration of E. coli in the coastal estuary of Ferrol, located in the NW of Spain ( Figure 3). The estuary extends over 32 km 2 with a relatively narrow and elongated shape. The total length of the estuary is 17 km, while its width varies from 400 m to a couple of kilometers in its widest part. Tidal ranges vary from 0.9 (neap tides) to 4.6 m (spring tides), being freshwater inflows from rivers negligible. As we are using a 2D depth-averaged model, we assumed well-mixed conditions over the water depth, which are characteristic of the autumn, winter and spring seasons in this estuary [39].
Approximately six spring semi-diurnal tidal cycles were simulated (3 days), with an approximate tidal range of 4.2 m. Two continuous sewage spills at different locations were considered. The first one (V2 in Figure 3) is located in the narrowest region of the estuary, where the high water velocities activate a rapid mixing with the receiving waters. The second one (V1 in Figure  3) is located in the inner part of the estuary, where mixing and dispersion were slower due to the lower water velocity. Both discharges were characterized by an E. coli concentration of 10 7 cfu/100 mL and a sewage discharge of 0.1 m 3 /s, both values were constant in time. Even though the model included the possibility of using the formulation of Mancini to evaluate the degradation of E. coli, in this case we had fixed the value of the degradation constant to 5.5 days −1 , following previous studies performed in this estuary [44].
To perform the computations, the estuary was discretized with an unstructured computational grid with 143,993 elements (with an average element size of 222 m 2 ). The levels of E. coli were sampled every 600 s with Iber and Iber+ at control points P1, P2 and P3.

Results
The E. coli concentration was sampled at three control points (P1, P2 and P3 in Figure 3). Figure  4 compares the spatial distribution of E. coli in the estuary at different time steps for the CPU and GPU implementations. The results obtained with Iber and Iber+ were virtually identical. The time series of concentration at the three control points are shown in Figure 5 where no significant differences could be observed. Furthermore, the time series of concentration obtained with the GPU and CPU implementations were statistically compared in terms of the normalized root mean square deviation (in %): where xi and yi are the values computed with Iber+ and Iber respectively for any variable of interest. The average error, 〈NRMSD〉 = 0.07%, was calculated taking into account all variables at all control points, supporting the good resemblance between series suggested by the visual comparison.   Table 1 shows the performance measurements for the first test case. This was the best case for GPU implementation because it had the largest mesh (143,993 elements) among the cases under study. GPU computing implied a certain overhead, mainly due to memory transfers and kernel launch latency. The higher the GPU workload the less significant the overhead will be. In this case Iber took more than 16 h to complete the simulation whilst Iber+ could run the same problem in less than 6 min. This implied a speedup of 181.

Description
In the second test case the concentration of dissolved oxygen (DO) and carbonaceous biochemical oxygen demand (CBOD) were computed in the estuary of A Coruña, located in the NW of Spain (Figure 6). The estuary has an area of 26 km 2 . The outer part of the estuary is relatively deep (with depths of the order of 20 m and maximum values of circa 30 m in the mouth), while the inner part, where the sewage spills are located, is relatively shallow (with maximum water depths of the order of 5 m at high tide and many dry regions at low tide).
The physical time simulated for this case extends over 8 tidal cycles (4 days) in which the tidal range moves from neap tides (with a tidal range of 1.2 m) to spring tides (with a tidal range of 3.5 m). The river discharge at the upstream boundary of the model is equal to 20 m 3 /s, which corresponds to its annual average value. Four discontinuous sewage spills of CBOD were considered, all of them located in the inner estuary ( Figure 6). The mean depth at the spills varied within 3 m and 4 m, which were relatively shallow depths. All the spills were characterized by the same uniform concentration of CBOD and DO (511 mg/L of CBOD and 3.5 mg/L of DO). The water discharges were discontinuous and different from one spill to another, with maximum values ranging from 0.06 to 0.5 m 3 /s. The CBOD degradation rate at 20 °C was set to 0.23 day 1 .
The domain was discretized using a computational grid of 50,329 elements. The mesh size was variable over the estuary. A finer resolution was used in the inner part of the estuary (element sizes of 10 m) since in this region the spills were located and unsteady wet-dry tidal fronts appeared. The size of the elements in the mouth of the estuary was approximately 180 m. The levels of DO and CBOD were sampled every 300 s using both Iber and Iber+ at control points P1, P2 and P3.

Results
In the second test case, the levels of CBOD and DO were sampled at four control points shown in Figure 6. In particular, the time series of CBOD computed at these control points are shown in Figure 7 with identical results for both implementations. Regarding DO concentration (Figure 8), very small differences could be appreciated at control point P2, where the peak values computed with Iber+ were slightly higher than those computed with Iber. Statistically, the differences were not significant with a negligible average error (〈NRMSD〉 = 0.32%, covering all points and variables), although higher than in previous case.  This case employed a mesh with 50329 elements, about a third of the mesh elements used in the first test case. As shown in Table 2, Iber+ was able to perform the simulation 61 times faster than Iber. In spite of having a lower number of elements, Table 2 shows that the real time employed by Iber+ to compute a simulation time step was longer than in the previous test. This can be explained by two reasons. First, this case had to compute two water quality species instead of one, causing a larger kernel launch overhead. Second, the size of the problem was not big enough to saturate the GPU capacity. Thus, this case was not able to take full advantage of the GPU computing capacity, as it happened in the previous case.

Description
This test case was extracted from the study presented in [1]. In that work, a 2D water quality model is used as a fundamental part of an integrated modeling approach for the design of the sewer network of the city of Lugo (Spain). Here, the impact of combined sewer overflows (CSO) in the river Miño was evaluated by means of the Environmental Quality Standards (EQS) presented in the Urban Pollution Manual [44]. The application of EQS requires an efficient water quality model since usually a large number of simulations must be run over long periods of time.
The objective variables computed in [44] to apply the EQS are the concentrations of dissolved oxygen and ammonia. The evaluation of these variables requires the computation of the five following species: org-N, NH3-N, NO3-N, DO and CBOD.
In the example presented here, two days of discontinuous sewage discharges at three different locations along the river were modeled. All the spills were characterized by the following concentrations: 102 mg/L of CBOD, 5 mg/L of org-N, 1.5 mg/L of NH3-N and 4 mg/L of DO. The concentration of nitrates (NO3-N) in the spills is negligible. The spill discharges were discontinuous in time and different from each other, with maximum values of discharge ranging from 0.18 to 1.5 m 3 /s. The ambient river concentration of the water quality species was negligible with the exception of the DO concentration, which was set to 9 mg/L. Those ambient values were imposed as initial and upstream boundary conditions in the model. The river discharge, which was obtained from a Water Quality Automatic Information System (SAICA) located upstream the river reach under study, varied from 40 to 60 m 3 /s during the two days of computation. The same values of the reaction kinetic constants proposed in [45] were used in the simulations, namely (all values at 20 °C): CBOD degradation rate equal to 0.35 day , org-N hydrolysis rate equal to 0.20 day , nitrification rate equal to 0.50 day and denitrification rate equal to 0.05 . Figure 9 shows the bathymetry of the river reach under study and the location of the sewage spills. The computational domain extends over 8 km of river, with an average width of 70 m and a total extension of 0.57 km 2 . The domain was meshed with 8763 elements with an average size of 65 m 2 . The levels of the simulated species were sampled every 600 s with both models at the control points P1, P2, P3 and P4.

Results
In the third case, five species (org-N, NH3-N, NO3-N, DO and CBOD) were sampled at the four control points indicated in Figure 9. For the sake of clarity, only the results obtained at P3 are shown in Figure 10, while the rest of time series are shown in Appendix D. In this simulation, the average error (〈NRMSD〉 = 0.04%) was negligible and smaller than in the previous two cases. In this case, Iber+ completed the simulation 29 times faster than Iber (Table 3). This speedup was considerably lower than the one obtained in the previous two cases due to the relatively small size of the mesh, with just 8763 elements. Compared to the previous case, the time needed to process a single time step was lower whilst the number of cells processed per second was much lower. This indicates the difficulty of taking full advantage of the GPU resources in cases with a small number of elements, such as the present one. Nevertheless, even with a small mesh, Iber+ ran significantly faster (29 times) than the non-parallelized CPU version.

Description
In this case an effluent discharge from a wastewater treatment plant into a river located in the south of Spain was modeled. The water quality species computed in this case are DO, CBOD, org-N, NH3-N, NO3-N, org-P, PO4-P, inorg-C, alkalinity, salinity and pH. Thus, 10 transport equations were solved in addition to the three Saint Venant equations (as mentioned in Section 2, the pH did not have an associated transport equation, instead its value was computed at each mesh element from the values of alkalinity and inorg-C).
The river reach modeled ( Figure 11 The simulated physical time was 3 days, using an unstructured mesh of 90,406 elements. At the end of the simulation a steady state was achieved for the whole reach. All the simulated species were sampled every 600 s with both Iber and Iber+ at control points P1, P2, P3 and P4.

Results
In the last case, ten different species were simulated (DO, CBOD, org-N, NH3-N, NO3-N, org-P, PO4-P, inorg-C, alkalinity, salinity and pH). Time series of concentrations were taken at Points P1 to P4 as indicated in Figure 11. The zone of the sewage spill is shown in detail in Figure 12. The time series of concentration at control point P3 are shown in Figure 13. For the sake of clarity, the results at the other control points were included in Appendix E. The results provided by Iber and Iber+ were almost identical in all cases, confirmed by the negligible average error (〈NRMSD〉 = 0.16%) as obtained for the rest of the cases.   Table 4 shows the performance measurements for this case, where a mesh of 90,406 elements (halfway between Test 1 and Test 2) was considered. Iber+ reached a speedup of 92 with respect to Iber. Comparing with Test 2, the new mesh was almost double the size of the former one and ten species were simulated instead of two. As a result of this, the time required to process a single time step was almost 3.5 times longer. However, the number of mesh cells processed per second was only twice lower. In this case, Iber+ could take more advantage of the GPU processing capacity due to the bigger mesh size.

Conclusions
Numerical models are useful tools for the evaluation of the environmental status of water bodies. However, it should be noted that these kinds of tools rely on a set of simplifications and are much simpler than the real systems, whose complexity cannot be fully reproduced by the models. Hence, they can provide wrong results, especially when the managers or the decision makers are not aware of those limitations. The parametrization of the models is then a crucial task in order to obtain precise results, which requires an exhaustive sensitivity analysis. In the particular case of Iber+, its 2D nature makes it only applicable to rivers and non-stratified estuaries.
On the other hand, the main advantage of 2D models like Iber+ is their affordable execution time, which makes them especially valuable for fast response purposes when compared with 3D models. However, its application is hindered by the high spatial resolution required for water quality studies in long river reaches or large estuaries.
This paper presented an improved version of a two-dimensional depth-averaged water quality model, whose efficiency was improved by implementing HPC techniques based on GPU parallelization. The model considered the water quality parameters most commonly used in the environmental assessment of receiving waters, including dissolved oxygen, CBOD, organic nitrogen, ammonia, nitrates, organic phosphorus, phosphates, pH, salinity and temperature. The implementation of the code ran in NVIDIA GPUs that are commonly installed in standard laptop and desktop PCs. In the test cases presented here, speedups in computational time between 29 and 181 were obtained when compared with the non-parallelized implementation, keeping the accuracy of the original model. The code will be integrated in the software package Iber, making it freely available.
In summary, the model presented three key features that make it very attractive and useful for the scientific and engineering community: simulation of the most common water quality parameters, high-performance computing in standard PCs and free availability.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Reaction Terms
The following source terms were included in the advection-diffusion equation to model the biochemical reactions between the water quality species considered in the model, as represented in Figure 1. Salinity and alkalinity were considered as conservative variables with no source terms.
Chlorophyll-A (A1) , , 1.047 The values of and are obtained from the solution of their respective depthaveraged transport equations. The previous system of five equations is solved as proposed in [44], by solving the following algebraic non-linear equation for at each mesh element: with: The coefficients , and represent respectively the fraction of inorganic carbon in form of carbon dioxide, bicarbonates and carbonates. Once the previous equation is solved at each mesh element, the pH can be computed as:

Bathymetry
Bathymetric survey carried out for previous studies. Spatial resolution of 30 m.

Streamflow
Annual average flow from river Grande de Xubia. Available from the regional Meteorological Agency MeteoGalicia (www.meteogalicia.gal).

Tide
Tidal harmonics obtained from the tidal gauge of Ferrol. Available from Puertos del Estado (www.puertos.es).

Bathymetry
Bathymetric survey carried out for previous studies. Spatial resolution of 30 m.

Effluent Discharge and Concentration
Sewer network model carried out in previous studies.

Streamflow
Annual average flow from river Mero. Available from the regional Meteorological Agency MeteoGalicia (www.meteogalicia.gal).

Tide
Tidal harmonics obtained from the tidal gauge of A Coruña. Available from Puertos del Estado (www.puertos.es).

Effluent Discharge and Concentration
Sewer network model carried out in [46] and [46].

Streamflow
River discharge obtained from a Water Quality Automatic Information System (SAICA) located upstream the river reach under study. Available from the regional water administration Confederación Hidrográfica del Miño-Sil (www.chminosil.es).

Bathymetry
Digital terrain model at 2 m resolution, obtained from LiDAR data from the Spanish National Plan of Aerophotogrammetry (PNOA), available from the Spanish National Geographic Institute (www.ign.es).

Effluent Discharge and Concentration
Virtual.

Streamflow
Annual average flow obtained from the regional water administration Confederación Hidrográfica del Guadalquivir (www.chguadalquivir.es).