1. Introduction
Hydrological model parameter optimization is a key point and an important issue for model application [
1,
2,
3,
4,
5]. With the development of mathematics and computer technology, a large number of parameter optimization algorithms have been proposed. Among the most popular optimization algorithms is the Monte Carlo sampling (MCS)-based parameter optimization method, which, as a basic algorithm, is frequently adopted in real-world applications due to its simplicity and practicality. MCS is a widely applied method in hydrology, offering valuable applications in various areas. This technique involves random sampling to obtain representative samples and assess uncertainties in the field of hydrology [
6,
7,
8]. MCS has found applications in flood risk analysis, water resources management, and hydrological modeling [
9]. In flood risk analysis, Monte Carlo simulations based on statistical distributions enable the estimation of flood magnitudes and the assessment of associated risks. This approach helps in designing flood protection measures, establishing flood warning systems, and evaluating infrastructure vulnerability [
10]. Water resources management benefits from MCS by incorporating uncertainties related to factors such as precipitation patterns and water demand. These simulations aid decision-makers in assessing the reliability and sustainability of water supply systems. They also support optimal management strategies and long-term planning [
11]. Hydrological modeling relies on MCS to quantify uncertainties associated with model inputs and parameters. This method allows for the generation of ensembles of model outputs, improving the credibility and understanding of model projections. Applications of Monte-Carlo-based modeling include rainfall-runoff modeling, streamflow prediction, and groundwater modeling [
12].
However, the computational cost of the MCS-based method is expensive due to the extensive parameter searching space of the Monte Carlo experiments and the huge number of objective function evaluations. The single-threaded or, in other words, serial MCS-based parameter optimization method consumes too much time, which prevents researchers and engineers from applying it to highly complex real-world problems. Binley and Beven [
13] attempted to assess the uncertainty associated with the predictions of a distributed rainfall-runoff model and carried out parameter optimization using an MCS-based method, GLUE. However, he recognized that the computational burden of the MCS is very high. Therefore, Binley constrained their MCS to 500 simulations even though they adopted a relatively simple distributed model, the Institute of Hydrology Distributed Model version 4 (IHDM4). However, at that time, performing this level of computation required significant code enhancement in order to fully use the computational horsepower of an 80-node transputer parallel computer [
14]. In that era, these pioneer studies were trying to employ hardware and software that was new to hydrological sciences and relevant disciplines. With further development of computer technology, constraints of computers on the application of MCS-based methods have been relaxed to some extent. However, it remains an issue, either because of a model that is particularly slow to run so that it is still not possible to sample sufficient simulations or because of the high number of parameter dimensions. The largest number of model runs used in a GLUE application that we know of is the two billion runs application [
15,
16]. This was for a model written by just a few lines of code but including 17 parameters for calibration. We can infer that for more complex models, two billion samples are still insufficient. Therefore, a large number of samples is necessary when carrying out a Monte Carlo experiment, which consequently requires much more powerful parallel acceleration techniques.
The development of modern microelectronic technology provides more powerful parallel computers. Multi-core and many-core hybrid heterogeneous parallel computing platforms have become the upstart in a recent high-performance computing field due to their stupendous floating-point computing capabilities compared with traditional CPU-only and old-generation computers. Until recently, several heterogeneous supercomputers, such as Summit and Sierra, have shown excellent performance on the TOP500 test. CPU-GPU heterogeneous platform successes owed to its better cost performance and energy consumption. The modern CPU-GPU hybrid computing platform has become the best choice for researchers and engineers who need high-performance computing [
17,
18,
19,
20,
21,
22,
23]. On the other hand, the GPU has been widely equipped in modern PCs; therefore, the CPU-GPU hybrid heterogeneous computer systems are easily available for scientific computing. The popularization of the GPU cards enables the CPU-GPU hybrid parallel program to execute on almost all PCs. Although the available GPU on PCs mainly focuses on gaming and entertainment tasks other than double-precision floating-point computation, these devices still perform very well in applications that do not require double-precision capability. Further, the software development toolkits of the CPU-GPU hybrid platform are easy to start with and can be acquired for free. Therefore, the modern CPU-GPU heterogeneous parallel computing platform can be established easily at a relatively low cost, and it shows good prospects in engineering applications.
More recently, the MCS-based parameter optimization has been speeded up for some applications using parallel computing techniques such as multi-core CPU. Even though a number of researchers have studied the acceleration of MCS-based parameter optimization, little research fully utilized the huge computational horsepower of the new generation CPU-GPU hybrid high-performance computer cluster. With the development of modern heterogeneous parallel computing technology, new generation hardware integrated with their versatile software development tools can provide tremendous computing horsepower and much better energy efficiency than ever before. The acceleration of the MCS-based parameter optimization method should catch up with the state-of-the-art of modern high-performance computing technology.
With the arrival of the big data era, hydrological model parameter optimization requires an unprecedentedly large amount of computing horsepower. This research focused on the MCS-based parameter optimization method and the coupled utilization of the newly emerged modern CPU-GPU hybrid high-performance computer cluster acceleration technology. In order to further improve the computational efficiency of the MCS-based hydrological model parameter optimization, the CPU-GPU hybrid computer cluster-based parallel parameter optimization method was proposed. The parallel method was implemented on the CPU cluster and GPU cluster, respectively. We utilized a total of five CPUs and five GPUs to achieve satisfactory acceleration results. Further, the scalability issue was investigated to prove the excellent robustness and scalability of the parallel optimization method. Additionally, the correctness of the proposed method is tested using sensitivity and uncertainty analysis of the model parameter sample points generated using the proposed method. Study results indicate good acceleration efficiency and reliable correctness of the proposed parallel optimization methods, which demonstrates excellent prospects in practical applications.
2. Methodology
2.1. Xinanjiang Model Parameter Optimization Based on Monte Carlo Sampling
2.1.1. The Monte Carlo Sampling
The brief procedure of the traditional MCS for model parameter optimization is listed below, and detailed descriptions of the MCS method can be found in relevant literature.
- (1)
A formal definition of a likelihood measure or set of likelihood measures is required. For hydrological model applications, the Nash–Sutcliffe coefficient of efficiency (
NSCE) is usually adopted as the likelihood measure or, in other words, objective function value. It can be calculated as follows:
where
denotes simulated discharge at time step
I;
denotes observed discharge at time step
i;
denotes the mean of the observed discharge values; n denotes the number of discharge data.
- (2)
An appropriate definition of the range and distribution of the parameter values is necessary for a particular model structure. Generally speaking, the ranges of parameters are predefined according to the physical meanings of the specific hydrological model, and the uniform distribution is adopted in most cases since the actual distribution of parameters is usually unknown.
- (3)
Sampling of the parameter sets in the feasible space is achieved by utilizing the Monte Carlo approach, and the likelihood values are evaluated with the objective function after obtaining the simulation results of the hydrological model.
- (4)
The optimality of different parameter sets is evaluated based on their likelihood value.
2.1.2. The Xinanjiang Model
The Xinanjiang model was developed in 1973 and published in 1980 [
24,
25,
26]. Its main feature is the concept of runoff formation on repletion of storage, which means that runoff is not generated until the soil moisture content of the vadose zone reaches field capacity, and thereafter, runoff equals excess rainfall without further loss. This hypothesis was first proposed in the 1960s, and lots of subsequent experiences support its validity for humid and semi-humid regions. According to the original formulation, the runoff generated was separated into two components using Horton’s concept of a final, constant infiltration rate. Infiltrated water was assumed to move into the groundwater storage and the remainder to the ground surface or storm runoff. However, evidence of variability in the final infiltration rate and in the unit hydrograph assumed to connect the storm runoff to the discharge from each sub-basin suggested the necessity of a third component. Guided by the work of Kirkby, an additional component, interflow was provided in the model in 1980. The modified model is now successfully and widely applied in China. The model structure is demonstrated in
Figure 1. Detailed descriptions of the principles of the Xinanjiang model can be found in relevant literature.
2.1.3. Model Parameter Optimization
The traditional MCS-based Xinanjiang model parameter optimization involves two aspects: (a) parameters upper and lower boundaries and their constraints and (b) the objective function or likelihood measurement. Additionally, for hydrological simulation with the Xinanjiang model in this study, both the computational time step and hydro-meteorological data time interval were set to one day.
For model parameters, the specification of lower and upper boundaries is listed in
Table 1. For the Xinanjiang model, the number of parameters (n) that need to be sampled is 15. Parameters
KG and KI of the linear reservoir-based flow concentration module have a structural constraint
KG +
KI = 0.7. Therefore, we sample
KG in this research, and
KI is calculated by 0.7
KG.
The Xinanjiang daily model focuses on water balancing and hydrograph simulation. Therefore, the objective function (
OBJ) adopted herein is calculated as follows:
where
RDRE and
NSCE represent the Runoff Depth Relative Error and the Nash–Sutcliffe Coefficient of Efficiency, respectively. The computation of the
NSCE has been listed in Equation (1), and the
RDRE is calculated as follows:
where
denotes simulated discharge at time step
i;
denotes observed discharge at time step
i;
n denotes the number of discharge data.
The parameters and state variables of the Xinanjiang model require two additional constraints to ensure the correctness of the model’s physical meaning. The constraints are applied to parameters
CG,
CI, and
CS (
CG ≥
CI ≥
CS) and soil moisture
W (
W must be non-negative). In order to consider the first constraint in the procedure of the MCS, before calculating the
OBJ, we test the
CG,
CI, and
CS values to verify whether the constraint is satisfied. If it is satisfied, we continue the calculation of the
OBJ; otherwise, we set the
OBJ to a penalty term, which is computed by
where
lambda is a penalty coefficient, which was set to 1000 in this research. If the first constraint is satisfied, then we can run the Xinanjiang daily model by using the hydro-meteorological data to generate the simulated discharge time series. After the model simulation is finished, a “flag” will be returned to indicate whether the simulation is a success. If the simulation is early stopped and returns a “flag” indicating that the state variable W has negative values, we set the
OBJ to the following penalty term expressed as:
where
WMUB denotes the upper boundary of parameter
WM;
WM denotes the
WM value generated from the MCS. This penalty term forces the algorithm to search toward larger
WM to avoid the negative
W values.
If the above-mentioned constraints are both satisfied, we calculate the
OBJ according to
After the OBJ of each parameter set is calculated, the parameter set with the minimum OBJ value is concluded, which is defined as the optimal parameter set.
2.2. Xinanjiang Model Parameter Optimization Based on Parallel Monte Carlo Sampling
2.2.1. Parallel Monte Carlo Sampling
The parallelization of the MCS-based parameter optimization involves two steps, which include the parallelization of the Monte Carlo sampling and the optimal parameter set reduction. We implemented the parallel MCS method on a multi-core CPU computer cluster and a many-core GPU computer cluster, respectively. A detailed description of the implementation can be found in the following paragraphs.
2.2.2. CPU Computer Cluster Implementation
The parallel optimization method was implemented on a multi-core CPU computer cluster, which contains 4 HP Z-series workstations hosting a total of five INTEL Xeon E5-2630v3 multi-core CPUs. The flow chart of the CPU computer cluster implementation of the parallel optimization method is demonstrated in
Figure 2.
The parallel optimization method starts from the initial settings on the master node. The hydro-meteorological data were loaded from CSV (comma-separated values) files, which include daily rainfall, runoff discharge, evapotranspiration, and catchment geographical information. Further, the program set the total sample number of the Monte Carlo experiment (
NS), likelihood threshold (
TH), model parameter boundaries, and
KG plus
KI constraint. After initial data loading and settings, the algorithm queries the number of slave nodes (
NN) and number of CPU cores in each slave node by using MPI and OpenMP APIs, respectively. The workload quantity assigned to each slave node is calculated as follows:
where
NSSi denotes the number of samples generated in slave node
I;
NCi denotes the number of CPU cores in slave node
i;
NT denotes the total number of CPU cores of the computer cluster;
i = 1, 2, …,
NN.
After initial data loading and model settings have been finished, the above-mentioned data and settings are broadcasted to all slave nodes by using MPI_Bcast API. For each slave node (let us take slave node i as an example), generate NCi threads to sample NSSi parameter sets that fall in the parameter feasible space. For each thread, run the Xinanjiang hydrological model and compute the likelihood function value by using the generated parameter set. A parameter set with a likelihood value higher than TH is preserved. Model simulations and likelihood function evaluations in the NCi threads are executed in parallel by using the OpenMP technology. After the NCi threads of calculations had finished, the OpenMP parallel reduction operation was started to find the optimal parameter set of each slave node. At last, send the feasible parameter sets and optimal parameter set to the master node by using MPI_Send API.
During the MCS, the master node waits for the likelihood calculations and parallel reduction in all the slave nodes until these operations are complete. Once all the above computations are complete, the master node receives feasible and best parameter sets from each slave node by using MPI_Recv API. At last, the master node chooses the parameter set with the largest likelihood function value as the optimal parameter set and finishes the execution by using MPI_Finalize API.
2.2.3. GPU Computer Cluster Implementation
The parallel optimization method was implemented on a many-core GPU computer cluster, which is constructed using four HP Z-series workstations hosting a total of five NVIDIA Tesla K40c GPUs. The flow chart of the GPU computer cluster implementation of the parallel optimization method is demonstrated in
Figure 3.
The parallel optimization method starts from the initial settings on the master node. The hydro-meteorological data are loaded from CSV (comma-separated values) files, which include daily rainfall, runoff discharge, evapotranspiration, and the geographical information of the study catchment. The algorithm also sets the total sample number of the Monte Carlo experiment (
NS), likelihood threshold (
TH), model parameter boundaries, and
KG plus
KI constraint. After initial data loading and settings, the algorithm begins the MPI execution and queries the number of slave nodes (
NN), number of GPUs in each slave node, and number of GPU cores in each slave node by using MPI and CUDA APIs, respectively. The workload quantity assigned to each slave node is calculated as follows:
where
NSSi denotes the number of samples generated in slave node
i;
NCi,j denotes the number of GPU cores of GPU
j in slave node
i;
NGi denotes the number of GPUs in slave node
i;
NT denotes the total number of GPU cores of the computer cluster;
i = 1, 2, …,
NN;
j = 1, 2, …,
NGi.
After initial data loading and settings have been finished, the above-mentioned data and settings are broadcasted to all slave nodes by using MPI_Bcast API. For each slave node (let us take slave node i as an example), create NGi CPU threads to control NGi GPUs by using OpenMP and offload the data and settings on the GPUs by using CUDA APIs such as cudaMemcpy. For slave node i, sample NSSi parameter sets within the parameter feasible space on the GPUs. Each GPU thread ran the Xinanjiang hydrological model and computed the likelihood function value by using the generated parameter set, and parameter sets with a likelihood value higher than TH were preserved. The model and likelihood function calculations are executed in parallel by using the OpenMP and CUDA technology on the GPUs. The GPU j is responsible for parameter set generation, model run, and likelihood function evaluations of samples. After the calculations of likelihood, the CUDA parallel reduction was started to find the feasible parameter sets and the optimal parameter set of each slave node. At last, the optimal parameter set was sent to the master node by using MPI_Send API.
During the process of MCS, likelihood calculations, and parallel reduction in all the slave nodes, the master node waits for all these computations’ completion. Once all the above computations are completed, the master node receives the feasible parameter sets and the optimal parameter set of each slave node by using MPI_Recv API. At last, the master node chooses the parameter set with the largest likelihood function value as the best parameter set and finishes the execution by using MPI_Finalize API.
2.3. Sensitivity and Uncertainty Analysis Based on GLUE
For the purpose of verifying the correctness of the proposed parallel parameter optimization method, we carry on sensitivity and uncertainty analysis to the Monte Carlo-generated parameter set samples by using the Generalized Likelihood Uncertainty Estimation methodology, GLUE.
The principle of the GLUE method can be summarized as follows: GLUE is proposed by Beven, and it is a framework used for model calibration and uncertainty analysis in hydrological and environmental sciences. The method assumes that the true system behavior cannot be fully represented by a single set of model parameters, so it considers multiple alternative model parameter sets to capture the uncertainties. GLUE uses a likelihood measure to compare the observed data with model simulations, quantifying how well each model reproduces the observed data. Each model simulation is characterized by a set of parameter values, and GLUE allows for parameter uncertainty using sampling from prior distributions. GLUE generates a large number of model realizations by sampling parameters randomly from their defined distributions by using a Monte-Carlo-based sampling method. Model performances are ranked based on how well they reproduce the observed data using the likelihood measure. A threshold value is defined to determine an “acceptable” range of model performance. Model parameter sets falling within this range are considered plausible. The ensemble of plausible model parameter sets is used to generate predictions and assess uncertainty using various statistical metrics and visualization techniques. By carrying on the above-mentioned procedures, GLUE provides a comprehensive framework for uncertainty analysis. It should be noted that this is just a brief overview of the GLUE method’s principles, and there are more technical details and variations depending on the specific application, which can be found in relevant references.
2.4. Hardware Adopted in This Study
The hardware utilized in this research is a computer cluster composed of one HP Z820 and three HP Z840 workstations. The screen and four workstations are demonstrated in
Figure 4. The USB KVM switch, which is used for the controlling and switching of four workstations by one set of screen, keyboard, and mouse, is shown in
Figure 5. This computer cluster has five INTEL Xeon E5-2630v3 CPUs and five NVIDIA Tesla K40c GPUs.
The Xeon E5-2630v3 CPU is a high-end server-level microprocessor. It’s a Haswell–EP architecture CPU with a 0.022 nm manufacturing process. It has 8 CPU cores and supports hyper-threading technology with up to 16 parallel threads. The base frequency of the CPU core is 2.4 GHz, and the turbo frequency is 3.2 GHz. The level 1 cache size is 8 × 32 KB 8-way set associative instruction and data caches. The level 2 cache size is 8 × 256 KB 8-way set associative caches. The level 3 cache size is 20 MB shared cache. It supports many new features such as MMX instructions, SSE/streaming SIMD extensions, AVX/advanced vector extensions, TBT2.0/turbo boost technology 2.0, etc. The V core is 0.65–1.3 V. The maximum operating temperature is 72 °C. The minimum power dissipation is 32 watt for the C1E state and 12 watt for the C6 state.
The Tesla K40c GPU is a high-end professional graphics card. Built on the 28 nm process and based on the GK110B graphics processor, the card supports DirectX 12.0. The GK110B graphics processor is a large chip with a die area of 561 mm2 and 7080 million transistors. It features 2880 shading units, 240 texture mapping units, and 48 ROPs. NVIDIA has placed 12,288 MB GDDR5 memory on the card, which is connected using a 384-bit memory interface. The GPU is operating at a frequency of 745 MHz, and memory is running at 1502 MHz. Being a dual-slot card, the NVIDIA Tesla K40c draws power from 1 × 6-pin + 1 × 8-pin power connectors, with a power draw rated at 245 W maximum. Tesla K40c is connected to the rest of the system using a PCIe 3.0 × 16 interface. The card measures 267 mm in length and features a dual-slot cooling solution.
2.5. Software Adopted in This Study
The software is developed based on the Microsoft Windows 7 64-bit operating system. The software ecosystem applied in this study is composed of MPICH2, Microsoft VC++2010 with OpenMP, and NVIDIA CUDA 6.5.
MPICH is a high-performance and widely portable implementation of the Message Passing Interface (MPI) standard. The goals of MPICH are: (1) to provide an MPI implementation that efficiently supports different computation and communication platforms, including commodity clusters, high-speed networks, and proprietary high-end computing systems, and (2) to enable cutting-edge research in MPI using an easy-to-extend modular framework for other derived implementations. MPICH is distributed as a source, and it has been tested on several platforms, including Linux (on IA32 and x86-64), Mac OS/X (PowerPC and Intel), Solaris (32- and 64-bit), and Windows.
The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. The OpenMP API defines a portable, scalable model with a simple and flexible interface for developing parallel applications on platforms from the desktop to the supercomputer. In this research, we adopted the OpenMP in Visual C++2010 implementation to develop parallel codes. The OpenMP C and C++ application program interface lets us write applications that effectively use multiple processors.
CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the GPU. With millions of CUDA-enabled GPUs sold to date, software developers, scientists, and researchers are using GPU-accelerated computing for broad-ranging applications. The NVIDIA CUDA Toolkit adopted in this study provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The CUDA Toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing the performance of applications.
2.6. Hydro-Meteorological Data Utilized in This Study
The study area of this research is the Ba River basin. It originates from the north slope of the Qinling, China. The full length of the Ba River is 92.6 km. The elevation difference from the headwater to the outlet of the river is 1142 m. The total slope is 12.8%. The catchment area is 2577 km
2. The Ba River basin is an asymmetric watershed. The left bank tributaries are sparse and long, while the right bank tributaries are condensed and short. The Ba River is a mountainous river. The river discharge hydrograph rises and falls steeply. The peak flow usually happens in the summer season, and the drying season is winter. The average annual precipitation of the studied area is 630.9 mm. The average annual evaporation is 949.7 mm. The average annual runoff is 493.1 million m
3. There are ten rainfall gauges located in this area and the outlet station is the Maduwang station. Observed daily rainfall, evaporation, and daily average discharges ranging from 2000 to 2010 were utilized as the calibration data set. The map of the Maduwang catchment is demonstrated in
Figure 6.
In this study, we can only obtain daily data for 11 years to carry out the model calibration. The use of a wider time window should be applied in future studies to ensure that the model and methods perform equally well in several climatic fluctuations, correlations, and combinations of input parameters. This is mainly due to the large variability and correlation of the hydrological-cycle processes, which only at least 30 30-year temporal window (that corresponds to the climatic scale) can be representative of their complexity. This is a known issue in hydrological-cycle processes and corresponds to the so-called Hurst behavior, long-term persistence, or long-range dependence [
27].