Next Article in Journal
Dynamic Group Management Scheme for Sustainable and Secure Information Sensing in IoT
Previous Article in Journal
Thermoeconomic Analysis and Optimization of a New Combined Supercritical Carbon Dioxide Recompression Brayton/Kalina Cycle
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Groundwater Depth Prediction Using Data-Driven Models with the Assistance of Gamma Test

1
State Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, China Institute of Water Resources and Hydropower Research, Beijing 100038, China
2
State Key Laboratory of Hydrology-Water Resource and Hydraulirc Engineering, Hohai University, Nanjing 210098, China
3
Bureau of Water Resources Survey of Heibei, Shijiazhuang 050031, China
4
Institute of Wetland Research, Chinese Academy of Forestry, Beijing 100091, China
5
Department of Civil Engineering, Faculty of Engineering, University of Malaya, Kuala Lumpur 50603, Malaysia
*
Author to whom correspondence should be addressed.
Sustainability 2016, 8(11), 1076; https://doi.org/10.3390/su8111076
Submission received: 26 August 2016 / Revised: 10 October 2016 / Accepted: 18 October 2016 / Published: 25 October 2016
(This article belongs to the Section Environmental Sustainability and Applications)

Abstract

:
Prediction of the groundwater dynamics via models can help better manage the groundwater resources and guarantee their sustainable use. Three types of data-driven models are built for groundwater depth prediction in the plain of Shijiazhuang, the capital of Hebei Province in North China. The data-driven models include the Power Function Model (PFM), Back-Propagation Artificial Neural Network (BPANN) and Support Vector Machines (SVM) with two kernel functions of linear kernel function (LKF) and radial basis function (RBF). Five classes of factors (including 12 indices) are considered as potential model input variables. The Gamma Test (GT) is adopted in this study to help identify the relative importance of the input indices and tackle the tricky issue of the optimal input combinations for the data-driven models. The established models are evaluated in both fitting and testing procedures based on the root mean squared error (RMSE) and Nash-Sutcliffe efficiency (E) for different input combination schemes. The results show that SVM (RBF) performs the best. It is interesting to find that the natural factors (i.e., precipitation and evaporation) are less relevant to the groundwater depth variations. The methods used in this study have much significance for groundwater depth prediction in areas lacking hydrogeological data.

1. Introduction

Groundwater is an important part of water resources for domestic, agricultural, industrial, and environmental uses due to its generally good quality, widespread availability, good water stability and the fact it is not easily polluted [1]. Groundwater resources are especially important in the arid and semi-arid regions, which can cover the shortage of water caused by uneven distribution of the surface water in time and space. However, arable land increases, urban expansion and population growth have caused a dramatic increase in water consumption, which has led to a decrease of the groundwater level in many countries all over the world [2,3].
With the rapid economic development in China in the past several decades, the demand for water has increased at a fast speed. This has led to an excessive use of water resources, especially groundwater. At present, there are many groundwater overdraft areas in China, among which Hebei Province in North China plain is one of the most serious overdraft areas where the groundwater level has decreased dramatically [4,5]. Much attention has been paid by both researchers and policy makers in this area. There have been studies focusing on both the reasons for and the negative effects of the continuous decline of the groundwater level in Hebei province and the North China plain [6,7].
Groundwater dynamic forecasting provides important information for groundwater management. Physical descriptive models and data-driven models are two classes of dynamic prediction models [8]. Detailed hydrogeological data in space and time, which are usually obtained by a large number of field experiments, are required to construct the physical models in practical application. So, a data-driven model is a good choice when data are limited and the groundwater system is complex. Power Function Model (PFM), Back-Propagation Artificial Neural Network (BPANN) and Support Vector Machines (SVM) are popular data-driven models used for forecasting the groundwater level [9,10,11]. Among these methods, PFM is a simple non-linear regression (NLR) model, while the other two models are more complex. Some studies indicate that the BPANN model and SVM model perform better, especially the SVM model [12], but the prediction abilities of these two models are not always good for all cases due to the complexities of hydrogeological conditions and the groundwater flow in different regions [13].
There are many causes for the decrease of the groundwater level, including natural factors, anthropic factors, biological factors and economic factors, which can all be treated as the inputs of the data-driven models. To construct a reliable data-driven model for groundwater prediction, some challenging questions need to be answered beforehand. For example, which inputs are relevant to the groundwater level and which are irrelevant? To what extent do the inputs determine the output from a smooth model? How many data points are required to make a prediction with the best possible accuracy? So far, these questions have not been addressed adequately in groundwater level prediction using data-driven models. However, it is possible that significant progress can be made to tackle these questions, because the Gamma Test (GT) has been presented as an efficient algorithm. By estimating the variance of the noise in the raw data, GT can directly estimate the best mean squared error for a given selection of input by a smooth model on unseen data, before the model construction. In this way, it can help find the best input data combination and identify the sufficient number of data points used for model training. A formal proof of the GT can be found in Evans and Jones [14]. Then, a series of studies have indicated that GT can greatly reduce the model construction work and can potentially be used in the area of hydrological prediction and water management. Moghaddamnia and colleagues used GT to select the input data for Artificial Neural Networks (ANN) and the adaptive neuro-fuzzy inference system for evaporation estimation [15]. Noori and colleagues assessed the SVM model performance for monthly stream flow prediction with the assistance of GT [16]. Han and colleagues explored the model structure for index flood regionalization using GT [17]. However, with respect to groundwater prediction, little work has been done using this efficient tool.
In this study, GT is used to rank the relative importance of the model inputs and find the best input combinations before the data-driven models are calibrated. Three data-driven models, i.e., PFM, BPANN and SVM, are used to predict the variations of the groundwater depth and the performances of the three established models are further compared. Twelve indices, including natural, anthropic, biological, economic and social factors which may influence the groundwater depth, are considered as the input of the data-driven models. The study is carried out in the plain of Shijiazhuang, the capital city of Hebei province in North China. During the past three decades, the groundwater level in this region has declined by 30 m, which has not only restricted the development of the economy but has also caused serious environmental problems [18]. It is hoped that the study can provide a novel and complementary methodology for the prediction of groundwater level dynamics in the study area.

2. Data and Methodology

2.1. Study Area

Shijiazhuang plain (from lat 37°33′N to lat 38°42′N and from long 113°18′E to long 114°41′E) is located in central and eastern Hebei province and belongs to the Bohai Sea Economic Zone. The total area of Shijiazhuang plain is 8157 km2. The average annual precipitation is about 490 mm, and the distribution is quite uneven in space and time. Over 70% of the precipitation occurs in the summer months from June to August. The average air temperature ranges from −0.8 °C in the winter seasons to 25.9 °C in the summer seasons. The plain comprises Shijiazhuang city and 13 counties, including Xingtang, Luquan, Yuanshi, Gaoyi, Xinle, Zhengding, Luancheng, Zhaoxian, Wuji, Gaocheng, Jinzhou, Xinji and Shenze. As a semi-arid region, water resources are critical to economic development and groundwater is the main source of water supply. The location of the study area and the monitoring wells and meteorological stations located in this area are shown in Figure 1. Two aquifers (the Holocene (Q4) and Upper Pleistocene (Q3)) in Shijiazhuang plain are composed of quaternary unconsolidated sediments, including sandy loam, sand and gravel-cobble. The west of Xingtang, Luquan and Yuanshi belongs to Q4, and other parts of the study area belong to Q3. The burial depth of the floor is 7–20 m for Q4 and 40–100 m for Q3. The specific field is 20–30 m3/(h·m) for Q4 and about 35 m3/(h·m) for Q3. Both of the aquifers contain a groundwater mineralization level lower than 0.5 g/L. All the 14 wells are used to monitor the shallow groundwater.

2.2. Input Indices and Data Source

Natural, anthropic, biological, economic and social factors which are closely related to the decline of the groundwater level are chosen as the key inputs of the data-driven models. These are 12 indices, including precipitation, evaporation, groundwater exploitation, industrial groundwater use, agricultural irrigation groundwater use, crop yield, population, urbanization rate, primary industry output, secondary industry output, tertiary industry output and gross domestic product (GDP). Precipitation and evaporation are the main natural factors which influence the groundwater recharge and loss. Groundwater exploitation is the groundwater fetching and use quantity, and it is the main anthropic factor which is the major reason for groundwater descending in the study area in recent decades [19], while industrial and agricultural irrigation groundwater uses are two major causes of groundwater exploitation. Crop yield is the main biological factor which reflects the crop water consumption, especially the groundwater consumption in the study area [20]. GDP is a comprehensive index indicating the regional economic strength and the pressure on groundwater use, while primary industry output, secondary industry output and tertiary industry output can reflect different aspects of economic development. Primary industry includes agriculture, forestry, animal husbandry and fishery. Secondary industry includes mining, manufacturing and construction. Tertiary industry refers to all other economic activities not included in the primary or secondary industry. Population and urbanization rate are two important factors for social development. The growth of population and the increase of urbanization rate can also raise the use of water, especially groundwater in this region. Table 1 shows the serial number of the 12 indices and their mutual relations are illustrated by Figure 2.
Groundwater depth is the only index which is used to show the change of groundwater resources in this study. The higher the groundwater depth, the more scarce the groundwater resources. As located in Figure 1, 14 monitoring wells covering the whole study area are chosen to provide observational data on the annual groundwater depth from 1984 to 2013. The continuous decrease of groundwater depth can be easily observed over the 30-year period (Figure 3), and the seasonal variation of groundwater depth is obvious. The lowest groundwater level appears in summer, while the highest groundwater level can be found in winter. The groundwater depth is highly related to groundwater exploitation in different seasons for one year. In order to evaluate the groundwater of the whole study area, the average groundwater depth of 14 monitoring wells in each year is used by the models. Observed precipitation and evaporation at 13 meteorological stations are also obtained for the same period from the National Meteorological Information Center of China Meteorological Administration (available at http://www.nmic.gov.cn/). The average precipitation and evaporation from the 13 meteorological stations in each year are used as the model inputs. Thirty-year data of the other 10 input indices, including annual groundwater exploitation, industrial groundwater use, agricultural irrigation groundwater use, crop yield, population, urbanization rate, primary industry output, secondary industry output, tertiary industry output and GDP for the whole study area are also obtained for the period from 1984 to 2013. For each of the 10 input indices, annual value is the sum of the corresponding values of Shijiazhuang city and 13 counties. All the data of the 10 input indices come from the Shijiazhuang Statistical Yearbook [21].

2.3. Methodology

2.3.1. Gamma Test

The GT estimates the mean square error (MSE) that can be achieved when modelling the unseen data using any continuous nonlinear model. A formal proof of GT is given by Evans and Jones [14]. The basic idea is quite distinct from the earlier attempts with nonlinear analysis. Suppose we have a set of data observations in the form:
{ ( x i , y i ) , 1 i M }
where xiRM are input vectors confined to some closed bounded set CRM and, yiR are corresponding outputs. The system of GT can be expressed as the following form:
y = f ( x 1 x m ) + r
where f is a smooth function and r is a random variable representing noise. Generally, the mean of the distribution of r is assumed as 0 and the variance of the noise Var(r) is bounded. The Gamma statistic Γ is the main parameter, which can estimate the model’s output variance.
For each vector xi (1 ≤ iM), the N[i,k] are the kth (1 ≤ kp) nearest neighbors xN[i,k] (1 ≤ kp). The distance statistic of input data can be calculated:
δ M ( k ) = 1 M i = 1 M | x N ( i , k ) - x i | 2   ( 1 k p )
where |…| denotes Euclidean distance, and the corresponding Gamma function of the output values:
γ M ( k ) = 1 2 M i = 1 M | y N ( i , k ) y i | 2   ( 1 k p )
where y is the corresponding y-value for the kth nearest neighbor of xi in Equation (3). In order to compute Γ, the p points ( δ M ( k ) , γ M ( k ) ) are calculated by univariate linear regression equation with least-squares:
γ = A δ + Γ
The value of Γ is the intercept of the Equation (5). Usually, the data, which are tested by GT, are normalized to a range of 0–1. Particularly, when δ equals 0, Γ value can be computed by γ:
γ M ( k ) Var ( r )  in probability as  δ M ( k ) 0
A more detailed description of calculation principles can be found in Evans and Jones [14].
The Vratio is another term, which can return a scale invariant noise estimate:
V ratio = Γ σ 2 ( y )
where σ2(y) is the variance of output y. According to the definition of Vratio, the value of Vratio close to 0 indicates a high degree of predictability of the given output y. In addition, the estimation of noise variance on the given output can be more credible if the standard error (SE) value is close to 0, and the complexity of the smooth function can be measured by the value of gradient.

2.3.2. Power Function Model (PFM)

The PFM is a kind of non-linear regression model, which is often used for prediction by establishing the relationship between the forecast factors and the forecast objects. The general equation is as follows:
y = p 0 x 1 P 1 x 2 P 2 x M P N
where y is the forecast object, pk (k = 0,…,N) are the parameters generally estimated by least squares and xk (k = 1,…,M) are the explanatory variables (forecast factors).

2.3.3. Back-Propagation Artificial Neural Network (BPANN)

The ANN is a nonlinear arithmetic system with densely interconnected processing elements or neurons. Three kinds of layers are contained in the mathematical structure, including input, hidden, and output layers. These layer has their nodes and activation functions. The back-propagation algorithm can effectively train the network and shorten the learning time, and it needs little information about the complex mechanism and process which should be explicitly described in mathematical form. If the neuron is the jth one in the present layer, while the inputs which it receives from the other n neurons are x1, x2,……,xn, respectively, in the previous layer. The connection weights between the jth neuron and the other n neurons are w1j, w2j, ……, wnj, respectively. The mathematical expression is as follows:
y j = f ( i = 1 n w i j x i + b j )
where yj is the output of the neuron, bj is the threshold of the neuron, and f is the transfer function.
In this study, one hidden layer is used to build the ANN model, which is trained by the back-propagation algorithm. A log-sigmoid and a linear function are contained in hidden layer and output layer, respectively. This kind of configuration is the most commonly used for ANNs, which can improve extrapolation ability. The input layer, hidden layer and output layer have five nodes, 30 nodes and one node, respectively (Figure 4). A momentum term is also added in the weight updating process to avoid the results being captured in local minimum. The number of hidden layer and nodes is discussed in Section 3.5.

2.3.4. Support Vector Machines (SVMs)

The SVM model is based on Vapnik-Chervonenkis (VC) dimension and structural risk minimum principle [22]. The SVM provides a new approach to solving the nonlinear and high dimension problem with a small sample set. The basic idea of SVM is to search for the nonlinear relationship between input and output vectors through nonlinear transformation of the input vector into the high-dimensional feature space. Given a set of N samples of { x k , y k } k = 1 N ( x k is the input vector, y k is the corresponding output value), and the regression function of SVM can be expressed as:
y = f ( x ) = w φ ( x ) + b
where w is a weight vector, φ is a nonlinear transfer function that implements transformation the nonlinear to linear relationship of input to output vectors, and b is a bias. Vapnik introduced the convex quadratic optimization question to ensure that extreme solution is optimal [22], and a ε-insensitively loss function is added to Equation (10):
min φ ( w ) = 1 2 w 2 + C k = 1 N ( ξ k + ξ k * ) subject to { y k w T φ ( x k ) b ε + ξ k w T φ ( x k ) + b y k ε + ξ k * ξ k , ξ k * 0 }    k = 1 , 2 , , N
where ξ and ξ * are slack variables that penalize training errors by the loss function over the error tolerance ε, and C denotes the degree of penalization for the sample outside the error tolerance ε. The input vectors are called support vectors. The dual Lagrangian form is:
W ( α , α * ) = 1 2 i , j = 1 N ( α i α i * ) ( α j α j * ) K ( x i , x j ) + i = 1 N y i ( α i α i * ) ε i = 1 N ( α i + α i * )
with the constraints,
{ i = 1 N ( α i α i * ) = 0 0 α i , α i * C    ( i = 1 , 2 , , N )
where α and α* are Lagrange multipliers, and the optimal desired weight vector of the regression hyperplane can be expressed with the kernel function:
w = i = 1 N ( α i α i * ) K ( x , x i )
where K(x, xi) is the kernel function. Equation (10) can be expressed as:
y = f ( x ) = w φ ( x ) + b = i = 1 N ( α i α i * ) K ( x , x i ) + b
Linear kernel function (LKF), polynomial kernel function (PKF) and radial basis function (RBF) are commonly used as kernel functions. Among these three kernel functions, the number of hyper-parameters in PKF is much more than LKF and RBF, so LKF and RBF are chosen in this study [23,24]. The parameters of RBF are discussed in Section 3.5. LKF, PKF and RBF are described by the Equations (16)–(18), respectively:
K ( x , x i ) = x x i
K ( x , x i ) = ( 1 + x x i ) d   ( d = 1 , 2 , n )
K ( x , x i ) = exp ( γ x x i )

2.3.5. Implementation and Assessment of Three Models

In this study, we define y as the groundwater depth which is an index of the groundwater dynamic variations and the forecast object, and xk as the input factors which may influence the groundwater in the models of PFM, BPANN and SVM. PFM is constructed by the Statistical Product and Service Solutions (SPSS) software in version 19.0. BPANN and SVM are both implemented by MATLAB toolbox.
Twenty groups of data from 1984 to 2003 are used to build and train the models, while 10 groups of data from 2004 to 2013 are used for testing. All the data are normalized to a range of 0–1 to avoid disturbance of dimensions. The root mean squared error (RMSE) and Nash-Sutcliffe efficiency represented by E are used to assess the accuracy of the three built models based on the observed values and the simulated values:
R M S E = 1 N i = 1 N ( y ^ i y i ) 2
E = 1 i = 1 N ( y i y i ) 2 / i = 1 N ( y i y ¯ i ) 2
where yi and y ^ i are the observed and simulated values of groundwater depth, respectively, y ¯ i is the mean observed value, i = 1,2,…,N, and N is the number of data groups. The perfect scores of RMSE and E are 0 and 1, respectively.

3. Results

3.1. Correlation Test for 12 Inputs

Pearson correlation coefficients (ρ) are used to analyze the correlations among the 12 inputs. The value of 1 means a perfect positive correlation while the value of −1 means a perfect negative correlation. The correlation coefficients of the 12 inputs are shown in Table 2. Most correlation coefficients are lower than 0.5 and most inputs show weak correlations. However, the correlation coefficient between GDP and secondary industry output reach 0.799, which indicates that secondary industry output is the main means to improve the economy in Shijiazhuang plain. Tertiary industry output also has a high correlation with population, which means that tertiary industry is greatly influenced by population in the study area. Though the two pairs of inputs have relatively high correlation coefficients, overall, the correlations among the 12 inputs have little effect on the feature selection results.

3.2. Relative Importance of the Model Inputs

The GT can provide the least model error compared to any smooth model by only giving the input and output data. In the study, the Gamma value Γ is used as the main criterion for distinguishing the importance of model inputs and the other three factors produced by GT (i.e., Gradient, Standard error of Γ and Vratio) are used as a reference. This is because there are still some technical problems of GT that cannot be explained in detail, such as the utilization of the three factors (Gradient, Standard error of Γ and Vratio). Basically, the least |Γ| represents the best model input combination. The gradient can indicate the model complexity. The lower the gradient, the more simple the model should be fitted. The SE of Γ shows the reliability of the Gamma value. The higher it is, the more unreliable the Gamma value is. Vratio illustrates the predictability of outputs using the effective inputs [15]. A mathematical model (e.g., PFM, BPANN and SVM in this study) may be built in high quality, if the three factors (gradient, SE of Γ and Vratio) show low values by setting the best input data [17]. Table 3 shows the GT results for schemes with different combinations of model input indices listed in Table 1. Scheme 1 involves all 12 input indices and the other schemes have 11 indices with the rest masked. It is interesting to find that only Scheme 5 (without index 4, i.e., industrial groundwater use) and Scheme 13 (without index 12, i.e., urbanization rate) have larger |Γ| values than Scheme 1. The other schemes all have smaller |Γ| values which indicate they will perform better than Scheme 1 in constructing a reliable model. It can be seen from Table 3 that the best combination of indices is in Scheme 3 which has the least |Γ| (0.00170), and the worst combination of indices is in Scheme 13 which has the highest |Γ| (0.01372). Then, we can rank the schemes from best to worst, which is 3 > 2 > 11 > 12 > 7 > 9 > 8 > 6 > 10 > 4 > 1 > 5 > 13, and the missing index in each of the schemes can form a ranking of the importance of the model input indices, which is 2 < 1 < 10 < 11 < 6 < 8 < 7 < 5 < 9 < 3 < 4 < 12. This indicates that natural factors (precipitation and evaporation) have less effect on the groundwater level, while anthropic factors, biological factors, economic factors and social factors have greater influence. Urbanization rate and industrial groundwater use are the two most important model input indices which imply that the process of modernization and industrial development in Shijiazhuang plain is the main cause of groundwater recession.

3.3. Model Input Selection Based on Gamma Test

In order to find the best combination of model inputs, search methods of backward and forward selections are coupled with the GT [25]. The backward procedure starts with all indices and each index is gradually removed in turn, while the forward procedure starts with the minimum index number and carries on by adding one index at a time successively. The index which works the best with the indices from the last round (i.e., resulting in the least |Γ| value) is added in forward selection and the one with the absence of which leads to the best results is removed in backward selection. The procedure is iterated until one index is left for backward selection or until all indices are included for forward selection. Table 4 and Table 5 show the model input selection results by using the search methods of backward and forward selections coupled with the GT, respectively. It can be found from the two tables that Scheme 19 has relatively fewer indices and the smallest values of |Γ| (0.00006). Theoretically, because of the least |Γ| and relatively fewer indices, Scheme 19 is the best model input combination. In this study, Scheme 19 is used to construct the three models. However, GT is still a method which requires more experiments to be completed. In order to make our conclusions more reliable, all the schemes in Table 4 and Table 5 are calculated as part of the sensitivity analysis of the input choice in Section 3.6.

3.4. Influence of the Parameters in BPANN and SVM (RBF)

Possible data overfitting is a disadvantage of BPANN and SVM (RBF). In order to avoid this problem and make the models more accurate, we choose the hidden layers and nodes for BPANN and two parameters (C and γ) for SVM (RBF) carefully by means of the try-and-error method. The parameter C which is a positive trade-off parameter determines the degree of empirical error and the parameter γ which is the main parameter in kernel function of RBF. Numerous trials were conducted.
Taking Scheme 19 as an example, the optimum numbers of hidden layers and nodes per layer are 1 and 30 for BPANN, respectively, while the optimal values of C and γ in this study are chosen to be 1.0 and 0.3 for SVM (RBF), respectively. All these optimal parameters are chosen based on the values of RMSE and E. With the change in number of hidden layers and nodes, different values of RMSE and E can be seen in Figure 5 and Figure 6. The influences of the C and γ on the testing results are shown in Figure 7 and Figure 8.

3.5. Fitting Results and Model Errors

For Scheme 19, the equation of PFM can be expressed as follows calibrated on the 20 groups of data from 1984 to 2003, respectively:
y ^ = 0.0015 x 3 0.2961 x 4 0.2428 x 6 0.1144 x 9 0.0009 x 12 0.4832
where x3 is groundwater exploitation, x4 is industrial groundwater use, x6 is crop yield, x9 is secondary industry output and x12 is urbanization rate, y ^ is the fitting results for groundwater depth.
The other two models of BPANN and SVM are also trained based on the dataset from 1984 to 2003. The procedure is carried out several times for both models and the optimal results are adopted. The RMSE and E of the three models for fitting results are shown in Table 6.
The RMSEs of PFM and SVM (LKF) are both higher than 1.0, and the value of SVM (LKF) is close to 1.5. The RMSE is lower than 0.5 for the SVM (RBF) model, while the value of BPANN model is higher than 0.5 and lower than PFM and SVM (LKF). So, the fitting values of SVM (RBF) are closer to observed values. The highest values of E are near 1.0 for SVM (RBF), and the lowest values of E are lower than 0.85 for SVM (LKF). It indicates that SVM (RBF) has the highest reliability in the four models. Based on the values of RMSE and E, it can be easily observed that the SVM (RBF) model has the best generalization ability in the sample learning process, while the SVM (LKF) model performs the worst. Good fitting results do not mean good prediction ability for the data-driven models. The performance of the models needs to be tested with further datasets. The observed and fitted values (Scheme 19) of groundwater depth are shown in Figure 9.

3.6. Testing Results and Model Errors

The three types of data-driven models are tested based on the 10 groups of data from 2004 to 2013 for three schemes. The RMSE and E of the three models for testing results are shown in Table 7. For Scheme 19, only the RMSE of SVM (RBF) is lower than 2.0, and the RMSE of SVM (LKF) is higher than 3.0. The BPANN model has lower RMSE than the PFM model. The value of E is higher than 0 for SVM (RBF), and the lowest values of E are lower than −2.0 for PFM and SVM (LKF). It can be noticed that BPANN and SVM (RBF) models perform better than PFM and SVM (LKF) as a whole with a relatively lower RMSE and better E, but stability is poor for the BPANN model. The SVM (RBF) model performs best in the testing process, and it has generalization ability in the testing process. The observed and tested values (Scheme 19) of groundwater depth are shown in Figure 10.

3.7. Sensitivity Analysis of the Input Choice

In order to make the input choice conclusion more reliable, the same methods are used to calculate the values of RMSE and E for other schemes in Table 4 and Table 5, and the results are compared with Scheme 19. The optimal models for BPANN and SVM (RBF) are also chosen by testing the influence of the parameters in the models. The fitting and testing results and errors of PFM, BPANN and SVM models are shown in Table 8 and Table 9. Figure 11 and Figure 12 show the changes of RMSE and E with |Γ|. It indicates that Scheme 19 is the best model input combination with lower values of RMSE and E for the three models, and it also has relatively fewer indices. Overall, with the increase of the |Γ|, the performances of the three models become worse, so GT can be used to confirm the best combination of model inputs. It should be mentioned that Schemes 21, 22, 23, 24 and 26 with much fewer indices have much higher RMSE and lower E. So, the schemes with much fewer indices but smaller values of |Γ| should be tested carefully. Additionally, it also can be seen that the SVM (RBF) performs best in most schemes.

4. Discussion and Conclusions

Considering limited data and complicated geological conditions in the plain of Shijiazhuang, physical models are not a good choice, and three types of data-driven models are used in this study to predict the groundwater depth in Shijiazhuang plain. Five kinds of factors (natural, anthropic, biological, economic and social) including 12 indices which may influence the groundwater depth are considered, i.e., precipitation, evaporation, groundwater exploitation, industrial groundwater use, agricultural irrigation groundwater use, crop yield, population, urbanization rate, primary industry output, secondary industry output, tertiary industry output and GDP. An efficient tool named GT is adopted in this study to identify the relative importance of the indices and further determine the best model input combination. The results of GT indicate that natural factors (precipitation and evaporation) have less effect on the groundwater level, while anthropic factors, biological factors, economic factors and social factors are the main reasons for the groundwater level decline, especially the anthropic factors which reflect the groundwater exploitation. Evaporation is found to be the least relevant index among the 12 indices, while urbanization rate and industrial groundwater use are shown to have the most significant effect on groundwater variations. This indicates that the relationship between the evaporation and the groundwater becomes weaker with the drop in groundwater level. It also reveals that modernization and industrial development in Shijiazhuang plain are the main reasons for groundwater recession.
The RMSE and E are used to assess the accuracy of the data-driven models. The results show that BPANN and SVM (RBF) models perform better than PFM and SVM (LKF) as a whole in all three schemes, while SVM (RBF) has the best generalization ability during the fitting and testing process. The SVM (RBF) model performs the best in both model fitting and model testing. Additionally, all the schemes in Table 4 and Table 5 are used to analyze the sensitivity of the input choice. The results show that GT can be used to confirm the best combination of model inputs, but schemes with much fewer indices and smaller values of |Γ| should be tested carefully.
The groundwater level has decreased dramatically and is continuing to do so in Shijiazhuang plain. Countermeasures need to be taken to reduce groundwater usage and guarantee the sustainable utilization of groundwater, by improving the water use efficiency, promoting comprehensive water-saving measures, adjusting the industrial structure and finding alternative water sources, such as water recycling and the South-to-North Water Diversion Project. The results of this study may also provide a methodology and a reference for the prediction of groundwater resources in data-limited areas, especially for the study region and North China in general. However, monthly prediction of the groundwater depth has not been considered in this study due to limited data observations. The advantage of GT can be better exploited when dealing with larger datasets using data-driven models in groundwater prediction.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (Grant No. 51409270), partition prediction of groundwater system of Hebei Province, the foundation of China Institute of Water Resources and Hydropower Research (1232), the International Science and Technology Cooperation Program of China (Grant No. 2013DFG70990), and the Open Research Fund Program of State Key Laboratory of Hydrology-Water Resources and Hydraulic Engineering (2014490611).

Author Contributions

All the authors have contributed to the conception and development of this manuscript. Jiyang Tian carried out the analysis and wrote the paper. Chuanzhe Li, Jia Liu, Fuliang Yu and Wan Zurina Wan Jaafar conceived and designed the framework. Shuanghu Cheng and Nana Zhao provided assistance in calculations and figure productions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Cao, G.; Zheng, C.; Scanlon, B.R.; Liu, J.; Li, W. Use of flow modeling to assess sustainability of groundwater resources in the North China Plain. Water Resour. Res. 2013, 49, 159–175. [Google Scholar] [CrossRef]
  2. Natkhin, M.; Steidl, J.; Dietrich, O.; Dannowski, R.; Lischeid, G. Differentiating between climate effects and forest growth dynamics effects on decreasing groundwater recharge in a lowland region in Northeast Germany. J. Hydrol. 2012, 448, 245–254. [Google Scholar] [CrossRef]
  3. Goderniaux, P.; Brouyère, S.; Wildemeersch, S.; Therrien, R.; Dassargues, A. Uncertainty of climate change impact on groundwater reserves—Application to a chalk aquifer. J. Hydrol. 2015, 528, 108–121. [Google Scholar] [CrossRef]
  4. Yuan, Z.; Shen, Y. Estimation of agricultural water consumption from meteorological and yield data: A case study of Hebei, North China. PLoS ONE 2013, 8, e58685–e58685. [Google Scholar] [CrossRef] [PubMed]
  5. Davidsen, C.; Liu, S.; Mo, X.; Rosbjerg, D.; Bauer-Gottwein, P. The cost of ending groundwater overdraft on the North China Plain. Hydrol. Earth Syst. Sci. 2015, 12, 5931–5966. [Google Scholar] [CrossRef]
  6. Kendy, E.; Gérard-Marchant, P.; Walter, M.T.; Zhang, Y.; Liu, C.; Tammo, S.S. A soil-water-balance approach to quantify groundwater recharge from irrigated cropland in the North China Plain. Hydrol. Process. 2003, 17, 2011–2031. [Google Scholar] [CrossRef]
  7. Hu, Y.; Moiwo, J.P.; Yang, Y.; Han, S.; Yang, Y. Agricultural water-saving and sustainable groundwater management in Shijiazhuang Irrigation District, North China Plain. J. Hydrol. 2010, 393, 219–232. [Google Scholar] [CrossRef]
  8. Knotters, M.; Bierkens, M.F.P. Physical basis of time series models for water table depths. Water Resour. Res. 2000, 36, 181–188. [Google Scholar] [CrossRef]
  9. Asefa, T.; Kemblowski, M.; Urroz, G.; Mckee, M. Support vector machines (SVMs) for monitoring network design. Groundwater 2005, 43, 413–422. [Google Scholar] [CrossRef] [PubMed]
  10. Xu, T.F.; Valocchi, A.J.; Choi, J.; Amir, E. Use of machine learning methods to reduce predictive error of groundwater models. Groundwater 2013, 52, 448–460. [Google Scholar] [CrossRef] [PubMed]
  11. Yang, Z.P.; Lu, W.X.; Long, Y.Q.; Li, P. Application and comparison of two prediction models for groundwater levels: A case study in Western Jilin Province, China. J. Arid Environ. 2009, 73, 487–492. [Google Scholar] [CrossRef]
  12. Shirmohammadi, B.; Vafakhah, M.; Moosavi, V.; Moghaddamnia, A. Application of several data-driven techniques for predicting groundwater level. Water Resour. Manag. 2013, 27, 419–432. [Google Scholar] [CrossRef]
  13. Ping, J.; Qiang, Y.; Xi, M. A combination model of chaos, wavelet and support vector machine predicting groundwater levels and its evaluation using three comprehensive quantifying techniques. Inf. Technol. J. 2013, 12, 3158–3163. [Google Scholar] [CrossRef]
  14. Evans, D.; Jones, A.J. A proof of the Gamma test. Proc. R. Soc. Lond. A 2002, 458, 2759–2799. [Google Scholar] [CrossRef]
  15. Moghaddamnia, A.; Gousheh, M.G.; Piri, J.; Amin, S.; Han, D. Evaporation estimation using artificial neural networks and adaptive neuro-fuzzy inference system techniques. Adv. Water Resour. 2009, 32, 88–97. [Google Scholar] [CrossRef]
  16. Noori, R.; Karbassi, A.R.; Moghaddamnia, A.; Han, D.; Zokaei-Ashtiani, M.H.; Farokhnia, A.; Gousheh, M.G. Assessment of input variables determination on the SVM model performance using PCA, Gamma test, and forward selection techniques for monthly stream flow prediction. J. Hydrol. 2011, 401, 177–189. [Google Scholar] [CrossRef]
  17. Han, D.; Wan Jaafar, W.Z. Model structure exploration for index flood regionalization. Hydrol. Process. 2013, 27, 2903–2917. [Google Scholar] [CrossRef]
  18. Lu, X.; Jin, M.; Martinus, T.V.G.; Wang, B. Groundwater recharge at five representative sites in the Hebei Plain, China. Ground Water. 2011, 49, 286–294. [Google Scholar] [CrossRef] [PubMed]
  19. Liu, C.M.; Yu, J.J.; Kendy, E. Groundwater exploitation and its impact on the environment in the North China Plain. Water Int. 2001, 26, 265–272. [Google Scholar]
  20. Cao, G.; Han, D.; Song, X. Evaluating actual evapotranspiration and impacts of groundwater storage change in the North China Plain. Hydrol. Process. 2014, 28, 1797–1808. [Google Scholar] [CrossRef]
  21. Shijiazhuang Bureau of Statistics. Shijiazhuang Statistical Yearbook; China Statistics Press: Beijing, China, 1984–2013.
  22. Cortes, C.; Vapnik, V. Support-Vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  23. Elangovan, M.; Sugumaran, V.; Ramachandran, K.I.; Ravikumar, S. Effect of SVM kernel functions on classification of vibration signals of a single point cutting tool. Expert Syst. Appl. 2011, 38, 15202–15207. [Google Scholar] [CrossRef]
  24. Safavi, H.R.; Esmikhani, M. Conjunctive use of surface water and groundwater: Application of support vector machines. SVMs and genetic algorithms. Water Resour. Manag. 2013, 27, 2623–2644. [Google Scholar] [CrossRef]
  25. Wan Jaafar, W.Z.; Han, D. Variable Selection Using the Gamma Test Forward and Backward Selections. J. Hydrol. Eng. 2012, 17, 182–190. [Google Scholar] [CrossRef]
Figure 1. The location of the study area and the distributions of monitoring wells and meteorological stations.
Figure 1. The location of the study area and the distributions of monitoring wells and meteorological stations.
Sustainability 08 01076 g001
Figure 2. Mutual relations of the 12 input indices. The blue box shows the five key factors, which are closely related to groundwater level changes. The green, red, cyan, grey and yellow boxes respectively represent natural, anthropic, economic, biological and social indices. The connecting lines with two-way arrows mean interaction effects between the indices and the groundwater level. The connecting lines with single arrows mean one-way supplement or consumption of the groundwater, and the lines without arrow mean inclusiveness.
Figure 2. Mutual relations of the 12 input indices. The blue box shows the five key factors, which are closely related to groundwater level changes. The green, red, cyan, grey and yellow boxes respectively represent natural, anthropic, economic, biological and social indices. The connecting lines with two-way arrows mean interaction effects between the indices and the groundwater level. The connecting lines with single arrows mean one-way supplement or consumption of the groundwater, and the lines without arrow mean inclusiveness.
Sustainability 08 01076 g002
Figure 3. Precipitation and groundwater depth from 1984 to 2013 in Shijiazhuang plain.
Figure 3. Precipitation and groundwater depth from 1984 to 2013 in Shijiazhuang plain.
Sustainability 08 01076 g003
Figure 4. The architecture of the Back-Propagation Artificial Neural Network (BPANN) model with 12 input layers, 30 hidden layers and one output layer.
Figure 4. The architecture of the Back-Propagation Artificial Neural Network (BPANN) model with 12 input layers, 30 hidden layers and one output layer.
Sustainability 08 01076 g004
Figure 5. Influence of number of hidden layers on the BPANN model with 30 nodes. (a) RMSE; (b) E.
Figure 5. Influence of number of hidden layers on the BPANN model with 30 nodes. (a) RMSE; (b) E.
Sustainability 08 01076 g005
Figure 6. Influence of number of nodes in hidden layer on the BPANN model with one hidden layer. (a) RMSE; (b) E.
Figure 6. Influence of number of nodes in hidden layer on the BPANN model with one hidden layer. (a) RMSE; (b) E.
Sustainability 08 01076 g006
Figure 7. Influence of parameter γ on the SVM (RBF) with C = 1.0. (a) RMSE; (b) E.
Figure 7. Influence of parameter γ on the SVM (RBF) with C = 1.0. (a) RMSE; (b) E.
Sustainability 08 01076 g007
Figure 8. Influence of parameter C on the SVM (RBF) with γ = 0.3. (a) RMSE; (b) E.
Figure 8. Influence of parameter C on the SVM (RBF) with γ = 0.3. (a) RMSE; (b) E.
Sustainability 08 01076 g008
Figure 9. The observed and fitted values (Scheme 19) of groundwater depth from 1984 to 2003.
Figure 9. The observed and fitted values (Scheme 19) of groundwater depth from 1984 to 2003.
Sustainability 08 01076 g009
Figure 10. The observed and tested values (Scheme 19) of groundwater depth from 2004 to 2013.
Figure 10. The observed and tested values (Scheme 19) of groundwater depth from 2004 to 2013.
Sustainability 08 01076 g010
Figure 11. The different values of RMSE and E with the change of |Γ| in fitting results. (a) RMSE; (b) E.
Figure 11. The different values of RMSE and E with the change of |Γ| in fitting results. (a) RMSE; (b) E.
Sustainability 08 01076 g011
Figure 12. The different values of RMSE and E with the change of |Γ| in testing results. (a) RMSE; (b) E.
Figure 12. The different values of RMSE and E with the change of |Γ| in testing results. (a) RMSE; (b) E.
Sustainability 08 01076 g012
Table 1. Serial number of the 12 input indices.
Table 1. Serial number of the 12 input indices.
NumberIndicesNumberIndices
1Precipitation (mm)7GDP (million USD)
2Evaporation (mm)8Primary industry output (million USD)
3Groundwater exploitation (million m3)9Secondary industry output (million USD)
4Industrial groundwater use (million m3)10Tertiary industry output (million USD)
5Irrigation groundwater use (million m3)11Population (million)
6Crop yield (million ton)12Urbanization rate (%)
Table 2. Correlation test for 12 inputs.
Table 2. Correlation test for 12 inputs.
Inputs NumberCorrelation Coefficient (ρ)
123456789101112
11
2−0.1991
3−0.3600.3761
4−0.225−0.176−0.1151
5−0.3550.3460.3520.1901
60.1220.3570.406−0.2340.1751
70.1310.2110.168−0.410−0.1750.3091
80.1390.3150.250−0.392−0.1050.3240.4321
90.1350.2170.167−0.387−0.1780.3090.799 *0.2821
100.1310.1890.150−0.288−0.1890.3880.3990.3760.3981
110.1500.3130.323−0.381−0.0490.3850.4300.3710.3280.720 *1
120.1540.1460.238−0.324−0.1600.2170.4550.3470.3540.3510.4401
Note: * means significant at 0.05 level.
Table 3. GT results for schemes with different combinations of model input indices.
Table 3. GT results for schemes with different combinations of model input indices.
Scheme IDCombination of IndicesMasked IndexGamma (Γ)GradientStandard ErrorVratio
11,2,3,4,5,6,7,8,9,10,11,12None−0.009220.028410.01161−0.03687
22,3,4,5,6,7,8,9,10,11,121−0.002120.025290.00420−0.00848
31,3,4,5,6,7,8,9,10,11,122−0.001700.027760.00612−0.00679
41,2,4,5,6,7,8,9,10,11,123−0.008360.030040.00767−0.03343
51,2,3,5,6,7,8,9,10,11,124−0.010200.032650.00566−0.04082
61,2,3,4,6,7,8,9,10,11,125−0.007320.030070.00711−0.02927
71,2,3,4,5,7,8,9,10,11,126−0.005590.028450.00606−0.02235
81,2,3,4,5,6,8,9,10,11,127−0.007140.029550.00510−0.02857
91,2,3,4,5,6,7,9,10,11,128−0.005800.028500.00613−0.02320
101,2,3,4,5,6,7,8,10,11,129−0.007450.029440.00993−0.02981
111,2,3,4,5,6,7,8,9,11,1210−0.003160.027590.00739−0.01264
121,2,3,4,5,6,7,8,9,10,1211−0.003510.028420.00810−0.01405
131,2,3,4,5,6,7,8,9,10,1112−0.013720.035140.00919−0.05487
Table 4. Backward selection results with the assistance of GT.
Table 4. Backward selection results with the assistance of GT.
Scheme IDCombination of IndicesIndex RemovedGamma (Γ)GradientStandard ErrorVratio
11,2,3,4,5,6,7,8,9,10,11,12None−0.009220.028410.01161−0.03687
31,3,4,5,6,7,8,9,10,11,122−0.001700.027760.00612−0.00679
141,3,4,5,6,7,8,9,10,12110.001030.028780.005550.00413
153,4,5,6,7,8,9,10,1210.000310.035790.006250.00124
163,4,5,6,7,9,10,128−0.000140.039990.00916−0.00054
173,4,5,6,9,10,1270.000210.044150.007760.00085
183,4,5,6,9,12100.000910.050430.007360.00366
193,4,6,9,125−0.000060.068560.00886−0.00023
203,4,6,1290.001600.085930.006600.00640
213,4,1260.017190.085150.011860.06877
224,1230.021530.154090.010400.08612
231240.005460.714950.003610.02185
Table 5. Forward selection results with the assistance of GT.
Table 5. Forward selection results with the assistance of GT.
Scheme IDCombination of IndicesIndex addedGamma (Γ)GradientStandard ErrorVratio
2312None0.005460.714950.003610.02185
2410,12100.00285 0.19576 0.00491 0.01142
257,10,1270.00149 0.13163 0.00483 0.00597
266,7,10,1260.00645 0.08496 0.00294 0.02579
273,6,7,10,123−0.00069 0.07638 0.00797 −0.00277
283,4,6,7,10,1240.00199 0.05354 0.00541 0.00796
293,4,5,6,7,10,1250.00134 0.04241 0.00717 0.00539
163,4,5,6,7,9,10,129−0.00014 0.03999 0.00916 −0.00054
153,4,5,6,7,8,9,10,1280.00031 0.03579 0.00625 0.00124
141,3,4,5,6,7,8,9,10,1210.00103 0.02878 0.00555 0.00414
31,3,4,5,6,7,8,9,10,11,1211−0.00170 0.02776 0.00612 −0.00679
11,2,3,4,5,6,7,8,9,10,11,122−0.009220.028410.01161−0.03687
Table 6. Fitting results and errors of PFM, BPANN and SVM models for Scheme 19.
Table 6. Fitting results and errors of PFM, BPANN and SVM models for Scheme 19.
Scheme IDPFMBPANNSVM (LKF)SVM (RBF)
RMSEERMSEERMSEERMSEE
191.06380.91570.84110.94731.43450.84670.42650.9864
Table 7. Testing results and errors of PFM, BPANN and SVM models for Scheme 19.
Table 7. Testing results and errors of PFM, BPANN and SVM models for Scheme 19.
Scheme IDPFMBPANNSVM (LKF)SVM (RBF)
RMSEERMSEERMSEERMSEE
192.3916−0.54092.2597−0.37553.4417−2.19101.46120.4248
Table 8. Fitting results and errors of PFM, BPANN and SVM models.
Table 8. Fitting results and errors of PFM, BPANN and SVM models.
Scheme IDGamma (Γ)PFMBPANNSVM (LKF)SVM (RBF)
RMSEERMSEERMSEERMSEE
19−0.000061.06380.91570.84110.94731.43450.84670.42650.9864
16−0.000141.17700.89681.64410.79861.49950.83510.48680.9823
170.000211.16260.89930.99230.92661.48740.83250.42790.9864
150.000311.23420.88011.76920.93591.50330.83020.44270.9821
27−0.000691.30940.79081.79030.91371.52320.73650.49150.9122
180.000911.29420.78821.82330.82441.60440.70860.51110.9035
140.001031.53360.66271.99170.68231.79080.61090.72930.8146
290.001341.60920.56341.97370.56051.78340.57010.71840.8003
250.001491.88640.41012.14310.40232.09370.39521.64360.7822
200.001601.69030.46292.20980.38572.88060.33281.73050.7909
3−0.001701.68050.43072.31050.36322.23080.35860.92360.8013
280.001992.09370.29562.42080.29242.60110.23930.99370.8028
240.0028514.9721−13.26986.0912−4.322312.2426−13.91424.9251−2.9634
230.0054614.1693−13.960112.9411−9.376213.9031−12.488110.4623−8.3092
260.0064513.8233−12.33026.0265−4.980811.7868−10.69844.8521−2.8341
1−0.009224.2475−2.98042.2928−2.46366.2099−4.09282.2947−0.3525
210.0171911.9236−9.57258.0231−5.946712.9325−11.32076.6943−3.9408
220.0215314.9261−10.736312.1099−9.330715.9222−13.29039.2362−7.8460
Table 9. Testing results and errors of PFM, BPANN and SVM models.
Table 9. Testing results and errors of PFM, BPANN and SVM models.
Scheme IDGamma (Γ)PFMBPANNSVM (LKF)SVM (RBF)
RMSEERMSEERMSEERMSEE
19−0.000062.3916−0.54092.2597−0.37553.4417−2.1911.46120.4248
16−0.000143.6501−2.58912.5698−0.77913.4791−2.26081.43590.4446
170.000213.1715−1.70972.2831−0.40423.4125−2.13711.58220.3256
150.000313.7253−2.59122.4672−0.79233.7982−2.69521.59010.3083
27−0.000693.7526−2.61082.5094−0.81553.8023−2.73251.64330.2596
180.000913.7802−2.79412.6455−0.94063.9246−2.82441.70240.2319
140.001033.8204−2.79922.5123−0.82993.4819−2.49531.72930.2297
290.001343.8246−2.84562.5297−0.84824.1284−3.23491.90390.2006
250.001498.0644−10.9367.6436−7.03549.423−10.8434.6255−3.4251
200.001609.5036−9.42137.9928−8.34098.1567−9.62423.9288−2.8603
3−0.001704.8355−3.21942.4093−0.70213.6938−3.03652.09340.1353
280.001995.1032−4.94124.2356−3.92365.9435−6.02563.5647−2.0425
240.0028530.8291−70.352724.821−50.342335.9801−102.831322.9564−41.9327
230.0054634.9115−98.574430.5092−88.325733.4883−94.295233.4219−94.0942
260.006457.9212−7.22056.9023−6.82118.9346−9.94385.3196−5.9443
1−0.009224.2245−3.92312.5646−0.79243.7992−2.58712.02930.1212
210.0171920.8431−40.549615.4925−24.982523.5794−46.932410.9577−13.9548
220.0215331.2469−79.036220.9421−33.492429.91207−68.362917.9822−26.9293

Share and Cite

MDPI and ACS Style

Tian, J.; Li, C.; Liu, J.; Yu, F.; Cheng, S.; Zhao, N.; Wan Jaafar, W.Z. Groundwater Depth Prediction Using Data-Driven Models with the Assistance of Gamma Test. Sustainability 2016, 8, 1076. https://doi.org/10.3390/su8111076

AMA Style

Tian J, Li C, Liu J, Yu F, Cheng S, Zhao N, Wan Jaafar WZ. Groundwater Depth Prediction Using Data-Driven Models with the Assistance of Gamma Test. Sustainability. 2016; 8(11):1076. https://doi.org/10.3390/su8111076

Chicago/Turabian Style

Tian, Jiyang, Chuanzhe Li, Jia Liu, Fuliang Yu, Shuanghu Cheng, Nana Zhao, and Wan Zurina Wan Jaafar. 2016. "Groundwater Depth Prediction Using Data-Driven Models with the Assistance of Gamma Test" Sustainability 8, no. 11: 1076. https://doi.org/10.3390/su8111076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop