# Research on the Data-Driven Quality Control Method of Hydrological Time Series Data

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- For continuous data, two stable predictive control models, i.e. the horizontal optimized integrated predictive control model and the longitudinal predictive control model, are constructed from the horizontal and vertical perspectives. The model provides two predictive values and confidence intervals for suspicious data and it is up to the staff to decide whether to manually fill, recommend replacement, or retain the original value. The latest data is used as a sample set for periodic training and model adjustment, so that the model parameters can be dynamically updated with time. The predicted values of the two models are used to set the control interval at the center to detect and control the quality of continuous hydrological data. Establish the statistical data quality control interval from the perspective of statistics, combine horizontal and vertical predictive control models, and propose the continuous hydrological data control model. The hourly hydrological real-time data is detected from the perspective of time consistency and the number of monitoring data violating the control interval is taken as its suspicion.
- For discrete hydrological data with a large spatial difference and poor temporal continuity (such as rainfall), a discrete hydrological data control scheme is proposed. Centered on the measured stations, a topological map of similarity weights between neighboring stations is established and adjusted with seasonal variation. The spatial interpolation model of daily precipitation is constructed by using the monitoring data of stations with a large correlation around and the missing precipitation data in the short term are attempted to be filled.
- Set the online real-time adjustment strategy, according to the seasonal variation characteristics of hydrology, and dynamically adjust the parameters of the basic QC parameters, thresholds, and parameters of the predictive control model when establishing the control interval.

## 2. Basic Theories Used in Hydrological Data Quality Control Model

#### 2.1. Hydrological Data Quality Control Model

#### 2.2. Related Theories

#### 2.2.1. Definitions

#### 2.2.2. Prediction Methods

#### Recurrent Neural Network

#### Support Vector Machines

#### Long-Short Term Memory

#### 2.2.3. Optimization Methods

#### Particle Swarm Optimization

#### Mind Evolutionary Algorithm

#### 2.2.4. Statistical Methods

#### Adaptive Boosting

#### Statistical Control

#### 2.2.5. Spatial Interpolation Methods

#### Inverse Distance Weighting

#### Trend Surface Method

#### Kriging Method

## 3. Continuous Hydrological Data Control Model

#### 3.1. Horizontal Predictive Control Model

- Select data and normalize it to make data between (−1, 1).
- The initial parameters of SVM and RNN are optimized by PSO. The optimal SVM weak prediction model and WNN weak prediction model are established.
- Select m RNN and the best SVM. Then the m + 1 excellent models are selected as the weak predictor of Adaboost and establish the strong predictor of Adaboost.
- Referring to the predicted value and mean square error of the prediction model, the confidence interval of quality control is established with the predicted value as the center. Then the interval is connected and the upper and lower confidence intervals are smoothed by a wavelet transformation (More algorithm details can be found in the Supplementary Materials).

#### 3.2. Longitudinal Predictive Control Model

**Training set / test set generation:**Select data and normalize it to make data between (−1, 1). Then organize the data and predict the hydrological situation in the future by using the data of the previous N hours. Lastly, we randomly disrupt the sequence and select 80% of the data as the training set and the rest as the test set.**Population generation:**The initial population is randomly generated and ranked according to the mean square error from small to large. The first m individuals were selected as the center of the superior subpopulation, the first m + 1 to m + k individuals were selected as the center of the temporary subpopulation, and then m new subpopulations were generated, which surround the m centers within limits.**Sub-population convergence operation:**Convergence operation is to change the central point of the subpopulation by iteration and then randomly generate multiple points around the center to form a new subpopulation until the central point position does not change any more and the average error of the prediction model is minimized (that is, the highest score). Then the subpopulation reaches maturity. Lastly, the score of the center position is taken as the score of this sub-population.**Subpopulation alienation operation:**The scores of the mature subpopulations are ranked from large to small and the superior subgroups are replaced by the temporary subgroups with high scores. The temporary subgroups are supplemented to ensure the quantity is unchanged.**Output the current iterative best individual and score:**Find the highest scored individual from the winner subgroup as the best individual and score currently obtained and save it in the temporary array temp_best**Select excellent individuals:**The loop iteratively performs the convergence and dissimilation operations of the sub-populations until the loop stop condition is satisfied, that is, the optimal individual position does not change or the number of loop iterations is reached. Sort the array temp_best from large to small and select the top m as excellent individuals.**Establish LSTM model:**Decode m excellent individuals generated in 6, establish m LSTM models, assign the optimized initial weights and thresholds to different network models, and train the sample sets again to construct m LSTM models.**Establish longitudinal predictive control model:**use the test set to simulate the m model and then weigh the combination of m models according to the test mean square error to establish the longitudinal predictive control model.

#### 3.3. Statistical Data Quality Control Model

**Parameter setting:**Set the size of the horizontal and vertical sliding windows to n and m, respectively. The wavelet base of the wavelet analysis is the bior and the decomposition scale is k. The smoothing time span is d and the dynamic weight is ${\delta}_{d}$ and ${\delta}_{t}$.**Time series smoothing:**The wavelet analysis is used to decompose and reconstruct the hydrological data and the smoothing sequence is obtained to reduce the short-term fluctuation and noise of the hydrological data.**Confidence interval of hydrological time series:**Taking hour real-time data as an example, the weighted mean value ${\overline{X}}_{d}$ and the mean square error ${S}_{d}$ of sliding window length n at each time are obtained. Set $({\overline{X}}_{d}-{\delta}_{d}\times {S}_{d},{\overline{X}}_{d}+{\delta}_{d}\times {S}_{d})$ as the confidence interval at the same time for the next day. The upper and lower bounds of confidence intervals are connected to form two upper and lower bound sequences and then the two sequences are smoothed by wavelet analysis to eliminate possible mutation points. If the interval change rate is large, the range of the error rate is limited to reduce the interval cell variation rate.**Confidence interval of longitudinal time series:**The monitoring data of the first m times are smoothed by wavelet analysis and the weighted mean value ${\overline{X}}_{t}$ and mean square error ${S}_{t}$ are obtained. In addition, set $({\overline{X}}_{t}-{\delta}_{t}\times {S}_{t},{\overline{X}}_{t}+{\delta}_{t}\times {S}_{t})$ as the confidence interval of real-time data at the next time, which ${\delta}_{t}$ needs to be constantly adjusted.**Comprehensive confidence interval:**The weighted control interval $\left[\left({\overline{X}}_{d}-{\delta}_{d}\times {S}_{d}\right)+\left({\overline{X}}_{t}-{\delta}_{t}\times {S}_{t}\right),\left({\overline{X}}_{d}+{\delta}_{d}\times {S}_{d}\right)+\left({\overline{X}}_{t}+{\delta}_{t}\times {S}_{t}\right)\right]/2$ is obtained by using the confidence interval of the vertical time series and the horizontal time series at the same time.

#### 3.4. Continuous Hydrological Data Quality Control Method

- Level 1: Data is outside a confidence interval.
- Level 2: Data is outside two confidence intervals.
- Level 3: Data is outside all confidence intervals.
- Level 4: Data is larger than the time varying rate.
- Level 5: Data is outside the maximum and minimum value.
- Level 6: The missing and malformed data.

## 4. Discrete Hydrological Data Control Model

#### 4.1. Hydrological Spatial Topological Structure

**Screening stations:**Taking the measured stations as the center, the peripheral stations with radius distance S as the candidate set n are selected and marked, and then the station m with sufficient hydrological data are selected from the candidate stations.**Constructing time series:**Extracting the station data set in the specified time from the compiled historical hydrological database and counting the monthly factor values and constructing the m monthly time series.**Correlation analysis:**Analyze the correlation coefficient between the station under test and the m surrounding stations and obtain the correlation coefficient set ${R}_{i}$ between m + 1 stations where $i=1,2,\dots ,m$. The correlation coefficient set ${R}_{1}$ of the station under test is sorted from large to small and the stations with large correlations are selected (the general coefficient is above 0.6).**Preserving the correlation coefficient:**The station code and the coefficient, which have the bigger correlation coefficient with the station under test, are saved in the station relation table st-relation. In addition, for the other m stations, the stations with correlation coefficients greater than 0.6 are selected and stored in the table st-relation.**Constructing regional networks:**Looping steps 1 to 4 and constructing relational networks. The difference is that only unmarked peripheral stations are selected in step 1, which can greatly reduce the construction time of the regional station network.

#### 4.2. Dynamic Adjustment of the Topological Structure

#### 4.3. Spatial Interpolation Model

- Referring to the table information of adjacent stations, the hourly real-time rainfall data of selected adjacent stations are extracted from the hydrological real-time database.
- View the data details and count the daily precipitation data without missing stations on the day.
- Using spatial geographic information (such as longitude and latitude or elevation) and daily precipitation data to construct three spatial interpolation methods, these three spatial interpolation methods have nothing to do with historical data. The model parameters are changed with the changes of regional data.
- Establish a three-dimensional space model and obtain predicted results based on the geographic information of the stations.

#### 4.4. Discrete Hydrological Data Quality Control Method

- Calculate the correlation between the station under test and the surrounding stations based on the spatial geography and data association of hydrological elements and construct a spatial topology.
- The correlation will change with the change of seasons. It is necessary to dynamically adjust the topology to fill the reference data and suspicious data for hydrological workers.
- Construct a spatial interpolation model, use a variety of spatial interpolation methods to detect spatial consistency, compare various linear spatial interpolation methods with nonlinear multivariate SVM methods, and analyze their advantages and disadvantages.
- Based on the predicted value of best-improved inverse distance spatial interpolation method, the confidence interval is given to check and control the daily precipitation data.

## 5. Experimental Analysis

#### 5.1. Experimental Analysis of Continuous Data

#### 5.1.1. Experimental Analysis of the Horizontal Predictive Control Model

#### 5.1.2. Experimental Analysis of the Longitudinal Predictive Control Model

#### 5.1.3. Experimental Analysis of the Statistical Data Quality Control Model

#### 5.1.4. Experimental Analysis of Continuous Hydrological Data Control

#### 5.2. Experimental Analysis of Discrete Data

#### 5.2.1. Build Topological Structure

#### 5.2.2. Experimental Analysis of the Spatial Interpolation Model

## 6. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Hu, G. Thoughts on the informationization of water conservancy. China Water Conserv.
**2002**, 11, 50–51. [Google Scholar] - Sciuto, G.; Bonaccorso, B.; Cancelliere, A.; Rossi, G. Probabilistic quality control of daily temperature data. Int. J. Climatol.
**2013**, 33, 1211–1227. [Google Scholar] [CrossRef] - Steinacker, R.; Mayer, D.; Steiner, A. Data quality control based on self-consistency. Mon. Weather Rev.
**2011**, 139, 3974–3991. [Google Scholar] [CrossRef] - Sciuto, G.; Bonaccorso, B.; Cancelliere, A.; Rossi, G. Quality control of daily rainfall data with neural networks. J. Hydrol.
**2009**, 364, 13–22. [Google Scholar] [CrossRef] - Abbot, J.; Marohasy, J. Application of artificial neural networks to rainfall forecasting in Queensland, Australia. Adv. Atmos. Sci.
**2012**, 29, 717–730. [Google Scholar] [CrossRef] - Fu, F.; Luo, X. Study on quality control method of hydrological data. Water Resour. Inf.
**2012**, 5, 12–15. [Google Scholar] - Yu, Y.; Wan, D. An application research of Bedford′s Law in hydrological data quality mining. Microelectron. Comput.
**2011**. [Google Scholar] [CrossRef] - Yu, Y.; Zhang, J.; Zhu, Y.; Wan, D. Data quality control and management for hydrological database. Hydrol.
**2013**, 33, 65–68. [Google Scholar] - Potter, C.; Venayagamoorthy, G.K.; Kosbar, K. RNN based MIMO channel prediction. Signal Process.
**2009**, 90, 440–450. [Google Scholar] [CrossRef] - Vinayakumar, R.; Soman, K.P.; Poornachandran, P. Detecting malicious domain names using deep learning approaches at scale. J. Intell. Fuzzy Syst.
**2018**, 34, 1355–1367. [Google Scholar] [CrossRef] - Ding, S.; Qi, B.; Tan, H. An overview on theory and algorithm of support vector machines. J. Univ. Electron. Sci. Technol. China
**2011**, 40, 2–10. [Google Scholar] - Zhou, Y.; Huang, C.; Hu, Q.; Zhu, J.; Tang, Y. Personalized learning full-path recommendation model based on LSTM neural networks. Inf. Sci.
**2018**, 444. [Google Scholar] [CrossRef] - Kennedy, J.; Eberhart, R. Particle Swarm Optimization, Proceedings of ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; IEEE: Piscataway, NJ, USA, 1995. [Google Scholar]
- Chen, C.; Tian, Y.; Bie, R. Research of SVR Optimized by PSO compared with BP network trained by PSO. J. Beijing Norm. Univ.
**2008**, 5, 449–453. [Google Scholar] - Li, Hu.; Wang, J. The short-term load forecast based on the PSO-RNN. Softw. Guide
**2017**, 16, 125–128. [Google Scholar] - Wang, S.; Feng, N.; Li, A. A BP network learning algorithm based on PSO. Comput. Appl. Softw.
**2003**, 8, 74–76. [Google Scholar] - Hu, W.; Li, Z. A simpler and more effective particle swarm optimization algorithm. J. Softw.
**2007**, 18, 861–868. [Google Scholar] [CrossRef] - Wang, W.; Tang, R.; Li, C.; Liu, P.; Luo, L. A BP neural network model optimized by mind evolutionary algorithm for predicting the ocean wave heights. Ocean Eng.
**2018**, 162, 98–107. [Google Scholar] [CrossRef] - Cao, Y.; Miao, Q.; Liu, J.; Gao, L. Advance and prospects of AdaBoost algorithm. Acta Autom. Sin.
**2013**, 39. [Google Scholar] [CrossRef] - Wong, K.W.; Wong, P.M.; Gedeon, T.D.; Fung, C.C. Rainfall prediction model using soft computing technique. Soft Comput.
**2003**, 7, 434–438. [Google Scholar] [CrossRef] - Lin, G.F.; Chen, L.H. A spatial interpolation method based on radial basis function networks incorporating a semivariogram model. J. Hydrol.
**2004**, 288, 288–298. [Google Scholar] [CrossRef] - Zhou, J.; Sha, Z. A New Spatial Interpolation Approach Based on Inverse Distance Weighting: Case Study from Interpolating Soil Properties; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Zhu, R.; Li, L.; Wang, H.; Gan, H. Comparative study on the spatial variability of rainfall and its spatial interpolation methods. China Rural Water Hydropower
**2004**, 7, 25–28. [Google Scholar] - Yang, Z. Bankruptcy prediction based on support vector machine optimized by particle swarm optimization and genetic algorithm. Comput. Eng. Appl.
**2013**, 49, 265–270. [Google Scholar] - Liu, J.; Qiang, H; Wang, Y. Application of particle swarm operation algorithm in water level-discharge relation curve fitting. J. Int. Hydroelectr. Energy
**2008**, 26, 11–13. [Google Scholar] - Jiang, Y.; Hu, T.; Gui, F.; Wu, X.; Zeng, Z. Application of particle swarm optimization to parameter calibration of Xin’anjiang model. J. Eng. Univ. Wuhan
**2006**, 44, 871–879. [Google Scholar] - Liu, X.; Ju, X.; Fan, S. A research on the applicability of spatial regression test in meteorological datasets. J. Appl. Meteorol. Sci.
**2006**, 1, 37–43. [Google Scholar] - Rissanen, P.; Jacobsson, C.; Madsen, H.; Moe, M.; Pálsdóttir, P.; Vejen, F. Nordic Methods for Quality Control of Climate Data; DNMI (Norwegian Meteorological Institute): Oslo, Norway, April 2000. [Google Scholar]
- Oord, A.V.D.; Li, Y.; Babuschkin, I.; Simonyan, K.; Vinyals, O.; Kavukcuoglu, K.; Driessche, G.; Lockhart, E.; Cobo, L.; Stimberg, F.; et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 3918–3926. [Google Scholar]

**Figure 14.**(

**a**) The flood season of the Guxiandu station under a basic statistical control method. (

**b**) The flood season of the Guxiandu station under the statistical control method in this paper.

**Figure 17.**(

**a**) The comparison of the interpolation effects of different methods. (

**b**) The comparison of the interpolation effects of different improved methods.

**Figure 18.**(

**a**) Comparison of daily precipitation data between prediction and statistics of Tangyin. (

**b**) Comparison of daily precipitation data of stations in the surrounding stations of Tangyin.

**Table 1.**Correlation coefficient between Shangli and adjacent stations in Xiangjiang River, Jiangxi Province.

Stations | Spring (1–3) | Summer (4–6) | Autumn (7–9) | Winter (10–12) | Whole |
---|---|---|---|---|---|

Chishan | 0.90 | 0.63 | 0.85 | 0.94 | 0.85 |

Pingxiang | 0.70 | 0.64 | 0.87 | 0.79 | 0.77 |

Zhongli | 0.66 | 0.57 | 0.46 | 0.22 | 0.56 |

Laoguan | 0.49 | 0.68 | 0.88 | 0.80 | 0.75 |

Models | CHDC | HPCM | LPCM | SDQC | SVM | RNN | LSTM |
---|---|---|---|---|---|---|---|

Detected anomalies | 56 | 41 | 42 | 37 | 37 | 36 | 39 |

Station Name | Da Ming | Dong Pi | Yun Feng | Wu Jia | Hang Shan | Gang Xia | Teng Qiao | Shuang Keng | Kang Du |

Correlation coefficient | 0.86 | 0.82 | 0.81 | 0.78 | 0.78 | 0.77 | 0.77 | 0.76 | 0.74 |

Distance/km | 24.6 | 19.6 | 6.9 | 6.7 | 30.9 | 8.9 | 19.9 | 7.3 | 33.7 |

Altitude distance/m | 119 | 46 | 56 | 48 | 15 | 14 | 93 | 110 | 4 |

Station name | Yang ping | Gao ping | Song shi | Dun tou | Zimu ling | Chen fang | Chen fangqiao | Gu gang | Shi men |

Correlation coefficient | 0.73 | 0.73 | 0.72 | 0.71 | 0.71 | 0.71 | 0.70 | 0.70 | 0.70 |

Distance/km | 10.1 | 29.6 | 31.1 | 18.8 | 17.7 | 10.6 | 8.3 | 13.1 | 24.3 |

Altitude distance/m | 514 | 117 | 5 | 41 | 78 | 21 | 121 | 39 | 92 |

Station name | Liu li | Shang tang | Guang cang | Jin xi | Xiao gongmiao | Da keng | Gou shu | Baishe | Taihe |

Correlation coefficient | 0.68 | 0.66 | 0.65 | 0.65 | 0.64 | 0.63 | 0.63 | 0.63 | 0.62 |

Distance/km | 34.6 | 14.6 | 8.1 | 33.4 | 35.1 | 19.3 | 41.4 | 31.4 | 31.4 |

Altitude distance/m | 101 | 82 | 132 | 104 | 104 | 122 | 146 | 67 | 476 |

Rainfall Prediction Model | Maximum Error | Rmse | Average Error | Predictive Confidence |
---|---|---|---|---|

Inverse Distance Weighting | 49.02 | 18.54 | 1.7945 | 79.56% |

Kriging Method | 51.35 | 17.79 | 1.6912 | 80.06% |

Trend Surface Method | 47 | 20.35 | 1.9327 | 76.57% |

Multivariate SVM | 49 | 15.68 | 1.5654 | 81.11% |

Rainfall Prediction Model | Maximum Error | Rmse | Average Error | Predictive Confidence |
---|---|---|---|---|

Inverse Distance Weighting (general) | 49.02 | 18.54 | 1.7945 | 79.56% |

Inverse Distance Weighting (improve) | 48.34 | 14.92 | 1.4945 | 82.56% |

Kriging (general) | 51.35 | 17.79 | 1.6912 | 80.06% |

Kriging (improve) | 49.35 | 15.79 | 1.5803 | 80.83% |

Trend Surface Method (general) | 47.30 | 20.35 | 1.9327 | 76.57% |

Trend Surface Method (improve) | 47.36 | 19.55 | 1.9015 | 77.63% |

Multivariate SVM | 49 | 15.68 | 1.5654 | 81.11% |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zhao, Q.; Zhu, Y.; Wan, D.; Yu, Y.; Cheng, X.
Research on the Data-Driven Quality Control Method of Hydrological Time Series Data. *Water* **2018**, *10*, 1712.
https://doi.org/10.3390/w10121712

**AMA Style**

Zhao Q, Zhu Y, Wan D, Yu Y, Cheng X.
Research on the Data-Driven Quality Control Method of Hydrological Time Series Data. *Water*. 2018; 10(12):1712.
https://doi.org/10.3390/w10121712

**Chicago/Turabian Style**

Zhao, Qun, Yuelong Zhu, Dingsheng Wan, Yufeng Yu, and Xifeng Cheng.
2018. "Research on the Data-Driven Quality Control Method of Hydrological Time Series Data" *Water* 10, no. 12: 1712.
https://doi.org/10.3390/w10121712