# A Practical Yet Accurate Real-Time Statistical Analysis Library for Hydrologic Time-Series Big Data

## Abstract

## 1. Introduction

- We proposed and implemented HydroStreamingLib, a library for hydrologic time-series analysis based on Apache Flink. The library contains 19 different hydrologic time-series testing methods in four categories, which expands the ecology of Flink in the hydrologic field.
- Based on HydroStreamingLib, a hydrologic stream data verification system was constructed, which can be applied to the statistical testing of large-scale, high-velocity hydrologic stream data and provide real-time visualization of test results. It realizes a complete solution from data collection, transmission, and analysis to persistence and visualization.
- We applied HydroStreamingLib to a real-world problem and evaluated the algorithms available in the proposed library to analyze different aspects. Compared with other general methods and tools, HydroStreamingLib achieved better results in real datasets.

## 2. Related Works

## 3. The Proposed Library

#### 3.1. Statistical Test Methods

#### 3.2. Data Importation and Preprocessing

#### 3.2.1. Data Duplication

#### 3.2.2. Data Anomaly

#### 3.2.3. Missing Data

#### 3.3. Distributed Implementation

#### 3.4. Characteristic Discrimination

## 4. Hydrologic Real-Time Analysis System

#### 4.1. Data Aggregation Based on Time Window

**Step 1.**A rolling time window ${W}_{0}$ is established for the hydrologic data stream ${S}_{0}=\left({T}_{0},{V}_{0}\right)$ flowing into Flink, and the window size is set to 1 h (the window size can be determined according to the actual situation, and the default value in this paper is 1 h), where ${T}_{0}$ is the sampling timestamp of the sensor, and ${V}_{0}$ is the sampled hydrologic value.

**Step 2.**The average value over an hour of hydrological data is calculated in ${W}_{0}$, which is used to measure the centralized location of window data. Since the data have undergone a series of preprocessing prior operations, the mean value has good robustness at this time.

**Step 3.**After the data stream ${S}_{0}$ is calculated by ${W}_{0}$, the output data stream ${S}_{1}=\left({T}_{1},{V}_{1}\right)$, ${T}_{1}$ is the timestamp with an interval of one hour, and ${V}_{1}$ is the mean sequence of hydrologic data every hour. At this point, a new scroll window ${W}_{1}$ is created with a window size of 24 h.

**Step 4.**The average of each day’s hydrologic data is calculated in ${W}_{1}$. After the data stream ${S}_{1}$ is calculated by ${W}_{1}$, its output data stream ${S}_{2}=\left({T}_{2},{V}_{2}\right)$, ${T}_{2}$ is a timestamp with an interval of one day, and ${V}_{2}$ is the mean series of daily hydrological data.

**Step 5.**The data stream ${S}_{2}$ contains the aggregated mean value of daily hydrological data, which can meet the statistical analysis in the time range of weeks, ten days, and months commonly used in the field of hydrology. If necessary, a new time window can be aggregated to ${S}_{2}$ so that it can continue to be created in a higher time dimension.

#### 4.2. Message Queue

#### 4.3. Processing Module

#### 4.4. Data Storage and Real-Time Display

#### 4.5. Platform Hardware Requirements and Availability

## 5. Experiment

#### 5.1. Dataset and Experimental Environment

#### 5.2. Statistical Anlysis of Water Level in Chuhe River

#### 5.3. Contrast Experiment

#### 5.4. Parallel Performance Experiment

#### 5.4.1. Speedup Ratio

#### 5.4.2. Sizeup Ratio

#### 5.4.3. Scaleup Ratio

#### 5.5. Evaluating the System on a Real Dataset

## 6. Conclusions

Characteristic | Methods |
---|---|

Stationarity | Student’s t Test Simple t Test Mann–Whitney Test |

Normality | Kolmogorov-Smirnov Test Jarque–Bera Test Geary’s Test Coefficient of Variation Test |

Trend | Kendall Rank Order Correlation Test Adjacency Test Difference Sign Test Mann–Kendall Test Spearman’s Rank Order Correlation Test Turning Point Test Inversions Test |

Homogeneity | Bartlett’s Test Bayesian Test Dunnett’s Test Hartley’s Test Von-Neumann’s Test |

Timestamp | Station ID | Water Level/m |
---|---|---|

2015-01-02 10:00:00 | 12910540 | 58.490 |

2015-04-23 13:55:00 | 62916400 | 5.480 |

2015-06-28 16:45:00 | 60403100 | 7.490 |

Month | K–S Test | J–B Test | Geary’s Test | Result |
---|---|---|---|---|

2016.7 | 0.1236 | 2.4358 | 0.963 | √ |

2016.8 | 0.2054 | 2.4079 | 1.1314 | √ |

2016.9 | 0.1899 | 3.3242 | 1.135 | √ |

2016.10 | 0.1615 | 0.0902 | 0.8766 | √ |

2016.11 | 0.3099 | 714.6828 | 0.5949 | × |

2016.12 | 0.2172 | 3.5783 | 1.1199 | √ |

Threshold | 0.24 | 5.991 | 1 | — |

Month | Student t Test | Simple t Test | Mann–Whitney Test | Result | ||
---|---|---|---|---|---|---|

Subsequence 1 | Subsequence 2 | Subsequence 3 | ||||

2016.7 | 1.6793 | 0.778 | −2.4602 | 0.1524 | 0.0437 | × |

2016.8 | 2.9517 | 0.7566 | −3.7088 | 0.0 | 0.0 | × |

2016.9 | 1.6185 | −3.7532 | 2.1351 | 0.8519 | 0.7934 | √ |

2016.10 | −3.0603 | 0.0754 | 2.9856 | 0.0 | 0.0 | × |

2016.11 | 0.2273 | −0.0052 | 0.239 | 0.0121 | 0.9279 | √ |

2016.12 | 1.637 | −2.366 | 0.729 | 0.4429 | 0.6934 | √ |

Threshold | 1.833 | 1.833 | 1.833 | 0.05 | 0.05 | — |

Month | Dunnett’s Test | Bayesian Test | Von-Neumann’s Test | Result | ||
---|---|---|---|---|---|---|

Subsequence 1 | Subsequence 2 | Subsequence 3 | ||||

2016.7 | 0.3537 | 0.2006 | 1.6949 | 2.6918 | 2.0144 | √ |

2016.8 | 2.1877 | 2.7654 | 3.3344 | 1.1825 | 2.0237 | √ |

2016.9 | 1.5643 | 1.9273 | 1.4739 | 4.0711 | 2.0716 | √ |

2016.10 | 11.1853 | 13.6305 | 22.6544 | 13.3524 | 0.0456 | × |

2016.11 | 2.1327 | 0.2324 | 1.5253 | 2.0354 | 0.7863 | √ |

2016.12 | 2.7363 | 2.5679 | 2.6607 | 1.8434 | 1.9006 | √ |

Threshold | 2.15 | 2.15 | 2.15 | 2.42 | 2 | — |

Month | SROC Test | KRC Test | Mann–Kendall Test | Result |
---|---|---|---|---|

2016.7 | −2.6142 | −2.9437 | 0.0034 | √ |

2016.8 | −18.6319 | −7.0472 | 0.0 | √ |

2016.9 | −0.4335 | −0.4103 | 0.6947 | × |

2016.10 | NaN | 7.7608 | 0.0 | √ |

2016.11 | −1.3136 | −1.6592 | 0.1007 | × |

2016.12 | −0.8942 | −2.0517 | 0.0655 | × |

Threshold | 2.048 | 1.96 | 0.05 | — |

Data Amount (MB) | HydroStreamingLib Run Time (s) | SparkR Run Time (s) | RStudio Run Time (s) | ||||||
---|---|---|---|---|---|---|---|---|---|

1 Node | 2 Nodes | 3 Nodes | 4 Nodes | 1 Node | 2 Nodes | 3 Nodes | 4 Nodes | ||

32 | 7 | 5 | 3 | 2 | 19 | 16 | 13 | 7 | 179 |

128 | 29 | 18 | 15 | 12 | 62 | 35 | 27 | 21 | 711 |

512 | 109 | 56 | 37 | 31 | 249 | 131 | 108 | 79 | >3600 |

1024 | 209 | 105 | 77 | 65 | 432 | 246 | 163 | 121 | >3600 |

