# A Practical Yet Accurate Real-Time Statistical Analysis Library for Hydrologic Time-Series Big Data

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

- We proposed and implemented HydroStreamingLib, a library for hydrologic time-series analysis based on Apache Flink. The library contains 19 different hydrologic time-series testing methods in four categories, which expands the ecology of Flink in the hydrologic field.
- Based on HydroStreamingLib, a hydrologic stream data verification system was constructed, which can be applied to the statistical testing of large-scale, high-velocity hydrologic stream data and provide real-time visualization of test results. It realizes a complete solution from data collection, transmission, and analysis to persistence and visualization.
- We applied HydroStreamingLib to a real-world problem and evaluated the algorithms available in the proposed library to analyze different aspects. Compared with other general methods and tools, HydroStreamingLib achieved better results in real datasets.

## 2. Related Works

## 3. The Proposed Library

#### 3.1. Statistical Test Methods

#### 3.2. Data Importation and Preprocessing

#### 3.2.1. Data Duplication

#### 3.2.2. Data Anomaly

#### 3.2.3. Missing Data

#### 3.3. Distributed Implementation

#### 3.4. Characteristic Discrimination

## 4. Hydrologic Real-Time Analysis System

#### 4.1. Data Aggregation Based on Time Window

**Step 1.**A rolling time window ${W}_{0}$ is established for the hydrologic data stream ${S}_{0}=\left({T}_{0},{V}_{0}\right)$ flowing into Flink, and the window size is set to 1 h (the window size can be determined according to the actual situation, and the default value in this paper is 1 h), where ${T}_{0}$ is the sampling timestamp of the sensor, and ${V}_{0}$ is the sampled hydrologic value.

**Step 2.**The average value over an hour of hydrological data is calculated in ${W}_{0}$, which is used to measure the centralized location of window data. Since the data have undergone a series of preprocessing prior operations, the mean value has good robustness at this time.

**Step 3.**After the data stream ${S}_{0}$ is calculated by ${W}_{0}$, the output data stream ${S}_{1}=\left({T}_{1},{V}_{1}\right)$, ${T}_{1}$ is the timestamp with an interval of one hour, and ${V}_{1}$ is the mean sequence of hydrologic data every hour. At this point, a new scroll window ${W}_{1}$ is created with a window size of 24 h.

**Step 4.**The average of each day’s hydrologic data is calculated in ${W}_{1}$. After the data stream ${S}_{1}$ is calculated by ${W}_{1}$, its output data stream ${S}_{2}=\left({T}_{2},{V}_{2}\right)$, ${T}_{2}$ is a timestamp with an interval of one day, and ${V}_{2}$ is the mean series of daily hydrological data.

**Step 5.**The data stream ${S}_{2}$ contains the aggregated mean value of daily hydrological data, which can meet the statistical analysis in the time range of weeks, ten days, and months commonly used in the field of hydrology. If necessary, a new time window can be aggregated to ${S}_{2}$ so that it can continue to be created in a higher time dimension.

#### 4.2. Message Queue

#### 4.3. Processing Module

#### 4.4. Data Storage and Real-Time Display

#### 4.5. Platform Hardware Requirements and Availability

## 5. Experiment

#### 5.1. Dataset and Experimental Environment

#### 5.2. Statistical Anlysis of Water Level in Chuhe River

#### 5.3. Contrast Experiment

#### 5.4. Parallel Performance Experiment

#### 5.4.1. Speedup Ratio

#### 5.4.2. Sizeup Ratio

#### 5.4.3. Scaleup Ratio

#### 5.5. Evaluating the System on a Real Dataset

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- McMillan, H.K.; Westerberg, I.K.; Krueger, T. Hydrological data uncertainty and its implications. Wiley Interdiscip. Rev. Water
**2018**, 5, e1319. [Google Scholar] [CrossRef] - Liu, Z.; Cheng, L.; Lin, K.; Cai, H. A hybrid bayesian vine model for water level prediction. Environ. Model. Softw.
**2021**, 142, 105075. [Google Scholar] [CrossRef] - Machiwal, D.; Jha, M.K. Hydrologic Time Series Analysis: Theory and Practice; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Nie, N.H.; Bent, D.H.; Hull, C.H. SPSS: Statistical Package for the Social Sciences; McGraw-Hill: New York, NY, USA, 1975; Volume 227. [Google Scholar]
- Toolbox, S.M. Matlab; Mathworks Inc.: Natick, MA, USA, 1993. [Google Scholar]
- Wen, J.; Yang, J.; Jiang, B.; Song, H.; Wang, H. Big data driven marine environment information forecasting: A time series prediction network. IEEE Trans. Fuzzy Syst.
**2020**, 29, 4–18. [Google Scholar] [CrossRef] - Carbone, P.; Katsifodimos, A.; Ewen, S.; Markl, V.; Haridi, S.; Tzoumas, K. Apache flink: Stream and batch processing in a single engine. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng.
**2015**, 36, 28–38. [Google Scholar] - Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J. Apache spark: A unified engine for big data processing. Commun. ACM
**2016**, 59, 56–65. [Google Scholar] [CrossRef] - Machiwal, D.; Gupta, A.; Jha, M.K.; Kamble, T. Analysis of trend in temperature and rainfall time series of an Indian arid region: Comparative evaluation of salient techniques. Theor. Appl. Climatol.
**2019**, 136, 301–320. [Google Scholar] [CrossRef] - Tosunoglu, F.; Kisi, O. Trend analysis of maximum hydrologic drought variables using Mann–Kendall and Şen’s innovative trend method. River Res. Appl.
**2017**, 33, 597–610. [Google Scholar] [CrossRef] - Machiwal, D.; Parmar, B.; Kumar, S.; Meena, H.M.; Deora, B. Evaluating homogeneity of monsoon rainfall in Saraswati River basin of Gujarat, India. J. Earth Syst. Sci.
**2021**, 130, 181. [Google Scholar] [CrossRef] - de Gois, G.; de Oliveira-Júnior, J.F.; da Silva Junior, C.A.; Sobral, B.S.; de Bodas Terassi, P.M.; Junior, A.H.S.L. Statistical normality and homogeneity of a 71-year rainfall dataset for the state of Rio de Janeiro—Brazil. Theor. Appl. Climatol.
**2020**, 141, 1573–1591. [Google Scholar] [CrossRef] - von Brömssen, C.; Betnér, S.; Fölster, J.; Eklöf, K. A toolbox for visualizing trends in large-scale environmental data. Environ. Model. Softw.
**2021**, 136, 104949. [Google Scholar] [CrossRef] - Aziz, K.; Zaidouni, D.; Bellafkih, M. Real-time data analysis using Spark and Hadoop. In Proceedings of the 2018 4th International Conference on Optimization and Applications (ICOA), Mohammedia, Morocco, 26–27 April 2018; pp. 1–6. [Google Scholar]
- Silva, B.N.; Khan, M.; Jung, C.; Seo, J.; Muhammad, D.; Han, J.; Yoon, Y.; Han, K. Urban planning and smart city decision management empowered by real-time data processing using big data analytics. Sensors
**2018**, 18, 2994. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Van Wyk, F.; Khojandi, A.; Kamaleswaran, R. Improving prediction performance using hierarchical analysis of real-time data: A sepsis case study. IEEE J. Biomed. Health Inform.
**2019**, 23, 978–986. [Google Scholar] [CrossRef] [PubMed] - Racine, J.S. RStudio: A platform-independent IDE for R and Sweave. J. Appl. Econom.
**2012**, 27, 167–172. [Google Scholar] [CrossRef] - Venkataraman, S.; Yang, Z.; Liu, D.; Liang, E.; Falaki, H.; Meng, X.; Xin, R.; Ghodsi, A.; Franklin, M.; Stoica, I. Sparkr: Scaling r programs with spark. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1099–1104. [Google Scholar]
- Alcalde-Barros, A.; García-Gil, D.; García, S.; Herrera, F. DPASF: A flink library for streaming data preprocessing. Big Data Anal.
**2019**, 4, 4. [Google Scholar] [CrossRef] - Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv
**2019**, arXiv:1901.03407. [Google Scholar] - Pang, G.; Shen, C.; Cao, L.; Hengel, A.V.D. Deep learning for anomaly detection: A review. ACM Comput. Surv.
**2021**, 54, 38. [Google Scholar] [CrossRef] - Chen, J.; Wang, X.; Li, Q.; Han, W. A Markov Process-Based Anomaly Detection of Time Series Streaming Data. In Signal and Information Processing, Networking and Computers; Springer: Berlin/Heidelberg, Germany, 2021; pp. 827–834. [Google Scholar]
- Rosner, B. On the detection of many outliers. Technometrics
**1975**, 17, 221–227. [Google Scholar] [CrossRef] - Schwertman, N.C.; Owens, M.A.; Adnan, R. A simple more general boxplot method for identifying outliers. Comput. Stat. Data Anal.
**2004**, 47, 165–174. [Google Scholar] [CrossRef] - Goldstein, M.; Dengel, A. Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. In Proceedings of the Poster and Demo Track of the 35th German Conference on Artificial Intelligence (KI-2012), Saarbrucken, Germany, 24–27 September 2012; pp. 59–63. [Google Scholar]
- Abdi, H.; Williams, L.J. Newman-Keuls test and Tukey test. Encycl. Res. Des.
**2010**, 2, 897–902. [Google Scholar] - Kipf, A.; Pandey, V.; Böttcher, J.; Braun, L.; Neumann, T.; Kemper, A. Scalable analytics on fast data. ACM Trans. Database Syst.
**2019**, 44, 1. [Google Scholar] [CrossRef] - Kreps, J.; Narkhede, N.; Rao, J. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB 2011: 6th Workshop on Networking Meets Databases, Athens, Greece, 12–16 June 2011; pp. 1–7. [Google Scholar]
- Wang, C.; Huang, X.; Qiao, J.; Jiang, T.; Rui, L.; Zhang, J.; Kang, R.; Feinauer, J.; McGrail, K.A.; Wang, P. Apache IoTDB: Time-series database for internet of things. Proc. VLDB Endow.
**2020**, 13, 2901–2904. [Google Scholar] [CrossRef]

Characteristic | Methods |
---|---|

Stationarity | Student’s t Test Simple t Test Mann–Whitney Test |

Normality | Kolmogorov-Smirnov Test Jarque–Bera Test Geary’s Test Coefficient of Variation Test |

Trend | Kendall Rank Order Correlation Test Adjacency Test Difference Sign Test Mann–Kendall Test Spearman’s Rank Order Correlation Test Turning Point Test Inversions Test |

Homogeneity | Bartlett’s Test Bayesian Test Dunnett’s Test Hartley’s Test Von-Neumann’s Test |

Timestamp | Station ID | Water Level/m |
---|---|---|

2015-01-02 10:00:00 | 12910540 | 58.490 |

2015-04-23 13:55:00 | 62916400 | 5.480 |

2015-06-28 16:45:00 | 60403100 | 7.490 |

Month | K–S Test | J–B Test | Geary’s Test | Result |
---|---|---|---|---|

2016.7 | 0.1236 | 2.4358 | 0.963 | √ |

2016.8 | 0.2054 | 2.4079 | 1.1314 | √ |

2016.9 | 0.1899 | 3.3242 | 1.135 | √ |

2016.10 | 0.1615 | 0.0902 | 0.8766 | √ |

2016.11 | 0.3099 | 714.6828 | 0.5949 | × |

2016.12 | 0.2172 | 3.5783 | 1.1199 | √ |

Threshold | 0.24 | 5.991 | 1 | — |

Month | Student t Test | Simple t Test | Mann–Whitney Test | Result | ||
---|---|---|---|---|---|---|

Subsequence 1 | Subsequence 2 | Subsequence 3 | ||||

2016.7 | 1.6793 | 0.778 | −2.4602 | 0.1524 | 0.0437 | × |

2016.8 | 2.9517 | 0.7566 | −3.7088 | 0.0 | 0.0 | × |

2016.9 | 1.6185 | −3.7532 | 2.1351 | 0.8519 | 0.7934 | √ |

2016.10 | −3.0603 | 0.0754 | 2.9856 | 0.0 | 0.0 | × |

2016.11 | 0.2273 | −0.0052 | 0.239 | 0.0121 | 0.9279 | √ |

2016.12 | 1.637 | −2.366 | 0.729 | 0.4429 | 0.6934 | √ |

Threshold | 1.833 | 1.833 | 1.833 | 0.05 | 0.05 | — |

Month | Dunnett’s Test | Bayesian Test | Von-Neumann’s Test | Result | ||
---|---|---|---|---|---|---|

Subsequence 1 | Subsequence 2 | Subsequence 3 | ||||

2016.7 | 0.3537 | 0.2006 | 1.6949 | 2.6918 | 2.0144 | √ |

2016.8 | 2.1877 | 2.7654 | 3.3344 | 1.1825 | 2.0237 | √ |

2016.9 | 1.5643 | 1.9273 | 1.4739 | 4.0711 | 2.0716 | √ |

2016.10 | 11.1853 | 13.6305 | 22.6544 | 13.3524 | 0.0456 | × |

2016.11 | 2.1327 | 0.2324 | 1.5253 | 2.0354 | 0.7863 | √ |

2016.12 | 2.7363 | 2.5679 | 2.6607 | 1.8434 | 1.9006 | √ |

Threshold | 2.15 | 2.15 | 2.15 | 2.42 | 2 | — |

Month | SROC Test | KRC Test | Mann–Kendall Test | Result |
---|---|---|---|---|

2016.7 | −2.6142 | −2.9437 | 0.0034 | √ |

2016.8 | −18.6319 | −7.0472 | 0.0 | √ |

2016.9 | −0.4335 | −0.4103 | 0.6947 | × |

2016.10 | NaN | 7.7608 | 0.0 | √ |

2016.11 | −1.3136 | −1.6592 | 0.1007 | × |

2016.12 | −0.8942 | −2.0517 | 0.0655 | × |

Threshold | 2.048 | 1.96 | 0.05 | — |

Data Amount (MB) | HydroStreamingLib Run Time (s) | SparkR Run Time (s) | RStudio Run Time (s) | ||||||
---|---|---|---|---|---|---|---|---|---|

1 Node | 2 Nodes | 3 Nodes | 4 Nodes | 1 Node | 2 Nodes | 3 Nodes | 4 Nodes | ||

32 | 7 | 5 | 3 | 2 | 19 | 16 | 13 | 7 | 179 |

128 | 29 | 18 | 15 | 12 | 62 | 35 | 27 | 21 | 711 |

512 | 109 | 56 | 37 | 31 | 249 | 131 | 108 | 79 | >3600 |

1024 | 209 | 105 | 77 | 65 | 432 | 246 | 163 | 121 | >3600 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sun, J.; Ye, F.; Nedjah, N.; Zhang, M.; Xu, D.
A Practical Yet Accurate Real-Time Statistical Analysis Library for Hydrologic Time-Series Big Data. *Water* **2023**, *15*, 708.
https://doi.org/10.3390/w15040708

**AMA Style**

Sun J, Ye F, Nedjah N, Zhang M, Xu D.
A Practical Yet Accurate Real-Time Statistical Analysis Library for Hydrologic Time-Series Big Data. *Water*. 2023; 15(4):708.
https://doi.org/10.3390/w15040708

**Chicago/Turabian Style**

Sun, Jun, Feng Ye, Nadia Nedjah, Ming Zhang, and Dong Xu.
2023. "A Practical Yet Accurate Real-Time Statistical Analysis Library for Hydrologic Time-Series Big Data" *Water* 15, no. 4: 708.
https://doi.org/10.3390/w15040708