A Novel Early Warning System (EWS) for Water Quality, Integrating a High-Frequency Monitoring Database with Efﬁcient Data Quality Control Technology at a Large and Deep Lake (Lake Qiandao), China

: To assess water quality (WQ) online for assuring the safety of drinking water, a novel early warning system integrating a high-frequency monitoring system (HFMS) and data quality control (QC) was developed at Lake Qiandao. The HFMS was designed for monitoring water quality, nutrient inputs by main tributaries, water currents and meteorology at different sites at Lake Qiandao. The EWS focused on data availability, a QC method, a statistical analysis method and data applications instead of technological aspects for sondes, wireless data transfer and interface software development. QC was implemented before use to delete the abnormal values of outliers, to detect change points, to analyse the change trend, to interpolate discrete missing measurements, and ﬁnd continuous missing or wrong observations caused by technical problems with the sonde. For demonstrating advantages and data availability, surface and proﬁling measurements at two sites were plotted. The plots show obvious seasonal and diel variations, demonstrating the success of integration of the system with advanced automated technology and good QC. This successfully developed system is now not only giving early warning signals, but also providing critical WQ information for the security of drinking water diverted to Hangzhou city through a tunnel of 110 km length. The automatic monitoring data with QC is also being used to produce initial conditions for WQ prediction based on a three dimensional hydrodynamic-ecosystem model. L.L. acquisition, L.L. All have read and agreed to the published version of the manuscript.


Introduction
Lake eutrophication is a long-term global problem caused by excess nutrient inputs [1], and exacerbated by long water residence times that delay WQ responses to management actions. It is a common problem in the lakes located at the middle and lower catchment of the Yangtze River even in the "good WQ" lakes classified by the Ministry of Ecology EWS. developed for Lake Qiandao, and the corresponding QC method for real-time highfrequency monitoring data.

Study Area
Built in 1959, Lake Qiandao is located at Chun'an County, which is at the west of Zhejiang Province, China (29 • 22 -9 • 50 N, 118 • 34 -119 • 15 E, Figure 1) [22]. It is one of the largest reservoirs in China with a surface water area of 573 km 2 and a water capacity of 178.4 × 10 8 m 3 , when the water level is 108 m [23]. The mean water depth is 34 m and the maximum depth is 100 m. It is used to supply drinking water for the 450,000 people in Chun'an County. Now, it is also providing drinking water for five million people in Hangzhou City, through a tunnel with a length of~110 km. There are 34 inflow tributaries around the lake. The largest one (Xinan River, Figure 1) is from the northwest, carrying 51.4% of the total inflow to the lake from all sources, not including rainfall and ground water. It carries 34.3% of the total phosphorus (TP) loading and 63.7% of the total nitrogen (TN) loading [19]. The multi-year average inflow and outflow are 103.44 × 10 8 m 3 and 97.45 × 10 8 m 3 , respectively [23], with a residence time of~668 days.
Water 2022, 13, x FOR PEER REVIEW 3 of 13 QA system is critical to the success of any environmental project, which has been successfully applied to the fields of climatology, oceanography and other geosciences. However, there has been limited application to developing an EWS with real-time high-frequency monitoring. Therefore, the aims of this study were to introduce a comprehensive EWS. developed for Lake Qiandao, and the corresponding QC method for real-time high-frequency monitoring data.

Study Area
Built in 1959, Lake Qiandao is located at Chun'an County, which is at the west of Zhejiang Province, China (29°22′-9°50′N, 118°34′-119°15′E, Figure 1) [22]. It is one of the largest reservoirs in China with a surface water area of 573 km 2 and a water capacity of 178.4 × 10 8 m 3 , when the water level is 108 m [23]. The mean water depth is 34 m and the maximum depth is 100 m. It is used to supply drinking water for the 450,000 people in Chun'an County. Now, it is also providing drinking water for five million people in Hangzhou City, through a tunnel with a length of ~110 km. There are 34 inflow tributaries around the lake. The largest one (Xinan River, Figure 1) is from the northwest, carrying 51.4% of the total inflow to the lake from all sources, not including rainfall and ground water. It carries 34.3% of the total phosphorus (TP) loading and 63.7% of the total nitrogen (TN) loading [19]. The multi-year average inflow and outflow are 103.44 × 10 8 m 3 and 97.45 × 10 8 m 3 , respectively [23], with a residence time of ~668 days.

Monitoring Stations
The HFMS at Lake Qiandao includes thirteen river stations; fourteen buoy stationsincluding ten buoys measuring the surface WQ (Buoy_surface, model EMM700, YSI Incorporated, Yellow Springs, USA); four 'profiler' buoys (Buoy_profiler, model EMM2500, Yellow Springs, USA); four hydrological stations (Hydro_station)-these were deployed at the three main tributaries and the only river outflow and measure water current speed; and four flux stations (Flux_station, model Tenghai HZF3, Tenghai Science & Technology Ltd., Hangzhou, China), located alongside the hydrological stations, measuring WQ parameters. There are also thirteen river stations (River_station, model EMM700, YSI Incorporated, Yellow Springs, USA) measuring WQ, deployed at the main inflow tributaries. Meteorological sensors (Met_station) are deployed at the top of each buoy station, except for one site (Figure 1). The Buoy_profilers were deployed at sites 1-4. The profiling information at site 3 is representative of the lake because it is at the centroid, while site 4 is the deepest monitoring point, which is located at the biggest outflow channel. Sites 5 and 6 with Buoy_surface are near the middle, capturing surface WQ variation. Table 1 shows all the sensor metadata. The alert range for a specific sensor, in the seventh column of Table 1, was decided by analysing historical data manually, or by using the sonde measurements. The alert thresholds are equal to the corresponding reasonable "minimum-maximum-range" (MMR) of each sonde. When the measured value is out of the alert range, an alert report will be recorded and the EWS can find the report by searching alert report once an hour. Once alert information for a specific sonde is found, a message will be sent to the EWS manager's cellphone. The integrated sonders with a metal protective cage move up and down through the water column at a constant speed The average return times for sonders moving at site 1, site 2, site 3 and site 4, are 55 min through a water column of 65 m, 45 min (water column 40 m), 50 min (water column 46 m), and 30 min (water column 16 m), respectively. The measurement values are recorded every minute at all the four sites. All the buoy systems are solar-powered and the data are transferred to a computer server at the Chunan Branch of Hangzhou Ecology and Environment Bureau by 4G wireless telemetry. The whole monitoring system is being maintained by Hangzhou Tenghai Science and Technology Limited, with the sondes cleaned to wipe bio-fouling once a month and calibrated once every three months for data assurance. The power supply system with a solar panel and wireless data transfer are also regularly checked and maintained by this company. Abbreviations: water temperature (WT), oxidation reduction potential (ORP), electrical conductivity (COND), dissolved oxygen concentration (DO_con), dissolved oxygen saturation (DO_sat), turbidity (TURB), chlorophyll a (CHLA), phycocyanin (a pigment specific to cyanobacteria, PC), fluorescent dissolved organic matter (FDOM), total nitrogen (TN), total phosphorus (TP), chemical oxygen demand (COD), nitrate (NO 3 − N), total organic carbon (TOC), relative humidity (RH), air pressure (BP), wind speed (Wind_spd), wind direction (Wind_dir) and air temperature (TEMP).

Data Quality Control
The monitoring stations produce large volumes of data, requiring specialised tools to facilitate quality control and to ensure that data are fit-for-purpose. We developed bespoke software in Fortran, employing two principle methods of quality control. Firstly, an MMR was adopted whereby the minimum and maximum values of the raw data measured by each sensor were specified, by assessing the range of previous observations and defining a 'reasonable range' (larger than or equal to the alert range at Table 1) for each variable based on a large volume of historical measurements from the lake area and inflows/outflows. The lowest value from both historical observations was adopted as the minimum value for MMR, with the maximum value defined with a similar method. Table 2 shows all the maximum (Max)/minimum (Min)/average (Avg) values for WQ measurements including WT, pH, DO, permanganate index (PI), chemical oxygen demand (COD), five-day biochemical oxygen demand (BOD 5 ), ammonia (NH 4 -N), TP, TN, CHLA and Secchi depth (SD). Unfortunately, only WT, pH and DO were observed by the monitoring buoys. Subsequently, data outside of the specified range for each variable were quarantined with a unique flag number (e.g., '8888') and will be further investigated. Table 2. Statistical value of measured water quality parameters at the four sites of Lake Qiandao from April 2001 to May 2021.

Number of Samples
Statistical Value The second approach is an "abnormal" value detection method, as follows: (1) Suspected abnormal value judgement. For a target value, not including the first and last ones (e.g., 'x i ' in Equations (1)-(3)), if it is either larger or smaller than its adjacent values, then the target value will be regarded as a suspected abnormal value and flagged. (1) Water 2022, 14, 602 6 of 12 where x i (I = 2, 3, 4, . . . n − 1) represents the time series of buoy measurements, excluding the first and last values. So if ff < 0, the measurement at the ith time will be regarded as a suspected abnormal value.
(2) Abnormal value confirmation. We calculate the average value x of raw data after MMR control, the anomaly |x max − x| between the maximum value x max and x, and the anomaly |x min − x| between the minimum value x min and x. The larger value |x − x| between |x max − x| and |x min − x| will be chosen to compare with the absolute value | f f | of ff. If | f f | is larger than, or equal to, |x − x| 2 , then the measurement at the i th time will be confirmed as an abnormal value.

Change Point and Trend Detections
The Pettitt test was used to automatically detect change points in the data series once a week. Pettitt's test is a nonparametric test to detect a single change point in a time series with continuous data. Its calculation procedures can be found in detail in [24]. The identified change points were then compared to the minimum and maximum values for each sensor. If their values are all in MMR, the validity of change point values will be confirmed. Otherwise the values will be removed from the time series or marked for further check. An exploratory analysis was also carried out to detect the trend of hourly and daily data for all the parameters using the Mann-Kendall method once a week. If the serial data kept increasing or decreasing for more than one week, its validity would be manually and carefully investigated.

Data Availability, Daily and Hourly Data Calculation
Most of the buoy monitoring datasets at the lake area (Buoy_surface, Buoy_profiler, Met_station) commenced in September 2015. The River_station data collection began in August 2016 and the Flux_station and Hydro_station data collection began in April 2017. A software developed by the authors is used to analyse and summarise the high-frequency data, including the calculation of daily and hourly values, based on quality-controlled raw data. Small data gaps without measurements (≤days) are interpolated by the software and the large data gaps are arbitrarily set up with a unique flag number (e.g., '8888'), which will be not included for calculating daily and hourly values. Figure 2 shows photographs of the Buoy_surface system at site 5 (Figure 2A), located at the mouth of largest tributary (Xinan River, Figure 1), and the Buoy_profiler system at site 4 ( Figure 2B), located at the deepest area in front of the dam for the power station (Figure 1), which is the only outflow. The web interface, which dynamically updates all station data from the database, allows the user to make requests for time periods of interest, review data from specific sites, visualise data as a function of time, and perform simple statistical analyses of the real-time data. All historical data from the monitoring system can be downloaded through the web interface by authorised users.

Buoy Photographs
Water 2022, 13, x FOR PEER REVIEW 7 of 13 Figure 2 shows photographs of the Buoy_surface system at site 5 (Figure 2A), located at the mouth of largest tributary (Xinan River, Figure 1), and the Buoy_profiler system at site 4 ( Figure 2B), located at the deepest area in front of the dam for the power station (Figure 1), which is the only outflow. The web interface, which dynamically updates all station data from the database, allows the user to make requests for time periods of interest, review data from specific sites, visualise data as a function of time, and perform simple statistical analyses of the real-time data. All historical data from the monitoring system can be downloaded through the web interface by authorised users.

Surface Measurements
The measurements from site 5 are presented here as an example, and show the daily and hourly variations in surface WQ measured by the buoy probes. Figure 3A

Surface Measurements
The measurements from site 5 are presented here as an example, and show the daily and hourly variations in surface WQ measured by the buoy probes. Figure 3A shows the time series of daily surface WT from 30 September 2015 to 1 August 2020, and daily surface DO, CHLA and PC from 27 January 2016 to 1 August 2020. The values of maximum (Max), minimum (min), average and standard deviation (Stdev) for WT, DO, CHLA and PC at site 5 are given in Table 3. All the maximum values for WT, CHLA and PC occurred in summer, but their lowest observations occurred in winter (PC) or spring (WT, DO and CHLA). CHLA and PC showed higher variability over time than those of WT and DO during the study period, based on their statistical Stdevs compared to their average values. To show diel variation, hourly data of WT, DO, CHLA and PC at site 5 for the period of 00:00 a.m. 3 July 2019-11:00 p.m. 16 July 2019, without data gaps, are presented as an example in Figure 3B. CHLA and PC show obvious diel variation with higher values in daytime relative to night time, while WT and DO keep more constant than CHLA and PC, showing no diel variation. The Pearson correlation coefficient between WT and DO is 0.8 (n = 336), suggesting that surface DO was mainly controlled by WT for the lake.  To show diel variation, hourly data of WT, DO, CHLA and PC at site 5 for the period of 00:00 a.m. 3 July 2019-11:00 p.m. 16 July 2019, without data gaps, are presented as an example in Figure 3B. CHLA and PC show obvious diel variation with higher values in daytime relative to night time, while WT and DO keep more constant than CHLA and PC, showing no diel variation. The Pearson correlation coefficient between WT and DO is 0.8 (n = 336), suggesting that surface DO was mainly controlled by WT for the lake.

Profiling Measurements
Profiles of WT, DO, CHLA and PC at site 4 (deepest area) for 1 January 2016-10 July 2020 are shown in CHLA ( Figure 4C) followed a similar pattern to WT, with higher values in summer than in winter, suggesting that the biomass of phytoplankton is mainly regulated by water temperature instead of nutrients. The phytoplankton was mostly distributed in the upper 15 m except for late 2018 and early 2019, when phytoplankton could still be found at a depth of 30 m. PC values ( Figure 4D) were much smaller than CHLA at the same depth. It didn't have distinct seasonal variation, but was obviously stratified in the summer of 2018 with a higher concentration in the lower layer than the upper layer.

Real-Time Early Warning Information
The Ministry of Ecology and Environment of China (MEEC) issued state standards for surface water quality in 2002 [25] in order to better manage surface water in China. Lake Qiandao was required to meet Grade I ( Table 4, the requirements for heavy metal were not shown in the table) since it provides drinking water for approximately half a million people in Chun'an County (located to the northeast of lake), and a total of 10 million people in both Hangzhou City and Jiaxing City, with water diverted through a tunnel of more than 110 km in length. DO and pH are the only two parameters which were measured by wireless sonde, deployed with monitoring buoys at the lake. The statistical analysis results for DO and pH are shown in Table 5. There were totally 275 and 130 samples of pH and DO out of their MMRs at site 5, accounting for 11.0% and 8.0% of all the valid samples, respectively. The percentages of pH (21.5%) and DO (11.3%) at site 6 were more than those at site 5. The observed maximum/minimum/average values of pH at site 5 and site 6 were 9.9/9.002/9.22 and 12.2/9.0002/9.62, respectively. Thus, the maximum pH at site 6 is much greater than that at site 5, showing that pH was more variable at site 6 than at site 5.
However, DO followed a reverse pattern, with more varied values at site 5 compared to site 6. Its lowest value was as low as 1.74 mg L −1 , observed on 3 April 2016.
summer and mixing in winter, although some periods and layers lacked measurements. The TDs were 9.  CHLA ( Figure 4C) followed a similar pattern to WT, with higher values in summer than in winter, suggesting that the biomass of phytoplankton is mainly regulated by water

Discussion
The EWS is now being used for giving real-time early warning signals by judging if the measurements of each sensor are out of the corresponding MMR and meet the required WQ grade. Unfortunately, DO and pH are the only two parameters which were directly measured by the buoy sensor and which can be used for the early warning of WQ at this lake. Therefore, it should be considered seriously whether sensors measuring nutrients (e.g., TN, TP, NH 4 -N, COD and PI) should be added to the buoys, or whether a model (e.g., AEM3D) simulating these nutrients based on high-frequency monitoring data, needs to be developed for providing more satisfactory early warning signals.
The system is also providing data for horizontal interpolation, to produce an initial condition for AEM3D predicting HABs and WQ at a time scale of seven days. For a WQ and HAB prediction system, initial and boundary conditions are very important to improve prediction accuracy. The initial conditions include horizontal and vertical WQ distribution within the waterbody (e.g., TN, TP and CHLA concentrations at each grid location). A huge challenge with initial conditions derives from the limited number of monitoring buoys collecting high-frequency WQ data, due to economic considerations, which leads to an inaccurate spatial distribution of WQ (e.g., high spatial patchiness of cyanobacterial bloom). However, many advanced interpolation methods are now available to address this issue. Inaccurate spatial WQ distribution in model initialisation can lead to inaccurate WQ prediction at each grid location, and thus unconvincing algal aggregation caused by winds. However, temporal and spatial difficulties prevent conventional methods for water sampling and laboratory analysis to meet the requirements for producing model initial conditions. A comprehensive approach is required, integrating high-frequency buoy monitoring data, laboratory data, satellite images and other available resources to provide a satisfactory spatial WQ concentration for the initialization of predictive simulation systems.
WQ sonde (multi-sensor probes) measurements can efficiently provide data wirelessly at a high temporal resolution, but potential problems could include data distortion due to sensor faults, or data gaps because of failed data storage and/or transfer. Therefore, data QC procedures are necessary before using the sensor measurements. The first step for QC is typically to detect missing series and estimate missing values by relying on neighbouring observations, then to detect unreasonable values out of range between the upper and the lower limits for each parameter, ideally guided by experience for a specific water body and measurement type. Unreasonable values can be removed and substituted with interpolated values. If many successive measurements from the same sensor are of the same exact value, they should usually also be removed and interpolated with neighbouring values, or excluded from data analysis. The final step is to detect outliers (incorrect or out-of-range) measurements, which can be removed or assumed to be missing [26]. Outliers are typically those observations which represent abrupt increases or decreases compared to the neighbouring values. There are many methods [26][27][28] and pieces of software [29] available for data QC. In this paper, we adopted the outlier detection method for finding anomalous values, which were removed and generally replaced with interpolated values. The detections of change points and trends will further help to find abnormal values or sonde problems. In our system, interpolation was not implemented to reproduce actual missing values due to the high potential for erroneous measurement generation when interpolating over longer time periods. Therefore, interpolation methods or software need to be integrated into this system in order to produce data without measurement.
The whole buoy monitoring system was originally designed to provide essential WQ information, in order to meet Grade I at the required sites. Therefore, Buoy_profilers were deployed at site 1, site 3 and site 4, and Buoy_surface was deployed at site 5. However, the buoys can only monitor WT, DO, pH, CHLA, PC and TURB. It is very difficult and expensive to directly and accurately measure TN, TP, NH 4 -N, PI, COD and BOD 5 at a high frequency and in near-real-time [30]. An optional solution in this early warning system, is to calculate these parameters based on their regressed relations with sonde-measured values. The calculated values from regressive equations can then provide vital WQ information at different zones of the lake, feeding the AEM3D (http://www.hydronumerics.com.au, accessed on 15 January 2022) model for WQ prediction.
The collected data may assist water environmental managers in identifying and predicting the impacts of climatic extreme events [31]. For example, at Lake Qiandao, rainstorms with high rainfall typically result in a large inflow of water and nutrient loading, including N, P, and organic/inorganic matters, leading to an abrupt increase in water level and a significant increase of regional N and P concentrations [22]. This remarkably alters the spectral absorption properties of chromophoric dissolved organic matter (CDOM) and particles at the northwestern, southwestern and northeastern areas [32]. HFMS can also provide a useful basis for theory and model developments, improving our understanding of lake (reservoir) responses to perturbations caused by human activities and climate change at different time scales (e.g., sub-hourly, hourly, daily, monthly, seasonally, annually and every decade).
Although HFMS has a wide range of applications, it is now still hampered by several factors. For example, we have a limited choice of water quality sensors that are robust, economic and low-maintenance. The accuracy of the sondes measuring chemical parameters (i.e., phosphorus, ammonium, ammonia and nitrite) and biological parameters (i.e., bacterial enumeration, cyanobacteria, biota and cyanotoxins) still needs to improve, although the fast spread of HFMS is encouraging sensor developers to improve technology as quickly as they can. For giving better early warning signals and real-time WQ assessment, in the future, it will be necessary to add sondes measuring chemical and biological parameters to the current HFMS.