Water-Quality Data Imputation with a High Percentage of Missing Values: A Machine Learning Approach

: The monitoring of surface-water quality followed by water-quality modeling and analysis are essential for generating effective strategies in surface-water-resource management. However, worldwide, particularly in developing countries, water-quality studies are limited due to the lack of a complete and reliable dataset of surface-water-quality variables. In this context, several statistical and machine-learning models were assessed for imputing water-quality data at six monitoring stations located in the Santa Luc í a Chico river (Uruguay), a mixed lotic and lentic river system. The challenge of this study is represented by the high percentage of missing data (between 50% and 70%) and the high temporal and spatial variability that characterizes the water-quality variables. The competing algorithms implement univariate and multivariate imputation methods (inverse distance weighting (IDW), Random Forest Regressor (RFR), Ridge (R), Bayesian Ridge (BR), AdaBoost (AB), Hubber Regressor (HR), Support Vector Regressor (SVR) and K-nearest neighbors Regressor (KNNR)). According to the results, more than 76% of the imputation outcomes are considered “satisfactory” (NSE > 0.45). The imputation performance shows better results at the monitoring stations located inside the reservoir than those positioned along the mainstream. IDW was the model with the best imputation results, followed by RFR, HR and SVR. The approach proposed in this study is expected to aid water-resource researchers and managers in augmenting water-quality datasets and overcoming the missing data issue to increase the number of future studies related to the water-quality matter.


Introduction
Monitoring, modeling and management represent the three foundations for building an effective pollution-control strategy [1]. They strictly depend on each other: there is no management without modeling and no modeling without exhaustive monitoring. Therefore, any problem related to data collection is then reflected in the performance of the modeling and management phases. Consequently, it is crucial first to acknowledge what improvement would result if all the available data could be well exploited [2].
The issue of missing data frequently occurs in environmental fields due to sensor failures, weak or inexistent strategy for coordinating monitoring campaigns, a change in the measurement site, in data collectors or to the equipment over time, budget issues [3,4]. Such water-quality data problem is particularly significant in developing countries where monitoring stations and monitoring frequency is scarce, and the percentage of missing data is exceptionally high [5].
It is possible to deal with missing data in two different ways: deletion or imputation [6]. Deletion consists of removing the observations or the features characterized by missing values, while imputation involves reconstructing missing data. Deletion is typically the default method adopted since it is rapid and straightforward [7]. However, in several fields, there are many examples in which such a technique presented some restrictions. It reduces the dataset size and may lead to biased results and a loss of critical information, mainly when a high percentage of missing values characterizes the dataset. Among the most straightforward imputation techniques, there are mean imputation and linear interpolation (which rely only on the available time-series data to perform the imputation), arithmetic, and weighted averaging. However, these techniques have shown poor performance when the dataset is characterized by a significant length of the missing sequence [5].
Another common approach used to fill in missing data, which is part of the univariate imputation methods, is to use information from the neighboring monitoring stations. The inverse distance weight (IDW) is a technique that has been successfully adopted for environmental datasets, particularly for meteorological variables [8][9][10][11].
In the last decade, progressively more advanced techniques have been adopted to reconstruct environmental time series [12,13]. Among them, machine-learning techniques that can handle multivariate inputs are the most widely used. Aguilera et al. [5] adopted three different methods (spatio-temporal kriging, multiple imputations by chained equations through predictive mean matching and random forest) to reconstruct daily precipitation time series characterized by extreme missingness (>90%). They found that spatiotemporal kriging simulates rainfall distribution under missing chronological patterns more reliably than the other two techniques. Sattari et al. [14] provided an in-depth comparison of ten different statistical and machine-learning models to impute monthly precipitation data. Computational results showed that arithmetic averaging, multiple linear regressors and non-linear iterative partial least squares perform best among the classical statistical methods. The multiple imputation technique performed best when rainfall data from more than one dependent station were considered. In addition, Barrios et al. [10] compared the performance of five models for filling monthly precipitation records, finding that artificial neural network, multiple linear regression and IDW showed the best performance.
Most of the imputation works presented in the scientific literature refer to meteorological variables and, sometimes, to hydrologic variables like streamflow [15]. To our knowledge, there are few works related to the imputation of water-quality data. Tabari and Talaee [16] employed artificial neural networks to successfully recover missing values of 13 water-quality parameters at five monitoring stations in the South of Iran. Srebotnjak et al. [17] adopted hot-deck imputation to improve a country-level water quality index, calculated by considering dissolved oxygen, electrical conductivity, pH, total phosphorus and total nitrogen. Ratolojanahary et al. [7] assessed for the first time the problem of high omission rate (even higher than 80%) in a water-quality dataset by adopting four machine-learning models (random forest, boosted regression trees, k-nearest neighbors and support vector regression). However, there is no comprehensive evaluation of different types of imputation models in the context of water-quality data characterized by a high percentage of incompleteness.
Since the beginning of systematic water-quality monitoring in 2004, Uruguay has been suffering the problem of data scarcity, which causes significant limitations in developing and implementing reliable and accurate water-quality models. The shortage of these models unavoidably produces the lack of management tools to design effective policies to mitigate pollution impacts on receiving water bodies.
Based on these considerations, this study aims at augmenting the current waterquality dataset of one of the most important Uruguayan watersheds, Santa Lucía Chico. In particular, we assess the performance of several univariate and multivariate imputation models (statistical and machine learning) to impute missing bi-monthly water-quality data and duplicating the size of the data refining the time series to a monthly frequency. Water-quality variables, in this study, include water temperature, electrical conductivity, Sustainability 2021, 13, 6318 3 of 17 pH, dissolved oxygen, total nitrogen, nitrite, nitrate and turbidity. This work presents two significant challenges: the high missingness percentage (between 50% and 70%) and the high temporal and spatial distribution of the variables under study.
At the national level, this study is expected to pave the path to future studies related to the water quality in Santa Lucía Chico (e.g., implementing reliable water-quality modeling tools, simulation and prediction of water-quality variables, scenario analysis). Globally, this methodology is expected to help water-resource researchers and managers in augmenting their water-quality datasets and overcoming the problem of missing data.

Methodology Description
The flowchart reported in Figure 1 describes the methodology adopted to accomplish the main goals of this study.
augmenting their water-quality datasets and overcoming the problem of missing data.

Methodology Description
The flowchart reported in Figure 1 describes the methodology adopted to accomplish the main goals of this study.
Five steps can be identified: 1. Pre-processing data: The dataset was pre-processed before any analysis to deal with the different units, orders of magnitude, not unified variable names and different sampling frequencies. 2. Data profiling: The dataset was analyzed with the aim of studying the distribution of the variables, their missing data and data quality (the dataset is described in Section 2.2, and the results of this step are reported in Section 3.1). 3. Variable correlations: Correlations among variables were considered to help the multivariate imputation techniques (Section 2.5). 4. Imputation: The selected imputation models were assessed, and their loss functions were computed (the imputation techniques and the imputation performance evaluation are described in Sections 2.3 and 2.4, respectively). 5. Best model selection: For each variable at each monitoring site, the model with the highest performance was selected as "the best model" (Section 3.2).  Figure 1. Methodology adopted for the imputation process.

Dataset Description
Uruguay has a humid subtropical climate (Cfa, according to the Köppen climate classification) with a mean temperature in the warmest month equal to 22 °C or higher [18]. The study area is characterized by total annual precipitation that varies between 1000 mm and 1500 mm and a temperature that can vary between 3 °C and 30 °C [19]. The region has a landscape of smooth hills with an average slope equal to 2.68%.
The water-quality dataset used in this study includes the following physical and chemical variables: water temperature (Tw) [ [20]. Data were collected from 2014 to 2020, with a bi-monthly frequency, at six monitoring stations located along the Santa Lucía Chico river, Uruguay. This is a mixed lotic and lentic system with wide national importance since its waters flow into the Paso Severino reservoir, the primary national drinking water source [19,[21][22][23]. The first three upstream monitoring stations (SLC01, SLC02 and PS01) are located before Five steps can be identified:

1.
Pre-processing data: The dataset was pre-processed before any analysis to deal with the different units, orders of magnitude, not unified variable names and different sampling frequencies.

2.
Data profiling: The dataset was analyzed with the aim of studying the distribution of the variables, their missing data and data quality (the dataset is described in Section 2.2, and the results of this step are reported in Section 3.1).

3.
Variable correlations: Correlations among variables were considered to help the multivariate imputation techniques (Section 2.5).

4.
Imputation: The selected imputation models were assessed, and their loss functions were computed (the imputation techniques and the imputation performance evaluation are described in Sections 2.3 and 2.4, respectively). 5.
Best model selection: For each variable at each monitoring site, the model with the highest performance was selected as "the best model" (Section 3.2).

Dataset Description
Uruguay has a humid subtropical climate (Cfa, according to the Köppen climate classification) with a mean temperature in the warmest month equal to 22 • C or higher [18]. The study area is characterized by total annual precipitation that varies between 1000 mm and 1500 mm and a temperature that can vary between 3 • C and 30 • C [19]. The region has a landscape of smooth hills with an average slope equal to 2.68%.
The water-quality dataset used in this study includes the following physical and chemical variables:  [20]. Data were collected from 2014 to 2020, with a bi-monthly frequency, at six monitoring stations located along the Santa Lucía Chico river, Uruguay. This is a mixed lotic and lentic system with wide national importance since its waters flow into the Paso Severino reservoir, the primary national drinking water source [19,[21][22][23]. The first three upstream monitoring stations (SLC01, SLC02 and PS01) are located before the reservoir; the other three stations (PS03, PS04 and PS02) are located in the lake ( Figure 2).    Some hydro-meteorological variables that may influence the water-quality variables under study were also considered to support the multivariate techniques. In particular, air temperature (Ta) (minimum, average, maximum) [°C], solar radiation (SR) [cal/cm 2 /d] and heliophany (Hel) (sunshine hours) [h] were used for Tw imputation. These data were collected daily by the National Institution of Agricultural Research (INIA) and have no missing values. Ta along with daily evapotranspiration (ET) data, also calculated from INIA (time series characterized by 0.1% of missing data), were considered for the imputation of Turb. Streamflow (Q) [m 3 /s] was considered for the imputation of TN, NO2 − , NO3 − and Turb. This time series was measured three times a day by the Uruguay National Water Board (DINAGUA) and is characterized by a neglectable percentage of missing data (5.6%).
Furthermore, precipitation records (P) from the Uruguayan Institute of Meteorology (INUMET) were considered for Turb imputation. The time series observed at the ten selected monitoring stations have a percentage of missing data that varies between 0.0% and The percentage of missing values detected for each variable at all monitoring stations is reported in Table 1. Some hydro-meteorological variables that may influence the water-quality variables under study were also considered to support the multivariate techniques. In particular, air temperature (T a ) (minimum, average, maximum) [ and Turb. This time series was measured three times a day by the Uruguay National Water Board (DINAGUA) and is characterized by a neglectable percentage of missing data (5.6%). Furthermore, precipitation records (P) from the Uruguayan Institute of Meteorology (INUMET) were considered for Turb imputation. The time series observed at the ten selected monitoring stations have a percentage of missing data that varies between 0.0% and 8.6% in the considered time window (2014-2020). The location of the INIA, DINAGUA and INUMET monitoring stations is represented in Figure 3.

Imputation Techniques
Since the best model for imputing any kind of variable does not exist [24], we evaluated several statistical and machine-learning algorithms (single and multiple imputation) to accomplish the objective of this study. The selected models are Inverse distance weighting (IDW), Random Forest regressor (RFR), Ridge regressor (RR), Bayesian ridge (BR), AdaBoost (AB), Huber regressor (HR), Support vector regressor (SVR), TheilSen regressor (TSR) and k-nearest neighbors regressor (KNNR). All of them have proved to be suitable for non-linear environmental variables, and some of them for cases characterized by a high percentage of missing data. Furthermore, they are already programmed and freely available in Phyton. Unless a software library is explicitly mentioned, scikit-learn was adopted to implement the algorithms [25]. We now briefly describe each of the imputation methods.
Inverse Distance Weighting (IDW): It is a deterministic univariate interpolation method. Missing samples at the target station (s) are computed from the observed values at neighboring stations. The weighting is assigned to the data using a weighting power that controls how the weighting factors decrease as the distance from station s increases [26]. This model was run in R, by using gstat library (function: gstat.idw).
Random Forest Regressor (RFR): It is a supervised learning algorithm that uses an ensemble learning method for regression. Such a method is a technique that combines predictions from multiple Decision Tree algorithms to improve the overall prediction and control overfitting. The decision trees run in parallel with no interaction among them and the mean of all the predictions is returned [27] (function: sklearn.ensemble.Random-ForestRegressor).
Ridge Regressor (RR): It is a technique for analyzing multiple regressions of highly correlated data. It trains a regression model that aims to minimize the least-squares function with an additional regularization term given by the sum of the values' squares (L2 norm) [28] (function: sklearn.linear_model.Ridge).
Bayesian Ridge (BR): It is an estimator that assumes and predicts the target by calculating its probability distribution during training. This method can cope with data sparsity more effectively than other methods. [29] (function: sklearn.linear_model.BayesianRidge).
AdaBoost (AB): It is an estimator that starts fitting a decision tree regressor on the original dataset and then fits additional copies of the regressor on the same slightly modified dataset. Depending on the correctness of the last prediction, samples that are difficult

Imputation Techniques
Since the best model for imputing any kind of variable does not exist [24], we evaluated several statistical and machine-learning algorithms (single and multiple imputation) to accomplish the objective of this study. The selected models are Inverse distance weighting (IDW), Random Forest regressor (RFR), Ridge regressor (RR), Bayesian ridge (BR), AdaBoost (AB), Huber regressor (HR), Support vector regressor (SVR), TheilSen regressor (TSR) and k-nearest neighbors regressor (KNNR). All of them have proved to be suitable for non-linear environmental variables, and some of them for cases characterized by a high percentage of missing data. Furthermore, they are already programmed and freely available in Phyton. Unless a software library is explicitly mentioned, scikit-learn was adopted to implement the algorithms [25]. We now briefly describe each of the imputation methods.
Inverse Distance Weighting (IDW): It is a deterministic univariate interpolation method. Missing samples at the target station (s) are computed from the observed values at neighboring stations. The weighting is assigned to the data using a weighting power that controls how the weighting factors decrease as the distance from station s increases [26]. This model was run in R, by using gstat library (function: gstat.idw).
Random Forest Regressor (RFR): It is a supervised learning algorithm that uses an ensemble learning method for regression. Such a method is a technique that combines predictions from multiple Decision Tree algorithms to improve the overall prediction and control overfitting. The decision trees run in parallel with no interaction among them and the mean of all the predictions is returned [27] (function: sklearn.ensemble.RandomForestRegressor).
Ridge Regressor (RR): It is a technique for analyzing multiple regressions of highly correlated data. It trains a regression model that aims to minimize the least-squares function with an additional regularization term given by the sum of the values' squares (L2 norm) [28] (function: sklearn.linear_model.Ridge).
Bayesian Ridge (BR): It is an estimator that assumes and predicts the target by calculating its probability distribution during training. This method can cope with data sparsity more effectively than other methods. [29] (function: sklearn.linear_model.BayesianRidge).
AdaBoost (AB): It is an estimator that starts fitting a decision tree regressor on the original dataset and then fits additional copies of the regressor on the same slightly modified dataset. Depending on the correctness of the last prediction, samples that are difficult to predict become more relevant as the training continues. The mean of all the models' predictions is returned [30] (function: sklearn.ensemble.AdaBoostRegressor).
Huber Regressor (HR): It is an algorithm that trains a linear model which optimizes the mean squared error (L2 error) for samples whose error is lower than a given threshold (d) and the mean absolute error (L1 error) for samples whose error is greater than d. In this way, the optimized function is not heavily influenced by outliers while not completely ignoring their effect [31] (function: sklearn.linear_model.HuberRegressor).
Support Vector Regressor (SVR): It is an estimator that focuses on minimizing the coefficients. More specifically, it considers the l2-norm of the coefficient vector, not the squared error. The error term is handled instead in the constraints, where the absolute error is set to less than or equal to a specified margin (maximum error). The latter can be adjusted to obtain the desired accuracy of the model. [32] (function: sklearn.svm.SVR).
TheilSen Regressor (TSR): It is a regressor that makes its estimation by calculating the slopes and intercepts of a subpopulation of all possible combinations of some subsample points. The final slope and intercept are then defined as the spatial median of these slopes and intercepts. It is robust against outliers compared to other linear regressors [33] (function: sklearn.linear_model.TheilSenRegressor).
K-Nearest Neighbors Regressor (KNNR): It is a regressor that calculates the distance (using all variables) from the target point to the others and makes a prediction by interpolating the nearest neighbors in the dataset [34] (function: sklearn.neighbors.KNeighborsRegressor).

Imputation Performance Evaluation
To compare the accuracy of the implemented techniques in reconstructing missing water-quality data, Kling-Gupta efficiency (KGE), percent bias (PBIAS) and the Nash-Sutcliffe efficiency (NSE) were used. The latter was employed as the objective function since it is the most restrictive [35], while KGE and PBIAS were both used for validation. Equations NSE varies between −∞ and 1. If NSE is 1, the imputed values match the records perfectly. If NSE is 0, the imputed values are as good as the observation mean. If NSE is negative, the observation mean is a better predictor than imputed values. Therefore, higher NSE values are desirable since they indicate a more accurate imputation model [36,37].
Unlike NSE, there are not well-defined KGE thresholds that define a "good" model. For this reason, the current literature tends to interpret KGE values similarly to NSE: negative values indicate "bad" model performance, while positive values indicate "good" model performance [38][39][40]. However, a recent study by Knoben et al. [41] found that all model results with −0.41 < KGE < 1 could be considered good performance.
The optimal value of PBIAS is 0, with lower values indicating accurate model imputation. Positive values denote an underestimation bias of the model, and negative values indicate an overestimation bias of the model [42]. Table 2 summarizes the performance evaluation criteria for NSE, KGE and PBIAS, used in this work, defined according to the standard review [36,41,42].

Helper Variables for the Imputation Process
Considering the correlations among water-quality variables, multivariate techniques exploited them for completing the missing values with the other existing water-quality observations. Spearman correlation was employed to evaluate possible correlations among water-quality variables, as it is a non-linear technique able to avoid overshadowing critical variable relationships. The aid variables considered in this study are framed in black in Figure 4. In particular, T w and Turb influence EC in surface waters. An increase in T w causes an increase in the mobility of the ions present in the water. An increase in T w may also produce an increment in the number of ions due to molecule dissociation. As the EC depends on these factors, an increase in T w leads to an increase in EC [43,44]. Furthermore, EC represents the ability of a liquid to conduct an electric charge; this ability depends on dissolved ion concentration, which is usually measured as total dissolved solids (TDS) [45]. Considering that TDS are highly correlated with Turb, we can assume that EC is also affected by Turb.
Furthermore, DO was considered dependent on T w : the higher T w , the lower DO. This is justified by the fact that cold water can hold more DO than warm water. In the cold season, when T w is low, the DO concentration is high. In the warm season, when T w is high, the DO concentration is often lower [19].
The variables Turb and T w are also highly correlated. In general, Turb is known as a proxy of the amount of suspended solids in water. Such suspended particles in water bodies absorb heat from solar radiation more efficiently than water. The heat is then transferred from the particles to water molecules, increasing the surrounding water temperature [46].
Other correlations considered were the ones between TN-Turb, TN-NO 2 − and TN-NO 3 − (even though the last two were not highlighted in the correlation matrix). This is justified by the fact that TN represents the sum of dissolved and particle-bound nitrogen.
Moreover, as we have already mentioned in Section 2.1, we also considered hydrometeorological variables aid for the imputation process since they may influence the water-quality variables under study. Particularly, Tw is deemed to be mainly affected by Ta, Hel and SR. Turb, TN, NO 2 − and NO 3 − are influenced by Q. Considering that NO 2 − and NO 3 − are part of the dissolved inorganic nitrogen (DIN), their correlation with streamflow is clear: the higher the Q, the lower the ions concentration, due to the dilution process [19]. Being TN the sum of dissolved and particle-bound nitrogen, we aided the imputation process of the latter by including Turb data. It is often assumed that these constituents positively affect river discharge, considering the importance of overland runoff in transporting sediments [47]. For this reason, we are considering Q as a supporting variable for Turb imputation.
In their study carried out in Santa Lucía Chico watershed, Gorgoglione et al. [19] found a seasonality of Turb values with higher values during the cold season. This was justified by the fact that in this season, frequent extreme precipitation events occur and, along with higher soil humidity due to low temperature, this causes a higher runoff and, Sustainability 2021, 13, 6318 8 of 17 therefore, a more significant amount of detached and washed-off sediments. For this reason, Turb imputation is also aided by ET, P and T a data (apart from Q as previously explained). Moreover, as we have already mentioned in Section 2.1., we also considered hydrometeorological variables aid for the imputation process since they may influence the water-quality variables under study. Particularly, Tw is deemed to be mainly affected by Ta, Hel and SR. Turb, TN, NO2 − and NO3 − are influenced by Q. Considering that NO2 − and NO3 − are part of the dissolved inorganic nitrogen (DIN), their correlation with streamflow is clear: the higher the Q, the lower the ions concentration, due to the dilution process [19]. Being TN the sum of dissolved and particle-bound nitrogen, we aided the imputation process of the latter by including Turb data. It is often assumed that these constituents positively affect river discharge, considering the importance of overland runoff in transporting sediments [47]. For this reason, we are considering Q as a supporting variable for Turb imputation.
In their study carried out in Santa Lucía Chico watershed, Gorgoglione et al. [19] found a seasonality of Turb values with higher values during the cold season. This was justified by the fact that in this season, frequent extreme precipitation events occur and, along with higher soil humidity due to low temperature, this causes a higher runoff and, therefore, a more significant amount of detached and washed-off sediments. For this reason, Turb imputation is also aided by ET, P and Ta data (apart from Q as previously explained).
A summary of the supporting variables taken into account for the imputation process is represented in Table 3. A summary of the supporting variables taken into account for the imputation process is represented in Table 3.

Dataset Profiling
The dataset considered for this study is formed by 48 time series (8 water-quality variables × 6 monitoring stations). Therefore, from now on, we will call "variable," "feature," or "attribute," a particular time series that refers to a water-quality variable recorded at one monitoring station (e.g., T w observed at SLC01 monitoring station will be T w [SLC01]). The data profiling process was programmed and run in Python 3.8, using the pandas_profiling library [48].
With the aim of showing the high temporal and spatial variability of the water-quality variables under study, we reported the box-plot representation at the six monitoring stations through the analyzed period (2014-2020) ( Figure 5). From all the pollutant plots presented, it is interesting to identify two different groups of behavior: the three monitoring sites situated in the reservoir show different patterns compared to those that characterize the stations located upstream of the reservoir. Furthermore, T w and DO are the only pollutants showing a strong intra-and inter-annual seasonality, while we cannot identify a clear pattern for the other contaminants under study. It is essential to highlight the high nutrient contribution of PS01 (TN, NO 2 − , NO 3 − ), where the biggest city of the watershed is located (Florida, that with a population of over 33,000, is home to almost half of the inhabitants of the region). It is known that urbanized areas are sources of nitrogen due to atmospheric deposition, lawn fertilizer application, wastewater effluent and leaky sewage infrastructure [49]. Turb shows a minor temporal pattern through the years, with the highest values registered in the monitoring stations located upstream of the reservoir (SLC01, SLC02 and PS01).
To better understand and justify the high spatio-temporal variability of the attributes under study, we analyzed the seasonality of the hydro-meteorological parameters used as helper variables in the imputation process (ET, Hel, T a , SR, P and Q) ( Figure 6). As mentioned in Section 2.1., precipitation-time series observed at ten monitoring stations were adopted for this study. For the sake of clarity, in Figure 6, we are only presenting the P boxplots related to Florida station since it is the barycentric one of the watershed.
In these plots, the "winter" period includes the fall and winter seasons (April-May-June-July-August-September), and the "summer" time window considers the spring and summer seasons (October-November-December-January-February-March). As expected, the meteorological variables ET, Hel, T a and SR show a strong seasonal pattern, with lower values in winter and higher values in summer. This behavior is not explicit for P and Q. They do not present an evident seasonality, but it is possible to state that in winter, extreme rainfall events and, therefore, major runoff events occur more frequently than in summer. This is justified by the fact that in the cold season, the higher soil humidity due to low temperature causes a higher runoff.
The strong seasonality of ET, Hel, T a and SR justifies the strong intra-and inter-annual seasonality of T w and DO since the former parameters are used as helper variables of the latter ones. The minor temporal pattern showed by Turb is explained by the fact that it depends on ET and T a , characterized by a strong seasonality, and Q and P, which only present extreme events in winter, without showing a solid temporal pattern. ity due to low temperature causes a higher runoff.
The strong seasonality of ET, Hel, Ta and SR justifies the strong intra-and inter-annual seasonality of Tw and DO since the former parameters are used as helper variables of the latter ones. The minor temporal pattern showed by Turb is explained by the fact that it depends on ET and Ta, characterized by a strong seasonality, and Q and P, which only present extreme events in winter, without showing a solid temporal pattern.

Imputation Results
To evaluate the performance of the different imputation models adopted and to choose the best one for each feature, k-fold cross-validation with k = 10 was used in this study. If a time series was characterized by less than 100 records, we adopted a repeated k-fold crossvalidation, always with k = 10. This method repeats the k-fold cross-validation process multiple times and reports the mean performance across all folds and all repeats [50]. The dataset was min-max normalized before any analysis to deal with the different units and orders of magnitude. The winning (best) models were the ones with the optimal hyper-parameters, i.e., those with the highest NSE (objective function). As a result, the best model with the highest accuracy was selected for each feature and validated by calculating KGE and PBIAS. The outcomes of this methodology are represented by augmented time series for all the water-quality variables, characterized by one-month frequency (i.e., the frequency was doubled up). The methodology adopted in this study was implemented using Python programming language on a desktop computer (Ubuntu Operating System, 16 GB of RAM and Intel i3 Processor).
In Table 4, we report, for each variable, the winning model with the corresponding values of the goodness-of-fit indicators calculated and the corresponding rating based on Table 3. Considering the NSE rating, the imputation performance is overall adequate. Tw at the six monitoring stations was the best-imputed variable, showing "very good" performance. The strong daily and annual seasonality that characterizes this variable makes its simulation and, therefore, its imputation less difficult. The correlation that exists between Tw and EC (an increase in Tw leads to an increase in EC) [43,44] is reflected in the "good" performance of this variable at the six monitoring sites ("satisfactory" at SLC01 and PS01; "good" at SLC02; "very good" at PS03, PS04 and PS02). The imputation process for the other waterquality variables shows different results. It is noteworthy that the performance is always notable at the three monitoring stations located in the reservoir of Paso Severino (PS03, PS04 and PS02); while the imputation can sometimes be "unsatisfactory" at the stations located upstream, along Santa Lucía Chico river (SLC01, SLC02 and PS01). This outcome can be attributed to the different hydrologic-response times considering the location of the measurement sites. The time base of the hydrographs observed at Florida hydrometric station ( Figure 3) is overall equal to 6 days and it generally does not vary with the change of the flow magnitude. Ríos [51] found that, on average, the renewal time of the Paso Severino reservoir ranges between 2 to 8 weeks. He also observed that during storm events, the renewal time could be a few days long, while it can last several months during dry periods. SLC01 and SLC02 are located several kilometers upstream of the reservoir, where the water body has a fluvial behavior associated with a lotic ecosystem. While PS02, PS03 and PS04 are located within the reservoir, where the water body is lacustrine, associated with a lentic ecosystem. The validation of the imputation process was outstanding, showing overall "very good" results in terms of the PBIAS and KGE ratings.
A box-plot representation of the model performance (NSE, PBIAS and KGE) is represented in Figure 7. More than 76% of the imputed data is characterized by NSE > 0.45 (it is at least "satisfactory"), and more than 92% of the imputed data has a positive NSE, meaning that for almost all the imputations, our methodology is better than the mean function used as an imputer. The validation results were notable. Considering PBIAS ratings, more than 96% of the imputed data can be considered at least "satisfactory" and more than 88% "very good." In terms of KGE ratings, all the imputations are considered "good."

Further Discussion
Effective water-resource management requires the analysis of a large number of water-quality information over space and time. However, in many parts of the world, particularly in developing countries, the monitoring of water-quality variables is usually characterized by few monitoring stations over the territory, where observations are recorded with a low frequency and are characterized by an important percentage of missing data. Therefore, in this study, we evaluated the performance of several statistical and machinelearning techniques (univariate and multivariate) in imputing a water-quality dataset characterized by eight water quality variables measured at six monitoring stations. Particularly, we aimed to augment the water-quality dataset, from bi-monthly to monthly frequency. The percentage of missing values ranges between 50% and 70% (high missingness percentage), and the water-quality variables are characterized by a high temporal and spatial distribution. The study area considered was one of the most critical Uruguayan watersheds, Santa Lucía Chico, since it provides water to more than 60% of the national population. This was an interesting study area to analyze since it is a mixed lotic and lentic system and the six monitoring stations are located along the mainstream (SLC01, SLC02 and PS01) and in the reservoir (PS03, PS04 and PS02). In this way, it was appealing to assess the performance of several models in these two different surface-water bodies.
There are few related works on the imputation of water-quality data, and they are relatively recent. In 2012, Srebotnjak et al. [17] showed that hot-deck imputation can improve geographical coverage of a country-level water quality index, calculated consider- IDW outperformed the other models in most cases (17 times), followed by RFR (8 times), HR (6 times), SVR (5 times) and RR (4 times). A possible explanation is that IDW is the only model that, in addition to considering temporal information, includes spatial information by looking at neighboring stations to support the imputation process. The other implemented machine-learning models won almost the same number of times (same order of magnitude).

Further Discussion
Effective water-resource management requires the analysis of a large number of waterquality information over space and time. However, in many parts of the world, particularly in developing countries, the monitoring of water-quality variables is usually characterized by few monitoring stations over the territory, where observations are recorded with a low frequency and are characterized by an important percentage of missing data. Therefore, in this study, we evaluated the performance of several statistical and machine-learning techniques (univariate and multivariate) in imputing a water-quality dataset characterized by eight water quality variables measured at six monitoring stations. Particularly, we aimed to augment the water-quality dataset, from bi-monthly to monthly frequency. The percentage of missing values ranges between 50% and 70% (high missingness percentage), and the water-quality variables are characterized by a high temporal and spatial distribution. The study area considered was one of the most critical Uruguayan watersheds, Santa Lucía Chico, since it provides water to more than 60% of the national population. This was an interesting study area to analyze since it is a mixed lotic and lentic system and the six monitoring stations are located along the mainstream (SLC01, SLC02 and PS01) and in the reservoir (PS03, PS04 and PS02). In this way, it was appealing to assess the performance of several models in these two different surface-water bodies.
There are few related works on the imputation of water-quality data, and they are relatively recent. In 2012, Srebotnjak et al. [17] showed that hot-deck imputation can improve geographical coverage of a country-level water quality index, calculated considering dissolved oxygen, electrical conductivity, pH, total phosphorus and total nitrogen. This water-quality index is a composite indicator to track water quality over time and space, easily interpretable since it varies from 0 to 100. Still, it does not allow a detailed analysis of each water-quality variable used to calculate it. Therefore, this type of index does not allow us to answer scientific questions such as which compounds are significant indicators for specific land use categories or the spatio-temporal behavior of a particular problematic compound in a particular area of study. To overcome these limitations, we decided to directly impute each water-quality variable and not a global index, which allows us to use the imputed data for more advanced analyses.
In 2015, Tabari and Talaee [16] obtained acceptable results (RMSE ranges between 0.016 and 4475) in imputing a large dataset of water-quality information (13 variables) measured, with a monthly frequency, at five monitoring sites along the Maroon River (Southwest of Iran). It should be noted that this study has already adopted the concept of helper variables to improve the imputation process based on the correlations among water-quality variables. The correlation between EC and Turb that we used in our analysis is confirmed in this study. In Tabari and Talaee [16], the results were insufficient for EC, Turb and total dissolved solids (TDS) at all monitoring stations, showing RMSE values between 100 and higher than 4000. They employed only two artificial neural networks as imputation models: multilayer perceptron and radial bias function. In our study, we improved such results using more imputation techniques and founding that SVR model shows better performance for EC and Turb.
In 2019, Ratolojanahary et al. [7] tackled for the very first time the problem of high rate missingness (higher than 80%) in a water-quality dataset of a drinking water well employing four machine learning models (RF, KNNR, SVR and boosted regression trees, similar to our AB). Their outcomes showed that SVR provides the best performance (notably in terms of average prediction error). However, this study does not introduce the temporal dimension into the imputation process, and, therefore, temporal variability of water-quality parameters is not considered a challenge. Spatial variability is also not addressed, as the authors analyzed water well. These aspects are included in our study. Furthermore, we confirm that the performance of SVR is better than AB and KNNR in the imputation of water quality data.
It is also important to note that our work pioneered the use of IDW for water-quality data imputation, and this method performed the best among all the methods analyzed. Some recent works proposed using IDW to interpolate water quality in scenarios where spatial variability may be negligible, as in the case of lakes [52] or where temporal variability is low, as in the case of groundwater [53].
Some of the correlations found in our work were also reported in previous studies in the same study area [19,21,23,54]: a robust correlation among nitrogen compounds, in its dissolved and particle-bound form; a strong inverse correlation between by T w and DO.

Conclusions
In this study, we tackled the challenge of data imputation in a multivariate waterquality dataset characterized by a high percentage of missing data (between 50% and 70%). In particular, the variables T w , EC, pH, DO, TN, NO 2 − , NO 3 − and Turb of six monitoring stations located along the Santa Lucía Chico river (Uruguay) were considered for this study. Adopting a multi-model approach was crucial since the best model for imputing any water-quality variable does not exist. The statistical and machine-learning models implemented were IDW, RFR, RR, BR, AB, HR, SVR and KNNR.
The imputation outcomes were overall adequate. More than 76% of the imputed data can be considered "satisfactory" (NSE > 0.45). This was validated by calculating PBIAS (>96% of the imputed data is "satisfactory") and KGE (all the imputations are considered "good"). It is interesting to notice that the performance is always remarkable at the three monitoring stations located in the Paso Severino reservoir, while they may be "unsatisfactory" at some monitoring stations located along the Santa Lucía Chico river (upstream the reservoir). Among the implemented models, IDW was chosen as the best model 17 times since it is the only model that considers the temporal and spatial variability that characterizes the variables under study.
This study paves the path to future water-quality research in the watershed under study (e.g., implementation of reliable modeling tools, water-quality prediction and scenario analysis). Hopefully, the results obtained in this work will help water managers and researchers worldwide make the most of existing water-quality data to improve modeling and generate effective pollution-control strategies.
Our current results are promising, but we believe that it is possible to improve the present methodology by integrating physical knowledge that considers the spatial information of the available water-quality data. Our future work intends to transform the current approach, based on machine learning, into a hybrid method where the data-driven techniques incorporate physical aspects during their training.
Funding: This research was funded by ANII, grant number FSDA_1_2018_1_153967.

Conflicts of Interest:
The authors declare no conflict of interest.