A Virtual Sensing Concept for Nitrogen and Phosphorus Monitoring Using Machine Learning Techniques

Harmful cyanobacterial bloom (HCB) is problematic for drinking water treatment, and some of its strains can produce toxins that significantly affect human health. To better control eutrophication and HCB, catchment managers need to continuously keep track of nitrogen (N) and phosphorus (P) in the water bodies. However, the high-frequency monitoring of these water quality indicators is not economical. In these cases, machine learning techniques may serve as viable alternatives since they can learn directly from the available surrogate data. In the present work, a random forest, extremely randomized trees (ET), extreme gradient boosting, k-nearest neighbors, a light gradient boosting machine, and bagging regressor-based virtual sensors were used to predict N and P in two catchments with contrasting land uses. The effect of data scaling and missing value imputation were also assessed, while the Shapley additive explanations were used to rank feature importance. A specification book, sensitivity analysis, and best practices for developing virtual sensors are discussed. Results show that ET, MinMax scaler, and a multivariate imputer were the best predictive model, scaler, and imputer, respectively. The highest predictive performance, reported in terms of R2, was 97% in the rural catchment and 82% in an urban catchment.


Background and Motivation
Eutrophication (Eutro) and harmful cyanobacterial bloom (HCB) are causes of concern around the world, with vital water bodies frequently experiencing increasingly toxic blooms in the past few decades [1]. Eutro refers to the nutrient enrichment of water mainly from agricultural sources, wastewater discharges, industrial wastes, and urban runoff [2], while HCB is one of the most severe symptoms of Eutro [3]. HCB is problematic for the drinking water treatment process [4], and some strains can produce toxins (cyanotoxins) that have significant health effects on humans, animals, and other organisms [5].
In addition, the severity of Eutro and HCB has significant ecological and socioeconomic costs due to the (i) increase in costs associated with water treatment, (ii) adverse effects on waterfront real estate, (iii) tourism and recreational losses due to the negative perceptions of water quality (WQ), (iv) animal fatalities, and (v) fish mortality [6]. Africa is particularly vulnerable to these economic losses due to its lag in socio-economic development. This assertion is substantiated by the fact that the combined costs of eutrophication in England and Wales and the USA were about US$160 million per year in 2003 [7] and US$2.2 billion per year in 2009 [8], respectively. Therefore, water bodies need to maintain an acceptable chemical, physical, and biological status in order to preserve raw water supply, protect human and animal health, and Some key factors that affect the operation of these virtual sensors include the complexity of the model, acceptable predictive accuracy, the type and number of surrogate sensors, and the availability of high-quality training data [9]. Thus, the core issue is to develop robust and inexpensive virtual sensors with excellent performance.

Literature Review
We are aware of only three studies that have assessed the feasibility of ML techniques for predicting nutrient concentrations, particularly for WQ monitoring purposes. Castrillo and García [9] used multiple linear regression (MLR) and random forest (RF) models to estimate nutrient concentrations in the rural and urban catchments of the River Thames in the UK. They considered only the variables most commonly measured in real-time (in- Some key factors that affect the operation of these virtual sensors include the complexity of the model, acceptable predictive accuracy, the type and number of surrogate sensors, and the availability of high-quality training data [9]. Thus, the core issue is to develop robust and inexpensive virtual sensors with excellent performance.

Literature Review
We are aware of only three studies that have assessed the feasibility of ML techniques for predicting nutrient concentrations, particularly for WQ monitoring purposes. Castrillo and García [9] used multiple linear regression (MLR) and random forest (RF) models to estimate nutrient concentrations in the rural and urban catchments of the River Thames in the UK. They considered only the variables most commonly measured in real-time (in-situ) as predictors (or surrogates). However, the modeling accuracy, reported in terms of the root mean squared error (RMSE) and the normalized RMSE (nRMSE), is not state-of-the-art (except for nitrate). For instance, the average nRMSE for both catchments is 21% and increases to 24% when excluding nitrate prediction in the rural catchment.
Elsewhere, Ha et al. [3] also used RF and MLR models to estimate nutrient concentrations in the Tri An reservoir in Vietnam. Amongst the predictors, they included chemical and biochemical oxygen demand. However, high-frequency monitoring of both these variables is not feasible due to their sensors' reliability and prohibitive cost [15,16]. Although the predictive performance (reported in terms of the RMSE and the coefficient of determination (R 2 )) in the validation phase was tolerable (average R 2 was 0.858), the results may be biased since the dataset is relatively small, in which case, the results depend on which data was used for model training vs. validation. The performance assessment based on the data split randomly over the entire dataset (as opposed to the year-based split) may have enabled a more robust conclusion.
Recently, Harrison et al. [12] employed RF to estimate nutrient concentrations within a forested area (Lake George) in New York, USA. They mainly used the variables with highfrequency sensor data (except for fluorescent dissolved organic matter and soil moisture) as surrogates. However, the predictive performance, reported in terms of the Nash-Sutcliffe Efficiency (NSE), was not state-of-the-art, especially for phosphorus fractions. For instance, the NSE was 32% for soluble reactive P, 47% for total soluble P, 74% for total P, and 76% for particulate P.
Therefore, despite the recognition of the urgent need for accurate, reliable, and costeffective monitoring programs [17], scientific progress has been rather sluggish. Furthermore, two of the three studies [3,12] build their predictive models using the data from one watershed of interest. Thus, even though the authors reported the accuracies better than [9] (who used two watersheds with contrasting land uses), the conclusions from their studies are not adequately substantiated since the robustness of their models may be limited [18]. Additionally, apart from applying the same ML algorithm (RF) across different study areas, all studies failed to achieve the target accuracy (defined in Section 3.1.3) for most variables, particularly P fractions. This drawback may impede the uptake of virtual sensing for operational purposes. However, the success of other ensemble techniques such as the extremely randomized trees algorithm [19], which has been tremendous in various application domains in recent years, is yet to be explored for N and P prediction.
In addition, there is also a lack of (i) effective baseline models and (ii) established datasets and accuracy benchmarks. For instance, MLR is not an appropriate baseline since WQ data often exhibit multicollinearity, nonlinear relationships, and skewed distributions [12]. Additionally, each work used different datasets since there is no standard benchmark, and the accuracy is reported using different performance metrics. In the absence of open-source implementations, these inconsistencies make it difficult to compare and critically assess the models' performance, and this impedes the research progress. Further, it is unclear what predictive accuracy would be desired to trigger management actions when nutrient concentrations exceed the regulatory thresholds. Accordingly, there is a need for a specification book that would form a basis for a robust and rigorous virtual sensor development process. Such a specification book, including the provision of adequate modeling details, would also make the experiments repeatable and results reproducible; the two issues essential for advancing the research field [20].
The other common issue with WQ data is the high number of outliers [21]. For example, except for temperature, all the variables in the present work have outliers (values outside the whiskers of a boxplot). A common procedure is to use either the three-sigma rule [22] or the interquartile range method [23] to identify and delete the outliers. However, because of the multivariate nature and seasonal variability of WQ data [21,24], removing these outliers may not be the most effective procedure since they may be legitimate, and we do not want to lose their impact. Consequently, this directly impacts the method used for data scaling since some scalers are very sensitive to the presence of outliers. Although the performance of metric and gradient-based ML algorithms depends on appropriately scaled  [25], the effect of various scalers and transformers on the predictive performance of N and P concentrations is yet to be assessed.
Another prevalent issue is the presence of many missing values (MVs) in WQ data. For instance, the raw data in the present work (described in Section 2.1) have a missing rate as high as 41% (total reactive P) in one catchment. How these missing data are handled remains crucial since many ML algorithms do not support data with MVs. A commonly applied basic strategy is to discard the entire columns and/or rows containing MVs. For example, Castrillo and García [9] managed the MVs by deleting the rows that contained such values. However, this has the drawbacks of (i) losing data which may be valuable even though incomplete and (ii) reducing the size of the datasets, which on many occasions are already small. For instance, the four-year dataset used by Ha et al. [3] had only 1047 observations. Therefore, a better strategy may be to impute these MVs, i.e., to infer them from the non-missing part of the data [26]. Consequently, there is a need to explore the effect of imputation algorithms on the predictive performance of N and P.

Work Objectives
In addressing the limitations mentioned above, the present work aims to formalize the virtual sensing concept for nutrient monitoring. In particular, the study seeks to achieve the following objectives:

1.
To present a specification book for virtual sensor-based N and P monitoring. The specification book (i) proposes more effective baseline models, (ii) recommends performance metrics to enable adequate model comparison, (iii) recommends the different levels of predictive accuracy benchmarks based on the requirements of data quality objectives, and (iv) proposes the benchmark datasets; 2.
To assess the effect of different scalers (Standard, MinMax, MaxAbs, and Robust Scaler) and transformers (Quantile and Power Transformer) on ML algorithms and virtual sensor performance on WQ data with outliers; 3.
To assess the effect of univariate, multivariate, and nearest neighbor imputation methods in the prediction of N and P concentrations.
The remainder of this article is structured as follows. Section 2 (i) introduces the study areas and water quality data, (ii) discusses exploratory data analysis and preprocessing, and (iii) highlights some of the best practices when developing the virtual sensors while the results and discussion are presented in Section 3.

Description of Study Areas and Water Quality Data
The open dataset employed in this work was obtained from the UK Centre for Ecology and Hydrology's Environmental Information Data Centre, available at https://catalogue. ceh.ac.uk/documents/db695881-eabe-416c-b128-76691b2104d8 (accessed on 17 September 2022). The hourly chemical and physical monitoring data for The Cut (a highly urbanized catchment) and the River Enborne (a relatively rural catchment) were taken between 2010 and 2012 and 2009 and 2012, respectively. Regarding land use statistics, 39% of the River Enborne catchment is designated as 'Arable and Horticulture' and 21% as 'Broadleaf Woodland' [27]. On the other hand, 30% of The Cut catchment is designated as improved grassland, 26% as arable land, and 15% as woodland [28].
These two tributaries with contrasting land-use receive sewage effluent from a small waste treatment plant (River Enborne) and major towns (The Cut) [29]. Detailed documentation of the catchments, such as maps, satellite view photos, sampling locations, and information concerning monitoring methodology and instrumentation characteristics, comes with the datasets as supporting documentation and is also provided in Wade et al. [29]. For ground-truthing the data, the two rivers were monitored using identical (i) infrastructure, (ii) chemical analysis methodologies, and (iii) monitoring sondes, and the data were validated with weekly grab sampling and laboratory analysis [29]. The parameters measured in each catchment are listed in Table 1.

• •
Total reactive phosphorus (orthophosphate) is regarded as the best indicator of the nutrient status of water bodies. It has similar effects to nitrates and ammonium in excess amounts.
As seen in Table 1, most of the parameters are common in both catchments except for nitrogen, measured as nitrate (NO 3 ) in the River Enborne and ammonium (NH 4 ) in The Cut. Additionally, total phosphorus was only measured in The Cut.

Data Analysis Frameworks
Data analysis and statistical operations were performed using the following opensource libraries (where version numbers are given in brackets) [

Exploratory Data Analysis and Preprocessing
The available data, 15,636 records for The Cut and 20,412 records for the River Enborne, are provided in Comma-Separated Values (CSV) format. Data preprocessing transforms input data into a form appropriate for modeling. In this work, exploratory analysis detected the outliers while preprocessing included data transformation, scaling, and missing values handling.

Box and Whisker Plot Analysis: Outlier Detection
A box and whisker plot (or boxplot) was chosen for outlier detection because it provides an insightful visualization of how spread out the data are. As seen in Figure 2 (as an example), a considerable number of data points are outside the whiskers of the box plot, demeaning these outliers as normal.

Exploratory Data Analysis and Preprocessing
The available data, 15,636 records for The Cut and 20,412 records for the River Enborne, are provided in Comma-Separated Values (CSV) format. Data preprocessing transforms input data into a form appropriate for modeling. In this work, exploratory analysis detected the outliers while preprocessing included data transformation, scaling, and missing values handling.

Box and Whisker Plot Analysis: Outlier Detection
A box and whisker plot (or boxplot) was chosen for outlier detection because it provides an insightful visualization of how spread out the data are. As seen in Figure 2 (as an example), a considerable number of data points are outside the whiskers of the box plot, demeaning these outliers as normal.

Data Transformation
To improve the predictive performance, some ML models, such as random forest and linear regression, require features to be as close as possible to normal (or Gaussian) distribution [30]. To minimize skewness and stabilize variance in variables not normally distributed, the transformation is performed by mapping the variable of interest into the target variable = , with the common mapping function as one of the following: the cubic root = √ , the square root = √ , the reciprocal = 1⁄ , the logarithm = ln , the Yeo-Johnson [31], and Box-Cox transforms [11]. In the present work, normality was checked by computing the Fisher-Pearson coefficient of skewness using the SciPy library. Contrary to applying the same transformation based on whether the features are positively or negatively skewed, as was the case in [9], each variable was transformed using all the mentioned mapping functions, and the one that resulted in the lowest skewness was then chosen for that particular variable, as shown in Table 2.

Data Transformation
To improve the predictive performance, some ML models, such as random forest and linear regression, require features to be as close as possible to normal (or Gaussian) distribution [30]. To minimize skewness and stabilize variance in variables not normally distributed, the transformation is performed by mapping the variable of interest x into the target variable y = f (x), with the common mapping function f as one of the following: [31], and Box-Cox transforms [11]. In the present work, normality was checked by computing the Fisher-Pearson coefficient of skewness using the SciPy library. Contrary to applying the same transformation based on whether the features are positively or negatively skewed, as was the case in [9], each variable was transformed using all the mentioned mapping functions, and the one that resulted in the lowest skewness was then chosen for that particular variable, as shown in Table 2. Importantly, all these are simple transformations that do not use any learnable parameters and therefore do not risk data leakage (discussed in Section 2.4.4) when performed before data splitting. As seen in Table 2, the appropriate mapping function for flow rate in The Cut and turbidity in both catchments is the reciprocal function, although Castrillo and García [9] applied the logarithmic function. This contradiction demonstrates the importance of data analysis (or visualization) before and after each transformation.

Data Scaling
Features such as conductivity and total reactive phosphorus in both catchments have very different scales and contain many outliers. These two characteristics can degrade the predictive performance of many ML algorithms [26]. Additionally, unscaled data can slow (or even prevent) the convergence of several gradient-based algorithms like neural networks and support vector machines. We compared the effect of MinMaxScaler, MaxAbsScaler, StandardScaler, RobustScaler, QuantileTransformer, and PowerTransformer on ML algorithms and virtual sensor performance on WQ data with outliers.

Virtual Sensor Development: Best Practices
The limited number of studies concerning the application of virtual sensing for nutrient monitoring indicates that the research field is still in its infancy. Therefore, this section discusses some of the best practices that will aid in interpretability and improved trust in ML models for regulatory purposes.

Selection of Input Variables
Good practice in any ML model development is to select a subset of relevant input variables. This is to avoid overfitting and therefore improve model generalization. For virtual sensing, the requirement is also to have the minimum best subset of these predictors since it determines the surrogate sensors that will be involved in the continuous monitoring program. Industrially, most of the current real-time or near real-time monitoring applications rely on one or more of the following WQ parameters: turbidity (Turb), dissolved oxygen (DO), temperature (Temp), pH, conductivity (EC), and chlorophyll (Chl) [29,32]. Due to advances in material science and sensor technologies, it is relatively easy to measure these parameters [32]; therefore, their high-frequency monitoring provides vital WQ data required to best manage the water bodies. However, none of these parameters can identify a specific pollutant but only serve as a screening method [32]. That is, their sudden and uncharacteristic change serves as a warning against water contamination.
In the virtual sensing context, using these variables as N and P surrogates makes financial and operational sense since they are easily and economically monitored in realtime. However, this real-time monitoring is only prevalent in developed countries due to the adoption of the Water Framework Directive [33]. In less developed countries, where monitoring cost is the main barrier [13], real-time monitoring of the six predictor variables may still be expensive considering the required number of sampling points in water bodies. In such cases, the objective would be to monitor fewer variables since that would reduce the cost of purchasing and operating the surrogate sensors. A drastic deviation of the predicted N and P concentrations from baseline conditions can then be used as an early warning to trigger additional testing or initiate preventive measures.
To select the relevant predictors, the standard procedure in various water sciences modeling studies is to apply diverse variable selection methods such as best subset selection, stepwise selection with bootstrapping, partial mutual information, hierarchical partitioning, and Pearson correlation [9,12]. From the six predictors commonly measured in real-time, we sought the minimum best subset based on feature importance (or ranking) in the present work. This is crucial for interpretability since it reveals how much each predictor contributes to the observed predictions. We selected suitable surrogates based on Shapley Additive exPlanations (SHAP) feature importance. SHAP is a unified approach that explains any ML model's output (or prediction) by connecting local explanations with coalitional game theory [34]. We selected ET as an explainer model since it is the best-performing model (discussed in Section 3). As a sanity check, we compared the chosen features with those obtained by the RF-based explainer model and observed minor differences in feature ranking. For instance, the ET-based explainer ranked nitrate predictive features (in the River Enborne) as EC, Temp, pH, DO, Turb, and Chl, while the RF-based explainer ranked them as EC, pH, Temp, Turb, Chl, and DO.
Notably, flow rate is not among the most commonly measured variables according to [29,32]. Considering that it is one of the important predictors in four of the five target variables (discussed in Section 3.1.4), it may be advisable to include it in monitoring programs, particularly in rural settings. For instance, it is the second most influential variable for TRP prediction in the River Enborne and ranks third for nitrate prediction in the same catchment based on the stepwise forward subset selection method [9]. However, it is excluded from the predictors in the present work since it is not commonly measured in real-time [29,32].

Cross Validation
Training and testing a ML model on the same data would produce a perfect predictive score but fail dismally on yet-unseen data. This is a classic example of a situation called overfitting. To avoid it, a standard practice in any supervised ML experiment is to randomly split the data into training and testing (evaluating) sets. However, when evaluating different estimator settings (also called hyperparameters), the risk of the test set overfitting is still prevalent since the settings can be tweaked until the model performs optimally. This way, the test set information can 'leak' into the model resulting in the evaluation metrics no longer reporting on the generalization performance.
Therefore, yet another portion of the dataset (so-called validation set) must be held out. This is to enable the training to happen on the training set and then the evaluation on the validation set, after which the final evaluation can be performed on the test set when the experiment is deemed successful. However, partitioning data into three sets drastically reduces the number of samples used to train the mode, and the results can depend on which random pair of train and validation sets were used [26]. This problem is solved by a procedure known as cross-validation (CV). While the test set is still necessary for final evaluation, the validation set is not required when performing CV. One type of CV, known as k-fold, splits the training set into k smaller folds or sets where each one is utilized iteratively for testing the model while the remaining k − 1 folds are used as training data. The performance reported by k-fold CV is then the average of values from the loop. In the present work, the datasets were split into train and test sets using a shuffled k-fold CV with k = 10 and a 20% subset reserved for the final test.

Hyperparameter Tuning
Hyperparameter tuning (or optimization) is a process of choosing a subset of optimal hyperparameters for ML algorithms. A Grid Search CV is currently the most widely used parameter optimization method [26]. It searches for optimal hyperparameters by exhaustively considering all parameter combinations. In the present work, a five-fold Grid Search CV was used to find the ideal hyperparameters. Increasing the hyperparameter search space is computationally very expensive; therefore, we only searched over the main parameters. For tree-based models, for example, those are the number of trees (or n_estimators) and the size of the random subsets (or max_features) in the Scikit-Learn library. The resulting optimal values did not significantly improve the models. That is, the accuracy rating levels (as proposed in Table 3) did not change. Therefore, we kept the default hyperparameters in order to speed up the training process.

Avoiding Data Leakage
Data leakage is a common methodological pitfall in ML-based science [26]. It occurs when the model has access to the test set information before its predictive performance is evaluated, resulting in overly optimistic results with no value in practice. A common course of data leakage is not separating the train and test data subsets during all the preprocessing [26]. Although both train and test sets should receive the same transformation, the general rule (or best practice) is that these transformations should only be learned from the training subset, and then the test data be transformed, as done on the training data. A pipeline class in Scikit-Learn (implemented in this work) makes it easier to chain all the preprocessing transformations with estimators without risking data leakage.

Imputation of Missing Values
Imputation refers to the process of replacing missing data with plausible estimates learned from non-missing data. Univariate imputation replaces the MVs in a particular feature dimension with statistical estimates such as the median, mean, or mode of non-missing values of the same feature dimension (or variable). In contrast, multivariate imputation is a more sophisticated approach that models each feature with MVs as a function of other features and utilizes that approximation for imputation [26]. On the other hand, k-nearest neighbor (kNN) imputation imputes each missing feature using available values from the nearest neighbors for the feature and averages these closest points to fill in the missing value. We assessed the effect of univariate (mean, median, and mode), multivariate (bayesian ridge, RF, extreme gradient boosting, light gradient boosting machine, and bagging regressor), and kNN imputation techniques on the predictive performance of N and P concentrations.

A Specification Book
To guarantee that hardware sensors can satisfy the requirements of high-quality data to meet the monitoring requirements, they undergo a rigorous certification process that tests for acceptable accuracy, precision, and resolution, among others [35]. Virtual sensing should also be held to the same standards if the model outputs are to be practically meaningful. In this context, the following requirements are crucial for a robust and rigorous virtual sensor development process:

Standard or Benchmark Datasets
The availability of good quality data is crucial for developing ML models [36,37]. In most established machine or deep learning application areas, the standard practice is to assess the models over several freely available benchmark datasets [38]. The significance of this practice is that it (i) enables reproducible research, (ii) facilitates a more rigorous inter-model evaluation, (iii) has been the basis for some of the most significant algorithmic developments in ML research, and (iv) guarantees the model's robustness (or precision), among other advantages [37]. On the contrary, water quality prediction studies frequently apply only one private dataset from the studied water bodies [3,12]. To our knowledge, only one study [9] used two freely available and high-quality data from two tributaries with contrasting land uses to develop their predictive model (the same data utilized in the present work). These data are relatively large, and their quality was validated by a weekly laboratory analysis [29]. Therefore, we propose these data as a benchmark, over and above the data from the case studies being considered.

Model Performance Evaluation
Model evaluation is the core part of developing an effective ML model. Besides the need for a reliable evaluation framework for the model to be relied upon, some of the main reasons for model evaluation include [39]:

1.
Estimating the predictive and generalization performance of the model on unseen (future) data; 2.
Providing a basis for evaluating improvements in the predictive performance when adjusting (or tweaking) model parameter values; 3.
Identifying the best ML algorithm for the problem at hand by comparing different algorithms or current modeling efforts with peer-reviewed literature results.
Several studies have explored suitable evaluation metrics for traditional watershed models [40,41]. Moriasi et al. [41] synthesized the performance measures and provided performance evaluation guidelines for widely used watershed-scale models (e.g., Soil and Water Assessment Tool). Based on their synthesis, personal modeling experiences, and meta-analysis, they recommended several statistical and graphical performance measures to evaluate model performance. From the recommended performance measures, the authors [41] highlight that the most widely used are RMSE, R 2 , and NSE. The three metrics are calculated as shown in Equation (1) [9], Equation (2) [26], and Equation (3) [12]: where n is the number of observations, y i is the true value,ŷ i is the predicted value, and µ is the mean of the observed values. As seen in Equations (2) and (3), the computation of R 2 and NSE is very similar. Therefore, since R 2 is more common in the general machine learning community [26], we recommend the consistent use of RMSE and R 2 for developing virtual sensors.

Accuracy Requirements
The rapid measurement of nutrient concentrations at drinking water intakes is essential for water management, especially when concentrations exceed regulatory limits. Generally, defining data quality needs and sensor specifications from various potential users (e.g., domestic, agriculture, recreation, etc.) is critical to the technology's evolution [14]. This is due to the varying requirements of the potential users that usually lead to design trade-offs with different data specifications. For instance, commercial sensors for measuring nitrate analytes in pure water have higher accuracy, precision, and resolution than the same nitrate sensors developed for other applications. Similarly, the accuracy requirements from virtual sensors may vary based on the intended use of the predictive data. Since the accuracy requirements (or benchmarks) for virtual sensor-based nutrient monitoring are currently unavailable in the literature, we propose the state-of-the-art accuracy ratings reported in terms of the R 2 as shown in Table 3.
The choice of these ratings is motivated by the fact that commercial sensors, as a reference, attain these accuracies based on the measured WQ variable [13], and nitrate prediction using several machine learning algorithms in the River Enborne is within these accuracies (Section 3.1.4). Although these accuracy ratings are given in terms of R 2 alone, the performance comparison of different prediction methods should also consider the achievable RMSE. Table 3, coupled with the predictors able to predict nutrients within these accuracy levels, would form a basis for a cost-benefit analysis for monitoring program managers (further discussed in Section 3.5).

A Baseline Model
One of the first requirements of any measurement process is that there must be a well-known and commonly accepted standard of comparison [35]. Similarly, a baseline refers to a simple and well-understood model whose performance is used as a benchmark for comparing and evaluating other models. Currently, multiple linear regression is the preferred baseline model [3,9,14]. However, WQ data are characterized by nonlinear relationships, skewed distributions, and multicollinearity [12], all of which violate the core assumptions of LR models. Thus, the field currently lacks effective baselines. That is, basic models with competitive performance for N and P prediction.
To raise the bar of baselines, we performed 'spot checking' on a diverse set of commonly applied linear, nonlinear, and ensemble algorithms with default hyperparameters in the Scikit-Learn library. The aim was to identify a set of algorithms that perform well in this problem domain. We evaluated decision tree (DT), random forest (RF), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), multi-layer perceptron (MLP), k-nearest neighbor (kNN), support vector machine (SVM), extremely randomized trees (ET), gradient boosting regressor (GB), stochastic gradient descent (SGD), linear regression (LR), ridge regression (Ridge), and bagging regressor (BR). In case the reader is unfamiliar with any of these algorithms, we refer them to the official machine learning library [26] for details concerning the theoretical backgrounds, mathematical formulations, implementation strategies, and tips on practical use.
To evaluate the thirteen algorithms, we used Castrillo and García [9]'s experimental setting as a benchmark. That is, we (i) used the same predictors, (ii) handled the MVs by listwise deletion, and (iii) scaled the features to zero mean and unit variance. The results of nitrate spot checking, reported in terms of R 2 , are presented in Figure 3.
Although Figure 3 presents only the results of nitrate spot checking, the rating of the best performing algorithms (ET, RF, and kNN) is the same for TRP in both catchments and TP in The Cut, with RF outperforming kNN in most cases. Notably, (i) linear regression models (SGD, LR, and ridge) are the least performing algorithms in both catchments, and (ii) ET performs better than the commonly used RF for all target variables. To further test the effectiveness of ET compared to RF, we assessed the performance of ET as a function of changing input variables. We then used the RF results obtained by [9] as a basis for comparison, as shown in Table 4.
unfamiliar with any of these algorithms, we refer them to the official machine learning library [26] for details concerning the theoretical backgrounds, mathematical formulations, implementation strategies, and tips on practical use.
To evaluate the thirteen algorithms, we used Castrillo and García [9]'s experimental setting as a benchmark. That is, we (i) used the same predictors, (ii) handled the MVs by listwise deletion, and (iii) scaled the features to zero mean and unit variance. The results of nitrate spot checking, reported in terms of R 2 , are presented in Figure 3.  As seen in Table 4, the average improvement provided by ET is 88% for NO 3 prediction in the Enborne, 41% for NH 4 prediction in The Cut, and 40% for TP prediction in The Cut. The average percentage improvement for TRP prediction in both catchments is only marginal (1% in the Enborne and 3% in The Cut). Consequently, we propose ET as a baseline for future studies. Notably, ET (i) does not make linearity assumptions, (ii) is relatively straightforward conceptually, and (iii) is easily implementable in programming languages like Python, MATLAB, or R.

Selected Models from Spot Checking
Apart from establishing the baseline model, spot checking (Section 3.1.4) effectively identifies the best-performing models to be optimized for the problem at hand. Using the 80% predictive performance as a threshold, we selected all the models that achieved ≥80% in both catchments for further analysis. The 80% was chosen because it is the minimum proposed for virtual sensor-based nutrient monitoring, as shown in Table 3. RF, XGB, LGBM, kNN, ET, and BR are all the algorithms that fulfilled this requirement and were subsequently selected. Table 5 presents nitrate (as NO 3 ) prediction accuracy for the six modes with different scaling techniques in the River Enborne. RF and ET showed the highest accuracy with no scaling, while kNN showed the lowest predictive performance with 71% accuracy. However, the scaling impact on kNN is significant as the performance improves from 71% to 95% (equal to RF) with MinMax scaling. Considering that variables like conductivity and TRP have very different scales (in both catchments), this improvement is explainable since kNN is a distance-based algorithm that predicts values based on the k closest neighbors and therefore is sensitive to the scale or magnitudes of the variables. With the rest of the models, data scaling marginally improves some but also degrades the performance of others. For instance, the performance of XGB drops from 94% to 90% after Quantile transformation. This effect may be due to the fact that this transformer changes the original feature distribution and, therefore, may distort the existence of any linear correlations between variables [26]. However, this behavior is unusual since scaling is known to have no effect on tree-based models [9,12]. For instance, scaling has no effect on RF, LGBM, ET, and the decision tree-based bagging regressor, as seen in Table 5. The performance of RF and ET with or without scaling and kNN after scaling is in the target rating, which is crucial for N monitoring in rural catchments where river water is sometimes used untreated, and infants are vulnerable to compounds like nitrates in drinking water [42].

The Effect of Different Scalers
The fact that the MinMax scaler is the best scaling technique suggests that the high number of outliers in the current datasets should indeed be considered normal since this scaler uses the minimum and maximum feature values for scaling and, therefore prone to very large marginal outliers.
Similar behavior is observed for TRP prediction in the same catchment. As seen in Table 6, the only major difference between NO 3 and TRP is that TRP predictive performance is no longer in the target but in the acceptable rating (as defined in Table 3), although ET's performance is relatively close to the target rating. The drop in TRP predictive performance, which was also observed for P fractions compared to NO 3 and total nitrogen in [12], may be due to the high nonlinearity and stochastic pattern fluctuations of TRP in water bodies [43]. Additionally, the MinMax scaler is still the best scaling technique, as evidenced by the kNN performance after scaling. In contrast to the River Enborne, where the NO 3 predictive performance was very high, nitrate (as NH 4 ) prediction in The Cut, as shown in Table 7, is below 90%, with the highest accuracy of 89% achieved by ET with or without data scaling. The second best performing model is kNN, whose accuracy improves from 21% to 88% after MinMax scaling. To explain this substantial drop in predictive performance between the two catchments, it is essential to note that the River Enborne is a rural catchment that receives sewage effluent from smaller wastewater treatment plants, while The Cut is an urban catchment dominated by treated sewage effluent from major towns and urban runoff. This implies that The Cut's water quality dynamics are different and more complex due to various industrial activities and urban runoff [29]. The various dissolved contaminants make it difficult for ML models to detect variable relationships. The negative impact of the Quantile transformer is more pronounced in this case. For instance, the predictive performance of XGB drastically drops from 82% to 68% after scaling. The power transformer also degrades the performance of all the models except for kNN. As mentioned already, the negative effect of these transformers may be due to the fact that they both change the original feature distribution and, hence, may distort the existence of any linear correlations between variables [26], which are already complex in urban catchments.
As with TRP prediction in the River Enborne, TRP prediction accuracy in The Cut is even lower. As shown in Table 8, only two models (RF and ET) achieve more than 80% accuracy with or without MinMax scaling. Although consistent with previous studies [12,44], this performance is in the 'poor' rating and thus unsatisfactory. While the reason for this poor performance is unclear, Harrison et al. [12] suggest that it may be due to the contrasting importance of particulate vs. dissolved nutrient fractions, including their respective sources. The other reason could be that TRP has three 'extra' electrons and can easily bond with other positively charged compounds and elements. Consequently, the formation of these 'new' compounds makes it difficult for ML algorithms to establish strong surrogate-regression relationships. The effect of scaling on total phosphorus prediction in The Cut is the same as that of TRP in the same catchment. Even the predictive performances of the six models are similar. This is because TP and TRP concentrations in The Cut are strongly correlated [29]. In fact, the correlation relationship is very close to 1:1.

Missing Values Handling
Having established that ET is the best-performing model with or without feature scaling, we assessed the impact of various imputation methods compared to listwise deletion on its predictive performance. The results of this assessment are presented in Table 9. For NO 3 prediction in River Enborne, multivariate imputation with RF, BR, XGB, or LGBM as estimators produces the best results, although equal to listwise deletion. In contrast, the impact of all imputation techniques on NH 4 prediction in The Cut is more positive. For instance, the predictive accuracy improves from the tolerable rating with listwise deletion to the acceptable rating after missing values imputation. Although this improvement is only marginal, it is crucial since the predictive performance in The Cut is, on average, unsatisfactory. For TRP prediction, multivariate imputation with RF, BR, XGB, or LGBM as estimators marginally improves the predictive performance in the River Enborne, while none of the imputation techniques is effective in The Cut. This comparison suggests that more advanced imputation techniques, such as generative adversarial imputation nets, should be explored for TRP prediction, particularly in The Cut, where the predictive performance is in the poor rating.

Sensitivity Analysis
Sensitivity analysis determines how the model output changes as input variables change. It enables model developers and end-users to understand the impact of each feature on the model output. Since some features can be more important (or more relevant) than others, feature importance enables their ranking based on their importance and contribution to the final model. Table 10 presents the predictive performance of the ET model on the five target variables as a function of each predictor's contribution based on SHAP values. A table of this nature would enable a cost-benefit analysis for water managers to choose the appropriate number of real-time sensors to install based on the achievable predictive performance. As seen in Table 10, the NO 3 predictive performance from only three surrogates in the River Enborne is in the acceptable rating. In contrast, TRP requires at least four predictors to achieve the same rating. Notably, a set of four surrogate sensors (EC, DO, Temp, and Turb) would be able to simultaneously predict NO 3 and TRP within the acceptable rating. Interestingly, the inclusion of Chl results in a marginal drop in performance, implying that the model considers it noise or irrelevant. In The Cut, a subset of six surrogate sensors would be able to predict NH 4 within the acceptable rating, while TRP and TP would be within the poor rating. There is a drop in predictive performance at the introduction of the second predictor in The Cut. This suggests that at least three surrogate sensors would be necessary for this catchment. In monitoring settings where the flow rate is already measured in high-frequency, as in the River Enborne and The Cut, its inclusion as a predictor is worthwhile since it increases TRP and TP predictive performance in The Cut to more than 85%, which is the tolerable rating.

Performance Results of Our Approach: A Comparative Analysis
To highlight this work's contribution, we compared the performance of our approach with the three most relevant studies [3,9,12] in the same problem domain. Firstly, we present a comparative analysis in terms of the crucial steps involved in virtual sensor development, as shown in Table 11. Table 11. A methodological comparison of our work with the three most related studies, where N/S = not specified, ML = machine learning, kNN = k-nearest neighbor, RF = random forest, MLR = multiple linear regression, DT = decision tree, XGB = extreme gradient boosting, LGBM = light gradient boosting machine, BR = bagging regressor, MLP = multi-layer perceptron, SVM = support vector machine, ET = extremely randomized trees, GB = gradient boosting, SGD = stochastic gradient descent, and SHAP = Shapley additive explanations.
Step [ Secondly, we compared the predictive performance of the three studies with ET (our work) in terms of the accuracy values reached, as shown in Table 12. As seen in Table 12, our work outperformed the three studies in terms of the RMSE and NSE or R 2 values reached, except for one case when the RMSE was slightly lower, although R 2 was still superior. Since parts of the watershed studied by [12] are relatively urbanized, the TP predictive performance of 0.74 shows that the complex nature of water quality data in urban catchments does indeed make it difficult for traditional machine learning models to establish strong inferences between target variables and surrogates.

Conclusions
In the present work, an innovative ML-based virtual sensing concept for continuous monitoring of N and P is discussed. The effectiveness of the virtual sensors is verified using two catchments of the River Thames (in the UK) with contrasting land uses. Using only the predictors most commonly measured in real-time as inputs, the predictive performances of the ET-based virtual sensors were higher than XGB, LGBM, kNN, BR, and the commonly applied RF model. A summary of this work's contributions is as follows: 1.
This study presented a specification book and best practices for developing virtual sensors for nutrient monitoring. These concepts will aid in developing robust and interpretable virtual sensors whose performance can be independently verified; 2.
The effect of various scalers and transformers on WQ data with outliers is assessed. Although this is usually overlooked in WQ monitoring studies, the present work proved its importance. For instance, the most effective scaler was the MinMax scaler and not the commonly applied standard scaler; 3.
Even though missing data is a common problem in WQ data, it is often overlooked in WQ monitoring studies. Therefore, this study assessed the impact of missing value imputation in predicting N and P concentrations where RF and LGBM-based multivariate imputers were the most effective.
Although ET performed satisfactorily in the River Enborne (achieving R 2 values as high as 96% with only four predictors), its limitation is that the predictive performance, particularly for TRP and TP in The Cut, is still unsatisfactory. The required number of surrogate sensors (six) to achieve the acceptable accuracy rating for NH 4 prediction is too high. Additionally, the same surrogates can only provide 82% accuracy for TRP and TP.
Notably, the 82% was achieved when missing values were handled by listwise deletion, which reduced the size of the dataset from 15,636 to 8934. Since none of the applied imputation methods resulted in better performance in this case, a realistic strategy includes the exploration of more advanced imputation and predictive techniques for developing virtual sensors in more urbanized watersheds. Therefore, future works will explore (i) the effect of data augmentation using generative adversarial networks and (ii) deep learning models such as deep echo state networks as predictive models.

Conflicts of Interest:
The authors declare no conflict of interest.