# Determination of Optimal Predictors and Sampling Frequency to Develop Nutrient Soft Sensors Using Random Forest

^{*}

## Abstract

**:**

^{2}values above 0.95 for nitrate, orthophosphate, and ammonium for both stations. The study then trained the models on 40 sampling frequencies, ranging from monthly to 15-min intervals. The results showed that as the sampling frequency increased, the model’s performance, measured by RMSE, improved. The optimal balance between sampling frequency and model performance was identified using a knee-point determination algorithm. The optimal sampling frequency for nitrate was 3.6 and 2.8 h for the 2 stations, respectively. For orthophosphate, it was 2.4 and 1.8 h. For ammonium, it was 2.2 h for 1 station. The study highlights the utility of surrogate models for monitoring nutrient levels and demonstrates that nutrient soft sensors can function with fewer predictors at lower frequencies without significantly decreasing performance.

## 1. Introduction

^{2}higher than 0.8 for orthophosphate, nitrite and nitrate in Vietnam (Ha, Nguyen [14]). Shen et al. (2020) used the RF to predict total nitrogen and total phosphorous covering 62,495 stations across the United States and obtained R

^{2}between 0.56 and 0.88 (Shen, Amatulli [15]). Harrison et al. (2021) used the RF models to predict different nitrogen and phosphorus species; their models of total nitrogen and total phosphorous were able to explain variations of 85% and 74%, respectively (Harrison, Lucius [16]). Among all these studies, the study of Tran et al. (2022) sets the premise of this study, where they used 15-min high-frequency measurements of nutrients and other water quality parameters and developed the soft sensor using the random forest algorithm and found the R

^{2}higher than 0.95 for NO

_{3}-N, OPO

_{4}-P and NH

_{4}-N [17]. These studies demonstrate that soft sensors based on ML models can provide reliable alternative real-time monitoring. However, in order to build an effective and efficient soft sensor, the ML models require the appropriate amount of input data of high quality, while balancing cost and feasibility [18].

## 2. Materials and Methods

_{3}-N), orthophosphate (OPO

_{4}-P), and ammonium (NH

_{4}-N). The workflow of this study is presented in Figure S1 of Supporting Information.

#### 2.1. Data and Study Site Description

^{3}s

^{−1}. Using automated measuring instruments, Station Erlabrunn and Station Kahl am Main monitor DO, Temp, EC, pH, NO

_{3}-N and OPO

_{4}-P with a 15-min temporal resolution, with station Kahl am Main additionally also monitoring NH

_{4}-N. The discharge rates are not monitored at the same automated water quality stations. Instead, the discharge data from stations that are nearby and are on the mainstream are used, specifically Station Würzburg for Erlabrunn (upstream) and Station Obernau Aschaffenburg for Kahl am Main (downstream).

_{4}-N, so the models were only developed for NO

_{3}-N and OPO

_{4}-P. For Kahl am Main, the NO

_{3}-N model used a dataset of ~4 years between 16 September 2017 and 31 August 2021. The OPO

_{4}-P and NH

_{4}-N models used a ~2-year dataset between 2 September 2019 and 31 August 2021.

#### 2.2. Tools Used for Data Analysis

#### 2.3. Data Pre-Processing and Spliting

_{3}-N, the database is assumed to be 10% and for NH

_{4}-N and OPO

_{4}-P, 20% are assumed as contaminated.

_{3}-N at Erlabrunn station, NO

_{3}-N has 3937 NaN values, so all these 3937 rows were removed. Next, the NaN values of the other predictors were interpolated if the number of consecutive NaNs was below 30, or otherwise deleted. This finally created a data frame with 65,808 rows and the statistics that give an overview of the cleaned dataset are in Table 1.

^{2}and RMSE for the cross-validation performance were calculated by averaging the results of all 5 cross-validation runs. The testing dataset was used to evaluate the final model’s performance on the unseen data.

#### 2.4. Random Forest Model

^{2}and RMSE on the unseen testing dataset and compared with the Multi-Linear Regression (MLR) models with the same predictors.

#### 2.5. Selection of Variable Predictors

^{2}from the cross-validation results, in this case ma2.6. for determination of the optimal sampling frequency

_{f}(x), that expresses the curvature of any continuous function f as a function of its first and second derivative at any point, as follows:

## 3. Results

#### 3.1. Model Performance with Varying Subset of Predictors

^{2}was chosen as the determining criteria to select the order in which the predictors were fed into the model. The input variables are described in detail, together with the sequence in which they were chosen, in Table 3. Overall, the performance of the models in terms of R

^{2}for both the stations to predict NO

_{3}-N is higher than the prediction of NH

_{4}-N and OPO

_{4}-P. In the case of NO

_{3}-N, both upstream and downstream models when fitted with one predictor gave an R

^{2}of 0.865 and 0.927, respectively, and when trained with all the predictors, both models resulted with an R

^{2}of 0.999. There is a wide difference between the two models of OPO

_{4}-P when fitted with just one predictor, where the upstream model had an R

^{2}of 0.342 and the downstream one had the value of 0.930. Nevertheless, the models performed comparably when fitted with all the predictors, with the R

^{2}values of 0.989 and 0.959. The solo NH

_{4}-N model performed at R

^{2}of 0.702 with one predictor and at R

^{2}of 0.970 with all the predictors.

^{2}after the addition of 5 predictors (Table 3). In some cases, with even fewer predictors there was no substantial improvement in R

^{2}, for example with NO

_{3}-N at both stations after 3 predictors. There are some predictors that repeat themselves in most cases. For example, week_sin is the first predictor in all the cases. EC is the second predictor in all the cases except 1. Similarly, Temp is the third predictor for all the cases except 1. DO is at the fourth or fifth position in 4 of the cases. Q is at the fourth or fifth position in 2 cases. Out of the 25 positions (first 5 predictors × 5 cases) 23 are taken by the same set of predictors i.e., week_sin, EC, Temp, DO and Q.

#### 3.2. Comparison of Random Forest Model Performance with Linear Regression Model

^{2}in Figure 1. The RF model performances on unseen data were typically comparable to that of the working datasets (comparing Table 3 and Figure 1). The RF models consistently outperformed the MLR models. The R

^{2}values of the RF models were found to be 0.999 and 0.999 for NO

_{3}-N, 0.987 and 0.963 for OPO

_{4}-P and 0.983 for NH

_{4}-N, whereas for the MLR model they were found to be 0.886 and 0.909 for NO

_{3}-N, 0.752 and 0.59 for OPO

_{4}-P and 0.195 for NH

_{4}-N. For the Kahl am Main Station, the RF model reduced the RMSE in comparison to the MLR models for NO

_{3}-N from 0.266 to 0.024 mgN L

^{−1}, for OPO

_{4}-P from 0.021 to 0.005 mgP L

^{−1}, for NH

_{4}-N from 0.017 and 0.003 mgN L

^{−1}. Similarly, for the Erlabrunn Station, the RMSE values reduced for NO

_{3}-N from 0.296 to 0.025 mgN L

^{−1}and for OPO

_{4}-P from 0.019 to 0.006 mgP L

^{−1}. These results suggest that RF models are capable of reproducing data trends and seasonality as well as providing reasonably accurate estimates of NO

_{3}-N, NH

_{4}-N, and OPO

_{4}-P concentrations.

#### 3.3. Model’s Performance with Varying Sampling Frequencies

_{3}-N at Kahl am Main and Erlabrunn stations, respectively. For OPO

_{4}-P, the OSFs were found to be every 2.4 and 1.8 h for the Kahl am Main and Erlabrun stations, respectively. For NH

_{4}-N, the OSF was every 2.2 h for the Kahl am Main station.

_{3}-N almost completely overlap the observed curves for the optimal and 15-min sampling frequencies (Figure 3c,d).

_{4}-P and NH

_{4}-N, where the observed values are smaller than 0.2 mgP L

^{−1}and 0.15 mgN L

^{−1}, respectively, the models need to be comparatively more sensitive than the NO

_{3}-N model to detect the decimal level changes. Their performances are shown in Figure 4 and Figure 5, respectively. By visual comparison of the time series at the OSF and 15-min interval, the shape of the plots looks very similar and they do not seem to differ much. Nevertheless, there are some peaks and troughs that are better predicted by the 15-min model. For example, in the NH

_{4}-N case (Figure 5), when comparing the OSF plot against the 15-min plot there is a trend of observed values peeking out at the peaks and bottoms. This peeking is much more prominent between January 2021 and July 2021, whereas for the 15-min graph there is much more overlap between the observed and predicted.

## 4. Discussion

#### 4.1. Nutrient Soft Sensor Performance Using RF

^{2}values exceeding 0.95 in the prediction of NO

_{3}-N, OPO

_{4}-P and NH

_{4}-N. Our results are comparable to the similar study using 15-min high-frequency measurements by Yen et al. (2022) and have higher R

^{2}values than the three studies mentioned in Section 1 [13,14,15]. Remarkably, this predictive capacity was maintained even with a smaller subset of predictors, comprising of five variables. These findings are consistent with the prior research [17,37] in this field, which likewise emphasized the importance of optimizing soft-sensor performance through the identification of optimal predictor sets. In this study it can be seen that the order of the inclusion is random and does not follow the order of the highest correlated variable going first as was previously expected (Supporting Information, Table S1). The RF model uses predictor variables that, a priori, are irrelevant, neither intuitively nor when looking at correlation coefficients or stepwise subsets with linear models. This is due to the intricate linkages that underlie the model and demonstrates the ability of the RF algorithm to mimic these non-linear relations embedded in the environmental phenomenon. The current work’s findings indicate that in order to successfully complete the variable sub-setting process, it is essential to follow a procedure that is in line with the model that will be implemented. In the literature there exist simpler and less resource-intensive methods such as feature weight coefficient or feature importance [38,39]. However, when the goal is to choose the fewest number of predictors without compromising the model’s accuracy, user-defined performance metrics perform better.

#### 4.2. Optimal Predictor Subset and Sampling Freuqnecy

^{2}greater than 0.99 for two cases and greater than 0.95 for the other three. EC, and Temp, DO and Q are commonly measured in rivers and streams by physical sensors within the authority monitoring schemes with high robustness, accuracy, and temporal resolution. Using them as surrogates would increase the applicability of setting up the soft sensors for nutrients in a broader context in the real world.

_{3}-N at the Kahl am Main station gave an RMSE of 0.35 mgN L

^{−1}. In comparison, the RMSE was 0.02 mgN L

^{−1}when the model was trained on the 15-min-interval data. Nevertheless, the model trained with the monthly data and their results can still be useful in certain cases. Overall, this study demonstrates the possible RMSEs when the surrogate models for nutrient soft sensors are trained on different sampling frequencies, offering a first idea of the accuracy level such models might achieve.

_{4}-P and NH

_{4}-N, the OSFs were found to be approximately every 2 h. By using these OSF, the R

^{2}values when compared to those of the models trained on 15-min interval had a worsening of 2.75% and 1.69% in the case of OPO

_{4}-P at the upstream and downstream stations, respectively, and had a worsening of 3.72% for NH

_{4}-N at the downstream station. However, at those OSFs, the amount of data collected and stored is only 1/8 of what was previously being monitored at 15-min intervals. In the case of the NO

_{3}-N soft sensor at the Kahl am Main station, the OSF was found to be approximately every 4 h. By using this frequency, the model had a worsening of 0.25% in the R

^{2}values compared to the 15-min ones. Similarly, for the Erlabrunn station, the OSF for NO

_{3}-N was determined to be approximately every 3 h and resulted in a compromise of 0.15% in the R

^{2}. These results would allow the physical sensors installed on the rivers to operate at lower frequencies and would save significant storage space and costs associated with measuring, handling and processing data for building soft sensors.

#### 4.3. Measures against Overfitting

^{2}values higher than 0.95 in all cases, suggest the possibility of this issue. To make sure the models were not overfitted, in the first place, all models regardless of at which frequencies and which subsets of predictors were tested with 20% of the of the datapoints from the whole dataset. Furthermore, in some cases, the models using lower frequencies were tested on a much larger dataset than they were trained on. For example, the models, as shown in Figure 3b, Figure 4b and Figure 5b, were tested on a dataset that was 19 times larger than the training dataset, and they achieved the R

^{2}values of 0.98, 0.72, and 0.73, respectively, and these R

^{2}values point towards a low likelihood of overfitting. Additionally, visual inspection of the aforementioned figures also suggests against overfitting as the models perform well especially when considering the substantial difference between the size of the training and testing dataset. The models in the figures simulate the periodicity of the time series rather well, although they miss out on predicting the highs and lows.

#### 4.4. Limination and Future Research Perspective

_{4}-N and OPO

_{4}-P produced R

^{2}values of 0.95 and above, there should be a word of caution. Both sensors have a least count of 0.01 mg L

^{−1}, which means there is an uncertainty of 0.01 mg L

^{−1}with each measurement. The mean concentrations of NH

_{4}-N and OPO

_{4}-P at the Kahl am Main station were 0.03 mgN L

^{−1}and 0.10 mgP L

^{−1}, respectively, which inherently introduces an uncertainty of 33% and 10%. This high level of uncertainty makes the measurements from these sensors imprecise. For comparison, the mean of NO

_{3}-N is 3.66 mgN L

^{−1}at the Kahl am Main station and is also measured with a sensor with a least count of 0.01 mgN L

^{−1}, which amounts to only 0.26% uncertainty. Since the measurements for NH

_{4}-N and OPO

_{4}-P are not that precise, the models might have also incorporated these patterns of errors. This problem is evident from the time series in Figure 4 and Figure 5, which are patchy with sudden peaks when compared to the relatively continuous time series of NO

_{3}-N (Figure 3). Therefore, for the development of reliable soft sensors for NH

_{4}-N and OPO

_{4}-P, there is a need for the development of physical sensors with higher precision for them.

## 5. Conclusions

^{2}values higher than 0.95 for all the 5 cases, namely NO

_{3}-N, NH

_{4}-N, OPO

_{4}-P concentrations at 15-min intervals at 2 stations on the Main River, Germany. Feature engineering results demonstrated that the model could perform at good accuracy even when trained on fewer predictors. Specifically, with 5 out of 9 predictors the model performed with an R

^{2}of higher than 0.99 for 3 cases and higher than 0.95 for the other 2 cases. This testifies to the hypothesis of using a smaller subset of predictors to harnesses the complex relationships behind these models in a more efficient way. The best trade-off between performance and measurement frequency was found for the RF models utilizing knee point analysis since high measurement frequency raises issues with data handling and storage. The OSFs were found to be approximately 3 h and 4 h for NO

_{3}-N for the upstream and downstream stations, respectively, 2 h for OPO

_{4}-P for both the stations and 2 h for NH

_{4}-N for the upstream station. Overall, the findings of this study demonstrate that the model can perform well even when trained on a lower sampling frequency and with fewer predictors. This would ensure an overall leaner soft sensor for monitoring water quality in river systems with fewer storage problems and reduced operational costs.

## Supplementary Materials

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Viviano, G.; Salerno, F.; Manfredi, E.C.; Polesello, S.; Valsecchi, S.; Tartari, G. Surrogate measures for providing high frequency estimates of total phosphorus concentrations in urban watersheds. Water Res.
**2014**, 64, 265–277. [Google Scholar] [CrossRef] [PubMed] - Wong, Y.J.; Nakayama, R.; Shimizu, Y.; Kamiya, A.; Shen, S.; Muhammad Rashid, I.Z.; Nik Sulaiman, N.M. Toward industrial revolution 4.0: Development, validation, and application of 3D-printed IoT-based water quality monitoring system. J. Clean. Prod.
**2021**, 324, 129230. [Google Scholar] [CrossRef] - Rode, M.; Wade, A.J.; Cohen, M.J.; Hensley, R.T.; Bowes, M.J.; Kirchner, J.W.; Arhonditsis, G.B.; Jordan, P.; Kronvang, B.; Halliday, S.J.; et al. Sensors in the Stream: The High-Frequency Wave of the Present. Env. Sci. Technol.
**2016**, 50, 10297–10307. [Google Scholar] [CrossRef] [Green Version] - Pellerin, B.A.; Stauffer, B.A.; Young, D.A.; Sullivan, D.J.; Bricker, S.B.; Walbridge, M.R.; Clyde, G.A., Jr.; Shaw, D.M. Emerging Tools for Continuous Nutrient Monitoring Networks: Sensors Advancing Science and Water Resources Protection. JAWRA J. Am. Water Resour. Assoc.
**2016**, 52, 993–1008. [Google Scholar] [CrossRef] [Green Version] - Brack, W.; Dulio, V.; Ågerstrand, M.; Allan, I.; Altenburger, R.; Brinkmann, M.; Bunke, D.; Burgess, R.M.; Cousins, I.; Escher, B.I.; et al. Towards the review of the European Union Water Framework Directive: Recommendations for more efficient assessment and management of chemical contamination in European surface water resources. Sci. Total Environ.
**2017**, 576, 720–737. [Google Scholar] [CrossRef] - Shang, C.; Gao, X.; Yang, F.; Huang, D. Novel Bayesian framework for dynamic soft sensor based on support vector machine with finite impulse response. IEEE Trans. Control Syst. Technol.
**2013**, 22, 1550–1557. [Google Scholar] - Curreri, F.; Fiumara, G.; Xibilia, M.G. Input selection methods for soft sensor design: A survey. Future Internet
**2020**, 12, 97. [Google Scholar] [CrossRef] - Joseph, F.J.J.; Nayak, D.; Chevakidagarn, S. Local maxima niching genetic algorithm based automated water quality management system for Betta splendens. J. Eng. Digit. Technol. JEDT
**2020**, 8, 48–63. [Google Scholar] - Aghelpour, P.; Mohammadi, B.; Biazar, S.M. Long-term monthly average temperature forecasting in some climate types of Iran, using the models SARIMA, SVR, and SVR-FA. Theor. Appl. Climatol.
**2019**, 138, 1471–1480. [Google Scholar] [CrossRef] - Shamshirband, S.; Esmaeilbeiki, F.; Zarehaghi, D.; Neyshabouri, M.; Samadianfard, S.; Ghorbani, M.A.; Mosavi, A.; Nabipour, N.; Chau, K.-W. Comparative analysis of hybrid models of firefly optimization algorithm with support vector machines and multilayer perceptron for predicting soil temperature at different depths. Eng. Appl. Comput. Fluid Mech.
**2020**, 14, 939–953. [Google Scholar] [CrossRef] - Qasem, S.N.; Samadianfard, S.; Kheshtgar, S.; Jarhan, S.; Kisi, O.; Shamshirband, S.; Chau, K.-W. Modeling monthly pan evaporation using wavelet support vector regression and wavelet artificial neural networks in arid and humid climates. Eng. Appl. Comput. Fluid Mech.
**2019**, 13, 177–187. [Google Scholar] [CrossRef] [Green Version] - Zhou, J.; Qiu, Y.; Zhu, S.; Armaghani, D.J.; Li, C.; Nguyen, H.; Yagiz, S. Optimization of support vector machine through the use of metaheuristic algorithms in forecasting TBM advance rate. Eng. Appl. Artif. Intell.
**2021**, 97, 104015. [Google Scholar] [CrossRef] - Francke, T.; López-Tarazón, J.A.; Schröder, B. Estimation of suspended sediment concentration and yield using linear models, random forests and quantile regression forests. Hydrol. Process.
**2008**, 22, 4892–4904. [Google Scholar] [CrossRef] - Ha, N.-T.; Nguyen, H.Q.; Truong, N.C.Q.; Le, T.L.; Thai, V.N.; Pham, T.L. Estimation of nitrogen and phosphorus concentrations from water quality surrogates using machine learning in the Tri An Reservoir, Vietnam. Environ. Monit. Assess.
**2020**, 192, 789. [Google Scholar] [CrossRef] [PubMed] - Shen, L.Q.; Amatulli, G.; Sethi, T.; Raymond, P.; Domisch, S. Estimating nitrogen and phosphorus concentrations in streams and rivers, within a machine learning framework. Sci. Data
**2020**, 7, 161. [Google Scholar] [CrossRef] - Harrison, J.W.; Lucius, M.A.; Farrell, J.L.; Eichler, L.W.; Relyea, R.A. Prediction of stream nitrogen and phosphorus concentrations from high-frequency sensors using Random Forests Regression. Sci. Total Environ.
**2021**, 763, 143005. [Google Scholar] [CrossRef] - Tran, Y.B.; Arias-Rodriguez, L.F.; Huang, J. Predicting high-frequency nutrient dynamics in the Danube River with surrogate models using sensors and Random Forest. Front. Water
**2022**, 4, 894548. [Google Scholar] [CrossRef] - Nabipour, N.; Mosavi, A.; Baghban, A.; Shamshirband, S.; Felde, I. Extreme learning machine-based model for Solubility estimation of hydrocarbon gases in electrolyte solutions. Processes
**2020**, 8, 92. [Google Scholar] [CrossRef] [Green Version] - Atkinson, A.C. Plots, Transformations and Regression; an Introduction to Graphical Methods of Diagnostic Regression Analysis; Oxford University Press: Oxford, UK, 1985. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef] [Green Version] - Bartram, J.; Ballance, R.; World Health Organization; United Nations Environment Programme. Water Quality Monitoring: A Practical Guide to the Design and Implementation of Freshwater Quality Studies and Monitoring Programs; Bartram, J., Balance, R., Eds.; E & FN Spon: London, UK, 1996.
- Strobl, R.O.; Robillard, P.D. Network design for water quality monitoring of surface freshwaters: A review. J. Environ. Manag.
**2008**, 87, 639–648. [Google Scholar] [CrossRef] - Huang, J.; Borchardt, D.; Rode, M. How do inorganic nitrogen processing pathways change quantitatively at daily, seasonal, and multiannual scales in a large agricultural stream? Hydrol. Earth Syst. Sci.
**2022**, 26, 5817–5833. [Google Scholar] [CrossRef] - Huang, J.; Merchan-Rivera, P.; Chiogna, G.; Disse, M.; Rode, M. Can high-frequency data enable better parameterization of water quality models and disentangling of DO processes? In Proceedings of the EGU General Assembly Conference Abstracts, Online, 19–30 April 2021; p. EGU21-8936. [Google Scholar]
- Lannergård, E.E.; Ledesma, J.L.J.; Fölster, J.; Futter, M.N. An evaluation of high frequency turbidity as a proxy for riverine total phosphorus concentrations. Sci. Total Environ.
**2019**, 651, 103–113. [Google Scholar] [CrossRef] - Skeffington, R.A.; Halliday, S.J.; Wade, A.J.; Bowes, M.J.; Loewenthal, M. Using high-frequency water quality data to assess sampling strategies for the EU Water Framework Directive. Hydrol. Earth Syst. Sci.
**2015**, 19, 2491–2504. [Google Scholar] [CrossRef] [Green Version] - Liu, Y.; Zheng, B.; Wang, M.; Xu, Y.; Qin, Y. Optimization of sampling frequency for routine river water quality monitoring. Sci. China Chem.
**2014**, 57, 772–778. [Google Scholar] [CrossRef] - Zhou, Y. Sampling frequency for monitoring the actual state of groundwater systems. J. Hydrol.
**1996**, 180, 301–318. [Google Scholar] [CrossRef] - Naddeo, V.; Zarra, T.; Belgiorno, V. Optimization of Sampling Frequency for River Water Quality Assessment According to Italian implementation of the EU Water Framework Directive. Environ. Sci. Policy
**2007**, 10, 243–249. [Google Scholar] [CrossRef] - Anvari, A.; Reyes, J.; Esmaeilzadeh, E.; Jarvandi, A.; Langley, N.; Navia, K. Designing an Automated Water Quality Monitoring System for West and Rhode Rivers. In Proceedings of the 2009 Systems and Information Engineering Design Symposium, Charlottesville, VA, USA, 24 April 2009; pp. 131–136. [Google Scholar] [CrossRef]
- Khalil, B.; Ou, C.; Proulx-McInnis, S.; St-Hilaire, A.; Zanacic, E. Statistical Assessment of the Surface Water Quality Monitoring Network in Saskatchewan. Water Air Soil Pollut.
**2014**, 225, 2128. [Google Scholar] [CrossRef] - Chen, Y.; Han, D. Water quality monitoring in smart city: A pilot project. Autom. Constr.
**2018**, 89, 307–316. [Google Scholar] [CrossRef] [Green Version] - Silva, R.; Lopes da Silveira, A.L.; Silveira, G. Spectral analysis in determining water quality sampling intervals. RBRH
**2019**, 24, e46. [Google Scholar] [CrossRef] [Green Version] - Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
- Hempel, S.; Adolphs, J.; Landwehr, N.; Willink, D.; Janke, D.; Amon, T. Supervised Machine Learning to Assess Methane Emissions of a Dairy Building with Natural Ventilation. Appl. Sci.
**2020**, 10, 6938. [Google Scholar] [CrossRef] - Satopaa, V.; Albrecht, J.; Irwin, D.; Raghavan, B. Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems Workshops, Minneapolis, MN, USA, 20–24 June 2011; pp. 166–171. [Google Scholar]
- Castrillo, M.; García, Á.L. Estimation of high frequency nutrient concentrations from water quality surrogates using machine learning methods. Water Res.
**2020**, 172, 115490. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Paul, A.; Mukherjee, D.P.; Das, P.; Gangopadhyay, A.; Chintha, A.R.; Kundu, S. Improved Random Forest for Classification. IEEE Trans. Image Process.
**2018**, 27, 4012–4024. [Google Scholar] [CrossRef] [PubMed] - Li, H.B.; Wang, W.; Ding, H.W.; Dong, J. Trees Weighting Random Forest Method for Classifying High-Dimensional Noisy Data. In Proceedings of the 2010 IEEE 7th International Conference on E-Business Engineering, Shanghai, China, 10–12 November 2010; pp. 160–163. [Google Scholar]
- Coraggio, E.; Han, D.; Gronow, C.; Tryfonas, T. Water Quality Sampling Frequency Analysis of Surface Freshwater: A Case Study on Bristol Floating Harbour. Front. Sustain. Cities
**2022**, 3, 791595. [Google Scholar] [CrossRef]

**Figure 1.**Performance of linear regression and random forest model on unseen data with (

**a**) RMSE and (

**b**) R

^{2}as the evaluation criteria.

**Figure 2.**RMSE of the RF models as a function of sampling frequency in Hz: (

**a**) NO

_{3}-N in Kahl am Main station; (

**b**) NO

_{3}-N in Erlabrunn station; (

**c**) OPO

_{4}-P in Kahl am Main station; (

**d**) OPO

_{4}-P in Erlabrunn station; (

**e**) I NH

_{4}-N in Kahl am Main station.

**Figure 3.**Observed and predicted values of NO

_{3}-N at Erlabrunn station when model was trained on the following frequencies: (

**a**) monthly; (

**b**) daily; (

**c**) Optimal i.e., 2.8 h but, for the sake of simplicity, rounded off to 3 h; (

**d**) 15 min.

**Figure 4.**Observed and predicted values of OPO

_{4}-P at Kahl am Main station when the model was trained on the following frequencies: (

**a**) monthly; (

**b**) daily; (

**c**) Optimal i.e., 2.4 h but, for the sake of simplicity, rounded off to 2 h; (

**d**) 15-min intervals.

**Figure 5.**Observed and predicted values of NH

_{4}-N at Kahl am Main station when model was trained on the following frequencies: (

**a**) monthly; (

**b**) daily; (

**c**) Optimal i.e., 2.2 h but for the sake of simplicity, rounded off to 2 h; (

**d**) 15-min intervals.

**Table 1.**Mean, standard deviation (std), minimum value (Min) and maximum value (Max) of both the stations.

Parameter | Unit | Kahl am Main | Erlabrunn | ||||
---|---|---|---|---|---|---|---|

Mean ± Std | Min | Max | Mean ± Std | Min | Max | ||

DO | mg/L | 9.63 ± 2.56 | 4.20 | 17.40 | 10.6 ± 2.13 | 5.30 | 16.30 |

Temp | °C | 14.56 ± 7.14 | 1.10 | 27.60 | 13.49 ± 7.49 | 0.10 | 28.60 |

pH | 8.02 ± 0.19 | 7.50 | 8.50 | 7.94 ± 0.20 | 7.40 | 8.50 | |

Flow | m^{3}/s | 136.52 ± 73.95 | 100.00 | 518.00 | 145.98 ± 42.80 | 39.80 | 437.0 |

EC | µS/cm | 616.95 ± 69.87 | 405.00 | 751.00 | 639.42 ± 80.68 | 387.00 | 842.00 |

NO_{3}-N | mgN L^{−1} | 3.66 ± 0.79 | 2.10 | 5.50 | 3.87 ± 0.98 | 2.09 | 6.70 |

OPO_{4}-P | mgP L^{−1} | 0.10 ± 0.04 | 0.01 | 0.22 | 0.09 ± 0.03 | 0.02 | 0.16 |

NH_{4}-N * | mgN L^{−1} | 0.03 ± 0.02 | 0.01 | 0.11 |

_{4}-N is not measured at Erlabrunn station.

Hyperparameter | Values |
---|---|

Bootstrap | True, False |

Size of the random subsets | Auto, ‘sqrt’ |

Depth of the trees | 10, 20, 30 |

Minimum number of samples to split a node | 6, 12, 20 |

Minimum number of samples to be at a leaf node | 6, 12, 20 |

Order of Variables | 1st | 2nd | 3rd | 4th | 5th | 6th | 7th | 8th | 9th |
---|---|---|---|---|---|---|---|---|---|

Station: Kahl am Main (downstream) | |||||||||

NO_{3}-N | +week_sin | +EC | +Temp | +DO | +month_sin | +pH | +week_cos | +month_cos | +Flow |

R^{2} | 0.927 | 0.979 | 0.995 | 0.998 | 0.998 | 0.999 | 0.999 | 0.999 | 0.999 |

RMSE | 0.676 | 0.587 | 0.397 | 0.205 | 0.097 | 0.043 | 0.034 | 0.028 | 0.026 |

OPO_{4}-P | +week_sin | +EC | +Temp | +Flow | +DO | +month_cos | +week_cos | +month_sin | +pH |

R^{2} | 0.930 | 0.973 | 0.985 | 0.988 | 0.990 | 0.990 | 0.989 | 0.989 | 0.989 |

RMSE | 0.054 | 0.047 | 0.025 | 0.016 | 0.016 | 0.008 | 0.007 | 0.006 | 0.006 |

NH_{4}-N | +week_sin | +EC | +Temp | +DO | +Flow | +pH | +month_sin | +month_cos | +week_cos |

R^{2} | 0.702 | 0.907 | 0.957 | 0.968 | 0.970 | 0.970 | 0.971 | 0.971 | 0.970 |

RMSE | 0.023 | 0.021 | 0.012 | 0.011 | 0.007 | 0.006 | 0.006 | 0.005 | 0.004 |

Station: Erlabrunn (upstream) | |||||||||

NO_{3}-N | +week_sin | +EC | +Temp | +Flow | +pH | +month_cos | +week_cos | +month_sin | +DO |

R^{2} | 0.865 | 0.968 | 0.994 | 0.998 | 0.999 | 0.999 | 0.999 | 0.999 | 0.999 |

RMSE | 1.001 | 0.837 | 0.526 | 0.162 | 0.097 | 0.063 | 0.063 | 0.043 | 0.037 |

OPO_{4}-P | +week_sin | +Flow | +EC | +Temp | +DO | +week_cos | +pH | +month_cos | +month_sin |

R^{2} | 0.342 | 0.848 | 0.929 | 0.954 | 0.956 | 0.958 | 0.959 | 0.959 | 0.959 |

RMSE | 0.035 | 0.031 | 0.022 | 0.019 | 0.013 | 0.013 | 0.010 | 0.008 | 0.007 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Arhab, M.; Huang, J.
Determination of Optimal Predictors and Sampling Frequency to Develop Nutrient Soft Sensors Using Random Forest. *Sensors* **2023**, *23*, 6057.
https://doi.org/10.3390/s23136057

**AMA Style**

Arhab M, Huang J.
Determination of Optimal Predictors and Sampling Frequency to Develop Nutrient Soft Sensors Using Random Forest. *Sensors*. 2023; 23(13):6057.
https://doi.org/10.3390/s23136057

**Chicago/Turabian Style**

Arhab, Muhammad, and Jingshui Huang.
2023. "Determination of Optimal Predictors and Sampling Frequency to Develop Nutrient Soft Sensors Using Random Forest" *Sensors* 23, no. 13: 6057.
https://doi.org/10.3390/s23136057