2.1. Experimental Equipment
In this study, a set of automatic near-surface water quality monitoring systems (
Figure 1a,b) were built to realize the non-contact and high accuracy measurement of water quality parameters. It is mainly composed of visible sphere machines (Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou, China), radar flowmeters (Chongqing Hayden Technology Co., Ltd., Chongqing, China), small weather stations (Shandong Renke Measurement and Control Co., Ltd., Jinan, China), self-developed hyperspectral imagers based on linear variable filters (LVF), and rotary calibration devices. The system already has the ability to collect data automatically, the stability test of hardware has passed, and a set of supporting software written by C# has been developed. Visible sphere machines are mainly used to display real-time pictures of the environment near the device and also to reflect the floating objects on the water surface through color images of the water surface. Radar flowmeters can monitor the water level and flow of current water bodies. Small weather stations can obtain some meteorological data, such as temperature and humidity, rainfall, light, wind speed, and wind direction.
The hyperspectral imager (
Figure 1c) is the core equipment developed by our laboratory team. The rotary calibration device is used in combination with the hyperspectral imager. When the hyperspectral imager works, the calibration plate (30% spectral reflectance) made of Teflon is transferred to the horizontal orientation through the rotary motor, and at the other time, the calibration plate shrinks to the holding box. This measure is to prevent the calibration board from being polluted, reduce the maintenance cycle, and realize the automatic calibration of data without being on duty. The hyperspectral imager is specifically composed of a front telescope unit, a coupled splitter unit, and a data processing unit. The operating principle (
Figure 2a) adopted is the reflected light of the target of incident to the front telescope unit, which then enters the field aperture through the beam compression of the telescope group, passes through the imaging lens group, and finally forms an image on the detector surface. The surface of the detector is windowed and coated, and a pixel-level modulation linear gradient filter module is integrated. After LVF splitting, monochromatic light with different wavelengths occurs at different positions, and the target spectral images with different wavelengths are obtained by the detector. The front telescope system and the coupled splitter unit are distributed along the optical axis. The system has passed the comparison test (
Figure 2c) with the ASD spectrometer (FieldSpec4 Hi-Res spectroradiometer, Analytical Spectra Devices, Inc., Boulder, CO, USA) and is feasible in specific application scenarios. The technical indicators (
Table 1) of the hyperspectral imager can meet the requirements of specific scenario applications. The technical scheme ensures the requirements of light and a small-scale system, the synchronous acquisition of multi-dimensional spectral information, efficient information reconstruction and so on.
The system is installed by placing a visible sphere machine, a radar flowmeter, and a hyperspectral imager in a custom storage box fixed on the horizontal pole and a small weather station on top of the vertical pole. The rotary calibration device is placed in the middle of the vertical pole, 70 cm from the height of the horizontal pole so that the hyperspectral imager does not lose focus and the calibration plate occupies an appropriate area of view. The height of the pole can be set to 2–5 m, fixed on the shore (in an open position, where water depth is not visible to the naked eye), and where the pole is facing the water. It is best to install the pole in the North–South direction so that it can avoid the influence of the pole’s own shadow and ensure the equipment works uninterrupted during the day. It is important to note that the system needs to be used with in situ water quality monitoring equipment (
Figure 1d) in the early stage. In this aspect, the lab team has performed a good job, and it has been applied in water plants, urban inland rivers, and other scenarios. Self-developed in situ water quality monitoring equipment has been updated for several generations; MWIS-3000 is the latest model, which has been laid out in the Poyang Lake, Xiangxi River, and Dasha River in Shenzhen. A large amount of water quality data can be obtained, including Chl-a, DO, COD, turbidity, NH3-N, NO3-N, pH, temperature, and so on. Using the software developed by our team based on C#, various water quality parameters can be obtained in seconds, and cloud storage and remote transmission can be achieved. The in situ water quality monitoring equipment is immersed in the water below the pole body and works synchronously with the pole equipment to obtain the measured values of the water quality parameters, which will be used as labels to assist in the evaluation and selection of models. All the equipment that can be controlled is connected to the industrial control computer (
Figure 2b) in the control box through a USB or COM port, which enables data management and evaluation decisions in the data cloud management platform. The power supply mode can be either mains or solar. The set of equipment in this study is expected to be used in lakes, rivers, reservoirs, ponds, and other scenarios. Because of its high spatial resolution, it can be used in large and small water bodies. In addition, a high temporal resolution enables the high-frequency continuous acquisition of data.
2.3. Field Data Collection
At present, the method to obtain the measured values of water quality parameters in most studies is to put water into bottles and take it back to the laboratory or hand it over to professional institutions for artificial chemical measurement [
9,
10,
11]. The feasibility of this method needs to be verified. The first, is that the water quality parameters in the bottle undergo subtle changes due to the activities of microorganisms [
21]. The second is the preservation of the water samples. Strictly speaking, they should be stored at low temperatures in an environment with a pH < 2 [
22]. However, due to the complex environmental conditions at the sampling point, such strict preservation conditions are often not achieved. Third, this artificial chemical method is easy to cause secondary pollution, which is time-consuming, laborious, and makes it easy to introduce artificial errors [
3], such as the dichromate method for measuring COD (HJ 828-2017), iodometric method for measuring dissolved oxygen (GB 7489-87), and Nessler reagent spectrophotometry for measuring ammonia nitrogen (HJ 535-2009), etc. The corresponding label value of the spectral information obtained by remote sensing must be time-dependent. Manual samplings and chemical measurements may bring great deviation to the measured value results, which is not conducive to the establishment of the subsequent models.
Since the concentration of Chl-a in the fishpond has reached more than 400
, it exceeds the range of most existing sensors. Both TN and TP parameters [
23] have poor sensor accuracy and time continuity (with a time interval greater than 40 min). Turbidity is a representative parameter because it has a strong correlation with parameters such as transparency, total suspended matter, and chroma [
24]. In this study, four water quality parameters (COD, DO, turbidity, and NH3-N) were selected to study, taking into account the water quality of the fishpond and the use of existing sensors. Using the in situ water quality monitoring equipment (
Table 2) developed by our laboratory, data (
Table 3) were collected continuously from 25 April 2022 to 6 June 2022 to obtain a 10 s interval of data for the four water quality parameters. In addition, in order to ensure the authenticity of the measured data, the in situ underwater probe maintains a frequency of manual maintenance once a week, and the probe also has its own brush to achieve a certain degree of self-cleaning. At the same time, the hyperspectral imager is working (2–3 times a day), taking 88 valid images (
Figure 4) during the study period, and contains the spectral information for both the calibration plate and the measured water body.
2.5. Machine Learning Models
This study applied fourteen machine learning algorithms, including linear regression (LR), ridge regression (RR), least absolute shrinkage and selection operator regression (LASSO), elastic net regression (ENR), k-nearest neighbor regression (KNN), Gaussian process regression (GPR), decision tree regression (DTR), support vector regression (SVR), multilayer perceptron regression (MLP), adaptive boosting regression (ABR), gradient boosting regression (GBR), bootstrap aggregating regression (Bagging), random forest regression (RFR), and extreme tree regression (ETR) to build the retrieval model, and finally selected the most suitable one to estimate the concentration of the four water quality parameters(COD, DO, turbidity, and NH3-N).
All the machine learning models used in this study are supervised models, which are divided into single and ensemble learning models. The single models include the linear model, KNN, GPR, DTR, SVR, and MLP. Linear models include LR, LASSO, RR, and ENR. The ensemble learning models include the Boosting and Bagging models, where the Boosting models contain GBR and ABR, and Bagging models contain RFR and ETR.
LR, also known as partial least squares regression, is a regression analysis that can model the relationship between one or more independent and dependent variables. The principle of the algorithm is to use the partial least squares method to minimize the square error [
37]. RR is a linear least square method with L2 regularization. It is a biased estimation regression method specially used for linear data analysis. By giving up the unbiasedness of the least square method, the regression coefficient is obtained at the cost of losing part of the information and reducing the accuracy. Ridge regression is very useful when there is collinearity in the data set, and the regression coefficient obtained by LR is more practical and reliable. In addition, it can make the fluctuation range of the estimated parameters smaller and more stable [
38]. Similar to RR, LASSO adds a penalty value to the absolute value of the regression coefficient [
39]. The difference with RR is that it uses an absolute value instead of square value in the penalty part. ENR is a mixture of LASSO and RR. RR is a biased analysis of the cost function using the L2-norm (square term). LASSO uses the L1-norm (absolute value term) to conduct a biased analysis of the cost function. ENR combines the two, using both the square term and the absolute value term [
40].
When calculating the predicted value of a data point using KNN, the model selects the k-nearest data points from the training data set and averages their label values, using the mean as the predicted value for the new data point [
41]. GPR is a non-parametric model that uses a Gaussian process prior to the regression analysis of data. The GPR model assumes two parts, noise (regression residuals) and a Gaussian process prior, and its solution is based on Bayesian inference. Without restricting the form of the kernel function, GPR is theoretically a universal approximator for any continuous function in q compact space. In addition, the GPR provides a posteriori of predictions, which can be parsed when the likelihood is normally distributed. Therefore, GPR is a probability model with generality and resolvability [
42]. An advantage of the DTR is that it does not require any transformation of the features if we are dealing with nonlinear data. A decision tree is grown by iteratively splitting its nodes until the leaves are pure or a topping criterion is satisfied [
43]. SVR creates an interval zone on both sides of the linear function with a spacing of
(also known as tolerance bias, which is an empirical value set manually); no loss is calculated for the samples falling into the interval zone, that is, only support vectors will have an impact on its function model, and the optimized model is obtained by minimizing the total loss and maximizing the interval [
44]. MLP is a network of fully connected layers with at least one hidden layer whose output is transformed by an activation function. The number of layers in a multilayer perception machine and the number of hidden cells in each hidden layer are hyperparameters [
45].
ABR calculates and re-weights the weights of the learners based on their current performance and then selects the median weighted learner as a result by sorting the weights of the learners [
46]. The basic idea of GBR is that multiple weak learners are generated serially, and the goal of each weak learner is to fit the negative gradient of the loss function of the previous accumulation model so that the cumulative model loss with the weak learner decreases in the direction of the negative gradient [
47]. Bagging’s step is to first sample
sample sets with
samples, then train a learner based on each sample set before finally combining the learners and using the mean method for regression tasks [
48].
RFR [
35] is an ensemble technique that combines multiple decision trees. A random forest usually has a better generalization performance than an individual tree due to randomness, which helps to decrease the model’s variance. Other advantages of random forests are that they are less sensitive to outliers in the dataset and do not require much parameter tuning. The only parameter in random forests that we typically need to experiment with is the number of trees in the ensemble. The only difference is that we use the MSE criterion to grow the individual decision trees, and the predicted target variable is calculated as the average prediction over all the decision trees. ETR and RFR are very similar, but there are two main differences: (1) Each tree in ETR uses all the training samples- that is, the sample set of each tree is the same. (2) RFR selects the optimal bifurcation feature in the feature subset, while ETR selects the bifurcation feature directly and randomly [
4,
49].
In this study, the above fourteen machine learning regression algorithms are not described in detail; the algorithms were implemented using python 3.6 and scikit learn libraries. The specific information of the training hardware platform is as follows: the CPU is intel core i7 10TH GEN, the graphics card is NVIDIA GeForce RTX 3060 12G, and the memory size is 16G. The operating system is Ubuntu Desktop 20.04 LTS.