An Alternative to Laboratory Testing: Random Forest-Based Water Quality Prediction Framework for Inland and Nearshore Water Bodies

Water quality monitoring plays a vital role in the water environment management, while efficient monitoring provides direction and verification of the effectiveness of water management. Traditional water quality monitoring for a variety of water parameters requires the placement of multiple sensors, and some water quality data (e.g., total nitrogen (TN)) requires testing instruments or laboratory analysis to obtain results, which takes longer than the sensors. In this paper, we designed a water quality prediction framework, which uses available water quality variables (e.g., temperature, pH, conductivity, etc.) to predict total nitrogen concentrations in inland water bodies. The framework was also used to predict nearshore seawater salinity and temperature using remote sensing bands. We conducted experiments on real water quality datasets and random forest was chosen to be the core algorithm of the framework by comparing and analyzing the performance of different machine learning algorithms. The results show that among all tested machine learning models, random forest performs the best. The data prediction error rate of the random forest model in predicting the total nitrogen concentration in inland rivers was 4.9%. Moreover, to explore the prediction effect of random forest algorithm when the independent variable is non-water quality data, we took the reflectance of remote sensing bands as the independent variables and successfully inverted the salinity distribution of Shenzhen Bay in the Google Earth Engine (GEE) platform. According to the experimental results, the random forest-based water quality prediction framework can achieve 92.94% accuracy in predicting the salinity of nearshore waters.


Introduction
Presently, the world is facing a crisis of freshwater shortage. Rivers are the most common source of freshwater, but along with industrial development, human activities have damaged the river water environment [1]. Domestic sewage, industrial wastewater, and agricultural drainage all contain inorganic salts such as nitrogen and phosphorus which may adversely affect the water quality of the river [2]. The shortage of freshwater resources is often not a lack of water, but a lack of clean water. Through sensors or laboratory testing, water quality monitoring can help us develop water quality management measures to protect water resources and improve aquatic habitats [3]. Water managers use different tools to monitor water quality, such as using sensors to monitor the physical characteristics of water [4], analyzing the biochemical characteristics of water in the laboratory [5], and even monitoring water bodies through satellite technology [6].
However, traditional monitoring means have redundancy in the arrangement of sensors, with one sensor needing to be placed for each water quality parameter. This placement strategy may result in a waste of resources. Moreover, the cost of monitoring different water quality indicators varies. For example, basic indicators such as temperature, pH, and conductivity can be monitored quickly and easily by sensors at a very low cost. Indicators such as total nitrogen (TN), on the other hand, require laboratory testing to produce accurate results, which is not only time-consuming but also costly [7]. As rivers are very mobile and water quality changes rapidly, traditional methods may not perform well in terms of timeliness in responding to changes in water quality if the tests take longer. In addition, these monitoring are often carried out separately, and the intrinsic link between the results obtained from these monitoring methods is ignored. For example, in inland rivers, total nitrogen is required to be measured in the laboratory, but because of the link between it and other features that are easily obtained through sensors, we can develop a framework to exploit that link. Simply put, total nitrogen concentration data can be obtained without laboratory or specialized detection instruments. This is also true in nearshore waters, where the large area of the nearshore sea makes the placement of a large number of sensors impractical, while satellite remote sensing technology provides a means of monitoring over a large spatial and temporal range. If we use the optical characteristics of seawater, establish the link between water quality and remote sensing reflectivity [8] and use this link to invert the water quality of a large area, we can realize the dynamic monitoring of nearshore waters. Therefore, the study and use of the relationship between different water quality indicators can be used to indirectly predict specific water quality indicators. This kind of predictive tools can contribute to water environment management [9].
Due to the highly dynamic nature of river water status, the collection and analysis of water quality data may not meet the dynamic requirements. With the extensive use of sensors, many water quality studies have adopted the "surrogate-regression" approach [10], which allows analysis of water quality by external factors. For example, using discharge, turbidity, conductivity, etc. to estimate the nutrient status of the water body [11], or directly using environmental proxies to estimate [12]. Most of the studies that used the "surrogateregression" method employed simple linear regression or multiple linear regression [13][14][15]. However, the reality is that the relationship between water quality data is not linear, but nonlinear. To address this problem, many scholars have used machine learning methods to explore nonlinear relationships among variables and have successfully applied machine learning methods to water-related fields: for example, flood prediction [16], river flow [17], groundwater pollution [18], photophysiology [19], etc.
In recent years, with the development of big data technology in the field of water quality prediction, machine learning has also been applied to river monitoring [20]. The methods used include Random Forest (RF) [21], Recurrent Neural Network (RNN) [22,23], Support Vector Machine (SVM) [24], etc. Dissolved oxygen (DO) [25], chlorophyll (Chla) [26], total phosphorus (TP) [27], and ammonia nitrogen ( NH 3 − N) [28] in water are used as variables to be predicted. In addition to using the relationship between physical and biochemical indicators in water, satellite remote sensing data are also widely used to predict water quality [29][30][31]. Google earth engine (GEE) is a platform that can be used to perform large-scale remote sensing calculations and is very friendly for studying land and ocean [32][33][34]. With GEE, we can use pre-trained models to predict the land parameters [35] and the water parameters [36] by remote sensing data.
However, we found that existing studies have not paid enough attention to the prediction of total nitrogen (TN) concentration in inland water bodies. In particular, total nitrogen concentrations in rivers reflect the degree of eutrophication. Excess nitrogen in rivers provides nutrients for algae, and most river ecosystems are unable to carry the abnormal and excessive rate of growth of these algae. Eutrophication occurs when algae bloom in large quantities in a short period of time. Algae that exceed the carrying capacity of a river can consume almost all of the dissolved oxygen in the river, causing other aquatic organisms in the river to die due to lack of oxygen. [37]. Monitoring of nitrogen in rivers is done mainly using sensor monitoring and laboratory testing [9].
In this paper, a random forest-based prediction framework is designed to enrich methods of water quality monitoring for inland and nearshore water quality. Two water bodies were studied in this study: inland rivers and nearshore waters. These two water bodies have different characteristics and there are differences in the analysis. Through the proposed framework with different data sources as input and target predicted water quality parameters as outputs, the results of experiments validate the robustness of the framework. The main contents of the study are as follows: • We explored the relationship between different water quality indicators and designed a water quality prediction framework, which uses machine learning methods to predict the target variable water quality indicators by the dependent water quality parameters • We compared the performance of different kinds of machine learning methods for total nitrogen (TN) prediction in the inland river and the best performing model was selected to be the core algorithm of the prediction framework • We discussed the feasibility of the proposed framework for water quality inversion with water surface reflection data acquired by remote sensing. The GEE platform was used to solve the problem of big data calculation and to map the salinity distribution of nearshore water bodies.

Framework for Inland and Nearshore Water Quality Prediction
Based on the above situation, we designed the water quality prediction framework as shown in Figure 1. As can be seen, the framework is divided into three parts, from left to right are: raw data acquisition and processing, model training and prediction, storage and display.

Data Acquisition and Pre-Processing
For different waters and their characteristics, the raw data are acquired in different ways. For example, in predicting the water quality of inland water bodies, the raw data mainly comes from the sensors installed at the water quality monitoring stations in the early stage. A small part of the data that is difficult to obtain directly through the sensor is obtained through laboratory analysis after a certain frequency of water quality sampling. In the nearshore sea water quality prediction, the original data include not only the water quality management department sampling and testing data, but also satellite transmission to the ground remote sensing data. Because the frequency of detection and sampling is different, and water quality prediction requires the correspondence of relevant data, so the original data need to be pre-processed. Prediction of inland and nearshore water quality, the independent variable data used is also different. To the inland river total nitrogen prediction, for example, the independent variables used are other water quality indicators with correlation. The nearshore seawater quality prediction using remote sensing reflectance data as the independent variable.

Decision Tree and Random Forest Model
The idea of the framework is to employ machine learning methods to predict water quality. Models learn the linear or nonlinear relationships that exist between the data from the training set. The trained models then make predictions on the test set data to test the model's effectiveness and generate evaluation metrics. Since there are nonlinear relationships between total nitrogen and other water parameters (such as temperature and pH), these factors directly affect the growth of algae in water bodies and indirectly affect the concentration of TN in waters. It is the existence of this nonlinear relationship that gives the possibility to predict TN. We consider the task of predicting TN as a mathematical regression problem. Here, we use the random forest as the prediction model for the whole framework.

Decision Tree
Decision tree is the basic structure of many ensemble learning methods, called classification tree when used for classification and regression tree when used for regression [38]. Unlike other classification methods that combine a set of features in a single decision step to perform classification, decision trees are based on a multi-stage or hierarchical decision scheme or tree structure and consist of nodes and directed edges. A decision tree consists of two types of nodes: internal nodes and leaf nodes. A feature or attribute of the data is represented by an internal node, while a leaf node represents a class. Specifically, a leaf node represents the result of a decision from the root node to this leaf node, and an internal node represents the data classification test performed at that point, i.e., the feature or attribute to which the test data belongs. Each node of the decision tree structure makes a binary decision, and the samples contained in the node based on the result of the attribute test are divided into sub-nodes (the root node contains the full set of samples), and the path from the root node to each leaf node corresponds to a sequence of decision tests. This processing is usually performed by moving down the tree until a leaf node is reached. In the decision tree approach, the characteristics of the data (i.e., other water parameters) are predictor variables, while the class to be mapped (TN) is referred to as the target variable. The tree-like structure of a decision tree is shown in Figure 2.
The key to decision trees is how to divide the data set with the expectation of making the unneeded data more orderly, and the measure of orderliness and disorderliness is information entropy: where X denotes the sample set containing K categories and P κ denotes the proportion (frequency) of the κ-th sample in the sample set D. Taylor expansion of f (x) = − ln x at x = 1 (ignoring higher order infinitesimals): Thus, the entropy can be translated as: Here, Gini (X) refers to the Gini index. Although decision tree is not an ensemble learning method, it is the basis of random forest and gradient boosted decision tree.

Random Forest
The idea of random forest is to build a forest of several decision trees and merge them together to have more accurate and stable results. Random forest contains two aspects of randomness. First, the selection of samples is random: a certain number of samples are drawn from the training set to generate the root node samples of the classless decision tree. Second, the selection of attributes is random: during the construction of each decision tree, a certain number of candidate attributes are randomly selected, from which the most suitable attribute is chosen as the split node. The random forest model randomly resamples the input data set, generates multiple training sets to construct decision trees, and then determines the final prediction based on the results (majority or average) of all decision trees. Figure 3 shows the training process of random forest.  The basic steps of the random forest algorithm are as follows: • Sampling: From the training set T, K sets of data sets are generated by Boostrasp sampling with put-back. Each set of data sets is divided into two kinds of sampled data and un-sampled data (out-of-bag data), and each data set will generate a decision tree by training. • Growth: Each decision tree is trained by training data. At each sub-node, m features are randomly selected from M attributes, and the optimal features are selected based on the Gini metric for full branching growth until no more growth is possible, without pruning. • Testing: Using the out-of-bag data to test the accuracy of the model. Because out-ofbag data are not involved in modeling, model effects and generalization capabilities can be tested to some extent. The prediction error of out-of-bag data is picked up to determine the best decision tree in the algorithm and re-modeled. • Prediction: Using the determined model for new data and prediction, the average of all decision trees prediction results is the final output.

Case Studies of Inland and Nearshore Water Quality
As mentioned previously, the research in this paper was conducted in inland waters and nearshore waters. In inland waters, we use the designed framework to predict river total nitrogen concentrations through relationships between water quality parameters; in nearshore waters, we invert seawater temperature and salinity using the reflectance of seawater to light observed by satellite.

Experiment Description and Settings
The experiments are conducted according to the proposed framework. All experiments in this article are compiled and tested on Windows system (CPU: Inter(R) Core(TM) i7-9700K CPU @ 3.60 GHz; GPU: Inter(R) UHD Graphics 630). All codes are written using the syntax of Python 3.8. The pre-processed data will be split into training set (90%) and testing set (10%). After processing the real water quality data, random forest is used as the core algorithm to learn the nonlinear relationships existing between water quality variables in the training set. The test set is used to evaluate the performance of the model.

Baseline Methods
In our experiment, to verify that random forest outperforms other common machine learning algorithms in inland and nearshore water quality prediction tasks, several methods such as SVR, KNN, Ridge Regression, MLP, Gradient Boosted Regression Tree (GBRT), and Bagging were also used in the study. These methods are described as follows: 3.2.1. SVR Support vector regression (SVR) is an application of support vector machine (SVM) [39,40] for regression of continuous variables [41]. The idea of SVR is that given dataset A of D elements {(X i , y i ) i = 1, 2, . . . , D}, d represents the sample of the training set, and X i is the i-th element of the d-dimensional vector, i.e., X i = {x 1 , x 2 , . . . , x d } ∈ d , and y i ∈ is the actual value corresponding to X i . The target output function of the SVR can be quantified by the following equation: where ω is the weight vector, ω i and b are coefficients determined by minimizing the error between the network output and the target variable, and φ(x i ) is the nonlinear mapping function. In practical applications, φ(x i ) is replaced by the kernel function K(x, z).

KNN
The K nearest neighbor(KNN) model is a nonparametric method proposed by Thomas Cover [42] that can be used for classification and regression. The output of the KNN model depends on the operation to be performed, for regression purposes, the model predicts an actual value, called the attribute value of the new data point. The input to the model is the nearest N neighboring data points, and the Euclidean distance between the new data point and its N nearest neighbors is calculated.

Ridge Regression
Ridge regression is a popular parameter estimation method used to address the collinearity problem frequently arising in multiple linear regression [43,44]. When there is covariance in the equation variables, a change in one variable can also cause other variables to change. Ridge regression is the addition of a constant matrix to the original equation that produces bias but ensures the stability of the regression coefficients. Although this addition results in a loss of information, it can be exchanged for a reasonable estimate of the regression model.

MLP
Multilayer Perceptron (MLP) [45] is a feed-forward neural network implementation that mimics the connectivity between human neurons, where the neurons between layers are connected in a fully connected manner. The hidden layer receives the signals from the input layer nodes and converts them into signals sent to all output nodes, converting them into the final layer output. The error between simulation and observation is minimized using a back-propagation algorithm. Activation function is used to enhance the network's ability to express nonlinear regression relationships.

Gradient Boosted Regression Tree
Gradient Boosted Regression Tree (GBRT), a model based on decision tree regression, can handle nonlinear and complex relationships between data. GBRT is an iterative decision tree algorithm that consists of multiple decision trees, and the final result is the cumulative sum of the conclusions of all trees [46]. GBRT uses a forward distribution algorithm, which minimizes the loss function by selecting the appropriate decision tree function based on the current model and the fitted function.

Bagging
The basic idea of Bagging Tree [47] is to consider that part of the output error in a single regression tree is due to a specific selection of the training dataset. Bagging uses self-sampling to generate different base classifiers. It introduces self-sampling to obtain training subsets for training base classifiers. Each sample training set is used to train a base learner, and the mean of all weak learner results is the output of Bagging regression.

Evaluation Methodology
Different models perform differently on data prediction. To compare the performance of various machine learning models on total nitrogen concentration prediction, we used MAE (Mean Absolute Error), MSE (Mean Square Error), RMSE (Root Mean Square Error), NSE (Nash-Sutcliffe efficiency coefficient) , and MAPE (Mean Absolute Percentage Error) as metrics to evaluate the models.
• MAE: • MSE: • RMSE: • MAPE: • NSE: where m is the number of data, y i is the observed values, f (x i ) is the predicted values , and y is the average of y i . MAE, MSE, RMSE, and MAPE are used to measure the gap between the observed and model-predicted values. The larger the value of the indicator, the larger the difference between the true and predicted values, and the worse the performance of the model. The value of the Nash-Sutcliffe efficiency coefficient (NSE) is from negative infinity to 1. The closer the NSE is to 1, the better the model is; the closer it is to 0, the worse the model is. If NSE is much less than 0, then the model is not credible.

Study Area and Materials
The case study area in Inland Water is the Lianjiang River basin, which is located in the eastern part of China's Guangdong Province. Lianjiang River has 17 large and small tributaries that join the mainstream from north to south. The main river is 71 km long and has a basin area of 1346.6 km 2 . The Lianjiang River is one of the mother rivers of the Chaoshan region and the population in the basin reaches more than 4 million people, a density six times the provincial average. The high population density and intensive industrial enterprises have put enormous pressure on the environment. The local government has invested around $4 billion in the management of the Lianjiang River, which shows the importance of monitoring and management of the river.
Data used in this part are from the water quality monitoring station set up by the Shantou Ecology and Environment Bureau (https://www.shantou.gov.cn/epd/) (accessed on 17 November 2021) in Haimen Bay (23°12 45.7 N, 116°37 15.7 E WGS-84). The geographical location of the Lianjiang River and Haimen Bay can be seen in Figure 4. As shown on the map, the station is located at the mouth of the Lianjiang River, where water quality predictions help to avoid pollution of the sea by inland sewage.
The majority of water quality data collected at the monitoring stations is at two-hour intervals and all nitrogen (TN) data are collected at four-hour intervals. Therefore, the data were collated and data items containing total nitrogen concentrations were retained, for a total of 1917 sets of water quality data. Water quality indicators include temperature (Temp), pH, dissolved oxygen (DO), turbidimetry (Tud), chemical oxygen demand (COD), total dissolved solids (T.D.S), ammonia nitrogen (NH 3 − N), and total nitrogen (TN). The data are statistically described in Table 1.   [48], for example, the value of conductivity maybe hundreds of times higher than other observed values. Such differences may have an impact on the prediction results.
To eliminate the influence of scale between indicators, the Z-score normalization method [49] is used to normalize the filtered data. The normalized data are scaled between 0 and 1, with z-score normalization method is as follows: where X norm denotes the normalized value, X denotes the real monitored value, X min and X max represents the minimum value and the maximum value in the set of data.

• Correlation Analysis
To find water parameters that are correlated with TN and can be used as independent variables in predicting the TN process, a correlation analysis was performed to extract possible relationships between the parameters. Pearson's correlation coefficient was used to measure the correlation between water quality indicators. The formula is as follows: where Cor X,Y is the correlation coefficient of the random variables X and Y, Cov(X, Y)is the covariance of the random variables X and Y , E is the mathematical expectation or mean, D is the variance, and √ D is the standard deviation. The correlation coefficients between other variables and total nitrogen obtained by Equation (11) are shown in Table 2. As can be seen in the table, the correlation coefficients of DO, pH, and Tud with TN are small, indicating the low probability of their linear correlation with TN. We also tested the variables with small correlations in the experiment and found that removing these variables would negatively affect the predicted results. One possible explanation is that there may be a nonlinear relationship between these variables and TN. For example, pH affects algal growth and thus changes TN concentrations, while algal growth affects DO concentrations [50]. Although the linear correlation between DO and TN is the lowest in the table, DO is directly related to the life and death of aquatic organisms. Proper DO helps these organisms to survive, while when algal overgrowth occurs, the DO of the water column decreases, organisms die, and microbial decomposition leads to higher TN in the water. Based on this consideration we retained and used these variables. TP was not used as an independent variable because the detection of TP is similar to TN, which requires laboratory analysis.

• Data Validation
According to the characteristics of machine learning algorithms, we randomly divide the pre-processed 1917 sets of data into training and testing sets, i.e., 90% of the data are used as training set to train the model and 10% are used as testing set to validate the model and calculate the metrics. For comparison purposes, we divide the used methods into two categories: ensemble learning methods and non-ensemble learning methods. The ensemble learning methods include Random Forest, Decision Tree, GBRT, and Bagging, while the non-ensemble learning methods include SVR, KNN, Ridge Regression, and MLP.
We calculated the correlation (Cor) coefficients between the predicted and observed values of TN for the Lianjiang River. Scatter plots were drawn with the observed and predicted values as horizontal and vertical coordinates, and we added a diagonal line with Y = x in each plot graph to indicate the result when the prediction is perfect. The deviation of the point from this line reflects the degree of deviation of the predicted value from the observed value. We compared the ensemble learning method and the non-ensemble learning method separately, and then analyzed the model with the best prediction ability in detail. Figure 5 shows the scatter distribution of the predicted and observed values of the four non-ensemble learning methods. It can be seen that among the four methods, MLP has the smallest deviation from the straight line Y = x with the largest correlation coefficient of 0.947, and KNN has the largest deviation from the straight line with the smallest correlation coefficient of 0.899. This indicates that the neural network performs better in the TN concentration prediction task in the study area compared with other non-ensemble learning methods. In addition, we found that the points determined by the predicted and observed values are closer to the straight line Y = x when the TN concentration is small.  The deviation of the points in the scatter plot from the straight line Y = x is smaller than that of the three non-ensemble learning methods mentioned above, which indicates that the predicted values of the ensemble learning methods are closer to the observed values in the task of predicting TN concentrations in inland rivers. In the comparison with the baseline method, random forest has the best performance. Table 3 shows the results of these eight methods on the six evaluation metrics, in which we use bold fonts to highlight the best performance results. By category, both the data fitting ability and the accuracy of prediction ensemble learning methods perform better than non-ensemble learning methods. The experimental results indicate that random forest performs the best among the eight methods. The MAE = 0.335, MSE = 0.259, RMSE = 0.509, and MAPE = 4.9% of random forest are the lowest among the methods; NSE = 0.94 and Cor = 0.967 are the highest among the methods, which indicates that the fitting ability of the random forest is stronger than other methods and the prediction error is smaller than other methods. From Table 3, it can be seen that random forest improves the prediction accuracy by about 2 times compared with non-ensemble learning methods. The prediction accuracy of random forest is also higher than that of decision trees, gradient boosting trees, and Bagging, which are also ensemble learning methods. This shows that our proposed framework is better than these baseline methods in terms of prediction accuracy.

Prediction Results of The Framework for TN
After comparing the performance of the model, we use the random forest as the core algorithm of the framework to predict the TN at the monitoring site of Haimen Bay, and the comparison of the predicted results with the true values is shown in Figure 7a. Most of the predicted values are very close to the observed values, which also indicates the good performance of the model. However, there are still some cases where the difference from the true value is large, which is due to the fact that these raw data with large errors in the test set are some large values. For example, the error rate is 10% for all, and the error for a concentration of 14 mg/L is 1.4 mg/L, while the error for a concentration of 6 mg/L is 0.6 mg/L which is less than half of 1.4 mg/L.  Figure 7b shows the absolute error between the predicted and observed values of TN. After the prediction of the test set data, we counted the range of these absolute errors. A total of 120 out of 192 data sets had absolute error values less than or equal to 0.3 mg/L, accounting for 62.5%, and 153 out of 192 data sets had absolute error values less than or equal to 0.5 mg/L, accounting for 79.7%. The mean absolute error (MAE) of the test set was 0.335 mg/L.

Case Studies of Nearshore Water Quality
Remote sensing technology is generally used to observe light-sensitive features, such as land, vegetation, and forest fire sites. In the field of watercolor remote sensing, the research objects are usually chlorophyll-a, soluble solids, yellow substances, etc., and there are fewer studies on non-optical sensitive objects in the water. Seawater salinity is very important for marine organisms and is a key indicator for maintaining the ecological balance of the ocean [51]. To a certain extent, the dynamic balance of seawater salinity reflects the stability of marine ecosystems. Since the effects of human activities on seawater salinity are mainly concentrated in nearshore waters, it is of great interest to monitor seawater salinity using the proposed water quality prediction framework.
Like the inland water quality prediction experiments, we use the random forest as the core algorithm of the prediction framework and the other seven machine learning methods as comparisons. The difference is that the source of input data used as the independent variable of the model in nearshore water quality prediction is not the same as inland water quality prediction.

Data Description and Processing
In the nearshore water quality prediction, the study area is Shenzhen Bay (22°29 21.7" N, 113°58 46.0" E WGS-84). Remote sensing data were used as independent variables and water quality data were used as dependent variables. Shenzhen Bay is a river inlet located between Shenzhen City, mainland China and Hong Kong Special Administrative Region, China. The area is densely populated with frequent human activities and has a large impact on the marine ecology. This part of the experiment was done with the GEE platform. Sentinel-2 remote sensing reflectance data of Shenzhen Bay waters from 2018-2019 were downloaded from the GEE platform, and seawater quality data were obtained from the Hong Kong Environmental Protection Department (EPD https://cd.epic.epd.gov.hk/ EPICRIVER/marine/, accessed on 17 November 2021).
The raw dataset contains 10 bands, temperature, and seawater salinity data from 11 monitoring sites in Shenzhen Bay. To eliminate the influence of clouds on the reflectance data, a Cloudmask operation was performed using the QA60 band provided by Sentinel-2 SR data. The QA60 band is used in binary form to indicate the presence or absence of clouds at the point, e.g., the tenth binary bit indicates transparent clouds the tenth indicates cirrus clouds if the binary bit is 0 indicates the presence of clouds, and 1 indicates the absence of clouds. Therefore, we set the filtering condition as the tenth and eleventh bits are 0, we can filter out the points without clouds. Due to the inconsistency between satellite transit time and water sampling time, the data were screened, and the minimum time difference was selected to match the data, and 147 sets of data were finally obtained.
As mentioned in the inland water quality prediction, we analyzed the correlation of reflectance data (B with numbers representing bands) with temperature and salinity data, and the decomposition results are shown in Table 4. We found that the correlation between reflectance data and temperature is positive, and the correlation coefficient is between 0.45 and 0.57. The reflectance data were negatively correlated with the salinity data, and the correlation coefficients ranged from −0.76 to −0.62. The regression relationship of remote sensing reflectance with temperature and seawater salinity was established on the GEE platform using the proposed water quality prediction framework. To verify the accuracy and generalization ability of the model, we used the data from six sampling points in the dataset as the training set and the data from the other five sampling points as the test set for the experiments.
As in the comparison experiments in predicting inland water quality, we compared the random forest with the other seven baseline methods on three metrics, MAE, NSE, and MAPE. The results of the comparison are shown in Table 5. From the table, we can see that the random forest-based prediction framework still performs better than the baseline methods in nearshore water quality prediction. For the prediction of seawater temperature and salinity in the region, the results of the comparison between observed and predicted values are shown in Figure 8, which shows that most of the predicted values do not differ much from the observed values, and only a few values have error fluctuations. From Table 5, it can be seen that in predicting seawater temperature in the study area, MAE = 1.107 and MAPE = 4.65% for random forest are lower than other methods. Similarly in predicting the salinity of seawater, MAE = 1.755 MAPE = 7.06% for random forest is also the lowest among the experimental methods. This shows the proposed framework is applicable to predict the temperature and salinity of nearshore seawater with good transferability. Using the visualization tool of the GEE platform, we successfully plotted the salinity distribution of seawater in Shenzhen Bay. As can be seen from Figure 9, the seawater salinity is radially distributed as the water flows from the river to the ocean. This distribution is due to the fact that the outflow from the inland is freshwater with very low salinity and there is a certain dynamic, and with the diffusion effect, the inorganic salts in the high salinity seawater will be transferred to the freshwater with low concentration, so the seawater salinity will gradually increase from the river inlet to the distant sea.

Discussion
In this study, we first designed a water quality prediction framework and discussed the application of machine learning methods in predicting the inland water quality. Using temperature (Temp), pH, dissolved oxygen (DO), turbidimetry (Tud), chemical oxygen demand (COD), total dissolved solids (T.D.S), and ammonia nitrogen (NH 3 − N) as independent variables, eight machine learning methods were applied to predict the target variable (TN), and MAE, MSE, RMSE, MAPE, NSE, and Cor were used as the evaluation metrics of model performance. We can visually compare the performance of these methods using Figure 10. After comparison, we found that the ensemble learning methods and MLP perform better than other non-ensemble learning methods (SVM, KNN, ridge regression). The analysis suggests that due to the complex nonlinear relationships between water parameters, the ensemble learning methods and MLP are better than other non-ensemble methods in capturing the nonlinear relationships between attributes and can optimize the learned features to achieve a better fit. The goal of supervised learning algorithms for machine learning is to learn a stable model that performs well in all aspects. However, practice is often the opposite of expectations. Sometimes we can only obtain multiple models with preferences that may perform better in some aspects. The underlying idea of ensemble learning [52] is to use multiple weak learners so that even if one weak learner produces an incorrect prediction, the other weak learners can fix that error. Ensemble learning can effectively improve the generalization ability of the system. Additionally, due to the introduction of randomness, random forest methods reduce the probability of overfitting cases. Random forest is insensitive to outliers and thus has good noise immunity.
Having obtained the good performance of random forest on river TN prediction task, we try to use remote sensing bands for water quality prediction. Here, seawater temperature and salinity are used as prediction targets, and the independent variables are reflectance data from Sentinel-2 satellite. Then using the computational resources provided by the GEE platform, we successfully inverted the temperature and salinity of the Shenzhen Bay region and plotted the distribution of seawater salinity in the domain.
From the distribution map, it can be seen that the seawater salinity increases in a diffuse manner from the river inlet to the outer sea.

Conclusions
The purpose of this study is to design a water quality prediction framework and then evaluate the performance of different machine learning methods in water quality prediction. Through the experiments conducted in two different water bodies: inland water and nearshore water, we can draw the following conclusions: • Machine learning methods and neural network methods can effectively predict the TN in rivers(with an accuracy of 95.1%). Thus, the water quality prediction framework we designed can be used as a soft alternative to sensors in cases where monitoring requirements are less stringent. Research on rivers can provide a real-time and rapid prediction of water quality, which provides a reference basis for river water quality monitoring work, and also provides a decision aid for river management. • Random forest-based water quality prediction framework can be applied to the inversion of ocean temperature (accuracy reaches 95.35%) and salinity(accuracy reaches 92.94%). Through the reflectance of the water body to light bands, the trained model can invert the water body without the help of water quality data. • GEE platform is friendly for remote sensing calculation. GEE provides satellite data resources including Landsat series and Sentinel series, and product resources developed with these data. Coupled with its powerful computing power, it can easily solve the problem of large-scale remote sensing calculations.

Future Work
When the random forest-based water quality prediction framework is designed, we can apply it in water quality monitoring work. In the future, the framework can be transformed into an online water quality monitoring tool and become part of the monitoring system. Specifically, when the sensor data are in the form of streams as input into the system, the framework can be used to obtain the water quality data that need to be predicted. In the inland water quality monitoring, the work of this paper can not only realize the data prediction of monitoring points without setting TN dedicated sensors, but also be able to fill the data inconsistency due to the different sampling and testing frequency, and provide help for the study of time series and so on. How to achieve the stream data as the input of the model and give prediction results based on this input is the direction of our next work, in which the dynamic update of the model is also a focus of the study.