Prediction of Total Phosphorus Concentration in Macrophytic Lakes Using Chlorophyll-Sensitive Bands: A Case Study of Lake Baiyangdian

: Total phosphorus (TP) is a signiﬁcant indicator of water eutrophication. As a typical macrophytic lake, Lake Baiyangdian is of considerable importance to the North China Plain’s ecosys-tem. However, the lake’s eutrophication is severe, threatening the local ecological environment. The correlation between chlorophyll and TP provides a mechanism for TP prediction. In view of the absorption and reﬂection characteristics of the chlorophyll concentrations in inland water, we propose a method to predict TP concentration in a macrophytic lake with spectral characteristics dominated by chlorophyll. In this study, water spectra noise is removed by discrete wavelet transform (DWT), and chlorophyll-sensitive bands are selected by gray correlation analysis (GRA). To verify the effectiveness of the chlorophyll-sensitive bands for TP concentration prediction, three different machine learning (ML) algorithms were used to build prediction models, including partial least squares (PLS), random forest (RF) and adaptive boosting (AdaBoost). The results indicate that the PLS model performs well in terms of TP concentration prediction, with the least time consumption: the coefﬁcient of determination ( R 2 ) and root mean square error ( RMSE ) are 0.821 and 0.028 mg/L in the training dataset, and 0.741 and 0.029 mg/L in the testing dataset, respectively. Compared with the empirical model, the method proposed herein considers the correlation between chlorophyll and TP concentration, as well as a higher accuracy. The results indicate that chlorophyll-sensitive bands are effective for predicting TP concentration.


Introduction
Lakes are vital freshwater resources on land, fulfilling several key ecological functions, such as providing water, purifying pollution, and maintaining biodiversity [1][2][3].
Owing to the influence of human activities and urbanization, many lakes face serious ecological problems, including reduced water volume, declining aquatic biodiversity, and water eutrophication. It is the premise and foundation of lake ecological environmental management that accurate and rapid monitoring of lake water quality [4,5].
Phosphorus is an essential element for algae growth, and the monitoring of total phosphorus (TP) is crucial for the monitoring and treating of water environments. However, high-precision TP concentration prediction remains challenging [6][7][8][9][10]. The existing in situ TP monitoring techniques are primarily based on the chemical method, which has the disadvantages of a long analysis period, chemical reagent consumption, and the generation of secondary pollution. In recent decades, remote sensing technology has been widely used to monitor the water quality of various inland lakes, due to its wide range of capabilities and timeliness [11][12][13][14]; it could serve as a new tool for the monitoring of TP concentration.
As an optically inactive substance, the TP concentration is difficult to invert using physical models [6,9]. The existing TP inversion models are broadly divided into direct and indirect models. The direct models establish the relationship between remote sensing reflectance (R rs ) and the measured TP concentration, and have been widely applied for water quality monitoring owing to their simplicity and feasibility. The direct models typically use statistical methods to estimate the water quality parameters with less consideration of the mechanism; therefore, these models' applicability is limited, depending on the study area and data [7][8][9]. Chlorophyll-a, total suspended matter (TSM), and colored dissolved organic matter (CDOM) have optical properties and spectral responses, and TP concentration is correlated with the content of these optically active substances [6,15,16]. The indirect methods typically establish the relationship between the TP and optically active substances, and the TP concentration is subsequently indirectly retrieved, based on the inversion results of optically active substances [17]. Unlike the direct models, the indirect models consider the mechanism by which the TP is inverted. However, the inversion results of optically active substances, as well as the correlation between the TP and optically active substances, affect the accuracy of the TP inversion results. Earlier studies demonstrated a correlation between chlorophyll-a and TP, as well as total nitrogen (TN) [18][19][20][21][22][23], which provides a mechanism for TP concentration prediction through the chlorophyll-sensitive bands. Based on the spectral responses of emergent plants to different TN concentrations, Wang et al. and Liu et al. retrieved the concentration of TN through the sensitive bands of vegetation, using unmanned aerial vehicle (UAV) hyperspectral data for Ebinur Lake, and obtained high TN concentration prediction accuracy [24,25].
Machine learning (ML) is a branch of computer science that has been widely used for ecological and environmental remote sensing. Owing to their good computing performance and nonlinear mapping capabilities, ML models have increasingly been adopted as effective models for the inversion of water quality parameters in recent years, such as support vector machines (SVM), random forest (RF), genetic algorithms (GA), extreme learning machines (ELM), and artificial neural networks (ANN) [24,[26][27][28][29][30]. Numerous researchers have analyzed the relationship between the water quality parameters and spectral data, using ML algorithms based on the measured water quality and spectral data, indicating that the ML algorithms may be capable of handling the nonlinear relationships between reflectance and water quality parameters [24,28,31].
Lake Baiyangdian is the largest freshwater wetland in the North China Plain, and the largest lake in Hebei Province [32][33][34]. It is a typical plant-dominated shallow freshwater lake; also a significant freshwater breeding base in northern China, with an extremely high ecological and economic status [35][36][37][38][39]. Lake Baiyangdian is also instrumental in maintaining the wetland ecosystem's balance, regulating the climate, improving the temperature and humidity, replenishing the groundwater, and protecting biodiversity and rare species [40]. The presence of industrial wastewater, domestic sewage, and domestic waste has negatively impacted the ecological environment in some areas of Lake Baiyangdian [37,40,41], posing a considerable threat to the water environment's security. Lake Baiyangdian is facing serious eutrophication, its water quality urgently requires monitoring, and several studies have used remote sensing technology to assess the lake's water quality [32,33,42]. However, as an important indicator of water eutrophication, Lake Baiyangdian's TP remains insufficiently investigated.
Consequently, the present study's objectives are as follows: to explore the characteristic bands for TP inversion of inland lakes with chlorophyll-dominated spectral characteristics; to verify the effectiveness of the TP predictions based on chlorophyll-sensitive bands through different ML models; and to characterize the spatial distribution of water quality at sampling points across Lake Baiyangdian.

Study Area
Lake Baiyangdian is a freshwater lake in the Haihe River Basin, located at 38 • 43 -39 • 02 N and 115 • 45 -116 • 07 E in the Xiong'an New Area of Hebei Province, China ( Figure 1). Lake Baiyangdian is composed of 143 lakes of various sizes with a total area of 366 square kilometers and an average annual water storage of 1.32 billion cubic meters. The river network in the basin shows a fan-shaped distribution [34,43]. The climate in the Lake Baiyangdian area is a typical temperate monsoon climate, with uneven distribution of precipitation throughout the year [36]. More of the precipitation occurs during the spring, and less in the autumn, while the precipitation between June and September accounts for 70-80% of the entire year's rainfall [44]. The lake has a semi-arid climate: the drought index is 2.98, the annual average temperature is 7 • C, the average precipitation is 550 mm, and the annual average evaporation is 1637 mm [44,45].

Study Area
Lake Baiyangdian is a freshwater lake in the Haihe River Basin, located at 38°43′-39°02′ N and 115°45′-116°07′ E in the Xiong'an New Area of Hebei Province, China ( Figure  1). Lake Baiyangdian is composed of 143 lakes of various sizes with a total area of 366 square kilometers and an average annual water storage of 1.32 billion cubic meters. The river network in the basin shows a fan-shaped distribution [34,43]. The climate in the Lake Baiyangdian area is a typical temperate monsoon climate, with uneven distribution of precipitation throughout the year [36]. More of the precipitation occurs during the spring, and less in the autumn, while the precipitation between June and September accounts for 70-80% of the entire year's rainfall [44]. The lake has a semi-arid climate: the drought index is 2.98, the annual average temperature is 7 °C , the average precipitation is 550 mm, and the annual average evaporation is 1637 mm [44,45].
Lake Baiyangdian is experiencing extreme water shortages. Its annual average water resource is 3.118 billion m 3 , and the per capita water resource is 297 m 3 -a mere 1/10 of the national per capita water resource. Owing to the occurrence of drying up incidents over the last 30 years, the amount of water entering Lake Baiyangdian has decreased [46]. At the same time, severe organic pollution and eutrophication affect most areas of the lake, and the main pollutants are chemical oxygen demand (COD) and TP derived from the rivers entering the lake from scenic spots and households [43,[47][48][49]. Due to the domestic sewage and agricultural non-point source pollution in the lake area, Lake Baiyangdian's water quality is poor (largely at the level of class IV-V) [46,48,50], and is deteriorating in some areas, posing a serious threat to the ecological environment. The situation of Lake Baiyangdian's ecological environment is critical and must be addressed. Lake Baiyangdian is experiencing extreme water shortages. Its annual average water resource is 3.118 billion m 3 , and the per capita water resource is 297 m 3 -a mere 1/10 of the national per capita water resource. Owing to the occurrence of drying up incidents over the last 30 years, the amount of water entering Lake Baiyangdian has decreased [46]. At the same time, severe organic pollution and eutrophication affect most areas of the lake, and the main pollutants are chemical oxygen demand (COD) and TP derived from the rivers entering the lake from scenic spots and households [43,[47][48][49]. Due to the domestic sewage and agricultural non-point source pollution in the lake area, Lake Baiyangdian's water quality is poor (largely at the level of class IV-V) [46,48,50], and is deteriorating in some areas, posing a serious threat to the ecological environment. The situation of Lake Baiyangdian's ecological environment is critical and must be addressed. Figure 2 presents the study flowchart. The overall TP prediction framework comprises four steps, each of which is described in detail below. Owing to the rich spectral information, hyperspectral remote sensing can capture the water's weak spectral characteristics and has been widely used in water quality monitoring [51][52][53][54][55]. In this study, a PSR-3500 portable spectrometer was used to measure the water spectra. The PSR-3500 is widely used for spectral measurement of ground objects. It has 1024 channels over a spectral range of 350-2500 nm, with spectral resolutions of 3.5, 10, and 7 nm at 700, 1500, and 2100 nm. A fiber optic probe with a field of view of 25 • was used for the measurements.  Figure 2 presents the study flowchart. The overall TP prediction framework comprises four steps, each of which is described in detail below. Owing to the rich spectral information, hyperspectral remote sensing can capture the water's weak spectral characteristics and has been widely used in water quality monitoring [51][52][53][54][55]. In this study, a PSR-3500 portable spectrometer was used to measure the water spectra. The PSR-3500 is widely used for spectral measurement of ground objects. It has 1024 channels over a spectral range of 350-2500 nm, with spectral resolutions of 3.5, 10, and 7 nm at 700, 1500, and 2100 nm. A fiber optic probe with a field of view of 25° was used for the measurements. The 62 sampling points were evenly distributed throughout the intersection and middle of the river in Lake Baiyangdian. The experiment was conducted on sunny days between 10:00 am and 14:00 pm, from 22 to 29 September 2018. For each sampling site, the radiances of water, sky, and the reference panel were measured, and water samples were collected simultaneously. Special observation geometry was adopted to avoid any influence of ships' shadows and direct solar radiation [56]. All of the water samples were placed in the incubator and then brought to the chemical laboratory to test the TP concentration within 24 h. In this paper, the TP concentration was measured by the ammonium molybdate spectrophotometric method (GB 11893-1989, issued by China). To reduce the influence of random error, the spectra were measured five times at the same spot, and the final spectrum was determined from the average of the five spectra for each sample site. Rrs was derived, using the following Equation (1):

Data Acquisition
where is the total radiance of the water; is the measured radiance of the sky; is the measured radiance of the reference panel; is the reflectance of the reference panel (30%); and r is the skylight reflectance at the air-water surface. The 62 sampling points were evenly distributed throughout the intersection and middle of the river in Lake Baiyangdian. The experiment was conducted on sunny days between 10:00 a.m. and 14:00 p.m., from 22 to 29 September 2018. For each sampling site, the radiances of water, sky, and the reference panel were measured, and water samples were collected simultaneously. Special observation geometry was adopted to avoid any influence of ships' shadows and direct solar radiation [56]. All of the water samples were placed in the incubator and then brought to the chemical laboratory to test the TP concentration within 24 h. In this paper, the TP concentration was measured by the ammonium molybdate spectrophotometric method (GB 11893-1989, issued by China). To reduce the influence of random error, the spectra were measured five times at the same spot, and the final spectrum was determined from the average of the five spectra for each sample site. R rs was derived, using the following Equation (1):

Spectral Preprocessing
where L SW is the total radiance of the water; L sky is the measured radiance of the sky; L p is the measured radiance of the reference panel; ρ p is the reflectance of the reference panel (30%); and r is the skylight reflectance at the air-water surface.

Spectral Preprocessing
Spectral dimension noise distorts the spectrum of ground objects, shifts the central wavelength, and thus affects the inversion results of water quality parameters. Therefore, the removal of spectral dimension noise is critical to improve the accuracy of water quality parameters. Owing to its good time-frequency resolution characteristics, wavelet transform (WT) is widely used to remove noise from spectral data [57,58]. WT transforms the function in time and space to determine the relationship between the time and frequency domains, including continuous WT and discrete wavelet transform (DWT). For the discrete case, the wavelet sequence is defined as follows: where a and b are the zoom and translation factor, respectively; a, b ∈ R; and a = 0. For any function f (t), the DWT is defined, using Equation (3): DWT decomposes the signal into detail and approximate coefficients. The signal S is decomposed into three layers, and the decomposition relation is, S = A 3 + D 3 + D 2 + D 1 , as shown in Figure 3. A 3 is the approximate coefficient of the original signal, which is the low-frequency component; D 1 -D 3 are the detail coefficients, which are the highfrequency components. Spectral dimension noise distorts the spectrum of ground objects, shifts the central wavelength, and thus affects the inversion results of water quality parameters. Therefore, the removal of spectral dimension noise is critical to improve the accuracy of water quality parameters. Owing to its good time-frequency resolution characteristics, wavelet transform (WT) is widely used to remove noise from spectral data [57,58]. WT transforms the function in time and space to determine the relationship between the time and frequency domains, including continuous WT and discrete wavelet transform (DWT). For the discrete case, the wavelet sequence is defined as follows: where a and b are the zoom and translation factor, respectively; , b ∈ ; and ≠ 0. For any function f(t), the DWT is defined, using Equation (3): DWT decomposes the signal into detail and approximate coefficients. The signal S is decomposed into three layers, and the decomposition relation is, S = A3 + D3 + D2 + D1, as shown in Figure 3. A3 is the approximate coefficient of the original signal, which is the low-frequency component; D1-D3 are the detail coefficients, which are the high-frequency components. To evaluate the de-noising results, the normalized correlation coefficient (NCC), signal to noise ratio (SNR), and peak signal to noise ratio (PSNR) were calculated, using the following equations, respectively: To evaluate the de-noising results, the normalized correlation coefficient (NCC), signal to noise ratio (SNR), and peak signal to noise ratio (PSNR) were calculated, using the following equations, respectively: where N is the total number of samples; x i is the spectral reflectance before de-noising; and x i is the spectral reflectance after de-noising. The DWT de-noising method was performed using MATLAB R2017a.

Gray Relation Analysis
The degree of relevance reflects the relevance of two sequences. The grey relation analysis (GRA) is based on grey system theory, which reveals the characteristics and degree of the relationship between factors [59,60]. Owing to its characteristics, such as the lower sample size and calculation requirements, and the lack of need for typical distribution rules, GRA is widely used for nonlinear feature selection. As a dimensionless quantity, GRA can express the correlation between the TP concentration and hyperspectral reflectance of the water samples. A more detailed description of GRA is provided by Kuo et al. [61]. The GRA was written using Python 3.7.

Prediction Model Construction and Verification
To verify the effectiveness of predicting the TP concentration through the chlorophyllsensitive bands, and to better understand the nonlinear relationship between TP concentration and reflectance, three typical ML algorithms were used in this paper, including partial least squares (PLS), random forest (RF), and adaptive boosting (AdaBoost).
(a) Partial least squares PLS is a typical parametric regression method, which has been widely used in studies owing to the good performance [62,63]. It is applicable to the case where the amount of highly collinear data and variables significantly exceeds the number of available samples. The PLS method selects successive orthogonal factors that maximize the covariance between the predictor and response variables to predict the variables. It takes advantage of the correlation between the TP concentration and reflectance spectra, and derives quantitative information from the spectra data.
(b) Random forest RF is a decision tree algorithm, based on ensemble learning algorithms. It has higher accuracy when used to solve nonlinear problems for regression and classification. As such, it has been widely used in remote sensing studies [64,65]. The RF algorithm uses multiple models when the samples are input, and the algorithm then integrates all of the models' results to derive a single model. The performance of the RF models is usually evaluated based on the out-of-bag (OOB) error. A detailed illustration of the RF method is available in the paper of Genuer et al. [66].

(c) Adaptive boosting
The AdaBoost algorithm is an integrated learning algorithm, based on the boosting algorithm framework. As an effective statistical learning algorithm, the AdaBoost algorithm is not susceptible to overfitting issues and is widely used in classification and regression problems [27,28,67,68]. It serially constructs a strong learner, with a weak learner that is continuously used to make up for the previous weak learner's shortcomings. The training samples are weighted in each iteration, and the weight is adjusted according to the error [26]. When the weight of the learner with the larger error is reduced and the weight of the learner with the smaller error is increased, the final weighted set becomes a strong learner. AdaBoost is also an iterative optimal search strategy. By searching the learner or function space, it constructs a perfect learner to ensure a sufficiently small objective function.
To evaluate the predictions of the TP concentration, four parameters were calculated, namely the coefficient of determination (R 2 ), root mean square error (RMSE), ratio of performance to deviation (RPD), and explained variance score (EVS). The RPD is the ratio between the standard deviation (SD) and the RMSE. These parameters were determined using Equations (7)-(10), respectively: where N is the total number of samples;ŷ i is the predicted value; y i is the measured value; and y is the mean of the measured value. Generally, a robust model has a high R 2 , RPD, and EVS, and a low RMSE. The PLS, RF, AdaBoost method, and four evaluating indicators were performed and implemented using Python 3.7.

Statistical Analysis
Based on the in situ data, the water samples were sorted, according to the TP concentration. Every third sample was included in the testing dataset; the rest of the data were included in the training dataset. The 62 water samples collected from Lake Baiyangdian were divided into 42 training datasets and 20 testing datasets. The training and testing datasets were representative of the entire water sample dataset in terms of the minimum, maximum, mean, and SD values. The coefficient of variation (CV) was used to complement the SD. Table 1 presents the statistics for the water samples' TP concentrations. The minimum TP concentration was 0.05 mg/L, and the maximum concentration was 0.31 mg/L.

DWT Denoising
In the DWT analysis, we decomposed the spectral data into three layers after several tests. The spectral de-noising filter, based on WT, includes hard and soft thresholds. In this paper, the detailed information of each layer is filtered by threshold selection, and the filtered spectra data are reconstructed by inverse WT. Table 2 compares the de-noising effects among different mother wavelet functions (db, sym, and coif). NCC was used to evaluate the spectra before and after de-noising; SNR and PSNR were used to evaluate the information reconstruction quality of the spectra. Generally, the better the information quality of the spectra, the greater that of the NCC, SNR, and PSNR. As Table 2 illustrates, the NCC values show little difference between the different functions, demonstrating that good waveform similarity can be maintained after de-noising with different wavelet functions. The SNR and PSNR of the spectra de-noised by db4 were 45.6378 and 51.5475 dB, respectively-higher than those of the other wavelet functions. The water spectra de-noised by db4 are shown in Figure 4a.

Feature Band Selection
The water spectral data were acquired in the range of 400~1000 nm, which is generally used in water color remote sensing. Figure 4a shows the 62 water reflectance spectra collected in Lake Baiyangdian. The reflectance spectra exhibit obvious chlorophyll-dominated characteristics. The absorption characteristics close to 440 and 675 nm are caused by chlorophyll absorption, and the absorption characteristic close to 620 nm is caused by phycocyanin. The absorption characteristic close to 440 nm is significantly affected by the suspended matter and CDOM, while the absorption characteristic close to 675 nm is less affected by the other water elements. Lake Baiyangdian's water spectra exhibit a clear reflection peak close to 700 nm, which is one of the most important spectral bands of chlorophyll concentration in inland water. Figure 4b shows the results of GRA between the spectral reflectance and the TP concentration. The GRA degree of all of the bands is >0.8, and the band with a higher GRA degree is close to 700 nm, consistent with the chlorophyll-sensitive bands. We selected the 37 characteristic bands from 674.4~736.3 nm to predict the TP concentration. These bands include the most important spectral characteristic chlorophyll bands, including the absorption valley at 675 nm and reflection peak at 700 nm. The GRA degrees of these bands are all >0.86, and their reflectance values were used as the model input to predict TP concentration.

Prediction of TP Concentration
A total of 37 chlorophyll-associated spectral bands were used to predict Lake Baiyangdian's TP concentration, with all the visible-near infrared (VNIR) bands used for comparison. To verify the chlorophyll-sensitive bands' applicability to the TP

Feature Band Selection
The water spectral data were acquired in the range of 400~1000 nm, which is generally used in water color remote sensing. Figure 4a shows the 62 water reflectance spectra collected in Lake Baiyangdian. The reflectance spectra exhibit obvious chlorophyll-dominated characteristics. The absorption characteristics close to 440 and 675 nm are caused by chlorophyll absorption, and the absorption characteristic close to 620 nm is caused by phycocyanin. The absorption characteristic close to 440 nm is significantly affected by the suspended matter and CDOM, while the absorption characteristic close to 675 nm is less affected by the other water elements. Lake Baiyangdian's water spectra exhibit a clear reflection peak close to 700 nm, which is one of the most important spectral bands of chlorophyll concentration in inland water. Figure 4b shows the results of GRA between the spectral reflectance and the TP concentration. The GRA degree of all of the bands is >0.8, and the band with a higher GRA degree is close to 700 nm, consistent with the chlorophyll-sensitive bands. We selected the 37 characteristic bands from 674.4~736.3 nm to predict the TP concentration. These bands include the most important spectral characteristic chlorophyll bands, including the absorption valley at 675 nm and reflection peak at 700 nm. The GRA degrees of these bands are all >0.86, and their reflectance values were used as the model input to predict TP concentration.

Prediction of TP Concentration
A total of 37 chlorophyll-associated spectral bands were used to predict Lake Baiyangdian's TP concentration, with all the visible-near infrared (VNIR) bands used for comparison. To verify the chlorophyll-sensitive bands' applicability to the TP concentration inversion, three different ML algorithms (PLS, RF, and AdaBoost) were used to construct the prediction model. Figure 5 reveals the TP concentration prediction performance of the different ML models, using chlorophyll-sensitive bands. The R 2 is >0.8 in the training dataset and >0.5 in the testing dataset for all ML models. The PLS model performs well for both of the training and testing datasets, and the R 2 values for the training and testing datasets are 0.821 and 0.741, respectively. The R 2 value of the RF model in the training dataset is 0.882, but only 0.523 in the testing dataset. The RF model's scatter plots in the testing dataset are discrete, indicating that the testing dataset's TP concentration could not be accurately predicted. The AdaBoost model shows the best performance for the training dataset (R 2 = 0.923). However, its R 2 value is only 0.608 for the testing dataset, possibly due to overfitting. Although the PLS model accuracy is not the highest for the training dataset, it exhibits the highest accuracy for the testing dataset. Compared with the other two ML models, the R 2 of the PLS model is >0.7 for both the training and testing datasets, demonstrating that the PLS model performs well in terms of the TP concentration prediction.   Figure 6 further illustrates the TP concentration prediction performance of the different models using all VNIR bands. The R 2 values for the training dataset were 0.817, 0.877, and 0.962 for the PLS, RF, and AdaBoost models, respectively. The R 2 values for the testing dataset were 0.585, 0.508, and 0.596, respectively. Compared with the models established using the chlorophyll-sensitive bands, the prediction accuracy of the PLS and RF models established using all VNIR bands was lower, while the AdaBoost model's prediction accuracy was higher for the training dataset. For the testing dataset, the accuracy of the three ML models established using all VNIR bands was lower than that of those established using the chlorophyll-sensitive bands. Although the accuracy of the AdaBoost model established using all VNIR bands was higher for the training dataset than that using the chlorophyll-sensitive bands, it performed poorly with the testing dataset, possibly as a result of overfitting in the training dataset. The verification results of the three different ML models demonstrate the feasibility of inverting the TP concentration, using chlorophyllsensitive bands.  Figure 7 illustrates the running times of different ML models predicting TP using all VNIR spectra and chlorophyll-sensitive bands. Compared with all VNIR bands, the running time for predicting TP using the chlorophyll-sensitive bands is lower, indicating that  Figure 7 illustrates the running times of different ML models predicting TP using all VNIR spectra and chlorophyll-sensitive bands. Compared with all VNIR bands, the running time for predicting TP using the chlorophyll-sensitive bands is lower, indicating that selection of the chlorophyll-sensitive bands could reduce the running time while maintaining prediction accuracy. The PLS method shows the shortest running time among the three ML models, at <0.2 s in both the chlorophyll-sensitive and VNIR bands. The RF model has the longest running time, of >0.5 s for both the chlorophyll-sensitive and VNIR bands. The RF algorithm creates a decision tree for each sample, and then obtains the prediction results for each decision tree. The final prediction result was then selected according to the vote results, which consumes a lot of time [26]. The AdaBoost model is weighted and iterated in the training process, and the weight is adjusted according to the error; thus, it also consumes more time [26,28,68]. The time difference of the AdaBoost model between the entire VNIR and chlorophyll-sensitive bands is the greatest between the three models. selection of the chlorophyll-sensitive bands could reduce the running time while maintaining prediction accuracy. The PLS method shows the shortest running time among the three ML models, at <0.2 s in both the chlorophyll-sensitive and VNIR bands. The RF model has the longest running time, of >0.5 s for both the chlorophyll-sensitive and VNIR bands. The RF algorithm creates a decision tree for each sample, and then obtains the prediction results for each decision tree. The final prediction result was then selected according to the vote results, which consumes a lot of time [26]. The AdaBoost model is weighted and iterated in the training process, and the weight is adjusted according to the error; thus, it also consumes more time [26,28,68]. The time difference of the AdaBoost model between the entire VNIR and chlorophyll-sensitive bands is the greatest between the three models.

Analysis of Time Efficiency
Although the AdaBoost method shows the highest accuracy in the training datasets, its time consumption is also high. By contrast, the PLS method shows high accuracy and minimal time consumption. In practical application, when substantial amounts of data must be obtained in real time, the PLS models may be used to predict the TP concentration more accurately and quickly.

Effectiveness Analysis of Chlorophyll-Sensitive Bands
To verify the accuracy of the TP predictions using the chlorophyll-sensitive bands, several empirical and semi-empirical models were compared, using single band, logarithmic, ratio, difference, first-and second-order differential, and three-and four-band methods. The dataset division rules are the same as those detailed in Section 3.1. The characteristic bands selected based on the empirical and semi-empirical method and prediction accuracy are shown in Table 3. Only the results for R 2 and RMSE are shown, owing to space constraints.
As Table 3 demonstrates, although the positions of the characteristic bands selected differ, depending on the empirical and semi-empirical method, they are all between 600 and 750 nm. These bands are also chlorophyll-sensitive, indicating that the chlorophyllsensitive band reflectance has a strong correlation with TP concentration. With the exception of the ratio and difference models, which have a higher prediction accuracy (R 2 > 0.6), the models established using other empirical methods failed to predict the TP concentration in Lake Baiyangdian. Although the ratio model's R 2 was high in both the training and testing datasets, the testing dataset's RMSE was also high. This indicates that the ratio model could only predict the relative TP concentration accurately; the prediction error is large for the absolute value. The characteristic bands of the empirical models were selected using statistical methods, and only one or two bands were used to predict the TP Although the AdaBoost method shows the highest accuracy in the training datasets, its time consumption is also high. By contrast, the PLS method shows high accuracy and minimal time consumption. In practical application, when substantial amounts of data must be obtained in real time, the PLS models may be used to predict the TP concentration more accurately and quickly.

Effectiveness Analysis of Chlorophyll-Sensitive Bands
To verify the accuracy of the TP predictions using the chlorophyll-sensitive bands, several empirical and semi-empirical models were compared, using single band, logarithmic, ratio, difference, first-and second-order differential, and three-and four-band methods. The dataset division rules are the same as those detailed in Section 3.1. The characteristic bands selected based on the empirical and semi-empirical method and prediction accuracy are shown in Table 3. Only the results for R 2 and RMSE are shown, owing to space constraints. As Table 3 demonstrates, although the positions of the characteristic bands selected differ, depending on the empirical and semi-empirical method, they are all between 600 and 750 nm. These bands are also chlorophyll-sensitive, indicating that the chlorophyll-sensitive band reflectance has a strong correlation with TP concentration. With the exception of the ratio and difference models, which have a higher prediction accuracy (R 2 > 0.6), the models established using other empirical methods failed to predict the TP concentration in Lake Baiyangdian. Although the ratio model's R 2 was high in both the training and testing datasets, the testing dataset's RMSE was also high. This indicates that the ratio model could only predict the relative TP concentration accurately; the prediction error is large for the absolute value. The characteristic bands of the empirical models were selected using statistical methods, and only one or two bands were used to predict the TP concentration. The empirical models ignore the mechanisms of TP inversion and do not make full use of the rich spectral information provided by hyperspectral data, so that their applicability is low [6,8]. As a commonly used semi-empirical model, the three-and four-band methods also performed poorly in predicting the TP concentration of Lake Baiyangdian, which may be caused by an insufficient utilization of the spectral information.
In this study, we selected chlorophyll-sensitive bands ranging from 674.4 to 736.3 nm, including the strong reflection and absorption chlorophyll bands in inland water [69][70][71]. The methodology proposed herein considers the mechanism for TP inversion, and makes full use of spectral information, thus avoiding the low applicability associated with the band combinations that ignore TP inversion mechanisms. Compared with the entire VNIR model, the model established using chlorophyll-sensitive bands as input not only shows a higher accuracy for both the training and testing datasets, but also reduces the running time, which has an advantage when substantial amounts of data need to be obtained in real time.

Spatial Distribution Characteristics of Water Samples
The Environmental Quality Standards for Surface Water of China (EQSSWC; standard number: GB3838-2002) categorize the water quality into five classes, which may be used to objectively evaluate water pollution. Class I water, which is the best quality, is used for source water and national nature reserves; class V has the worst quality and is applicable to areas with agricultural and landscape requirements. In light of the ecological function of Lake Baiyangdian, its TP concentration should be in the range of 0.02~0.2 mg/L; however, the concentrations at some of the sampling points exceeded the standard. Based on the EQSSWC, the water quality of each class was determined according to the TP concentration, and the TP concentration of water samples in Lake Baiyangdian is categorized into classes II-V ( Figure 1).
As Figure 1 illustrates, most of the sampling points contained class III water; class V water was the least prevalent. The sampling points containing class II water are distributed in the north and east of Lake Baiyangdian; class III water is mainly distributed in the middle of the lake; and class IV and V water are mainly distributed in the west. The sampling points with serious water pollution (containing class IV and V water) are distributed in the west of Lake Baiyangdian, close to residential areas. By contrast, the sampling points containing class II water are mainly distributed close to the large area of water bodies. The main sources of pollution in Lake Baiyangdian include tourism, agriculture, aquaculture, and domestic wastewater [37,46,49]. Copious amounts of domestic sewage are discharged into the river, leading to more severe pollution in residential areas than elsewhere [39,46]. The lake's water quality impacts local residents' health and plays a key role in the local ecosystem. It is thus a matter of some urgency to mitigate local domestic sewage discharge, improve water quality, and conserve the ecological environment of Lake Baiyangdian.

Conclusions
TP monitoring is of great significance to monitor and treat water environments. However, as an optically inactive substance, TP concentration is difficult to invert using physical models. In this paper, Lake Baiyangdian was taken as the study area, and the WT de-noising method was used to remove background noise and extract the water's weak spectral information. Considering the correlation between TP and chlorophyll, the chlorophyll-sensitive bands were selected by GRA, and the TP concentration prediction model was constructed based on three ML algorithms (PLS, RF, and AdaBoost). The results demonstrate that the PLS model shows the best performance among the three ML algorithms in the testing dataset, with the least time consumption: the R 2 and RMSE are 0.741 and 0.029 mg/L, respectively. Compared with the empirical model, the method proposed herein has a higher prediction accuracy. Future studies will investigate the correlation between the chlorophyll and TP in other lakes, and verify the method proposed in this paper.  Acknowledgments: The authors thank Changping Huang, Yao Chen, Jiao Wang, Na Qiao and Senlin Tang for their efforts in the collection of water samples and spectra in Lake Baiyangdian. The authors also would like to thank anonymous reviewers for their great comments and suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.