Machine Learning Algorithms for Chromophoric Dissolved Organic Matter (CDOM) Estimation Based on Landsat 8 Images

Chromophoric dissolved organic matter (CDOM) is crucial in the biogeochemical cycle and carbon cycle of aquatic environments. However, in inland waters, remotely sensed estimates of CDOM remain challenging due to the low optical signal of CDOM and complex optical conditions. Therefore, developing efficient, practical and robust models to estimate CDOM absorption coefficient in inland waters is essential for successful water environment monitoring and management. We examined and improved different machine learning algorithms using extensive CDOM measurements and Landsat 8 images covering different trophic states to develop the robust CDOM estimation model. The algorithms were evaluated via 111 Landsat 8 images and 1708 field measurements covering CDOM light absorption coefficient a(254) from 2.64 to 34.04 m−1. Overall, the four machine learning algorithms achieved more than 70% accuracy for CDOM absorption coefficient estimation. Based on model training, validation and the application on Landsat 8 OLI images, we found that the Gaussian process regression (GPR) had higher stability and estimation accuracy (R2 = 0.74, mean relative error (MRE) = 22.2%) than the other models. The estimation accuracy and MRE were R2 = 0.75 and MRE = 22.5% for backpropagation (BP) neural network, R2 = 0.71 and MRE = 24.4% for random forest regression (RFR) and R2 = 0.71 and MRE = 24.4% for support vector regression (SVR). In contrast, the best three empirical models had estimation accuracies of R2 less than 0.56. The model accuracies applied to Landsat images of Lake Qiandaohu (oligo-mesotrophic state) were better than those of Lake Taihu (eutrophic state) because of the more complex optical conditions in eutrophic lakes. Therefore, machine learning algorithms have great potential for CDOM monitoring in inland waters based on large datasets. Our study demonstrates that machine learning algorithms are available to map CDOM spatial-temporal patterns in inland waters.


Introduction
Chromophoric dissolved organic matter (CDOM), which is also referred to as gelbstoff, gilvin or yellow matter, is widely found in natural water bodies and is a soluble and complicated mixture of organic substances consisting mainly of humic acids, fulvic acids and aromatic polymers [1]. CDOM can affect the underwater light field, and its generation, transport and transformation processes also influence the biogeochemical recycling of carbon, nitrogen and phosphorus in the water column [2][3][4][5][6].
Current research suggests that CDOM in water comes from multiple sources, including (a) allochthonous sources, which mainly include degraded organic matter from the surrounding terrestrial environment as input from terrestrial runoff, precipitation and groundwater recharge, and resuspension of sediments [7], and (b) autochthonous sources, which include the chemical degradation products of organisms from phytoplankton, macrophyte and bacteria [8]. The degradation process of CDOM mainly includes photochemical study, we collected a large amount of CDOM light absorption coefficient data from lakes with different trophic states and applied different machine learning algorithms to Landsat 8 Operational Land Imager (OLI) imagery of lakes, with the main objective of finding the robust algorithm to estimate CDOM absorption coefficient. First, using the CDOM measured from lakes with different trophic states collected in the middle and lower reaches of the Yangtze River (YR), the upper reaches of the Huai River (RHR) and the Yunnan-Guizhou Plateau (YGP) regions and with Landsat 8 Operational Land Imager (OLI) images, we assessed the abilities of four machine learning algorithms, namely, backpropagation (BP) neural network, GPR, SVR and RFR algorithms, to estimate CDOM absorption. Second, we compared empirical algorithms (including ratio models, normalized models, and binary primary polynomial models) with machine learning algorithms using the same validation dataset. Finally, the CDOM remote sensing estimation results obtained for Lake Qiandaohu and Lake Taihu are used as examples to compare the estimation accuracy of CDOM for lakes with different trophic states across the four machine learning algorithms.

Research Area
The scope of sampling in this study is spatiotemporally broad, covering mainly some lakes in the upper reaches of the Huai River, the middle and lower reaches of the Yangtze River and the Yunnan-Guizhou Plateau ( Figure 1). The sampling period of Lake Taihu ranges from 2013 to 2021, and the sampling period of most lakes is mainly concentrated in 2017-2021 (Table 1). The data for Lake Qiandaohu and Lake Taihu are mostly data from routine monitoring (monthly sampling), and the data for the other lakes sampled are mostly data from winter, spring and summer bulk sampling in the lower and middle reaches of the Yangtze River and the upper reaches of the Huai River (Table 1). We chose to study the lakes in the abovementioned regions due to their different trophic states [55].
Most of the abovementioned CDOM estimation methods focus on a single lake or a few lakes in a small area and are aimed at improving the accuracy of the algorithm. In this study, we collected a large amount of CDOM light absorption coefficient data from lakes with different trophic states and applied different machine learning algorithms to Landsat 8 Operational Land Imager (OLI) imagery of lakes, with the main objective of finding the robust algorithm to estimate CDOM absorption coefficient. First, using the CDOM measured from lakes with different trophic states collected in the middle and lower reaches of the Yangtze River (YR), the upper reaches of the Huai River (RHR) and the Yunnan-Guizhou Plateau (YGP) regions and with Landsat 8 Operational Land Imager (OLI) images, we assessed the abilities of four machine learning algorithms, namely, backpropagation (BP) neural network, GPR, SVR and RFR algorithms, to estimate CDOM absorption. Second, we compared empirical algorithms (including ratio models, normalized models, and binary primary polynomial models) with machine learning algorithms using the same validation dataset. Finally, the CDOM remote sensing estimation results obtained for Lake Qiandaohu and Lake Taihu are used as examples to compare the estimation accuracy of CDOM for lakes with different trophic states across the four machine learning algorithms.

Research Area
The scope of sampling in this study is spatiotemporally broad, covering mainly some lakes in the upper reaches of the Huai River, the middle and lower reaches of the Yangtze River and the Yunnan-Guizhou Plateau ( Figure 1). The sampling period of Lake Taihu ranges from 2013 to 2021, and the sampling period of most lakes is mainly concentrated in 2017-2021 (Table 1). The data for Lake Qiandaohu and Lake Taihu are mostly data from routine monitoring (monthly sampling), and the data for the other lakes sampled are mostly data from winter, spring and summer bulk sampling in the lower and middle reaches of the Yangtze River and the upper reaches of the Huai River (Table 1). We chose to study the lakes in the abovementioned regions due to their different trophic states [55].

Sample Collection and Processing
To accurately measure CDOM absorption, water samples collected in situ need to be stored under dark refrigeration conditions in acid-washed polyvinyl chloride bottles and delivered back to the laboratory in a timely manner [56]. Then, water samples were filtered at low pressure, large particles and plankton cells were filtered out first using precombustion Whatman GF/F filters, and next, the samples were filtered using nitrocellulose Millipore filters with a 0.22 µm pore size. Absorption spectra of water samples filtered with 1 nm intervals in the wavelength range 200-800 nm were obtained by using a Shimadzu UV-Vis 2550 spectrophotometer, with Milli-Q water used as the blank reference [42]. Finally, the absorption coefficient at each wavelength can be calculated by Equation (1) and corrected for scattering effects by removing the absorbance at 700 nm (Equation (2)) [57].
where λ denotes wavelength, D denotes the measured absorbance, α denotes the absorption coefficient in m −1 , α' is the uncorrected absorption coefficient in m −1 , and r denotes the cuvette path length in m.
In previous studies, freshwater scientists usually used the absorption coefficients at 350, 420 or 440 nm to represent CDOM in inland aquatic environments [13].   [1] showed that the coefficient of determination R 2 between a(350) and a(254) could reach 0.96 (p < 0.001), and we also found a stronger correlation between a(254) and the reflectance of OLI imagery. Therefore, in this study, we chose the absorption coefficient at 254 nm to characterize CDOM, which is used to develop and validate CDOM estimation models.

Remote Sensing Data Processing
Landsat 8 OLI data have a temporal resolution of 16 days and a spatial resolution of 30 m and contain four visible bands, one near infrared band (NIR) and two shortwave infrared bands. Landsat 8 imagery, which offers a new shorter wavelength blue band (ultrablue band), a narrower NIR band and a higher signal-to-noise ratio relative to Landsat 4, 5 and 7 imageries, allows for an improved ability to monitor water quality parameters in inland waters [29,31].
Nearly cloud-free imagery was downloaded from Google Earth Engine (https://code. earthengine.google.com/ (5 August 2021)), an online cloud-based geospatial processing platform dedicated to processing satellite imagery and other Earth observation data [58]. It provides a Landsat 8 OLI surface reflectance dataset after atmospheric correction using Land Surface Reflectance Code (LaSRC) software [59]. The remote sensing reflectance (Rrs) was obtained by dividing the surface reflectance by π (3.14) [59]. The Landsat 8 imagery data were matched with the measured CDOM, and the remote sensing reflectance was extracted.
The spectral information relating to marked algal bloom and aquatic vegetation was removed from the samples by setting a floating algae index (FAI) threshold (>0.01) [60], as the spectral information from algal bloom and aquatic vegetation severely masked the water light signal, which had an important effect on the training and validation of the algorithms. Therefore, the reflectance data containing marked algal bloom information were automatically removed from the imagery in later CDOM estimation, which reduced the algal spectral noise and improved the algorithm accuracy. In addition, to ensure uniformity of the water surrounding the sampling site, we tested the coefficient of variation (CV) for each band (reflectance in a 3 × 3 window centered on the sampling site, CV < 10%) [47]. Table 1 shows a summary of the specific sampling times and sample information for the measured CDOM absorption coefficient data that were matched to Landsat 8 images. In total, 1708 data samples were matched. The time difference between the matched Landsat 8 and the measured CDOM absorption coefficient was controlled within 16 days, potentially decreasing the accuracy of our models. The matching time was suitably extended because of the long temporal resolution of Landsat 8 OLI and the small amount of matched data and because the lakes of interest were inland lakes with relatively stable CDOM [13].

Data Preprocessing
To improve the speed at which the gradient descent method obtains the optimal solution, the data used for model training need to be standardized [12]. In this paper, because the maximum and minimum reflectance values are difficult to determine when applied to remote sensing images, the z-score standardization method was used [61,62]. The characteristic of the z-score standardization method is that the data can be standardized to a distribution with a mean of 0 and a variance of 1, and it is not easily affected by outliers. Each input element was standardized separately according to the following formula: where m and v denote the mean and variance, respectively, of the input elements.

BP Neural Network
Currently, the most commonly used neural network learning approach is the BP neural network algorithm. The algorithm is a multilayer feed-forward network, and the main learning process consists of a forward computation process and an error BP process. It mainly includes three structures: the input layer, the hidden layer and the output layer. Neurons at different levels are interconnected through corresponding weights. The calculation formula can be found in Equation (4) [50].
For the ith neuron, neuron X i in layer l is the input, the inputs are often the independent variables that are crucial to the system model, and W l ij denotes the connection weight from neuron i at layer l to neuron j at layer l + 1, which regulates the proportion of the weight of each input quantity. W l bj denotes the bias of neuron j at layer l, and f ( * ) is a nonlinear activation function.
Equation (5) above is the error function equation, and the actual output of the model for sample is the expected output of the model. The backpropagation of error updates the network weights by reversing the output error (E) in some form layer by layer through hidden layers to input layers and assigning E to the individual neural units of each layer neuron.
In addition, the selection of other parameters, such as the training algorithm, hidden layers and iterations, also impacts the training accuracy of the model. In our case, the tansig activation function and Levenberg-Marquardt algorithm were used, and the hidden layer was set at 29 layers.

SVR
The basic idea of SVR regression is finding a regression hyperplane in a high-dimensional space so that all the data in the set are at the closest distance to that plane [63]. To use SVR to better solve the problems in regression fitting, Vapnik et al. [64] introduced an insensitive cost function (ε) based on SVR classification, which forms the SVR model [48,50]. Given a training dataset D = {(x 1, y 1 ), (x 2, y 2 ), . . . . . . (x m, y m )}, the SVR prediction model can be constructed as follows: Unlike traditional regression algorithms, which typically calculate the loss based on the error between the model output f (x) and y, the loss is zero if and only if f (x) is exactly the same as y. SVR assumes that it can tolerate an error between f (x) and y of at most ε (ε > 0) and calculates the loss only if the prediction error is larger than ε. Briefly, the weights w and bias b could be calculated by minimizing the following function: In Equation (8), where l ε denotes the loss function and C denotes the penalization parameter to penalize errors larger than ε. By introducing slack variables ξ i and ξ * i , Equation (7) can be transformed into Equation (9): subject to The Lagrange multipliers α i and α * i were used to create the Lagrange function and address the dual problem, and the final regression function is obtained as follows: where K is a kernel function as follows: Some typical kernel functions are available, such as the sigmoid kernel and the radial basis function kernel; the latter was used in this paper.
In this SVR model, the software package libSVR_3.24 designed by Prof. Lin Chih-Jen et al. (2021) [65] is used, and the SVR model parameters are selected using cross-validation, where gamma is 1 and cost is 2.

GPR
As a kernel-based, nonparametric probabilistic algorithm, GPR achieves a functional relationship between input and output elements by using a multivariate joint Gaussian distribution of the mean and covariance matrix of the available data. In our GPR implementation, we use the squared exponential kernel function, which is expressed as follows: where v denotes a scale factor, B denotes the number of input elements, and σ b denotes a dedicated parameter controlling the spreading relationship of each input element b [50].

RFR
RFR is an ensemble learning approach that constructs a large number of decision trees with no relationship between each tree during training and outputs the average of all decision tree predictions as the model prediction result [53]. First, resampling (that is, sampling with replacement) is performed using bootstrapping to generate T random training sets S 1 , S 2 , . . . , S T . Then, by constructing decision trees, some numbers of attributes are randomly chosen, from which the most suitable attribute is selected as the splitting node of the decision trees. After the random forest is constructed, test sample X is entered into each decision tree for calculation, and the average predicted value of all the decision trees is used as the final prediction result [66]. The advantages of RFR are that it is more resistant to noise than other methods. The model reduces the correlation between the decision trees and is less sensitive to outliers and noise, so it has better generalizability and accuracy. In our case, the random forest is set up with 100 trees and five leaf nodes.

Accuracy Assessment
In this paper, the coefficient of determination (R 2 ), mean relative error (MRE), root mean square error (RMSE), and relative RMSE (RRMSE) between measured and estimated CDOM absorption coefficient were used to evaluate the performances of all four machine learning approaches.
where N denotes the number of data points, i denotes the ith data point, and Meas and Esti denote the CDOM absorption coefficient measurements and estimates, respectively.

Experimental Settings
In the training of the model, firstly, we used Pearson correlation analysis to determine the correlation between each OLI band and the CDOM absorption coefficient. Then, the bands with higher correlations were gradually added as inputs to the machine learning models. The results showed that the validation results were most accurate when the input element was seven bands, so in this paper we mainly use bands 1 to 7 of Landsat 8 as the input elements of the algorithms and a(254) as the output element. Although the reflectance at longer wavelengths is generally insensitive to CDOM, practical validation has shown that the performance of the algorithm can be improved by using additional longer wavelengths [36]. Among the 1708 sets of sample datasets, 1300 sets are randomly selected for model training, and 408 sets are used for model validation. To ensure the comparability between the BP neural network, GPR, RFR, and SVR algorithms, the training and validation datasets are consistent for each algorithm. Model training, statistical analysis of the parameters, error analysis, etc. are implemented in MATLAB 2019b. Table 2 and Figure 2 show the training and validation results of the four machine learning algorithms. The BP neural network had the highest stability in the training results (R 2 = 0.74, MRE = 20% and RMSE = 3.68 m −1 ) and validation results (R 2 = 0.75, MRE = 22.5% and RMSE = 3.66 m −1 ). The RFR model had the best fitting accuracy for the training data (R 2 = 0.87, MRE = 14.7% and RMSE =2.83 m −1 ), but it had the lowest fitting accuracy for the validation data (R 2 = 0.71, MRE = 24.4% and RMSE = 4.00 m −1 ). Therefore, the stability of the RFR model was lower than that of the other three models. The GPR (R 2 = 0.74, MRE = 22.2%) and SVR (R 2 = 0.72, MRE = 22.3%) models in the validation data were more accurate than the RFR model. Overall, the four machine learning algorithms achieved more than 70% accuracy for CDOM absorption coefficient estimation in the available data. The available CDOM data take very little time to run the four different algorithms, so there are no further statistics or discussion of the runtimes of the different algorithms.

Training Data
Validation Data  Table 2 and Figure 2 show the training and validation results of the four machine learning algorithms. The BP neural network had the highest stability in the training results (R 2 = 0.74, MRE = 20% and RMSE = 3.68 m −1 ) and validation results (R 2 = 0.75, MRE = 22.5% and RMSE = 3.66 m −1 ). The RFR model had the best fitting accuracy for the training data (R 2 = 0.87, MRE = 14.7% and RMSE =2.83 m −1 ), but it had the lowest fitting accuracy for the validation data (R 2 = 0.71, MRE = 24.4% and RMSE = 4.00 m −1 ). Therefore, the stability of the RFR model was lower than that of the other three models. The GPR (R 2 = 0.74, MRE = 22.2%) and SVR (R 2 = 0.72, MRE = 22.3%) models in the validation data were more accurate than the RFR model. Overall, the four machine learning algorithms achieved more than 70% accuracy for CDOM absorption coefficient estimation in the available data. The available CDOM data take very little time to run the four different algorithms, so there are no further statistics or discussion of the runtimes of the different algorithms.

Model Application for Lakes with Different Trophic States
To objectively validate the accuracy of the machine learning algorithms, the trained models were used for Landsat 8 OLI images of lakes with different trophic states, taking Lake Taihu (eutrophic state) and Lake Qiandaohu (oligo-mesotrophic state) as examples. The imageries of Lake Qiandaohu (9 August 2018) and Lake Taihu (30 July 2017) were selected for CDOM estimation and compared with the measured CDOM absorption coefficient results of the same period (within 3 days) to visually reflect the CDOM estimation capability of the different algorithms (Figures 3-6).

Model Application for Lakes with Different Trophic States
To objectively validate the accuracy of the machine learning algorithms, the trained models were used for Landsat 8 OLI images of lakes with different trophic states, taking Lake Taihu (eutrophic state) and Lake Qiandaohu (oligo-mesotrophic state) as examples. The imageries of Lake Qiandaohu (9 August 2018) and Lake Taihu (30 July 2017) were selected for CDOM estimation and compared with the measured CDOM absorption coefficient results of the same period (within 3 days) to visually reflect the CDOM estimation capability of the different algorithms (Figures 3-6).

Model Application for Lakes with Different Trophic States
To objectively validate the accuracy of the machine learning algorithms, the trained models were used for Landsat 8 OLI images of lakes with different trophic states, taking Lake Taihu (eutrophic state) and Lake Qiandaohu (oligo-mesotrophic state) as examples. The imageries of Lake Qiandaohu (9 August 2018) and Lake Taihu (30 July 2017) were selected for CDOM estimation and compared with the measured CDOM absorption coefficient results of the same period (within 3 days) to visually reflect the CDOM estimation capability of the different algorithms (Figures 3-6).

Model Application for Lakes with Different Trophic States
To objectively validate the accuracy of the machine learning algorithms, the trained models were used for Landsat 8 OLI images of lakes with different trophic states, taking Lake Taihu (eutrophic state) and Lake Qiandaohu (oligo-mesotrophic state) as examples. The imageries of Lake Qiandaohu (9 August 2018) and Lake Taihu (30 July 2017) were selected for CDOM estimation and compared with the measured CDOM absorption coefficient results of the same period (within 3 days) to visually reflect the CDOM estimation capability of the different algorithms (Figures 3-6).     All four machine learning algorithms performed relatively well in the Lake Qiandaohu image application, with generally similar estimation results (Figures 3-4). For the whole lake, the spatial distribution of CDOM tended to decrease from northwest to southeast. Comparison with the measured CDOM of Lake Qiandaohu showed that the estimated R 2 of all four models was higher than 0.72, with the highest accuracy obtained from the GPR model with R 2 = 0.83 ( Figure 4). However, there is an overestimation of CDOM at the upstream boundary sampling points (e.g., 10-12 # ). The RFR model has a clear overestimation when the measured CDOM is less than 5.7 m −1 ; and a clear underestimation when the CDOM is greater than 5.7 m −1 . Lake Taihu had a higher level of eutrophication and higher CDOM absorption coefficient than Lake Qiandaohu. The imagery was preprocessed by FAI to automatically remove the algal bloom and aquatic vegetation information. The estimation results of the four machine learning models were generally consistent in spatial distribution, showing high values in the northwest and low values in the southeast ( Figure 5). A comparison with the measured CDOM in Lake Taihu shows that the BP neural network has the lowest accuracy of 0.48, while the GPR and SVR models have better accuracy (R 2 > 0.73) ( Figure  6). The RFR model is consistent with the estimates at Lake Qiandaohu, with an overestimate when the measured CDOM is less than 20 m −1 and an underestimate when the CDOM is greater than 20 m −1 . From the imagery, the BP neural network overestimated the northwestern part of Lake Taihu relative to the other three models. Therefore, GPR had higher stability and estimation accuracy than the other three models after comprehensively considering model training, validation and the application on Landsat 8 OLI images (Table 2 and Figures 4 and 6).
By comparing the measured and estimated CDOM of the two lakes by the four machine learning algorithms, the estimation accuracy of eutrophic Lake Taihu was lower than that of oligo-mesotrophic Lake Qiandaohu. Some of the estimated results were higher than the measured CDOM, which may be due to the complexity of the Lake Qiandaohu boundary, which was influenced by the mixed pixels of the shore vegetation, resulting in high CDOM absorption coefficient. In contrast, the estimates of the GPR and RFR models were slightly lower than the measured CDOM values (>20 m −1 ), which could be related to the sampling sites contaminated by algal blooms, or to the small number of samples with high CDOM values in the training data. These results are also consistent with the results in RFR model, where low values were overestimated and high values were underestimated.

Advantages of Machine Learning Algorithms
To better compare the merits of the models, we trained the commonly used empirical models (including ratio models, normalized models, binary primary polynomial models, quadratic models, etc.) based on different combinations of bands and selected the binary primary polynomial, normalized and ratio models with better accuracies for comparison All four machine learning algorithms performed relatively well in the Lake Qiandaohu image application, with generally similar estimation results (Figures 3 and 4). For the whole lake, the spatial distribution of CDOM tended to decrease from northwest to southeast. Comparison with the measured CDOM of Lake Qiandaohu showed that the estimated R 2 of all four models was higher than 0.72, with the highest accuracy obtained from the GPR model with R 2 = 0.83 ( Figure 4). However, there is an overestimation of CDOM at the upstream boundary sampling points (e.g., 10-12 # ). The RFR model has a clear overestimation when the measured CDOM is less than 5.7 m −1 ; and a clear underestimation when the CDOM is greater than 5.7 m −1 . Lake Taihu had a higher level of eutrophication and higher CDOM absorption coefficient than Lake Qiandaohu. The imagery was preprocessed by FAI to automatically remove the algal bloom and aquatic vegetation information. The estimation results of the four machine learning models were generally consistent in spatial distribution, showing high values in the northwest and low values in the southeast ( Figure 5). A comparison with the measured CDOM in Lake Taihu shows that the BP neural network has the lowest accuracy of 0.48, while the GPR and SVR models have better accuracy (R 2 > 0.73) ( Figure 6). The RFR model is consistent with the estimates at Lake Qiandaohu, with an overestimate when the measured CDOM is less than 20 m −1 and an underestimate when the CDOM is greater than 20 m −1 . From the imagery, the BP neural network overestimated the northwestern part of Lake Taihu relative to the other three models. Therefore, GPR had higher stability and estimation accuracy than the other three models after comprehensively considering model training, validation and the application on Landsat 8 OLI images (Table 2 and Figures 4 and 6).
By comparing the measured and estimated CDOM of the two lakes by the four machine learning algorithms, the estimation accuracy of eutrophic Lake Taihu was lower than that of oligo-mesotrophic Lake Qiandaohu. Some of the estimated results were higher than the measured CDOM, which may be due to the complexity of the Lake Qiandaohu boundary, which was influenced by the mixed pixels of the shore vegetation, resulting in high CDOM absorption coefficient. In contrast, the estimates of the GPR and RFR models were slightly lower than the measured CDOM values (>20 m −1 ), which could be related to the sampling sites contaminated by algal blooms, or to the small number of samples with high CDOM values in the training data. These results are also consistent with the results in RFR model, where low values were overestimated and high values were underestimated.

Advantages of Machine Learning Algorithms
To better compare the merits of the models, we trained the commonly used empirical models (including ratio models, normalized models, binary primary polynomial models, quadratic models, etc.) based on different combinations of bands and selected the binary primary polynomial, normalized and ratio models with better accuracies for comparison (see Figure 7). The empirical algorithm with the best accuracy was also validated, and the results are shown in Figure 8. The three empirical models performed similarly, with validation accuracies of R 2 = 0.54, R 2 = 0.51 and R 2 = 0.56, all of which were much lower than the estimation accuracies of the machine learning models. The empirical models underestimated all values of CDOM above 20 m −1 . Therefore, the machine learning models (R 2 > 0.71) outperformed the empirical models ( Figure 8). Although it proved difficult to retrieve the CDOM in different waters, the machine learning models offered the potential for reasonable estimates of CDOM without considering the geographical location and optical complexity of the lakes.
Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 17 (see Figure 7). The empirical algorithm with the best accuracy was also validated, and the results are shown in Figure 8. The three empirical models performed similarly, with validation accuracies of R 2 = 0.54, R 2 = 0.51 and R 2 = 0.56, all of which were much lower than the estimation accuracies of the machine learning models. The empirical models underestimated all values of CDOM above 20 m −1 . Therefore, the machine learning models (R 2 > 0.71) outperformed the empirical models ( Figure 8). Although it proved difficult to retrieve the CDOM in different waters, the machine learning models offered the potential for reasonable estimates of CDOM without considering the geographical location and optical complexity of the lakes.   (see Figure 7). The empirical algorithm with the best accuracy was also validated, and the results are shown in Figure 8. The three empirical models performed similarly, with validation accuracies of R 2 = 0.54, R 2 = 0.51 and R 2 = 0.56, all of which were much lower than the estimation accuracies of the machine learning models. The empirical models underestimated all values of CDOM above 20 m −1 . Therefore, the machine learning models (R 2 > 0.71) outperformed the empirical models ( Figure 8). Although it proved difficult to retrieve the CDOM in different waters, the machine learning models offered the potential for reasonable estimates of CDOM without considering the geographical location and optical complexity of the lakes.

Estimation Accuracy of the CDOM for Lakes with Different Trophic States
The estimation accuracies of CDOM in lakes with oligo-mesotrophic states were better than those in lakes with eutrophic states, perhaps due to the relatively low concentrations of optically active parameters and less interference from autochthonous and allochthonous sources of CDOM, resulting in better CDOM estimation performance in simple

Estimation Accuracy of the CDOM for Lakes with Different Trophic States
The estimation accuracies of CDOM in lakes with oligo-mesotrophic states were better than those in lakes with eutrophic states, perhaps due to the relatively low concentrations of optically active parameters and less interference from autochthonous and allochthonous sources of CDOM, resulting in better CDOM estimation performance in simple aquatic environmental lakes [16]. The concentration of optically active parameters (e.g., Chl-a and TSM) can also impact the optical properties of natural waters [13,16]. For example, concentration variations in Chl-a and TSM may influence the scattering and absorption characteristics of waters and the underwater light field [67]. Therefore, in the more complex optical signal of eutrophic lakes, which are affected by CDOM, TSM and Chl-a, CDOMrich lakes have strong absorption in the blue-light band, while water-leaving optical signatures are small and CDOM has no special reflectance signals. Water color parameters with specific reflectance signals such as TSM and Chl-a dominate the reflectance spectra of complex water bodies, and the interaction between these parameters also directly affects the estimation accuracy of CDOM using remote sensing [13,15,68]. In addition, the reflectance spectrum of CDOM is very similar to that of Chl-a in blue regions, which makes it hard to separate their spectral characteristics [1,16]. Current studies on the separation of the spectrum characteristics of Chl-a, CDOM and TSM are not well established, so CDOM estimation in complex optical environments remains challenging [3].

Application of Machine Learning Modelling of Landsat Data
Based on the model comparison results, the machine learning models performed better than empirical models for Landsat 8 OLI data. Therefore, machine learning models can be extended to Landsat series data for long-term CDOM monitoring. Sensors have not been developed for inland water monitoring, but using long data series is attractive [69]. However, we cannot deny that, when applied to inland waters, the current Landsat series data are limited in terms of their temporal, spatial and spectral resolutions for CDOM estimation, especially in eutrophic lakes where various optical signals interact with each other, resulting in a lower CDOM estimation accuracy [70]. In addition, although we collected a large amount of CDOM data from lakes with different trophic states covering the complex optical properties of waters to improve the extended application and portability of the models, the machine learning models have limitations. However, the performance of a machine learning algorithm is constrained by training data characteristics. Therefore, when expanding to other regions, accuracy may be reduced [47]. Nevertheless, the potential for the remote sensing estimation of CDOM using machine learning algorithms has been identified through this extensive study; so that subsequent research could both experiment with remote sensing imagery with better temporal, spatial and spectral resolution to improve CDOM estimation accuracy in eutrophic lakes, and continuously explore the potential of other machine learning algorithms to monitor inland waters.
However, compared to TSM and Chl-a with very strong optical signal and high estimation precision [14][15][16], it is challenging to accurately estimate CDOM by satellite although we have collected a large and extensive CDOM dataset [3]. The main reasons can be attributed to two aspects. Firstly, the optical signal of CDOM is very low, generally lower than phytoplankton and nonphytoplankton particles in inland waters [71], and CDOM only absorbs and does not scatter, and its variation contributes little to reflectance [13]. Secondly, current satellite remote sensing does not have spectral channels for CDOM, just as MODIS, Sentinel and Landsat satellites have red and green spectral channels for chlorophyll and suspended matter, etc. In addition, CDOM mainly absorbs ultraviolet wavelengths below 400 nm, and none of the current satellite remote sensing has an ultraviolet channel [1,71]. Therefore, due to the characteristics of the CDOM absorption spectra, the precise estimation of the CDOM needs to be further combined with the satellites to develop suitable mathematical models.

Conclusions
In this study, four different machine learning algorithms were trained and validated using a large CDOM absorption coefficient dataset covering a(254) from 2.64 to 34.04 m −1 from lakes with different trophic states between 2013-2021. The results show that machine learning algorithms achieved more than 70% accuracy for CDOM estimation in the available data. When the trained model was used in lakes with different trophic states, the accuracy of CDOM for oligo-mesotrophic lakes was higher than that for eutrophic lakes, which is due to the increased optical complexity in eutrophic lakes. However, machine learning models still have great potential in eutrophic lakes, for example, the SVR model achieved an R 2 of 0.76 for Lake Taihu. The spatial distribution of the CDOM results estimated for the Lake Taihu and Lake Qiandaohu showed an overall decreasing trend from the upstream to the downstream lakes, in line with the spatial variation in the measured CDOM results. Therefore, the machine learning algorithms can contribute to CDOM estimation in inland waters and have wide applications for water resource management.