Estimating Coastal Chlorophyll-A Concentration from Time-Series OLCI Data Based on Machine Learning

: Chlorophyll-a (chl-a) is an important parameter of water quality and its concentration can be directly retrieved from satellite observations. The Ocean and Land Color Instrument (OLCI), a new-generation water-color sensor onboard Sentinel-3A and Sentinel-3B, is an excellent tool for marine environmental monitoring. In this study, we introduce a new machine learning model, Light Gradient Boosting Machine (LightGBM), for estimating time-series chl-a concentration in Fujian’s coastal waters using multitemporal OLCI data and in situ data. We applied the Case 2 Regional CoastColour (C2RCC) processor to obtain OLCI band reﬂectance and constructed four spectral indices based on OLCI feature bands as supplementary input features. We also used root-mean-square error (RMSE), mean absolute error (MAE), median absolute percentage error (MAPE), and R 2 as performance indicators. The results indicate that the addition of spectral indices can easily improve the prediction accuracy of the model, and normalized ﬂuorescence height index (NFHI) has the best performance, with an RMSE of 0.38 µ g/L, MAE of 0.22 µ g/L, MAPE of 28.33%, and R 2 of 0.785. Moreover, we used the well-known band ratio and three-band methods for chl-a estimation validation, and another two OLCI chl-a products were adopted for comparison (OC4Me chl-a and Inverse Modelling Technique (IMT) Neural Net chl-a). The results conﬁrmed that the LightGBM model outperforms the traditional methods and OLCI chl-a products. This study provides an effective remote sensing technique for coastal chl-a concentration estimation and promotes the advantage of OLCI data in ocean color remote sensing.


Introduction
In the coastal regions, due to the impacts of climate change and intensive human activities, such as rainfall, sewage discharge, and overfishing, eutrophic and polluted water bodies are imported into coastal waters through surface runoff, thus threatening the already-deteriorating coastal water quality [1]. Chlorophyll-a (chl-a) is the main pigment in phytoplankton for photosynthesis and is regarded as a proxy for biomass in water [2,3]. Appropriate biomass is important for maintaining the balance of a healthy aquatic ecosystem. Therefore, chl-a has been used as a key indicator for evaluating water quality including eutrophication [4,5]. Monitoring chl-a concentration is a significant issue in coastal water management.
Ocean color remote sensing technology has been used as a highly efficient means of estimating chl-a concentration owing to its advantages, such as large-scale and real-time observations. It can also be used to track and reveal the spatiotemporal dynamic process of the water quality [6]. The coastal zone color scanner, the first ocean color sensor carried on the Nimbus-7 satellite launched by the National Aeronautics and Space Administration in 1978, started studying the ocean chl-a concentration [7] and demonstrated the feasibility of

Study Area and In Situ Data
The Fujian Province is located on the southeast coast of China, in between 23 • 33 -28 • 20 N and 117 • 30 -120 • 40 E. Fujian has a long coastline of approximately 3752 km. This long and winding coastline hosts numerous bays and harbors. Its excellent geographical conditions promoted the development of an aquaculture industry and maritime transport. Fujian is close to the Tropic of Cancer, has a typical subtropical monsoon climate, and is warm and humid. Sufficient rainfall forms a dense water system with many rivers having different sizes. Freshwater and terrigenous materials are imported into the ocean through surface runoff. The aquaculture industry and residential areas are concentrated along the coastal areas. Some eutrophic and polluted water bodies are discharged into the coastal estuaries, thus worsening the quality of the coastal water with gradual eutrophication and frequent red tide events. Thus, the aquatic ecosystem is under great environmental pressure.
To achieve effective assessment and management of the coastal water environment, Fujian has set up a batch of ecological buoy observation stations along the coast to monitor chl-a concentration. The buoy data are updated every 30 min. Figure 1 presents the spatial distribution of the stations, and further detailed information is available in the Fujian Marine Forecasts (http://www.fjhyyb.cn/Ocean863Web_MAIN/) (accessed on 10 January 2017). In this study, we selected chl-a data consistent with the satellite imaging time (at approximately 10 a.m. every day) for our study, and the time-series period is from May 2017 to May 2020.
Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 22 based model in order to estimate the chl-a concentration of the coastal waters of Fujian (China). The model was applied to time-series OLCI images to map the spatial distribution of chl-a concentration and then analyze the spatial and temporal distribution characteristics of chl-a concentration in Fujian's coastal waters.

Study Area and In Situ Data
The Fujian Province is located on the southeast coast of China, in between 23°33′-28°20′N and 117°30′-120°40′E. Fujian has a long coastline of approximately 3752 km. This long and winding coastline hosts numerous bays and harbors. Its excellent geographical conditions promoted the development of an aquaculture industry and maritime transport. Fujian is close to the Tropic of Cancer, has a typical subtropical monsoon climate, and is warm and humid. Sufficient rainfall forms a dense water system with many rivers having different sizes. Freshwater and terrigenous materials are imported into the ocean through surface runoff. The aquaculture industry and residential areas are concentrated along the coastal areas. Some eutrophic and polluted water bodies are discharged into the coastal estuaries, thus worsening the quality of the coastal water with gradual eutrophication and frequent red tide events. Thus, the aquatic ecosystem is under great environmental pressure.
To achieve effective assessment and management of the coastal water environment, Fujian has set up a batch of ecological buoy observation stations along the coast to monitor chl-a concentration. The buoy data are updated every 30 min. Figure 1 presents the spatial distribution of the stations, and further detailed information is available in the Fujian Marine Forecasts (http://www.fjhyyb.cn/Ocean863Web_MAIN/) (accessed on January 10, 2017). In this study, we selected chl-a data consistent with the satellite imaging time (at approximately 10 a.m. every day) for our study, and the time-series period is from May 2017 to May 2020.

Satellite Data and Preprocessing
OLCI, a new-generation ocean water-color sensor onboard the Sentinel-3A and Sentinel-3B satellites, was designed for imaging water systems [39]. It has 21 spectral bands within the range of visible to NIR wavelengths (400-1020 nm), including 16 water-color bands, and the spatial resolution is 300 m. Table 1 presents the band setting of OLCI. The high signal-tonoise ratio, spectral resolution, and temporal resolution provide accurate, comprehensive, and rich spectral information of chl-a in the optically complex coastal waters of Fujian. In this study, we obtained Sentinel-3 OLCI full-resolution L1 data from the European Space Agency data hub (https://scihub.copernicus.eu/) (accessed on 16 January 2017). In atmospheric radiation transmission, the ground target signals received by the sensor can be affected by atmospheric interference. Therefore, atmospheric correction of the OLCI images is an essential prerequisite in remote sensing quantitative inversion in order to weaken or eliminate atmospheric influence on images and obtain the true remote sensing reflectance (Rrs) of water pixels. The Case 2 Regional CoastColour (C2RCC) processor can be well applied to the OLCI images and other ocean water-color sensors (such as S2-MSI, Landsat-8, MERIS, MODIS). The C2RCC processor uses a large database of radiative transfer simulations inverted by neural networks as basic technology and data measured at the top of the atmosphere by satellite sensors inverted to water optical properties, and its performance has been validated in various studies [40][41][42][43]. Therefore, we applied the C2RCC processor for atmospheric correction to obtain Rrs from OLCI images.
Then, according to the longitude and latitude coordinates of each observation station, we posited the observation point on the pixel of the OLCI image, and extracted the corresponding Rrs from OLCI bands. In this process, if water pixels were covered by clouds, the values were discarded. Finally, 602 pairs of Rrs-chl-a formed the data set for chl-a modeling and verification.

LightGBM
Gradient boosting is a powerful machine learning algorithm. It achieves the most advanced results in various research tasks, such as weather forecasting, search rankings, and numerical dynamic simulation [35]. In essence, gradient boosting is the construction of a strong ensemble predictor through several weak learners by performing gradient descent in a functional space. In theory, it can choose many different learning algorithms as the base learner but usually chooses the decision tree because it has excellent performance in flexibly dealing with all kinds of data (including continuous, discrete, and missing values), does not need to perform feature normalization, and has good interpretability. The combination of gradient boosting and multiple decision trees formed the well-known gradient boosting decision tree (GBDT) algorithm [44].
LightGBM was proposed by Microsoft Research Asia in January 2017. It is an advanced version of the GBDT algorithm. LightGBM proposes a new leaf-wise strategy with depth constraint for decision tree growth, which is much more efficient than the level-wise strategy commonly employed in GBDT algorithms. In the process of decision tree growth, leaf-wise means selecting the leaf with the largest split gain from all the leaves in the current layer, then splitting, and then doing the same operation again for the next layer. Therefore, compared with GBDT, which simultaneously splits all leaves in the current layer, LightGBM has a lower computational cost and better accuracy in the case of the same number of splitting. Meanwhile, the maximum depth constraint can effectively prevent the problem of overfitting. In addition, LightGBM adopts a histogram algorithm to go through histograms instead of samples to improve the running speed, which significantly reduces the time complexity. Compared with the current existent gradient boosting algorithms, such as extreme gradient boosting [36], LightGBM has the advantages of fast speed and strong robustness. LightGBM is an advanced and well-performed ensemble learning algorithm. It is based on a certain strategy by combining multiple sub-learners to create a new strong learner to complete the learning task. Therefore, it can achieve higher efficiency and better generalization performance than other machine learning algorithms. Meanwhile, compared to neural network algorithms, it is more suited for small sample modeling. Therefore, this study attempts to use this method to estimate chl-a concentration in Fujian's coastal waters.

Input Variables
In empirical statistical methods, many studies have proposed numerous spectral indices or spectral feature methods for the estimation of chl-a concentration via various satellite images. OLCI provides rich spectral signatures but not all in response to chl-a. Thus, input variable selection is critical to model building to eliminate redundant and interferential variables, reduce the dimensions of data, and select the most important features to help improve the accuracy and stability of the models.
Previous studies have demonstrated that the spectral indices, which are built in the form of band combinations, are more beneficial in enhancing the chlorophyll optical signal than a single spectral band. Thus, in this study, we first chose B3-B12 of OLCI as spectral band variables, which are the main wavelength range (442.5-753.75 nm) carrying chl-a spectral information. The OLCI 412.5 nm band was excluded due to the influence of CDOM absorption. Moreover, the 778.75-1020 nm range was not considered as its bands are sensitive to suspended sediment in high-turbidity waters. Then, we constructed four kinds of spectral indices based on these spectral bands: band ratio index (BRI), NDCI, NFHI, and three-band index (TBI). Spectral bands and indices together constitute the input variable feature set. The details are presented in Table 2. BRI, based on NIR/Red, has frequently been used in estimating chl-a concentration in coastal waters because there is a reflection peak in the NIR and an absorption behavior in the red of chl-a; the expression is shown in Equation (1), such as NIR (716 nm)/Red (667 nm) [10] and NIR (708 nm)/Red (665 nm) [11]. In this study, B8 and B9 are denoted by λ1, and B11 is denoted by λ2.
Maisha first proposed NDCI to estimate chl-a concentration in optically complex coastal turbid productive waters and normalize the difference and sum of Rrs at 708 and 665 nm on MERIS images [13]. NDCI was also developed on the basis of the absorption and reflection characteristics of chl-a; the expression is shown in Equation (2). Thus, in this study, the choices of λ1 and λ2 are the same as BRI.
NFHI, a fluorescence remote sensing algorithm, normalized the fluorescence peak to a reflection peak at 560 nm or the absorption peak at 675 nm. OLCI B10, whose central wavelength is at 681.25 nm, is the closest to the true fluorescence peak (683 nm) among all water-color sensors; the expression is shown in Equation (3). Therefore, B10 is denoted by λ1, and B6 and B9 are denoted by λ2.
TBI, a semiempirical model based on the bio-optical model, can effectively avoid the influence of CDOM and suspended solids in turbid waters with a certain physical meaning and good portability. This index involves three bands, as presented in Equation (4). In this study, B8 and B9 are denoted by λ1, B11 by λ2, and B12 by λ3.

Experimental Design
The following seven cases, as shown in Table 3, are designed for a comparative study, and the input features of all cases included spectral bands as a basis for comparison. One purpose is to evaluate the influence of different spectral index variables on model prediction, the other is to find the best spectral index to improve the estimation accuracy of chl-a concentration through the comparison between these cases. The data set was randomly divided into training sets (70%) and testing sets (30%) to train the model and verify its performance, respectively. In the modeling process, there are several important parameters for the LightGBM model: n_estimators is the number of base learners; learning_rate can control the convergence rate of the model; and num_leaves can adjust the model complexity. In this study, a grid-search method was employed to determine the optimal parameter of the LightGBM mode. First, a high learning_rate was selected (approximately 0.5). This is important for tuning to speed up the convergence and then increase n_estimators. Usually, with the increase n_estimators, the regression Remote Sens. 2021, 13, 576 7 of 21 error will gradually decrease and then remain stable. This study tested the range of 50-1000 (interval of 50). Figure 2 shows that the error of the model tends to be stable when n_estimators = 400. Thus, we set the number of regression trees as 400. Next, we tuned num_leaves the same way; its value should not be set too large, otherwise, the problem of overfitting may occur. This study tested the range of 5-60 (interval of 5). The result revealed that num_leaves = 40 is appropriate. Finally, reducing the learning rate to improve the model's performance found the best learning_rate in this study was 0.05. The other parameters were set as lambda_L1 = 1, lambda_L2 = 3, and boosting_type = gbdt. Following a series of experiments, we found that the model was very robust to changes in various hyperparameters. Table 4 shows the description of the imperative parameters of the LightGBM model and the optimal values after tuning. num_leaves can adjust the model complexity. In this study, a grid-search method was employed to determine the optimal parameter of the LightGBM mode. First, a high learn-ing_rate was selected (approximately 0.5). This is important for tuning to speed up the convergence and then increase n_estimators. Usually, with the increase n_estimators, the regression error will gradually decrease and then remain stable. This study tested the range of 50-1000 (interval of 50). Figure 2 shows that the error of the model tends to be stable when n_estimators = 400. Thus, we set the number of regression trees as 400. Next, we tuned num_leaves the same way; its value should not be set too large, otherwise, the problem of overfitting may occur. This study tested the range of 5-60 (interval of 5). The result revealed that num_leaves = 40 is appropriate. Finally, reducing the learning rate to improve the model's performance found the best learning_rate in this study was 0.05. The other parameters were set as lambda_L1 = 1, lambda_L2 = 3, and boosting_type = gbdt. Following a series of experiments, we found that the model was very robust to changes in various hyperparameters. Table 4 shows the description of the imperative parameters of the LightGBM model and the optimal values after tuning.  L2 Boosting type (gbdt) gbdt, rf, dart, goss gbdt In the verification process, we used the independent testing sets (30%) to validate the performance of the model and validated the model-estimated result via the in situ values. Here, we used root-mean-square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and R 2 as performance measures. The calculations of the performance indicators are shown below.
Boosting type (gbdt) gbdt, rf, dart, goss gbdt In the verification process, we used the independent testing sets (30%) to validate the performance of the model and validated the model-estimated result via the in situ values. Here, we used root-mean-square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and R 2 as performance measures. The calculations of the performance indicators are shown below.
Remote Sens. 2021, 13, 576 where, y i is the in situ chl-a concentration values,ŷ i is the estimated chl-a concentration values based on the LightGBM models, n is the total number of testing sets.

Optimal Input Feature Variables
Improving the prediction accuracy is critical for chlorophyll inversion, and this is closely related to the choice of input feature variables. Therefore, on the basis of the LightGBM algorithm, the prediction accuracy of seven feature combinations for each case was evaluated using the testing sets, 181 pairs of OLCI remote sensing reflectance, and in situ chl-a values. The correlations between in situ observed values and predicted values based on the LightGBM model of each case are presented in Figure 3, and the performance indicators calculated for the seven cases are presented in Table 5.
where, is the in situ chl-a concentration values, is the estimated chl-a concentration values based on the LightGBM models, n is the total number of testing sets.

Optimal Input Feature Variables
Improving the prediction accuracy is critical for chlorophyll inversion, and this is closely related to the choice of input feature variables. Therefore, on the basis of the LightGBM algorithm, the prediction accuracy of seven feature combinations for each case was evaluated using the testing sets, 181 pairs of OLCI remote sensing reflectance, and in situ chl-a values. The correlations between in situ observed values and predicted values based on the LightGBM model of each case are presented in Figure 3, and the performance indicators calculated for the seven cases are presented in Table 5.    The scatterplots, as shown in Figure 3, whereas Case 3 added NDCI. They both used B11, B9, and B8 in the form of BRs to eliminate uncertainties in the estimation of water remote sensing reflectance in atmospheric correction, seasonal solar azimuth differences, and others. In all the added indices cases, Case 5 has a poor performance, with an RMSE of 0.452 µg/L, MAE of 0.262 µg/L, MAPE of 31.52%, and R 2 of 0.715, but it has a better performance than Case 1, indicating that the TBI is also a positive factor for chl-a concentration estimation.
The above clearly demonstrates that the addition of spectral indices, including BRI, NDCI, NFHI, and TBI, can contribute to the improvement of the prediction accuracy of chl-a. These spectral indices are constructed in the form of band combinations using the chl-a sensitive bands to enhance chl-a remote sensing signals. Among these, NFHI is the most beneficial spectral index for improving the chl-a prediction accuracy in the study of Fujian's coastal waters. However, when all spectral indices added 18 input variables (in Case 6), the result did not demonstrate the best prediction accuracy. In fact, these indices are used solely as factors to establish the regression equation for chl-a predictions. Therefore, although they all have a positive response on chl-a prediction, adding them all for model predictions is still not a good idea. Furthermore, too many dimensions of variables may lead to model instability and complexity. Therefore, the feature variables in Case 4 were eventually chosen for this following study.

Mapping Chl-a Concentration from the OLCI Images
The coastal areas of Fujian are where red tide events seriously occur. Notably, from April to June each year, a high incidence of red tide is observed, according to relevant information released by the Fujian Ocean and Fisheries Bureau (http://hyyyj.fujian.gov. cn/) (accessed on 1 April 2020). Hence, to better understand the temporal and spatial variations of chl-a concentration in Fujian's coastal waters and evaluate the applicability of the LightGBM-based model to estimate chl-a concentration, we applied the LightGBM model on the OLCI images to map the spatial distribution of chl-a concentration in April to June 2020. In this study, we selected 12 OLCI images (nearly cloudless) of Fujian's coastal water areas. Then, we used the LightGBM-Case 4 method to obtain the time-series estimation of chl-a concentration in order to track the chl-a concentration variation process. Figure 4 presents the estimated results. From the results presented in Figure 4, it can be clearly seen that the spatial distribution of chl-a concentration in coastal waters changed daily. From an overall perspective, on 8-10 April 2020, the spatial distributions of chl-a concentration in water were almost similar, and the variation is not evident. The values of chl-a concentration are mainly distributed between 0.5 and 2 ug/L in coastal waters and less than 0.5 ug/L in the open ocean. From the results presented in Figure 4, it can be clearly seen that the spatial distribution of chl-a concentration in coastal waters changed daily. From an overall perspective, on 8-10 April 2020, the spatial distributions of chl-a concentration in water were almost similar, and the variation is not evident. The values of chl-a concentration are mainly distributed between 0.5 and 2 ug/L in coastal waters and less than 0.5 ug/L in the open ocean. On 13 April, the chl-a concentration as a whole increased, and the overall spatial pattern was also mostly the same as before. On 16 and 17 April, a significant difference in the variations of chl-a concentration in space was observed. The values of chl-a concentration decreased and were mainly distributed between 0.5 and 1.0 ug/L in most coastal water areas. It should be noted that in the sea near Pingtan Island (Fuzhou), chl-a concentration significantly changed, where the values are mainly greater than 1.0 ug/L, whereas in the previous four phases, the values were less than 0.5 ug/L. The values of chl-a concentration on 17 April were higher than those on 16 April, and the area of variation was also larger. We can also see that in the results for May and June, the temporal and spatial variations of chlorophyll concentration were significant.
With human activity as the main factor causing the change in chl-a concentration, other natural factors also made a common impact, such as the temperature and salinity of the sea water and the changes in the wind direction of ocean currents. According to synoptic data, in mid-April, the coastal weather in Fujian was cloudy to overcast, the water temperature rose, and the wind and waves were relatively small. These conditions are conducive to phytoplankton proliferation and aggregation, and thus, the coastal waters of Fujian entered the peak of the red tide period. Simultaneously, the government focused on this serious issue and conducted encryption monitoring. The field survey results revealed that Thalassiosira subtilis and Skeletonema costatum appeared in the waters in April and May. Our results also provide comprehensive spatial and temporal information as satellite monitoring presents a huge advantage both in time and space for large-scale and real-time monitoring.

Spatiotemporal Distribution Analysis
To further understand the characteristics of spatial and temporal distribution in Fujian's coastal waters, we applied the LightGBM-based model to the OLCI images from 2017 to 2019 to generate time-series chl-a concentration. The annual and monthly averages were calculated on the basis of these time-series chl-a concentration values. Figures 5 and 6 present the spatiotemporal distribution. We can also see that in the results for May and June, the temporal and spatial variations of chlorophyll concentration were significant. With human activity as the main factor causing the change in chl-a concentration, other natural factors also made a common impact, such as the temperature and salinity of the sea water and the changes in the wind direction of ocean currents. According to synoptic data, in mid-April, the coastal weather in Fujian was cloudy to overcast, the water temperature rose, and the wind and waves were relatively small. These conditions are conducive to phytoplankton proliferation and aggregation, and thus, the coastal waters of Fujian entered the peak of the red tide period. Simultaneously, the government focused on this serious issue and conducted encryption monitoring. The field survey results revealed that Thalassiosira subtilis and Skeletonema costatum appeared in the waters in April and May. Our results also provide comprehensive spatial and temporal information as satellite monitoring presents a huge advantage both in time and space for large-scale and real-time monitoring.

Spatiotemporal Distribution Analysis
To further understand the characteristics of spatial and temporal distribution in Fujian's coastal waters, we applied the LightGBM-based model to the OLCI images from 2017 to 2019 to generate time-series chl-a concentration. The annual and monthly averages were calculated on the basis of these time-series chl-a concentration values. Figures 5 and 6 present the spatiotemporal distribution.     Figure 5 presents that, on the whole, the spatial distribution trends of the annual average in 2017-2018 are broadly similar. The values of chl-a concentration are generally higher in the near shore than in the far shore, and the concentration values gradually decrease as the distance from the shoreline increases. This is probably because the internal environment of the coastal waters is complex and changeable and is impacted by human activities (e.g., wastewater discharge, waterway transportation, and aquaculture). Contrarily, in the ocean off the coast, which is almost unaffected by human factors and natural influences, the chl-a concentration values are usually at a low level with small change in the ranges.
Fujian's coastal water obviously has regional characteristics as it typically hosts case II water. Moreover, the complex and changeable internal environment of the Fujian nearshore region causes distribution differences. The chl-a concentration in the coastal waters is mostly around 1 µg/L but is significantly higher in some bays and river estuaries than in the surrounding open seas. The coastal waters are shallower, and the water bodies are easily disturbed, which results in a nutrient and salt mixture in the whole water layer and thus promotes phytoplankton growth. In addition, there are many harbors in Fujian; this favorable geographical condition promotes aquaculture development. Sansha Bay is one of the most typical aquaculture bays in the coastal areas of China, as shown in Figure 7. The aquaculture activities mainly include cage culture and raft culture. Thus, metabolites and wastewater from farmed bait residues do not easily spread in the bay, resulting in water eutrophication and relatively high chl-a concentration. Another reason is that the coastal area is dense with industrial establishments, thus resulting in a large amount of industrial wastewater carrying terrestrial materials in rivers flowing into the ocean. In the vertical direction, the northern part of Fujian's coastal waters has a higher chl-a concentration than the other regions. Figure 6 demonstrates that Fujian's coastal water also has obvious differences in chl-a concentration with time. In spring and summer, the chl-a concentration is higher than in autumn and winter. As the water temperature gradually increases in spring and summer, the water becomes more suitable for phytoplankton growth. Therefore, the chl-a concentration increases to the normal range, and the areas where there are obvious chl-a concentration increases are the bay areas.
Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 22 higher in the near shore than in the far shore, and the concentration values gradually decrease as the distance from the shoreline increases. This is probably because the internal environment of the coastal waters is complex and changeable and is impacted by human activities (e.g., wastewater discharge, waterway transportation, and aquaculture). Contrarily, in the ocean off the coast, which is almost unaffected by human factors and natural influences, the chl-a concentration values are usually at a low level with small change in the ranges. Fujian's coastal water obviously has regional characteristics as it typically hosts case II water. Moreover, the complex and changeable internal environment of the Fujian nearshore region causes distribution differences. The chl-a concentration in the coastal waters is mostly around 1 μg/L but is significantly higher in some bays and river estuaries than in the surrounding open seas. The coastal waters are shallower, and the water bodies are easily disturbed, which results in a nutrient and salt mixture in the whole water layer and thus promotes phytoplankton growth. In addition, there are many harbors in Fujian; this favorable geographical condition promotes aquaculture development. Sansha Bay is one of the most typical aquaculture bays in the coastal areas of China, as shown in Figure 7. The aquaculture activities mainly include cage culture and raft culture. Thus, metabolites and wastewater from farmed bait residues do not easily spread in the bay, resulting in water eutrophication and relatively high chl-a concentration. Another reason is that the coastal area is dense with industrial establishments, thus resulting in a large amount of industrial wastewater carrying terrestrial materials in rivers flowing into the ocean. In the vertical direction, the northern part of Fujian's coastal waters has a higher chl-a concentration than the other regions. Figure 6 demonstrates that Fujian's coastal water also has obvious differences in chl-a concentration with time. In spring and summer, the chl-a concentration is higher than in autumn and winter. As the water temperature gradually increases in spring and summer, the water becomes more suitable for phytoplankton growth. Therefore, the chl-a concentration increases to the normal range, and the areas where there are obvious chl-a concentration increases are the bay areas.

Comparison with Other Previous Algorithms and OLCI L2 Products
To further evaluate the performance of the LightGBM method, we examined the BR algorithm based on NIR/Red and TB algorithm for comparison, which represent the empirical and semiempirical methods in turbid productive coastal waters, respectively, and both have been widely used. Previous studies have demonstrated that the BR algorithm can eliminate the errors caused by different solar altitude angles and observation angles, partially eliminate the interference caused by water surface smoothness and small waves changing with time and space, and counteract some atmospheric effects. Contrarily, the TB algorithm has a definite physical basis and can eliminate some of the effects of suspended sediments. Table 6 presents the two methods based on the OLCI images tested in this study. The specific processes of the methods are as follows: First, x-variable is calculated from OLCI bands; second, the linear regression model is built with the corresponding in situ chl-a data by the least square method; last, the regression models are applied to retrieve chl-a concentration from OLCI images. The average RMSE and R 2 for the evaluation of the three models are presented in Figure 8. From Figure 8, it can be observed that the LightGBM model has the best accuracy with a lower RMSE and higher R 2 . Here, we mapped the spatial distribution of chl-a concentration on 12 May 2018, 5 April 2019, and 17 April 2020, in Fujian's coastal waters based on the LightGBM-Case 4, BR-Case 2, and TB-Case 2 models to further compare the spatial distribution. The results are presented in Figure 9. The results also indicate that the LightGBM-based model has a better applicability considering the space than the traditional methods for the estimation of chl-a concentration in Fujian's coastal waters.
pirical and semiempirical methods in turbid productive coastal waters, respectively both have been widely used. Previous studies have demonstrated that the BR algo can eliminate the errors caused by different solar altitude angles and observation an partially eliminate the interference caused by water surface smoothness and small w changing with time and space, and counteract some atmospheric effects. Contraril TB algorithm has a definite physical basis and can eliminate some of the effects o pended sediments. Table 6 presents the two methods based on the OLCI images tes this study. The specific processes of the methods are as follows: First, x-variable is c lated from OLCI bands; second, the linear regression model is built with the corresp ing in situ chl-a data by the least square method; last, the regression models are ap to retrieve chl-a concentration from OLCI images. Table 6. List of the representatives of the empirical and semiempirical methods tested in this study.

Model
Expression The average RMSE and R 2 for the evaluation of the three models are present Figure 8. From Figure 8, it can be observed that the LightGBM model has the best acc with a lower RMSE and higher R 2 . Here, we mapped the spatial distribution of chl-a centration on 12 May 2018, 5 April 2019, and 17 April 2020, in Fujian's coastal waters b on the LightGBM-Case 4, BR-Case 2, and TB-Case 2 models to further compare the s distribution. The results are presented in Figure 9. The results also indicate tha LightGBM-based model has a better applicability considering the space than the tional methods for the estimation of chl-a concentration in Fujian's coastal waters.   Figure 9 demonstrates that the spatial distribution patterns of chl-a concentration using BR and TB models are very similar. While the mapping results derived from these two models can also obtain the general spatial distribution pattern of chl-a in Fujian's coastal waters, such as in areas with higher chl-a concentration, the models are less sensitive to the changes in chl-a in some waters with a lower concentration. Contrarily, the LightGBM-Case 4 model can capture more information about chl-a. This may be because the LightGBM-Case 4 model used the continuous spectral bands B3-B12 (442.5-753.75 nm),  Figure 9 demonstrates that the spatial distribution patterns of chl-a concentration using BR and TB models are very similar. While the mapping results derived from these two models can also obtain the general spatial distribution pattern of chl-a in Fujian's coastal waters, such as in areas with higher chl-a concentration, the models are less sensitive to the changes in chl-a in some waters with a lower concentration. Contrarily, the LightGBM-Case 4 model can capture more information about chl-a. This may be because the LightGBM-Case 4 model used the continuous spectral bands B3-B12 (442.5-753.75 nm), which carry the main remote sensing signal of chl-a, like the fluorescence band of chl-a. Contrarily, the BR and TB methods only used two or three bands. Moreover, they only considered the linear relationship between independent and dependent variables, whereas the optical properties of case II water bodies are complex, and the relationship between the concentration of chl-a and the spectrum cannot be completely expressed linearly. As a machine learning method, the LightGBM-Case 4 model can deal with any nonlinear relationship. Moreover, the LightGBM-Case 4 model inputs more spectral features than the BR and TB methods. In addition to the performance differences in the models themselves, the limited number of samples and the accuracy of the atmospheric correction algorithm affect the prediction accuracy of the model to some extent. Moreover, natural and human activities as well as other factors, such as meteorological, hydrological, and aquacultural factors, influence chl-a concentration. More influencing factors should be comprehensively considered to render the remote sensing monitoring model more reliable.
In order to show the result difference more intuitively between different models, we mapped the estimated chl-a bias between LightGBM-Case 4 and LightGBM-Case 1, and LightGBM-Case 4 and BR-Case 2. Figure 10 showed that the bias between the two LightGBM models (Case 4 and Case 1) are very small. While for the two different methods (LightGBM and BR), the bias is much greater. This indicated that the performance difference between LightGBM and BR models is relatively significant.
Remote Sens. 2021, 13, x FOR PEER REVIEW 17 of 22 which carry the main remote sensing signal of chl-a, like the fluorescence band of chl-a. Contrarily, the BR and TB methods only used two or three bands. Moreover, they only considered the linear relationship between independent and dependent variables, whereas the optical properties of case II water bodies are complex, and the relationship between the concentration of chl-a and the spectrum cannot be completely expressed linearly. As a machine learning method, the LightGBM-Case 4 model can deal with any nonlinear relationship. Moreover, the LightGBM-Case 4 model inputs more spectral features than the BR and TB methods. In addition to the performance differences in the models themselves, the limited number of samples and the accuracy of the atmospheric correction algorithm affect the prediction accuracy of the model to some extent. Moreover, natural and human activities as well as other factors, such as meteorological, hydrological, and aquacultural factors, influence chl-a concentration. More influencing factors should be comprehensively considered to render the remote sensing monitoring model more reliable.
In order to show the result difference more intuitively between different models, we mapped the estimated chl-a bias between LightGBM-Case 4 and LightGBM-Case 1, and LightGBM-Case 4 and BR-Case 2. Figure 10 showed that the bias between the two LightGBM models (Case 4 and Case 1) are very small. While for the two different methods (LightGBM and BR), the bias is much greater. This indicated that the performance difference between LightGBM and BR models is relatively significant.  Ocean color missions are of great significance for the marine environment monitoring. It is important to compare the LightGBM-estimated chl-a concentration with the existing ocean color products. Here, we employed two kinds of OLCI L2 chl-a concentration products for comparison and validation; one is OC4Me chl-a concentration products (based on the OC4Me algorithm) [6], the other is NN chl-a concentration products (based on the Inverse Modelling Technique (IMT) Neural Net algorithm). Figure 11 presented the spatial distribution of chl-a concentration based on the LightGBM-Case 4 model, NN algorithm, and OC4ME algorithm. In general, chl-a concentration values based on NN and OC4ME algorithms are significantly higher than LightGBM-Case 4 estimated values; especially OC4ME estimated values, which were overestimated to 20 ug/L. In fact, the values of chl-a concentration are mainly distributed between 0 and 2 ug/L in the coastal waters according to in situ measurement. The spatial distribution of chl-a concentration for the LightGBM model and NN algorithm are generally consistent. However, the OLCI L2 products are not spatially complete, especially the OC4ME products. Figure 11 presented that there are some missing values for chl-a concentration based on the NN algorithm and OC4ME algorithm (shown in the red polygon), while the LightGBM-estimated result is spatially complete. Overall, the LightGBM-estimated product for Fujian's coastal waters has higher quality than existing OLCI L2 products.
Ocean color missions are of great significance for the marine environment monitoring. It is important to compare the LightGBM-estimated chl-a concentration with the existing ocean color products. Here, we employed two kinds of OLCI L2 chl-a concentration products for comparison and validation; one is OC4Me chl-a concentration products (based on the OC4Me algorithm) [6], the other is NN chl-a concentration products (based on the Inverse Modelling Technique (IMT) Neural Net algorithm). Figure 11 presented the spatial distribution of chl-a concentration based on the LightGBM-Case 4 model, NN algorithm, and OC4ME algorithm. In general, chl-a concentration values based on NN and OC4ME algorithms are significantly higher than LightGBM-Case 4 estimated values; especially OC4ME estimated values, which were overestimated to 20 ug/L. In fact, the values of chl-a concentration are mainly distributed between 0 and 2 ug/L in the coastal waters according to in situ measurement. The spatial distribution of chl-a concentration for the LightGBM model and NN algorithm are generally consistent. However, the OLCI L2 products are not spatially complete, especially the OC4ME products. Figure 11 presented that there are some missing values for chl-a concentration based on the NN algorithm and OC4ME algorithm (shown in the red polygon), while the LightGBM-estimated result is spatially complete. Overall, the LightGBM-estimated product for Fujian's coastal waters has higher quality than existing OLCI L2 products. Figure 11. Spatial distribution of chl-a concentration based on the LightGBM-Case 4 using the OLCI images (left column) and OLCI L2 chl-a products based on the NN algorithm (middle column), and OC4ME algorithm (right column) on 13 April 2020, 17 April 2020, and 18 June 2020 in Fujian's coastal waters.

Discussion
At present, OLCI is a good option for quickly mapping and retrieving water quality parameter information over large-scale coastal waters due to the high spectral resolution and temporal resolution. With the ocean observation mission of Sentinel-3A and Sentinel-3B satellites, the OLCI data can be obtained almost daily. However, the imaging quality is always subject to the weather, especially in the cloudy and rainy subtropical areas like Fujian. This reduces the usability of the OLCI images for ocean color study, but this meteorological factor is inevitable for optical remote sensing. In the future, we should consider the fusion of multi-source satellite data to obtain more ocean color information. Moreover, the atmospheric correction is significant for ocean color retrieval. The atmospheric correction quality has great influence on the inversion result from remote sensing. A suitable atmospheric correction method can guarantee high-quality reflectance data for ocean color modeling.
Although ocean color missions have provided us abundant ocean color products, their applicability for local areas has not been verified. In this study, we compared the LightGBM-based chl-a product with another two OLCI L2 products (based on NN and OC4ME algorithms). Although the spatial distribution of chl-a concentration from these three methods are generally similar, the ranges of chl-a values from each method are quite different. The main reason is that the in situ data for NN and OC4ME modeling is measured from various ocean regions over the past decades, while in situ data for LightGBM modeling is collected from the coastal waters of Fujian. Thus, the OLCI L2 products are not well suitable for ocean color study in Fujian's coastal waters. Thus far, there has been still many challenges in developing a universal inversion model for chl-a concentration estimation of global waters due to the significant differences in the optical properties of water bodies over different regions. Although the LightGBM method has yielded good results in the coastal waters of Fujian, the spatial applicability is still the limitation of this approach. However, our method itself has important reference value for the ocean color study in other coastal areas.

Conclusions
This study aimed to establish a robust and efficient model based on an advanced machine learning method to estimate chl-a concentration from a new-generation OLCI instrument in coastal waters with complex optical properties. In this study, we proposed a novel gradient boosting method, known as LightGBM, to retrieve the chl-a concentration by combining multitemporal OLCI data with in situ data in Fujian's coastal waters. The performance of the model was quantitatively evaluated using statistical indicators RMSE, MAE, MAPE, and R 2 . The study demonstrated that the accuracy of the LightGBM model was higher than that of the BR algorithm, indicating that LightGBM is better suited for chl-a estimation in Fujian's coastal waters. Moreover, we validated the applicability of the LightGBM-based model for chl-a concentration estimation and analyzed the spatiotemporal distribution of Fujian's coastal waters.
The inputs of feature variables are important for LightGBM-based chl-a concentration estimation. In this study, we tested four spectral indices (BRI, NDCI, NFHI, and TBI) to evaluate their impact on model performance compared with the spectral band-based model. The results demonstrated that the addition of NFHI indices significantly improved the prediction accuracy, with an RMSE of 0.38 µg/L, MAE of 0.22 µg/L, MAPE of 28.33%, and R 2 of 0.77. One advantage of machine learning methods (e.g., LightGBM) is that they can handle multidimensional variables, and the feature variables greatly impact the machine learning performance. Thus, it is important to construct more meaningful additional features based on the initial spectral features to improve the prediction performance. In future studies, we would like to determine more useful spectral indices as input features to enhance the chl-a signal from satellite remote sensing reflectance to further improve the accuracy of the LightGBM model. This would require a deeper understanding of the complex optical properties of Fujian's coastal waters.
The quality of data sets largely limit the performance of models based on machine learning. One is the quality of satellite remote sensing data (spectral resolution, spatial resolution, temporal resolution, and signal-to-noise ratio). Moreover, to further improve the data collected from float stations, improved uniformity and adequacy for the float spatial distribution are required. Meanwhile, except for optical parameters from satellite data directly related to chlorophyll concentration, other environmental factors (including temperature, light, nutrients, microelements, and rainfall) that may influence chl-a concentration would be considered in future studies.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.