WaterNet: A Convolutional Neural Network for Chlorophyll-a Concentration Retrieval

: The retrieval of chlorophyll-a (Chl-a) concentrations relies on empirical or analytical analyses, which generally experience di ﬃ culties from the diversity of inland waters in statistical analyses and the complexity of radiative transfer equations in analytical analyses, respectively. Previous studies proposed the utilization of artiﬁcial neural networks (ANNs) to alleviate these problems. However, ANNs do not consider the problem of insu ﬃ cient in situ samples during model training, and they do not fully utilize the spatial and spectral information of remote sensing images in neural networks. In this study, a two-stage training is introduced to address the problem regarding sample insu ﬃ ciency. The neural network is pretrained using the samples derived from an existing Chl-a concentration model in the ﬁrst stage, and the pretrained model is reﬁned with in situ samples in the second stage. A novel convolutional neural network for Chl-a concentration retrieval called WaterNet is proposed which utilizes both spectral and spatial information of remote sensing images. In addition, an end-to-end structure that integrates feature extraction, band expansion, and Chl-a estimation into the neural network leads to an e ﬃ cient and e ﬀ ective Chl-a concentration retrieval. In experiments, Sentinel-3 images with the same acquisition days of in situ measurements over Laguna Lake in the Philippines were used to train and evaluate WaterNet. The quantitative analyses show that the two-stage training is more likely than the one-stage training to reach the global optimum in the optimization, and WaterNet with two-stage training outperforms, in terms of estimation accuracy, related ANN-based and band-combination-based Chl-a concentration models. Methodology, M.A.S.; Project administration, C.-H.L.; Software, M.A.S.; Supervision, C.-H.L., L.M.J. and A.C.B.; Validation, M.A.S. and M.V.N.; Visualization, M.A.S.; Writing—original draft, M.A.S., C.-H.L. and M.V.N.; Writing—review and


Introduction
Eutrophication occurs when a water body becomes overly enriched with nutrients, e.g., phosphorus and nitrogen. Aquaculture managers add nutrient fertilizers to increase the density and productivity of commercial fish. However, the oversupply of nutrients might cause eutrophication, and consequently, algal blooms which refer to excessive algal growth due to the increased availability of nutrients arise that causes the degradation of water quality [1]. Several works reported that Europe and USA suffer economic losses of approximately 1 billion and 100 million USD per year, respectively, due sensing reflectance into the inherent optical properties (IOPs) of water using the radiative transfer equation (RTE) and further link the estimated IOPs to Chl-a concentrations. However, the accuracy of Chl-a concentration retrieval might be unstable because of the difficulty of approximating the spectral shapes of IOPs [37].
Artificial neural networks (ANNs) have been utilized in geospatial fields with various applications [8,[38][39][40]. ANN is called glorified regression because of the network's nonlinear modeling feature. Ioannou et al. [37,41] used ANN instead of RTE to model IOP coefficients from simulated R rs . The obtained IOP coefficients were then used to retrieve Chl-a concentrations. Buckton et al. [42] proposed an ANN-based empirical model that directly links R rs to Chl-a concentrations. A three-layer structure, consisting of input, hidden, and output layers, was adopted in these ANN-based studies, and the results reveal the capability of ANNs to retrieve Chl-a concentrations. Furthermore, Hafeez et al. [43] searched for the optimal neural structure and parameters including the number of hidden layers and neurons by using exhaustive search. The study further compared the determined optimal ANN model with models based on other machine learning methods, including random forest, cubist regression, and support vector regression, for Chl-a concentration estimation. The experimental results revealed that the optimal ANN model has better performance. However, these ANN-based models are pixel-based models, and thus, Chl-a concentrations are estimated from the variations of the remote sensing reflectance in the spectral bands of a pixel. In other words, the models do not consider local spatial information. In addition, ANN-based models suffer from the problem of insufficient in situ samples. Training an ANN model requires a large set of labelled samples and good initial values for optimization. Training with few samples will lead to overfitting problems, and unsuitable initial values will make the convergence of the loss function to the global minimum difficult. Kown et al. [6] utilized around 90 in situ Chl-a concentration samples in the middle of the South Sea of Korea to train a three-layered ANN. Similar quantity was also acquired by El-habashi et al. [44]. Some researchers utilized a spectra simulation technique, such as Hydrolight and WASI3D, to simulate the R rs and Chl-a concentration samples for data enrichment, and the simulated samples were further used to train the ANN models [37,41,42]. Ioannou et al. [37] reported that a slight underestimation appeared, possibly because of the simulation.
In the present study, a Chl-a concentration estimation model based on a convolutional neural network (CNN), called WaterNet, is proposed. WaterNet is an end-to-end model that integrates feature extraction, band expansion, and Chl-a estimation into the neural network. 3D convolutional kernels are utilized in which both spectral and spatial information in images are adopted in the neural network. Therefore, the proposed WaterNet can handle artificial objects in water bodies, such as aquaculture cages and aquatic plants, and alleviate the influence of satellite instrumental errors. In addition, a two-stage training consisting of pretraining and refinement is proposed to address sample insufficiency. WaterNet is pretrained by utilizing the Chl-a concentrations derived from an existing Chl-a concentration model in the first stage. Then, the in situ samples are used to refine the pretrained model in the second stage. The proposed method provides two contributions: (1) the introduction of WaterNet, which is an end-to-end CNN-based model; and (2) the introduction of a two-stage training that can alleviate the problem of insufficient in situ samples.

Study Area and Datasets
The study area is Laguna Lake (Figure 1), which is the largest lake in the Philippines with an area of 900 km 2 . The average depth is 2.8 m and the shoreline length is 220 km. The water resources of Laguna Lake are used to provide water supply, enable the transportation of people and goods between communities, and support the aquaculture industry [45]. The aquaculture structures occupy nearly 150 km 2 (around 17%) of the total lake area and most are fish farms including 14 indigenous species and 19 exotic species [46]. The rapid growth of the population and urbanization in Manila is producing large amounts of wastewater and inorganic materials, which threaten the water quality of Laguna Lake. Field data were collected on five different days in January-April 2019, covering up the dry season. Completing the collection of all samples in one campaign and one day is difficult because of the huge area of the lake. Therefore, the study area was divided into three regions: West Bay, Central and South Bays, and East Bay, and the samples in one region were collected in each campaign. The starting point and time to take samples were on a local port located in-between the West and Central Bays and at around 8 a.m. to match the satellite acquisition time. Completing a field survey in West or Central Bay requires around 5 h, excluding the installation and uninstallation of the tools on the boat. Going to the East Bay from the port requires 2-3 more hours, and the survey might not able to match the sampling and acquisition time. Therefore, no data were collected in East Bay during the period.
An along-track Chl-a data logger, Infinity-CLW ACLW2-USB produced by JFE Advantech Co., Ltd., was installed on a boat with a depth of 0.5 m to measure the Chl-a concentrations for each second, while speed of the boat was around 10 kph. The Chl-a data logger recorded around 15,000 samples in one campaign. However, several successive samples were mapped to a pixel in the satellite images because of the differences between the sampling resolution and the image spatial resolution. Moreover, outlier removal and data aggregation were performed in data preprocessing to partially remove noises from the measured Chl-a concentrations. Specifically, a sample was removed if two successive records of Chl-a concentrations showed considerable differences or if the Chl-a concentrations exceeded the interquartile range of the dataset collected in one campaign. Then, the Chl-a concentration of a water pixel was obtained by averaging the Chl-a concentrations of the samples that were geographically mapped to that pixel. A total of 257 in situ samples were collected from the field campaigns. The statistical information of the in situ samples is summarized in Table 1. Sentinel-3 was expected to continue the legacy of the MERIS in extracting a wide range of information about optically significant constituents in water bodies. Therefore, Sentinel-3 OLCI images were selected as remote sensing images in the current work. The satellite carries seven sensors, including Ocean and Land Color Instrument (OLCI) which contains 21 spectral bands with the wavelengths ranging from visible to near-infrared. To avoid the atmospheric effect, this study adopts level-2 water full-resolution (WFR) images which contain 16 atmospherically-corrected spectral bands of 300 m Field data were collected on five different days in January-April 2019, covering up the dry season. Completing the collection of all samples in one campaign and one day is difficult because of the huge area of the lake. Therefore, the study area was divided into three regions: West Bay, Central and South Bays, and East Bay, and the samples in one region were collected in each campaign. The starting point and time to take samples were on a local port located in-between the West and Central Bays and at around 8 a.m. to match the satellite acquisition time. Completing a field survey in West or Central Bay requires around 5 h, excluding the installation and uninstallation of the tools on the boat. Going to the East Bay from the port requires 2-3 more hours, and the survey might not able to match the sampling and acquisition time. Therefore, no data were collected in East Bay during the period.
An along-track Chl-a data logger, Infinity-CLW ACLW2-USB produced by JFE Advantech Co., Ltd., was installed on a boat with a depth of 0.5 m to measure the Chl-a concentrations for each second, while speed of the boat was around 10 kph. The Chl-a data logger recorded around 15,000 samples in one campaign. However, several successive samples were mapped to a pixel in the satellite images because of the differences between the sampling resolution and the image spatial resolution. Moreover, outlier removal and data aggregation were performed in data preprocessing to partially remove noises from the measured Chl-a concentrations. Specifically, a sample was removed if two successive records of Chl-a concentrations showed considerable differences or if the Chl-a concentrations exceeded the interquartile range of the dataset collected in one campaign. Then, the Chl-a concentration of a water pixel was obtained by averaging the Chl-a concentrations of the samples that were geographically mapped to that pixel. A total of 257 in situ samples were collected from the field campaigns. The statistical information of the in situ samples is summarized in Table 1. Sentinel-3 was launched by the European Space Agency as a part of the Copernicus Programme. Sentinel-3 was expected to continue the legacy of the MERIS in extracting a wide range of information about optically significant constituents in water bodies. Therefore, Sentinel-3 OLCI images were selected as remote sensing images in the current work. The satellite carries seven sensors, including Ocean and Land Color Instrument (OLCI) which contains 21 spectral bands with the wavelengths ranging from visible to near-infrared. To avoid the atmospheric effect, this study adopts level-2 water full-resolution (WFR) images which contain 16 atmospherically-corrected spectral bands of 300 m spatial resolution, including Bands 1-12, 16-17, and 20-21, and two Chl-a concentration channels built by using the inverse radiative transfer model-neural network (IRTM-NN) [47] and OC4Me [48]. As for the other bands, Bands 13-15 and Bands 18-19 were dedicated for atmospheric correction and were not available in level-2 WFR. Five level-2 water full-resolution (WFR) images over the study area were utilized.
Non-water pixels, including the pixels belonging to land and clouds, did not have information about Chl-a concentrations. In addition, the remote sensing reflectance of some water pixels are outside the normal range (0-1 sr −1 ) possibly because of the existence of cloud shadows or the low acquisition quality. In addition to WFR, the OLCI global vegetation index in land full-resolution (LFR) level-2 images was utilized in the classification. Using the WFR and LFR of Sentinel-3 level-2 images, the images were classified as water, land, cloud, cloud shadow, and low-quality water pixels. A classification based on a decision tree was adopted. The classification results for the tested Sentinel-3 images are displayed in Figure 2.
Remote Sens. 2019, 11, x FOR PEER REVIEW 5 of 16 spatial resolution, including Bands 1-12, 16-17, and 20-21, and two Chl-a concentration channels built by using the inverse radiative transfer model-neural network (IRTM-NN) [47] and OC4Me [48]. As for the other bands, Bands 13-15 and Bands 18-19 were dedicated for atmospheric correction and were not available in level-2 WFR. Five level-2 water full-resolution (WFR) images over the study area were utilized. Non-water pixels, including the pixels belonging to land and clouds, did not have information about Chl-a concentrations. In addition, the remote sensing reflectance of some water pixels are outside the normal range (0-1 sr −1 ) possibly because of the existence of cloud shadows or the low acquisition quality. In addition to WFR, the OLCI global vegetation index in land full-resolution (LFR) level-2 images was utilized in the classification. Using the WFR and LFR of Sentinel-3 level-2 images, the images were classified as water, land, cloud, cloud shadow, and low-quality water pixels. A classification based on a decision tree was adopted. The classification results for the tested Sentinel-3 images are displayed in Figure 2.

CNN-Based Chl-a Concentration Model
In this section, the data preprocessing and normalization (Subsection 2.2.1), WaterNet network structure (Section 2.2.2), and proposed two-stage training (Section 2.2.3) are introduced and discussed.

Data Preprocessing and Normalization
WaterNet adopts a 3D convolution kernel, which utilizes the spatial and spectral information of images in modeling instead of pixel-based structure and spectral information. The input to WaterNet is a volume of the size 7 × 7 × 16. The volume covers a small patch centered at a water pixel. The selection of 7 × 7 spatial coverage for WaterNet input is because of the padding effect that arises from convolutional processes. Each pixel in the patch contains the normalized remote sensing reflectance of the 16 spectral bands of Sentinel-3 WFR level-2 images. The output of WaterNet is the estimated normalized Chl-a concentration at the center pixel of the input patch.
During data preprocessing, the cloud-and shadow-free water pixels are extracted from the images, and patching is performed to create a local patch with a size of 7 × 7 for each pixel. To provide a roughly estimated Chl-a concentration for each patch, we used the Chl-a concentrations at

CNN-Based Chl-a Concentration Model
In this section, the data preprocessing and normalization (Section 2.2.1), WaterNet network structure (Section 2.2.2), and proposed two-stage training (Section 2.2.3) are introduced and discussed.

Data Preprocessing and Normalization
WaterNet adopts a 3D convolution kernel, which utilizes the spatial and spectral information of images in modeling instead of pixel-based structure and spectral information. The input to WaterNet is a volume of the size 7 × 7 × 16. The volume covers a small patch centered at a water pixel. The selection of 7 × 7 spatial coverage for WaterNet input is because of the padding effect that arises from convolutional processes. Each pixel in the patch contains the normalized remote sensing reflectance of the 16 spectral bands of Sentinel-3 WFR level-2 images. The output of WaterNet is the estimated normalized Chl-a concentration at the center pixel of the input patch.
During data preprocessing, the cloud-and shadow-free water pixels are extracted from the images, and patching is performed to create a local patch with a size of 7 × 7 for each pixel. To provide a roughly estimated Chl-a concentration for each patch, we used the Chl-a concentrations at the center pixel of a patch estimated by IRTM-NN and OC4Me. Table 2 summarizes the number of image patches, those with non-water pixels are excluded, mapped to the cloud-and shadow-free water samples in the campaigns, as well as the Chl-a concentration at the patch center pixel estimated by IRTM-NN and OC4Me. After the preprocessing, a set of patches with roughly estimated Chl-a concentrations can be obtained, that is, (Patch 1 , c 1 ), · · · , Patch n p , c n p , where n p denotes the number of patches, 1 , · · · , r i 49 represents the remote sensing reflectance set of the 49 pixels in the ith patch, and c i denotes the Chl-a concentration of the center pixel in the ith patch calculated with IRTM-NN or OC4Me. In addition, a small set of patches with in situ Chl-a measurements from field campaigns are utilized: Patch m(1) , is m(1) , · · · , Patch m(n is ) , is m(n is ) , where n is denotes the number of in situ samples, m(.) is the mapping function between the indexes of the in situ samples and the samples with estimated Chl-a concentrations, and is m( j) is the measured Chl-a concentration in jth in situ sample. The Chl-a concentrations of the patches are normalized to the range [0, 1] using the following equations for model training stabilization: where n_c i and n_is m( j) denote the normalized c i and is m( j) , respectively; max(chla) and min(chla) represent the maximal and minimal values of the roughly estimated or in situ Chl-a concentrations, respectively. Similarly, the pixel remote sensing reflectance is rescaled to the range [0, 1] as where n_r i is the normalized remote sensing reflectance of spectral bands r i , max(R) and min(R) represent the maximal and minimal values of the remote sensing reflectance at each wavelength, respectively, and R ∈ R 16×n p is the set of the remote sensing reflectance of all pixels. After the preprocessing, two normalized sample sets, namely, (n_Patch i , n_c i ) , can be obtained and used in the model training, where n_Patch i = n_r i 1 , · · · , n_r i 49 represents the set of normalized pixels in patch i. Figure 3 illustrates the network structure of WaterNet, which consists of three phases: band expansion, feature extraction, and Chl-a estimation. WaterNet contains one 3D convolution layer in the first phase, two 3D convolution layers in the second phase, and two fully connected layers in the third phase. The three phases are described below.

Network Structure of WaterNet
Band expansion phase. In line with the works of [5,49,50], who utilized band combinations as feature candidates in Chl-a concentration modeling, this phase aims to enrich the spectral information using band combination. The filter's kernel size used in this phase is 1 × 1 × 3, indicating that the convolution is performed in the spectral domain and that spectral enrichment is achieved using linear band combination. A rectified linear unit function is used as an activation function, followed by batch normalization. This phase involves three filters and 24 unknown parameters. Half of these parameters are weights and biases in the filter masks while the others include means, standard deviations, shifts, and scaling in batch normalization. Moreover, no padding is adopted during filtering, and the output of the layer is a 7 × 7 × 42 feature volume, where 7 × 7 is the spatial size and 42 is the spectral size.
Feature extraction phase. This phase involves two 3D convolutional layers that extract the features that are sensitive to Chl-a concentration. In the first layer, 10 filter kernels with a size of 3 × 3 × 42 are utilized to produce 10 feature maps. The spatial size of the feature maps in the first convolutional layer is 5 × 5 because of the absence of padding during filtering. The second convolutional layer utilizes five filter kernels with a size of 3 × 3 × 10. The output of this layer is a 3 × 3 × 5 feature volume. In total, this phase involves 4305 unknown parameters, including 3830 in the first convolutional layer and 475 unknowns in the second one.
Chl-a estimation phase. The Chl-a estimation phase involves reshaping a 3D volume into a 1D vector by flattening and two fully connected layers. The length of the reshaped vector is 45 because the size of the output from the second phase is 3 × 3 × 5. The vector is fully connected to a hidden layer with nine neurons and is further fully connected to an output layer with one neuron representing the normalized Chl-a concentration at the patch center pixel. A sigmoid function is used as an activation function in the fully connected layers. A total of 424 unknown parameters are involved in this phase.
Remote Sens. 2019, 11, x FOR PEER REVIEW 7 of 16 band combination. A rectified linear unit function is used as an activation function, followed by batch normalization. This phase involves three filters and 24 unknown parameters. Half of these parameters are weights and biases in the filter masks while the others include means, standard deviations, shifts, and scaling in batch normalization. Moreover, no padding is adopted during filtering, and the output of the layer is a 7 × 7 × 42 feature volume, where 7 × 7 is the spatial size and 42 is the spectral size. Feature extraction phase. This phase involves two 3D convolutional layers that extract the features that are sensitive to Chl-a concentration. In the first layer, 10 filter kernels with a size of 3 × 3 × 42 are utilized to produce 10 feature maps. The spatial size of the feature maps in the first convolutional layer is 5 × 5 because of the absence of padding during filtering. The second convolutional layer utilizes five filter kernels with a size of 3 × 3 × 10. The output of this layer is a 3 × 3 × 5 feature volume. In total, this phase involves 4305 unknown parameters, including 3830 in the first convolutional layer and 475 unknowns in the second one.
Chl-a estimation phase. The Chl-a estimation phase involves reshaping a 3D volume into a 1D vector by flattening and two fully connected layers. The length of the reshaped vector is 45 because the size of the output from the second phase is 3 × 3 × 5. The vector is fully connected to a hidden layer with nine neurons and is further fully connected to an output layer with one neuron representing the normalized Chl-a concentration at the patch center pixel. A sigmoid function is used as an activation function in the fully connected layers. A total of 424 unknown parameters are involved in this phase. Several studies on remote sensing using ANN conducted experiments to search for the optimal neural network structure and parameters, including filter sizes, number of filters, number of layers, number of neurons, and even activation functions. This process is generally trial-and-error and timeconsuming. In addition, searching for the optimal neural structure with optimal parameters is sometimes not suitable because of the limited in-situ samples and overfitting problem. Therefore, this study addresses the design of the phrases in neural structures instead of the optimal parameters.

Two-Stage Training
WaterNet containing 4753 unknown parameters requires a large number of in situ samples for training to reach global optimization. However, in situ sample collection is costly, and only 257 in situ samples were obtained from the field campaigns in this work. The number of in situ samples for training is much less than the number of unknowns in WaterNet. Training with insufficient samples will increase the probability of generalization errors and overfitting. In addition, setting the initial values for the unknown parameters in the optimizer is crucial, especially when the training samples are insufficient. To solve this problem, this work introduces a two-stage training consisting of pretraining and refinement. In the former, WaterNet is pretrained using the samples with Chl-a Several studies on remote sensing using ANN conducted experiments to search for the optimal neural network structure and parameters, including filter sizes, number of filters, number of layers, number of neurons, and even activation functions. This process is generally trial-and-error and time-consuming. In addition, searching for the optimal neural structure with optimal parameters is sometimes not suitable because of the limited in-situ samples and overfitting problem. Therefore, this study addresses the design of the phrases in neural structures instead of the optimal parameters.

Two-Stage Training
WaterNet containing 4753 unknown parameters requires a large number of in situ samples for training to reach global optimization. However, in situ sample collection is costly, and only 257 in situ samples were obtained from the field campaigns in this work. The number of in situ samples for training is much less than the number of unknowns in WaterNet. Training with insufficient samples will increase the probability of generalization errors and overfitting. In addition, setting the initial values for the unknown parameters in the optimizer is crucial, especially when the training samples are insufficient. To solve this problem, this work introduces a two-stage training consisting of pretraining and refinement. In the former, WaterNet is pretrained using the samples with Chl-a concentrations derived from an existing retrieval model (i.e., (n_Patch i , n_c i ) ). The pretrained model is refined Remote Sens. 2020, 12, 1966 8 of 16 with the in situ samples in the latter stage (i.e., n_Patch m( j) , n_is m( j) n is j=1 ). The concept of the two-stage training is illustrated in Figure 4. The main idea is to obtain a suitable initial value through a pretraining process with the labeled samples obtained from an existing Chl-a model. In other words, a set of patch samples with roughly estimated Chl-a concentrations are generated by using an estimation model. The calculated loss values, however, may exhibit large deviations because the sample labels and Chl-a concentrations are not as accurate as the in situ measurements. Nevertheless, the pretraining result is closer to the real optimum compared with the initial values. The pretraining result is thus used as the new initial value in the refinement stage. Using in situ samples as training samples with the new initial values has a higher probability to reach the global optimum, compared with one-stage training, i.e., training using samples from an existing model or from in situ measurements. In this way, the requirement of large in situ sample sets can be reduced because of the two-stage training strategy. ). The pretrained model is refined with the in situ samples in the latter stage (i.e., _ ( ) , _ ( ) ). The concept of the two-stage training is illustrated in Figure 4. The main idea is to obtain a suitable initial value through a pretraining process with the labeled samples obtained from an existing Chl-a model. In other words, a set of patch samples with roughly estimated Chl-a concentrations are generated by using an estimation model. The calculated loss values, however, may exhibit large deviations because the sample labels and Chl-a concentrations are not as accurate as the in situ measurements. Nevertheless, the pretraining result is closer to the real optimum compared with the initial values. The pretraining result is thus used as the new initial value in the refinement stage. Using in situ samples as training samples with the new initial values has a higher probability to reach the global optimum, compared with one-stage training, i.e., training using samples from an existing model or from in situ measurements. In this way, the requirement of large in situ sample sets can be reduced because of the two-stage training strategy. Two candidates of Chl-a concentrations are provided in Sentinel-3 WFR products, namely, Chla concentrations from IRTM-NN and OC4Me. IRTM-NN is an analytical-based method that replaces RTE with an ANN with 10 spectral bands in the input layer, and OC4Me is a two-band model [51]. These two candidates are compared with the in situ Chl-a concentrations, and the one with better accuracy in terms of root mean square error (RMSE) is utilized to generate labeled samples. On the basis of the comparisons shown in Table 3, the estimated Chl-a concentrations from IRTM-NN are slightly better than those from OC4Me. Therefore, IRTM-NN is selected and used to generate the labeled sample ( _ , _ ) for the pretraining. In optimization, the Adam optimizer which utilizes adaptive learning rate and moment is employed [52], and the MSE is used as the loss function. Given the labeled samples _ , ⋯ , _ and the corresponding predictions , ⋯ , from the model, the loss function is defined Overfitting is alleviated by adopting the commonly used dropout and regularization. The dropout temporally removes several neurons when computing the loss function for model convergence monitoring, whereas the regularization penalizes large weights by adding the Frobenius norm of parameters to the loss function (Equation (3)) during error backpropagation for weights tuning. In training, the number of epochs is set to 30, and the training process stores the parameters of the epoch with the minimal loss value. These values, along with the network structure, are used to estimate Chl-a concentrations. Two candidates of Chl-a concentrations are provided in Sentinel-3 WFR products, namely, Chl-a concentrations from IRTM-NN and OC4Me. IRTM-NN is an analytical-based method that replaces RTE with an ANN with 10 spectral bands in the input layer, and OC4Me is a two-band model [51]. These two candidates are compared with the in situ Chl-a concentrations, and the one with better accuracy in terms of root mean square error (RMSE) is utilized to generate labeled samples. On the basis of the comparisons shown in Table 3, the estimated Chl-a concentrations from IRTM-NN are slightly better than those from OC4Me. Therefore, IRTM-NN is selected and used to generate the labeled sample (n_Patch i , n_c i ) n p i=1 for the pretraining. In optimization, the Adam optimizer which utilizes adaptive learning rate and moment is employed [52], and the MSE is used as the loss function. Given the labeled samples n_c 1 , · · · , n_c n p and the corresponding predictions predi 1 , · · · , predi n p from the model, the loss function is defined as Overfitting is alleviated by adopting the commonly used dropout and L 2 regularization. The dropout temporally removes several neurons when computing the loss function for model convergence monitoring, whereas the L 2 regularization penalizes large weights by adding the Frobenius norm of parameters to the loss function (Equation (3)) during error backpropagation for weights tuning. In training, the number of epochs is set to 30, and the training process stores the parameters of the epoch with the minimal loss value. These values, along with the network structure, are used to estimate Chl-a concentrations.

Postprocessing of WaterNet
The output of WaterNet is the estimated normalized Chl-a concentration. Therefore, postprocessing is performed to transform the normalized Chl-a concentration back to the original range using the maximal and minimal values of the Chl-a concentrations in Equation (1). The recalling is defined aŝ where n_ĉhl_a i represents the estimated normalized Chl-a concentration from WaterNet, andĉhl_a i is the estimated Chl-a concentration after rescaling.

Experimental Results
To evaluate the performance of WaterNet, we adopted the k-fold cross-validation, in which k is set to 10 and all samples from the campaigns were uniformly partitioned into 10 folds. Table 4 shows the statistical analysis of the Chl-a concentrations in each fold. WaterNet was evaluated and compared with other related methods using cross-validation. The evaluations are elaborated in this section. Moreover, the comparison of WaterNet with other neural structures is described in Section 4.1, whereas that with related Chl-a retrieval models is discussed in Section 4.2.

WaterNet Performance Evaluation
WaterNet was pretrained and refined through the proposed two-stage training, which employs the Chl-a concentrations from IRTM-NN and in situ measurements. To evaluate the feasibility and performance of the proposed method, we compared the two-stage training with one-stage trainings, including the first and second stages. The first stage trains the neural network using the patch samples (n_Patch i , n_c i ) n p i=1 , whereas the second stage refines the neural network by utilizing the in situ samples n_Patch m( j) , n_is m( j) n is j=1 . The comparisons are presented in Table 5 and Figure 5. The results show that the two-stage training is better than the one-stage training that implements the second stage Remote Sens. 2020, 12,1966 10 of 16 only (i.e., second-stage training). The range of RMSE decreases from 0.716-2.181 to 0.509-0.975 µg/L, and the average value is improved from 1.298 to 0.752 µg/L. This finding implies that the former has a higher possibility to converge to a better loss value compared with the latter using insufficient in situ samples. The results also show that the two-stage training is better than the first-stage training. The range of RMSE decreases from 2.189-2.492 to 0.509-0.975 µg/L, and the average value is improved from 2.365 to 0.752 µg/L. This result means that the first-stage training cannot reach the global optimum because the labels of the samples are not from in situ measurements. Nevertheless, the first-stage training can provide good initial values of the unknown parameters for the second-stage training. The comparisons of the optimization convergences between WaterNet with and without two-stage training are presented in Figure 5. Using the initial values from the first-stage training, the two-stage training can converge more efficiently (less than 10 epochs) than the second-stage training (around 25 epochs).
In conclusion, the pretraining stage can provide an initial value that increases the possibility of reaching the global optimum in the second stage with few in situ samples. Table 5. Training comparison. Comparison between two-stage training and one-stage trainings, including the first and second stage only, in WaterNet. "Ave." and "Std.", respectively, represent the average and standard deviation of root mean square errors (RMSEs) in each fold. The results show that the two-stage training is better than the one-stage training that implements the second stage only (i.e., second-stage training). The range of RMSE decreases from 0.716-2.181 to 0.509-0.975 μg/L, and the average value is improved from 1.298 to 0.752 μg/L. This finding implies that the former has a higher possibility to converge to a better loss value compared with the latter using insufficient in situ samples. The results also show that the two-stage training is better than the first-stage training. The range of RMSE decreases from 2.189-2.492 to 0.509-0.975 μg/L, and the average value is improved from 2.365 to 0.752 μg/L. This result means that the first-stage training cannot reach the global optimum because the labels of the samples are not from in situ measurements. Nevertheless, the first-stage training can provide good initial values of the unknown parameters for the second-stage training. The comparisons of the optimization convergences between WaterNet with and without two-stage training are presented in Figure 5. Using the initial values from the firststage training, the two-stage training can converge more efficiently (less than 10 epochs) than the second-stage training (around 25 epochs). In conclusion, the pretraining stage can provide an initial value that increases the possibility of reaching the global optimum in the second stage with few in situ samples.  The trained WaterNet was applied to Sentinel-3 images, and the generated Chl-a concentration maps are shown in Figure 6. The Chl-a concentrations are visualized by colors ranging from 6 (yellow) to 12 μg/L (red). The trained WaterNet was applied to Sentinel-3 images, and the generated Chl-a concentration maps are shown in Figure 6. The Chl-a concentrations are visualized by colors ranging from 6 (yellow) to 12 µg/L (red).

Comparison Between WaterNet and Feedforward Neural Networks
WaterNet is also compared with feedforward neural network, which is a commonly used pixelbased neural structure for Chl-a concentration estimation [41,42]. Three feedforward neural networks containing one, two, and three hidden layers with sigmoid activation functions are used for comparison. To obtain fair comparisons, we set the numbers of unknown parameters in WaterNet and the feedforward neural networks to be almost the same. The number of neurons in the input, hidden, and output layers of the three feedforward networks are (16,264,1), (16,70,50,1), and (16,44,44,44,1), respectively; whereas the numbers of unknowns are 4753, 4791, and 4753, respectively, as shown in Table 6. The proposed two-stage training is applied to WaterNet and the feedforward neural networks for a fair comparison. The results in Table 6 show that WaterNet (Avg. RMSE: 0.752 μg/L) outperforms the feedforward neural networks (Avg. RMSE: 1.369, 1.429, and 1.374 μg/L) in terms of accuracy of Chl-a concentration estimation. This phenomenon can be attributed to the efficient network connection of WaterNet due to the weight sharing in the convolution layers. In addition, the information of the neighboring pixels in WaterNet allows the elimination of instrumental errors and the handling of man-made objects in the water bodies during Chl-a concentration estimation.

Comparison Between WaterNet and Feedforward Neural Networks
WaterNet is also compared with feedforward neural network, which is a commonly used pixel-based neural structure for Chl-a concentration estimation [41,42]. Three feedforward neural networks containing one, two, and three hidden layers with sigmoid activation functions are used for comparison. To obtain fair comparisons, we set the numbers of unknown parameters in WaterNet and the feedforward neural networks to be almost the same. The number of neurons in the input, hidden, and output layers of the three feedforward networks are (16, 264, 1), (16,70,50,1), and (16,44,44,44,1), respectively; whereas the numbers of unknowns are 4753, 4791, and 4753, respectively, as shown in Table 6. The proposed two-stage training is applied to WaterNet and the feedforward neural networks for a fair comparison. The results in Table 6 show that WaterNet (Avg. RMSE: 0.752 µg/L) outperforms the feedforward neural networks (Avg. RMSE: 1.369, 1.429, and 1.374 µg/L) in terms of accuracy of Chl-a concentration estimation. This phenomenon can be attributed to the efficient network connection of WaterNet due to the weight sharing in the convolution layers. In addition, the information of the neighboring pixels in WaterNet allows the elimination of instrumental errors and the handling of man-made objects in the water bodies during Chl-a concentration estimation.

Comparison Between WaterNet and Feedforward Neural Networks
WaterNet is also compared with feedforward neural network, which is a commonly used pixelbased neural structure for Chl-a concentration estimation [41,42]. Three feedforward neural networks containing one, two, and three hidden layers with sigmoid activation functions are used for comparison. To obtain fair comparisons, we set the numbers of unknown parameters in WaterNet and the feedforward neural networks to be almost the same. The number of neurons in the input, hidden, and output layers of the three feedforward networks are (16,264,1), (16,70,50,1), and (16,44,44,44,1), respectively; whereas the numbers of unknowns are 4753, 4791, and 4753, respectively, as shown in Table 6. The proposed two-stage training is applied to WaterNet and the feedforward neural networks for a fair comparison. The results in Table 6 show that WaterNet (Avg. RMSE: 0.752 μg/L) outperforms the feedforward neural networks (Avg. RMSE: 1.369, 1.429, and 1.374 μg/L) in terms of accuracy of Chl-a concentration estimation. This phenomenon can be attributed to the efficient network connection of WaterNet due to the weight sharing in the convolution layers. In addition, the information of the neighboring pixels in WaterNet allows the elimination of instrumental errors and the handling of man-made objects in the water bodies during Chl-a concentration estimation.

Comparison Between WaterNet and Feedforward Neural Networks
WaterNet is also compared with feedforward neural network, which is a commonly used pixelbased neural structure for Chl-a concentration estimation [41,42]. Three feedforward neural networks containing one, two, and three hidden layers with sigmoid activation functions are used for comparison. To obtain fair comparisons, we set the numbers of unknown parameters in WaterNet and the feedforward neural networks to be almost the same. The number of neurons in the input, hidden, and output layers of the three feedforward networks are (16,264,1), (16,70,50,1), and (16,44,44,44,1), respectively; whereas the numbers of unknowns are 4753, 4791, and 4753, respectively, as shown in Table 6. The proposed two-stage training is applied to WaterNet and the feedforward neural networks for a fair comparison. The results in Table 6 show that WaterNet (Avg. RMSE: 0.752 μg/L) outperforms the feedforward neural networks (Avg. RMSE: 1.369, 1.429, and 1.374 μg/L) in terms of accuracy of Chl-a concentration estimation. This phenomenon can be attributed to the efficient network connection of WaterNet due to the weight sharing in the convolution layers. In addition, the information of the neighboring pixels in WaterNet allows the elimination of instrumental errors and the handling of man-made objects in the water bodies during Chl-a concentration estimation.

Comparison Between WaterNet and Feedforward Neural Networks
WaterNet is also compared with feedforward neural network, which is a commonly used pixelbased neural structure for Chl-a concentration estimation [41,42]. Three feedforward neural networks containing one, two, and three hidden layers with sigmoid activation functions are used for comparison. To obtain fair comparisons, we set the numbers of unknown parameters in WaterNet and the feedforward neural networks to be almost the same. The number of neurons in the input, hidden, and output layers of the three feedforward networks are (16,264,1), (16,70,50,1), and (16,44,44,44,1), respectively; whereas the numbers of unknowns are 4753, 4791, and 4753, respectively, as shown in Table 6. The proposed two-stage training is applied to WaterNet and the feedforward neural networks for a fair comparison. The results in Table 6 show that WaterNet (Avg. RMSE: 0.752 μg/L) outperforms the feedforward neural networks (Avg. RMSE: 1.369, 1.429, and 1.374 μg/L) in terms of accuracy of Chl-a concentration estimation. This phenomenon can be attributed to the efficient network connection of WaterNet due to the weight sharing in the convolution layers. In addition, the information of the neighboring pixels in WaterNet allows the elimination of instrumental errors and the handling of man-made objects in the water bodies during Chl-a concentration estimation.

Comparison Between WaterNet and Feedforward Neural Networks
WaterNet is also compared with feedforward neural network, which is a commonly used pixelbased neural structure for Chl-a concentration estimation [41,42]. Three feedforward neural networks containing one, two, and three hidden layers with sigmoid activation functions are used for comparison. To obtain fair comparisons, we set the numbers of unknown parameters in WaterNet and the feedforward neural networks to be almost the same. The number of neurons in the input, hidden, and output layers of the three feedforward networks are (16,264,1), (16,70,50,1), and (16,44,44,44,1), respectively; whereas the numbers of unknowns are 4753, 4791, and 4753, respectively, as shown in Table 6. The proposed two-stage training is applied to WaterNet and the feedforward neural networks for a fair comparison. The results in Table 6 show that WaterNet (Avg. RMSE: 0.752 μg/L) outperforms the feedforward neural networks (Avg. RMSE: 1.369, 1.429, and 1.374 μg/L) in terms of accuracy of Chl-a concentration estimation. This phenomenon can be attributed to the efficient network connection of WaterNet due to the weight sharing in the convolution layers. In addition, the information of the neighboring pixels in WaterNet allows the elimination of instrumental errors and the handling of man-made objects in the water bodies during Chl-a concentration estimation.

Comparison of WaterNet and Related Chl-a Concentration Models
Five commonly used Chl-a concentration models based on band combination, namely, two-band model [30], three-band model [29], normalized different chlorophyll index (NDCI) [53], NASA fluorescence line height (FLH) model [54], and NASA OC3E model, were compared with WaterNet. The compared models were defined on the basis of that the difference of two reciprocal spectral reflectance is small such that the absorption by suspended solids and CDOM can be omitted, the total absorption of Chl-a, CDOM, and total suspended solids is nearly zero, and the back-scattering coefficient of Chl-a is spectrally invariant. The five retrieval models were then defined using the available spectral bands in Sentinel-3. The model features for Sentinel-3 are listed in Table 7. The features were further calibrated by a linear regression to convert to the Chl-a concentration, except the OC3E model which was performed by using the fourth-polynomial of the feature as the exponent of power of 10. Table 7. Related Chl-a concentration retrieval models.

Model Name Chl-a Retrieval Model Feature
Three-band model [29] R −1 rs (665) − R −1 rs (709) × R rs (754) Two-band model [30] [R rs (709) ÷ R rs (665) The comparison results in Figure 7 indicate that the five models exhibit similar performances. The RMSEs of the estimated Chl-a concentrations range from 1.28 to 1.62 µg/L, which might be due to similar definitions of the models. In addition, WaterNet demonstrates a more satisfactory performance than the five models. The RMSEs of the estimated Chl-a concentrations decrease to 0.509-0.975 µg/L. In addition to the quantitative analyses, a qualitative comparison using the Chl-a concentration maps for an image acquired on 6 April 2019 was conducted. The results are shown in Figure 8. The Chl-a map generated by WaterNet has a larger Chl-a concentration range than those generated by the five models. In addition, the west region of the West Bay of Laguna Lake, which is close to Manila with a high population density, has a higher Chl-a concentration than the other regions. Moreover, higher Chl-a concentration was also found in the East Bay. This was in line with the study of Herrera et al. [45] which showed similar pattern with the estimation using WaterNet. This visually demonstrates the reasonability of the Chl-a concentration map generated by WaterNet.

Comparison of WaterNet and Related Chl-a Concentration Models
Five commonly used Chl-a concentration models based on band combination, namely, two-band model [30], three-band model [29], normalized different chlorophyll index (NDCI) [53], NASA fluorescence line height (FLH) model [54], and NASA OC3E model, were compared with WaterNet. The compared models were defined on the basis of that the difference of two reciprocal spectral reflectance is small such that the absorption by suspended solids and CDOM can be omitted, the total absorption of Chl-a, CDOM, and total suspended solids is nearly zero, and the back-scattering coefficient of Chl-a is spectrally invariant. The five retrieval models were then defined using the available spectral bands in Sentinel-3. The model features for Sentinel-3 are listed in Table 7. The features were further calibrated by a linear regression to convert to the Chl-a concentration, except the OC3E model which was performed by using the fourth-polynomial of the feature as the exponent of power of 10. Table 7. Related Chl-a concentration retrieval models.

OC3E
(443) (490) (560) The comparison results in Figure 7 indicate that the five models exhibit similar performances. The RMSEs of the estimated Chl-a concentrations range from 1.28 to 1.62 μg/L, which might be due to similar definitions of the models. In addition, WaterNet demonstrates a more satisfactory performance than the five models. The RMSEs of the estimated Chl-a concentrations decrease to 0.509-0.975 μg/L. In addition to the quantitative analyses, a qualitative comparison using the Chl-a concentration maps for an image acquired on 6 April 2019 was conducted. The results are shown in Figure 8. The Chl-a map generated by WaterNet has a larger Chl-a concentration range than those generated by the five models. In addition, the west region of the West Bay of Laguna Lake, which is close to Manila with a high population density, has a higher Chl-a concentration than the other regions. Moreover, higher Chl-a concentration was also found in the East Bay. This was in line with the study of Herrera et al. [45] which showed similar pattern with the estimation using WaterNet. This visually demonstrates the reasonability of the Chl-a concentration map generated by WaterNet.

Conclusions and Future Works
In this study, a novel patch-and CNN-based model called WaterNet was proposed for Chl-a concentration estimation. Instead of a pixel-based neural structure, a 3D convolutional neural structure was used to consider the spectral and spatial information of images in the neural network. In addition, a two-stage training was proposed to overcome the challenge of insufficient in situ Chla samples. In the two-stage training, the use of Chl-a concentration from IRTM-NN proved effective in providing good initial values in the refinement training stage. The qualitative and quantitative comparisons revealed that WaterNet outperformed the related Chl-a concentration models and the feedforward neural network. We conclude that WaterNet can properly model the nonlinear relationships between the remote sensing reflectance of spectral bands in optical satellite images and the Chl-a concentrations in inland water bodies. Due to the limited in situ Chl-a samples, the testing of WaterNet was difficult in the current study. In the future, more in situ samples will be collected from different water bodies. A further testing will be conducted to evaluate the sensitivity of WaterNet to different water bodies, or a site-independence WaterNet will be developed. In addition, other water quality parameters, such as turbidity, will be integrated into WaterNet, and WaterNet will be further applied to other optical satellite images, such as Landsat 8 and Sentinel-2 imagery.

Conclusions and Future Works
In this study, a novel patch-and CNN-based model called WaterNet was proposed for Chl-a concentration estimation. Instead of a pixel-based neural structure, a 3D convolutional neural structure was used to consider the spectral and spatial information of images in the neural network. In addition, a two-stage training was proposed to overcome the challenge of insufficient in situ Chl-a samples. In the two-stage training, the use of Chl-a concentration from IRTM-NN proved effective in providing good initial values in the refinement training stage. The qualitative and quantitative comparisons revealed that WaterNet outperformed the related Chl-a concentration models and the feedforward neural network. We conclude that WaterNet can properly model the nonlinear relationships between the remote sensing reflectance of spectral bands in optical satellite images and the Chl-a concentrations in inland water bodies. Due to the limited in situ Chl-a samples, the testing of WaterNet was difficult in the current study. In the future, more in situ samples will be collected from different water bodies. A further testing will be conducted to evaluate the sensitivity of WaterNet to different water bodies, or a site-independence WaterNet will be developed. In addition, other water quality parameters, such as turbidity, will be integrated into WaterNet, and WaterNet will be further applied to other optical satellite images, such as Landsat 8 and Sentinel-2 imagery.