Spectral Feature Selection Optimization for Water Quality Estimation

The spatial heterogeneity and nonlinearity exhibited by bio-optical relationships in turbid inland waters complicate the retrieval of chlorophyll-a (Chl-a) concentration from multispectral satellite images. Most studies achieved satisfactory Chl-a estimation and focused solely on the spectral regions from near-infrared (NIR) to red spectral bands. However, the optical complexity of turbid waters may vary with locations and seasons, which renders the selection of spectral bands challenging. Accordingly, this study proposes an optimization process utilizing available spectral models to achieve optimal Chl-a retrieval. The method begins with the generation of a set of feature candidates, followed by candidate selection and optimization. Each candidate links to a Chl-a estimation model, including two-band, three-band, and normalized different chlorophyll index models. Moreover, a set of selected candidates using available spectral bands implies an optimal composition of estimation models, which results in an optimal Chl-a estimation. Remote sensing images and in situ Chl-a measurements in Lake Kasumigaura, Japan, are analyzed quantitatively and qualitatively to evaluate the proposed method. Results indicate that the model outperforms related Chl-a estimation models. The root-mean-squared errors of the Chl-a concentration obtained by the resulting model (OptiM-3) improve from 11.95 mg·m−3 to 6.37 mg·m−3, and the Pearson’s correlation coefficients between the predicted and in situ Chl-a improve from 0.56 to 0.89.


Introduction
Detecting drastic changes in water quality is necessary to prevent unexpected environmental incidents. Conventional water sampling methods are reliable but are ineffective in identifying detailed spatial variations of water quality, which renders comprehensive management infeasible [1][2][3]. Remote sensing techniques have been proven effective in the selection of aquaculture sites and the qualitative measurement of regional water parameters, including suspended sediment, chlorophyll-a (Chl-a), and pollutant loads [4][5][6]. Kuhn et al. [7] used Landsat-8 and Sentinel-2 aquatic remote sensing reflectance products to estimate turbidity over the Amazon, Columbia, and Mississippi rivers. The ease of remote sensing techniques relies on the determination of the optical properties of water bodies. Phytoplankton and related materials, such as debris, heterotrophic organisms, and excreted organic matters, dominate the optical properties of waters in deep ocean waters; they are referred to as Case I waters whose optical properties vary with phytoplankton concentration [8]. The ratio of blue and green

Materials and Study Area
Lake Kasumigaura (36 • 09 N, 140 • 14 E) is the second largest lake in Japan a 220 km 2 surface area and 4.0 m average depth. The in situ samples collected by the University of Tsukuba in 2008 and 2010 were utilized, respectively ( Figure 1). The acquisition dates of the in situ samples coincided with those of the medium-resolution imaging spectrometer (MERIS) images. Water sample collections were performed between 10:00 and 14:00 h local time over optically deep waters. They were kept in ice boxes and taken to the laboratory. Chlorophyll-a was extracted using methanol (100%) at 4 • C under dark conditions for 24 h. The optical density of the extracted chlorophyll-a was measured at four wavelengths (750, 663, 645, and 630 nm), and the concentration was calculated according to SCOR-UNESCO equations [29]. Following the sample filtering strategy in [30], several in situ samples were regarded as outliers using the standard deviation of the difference between the actual Chl-a concentration and predicted Chl-a concentration. A total of 26 in situ samples remain after outlier filtering. The descriptive statistics shows extensive variation in the Chl-a concentration ranges, 4.40 (min), 76.90 (max), 62.90 (median) mg·m −3 and 36.60 (min), 83.40 (max), 44.80 (median) mg·m −3 in 2008 and 2010, respectively ( Figure 2). The Chl-a concentration is high in upstream areas where Lake Kasumigaura receives high-turbidity waters from two narrow rivers, including Sakura River and Koise River; Lake Kasumigaura is under the influence of agricultural activities. During the monsoon season, which is generally in May, the fresh water inflow lowers the Chl-a concentrations [31].  The in situ samples were divided into two sets, namely, training and testing. The first set with 10 samples in 2010 was used for feature candidate optimization and training, and the second set with the remaining 10 samples in 2010 and 6 samples in 2008 was used for testing. In addition, The MERIS images were atmospherically corrected using the method in [32]. To ensure that the water pixels were neither mixed with land pixels nor contaminated by clouds, data collected less than one MERIS pixel from the bank and/or covered by clouds were excluded. Moreover, MERIS has 15 narrow spectral bands in the visible and NIR spectral ranges [33]. The reflectances of 14 narrow spectral bands were used for feature generation and selection without considering B 15 (900).

Methods
The workflow of the proposed method consisted of three main steps, namely, feature candidate generation, candidate optimization, and Chl-a retrieval model determination ( Figure 3). The inputs to the method were the remote sensing reflectance R rs (λ) of MERIS images and their corresponding in situ Chl-a measurements. In the first step, a set of feature candidates formed from the two-band, three-band, and NDCI models, was generated. Next, candidate optimization based on neighborhood component analysis [34] was performed in the second step to determine the significances of feature candidates. In the third step, a multivariate linear regression was conducted with the optimal determined features to determine the Chl-a estimation model. Sections 3.1 and 3.2 introduce feature candidate generation and feature optimization, respectively. Section 3.3 presents Chl-a retrieval model determination, mapping, and validation. Figure 3. Procedures of the study, including feature candidate generation from three-band, two band, and the normalized difference chlorophyll index (NDCI), as well as feature optimization and Chl-a retrieval model determination.

Feature Candidate Generation
The candidates were generated from two-band, three-band, and NDCI models. These three models are briefly introduced. The three-band model based on NIR and red spectral bands was proposed by Dall'Olmo and Gitelson [35]. The model is based on the fact that the difference in reciprocal surface reflectance R rs (λ 1 ) −1 and R rs (λ 2 ) −1 on two spectral wavelengths λ 1 and λ 2 must be small to omit the absorption of suspended solids and CDOM. In addition, this model assumes that the total absorption of Chl-a, CDOM, and total suspended solids beyond the spectral wavelength of 730 nm is nearly zero, and the back-scattering coefficient of Chl-a is spectrally invariant. Given these facts and assumptions, the structure of three-band model is defined as Dall'Olmo et al. [12,35] suggested setting the wavelength λ 1 to the red spectral region between 660 and 690 nm to maximize the sensitivity to the changes in Chl-a concentrations, setting the wavelength λ 2 to the range between 690 and 730 nm; the aim is to remove the influence of other absorption factors, such as tripton and CDOM, and locating the wavelength λ 3 in the range between 730 and 760 nm to eliminate misestimation caused by particulate backscattering. The structure of the two-band model is defined as Moses et al. [18] presented another two-band algorithm to retrieve the Chl-a of Case II waters. The model is formulated as R −1 rs (665) × R rs (709) to match the designed bands of the MERIS sensor. The λ 3 is set to 709 nm instead of 753 nm because of the following: (1) The wavelength at 709 nm can better represent the chlorophyll-induced reflectance than that at 753 nm. (2) The magnitude of the water-leaving radiance at 753 nm is lower than that at 709 nm given the increased absorption by water at long wavelengths. Thus, the uncertainties of the atmospheric correction procedure attributed to a low signal-to-noise ratio are less amplified at 709 nm than at 753 nm. (3) λ 3 = 708 nm is close to λ 1 = 665 nm. Thus, the atmospheric effect at 709 nm is closer to that at 665 nm. This characteristic reduces the sensitivity of the two-band model with λ 3 at 709 nm to spectral non-uniform atmospheric effects. Mishra and Mishra [25] developed NDCI to estimate Chl-a concentration in turbid waters. This method utilizes the spectral main absorption peak in the red spectral region at 665 and 709 nm. The NDCI is formulated as the normalized spectral difference between R rs (709) and R rs (665); that is,[R rs (665) − R rs (709)]/[R rs (665) + R rs (709)]. Thus, the measurement form is represented by ( The current study selected the four bands B 7 (665), · · · , B 10 (754) as the common bands as suggested in previous studies [13,18,25,36]. The following were the priority choices: four common spectral regions, namely, the 7th-10th spectral bands of MERIS images with wavelength centers of 665, 681, 709, and 754 nm (denoted as B 7 (665), B 8 (681), B 9 (709), and B 10 (754), respectively), and one band from the remaining bands. In each candidate set, five bands were selected and the feature candidates based on the three models were generated. Table 1 shows the examples of feature candidate generation. A total of 30 possible feature candidates were generated by using the three-band model in Equation (1). A total of 10 possible feature candidates were generated by using the two-band model in Equation (2). A total of 10 possible feature candidates were generated by using the NDCI in Equation (3). In total, 50 possible candidates were generated.

Feature Optimization
Feature optimization is performed to select substantial candidates from the candidate pool C 1 , · · · , C n c (where n c represents the number of candidates) such that the selected candidates are sensitive to the changes in Chl-a concentration and are effective in Chl-a concentration estimation. The candidate sample vector x i : x i,1 , · · · , x i,n c is defined. x i,j belongs to the candidate model C j for the i-th in situ sample (denoted as S i ). The in situ sample S i is represented as a pair (x i , y i ), where y i ∈ R denotes the Chl-a value of sample S i . Candidate selection is based on neighborhood component analysis [34], which is a nonparametric classification and feature selection method. The optimization aims to identify substantial values for each candidate. Given a set of training data T = S 1 : (x 1 , y 1 ), · · · , S n s : (x n s , y n s ) containing n s samples, the optimization aims to find a substantial value for each candidate x. The procedure begins with the selection of a sample from T as the reference sample, which is denoted as S r : (x r , y r ), and the weighted distance is calculated between the reference sample and other samples using where w_dist(x i , x r ) represents the weighted distance between x i and x r , and w j denotes the weight and significance of the feature candidate C j that the optimization wishes to obtain. A leave-one-out strategy is adopted to predict the response for reference x r by using the dataset T − S r : (x r , y r ) ; that is, the training set T excluding the reference sample (x r , y r ), to obtain the weights and to define the objective function of the optimization. Next, the probability of using x i in the prediction of reference x r is defined as measuring a normalized distance between these two samples with a Gaussian kernel function; that is, where g(·) represents the Gaussian kernel function. Given these probabilities, the cost function f r (S 1 , · · · , S n s ) for the reference sample is defined as the summation of the loss caused by the response of the reference sample and that of other samples multiplied by their probability; that is, f r (S 1 , · · · , S n s ) = n s i = 1, i r p ir l(y i , y r ), where l(y i , y r ) represents a loss function that measures the similarity between the response y i in the sample S i and the response y r in the reference sample S r . The loss function is formulated as l(y i , y r ) = y i − y r in the implementation. The overall objective function is obtained by summing the cost function from each reference sample. In addition, a regularization term is introduced to the optimization to avoid overfitting.
The objective function can be formulated combining these two terms. Considering the objective function, the optimal weights are defined as w = arg min w = {w 1 ,··· ,w nc } n s r = 1 f r (S 1 , · · · , S n s ) + α × w 2 1 + · · · + w 2 n c , where α is the parameter for balancing the fitness of the cost functions and the smoothness of the obtained weights. The optimization in Equation (7) is solved to search for the optimal weights w by using the gradient descent method [37], which is a commonly used optimization solver that iteratively moves toward the optimal solution from an initial solution in search space with the aid of the gradient direction of the objective function.

Chl-A Estimation, Mapping, and Validation
After determining the weights in the candidates, the optimal feature can be found. The relation between Chl-a concentrations and the optimal features is identified using the regression model, and the spatial pattern of Chl-a concentration is estimated on the basis of the best regression model with optimal features.
To evaluate the generated and related Chl-a estimation models, the commonly used measurements, including the slope of the regressed line denoted by m, the root-mean-square error (RMSE), and Pearson's correlation coefficient denoted by r between the estimated and measured Chl-a, were adopted as where chla p i and chla m i represent the predicted and measured Chl-a concentration of sample S i , respectively;chla p and chla m denote the average predicted and average measured Chl-a concentration, respectively; and n v represents the number of testing samples. The RMSE indicates the absolute fit of the model to the data; that is, how close the observed data points are to the model's predicted values. The correlation and the slope of the regression line were defined as the statistical association between observation and prediction. The better model exists in the lower RMSE, with a higher correlation between observation and prediction and the 1:1 slope of the regression line between observation and prediction.

Results of Feature Optimization
The Chl-a estimation models were generated by the proposed method from the candidate sets (OptiM-1-OptiM-5) and the related methods from the two-band model [18] (denoted as TwoB-M), three-band model [13] (denoted as ThreeB-G), and NDCI model [25] in Table 2. Table 3 shows the regression models between the Chl-a concentrations and spectral features. The Chl-a in situ samples from Lake Kasumigaura and MERIS images with the same acquisition data were used as test data, and the coefficient of determination R 2 was adopted as the measurement of regression fitness. The R 2 of estimation results from the resulting models are between 0.57 and 0.62, which are superior to those from the two-band model, three-band model, and NDCI model [25] (R 2 = 0.44 − 0.55). Based on these measurements, the optimal model is OptiM-3, which contains two candidates in the form of a three-band model; that is, R −1 rs (665) − R −1 rs (709) × R rs (510) and R −1 rs (665) − R −1 rs (510) × R rs (709). Table 4 shows the model performance comparisons for validation. This result agrees with the conclusions of previous studies [38,39] and indicates that the performance of three-band model is slightly better than that of two-band models and NDCI. In addition, the comparisons show that the RMSE of the best model is 6.37 mg·m −3 (Figure 4). By contrast, the RMSEs of the related previous models (ThreeB-G, TwoB-M, NDCI) are close to 12 mg·m −3 (Figure 4). The combination of these two three-band candidates outperforms the three-band model with optimal bands [13]. Therefore, the obtained model can preserve the characteristics of the three-band model while optimally estimating Chl-a concentrations. Table 2. Proposed and related Chl-a estimation models.

Model Name
Model Feature Table 3. Chl-a estimation models using regression fitness. The intercept and two slopes of the regression lines are denoted as a 0 , a 1 , and a 2 , respectively.

Mapping with Various Spectral Features
Chl-a concentration maps are generated by the resulting model and the related empirical models in Figure 5. Spatial patterns of Chl-a in the four maps are similar. The Chl-a concentration is relatively low in the southern area in the map generated by our model compared with that generated by the compared models, especially the regions near the lake boundaries. In addition, the map generated by our model is spatially smoother than the compared model, and the spatial distribution of Chl-a concentration in our map is more fitted with the result in [40]. Moreover, the Chl-a concentration map can be used to identify the Chl-a hotspot in the lake. For instance, a high Chl-a concentration can be found at the northern part of the lake. The selection of appropriate features is complex and challenging because the changes in the chemical and physical properties of water can lead to different model/feature determination. This study provides an accurate satellite Chl-a model of turbid water by using optimal feature generation and selection based on feature generation from the two-band, three-band, and NDCI models. The regional and spatial information of Chl-a concentration can be generated considering a model with such satellite information.

Discussion
The model for optimal feature selection is based on feature generation from the two-band, three-band, and NDCI models. This study can eventually provide an accurate satellite Chl-a model of turbid productive (Case II) water by conducting empirical and optimal feature generation and selection.
The optical properties in clear waters are controlled by phytoplankton. Chl-a retrieval in clear waters is commonly used at the blue and green spectral regions, whereas Chl-a retrieval in turbid waters shifts from the blue and green to the red and NIR spectral regions to avoid high absorption of CDOM and non-algal particles [41]. However, changes in Chl-a concentration are sensitive at the red region between 660 and 690 nm [13]. The wavelength at 708 nm fully represents the Chl-a-induced reflectance peak in the NIR, whereas the reflectance at 753 nm does not because it mostly depends only on the scattering of suspended particles. The commonly used models [42] consider the following ratios: first, reflectances at the blue region (440-510 nm) within the first peak of strong absorption to reflectances at the green region (550-555 nm) with the minimum absorption [43]; and second, reflectances at the NIR region (685-710 nm) with the minimum absorption to reflectances at the red region (670-675 nm) with the second peak of absorption [44]. In this paper, the features from the models typically include blue, red, and NIR spectral regions and are highly related to reflectances within the first peak of strong absorption at the blue region to reflectances at the second peak absorption and minimum absorption at the red and NIR regions (665 and 709 nm). The features from the existing models correlate with the reflectances at the red and NIR regions at 620, 681, and 709 nm. However, the existing models [13,18,25,45] are between the red and NIR spectral regions. This result matches previous results [13,27], showing that the NIR spectral regions are negligibly affected by the presence of particles and CDOM in the estimation of Chl-a concentrations [28]. The model obtains the three-band features based on our schemes, and its accuracy is higher than those of the existing widely applied empirical algorithms from previous studies. The selection of appropriate features is complex and challenging due to the changes in chemical and physical properties of water.
This study primarily applies feature selection optimization to satellite-based water quality mapping. Selecting the important features in the feature selection algorithm aims to derive accurate predictive models for the estimation of Chl-a concentration. The optimal feature selection is useful for determining site-specific and generally used parameters for Chl-a estimation. From the selected features, the band at 709 nm is commonly selected in the models. The radiance peak at 709 nm in water-leaving radiance, that is, the MERIS maximum chlorophyll index, is extensively used to measure the presence of high Chl-a concentration against a scattering background [13]. Moreover, the Chl-a concentration map can be used to identify the Chl-a hotspot in the lake. The high Chl-a in the water environments becomes warmer in the summer, leading to increased algal growth rates. For example, high Chl-a concentration can be found at the northern part of the lake. The regional and spatial information of Chl-a concentration cannot be generated without such satellite information and modeling. In addition, the selected model will affect the spatial pattern of Chl-a estimation. The Chl-a concentration in the proposed model is lower in the southern area than those in the previous models, especially in the boundary of the lake.

Conclusions
This study provides a systematic approach for water quality estimation based on optimal feature generation and selection and proposes an optimization of feature generation and selection for the determination of a Chl-a concentration model. A set of candidates was generated on the basis of the two-band, three-band, and NDCI models. The optimal model, which consists of one or several candidates with substantial weights, was determined through neighborhood component analysis with an objective function. In situ samples from Lake Kasumigaura, Japan, and MERIS images were used to test the feasibility of the proposed process. The Chl-a concentration estimation performance of the obtained model was compared with that of related models.
The model can successfully estimate Chl-a concentrations from optimal spectral features. However, the geographical and seasonal variations in the environments of turbid inland waters complicate the selection of spectral bands used in the empirical models. The combination of spectral bands is identified as the optimal features using the proposed optimization. Quantitative measurements, including RMSE, r, and m, demonstrate the superiority of the obtained optimal model over the previous related models. In future work, images from Sentinel 3, a successor of MERIS, and additional in situ Chl-a samples will be utilized. Moreover, a nonlinear estimation model will be developed by using an artificial neural network.