Multi-Reservoir Water Quality Mapping from Remote Sensing Using Spatial Regression

Regional water quality mapping is the key practical issue in environmental monitoring. Global regression models transform measured spectral image data to water quality information without the consideration of spatially varying functions. However, it is extremely difficult to find a unified mapping algorithm in multiple reservoirs and lakes. The local model of water quality mapping can estimate water quality parameters effectively in multiple reservoirs using spatial regression. Experiments indicate that both models provide fine water quality mapping in low chlorophyll-a (Chla) concentration water (study area 1; root mean square error, RMSE: 0.435 and 0.413 mg m−3 in the best global and local models), whereas the local model provides better goodnessof-fit between the observed and derived Chla concentrations, especially in high-variance Chla concentration water (study area 2; RMSE: 20.75 and 6.49 mg m−3 in the best global and local models). In-situ water quality samples are collected and correlated with water surface reflectance derived from Sentinel-2 images. The blue-green band ratio and Maximum Chlorophyll Index (MCI)/Fluorescence Line Height (FLH) are feasible for estimating the Chla concentration in these waterbodies. Considering spatially-varying functions, the local model offers a robust approach for estimating the spatial patterns of Chla concentration in multiple reservoirs. The local model of water quality mapping can greatly improve the estimation accuracy in high-variance Chla concentration waters in multiple reservoirs.


Introduction
Remotely sensed data can help decision makers to monitor the water quality of regional waterbodies more effectively. Satellite remote sensing is a highly cost-effective approach for determining the spatio-temporal variation in the water quality parameters of seas and large inland waters [1,2] when compared to traditional in situ samplings. Remote sensing techniques have been widely applied to measure the qualitative water parameters, i.e., suspended sediments, chlorophyll-a (Chla), colored dissolved organic matter (CDOM), and pollutants in ocean and inland water [3,4]. Numerous studies applied Sentinel-2 images for water quality mapping in regional waterbodies [5][6][7]. Sentinel-2 with a multi-spectral imager (MSI) is an optical imager with 13 spectral bands spanning from the blue to the shortwave infrared (SWIR), with 10, 20, or 60 m ground resolution. Sentinel-2 imagery has a high potential for monitoring Chla in coastal and inland waters using its red edge (705 nm) and the red bands (665 nm) [7]. However, the selection of a Chla estimation algorithm depends on the optical properties of ocean, coastal or inland waters. The band algorithms for Chla estimation require each type of water to be evaluated. Figure 1 shows the in situ sampling sites in the two study areas. Reservoirs (Tsengwen in Chiayi, Nanhua, and Wushantou in Tainan) and (Agondian, Chengcing lake, and Fongshan in Kaohsiung) are used as a water supply for study area 1 and 2. The Taiwan Environmental Protection Administration (TWEPA) regularly monitors the water quality throughout Taiwan. The water quality time series in the reservoirs were collected from these agencies, respectively (https://wq.epa.gov.tw/EWQP/zh/ConService/DownLoad/ HistoryData.aspx, accessed on 4 June 2021). The recorded Chla data were collected from these stations with a monthly data sampling frequency. In these ground-truth data, 14 observations are in study area 1 on 3 December 2019, whereas 11 observations are in study area 2 on 10-12 February 2020. Moreover, the turbidity in all reservoirs is low in the study area (Tsengwen: 3.8, Nanhua: 5.6, Wushantou: 5.3, Agondian: 3.2, Chengcing lake: 7.0, and Fongshan: 5.5 NTU (nephelometric turbidity unit)). lake: 7.0, and Fongshan: 5.5 NTU (nephelometric turbidity unit)).
For this study, the Sentinel-2 images were used on December 2, 2019 (same acquisition time with in situ data) and February 10, 2020, in study areas 1 and 2. The images were accessed from the Google Earth engine with low cloud-cover images (12% and 8%) in study areas 1 and 2. The 10-m resolution images (bands at 443, 490, 560, 665, 705, and 740 nm) were used. The Sentinel-2 level 2A products from European Space Agency, ESA's Scihub were used. This product is a Bottom Of Atmosphere (BOA) or Surface Reflectance (SR) images derived from the associated Level-1C products through the sen2cor processor to produce atmospherically corrected images [16]. It is represented in UINT16 data format and scaled by 10,000 for a real SR value.

Method
The main steps involved calibrating the model and estimating the accurate map of Chla ( Figure 2). In the preprocessing of data, the collated data pairs from Sentinel-2 images and sampling points are generated (step 1). Subsequently, the specific band ratio from satellite images is used as inputs for regression models to estimate Chla concentrations (step 2). The global and local models (linear and spatial regressions) are applied for water quality regression and mapping (step 3). To evaluate the model performance, the commonly-used measure, RMSE, is used based on the estimated values at the sampling points and the observations (step 4). Using the model with the best band ratio, the spatial patterns of Chla concentrations can be identified from the Sentinel-2 image data. Eventually, the water quality maps in the multiple reservoirs can be determined by the masking (step 5). For this study, the Sentinel-2 images were used on 2 December 2019 (same acquisition time with in situ data) and 10 February 2020, in study areas 1 and 2. The images were accessed from the Google Earth engine with low cloud-cover images (12% and 8%) in study areas 1 and 2. The 10-m resolution images (bands at 443, 490, 560, 665, 705, and 740 nm) were used. The Sentinel-2 level 2A products from European Space Agency, ESA's Scihub were used. This product is a Bottom Of Atmosphere (BOA) or Surface Reflectance (SR) images derived from the associated Level-1C products through the sen2cor processor to produce atmospherically corrected images [16]. It is represented in UINT16 data format and scaled by 10,000 for a real SR value.

Method
The main steps involved calibrating the model and estimating the accurate map of Chla ( Figure 2). In the preprocessing of data, the collated data pairs from Sentinel-2 images and sampling points are generated (step 1). Subsequently, the specific band ratio from satellite images is used as inputs for regression models to estimate Chla concentrations (step 2). The global and local models (linear and spatial regressions) are applied for water quality regression and mapping (step 3). To evaluate the model performance, the commonly-used measure, RMSE, is used based on the estimated values at the sampling points and the observations (step 4). Using the model with the best band ratio, the spatial patterns of Chla concentrations can be identified from the Sentinel-2 image data. Eventually, the water quality maps in the multiple reservoirs can be determined by the masking (step 5).

Global and Local Models
The satellite-based water quality regression model identifies the relation between Chla concentration (Chla i ) and the above-specified band ratio (R i ) in observation i. The global model in multiple reservoirs (linear regression) is used to estimate the Chla at observation i, which can be expressed as: where β 0 is the intercept; β 1 is the slope of the linear regression parameters; ε i is the residual of the regression model. The local model of water quality mapping using spatial regression can estimate water quality effectively because it is extremely difficult to find a nonspatial mapping algorithm in multiple reservoirs and lakes. The local model (spatial regression) is further extended to allow for a spatially varying function for estimating Chla as: where the parameter derived from the closed-form solution of ridge regression and spatially weighted regression: is a spatial weight matrix, which is formulated from the Gaussian and Euclidean distance functions. In spatial regression, the Gaussian decay-based function commonly used as a kernel is defined as e −(D ij /h) 2 , where h is the non-negative bandwidth [25]. The parameter D ij is the distance between the observed points i and j in the space domain, In Equation (11), I is the unity matrix.
The best lambda (λ) for this study, is defined as the lambda that minimize the estimation error and variance. Lambda was used as 10 −5 .

Validation, Mapping, and Masking
Based on the process (Figure 2), the approaches be can then be applied to estimate the Chla map in the multiple reservoirs. The model performance of the water quality estimation is analyzed to compare observation versus estimation from the models using RMSEs. After the water quality mapping, this study considered the normalized difference water index (NDWI) from green and NIR bands as the masking. The NDWI is developed to delineate water features. Moreover, the NDWI is used about 0.2 as the criterion. The non-water area can be removed.

Model Performance in Study Area 1
In Table 1, the RMSEs in the global models in regression are 0.434-0.612 mg m −3 . The results show that the model performance is slightly various when compared with the various band ratio algorithm. Moreover, the RMSEs show slight differences when considering the outlier on/off in study area 1. The outlier here is more than three scaled median absolute deviations (MAD) away from the median. From the global model, the minimum RMSE of Chla is using the MCI (Equation (7)) with outlier on. In addition, Table 1 also shows the RMSE of the local model in study area 1. The model performances are similar when considering global and local models in low Chla concentration water. The local model RMSEs of Chla are better slightly when compared to the global model. The minimum RMSE is using the FLH in Equation (6). In this case, the local model considers the same regression parameters at all locations due to the consistent relation between Chla and band ratios in the study area. Therefore, the local model in a homogenous system can be simplified to the global model.  Figure 3 shows the Chla concentration in the best local model in study area 1. From the map, Chla concentration varies with space but is lower than 5.5 mg m −3 . Detailed spatial Chla pattern was shown in the various reservoirs. For example, the high Chla concentration is upstream in the Tengwen reservoir. The water quality map can be identified based on sampling data and remote sensing. The resulting map can provide the regional information of Chla concentration for water quality management.

Model Performance in Study Area 1
In Table 1, the RMSEs in the global models in regression are 0.434-0.612 mg m −3 . The results show that the model performance is slightly various when compared with the various band ratio algorithm. Moreover, the RMSEs show slight differences when considering the outlier on/off in study area 1. The outlier here is more than three scaled median absolute deviations (MAD) away from the median. From the global model, the minimum RMSE of Chla is using the MCI (Equation (7)) with outlier on. In addition, Table 1 also shows the RMSE of the local model in study area 1. The model performances are similar when considering global and local models in low Chla concentration water. The local model RMSEs of Chla are better slightly when compared to the global model. The minimum RMSE is using the FLH in Equation (6). In this case, the local model considers the same regression parameters at all locations due to the consistent relation between Chla and band ratios in the study area. Therefore, the local model in a homogenous system can be simplified to the global model. Figure 3 shows the Chla concentration in the best local model in study area 1. From the map, Chla concentration varies with space but is lower than 5.5 mg m −3 . Detailed spatial Chla pattern was shown in the various reservoirs. For example, the high Chla concentration is upstream in the Tengwen reservoir. The water quality map can be identified based on sampling data and remote sensing. The resulting map can provide the regional information of Chla concentration for water quality management.    Table 2 shows the model performance in global and local models in regression in study area 2. The RMSEs of the local model are generally lower than that of the global  (Figure 4), the performance of the local model is better than that of the global model in the high-variance data. The spatial regression model that considers spatial heterogeneity can sufficiently explain the spatial relation between Chla concentrations and spectral band ratios [1,26]. In the local model, Equation (2) is the best predictor, i.e., blue and green band ratio in the study area 2. Using this local model, the RMSE reduced a lot from 42.29 to 6.49 mg m −3 when compared to the global model. Figure 5 shows the Chla maps in the reservoirs at study area 2 using the global and local models with the Equation (2). From the results of the global model, the Chla concentration map shows the inaccurate spatial patterns in the multiple reservoirs. From the results of the local model, the Chla concentration map clearly shows the low concentration in the Agondian reservoir and high concentration in the Chengcing lake and Fongshan reservoirs. The results from the local model correspond to the actual conditions. For example, in the Chengcing lake, the higher concentration Chla occurs in the west and east-northern area, but lower concentration towards the west-southern. Figure 6 shows the spatial pattern of residuals in both models. The model residuals in the global model are mostly larger than those in the local model (mean: 3.6 and 0.7 mg m −3 and standard deviation: 44.2 and 6.8 mg m −3 in the global and local models). The estimated Chla is overestimated in the Agondian and Chengcing lake reservoirs but is underestimated in the Fongshan reservoir in the global model. The local model can adopt the model coefficients in space, and it is more suitable in multiple-reservoir water quality estimation.  Table 2 shows the model performance in global and local models in regression in study area 2. The RMSEs of the local model are generally lower than that of the global model. The RMSE results from the local models are between 6.49 and 12.56 mg m −3 , which are superior to those in the global model. From the plots of estimations and observations in Equation (2) (Figure 4), the performance of the local model is better than that of the global model in the high-variance data. The spatial regression model that considers spatial heterogeneity can sufficiently explain the spatial relation between Chla concentrations and spectral band ratios [1,26]. In the local model, Equation (2) is the best predictor, i.e., blue and green band ratio in the study area 2. Using this local model, the RMSE reduced a lot from 42.29 to 6.49 mg m −3 when compared to the global model. Figure 5 shows the Chla maps in the reservoirs at study area 2 using the global and local models with the Equation (2). From the results of the global model, the Chla concentration map shows the inaccurate spatial patterns in the multiple reservoirs. From the results of the local model, the Chla concentration map clearly shows the low concentration in the Agondian reservoir and high concentration in the Chengcing lake and Fongshan reservoirs. The results from the local model correspond to the actual conditions. For example, in the Chengcing lake, the higher concentration Chla occurs in the west and east-northern area, but lower concentration towards the west-southern. Figure 6 shows the spatial pattern of residuals in both models. The model residuals in the global model are mostly larger than those in the local model (mean: 3.6 and 0.7 mg m −3 and standard deviation: 44.2 and 6.8 mg m −3 in the global and local models). The estimated Chla is overestimated in the Agondian and Chengcing lake reservoirs but is underestimated in the Fongshan reservoir in the global model. The local model can adopt the model coefficients in space, and it is more suitable in multiple-reservoir water quality estimation.

Sensitivity Analysis of Lambda in the Local Model
The RMSE varies slightly with various lambda in Table 3. The RMSEs of the models show 6.49-7.82 mg m −3 with the lambda from 10 to 0.1. In addition, the result also shows that the estimation variance (standard deviation of regional estimated concentrations as an index) varies slightly. The model is a robust way to monitor the spatial variability in water quality when the model RMSE and standard deviation of regional estimated concentrations are lower. Therefore, the suitable lambda was used as 10 in this study.

Sensitivity Analysis of Lambda in the Local Model
The RMSE varies slightly with various lambda in Table 3. The RMSEs of the models show 6.49-7.82 mg m −3 with the lambda from 10 to 0.1. In addition, the result also shows that the estimation variance (standard deviation of regional estimated concentrations as an index) varies slightly. The model is a robust way to monitor the spatial variability in water quality when the model RMSE and standard deviation of regional estimated concentrations are lower. Therefore, the suitable lambda was used as 10 in this study.

Sensitivity Analysis of Lambda in the Local Model
The RMSE varies slightly with various lambda in Table 3. The RMSEs of the models show 6.49-7.82 mg m −3 with the lambda from 10 −5 to 0.1. In addition, the result also shows that the estimation variance (standard deviation of regional estimated concentrations as an index) varies slightly. The model is a robust way to monitor the spatial variability in water quality when the model RMSE and standard deviation of regional estimated concentrations are lower. Therefore, the suitable lambda was used as 10 −5 in this study.  Table 4 shows the correlation coefficients between observations and estimations in the global model. The correlation coefficients between observations and estimations for study area 1 are from 0.27 to 0.75, whereas the correlation coefficients are from 0.34 to 0.90 for study area 2. The correlation coefficients are 0.75 and 0.90 for study areas 1 and 2 in the best global model. However, the correlation coefficients are higher (0.79 and 0.99) for study areas 1 and 2 in the best local model. In the global models, the best band ratios are Equation (7) (MCI) and Equation (3) (green-red band ratio) for study areas 1 and 2. Figure 7 shows the plots between the Chla concentrations and band ratios in Equation (7) and Equation (3) for study areas 1 and 2. Without the high performance in the local model, the global model with the best band ratio also provided reliable information for water quality estimation.   Table 4 shows the correlation coefficients between observations and estimations in the global model. The correlation coefficients between observations and estimations for study area 1 are from 0.27 to 0.75, whereas the correlation coefficients are from 0.34 to 0.90 for study area 2. The correlation coefficients are 0.75 and 0.90 for study areas 1 and 2 in the best global model. However, the correlation coefficients are higher (0.79 and 0.99) for study areas 1 and 2 in the best local model. In the global models, the best band ratios are Equation (7) (MCI) and Equation (3) (green-red band ratio) for study areas 1 and 2. Figure  7 shows the plots between the Chla concentrations and band ratios in Equation (7) and Equation (3) for study areas 1 and 2. Without the high performance in the local model, the global model with the best band ratio also provided reliable information for water quality estimation.  7. The best global models using band ratios Equation (7) and Equation (3) for study areas 1 (a) and 2 (b). Figure 7. The best global models using band ratios Equation (7) and Equation (3) for study areas 1 (a) and 2 (b).

Effect of Band Ratio
In the Chla estimation from remote sensing, the commonly used algorithms are based on the ratios of (1) reflectances at the blue region between 440 and 510 nm to ones at the green region between 550 and 555 nm; (2) reflectances at the red region between 685 and 710 nm to ones between 670 and 675 nm; and (3) reflectances at the green region between 550 and 555 nm to ones at the red region between 670 and 675 nm [5]. The blue and green spectral regions of reflectances are commonly used to assess Chla in Case 1 water, where the optical properties were determined by phytoplankton and related matters. In addition, the Case 2 water algorithms, such as red-NIR band ratio, FLH, and MCI, are commonly used to estimate Chla concentrations in turbid water [27,28].
In this study, the best band ratio is the blue-green band and the MCI for the red-NIR band in high and low Chla areas in the local model. In addition, the best global model is the green-red band in a high Chla concentration of water. The result matches the previous studies [5,28,29]. Most algorithms for Chla estimation in waters have been based on the principles of water absorption that a high content of Chla leads to an increase in water absorption of 443 nm and near 675 nm [5,30]. Chla mapping in clear waters is commonly used at the blue and green spectral bands because the optical properties in clear waters are controlled by phytoplankton, whereas Chla mapping in turbid waters shifts from the blue and green to the red and NIR spectral bands or the green-to-red band to avoid high absorption of non-algal particles [5,29,31]. With the low Chla and turbid concentration water environment (area 1), the turbid becomes relatively high impacts in the study area. Therefore, the MCI or FLH was suitable as the independent variable for study area 1.

Global and Local Models
Most global models reported in the literature are developed using a least-squares approach in which a model form is selected, and a least-squares fit is used to determine model coefficients [11]. The local model for spatial regression can overcome the spatial data uncertainty and heterogeneous conditions in water quality mapping [1], especially in high-variance concentrations in multiple reservoirs. The bio-optical relationships exhibit nonlinear and spatially heterogeneous patterns, e.g., in turbid water [32]. In mapping the water quality of the multiple reservoirs, the sampling data from the various reservoirs can be used simultaneously. However, considering sufficient sampling data can be helpful for model calibration. The observed water quality data are generally few for the mapping because water quality observation surveying is usually time-consuming and costly. Even as neural network and machine learning approaches are becoming popular [6,22,33], they still usually require far more data than the local model used herein. The local model without large datasets is a reliable and superior tool for estimating the nearly real-time spatial distribution of Chla in the multiple reservoirs. Since these data-driven models integrate in situ sampling and remote sensing data when establishing a model, they typically generate results with higher accuracy than the conventional spectral or bio-optical models [22].
Spatial interpolation for water quality mapping, such as inverse distance weighting, explicitly implements the assumption that the observations close together are more alike than those that are farther away [34,35]. Without considering the information of remote sensing, the spatial pattern of the Chla map (the result of spatial interpolation) is smoother [36]. Therefore, the water quality details can be observed using the spectral information of remote sensing, especially inland water.

Limitation
This data-driven model of water quality mapping in multiple reservoirs should be implemented each time but will be effective based on cloud computing, the smartphone camera, and UAV data [22,37,38]. In addition, the quantity of observation data is important but is a limitation if few measurements are available in a reservoir. The effectiveness of the method could be compromised with a combined small number of data from multiple reservoirs. The local model, e.g., spatial regression approach, does not require a large number of data pairs, but it needs real-time calibration when considering new data pairs. Further investigation is required to determine the various band ratio model for each reservoir to assure model effectiveness. Then, the critical factors will be considered in the future. Future studies will focus on anomaly detection in water quality mapping. This will be achieved by providing robust early warning models in the system. We will consider the long-term water quality mapping using the models.

Conclusions
This study offered an effective approach for exploring the spatial patterns of Chla using in situ datasets in multiple reservoirs. In situ sampling data are generally limited in water quality monitoring. This study used a satellite-based water quality mapping scheme based on open data of Sentinel-2 images and ground-based observations in multiple reservoirs.
The spatial regression provided better goodness-of-fit and spatially-varying coefficients in multi-reservoirs when compared with the global model, especially in highvariance Chla concentration waters. The local model is a local weighting estimator to fit easily between remote sensing and in situ observations. The local model provided reliable information on the spatial patterns of water quality and identified the nonlinear and spatially varying relationship between the in situ and remote sensing observations. When fitting the local model with observed data, the regional variations in water quality can be sufficiently exploited. Moreover, the local model offered a robust approach for estimating the spatial patterns of Chla concentrations in low and high Chla concentration waters after testing from various band ratios, model parameters (lambda), and outlier on/off. Moreover, the local model performed reliably using the blue-green band ratio and the MCI/ FLH in the study areas. The best model input was the blue-green band in Case I water (low-turbid water). However, the model structure can still be improved. For instance, a blue-green band ratio and MCI/ FLH will be considered simultaneously. The spatial regression approach did not require a large number of data pairs, but it needed real-time calibration when considering new datasets. Furthermore, the stochastic model for the analysis of variance vs. scale will be considered [39]. In addition, the modified NDWI (mNDWI) will be applied because the SWIR band is less sensitive to concentrations of sediments [40].