Optimal Band Selection for Airborne Hyperspectral Imagery to Retrieve a Wide Range of Cyanobacterial Pigment Concentration Using a Data-Driven Approach

: Understanding the concentration and distribution of cyanobacteria blooms is an important aspect of managing water quality problems and protecting aquatic ecosystems. Airborne hyperspectral imagery (HSI)—which has high temporal, spatial, and spectral resolutions—is widely used to remotely sense cyanobacteria bloom, and it provides the distribution of the bloom over a wide area. In this study, we determined the input spectral bands that were relevant in effectively estimating the main two pigments (PC, Phycocyanin; Chl-a, Chlorophyll-a) of cyanobacteria by applying data-driven algorithms to HSI and then evaluating the change in the spatio-temporal distribution of cyanobacteria. The input variables for the algorithms consisted of reﬂectance band ratios associated with the optical properties of PC and Chl-a, which were calculated by the selected hyperspectral bands using a feature selection method. The selected input variable was composed of six reﬂectance bands (465.7–589.6, 603.6–631.8, 641.2–655.35, 664.8–679.0, 698.0–712.3, and 731.4–784.1 nm). The artiﬁcial neural network showed the best results for the estimation of the two pigments with average coefﬁcients of determination 0.80 and 0.74. This study proposes relevant input spectral information and an algorithm that can effectively detect the occurrence of cyanobacteria in the weir pool along the Geum river, South Korea. The algorithm is expected to help establish a preemptive response to the formation of cyanobacterial blooms, and to contribute to the preparation of suitable water quality management plans for freshwater environments.


Introduction
Cyanobacterial blooms, also named algal blooms, are massive growths of cyanobacteria in eutrophic lakes or rivers with low flow rates which change the color of the water surface to green. Recently, the duration, intensity, and frequency of algal blooms have gradually been increasing due to global warming, urbanization, and changes in precipitation patterns [1]. Excessive occurrence of algae poses environmental, societal, and economic threats due to the resulting damage to the aquatic ecosystems and the health of the aquatic animals and plants therein, as well as increased water purification costs due to harmful algal blooms (HABs)-associated toxins [2,3]. Understanding the patterns of distribution of of the peak near 700 nm is significantly higher than it is for the low condition. Moreover, the magnitude of peaks is shifted depending on the spatial distribution of cyanobacterial density [25,27].
Therefore, the objectives of this study are to (a) determine relevant spectral bands within the range of 400-800 nm for use in empirical algorithms to effectively retrieve two pigments, (b) apply the selected input spectral reflectance data to empirical algorithms such as data-driven models, (c) evaluate the performance of the algorithms, and (d) suggest the best input spectral band combination and an algorithm that shows good performance for understanding the characteristics of the spatial distribution in a weir pool.

Study Area
BJW in South Korea was constructed in October 2011 to secure water storage in Geum river and reduce flood and drought damage ( Figure 1). The weir is located 60 km upstream from the river estuary, and the basin area of the weir is 7976 km 2 while its width and height are 311 m (movable 120 m, fixed 191 m), and 5.5 m, respectively. The manageable water level and storage capacity of water are 4.2 m and 24.2 × 10 6 m 3 , respectively. Since the construction of the weir, cyanobacteria blooms have become a major concern due to continued drought, increases in water temperature, accumulation of nutrients load, and increased residence time. The algae alert based on an increase in the density of cyanobacteria has been steadily issued from 2012 to 2018.
fluorescence of PC and Chl-a [25,26]. The spectral wavelengths used to estimate PC or Chl-a vary according to the sites and concentration levels. In a high Chl-a concentration condition, the peak at 550 to 580 nm is slightly shifted to higher wavelengths, and the magnitude of the peak near 700 nm is significantly higher than it is for the low condition. Moreover, the magnitude of peaks is shifted depending on the spatial distribution of cyanobacterial density [25,27].
Therefore, the objectives of this study are to (a) determine relevant spectral bands within the range of 400-800 nm for use in empirical algorithms to effectively retrieve two pigments, (b) apply the selected input spectral reflectance data to empirical algorithms such as data-driven models, (c) evaluate the performance of the algorithms, and (d) suggest the best input spectral band combination and an algorithm that shows good performance for understanding the characteristics of the spatial distribution in a weir pool.

Study Area
BJW in South Korea was constructed in October 2011 to secure water storage in Geum river and reduce flood and drought damage ( Figure 1). The weir is located 60 km upstream from the river estuary, and the basin area of the weir is 7976 km 2 while its width and height are 311 m (movable 120 m, fixed 191 m), and 5.5 m, respectively. The manageable water level and storage capacity of water are 4.2 m and 24.2 × 10 6 m 3 , respectively. Since the construction of the weir, cyanobacteria blooms have become a major concern due to continued drought, increases in water temperature, accumulation of nutrients load, and increased residence time. The algae alert based on an increase in the density of cyanobacteria has been steadily issued from 2012 to 2018.

Data Acquisition
A total of nine monitoring events were conducted to collect water samples, field hyperspectral reflectance, and airborne hyperspectral imagery during the 2-year period from 2016 to 2017 in the BJW ( Figure 1 and Table 1). Water sampling and field

Data Acquisition
A total of nine monitoring events were conducted to collect water samples, field hyperspectral reflectance, and airborne hyperspectral imagery during the 2-year period from 2016 to 2017 in the BJW ( Figure 1 and Table 1). Water sampling and field hyperspectral measurement were performed concurrently with image acquisition by an aircraft equipped hyperspectral sensor. Water samples were analyzed for PC, Chl-a, and suspended solid.

Hyperspectral Imagery
Hyperspectral sensors measure reflectivity over hundreds of spectral bands for inland water. The high spectral resolution data of water surface reflectance are ideal data for the development and validation of PC algorithms [28,29]. The hyperspectral image sensor (AISA Eagle, SPECIM Inc., Oulu, Finland) has a spectral range of 400 to 970 nm, and a (4-5) nm spectral resolution. The spatial resolution of the hyperspectral imageries taken from the aircraft is 2 m, and the average river width in the 23 km upstream, where the observation was conducted, is 310 m (150 cells). Field hyperspectral reflectance data were collected using hand-held spectroradiometer (FieldSpec HandHeld2, ASD, Inc., Longmont, CO, USA). Field hyperspectral reflectance includes water surface reflectance, sky radiance, and irradiance [30]. Geometry correction and atmospheric correction using field hyperspectral data and MODTRAN 6 software were performed to improve the surface reflectance of the captured hyperspectral image [31].

Chl-a and PC Sampling
Concentrations of chlorophyll-a and phycocyanin were measured for a total of 134 samples collected through in situ monitoring ( Table 1). The concentration of Chl-a was measured using the following equation [32]: where V 1 is the volume of supernatant, V 2 is the total volume of filtered sample, and a is the absorbance by each wavelength of supernatant. The supernatant was extracted using the solvent-extraction method and centrifugation. PC was measured using the following equation with a laboratory experiment: where a is the absorbance by each wavelength of supernatant. The supernatant was extracted by physical force using the freezing and thawing method and centrifugation [33,34].
The absorbance of the sample was measured using a Cary 5000 UV-vis-NIR Spectrophotometer (Agilent Inc., Santa Clara, CA, USA) that provided a wavelength range from 200 to 3300 nm.

Selection of Input Bands
Input data of the PC and Chl-a optical models were extracted from the hyperspectral imagery (HSI) of the same location at which water sampling had been conducted ( Figure 2). The band selection was performed using random forest feature selection (RFFS) for more Remote Sens. 2022, 14, 1754 5 of 22 efficient and improved Chl-a and PC estimations with varying concentrations according to time and space from his (theoretical background for random forest is described in Section 2.4.2). After calculating the importance for each variable in a regression problem, RFFS removes insignificant variables. It is particularly effective for the dimensionality reduction in data with hundreds of consecutive data volumes, such as hyperspectral images [35,36]. Further, the reflectivity of 400 to 450 nm and above 900 nm were removed due to the influence of atmospheric scattering, absorption of colored dissolved organic matter, and noise [37][38][39]. In addition, RFFS was performed for three concentration sections for each of PC and Chl-a. The input variables consisted of a total of 9 band ratio from cyanobacteria two pigments (PC, Chl-a) key feature reflectance (R pp /R pa , R cp /R pa , R gp /R pa , R pp /R ca , R cp /R ca , R gp /R ca , R pp /R wa , R cp /R wa , and R gp /R wa ). Band ratio data were used to reduce the effects of atmospheric and irradiance [20,[40][41][42].  [20,[40][41][42].

Development of Optical Models to Retrieve Pigments
Six regression models were used to predict the concentrations of the two pigments (i.e., PC and Chl-a) for cyanobacteria detection. The models include Partial Least Squares (PLS), tree-based ensemble regressions (Random Forest (RF), Gradient Boosting (GB), Support Vector Machine (SVM), K-Nearest Neighbor regression (KNN), and Artificial Neural Network (ANN). All models are implemented in Python open-source library scikit-learn and Keras.

Partial Least Squares
PLS regression is a method of generalizing and combining the features of principal components analysis and multiple regression to predict dependent variables by extracting a set of orthogonal factors-called latent variables-with the best predictive power from the predictors [43,44]. The PLS regression predictor is: is the vector of dependent (predictable) variables, X is the matrix of independent (predictors) variables, is the matrix of regression coefficient, and is the matrix of error.

Tree-Based Ensemble Regression
Random Forest (RF) regression is an ensemble learning method that combines a set of regression trees. Ensemble models aim to reduce bias and/or variance of such weak learners by combining several of them together to create a strong learner that achieves better performance. RF can handle many data sets in a short time, and it has both high prediction accuracy and the capability of determining variable importance [45,46]. The RF regression predictor is:

Development of Optical Models to Retrieve Pigments
Six regression models were used to predict the concentrations of the two pigments (i.e., PC and Chl-a) for cyanobacteria detection. The models include Partial Least Squares (PLS), tree-based ensemble regressions (Random Forest (RF), Gradient Boosting (GB), Support Vector Machine (SVM), K-Nearest Neighbor regression (KNN), and Artificial Neural Network (ANN). All models are implemented in Python open-source library scikitlearn and Keras.

Partial Least Squares
PLS regression is a method of generalizing and combining the features of principal components analysis and multiple regression to predict dependent variables by extracting a set of orthogonal factors-called latent variables-with the best predictive power from the predictors [43,44]. The PLS regression predictor is: where Y is the vector of dependent (predictable) variables, X is the matrix of independent (predictors) variables, β is the matrix of regression coefficient, and ε is the matrix of error.

Tree-Based Ensemble Regression
Random Forest (RF) regression is an ensemble learning method that combines a set of regression trees. Ensemble models aim to reduce bias and/or variance of such weak learners by combining several of them together to create a strong learner that achieves better performance. RF can handle many data sets in a short time, and it has both high Remote Sens. 2022, 14, 1754 6 of 22 prediction accuracy and the capability of determining variable importance [45,46]. The RF regression predictor is: where Y pr is the predictable variable, N is the number of regression trees, T(x) is the result of each regression tree, and x is the input variables. RF increases tree diversity by growing trees from different training data sets that are generated through a procedure called bagging; this is to avoid correlation between trees. Bagging is used to generate training data by randomly resampling the original data set. Gradient Boosting regression tree (GB) is a model that is improved by applying a boosting statistical technique to the RF, and it sequentially creates a new regression tree that minimizes the residual of the existing tree. The sequential tree creation process is the form of gradient descent; i.e., a new tree is added to optimize the model at each step to minimize the loss function [47].

K-Nearest Neighbors Regression
K-nearest neighbors (KNN) regression is a nonparametric regression method that is a kind of instance-based lazy learning algorithm. Nonparametric regression is characterized by the fast learning of complex target functions without loss of information, because no assumptions are made about the distribution of the data during the training phase. KNN works based on the weighted average of the k-nearest neighbors, which applies the inverse of their distance [48]. The learning method of KNN calculates the Euclidean distance of the input data, then rearranges the existing data by increasing or decreasing the distance. Next, the inverse distance weighted average is calculated by considering the K-nearest neighbors, and the number K of nearest neighbors is optimized to minimize the loss function.

Support Vector Machine
Support Vector Machine (SVM) is machine learning that is specialized in pattern recognition for classification, but it can be employed in regression analysis for nonlinear and high-dimensional data using kernel functions [49]. The principle of SVM is to (1) create a linear non-probabilistic decision boundary that can classify data sets, (2) calculate the vertical distance (margin) between a support vector (a vector that determines a decision boundary) and the decision boundary, and (3) optimize to maximize the margin by adjusting the boundary. Support Vector Regression (SVR) applies an ε-insensitive loss function to the SVM [50,51]. As a result, data are located closer than ε from the regression function, and data located at a distance greater than ε are penalized. In the same way as SVM, the optimization process determines a regression function that maximizes margin while minimizing penalty.

Artificial Neural Network
Artificial Neural Network (ANN) is a machine learning model based on the information processing system in the biological brain [52]. It is mainly composed of an input layer, a hidden layer, and an output layer; here, a layer is a collection of nodes called artificial neurons. ANN is widely applied to solve the nonlinear function approximation problem, and it has the features of creating large-scale parallel networks to explore the characteristics of data and learning relationships directly from the data [53]. Further, it is not very sensitive to noise in data. In this study, single layer perceptron was constructed to predict PC and Chl-a concentrations.

Regression Model Optimization
The following parameters are used for optimization for each regression technique: Tree ensemble (min sample leaf, depth, number of estimators, and minimum ratio of sample split), SVM (kernel function, C, and gamma), KNN (number of neighbors, weight, and leaf size), and ANN-single layer perceptron structure (activation function of the input and output layers, and number of nodes). The input data of the models were divided into training and test sets in a 7:3 ratio; 94 were used for training data and 40 were used for test data. The input data (water surface reflectance) were normalized with Standardscaler using the mean and standard deviation, while the output data (PC, Chl-a concentration) were normalized with MinMaxscaler to have a range of (0-1). The training loss function of the models was set to be Mean Square Error (MSE), and parameter optimization was performed for each of the six models using 100 randomly selected data sets. For more detailed descriptions of the parameters and libraries used in this study, refer to Pedregosa et al. [54].

Performance Evaluation Parameters
The model performances were evaluated using the four statistics of the coefficient of determination (R 2 ), Nash-Sutcliffe efficiency (NSE), and root mean square error (RMSE). The statistics are computed as shown in the following equations: where O i is the observed value for algae concentration (Chl-a, PC); P i is the estimated value for algae concentration; O and P are the averages of the observed and estimated values, respectively; and n is the total number of data. R 2 expresses how well the estimated values of the model describe the observed values, and it ranges from 0 to 1; a value close to 1 indicates the best-fit model, and values greater than 0.5 are typically considered acceptable [55,56]. NSE is used to evaluate the efficiency of the model. NSE has a range of (−∞ to 1), and when the value of NSE is 1, this means that the observed value and the estimated value are perfectly matched. Further, for NSE values greater than 0, the values calculated from the model are acceptable, but values less than 0 are judged to be unacceptable [57]. RMSE is a statistical value that judges the error between the observed value and the simulated value. The closer RMSE is to 0, the less error there is in the model result. If the value of RMSE is less than half of the standard deviation of the observed value, then the performance of the model is determined to be suitable [58].

Band Selection
The ranges of concentrations obtained from 134 samples were (0. 19 Based on the RFFS result, six band ranges were selected whose reflectivity was shown to change sensitively according to the concentration conditions (PC, Chl-a) (Tables 2 and 3). For each band, the maximum or minimum value within the range was used to consider the shift in the position of the peak or absorption according to the cyanobacterial biomass, suspended solid (SS) concentration ( Figure 2).

Model Development
To develop optical models to retrieve pigments, six regression models (PLS, RF, GB, SVM, KNN, and ANN) and three input variable cases were compared. Band ratio data were used to reduce the effects of atmosphere and irradiance. Table 4 summarizes the training and validation statistics results of each case. In Cases 1 and 2, the Chl-a and PC concentrations were estimated using the spectral reflectance bands that were sensitive to absorption by each pigment, and in Case 3, the two pigments were estimated by all spectral reflectance bands that were sensitive to absorption by both Chl-a and PC. When the bands of Chl-a and PC were considered together (Case 3), the resulting performance was higher than those of the other cases and, in particular, the Chl-a estimation performance was significantly improved in all models. In the comparison between PC estimation models in  According to the comparison between observed and estimated pigment concentrations in Case 3, RF and GB tended to be underestimated in the high concentration section, despite the high statistical value in PC estimation (Figure 3b,c). On the other hand, ANN showed good overall performance for all concentration ranges (Figure 3f). In the Chl-a estimation of PLS, RF, and GB, the distributions of the points showed weak linear relationships between (Figure 3a-c). In the cases of RF, GB, and KNN, the estimated values did not exceed more than a certain value (see the red-dotted circle in Figure 3b,c,e). Likewise, ANN had good regression performance for Chl-a estimation, but it was lower than the corresponding performance for PC estimation. From Table 4 and Figure 3, ANN was the most suitable regression model for cyanobacteria estimation using the nine wavelength combination ratios considered in this work.
Additional analysis was performed to confirm the effects of the two bands that are rarely used for inland cyanobacteria estimation on the optical model estimation performance. To determine the effects of R gp and R wa on the estimation performance of the model, the ANN algorithm results shown in Table 4 Case 3, applying all bands (ANN O ) and the ANN algorithm for three cases were compared. Table 5: (A) removed R wa , (B) removed R gp , and (C) removed R wa and R gp from ANN O . In PC estimation, the R 2 of each case in the training step was almost the same as that of ANN O , but in the validation step, it decreased by 0.11 on average, compared to the original value of 0.79. The average performance of the three cases of estimating Chl-a was decreased to 0.56 in both training and validation.
Comparing the results between the three cases, both the R gp and R wa band reflections have effects that are almost similar to that of the ANN algorithm, and the two bands have a great influence on the algorithm, particularly in terms of validation.
Remote Sens. 2022, 14, x FOR PEER REVIEW 10 of 22 Figure 3, ANN was the most suitable regression model for cyanobacteria estimation using the nine wavelength combination ratios considered in this work. Additional analysis was performed to confirm the effects of the two bands that are rarely used for inland cyanobacteria estimation on the optical model estimation performance. To determine the effects of Rgp and Rwa on the estimation performance of the model, the ANN algorithm results shown in Table 4 Case 3, applying all bands (ANNO) and the ANN algorithm for three cases were compared. Table 5: (A) removed Rwa, (B) removed Rgp, and (C) removed Rwa and Rgp from ANNO. In PC estimation, the R 2 of each case in the training step was almost the same as that of ANNO, but in the validation step, it decreased by 0.11 on average, compared to the original value of 0.79. The average performance of the three cases of estimating Chl-a was decreased to 0.56 in both training and validation. Comparing the results between the three cases, both the Rgp and Rwa band reflections have effects that are almost similar to that of the ANN algorithm, and the two bands have a great influence on the algorithm, particularly in terms of validation.

Baekje Weir Algae Spatial Distribution Generation
The concentrations of the two pigments were estimated by dividing the study area into seven parts for a more granular study. Z1 is the area adjacent to the weir dam, Z2 and Z3 are tributary-affected areas, Z4 and Z6 are river bends, and Z5 and Z7 are locations to check for the effects of any changes in flow velocity according to cross-sectional area (Figure 4).
A: Reflection of water absorption of 731.42-784.11 nm removed from Origin ANN input data. B: Reflection of green peak of 465.74-589.58 nm removed from Origin ANN input data. C: Reflection of water absorption and green peak removed from Origin ANN input data.

Baekje Weir Algae Spatial Distribution Generation
The concentrations of the two pigments were estimated by dividing the study area into seven parts for a more granular study. Z1 is the area adjacent to the weir dam, Z2 and Z3 are tributary-affected areas, Z4 and Z6 are river bends, and Z5 and Z7 are locations to check for the effects of any changes in flow velocity according to cross-sectional area (Figure 4). The spatial distribution characteristics of algae (cyanobacteria) in the BJW were analyzed through the six selected models for the hyperspectral image of 12 August 2016 (Z1), at which time cyanobacteria were predominant ( Figure 5). In terms of PC spatial distribution, ANN and PLS more precisely estimated the distribution of high PC concentration near the weir, while the other four models underestimated the PC concentration in the same region. Similarly, ANN, PLS, and KNN performed well in Chla spatial distribution; however, in the case of PLS, the Chl-a distribution near the bridge 1 km away from the weir (Z2, Z3) was underestimated. The spatial distribution characteristics of algae (cyanobacteria) in the BJW were analyzed through the six selected models for the hyperspectral image of 12 August 2016 (Z1), at which time cyanobacteria were predominant ( Figure 5). In terms of PC spatial distribution, ANN and PLS more precisely estimated the distribution of high PC concentration near the weir, while the other four models underestimated the PC concentration in the same region. Similarly, ANN, PLS, and KNN performed well in Chl-a spatial distribution; however, in the case of PLS, the Chl-a distribution near the bridge 1 km away from the weir (Z2, Z3) was underestimated.
The spatio-temporal distribution changes of PC and Chl-a over nine sampling days were estimated using the ANN model, which showed good performance in the statistical values and spatial distribution estimation. Figures 6 and 7 show the results, and Appendix A shows the detailed concentration values. The maximum PC and Chl-a concentrations were observed on 12 August 2016, and high concentrations of the two pigments could be seen to be distributed until 24 August 2016. The Chl-a was observed to be above the concentration of 18.05 mg/m 3 over the entire monitoring period, but PC was only observed on a specific date (August 2016; September-October 2017) or in a specific section. The concentration of two pigments tended to decrease as it went upstream, and the change in concentration was greater for PC than it is for Chl-a. The overall PC concentration adjacent to the weir dam was high (12 August 2016), and the PC concentration of Z1 (43.80 mg/m 3 ) appeared to be higher than those of Z2 and Z3, which were 34. The spatio-temporal distribution changes of PC and Chl-a over nine sampling days were estimated using the ANN model, which showed good performance in the statistical values and spatial distribution estimation. Figures 6 and 7 show the results, and Appendix A shows the detailed concentration values. The maximum PC and Chl-a concentrations were observed on 12 August 2016, and high concentrations of the two pigments could be seen to be distributed until 24 August 2016. The Chl-a was observed to be above the concentration of 18.05 mg/m 3 over the entire monitoring period, but PC was only observed on a specific date (August 2016; September-October 2017) or in a specific section. The concentration of two pigments tended to decrease as it went upstream, and the change in concentration was greater for PC than it is for Chl-a. The overall PC concentration adjacent to the weir dam was high (12 August 2016), and the PC concentration of Z1 (43.80 mg/m 3 ) appeared to be higher than those of Z2 and Z3, which were 34.11 and 30.88 mg/m 3 , respectively. On the other hand, in the low PC concentration (15 September 2017), Z2 and Z3 of 8.49 and 9.28 mg/m 3 , respectively, were both similar or higher than Z1 (1.22 mg/m 3 ). Chl-a was generally distributed throughout the entire upstream with an average concentration of 29.42 mg/m 3 . Unlike Chl-a, high PC concentration were partially observed in Z2, Z3, Z5, and Z6 on August 2016.

Band Selection for Inland Cyanobacteria Pigments
Band selection is known to critically affect the performance of the optical algorithm for estimation [59]. The selected six spectral bands included the feature reflectance of cyanobacteria and the water characteristic reflectance (algae, inorganic particles, and suspended solids) ( Table 3). The first band, called the green peak, appears in R gp at around 460-590 nm due to the light reflection by algal cells and the low absorption of Chl-a [25,60]. The bands at 600-630 and 660-680 nm showed minimum reflectance, which was attributed to the strong absorption of both PC (absorption reflectance of PC: R pa ) and Chl-a (absorption reflectance of Chl-a: R ca ). The reflectance peak at 640-660 nm appears to be due to the fluorescence characteristics of PC (peak reflectance of PC: R pp ), while the peak at 700-712 nm is a relative peak caused by Chl-a absorption (peak reflectance of Chl-a: R cp ). The last band at 730-784 nm was affected by scattering stemming from the presence of inorganic particles in water (R wa ) [61][62][63]. The four bands in the range from 600-712 nm contain most of the reflectance used in previous studies that estimated the concentrations of PC and Chl-a inland [20][21][22][23][24][25][26]. The two reflectance bands (R gp and R wa ) improved the performance of the optical algorithm by minimizing the influence of inorganic particles in inland water and removing interference from suspended particles in the green peak reflectance (Table 5) [61,64]. The empirical method using machine learning is generally the most powerful method for estimating a wide range of variables, as it does not require prior understanding of the complex interactions between water and light reflection. However, the performance of this method varies substantially depending on the water quality conditions, locations, and variable ranges from which the data were obtained [65]. Therefore, the optical algorithms in this study obtained high regression performance with the six-reflection band ranges that can correct for the reflection interference effect.

Cyanobacteria Optical Algorithm Specialized for BJW
The ANN model showed the highest performance in estimating the two pigments. Previous studies that have involved the estimation of two pigments made comparisons according to the use of input spectral bands and the performance of the model. Song et al. [66] estimated the PC concentrations of Central Indiana USA and South Australia using reflection bands of 620 to 630, 685 to 700, and around 555 to 625 nm obtained from Sentinel-3/OLCI and Hyperion satellite data. Partial least squares-ANN and three-band model were applied for PC estimation, and the results were found to be R 2 (0.84-0.98). Chang and Vannah [67] estimated the microcystin concentration in Lake Erie using Landsat and MODIS (Landsat: 1-5, 7; MODIS: 1-4, 6, 7). Six reflectance bands obtained from the two satellites were used to construct ANN and genetic programming; the resulting R 2 values for the two machine learning models were 0.53 and 0.60, respectively. He et al. [68] estimated the Chl-a concentration in the Gulf of St. Lawrence using 10 bands of MODIS reflectance: 412, 443, 469, 488, 531, 547, 555, 645, 667, and 678 nm. The performances of the five models SVM, ANN, GB, RF, and MLR were compared using R 2 values. Among the models, SVM showed the highest R 2 values of 0.71 and 0.91 in the training and validation steps, respectively. Zhou et al. [69] estimated Chl-a concentration in Dianshan lake using principal component analysis-ANN to obtain in situ hyperspectral surface reflectance. The R 2 values were 0.85 and 0.64 in the training and validation steps, respectively. The ANN model to estimate PC and Chl-a developed in this study showed significantly improved performance over the regression models and offered similar or better performance than the algorithms described in previous studies.

Spatio-Temporal Distribution Characteristics of Cyanobacteria in BJW
In South Korea, cyanobacterial blooms have intermittently occurred in rivers and reservoirs. However, after the Four Major Rivers Project (2010-2011), many weirs were installed in the major rivers, such as the Nakdong, Geum, and Yeongsan Rivers, which subsequently led to the frequent formation of cyanobacteria blooms, which caused various water quality problems in the weir pools [70]. In the study area, i.e., the BJW of Geum river, algal bloom caused by cyanobacteria has frequently occurred every summer. The maximum air temperatures on the sampling dates (12 August 2016, 24 August 2016, 22 September 2017,  and 28 October 2017) when the two cyanobacteria pigments were dominantly distributed in BJW were 36.2, 34.4, 27.0, and 23.8 • C, respectively ( Figure A1). The algae growth rate is sensitive to temperature conditions: cyanobacteria in fresh water shows an optimal growth rate of about 30 • C, while other chlorophytes or diatoms have optimal growth rates of around 20-30 • C [71][72][73][74]. In the spatio-temporal change of the two pigments, PC was more sensitive to temperature than Chl-a. In August 2016 (the air and water temperatures were 34.4-36.2 and 30.6-30.9 • C), high PC concentration was widely observed in the BJW under continuous water discharge due to relatively high cyanobacterial growth rates. However, considerably low PC concentrations were collected due to a decrease in the growth rates with decreasing water temperature (23.0 • C) and continuous wash-out by the discharge. Berg and Sutula [75] also reported that if sufficient N and P are provided to allow cyanobacteria to grow, the temperature has the greatest effect on the growth rate.
Chl-a concentration was maintained with relatively high values regardless of sampling dates (e.g., 14 October 2016 vs. 24 August 2016), compared to PC. Based on changes in cyanobacterial and diatom cell density in the BJW [70], relatively high Chl-a and low PC concentrations in September may result from the transition of phytoplankton community [76]. Therefore, it is assumed that there is no noteworthy concentration change compared to PC. The spatial distribution characteristics of PC and Chl-a showed similar patterns when massive cyanobacteria dominated in the upstream of the BJW. In addition, the two pigments are distributed in various patterns in different sections due to the influence of hydrodynamic and tributary river factors. Due to the velocity difference, high concentration is found at the insides of the river bends (Z4, Z6), in large river cross-sections, and at the river boundary, rather than at the center of the river (Z5, Z7). Cyanobacteria grow actively in low velocity and they have a long residence time [77][78][79]. Comparing Z2 and Z3, Z2 distributed a higher concentration due to the tributary river, advective transport and accumulation of biomass at the weir.

Conclusions
Based on nine hyperspectral and water quality monitoring campaigns, this study selected major spectral bands related to the effective retrieval of cyanobacteria pigments (i.e., PC, Chl-a) from hyperspectral reflectance data and developed data-driven algorithms for the remote sensing of cyanobacteria. Six reflection band ranges were selected by the random forest feature selection (RFFS) while considering peak and absorption reflectance as affected by PC, Chl-a, SS, and water characteristics. The sensitive reflectance of each band ratio model using both PC and Chl-a showed better performance in the estimation of each pigment than the individual model using each pigment's sensitive reflectance. This result shows that the two pigments-specific reflectance may be applied simultaneously to construct retrieval models. That is, the selection of relevant reflectance may be critical to the retrieval models, and reflectance bands should be investigated in terms of the pigment sensitivity and improvement of model performance. Overall, this study identified a reflection band that can consider the interference effects of various water characteristics in hyperspectral reflectance imagery, and it therefore provides a useful method for constructing a retrieval model with which to estimate the spatio-temporal concentrations of the main two cyanobacteria pigments (PC and Chl-a). It is expected that future reliable models derived from this study can support the development of efficient management practices for mitigating algal blooms.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.