1. Introduction
Since the eutrophication of inland lakes appeared in the 1930s, about 40–50% of lakes and reservoirs in the world have been affected by eutrophication to different degrees. Lake eutrophication has become one of the most intractable water environment problems [
1]. Eutrophication of water bodies will cause the rapid growth of phytoplankton, especially those algae with floating or moving ability, as well as over-propagation of algae which forms algal blooms on the water surface. Cyanobacteria blooms will excessively consume oxygen in water, which may lead to the death of aquatic plants and fish [
2]. Harmful Algal Blooms (HABs) are dangerous to aquatic organisms and water ecosystems, and may cause mild skin irritation to humans and even cause public health risks of serious diseases [
3]. Therefore, a timely grasp of the outbreak of cyanobacteria blooms in water bodies is important for precaution and management.
Currently, the remote sensing monitoring of cyanobacteria mainly involves the spatial characteristics of harmful cyanobacteria and remote sensing inversion of the phytoplankton pigment concentration, including chlorophyll-a (chl-a) and phycocyanin (PC). The main basis is that the outbreak of cyanobacteria will cause changes in physical properties such as water color and transparency, and then lead to changes in the spectral reflection characteristics of water bodies [
4]. The research data sources mainly include GF-1, Sentinel-3a, Landsat8, MODIS, AVHRR, and other multisource satellite remote sensing data. The study area comprises mainly inland lakes such as Taihu Lake in China and Lake Erie in North America [
5].
In research on the spatial characteristics of harmful cyanobacteria, there are two main aspects: false color composite maps composed of appropriate spectral bands and water color indices. On the false color composite map composed of appropriate spectral bands, the water bloom is different from clean water bodies, turbid water bodies, and clouds. Single-band threshold [
6], D-value algorithms [
7], and ratioing algorithms [
8], which are relatively simple can roughly identify water blooms. The water color index also provides a new idea for this problem [
9]. Among many water color indices, the normalized difference vegetation index (NVDI) method has high accuracy in monitoring low concentrations of cyanobacteria [
10,
11]. The enhanced vegetation index (EVI) method can effectively restrain the interference of background water and sediment [
12]. The baseline subtraction method used by the floating algae index (FAI) can effectively remove cloud and geometry contamination and is more stable for extracting a long-term series of cyanobacteria blooms [
13,
14]. In addition, the maximum characteristic peak height (MPH) and maximum chlorophyll index (MCI) are also applied in water bloom monitoring [
15,
16]. Through the above findings, research on the spatial characteristics of harmful cyanobacteria can indicate the range of cyanobacteria blooms, but they will be affected by clouds, aquatic plants, and turbid water [
17,
18].
In research on remote sensing inversion of the phytoplankton pigment concentration, the single-band threshold [
19], band interpolation algorithms, and band ratio algorithms [
20] of chl-a are simple and suitable for areas with high chlorophyll concentrations. The three-band method [
21,
22] and four-band method [
23] weaken the influence of turbid water on chl-a, and improve the accuracy of the model. Neural network algorithms have high accuracy, but they need large data support [
24,
25]. For phycocyanin, remote sensing inversion uses mostly empirical models, semi-analytical models, and so on [
26,
27,
28,
29]. Currently, the remote sensing inversion of the cyanobacteria pigment concentration is mainly based on empirical algorithms and semi-empirical and semi-analytical algorithms of multispectral and hyperspectral data [
30,
31].
With the development of artificial intelligence, machine learning has become an important technology in water quality inversion, such as Multiple Linear Regression (MLR), Support Vector Machine (SVM), Extreme Learning Machine (ELM), Long Short-Term Memory (LSTM), ANN, Backpropagation Neural Network (BP-NN), Catboost model (CB), Random Forest (RF), etc. [
32]. In chlorophyll remote sensing inversion, SVM and ELM algorithms outperform CB [
33], BP-NN [
34], RF [
35], and other algorithms [
36,
37,
38,
39]. For SVM, it is characterized by robustness, and has a good ability to interpret a nonlinear relationship. Satisfactory results have also been obtained in salinity estimation [
40,
41]. ELM has a simple working principle and high computational efficiency. Especially in the estimation of organic carbon, ELM has good reliability and accuracy [
42]. In the estimation of suspended solids, total nitrogen and total phosphorus, the performance of SVM and MLR algorithm is better than that of ANN and other algorithms [
43,
44,
45,
46]. In water quality temperature estimation, MLR produced promising outcomes [
47,
48]. MLR is a flexible method of data analysis that may be appropriate whenever a quantitative variable is to be examined in relationship to any other factors and it has been widely used in various fields. LSTM has good performance in long-term water quality monitoring [
49,
50]. However, the estimation effect of these algorithms on cyanobacteria concentrations are still unknown. Therefore, we chose MLR, SVM, LSTM, and ELM algorithms to establish models for estimating the concentration of cyanobacteria from multispectral data in this study. Their applicability in cyanobacteria concentration prediction will be compared.
The spatial characteristics of harmful cyanobacteria can reflect the spatial distribution of cyanobacteria and other phytoplankton, but it is difficult to quantitatively evaluate the concentration of cyanobacteria in water [
51,
52]. The remote sensing inversion of phytoplankton pigment concentrations can reflect the eutrophication degree of water bodies. However, it cannot represent the specific concentration of cyanobacteria. They are affected by other phytoplankton.
Cyanobacteria treatment stations have been built in Erhai Lake. Physical adsorption of cyanobacteria by adsorbent and mechanical filtration can achieve the effect of water purification. When treating cyanobacteria with different concentrations, the dosage and proportion of adsorption are also different. Therefore, it is necessary to identify and analyze the concentration of cyanobacteria in water to improve the treatment efficiency and avoid waste and pollution.
Given the above, this study monitored cyanobacterial concentrations and multispectral data in Erhai Lake. The regression prediction model between multispectral data and cyanobacterial concentrations was established by MLR, SVR, LSTM, and ELM. This study provides a theoretical basis for the rapid and efficient treatment of cyanobacteria.
2. Materials and Methods
2.1. Study Area
The study area is located northwest of Erhai Lake. Erhai Lake, the seventh largest freshwater lake in China, is located in Dali, Yunnan Province (25°36′–25°58′ N, 100°06′–100°18′ E, 1972 m above sea level). Its water surface coverage is approximately 251 km2, and it has a shallow mean water depth of 10.5 m. It is a national nature reserve and the only centralized water source in Dali. It is responsible for the drinking water supply of more than 600,000 residents and a large number of tourists. The quality of water is related to the social and economic development of Dali. From 1999 to 2014, cyanobacteria blooms occurred many times in summer and autumn in Erhai Lake, mainly in small-scale blooms (the area of blooms is within 10 km2). The large-scale water blooms mainly occurred in 2003, 2006, and 2013, among which the area of water blooms in 2006 was the largest, reaching 42 km2. The nearshore lake bay area is prone to cyanobacterial accumulation, and the large-scale cyanobacteria blooms in Erhai Lake mainly occur in the northern area of Erhai Lake. The accumulation of cyanobacteria in the nearshore area starts in spring, and blooms in the central water area occur in late summer and autumn (August–November), among which large-scale water blooms mainly occur in October.
Combined with the distribution of cyanobacteria blooms in previous years, the blooms on the northern shore of Erhai Lake are relatively serious. Therefore, 100 m away from the shore, one sampling point every 120 m, and a total of 10 sampling and monitoring points were selected to form a sampling belt distributed along the coast for monitoring. The specific area is shown in
Figure 1.
2.2. Field Data Collection
From September to November 2021, field data collection and sampling analysis were conducted. The main objective was to obtain multispectral data and lake water samples from 10 monitoring points and analyze them in the laboratory. A total of 60 days of monitoring and 135 groups of effective data were collected. The experiment was conducted 18 times, and 2 times were affected by rain, strong wind, and other factors, which affected the safety of the experiment, so it was impossible to sample on the spot. In addition, the UAV is affected by the wind during the flight and there are ships on the water surface, which makes the image inaccurate. Finally, 135 sets of complete data were obtained.
The multispectral camera carried by DJI·P4M was used for image acquisition in this experiment. The camera was equipped with six lenses, namely RGB visible light and five spectral channels of R, G, B, RE, and NIR. Parameters of multispectral camera are shown in
Table 1. And this multispectral camera is shown in
Figure 2.
According to the selected sampling points, the multispectral image of the actual water surface was taken vertically downward 10 m above each point (photo resolution: 1600 × 1300), and the accuracy of each pixel was approximately 0.53 cm/pixel.
This experiment was designed to be sampled and tested every three days. Each experiment started at 1 p.m., when the temperature is the highest and the light is the strongest. Multispectral images were taken at each sampling point at 1:00 p.m., water samples were collected at 2:00 p.m., and water samples were collected at a depth of 20 cm with a water sampler. Water was taken 10 times near each sampling point, to a total of 10 L. After mixing, 500 mL for detection was taken. Immediately after collection, the water sample was sent to the laboratory to test the cyanobacterial concentrations, chlorophyll concentration, turbidity and temperature and other parameters, in order to complete the detection before 7 p.m. of the same day.
2.3. Data Preprocessing
For water samples, we used the multi parameter cyanobacteria concentration module of HX-200 multi parameter controller produced by Beijing Hongxinhengce Technology Co., Ltd. (Beijing, China) to analyze the cyanobacterial concentration. It makes use of the characteristics of cyanobacteria with absorption peaks and reflection peaks in the spectrum. The monochromatic light at the absorption peak band of the cyanobacteria spectrum was emitted into the water. The cyanobacteria in the water absorb the energy of the light and reflect monochromatic light with another wavelength. The light intensity reflected by cyanobacteria is directly proportional to the content of cyanobacteria in water to detect the concentration of cyanobacteria.
The gray value of the nonreflective strong area in the middle of different band multispectral images was extracted by MATLAB. Multispectral images are a group of images with different bands at the same position, so the coordinates of the extracted pixels in the same group of images are consistent. The gray values of the extracted images in different bands are taken as independent variables in the regression model.
2.4. Modeling Techniques
2.4.1. Multivariable Linear Regression
MLR is a statistical analysis technique to find the functional relationship between multiple independent variables and a dependent variable. It can estimate the model parameters by the least square method, find the function by minimizing the sum of squares of errors, and solve the coefficient matrix by matrix operation. MLR is the most widely used regression model. Its prediction model is as follows:
where
is the predicted value,
is the input argument vector,
is the vector of coefficients, and
is the random error term,
. The error term should be independent and have a normal distribution [
53].
2.4.2. Support Vector Regression
SVR is the implementation of support vector machine (SVM) in regression. SVM is a kind of generalized linear classifier for binary classification of data according to supervised learning, and its decision boundary is the maximum margin hyperplane for solving learning samples [
54]. SVM uses the hinge loss function to calculate the empirical risk and adds the regularization term to the solution system to optimize the structural risk. It is a sparse and robust classifier. SVM can be used for nonlinear classification by the kernel method, which is one of the common kernel learning methods.
2.4.3. Long Short-Term Memory
LSTM is a commonly used recurrent neural network (RNN). Compared with RNN, its essence lies in the introduction of the concept of the cell state. The cell state of LSTM will determine which states should be left behind and which states should be forgotten. The problem of the disappearance of the RNN gradient was solved [
55]. The LSTM network has three gates in the hidden layer (input gates, output gates, and forget gates). Input gates control the input flow of the memory cell, and output gates control the output flow into other cells. The role of forget gates is to selectively forget the information in the state of the cell.
2.4.4. Extreme Learning Machine
ELM proposed a single-hidden layer feedforward network (SLFNs) that randomly selects the input weights and analytically determines the output weights of SLFNs [
56]. One key principle of the ELM is that one may randomly choose and fix the hidden node parameters. After the hidden node parameters were chosen randomly, the SLFN becomes a linear system where the output weights of the network can be analytically determined using a simple generalized inverse operation of the hidden layer output matrices [
57]. The applications of ELM include computer vision and bioinformatics. It is also applied to regression problems in some Earth Sciences and Environmental Sciences [
58].
2.5. Modeling Strategy and Validation Metrics
The model dataset was randomly divided into a training set and test set at a ratio of 3:1, and the independent variable of the model was the gray value of images in different bands. As the image was in TIF format, it was a 16-bit grayscale image with a grayscale value range of 0–65,535. The unit of the input independent variable is cells/mL, and the magnitude was also large, so they are normalized, put into the model for analysis, and then inverse normalization is carried out after the results are output.
In this study, different forecasting models were evaluated using the following three evaluation metrics: coefficient of determination of use (
R2), root mean square error (
RMSE), and mean relative error (
MRE). The larger the
R2 and the smaller the
RMSE and
MRE are, the better the prediction accuracy of the model:
where
is the measured value,
is the predicted value,
is the average of the measured values, and
denotes the number of samples.