Chlorophyll-A Prediction of Lakes with Different Water Quality Patterns in China Based on Hybrid Neural Networks

One of the most important water quality problems affecting lakes and reservoirs is eutrophication, which is caused by multiple physical and chemical factors. As a representative index of eutrophication, the concentration of chlorophyll-a has always been a key indicator monitored by environmental managers. The most influential factors on chlorophyll-a may be dependent on the different water quality patterns in lakes. In this study, data collected from 27 lakes in different provinces of China during 2009–2011 were analyzed. The self-organizing map (SOM) was first applied on the datasets and the lakes were classified into four clusters according to 24 water quality parameters. Comparison amongst the clusters revealed that Cluster I was the least polluted and at the lowest trophic level, while Cluster IV was the most polluted and at the highest trophic level. The genetic algorithm optimized back-propagation neural network (GA-BPNN) was applied to each lake cluster to select the most influential input variables for chlorophyll-a. The results of the four clusters showed that the performance of GA-BPNN was satisfied with nearly half of the input variables selected from the predictor pool. The selected factors varied for the lakes in different clusters, which indicates that the control for eutrophication should be separate for lakes in different provinces of one country.


Introduction
The deleterious proliferation of planktonic algae is a main cause of death of aquatic life and damage to aquatic ecosystems and water functions in lakes [1].Algal blooms have occurred in many lakes around the world in recent years [2,3].Many factors have influenced the growth of phytoplankton, generally represented by the concentration of chlorophyll-a, such as physical variables, nutrients, organic substances, and metal ions [4,5].The light conditions and nutrients were known as key elements necessary for the growth of plants and animals and in lake ecosystems.
The light conditions in lakes influence the growth of the plankton community, and the eutrophication caused by the high concentration of chlorophyll-a would impact light availability in lakes [6,7].Excessive nitrogen and phosphorus inputs are important factors to shift lakes from oligotrophic to hypertrophic conditions [8], and lead to dramatic increases in harmful cyanobacteria blooms, which would create a serious threat to lake ecosystems [9].The excessive amount of organic substances and metal ions in freshwaters generally originate from domestic sewage, urban run-off, industrial effluents and farm wastes, which are main causes of water pollution.Dissolved organic matter in lakes would absorb light and alter the light environment at depth, which would subsequently affect phytoplankton [10] and could also be consumed directly or indirectly by aquatic life and have

Data Set
The data set used in this study included 27 representative lakes distributed across 18 provinces in China during the period of 2009-2011 (Figure 1).The local environmental monitoring centers collected water samples at least three times every year from each of the lakes and reservoirs during the wet, level, and dry water periods.Twenty-four water quality parameters were analyzed according to the national standards for surface waters in China (GB3838-2002) in local laboratories.The number and location of sampling sites in each lake or reservoir were different according to the lake volumes and areas.Without consideration of temporal effects, the annual averages of the parameters were calculated for each sampling site.The monitoring sites with missing variables were not taken into account, and the final data set included two-year monitoring data of 149 sites.
Amongst the lakes, Chao Lake (CL, 12 sites), Daming Lake (DML, three sites), Dongpu Reservoir (DPR, two sites), Hongze Lake (HZL, six sites), Laoshan Reservoir (LSR, three sites), Menlou Reservoir (MLR, two sites), Nansi Lake (NSL, five sites), Qiandao Lake (QDL, three sites), Tai Lake (TL, 21 sites), Xi Lake (XL, three sites), and Xuanwu Lake (XWL, two sites) are located in eastern China.Danjiangkou Reservoir (DJKR, four sites), Dong Lake (DL, five sites), Dongting Lake (DTL, 12 sites), and Boyang Lake (BYL, four sites) are located in central China.Uneven geographical distribution of precipitation has resulted in uneven distribution of water resources in China.The north of the country, with similar land area and population as the south, held only 18% of the total water resources [20].There are many more lakes and reservoirs distributed in the south of the country, with larger water surface area than in the north.Baiyangdian (BYD, eight sites), Dalai Lake (DLL, two sites), Kunming Lake (KML, one site), Miyun Reservoir (MYR, one site), and Yuqiao Reservoir (YQR, three sites) are located in northern China, which has the lowest water surface area in the country.Bosten Lake (BSTL, 14 sites) and Shimen Reservoir (SMR, one site) are located in northwestern China, while Dianchi (DC, 10 sites) and Erhai (EH, nine sites) are located in southwestern China.These two areas are urbanized to a lower degree than the others [21].Dahuofang Reservoir (DHFR, five sites), Jingpo Lake (JPL, three sites), and Songhua Lake (SHL, five sites) are located in the coldest area of the country, northeastern China.The geographical variation of the lakes and reservoirs may lead to the differences in the factors influencing chlorophyll-a, which was explored in this study.
Water 2017, 9, 524 3 of 13 in the country.Bosten Lake (BSTL, 14 sites) and Shimen Reservoir (SMR, one site) are located in northwestern China, while Dianchi (DC, 10 sites) and Erhai (EH, nine sites) are located in southwestern China.These two areas are urbanized to a lower degree than the others [21].Dahuofang Reservoir (DHFR, five sites), Jingpo Lake (JPL, three sites), and Songhua Lake (SHL, five sites) are located in the coldest area of the country, northeastern China.The geographical variation of the lakes and reservoirs may lead to the differences in the factors influencing chlorophyll-a, which was explored in this study.Twenty-four parameters were analyzed at each sampling site, included chlorophyll-a (Chla), water temperature (Temp), pH, clarity (SD), dissolved oxygen (DO), potassium permanganate index (CODMn), biochemical oxygen demand (BOD), ammonia nitrogen (NH3-N), total nitrogen (TN), total phosphorus (TP), petroleum, volatile phenol, mercury (Hg), lead (Pb), copper (Cu), zinc (Zn), fluoride, selenium (Se), arsenic (As), cadmium (Cd), hexavalent chrome (Cr), cyanide, anionic surfactant, and sulfide.Firstly, all the parameters were used to classify the lakes into several groups.Then Chla was used as the output variable, and the other 23 physicochemical parameters were considered as potential input variables.The most influential variables for Chla were selected for the lakes with similarities in each group.The raw data were standardized between zero and one before analysis to eliminate the effects of various dimensions and maintain the same or similar importance.

Trophic Level Index
In order to understand the trophic states of lakes, a comprehensive trophic level index (TLI) based on Chla concentrations [22] was used to estimate trophic state as follows: where TLI(∑) denotes the integrated trophic level index, TLI(j) is the trophic level index of parameter j, and Wj is correlative weight of the parameter j.

= ∑
(2) Twenty-four parameters were analyzed at each sampling site, included chlorophyll-a (Chla), water temperature (Temp), pH, clarity (SD), dissolved oxygen (DO), potassium permanganate index (COD Mn ), biochemical oxygen demand (BOD), ammonia nitrogen (NH 3 -N), total nitrogen (TN), total phosphorus (TP), petroleum, volatile phenol, mercury (Hg), lead (Pb), copper (Cu), zinc (Zn), fluoride, selenium (Se), arsenic (As), cadmium (Cd), hexavalent chrome (Cr), cyanide, anionic surfactant, and sulfide.Firstly, all the parameters were used to classify the lakes into several groups.Then Chla was used as the output variable, and the other 23 physicochemical parameters were considered as potential input variables.The most influential variables for Chla were selected for the lakes with similarities in each group.The raw data were standardized between zero and one before analysis to eliminate the effects of various dimensions and maintain the same or similar importance.

Trophic Level Index
In order to understand the trophic states of lakes, a comprehensive trophic level index (TLI) based on Chla concentrations [22] was used to estimate trophic state as follows: where TLI(∑) denotes the integrated trophic level index, TLI(j) is the trophic level index of parameter j, and W j is correlative weight of the parameter j.

The Self-Organizing Map
The self-organizing map (SOM), as an unsupervised learning method, has been used extensively in various fields to extract information from complex data sets and map them into fewer dimensions [24].One objective of the SOM was to construct a topological map, which can visualize the clustered input variables and explore similarities among the data.In general, the SOM comprises an input layer and a clustering layer, which consist of nodes distributed on the two-dimensional map [25].A weight vector of the same dimension as the input vector is associated with each node and obtained from the results after iterative updates of the SOM.The determination of the number of map nodes has important effects on the accuracy and generalization capability of the SOM.In this study, we chose a heuristic rule often used in previous studies to calculate the number of nodes.The rule was known as m = 5 √ n, in which m is the number of SOM nodes and n is the number of input sites [26].After the training process, preliminary grouping of samples was achieved and further clustering could be applied for the referenced vectors.The k-means algorithm was one of the most frequently used methods and chosen for use in this study [27].The Davies-Bouldin index (DBI) was calculated for different numbers of clusters, while the number with the lowest DBI was considered as the most optimal one for the trained SOM [26,28].The samples with similar characteristics were classified into the same group and supplemented by additional analysis.

The Optimized Back-Propagation Neural Networks
The back-propagation neural network (BPNN) was a popular algorithm applied in many subject areas showing great nonlinear regression capability [29].The model was constructed from examples of data and known outputs based on supervised learning with a hypothesis that all the information contained in the data sets could be used to establish the relationships between inputs and outputs [30].However, when the sample size was small, the myriad of input variables often lead to an over-fitting problem.In this study, we used genetic algorithms (GA) to select the optimal subset from a predictor variables pool for the BPNN.The GA-optimized BPNN (GA-BPNN) was applied for each cluster obtained from the SOM.The Chla concentration was used as the output variable, while all the other parameters acted as input variables at first.Then the selection process of input variables was performed, which started with an initial random set of weights and a global search was executed on the net weight range to find a better result.The initial population size of GA was 20, the length of chromosome was 23, and the maximum generation value was 100.When the generation was reaching maximum value, the chromosome with the minimum error value was chosen as the best solution for the model and used to optimize the initialized weights and the threshold of BPNN.The data set of each cluster was randomly divided into training and testing subsets, with proportions of 80% and 20%, respectively.The leave-one-out cross-validation is known as a most extreme form of k-fold cross-validation, in which k is the number of training patterns [31].It was applied to limit the over-fitting problem and provide an almost unbiased estimate of the true generalization ability of the model [32].The training subsets were used to obtain best structures and parameters of GA-BPNNs through the leave-one-out cross-validation method, and the randomly extractive testing subsets were used to validate the models.The performance of the GA-BPNNs was assessed by two standard statistical performance evaluation criteria, including the coefficient of determination (R 2 ) and root mean squared error (RMSE) on both the training and testing data sets in this study.

The Clustering Results of Sampling Sites
According to the methodology described above, a SOM with 80 nodes (eight vertical and 10 in a horizontal direction) was applied for preliminary classification of sampling sites based on standardized environmental monitoring data including 24 parameters.The component planes of each parameter in neurons on the trained map are shown in Figure 2. The nodes in varied colors represent different weighted values.The relationships amongst the variables could be explained by comparing the component planes.For example, the component planes of NH 3 -N, TN, TP and anionic surfactant with similar distributed patterns on the map indicate that these four parameters have positive correlations.In order to quantitatively evaluate the correlations, the Pearson correlation analysis was performed between these four parameters.There were significantly positive correlations between NH 3 -N and TN (0.90, p < 0.01), NH 3 -N and TP (0.90, p < 0.01), TN and TP (0.83, p < 0.01).The positive correlation coefficients between anionic surfactant and NH 3 -N (0.58, p < 0.01), TN (0.49, p < 0.01), TP (0.55, p < 0.01) were a little lower.The component planes of Hg and Pb showed visually positive correlation, with a Pearson correlation coefficient of 0.48 (p < 0.01).There were no significant correlations between DO and the other parameters from component planes, which was consistent with the results of the correlation analysis.
Water 2017, 9, 524 5 of 13 The data set of each cluster was randomly divided into training and testing subsets, with proportions of 80% and 20%, respectively.The leave-one-out cross-validation is known as a most extreme form of k-fold cross-validation, in which k is the number of training patterns [31].It was applied to limit the over-fitting problem and provide an almost unbiased estimate of the true generalization ability of the model [32].The training subsets were used to obtain best structures and parameters of GA-BPNNs through the leave-one-out cross-validation method, and the randomly extractive testing subsets were used to validate the models.The performance of the GA-BPNNs was assessed by two standard statistical performance evaluation criteria, including the coefficient of determination (R 2 ) and root mean squared error (RMSE) on both the training and testing data sets in this study.

The Clustering Results of Sampling Sites
According to the methodology described above, a SOM with 80 nodes (eight vertical and 10 in a horizontal direction) was applied for preliminary classification of sampling sites based on standardized environmental monitoring data including 24 parameters.The component planes of each parameter in neurons on the trained map are shown in Figure 2. The nodes in varied colors represent different weighted values.The relationships amongst the variables could be explained by comparing the component planes.For example, the component planes of NH3-N, TN, TP and anionic surfactant with similar distributed patterns on the map indicate that these four parameters have positive correlations.In order to quantitatively evaluate the correlations, the Pearson correlation analysis was performed between these four parameters.There were significantly positive correlations between NH3-N and TN (0.90, p < 0.01), NH3-N and TP (0.90, p < 0.01), TN and TP (0.83, p < 0.01).The positive correlation coefficients between anionic surfactant and NH3-N (0.58, p < 0.01), TN (0.49, p < 0.01), TP (0.55, p < 0.01) were a little lower.The component planes of Hg and Pb showed visually positive correlation, with a Pearson correlation coefficient of 0.48 (p < 0.01).There were no significant correlations between DO and the other parameters from component planes, which was consistent with the results of the correlation analysis.The k-means algorithm was further applied to cluster the neurons on the trained SOM map.The optimal number of clusters was selected based on the DBI values, which was calculated for clusters two through 10.The results (Table 1) showed that the most appropriate number of clusters was four, which corresponded to the minimum DBI value.Therefore, the sampling records could be classified into four groups, denoted by I, II, III, IV on the trained SOM map (Figure 3), and the number of records of different lakes included in each neuron were also marked in the figure.The k-means algorithm was further applied to cluster the neurons on the trained SOM map.The optimal number of clusters was selected based on the DBI values, which was calculated for clusters two through 10.The results (Table 1) showed that the most appropriate number of clusters was four, which corresponded to the minimum DBI value.Therefore, the sampling records could be classified into four groups, denoted by I, II, III, IV on the trained SOM map (Figure 3), and the number of records of different lakes included in each neuron were also marked in the figure.The sites in Cluster I were mainly distributed in BSTL, DJKR, and QDL.The average TLI of BSTL, DJKR, and QDL were 36.67,35.19, and 32.80, exceeding the minimum mesotrophic criterion.These oligotrophic to mesotrophic lakes often had very clear waters, with high drinking-water quality.DJKR and QDL, located in Hubei Province and Zhejiang Province respectively, had rich water resources and an abundant amount of rainfall each year.BSTL is in Xinjiang Uygur Autonomous Region, a vast territory with a sparse population and a low level of urbanization.The mean values and standard deviation (STD) of the 24 parameters for the four clusters are summarized in Table 2.The mean concentrations of TN and TP in Cluster I were 1.08 mg/L and 0.02 mg/L, respectively, which were lower than the values in the other clusters.The values of water temperature and SD were the highest in Cluster I, while Se and As were the lowest in this cluster.The water quality in these three lakes was fairly good, basically attaining the functioning requirements as a drinking water source or rare aquatic habitat.
Cluster II included the sites in EH, DHFR, MYR, MLR and some of the sites were located in the upper basin of BSTL.The TLI values of EH, DHFR, MYR, and MLR were 40.74, 40.10, 34.12, and 39.77, which indicated that they were at the mesotrophic level.The sites in BSTL in Cluster II had a  sites in Cluster I were mainly distributed in BSTL, DJKR, and QDL.The average TLI of BSTL, DJKR, and QDL were 36.67,35.19, and 32.80, exceeding the minimum mesotrophic criterion.These oligotrophic to mesotrophic lakes often had very clear waters, with high drinking-water quality.DJKR and QDL, located in Hubei Province and Zhejiang Province respectively, had rich water resources and an abundant amount of rainfall each year.BSTL is in Xinjiang Uygur Autonomous Region, a vast territory with a sparse population and a low level of urbanization.The mean values and standard deviation (STD) of the 24 parameters for the four clusters are summarized in Table 2.The mean concentrations of TN and TP in Cluster I were 1.08 mg/L and 0.02 mg/L, respectively, which were lower than the values in the other clusters.The values of water temperature and SD were the highest in Cluster I, while Se and As were the lowest in this cluster.The water quality in these three lakes was fairly good, basically attaining the functioning requirements as a drinking water source or rare aquatic habitat.
Cluster II included the sites in EH, DHFR, MYR, MLR and some of the sites were located in the upper basin of BSTL.The TLI values of EH, DHFR, MYR, and MLR were 40.74, 40.10, 34.12, and 39.77, which indicated that they were at the mesotrophic level.The sites in BSTL in Cluster II had a'TLI value 39.73, which was higher than the sites in Cluster I.The mesotrophic lakes were often clear water lakes and ponds with an intermediate level of productivity and medium levels of nutrients.The reservoirs DHFR, MYR, and MLR were located in the provinces with serious water shortage problems.Freshwater is valuable in these regions and the reservoirs have always acted as important drinking water sources.BSTL is located in Xinjiang, which is situated deep in the hinterland of Eurasia and one of the driest zones in the world [33].Due to climate change and human activity, many lakes in Xinjiang have disappeared in the past decades and lake wetlands were destroyed by municipal wastewater and industrial sewage [34].The ecological environment is very fragile in this region and more attention should be paid on water resources in this arid ecological system.The TLI values in Cluster II were generally higher than Cluster I, indicating a higher risk of eutrophication.
Water 2017, 9, 524 The sites in Cluster III were mainly distributed in DML, DL, XL, and YQR, with TLI values of 53.73, 61.43, 50.11, and 46.62 respectively.The lakes DML, DL, and XL were lightly eutrophic, while YQR was considered mesotrophic.The mean values of the parameters in this cluster were a little lower than Cluster IV, but higher than Cluster I and Cluster II, indicating worse water quality.Due to the cultural significance of DML, DL, and XL, intense eutrophication was occurring due to human activities as the regions have experienced rapid economic development and environmental change [35].YQR is the main water source for industrial, agricultural, and daily use in Tianjin, and has played an important role in the economic development of Tianjin City [36].The TLI value of YQR was approaching the minimum eutrophic range and the eutrophication risk would induce adverse effects on human health.
The sites in BYD, DC, CL, DTL, HZL, DLL, TL, and NSL were mainly grouped in Cluster IV.The TLI values of DTL and NSL were 49.17 and 49.95 respectively, which were close to the minimum eutrophic criterion.The lakes BYD, CL, and HZL were at the light eutrophic level, with TLI values of 54.54, 57.82, and 58.11 respectively.The lake DLL was at the middle eutrophic level (66.49), while DC was the most eutrophic lake with a TLI value of 70.69.These lakes are mostly located in economically developed provinces with large populations, such as Shandong, Jiangsu, Anhui, Hunan, and Hebei.Under the tremendous pressure of human influence, large amounts of nutrients have been discharged into lakes and the mean concentrations of TN and TP were 2.33 mg/L and 0.13 mg/L, respectively.These eutrophic lakes commonly have an excess amount of nutrients, which would induce the growth of plants and algae, leading to higher Chla concentrations and oxygen depletion in the water body [37].The mean values of COD Mn , BOD, NH 3 -N, and fluoride were the highest in Cluster IV while SD was much lower than in the other clusters.These lakes were at a higher risk of eutrophication and some of them have had several blue-green algae blooms [38,39].The result of classification based on the 24 parameters was not completely consistent with the TLI values, as more comprehensive water quality characteristics were considered in this classification.

Different Predictors of Chlorophyll-A for Sites with Various Water Quality Characteristics
The SOM grouped the sampling sites from different lakes into four clusters according to the values of physicochemical parameters.As an indicator of the amount of phytoplankton or algae presented in a water body, Chla concentration was used to estimate biomass productivity and ecosystem health.
In this study, we tried to determine the factors that most influence Chla for the lakes with different types and amounts of pollution in each SOM cluster.The results of GA-BPNNs with best fit structure during training and testing periods for Cluster I-IV are summarized in Table 3.The results of selected input variables of each cluster are shown in Table 4.The results indicate that the predicted accuracies of GA-BPNNs were satisfied with nearly half of the predictors for the four clusters.In Cluster I and Cluster IV, SD was an important physical factor reflecting the Chla concentrations [40].Previous studies had demonstrated that spatial and temporal variations in SD were highly associated with variations in Chla [41,42].The growth of various species of algae produces chlorophyll and gives water its green tint in productive areas, meaning excessive algae growth or algal blooms often lead to reduced water clarity and light penetration [43].Cluster II and Cluster III both chose Temp and pH as the most influential physical factors.In both clusters, Temp and pH showed significant positive linear correlation with Chla, which was consistent with previous studies [44].The aquatic environment, with increasing water temperatures, would favor the bloom of toxin-producing harmful cyanobacteria, so that cyanobacteria grew rapidly during spring and summer, while they were not likely to reproduce during winter [45].Furthermore, cyanobacteria with efficient carbon concentration mechanisms could outcompete other phytoplankton species under high pH [46].
Except for these physical parameters, nutrients, organic substances, and metal ions also had significant effects on the growth of phytoplankton [40,47].Nitrogen and phosphorus were important nutrients for algae growth and often identified as limiting factors to algal biomass [48].The excess input of nutrients results in the deleterious proliferation of planktonic alga and causes disruption of the aquatic environment [9].For the lakes at a lower trophic level (Cluster I and Cluster II), both TN and TP were limiting factors of chlorophyll-a, while in the eutrophic lakes (Cluster III and Cluster IV), TP was more important.The ratio of TN:TP was higher in Cluster I and Cluster II than in Cluster III and Cluster IV, which indicates that the natural and undisturbed lakes received much less phosphorus than nitrogen, and eutrophic lakes received a large amount of wastewater and sewage with a lower average N:P [49].The pollution caused by organic substances was a main anthropogenic driver of ecological change in ecosystems and may affect the ecological functions of lakes [50].One or more parameters, which indicated the amount of organic matter in the water, included COD Mn , BOD, and petroleum, were grouped into four clusters, demonstrating the important influence organic matter has on the concentration of Chla.Organic pollutants, even without toxicity, were one of the causes of water pollution because of the consumption of dissolved oxygen in the water.The organic matter was made up of a complex mixture of lipids, carbohydrates, proteins, and other biological chemicals, with significantly different physical, chemical, and toxicological properties [51], so that the specific contents were different in each cluster.Metal ions always had detrimental effect on ecosystems because of their toxicity and persistence in the water environment [52].Some metals (Cu, Zn, etc.) played an important role for the physiological functions of living tissue and regulate many biochemical processes when presented in trace concentrations [53].However, the same metals at elevated concentrations discharged from sewage or industrial effluents would have severe toxicological effects on the aquatic ecosystem, which has been demonstrated in marine phytoplankton [54].Some other heavy metals, such as Pb and Hg, were non-essential heavy metals and their role in cells is not known [55].They may affect organisms by accumulating in the body directly or by transferring to the next trophic level of the food chain, which would be a danger to human health [56].Notably, the concentrations of Hg and Pb in Cluster IV were Water 2017, 9, 524 9 of 13 not lower than the other clusters, but they were not selected as input variables.This indicates that the control of nutrients and organic substances would be more crucial for the lakes in Cluster IV.
The composition of the site clusters in different parts of China had significant geographic variation, as shown in Figure 4.The percent of sites belonging to Cluster IV was 82.11% and 79.55% in eastern and central China, respectively.There are many lakes and abundant water in these two parts of China, but the water quality was very poor.Emphasis should be laid on controlling the inputs of TP and organic substances into the water in these regions.The proportions of sites belonging to Cluster III and Cluster IV in northern China were 37.84% and 40.54%, respectively.It indicates that the trend of the deterioration of water quality should be curbed and the inputs of influential factors need to be controlled in this region.The sites in northwestern China were mostly grouped into Cluster I and Cluster II.A general nationwide pattern emerged from Figure 4: the lakes in economically advanced regions in the east always had poorer water quality than those in the less developed western territories [57].However, the water shortage problems were particularly severe in northwestern China.The protection of lakes in this region should focus on both the quantity and quality of water.The sites in southwestern China were mainly grouped into Cluster II and Cluster IV.The water quality in this region was generally good except for lake DC.With a rapid increase in local population and inputs of massive amounts of municipal and industrial sewage into the lake, it was in a status of heavy eutrophication and frequently accompanied by cyanobacteria blooms [58].The pollution control of DC should be continued in the future.The number of lakes in northeastern China was relatively lower than in other regions, and they were mainly grouped into Cluster II and Cluster III.Northeastern China plays a vital role in national economic development with its developed industry, high degree of urbanization, and fertile cropland [59].Limiting the inputs of metal ions and nutrients should be emphasized as a part of water pollution control.As the coldest region in China, the eutrophication risk was relatively lower in this region than others.

Conclusions
The potential for artificial neural networks (ANNs) in the classification of lakes and chlorophyll-a estimation was examined in this study.The SOM was applied first to group the sampling records into four clusters based on 24 parameters.Most of the physical factors and nutrients steadily deteriorated from Cluster I to Cluster IV, which indicated better water quality in Cluster I than the other clusters.The classification result was a little different from the trophic levels

Conclusions
The potential for artificial neural networks (ANNs) in the classification of lakes and chlorophyll-a estimation was examined in this study.The SOM was applied first to group the sampling records into four clusters based on 24 parameters.Most of the physical factors and nutrients steadily deteriorated from Cluster I to Cluster IV, which indicated better water quality in Cluster I than the other clusters.The classification result was a little different from the trophic levels as it took more comprehensive parameters into consideration.Based on the result of classification, GA-BPNN was applied to each cluster to select the specific factors that most influenced the concentration of chlorophyll-a.The results of R 2 and RMSE showed that the performance of was satisfied with nearly half of the input variables selected from the predictor pool for each cluster.The composition of lake clusters showed that the lakes in the economically advanced eastern regions had poorer water quality than those in less-developed western territories.The organic substances discharged from anthropogenic activities were important factors in all four clusters.Based on the results, the combination of SOM and GA-BPNN was found to be effective on clustering and predicting, and the results could give some suggestions as to the management of lakes with diverse water quality characteristics in China.Our approach could be of great interest to lake managers who are concerned with controlling the undesirable effects of eutrophication, as the results suggested that the limiting nutrient factor for eutrophication was TP in eastern, central, and northern China, while both TN and TP were emphasized in northwestern, southwestern, and northeastern China.

Figure 1 .
Figure 1.The locations of 27 representative lakes in China.

Figure 1 .
Figure 1.The locations of 27 representative lakes in China.

Figure 2 .
Figure 2. Component planes of the 24 parameters.

Figure 2 .
Figure 2. Component planes of the 24 parameters.

Table 1 .
Davies-Bouldin index (DBI) values of different numbers of clusters on the trained SOM.

Figure 3 .
Figure 3.The clustering results of the trained SOM neurons.The k-means method was applied to define boundaries (dark lines) of the four clusters (I-IV) on the map.

Figure 3 .
Figure 3.The clustering results of the trained SOM neurons.The k-means method was applied to define boundaries (dark lines) of the four clusters (I-IV) on the map.

Figure 4 .
Figure 4.The composition of clusters in different regions of China.

Figure 4 .
Figure 4.The composition of clusters in different regions of China.

Table 1 .
Davies-Bouldin index (DBI) values of different numbers of clusters on the trained SOM.

Table 2 .
Summary statistics of 24 parameters for the four clusters.

Table 3 .
The performance statistics of GA-BPNNs for each cluster during training and testing periods.