Predicting Bio-indicators of Aquatic Ecosystems Using the Support Vector Machine Model in the Taizi River , China

Numerous studies have sought to clarify the link between biological communities and environmental factors in freshwater, but an appropriate model is still needed to predict the effect of water quality and hydromorphology improvement on biological communities and to provide useful information for ecological restoration planning. In this study, a support vector machine (SVM) was used to predict the bio-indicators of an aquatic ecosystem (i.e., macroinvertebrates, fish, algae communities) in the Taizi River, northeast China. Environmental factors, including physico-chemical (i.e., dissolved oxygen (DO), electricity conductivity (EC), ammonia nitrogen (NH3-N), chemical oxygen demand (COD), biological oxygen demand in five days (BOD5), total phosphorus (TP), total nitrogen (TN)) and hydromorphology parameters (i.e., water quantity, channel change, morphology diversity) were used as the input variables to train and validate the SVM model. The sensitivity of the input variables for the prediction was examined by removing a variable from the SVM model. Results revealed that the SVM model reproduced the variation in bio-indicators of fish and algae communities well, based on the input variables. The sensitivity for the input variables applied in SVM showed that in the Taizi River the most sensitive variables for predicting macroinvertebrate and algae communities were channel change, DO, TN, and TP, while the most sensitive variables for predicting fish communities were DO and BOD5. This study proposed an effective method for predicting biological communities, which will improve freshwater quality and hydromorphology management schemes. The outputs can guide the decision-making process in river basin management, support the prioritization of actions and resource allocation, and help to monitor and evaluate the effectiveness of interventions.


Introduction
Biological communities in freshwater ecosystems provide goods and services of critical importance to human societies [1,2].Their measurement provides the predominant indicators reflecting the ecological state of a waterbody and can promote effective improvements in river conservation.River pollution and hydromorphology destruction, resulting from human activities such as dam construction, are increasing problems that affect biological diversity and community structure of aquatic ecosystems [3].In recent decades, there has been great interest in directly studying the effects of pollution and hydromorphology destruction on biological community structure indicators.Clarifying the relationship between biological community and environmental factors can help decision-makers to develop appropriate water pollution control and ecological restoration measures, with protecting the integrity of freshwater biology as the final restoration goal.The bio-indicators of aquatic ecosystems have been proven to be effective in reflecting long-term disturbance in rivers.The response of biological communities to different types of anthropogenic stress varies significantly.For example, the bio-indicators of an algae community are widely used in the monitoring of eutrophication, because low concentrations of nitrogen and phosphorus will increase algae growth and, to some extent, its biodiversity, but will have little effect on fish and macroinvertebrate communities [4]; while the bio-indicators of a fish community are more widely used in monitoring impacts of dam construction [5].Nevertheless, the bio-indicators of a macroinvertebrate community are frequently used in monitoring organic pollution [6] or heavy metal pollution [7].Therefore, by clarifying the response of the bio-indicators of aquatic communities to an environmental stress, the main factors causing the ecological destruction of aquatic ecosystems can be identified, making river ecological restoration measures more specific.To establish a river management strategy that aims to improve the ecological status of rivers rather than simply reducing pollutant emissions, scientists in China are seeking to understand the changes in aquatic organisms caused by pollutant emission reduction and ecological rehabilitation.However, studies and national environmental protection action in China in the last few years have remained focused on the individual evaluation of physico-chemical parameters when considering water quality [8][9][10], such as indicators of COD and NH 3 -N.These may not be able to completely reflect the ecological status of rivers and so lead to effective improvement in river management measures.
The relationship between diverse environmental factors and bio-indicators is complex, which increases the difficulty in predicting the community structure of freshwater biology [11].Models that have been used can be categorized according to their deterministic and stochastic approaches [12].Process-based mathematical models have been widely used to predict the general ecological response of biological community structure to environmental factors.However, the physical dynamics of community structure are not well understood as there are some uncertainties, such as inadequate observations and the complex interactions of the biological communities [12][13][14].This limits the development of an appropriate formulation for simulating community structure of freshwater biology and demands an alternative modeling approach, such as the promotion of a data-driven methodology [11].
Support vector machines has provided a rigorous method for uncertainty analysis and presented key information for management decision-making [15,16].They have the ability to extract temporal or spatial patterns and to describe highly nonlinear and complex data.In the past few years, there has been a lot of interest in support vector machines because they have yielded excellent generalization performance on a wide range of problems [17,18].SVMs produce very competitive results when compared with the best accessible classification methods and they need only the smallest amount of model tuning because there are only a few parameter settings that need to be adjusted.A SVM maintains steady performance regardless of input dimensionality and correctly determines the global optimum during the regression process [19,20].However, there is still not much experience with or application of SVM in ecological study.Therefore, we used a SVM for regression to develop a predictive model of freshwater biology community structure.
A complete analysis of SVM entails three steps: model selection, fitting, and validation.Beginning with inclusion of a previously selected set of input variables, data normalization was carried out to reduce the complexity of the model and decrease its computational requirements.A radial basis function (RBF) kernel, which is widely used in nonlinear fitting, was implemented to build the SVM models.The performance of SVM based model was finally evaluated by 10-fold cross-validation.The Taizi River, which flows through mountains in northeast China, is under pressure because of environmental pollution and ecological damage, as is the case with rivers elsewhere in China.The local government is working to restore its water quality, but without significant success.Knowledge of the community structure would benefit more effective restoration and management of the river basin ecosystem.

Study Area
The Taizi River is located in northeast China (40 • 30 -41 • 40 N, 122 • 20 -124 • 55 E) and is one of the main tributaries of the Liaohe River Basin.The Taizi River, with a length of about 400 km, has nine tributaries, and a catchment area of about 1.39 × 104 km 2 (Figure 1) [21].The area is characterized by a warm, temperate continental climate [22].The Taizi River Basin has experienced industrial development within Liaoning province since the 1950s.The basin is now an important area for industry (including metallurgical, petrochemical, and equipment manufacturing) and agriculture (dryland and paddy farming).Water from the Taizi River is mainly used for the domestic, industrial, and agricultural needs of the three biggest cities (Benxi, Liaoyang, and Anshan) and the surrounding areas.Currently, land use is dominated by agriculture and forestry [22].The major threats to ecosystem quality in the Taizi River Basin have been identified as urban and industrial point source pollution, as well as diffuse pollution related to agriculture and other activities (road construction, waste disposal, etc.) [21].There are nine reservoirs and several river weir gates on the Taizi River, and these have significantly altered its natural flow regime and interfered with solid transport and fish migration.The ecological quality of the Taizi River has also been extensively influenced by the clearing of riparian vegetation and the channeling of rivers and streams related to land use changes, as well as to the extraction of riverbed materials [21,22].
Sustainability 2017, 9, 892 3 of 11 Knowledge of the community structure would benefit more effective restoration and management of the river basin ecosystem.

Study Area
The Taizi River is located in northeast China (40°30′-41°40′ N, 122°20′-124°55′ E) and is one of the main tributaries of the Liaohe River Basin.The Taizi River, with a length of about 400 km, has nine tributaries, and a catchment area of about 1.39 × 104 km 2 (Figure 1) [21].The area is characterized by a warm, temperate continental climate [22].The Taizi River Basin has experienced industrial development within Liaoning province since the 1950s.The basin is now an important area for industry (including metallurgical, petrochemical, and equipment manufacturing) and agriculture (dryland and paddy farming).Water from the Taizi River is mainly used for the domestic, industrial, and agricultural needs of the three biggest cities (Benxi, Liaoyang, and Anshan) and the surrounding areas.Currently, land use is dominated by agriculture and forestry [22].The major threats to ecosystem quality in the Taizi River Basin have been identified as urban and industrial point source pollution, as well as diffuse pollution related to agriculture and other activities (road construction, waste disposal, etc.) [21].There are nine reservoirs and several river weir gates on the Taizi River, and these have significantly altered its natural flow regime and interfered with solid transport and fish migration.The ecological quality of the Taizi River has also been extensively influenced by the clearing of riparian vegetation and the channeling of rivers and streams related to land use changes, as well as to the extraction of riverbed materials [21,22].

The Available Dataset
The dataset for the application of the SVM model was obtained from the results of the National Key Science and Technology Special Program of China on Water Pollution Control and Treatment in the Taizi River Basin.This program included 163 sampling sites monitored in 2009, and 60 sites monitored in 2010, along the main channel and tributaries of the Taizi River Basin (Figure 1).
The available dataset included data on biological communities (i.e., fish, algae, and macroinvertebrates), physico-chemical parameters (i.e., DO, EC, NH 3 -N, COD, BOD 5 , pH, TP, TN), and hydromorphological parameters (i.e., water quantity, channel change, morphology diversity).These indicators were selected for ecological status classification of the Taizi River Basin [23].The results of previous studies showed that there was a negative trend in the ecological status from the highlands to the lowlands of the Taizi River Basin, and that the biological communities were significantly impaired, with varying degrees of damage to each species caused by environmental pressure.The macroinvertebrate fauna was most badly damaged, while the fish community was less impaired.The algae community received the best evaluation compared to other communities.Organic pollution (i.e., COD, BOD 5 ) from agriculture and domestic sources; an unstable hydrological regime (i.e., water quantity shortage); and chemical pollutants (i.e., PAHs and metals) from industry were found to be the main stressors impacting the ecological status of the Taizi River Basin.

Theoretical Background of Applied Models
The SVM is a kernel-based learning algorithm that is widely used for pattern classification and regression [28,29].When used for regression, the SVM finds a function that estimates the network output (s i ) that represents the deviation from the real values for all training data.Initially, the input data X i were mapped into a higher-dimensional feature via a linear mapping function ϕ(X i ); linear regression is then implemented in this space.The SVM subsequently approximates the function (Equation (1)) where w i and b were the coefficients determined through minimizing the regularized risk function based on the network outputs and real values.In this process, a kernel function approach is applied to carry out the nonlinear mapping.The kernel function κ (X i , X) is computed using the inner product between the nonlinear mapping data (ϕ(X i ), ϕ(X)) [16,30].In this study, a radial basis function (RBF) is used as the kernel function in the SVM model (Equation (2)) In this study, data normalization was used to adjust values measured on different scales to a notionally common scale.Because the units and scales of the parameters were different, this ensured that all parameters had the same scale for a fair comparison.Unity-based normalization was used to bring all parameter values into the range [0, 1], using Equation (3) ..
X i is the normalized value; X i is the original value; X min is the minimum value; and X max is the maximum value.

Performance
The performances of the SVM for regression in this study depended on parameters: C, sigma (σ), and epsilon (ε).The hyper-parameter C is a regularized constant used to determine the trade-off between the complexity of the decision rule and the frequency of error [31].σ is a parameter of the kernel, which controls the amplitude of the RBF, and therefore controls the generalization ability of the SVM.For the SVM with the RBF kernel, C, and σ were the two basic parameters involved in optimization.In the SVM for regression, ε determines the complexity by adjusting the number of support vectors as a prescribed parameter to determine training error.In each subset, 90% of samples were used for training and the 10% of samples for validation.The value of the different statistical descriptors mentioned above was calculated as the arithmetic mean of the 10 validation subsets.It should be noted that overfitting is one of the main issues in the development of SVM based models.Overfitting occurs when a model achieves an outstanding performance on the training data but it is unable to generalize.However, the cross-validation method has been found as an outstanding technique for avoiding overfitting [32], and thus for achieving good generalization capability.Genetic algorithm was applied to determine optimal parameters for the SVM model based on the lower values of the root-mean-square error (MSE) in the validation subset.The MSE was determined by Equation ( 4) where y i is the observed value; ŷi is the predicted value; and N is the number of units in the summation.The cross-validation method is an outstanding technique for avoiding over fitting [33,34], with a good generalization capability.Currently, most approaches to determine model parameters are based on prior knowledge, users' expertise, or experimental trial, such that there is no guarantee that the selected parameters are optimal [19].However, no general guideline is available to select these parameters.In this study, three parameter optimizations (C, σ, and ε) were considered by genetic algorithm (GA).GA are stochastic search techniques that can search large and complicated spaces using ideas from nature genetics the evolution principle.Here, the values of the SVM parameters C, ε, and σ are directly coded in the chromosome with real-value data; we dynamically optimize the values of the SVM parameters through the GA evolutionary process, and use the acquired parameters to construct an optimized SVM model in order to proceed with the prediction.Details of GA procedure can be referenced by Liu et al. [15].A search range of [0.1, 100] was used for both C and σ, while [0,1] was taken as the range for ε.
The squared correlation coefficient (R 2 ) was chosen to describe the overall model performance.This indicator represented the proportion of the observed variance explained by the model.MSE was selected to characterize the overall model error.

Sensitivity Analysis
In this study, a sensitivity analysis was applied to investigate sensitive input variables that influence the prediction of bio-indicators.The one-factor-at-a-time (OAT) method was used as the assessment tool for checking sensitivity of model variables.The SVM models were running by removing a variable at a time with other parameters constant, resulting in new output.The variation in overall model performance (squared correlation coefficient, R 2 ) for a given variable was subsequently calculated to obtain the effects of the variable on the model performance; this process was repeated for every variable.

Determination of Optimal Model
In parameter optimization, MSE was calculated as the arithmetic mean of 10 validation subsets for each different regression model.Results for the three optimized parameters are shown in Table 2; the values of R 2 for each different regression model are shown in Figure 2. The values of C varied from 0.48 (M_S) to 87.72 (F_S); values of σ varied from 0.08 (M_BMWP) to 99.88 (A_S).The optimal values of ε obtained from the genetic algorithm were from 0.001 (F_BP) to 0.33 (M_BMWP).

Sensitivity Analysis
Table 3 shows the R 2 for every input variable applied in the SVM model.R 2 was used to indicate the model performance.The R 2 value was greater, indicating a better model fit.OAT analysis checked the model fitting changes by removing a variable and, if the value of R 2 became smaller (indicating a greater impact of this variable on the model fit, which meant a smaller R 2 value), the more sensitive was the variable.For the algae community, the smallest values of R 2 for A_BP and A_S were 0.94 (TP) and 0.91 (CC), respectively.For the fish community, the values of R 2 for F_BP, F_IBI, and F_S were 0.93 (BOD5), 0.62 (CC), and 0.93 (BOD5), respectively.For the macroinvertebrate community, the values of R 2 for M_BMWP, M_EPT, and M_S were 0.35 (BOD5), 0.65 (CC), and 0.54 (TP), respectively.

Sensitivity Analysis
Table 3 shows the R 2 for every input variable applied in the SVM model.R 2 was used to indicate the model performance.The R 2 value was greater, indicating a better model fit.OAT analysis checked the model fitting changes by removing a variable and, if the value of R 2 became smaller (indicating a greater impact of this variable on the model fit, which meant a smaller R 2 value), the more sensitive was the variable.For the algae community, the smallest values of R 2 for A_BP and A_S were 0.94 (TP) and 0.91 (CC), respectively.For the fish community, the values of R 2 for F_BP, F_IBI, and F_S were 0.93 (BOD 5 ), 0.62 (CC), and 0.93 (BOD 5 ), respectively.For the macroinvertebrate community, the values of R 2 for M_BMWP, M_EPT, and M_S were 0.35 (BOD 5 ), 0.65 (CC), and 0.54 (TP), respectively.

Discussion
The result of SVM model showed that the bio-indicators of the fish community (i.e., F_BP, F_S) and algae community (i.e., A_BP, A_S) are better fitted with the environmental variables, compared with the indicators of the macroinvertebrate fauna (i.e., M_BMWP, M_S).This indicates that, in the Taizi River, the SVM model can be a reliable prediction tool for fish and algae communities using the selected environmental factors, while the ability of the model to predict the macroinvertebrate community was poor.The result of ecological status classification of the Taizi River reveals that the macroinvertebrate fauna was significantly impaired, while the fish community and algae community were less damaged [23].This indicates that species with considerable or moderate tolerance occurred among the macroinvertebrate fauna, so their sensitivity to environmental stress was not very great.
Agricultural activities, which are major types of human disturbance in the Taizi River, are known to contribute significant pollution to waterways in the form of nutrients, which are likely to affect the algae community.Previous studies showed that the quality of the physical habitat (i.e., water quantity, substrate), as well as the chemical pollutants (i.e., COD, EC, TN) structured the fish communities at the local scale, and played a crucial role in the reproduction and predation of fish communities [35,36].This study considered both the physical habitat and chemical pollutants as environmental pressures in the SVM model, as apparently they can both impact the structure of the fish community.Nevertheless, some uncertainties are not considered in the model, for example, the very complicated connection between the different aquatic communities (i.e., the food webs among fish, macroinvertebrates, and algae)-which can also influence the model result in this study-should not be ignored.
The sensitivity for the input variables applied in the SVM showed that the most sensitive variables for predicting macroinvertebrate and algae communities were CC, DO, TN, and TP, while DO and BOD 5 were the most sensitive variables for predicting fish communities relative to macroinvertebrate and algae communities.Studies have shown that nutrients play an important role in the photosynthetic production of a lake, as a limiting factor for the algae community [8].With respect to the macroinvertebrate community, the hydromorphology dynamics of the river also played a key role in the small-scale distribution of the benthic community.For example, a higher velocity of river flow is usually associated with a richer and more abundant macroinvertebrate assemblage.This could be attributable to the river flow velocity, which plays a key role in water oxygenation and functional feeding of some macroinvertebrate groups, such as filter feeders.A study of the diversity and abundance of macroinvertebrates in a stream in Brazil reported that the sampling station with the highest DO level also had the highest Shannon diversity index [37].DO could be also a key factor impacting the structure of a fish community, a slow levels of DO will influence the tolerance limit of fish [38].Previous studies have shown that many marine fish became stressed at a DO level of 4.5 mg•L −1 [39].In the Taizi River, DO and other physico-chemistry indicators (such as TN and pH) had a significant effect on fish spatial distribution at the reach scale [40].
The results of sensitivity analysis can provide a reference for ecological restoration with the aim of aquatic organism protection in the Taizi River.The restoration of river continuity, especially reach sinuosity and nutrient control at the reach scale, should take priority when improving the quality of algae and macroinvertebrate communities.However, control of organic pollution should be given priority when fish community restoration is taken into account.When developing an ecological restoration plan for the Taizi River, the importance of DO improvement to benefit all biological communities should not be overlooked.

Conclusions
The main purpose of this study was to provide a rational model for prediction of freshwater biology community structure.Here, a SVM model was applied to predict the biology community structure using biological communities and physico-chemical parameters.They were then compared in terms of prediction accuracy and sensitivity, depending on changes in the model input variables.The SVM based model was successfully set up, with optimal model parameters determined using GA, showing a reasonable prediction accuracy during both the training and validation process.The results of this study suggest that SVM scan reveal the key variables to predict biology community structure and may be a promising tool for water ecosystem management.

Figure 1 .
Figure 1.Map of the Taizi River Basin and location of sampling sites.Figure 1. Map of the Taizi River Basin and location of sampling sites.

Figure 1 .
Figure 1.Map of the Taizi River Basin and location of sampling sites.Figure 1. Map of the Taizi River Basin and location of sampling sites.

Table 1 .
Indicators of freshwater biology community structure (a) and environmental indicators (b) applied to the Taizi River Basin.

Table 2 .
Values of each optimized parameter calculated by genetic algorithm in SVM.

Table 2 .
Values of each optimized parameter calculated by genetic algorithm in SVM.