Response of Fish Communities to Various Environmental Variables across Multiple Spatial Scales

A better understanding of the relative importance of different spatial scale determinants on fish communities will eventually increase the accuracy and precision of their bioassessments. Many studies have described the influence of environmental variables on fish communities on multiple spatial scales. However, there is very limited information available on this topic for the East Asian monsoon region, including Korea. In this study, we evaluated the relationship between fish communities and environmental variables at multiple spatial scales using self-organizing map (SOM), random forest, and theoretical path models. The SOM explored differences among fish communities, reflecting environmental gradients, such as a longitudinal gradient from upstream to downstream, and differences in land cover types and water quality. The random forest model for predicting fish community patterns that used all 14 environmental variables was more powerful than a model using any single variable or other combination of environmental variables, and the random forest model was effective at predicting the occurrence of species and evaluating the contribution of environmental variables to that prediction. The theoretical path model described the responses of different species to their environment at multiple spatial scales, showing the importance of altitude, forest, and water quality factors to fish assemblages.


Introduction
The distribution and abundance of aquatic communities are governed by various environmental factors at different spatial scales [1][2][3][4]. Among aquatic organisms, fish are relatively easy to identify, and are an important component of aquatic ecosystems through their regulatory effects on a variety of ecosystem-level properties and functions via their consumption of lower trophic levels [5][6][7]. They are commonly recognized as sensitive keystone communities that can indicate habitat change, environmental degradation, and overall ecosystem health [8][9][10].
Diverse studies have explored the relationships between biotic and abiotic factors, including geological factors [11], land cover and land use types [12,13], hydrological factors [14], stream habitat characteristics [15], stream order [16][17][18][19], and water quality [20]. These environmental factors are considered in a hierarchical structure ranging from large scale to small scale. Large-scale factors (i.e., landscape features) affect small-scale factors (i.e., microhabitat conditions and water quality, which have important influences on the distribution and abundance of organisms). Therefore, environmental conditions can be viewed as constituting filters through which species in the regional species pool must pass to potentially be present at a given locale [21,22]. The multi-scale habitat filter primarily specifies a set of four habitat levels (watershed, reach, channel unit, and microhabitat). However, slightly different numbers of habitat levels and diversity of elements within levels have been reported [23,24]. Therefore, various studies have been carried out to predict fish distribution or to identify the important environmental factors affecting the distribution patterns of fish [25][26][27]. Predicting fish assemblages is relevant to the evaluation of environmental quality and is an important framework for ecological studies on species interactions [28]. Species composition models may support environmental management by simulating different environmental scenarios and pointing out the most critical factors that need to be changed or regulated [28].
Understanding the effects of environmental variables on the distribution of biodiversity is fundamental for developing biological monitoring tools. A better understanding of the relative importance of determinants of fish communities at different spatial scales will eventually increase the accuracy and precision of bioassessments [1]. Many studies have examined the influence of environmental variables on fish communities from Europe [25,29], North America [30,31], and Oceania [32,33]. However, very little information is available on the East Asian monsoon region, particularly Korea [34,35], despite this region's long history and environmental features that have contributed to a rich biodiversity [1]. The Asian monsoon region has more than half of the World's population and comprises a major portion of the largest ocean and the largest continent, including the highest mountains in the World [36].
In this study, we evaluated the relationship between fish communities and environmental variables at 691 sampling sites throughout South Korea. Our goals were as follows: (1) to characterize the distributional patterns of fish communities on the national scale, (2) to identify the most important environmental factors influencing the distribution and abundance of fish species for different environmental categories across multiple spatial scales, and (3) to clarify the relative influence of regional and local variables on fish community composition in Korean rivers.

Ecological Data
Fish community data were obtained from the National Aquatic Ecological Monitoring Program operated by the Ministry of Environment and the National Institute of Environmental Research, Korea. Fish were sampled at 720 sites from 388 rivers at the national scale ( Figure 1) in April and May 2009, according to the standardized sampling protocol of the programme [37]. Among them, fish were collected at 691 sites. Five major rivers (Han, Nakdong, Geum, Yeongsan, and Seomjin Rivers) and their tributaries and small streams form the entire stream system in the country.   Fish were collected from all types of habitats in each site, including riffle, run, and pool areas, based on the catch per unit effort, with two types of sampling equipment: casting net (5-mm mesh size) and kick net (4-mm mesh size). Stream segments of approximately 200 m were sampled at each site for 50 min. Most of the captured fish were identified to the species level in the field. Among the collected specimens, some species requiring detailed identification and observation were fixed with 10% formalin solution and transported to the laboratory. Details of the sampling protocol for fish are described in MOE/NEIR [37].
Fourteen environmental variables were measured at each site. Environmental variables were categorised into three groups, geo-hydrological factors, land cover types, and physicochemical factors to indicate different fish assemblage characteristics (Table 1). Slope and stream order were obtained from the Water Management Information System (WAMIS, http://www.wamis.go.kr) of the Ministry of Land, Transport and Maritime Affairs, Korea. Altitude was measured from a Digital Elevation Map (DEM), and distance from the source (DFS) was calculated as the distance from the source of the stream to each site. The above variables were extracted from a digital map using ArcGIS 9.3 (www.esri.com). Land cover types (urban area, forest area, paddy field, and dry field) were obtained from the Ministry of Environment, Korea. The proportion of each type of land cover was extracted from a 1,000 × 100 m (length × width) area that included each sampling site on a digital map using ArcGIS 9.3 (www.esri.com). Dissolved oxygen (DO), conductivity, and pH were measured in the field using YSI 85 meters (YSI Inc., Yellow Springs, OH, USA) for DO and conductivity, and an Orion 3-Star-Plus pH meter (Thermo Fisher Scientific Inc., Waltham, MA, USA), for pH. Other variables, such as biological oxygen demand (BOD), total nitrogen (TN), and total phosphorus (TP), were analyzed in the laboratory by using the techniques by Eaton et al. [38]. Details of the sample collection protocol for fish and that for laboratory analysis are described in MOE/NIER [37].
Fish species were grouped into four different trophic guilds (omnivores, piscivores, insectivores, and herbivores) and three different tolerance guilds (i.e., tolerant species, intermediate species, and sensitive species) to evaluate how these traits were related to environmental differences. Tolerance and trophic guilds were defined according to the guidelines of the "National Surveys for Stream Ecosystem Health" in Korea [37]. A dataset consisting of 691 sites with fish species observed at more than two sampling sites were used for patterning and predicting fish communities. To reduce variation, fish density was natural-logarithm transformed after adding one to all data to avoid the problem of zero being undefined for logarithms.

Overall Procedure
Data analyses were conducted with four different analytical methods: self-organizing map (SOM), indicator species analysis, random forest model, and theoretical path model: (1) SOM classified fish communities to several groups, (2) indicator species were defined in each group using indicator species analysis, (3) SOM groups were predicted with random forest model, (4) occurrences of selected species were predicted with random forest model, and (5) theoretical path model was used to evaluate the influence of environmental factors on the occurrence of indicator species. The overall data analyses procedures are given in Figure 2.   SOM was applied to characterize the distributional patterns of fish communities. SOMs average the dataset in weight vectors through the learning process while removing noise [39,40]. The weight vectors in SOM tend to approximate the probability density function of the input vector and provide the distributional pattern of each input variable [41]. SOM consists of an input layer and an output layer. Each layer is connected by connection intensities (weights). The input layer is formed by computation units (neurons) that receive input data, which are used to calculate the Euclidean distance between the weight vector and the input vector. The output layer consists of N output neurons on a two-dimensional hexagonal lattice. A map size of 126 (= 14 × 9) output neurons was chosen, which was determined by the heuristic equation [42]. Two criteria, the quantization error for resolution and the topographic error for topology preservation, were used to evaluate the map quality [41]. These error values were used as an indicator of the accuracy of the mapping at preserving topology [43].
After the learning process, the SOM units were classified based on a hierarchical cluster analysis using Ward's linkage method with the Euclidean distance measure [44]. Based on the SOM weight after the SOM training, 25 relatively abundant species were examined to find indicative species for each cluster in SOM. We used the functions in the SOM toolbox [45] for training the SOM in Matlab version 6.1. A multi-response permutation procedure (MRPP), a nonparametric procedure for testing the differences between groups, was conducted using PC-ORD version 4.25 to evaluate the significance of the clusters [46].

Random Forest Model
A random forest model [47] was applied to predict the occurrence of fish species and the patterns of fish communities as defined by the SOM, using different combinations of environmental variables, and to identify the determinant environmental factors contributing to the models. The random forest model is a non-parametric method for predicting and assessing the relationship between a large number of potential predictor variables and response variables [47]. Random forest models have several advantages compared to other statistical methods, such as high classification accuracy, a novel method of determining variable importance, and the ability to model complex interactions among predictor variables [48]. Therefore, the random forest model offers powerful alternatives to traditional parametric and semiparametric statistical methods for the analysis of ecological data.
The importance of environmental variables used in the random forest model was evaluated using the Minimum Description Length (MDL), which measures the quality of attributes as their ability to compress the data [49] To compare the relative importance of each environmental factor, values of MDL were rescaled to range from 0 to 100. Dichotomies (0 and 1) for SOM clusters were used to predict fish community patterns. We used a correct prediction rate (i.e., the number of sites correctly predicted out of the total number of sites in the cluster) and Cohen's kappa [50] as a measure of agreement. Cohen's kappa ranges from 0 (completely random predictions) to 1 (perfect predictions). The random forest model was run with the CORElearn package [51] in the R statistical program (http://cran.r-project.org). Prior to running the random forest model, environmental variables, except land cover type, were transformed by the natural logarithm (x + 1) to reduce variation.

Indicator Species Analysis
To evaluate the indicator species in each SOM cluster, we applied indicator species analysis [52]. Indicator species were selected by an indicator value calculated as the product of its relative abundance and its relative frequency with ranges from 0 (no indication) to 100 (perfect indication) [53]. Monte Carlo tests were used to determine the significance of species indicator values. Indicator species analyses and Monte Carlo tests were carried out using PC-ORD version 4.25.

Theoretical Path Model (TPM)
TPM was used to describe the directed dependencies among a set of variables at multiple spatial scales [54]. Path analysis [55][56][57] was used to decompose correlations into their direct and indirect components and to allow simple correlations among a set of variables to be partitioned according to a path model describing their causal relationships [58]. Path analysis generated diagrams representing the relationships between variables at different scales, their direct link, and its significance [59]. This included correlation coefficients from the Pearson's correlation analysis and explanatory values (R 2 ) for environmental variables from multiple regression models (MRMs). Environmental variables were placed into three groups (geo-hydrological factors, land cover types, physicochemical factors) with one response layer (indicator species). Only a single important indicator was employed for each SOM cluster in the TPM. The indicator species were selected using indicator species analysis. TPMs were conducted using Statistica software (StatSoft, Inc., version 7).

Statistical Analysis
Spearman rank correlation coefficients were calculated among environmental variables. The Kruskal-Wallis test (K-W test) was used to compare the differences in environmental variables, community indices, and biological guilds, such as trophic and tolerance guilds, among clusters defined in the SOM. The nonparametric Dunn's multiple comparisons test was used for post hoc comparisons. The K-W test and the Dunn's test were conducted using the Statistica software.

Fish Communities
A total of 128 fish species in 32 families were collected from 691 sites ( Table 2), including 49 endemic species (representing 34.3% of the total fish abundance) and five exotic species (representing 3.3% of the total fish abundance). Zacco platypus (32.5%), Z. koreanus (11.6%), and Pungtungia herzi (4.4%) were the most abundant species and made up 48.6% of all the individuals collected. The Han River Watershed showed the highest species richness (96 species), followed by the Geum (69 species), Nakdong (67 species), Youngsan (59 species), and Seomjin (50 species) River Watersheds. The most common species in each of the five major watersheds was the same species that was the most common at the national scale, Z. platypus, but the second most common species was different in the Geum River Watershed (Hemiculter eigenmanni) and Youngsan River Watershed (Z. temminckii) from the other watersheds (Z. koreanus).

Fish Assemblage Patterns
Through the SOM learning process, the 691 sites were grouped into five clusters (I-V) according to the similarities of fish communities (Figure 3). The final quantization and topographic errors were 0.89 and 0.03, respectively, indicating a good training of the SOM. The clusters were significantly different in community composition (MRPP, A = 0.11, P < 0.001). The largest number of sites (181) was grouped into cluster IV, followed by cluster I (142), cluster III (138), cluster II (121), and cluster V (109). The number of sites in each cluster was visualized as the size of the hexagonal lattice in each SOM unit.
The distribution of sites in each cluster was highly related to the sites' geographic locations (Figure 3(c-g)). For instance, most sites in cluster I were located in the northeastern part of the Han River Watershed, and the sites in cluster III were situated in a mountainous area of the Korean peninsula. Sites in cluster IV were widely distributed in the Youngsan and Nakdong Rivers. Most sites in cluster II were near the coast and the sites in cluster V were mainly in the tributaries of the Han River.

Differences in Environmental Variables
All environmental variables were significantly different among clusters (K-W test, P < 0.05) ( Table 4). Altitude and slope were highest in cluster I, and lowest in cluster V (Dunn's test, P < 0.05). Stream order was significantly lower in cluster II, and DFS in cluster III and IV was significantly greater than in the other clusters (Dunn's test, P < 0.05). Cluster V showed high values for the proportion of urban area, but the lowest values for dry field area. Clusters I and II had a lower proportion of paddy fields than clusters III, IV, and V (Dunn's test, P < 0.05).
The proportion of forest area decreased according to gradients of clusters from cluster I to cluster V (Dunn's test, P < 0.05). The values of BOD, conductivity, and TP were significantly higher in cluster V, whereas cluster I showed the lowest values (Dunn's test, P < 0.05). The concentration of TN was not significantly different among clusters except cluster V. Cluster C had a relatively low pH, and cluster I showed a significantly lower concentration of Chl-a than clusters II-IV.  Table 3. Spearman rank correlation coefficients between 14 environmental factors (* P < 0.05, ** P < 0.01).  Stream Order 5 (4-6) bc 5 (3-5) c 6 (5-6) a 6 (5-7) a 5 (4-6) ab

Characteristics of Fish Assemblages
Community indices, tolerance guilds, and trophic guilds were significantly different among the five SOM clusters (K-W test, P < 0.05) (Figure 4). Species richness, the number of individuals, the Shannon index, and evenness showed similar patterns. Community indices, except for the number of individuals, were significantly higher in cluster III, and the lowest in cluster II (Dunn's test, P < 0.05). The ratio of insectivores to herbivores among the feeding guilds was relatively high in cluster I and was the lowest in cluster V ( Figure 5).  The proportion of herbivores showed a similar pattern to that of insectivores, and was significantly higher in cluster I than in all other clusters except cluster II (Dunn's test, P < 0.05). The proportion of omnivores was higher in cluster V, whereas that of piscivores was higher in cluster IV (Dunn's test, P < 0.05).

Prediction of Fish Assemblages
All five clusters were well predicted by random forest models (prediction accuracy > 0. 75) according to several different combinations of environmental factors across multiple spatial scales ( Table 5). When all 14 environmental factors were considered as independent variables, the prediction accuracy was the highest (>0.85). The prediction accuracies were low when only the four land cover types were used. Cluster I had higher abundances of Z. koreanus, Coreoleuciscus splendidus, and Tridentiger brevispinis, and cluster II had a high abundance of Rhynchocypris oxycephalus ( Figure 6). Cluster IV was characterized by a relatively high abundance of T. brevispinis and Z. platypus, whereas cluster V was represented by Carassius auratus and H. eigenmanni. The occurrences of 15 indicator species (Table 6) were well predicted with 14 environmental factors using the random forest model (accuracy rate >0.85). H. eigenmanni and Sarcocheilichthys variegatus displayed the highest prediction accuracy (0.96 and 0.95, respectively). Altitude and DFS were the most important variables for the prediction of fish occurrence. Forest area, stream order, TP, and conductivity were also relatively important for the prediction of fish occurrence.

Theoretical Path Model
Different species showed different responses to their environment across multiple spatial scales. Among geo-hydrological factors, altitude was positively correlated with Iksookimia koreensis, which was an indicator species of cluster I, whereas Carassius auratus, which was an indicator species of cluster V, displayed a negative correlation (r = 0.37 and r = −0.36, respectively, P < 0.05) (Figure 8). Rhynchocypris oxycephalus and S. variegatus showed a highly significant correlation with DFS (r = −0.36 and r = 0.29, respectively, P < 0.05). At the local scale, the presence of S. variegatus and Opsariichthys uncirostris was positively correlated with the concentration of Chl-a (r = 0.17 and r = 0.22, respectively, P < 0.05), whereas other species exhibited a significant relationship with TP (r = −0.27 in I. koreensis, r = −0.11 in R. oxycephalus, and r = 0.41 in C. auratus, P < 0.05).

Discussion
The distribution and abundance of fish communities were characterized with environmental variables across multiple spatial scales using SOM, random forest, and theoretical path models. In this study, we characterized how Korean fish assemblages on a national scale react to changes in the modified longitudinal gradient with various environmental variables at multiple spatial scales, and presented the importance of altitude, DFS, and urban areas for predicting fish community patterns and the occurrence of fish species. These results could provide necessary information for managing fish assemblages and the relationships between changes in fish assemblage and environmental variables. SOM revealed differences among fish communities, reflecting environmental gradients such as the longitudinal gradient from upstream to downstream, and differences in land cover, water quality, etc. For example, sites in cluster I were from small streams (25.7% of the streams less than third order with high altitude and short DFS), whereas most sites in cluster IV were located further downstream (39.9% of streams were greater than the seventh order with low altitude and long DFS). Species richness and abundance were significantly lower at downstream sites, and high values were found at mid-stream sites. However, previous studies have reported trends where the lowest species richness and abundance is found in headwater streams and the highest levels are downstream at low altitudes [13,[60][61][62][63]. These studies highlighted the importance of habitat size, because the larger area supported higher species richness and abundance. However, this concept was primarily applied to areas without disturbances. In the current study, the sampling sites showed a wide range of disturbances but general longitudinal gradients of fish species richness were observed after excluding severely polluted sites from the analysis. Oberdorff et al. [64] support our findings that species richness reached the maximum in midsize rivers, and then decreased in large rivers. The proportion of forest area decreased downstream, whereas agricultural and urban areas increased, creating an increase in nutrient and pollutant inputs to streams [65]. The moderate increase of nutrients in the middle stream led to increased species richness, while high nutrients may have reduced the species richness. This supports the intermediate disturbance hypothesis [66][67][68].
Trophic guilds as well as species richness changed along the upstream-downstream gradient. This result supported the River Continuum Concept [69]. The proportions of herbivores and insectivores were significantly higher further upstream than downstream, whereas the proportion of omnivores was relatively high downstream ( Figure 5). The trophic composition of the fish communities was induced by the available food resources [64]. Lowe-McConnell [70] and Rahel and Hubert [71] reported similar results; headwater streams had higher proportions of insectivorous species, while omnivores were more common in large rivers. These gradients in longitudinal distribution were found at the species level ( Figure 6). Insectivores such as Iksookimia koreensis, C. splendidus, and Z. koreanus were mainly distributed in cluster I, and their relative abundance gradually decrease towards clusters III and IV. Piscivores, such as O. uncirostris, showed relatively high abundance in cluster IV, and gradually decreased toward cluster I.
Urbanization was correlated with low fish abundance and richness and urban sites were dominated by disturbance-tolerant species [72]. Urbanization can lead to high concentrations of TP and TN [73], however, fish diversity and abundances in urban catchments have been found to be dramatically lower than in forested catchments [74][75][76]. This relationship indicates that urbanization can exert a major influence on water quality, habitat, and biological assemblages [65]. Similarly, agricultural exploitation can also influence aquatic organisms and their environments. Many studies have reported that agricultural activities degrade water quality, affect both riparian and stream habitat quality, and alter water flow [65]. Fish and macroinvertebrate biodiversity has been documented to decrease with a greater percentage of agricultural land [77][78][79].
Fish assemblages can be influenced by changes in environmental variables such as physical habitat and land use [80][81][82][83]. Stream gradient, stream order, hydrologic regime, and channel morphology were highly correlated with species richness [84][85][86]. Joy and Death [87] showed that altitude and distance from the coast were important in a model predicting regional freshwater fish occurrence in the Manawatu-Wanganui region of New Zealand. He et al. [27] also stated that altitude and stream length played important roles in driving the observed endemic fish assemblage structure. Altitude and DFS were also important variables for the prediction of fish community patterns in this study (Figure 7). Especially, altitude was the most important variable for the prediction of fish community pattern in Clusters I, III and IV, an indication of longitudinal gradients.
Altitude and DFS were the most important factors in 11 of the 15 indicator species. These 11 were indicator species for clusters I and III, which had relatively high altitude. Coreoperca herzi was an indicator species for cluster I, and DFS and TP were relatively important for predicting the occurrence of C. herzi. Samples in cluster I were located in upstream locations with a short DFS and a low concentration of TP (Table 4). Urban land cover was the most important variable for predicting the distribution of Liobagrus andersoni and Koreocobitis rotundicaudata, which were indicator species of cluster I. Changes in land use can affect assemblage composition, and lead to changes in the contaminant level of streams [88]. TN and TP were the most important variables for predicting the distribution of Cyprinus carpio and C. auratus, which were indicator species for cluster V.
Similarly, the theoretical path model described different responses of species to their environment at multiple spatial scales (Figure 8). Geographical attributes persist over a relatively long time and influence the development and selection of species' life history and behavioral traits [89]; and the surrounding conditions (e.g., slope and stream order) of a stream can also directly and indirectly affect stream habitats [1,90]. The theoretical path model showed significant correlations between geohydrological factors, land cover types, and physicochemical factors. Among land cover types, forest area displayed the highest correlation with five dominant species, and Chl-a and TP were the most important physicochemical factors for explaining species occurrences, indicating the importance of water quality in micro-habitat condition. Li et al. [1] reported similar results on benthic macroinvertebrates in the same study area. There are many reports of a strong correlation between geographical location and stream communities [91,92] and of the importance of altitude [3,93]. The random forest model is a non-parametric method for predicting and assessing the relationship between a large number of potential predictor variables and response variables [47]. Cutler et al. [48] reported that the random forest model demonstrated its learning and predicative power as well as its explanatory capacities by presenting a high capability for modeling ecological problems involving non-linear relationships between data. Random forest models have several advantages compared to other statistical methods, such as high classification accuracy, a novel method of determining variable importance, and the ability to model complex interactions among predictor variables [48]. Therefore, the random forest model offers powerful alternatives to traditional parametric and semiparametric statistical methods for the analysis of ecological data. In addition, He et al. [27] showed that mixed models that included both land cover and river characteristic variables were more powerful at explaining the endemic fish distribution patterns in the upper Yangtze River, similar to our results. In our study, the random forest model was more powerful for predicting fish community patterns using all 14 environmental factors than models using either a single variable or another combination of environmental variables (Table 5).
Many studies have been conducted on the relationships between changes in fish community structure and environmental variables [71,82,83], and most studies have considered some environmental variables such as physical habitat and land use at the local or watershed scale [94]. Although the distribution and abundance of species are closely linked to small-scale habitat availability [95], they are also influenced by variables at larger spatial scales [91]. Regional variables may operate as "filters" constraining species at lower scales through selective habitat forces [22]. Consequently, preservation and conservation strategies for maintaining stream integrity will be more effective if they are treated as a part of landscape development rather than an isolated entity [1]. Future studies to benefit conservation and management may consider the influences of global processes on biodiversity, the interactions between these three spatial scales, and the effects of global warming on fish communities.

Conclusions
The relationships between the distribution and abundance of fish communities and environmental variables at multiple spatial scales were evaluated using SOM, random forest, and theoretical path models. The SOM explored differences among fish communities, reflecting environmental gradients, such as a longitudinal gradient from upstream to downstream, and differences in land cover types and water quality. The random forest model for predicting fish community patterns that used all 14 environmental variables was more powerful than a model using any single variable or other combination of environmental variables, and the random forest model was effective at predicting the occurrence of species and evaluating the contribution of environmental variables to that prediction. The theoretical path model described the responses of different species to their environment at multiple spatial scales, showing the importance of altitude in geo-hydrological factors, forest cover types, and water quality factors to fish assemblages.