Assessment of Water Quality in a Subtropical Alpine Lake Using Multivariate Statistical Techniques and Geostatistical Mapping: A Case Study

Concerns about the water quality in Yuan-Yang Lake (YYL), a shallow, subtropical alpine lake located in north-central Taiwan, has been rapidly increasing recently due to the natural and anthropogenic pollution. In order to understand the underlying physical and chemical processes as well as their associated spatial distribution in YYL, this study analyzes fourteen physico-chemical water quality parameters recorded at the eight sampling stations during 2008–2010 by using multivariate statistical techniques and a geostatistical method. Hierarchical clustering analysis (CA) is first applied to distinguish the three general water quality patterns among the stations, followed by the use of principle component analysis (PCA) and factor analysis (FA) to extract and recognize the major underlying factors contributing to the variations among the water quality measures. The spatial distribution of the identified major contributing factors is obtained by using a kriging method. Results show that four principal components i.e., nitrogen nutrients, meteorological factor, turbidity and nitrate factors, account for 65.52% of the total variance among the water quality parameters. The spatial distribution of principal components further confirms that nitrogen sources constitute an important pollutant contribution in the YYL.

Keywords: multivariate statistical technique; geostatistical mapping; water quality; principal component analysis; cluster analysis; Yuan-Yang Lake

Introduction
Water quality is the main factor controlling healthly and diseased states in both humans and animals. Surface water quality is an essential component of the natural environment and a matter of serious concern today. The variations of water quality are essentially the combination of both anthropogenic and natural contributions. In general, the anthropogenic discharges constitute a constant source of pollution, whereas surface runoff is a seasonal phenomenon which is affected by climate within the water catchment basin [1]. Among them, because of the intensive human activities, the anthropogenic inputs from a variety of sources are commonly the primary factors affecting the water quality of most rivers, lakes, estuaries, and seas, especially for those close to highly urbanized regions.
Many investigations have been conducted on anthropogenic contaminants of ecosystems [2][3][4]. Because of the spatial and temporal variations in water quality conditions, a monitoring program which provides a representative and reliable estimation of the quality of surface waters is necessary. The monitoring results produce a large and complicated data matrix that is difficult to interpret to draw meaningful conclusions. Multivariate statistical techniques are powerful tools for analyzing large numbers of samples collected in surveys, classifying assemblages and assessing human impacts on water quality and ecosystem conditions.
The application of different multivariate statistical techniques, such as principal component analysis (PCA), factor analysis (FA), cluster analysis (CA), and discriminate analysis (DA), assists in the interpretation of complex data matrices for a better understanding of water quality and ecological characteristics of a study area. These techniques provide the identification of possible sources that affect water environmental systems and offer a valuable tool for reliable management of water resources as well as rapid solution for pollution issues [5][6][7]. Multivariate statistical techniques have been widely adopted to analyze and evaluate surface and freshwater water quality, and are useful to verify temporal and spatial variations caused by natural and anthropogenic factors linked to seasonality [8,9].
Geostatistical mapping is based on field observations. Because field surveys are limited by the cost of sampling, only sparse observation data are generally available. Geostatistical mapping or further analysis requires the assessment of exhaustive attribution values for an entire study area. Geostatistical mapping techniques have been widely applied to different fields including water quality in bays [10] watersheds [11], soil properties [12], precipitation [13], river discharges [14], air pollution [15], and so on. To the best of our knowledge, geostatistical mapping has not been adopted for studying water quality data in lakes.
The objective of the present study was to analyze 14 physico-chemical water quality parameters in water samples collected on monthly basis from 2008 to 2010 in a subtropical alpine lake (Yuan-Yang Lake) in Taiwan. The data matrix obtained from field measurement was subjected to the CA, PCA, and FA techniques, as well as geostatistical mapping to evaluate information about the similarities between sampling stations and to ascertain the important contributions of nutrient sources among water quality parameters in the alpine lake.

Study Site and Sample Collection
The Long-Term Ecological Research (LTER) program is one of the core projects of the Global Change and Terrestrial Ecosystem program (GCTE), which is under the umbrella of the International Geosphere-Biosphere Program (IGBP). An understanding of ecological processes and of mechanisms leading to ecologically tragic events is particularly important for the sustainability of Taiwan Island. To meet such a requirement, the LTER project was initiated in 1992 on the island. Yuan-Yang Lake (YYL) is one of the six LTER sites and the only site associated with a mountain lake ecosystem in Taiwan. YYL, a small (3.6 ha) and shallow (4.5 m maximum depth) lake in a mountainous catchment 1,730 m above sea level, is located in the northeastern region of Taiwan (24°35′ N, 121°24′ E) ( Figure 1). The lake and surrounding catchment (374 ha) were designated as a long-term ecological study site by the Taiwan National Science Council in 1992 and joined the Global Lake Ecological Observatory Network (GLEON) in 2004. The lake is an important site for studying physical characteristics, water quality, and ecosystems. Recently, the lake has been subject to pollution sources from recreational activities, therefore the investigation of water quality is urgent and necessary.
The steep watersheds are dominated by pristine Taiwan false cypress [Chamaecyparis obtusa Sieb. & Zucc. var. formosana (Hayata) Rehder] forest. The average annual temperature is approximately 13 °C (monthly average ranges from −5 to 15 °C ) and the annual precipitation is more than 4,000 mm. YYL is subject to three to seven typhoons in summer and autumn each year, during which more than 1,700 mm of precipitation may fall on the lake.
The sampling network including eight measured stations was designed to cover a wide range of key locations accounting for inflow and outflow ( Figure 1). Stations 1 and 2 are located at shallow area which is a swamp (shallow) zone. Stations 3 to 8 are located at the middle and deep zones. Station 4 is near by water inflow site, while station 5 is close to the site of lake water outflow.
Water temperatures were measured through the water column at 0.5 m increments using a thermistor chain (Templine, Apprise Technologies, Inc. Duluth, MN, USA). Wind speed was measured 1 m above the lake by an anemometer (model 03001, R.M. Young, Traverse, MI, USA). Precipitation, air temperature and downwelling photosynthetically active radiation (PAR) were measured at a land-based meteorological station approximately 1 km away from the lake. Variation in water levels was measured using a submersible pressure transmitter [PS 9800(1), Instrumentation Northwest, Kirkland, WA, USA] deployed at the lake shore ( Figure 1). The attenuation of irradiance by the water column, in the 400-700 nm bands, was measured using a Licor underwater quantum flat head sensor. The outputs from the senor were stored using Licor data logger in the field, and converted to light measurements in the laboratory. The pH, turbidity, and Secchi depth were measured in situ. Dissolved oxygen concentration was measured with a dissolved oxygen meter (Yellow Springs Instruments Company USA, Model 550A). The water samples, collected using an open water grab sampler equipped with a sample pull-ring that allowed for sampling at different water depths, were analyzed and measured in laboratory to obtain total suspended solids (TSS), nutrients (nitrate nitrogen, ammonium nitrogen, total nitrogen, and total phosphorus), and chlorophyll a concentrations. Chlorophyll a was measured by filtering with 600 cm 3 samples through a glass fiber filter. The filter paper itself was used for the analysis. The filtering was group up 90% acetone solution and fluorometer is used to read the light transmission, which in turn was used to calculate the concentration of chlorophyll a. TSS and nutrients, concentration was analyzed using the US EPA standard method 160.1 [16].

Cluster Analysis
CA is an unsupervised pattern recognition method that divides a large amount of cases into smaller groups or clusters based on the characteristics they process. The resulting clusters of objects should exhibit high internal (within cluster) homogeneity and high external (between clusters) heterogeneity. Hierarchical CA is the most common approach, which starts with each case in a separate cluster and joints the clusters together step by step until only one cluster remains and is typically illustrated by a dendrogram (tree diagram). The dendrogram provides a visual summary of the clustering process, presenting a picture of the groups and their proximity, with a dramatic reduction in dimensionality of original data. The Euclidean distance usually provides the similarity between two samples and a distance can be represented by the difference between analytical values from samples. In the present study, hierarchical CA was adopted to the standardized data using Ward's method, with Euclidean distance as a measure of similarity. The Ward method applies an analysis of variance approach to assess the distances between clusters to minimize the sum of squares of any two clusters that can be formed at each step. The spatial variability of water quality in the lake was determined from hierarchical CA using the linkage distance [17][18][19]. ...

Principal Component Analysis/Factor Analysis
The linear combination is chosen so that the sum of squares of 1 c is as large as possible subject to the condition that 22 11 1 The second principal component is another linear combination of y j : where the variance 2 c is the maximal, subject to the conditions that corr( 1 c , 2 c )=0 and that 22 12 2 ......

1.
p ww    The criterion of summarizing the information in p variables by a few components is valuable as a means of reducing the number of variables needed in an analysis [20].
FA follows PCA. FA focuses on reducing the contribution of less significant variables to simplify even more of the data structure coming from PCA. This purpose can be implemented by rotating the axis defined by PCA based on well established rules, and constructing new variables, also called varifacrors (VFs). PCA of the normalized variables was performed to extract significant PCs and to further reduce the contribution of variables with minor significance; these PCs were subjected to varimax rotation (raw) generating VFs [21,22].
The FA can be written as: where y is the measured variable, f is the factor loading, z is the factor score, e is the residual term accounting for errors, i is the sample number and m is the total number of factors. The multivariate statistical technique calculations were implemented using STATISTICA 8 [23] and Microsoft Office Excel 2007.

Geostatistical Mapping
Geostatistical mapping can be defined as the analytical production of maps by using field observations, auxiliary information and a computer program that generates predictions. The isotropic semivariogram are estimated to characterize the relationship between general spatial dependence and distance among the observations. Different semivariogram models, e.g., exponential and Gaussian models, nested with nugget effects are selected separately with respect to different principle components or factor scores. The optimal parameters for semivariogram models are calculated by the weighted least squares method [24]. Despite the concerns about the spatial non-orthogonality, the cross-correlations between different principle components or factor scores are calculated [25,26]. It shows that the cross-correlations increase as the spatial lags increases; however, the maximum cross-correlations are still small and less than 0.4. This study then assumes the spatial orthogonality of the principle components as well as the factor scores. The use of simple kriging usually requires the knowledge of the underlying space/time trend of the attributes of concern. However, it is not available for the modeling of "transformed" variables in this study. In these cases, many studies use nonparametric method for the trend modeling. Therefore, in this study, ordinary kriging is used for the spatial mapping which considers a non-parametric trend as well the spatial association among the attributes concurrently. All the geostatistical analysis computations of this study were performed on SEKSGUI, which is freely and publicly available [27].

Results and Discussion
The measured results of 14 physico-chemical water quality parameters at eight sampling stations from August 2008 to June 2010 in the YYL are presented in Table 1.

Spatial Similarity with CA
Cluster analysis was applied to find out the similarity groups between the sampling stations. It produced a dendogram (Figure 2), grouping all eight sampling stations into three statistically meaningful clusters.  The two measurement stations (1 and 2) are regarded as the cluster 2 which comprises the shallow area. Stations 3, 4, 5, and 8 are cluster 1 which corresponds to the middle water depth. Stations 6 and 7 belonging to the deep zone which constitutes cluster 3. The results show that the CA technique is useful for classification of lake waters, hence, the number of sampling sites and respective cost can be diminished in future monitoring plans. There are other reports [28][29][30], with similar water quality program results.

Principal Component Analysis and Pollution Identification
Pattern recognition of correlations among 14 parameters was best summarized by PCA/FA. The Bartlett test was used on the data set to examine the suitability of these data for PCA/FA. In this study, the covariance matrix coincided with the correlation matrix which was presented in Table 2, because FA/PCA was applied to normalized data. Overall, the correlations between variables were relatively weak. There are some positive correlations between some variables such as TP, NH 4 -N, TN, TSS, Chl-a, and so on. The negative correlations were revealed between some variables such as DO, Temp, NH 4 -N, TN, Chl-a, Turb, and so on. Correlation coefficients of two elements were very useful, because they numerically represented the similarity between two elements of the two water quality variables. This also indicated that PCA could successfully reduce the dimensionality of the original data set. Therefore factor analysis of the present data set further reduced the contribution of less significant variables obtained from PCA.
The Scree plot (shown in Figure 3) was applied to identify the number of PCs to be retained to understand the underlying data structure. Based on the Scree plot and the eigenvalues >1 criterion, four factors were chosen as principal factors, explaining 65.52% of the total variance in the data set. The corresponding VFs, variables loadings, eigenvalues, and explained variance are presented in Table 3.   Liu et al. [31] classified the factor loadings as "strong", "moderate", and "weak", corresponding to absolute loading values of >0.75, 0.75-0.50, and 0.50-0.30, respectively. The first factor (VF1), explaining 26.89% of total variance, had moderate positive loadings on TP, NH 4 -N, TSS, Chl-a, and Turb (turbidity). Because the NH 4 -N concentration is a nutrient source for chlorophyll a growth, VF1 represented nitrogen source. VF2, which explained 18.08% of total variance, had a moderate positive loading on R (rainfall), WS (wind speed), TN, and pH and represents meteorological factors. VF3, explaining 11.02% of total variance, has a moderate positive loading on Ke, SD, and Turb (turbidity). This factor represents the contribution of turbidity effects in the water column. VF4, explaining 9.54% of total variance, had a moderate positive loading on NO 3 -N and water temperature and represented the nitrate factor. The analyzed results revealed that FA/PCA can serve as an important means to identify the main factors affecting water quality in the alpine lake.

Geostatistical Mapping
Geostatisitcal techniques were used for the mapping of principle components and factor scores over the study area. Due to the long period between each observation campaign, the temporal correlation among the observations is assumed to be ignorable in this analysis. Table 4 shows that the spatial dependence structure varies across the identified contributing factors by the common multivariate analysis. It implies the variation of spatial patterns of impacts to water quality from the contributing factors. Among them, the impact of nitrogen nutrients changes more significantly over space than other contributing factors. The experimental and modeled variograms of PC1 and FA1 are shown in Figure 4. The variogram figure in time for PC1 and FA1 is also presented in Figure 5. It is clear that the variogram value approximates to sill in cases of the temporal lags in month among observations larger than 0. It implies the low correlation between the observations collected in different months. The contaminants from nitrogen nutrient are more localized as shown in Figure 6. On the other hand, the effects from the sunlight, organic matter, and nitrate nutrition present much smoother variations across the study area. This implies the sources of these contributors are more homogeneously distributed over the lake. It is noticeable that the range of the semivariogram model of second principle component is excessively larger than those of the models of other factors. It implies that the meteorological effects derived from PCA contribute a relatively large scale variation of water quality in space with respect to the scale of the study area.
The spatial distribution of the PC and FA can vary over time. Our analysis shows that the spatial distributions of PC (or FA) of the observations collected in the same month are generally similar. As for the PC obtained at different months, their spatial distribution can be distinct. This variability can result from meteorological condition and physico-chemical characteristics.    The general characteristics can be seen in Figure 7 in which a clear increasing trend from south to north of principle component is shown.

Conclusions
Water quality data collected from eight monitoring stations located around the subtropical alpine Yuan-Yang Lake in Taiwan have been examined by unsupervised pattern recognition (CA) and display methods (PCA/FA) to yield correlations between variables and water quality similarity in the lake. Cluster analysis confirmed the existence of three types of water quality (i.e., shallow, middle and deep zones of the lake). The PCA and FA assisted to extract and recognize the factors or origins responsible for water quality variations. PCA/FA identified four latent factors that explained 65.52% of total variance, namely nitrogen source, meteorological factor, turbidity effect, and nitrate factor, respectively. Geostatisitcal techniques were used for the mapping of principle components and factor scores in the lake. The results revealed that the impact of nitrogen nutrients changes more significantly over space than other contributing factors. It means that nitrogen sources consist of important contribution to affect the water quality of the lake. Thus, this study illustrated the usefulness of multivariate statistical and geostatistical techniques for the analysis and interpretation of complex data set, water quality assessment, and identification of important contribution in nutrient source in the YYL.