Descriptive Characteristics of Surface Water Quality in Hong Kong by a Self-Organising Map

In this study, principal component analysis (PCA) and a self-organising map (SOM) were used to analyse a complex dataset obtained from the river water monitoring stations in the Tolo Harbor and Channel Water Control Zone (Hong Kong), covering the period of 2009–2011. PCA was initially applied to identify the principal components (PCs) among the nonlinear and complex surface water quality parameters. SOM followed PCA, and was implemented to analyze the complex relationships and behaviors of the parameters. The results reveal that PCA reduced the multidimensional parameters to four significant PCs which are combinations of the original ones. The positive and inverse relationships of the parameters were shown explicitly by pattern analysis in the component planes. It was found that PCA and SOM are efficient tools to capture and analyze the behavior of multivariable, complex, and nonlinear related surface water quality data.


Introduction
Water quality is assuming great importance with rising pressure on industry, agriculture, and population [1]. Surface water is a substantial source for domestic usage, industrial heating, and agricultural irrigation. Since surface water is easily accessible to human beings, it is the most vulnerable water body to contaminants. The major pollutant sources of surface water include discharges from domestic, industrial and agricultural activities. Runoff transports pollutants in urban areas and agricultural lands to surface water resources, such as rivers and lakes [2]. Due to the complexity and uncertainty involved in surface water interaction mechanism into groundwater, pollutants in surface water appear significant for groundwater quality. Therefore, the assessment of surface water quality is of great concern in water environmental management.
Water quality monitoring is time-consuming and it is expensive to obtain a large number of water quality data. The water quality data analysis seems a difficult task because the data are multidimensional, complex, and nonlinear. There are a variety of methods to assess water quality, such as TOPSIS method [3][4][5], statistical analysis [6][7][8], support vector machine [9], set pair analysis [10], water quality index [11], and matter-element analysis [12]. However, it is difficult to decide which ones are better. Owing to the complex characteristics of surface water, it is necessary to use a sophisticated knowledge extraction and diagnosis tool that can provide the analysis and visualization of the multidimensional data set [13]. Model-based diagnosis technique and statistics-based diagnosis technique are widely used to solve the problem. However, the model-based diagnosis technique has some weaknesses. These weaknesses include: (1) they cannot provide an entirely satisfactory description of the cause-effect relationships; (2) the models require the specification of a large number of parameters; and (3) a large number of parameters included in this model need to respecify parameter values for different operational conditions [14]. Conversely, statistics-based diagnosis techniques are preferable for implementing knowledge extraction in water quality data. Multivariate analysis methods, such as principal component analysis (PCA), belong to a kind of statistics-based diagnosis technique and have been widely developed in hydrological system analysis [15][16][17][18]. However, the limitations of classical multivariate analysis methods are well known [2,19]. Artificial neural network (ANN) is another type of statistics-based diagnosis technique and powerful for multivariate, nonlinear analysis. ANN offers an alternative to traditional statistical methods for optimal monitoring and determination of dynamic system [20], and has attracted considerable attention [21,22].
Self-organising map (SOM) is a neural network-based pattern analysis technique with unsupervised learning [23,24], that has been widely applied in water quality data analysis [2,11,[25][26][27]. Çinar and Merdun employed SOM to diagnose the relationships of surface water quality parameters, and clustered seven groups corresponding to water quality parameters [2]. Hong and Rosen used the SOM technique to capture the influences of stormwater infiltration on groundwater quality parameters, and obtained the relationships between the parameters [11]. Wu et al. performed a SOM method to identify the effects caused by climate change and human activities on coastal water quality [26]. The SOM technique is a powerful tool to group the similar input patterns from a multidimensional input space into a much lower dimensional space, usually two dimensions. SOM can be used for clustering, classification, estimation, prediction, and data mining [28]. SOM can potentially outperform current methods of analysis because they can successfully: (1) deal with the nonlinearities of the system; (2) be developed from data without requiring the mechanistic knowledge of the system; (3) handle noisy or irregular data; (4) be easily and quickly updated; and (5) interpret information from multiple variables or parameters [13,14]. The SOM method has excellent visualization capabilities, which can be helpful in the initial steps of water quality assessment frameworks.
In this study, PCA was performed to extract four significant principal components (PCs) from the twelve water quality parameters, and the SOM method has been used to analyze the complex relationships of water quality parameters in multivariable surface water quality data.

Study Area and Data
Hong Kong is divided into ten water control zones and each one has a set of water quality objectives. The rates of annual compliance with the key water quality objectives are assessed during the year. The Tolo Harbor and Channel Water Control Zone is one of the ten water control zones in Hong Kong. Tolo Harbor is largely landlocked, with a narrow channel to the open sea, making it difficult for pollutants entering the harbor to be flushed out by tidal action. The harbor suffered severely from red tides in the 1980s. The establishment of the zone aimed to help improve the harbor water quality as well. The rivers in the zone are all short, with relatively small flows. The water quality data were collected for the twelve parameters during 2009-2011 with a total of 4752 measurements [29]. The water quality parameters involved were 5-day biological oxygen demand (BOD 5 ), ammonia-nitrogen (NH 3 -N), chemical oxygen demand (COD), electrical conductivity (EC), dissolved oxygen (DO), total phosphorus (TP), nitrate nitrogen (NO 3 -N), nitrite nitrogen (NO 2 -N), saturated oxygen (Satur O 2 ), non-dissolved matter (Susp), dissolved matter (Diss sol), and temperature (T). demand (BOD5), ammonia-nitrogen (NH3-N), chemical oxygen demand (COD), electrical conductivity (EC), dissolved oxygen (DO), total phosphorus (TP), nitrate nitrogen (NO3-N), nitrite nitrogen (NO2-N), saturated oxygen (Satur O2), non-dissolved matter (Susp), dissolved matter (Diss sol), and temperature (T).

Principal Component Analysis
PCA is an efficient tool to explain the variance of a large data set of correlated parameters with a much smaller data set of uncorrelated PCs [30,31]. The PCs acquired by multiplying the original correlated parameters with the eigenvector (loadings), can provide information on the most meaningful parameters that describe a whole data set allowing data reduction with minimum loss of original information [32,33].

Self-Organising Map
SOM has been extensively used for data analysis owing to its excellent ability for displaying a high-dimensional dataset into a lower dimensional space. SOM consists of input layer and output layer (competitive layer), connected with each other by computational weights. The input layer is connected to each vector of the data set, and the output layer is made of an array of nodes ( Figure 2).
The neurons in a SOM learn in an unsupervised way because the network is not required to provide a specific objective. SOMs are competitive networks so that the neurons compete to provide the right answer, with only one neuron (or one node of neurons) becoming activated when a data pattern is presented [34]. It involves the processes of competition, cooperation, and update. The best-matching unit (BMU) is decided in the process of competition, the neighbor neurons are determined in the process of cooperation, and the weight vectors are updated in the last process. The steps of SOM algorithm are displayed as follows: (1) Initialize the SOM network. The weight vector wij (i = 1,2,…,S; j=1,2,…,R) is set randomly in the interval [0,1], R is the sample dimension, and S is the number of output neurons.

Principal Component Analysis
PCA is an efficient tool to explain the variance of a large data set of correlated parameters with a much smaller data set of uncorrelated PCs [30,31]. The PCs acquired by multiplying the original correlated parameters with the eigenvector (loadings), can provide information on the most meaningful parameters that describe a whole data set allowing data reduction with minimum loss of original information [32,33].

Self-Organising Map
SOM has been extensively used for data analysis owing to its excellent ability for displaying a high-dimensional dataset into a lower dimensional space. SOM consists of input layer and output layer (competitive layer), connected with each other by computational weights. The input layer is connected to each vector of the data set, and the output layer is made of an array of nodes ( Figure 2).
The neurons in a SOM learn in an unsupervised way because the network is not required to provide a specific objective. SOMs are competitive networks so that the neurons compete to provide the right answer, with only one neuron (or one node of neurons) becoming activated when a data pattern is presented [34]. It involves the processes of competition, cooperation, and update. The best-matching unit (BMU) is decided in the process of competition, the neighbor neurons are determined in the process of cooperation, and the weight vectors are updated in the last process. The steps of SOM algorithm are displayed as follows: (1) Initialize the SOM network. The weight vector w ij (i = 1,2, . . . ,S; j=1,2, . . . ,R) is set randomly in the interval [0,1], R is the sample dimension, and S is the number of output neurons. The initial value of learning ratio η(0) (0 < η(t) < 1), the map size, the neighborhood ratio N g (0), and the maximum number or possible iterations T are defined.
(2) Present an input vector P k " pp k 1 , p k 2 ,¨¨¨p k R q (k = 1,2, . . . ,M, M is the sample number) to the SOM network and calculate the distance. The Euclidean distance is frequently used and can be calculated as: (3) Choose the smallest distance and identify the BMU.
(4) Update the weight vector w ij in the neighbor ratio N g (t).
where w ij pt`1q is the weight vector at learning step t + 1. The neighbor ratio N g (t) and the learning ratio ηptq decrease with the number of iterations of the model. (5) The process goes on an iterative way until the optimal number of iteration steps is satisfied and then it jumps back to step (2).
In step (1), the map size is vital to detect the deviation of the data set. If the map size is too small, it might not explain some important differences that should be detected. In contrast, if the map size is too big, the differences are too small [26,35]. There are two classical methods to determine the map size [26,36]. The first method is to calculate quantization error (QE) and topographic error (TE), and the other one is that the optimal number of neurons is close to 5 ? n (n is the number of samples of the training data). The detailed discussion of QE and TE can be seen in [36,37].
The SOM technique has a distinct capability to represent the complex relationships of the water quality parameters using component planes and U-matrix. All simulations were implemented in MATLAB R2012b using a SOM toolbox [38]. The initial value of learning ratio η(0) ( 0 η( ) 1 t < < ), the map size, the neighborhood ratio Ng(0), and the maximum number or possible iterations T are defined.
(2) Present an input vector (3) Choose the smallest distance and identify the BMU.
(4) Update the weight vector wij in the neighbor ratio Ng(t).
( 1) where ( 1) ij w t + is the weight vector at learning step t + 1. The neighbor ratio Ng(t) and the learning ratio η( ) t decrease with the number of iterations of the model. (5) The process goes on an iterative way until the optimal number of iteration steps is satisfied and then it jumps back to step (2).
In step (1), the map size is vital to detect the deviation of the data set. If the map size is too small, it might not explain some important differences that should be detected. In contrast, if the map size is too big, the differences are too small [26,35]. There are two classical methods to determine the map size [26,36]. The first method is to calculate quantization error (QE) and topographic error (TE), and the other one is that the optimal number of neurons is close to 5 n (n is the number of samples of the training data). The detailed discussion of QE and TE can be seen in [36,37].
The SOM technique has a distinct capability to represent the complex relationships of the water quality parameters using component planes and U-matrix. All simulations were implemented in MATLAB R2012b using a SOM toolbox [38].

Data Pre-Processing
The water quality parameters usually have different units and need to be normalized before applying the SOM method to avoid misclassification. The data can be transformed to the data with zero mean and unit variance.

K-Means Clustering
K-means clustering is a method of vector quantization, aiming to partition n observations into k (k ď n) sets S " tS 1 , S 2 ,¨¨¨, S k u to minimize the within-cluster sum of squares: where px 1 , x 2 ,¨¨¨, x n q is a set of observations, k is the number of clusters, µ i is the mean of points in S i . The Davies-Bouldin clustering index was used to determine the optimal number of the clusters for a dataset, and k-means clustering was conducted in MATLAB R2012b.

Statistical Analysis
The statistical characteristics and Pearson correlation matrix of the twelve water quality parameters are listed in Tables 1 and 2 respectively. Descriptive statistics includes minimum values, maximum values, median values, mean values, standard deviation (SD), and coefficient of variation (CV) of the water quality data in Table 1. Table 2 reveals the quantitative representation of these relationships for the parameters. As it can be seen in Table 1, Susp has the biggest CV, followed by NH 3 -N, while Satur O 2 has the smallest, followed by DO. This demonstrates that Susp and NH 3 -N change a lot, while Satur O 2 and DO are temporally stable. Except for Susp, NH 3 -N, Satur O 2 , and DO, the other parameters possess medium CVs, which indicates their concentrations do not change as much as Susp and NH 3 -N, but more than Satur O 2 and DO. T is one of the most important water quality parameter in water quality, and limits the saturation values of gases and solids that are dissolved in it [39]. T varies between 11.2˝C and 33.4˝C with a median value of 23.5˝C. T shows negative correlation with DO (r =´0.466) and positive correlation with BOD 5 , NH 3 -N, COD, EC, TP, NO 3 -N, NO 2 -N, Satur O 2 , Susp, and Diss sol, which are presented in Table 2. Notes: ** indicates correlation is significant at the 0.01 level (2-tailed); * indicates correlation is significant at the 0.05 level (2-tailed).
The concentration of oxygen in surface water is a measure of self-cleaning capacity of the water body. 14 out of the 396 samples of DO concentration are below 6 mg¨L´1, and the median value and mean value of the DO concentration are 8.1 mg¨L´1 and 8.1083 mg¨L´1, respectively. Similarly, the median value and mean value of Satur O 2 are 98% and 94.9672%, respectively. DO and Satur O 2 are negatively correlated with BOD 5 , NH 3

PCA Results
The KMO test and Bartlett's test were firstly implemented to examine the validity of PCA ( Table 3). The test shows that the KMO and Bartlett's test are 0.626 and 4517.867, respectively. It means that PCA can be used to perform data reduction. The objective of PCA is to reduce the multidimensional parameters to a set of PCs much smaller in number. According to the criteria of eigenvalue-one, four PCs were extracted, accounting for 75.894% of the total variance ( Table 4). The PCs, eigenvalues, percentage of total variance, and cumulative percentage of explained variance are shown in Table 4. As it can be seen in Table 4, the eigenvalues of four PCs are 4.582, 1.784, 1.557, and 1.184, respectively. The four PCs can explain 75.894% of the total variance. The first two PCs (PC1 and PC2) account for 38.187% and 14.866% of the variance, respectively, explaining more than a half of the total variance in the original dataset. PC3 and PC4 explain 12.978% and 9.864% of the total variance, respectively.The first PC (PC1) with the biggest eigenvalue 4.582 has strong positive loadings on TP, COD, BOD 5 , NH 3 -N, NO 2 -N, Diss sol, and NO 3 -N, which suggests that PC1 represents the contaminants in the study area. The coefficients of TP, COD, BOD 5 , NH 3 -N, and NO 2 -N are higher than those of NO 2 -N, Diss sol, and NO 3 -N. It means that TP, COD, BOD 5 , NH 3 -N, and NO 2 -N have bigger effect on PC1 than the other two parameters. PC2 has significant loadings by Satur O 2 and DO, representing the dissolved oxygen in the water body. PC3 has positive loadings on EC and Diss sol. EC and Diss sol are correlated as mentioned above. The existence of high concentration of Diss sol leads to the high loadings of EC. The last PC (PC4) indicates the temperature because it only has a strong loading by T. Difference in water temperature affects dissolved oxygen, the rate of photosynthesis and metabolic rates of aquatic life [39]. PC2 and PC4 indicate the decay rate of the contaminants. PCA results represent the contaminants and the decay rate of the contaminants regardless of monitoring stations in the study area.

SOM Results
The map size is crucial for SOM technique to cluster the data set. QEs and TEs of big and small map sizes were calculated to determine the optimal number of the map units (Table 5). It can be seen that the map size of (14ˆ7) has the minimum values of QE and TE as 1.2388 and 0.0152, respectively. Therefore, SOM involved 98 output neurons displayed in 14 rows and 7 columns is chosen in this study. The total number 98 arranged in the hexagonal grid is close to 99.5 (5 ? n). The visualization of the component planes is a good tool to figure out the interrelationship of the different water quality parameters. By comparing the component planes in Figure 3, some parameters demonstrate positive patterns. The grouping of the parameter planes shows three well-defined groups of correlated parameters. The component planes of the same groups have positive relationships between them. The first group includes the parameters of TP, COD, BOD 5 , NH 3 -N, and NO 2 -N. All the water quality parameters in this group have high values (red color) in the lower parts, especially in the lower left parts of the group. It is shown in Section 4.2 above these five parameters have bigger effect on PC1 than the other two parameters (NO 3 -N and Diss sol). The second group includes EC and Diss sol, which is expressed as PC2 in Section 4.2 above. EC is a reflection of Diss sol in water. The third group comprises DO and Satur O 2 . DO is correlated with Satur O 2 with a correlation coefficient of 0.784 in Pearson correlation matrix, and PC2 shows strong positive loadings on DO and with Satur O 2 . The non-conventional positions of Susp, T, and NO 3 -N could be explained by their ability to describe various complex pollutants and their transformations [15].
The cluster analysis of SOM is implemented by K-means clustering algorithm to find the optimal number of the clusters. Davies-Bouldin clustering index [40] is to compute the optimal number of clusters for a dataset, which is commonly used in determining the optimal numbers of clusters [2,14]. As shown in Figure 4, the Davies-Bouldin clustering index is minimized at four with the best clustering. That means the optimal number of the clusters is four, and the four-cluster structure of the map is described in Figure 5. According to Figures 3 and 5 the following information on water quality parameters can be concluded: (1) High DO, low T, low BOD 5 , low NH 3 -N, low COD, low EC, low TP, low NO 2 -N, low Susp, and low Diss sol (Group 1).
(3) High BOD 5 , high NH 3 -N, high COD, high TP, and low Susp (Group 3).          In winter, when temperature is low, high concentration of DO, low concentrations of BOD 5 , NH 3 -N, COD, EC, TP, NO 2 -N, Susp, and Diss sol are observed in Group 1. Çinar and Merdun clustered seven groups based on 1046 surface water samples collected over six years and found that in winter when temperature is lowest and rainfall highest, high concentration of DO, and low concentrations of Na, K, Cl, NH 4 -N, NO 2 -N, and o-PO 4 (ortho-phosphate), pV (organic matter) can be observed in one group, which is similar to the results in this study [2].
The correlation matrix of the weight of the SOM is shown in Table 6. The minimum values, maximum values, mean values, and SE values of the four groups are expressed in Table 7. As shown in Figures 3 and 5 Groups 2 and 4 represent the normal condition of the study area. The second group contain a total of 199 samples showing the highest frequency among the four groups, followed by the fourth group with a total of 85 samples.

Conclusions
Water pollution control and management require the interpretation of a large amount of water quality data, and the data are complex, multidimensional, and nonlinearly related. Therefore, the analysis and diagnosis of surface water quality is a quite difficult task due to the characteristic of the data. In this study, PCA and SOM were implemented to discover the complex relationship among the twelve surface water quality parameters during 2009-2011 from river water monitoring stations in Tolo Harbor and Channel Water Control Zone in Hong Kong. The following results were obtained.
(1) In the study area, the overall quality of surface water is good. It is attributable to a number of improvement measures, which had a significant effect on river water quality. However, some monitoring stations were still receiving discharges of untreated sewage effluents in this area. The major contaminants in the study area are TP, COD, BOD 5 , NH 3 -N, and NO 2 -N, which is indicated in Pearson correlation matrix, PCA results, and component planes in SOM.
(2) Similarly, relationships were observed between Satur O 2 and DO, EC and Diss sol. It means that Pearson correlation matrix is an efficient tool to analyze the relationships between the parameters, which can verify the relationships in the SOM results.
(3) The raw data can be clustered to four groups, and Groups 2 and 4 describe the normal condition of the study area.
It has been demonstrated that PCA shows good ability in dealing with multivariable data set in this study. Meanwhile, SOM, a powerful artificial intelligence technique, is capable of capturing the complex relationships between the parameters. It is demonstrated in this paper that PCA and SOM are good tools to deal with large data sets, and they may have the potential to be applied solve other types of water resources (groundwater) problems.