Information Entropy for Evaluation of Wastewater Composition

The composition of wastewaters collected during one year was evaluated based on the Shannon information entropy. Eleven physico-chemical parameters, biochemical oxygen demand (BOD), chemical oxygen demand (COD), total phosphorus (TP), total nitrogen (TN), total suspended solids (TSS), total dissolved salts (TDS), pH, ammonium, phosphate, cyanide and phenol, were determined for their characterization. Entropy of the parameters calculated by means of their histograms decreased in the order: phosphate > ammonium > TDS > TN > pH > BOD > COD > TSS > TP > phenol > cyanide. Entropy weights of the parameters were calculated for the evaluation of wastewater composition by means of the entropy weighted index (EWI) defined according to the simple additive weighting (SAW) model. The EWI values were statistically processed by us to observe temporal wastewater composition changes and were verified by means of the principal component weighted index (PCWI). The EWI values were statistically analyzed by univariate statistics. The outlaying samples were also confirmed by multivariate analysis. The entropy-based approach allowed us to simply evaluate wastewater composition by means of one index instead of several parameters. The main advantage of EWI is the simple histogram-based calculation of entropy with no need of the normal distribution of the used parameters.


Introduction
Real waters represent very complex systems containing organic and inorganic compounds, suspended solids, dissolved gases and different microorganisms. The physico-chemical properties can be characterized by several physical and chemical parameters and, therefore, the evaluation of water composition is a multidimensional problem. The parameters are of different magnitudes and scales, often mutually correlated and non-normally distributed. Some of them, such as chemical oxygen demand (COD), biochemical oxygen demand (BOD), electrical conductivity, total suspended solids (TSS) and total dissolved salts (TDS), characterize groups of similar compounds, while the others provide information about the concentrations (magnitudes) of individual compounds, such as anions and cations, heavy metals and many types of organic compounds.
The concept of entropy was introduced by R. J. E. Clausius (1822-1888) as a measure of dissipated and useless heat. With the development of thermodynamics in the 19th century, L. Boltzmann (1844-1906) defined entropy as a simple function of all possible ordered states W as S = klnW, where k is the Boltzmann constant, which means that entropy increases with higher disorder of a system. J. W. Gibbs  substituted the number of possible states with n states with probabilities p i and derived the relationship S = −kn n i=1 p i ln p i which inspired C. E. Shannon's (1916Shannon's ( -2001 concept of information entropy [1,2].
The aim of this paper was to statistically evaluate raw wastewaters composition based on the concept of entropy in the information theory. The variance of wastewater parameters was expressed by entropy, and the changes of wastewater composition were evaluated by the single index composed of the parameters and their entropy weights. The entropy weighted index (EWI) values were verified by comparison with the values of principal component weighted index (PCWI) computed based on robust principal component analysis (RPCA) which was introduced recently [32].

Sample Collection and Analysis
The 343 wastewater samples were taken at an inlet of a biological wastewater treatment plant (BWWTP). The BWWTP was designed for the capacity of 640,000 population equivalents for the treatment of municipal and industrial wastewaters. Water analyses were performed according to ISO

Entropy Calculation
In general, information entropy H j of each variable x j (the number of variables is m) describing n observations can be defined by Shannon's relationship [1] as where p i,j is the probability of x j occurrence; it holds: n i=1 p i,j = 1. The maximal entropy is defined as H j,max = ln n. The probabilities p i,j can be approximated with relative frequencies f i,j calculated using histograms for N intervals as follows In analogy with the simple additive weighting (SAW) model [33], EWI describing composition of a water sample i was calculated as where µ j is the mean of parameter x j calculated from n samples and w j is the entropy weight. It holds: The ratio compensates the different scales and units of the parameters and can be considered as a relative concentration. The entropy weights were calculated as where

Principal Component Analysis
Principal component analysis looks for new latent variables of n samples, which are statistically independent [34]. Each latent variable-principal component (PC) is a linear combination of p variables x i and describes a different source of total variation where X(n x m) is the data matrix, T(n x p) and W(m x p) are the matrix of principal components scores and loadings, respectively, and E(n x m) is the residual matrix representing noise. Classical PCA can be performed by eigenvalue decomposition of a correlation matrix or singular value decomposition (SVD) of an original data matrix [35,36]. RPCA was performed by the eigenvalue decomposition of an estimated correlation matrix with the lowest possible determinant computed using a minimum covariance determinant (MCD) algorithm [37][38][39]. It was computed using a subroutine (mcdcov) in MATLAB (see below).

Mahalanobis Distance
The Mahalanobis distance of a variable x i can be computed as where µ is the mean vector of n variables x i , x is the row vector of variable x i and C is the covariance matrix. The robust Mahalanobis distance of the variable x i can be computed as where x is the row vector of variable x i, µ M is the MCD estimation of location and Σ is the MCD estimated covariance matrix. The MCD estimator is considered to be a highly robust estimator of multivariate location and scatter.

Statistic Calculations
An original data matrix of wastewater samples was processed in MS Excel. The MCD estimators were calculated by means of the LIBRA MATLAB Library [40] using MATLAB R2015b (MathWorks, USA). Statistical calculations were performed using the software packages QC.Expert (TriloByte, Czech Republic) and XLSTAT 2019 (Addinsoft, Boston, MA, USA). The data smoothing was performed by a fast Fourier transform (FFT) algorithm in the program OriginPro 9.0.0. (Origin Corporation, Northampton, MA, USA).
The data were standardized in order to avoid misclassifications arising from different orders of magnitude of variables. For this purpose, the data were mean (µ) centred and scaled by standard deviations (σ) as y = x−µ σ .

Entropy and Entropy Weights of Wastewater Parameters
The raw wastewaters mixed from municipal and industrial ones were characterized by the 11 parameters listed in Table 1. The wastewater data were standardized as mentioned above: the original parameters x j were scaled and centred to obtain the transformed parameters y j which were further used by us to approximate their density functions p i,j by relative frequencies f(y i,j ) to be used in Equation (2). Two examples of histograms with the highest (PO 4 3− ) and lowest entropy (CN − ) are shown in Figure 1.
Water 2020, 12, x FOR PEER REVIEW 4 of 11

Statistic Calculations
An original data matrix of wastewater samples was processed in MS Excel. The MCD estimators were calculated by means of the LIBRA MATLAB Library [40] using MATLAB R2015b (MathWorks, USA). Statistical calculations were performed using the software packages QC.Expert (TriloByte, Czech Republic) and XLSTAT 2019 (Addinsoft, Boston, MA, USA). The data smoothing was performed by a fast Fourier transform (FFT) algorithm in the program OriginPro 9.0.0. (Origin Corporation, Northampton, MA, USA).
The data were standardized in order to avoid misclassifications arising from different orders of magnitude of variables. For this purpose, the data were mean (μ) centred and scaled by standard

Entropy and Entropy Weights of Wastewater Parameters
The raw wastewaters mixed from municipal and industrial ones were characterized by the 11 parameters listed in Table 1. The wastewater data were standardized as mentioned above: the original parameters xj were scaled and centred to obtain the transformed parameters yj which were further used by us to approximate their density functions pi,j by relative frequencies f(yi,j) to be used in Equation (2). Two examples of histograms with the highest (PO4 3− ) and lowest entropy (CN − ) are shown in Figure 1. The entropy values summarized in Table 2 Based on explanatory analysis, for example the P-P plot shown in Figure S1 (Supplementary Materials), the parameters were separated into two groups: the first group contained the parameters with higher entropy, such as PO4 3− , NH4 + , TDS, TN, pH, BOD and COD, and the second one consisted of TSS, TP, phenol and CN − with lower entropy. It is obvious that entropy decreased with increasing kurtosis and skewness. The high values of kurtosis and skewness are typical for the variables, which changed in narrow intervals and existed in low magnitudes and, thus, their distributions were tailed. This is the case for the parameters in the second group.
The parameters of the first group were of higher entropy, that is, higher uncertainty, documented by the higher median absolute deviation (MAD) values. From a practical point of view, they should be monitored more frequently than the others by, for instance, the continual determination of pH, phosphate, ammonium, TN and COD. BOD and COD characterize mostly The entropy values summarized in Table 2 decreased in the sequence PO 4 3− > NH 4 Based on explanatory analysis, for example the P-P plot shown in Figure S1 (Supplementary Materials), the parameters were separated into two groups: the first group contained the parameters with higher entropy, such as PO 4 3− , NH 4 + , TDS, TN, pH, BOD and COD, and the second one consisted of TSS, TP, phenol and CN − with lower entropy. It is obvious that entropy decreased with increasing kurtosis and skewness. The high values of kurtosis and skewness are typical for the variables, which changed in narrow intervals and existed in low magnitudes and, thus, their distributions were tailed. This is the case for the parameters in the second group.

of 10
The parameters of the first group were of higher entropy, that is, higher uncertainty, documented by the higher median absolute deviation (MAD) values. From a practical point of view, they should be monitored more frequently than the others by, for instance, the continual determination of pH, phosphate, ammonium, TN and COD. BOD and COD characterize mostly organic compounds, similar to TN and ammonium, which is the prevailing nitrogen form mostly resulting from hydrolysis of urea. Dissolved phosphate also enters wastewaters in the form of urea and detergents [41]. The parameters of the second group were of lower entropy, that is, lower uncertainty. Cyanide and phenol came from coke-making factories; their contractions were of 0.15-0.16 mg/L. The high kurtosis of TSS was caused by tailing of its distribution curve due to heterogeneity of wastewaters including sedimentation of solid particles during physico-chemical analyses.

Entropy Weighted Index
The calculated entropy weights were used by us to construct the entropy weighted index and to characterize the wastewater composition. A similar approach has been already used for the ground water quality assessment [26,27,31]. This is a simple way to describe complex water systems by one parameter. On the similar principle, for example, soil quality index (SQI) composed from several soil composition indicators (pH, TN, TP, cation exchange capacity, soil organic matter, etc.) has been successfully used for soil composition assessment [42,43]. An analogy with the SAW model, EWI was calculated for every sample i according to Equation (3). The difference 1-h j is called the relative redundancy and can be interpreted as a degree of diversification of information provided [15,16,18,25,30,44]. In information theory, the entropy weights represent useful information on variables (parameters). In other words, the higher the entropy weight, the more useful information on the parameter and vice versa.
The EWI plot was constructed in order to demonstrate the temporal changes of wastewater composition during a year as shown in Figure 2. The samples were labeled according to their sequence of sampling, therefore the plot demonstrates their temporal composition changes. The EWI values were smoothed by the FFT algorithm by us to clearly see some trends in the data. In the first half of the year, the EWI values slightly increased during January and February and then oscillated around the mean (see the next paragraph) until a period between June and August. The minimal EWI value of 0.610 was reached at the end of July. In this period, people spend their time outside cities and production in some companies is reduced. In addition, higher temperatures accelerate chemical and biochemical processes in wastewaters. Conversely, during winter and autumn EWI increased due to the reduced migration of inhabitants and lower temperatures, which decelerated the chemical and biochemical processes in wastewaters. The EWI plot was compared with the plot of COD (Figure 3), which is the simple and typical wastewater parameter. Both plots were found to be similar as expected.

Statistical Analysis of EWI Data
The EWI values were statistically processed by exploratory analysis and outlaying samples corresponding to EWI ≥ 1.65 were identified: samples 38, 164, 272, 290, 302, 323, 324 and 326. The composition of the outliers is given in Table S1 (Supplementary Materials); the outlaying parameters

Statistical Analysis of EWI Data
The EWI values were statistically processed by exploratory analysis and outlaying samples corresponding to EWI ≥ 1.65 were identified: samples 38, 164, 272, 290, 302, 323, 324 and 326. The composition of the outliers is given in Table S1 (Supplementary Materials); the outlaying parameters

Statistical Analysis of EWI Data
The EWI values were statistically processed by exploratory analysis and outlaying samples corresponding to EWI ≥ 1.65 were identified: samples 38, 164, 272, 290, 302, 323, 324 and 326. The composition of the outliers is given in Table S1 (Supplementary Materials); the outlaying parameters detected by box-and-whisker plots were highlighted in bold. All the outliers were confirmed by means of the robust Mahalanobis distances calculated according to Equation (8). The cut-off limit was set at 2 11,0.975 = 4.682 for the 97.5% quantile.
The outlaying samples were excluded from the dataset and remaining 335 were tested for normality which was proved by D'Agostino, Kolmogorov-Smirnov and moment tests (kurtosis = 3.403, Water 2020, 12, 1095 7 of 10 skewness = −0.110). The EWI mean and standard deviation were calculated at 0.965 and 0.227, respectively; the EWI median was 0.972. The lower warning limit (LWL) and upper warning limit (UWL) were calculated at 0.511 and 1.419, respectively, and the lower control limit (LCL) and upper control limit (UCL) were calculated at 0.284 and 1.646, respectively ( Figure 2). All these limits are commonly used for the statistical regulation of various processes and can be used for the regulation of the EWI values.

Verification of EWI
The principal component weighted index [32] was employed in order to verify the EWI data. RPCA was applied for this purpose. PCWI was defined as follows where u k stands for the weight of k-th principal component calculated as where λ k is the eigenvalue of k-th PC and q is the number of selected principal components. The objectivity of PCWI is based on the following facts: (i) principal components are orthogonal and thus independent which is consistent with the SAW theory and (ii) the weights of principal components correspond to their eigenvalues expressing their importance. When all 11 principal components were used (q = 11) their weights were equal to their variabilities. The scree and cumulative plots are shown in Figure 4.
Water 2020, 12, x FOR PEER REVIEW 7 of 11 detected by box-and-whisker plots were highlighted in bold. All the outliers were confirmed by means of the robust Mahalanobis distances calculated according to Equation (8). The cut-off limit was set at 2 11,0.975 Χ = 4.682 for the 97.5% quantile. The outlaying samples were excluded from the dataset and remaining 335 were tested for normality which was proved by D'Agostino, Kolmogorov-Smirnov and moment tests (kurtosis = 3.403, skewness = −0.110). The EWI mean and standard deviation were calculated at 0.965 and 0.227, respectively; the EWI median was 0.972. The lower warning limit (LWL) and upper warning limit (UWL) were calculated at 0.511 and 1.419, respectively, and the lower control limit (LCL) and upper control limit (UCL) were calculated at 0.284 and 1.646, respectively ( Figure 2). All these limits are commonly used for the statistical regulation of various processes and can be used for the regulation of the EWI values.

Verification of EWI
The principal component weighted index [32] was employed in order to verify the EWI data. RPCA was applied for this purpose. PCWI was defined as follows = ∑ (9) where uk stands for the weight of k-th principal component calculated as (10) where λk is the eigenvalue of k-th PC and q is the number of selected principal components. The objectivity of PCWI is based on the following facts: (i) principal components are orthogonal and thus independent which is consistent with the SAW theory and (ii) the weights of principal components correspond to their eigenvalues expressing their importance. When all 11 principal components were used (q = 11) their weights were equal to their variabilities. The scree and cumulative plots are shown in Figure 4.  The significant linear correlation between EWI and PCWI (r = 0.910) is shown in Figure 5. It demonstrates a strong agreement between both indexes and confirms the validity of EWI. The outlaying samples (38,164,272,290,302,323, 324 and 326) not included into the regression are also clearly visible. In addition, the scree plot indicated four main principal components and PCWI composed from them also correlated well with EWI: the correlation coefficient r = 0.900. The significant linear correlation between EWI and PCWI (r = 0.910) is shown in Figure 5. It demonstrates a strong agreement between both indexes and confirms the validity of EWI. The outlaying samples (38,164,272,290,302,323, 324 and 326) not included into the regression are also clearly visible. In addition, the scree plot indicated four main principal components and PCWI composed from them also correlated well with EWI: the correlation coefficient r = 0.900.

Conclusions
The wastewater composition was evaluated using the Shannon entropy. Entropy of the wastewater parameters calculated based on their histograms decreased in the order: PO4 3− ˃ NH4 + ˃ TDS ˃ TN ˃ pH ˃ BOD ˃ COD ˃ TSS ˃ TP ˃ phenol ˃ CN − . According to the entropy values the parameters were separated into two groups: (i) phosphate, ammonium, TDS, TN, pH, BOD and COD and (ii) TSS, TP, phenol and cyanide. The parameters from the first group should be monitored frequently because of their higher uncertainty in terms of the higher temporal changes.
The entropy weights were calculated by us to define the entropy weighted index analogous to the SAW model. The EWI plot showed the temporal changes of wastewater composition during one year. The EWI values were statistically analyzed by univariate statistics and the limits for statistical regulation, such as UCL, LCL, UWL and LWL, were calculated. In addition, the outlaying samples were detected by univariate and multivariate analyses. EWI was verified by comparison with PCWI composed from the robust principal components. EWI agreed well with PCWI which was documented with their correlation coefficient r = 0.910 for all principal components and r = 0.900 for four main ones.
The validation confirmed the capability of EWI to reliably characterize wastewater composition as the single indicator and could be of interest to BWWTP operators as well as other experts and decision makers in this field. The main advantage of EWI is the simple histogram-based calculation of entropy with no need of the normal distribution of the used parameters. Based on the results mentioned above one can conclude that information entropy is suitable for the evaluation of wastewater composition.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Figure S1: P-P plot of parameter entropies, Table S1: Composition of identified outlaying samples.

Conclusions
The wastewater composition was evaluated using the Shannon entropy. Entropy of the wastewater parameters calculated based on their histograms decreased in the order: PO 4 3− > NH 4 + > TDS > TN > pH > BOD > COD > TSS > TP > phenol > CN − . According to the entropy values the parameters were separated into two groups: (i) phosphate, ammonium, TDS, TN, pH, BOD and COD and (ii) TSS, TP, phenol and cyanide. The parameters from the first group should be monitored frequently because of their higher uncertainty in terms of the higher temporal changes. The entropy weights were calculated by us to define the entropy weighted index analogous to the SAW model. The EWI plot showed the temporal changes of wastewater composition during one year. The EWI values were statistically analyzed by univariate statistics and the limits for statistical regulation, such as UCL, LCL, UWL and LWL, were calculated. In addition, the outlaying samples were detected by univariate and multivariate analyses. EWI was verified by comparison with PCWI composed from the robust principal components. EWI agreed well with PCWI which was documented with their correlation coefficient r = 0.910 for all principal components and r = 0.900 for four main ones.
The validation confirmed the capability of EWI to reliably characterize wastewater composition as the single indicator and could be of interest to BWWTP operators as well as other experts and decision makers in this field. The main advantage of EWI is the simple histogram-based calculation of entropy with no need of the normal distribution of the used parameters. Based on the results mentioned above one can conclude that information entropy is suitable for the evaluation of wastewater composition.

Conflicts of Interest:
The author declares no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.