Total and Specific THMs’ Prediction Models in Drinking Water Pipe Networks

Although disinfection is a crucial process for the safety of drinking water, it is responsible for the formation of disinfection by-products (DBPs) being accused of severe health problems. The present study presents the development of models predicting trihalomethanes (THMs) in a drinking water supply system in Greece. Although some of the developed models can be used for the prediction of THMs, they are site-specific and cannot be used extensively.


Introduction
Disinfection is considered a particularly important process for drinking water safety, as it inactivates pathogenic micro-organisms protecting human health. Several outbreaks have been recorded in the past, due to many chemical and biological hazards in the drinking water supply chain. As disinfection is introduced in the drinking water process, waterborne diseases have been dramatically declined [1,2]. However, during the disinfection process, several disinfection byproducts (DBPs) are formed. Two main groups of DBPs are known: trihalomethanes (THMs) and haloacetic acids (HAAs). THMs include chloroform (CM), bromodichloromethane (DCBM), dibromochloromethane (DBCM), and bromoform (BM) [3]. The first studies on DBPs appeared in the 1970s when Rook and others identified CM and other THMs in drinking water [3][4][5]. There are more than 600 DBPs reported and some of them are regulated while others are considered as emerging ones due to their lower occurrence level and effects [3,6]. THMs are formed mainly due to the reactions of chlorine with natural organic or inorganic matter. While DBPs formed are accused to be toxic and carcinogenic organic compounds, residual chlorine is crucial for the drinking water supply systems. Disinfection takes place normally in water reservoirs at the water supply network's head and pathogens contaminate water as it travels along with the network and arrive at the consumers' taps. Thus, the presence of residual chlorine is crucial. The present study presents the development of models predicting THMs (total, CM, BM, DCBM, DBCM) in a drinking water supply system in Greece. The models are data-driven, based on data gathered from the normal sampling procedures elaborated by the water utility. THMs predictive models are helpful to the water utility managers for decision making and can be used as a tool for the optimal selection of boosters to minimize DBPs formation and at the same time maintain the right disinfectant residual.

Literature Review on DBPs Predictive Models
DBPs predictive models' development is a methodology increasingly recognized as it can be used by the water utility managers as a decision-making tool. For example, they can be used for setting the disinfection dose, the contact time, pH adjustment, etc. They are important because the aim is to reduce DBPs concentrations and at the same time maintain the required disinfectant residual [1]. The existence of such models in regions and water distribution networks (WDNs) with the same characteristics could be useful, as they provide sufficient estimations of DBP concentrations that could minimize the need for complicated and expensive analysis of such compounds. Boosters' optimal locations and sampling optimal locations can be identified using such models.
The factors affecting the existence of DBPs are pH, chlorine dose, residual chlorine, temperature, contact time, the existence of organic materials, and seasonal variability [7,8]. While pH and temperature seem to be proportional to THMs concentrations, pH effects vary for different DBPs [9]. THMs concentrations vary throughout the water supply system, as higher concentrations are found in the network compared to the tanks [10]. This is probably due to the existence of biofilm formatted in pipes' walls which reacts with the residual chlorine forming THMs.
Data-driven statistical models and process-based ones are found in the literature to predict DBPs formation. Data-driven models are mainly based on the statistical relationships of dependent and independent variables [11]. Process-based models are based on the actual processes taking place in the WDN [11]. Process-based models are more difficult to be developed as parameter estimation within a process-based model is imprecise or difficult to obtain [12] or the data required for the development of process-based models are not available [11] and the laws of chemistry and mathematics for the formation of DBPs should be known in advance [11,13]. Data availability is a crucial issue for the data-driven models but as more data become available, the statistical models are used more and more.
In previous studies, many DBPs prediction models have been developed and many of them are derived from linear and non-linear regression analysis. However, these models are site-specific and cannot be used widely. Some of these models have used laboratory data and other real field data. In Greece, only a few models have been developed for the Athens water supply system and water treatment plants [14][15][16], for river water in Lesvos island [17] and two water supply systems [1].

Materials and Methods
The present study uses real field data to develop models predicting total THMs and each of the four THMs, namely CM, BM, DCBM, DBCM. The data are taken from a water utility serving a city of about 55.000 people. The data are gathered during sampling processes followed by the water utility according to the Greek institutional framework. The samplings are taken from different points of the water distribution network. Chlorination takes place in the reservoirs, using sodium hypochlorite and the dose is determined by the water utility. The sampling frequency is based on the water quantity abstracted and the number of consumers served, according to the national legislation. The data gathered are variables measured by the water utility during check and audit monitoring processes. As the frequency of audit monitoring is lower than the check one, the number of available data is limited.
Statistical analysis is elaborated to develop the necessary models. Initially, the variables are tested for normality using the Kolmogorov-Smirnov (K-S) test to check the goodness-of-fit to the normal distribution [18] at significance level 0.05. The variables not following normal distribution are transformed. Then, the Pearson correlation matrix is used to examine the relationships between the variables. To perform a linear regression, the total trihalomethanes (TTHMs (or each one of THMs)) concentration (Y), is assumed to be a linear function of the inputs, X. The unknown parameters to be determined, ai are the coefficients, as given in (1): where n is the number of inputs used. The coefficients are chosen to minimize the sum of the squared differences between the predicted and actual values of Y. Multiple regression analysis is used to evaluate the statistically significant variables at a level of significance α. The models are tested using ANOVA tests to check if the residuals of the models follow the normal distribution [19], and the mean value of the residuals is zero. The residuals should be evenly attributed above and below zero, otherwise, it should be suspected a calculation error or that an additional variable should be added to the regression model [18]. To check autocorrelation, the Durbin Watson estimate is calculated. R 2 values are also gathered to check how well the model fits the data. Finally, the developed models are used for the validation of the results, comparing observed and predicted data.

Results and Discussion
The data gathered include pH, total organic carbon concentration TOC (mg/L), conductivity (µS/cm), residual chlorine (mg/L), turbidity (NTU), total THMs (µg/L) and the concentrations of the four THMs (chloroform-CM, bromoform-BM, bromodichloromethane-DCBM, and dibromochloromethane-DBCM) in µg/L. Water is abstracted from 27 different water boreholes. Data refer to a period of 5 years (2014-2018). TOC values range from 0.31 to 39.5 mg/L and TTHMs range from 0.48 to 68.35 µg/L, lower than the threshold. BM is 49.8% of the total TTHMs concentration, followed by DBCM (27.5%), while CM is only 6.27% of the total TTHMs concentration ( Figure 1). Organic substances are not present in high concentrations in water as groundwater is abstracted. However, the existence of BM concentrations may indicate the presence of inorganic compounds such as Br − mainly due to anthropogenic factors and of seawater intrusion. Table 1 presents the total number of values (N), average (AV), standard deviation (SD), minimum (MIN), and maximum (MAX) values of the parameters studied.  The results of the K-S tests for the estimation of goodness-of-fit of the dependent variables of the model to the normal distribution showed that all dependent variables followed the normal distribution at significance level 0.05 except after log-transformation steps [17] and DCBM (Table 2). DCBM values were transformed using the Box-Cox transformation in Minitab. It was found that (DCBM) −0.6 follows normal distribution at significance level 0.05. Independent variables were also tested for normality using the K-S test. Only pH, conductivity, and residual chlorine follow normal distribution at the 0.05 significance level. Thus, log-transformation for turbidity took place (Table 2). Regarding TOC another transformation took place using Box-Cox transformation. It was found that (TOC) −2 follows normal distribution at significance level 0.05.
The relationships between the variables were examined by Pearson correlation matrix. Pearson correlation values show a strong positive correlation between TTHMs and BM, DCBM, DBCM, and CM concentrations which is logical as the 4 THMs form the total TTHMs (Table 3). TTHMs have a moderate positive correlation with TOC (r = 0.587), water pH (r = 0.497) and a moderate negative one with conductivity (r = −0.446) ( Table 3) Table 3. Like BM, DCBM shows a moderate positive correlation with pH (r = 0.546) while DCBM and DBCM show a very low negative correlation.  Based on the data, multiple regression analysis was applied at significance level α = 0.05 for TTHMs, BM, DCBM, DBCM, and CM. Throughout the process of models' development, several linear and non-linear regression analyses were performed. The inclusion of each variable in the proposed model was based on the t-criterion [20]. Methodological details about the model development are extensively discussed in past studies [1,15,21].
The first model predicts the TTHMs' concentrations. All variables are initially used as independent variables (inputs). As the independent variables are not statistically significant (p > 0.05), they are excluded one-by-one from the model development process (Table 4). Finally, only pH and TOC −2 are found to be statistically significant (p < 0.05) ( Table 5). R 2 value is 61.61% for this model, which shows that it can be used in a satisfactory way for the prediction of TTHMs. All statistical analysis data are given in Table 5. Durbin Watson estimate provided in Table 5, is used to check autocorrelation. The values of the Durbin-Watson statistic were found to be 1.70419, showing that there is no autocorrelation. The model is The second model predicts BM concentrations. The independent variables found to be statistically significant (p < 0.05), after excluding the not statistically significant ones, are pH and TOC −2 (Tables 4 and 5). The value of R 2 is 53.64%, showing that the model provides a relatively good estimation for the prediction of BM concentrations (   DCBM concentrations' prediction model is then developed. Only TOC −2 was found to be statistically significant (p < 0.05) ( Table 4). The same methodology as before was used. The model provides a good estimation for the prediction of DCBM concentrations, as R 2 value is 64.27% (Table  5). No autocorrelation exists as Durbin Water estimate is found to be 2.3469. The model to predict DBCM concentrations is developed following the same methodology. The independent variables being statistically significant are pH and TOC −2 (Tables 4 and 5). The model can be used for the prediction of DBCM concentrations as it provides a very good estimation (R 2 = 65.95%). No autocorrelation exists as Durbin Water estimate is found to be 2.06195. The last model developed is the one predicting CM concentrations following the same methodology. Only TOC −2 is found to be statistically significant (p < 0.05). As the R 2 value is 37.62% ANOVA tests are elaborated for all models. Statistical examination showed that the residuals of the models follow the normal distribution [19], and the mean value of the residuals is zero. In all cases, the analysis showed that the residuals are approaching normal distribution and the models are deemed valid to describe the experimental data (Figure 2a All models can be used for the prediction of total THMs, BM, DCBM, DBCM concentrations, except for the one predicting CM concentration. This model is weaker than the others as the dataset used contains a smaller amount of variables' values (19 compared to 33 for BM, DBCM, and 35 to TTHMs). Although the number of variables' values used for the prediction of DCBM are also a small number (20), it seems that the model is adequately reliable. An important factor affecting the reliability of the developed models is that THMs concentrations are low as the sampling takes place in the reservoirs before the entrance of the water into the water distribution network, where possible reactions with the biofilm found in the pipes' walls take place.
Comparing the models developed in this study with the models developed in other studies, it can be concluded that as pH increases, TTHMs, BM, DBCM concentrations increase. The data studied did not show a relationship between the concentrations of DCBM and CM with pH, probably because the dataset contains a small number of data. However, there are studies in the literature showing that the effect of pH varies on TTHMs formation [9]. The models developed show that as TOC concentrations increase, THMs values increase for all models. Such findings are in accordance with the findings from the literature [9].
Other explanatory variables such as disinfectant dose, reaction time, temperature, and others are not available in this study. The availability of reliable data is an issue mentioned in many studies affecting the choice of explanatory variables. All data are from samples taken in autumn. Studies have shown that higher THMs levels exist in the summer months. In the present study, it must be noted that the model is site-specific and cannot be used extensively. A limitation of the study is that the models are validated with the same dataset.

Conclusions
The present study uses real field data to develop total THMs models and models predicting each of the four THMs, namely CM, BM, DCBM, DBCM. The data are taken from a water utility serving a city of about 55.000 people. The data are gathered during sampling processes followed by the water utility according to the Greek institutional framework. The samplings are taken from different points of the water supply network. The paper: (a) investigates the formation of THMs during chlorination of groundwaters taking into consideration the available variables such as pH and TOC; (b) develops predictive models for the concentrations of total THMs, BM, DCBM, DBCM, and CM formed during chlorination of these groundwaters; and (c) statistically evaluates the developed models, in comparison to the models developed during previous studies for THMs [15,21] using the same modeling technique (multiple regression).
The study's results showed that models developed to predict the formation of total THMs, BM, DCBM, and DBCM are reliable and can be used. However, the model developed for CM prediction is not reliable. Reasons for that include the lack of enough data and lack of data for explanatory variables affecting the formation of THMs. Finally, as all samplings are done in autumn, it is suggested that seasonal variation should be taken into consideration elaborating more samplings all over the year, to study the effect of the season in the formation of TTHMs, which is found to be related in other studies [15].
As the formation reactions of DBPs are complex, universally applicable models are difficult to be developed. The models developed can be used in regions and WDNs with the same characteristics. These models are useful for the water utility managers during the decision making for the disinfection dose, the pH adjustment, etc. Additionally, such models can be used to locate the optimal locations for chlorination boosters to achieve the desired chlorination and the right residual chlorine levels, without forming high THMs concentrations. Finally, these models in combination with the residual disinfectant ones can result in the optimal selection of sampling points for water quality control to be used for epidemiological studies and health risk assessment [9,22,23].