Leak Localization in Water Distribution Networks Using Pressure and Data-Driven Classifier Approach

: Leaks in water distribution networks (WDNs) are one of the main reasons for water loss during fluid transportation. Considering the worldwide problem of water scarcity, added to the challenges that a growing population brings, minimizing water losses through leak detection and localization, timely and efficiently using advanced techniques is an urgent humanitarian need. There are numerous methods being used to localize water leaks in WDNs through constructing hydraulic models or analyzing flow/pressure deviations between the observed data and the estimated values. However, from the application perspective, it is very practical to implement an approach which does not rely too much on measurements and complex models with reasonable computation demand. Under this context, this paper presents a novel method for leak localization which uses a data-driven approach based on limit pressure measurements in WDNs with two stages included: (1) Two different machine learning classifiers based on linear discriminant analysis (LDA) and neural networks (NNET) are developed to determine the probabilities of each node having a leak inside a WDN; (2) Bayesian temporal reasoning is applied afterwards to rescale the probabilities of each possible leak location at each time step after a leak is detected, with the aim of improving the localization accuracy. As an initial illustration, the hypothetical benchmark Hanoi district metered area (DMA) is used as the case study to test the performance of the proposed approach. Using the fitting accuracy and average topological distance (ATD) as performance indicators, the preliminary results reaches more than 80% accuracy in the best cases.


Introduction
Water scarcity, leak detection, and network efficiency are the main factors driving the implementation of smart water solutions across the globe. Particularly, water leaks inside water distribution networks (WDN) can cause water losses in fluid transportation, risks of bacteria, and pollutant contamination [1]. Besides that, water leaks may also lead to increases in the consumers' water bills, although in some countries (e.g., European and Canadian countries), higher water prices are connected with higher investment in the WDN in order to prevent leaks [2][3][4].
According to the standard water balance methodology presented by IWA/AWWA (International Water Association/American Water Works Association), water leakage is an important reason of water loss [5], as in many WDNs, the losses due to leaks are estimated to account for up to 27% of the total amount of extracted water [6]. In China, around 8 billion cubic meters of water was lost in 2017 [7], and the total amount in Asia is around 29 billion cubic meters (value more than 9 billion dollars) per year [8]. The mean value for water losses in EurEau (European Federation of National Associations of Water Services) member countries in the year of 2017 are 23% and 2171 m 3 /km/year [9]. Considering the worldwide problem of water scarcity, added to the challenges that a growing population brings, it is critical to minimize the water losses through the detection and localization of water leaks in the WDN in a timely and efficient manner using advanced techniques.
In order to accurately localize the water leaks, correct and oriented monitoring of detail information concerning system behavior is required. Among these monitoring devices, the acoustic equipment (e.g., noise correlators and listening sticks) is efficient to localize the leaks manually through reading abnormal behaviors at potential locations of the WDN system [10,11]. However, the expensive cost, as well as time consuming and labour demanding features prevent the acoustic equipment being widely used in reality. Due to that, flow and pressure meters are optional devices of reading useful system information for leak detection and localization. Compared with flow meters, pressure meters are easily installed and less expensive. Moreover, as discussed in [12], focusing more on using pressure data can facilitate leak localizations and reduce the required investments as well.
The key principle of using real-time pressure measurements for leak detection and localization in WDNs is the deviation of real-time data from the normal range of system behaviors [11].
The state-of-the-art for leak localization in WDN is filled with contributions of different approaches. Among them, the original popular approaches rely on estimating hydraulic dynamics using mathematical models [13,14]. For example, [15] estimates the location of a leak through building a pressure drop surface with triangle-based cubic interpolation approach. Meanwhile, [16] infers a leak location in WDN through creating a sensitive matrix of different pressure measurements when a leak happens. However, the performance of model-based approaches is limited too much by the accuracy of the mathematical models, and it is not easy to choose an appropriate model [17]. Further, the investment requirement for a large number of sensors also slows down the development of this method. Moreover, the high computation demand and the difficulty of parameter estimations hinder the final usage of model-based approaches, especially for the large WDNs. During the past decades, with the advances of online monitoring devices, data-driven approaches which focus on the knowledge mining of available data have prevailed in the field of leak localization [11,12,[18][19][20]. Accordingly, [18] proposes a mixed hydraulic and data-based model that relies on pressure residual and leak sensitivity analysis, which is based on analyzing the difference between measurements and their estimation using a hydraulic network model. More recently, [19] presented a completely datadriven approach through analyzing the pressure residual between a healthy WDN and a network with leakages, using [20] to interpolate the pressure in nodes without sensor information. However, due to the graph structure of the WDN, the accuracy of this approach is affected by the distance between the leaking and inlet node.
Due to the powerful capacities for pattern recognition and feature identification, a machine learning algorithm has been proven efficient for solving leak localization problems, using support vector machines and clustering algorithms, etc. [11,12]. However, the difficulty of using a machine learning method is selecting the proper algorithm and designing suitable feature extractors to learn complex features [11,12]. Among numerous machine learning approaches, the neural network (NNET) is a method which is capable of leak localization considering the ability of processing and modelling multiple inputs without explicit knowledge of the involved parameters. Further, linear discriminant analysis (LDA) is another method used in statistics and machine learning which explicitly attempts to find a linear combination of features to separate classes of objects. The high resolution of LDA makes this method a good tool to predict locations (e.g., leaks) based on limited information [21][22][23].
This work proposed a data-driven approach based on limit pressure measurements to localize a leak inside a WDN. Two different machine learning classifiers based on LDA and NNET are used to determine the probabilities of each node having a leak inside a WDN. In order to improve the localization accuracy, a Bayesian temporal reasoning is applied afterwards to rescale the probabilities of each possible leak location at each time step after a leak is detected. With the aim of achieving an accurate estimation of consumed water, the WDN has been divided into smaller sub-networks, named district metered areas (DMAs) for management. Almost all of the previous implementations are applied on the DMA level [10]. Practically, the performance of the leak localization methods is highly sensitive to the numbers of the installed sensors, as well as the placement of these sensors. In order to ensure the optimal performance of the proposed leak localization approaches, in this paper, the sensor placement strategies for a WDN from [24] are used directly without digging into this topic [24][25][26][27]. To estimate the nodes head where sensors are not placed, the Kriging spatial interpolation [20] with hydraulic topology of the network is used, which also generates a perfect no-leak scenario as a reference. Further, historical data for each leak scenario of each node are provided through simulation as training data for the classification. As an initial illustration, the benchmark Hanoi DMA is used as the case study to test performance of the proposed approach. Discussions about the costs and benefits of the proposed approach are also presented, as well as the future research plan.

Methodology
The scheme of the proposed pressure based data-driven leak localization approach [28] is depicted in Figure 1. A key assumption has been made initially that a leak has already been detected in a DMA. Further, a number of pressure sensors is supposed to be installed to read pressure measurements for some nodes. A flow sensor is also required to read the inlet flow value to the DMA.
The first step of the proposed approach is selecting of the nodes with pressure sensor installed based on the optimal sensor placement strategy generated by [24]. Afterwards, datasets are prepared which include historical data of DMA comprising pressure measurements with corresponding leak location labels and flows at the inlet node. Topology information of the DMA is also needed to correctly interpolate for the nodes without sensors. More detailed definitions about the required datasets will be explained in the next section. After that, Kriging spatial interpolation is used in Step 3 to estimate the pressure at the nodes which are not equipped with sensors based on hydraulic proximity [29]. Later on, the machine learning classifiers in view of LDA and NNET are developed and trained using the datasets created at Step 2. The fitting accuracy, Kappa coefficient [30], and the average topological distance (ATD) are used as performance indicators for training the classifiers, and the best classifiers are selected to be used later on in Step 5. When a leak has been detected in Step 6, equal prior probabilities are initially set to all the nodes. Based on the limit pressure measurements from the sensors embedded in this DMA, as well as the estimated pressure interpolated by Kriging, the trained LDA/NNET classifiers are used to compute, in Step 5, probabilities of each node being the leak location based on raw pressure without estimating a hydraulic model nor a reference model. In order to better infer the leak node, the Bayes temporal reasoning rule is used at Step 8 to re-calibrate the probabilities given by the classifiers. The final estimated location of the leak is obtained at Step 9.

Data Structure
The data structure required for the leak localization approach is defined in a matrix ( Figure 2) which contains: (1) The leak vector Y∈ ℝ , which is a label where the true leak was located, and this information is assumed to be provided; (2) The time vector T∈ ℕ in the unit of hour. The data structure is ordered by time, which is the time elapsed from when the leak was first created. As a leak has been detected, this information is also known; (3) The pressure vector X∈ ℝ (m), which can actually either represents the head, pressure, or the residual with the reference model. The pressure information should be the value given directly from the sensors; (4) The flow vector F∈ ℝ in the unit of m 3 /s. Flow from the inlet in the DMA, which is also the flow enters to the DMA, is the value provided by flow sensor measurement for analyzing. The matrix should be read taking snapshots of the DMA at any given moment. So, for the leak scenario Y1, there is pressure Xk for the node k at given time Tj. The data set can be melted into a single matrix with each Y label repeated, which will be easier manipulated by RStudio, an open source software for R [31]. Considering there is not always a sensor at all the nodes, the pressures for the nodes which do not have a sensor is interpolated used Kriging, as explained in [19]. Table 1 includes an example of how the dataset should look:

Classifier
The classifiers in view of LDA and NNET are defined and tested individually with the objective of looking for a pattern where different leaks can be segregated to predict future events according to past historical data.
As explained in the introduction section, LDA [21] is a method to find a linear combination of features which separates two or more classes of data. The resulting combination may be used as a linear classifier. In this study, since there are more than two classes, multiclass LDA are used, in which a subspace is found in order to contain all the class variability. LDA models the distribution of the predictors (given X, in this study represents the node pressures) separately in each of the response classes (given Y, in this study means the different leak localization labels) and uses the Bayes theorem [32] to flip them into estimates. When these distributions are assumed to be normal, it turns out that the model is very similar to the logistic regression. Given the high complexity of calculation, the logistic regression in favor of LDA is omitted, which will be more efficient and deliver better results [33].
The NNET [34] fits a single-hidden-layer neural network trained for classification. Using crossvalidation, the model has been tuned to both avoid over-fitting and setting the number of units in the hidden layer [33]. To avoid the over-fitting in the NNET, weight decay is used to penalize the sum of squares of the weights.

Cross Validation
In order to reduce over-fitting in the training set, 10-fold cross-validation is applied, which will consequently slow down the parameter process search. However, considering that the over-fitting is hard to removed entirely, a validation set is held out for the final estimation with expected prediction error. The cross-validation methods are defined as:


Hold Out Method: Divide the training sample (70%) vs testing sample (30%). If the error rate is similar on both, it means that the model is not over fitted. This method requires low computing time, however, it is prone to sample bias.  K-Fold Cross Validation: The sample is spliced into K equal sub size samples. All of the models used have been calculated through a 10-fold cross validation. Since the response feature is categorical, the parameters will be tuned according to the results of accuracy. The K results can then be averaged to produce a single estimation. The upside of this method is that how to divide the data is less impactful, as selection bias will no longer be present.

Evaluation Metrics
The fitting accuracy, Kappa and ATD are the metrics being used to evaluate the classification performance in the dataset:


Accuracy is the percentage of correctly classified instances out of all the instances. It is a more meaningful metric in binary classification than multi-class classification problems, since in multi-class problems it is harder to determine how the accuracy breaks down across the different classes.  Kappa (or Cohen's Kappa) is similar to classification accuracy, except that it is normalized at the baseline of random chance of the data [30]. It is a more useful measure to use on problems that have an imbalance in the classes. However, with the usage of simulations, this problem is negated, as all leak scenarios appear the same amount of times. It compares how the classifier performs against the performance of a classifier which simply guesses at random according to the frequency of each class. Values between 0.6 and 0.8 are considered good [35] even though they supplied no evidence to support it.  ATD, average topological distance, which represents the distance in nodes between the node predicted as having the leak with the true node that has the leak. ATD is useful for node relaxation which will assess the overall performance.
Other than that, in the training phase of the model, fitting accuracy is also used as the metric for parameter tuning and for selecting the best model.

Bayes Temporal Reasoning
Bayes temporal reasoning has already been previously used to improve the diagnosis using the residuals generated in the model-based leak localization methodologies [19]. In this study, Bayesian temporal reasoning is used to improve the diagnosis in the proposed data-driven leak localization approach, working with probabilities given directly by the classifier which uses the head/pressure directly.
Due to the fact that the simulations of the dataset includes cases where a leak is created for a long period of time and the section with a healthy state in between the leaks is ignored, real situations cannot be fully represented. Besides that, in the healthy state, the leak localization procedure is irrelevant since the precondition is that a leak has been detected. All the leaks are also static, meaning, once created, it will present in the network until it is fixed.
Following this reasoning, a Bayes rule is added to re-scale the probabilities using as prior of the probabilities for each node being the candidate leak location at previous time steps. At every time step t, the probability of a leak occurrence is estimated as a result of the application of the Bayes rule: ( | ( ( ))) = ( ( ( ))| ) ( | ( ( − 1))) ( ( ( ))) where nl is the number of different leak labels, c(x(t)) is the probabilities returned by the classifier, LDA or NNET, given the head or pressure at time step t.

P(yi|c(x(t))) is the posterior probability that the instance c(x(t)) belongs to the class yi at time step t given the previous information. P(c(x(t))|yi) is the likelihood of the instance c(x(t))
assuming that the leak has been created in node yi. P(yi|c(x(t − 1))) is the prior probability for the class yi taking into account previous time steps. P(c(x(t))) is a normalizing factor given by the total probability law: ( ( ( ))) = ∑ ( ( ( )), ) ( | ( ( − 1))) At each iteration, the prior probabilities used are considered as equal for all different labels. The variable t = 1, …, k is the time step when a leak is first created until it is fixed, then P(yi|c(x(0))) = 1 for i = 1, …, nl.

Hanoi DMA
The Hanoi DMA (Figure 3) is used as the case study to initially illustrate the performance of the proposed approach. In this network, the reservoir acts as the inlet node of this DMA. Considering data for all the nodes are expected to generate, all the nodes are assumed have a sensor and the real distribution of the real sensor nodes placed inside the Hanoi network is ignored during the simulation phase. The simulation is obtained using the simulator EPANET 2 [36], where for each leak scenario, of the 31 nodes, a leak is simulated with one-hour time steps lasting 96 h. All the generated data from the simulation are used for calibrating the classifier models. A perfect case without leaks has been used as a reference.

Sensitivity Analysis with Residual
The state-of-the-art up has been that the node which has the highest difference (residual) between the reference model and the real time model should be the candidate which contains the leak. However, this is not always the case, especially when there are no sensors placed in all the different nodes inside the DMA. The following boxplots (Figures 4,5) provide the distribution of residuals at each simulation time for all of the 31 leak scenarios, which shows that, the median for the node which contains the leak is, in most cases, is higher than the other groups. A significant increase is also observed with respect to the same node without a leak. However, other nodes which are topologically close to the true leak location have a higher median as well. Even when the leak is far away, it seems to obtain an increased residual compared to when there is not a leak in the network, affected by the topology of the DMA. Since these networks are graph structures by definition, once there is a leak in the DMA, the whole network is affected, and the network is limited in the choices to mitigate the leak. A node "previous" to the leak, in height or structure, will also be affected since the difference in head between its links will change, and the flow of water will modify its path accordingly. Nodes that are "posterior" to the leak, will also change from their default state, since the leak has been created. Water flows and pressure entering the posterior nodes will vary. This hypothesis is further tested by checking the fringe nodes. On average, with the simulations, using the maximum residual approach accounted for a success rate of 28.5%. Of this 28.5%, 20% came from leaf nodes (in this case node 12 with an accuracy of 96.9% and node 21 with an accuracy of 86.5%). These nodes ranked the highest in success rate thanks to the fact that they are at the end of the network. Since they do not have "children" nodes, the error will not propagate along those links. However, 0% accuracy starting with node 10 and 11 also happen.
In these boxplots, it is clear to conclude that, leak localization is very dependent on the amount of nodes with sensors. A good result can be expected with sensors at all the nodes with respect to how the residual is much higher in the node which has the leak.
Finally, this approach is still reliant on Kriging to interpolate the pressure in nodes which do not have sensors placed. However, by the very definition of the problem, the assumption in which the variance of the field is stationary is at risk [37]. The creation of a leak inside the model will create an irregular event in which it is difficult to justify a stationary variance. The unknown evolution of the leak, and how irregular the variance will become once added to the model. With respect to the mean absolute percentage error, average errors are higher than 50%, except in the optimal four sensor placement, which is almost zero. Besides, after Kriging the amount of observations which surpassed three standard deviations was around 35% which indicates that the Kriging might not be working as well as we intended in this data.

Sensitivity Analysis with Pressures
After the previous sensitivity analysis with residual, it is doubtful that the residuals are that useful at predicting the true leak location. A simpler approach using directly the pressure instead of the residual is taken. As explained in the methodology, only interpolate using Kriging once to obtain the head/pressure for all the nodes on the online model, omitting the reference model and residual calculations is considered.
Principal Component Analysis (PCA) [38] has been performed on the case with sensors at every node (Figure 6), and the non-perfect-information case (Figure 7), in which there are only four sensors placed inside the network (12, 16, 21, and 27) [24]. Removing time and flow features to see the structure of the data, a sort of rainbow effect can be detected, whereby thanks to the linear combination of pressures, it is able to differentiate where the different leak locations are. This, in turn, should translate to better and easier classification results when add to the flow.

Results
Following the method explained above, two different classifiers based on LDA and NNET are applied using R. The 72-h simulation from each leak scenario is used to train the model, and the 10fold cross-validation is used to tune the parameters and to select the model with the best accuracy. Once trained, the resulting classifier is validated using the remaining 24 h to get accurate results of the evaluation metrics. Bayes time reasoning is also recursively applied on this validation data set. Tables 2-4 show how the NNET, when applied with posterior Bayes time reasoning, can result in around 70% accuracy in the average case. However, when there are five sensors placed in the network, very bad results are produced in both LDA and NNET. This might be due to the Kriging, in which the average error in this case was the worst compared to all the other sensor placement. This is not a new result, as it is known from posterior works that the placement of the sensors will produce very different results [39]. For the validation data set, the evolution over time of the Bayes time reasoning is plotted in Figure 8, which provides an accuracy boost in less than 5 h with a significant increase. ATD below 1 in most cases was obtained in less than 3 h (Figure 9).

Conclusions
A new data-driven solution to the leak localization problem in WDNs based on limit pressure measurements has been presented in this study. The proposed approach has been explained or referred to, and an example is presented using the Hanoi DMA as case study.
After reviewing the results, the use of a reference model to calculate the residual is put into question. The main reason for this is that it hinders the data structure of the residuals by adding bias thanks to a not adequate Kriging estimation. As stated before, in the average case, the mean absolute percentage error is higher than 50% when applied to the interpolation. This fault is attributed to the Kriging assumptions, which do not hold in the network structure inherent to WDNs, as well as the intrinsic variability which a leak brings to the network.
As is common in the field, the number of sensors, and moreover the placement of these sensors affect the performance of both the classifier and the data interpolation. By directly tackling the data interpolation problem it is possible to have a better knowledge of both the state of the network and the optimal sensor placement. Making further improvements in the interpolation will in turn make the classification problem easier for future algorithms. It is necessary to look further into a better interpolation technique suitable for networks.
The case study applied herein demonstrates that using the raw pressures instead of the residuals when using LDA or NNET for classification purposes can achieve better results. The same can be said in the average case when using Bayes temporal reasoning. It can be seen that a classifier made vast improvements over a simple heuristic, such as selecting the biggest residual. However, it is still far from perfect. Even when simulated with perfect data, assuming sensors are installed at all nodes, significantly better results still could not be obtained. This suggests over-fitting when using the classifier or careful selection of which nodes are used for the classifier.
The main drawback here is in supervised learning itself, which needs previous historical data in which leaks for all leak scenarios have been found. The classifier model will deteriorate quickly when there is no information concerning a leak scenario. However, this problem can be mitigated with node relaxation. More data and case studies are needed before adequately referring to this model as a proper solution to any DMA. In the simulations of this study, many types of uncertainty present in the real world have not been accounted for.
Author Contributions: C.C.S. contributed to the subject of research, the idea of applying the classifiers and the drafting of the paper. B.P. contributed in selecting classifiers and the best features, doing the test and preparing the technical report. V.P. contributed in defining the analyzing methods, performance indicators and the manuscript review. G.C. collaborated in defining real problems and digital solution development. All authors have read and agreed to the published version of the manuscript.