Use of Machine Learning for Leak Detection and Localization in Water Distribution Systems

: This paper presents an investigation of the capacity of machine learning methods (ML) to localize leakage in water distribution systems (WDS). This issue is critical because water leakage causes economic losses, damages to the surrounding infrastructures, and soil contamination. Progress in real-time monitoring of WDS and ML has created new opportunities to develop data-based methods for water leak localization. However, the managers of WDS need recommendations for the selection of the appropriate ML methods as well their practical use for leakage localization. This paper contributes to this issue through an investigation of the capacity of ML methods to localize leakage in WDS. The campus of Lille University was used as support for this research. The paper is presented as follows: First, ﬂow and pressure data were determined using EPANET software; then, the generated data were used to investigate the capacity of six ML methods to localize water leakage. Finally, the results of the investigations were used for leakage localization from ofﬂine water ﬂow data. The results showed excellent performance for leakage localization by the artiﬁcial neural network, logistic regression, and random forest, but there were low performances for the unsupervised methods because of overlapping clusters.


Introduction
Water leakage constitutes an important issue in managing water distribution systems because it causes economic losses, damages to the surrounding soil and infrastructures, and soil contamination. According to the World Bank [1], the non-revenue water (NRW) level in developing countries ranges from 40% to 50% of the water pumped into the distribution systems. The American Water Works Association Research Foundation (AWWARF) estimated that water utilities in the United States suffered from 250,000 to 300,000 main breaks per year, causing approximately USD 3 billion in annual damages [2]. In [3], it was reported that the water losses from leakage in some countries in the Middle East represented 50% of the water supply. Reference [4] identified leakage as one the most common operational problems in the water distribution system of Athens.
Relevant research has been conducted for the development of methods for water leakage detection. These methods can be classified into hardware-and software-based methods. The first category uses various technologies such as acoustic monitoring [5], gas injection [6], thermography [7], ground-penetrating radar [8], and free-swimming systems [9]. Acoustic monitoring includes technologies for listening sticks, leak noise correlation, and leak noise loggers. These methods have high performances but suffer from the high cost. The gas injection method injects a non-toxic, water-insoluble, and lighterthan-air gas into water pipes. The leak can be detected by scanning the ground surface using gas detectors. This method is characterized by speed tracing, but its high cost reduces its practical use. The ground-penetrating radar method is based on tracking the reflection of electromagnetic waves generated at the ground surface. It provides information about the presence of anomalies in the subsoil. Water leaks can be detected by identifying soil voids created by water leaks or by detecting sections of pipes that appear deeper than they actually are due to the increase in the dielectric properties of the surrounding saturated soils. This method can be used for metallic or plastic pipes, but it is expensive and timeconsuming. The free-swimming systems methods are based on introducing the water pipes of capsules with an embedded power source, electronic components, and instrumentation (acoustic sensor, accelerometer, magnetometer, GPS synchronized ultrasonic transmitter, and temperature sensor). These capsules record the internal environment of the pipes and send the recorded data to a server. The analysis of registered data permits detection and localize anomalies related to water leakage. This method is well adapted for pipes with large diameters.
The second category of water leakage detection methods is based on analyzing data related to the water operation system. It includes statistical methods [10,11], the water balance method [12], the minimum night flow method [6], the real-time transient modeling [13], and the negative pressure wave [14]. Leak detection using statistical methods is based on the determination of the statistical characteristics of the water flow and pressure in the water network and the determination of the outliers, which could be related to water leakage. The efficiency of these methods is related to the quality of the recorded data and the regularity of the consumption patterns. The water balance method relies on the principle of mass conservation. A leak is identified if the difference between the amount of water put into the water network and the sum of water consumption and usage exceeds an established tolerance. The efficiency of this method depends on the quality of the monitoring system and the knowledge of the water usage in the water network. The MNF method is based on water flow analysis when the water demand is low and the water pressure is high. A leak alarm is generated when the MNF exceeds a threshold, depending on the water network's characteristics and usage. This method is widely used; its efficiency depends on the quality of the water network monitoring and the regularity of the water usage. The real-time transient modeling method is based on comparing the hydraulic recorded data with the results of hydraulic models. The efficiency of this method depends on the quality of recorded data and the quality of the hydraulic models and their calibration. The harmful pressure wave method is based on tracking acoustic waves created by the water pressure drop resulting from the water leak. Pressure sensors are installed at the beginning and the end stations of the pipeline. The record of the generated waves allows for the detection and localization of water leakage. This method is efficient but suffers from high operating costs.
The large variety of developed methods highlights the great difficulty of detecting and localizing water leakage in urban water distribution systems because of the complexity.
The recent progress in the real-time monitoring of the water distribution systems has been offering new opportunities to develop data-based methods for water leakage detection and localization. Machine learning-based methods have been widely used to detect and localize water leakage in water distribution systems.
Caputo and Pelagagge [15] used artificial neural networks (ANNs) to detect and localize the water leak in water distribution systems. Data were generated using a hydraulic model of the water network for various operating conditions and cases with different locations and amounts of the water leak. The method detected leaks correctly in small water distribution systems. Salam et al. [16] used the radial basis function neural network method for leak detection. The hydraulic software, EPANET, was used for data generation. The pressure variations in the water network were used as input data for the ANN model, while the leak intensity and locations constituted the output parameters. The authors showed that the method could detect the magnitude and the location of leakage with a 98% accuracy. Mounce et al. [17] used the ANN method to identify anomalies in the water distribution time series data in a pattern matching-based approach. This method was based on the similarity research between new events and profiles established from past events. This research allowed the classification of the new events and consequently to identify abnormal events, which could be related to leak. Recently, Rojek and Studzinski [18] used the ANN method to detect and localize water leakage in the water distribution systems. Tests on real off-line data showed that the ANN method correctly identified the localization of simulated leaks.
Zhang et al. [19] used the multiclass support vector machine method (SVM) for leakage detection in a large-scale water distribution network. First, the method K-means clustering was used to subdivide the water network into leakage zones. Then, data with leakage events were generated using the Monte Carlo method together with the hydraulic model. The authors showed that the multiclass SVM could identify the leakage zone using flow and pressure data. However, Chan et al. [20] reported that this method faced a significant challenge concerning determining the number of clusters and the high impact of the random determination of the first cluster on the clustering process.
Soldevila et al. [21] used the K-nearest neighbors to classify data generated by the hydraulic model EPANET from the simulation of leakage events at the totality of the nodes of the water distribution network. Data were then used to train the K-Nearest Neighbors model to localize the leakage area. The good performance of this method in the localization of one water leak was assessed on three examples.
Ciupke [22] used the regression tree method to detect water leakage. Alerts were established when the water flow exceeded the normal water flow range. The method was tested on real examples and gave very good results, even for detecting small leaks.
Van der Walt et al. [23] analyzed the capacity of Bayesian probabilistic analysis, the support vector machine, and an artificial neural network to detect and localize water leakage from pressure and flow data. These methods were compared to data generated from numerical modeling and laboratory tests. Since analysis showed that the performances of these methods depend on the complexity of the water network and the amount of available data, the authors did not propose general recommendations for the use of the machine learning methods for leak detection.
This literature review shows that intensive research has been conducted to use machine learning methods for leakage localization. However, the literature is still missing a comparison of the different categories of machine learning methods to localize water leakage in the same water distribution system. This paper proposes to fill this gap by comparing the capacities of various categories of machine learning methods to localize leakage in a complex water distribution system based on the water network of the scientific campus of Lille University in France.

Research Methodology
This research aimed at investigating the capacity of machine learning methods to localize the position of leakages in water distribution systems using flow and water pressure data. Following the methodology proposed by different scholars [15,16,19,23], the hydraulic software EPANET was used to create data related to different scenarios of leakage in the water distribution system of the scientific campus. For each leakage scenario, EPANT provided the water flow from the supply sections and the pressure in five hydraulic areas of the campus. The generated data were then used for training and testing six machine learning methods. The tests were first conducted with water flow and pressure data.
The performances of the machine learning methods were investigated using the parameters accuracy, precision, recall, and F1-score, which are determined from the confusion matrix (Table 1): The following sections present the generated data and the machine learning methods used in this research.

Data Generation
Data were generated using the software EPANET, developed by the Water Supply and Water Resources Division (Formerly the Drinking Water Research Division) of the US Environmental Protection Agency.
The water distribution network of the scientific campus of Lille University was used as support for this research. This campus represents a small town with approximately 150 buildings and 25,000 users including students, faculty members, and technical, and administrative staff [24]. Figure 1 illustrates the water distribution network of the campus [25,26]. The water network is composed of 15 km of strongly meshed pipes. The water company supplies the campus with water in three sections, located in the North, West, and South of the campus ( Figure 1).    Table 2. Leakage scenarios were used for the generation of data (leak nodes are given in Figure 2).
Each leak scenario was modeled under two conditions. The first condition concerned a constant pressure at the water supply sections, which were considered tanks with a constant water height (H = 40 m). The second condition concerned the water leak, which was considered by the following condition between the pressure (P) and water: The parameters C and a characterize the water leakage, which designate the emitter coefficient and emitter exponent, respectively. Simulations were conducted with a = 0.5 and C = 1. Table 4 provides a statistical analysis of the generated leak data. It shows that tank 1 provided the highest water supply rate (supply flow rate = 0.41), followed by tank 2 (flow rate = 0.35). This means that the water supply of the campus was mainly provided from the north and west of the campus, where the construction density was higher than that in the south of the campus. The highest average pressure is observed in zone 3, located in the South of the campus (average pressure approximately 35 m), followed by zones 5 and 4 (average pressure approximately 30 m). The average pressure in zones 1 and 2 was approximately 28 m.  Figure 3 illustrates the impact of the leakage position on the flow rate ratios FL1, FL2, and FL3. It shows that leakage in zones 1 and 2 caused a high flow rate from tank 1 (FL1), a medium flow rate from tank 2 (FL2), and a low flow rate from tank 3 (FL3). Leakage in zone 3 caused a high flow rate from tank 3 (FL3), medium flow from tank 2 (FL2), and low flow from tank 1 (FL1). Leakage in zone 4 caused a high flow rate from tank 2 (FL2) and low to medium flow from tank 3 (FL3). Finally, leakage in zone 5 caused a high flow rate from tank 1 and tank 2 (FL1 and FL2) and flow from tank 3 (FL3). Table 5 summarizes the impact of the leakage position on the water flow rate from the three supply sections. It can be observed that a high flow rate from tank 2 (FL2) could be attributed to leakage in zone 4, and a high flow rate from tank 3 (FL3) could be attributed to leakage in zone 3. A high flow rate from tank 1 (FL1) could be attributed to leakage in zone 1. The medium flow rate from tanks 1 and 2 could be related to leakage in zones 2 and 5.   Figure 4 illustrates the impact of the leakage position on the pressures from PZ1 to PZ5. It shows that leakage in each zone caused a significant drop in the pressure in the leakage zone. It also shows a significant impact of some leakages in a zone on the pressure in other zones, such as the impact of (i) leakage in zone 1 on the pressure in zone 2 (ii) leakage in zone 2 on the pressure in zone 1, (iii) leakage in zone 3 on the pressure in zone 4, and (iv) leakage in zone 5 on the pressure in zone 2.

Use of Machine Learning Methods
Analyses were conducted with three supervised machine learning methods (logistic regression, decision tree, and random forest), two unsupervised methods (hierarchical classification and a combination of the principal component analysis (PCA) and the Kmeans methods, and an artificial neural network (ANN). In addition, simulations were conducted using the Kaggle platform (https://www.kaggle.com, accessed on 20 September 2021). The following sections briefly present the methods used in this research.
The logistic regression is used for binary classification [27]. The method used in this work waws based on the functions: where x is the input data, and θ is the parameter determined by the minimization of the cost function. The decision tree method is based on applying a series of questions to determine the model response [28,29]. This method classifies a population into branch-like segments that construct a tree with a root node, internal nodes, and leaf nodes. The model generates a flowchart (tree), where each internal node (represented by a question) tests some features and guides down through the branches (the result of the splitting) with a "gini" coefficient, which is defined as follows: The parameter c designates the number of total classes; p(i) is the probability of picking a data point with class i.
Random forest methods are used for both classification and regression by combining randomized decision trees [30,31]. Each decision tree gives a vote for a target variable. The random forest algorithm chooses the combination that obtains the highest vote. This method has high predictive accuracy; it is efficient on large data sets and works well with missing data. However, it suffers from interpretation difficulties and overfitting in the case of noisy data.
The hierarchical classification method is used to build a hierarchy of clusters. The results of clustering are usually presented in a dendrogram. Hierarchical classification could be conducted using (i) a "bottom-up" approach, where each observation starts in its cluster, and pairs of clusters are merged as one move up the hierarchy or (ii) a "top-down" approach, where all observations start in one cluster; splits are performed when moving down the hierarchy. The Ward method is used in this analysis [32]. The PCA method is used to reduce the input data dimension by focusing on the principal components [33].
K-means clustering is a type of unsupervised learning. It aims at partitioning n observations into k clusters [34]. Initially, K initial means are randomly generated. Then, K clusters are created by associating each observation with the nearest centroid. Next, the objective function, the sum of the distance, is optimized until the best cluster centers candidates are found. Finally, data points are clustered based on feature similarity.
The ANN is inspired from the human brain functioning [35]. It transforms the input data (input layer) through a series of neural layers (hidden layers) to output data (output layer). The transformation is based on weights, which are adjusted by optimizing the prediction of a training data set. The Sigmoid function is used in data transformation.

Input and Output Parameters
Machine learning methods were used with the following input and output parameters: Input parameters: water flow at the three supply sections (FL1, FL2, FL3) and water pressure values at five observation points (Pz1, Pz2, Pz3, Pz4, Pz5) Output parameter: number of the campus zone (Z1, Z2, Z3, Z4, Z5).

Supervised Methods
The training phase of the supervised methods was conducted with 80% of the data, while 20% was used for the testing phase. Table 6 summarizes the results obtained with the water supply flow data. It shows that both the logistic regression and random forest methods gave excellent results with an accuracy = 1.0, precision = 1.0, recall = 1.0, and F1-score = 1. The decision tree method gave very good results with an accuracy = 0.95, precision = 0.96; recall = 0.95, and F1-score = 0.95. Table 6. Classification report for the supervised methods-flow data.   Table 7 summarizes the results obtained with the pressure data. It can be observed that both the logistic regression and the random forest methods gave excellent results with an accuracy = 1.0, precision = 1.0, and a recall = 1.0. The decision tree method gave good results with an accuracy = 0.88, precision = 0.91, recall = 0.94, and F1-score = 0.91. Figure 6 shows the confusion matrix for the decision tree method. It indicates excellent performances for all the zones, except for zone 1 (recall = 0.70) and the zone 2 (precision = 0.54). The bad results for zones 1 and 2 could be related to the spatial proximity of these zones and their hydraulic interaction (Figures 2 and 4). Table 7. Classification report for the supervised methods-pressure data. Pressure and flow data were used with only the decision tree method. The logistic regression and the random forest methods gave excellent results with either the flow or pressure data. Table 8 summarizes the classifications report for the decision tree. It shows that this method gives excellent results with an accuracy of 0.98, precision of 0.97, recall of 0.97, and F1-score of 0.96. It can be observed that the performance obtained with the flow and pressure data was better than that obtained with the flow data (Table 6) and pressure data (Table 7). Figure 7 shows the confusion matrix for the decision tree method. It indicates excellent performances for all the zones, except for zone 2 (recall = 0.83) and the zone 5 (precision = 0.83).   Figure 8 shows the results obtained with the hierarchical classification method with the pressure data. It shows the existence of three groups: the first group, G1, was composed of the pressures in zones 3 and 4; the second group, G2, concerned the pressure in zone 1; the third group, G3, included the pressures in zones 2 and 5.        Figure 11 illustrates the results obtained with the pressure data for k = 5 clusters. Both PC1 and PC2 showed significant overlapping clusters. In the (PC1, PC2) plan, the five clusters were not well distinguished.  Figure 12 illustrates the results obtained with flow rate and pressure data for k = 5 clusters. Both PC1 and PC2 showed clusters overlapping. In the (PC1, PC2) plan; only three clusters could be well distinguished.

Artificial Neural Network
Analyses were conducted with a multilayer backpropagation neural network model. Figure 13 shows the results of the application of the ANN method with the water supply data. It indicates a rapid convergence of the ANN model. Indeed, a good convergence was observed with approximately 20 epochs. The ANN model gave excellent results with an accuracy = 1.0, precision = 1.0, recall = 1.0, and F1-Score = 1.0 (Table 9).

Analysis of the Water Leak in the Scientific Campus of Lille University
This section presents an analysis of the leak in the scientific campus of Lille University. The analysis was based on daily flow data collected in 2015 at the three supply sections: FL1 in the North, FL2 in the west, and FL3 in the South. The year 2015 was selected because of the availability of data for this year and the observation of several abnormal events in the water consumption, related to water leakage. The water usage in the campus concerns mainly domestic activities in the students' residences, academic activity, and buildings' cleaning. Water is not used for irrigation. Since the water usage is related to regular activities, the water consumption at the daily scale is expected to be regular.
The following sections present a successively analysis of the daily water consumption and leakage detection and localization. Figure 16 illustrates the variation of the daily water consumption of the campus (Qd). It indicates missing data in the period from May 3 to May 27. This period is not considered in the analysis. This figure shows a significant variation in Qd. The minimum daily consumption was equal to 414 m 3 , while the maximum was equal to 1680 m 3 and the average consumption was equal to 890 m 3 . Low daily consumption values could be attributed to the vacation periods, while the high daily consumption values could be associated with water leakage.  Figures 17 and 18 illustrate the repartition of the daily water supply among the three supply sections. They show that the daily water supply from the North (F1D) was higher than those from the west and south campus. It also had the most significant variation (Table 10): the minimum daily supply was equal to 100 m 3 , while the maximum was equal to 772 m 3  . Thus, the water supply F1D accounted for about 50% of the total water supply, while F2 accounted for 28% and F3 for 22% of the campus water supply.

Leakage Analysis
The identification of leakage events was based on the observation of abnormal water consumption. Figures 19 and 20 show the events with water consumption exceeding 1200 m 3 /h (average water consumption + 1.5 standard deviation). We observed the presence of five groups of events, which are summarized in Table 11. The first group (G1) corresponds to day 76 with consumption exceeding by approximately 464 m 3 of the water consumption average (Qav), followed by day 86 (G2) which exceeded Qav by 326 m 3 Table 12 show the repartition of the water supply ratios related to leakage events. It shows that the ratio associated with FL1 was higher than those associated with FL2 and FL3. FL1 accounted for 56% of the total water supply for groups G1, G2, and G5, while FL2 and FL3 accounted for approximately 22% each. For groups G3 and G4, FL1 accounted for approximately 46% of the water supply, while FL2 and FL3 accounted for approximately 27% each.  Figure 21. Repartition of the water supply ratios related to leak events.

Leakage Localization
For the localization of leakage events G1 to G5, the water supply ratios corresponding to the leakage events are reported in Figure 22. The ratios of water flow for the events G1, G2 and G5 are indicated by the water flow ratios (FL1, FL2, and FL3) for zone 1, while those related to the events G3 and G4 are reported by the water flow ratios for zone 2. Therefore, it could be observed that leakages G1, G2, and G5 well matched with the water flow repartition for leakages in zone 1, while leakages G3 and G4 well matched with the water flow repartition for leakages in zone 2. This observation indicates that leakages G1, G2, and G5 could be attributed to zone 1, while leakages G3 and G4 could be attributed to zone 2.

Discussion
This research concerned the detection and localization of leaks in urban water distribution networks. This issue is of significant concern in the management of the water distribution systems, because leaks in the water distribution system cause substantial economic, social, and environmental impacts and severe damages to the surrounding soils and infrastructures.
Despite the important research on the development and use of hardware-and softwarebased methods for the detection, localization, and localization of water leaks, professionals still need efficient and cost-effective methods to detect water leaks in complex water distribution systems.
The recent progress in smart monitoring and artificial intelligence provides significant opportunities to develop data-based methods for leak detection and localization. The literature review showed an important concern in the use of these methods. However, on the one hand, the majority of the applications using artificial intelligent methods remain at the research stage. On the other hand, the literature review revealed a lack of comprehensive use of these methods. This research aimed to fill the gap in this area by thoroughly investigating the machine learning methods to detect and localize leaks in the water distribution system.
The water network of the scientific campus of Lille University was used as support for this research. This use was motivated by the campus' representativity of a small town, the complexity of the water network and the availability of data about the water network asset and water consumption. The water network is monitored by approximately 93 automated meter readings (AMRs) that record the water supply and consumption in the main buildings at an hourly time interval.
The physical water network was completed by constructing a Lab pilot of this network to investigate, under well-controlled conditions, the impact of the position of a leak on the water flow rates. Results of experiments showed an evident influence of the leak position leak on the water supply flow rates when the leak was in the proximity of the water supply. However, the impact is unclear for other locations, which means that the leak position could not be systematically determined from only the supply flow rates. In the future, it could be interested to monitor the pilot with pressure cells to investigate the possibility of improving the leak localization using the water supply flow rates and the pressure variation in the water network.
A large data set was built regarding the impact of leaks on the water network of the scientific campus on the variations in the water supply flow rates and the pressure in five campus zones. This data set was constructed using the hydraulic software EPANET. The data set included the responses of the water network to 215 individual and double leaks.
The data set was used for training and testing the following six machine learning methods: • Three supervised methods: logistic regression, decision tree, and random forest; • Two unsupervised methods: The hierarchical classification method and a combination of the PCA and K-means classification method; • The ANN The results of the tests conducted on these methods showed: • Excellent performance of the supervised methods in the localization of leaks in the water network. Both the logistic regression and the random forest predicted the position of the leak with an accuracy = 1.0. In contrast, the decision tree predicted leaks with an accuracy = 0.98 with pressure and flow data; • Excellent performances by the ANN for the localization of water leaks in the water network (accuracy = 1.0); • Some difficulties in exploiting the clustering capacity of the unsupervised methods in the leak localization because of overlapping clusters.
The results of this research were used to investigate the position of water leaks in the campus using water flow data rates recorded in 2015. Unfortunately, difficulties were encountered in the determination of the position of leaks because of a lack of pressure data. Therefore, in the future, we recommend extending the monitoring of the campus water network by adding cell pressure on the campus and flow rates in critical sections of the water network.

Conclusions
This paper presented an investigation of the use of machine learning methods to local leakage in the water distribution network. Leakage localization was based on the creation of hydraulic zones in the water distribution network. For each zone, sensors are used to measure the water supply variations and the water pressure. Collected data were then used for the construction of the machine learning models.
This methodology was used to investigate the capacity of six machine learning methods to localize leaks in the water distribution network of the scientific campus of Lille University. Data were generated using EPANET software. The investigation showed (i) excellent performance from the supervised methods, in particular, the logistic regression and random forest; (ii) excellent performances by the artificial neural network; (iii) difficulties in the exploitation of the clustering capacity of the unsupervised methods in leak localization because of clusters' overlapping. Offline water supply flow data were then used for the localization of water leakage in the scientific campus. The results gave some indications about the localization of the water leakage.
This paper shows that the ANN and the supervised logistic regression and random forest methods performed well in the localization of the water leakage in the water distribution systems, mainly when using both water flow and pressure data. These results are based on data generated using the software EPANET. Therefore, they should be confirmed on data collected from complex water networks, including water supply flow and pressure data in the subzones of the water network, and the localization of leakage events. Data Availability Statement: Data sharing not applicable.

Conflicts of Interest:
All other authors have no conflict of interest.