A Leak Zone Location Approach in Water Distribution Networks Combining Data-Driven and Model-Based Methods

: Model-based and data-driven methods are commonly used in leak location strategies in water distribution networks. This paper formulates a hybrid methodology in two stages that complements the advantages and disadvantages of data-driven and model-based strategies. In the ﬁrst stage, a support vector machine multiclass classiﬁer is used to reduce the search space for the leak location task. In the second stage, leak location task is formulated as an inverse problem, and solved using a variation of the differential evolution algorithm called topological differential evolution. The robustness of the method is tested considering measurement and varying demand uncertainty conditions ranging from 5 to 15% of node nominal demands. The performance of the hybrid method is compared to the support vector machine classiﬁer and topological differential evolution approaches as standalone methods of leak location. The hybrid proposal shows higher performance in terms of location accuracy, zone size, and computational load.


Introduction
Water, a vital resource for humanity, plays multiple roles in everyday life: ranging from drinking water consumption to the fulfilment of daily tasks. Aiming to guarantee a ready supply of water for the continuously expanding urban areas, bigger and more convoluted water distribution networks (WDNs) have been implemented. As these distribution systems grow in size and complexity, the task of monitoring and diagnosing faults in their behaviour becomes more complex [1]. Leaks and pipe bursts are a common anomalous state in WDNs that generate significant waste worldwide in terms of water losses, wasted energy, and maintenance costs. As an average, 30% of the water pumped into urban areas is lost due to unattended leaks [2], in some cases reaching nearly 50% of the water pumped. Leaks can be classified into background leaks or pipe bursts [3]. Background leaks are commonly small in size and their effect is nearly imperceptible; therefore they represent a small percentage of water losses. Pipe bursts, on the other hand, are leaks of a bigger size and impact which can cause pressure drops in the network. These are the main cause of water losses in WDNs and its location is a current scientific problem that occupies the attention of several research groups in the world.
In order to detect the presence of a leak in a pipeline or set of pipes, hardware-oriented methods are employed. These are characterized by the use of specialized equipment based on infrared sensors, CCTV cameras, moisture sensors and, the most common, acoustic sensors, among others [2,4,5]. The analysis of the transient dynamics logged by most of these technologies allows an exact estimation of the leak location, and sometimes size [6,7]. The development in data logging technologies through wireless sensor networking and the internet of things have rendered hardware-based leak location technologies more accesible [8]; however, the pipeline-level nature of these hardware-oriented methods reduces them to a local use. Employing these technologies to locate a leak by exploring a large area would be time consuming and result in high operational costs. Therefore, the exploration of every potential leak pipe in a WDN would be prohibitive [9].
Higher scale software-based leak location approaches have therefore been developed by several authors using the operational variables (i.e., pressures and flows) in the network [10][11][12][13][14][15][16][17][18]. These approaches have received significant attention due to the increasing development and accessibility of SCADA systems, as well as other data acquisition technologies. Software-based leak diagnosis methodologies can be classified into two distinct groups: model-based strategies [15,16], and data-driven approaches [10,11,14]. Hybrid solutions have also been proposed by combining data-driven and model-based approaches [17].
Model-based strategies are centered around a mathematical model of the WDN that describes the relationship between its operational variables, taking into account the network's structure (topological layout, pipe dimensions, etc.). Leak diagnosis is then effected by comparing the model outputs with the measured variables in the network. Quiñones-Grueiro et al. [15] and Steffelbauer et al. [19] formulate the leak location task as an inverse problem, and identify the location of the leak by finding the optimum network parameters that maximize the similarities between the model output and the measured leak sample. The resulting optimization problem is then solved using the differential evolution (DE) algorithm in [19]; and a modified version of DE which considers the topological characteristics of the network is used in [15]. Li et al. [18] propose a more direct approach, using the network model to generate a set of sensitivity matrixes that characterize several leak scenarios in the network. The similarity between the leak samples and the simulated samples in the sensitivity matrixes is then evaluated in order to identify the simulation conditions (leak location and size) of the most similar simulated scenarios.
Model-based leak diagnosis strategies do not require historical data of all the network modes, i.e., leaks of different sizes and locations. However, a model of the WDN is not always available, and its development may be too expensive or complex. Furthermore, the quality of these approaches depends directly on proper model calibration [16]; which should include the modeling of uncertainties like variations in consumer demands, variations in pipe roughness and diameters due to old age, and sensor accuracy and noise.
Data-driven methodologies take advantage from the historical data from the network and develop data analysis strategies. These may include statistical analysis such as process control charts [10]; however, an increase in the use of machine-learning-oriented methodologies has taken place in the past years [11,14,20,21]. Romero et al. [14] and Zhou et al. [12] define the leak location as a classification problem, and solve it using deep learning techniques. Chen et al. [13] and Shekofteh et al. [22] implement random forests (RF) and artificial neural networks (ANNs), respectively, to solve the leak location problem as a hierarchical classification problem. Sun et al. [21] estimate the pressure values in every node of the network by applying Kriging spatial interpolation [23]. These estimated pressure values are then used to locate the leak applying linear discriminant analysis (LDA) and ANNs as classifiers. Other machine learning techniques have been used such as support vector machines (SVMs) [11,20] and Gaussian process regression [24].
On the contrary of the model-based approaches, data-driven methodologies only require knowledge of the structural properties of the network and historical data from a representative set of network operational modes. However, this advantage doubles as an obstacle, since historical data of all the network modes is rarely available. Therefore, synthetic data generated by a network model is often used for training these data-driven methodologies [11][12][13].
Moser et al. [17] present a hybrid strategy named error-domain model falsification, which diagnoses the leak sample by comparing it to a set of model predictions. This comparison is effected against a threshold that characterizes the variations (uncertainties) in the network operational variables. This decision thresholds are generated from a statistical analysis of historical data. Chen et al. [13] and Zhang et al. [20] use clustering algorithms based on sensitivity matrixes generated by network models to group the network nodes into zones before locating the leaks.
A leak can appear in any location in the network. However, it is an accepted assumption in software-based methodologies to estimate the leak location strictly as a node [11,12,25]. This simplification of the leak location problem has nevertheless been insufficient due to the inherent uncertainties in the WDN operational variables; and the fact that, in most networks, the number of sensors is significantly smaller than the number of nodes. Therefore, finding the exact leak node with high accuracy is a difficult task. Several works have estimated a zone (a group of network nodes) instead of a single node as a leak location [11][12][13]15,20]. This increase in potential location size further simplifies the leak diagnosis, allowing higher performance. Once the location of the leak has been estimated to a relatively small area, hardware-oriented technologies can be used in order to identify the exact location of the leak at pipeline level.
Clustering is often used for generating the zones in the network. Zhang et al. [20] and Chen et al. [13], for example, use k-means clustering to group the network nodes with similar leak patterns by analyzing a sensitivity matrix. Quinones-Grueiro et al. [11], however, use k-medoids clustering to group the nodes according to their topological characteristics (shortest pipe distances between network nodes). Another zone generation approach consists in selecting multiple ranking candidates from a final solution. Such is the case of Zhou et al. [12] who, through a neural-network-based methodology, produces a location probability for every node in the WDN. The top 5 nodes with the highest leak location probabilities are then selected as the potential leak zone location. The topological characteristics of the network are also considered for leak zone generation by including the area near the estimated leak node. Quiñones-Grueiro et al. [15] construct a zone by also considering the nodes neighboring the identified leak node as potential leak locations. A neighboring node is defined as connected to the initial estimated location node through a single pipe (regardless of the pipe length). Li et al. [18] analyse the relationship between zone size and location accuracy by generating a zone with the nodes that fall within a threshold pipe distance (defined as the total pipe length in the shortest path between two nodes) from the initial estimated location. As expected, a higher pipe distance threshold reduces the estimated location size while increasing the accuracy of the location strategy.
The main goal of this work, and its main contribution, is the proposition of a hybrid methodology for the location of leaks in WDNs. This methodology consists of two stages: an initial data-driven stage in which a subzone of the WDN is identified as a potential leak location by means of a multiclass SVM classifier; and a second stage that improves the leak location estimated in the first stage through a model-based approach. This second stage is formulated as an inverse problem, and it is solved using a variation of the Differential Evolution (DE) algorithm called Topological Differential Evolution (TDE). The first stage is meant to work as a search space reduction for the inverse problem. This search space reduction is expected to improve leak location performance and reduce computational cost.
The proposed methodology is tested under different demand uncertainty conditions using the hydraulic model of the WDN in the city of Modena, Italy. The performance of the hybrid method is compared to the SVM classifier and TDE approaches independently as standalone methodologies.
This paper is structured as follows: Section 2 presents the theoretical bases for this study, as well as the methodology proposed for the location of leaks. Section 3 presents the Modena WDN as case study to evaluate the proposed strategy; furthermore, the characteristics of the developed simulations are detailed, and experimental design is explained. In Section 4, test results are presented and discussed, and the proposed method is compared to other leak location approaches. Finally, conclusions and recommendations for future works are presented.

Network Modeling
The hydraulic model of a WDN requires previous knowledge of its topological characteristics, as well as a detailed account of the physical principles that rule its behaviour. An approximated mathematical model can be developed based on the mass and energy balance equations of the network, assuming that flows are stationary, incompressible and permanent; constant fluid density is also assumed. This model is based around two main elements: pipes or links, which represent actual pipelines; and network nodes, or junctions, which represent important consumer takes, output and entry points in the network, and connections between pipes.
The steady-state mathematical model of a WDN is described by a set of equations based on the associated physical laws. The flow dynamics in the jth node of the network are modeled as follows [26]: where N T is the number of nodes in the WDN, n j is the number of pipes connected to node j, q in ij and q out ij represent input and output flow, respectively, at node j through pipe i, and d j represents the consumer's demand at the node.
The energy balance equations of the model are formalized as follows: which states that the head difference ∆H between any two nodes in the network is always the same for every path between the two nodes [26]. With h e being the head drops associated with E elements in the path and h f representing head lifts in F elements along the path. The head-flow relationship model in a pipe has been defined in different ways. Darcy-Weisbach's and Hazen-Williams's, presented below, are the most common: where q is the flow in the pipe and h d is the head drop. R is a coefficient that depends on the model used, which summarizes several network characteristics like pipe dimensions and roughness coefficients; and usually has a value close to 2 [26]. Leaks are modeled as pressure-driven aggregated demands in the nearest node; the node flow dynamic presented in (1) is then redefined for the leak node l as: where ξ = 0.5 [27], h l is the pressure head at node l, and E c is an emitter coefficient, which summarizes the leak dimensions.

Leak Location Methodology
The methodology proposed for leak location is presented in Figure 1. A previous leak detection step is assumed following any of the procedures found in the literature [9,15,17,28]; therefore, only leaks detectable by the sensors installed in the network will be located. A set of measured hydraulic variables m ∈ n s represents each of the leak samples, with n s being the number of sensors (S) installed in the network. The estimated leak location is generated in two consecutive stages: On the first stage, the WDN is partitioned into a set of candidate leak zones using agglomerative clustering [29], and taking into account the topological relationships between network nodes, presented in Λ ∈ N T ×N T . A previously trained SVM multiclass classifier, combined with Bayes temporal reasoning [30], is then used to estimate a potential leak zone Z C . In this stage, a potential leak size range E Crange = [E cmin , E cmax ] is also estimated for each sample; and an analysis of dominant sensors (S d ) is effected based on the estimated leak zone Z C . On the second stage, an estimated leak node is generated as the solution of an inverse problem using Topological Differential Evolution (TDE) [15], with Z C and E Crange as search space restrictions. Finally, temporal reasoning is applied to the leak location estimated as the solution of the inverse problem and extended to the nearby neighbor nodes following the procedure presented in Section 2.4.1, generating the estimated leak zone location Z ngh for sample m. The requirements for the implementation of the proposed methodology in a practical WDN application are presented below: • A set of n s sensors installed in the network to capture the hydraulic information. • A SCADA system for data acquisition and processing. • A calibrated network hydraulic model. • A processing unit (e.g., a personal computer (PC) or industrial PC) to implement and run the methodology.
The previous requirements can be fulfilled for many urban networks today, being the hydraulic model the main constraint due to the complexity of its calibration. The proposed methodology does not require a specific number, or type (pressure or flow), of sensors to be installed in the network, however, a large number of sensors will most likely result in a better performance. In spite of being a hybrid leak location methodology, historical data of the network behaviour is not necessary, since synthetic data generated by the network model can be used to train the multiclass classifier. Furthermore, the proposed methodology is completely unintrusive, not requiring any water supply reductions or interruptions for leak diagnosis. On the first stage, a clustering algorithm is used to divide the network into a set of zones. In this work, agglomerative clustering [29] was selected due to its simplicity and effectiveness. Agglomerative clustering is a hierarchical clustering algorithm that generates a cluster dendrogram which goes from N C clusters at the bottom level to a single cluster at the top. This dendrogram is built by iteratively grouping a set C = {c 1 , c 2 , . . . , c n c } of n c elements [29]. For WDN node grouping, every node in the network is considered as an element, and n c = N T , with C = {1, 2, . . . , N T }. A distance (or similarity) metric d(·, ·) must be defined between network nodes in order to build the cluster dendrogram. In this case, the total pipe distance in the shortest path between two nodes is considered as the similarity metric: d(c i , c j ) = Λ[i, j]. Once the cluster dendrogram has been built, n z clustered zones can be produced, with 1 ≤ n z ≤ N C , by selecting the corresponding level in the dendrogram [31]. By matching every leak sample with the clustered zone that contains the actual leak node, n z class labels are defined for the multiclass classifier.

Support Vector Machine Classifier
A Support Vector Machine (SVM) is selected as the classifier for the first step in the leak zone location methodology presented in this work. SVM classifiers are based on finding a hyperplane that optimally separates the two classes in training data set D by maximizing the margin between the hyperplane and the support vectors [32]. The support vectors are defined as the datapoints from any of the classes that are closest to the separating hyperplane. Aiming to maximize the separating power of the hyperplane, data samples are projected into a higher dimensional space by means of the Kernel trick. Therefore, a kernel function K(·, ·) is defined between any two datapoints. The separating hyperplane between two classes is formalized as follows: g(x) = w T x + b; where w ∈ H is a margin vector derived from the support vectors, with H being the dot product space of feature samples; and b is an offset value. The position of the hyperplane is described by both w and b, and x is a feature vector or data sample. The training of an SVM classifier is then reduced to finding the optimal values for w and b solving the optimization problem presented in (5) [33]: where a ∈ n t are Lagrange multipliers; D ∈ {x i , y i } n t is the training dataset; and C ∈ , C > 0, represents an error penalty that acts as an upper bound to limit the influence of the individual samples. For this work, the Radial Basis Function (RBF) Kernel, defined in (6), will be selected due to its non-linear nature, its small number of parameters, and its success rate in previous works [34].
In order to ensure the maximum performance of the SVM classifier, a proper hyperparameter selection must be effected for {C, γ} [20,34].

Bayes Temporal Reasoning
Analyzing leak samples over a time span T H ∈ Z + , rather than at a single time instance, has shown to improve the leak localization accuracy [15]. Therefore, Bayes temporal reasoning, as defined in [21,30], is applied to the classification probabilities from a group of T H leak samples before estimating a leak location Z C .

Leak Size Estimation
For this work, the leak size range is assumed to have been previously estimated for every sample following the procedure presented in [15]. The potential leak range for sample i is therefore estimated as where E ci is the actual leak size.

Dominant Sensor Selection
In medium to large WDNs, a leak at a given node has often little to no effect in the hydraulic variables (flows and pressures) of the nodes located far away. Therefore, variations due to uncertainties in measurements at locations far away from the leak node can result detrimental to the leak diagnosis accuracy. In order to avoid this effect, Li et al. [18] define a set S d ∈ [1, N T ] n sd ; S d ⊂ S of dominant sensors as the n sd sensors that best reflect the variations caused by a specific leak in the network, with S ∈ [1, N T ] n s . For this work, a set of dominant sensors was selected for every leak sample by considering the n sd sensors closest to the estimated leak zone Z C , following the procedure presented in Appendix A.

Stage 2: Leak Location as an Inverse Problem
An inverse problem consists in mapping the effect of the variation in model parameters on the input-output relationship of a given model [35]. Therefore, the solution of an inverse problem comes down to finding a parameter vector θ, such that the model outputŷ matches the real output y for a given input x. The leak location task can be posed as an inverse problem by defining the model parameters θ = {E c , ω} for a hydraulic model of the WDN, where the input vectord N ∈ N T represents the nominal demands, andm d is a vector of the model outputs measured by the dominant sensor. The leak location problem is then solved by finding the leak node ω in which a simulated leak of size E c maximizes the similarity between the output vectorm d and the real leak sample m d measured by the dominant sensors. Leak node estimation is then formalized as the optimization problem in (7): min where M(θ) =m d represents the network numerical model. The parameter search space consists of: [E cmin ; E cmax ], which defines the limits for the possible leak sizes; and Z C , which defines the topological search space (i.e., the potential leak location nodes). If the topological search space hasn't been restricted, every node in the network is a potential leak location, and Z C = {1, 2, . . . , N T }. The cost function f n is defined as the Euclidean distance between the real leak sample m d and the simulated leak samplem d as presented in (8): The optimization problem presented in (7) is solved using a variation of the differential evolution algorithm called topological differential evolution (TDE). For its implementation, a mutation (F) and a crossover (C r ) coefficient must be properly selected. Furthermore, the following stop criteria must also be selected for the TDE algorithm: maximum iterations g max , stagnation iteration limit S L , and cost tolerance ϕ. A detailed description of the TDE algorithm is presented in Appendix B.

Temporal Reasoning and Neighbor Expansion
Temporal reasoning is implemented for TDE by generating a leak location zone combining the estimated nodes over a time span T H . Defining ω = {ω 1 , ω 2 , . . . ω T H }, where ω i ∈ Z + is the estimated leak node for sample i, an estimated leak zone by means of temporal reasoning is then defined as Z T = ∪ T H i=1 ω i . As stated earlier, leak zone generation can be implemented extending the estimated leak location to the nearby neighbor nodes [15,18]. Neighbor zone extension from the temporal-reasoning-derived estimated zone Z T was implemented in this work. Assuming Z T consists of N n nodes with 1 ≤ N n ≤ T H , matrix Z m ∈ {0, 1} N n ×N T is defined: Submatrix Λ Z contains the pipe distances from every node in the network to each node in Z T , and is calculated as follows: Finally, the estimated leak zone Z ngh is generated by grouping the network nodes with a total pipe distance smaller than a threshold λ from at least one of the nodes in Z T : where

Leak Location Accuracy
Leak location accuracy is defined as the percentage of leak samples that were correctly located in the dataset: with where Ω i is the actual leak node for sample i and Z ngh i is the estimated leak zone for sample i.

Leak Zone Size
Two leak zone size indexes are defined. The first one is the number of network nodes in the estimated leak location: The second zone size index defined is the total pipe length in the zone and is formalized as follows: where β jk = 1 if j, k ∈ Z ngh i , and j and k are neighbor nodes 0 otherwise . (17) Note that ZPS i is calculated by dividing by 2 to account for the fact that Λ is a symmetric matrix. A small leak size index, both in number of nodes and total pipe distance, is a desirable result. Smaller estimated leak zones reduce the time it takes to find the exact leak location in a subsequent step. Mean and standard deviation values are calculated for every dataset: {ZNS, σ ZNS }, {ZPS, σ ZPS }.

Computational Cost
Since online test time for the classifier is negligible in comparison to the cost of solving the optimization problem presented in (7), computational cost CC is measured as the number of generations g last before reaching any of the stop criteria: Mean and standard deviation values are calculated for every dataset: {CC, σ CC }.

Modena Network
The proposed leak location methodology can be implemented in any WDN with a properly calibrated hydraulic model. However, its full potential is best taken advantage of in medium-to-large scale WDNs, since the reduction of the search space considered in stage 1 of the methodology is practically unnecessary in small networks. Therefore, the proposed methodology was tested on the hydraulic model of the WDN in the city of Modena, Italy (Figure 2), which presents characteristics common to most urban WDNs. It is a medium sized WDN with 317 pipes and 268 nodes [36]. It is gravity-fed by 4 reservoirs, therefore, no pumps are installed. This is a meshed WDN, with a relatively high number of connections per node, which results in higher correlation among the hydraulic variables at different nodes. Full supervision of pressures and flows in every node and pipe of a WDN is rarely found in practice since it would result in excessive instrumentation costs. Therefore, a great number of leak diagnosis methodologies have been implemented by supervising only a limited number of variables [13,15]. For the purpose of this study, ten pressure sensors are installed in the nodes marked in Figure 2. Pressure sensors are often preferred over flow sensors due to their lower installation and maintenance costs. The optimal location for the pressure sensors was determined by maximizing the leak detection for a 2000 samples dataset with different size leaks (E c ∈ [0.1, 2.0]) randomly generated in the network nodes. Leak detection is implemented by comparing the sample residuals to a threshold, and the sensor location optimum is determined through a genetic algorithm (GA) [37].

Realistic Sample Simulation-Uncertainty Modeling
Taking advantage of the WDN model used for leak location, a group of train and test synthetic datasets were simulated to evaluate the proposed methodology. The hydraulic simulation software EPANET 2.0 [27] is used. In order to make realistic enough simulations of the network modes, the following modeling considerations were made: • Minimum night flow regime is assumed, ranging from 2 a.m. to 6 a.m. At this time at night, variation in the demand patterns from consumers are very small, therefore, fixed nominal demands can be assumed. The characteristic reduction in demand variation at night time simplifies the location task, increasing its performance; however, this also means that the leak location can only be identified at night. • Sensor sampling time is considered to be 15 min per sample, with 4 samples in one hour. This amounts to a total of 16 samples in a day under minimum night flow regime. An average between hourly samples is then calculated to filter uncertainties in the data, resulting in 4 filtered samples in a day. A leak scenario is defined as a set of 4 filtered samples from a single day, all of which are generated under the same conditions (leak location and size). • The leak size for every scenario is sampled randomly from a uniform distribution within the interval E C ∈ [0.5, 1.0]. This results in leaks that range from 2.6 to 6.3 lps. A timely location and maintenance of leaks of this size can save from 1.1 to 2.6% of the network's total nominal demands during minimum night flow regime. The leak size range selected could be described as matching medium sized leaks. Leaks higher than that interval are relatively easy to diagnose and sometimes result in pipe bursts visible at street level, which are most likely to be diagnosed by consumer reports. Leaks smaller than the selected interval are, however, mostly background leaks, which are often undetectable due to the uncertainties in network demands and measurements. Figure 3 shows the effect of different sized leaks on pressure head values for every node of the network, which are compared to nominal (no leak) patterns; no uncertainties are considered. • In real WDNs, actual node demands differ from modeled nominal demands due to variations in consumer patterns. These demand uncertainties must be taken into account in order to achieve realistic simulations. Therefore, node demands were sampled from the following Gaussian distribution: ℵ ∼ {d n , ψd n }, where d n is the nominal demand and ψ ∈ + is a demand uncertainty coefficient. Several uncertainty coefficients are explored in this work. Demand uncertainty is sampled as a Gaussian distribution centered around the nominal demand because the latter has been estimated as the most probable value. Furthermore, by defining demand uncertainty standard deviation as a factor of the nominal demand, nodes with higher nominal demand values are modeled with higher uncertainty. • In order to guarantee a realistic simulation of the pressure sensors, measurement noise is sampled from the following uniform distribution [−0.025, 0.025] mH 2 O, and stacked additively with the pressure values simulated.
Each test dataset contains 2 scenarios per network node, for a total of 536 scenarios (2144 samples) per dataset. A training dataset was also generated to train the SVM multiclass classifier with an uncertainty coefficient of 0.10. It contains 50 leak samples per network node for a total of 13,400 samples.

Hybrid Methodology Implementation
The procedure to follow in order to implement the proposed methodology is presented below; this procedure can be applied for any WDN that meets the requirements presented in Section 2.2: 1.
The WDN is partitioned into subzones to form classes.

2.
A classifier is trained using the training dataset presented in Section 3.3.1 and the labels generated in step 1.
The number of classes (n z ), and zones, in the network is selected to guarantee a 95% classification accuracy for the multiclass classifier.
The number of dominant sensors n sd is selected from the following values: n sd ∈ {2α}, α ∈ {1, 2, . . . , n s 2 }, aiming to maximize leak location accuracy; if location accuracy values are similar for different dominant sensor values, the number of dominant sensors that achieves the lowest computational cost will be selected. 6.
The stop criteria g max , S L , and ϕ, as well the population size K p , for the TDE algorithm are selected aiming to minimize the computational cost without resulting in loss of accuracy. 7.
Temporal reasoning time horizon T H is selected according to the sampling frequency and the number of samples in a minimum night flow regime scenario. 8.
Neighbor expansion distance λ is selected depending on the desired zone size.
Once online, the TDE algorithm is run 5 times for every leak sample, and the candidate solution with the smallest cost function value is selected as the estimated leak node. These multiple runs of the algorithm aim to avoid engaging potential local optima caused by the random nature of the initialization step.
Parameter search for the Modena WDN case study resulted in the following values:

Comparing Leak Location Strategies
The results achieved by the proposed hybrid method for the 5 test datasets are compared to the results achieved by the TDE and SVM multiclass classifier approaches applied independently as standalone methodologies. The standalone TDE approach is applied without topological searchspace reduction, considering Z C = [1, N T ] as the complete network for every sample. In this case, all the sensors installed are considered as dominant sensors. Hyperparameters and stop criteria for the standalone TDE approach are the same as the ones selected for the hybrid approach. For the standalone multiclass classifier approach, Z C is considered as the final estimated leak zone. In this case, n z = 35 is selected such that the average zone size is similar to the one obtained with the hybrid approach. Hyperparameters are identical to the ones selected for the hybrid approach.

Results and Discussion
A comparison between the hybrid strategy presented and the two independent methodologies is developed according the performance measures defined in Section 2.5. Table 1 and Figure 5 present a location accuracy (ACC) comparison between the three methodologies tested under different demand uncertainty percentages (100ψ) with the five test datasets.  The hybrid approach outperforms the other two methodologies for higher uncertainty values, with a nearly 5% higher location accuracy when the demand uncertainty coefficient is ψ = 0.15. However, for the lower uncertainty levels, its performance is similar to the standalone TDE approach, reaching both 93-94% leak location accuracy when the demand uncertainty coefficient is ψ = 0.05. The standalone classifier presents the overall lowest performance; however, it deteriorates the least as uncertainty increases, being the most robust to demand uncertainty among the three. It is this robustness in the first stage, combined with the non-dominant sensor measurements deprecation, what makes the hybrid approach more robust, and overall more accurate, than the standalone TDE approach.
All three methodologies are compared regarding the size of the estimated leak location zones, considering size in number of nodes ( Figure 6) and total pipe distance (Figure 7). A smaller estimated leak zone according to both indexes is desirable, and the best zone size results are highlighted in Tables 2 and 3. Both zone size indexes present the standalone classifier generating estimated leak zones nearly two times bigger than the other two methods. Wilcoxon's signed ranked test [38], with a 95% confidence interval, was applied to compare zone size, in number of nodes, between the hybrid and the standalone TDE approaches. This test showed that there were significant statistical differences between both methods for every uncertainty value, with the hybrid approach generating smaller zones. The same statistical test was carried out between the same two methodologies comparing zone size in total pipe distance instead. In this case, the hybrid approach showed significantly smaller zones for all uncertainty values, except for the 0.05 case, for which no statistically significant difference was found. Overall, these results assert the higher performance of the hybrid approach over the standalone TDE and standalone multiclass classifier, achieving higher accuracy values while estimating smaller zones.   Finally, computational cost (CC) is compared between the hybrid and the standalone TDE approach in the form of number of generations (g last ). The computational cost of the classifier approach (both in its standalone version and within the hybrid methodology) is not analyzed since it is comparatively unimpactful. Results are presented in Table 4 and    As expected, by reducing the search space of the inverse problem, the optimization algorithm reaches a solution much faster. The mean hybrid approach TDE run is at least over 30% faster than that of the standalone TDE approach, being as much as 63% faster when the uncertainty coefficient is ψ = 0.05. The difference is higher for low uncertainty values and it reaches a plateau as uncertainty increases. The main reason behind this behaviour is the fact that the cost tolerance stop criterion ϕ is selected aiming to guarantee maximum performance for the lowest uncertainty (ψ = 0.05) case. As uncertainty increases, so do variations from the nominal pressure values, increasing cost function values in general. This causes the location of higher uncertainty samples to rarely stop due to the cost tolerance criterion, tilting towards a generation-oriented stop criterion such as stagnation (S L ) and max generations (g max ).
A topological representation of the results obtained for two correctly located leak scenarios is presented in Figure 9. In Figure 9a, the actual leak node is located within the estimated leak zone Z ngh ; furthermore, it is located within the leak nodes estimated after applying temporal reasoning to all the samples in the scenario. In Figure 9b however, temporal reasoning would not have sufficed to identify the actual leak node; in this scenario, neighbor expansion guarantees that the actual leak node is inside the estimated zone.

Conclusions
This study presents a hybrid methodology for leak zone location in WDNs by combining a multiclass SVM classifier and an inverse problem solved using an optimization algorithm called Topological Differential Evolution (TDE). The classifier is used to reduce the search space of the inverse problem, aiming to increase its performance and reduce its computational cost. From the reduced search space, a set of dominant sensors is selected for leak localization, reducing unwanted uncertainty in measurements. Temporal reasoning is applied to both the classifier and the TDE, and the estimated leak location is extended to the nearby nodes. In order to evaluate the performance of the proposed strategy, 5 datasets with different demand uncertainty values are tested. The hybrid method is compared to two other methodologies: a standalone SVM classifier and a TDE approach without previous topological search space reduction.
The hybrid approach presented similar location accuracy to the standalone TDE approach for low uncertainty values; however, for higher uncertainty levels, the hybrid approach outperformed the standalone TDE, proving to be more robust with respect to demand uncertainty. The standalone multiclass classifier significantly underperformed in all cases. Estimated zone sizes, measured in number of nodes and total pipe distance, were smaller for the hybrid approach overall in comparison to the other two methods. Regarding computational cost, the search space restriction resulted in a reduction of 30% for high uncertainties and over 50% for low uncertainties. Overall, the hybrid approach demonstrated to be more robust, while reducing the time it takes to locate the leak.
An unintrusive methodology has been proposed with relatively accessible implementation requirements. As long as these requirements have been met, the proposed methodology can be applied to any WDN with a calibrated hydraulic model and it is expected to yield similar results for other medium-to-large WDNs that are meshed in a similar way to the Modena network. Historical data of the network behaviour is not necessary for the implementation of the strategy, however, it can be used if available. Moreover, no specific number of installed sensors is required, and both flow and pressure sensor measurements can be considered simultaneously.
The main limitation to the proposed methodology is the required availability, and proper calibration, of a WDN hydraulic model. Furthermore, the restriction of the leak location task to night hours could result in moderate losses.
Future works will explore different data-driven methodologies in order to improve the performance of the first stage of the hybrid strategy and reduce the topological search space even further; and a study of the performance of the proposed leak location methodology under daytime demand patterns will be performed.

Initialization
In the initialization step, an initial population P 0 with K p candidate solutions is produced by randomly sampling from the search space. The mutation F ∈ (0, 2] and crossover coefficients C r ∈ (0, 1) are also defined at this stage [39]. Initialization is carried out only once for generation 0, while the other three operators are applied to every generation.

Mutation
The mutation operator generates a mutated candidate solutionx The three operators are applied to every subsequent generation until any of the following stopping criteria are met: • Maximum number of generations: g = g max . • Stagnation: the best candidate (the candidate with the smallest value of the cost function f n ) x g best in the population has remained unchanged for S L generations. • Tolerance: the smallest cost function value in the population f n best is smaller than a threshold ϕ.
The solution for the optimization problem presented in (7) is then selected as the best member of the last generation x g last best . The leak location is then estimated as node ω g last best .