Burst Detection in Water Distribution Systems: The Issue of Dataset Collection

: Developing data-driven models for bursts detection is currently a demanding challenge for efﬁcient and sustainable management of water supply systems. The main limit in the progress of these models lies in the large amount of accurate data required. The aim is to present a methodology for the generation of reliable data, which are fundamental to train anomaly detection models and set alarms. Thus, the results of the proposed methodology is to provide suitable water consumption data. The presented procedure consists of stochastic modelling of water request and hydraulic pipes bursts simulation to yield suitable synthetic time series of ﬂow rates, for instance, inlet ﬂows of district metered areas and small water supply systems. The water request is obtained through the superimposition of different components, such as the daily, the weekly, and the yearly trends jointly with a random normal distributed component based on the consumption mean and variance, and the number of users aggregation. The resulting request is implemented into the hydraulic model of the distribution system, also embedding background leaks and bursts using a pressure-driven approach with both concentrated and distributed demand schemes. This work seeks to close the gap in the ﬁeld of synthetic generation of drinking water consumption data, by establishing a proper dedicated methodology that aims to support future water smart grids.


Introduction
The water distribution infrastructures, along with the energy grids, have undertaken an important renewal process in the last few years that aims to transform the current distribution networks in smart grids [1,2]. The reasons for this changing lie in the necessity of water distribution systems (WDS, hereafter) to tackle water scarcity because of climate change and to deal with water requests increasing due to population growth in urban areas [3]. Therefore, water-smart networks comprise smart sensors (e.g., flow, pressure and quality meters, and noise loggers) and communication and data storage devices together with data management and analysis routines [4]. Despite the potential benefits of smart systems [2], above all the enhancement of WDS management, proper handling of the data stream through the different components represents a real challenge nowadays [5].
Smart WDS enables improving the control of water leakages which are a demanding task well known in the literature. Water leaks comprise the background losses and pipes bursts [6,7]. The former provide suitable data for the development of anomaly detection methods because of scarce variation of water demand representation and poor accuracy of bursts modeling. Thus, this work aims to lay the foundations in the synthetic generation of water consumption data in order to support the development of bursts detection in WDS. Therefore, a methodology for the generation of water consumption time series is presented, which comprises all its components, i.e., the water request, the background losses and the bursts, in terms of mean and variation.
For the sake of clarity, the water request is defined in this paper as the water demand required by the aqueduct users. The term demand is used instead to indicate the water demand actually supplied by the water distribution system to the users. In addition, water consumption means the total amount of water delivered by the water supply system including both water demand and water losses along the distribution network.
Summarizing, the contribution of this study is a reliable methodology that provides suitable water consumption data. The final scope is to support the development of data-driven techniques for the bursts detection, which need a large amount of accurate data. Specifically, the proposed methodology, hereafter refers as stochastic-hydraulic time series generator (SHtsG, hereafter), relies on two principal phases: the water request modeling and the hydraulic WDS simulation. The first stage is based on a superimposition approach that takes into account the daily, weekly, and seasonal deterministic patterns together with the random component of the variation. Its output is the water request time series of the different group of WDS districts. On the other hand, the second stage involves the production of hydraulic data on the base of a distributed fully pressure driven model (DFPDM, hereafter) able to properly simulate the water demand, the background leakages, and the busts. This model simulates each component of the water request with a proper pressure-demand relationship and suitable demand scheme: distributed along pipes for the water demand and background leakages and concentrated at nodes for bursts. In addition, DFPDM enables randomly generating different types of pipe bursts by varying the position along the networks, the amount of water losses, and the frequency and duration according to a defined occurrence probability. Therefore, DFPDM that allows for obtaining extended period simulations includes both the hydraulic and mechanical reliability of WDS through a proper simulation of the satisfied water request and the pipe breaks [27,28]. The results highlight the generation of a suitable hydraulic dataset of a lacking labels time series, which are closed to real data in both mean and variance. Eventually, the SHtsG is able to provide a massive amount of data that is crucial in the development of machine learning algorithms for anomaly detection.
The rest of the paper is organized as follows: Section 2 presents the framework of the proposed methodology; Section 3 depicts two test cases aimed at validating the methodology as well as enlightening its advantages and strengths; and Section 4 draws conclusions and final remarks.

Methodology
The proposed stochastic-hydraulic methodology, called SHtsG, consists of two phases: the first refers to the stochastic generation of a water request time series and the second regards the WDS simulation. Specifically, the stochastic part superimposes daily, weekly, and yearly water request patterns jointly with a random component in order to produce suitable water request time series. These are subsequently employed in the second part, where the hydraulic solver computes a WDN water consumption time series through an extended period simulation.
For the sake of clarity, Figure 1 illustrates the SHtsG. Specifically, the first part provides the water consumption time series for each district/group of users of the WDS (district is defined as a group of neighboring users with uniform characteristics) by means of a superimposition of deterministic and random trends. This SHtsG stage contemplates multiple time series according to different district typologies as well as various user aggregation. Given the water request distribution along the network, the WDS hydraulic model is built in the second part of the SHtsG. The hydraulic model exploits a fully pressure driven approach, called DFPDM, where water demands, background leakages, and water bursts are properly embedded in order to ensure extended period simulations accuracy and fidelity.
The final output is the consumption time series of the WDS inlet flow rate (or district metered areas flow rates) considering both hydraulic and mechanical reliability.

Water Request Stochastic Modeling
The first phase of the SHtsG deals with the modeling of the water request through a superimposition approach of different trends, as follows: where the subscript u and t, respectively, represent the users' group/district and the time coordinate of the water request, while D ut denotes the water request at the time t of the group of users u, and D uy is the annual average water request of group u. The daily water request pattern is α d,ut = D uh /D ud , which represents the dimensionless daily behavior of the water request with an hourly time interval expressed by hourly average request (D uh ) over daily average request (D ud ). Besides, the weekly and yearly dimensionless pattern are the ratio between daily average request and monthly average request (α w,ut = D ud /D um ), and between monthly average request and yearly average request (α y,ut = D um /D uy ), respectively. On the other hand, ut represents the random component of the water request of the group of users u at the time t. Although ut could be a negative number, the water request D ut in Equation (1) (D ut is the sum of one real non-negative term and another real term) has to belong to the real positive numbers due to the definition itself of water request that can not results negative. The water request in Equation (1) consists of two parts: the first addend regards the deterministic terms and the second one the random term. This formula allows for properly modeling the water request of different users aggregation due to the different deterministic temporal trends and to the random component.
On one hand, the deterministic component uniquely defines the water request by merging the different dimensionless trends, e.g., α d,ut , α w,ut and α y,ut together with the yearly average request for each analyzed group of users. On the other one, the random component adds a variability effect at the deterministic request making the resulting time series more realistic. This component is calculated on the basis of the number of users aggregated and the first and second order moments of the water request itself. Given that the number of users aggregated and the water request mean are given (deterministic part of Equation (1)), the second order moment is defined through the Gargano et al. [29,30] approach, as follows: where CV is the coefficient of variation, which is defined as the ratio of the standard deviation to the mean (σ/µ), and N u is the number of users aggregated. Equation (2) is valid for N u ranges between 200 and 1250 users. Since the water request variation is defined, the random component of each time step for each group of users ( ut ) is determined by a random generation of the normal distribution N (µ, σ 2 ). The µ and σ 2 represent the time step average request D ut and its variance, respectively. The variance is derived by Equation (2) as follows: σ 2 = (CV ut D ut ) 2 . The water requests of the different district/groups of users u are implemented in the hydraulic model by distributing them uniformly along the pipes [31] belonging to the corresponding district area. The uniformly distributed requests are defined as the water request per unit length: where L ij denotes the length of the ij-th pipe belonging to group u. The water request of each pipe of the u-th group thus result in being uniform.

Water Consumption Hydraulic Simulation
The second phase of the SHtsG aims to provide a synthetic consumption time series close to real ones due to both a suitable calculation of the satisfied water request, called demand, and a proper modeling of the water losses with the DFPDM. The water consumption for each group of users reads as: where the Wt ut defines the water consumption for the group u; Wd ut , Wl ut , Wb ut are instead the water demand, the background losses, and the water bursts, respectively. The water consumption is represented by the distributed along pipe consumption scheme and the concentrated at the nodes demand scheme (for details, see [32]). The first is used for implementing water demand and background leakages, as follows: where ij is the pipe index on which the withdrawals are spread, while p ij denotes the part of the consumption assumed distributed along the ij-th pipe, pd ij , and pl ij are instead the distributed water demand and background leakages of the ij-th pipe. These variables are expressed as flow rate per unit length and are characteristic related to the pipes. On the contrary, the second component is the focus of lumped withdrawals at the i-th node, and reads as: where q i is the component of the water consumption concentrated at nodes which consists only of the water bursts qb i . They are measured as flow rate drawn at the nodes of the network. The mathematical equations on the base of the DFPDM are the mass balance at the nodeî: where i and j are the ending pipe nodes, x is the pipe longitudinal coordinate,î is the considered node for the mass balance, h is the hydraulic head, and Q ij is the flow rate at the node i. p ij and q i , which are defined above, denote the water consumption assumed as distributed and concentrated, respectively. The head loss equation, which is expressed according to the Darcy-Weisbach formula, for the ij-th pipe reads as follows: where h i and h j is the hydraulic head at node i and j, and r ij denotes the hydraulic resistance per unit length of the ij-th pipe. Equation (8) is the generic formulation of the head loss drop of the pipes that embraces all the possible cases. For example, Equation (8) in the case of null distributed water consumption along pipes assumes the following form: while the case of pipe water consumption different form zero in demand driven hydraulic conditions results as follows: In this case, the final water demand is equal to the water request (p ij = d ij ) given the required pressure for regular working. The last possible case regards pipes working in a pressure driven condition with distributed water consumption. The distributed request p ij is therefore function of the hydraulic head along the ij-th pipe and can be assumed as a second order polynomial function: where w ij,1 , w ij,2 , and w ij,3 are the three coefficients defined by means of the punctual pressure-consumption relationship evaluated at the top, middle, and end points of pipe. In this work, both the three coefficients of p ij (h(x)) and the nodal q ij (h(x)) consumption are evaluated by Siew and Tanyimboh [33]. The solver used for performing the hydraulic simulations is based on a Menapace and Avesani numerical scheme [34]. This hydraulic solver enables simulating both distributed along a pipe and lumped at nodes water withdrawals with a pressure-driven approach using the GGA method [35].
Since the mathematical equations on the basis of the DFPDM have been presented, the modeling of the different components of the water losses is herewith specified. Firstly, the water demand is modeled with the pressure-demand relationship [33] and the distributed along pipe demand scheme [34]. The accuracy of this approach, which has been proved in [36,37], is given by the uniformly widespread withdrawals along each pipe of the network depending on the pressure along it, as follows: In the case of normal operating pressure conditions, the water demand pd ij of the ij-th pipe results in being equal to the water request d ij , which is therefore fully satisfied. On the other hand, Equation (12) shows that the water request is only partially satisfied in case of scarce pressure conditions, and it depends on the pressure drop along the ij-th pipe.
Secondly, the background leakages also adopt the distributed scheme that is well suited to small and diffuse water losses. This type of losses are represented as where β 1 and β 2 are the formula parameters representing the magnitude and the power, respectively, whileĥ ij is the average pressure of the ij-th pipe that is calculated as the difference between the hydraulic head h ij and the pipe level z ij . The background losses usually are settled by means of a calibration procedure involving an optimization algorithm on some parameters, e.g., pipe roughness, water demand, and losses of the hydraulic model. Thirdly, the water bursts are implemented in the model using lumped withdrawals at the nodes with different formulations depending on the type of pipe breaks. Three types of pipe breaks are considered: longitudinal, circumferential, or spiral cracks. According to [38][39][40][41], the universal formula of the three bursts for the i-th node reads as (14) where C d denotes the coefficient of discharge, g is the gravity acceleration, andĥ is the pressure at the i-th node. Instead, A 0,i and m i define the initial leak area and the head-area slope of the pipe where the break is located (node i), respectively. The head-area slopes m i of the three type of cracks are defined according to [41] through the following equations: for the longitudinal, circumferential, or spiral cracks, respectively. d is the pipe diameter, L c is the crack length, σ l is the longitudinal stress, ρ is the water density, E is modulus of elasticity, and t pipe thickness. The water bursts are therefore functions of the type of crack, the pipe characteristics (e.g., material), and the hydraulic pressure. This thorough bursts implementation allows for generating realistic punctual water losses with a wide variety in type and magnitude. Indeed, a burst generator is coupled with the hydraulic DFPDM for supporting the random generation of pipes break due to the possibility of generating three types of bursts with their own characteristics: probability of failure, location, outflow, and duration. First of all, the probability of failure of an individual pipes at each time step is defined using the Poisson probability distribution [42], as follows: where L ij and br ij represent the length (as mentioned above) and the break rate of the ij-th pipe. The latter is defined adopting the Walski and Pelliccia formulation [43]: where c 1 and c 2 are two correction factors respectively depending on the pipe material and its state (presence of previous breaks), and on the pipe diameter. y is the current year and k is the installation year of the pipe, while a and b represent the regression coefficients [43] depending on the pipe material. When a burst is generated in a pipe according to Equations (18) and (19), the punctual water losses are lumped at one of the ending nodes of the selected pipe. Thereafter, the amount of water losses is chosen by randomly selecting a pair of geometric parameters L c and A 0 able to generate an outflow ranges between a minimum and maximum flow rate thresholds. Eventually, the burst duration is fixed according to the magnitude of the flow rate loss. The realistic bursts modeling and the randomness introduced by the bursts generators allows for producing suitable time series for the training of anomaly detection algorithms.

Application
The scope of Section 3 is to test the performance of the SHtsG. To guide the reader into the proposed methodology, a summary of all the steps involved is hereafter described. The first step regards the water request generation. Hence, a suitable request time series is generated according to the formulation proposed in Section 2.1. This involves the modelling of both the deterministic and the random component through Equation (1). Therefore, the second step concerns the water consumption generation including the water leakages. This means that the previously generated water requests are used as input in the hydraulic simulations based on DFPDM. Specifically, the background losses in Equation (13) are defined through a calibration procedure given pressure and flow rate measurements in some points of the network. The bursts are indeed generated randomly according to the Poisson distribution in Equation (18) with a random hole shape selected among the three formulations proposed in Section 2.2. In the following two applications, the entire methodology is deeply described by practical applications.

Apulian
The first application used to test the performance of the SHtsG is the Apulian WDN, a well-known test case in literature. This WDS has been selected due to its scarce hydraulic pressure behavior, which ensures pressure driven conditions, and, due to the simple layout, which potentially emphasizes the weakness and strengths of the SHtsG. Figure 2 shows the network layout which consists of 34 pipes and 23 nodes. More details about the Apulian network can be found in [31]. In addition, the Apulian network is divided into five districts for highlighting the SHtsG capability in generating variable water requests for different parts of the network. In order to properly model the WDS, the total amount of water request has been reduced as in [44]. This means that the yearly average total users request results in 70 L/s. Therefore, this water request has been distributed along the pipes of the network. Then, the Apulian WDS has been simulated with a fully pressure driven approach according to Section 2.2. Nonetheless, the modeling of the stochastic behavior of the water requests in time is crucial for performing a realistic extended period simulation. Hence, the water request has been generated according to Section 2.1 by modelling the deterministic and the random components for each pipe of the network over a period of 4 years. These components for a single month are reported in Figure 3.
Starting from the top, the first plot in Figure 3 represents the dimensionless daily behaviour. To specifically model the water behaviour of each district, 5 different daily pattern time series have been generated and distributed in the corresponding district of the network (see the districts in Figure 2). Each daily district time series has been generated by randomly selecting different pattern combinations according to the main characteristics of the district water request, e.g., number of peaks (two or three), main peak (morning, midday or evening) and occurrence time. The use of different district daily patters helps to increase the variance of the final WDS water request, making it more realistic [45]. The second plot in Figure 3 represents the weekly patterns, and the third plot the monthly one, which is constant in the plot since only one month is displayed. The yearly pattern adopted corresponds to that of a typical seaside tourist region, characterized by a high water request during the summer (more details [44]). The fourth plot instead represents the dimensional random component of the water request (L/s), which is specific of each user aggregation (i.e., the water request is aggregated for each pipe of the WDS). Having the Apulian 34 pipe, the random components have been calculated for the distributed water request of each pipe. In this way, the random nature of the request is guaranteed together with a temporal and spatial variability of the request time series. Finally, the fifth plot shows the total water requests of the entire network in L/s. Calculated the water request in the first phase of the SHtsG, hereafter, the second phase is described. Therefore, the water request is the input of the DFPDM which, together with the bursts and the background losses, allows for producing a realistic time series of the water consumption. The total WDS consumption divided for its components is reported in Figure 4. The first step of this second phase is the calibration of the background losses. In this work, it was assumed that the average background leakage corresponds to the 40% of the yearly average request. These type of losses have been uniformly spread along the network by distributing the same amount of water losses per unit length along each pipe. Due to the presence of background leakages along the entire network, it is interesting to note that such water losses follow the average pressure of the network. Once the background losses was modeled, the bursts generator has randomly introduced local pipe breaks along the networks and then the WDS has been simulated by DFPDM producing the time series of the consumption displayed in Figure 4. It is worth noting that the type of bursts generated, their characteristics, and frequency have been set according to the WSD characteristics, following Section 2.2.
To emphasize the robustness of the SHtsG by the generation of a high number of pipe failures, the average age of the pipes has been assumed to be 60 years, and also the state of all the pipes of the WDS has been classified as "pipes with one or more previous breaks" [43]. Hence, the c 1 coefficient of Equation (19) has been fixed to 7.364. The other coefficient has been fixed by assuming the value of pit cast iron pipes, which are 0.02577 and 0.0207 for a and b, respectively. Regarding the magnitude of the bursts, the geometric parameters of the crack in Equations (15)- (17) have been randomly generated to grant the range of lost flow rate between 10 L/s and 70 L/s. In addition, the repair time has also been random between 12 and 48 h. The detail of a random generated burst is given in Figure 5.
has been classified as "pipes with one or more previous breaks". Hence, the c 1 coefficient of Eq. (19) 232 has been fixed to 7.364. The other coefficient has been fixed by assuming the value of pit cast iron 233 pipes, which are 0.02577 and 0.0207 for a and b, respectively. Regarding the magnitude of the bursts, 234 the geometric parameters of the crack in Eqs. (15) (16) (17) have been randomly generated to grant the 235 range of lost flow rate between 10 l/s and 70 l/s. In addition, the repair time has been also random 236 between 12 and 48 hours. The detail of a random generated burst is given in Fig. 5.  Figure 5. Detail of simulated bursts in the Apulian network. Fig. 5a shows a burst event with its consumption behaviour, while Fig. 5b depicts the water losses (burst and background leaks) and the pressure of each network nodes with also the average network pressure.  Figure 5a shows the behavior of the network consumption when a burst is generated. The outlet flow rate of a burst depends on two main reasons: firstly, the characteristic of the burst itself (for details, see Section 2.2) and, secondly, the pressure where the bursts happen. This punctual pressure derives from the network capacity in terms of both pressure and flow rate. This dependency is well displayed in Figure 5b, where the burst is displayed with the available pressure at each network nodes during the time. In particular, a slight pressure drop at the nodes close to the pipe burst are present during the burst occurrence. Because the outlet flow rate fully depends on the local pressure, the behavior of the outlet flow rate follows the local pressure. These considerations are fundamental to properly model the consumption of the time series for anomaly detection purposes. In fact, the pressure dependency of a burst is mandatory to be modeled to catch the physics of this phenomenon, even more, when the network has a lack of pressure. These notes are also highlighted in Figure 6.  Figure 6 shows the distribution of the generated consumption for each daily hour in the Apulian network. In particular, the black boxes represent the quantiles of the hourly water consumption of normal data, i.e., not affected by bursts, and the black points represent their outliers. The abnormal data are instead labeled with the red color. It is noteworthy that a lot of abnormal data lie inside the hourly boxes of the normal data during the daytime, and, vice versa, several outliers belong to normal data. This behavior has to be imputed to the scarce operating pressure of the WDS. This means that the scarce pressure condition in the network does not allow for sufficing to both burst and consumption. In conclusion, both the satisfied water demand and the outlet flow rate of a burst are dependent on the pressure; specifically, both of them decrease with a pressure drop. Thereby, a burst occurrence causes a decrease in the network pressure and consequently a reduction into the satisfied water request, resulting in a moderate changing of the final WDS flow rate distribution.

Egna
The WDS of Egna has been considered as the second test case to test the reliability of the proposed data generation methodology. Figure 7 reports the Egna aqueduct, which consists of around 28 km of pipes with diameter between 50 and 150 mm and two tanks, which supply water to the WDS. In particular, the network model consists of 169 pipes and 149 junctions. More information about the network can be found in [46]. Differently from the previous test case, this WDN does not exhibit a pressure deficit, providing consequently further scenarios for evaluating the proposed methodology. In addition, this test results in being more interesting due to the availability of a small dataset of measured hourly water consumption during March and April 2019. It is worthwhile pointing out that the water demand always matches the water request due to the high operating pressure in the Egna WDS. As a preliminary step, to generate a proper water consumption time series, which are close to the real Egna WDS, the SHtsG has been set up with the data of March and then tested with the April data. Firstly, a data imputation has been performed adopting the Kalman filter jointly with Arima model according to [47] to fill 13 na values into the dataset. Once the dataset was fulfilled, the consumption time series of March has been decomposed into the different components as follows: where Wt represents the monthly average water consumption, Wd is the monthly average water demand, Wl represents the monthly average background losses and Wb the monthly average bursts. By analysing the March consumption, it emerges that the minimum night flow (which ranges from 2:00 a.m. to 4:00 a.m.) is constant during March. This leads to the assumption that no bursts are present in this period. This follows that the Wb term in Equation (20) is zero. In addition, the water demand Wd of this month has been calculated given the water request of 123 l habitant·day and the number of habitants equal to 5290. Thus, the average background losses Wl has also been directly calculated through Equation (20). To decompose the signal hour per hour, the behavior of the background losses during the days has been evaluated. The relationship for undetected leaks [48] has been adopted as follows: where Wl t is the background losses term at the t-th hour, the Wl is the average background losses of March, the P t is the average network pressure at the t-th hour, and the P term is the average network pressure in the month of March. The pressure of the network is known because of a measurement campaign made in the same month. Hence, the behavior of the background losses was estimated through Equation (21) and the time series has been decomposed. Differently from the March period, the time series of the consumption of April results in being not stationary during the night. This means that the term Wb is different from zero during some days in April. To decompose the signal for April, the background losses have been assumed constant in time assuming that the degradation of the network in a short period of a month is unchanged. Thus, the same values of Q l,i of March were also used for April. Moreover, the water demand has been estimated considering a water request per habitant for the April month of 131 l habitant·day . Hence, the time series has been finally decomposed and the result reported in Figure 8. Starting from the decomposed time series, the DFPDM has been modeled to properly reproduce the consumption time series of March. The first phase consists of calibrating the water request terms of Equation (1). As previously declared, the calibration of the model has been performed over the March data. An analysis of the March daily pattern has been carried out. Firstly, the structure of the daily pattern has been studied. In particular, the working days, which ranges from Monday to Friday, always show a 3 peaks behavior, while the Saturdays have three peaks in 60% of the cases and two peaks in the others. Differently from the previous, the Sundays always show two peaks. Secondly, an analysis of the peaks level and their occurrence has been performed. It was found that the main peak happens in the morning in 80% of the cases for the working days, and the remaining 20% happens in the evening. In particular, when the main peak is in the morning, it happens in 75% of the cases at 7.00 a.m. and in 25% of the cases at 8.00 a.m. Concerning the midday peaks occurrence, in 65% of the cases, the peak happens at 12.00 a.m., and for the remaining 35% at 1.00 p.m. Lastly, the evening peak occurs at 5.00 p.m. in 20% of the cases, at 6.00 p.m. in 30% of the cases and in the remaining 50% at 7.00 p.m .Regarding the Saturdays with a three-peaks structure, the morning, midday and evening peaks occurs at 9.00 a.m., 12.00 a.m., and 6.00 p.m., respectively. Instead, for the Sundays and the Saturdays with a two-peak structure, the first peak happen always at 9.00 a.m., and the second peak is at 5.00 p.m. or 6.00 p.m. in 50% of the cases. Therefore, the aforementioned probabilities have been used to generate the daily pattern coherent with the real data analyzed. This generation has been performed for each district of the network shown in Figure 7. Moreover, the generation of the water requests has been made as described in Section 2.1, omitting the weekly and yearly components. Regarding the random component, it has been calculated for each district of Egna with its different user aggregation. Figure 9 depicts the resulting water request. Figure 9. Water request components for the Egna test case. Starting from the top, the first represents the daily pattern, the second the weekly pattern, and the third the yearly one. The fourth plot represents the random components and the fifth the resulting water request.
Then, the request previously generated has been used as input into the DFPDM with both the bursts and background losses. Figure 10 shows the generated consumption for March month with no bursts and background losses coherent with the real data.
The model has been then used to generate the consumption also for the April month. In particular, 10 different generations of the consumption of April have been made to show the reliability of the methodology. Regarding the burst parameters, the geometric parameters A 0 and L c of the cracks in Equations (15)- (17) have been bounded to generate values of outflow between 1.5 L/s and 10 L/s. This decision has been made due to characteristics of the Egna WDS. Regarding the duration, different intervals have been selected depending on the burst magnitude. In addition, the repair time ranges between 3 and 21 days in case of bursts outflow from 1.5 L/s to 4 L/s, while the bigger than 4 L/s leaks last from 12 to 72 h. These long repair times are due to the poor operational management of the small municipality analyzed. Moreover, the age of the pipes is known and the c 1 coefficient has been fixed to 7.364. The a and b coefficients are set to 0.02577 and 0.0207, respectively.   It shows a few outliers regarding the normal data, while the abnormal data, indicated with the red color, appears to be more diffused in the outlier zone. This behavior is different from the Apulian test case (see Figure 6). In fact, the Egna network does not suffer from the pressure deficit condition. This means that the available pressure is always enough to suffice at both users' water request and bursts. These observations underline even more the crucial role played by the proper hydraulic solver adopted in the simulations. In fact, the generation of a time series of consumption for burst detection can not be made regardless of the pressure behavior of the WDS.
To show the robustness of the proposed methodology, the original time series and the data generated by SHtsG have been tested by means of the t-student test. Figure 12 and Table 1 display the corresponding results. Thus, the reliability of the consumption time series generated by SHtsG has been tested for both March and April, which are the months used for the model training and validation, respectively. The t-student test enables to define if there is a consistent difference between the original and the generated data at each hour of the day. In particular, the null hypothesis states that the mean of the real data are equal to the mean of the generated by SHtsG data. The results of the test are displayed through the bounds of Figure 12, which are defined according to a significance level of 1%, allow to easily identify the rejected hours, i.e., real values out of the limit lines.
Firstly, Figure 12a,b report the results of March data adopted in the training of the SHtsG. In this month, the null hypothesis is never rejected for the water request, while only 5 times the null hypothesis is rejected for the consumption. This means that the generated time series are able to well resemble the real data with 1% significance level. Secondly, Figure 12c,d show the results of the t-student test of April data. It is worth noting that the statistical test has been made between 1 month of real data and the 10 sample of the generated April data. The performance of 10 samples representing April data has been adopted in order to have a more robust evaluation of the SHtsG. In case of a single sample, the random part of the bursts modelling excessively affects the variability of the results even if the procedure has been properly set up. The test gives positive results even in the severe case of the validation dataset of April. Only in a few hours, the two dataset are significantly different from each other. It mainly happens during the night, where the difference between March and April is more significant. Table 1 lists the p-values of the t-student test applied on the 4 cases described above. (c) T-student test applied to the generated and real request data of Egna in April.
(d) T-student test applied to the generated and real consumption data of Egna in April. Figure 12. T-student test between generated and real data of March and April for both water request and consumption.

329
March and April, which are the months used for the model training and validation, respectively. The 330 t-test allows to define if there is a consistent difference between the original and the generated data. Firstly, Fig. 12a and Fig. 12b report the results of the March data adopted in the set up of the SHtsG.

332
The null hypothesis is rejected at each hour of the day, meaning that the generated time series of the 333 request and consumption are able to well resemble the real data with a 1% significance level. Secondly, 334 Fig. 12c and Fig. 12d show the results of the t-student test of April data. It is worth noting that the 335 statistical test has been made between 1 month of real data and the 10 sample of the generated April 336 data. The performance of 10 samples representing April data has been adopted in order to have a more  Figure 12. T-student test between generated and real data of March and April for both water request and consumption. Table 1. P-values obtained by the t-student test of the null hypothesis such that the mean of the real data are equal to the mean of the generated by SHtsG data for each hour of the day. Specifically, real data and synthetic data are compared both for the water request and consumption in March (training) and April (validation), separately.

Conclusions
This study proposes a methodology to generate water consumption time series affected by bursts for supporting the development of anomaly detection models in WDS. The SHtsG deals separately with water request and water consumption generation. The first consists of modeling the water requests through a superimposition approach, which considers both the different deterministic trends (e.g., seasonal, weekly and daily pattern) and the random components of the variance. The second phase consists of generating consumption with accurate WDS modeling in which background losses and bursts are properly introduced. The two phases methodology allows for providing a synthetic hydraulic dataset close to the real data in terms of both mean and variance. Moreover, the generated time series are not affected by uncertainty and missing data due to measurements and transmission system. These synthetic data concern normal and abnormal data with precise leaking labels including complete information about the bursts.
Two different applications are presented to validate and highlight the advantages of the SHtsG. The first consists of a WDS widely used in literature that is called the Apulian network. This test case aims to provide a detailed step by step explanation of the methodology applied to a WDS operating in scarce pressure conditions. The reliability of the SHtsG is underlined by the resulting time series of the WDS inlet flow rate which represents four years of the total consumption. Moreover, the variance of the consumption is shown by means of the random introduction of the different types of bursts properly labelled. It is noteworthy to mention how complex it is to distinguish normal data from abnormal data, i.e., affected by bursts, due to the pressure deficit condition of the analyzed WDS. Therefore, the use of proper pressure-driven hydraulic solver, which is able to properly simulate the different water request components (e.g., demand, background leakages, and bursts), results in being crucial for providing a final suitable dataset.
The second presents the Egna network which consists of a small mountain WDS with elevated operating pressures. This test case helps to understand how the SHtsG can be used to enlarge a real dataset maintaining the characteristics of the original time series. The accuracy of the methodology in reproducing real data are proven by a t-student test. Moreover, the elevated network pressures enhanced the different behavior of the burst, which can be more easily distinguished. The SHtsG depicts that it is possible to properly reproduce the variability of the real data and to generate a time series coherent with the real WDS behavior. The importance of the accurate hydraulic solver adopted is also highlighted.
To conclude, the presented methodology has shown promising results to generate suitable time series extrapolated from extended period simulations that include both hydraulic and mechanical WDS reliability. The ability of SHtsG to produce massive hydraulic data simply by selecting a flow rate time series of crucial WDS points is noteworthy, e.g., tanks outflow, district inlet, or districts representative pressure. Future efforts will involve the direct application of this methodology in the development of anomaly detection algorithms.