Comparison of Statistical and Machine Learning Models for Pipe Failure Modeling in Water Distribution Networks

The application of statistical and Machine Learning models plays a critical role in planning and decision support processes for efficient and reliable Water Distribution Network (WDN) management. Failure models can provide valuable information for prioritizing system rehabilitation even in data scarcity scenarios, such as developing countries. Few studies have analyzed the performance of more than two models, and examples of case studies in developing countries are insufficient. This study compares various statistical and Machine Learning models to provide useful information to practitioners for the selection of a suitable pipe failure model according to information availability and network characteristics. Three statistical models (i.e., Linear, Poisson, and Evolutionary Polynomial Regressions) were used for failure prediction in groups of pipes. Machine Learning approaches, particularly Gradient-Boosted Tree (GBT), Bayes, Support Vector Machines and Artificial Neuronal Networks (ANNs), were compared in predicting individual pipe failure rates. The proposed approach was applied to a WDN in Bogotá (Colombia). The statistical models showed an acceptable performance (R2 between 0.695 and 0.927), but the Poisson Regression was the most suitable for predicting failures in pipes with lower failure rates. Regarding Machine Learning models, Bayes and ANNs exhibited low performance in the prediction of pipe failure condition. The GBT approach had the best performing classifier.


Introduction
The main purpose of Water Distribution Networks (WDNs) is to supply water to the population in the required quantity and quality [1]. Factors such as climate change, deterioration of system components, uncertainty regarding the physical condition of the pipes, growing water demand, and economic restrictions increase the complexity of their management [2]. A proper strategy for operation, maintenance, and rehabilitation of WDNs needs to be developed to ensure efficient and reliable management. Therefore, improving the efficiency in water supply and leakage reduction in WDNs should be strategic priorities for the sustainable development of water resources [3,4].
Pipe failures in WDNs may cause economic, environmental, and social costs resulting from water supply and traffic interruption, contaminant intrusion through the network, and loss of resources such as water and energy [5,6]. According to the United Nations, water utility assets in developing countries are more likely to be poorly managed due to inappropriate political administration and the increased pressure on the available water resources. Thus, the general lack of preventive maintenance plans has led to low-performing WDNs [7]. In Bogotá, Colombia's capital city, the water loss rate ranges between clustering approach was considered to create pipe groups. ML approaches (i.e., Gradient-Boosted Tree (GBT), Bayes, Support Vector Machines (SVMs), and Artificial Neuronal Networks (ANNs)) were compared in terms of predicting individual pipe failure rates. The pipe attributes, and environmental and operational variables were included as input variables. The proposed approach was applied to a WDN in Bogotá (Colombia).

Statistical Models
Three statistical models, including Linear Regression, Poisson Regression, and EPR are used to estimate the number of expected failures in pipe groups. These models are selected because they produce explicit polynomial expressions, which provide a high level of correlation between input variables and the dependent variable [11,14]. Linear Regression is an extension of regression analysis that includes independent variables as explanatory in a predictive equation. In the linear regression model, the value of the dependent variable ranges at a constant rate as the value of the independent variable increases or decreases. Thus, the equation of a straight line exhibits the relationship between the true value of Y and X i , as shown in Equation (1) [24].
where Y i is the dependent variable, X j are the independent variables, β j are the coefficients to be estimated, and i is the error term that represents the deviation of the conditional mean to the observation. Poisson Regression is a count data model which describes the number of failures for a given time and can consider the non-negativity integer nature of the dependent variable [25]. It is assumed that the failures at a year (t) for a pipe i follow a Poisson distribution with mean intensity λ i,t . The probability of having k i,t failures can be estimated as follows [26].
where λ i,t = exp (β o +β 1 X 1 + . . . + β j X j , X j are the independent variables, β j are the coefficients to be estimated, and k i,t is the number of failure events. EPR is a hybrid regression method that combines conventional regression techniques and genetic programming, producing a range of equations in trade-off between the number of polynomial terms and accuracy [14]. EPR consists of two main stages: (1) the exploration of the best model structure using a multiobjective genetic algorithm and (2) the estimation for parameters for an assumed model structure using the least-squares method [5]. The general form of the ERP model is expressed in Equation (3) [27].
where y is the dependent variable, X k are the independent variables, a j are the parameters to be estimated, a o is an optional bias, F is the function constructed by the EPR model, f is the function selected by the user, and m is the maximum number of polynomial terms. The pipes' data is processed by removing attributes considered irrelevant to the prediction task (e.g., pipe ID) and those with missing values (e.g., pipe depth). The K-means clustering approach is applied to create pipe groups with similar characteristics. Following this approach, K clusters are created assigning n data samples to them. The required inputs are the data samples, the number of k clusters, and the stopping condition. The formulation of the clusters is based on the principle of maximizing intracluster similarity and minimizing intercluster similarity [14]. Therefore, the algorithm starts from an initial distribution of clusters' centroids, such that they are as distant as possible from each other, and determines which centroid is closest to each data point. All data points are assigned to its nearest cluster. Then, the clusters' centroids are recalculated as the arithmetic mean of all its assigned data points. This process is replicated either until no data points can be assigned to a closer cluster or until the stopping condition is achieved [28].
The Euclidean distance is selected as the objective function of the dissimilarity measure. This function, based on the Euclidean distance between a vector x k in a group G i , and the corresponding cluster center c i , is defined in Equation (4) [29]. Further, the optimal center that minimizes Equation (4) is expressed in Equation (5).
k,x k ∈G i ||x k − c i || 2 is the objective function within group G i and |G i | is the size of G i . Data are grouped using pipe diameter, age, and length based on the premise that pipes with similar characteristics are expected to have the same breakage pattern [13]. Consequently, each pipe takes a number of failures and a length equal to the total lengths and the total number of failures for the individual pipes of the same group. The Davies-Bouldin criterion is used to select the optimal number of clusters (K). This criterion is based on the ratio between the distances within-clusters and between-clusters. The Davies-Bouldin Index (DBI) is a measure that evaluates the separation between the ith and the jth cluster. The DBI has proven a suitable measure to evaluate the clustering performance [30,31]. It can be calculated by means of Equation (6) [32]. According to the Davies-Bouldin criterion, the best clustering partition is the one that minimizes the DBI.
where K is the number of clusters, d i is the average distance between each data point of the ith cluster and the centroid of the respective cluster (similarly to d j ), and d i,j is the Euclidean distance between the centroids of the clusters i and j. Training and test datasets are built randomly. The models are trained on 70% of the available data and tested on the remaining 30%. The K-fold cross-validation technique is used to minimize the risk of overfitting [14]. This technique divides the dataset randomly into k-partitions and, at each step, it uses one partition for testing and the rest for training [33]. The explanatory variables of the models are pipe diameter (in mm), total length (in m), and pipe age (in years), while the dependent variable is the total number of failures (FR). The performance of each model is compared using the coefficient of determination (R 2 ) and the root mean square error (RMSE), defined as follows.
where y p,i = prediction value for the sample i, y o = mean value of measurements, y o,i = measurement value for the sample i, y p = mean value of predictions, and n = number of data samples.

Machine Learning Models
ML approaches, namely GBT, Bayes, SVMs, and ANNs, are compared in predicting individual pipe condition. ML techniques are classified into two main categories: supervised learning and nonsupervised learning. The selected models are categorized as supervised learning. Within such learning, the ML algorithm receives the inputs and intends to converge to the best classifier. These methods can learn the patterns of the underlying process from past data and generalize the relationships between input and output data to predict or estimate an output given a new set of input variables [34]. Inputs capture the application of concepts, instances, and variables. An instance then is an independent example of the concept, and comprises a set of variables.
GBT is a forward-learning ensemble method that obtains predictive results through gradually improved estimations, which combines the performance of many weak classifiers from previous iterations to produce a powerful one [35]. After each boosting iteration, misclassified data have their weights increased and, for correctly classified data, their weights decreased. The output of the GBT model can be written as shown below [36].
where j is the size of the tree, R j is the region associated with the jth leaf after space partition imposed by the tree,ŷ j is the label associated with R j , χ is the characteristic function, and Θ, consisting of (ŷ j , R j ), are the estimated parameters during training. Bayes is a graphic approach that represents a probabilistic relationship between a set of variables utilized to forecast the behavior of a system based on an observed process [37,38]. This model is composed of (a) a set of variables and links between the variables, (b) a set of states for each variable, and (c) an assigned conditional probability for each variable. This model has the advantage of demanding less estimated parameters. Given a scenario comprising A i (i = 1, 2, . . . , n) independent variables and an observed data Y, the Bayes formula can be written as shown in Equation (10) [38].
where P(A i |Y ) is the posterior occurrence of the probability of A given the condition that Y occurs, p(A i is the prior occurrence of the probability A i , and p Y|A j the conditional occurrence probability of Y given that A occurs. In addition, SVM is a supervised learning technique based on the principle of optimal separation classes. The SVM method builds a linear model called maximum margin hyperplane, which provides the greatest separation between instances with different values of the dependent variable [39]. Datasets containing instances that cannot be separated with a straight line are projected into a higher-dimensional space through a kernel function. Common kernels are the polynomial, hyperbolic tangent, and radial basis functions. The SVM model is presented in Equations (11)-(13) [22].
Subject to the constrains : y i w T , learning, the ML algorithm receives the inputs and intends to converge to the best classifier. These methods can learn the patterns of the underlying process from past data and generalize the relationships between input and output data to predict or estimate an output given a new set of input variables [34]. Inputs capture the application of concepts, instances, and variables. An instance then is an independent example of the concept, and comprises a set of variables.
GBT is a forward-learning ensemble method that obtains predictive results through gradually improved estimations, which combines the performance of many weak classifiers from previous iterations to produce a powerful one [35]. After each boosting iteration, misclassified data have their weights increased and, for correctly classified data, their weights decreased. The output of the GBT model can be written as shown below [36].
where j is the size of the tree, R j is the region associated with the jth leaf after space partition imposed by the tree, y j is the label associated with R j , χ is the characteristic function, and Θ, consisting of (y j , R j ), are the estimated parameters during training.
Bayes is a graphic approach that represents a probabilistic relationship between a set of variables utilized to forecast the behavior of a system based on an observed process [37,38]. This model is composed of (a) a set of variables and links between the variables, (b) a set of states for each variable, and (c) an assigned conditional probability for each variable. This model has the advantage of demanding less estimated parameters. Given a scenario comprising Ai (i = 1, 2, …, n) independent variables and an observed data Y, the Bayes formula can be written as shown in Equation (10) [38].
where P A i |Y is the posterior occurrence of the probability of A given the condition that Y occurs, p(A i ) is the prior occurrence of the probability A i , and p Y A j the conditional occurrence probability of Y given that A occurs. In addition, SVM is a supervised learning technique based on the principle of optimal separation classes. The SVM method builds a linear model called maximum margin hyperplane, which provides the greatest separation between instances with different values of the dependent variable [39]. Datasets containing instances that cannot be separated with a straight line are projected into a higherdimensional space through a kernel function. Common kernels are the polynomial, hyperbolic tangent, and radial basis functions. The SVM model is presented in Equations (11)-(13) [22].
Subject to the constrains: and ∈ i ≥ 0, i = 1,…, N where ⦰ x i is the kernel used to transform the data from the input to the high-dimensional space, C svc is a regularization parameter, w is the weight vector to the hyperplane, and b is the hyperplane offset parameter. ∈ i represents the slack variables measuring the degree of misclassification, and x and y are the data points of the dataset with N points. ANNs are parametric regression estimators that use an iterative process to adjust weights and biases within their layers to recognize patterns between inputs and outputs [1]. Particularly, Multilayer Perceptron Networks (MLP) are fully connected networks comprised of several nodes or neurons organized in input and output layers, as well as hidden layers. The principle of MLP is based and ∈ i ≥ 0, i = 1, . . . , N where 5 of 22 ceives the inputs and intends to converge to the best classifier. These rns of the underlying process from past data and generalize the d output data to predict or estimate an output given a new set of input the application of concepts, instances, and variables. An instance then the concept, and comprises a set of variables. g ensemble method that obtains predictive results through gradually combines the performance of many weak classifiers from previous ful one [35]. After each boosting iteration, misclassified data have their rrectly classified data, their weights decreased. The output of the GBT n below [36].
j is the region associated with the jth leaf after space partition imposed ociated with R j , χ is the characteristic function, and Θ, consisting of eters during training. ch that represents a probabilistic relationship between a set of variables ior of a system based on an observed process [37,38]. This model is les and links between the variables, (b) a set of states for each variable, nal probability for each variable. This model has the advantage of rameters. Given a scenario comprising Ai (i = 1, 2, …, n) independent a Y, the Bayes formula can be written as shown in Equation (10) [38].
r occurrence of the probability of A given the condition that Y occurs, e of the probability A i , and p Y A j the conditional occurrence ccurs. ervised learning technique based on the principle of optimal separation ds a linear model called maximum margin hyperplane, which provides een instances with different values of the dependent variable [39]. that cannot be separated with a straight line are projected into a higherkernel function. Common kernels are the polynomial, hyperbolic ions. The SVM model is presented in Equations (11)-(13) [22].  (11) the constrains: and ∈ i ≥ 0, i = 1,…, N d to transform the data from the input to the high-dimensional space, eter, w is the weight vector to the hyperplane, and b is the hyperplane s the slack variables measuring the degree of misclassification, and x e dataset with N points. ression estimators that use an iterative process to adjust weights and (x i ) is the kernel used to transform the data from the input to the high-dimensional space, C svc is a regularization parameter, w is the weight vector to the hyperplane, and b is the hyperplane offset parameter. ∈ i represents the slack variables measuring the degree of misclassification, and x and y are the data points of the dataset with N points.
ANNs are parametric regression estimators that use an iterative process to adjust weights and biases within their layers to recognize patterns between inputs and outputs [1]. Particularly, Multilayer Perceptron Networks (MLP) are fully connected networks comprised of several nodes or neurons organized in input and output layers, as well as hidden layers. The principle of MLP is based on summarizing input signals with a suitable weight, considering that each neuron is activated by a function. Common activation functions are the sigmoid logistic, tangent sigmoid, and linear functions. The signals are transferring from node i to node j and the output signal (e.g., pipe condition or failure rate) is described by the relation shown below [40].
where x i are the input signals or the explanatory variables, w ij are the synaptic weights, and f is the neuron activation function that simulates the information transmission. The pipes' data is processed as described above. Table 1 provides an overview of the explanatory variables adopted for the models' training. The selected variables are separated into nominal and numerical, and the nominal are changed to a numeric type. In the process, the nominal variables become into real-valued attributes. For example, land use can be classified as (a) residential, (b) commercial, (c) industrial, and (d) institutional. The four categories of the variable are converted into one variable coded as (a) 0 for residential use, (b) 1 for commercial use, (c) 2 for industrial use, and (d) 3 for institutional use. Then, land use is converted to a variable with numeric integer values from 0 to 3. The dataset is divided randomly into training and test datasets, as is described previously. K-fold cross-validation technique is also applied to decrease the risk of overfitting. The models are used to establish the predictions of pipe condition (i.e., failure or nonfailure). For the GBT model, the parameters that must be established are the number of boosting iterations (M), the size of each tree (J), and the learning rate (λ). Experiences suggest that values of J ranging between 4 and 8 work properly in the context of boosting, with results being insensitive to particular choices in this range [41]. In addition, the learning rate varies between 0 and 1. Empirically, it has been found that smaller values of λ (lead to larger values of M) can decrease the test error. An automated trial and error approach is adopted to select the best tree configuration [35]. To select the parameters, a 10-fold cross-validation is carried out varying J between 4 and 8, and the learning rate between 0.1 and 0.5.
Concerning the SVMs, the parameters that must be defined are the capacity (C), gamma (γ), and epsilon (ε). The capacity (C) is a coefficient that regulates the trade-off between the training errors and the prediction risk minimization. Higher values of C lead to higher weights assigned to misclassifications [34]. The γ parameter is a coefficient that controls the complexity of the solution and ε is the loss function that describes the regression vector without all the input data [42]. Thus, an automated trial and error approach is adopted for testing polynomial, radial, and neural kernel functions. A 10-fold cross-validation is carried out varying the capacity from 0.1 to 50, and γ between 0.1 and 20. The radial kernel is implemented as the kernel function, and the combination of parameters that produce the best classification is selected.
Regarding the ANNs, the parameters that must be defined are the number of input layers, the number of hidden layers, the neurons in the hidden layers, the training cycles, the learning rate, and the activation function. Thus, the number of input layer is defined as the number of explanatory variables. The number of hidden layers is selected on the basis that additional layers produce additional errors. The literature suggests using one or two hidden layers [43,44]. The typical sizes of the learning rate range between 0.01 and 0.6 and the most common activation functions are logistic and sigmoid [44]. Furthermore, the number of neurons can be calculated using the following equation.
where HL is the number of hidden layers, N i is the number of input variables, and N c is the number of classes. An automated approach is adopted to choose the best ANN configuration considering one or two hidden layers with eight neurons. Several activation functions are tested (i.e., exponential, logistic, sigmoid, and hyperbolic tangent). For each model, the selected values of the parameters are presented in Appendix A.
The performance of the ML methods is evaluated using accuracy, confusion matrices, and receiver operating characteristic (ROC) curves. Accuracy is estimated as the fraction of correct predictions to the total predictions [9], as shown in Equation (16). The confusion matrix, presented in Table 2, provides more information on model performance because it categorizes the results according to predictions and observations. Pipes that are correctly classified as fail are represented by true positive (TP) and pipes correctly classified as not fail, by true negative (TN). Incorrect classifications are described by false negative (FN), which occurs when the model predicts that the pipe does not fail, but is broken, and false positive (FP), when a pipe does not fail, but is predicted to fail. A set of alternative metrics, particularly true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), false negative rate (FNR), and the F-measure, can be used for assessing the predictive capability of the models. The TPR, or recall, measures the percentage of right predictions made from the class of interest (i.e., the failed pipes). The TNR gives the percentage of correct classification of the other class (i.e., the pipes that do not fail). Similarly, the FPR represents the proportion of all negatives incorrectly classified as positive, and the FNR evaluates the positives incorrectly classified as negatives. The rates presented are related to each other thought the equations FNR + TPR = 1 and FPR + TNR = 1 [45]. Finally, the F-measure compares the models' performance in terms of recall and precision (i.e., a measure of exactness) using a factor that controls their relative importance. The F-measure, precision, and recall tend to 1 as the models' performance increases [46]. The metrics are defined below.
The ROC curve is a helpful technique for visualizing and selecting the most suitable model based on performance [47]. This curve is obtained by plotting the TPR as a function of the FPR (Figure 1), considering different probability thresholds to make class predictions [39]. The ROC curve is considered reliable when the curve is over the 45 • line. Perfect classification is graphically defined by the union of two lines, corresponding to FPR equal to 1 and TPR equal to 1 [9].
The ROC curve is a helpful technique for visualizing and selecting the most suitable model based on performance [47]. This curve is obtained by plotting the TPR as a function of the FPR (Figure 1), considering different probability thresholds to make class predictions [39]. The ROC curve is considered reliable when the curve is over the 45° line. Perfect classification is graphically defined by the union of two lines, corresponding to FPR equal to 1 and TPR equal to 1 [9]. Generally, a baseline probability threshold, where any pipe with a predicted probability of failure greater than 50% will be assigned as failed, is used to train the models. A new threshold can be determined using Youden's J index.
This index allows a new threshold that is closest to the optimal model. Youden's J index does not modify the trained model as the same parameters are being used, and it is only employed to increase the sensitivity of the model to the minority class of interest [48].

Case Study
The proposed models were applied to a WDN in Bogotá (Colombia), presented in Figure 2. The WDN has 61,251 pipes with an overall network length of 1819 km and 28,671 house connections. The network has different pipe materials, distributed as follows: polyvinyl chloride (70.6%), asbestoscement (24.2%), high-density polyethylene (2.7%), cast iron (0.9%), and others (1.6%). The average pipe is 29 years old, including the 11,442 pipes that have been in operation for over 40 years. The oldest pipes on the network are asbestos-cement (see Figure 2), and the majority of the pipes installed within the past 10 years are made of polyvinyl chloride (PVC). Pipe diameters range from 50.8 to 609.6 mm and approximately 51% of the pipes have a diameter ranging between 50.8 and 76.2 mm. Generally, a baseline probability threshold, where any pipe with a predicted probability of failure greater than 50% will be assigned as failed, is used to train the models. A new threshold can be determined using Youden's J index.
This index allows a new threshold that is closest to the optimal model. Youden's J index does not modify the trained model as the same parameters are being used, and it is only employed to increase the sensitivity of the model to the minority class of interest [48].

Case Study
The proposed models were applied to a WDN in Bogotá (Colombia), presented in Figure 2. The WDN has 61,251 pipes with an overall network length of 1819 km and 28,671 house connections. The network has different pipe materials, distributed as follows: polyvinyl chloride (70.6%), asbestos-cement (24.2%), high-density polyethylene (2.7%), cast iron (0.9%), and others (1.6%). The average pipe is 29 years old, including the 11,442 pipes that have been in operation for over 40 years. The oldest pipes on the network are asbestos-cement (see Figure 2), and the majority of the pipes installed within the past 10 years are made of polyvinyl chloride (PVC). Pipe diameters range from 50.8 to 609.6 mm and approximately 51% of the pipes have a diameter ranging between 50.8 and 76.2 mm. Pipe failure records, available from 2012 to 2018, were provided by the city's water utility service (EAB). Figure 3 shows the relationship between failures and the pipes' characteristics. A preliminary analysis showed that pipes with diameters of between 50.8 and 76.2 mm exhibited the highest failure rate (see Figure 3a). Commonly, the number of failures in small diameter pipes exceeds that of larger diameter pipes [5,49]. The main reason for the higher frequency of failure in small diameter pipes has been attributed to decreased pipe strength to ground movement and corrosion because of reduced wall thickness [50]. As presented in Figure 3b, the older pipes showed a higher number of failures. Failure rates can be expected to be higher in the months following the installation. Then, the rates are lower for several years, before increasing with the age of the pipe [50]. Pipe deterioration with age is known to occur. Nevertheless, the relationship between the pipe age and the failure rate is not clear due to manufacturing process, installation and operational practices, and environmental conditions. In the network, the average age of pipe failure is 34 years. The analysis also exhibited that the failure rate increased with the pipe length (see Figure 3c). Boulos et al. [51] concluded that short pipes broke less than longer pipes because a short pipe length limits the effects of transient surge pressure because junctions cause pressure wave reflections.
In addition, Figure 3d shows the number of failures per material normalized by the total number of pipes of each material. Failure records revealed that 67.8% of the failed pipes are made of asbestoscement and 28.3% of PVC. The failure rate of pipe materials is determined by the material's flexibility, method of manufacture, and manufacturing processes [50]. It was expected that PVC pipes have a lower failure rate than asbestos-cement pipes because plastic materials were the last to be introduced, and their quality has improved significantly over the years. The higher number of failures exhibited by asbestos-cement and PVC pipes also depends on the material distribution, which is described above. Based on these findings, only asbestos-cement and PVC pipes are considered for the analysis. As each type of material has a specific deterioration pattern [9,52], independent (per material) analyses were carried out for each. Pipe failure records, available from 2012 to 2018, were provided by the city's water utility service (EAB). Figure 3 shows the relationship between failures and the pipes' characteristics. A preliminary analysis showed that pipes with diameters of between 50.8 and 76.2 mm exhibited the highest failure rate (see Figure 3a). Commonly, the number of failures in small diameter pipes exceeds that of larger diameter pipes [5,49]. The main reason for the higher frequency of failure in small diameter pipes has been attributed to decreased pipe strength to ground movement and corrosion because of reduced wall thickness [50]. As presented in Figure 3b, the older pipes showed a higher number of failures. Failure rates can be expected to be higher in the months following the installation. Then, the rates are lower for several years, before increasing with the age of the pipe [50]. Pipe deterioration with age is known to occur. Nevertheless, the relationship between the pipe age and the failure rate is not clear due to manufacturing process, installation and operational practices, and environmental conditions. In the network, the average age of pipe failure is 34 years. The analysis also exhibited that the failure rate increased with the pipe length (see Figure 3c). Boulos et al. [51] concluded that short pipes broke less than longer pipes because a short pipe length limits the effects of transient surge pressure because junctions cause pressure wave reflections.
In addition, Figure 3d shows the number of failures per material normalized by the total number of pipes of each material. Failure records revealed that 67.8% of the failed pipes are made of asbestos-cement and 28.3% of PVC. The failure rate of pipe materials is determined by the material's flexibility, method of manufacture, and manufacturing processes [50]. It was expected that PVC pipes have a lower failure rate than asbestos-cement pipes because plastic materials were the last to be introduced, and their quality has improved significantly over the years. The higher number of failures exhibited by asbestos-cement and PVC pipes also depends on the material distribution, which is described above. Based on these findings, only asbestos-cement and PVC pipes are considered for the analysis. As each type of material has a specific deterioration pattern [9,52], independent (per material) analyses were carried out for each.

Statistical Models
Regarding the statistical models, the K-means clustering approach was applied to separate the data into a number of specified clusters based on pipe diameter, length, and age. Other variables were excluded as using the pipes' attributes helps to obtain models with greater statistical significance [5,53]. To determinate the optimal number of clusters, the Davies-Bouldin criterion was applied. Thus, the data were divided into six groups for asbestos-cement pipes and four groups for PVC pipes. The DBI calculated for the clustering was 0.6478 and 0.6322 for asbestos-cement and PVC pipes, respectively. The K-means algorithm was run using different initializations of centroids, and the partition that produced the lower sum of squared distance was chosen. Optimal partitions results and the distribution of the data set for the number of clusters calculated are shown in Figure 4. The diameter was the variable that most influenced the partition, followed by the length. The data points of each cluster have a high similarity to each other in terms of diameter and length. As shown in Figure 4, the clusters are composed of pipes with different ages, which could mean that age is not being a good indicator of the pipe condition.

Statistical Models
Regarding the statistical models, the K-means clustering approach was applied to separate the data into a number of specified clusters based on pipe diameter, length, and age. Other variables were excluded as using the pipes' attributes helps to obtain models with greater statistical significance [5,53]. To determinate the optimal number of clusters, the Davies-Bouldin criterion was applied. Thus, the data were divided into six groups for asbestos-cement pipes and four groups for PVC pipes. The DBI calculated for the clustering was 0.6478 and 0.6322 for asbestos-cement and PVC pipes, respectively. The K-means algorithm was run using different initializations of centroids, and the partition that produced the lower sum of squared distance was chosen. Optimal partitions results and the distribution of the data set for the number of clusters calculated are shown in Figure 4. The diameter was the variable that most influenced the partition, followed by the length. The data points of each cluster have a high similarity to each other in terms of diameter and length. As shown in Figure 4, the clusters are composed of pipes with different ages, which could mean that age is not being a good indicator of the pipe condition.  Tables 3 and 4 summarize the obtained results for the statistical models. By comparison, the regression coefficients associated with the explanatory variables are relatively different among materials. The latter is because the pipe deterioration process is different among materials. Factors such as construction methods, corrosion processes, and environmental conditions can affect the relationship between pipe age, diameter, length, and number of failures for each material. Other authors have also observed differences in the fitted values of the variables from one to another material [49,54]. From the reported values, pipe length proved to be highly relevant in the observed failure events. The applied methods showed an inverse relationship between the diameter and the number of failures. Furthermore, pipe length has a positive relationship with the number of failures. These relationships are consistent with previous research [5,49,53]. In contrast, three of the equations exhibited a positive relationship between pipe age and the failures, while the remaining presented an inverse relationship. This is a counterintuitive result, considering that older pipes are most likely to fail. However, this can be explained by the fact that many pipes are older than the point in time when pipe failures began to be recorded [14]. Other authors have attributed this result to the fact that only measurable variables are included in the models. Variables such as quality and strength of the material are not measured, but their change can produce variations in pipe performance from one age to another [14,55].  Tables 3 and 4 summarize the obtained results for the statistical models. By comparison, the regression coefficients associated with the explanatory variables are relatively different among materials. The latter is because the pipe deterioration process is different among materials. Factors such as construction methods, corrosion processes, and environmental conditions can affect the relationship between pipe age, diameter, length, and number of failures for each material. Other authors have also observed differences in the fitted values of the variables from one to another material [49,54]. From the reported values, pipe length proved to be highly relevant in the observed failure events. The applied methods showed an inverse relationship between the diameter and the number of failures. Furthermore, pipe length has a positive relationship with the number of failures. These relationships are consistent with previous research [5,49,53]. In contrast, three of the equations exhibited a positive relationship between pipe age and the failures, while the remaining presented an inverse relationship. This is a counterintuitive result, considering that older pipes are most likely to fail. However, this can be explained by the fact that many pipes are older than the point in time when pipe failures began to be recorded [14]. Other authors have attributed this result to the fact that only measurable variables are included in the models. Variables such as quality and strength of the material are not measured, but their change can produce variations in pipe performance from one age to another [14,55]. Table 5 presents a summary of the statistical models' performance. All the models showed acceptable performance on both training and testing datasets. Poisson Regression performs best according to R 2 and RMSE. These results confirmed that the Poisson Regression's ability for generalization (i.e., the model's ability to adapt properly to a new range of inputs) is better than that of the other techniques. The advantage of the Poisson Regression is to recognize the non-negative nature of the predicted variable. The application of this model is suitable for predicting failures in pipes with lower failure rates, such as pipes with large diameters and small lengths.

Material Equation
Asbestos  Because the number of failures should be assessed for each pipe over a certain period, the models trained cannot be used directly for calculating the failure rate for an individual pipe because they are aggregated. The individual failure rate (λ i ) for a given observation period (T) is calculated by considering the pipe length (L i ), the total class length (L group ), and the failures predicted for the group (FR group ) [5]. (23) It is important to clarify that the failure rate does not consider the burst history of the individual pipe but the group of pipes. The accuracy of the individual failure rate predictions based on different pipe characteristics is compared in Figure 5. For the asbestos-cement pipes, Linear Regression underestimated the failure rate in most cases. The limitation of this model is more evident in younger pipes and pipes with larger diameters (see Figure 5a,b). Additionally, Poisson Regression and EPR showed adequate precision when calculating the failure rate in different age ranges and pipe diameters. Regarding the PVC pipes, the predicted capability of EPR is limited for the small pipe diameters (i.e., 76.2 mm), as shown in Figure 5c. This prediction has substantially improved for Poisson and Linear Regression. All the models show reasonable accuracy when calculating the failure rate in the different age ranges (see Figure 5d). Based on the results, the Poisson Regression is better than the two other models, while the quality of the Linear Regression and EPR are similar.
Based on the results, the best performance models were used to represent pipe failure rates in the WDN and classify them in different ranges to identify more vulnerable regions. Figure 6 shows the spatial distribution of the failure rates. The results showed that 4% and 5% of the pipes have a high failure rate in the current and predicted condition, respectively. The water utility service must pay attention to those pipes and take an appropriate preventive approach (e.g., maintenance or replacement after inspection). Further, for both current and predicted conditions, most of the pipes exhibit low failure rates. The approach presented above is useful for pipe failure prediction when information about other variables different from pipe attributes (e.g., environmental and operational variables such as water pressure and temperature) is not available. It is recommended to determine an appropriate group size according to the data characteristics to ensure that the aggregation process has statistical significance. Based on the results, the best performance models were used to represent pipe failure rates in the WDN and classify them in different ranges to identify more vulnerable regions. Figure 6 shows the spatial distribution of the failure rates. The results showed that 4% and 5% of the pipes have a high failure rate in the current and predicted condition, respectively. The water utility service must pay attention to those pipes and take an appropriate preventive approach (e.g., maintenance or replacement after inspection). Further, for both current and predicted conditions, most of the pipes exhibit low failure rates. The approach presented above is useful for pipe failure prediction when information about other variables different from pipe attributes (e.g., environmental and operational variables such as water pressure and temperature) is not available. It is recommended to determine an appropriate group size according to the data characteristics to ensure that the aggregation process has statistical significance.

Machine Learning Models
With respect to ML models, Table 6 summarizes the accuracy and the F-measure for the trained models evaluated on the test data. Tables 7 and 8 summarize the confusion matrices for the trained models. All the models used a baseline probability threshold where any pipe with a predicted failure

Machine Learning Models
With respect to ML models, Table 6 summarizes the accuracy and the F-measure for the trained models evaluated on the test data. Tables 7 and 8 summarize the confusion matrices for the trained models. All the models used a baseline probability threshold where any pipe with a predicted failure probability of greater than 50% would be assigned as failed. Although the accuracy was higher than 93%, the confusion matrices revealed that ANNs focused on correctly classifying the majority class, namely the pipes that do not fail. Thus, ANNs gave only 39% of correct classifications for asbestos-cement failing pipes and 7.14% for PVC failing pipes. Overall accuracy may not afford a reliable performance indicator for models trained using an imbalanced dataset (i.e., when most of the pipes do not fail) because it can provide an incorrect impression of the capabilities to predict the minority class condition, in this case, the failing pipes. In contrast, the F-measure has demonstrated its usefulness to test the models' performance when an imbalanced dataset is used for training. According to this metric, GBT and SVM showed acceptable performance.  In contrast, Bayes and GBT exhibited the best performance considering the TPR (e.g., 0.894 and 0.546 for asbestos-cement, respectively). The models with the lowest FPR were SVMs (0.179 for asbestos-cement and 0.333 for PVC) and GBT (0.265 for asbestos-cement and 0.535 for PVC). For failure prediction, conservative models are preferred because they reduce the pipes replacement cost before the end of their service life [9]. Although SVMs and GBT have a lower TPR compared to Bayes, using these models does not affect the rehabilitation strategies because not all pipes predicted to fail would be replaced immediately. Figure 7 shows the ROC curves for the trained models. The caption provides information about the area under the curve (AUC), which is a quantity that falls in the range between zero and one that integrates over the respective ROC function [9]. Thus, if a comparison is made between models, the most effective model is the one with the largest AUC. For asbestos-cement pipes, the ROC curves for the four selected models are relatively close. GBT achieves the highest AUC (0.998), which indicates that this method is well suited for pipe failure prediction, and ANNs exhibit the lowest AUC (0.984). Concerning PVC pipes, ROC curves for GBT and Bayes are notably close, with the most reliable prediction model being GBT (see Figure 7). The results showed that these models discriminate better between failing pipes and nonfailing ones because its curve is always above the 45 • line. Additionally, GBT exhibited the highest AUC and ANNs, the lowest.  By comparison, GBT exhibited better performance than the other models. This approach has the advantage of providing greater importance to the misclassified pipes in each iteration, so it not only focuses on correctly classifying the pipes that do not fail. As compared to SVM, GBT provides a higher level of readability and transparency of the result, and identifies the relevance of the explanatory variables. Additionally, Bayes was demonstrated to be an effective model for classifying the failing pipes. Despite this, the model showed the highest FNR (0.848 and 0.967 for asbestos-cement and PVC pipe datasets, respectively) and the lowest F-measure (25.66% and 6.26% for asbestos-cement and PVC pipe datasets, respectively). As mentioned earlier, the application of models with low FNR is preferable.
Results also showed that the imbalance dataset significantly compromised the ability of ANNs to correctly classify the failing pipes. The low predictive capability is most evident in PVC pipes (Fmeasure of 7.40%), as these pipes are less likely to fail, and they have been installed more recently. St. Clair et al. [56] and Wu et al. [6] mention that the data requirement is the main limitation of the ANNs. Figure 8 shows the importance of the variables for the GBT model, where high values indicate high relevance for the prediction process. The most important variables were the number of previous failures, length, and precipitation. Some authors found that the number of previous failures was a significant variable for predicting future failure rates and demonstrated their impact on the occurrence of new pipe failures [37,47,57]. Oliveira et al. [58] concluded that inadequate repairs of breaks might produce new failures close to the previous ones. Besides, Debón et al. [47] and Winkler et al. [9] also observed that pipe attributes, such as age, length, and diameter, are significant variables for failure prediction. The other environmental and operational selected variables were not particularly significant in the modeling process (see Figure 8). It is necessary to consider that because of the procedure's data dependency, the importance of the variables is representative of this case study and not for the pipe failure process. As previously mentioned, all the trained models use a baseline probability of 50%. A new threshold can be determined using Youden's J index. The value of the index for the GBT method was 0.57 and 0.54 for asbestos-cement and PVC pipes, respectively. The result suggested that, when applying GBT, acceptable predictions can be obtained for the failing pipes without sacrificing a reasonable level of accuracy for the pipes that do not fail.
By comparison, GBT exhibited better performance than the other models. This approach has the advantage of providing greater importance to the misclassified pipes in each iteration, so it not only focuses on correctly classifying the pipes that do not fail. As compared to SVM, GBT provides a higher level of readability and transparency of the result, and identifies the relevance of the explanatory variables. Additionally, Bayes was demonstrated to be an effective model for classifying the failing pipes. Despite this, the model showed the highest FNR (0.848 and 0.967 for asbestos-cement and PVC pipe datasets, respectively) and the lowest F-measure (25.66% and 6.26% for asbestos-cement and PVC pipe datasets, respectively). As mentioned earlier, the application of models with low FNR is preferable.
Results also showed that the imbalance dataset significantly compromised the ability of ANNs to correctly classify the failing pipes. The low predictive capability is most evident in PVC pipes (F-measure of 7.40%), as these pipes are less likely to fail, and they have been installed more recently. St. Clair et al. [56] and Wu et al. [6] mention that the data requirement is the main limitation of the ANNs. Figure 8 shows the importance of the variables for the GBT model, where high values indicate high relevance for the prediction process. The most important variables were the number of previous failures, length, and precipitation. Some authors found that the number of previous failures was a significant variable for predicting future failure rates and demonstrated their impact on the occurrence of new pipe failures [37,47,57]. Oliveira et al. [58] concluded that inadequate repairs of breaks might produce new failures close to the previous ones. Besides, Debón et al. [47] and Winkler et al. [9] also observed that pipe attributes, such as age, length, and diameter, are significant variables for failure prediction. The other environmental and operational selected variables were not particularly significant in the modeling process (see Figure 8). It is necessary to consider that because of the procedure's data dependency, the importance of the variables is representative of this case study and not for the pipe failure process. Because the other selected approaches cannot identify the significant variables in the modeling process, the backward elimination technique was used to recognize the relevance of the variables. This technique begins with a full model that contains all the selected variables for the modeling process. Variables are then removed one by one from the full model until all remaining variables are considered to have some significant contribution to the outcome [59]. The soil moisture content, soil contraction and expansion potential, and number of valves and hydrants connected to the pipe were the variables not considered for the asbestos-cement trained models. For the PVC pipes, the soil moisture content and soil contraction and expansion potential were eliminated from the SVM model. Further, the land use, soil moisture content, and soil contraction and expansion potential were excluded from the ANN model. In general, the results showed that environmental variables, such as the soil moisture content, land use, and soil contraction and expansion potential, have no pleasing effect on pipe failure modeling.
The GBT models trained were selected as the final classifiers due to their performance. A sensitivity analysis of GBT to the input variables was performed to provide information on its generalizability. The analysis was carried out considering the effects of variation in values of only one input, while the others remained constant. The results showed that the GBT model trained for asbestos-cement pipes is more sensitive to changes in the diameter, age, and the number of previous failures. An increase in the diameter, precipitation, and number of valves generates an increment in the number of failing pipes. The GBT model trained for PVC pipes is more sensitive to the number of previous failures, precipitation, and the number of hydrants. Modification of the other variables does not affect the outcome of pipes predicted to fail.
Based on the results, the final GBT models trained are used to predict the failure probability of individual pipes in the WDN. This prediction focuses on a specific pipe and its characteristics, which enables an understanding of a single pipe failure pattern. Figure 9 shows the pipes' deterioration pattern in the WDN. The results revealed that around 0.17% of the pipes have a high probability of failure in the present condition. For those pipes, it is necessary to use the appropriate maintenance or replacement strategies to avoid failure. Likewise, for both current and predicted conditions, most of the pipes exhibit low failure probability. The analysis of the probability values made it possible to establish that, when comparing the current condition with the predicted condition, there was a 28% increase in the number of pipes with failure probabilities of between 0.6 and 0.8, and an 18% increase in the pipes with failure probabilities of between 0.8 and 1.0.
According to Figure 9, it is important to highlight that some pipes do not deteriorate as expected, Because the other selected approaches cannot identify the significant variables in the modeling process, the backward elimination technique was used to recognize the relevance of the variables. This technique begins with a full model that contains all the selected variables for the modeling process. Variables are then removed one by one from the full model until all remaining variables are considered to have some significant contribution to the outcome [59]. The soil moisture content, soil contraction and expansion potential, and number of valves and hydrants connected to the pipe were the variables not considered for the asbestos-cement trained models. For the PVC pipes, the soil moisture content and soil contraction and expansion potential were eliminated from the SVM model. Further, the land use, soil moisture content, and soil contraction and expansion potential were excluded from the ANN model. In general, the results showed that environmental variables, such as the soil moisture content, land use, and soil contraction and expansion potential, have no pleasing effect on pipe failure modeling.
The GBT models trained were selected as the final classifiers due to their performance. A sensitivity analysis of GBT to the input variables was performed to provide information on its generalizability. The analysis was carried out considering the effects of variation in values of only one input, while the others remained constant. The results showed that the GBT model trained for asbestos-cement pipes is more sensitive to changes in the diameter, age, and the number of previous failures. An increase in the diameter, precipitation, and number of valves generates an increment in the number of failing pipes. The GBT model trained for PVC pipes is more sensitive to the number of previous failures, precipitation, and the number of hydrants. Modification of the other variables does not affect the outcome of pipes predicted to fail.
Based on the results, the final GBT models trained are used to predict the failure probability of individual pipes in the WDN. This prediction focuses on a specific pipe and its characteristics, which enables an understanding of a single pipe failure pattern. Figure 9 shows the pipes' deterioration pattern in the WDN. The results revealed that around 0.17% of the pipes have a high probability of failure in the present condition. For those pipes, it is necessary to use the appropriate maintenance or replacement strategies to avoid failure. Likewise, for both current and predicted conditions, most of the pipes exhibit low failure probability. The analysis of the probability values made it possible to establish that, when comparing the current condition with the predicted condition, there was a 28% increase in the number of pipes with failure probabilities of between 0.6 and 0.8, and an 18% increase in the pipes with failure probabilities of between 0.8 and 1.0. extrapolation of the predictions [9]. Although it is not intuitive, a decrease of the failure probability can be observed in reality. Some authors associate a higher failure rate with a pipe's initial service life. [9,60]. Martinez-Codina et al. [61] performed a study to determine the relationship between pipe failure causes and processes. Based on the experimental analyses, they observed that the failure probability amounted to a higher rate in the first years of service life than in the later years. The best performance models (i.e., Poisson regression for group pipe analysis and the GBT approach for individual pipe analysis) were applied to another WDN in Bogotá (Colombia). The WDN has 20,793 pipes with an overall network length of 652 km. The network has different pipe materials, distributed as follows: polyvinyl chloride (74.5%), asbestos-cement (25%), and concrete cylinder pipes (0.19%). Pipe diameters range from 50.8 to 406.4 mm, and 40% of the pipes have a diameter of 76.2 mm. In addition, approximately 75% of the pipes have an age ranging between 41 and 50 years. Pipe failure records are available from 2015 and 2017. A preliminary analysis showed that pipes with a diameter of 76.2 mm and an age of over 40 years exhibited the highest failure rate. Failure records revealed that 65% of the failed pipes are made of asbestos-cement and 34% of PVC.
Regarding the group pipe analysis, data were divided into four groups for both asbestos-cement and PVC pipes. Results show an R 2 of 0.452 for asbestos-cement pipes and 0.724 for PVC pipes. For the individual pipe analysis, results gave only 4% correct classifications for asbestos-cement pipe failure and 15% for PVC pipe failure. These results and other findings in previous studies underline the need for each WDN to develop its own failure model [1,62]. All the networks have substantive differences, and the effect of specific variables in the models is dependent on the WDN characteristics.

Conclusions
In this paper, the performance of several statistical and ML models in predicting pipe failure in WDNs was evaluated. Three statistical models, including Linear Regression, Poisson Regression, and Evolutionary Polynomial Regressions, were used for failure prediction based on pipe diameter, pipe According to Figure 9, it is important to highlight that some pipes do not deteriorate as expected, as their condition improves with age. This can be explained by the fact that, when the age of the pipe is increased, observations outside of the training data are generated. Thus, the model requires an extrapolation of the predictions [9]. Although it is not intuitive, a decrease of the failure probability can be observed in reality. Some authors associate a higher failure rate with a pipe's initial service life. [9,60]. Martinez-Codina et al. [61] performed a study to determine the relationship between pipe failure causes and processes. Based on the experimental analyses, they observed that the failure probability amounted to a higher rate in the first years of service life than in the later years.
The best performance models (i.e., Poisson regression for group pipe analysis and the GBT approach for individual pipe analysis) were applied to another WDN in Bogotá (Colombia). The WDN has 20,793 pipes with an overall network length of 652 km. The network has different pipe materials, distributed as follows: polyvinyl chloride (74.5%), asbestos-cement (25%), and concrete cylinder pipes (0.19%). Pipe diameters range from 50.8 to 406.4 mm, and 40% of the pipes have a diameter of 76.2 mm. In addition, approximately 75% of the pipes have an age ranging between 41 and 50 years. Pipe failure records are available from 2015 and 2017. A preliminary analysis showed that pipes with a diameter of 76.2 mm and an age of over 40 years exhibited the highest failure rate. Failure records revealed that 65% of the failed pipes are made of asbestos-cement and 34% of PVC.
Regarding the group pipe analysis, data were divided into four groups for both asbestos-cement and PVC pipes. Results show an R 2 of 0.452 for asbestos-cement pipes and 0.724 for PVC pipes. For the individual pipe analysis, results gave only 4% correct classifications for asbestos-cement pipe failure and 15% for PVC pipe failure. These results and other findings in previous studies underline the need for each WDN to develop its own failure model [1,62]. All the networks have substantive differences, and the effect of specific variables in the models is dependent on the WDN characteristics.

Conclusions
In this paper, the performance of several statistical and ML models in predicting pipe failure in WDNs was evaluated. Three statistical models, including Linear Regression, Poisson Regression, and Evolutionary Polynomial Regressions, were used for failure prediction based on pipe diameter, pipe age, and length as explanatory variables. ML approaches, including Gradient-Boosted Tree (GBT), Bayes, Support Vector Machine, and Artificial Neuronal Networks (ANNs), were compared in terms of their capability to predict individual pipe condition. Pipe attributes, and environmental and operational variables were included as input variables. The selected case study was a highly populated area in Bogotá with a large WDN.
The results of the statistical models showed that the cluster-based prediction approach reduces the prediction error for pipe failure when information about variables other than pipe attributes is not available. All the models demonstrated acceptable results in terms of their performance (R 2 between 0.695 and 0.927 and RMSE between 45 and 22 for the test sample), but the application of Poisson Regression was the most suitable for predicting failures in pipes with lower failure rates. Regarding ML models, all methods but the ANNs presented acceptable performance. The GBT approach presented the best performing classifier (ACU of 0.998 and 0.990 for the test sample of asbestos-cement and PVC pipes, respectively). The GBT model has a greater ability to accurately predict pipe failure when an imbalanced database is used. Furthermore, the assumptions and trade-offs of the GBT model are more transparent than in other artificial intelligence techniques.
Using the abovementioned predictive models earlier could significantly reduce the time and money allocated to the identification of deteriorated pipes. The group-pipe failure analysis is useful when the available data is limited. In contrast, the individual-pipe approach makes it possible to consider the single pipe failure history, which is of great importance for establishing an individual pipe's propensity to fail. The knowledge provided by this study is especially important for water utility as it provides information that helps to prioritize a proactive rehabilitation strategy, making it more efficient and profitable. Future work will include the application of the modeling approach to a more detailed dataset that incorporates other variables such as water pressure and temperature, which affect the pipe failure process [63,64]. It is also recommended to evaluate the effect of the failures' spatial correlation [65].
Author Contributions: M.M.G.-G. performed the proposed approach, analyzed the obtained data, and wrote the paper. Supervision, review, and editing were conducted by J.P.R. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Acknowledgments:
The authors acknowledge the water utility service (EAB) for providing the data used in this study.

Conflicts of Interest:
The authors declare no conflict of interest.