A Decision Tree Approach to the Risk Evaluation of Urban Water Distribution Network Pipes

: To evaluate the risk of a pipe in the water supply network of Beijing, we used the accident records of the gridding urban management (GUM) system. In addition, road and building information derived from a three-dimensional (3D) electronic map was also employed. A machine learning algorithm, the decision tree, was employed to train and evaluate the dataset. The results show that the contributions of the surrounding buildings and roads are neglectable, except for super-high-rise buildings, which have limited contributions. This ﬁnding is consistent with the results of other studies. The decision tree identiﬁes dominant features and isolates the risk contribution of such features. The output tree structure indicated that the time since the last accident is a dominant factor, to which super-high-rise buildings contribute slightly. A cut-o ﬀ value of 0.019 was chosen to predict high-risk regions. Approximately 0.4% of the data were predicted to be high risk, and the corresponding gain in risk rate was approximately 19.2. This model may be used in cities where detailed proﬁles of water supply pipes and maintenance records are not available or are expensive to achieve.


Introduction
Water supply is an important part of the urban lifeline in a city. In China, the Water Pollution Prevention and Control Action Plan was initiated by the State Council in April 2015. This plan is also known as the Water Ten Plan (WTP). The WTP proposed several measures and set out long-term objectives at a national level [1], and the renovation and upgradation of aging water pipes in distribution networks was initiated. These pipes have been in service for more than 50 years and are mostly made of outdated materials. In addition, a long-term goal for the national average leakage rate in water distribution networks was proposed. After achieving this goal, the average leakage rate would be lower than 12% by 2017 and 10% by 2020. It is worth noting that the national average water loss rate was 14.32% in 2015 [2].
However, it is usually difficult to predict when, where, and how severe a pipe failure may occur in a water supply network. The frequency of the breakage and failure of pipes in water distribution networks tends to increase over time. This may be caused by a variety of reasons, such as environmental temperature variations and different operation pressure levels in pipes.
Some researchers have tried to locate leakage sites in order to prevent water loss. Different methods have been proposed to detect leakages in water distribution systems. These methods include artificial neural networks, state estimation, stochastic process control, and time series modeling [3]. The accuracy of these methods requires extensive system monitoring due to the measurement of system parameters such as pipe flow rate and pressure head [3]. Some researchers have focused on prediction pipeline leakage using stochastic models, namely certainty and probability models [4]. Popular certainty models include the time index model, the time power model, and the time linear model. Either the number of leakages per unit of time or pipe length or the total number of pipe leakages is predicted by these models. As a probability model, a hazards model predicts the probability of a pipe leakage in an instant for one specified pipe. In recent years, various artificial intelligent algorithms, such as artificial neural networks, the ant colony algorithm, and the genetic algorithm, have been used for predicting leakages in pipe networks. Furthermore, models combining several of the aforementioned algorithms have flourished.
Some groups have studied pipeline state prediction methods, including pipe leakage, pipe burst, and comprehensive assessment [5]. Pipe leakages are usually caused by very slight damage to pipes. These events are frequently modeled as a time series when the number of current leaks in the same pipeline is more than four. The gray forecasting theory was adopted by Zhang et al. [6] to find an inherent pattern by suppressing the influence of random factors through data washing and to determine the occurrence trend of future leakages. Wang et al. [7], Jin et al. [8], and Wang et al. [9] modeled the pipeline leakage time interval, the pipeline monthly maintenance times, and the annual leakage rate as the time series data, and they then carried out forecast analysis. These methods are a second exponential smoothing model, a third exponential smoothing model, and a wavelet neural network, respectively. When the frequency of pipeline leakage did not hold for the time series assumption, linear models were proposed by Zhang et al. [6,10] and Wang et al. [7] to predict the next time a pipe leakage would occur. These models used the depth of installation, the hydraulic pressure, and the pipe diameter as features. Grey relational analysis (GRA) was employed to analyze and evaluate all the factors and to draw the order of the factors influencing pipeline leakage [4]. Additionally, a linear prediction model was trained to determine the first leakage time after the networks were deployed.
Pipe burst refers to the situation where a pipeline leakage rises to the ground due to structural damage to the pipeline, and it must be repaired immediately [11]. Most research on pipe burst is based on the hypothesis proposed by Shamir et al. [12], which states that the number of bursts is exponential to the age of the pipe. A model was proposed by Ma et al. [13] to predict pipe replacement time by employing the predicted pipe burst rate and the limit pipe burst rate equation proposed by Shamir et al. [12]. This model was designed to minimize the total cost of repairing and replacing pipes during their service life. A method combining survival analysis and burst hazard was proposed by Zhou et al. [14] and Ke et al. [15]. Their studies used the age, diameter, and material of the pipe as features to predict the pipe burst rate per unit of length. The assumption was that the baseline hazard is a quadratic function of the age of the pipe.
In recent years, comprehensive assessment methods involving more variables have been proposed. Zeng et al. [16] evaluated dynamic hydraulic factors and their impact on pipe bursts in a stage-by-stage fashion. The time series of the evaluation index vectors, composed of velocity, pressure, and the pressure difference of the pipe sections, were used as the input for a back propagation neural network (BPNN) to obtain the safety level of each time point of a pipe section. They used a revised evaluation metric of pipe burst rate to carry out a comprehensive evaluation of pipeline safety [15]. Chang et al. [17] also employed BPNN to comprehensively predict pipeline damage risk using the material, age, diameter, length, and coupling of the pipe.
A more comprehensive study was performed by Kumar et al. [18]. They modeled the failure of water pipes as a binary classification task, which predicted whether a failure may or may not occur for a city block within a specified period. Several data sources were combined, leading to a comprehensive dataset composed of pipe diameter, pipe age, pipe material, installation year, soil type, rock type, pressure zone, road rating, the number of previous breaks in that city block, etc.
However, the capability and application of the above models remain constrained. First of all, these models require information about the pipeline itself, namely the diameter, age, materials, depth, pressure, etc. In addition to this detailed information about the pipelines, some of these models also require accident records to be continuously acquired, with little missing or corrupted data. Moreover, the prediction method of the leakage model has only achieved limited results so far and can only be applied to a simple and specific pipeline network system. For complex pipe network systems, the calculation speed and accuracy of real-time simulation software need to be further improved [19].
The municipality faces challenges not only in managing the expected large amount of replacements but also in quickly identifying and fixing problems as soon as possible. The situation is even worse in Beijing, where a large proportion of the water pipelines have been serving for longer than their projected lifetimes. Approximately 2000 km of water pipelines are scheduled for renewal during both the 12th FP (the 5-year plan) and the 13th FP. In addition, gridding urban management (GUM) is employed by the Beijing municipality to collect accurate and timely information on urban problems. GUM divides an area (e.g., a district) into a number of spatial grids and collects 173 types of urban problems, including pipe blasts and leaks, within five categories [20]. Urban management inspectors patrol the streets regularly and report problems to management centers.
Complete and detailed profiles of pipelines are not publicly available, which prohibits the application of the previously mentioned methods. In addition, the irregularity of the pipe leakage records also limits the potential application of these methods.
Therefore, we propose a decision tree machine learning method to assess the risk of water pipe accidents on a regional level. The water supply accident records collected by GUM were used as features. Moreover, the impact of the surrounding roads and buildings were included in this model; this extra information was extracted from a 3D electronic map.

Data
This study used 859 accident records of water supply pipelines collected by the Beijing GUM System, spanning from 1 January 2013 to 30 December 2016.
Roads and building information were derived from a 3D electronic map around the accident site. This information was parameterized by the width and height of the road, the area, and the coordinate values of buildings. The widths of the roads reflect the traffic load borne by the pipeline, and the surrounding buildings contribute to the static load of the pipes underneath.
A deformation analysis model of pipelines under static load was derived using Mindlin's solution [21]. The maximum settlement displacement was specified as 30 mm, a common criterion for most cities [22]. The impact of buildings and roads within 240 m of the site of an accident was evaluated. This value was chosen in order to cut off buildings whose contribution to the settlement displacement is less than 3 mm [23].
Moreover, the construction time was assumed to be the construction time of nearby buildings. Then, it was transformed into a categorical variable (Table 1). Table 1. Clustering of accident-related factors [5].

No.
Factors Description Classification Boundary

Preprocessing of Variables
In total, 12 relevant factors were derived by processing the acquired accident data. There were seven building factors, three time factors, one accident type factor, and one road factor.
All continuous variables were categorized, and the results of these factors are listed in Table 1. To facilitate categorization, lowess curves were first fitted using lowess in R. The curves were then clustered using K-means and subdivided into different segments according to the clustering results.
Below is an example of how the displacement was categorized (Figure 1). A lowess curve was first fitted based on the displacement and survival time since 1 January 2013. The curve was then clustered into five segments, and the final category was further manually reduced into three.

Preprocessing of Variables
In total, 12 relevant factors were derived by processing the acquired accident data. There were seven building factors, three time factors, one accident type factor, and one road factor.
All continuous variables were categorized, and the results of these factors are listed in Table 1. To facilitate categorization, lowess curves were first fitted using lowess in R. The curves were then clustered using K-means and subdivided into different segments according to the clustering results.
Below is an example of how the displacement was categorized (Figure 1). A lowess curve was first fitted based on the displacement and survival time since 1 January 2013. The curve was then clustered into five segments, and the final category was further manually reduced into three.

Data Preprocessing
Biganzoli et al. [24,25] showed that artificial neural networks (ANNs) can be applied to the modeling of survival data. Time intervals were used as additional inputs with explanatory variables, logic functions were used in the hidden layer, and the single output was the estimated failure

Data Preprocessing
Biganzoli et al. [24,25] showed that artificial neural networks (ANNs) can be applied to the modeling of survival data. Time intervals were used as additional inputs with explanatory variables, logic functions were used in the hidden layer, and the single output was the estimated failure probability. The survival data were transformed before fitting an ANN model. Table 2 is an example of the converted data. An original sample with a lifetime of four days was converted into four samples. Two new variables were added, namely time interval and outcome (i.e., survival or death), while the other variables, X1 and X2, remained unchanged. Our dataset contained 859 accident records. These events were converted using this method, resulting in approximately 530,000 records on a daily basis, with 1 for an accident and 0 for none.
The converted dataset was randomly divided into approximately 35,000 samples (approximately 66%) for the training set and the rest (approximately 34%) for testing.

Model
To build the decision tree model, we used the function ctree in an R package named Party [26]. The inputs included the 12 accident-related factors, in addition to the time interval. The only output was the risk coefficient, ranging between 0 and 1, which revealed the probability of pipeline accidents. The closer the output is to 1, the more likely the pipeline is to be damaged.
Based on all 530,000 day-based events, the average accident rate was approximately 0.16%. This is a typical configuration of an extremely unbalanced dataset. The decision tree outputted the accident risk ( Figure 2). The probabilities of leaves 12 and 15 were considerably higher than the average accident rate, indicating discriminative features. To quantify their effects, we introduced the amplification ratio of risk (ARR) to evaluate the model. The model parameters were optimized by two-fold cross-validation. Minsplit (the minimum number of samples in a node) was set to 100-800, span 100; minbusket (the minimum number of samples in a leaf node) was set from 25 to half of minsplit, span 25; maxdepth (the depth of the decision tree) was set from 3 to 8. Based on different combinations of the input parameters, a two-fold cross-validation model was established and the ARR was calculated. The results showed that the minimum sample size of the node set had no effect. When the minimum sample size of the leaf was 25 and the maxdepth was 3, the model results were optimal.
accidents. The closer the output is to 1, the more likely the pipeline is to be damaged.
Based on all 530,000 day-based events, the average accident rate was approximately 0.16%. This is a typical configuration of an extremely unbalanced dataset. The decision tree outputted the accident risk (Figure 2). The probabilities of leaves 12 and 15 were considerably higher than the average accident rate, indicating discriminative features. To quantify their effects, we introduced the amplification ratio of risk (ARR) to evaluate the model. The model parameters were optimized by two-fold cross-validation. Minsplit (the minimum number of samples in a node) was set to 100-800, span 100; minbusket (the minimum number of samples in a leaf node) was set from 25 to half of minsplit, span 25; maxdepth (the depth of the decision tree) was set from 3 to 8. Based on different combinations of the input parameters, a two-fold cross-validation model was established and the ARR was calculated. The results showed that the minimum sample size of the node set had no effect. When the minimum sample size of the leaf was 25 and the maxdepth was 3, the model results were optimal.
To determine the cut-off value, a trade-off between the ARR in the span of high risk and the percentage of positives (PoP) of the data needs to be achieved. A higher PoP indicates more regions to patrol, which may lead to extra costs. A higher ARR, on the other hand, may miss high-risk regions. To determine a reasonable trade-off, the training data were modeled 50 times by cross-validation, and lowess curves were fitted for different cut-offs with the ARR and PoP, respectively, as shown in Figures 3 and 4, respectively. To determine the cut-off value, a trade-off between the ARR in the span of high risk and the percentage of positives (PoP) of the data needs to be achieved. A higher PoP indicates more regions to patrol, which may lead to extra costs. A higher ARR, on the other hand, may miss high-risk regions.
To determine a reasonable trade-off, the training data were modeled 50 times by cross-validation, and lowess curves were fitted for different cut-offs with the ARR and PoP, respectively, as shown in Figures 3 and 4, respectively.

Output Model
The structure of the decision tree shows that time is the dominant factor in determining the risk of pipes in a region. For samples with a survival time longer than 1291 days (for example, a pipe older than 1291 days with no event records), the risk of an event occurring during June to December

Output Model
The structure of the decision tree shows that time is the dominant factor in determining the risk of pipes in a region. For samples with a survival time longer than 1291 days (for example, a pipe older than 1291 days with no event records), the risk of an event occurring during June to December is higher than during the other months of the year. For samples with a survival time longer than 950

Output Model
The structure of the decision tree shows that time is the dominant factor in determining the risk of pipes in a region. For samples with a survival time longer than 1291 days (for example, a pipe older than 1291 days with no event records), the risk of an event occurring during June to December is higher than during the other months of the year. For samples with a survival time longer than 950 days, a super-high-rise building may slightly increase the risk.

Fifty-Fold Cross-Validation
The 50-fold cross-validation results are listed in Figures 3 and 4. Figure 3 shows that the ARR increases along with the increment of the cut-off value, whereas the PoP decreases rapidly with the increment of the cut-off value, as shown in Figure 4.
A reasonable cut-off value may be picked from (0.018, 0.038), as a rapid drop of the PoP exists in this range. The lower limit was chosen to ensure that the ARR was greater than 20 (it reads 20.0 at 0.018 in Figure 3), and the upper limit was chosen to ensure that the PoP was higher than 0.1% (in Figure 4, it reads 0.105% at 0.038).

Model Performance
We chose the cut-off value to be 0.019, and we evaluated our model on the test set to determine its performance. The PoP and ARR results are listed in Table 3. The model predicts 0.41% of the test samples as being of high risk; the risk probability is 19.2 times higher than the average risk rate, i.e., an ARR of 19.2. A discrepancy of approximately 20% was found in the ARR of the test data compared to that of the 50-fold cross-validation. This indicates that the generalization capability of this model needs further improvement. This could be attributed to our limited dataset; moreover, the dataset was extremely unbalanced.

Discussion
Most of the aforementioned models in the literature survey require information about pipes, which is usually limited, especially for old urban regions for periods before the establishment of the information system for city management. We used event records collected by the GUM system. Compared to other methods, the advantage is its quick and cheap data acquisition, while the disadvantage is its limited prediction accuracy.
Despite the limited accuracy, management efficiency may be improved by identifying high-risk regions using model prediction. For example, for a PoP of 0.4% and an ARR of 19, approximately 8% (0.4%×19) of events may be found by patrolling 0.4% of the regions which are predicted to be of high risk by our model.
However, the generalization capability of our model needs to be further improved, as its accuracy is approximately 8%. This limited accuracy may be attributed to the quality of the raw data. Moreover, the absence of more discriminative features, such as the pipe material and diameter, the working pressure of the pipes, and the depth and age of the pipes, may also contribute to this limited accuracy. The features that we used were pressure from nearby buildings and roads, the age of the pipe as estimated from surrounding construction time, seasons when events occurred, and event type. To the best of our knowledge, none of the first three features have ever been used by other researchers.
It can be concluded from the output of the decision tree that the most discriminative factor in determining a high-risk region is survival time (Figure 2). The next discriminative feature is the month of the year. Figure 2 shows that the risk of an event occurring during June to December is considerably higher than in other months of the same year. The event types in the GUS event records are mostly given by non-professionals or non-experts, which is usually a subjective decision based on the severity of the leakage. This inaccuracy may be attributed to the insignificance of this feature. The width of roads, the pressure of buildings, and the regional construction year contribute little to the identification of high-risk regions. However, super-high-rise buildings may slightly increase the risk of pipe events. The feature of age shows limited contribution to this risk, which is consistent with [4].
Additionally, the transformed data are extremely imbalanced. This could also have contributed to the limited accuracy. If the data were transformed in a month-based way, such an imbalance could be alleviated.
Accumulating more data always improves the prediction accuracy. Meanwhile, a hierarchical data model would help, since the maintenance and management of different urban areas affects the occurrence of pipeline events.