1. Introduction
Water distribution networks (WDNs) are essential infrastructure systems that transport treated potable water from treatment facilities to end users, including households, businesses, and industries. These systems typically consist of a network of interconnected pipes, pumps, valves, storage reservoirs, and service lines, all working together to ensure a reliable water supply. Ensuring the effective operation and maintenance of WDNs is crucial for public health, economic stability, and environmental protection. As WDNs age, maintaining them has become increasingly challenging, especially given the global trend of ageing infrastructure [
1,
2,
3,
4]. The importance of maintaining a WDN lies in its role in preventing issues such as water contamination, pressure imbalances, and supply shortages. Failures in the system, such as pipe bursts, leaks, and valve malfunctions, can lead to substantial water losses, infrastructural damage, and disruptions to essential services. Such failures often result from material degradation, external interference, or operational stress, highlighting the need for robust maintenance strategies [
5,
6,
7]. Historically, reactive maintenance has been the predominant approach to managing water network failures, where issues are addressed only after they occur. However, this approach often leads to higher costs, increased water loss, and service interruptions [
8,
9,
10]. In contrast, predictive maintenance, which leverages data-driven techniques such as using sensors, pressure monitoring, and analytics, is emerging as a more efficient and proactive solution [
11,
12,
13]. By predicting potential failures and addressing them before they escalate, predictive maintenance reduces operational costs and enhances system reliability [
14,
15,
16,
17].
The prediction of water pipe failures has evolved significantly, advancing from traditional statistical models to more sophisticated data-driven and machine learning (ML) approaches. This shift is driven by the increasing complexity of water distribution systems and the pressing need for more accurate predictive maintenance strategies [
18]. In the early stages, statistical models, such as those based on Weibull and lognormal distributions, were widely used to predict pipe failures. These models primarily relied on historical failure data to forecast future events, providing a foundational method for understanding the structural deterioration of water mains. However, the reliability of such models was highly dependent on the availability and quality of historical data, posing challenges when data were scarce or incomplete. Studies have highlighted these limitations, noting the reduced accuracy of predictions under such conditions [
19,
20,
21].
With advancements in technology, machine learning techniques have emerged as more effective tools for failure prediction in WDNs. Models such as Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs) have been successfully applied to analyse complex patterns in operational data, enhancing the predictive accuracy of these systems [
22,
23,
24,
25]. Later, more advanced ML models, including tree-based models such as Random Forest (RF) and eXtreme Gradient Boosting (XGB), demonstrated improved performance in water pipe failure predictions [
26]. Comparisons between traditional statistical models and ML approaches have shown that the latter, particularly tree-based models, often outperform earlier methods in predicting individual pipe failures [
27,
28].
Ensemble models, which combine multiple ML algorithms, have become essential for enhancing failure prediction accuracy in WDNs. These models often integrate supervised and unsupervised techniques, leveraging the strengths of each. For example, clustering (unsupervised) groups pipes based on failure patterns, while regression or classification (supervised) predicts the likelihood of failure [
29]. Stacking, a common ensemble method, combines predictions from multiple models using a meta-learner, increasing overall accuracy and robustness, particularly in scenarios with high variability or noisy data [
30].
In recent years, the integration of ML models with traditional statistical and physical models has further enhanced predictive capabilities. Combining these methods has been shown to improve the accuracy of condition predictions for water pipes, particularly in cases involving corrosion depth or structural degradation [
31,
32,
33,
34]. Additionally, optimised inspection schedules can now be developed using ML models, aiding maintenance planning by accurately predicting failure risks [
17,
35]. The use of diverse data sources, including environmental factors and historical failure records, has also refined predictive models, with techniques such as AutoML playing a crucial role in these advancements [
13,
36].
Although models have been developed for the condition assessment and failure prediction of mains, little attention has been given to failures in service lines. Service lines are small-diameter pipes branching from mains to deliver water to individual customers. The primary causes of service line failures include the use of low-quality materials, improper installation, and age-related deterioration. In most countries, the section of the service line between the main and the property boundary (public service line) is owned by the utility, while the remaining section within the property (private service line) is the responsibility of the customer. Although it is the customer’s duty to address failures or leaks on the private line, such issues can still lead to supply disruptions on the customer’s side. Therefore, investigating the failure modes of service lines and implementing predictive maintenance for these assets is essential.
The likelihood of failures in service lines is influenced by factors such as pipe material, age, diameter, service pressure, flow rate, and other hydraulic characteristics [
37]. Additionally, water quality and microbiological factors are known contributors to many failures [
38]. Service lines are also vulnerable to environmental conditions, including bedding quality, soil corrosivity, soil stability, stray electrical currents, and temperature fluctuations, particularly freezing soil. Other common causes of failure include external damage from third parties, such as contractors accidentally striking water pipes during excavation, as well as issues stemming from improper installation and inadequate installation supervision [
39].
A recent survey of case study WDNs in various countries revealed significant differences in the proportion of total water network failures attributed to service lines.
In the UK and Italy, service lines account for less than 30% of total failures, whereas in Tokyo, Taiwan, the Philippines, Jordan, and New Zealand, this figure exceeds 80% [
40]. In the US, Lee and Meehan [
41] developed a model for predicting service line failures using nine years of failure data. They tested several statistical distributions, including exponential, Weibull, and lognormal, finding the Weibull distribution to be the most effective for generating survival functions across different pipe materials and ages. Their findings revealed distinct survival rates depending on material, with certain types, such as lead and copper, generally lasting longer than materials like polyethylene. Van Hecke [
42] compared the benefits of using 316-stainless steel corrugated pipes for service lines with those of polyethylene pipes. For a life cycle cost analysis, they estimated the failure likelihood for service lines made from these materials by calculating the ratio of failures to the total number of service lines. Their findings reported a failure rate of 0.003 (failures per year) for the Tokyo WDN and 0.001 (failures per year) for an Australian utility. The analysis, however, did not account for additional factors such as pipe diameter, age, and other variables beyond material type.
Although many studies have focused on predicting failures in water mains, failures in service lines remain insufficiently understood and poorly predicted, despite their major contribution to overall network failures and customer disruptions. Existing models often rely on incomplete data and are not designed for the unique characteristics of service lines, such as material variability, installation quality, and environmental influences. This gap highlights the need for a reliable, data-driven predictive framework that can identify high-risk service lines and support proactive maintenance planning.
In this research, a machine-learning-based predictive maintenance model is developed to forecast failures in service lines, representing the main contribution of the study. The objectives of the research are as follows: (1) developing a predictive maintenance model using machine learning algorithms to predict service line failures; (2) enabling proactive decision-making by estimating the likelihood of service line failures; (3) supporting asset managers and utility operators in formulating short-, medium-, and long-term preventive investment plans; and (4) reducing supply interruptions and enhance the reliability of water distribution systems.
The article is organised as follows: The first section presents the introduction and literature review. The case study section focuses on a part of Tehran water distribution network. The methodology section details the machine learning modelling employed in the study. The results section highlights the findings and analyses, while the discussion section examines why the models behave as observed, data limitations, comparisons of model performance, and implications for utilities. Finally, the last section provides the conclusions and highlights the key findings, practical applications, and limitations of the study for water utilities.
2. Case Study
To evaluate the performance of the proposed approaches, they were applied to the public service lines in a case study WDN in Tehran, Iran. This case study was chosen because service line failures accounted for 98% of all recorded bursts. Specifically in 2023, Only 100 out of more than 6000 failures involved the mains; the rest occurred on service lines. This network spans 1400 km of pipes across an 80 km2 area, serving a population of approximately 1.5 million, equivalent to nearly 205,000 customers. The high number of people per customer is due to the use of a shared service line for all properties within a single building. The WDN has a total demand of around 4 m3/s, supplied by 10 separate service reservoirs.
Customer data primarily includes information on the length, diameter, material type, installation date, and elevation of service lines. The utility had a calibrated and up-to-date hydraulic model, allowing for the determination of maximum and minimum daily pressures at each service line. Additionally, historical billing data was used to establish the minimum and peak demand for each service line.
Figure 1a shows the distribution of service line lengths, with more than half of the pipes measuring between 6 and 12 metres.
Figure 1b illustrates the installation years of the service lines, with a significant number installed between 1986 and 1994, indicating a period of rapid development in the area. Another phase of WDN expansion occurred between 2000 and 2004, as reflected in
Figure 1b.
Table 1 presents the distribution of the service lines based on their diameter and material. Around 80% of the service lines have a diameter of 1/2” (12.7 mm), 15% a diameter of 3/4”, and only 5% are 1” or larger. Half of the service lines are made of ductile iron (older generation), more than 30% were made of Polyethylene (newer generation), and 20% were made of other materials.
Figure 2a illustrates the distribution of minimum and maximum pressures across service lines. Over 50% of customers experience a minimum daily pressure between 15 and 25 m, while the maximum daily pressure shows a more even distribution. Furthermore, more than 88,000 failures were recorded in work orders between 2013 and 2023.
Figure 2b displays the reported number of service line failures over this period. Excluding 2013, which contains incomplete data, the trend shows a general decline. The utility attributes this improvement primarily to the implementation of pressure management programmes, regular replacement of ageing service lines, fewer supply interruptions, and advancements in materials quality.
3. Methodology
In this study, machine learning models were developed to predict the likelihood of failure (LoF) in service lines. The models were trained using service line failure data, capturing the complex relationships between service line attributes—such as length, age, diameter, material, demand, and pressure—and the LoF.
Figure 3 presents the flowchart of the methodology used in this research. Failure data from 2013 to 2021 were used to train the classification models, which predict whether a pipe is likely to fail in the future.
To assess the impact of training data length on prediction accuracy, the training process was repeated eight times using datasets spanning 2 to 9 years. For example, failure data from 2013 to 2014 were used for the 2-year training, while data from 2013 to 2021 were used for the 9-year training. In all cases, failures occurring in 2022–2023 were used to validate the predictions.
3.1. Data Preparation
The utilised dataset comprised two distinct sub-datasets, as follows:
Asset Data: This dataset contained comprehensive details about the service line pipes, including the customer’s unique ID number, as well as the diameter, length, age, installation depth, and material type of the service line. As outlined in the Case Study section, minimum and maximum pressures for each customer, obtained from the hydraulic model, were incorporated into the dataset. Historical customer data was also utilised to include the monthly consumption (demand) for each customer. Additionally, the dataset provided information on the reservoir supplying each service line and the diameter and material of the main pipe connected to it.
Failure Data: This contained records of failures occurring on the service lines, including the customer’s ID number, pipe diameter, and details of each failure, such as the date, type, and cause. Each failure was uniquely identified by the customer ID number, linking it to the corresponding asset data.
To streamline the modelling process, these two datasets were merged into a single comprehensive dataset. This integration enabled the identification of the number of incidents associated with each service line in the asset data for each year. Ultimately, event entries for each year were classified as binary, as multiple records for the same pipe within a single year likely indicate recurring incidents or repairs.
3.2. Classification
Classifiers are machine learning algorithms designed to learn from a subset of data and categorise unseen data into specific classes. These algorithms are widely used across various domains. In this research, Random Forest, Extreme Gradient Boosting, and Long Short-Term Memory models were employed to predict failures in service line pipes. Although the model prediction performance is evaluated in a two-year period, the purpose of this study is not to predict the exact timing of failures within a specified future time horizon. Instead, the models estimate the likelihood of failure based on historical asset attributes and failure records. The classification output indicates relative failure risk rather than a time-to-failure prediction.
3.2.1. Random Forest
Random Forest (RF) is an ensemble machine learning method commonly used for classification and regression tasks, built on the principle of aggregating multiple decision trees [
43]. Ensemble algorithms are generally categorised into three types: bagging, boosting, and stacking. Bagging involves generating several individual estimators and combining their predictions through averaging (for regression) or majority voting (for classification) to produce the final ensemble prediction. Random Forest is a well-known example of a bagging technique [
44].
Conventional regression trees often suffer from overfitting, which leads to poor generalisation performance on unseen data [
45]. Random Forest mitigates this issue by introducing randomness into the tree-building process. This is achieved by selecting a random subset of features at each split within the tree structure. The individual decision trees, trained on different subsets of data, are then combined to form a robust predictor. Predictions are made by averaging the outputs of the individual trees [
43]. RF was selected for its ability to model complex nonlinear relationships in structured asset data while remaining robust to noise and overfitting [
27,
28].
3.2.2. Extreme Gradient Boosting
Extreme Gradient Boosting (XGB) is a highly efficient machine learning algorithm that combines the predictions of multiple weak tree models into a robust linear ensemble [
46]. The final prediction output of the model is expressed as:
where
is the final prediction for sample
;
is the contribution of the
k-th stump; and
is the number of iterations.
While derived from Gradient Boosted Decision Trees (GBDT), XGB offers significant enhancements in two key aspects:
As with many machine learning models, XGB requires hyperparameter tuning to optimise its performance. XGB was chosen for its high predictive accuracy on tabular datasets and its effectiveness in handling imbalanced failure data through boosted learning and regularisation [
46,
47]. Given the large scale of the dataset in this research, XGB was selected as the preferred model for its efficiency and scalability.
3.2.3. Long Short-Term Memory
In this study, a Long Short-Term Memory (LSTM) model was used to classify service line failures based on available features of the pipes. The input data , where is the number of samples and is the number of features, is reshaped into a 3-dimentional format to match the LSTM’s input structure.
The LSTM processes the data sequentially by updating its internal state
at each time step
based on the current input
and the pervious hidden state
. The key operations in an LSTM cell, such as the forget gate, input gate, and cell state updates, allow it to capture long term dependencies in the data. Applying the three mentioned functions, the internal state at each time step can be calculated as:
where
is the forget gate which decides how much of the pervious cell state
to retain.
is the input gate which determines how much new information
from the current input
should be added to the cell state.
as known as output gate, controls how much of the cell state
is passed to the hidden state
.
is element-wise (Hadamard) product of two matrices.
After processing the input, a fully connected layer with a sigmoid activation function produces a probability
, where
represents the predicted probability of failure;
represents the sigmoid activation function:
This function maps the input, , to a range between 0 and 1, which is interpreted as a likelihood. is a weight matrix that connects the hidden state to the output layer. It defines how the hidden state contributes to the prediction. is a bias term added to the linear transformation to provide flexibility in fitting the model.
The model is trained by minimising the binary cross-entropy loss:
where
is true label for sample
;
is the predicted probability sample
, between 0 and 1, and
is the total number of samples in the training data [
48,
49].
LSTM is a robust algorithm designed for training on time-series data, making it particularly suitable for failure prediction in water pipes. Its ability to capture temporal dependencies, learn patterns from historical failure data, and detect subtle anomalies enables it to identify potential future failures effectively [
50]. LSTM is theoretically well suited to this problem because service line failures exhibit temporal dependence, where past failure events and time-varying attributes influence future failure likelihood.
3.3. Data Partitioning and Treatment
To reflect the temporal nature of asset failures and avoid information leakage, the dataset was split chronologically into training and validation sets. Historical failure records from 2013 to 2021 were used for model training, while failures occurring between 2022 and 2023 were reserved as an unseen validation dataset. In this study, the temporal dimension is incorporated through the progressive expansion of the training period rather than through explicit multi-time-step sequence construction at the individual pipe level. To investigate the effect of training data length on model performance, multiple training windows were considered, ranging from 2 to 9 years. For example, the 2-year split included failures from 2013 to 2014, whereas the 9-year split covered the full 2013 to 2021 period. The validation dataset remained identical across all experiments to ensure consistent and comparable performance evaluation.
Cross-validation was not applied in this study due to the temporally ordered nature of the failure data. Random or repeated data splitting can lead to information leakage by allowing future observations to influence model training. Instead, a strict chronological split was adopted to reflect real-world deployment, where models are trained on historical data and applied to predict future failures.
The failure prediction problem is inherently imbalanced, with non-failure instances significantly outnumbering failures. To address this issue, a class weighting approach was adopted. Higher weights were assigned to the minority failure class during model training, increasing its contribution to the loss function. This encouraged the models to pay greater attention to failure events and improved their ability to identify high-risk service lines without altering the original data distribution. The class weighting was only applied to train dataset.
3.4. Evaluation of Binary Classifiers
To assess the accuracy of predictions in binary classification, four key elements are considered: true positives (TP), false negatives (FN), false positives (FP), and true negatives (TN).
- -
True Positives (TP): Instances correctly predicted as positive.
- -
False Negatives (FN): Positive instances incorrectly predicted as negative.
- -
False Positives (FP): Negative instances incorrectly predicted as positive.
- -
True Negatives (TN): Instances correctly predicted as negative.
These elements are commonly visualised in a confusion matrix (
Figure 4), which provides a clear summary of the model’s classification performance.
The F1-score is a metric used to evaluate the performance of a binary classification model by considering both precision and recall. Precision is the proportion of correctly predicted positive instances among all predicted positives:
Recall is the proportion of correctly predicted positive instances among all actual positives:
The F1-score is given as the harmonic mean of precision and recall, expressed mathematically as:
This metric is particularly useful for imbalanced datasets, as it provides a balanced measure that accounts for both false positives and false negatives. A higher F1-score indicates better model performance, with a value of 1 representing perfect precision and recall.
The Receiver Operating Characteristic (ROC) curve is a graphical tool used to evaluate the performance of a binary classification model. It plots the True Positive Rate (
TPR) also known as Sensitivity, against the False Positive Rate,
FPR, which is equal to 1 −
Specificity, across various threshold levels. True Positive Rate (
TPR =
TP/(
FP +
TN)) is the proportion of actual positives correctly identified by the model, while False Positive Rate (
FPR =
FP/(
FP +
TN)) is the proportion of actual negatives incorrectly classified as positives. The ROC curve demonstrates the model’s ability to distinguish between the two classes. An ideal classifier would have a point at the top-left corner of the plot (
TPR = 1,
FPR = 0), indicating perfect performance, while a random classifier would produce a diagonal line. This makes the ROC curve a valuable tool for comparing and assessing classification models [
51].
The Area Under the Curve (AUC) provides a single summary value of the ROC curve, where a value of 1 indicates a perfect model and 0.5 represents random chance. AUC-ROC is widely used to assess the performance of classification models. However, it has been shown to be biassed when dealing with imbalanced datasets, where one class significantly outnumbers the other [
52]. For example, in datasets related to pipe breaks, non-break instances often outnumber those with breaks. To address this, the AUC of the Precision-Recall Curve (AUC-PRC) is used, as it is less influenced by class imbalances [
53]. However, in highly imbalanced failure datasets, the baseline AUC-PRC of a random classifier is equal to the prevalence of the failure class; therefore, absolute AUC-PRC values should be interpreted relative to this baseline rather than in isolation [
52]. Both the ROC and PR curves are derived from the confusion matrix at different threshold levels [
54].
To evaluate the practical utility of the models developed in this study, a different approach was adopted. The classifiers provide a Likelihood of Failure (LoF) score between 0 and 1 for each service line, indicating the probability of failure. The higher the LoF score, the greater the likelihood of failure. All pipes were ranked based on their LoF values, and it was assumed that a certain percentage (p) of the pipes would be replaced. The top p percent of pipes, with the highest LoF scores, were selected as candidates for replacement. The number of actual failures in this subset during the validation period was then calculated. This value was compared with the expected number of failures if p percent of the pipes were randomly replaced within the WDN. In a random replacement scenario, replacing p percent of the pipes would typically capture p percent of the total failures. This metric helps quantify how many failures could be avoided in the validation period by replacing p percent of the pipes according to the model’s predictions.
In this study,
p ranged from 0.1% to 20%. Lower percentiles (e.g., 0.1%, 0.2%, 0.5%, 1%, and 2%) correspond to short-term, tactical pipe replacement plans, while higher percentiles (e.g., 5%, 10%, and 20%) align with long-term, strategic asset rehabilitation programmes. This evaluation method helps asset managers assess which model offers the most value for either tactical or strategic replacement decisions [
13]. For this reason, model performance was additionally evaluated using a replacement-priority framework, which directly reflects operational decision-making and provides a more meaningful assessment than threshold-dependent metrics alone.
4. Results
After training each classification model on training sets ranging from 2 to 9 years in length, the predictive performance for service line failures was evaluated using the validation period (2022–2023).
Figure 5a depicts the relationship between the F1-score and the number of years included in the training set. The F1-score was chosen as the evaluation metric due to its ability to balance precision and recall, which can often be at odds. As shown, the RF model achieves the highest performance, closely followed by the XGB model. Although the LSTM model demonstrates comparatively lower performance, it exhibits a notable increase in F1-score as the training set lengthens. This improvement is likely due to LSTM’s capacity to process sequential data and retain information through its hidden state over time.
Figure 5b presents the ROC curve for the RF, XGB, and LSTM models trained over a 9-year period. The corresponding Areas Under the ROC Curve (AUC-ROC) for RF, XGB, and LSTM are 0.61, 0.58, and 0.60, respectively. These values suggest that all three models outperform a random baseline, demonstrating their ability to distinguish between the two classes.
Figure 6a illustrates the relationship between the number of years in the training set and the AUC-ROC scores for all models. As the size of the training set increases, the AUC-ROC scores improve across all models. Notably, while the LSTM model initially exhibits weaker performance compared to RF and XGB, its performance shows significant improvement with a larger training set. This enhancement reflects the LSTM architecture’s capacity to leverage sequential data and capture temporal dependencies effectively.
Figure 6b depicts the AUC-PRC scores as a function of the number of years in the training set. Although the absolute AUC-PRC values are relatively low, this outcome is expected due to the extreme class imbalance in service line failure data. Failure events constitute only a small fraction of the total observations, resulting in a very low random baseline AUC-PRC. The reported values therefore represent a meaningful improvement over random performance. The results indicate that the performance of all models improves with the inclusion of more years in the training set. Among the models, RF exhibits the highest performance, with XGB and LSTM following closely behind. This trend underscores the benefit of larger training datasets in enhancing model accuracy and reliability.
In this case study, the utility implements a preventive maintenance programme, replacing approximately 0.5% of service lines annually. This programme primarily targets service lines based on their age and material type, alongside replacing older service lines in newly developed areas. A 1% replacement of service lines corresponds to a two-year replacement cycle, while a 2% replacement aligns with a four-year schedule.
Figure 7a illustrates the short-term replacement schedule, showing that prioritising 1% of service lines for replacement based on the models’ recommendations results in an estimated 2% reduction in failures. Similarly,
Figure 7b demonstrates that replacing 2% of service lines as suggested by the models leads to a projected 4% decrease in failures. Therefore, in tactical asset management, using a predictive machine learning model trained on less than 10 years of failure data can significantly enhance the effectiveness of investments, achieving twice the improvement compared to random asset replacement strategies.
Considering a replacement rate of 0.5% per year, a 5% replacement rate corresponds to a 10-year plan, while a 20% replacement rate aligns with a 40-year schedule.
Figure 8 demonstrates the efficiency of the proposed models for strategic asset management.
Figure 8a shows that replacing 5% of service lines, as recommended by all models, results in an approximate failure reduction ranging from 7.5% to 8.5%. Similarly,
Figure 8b indicates that a 20% replacement, as suggested by the models, leads to a failure reduction of approximately 26% to 29%. These results reveal that the proposed models are more effective for tactical asset management programmes than for strategic ones.
5. Discussion
Service lines comprise multiple elements, and failures may be attributed to different components. The varying proportion of failures across these elements helps identify the components most susceptible to failure in this case study.
Figure 9 illustrates the various failure modes on a typical service line in the case study area. Over one-third of all service line failures occur in shut-off valves, primarily due to frequent use by customers. More than 21% of failures are associated with the components connecting the metre to the service line, which are also affected by their proximity to the frequently used shut-off valve. Additionally, 11% of failures were observed in the upstream joint of the metre. These joints experience higher wear due to the connection of two elements (metre and pipe) made from different materials. Non-homogeneous connections, such as curb box joints, exhibit lower failure rates as they are typically embedded in concrete or soil, offering additional protection. Overall, more than 75% of all service line failures in this case study occur within the metre box, highlighting the importance of careful installation and maintenance of its components. While out-of-box elements have lower failure rates, their repair is more time-consuming and costly due to their deep installation.
Although the models show minor performance differences, they produce broadly similar results, indicating that all are capable of capturing meaningful failure-related patterns. Random Forest achieved the highest performance, followed closely by Extreme Gradient Boosting, while Long Short-Term Memory showed lower but improving performance with longer training data. This similarity is largely due to the predominantly static nature of the input features, such as material, diameter, length, and pressure, which contain sufficient information to distinguish higher-risk service lines. Tree-based models effectively capture nonlinear relationships and handle class imbalance, whereas LSTM benefits mainly from longer historical records. Overall, the consistent performance improvement with increasing training data length highlights the importance of long-term failure data in capturing degradation trends and accumulated risk.
Most prior studies on pipe failure prediction have focused on water mains rather than service lines [
18,
19,
22,
27]. Reported AUC-ROC values for main failure prediction using machine learning models typically range between 0.60 and 0.75, depending on the richness of the feature set and network characteristics [
26,
27,
28]. The AUC-ROC values achieved in this study (approximately 0.58–0.61) are therefore comparable, particularly given the limited feature set and the higher uncertainty associated with small-diameter service lines. In such rare-event prediction problems, precision inevitably declines as recall increases, which is a well-known characteristic of imbalanced classification rather than an indication of model inadequacy. International studies have shown that service lines often dominate failure statistics, accounting for more than 80% of failures in several utilities worldwide [
40]. Despite this, predictive modelling of service line failures remains scarce. Existing studies have largely relied on survival analysis or simplified statistical approaches [
41,
42], which typically focus on a limited number of explanatory variables.
As previously discussed, the available features for service lines did not fully capture all factors influencing failure. While attributes such as age, material, and pressure are primary determinants of failure rates, incorporating additional features—such as joint type, soil type, and weather data—could significantly enhance the predictive capabilities of the models. Material type, although included, is less informative since approximately half of the service lines share the same material. However, service lines with the same material but produced by different manufacturers often exhibit varying qualities and durability. Introducing a new feature, such as ‘service line manufacturer,’ could enable the models to account for the impact of production differences on failure rates.
The pressure data used in this study were derived from a hydraulic model; replacing this with actual pressure measurements, particularly capturing sudden pressure fluctuations, could further improve predictive accuracy. Additionally, in utilities where failure reasons are systematically recorded by technicians, failures could be categorised more precisely. This categorisation would provide valuable insights, enabling machine learning models to deliver more accurate and actionable predictions.
Most water utilities adopt a reactive approach to service line maintenance, addressing issues only after a failure occurs. This practice incurs substantial economic, environmental, and social costs. The methodology proposed in this study enables utilities to proactively identify and replace service lines with a high likelihood of failure, minimising unexpected disruptions and ensuring uninterrupted service for customers. Machine learning models have proven to be highly effective tools for planning predictive asset maintenance.
6. Conclusions
This study is among the first to evaluate machine learning-based failure prediction for clean water service lines while explicitly analysing the effect of training data length on model performance. A case study was conducted using a WDN serving over 200,000 customers. Data on service line attributes—including diameter, length, age, material, pressure, and demand—along with 11 years of historical failure records, were utilised. The first 9 years of data were employed for training the machine learning models, while the final 2 years were reserved for validation. RF, XGB, and LSTM algorithms were applied to estimate the likelihood of failure for each service line. Performance was evaluated using multiple metrics, including F1-score, ROC curve, AUC-ROC, and AUC-PRC. Additionally, a novel metric—failure reduction percentage relative to the replacement percentage—was introduced to quantify the effectiveness of the predictions in reducing failures by prioritising service lines with the highest likelihood of failure for replacement. The findings demonstrate that despite the limited feature set, the machine learning models successfully captured the complex relationships between service line attributes and failures. Furthermore, the models consistently improved with the inclusion of additional training data.
6.1. Key Findings
The results show that tree-based models (RF and XGB) outperformed the LSTM model, primarily due to the predominantly static nature of the service line attributes. Model performance improved consistently as longer training histories were used, underscoring the importance of historical failure data in capturing degradation patterns. The machine learning-based prioritisation strategy identified up to twice as many failures compared with random replacement for the same replacement rate. Furthermore, the predictive models demonstrated greater effectiveness for tactical, low-percentage replacement programmes than for long-term strategic asset planning.
6.2. Practical Implications
Utilities can use machine learning-derived likelihood-of-failure rankings to prioritise limited replacement budgets more effectively. Even when data are incomplete, predictive models provide measurable benefits compared with traditional reactive or age-based maintenance approaches. Moreover, the proposed evaluation metric directly links model outputs to operational outcomes, facilitating practical adoption by asset managers.
6.3. Limitations
Pressure data were model-based rather than measured, limiting the representation of transient effects. In addition, important explanatory variables such as soil properties, joint types, and manufacturer details were unavailable. Finally, representing failures as an annual binary outcome does not capture failure severity or recurrence frequency.
6.4. Future Research Directions
Future work should explore the integration of high-frequency pressure data, environmental variables, and detailed component-level information. Differentiating failure modes and incorporating cost–risk optimisation frameworks could further enhance strategic asset management. Additionally, hybrid models combining ML with physical or survival-based approaches may improve long-term failure prediction for service lines.