4.1. Factors Influencing Failures
To better understand the nature of the failures, all incidents were classified using the concept of failure frequency, defined as the number of failures per 100 km per year [
37]. This metric allows for meaningful comparison of failures across pipes of different diameters, ages, and materials.
As a first step, the failures were analyzed in relation to pipe diameter, and the results are presented in
Table 1.
An analysis of failures by pipe diameter reveals that smaller-diameter pipes in the network experience a higher frequency of failures compared to larger-diameter pipes. In particular, pipes with diameters of 63 mm, 75 mm, and 90 mm exhibit approximately 4 to 5 times more failures than those with larger diameters. This finding is consistent with previous studies in the literature, which indicate that pipes with diameters smaller than 150 mm tend to have higher failure frequencies [
38,
39]. However, the relationship between pipe diameter and failure frequency is influenced by various factors, including pipe material, pipe age, installation depth, soil characteristics, and environmental conditions. Therefore, it is essential to consider these additional parameters alongside pipe diameter in pipe failure prediction models.
Pipe material is another critical factor that affects the frequency of failures in drinking water systems. When planning long-term infrastructure strategies, water utilities must consider not only operational feasibility, cost, and practicality but also the susceptibility of different pipe types to failure. The failure frequencies calculated for the four different pipe materials found in the study area are presented in
Table 2.
An analysis of failures by pipe material reveals that the highest number of failures occurred in cast iron (CI) pipes. These pipes, primarily due to their age and the lack of cathodic protection, have been the most failure-prone material as a result of corrosion. They are followed by asbestos cement (AC) pipes, which are particularly susceptible to failures in older systems due to aging and brittleness. The data also indicate that HDPE and PVC pipes tend to experience fewer failures. Both materials are resistant to ground movements, easy to install, and relatively cost-effective. As a result, these two types of pipes are predominantly used in Türkiye. Similar findings are reported in previous studies in the literature, supporting the results obtained in this study [
27,
40,
41,
42].
Several studies have indicated that failure frequency generally increases with pipe age, independent of pipe material and environmental conditions. The relationship between pipe age and failure frequency in the study area is presented in
Table 3.
An examination of the study area shows that pipes up to 10 years old have a failure frequency below 13 failures per 100 km per year, which is generally considered an acceptable threshold [
37,
43]. However, due to material aging, failure frequency can reach as high as 192 failures per 100 km per year. This finding provides important insights for asset management. It highlights the necessity for water utilities to develop pipe replacement programs that take these values into account during the operational phase [
44,
45].
Temperature variations are also among the key factors contributing to failures in water distribution networks. Studies have shown that sudden temperature fluctuations significantly increase the likelihood of failures. Not only sudden changes but also prolonged exposure to extreme high or low temperatures can elevate failure frequencies [
46,
47]. For instance, cold weather conditions tend to increase failure frequency in cast iron pipes, whereas high temperatures are found to substantially raise failure frequencies in PVC and HDPE pipes.
Table 4 presents data from the study area based on temperature ranges and corresponding failure frequencies.
In the study area, a significant increase in failure frequencies was observed with rising temperatures. This trend can primarily be attributed to the fact that the majority of the network—approximately 79%—is composed of HDPE and PVC pipes, as supported by findings in the literature. Consequently, elevated temperatures are seen to contribute to higher failure frequencies within the system.
Analyzing the factors that lead to failures is of critical importance for water utilities. Especially in the context of long-term planning, insights gained from the characteristics of the existing network can be used to design effective rehabilitation strategies. Additionally, these data can inform the selection of appropriate pipe materials for new installations. The results obtained in this study not only confirm widely accepted views in the literature but also provide quantifiable insights that can be applied in practice.
4.2. Failure Prediction Model
Three distinct models—RFR, XGB, and MLP—were selected to simulate failures in DWSs. MATLAB (Version R2022a) executed the selected techniques, and the results were documented. To assess model accuracy, the dataset was randomly divided into three segments, 60% for training, 20% for validation, and 20% for testing, before analysis. Each approach underwent a fair and consistent train/test split, ensuring an accurate assessment of anticipated accuracy. The models were used to predict the failure rate, specifically the number of failures per 100 km, which was the model output. Performance metrics for the models (RFR, XGB, and MLP) were calculated on the test set, and the results are shown in
Table 5.
Standard global conventional performance metrics, including RMSE, NMSE, NMBE, MAE, MAPE, IOA, and R2, led the assessment of the models.
Among the models employing the three distinct techniques, the MLP model performed superiorly across nearly all metrics. The RMSE was 1.48283, indicating minimal discrepancies between projected and actual values, demonstrating the model’s accuracy in producing reliable predictions. The model demonstrated its capability to forecast future accuracy, evidenced by exceptionally low NMSE (0.00553) and MAE (1.22132) values. The MAPE value, calculated at 2.927%, provides compelling evidence that the model generated reliable predictions regardless of fault size.
The MLP model exhibited a minor positive bias in its predictions, with the NMBE measured at 0.00542. The RFR model exhibited a slight tendency to underestimate, as indicated by a negative NMBE of −0.00159. Although it did not maintain the load generated by the MLP model, the XGB exhibited more equitable NMBE calculations than the RFR model.
The MLP model achieved an IOA of 0.99787 and a R2 of 0.99309, indicating it accounted for over 99% of the variability in the observed data. Despite its superior performance (IOA = 0.99676 and R2 = 0.98917), the XGB model was inferior to the MLP in terms of error metrics. In terms of RMSE and MAE, the RFR model exhibited the worse performance overall.
These findings underscore the MLP model’s efficacy in predicting issues within drinking water systems, notably due to its ability to capture complex and nonlinear relationships. The capacity of artificial neural networks to comprehend intricate relationships among input variables significantly influenced this outcome. Nonetheless, MLP models typically exhibit constrained interpretability and elevated computational expenses. Based on the results, detailed outcomes related to the best-performing model MLP are presented in
Figure 2,
Figure 3 and
Figure 4.
For the MLP model, the regression diagrams presented in
Figure 2 were prepared to evaluate the performance of the developed prediction model across the training, validation, testing, and overall datasets. In each subplot, the horizontal axis represents the actual (target) values, while the vertical axis shows the predicted (output) values generated by the model. The circular markers indicate individual observations, solid, colored lines represent the model’s fitted regression lines, and the dashed lines denote the ideal prediction line, i.e., the Y = T (output = target) relationship.
An examination of the regression line in the training phase reveals that the model learned from the training data with nearly zero error. The proximity of the prediction values to the ideal line suggests that overfitting is not immediately apparent. During validation, the regression equation was obtained as Output = 0.99 × Target + 0.32, indicating that the model is capable of generalizing with high accuracy on unseen validation data. The slope value shows that predictions are very close to the target values, with no meaningful systematic bias.
Similarly, in the testing phase, the regression line was found to be Output = 0.99 × Target + 0.30, confirming that the model successfully transferred its learned patterns to previously unseen data. The predicted and observed values were closely aligned, with minimal deviations.
For the overall dataset, the regression line was calculated as Output = 0.99 × Target + 0.21, which suggests that the model consistently generated accurate and reliable predictions across the entire dataset without introducing significant systematic error. The slope being very close to 1 and the low intercept value indicate a balanced and stable performance throughout the prediction range.
Overall, the strong linear correlation observed at every stage—training, validation, and testing—demonstrates the model’s success in both learning and generalization. Additionally, the clustering of data points around the ideal linear line in the scatter plots further supports the conclusion that the model operates with low variance and minimal bias.
An error histogram was also generated for the model (
Figure 3). This histogram, divided into 20 bins, was created to analyze the model’s prediction performance. The horizontal axis represents the error, defined as Error = Target − Output, while the vertical axis shows the number of samples (frequency) falling within each error range. Different datasets are color-coded: training data is shown in blue, validation data in green, and test data in red. Additionally, the zero-error line is represented in orange.
The shape of the histogram closely resembles a bell curve (normal distribution), indicating that the model generally performs with low errors. The highest concentration of data points lies within a narrow range near zero error (approximately between −0.43 and +0.43), with a noticeable density of training samples in this region. This suggests that the model was able to produce predictions close to the target values during training.
The comparable centering of the error distributions for the test (red) and validation (green) datasets indicates the model’s exceptional generalization ability. Low-variance and symmetric error distributions for non-training data indicate that the model does not exhibit overfitting.
Minimal occurrences near the extreme ends of the histogram—specifically, between −6 and −8 or between +5 and +7—suggest that the model infrequently produces substantial prediction errors. These uncommon abnormalities are likely caused by erratic data, exceptional conditions, or random deviations.
The model’s error distribution has a balanced structure throughout the training and testing/validation datasets, with the majority of prediction errors concentrated at zero. This indicates that, devoid of systematic bias, the model can produce reliable, consistent, and broadly applicable predictions.
An examination of the error graph from the training process (
Figure 4) illustrates how the Mean Squared Error (MSE) values for different datasets (training, validation, and test) change over the number of epochs during model training. The blue line corresponds to the training data, the green line to the validation data, and the red line to the test data. The epoch at which the best validation performance was achieved (epoch 119) is highlighted with a green circular marker. The graph is presented on a logarithmic scale, allowing clearer observation of the rapid decline and stabilization in error values.
Looking at the overall trend of the curves, the model is seen to rapidly reduce the MSE starting from the initial epochs. This indicates that the model began learning the data effectively early on, and parameter updates contributed significantly to performance improvement. After approximately the 60th epoch, the rate of error reduction slowed across all curves, and around the 100th epoch, the error values began to stabilize. This suggests that the model had reached an optimal level of learning, where further training yielded no substantial improvements.
The best validation performance was achieved at the 119th epoch, where the validation error reached approximately 1.4175. This value reflects a high level of predictive accuracy on the validation set and indicates strong generalization capability. Furthermore, the alignment of the training and test error curves with the validation error supports the conclusion that the model did not suffer from overfitting and was able to apply learned patterns successfully across different datasets.
The parallel and closely aligned progression of the training, validation, and test curves demonstrates that the model exhibited balanced performance across all data partitions and that the learning process advanced in a stable manner. This also suggests that the hyperparameters used during training (e.g., learning rate, network architecture, number of epochs) were appropriately selected.
Overall,
Figure 4 confirms that the model was trained effectively, experienced neither overfitting nor underfitting, and successfully identified the point of minimum validation error.
For the established model, predictions on a randomly selected subset of 60 test samples were compared to their actual values and are presented in
Figure 5.
An examination of
Figure 5 visually demonstrates the prediction accuracy of the developed model on the test dataset. The horizontal axis represents test samples, numbered from 1 to 60, while the vertical axis indicates the annual failure frequency per 100 km of pipeline. A comparison between the observed values and the values predicted by the model offers valuable insight into the overall predictive performance of the model.
A general review of the graph structure shows that the predicted values closely follow the observed ones, indicating that the model successfully learned the underlying patterns and accurately captured the dynamics within the data. The model’s ability to track sharp increases and decreases (e.g., between test samples 12–18 and 35–38) further demonstrates its sensitivity to variable trends.
Some local deviations are also observed—particularly at peak points such as test samples 14 and 36, where the predicted values slightly differ from the observed ones. However, these discrepancies appear to be random rather than systematic, suggesting that while the model may introduce minor errors when handling high-variance data, it still manages to capture the overall trend with considerable success.
In segments where the data is relatively flat (e.g., test samples 37–50), the model’s predictions almost perfectly match the observed values, indicating a strong ability to learn and replicate stable patterns. Additionally, the downward trend observed in the final portion of the test dataset (samples 50–60) is accurately followed by the model.
In conclusion, while all three models demonstrated acceptable levels of performance, the MLP model provided the highest accuracy and reliability in predicting failures in drinking water systems. This reflects the model’s robustness and capacity to generate trustworthy predictions not only during training but also when applied to real-world data.
When the predictive performance of this study is interpreted within the broader context of the literature, it is important to consider methodological and scale-related differences among studies. In [
31], the reported RMSE values (>2.3) arise from an economic leakage assessment framework rather than a predictive machine-learning model; thus, similarities exist only in thematic scope, not in modeling objectives. In [
32], LS-SVM outperformed FFNN and GRNN models after fuzzy clustering was applied to create homogeneous sub-regions. The very low RMSE value reported in that study (0.0086) reflects the fact that the target variable was normalized and modeled separately within narrowly defined clusters, yielding a much smaller numerical range. By contrast, the target variable in this study—the annual number of failures per 100 km—is naturally larger in magnitude (typically ranging between 0 and 20 or more). Therefore, an RMSE of 1.48 corresponds to an average prediction error of only about 1.5 failures per 100 km per year, which is realistic and meaningful for practical applications. This scale dependency underscores why a direct numerical comparison of RMSE values across studies may be misleading. For this reason, we also emphasize scale-independent metrics: the proposed MLP model achieved an R
2 of 0.98583 on independent DMA test regions (The maximum R
2 value seen in the study [
32], including subsets, was 0.736.), demonstrating strong generalization capability across diverse network conditions without the need for prior segmentation. These distinctions clarify the observed performance differences and highlight the robustness and practical applicability of the proposed approach.
In future studies, even greater success may be achieved by developing hybrid or ensemble models that combine the strengths of neural networks and tree-based algorithms.
4.3. Real-World Data Testing
Although the developed model demonstrated successful performance on internal test datasets, its applicability to different water networks is equally important. To evaluate the model’s performance under varying conditions, data from the provinces of Sakarya and Kayseri in Türkiye—distinct from the original study area of Malatya—were selected for further testing (
Figure 1).
In this context, data were collected from a total of 24 measurable sub-regions (District Metered Areas—DMAs), including 11 from Sakarya and 13 from Kayseri. These are summarized in
Table 6.
The creation of DMAs (District Metered Areas) plays a critical role in drinking water management [
48,
49,
50]. These areas are physically separated from other parts of the network and are continuously monitored using flow meters installed at their inlets. Moreover, DMAs can be integrated with subscriber management systems, fault management systems, and Geographic Information Systems (GISs).
In the 24 selected DMAs, systematic failure records have been maintained. These records include additional data such as the age, diameter, material type, and location of the pipes where failures occurred. To ensure a robust testing process, the selected regions were chosen to represent a variety of pipe materials, diameters, and ages (see
Table 6).
The actual failure data from these areas were used as input to the MLP model, and the resulting predictions are presented in
Table 7.
As a result of the analyses, the comparison between actual failure data and model predictions is presented in
Table 7 and
Figure 6. The table includes the total pipe lengths for various DMA regions, the annual number of failures, the observed annual failure frequency per 100 km (Observed Failure/100 km/year), and the corresponding values predicted by the model (Predicted Failure/100 km/year). Such a comparison is highly valuable for assessing both the model’s ability to generalize across different regions and its sensitivity to local variations.
A general evaluation of the dataset reveals a high level of agreement between the observed failure frequencies and the values predicted by the model. In nearly all DMAs, the predicted values either match the observed rates exactly or differ by only a small margin. This indicates that the model has not only captured the overall distribution accurately but has also successfully learned local variations at the DMA level.
In a few DMAs, there are noticeable discrepancies between the observed and predicted values. For example:
In SasDMA8, the observed failure frequency was 41.49, while the model predicted 33.90. This deviation (approximately 7.6 points) may stem from the model’s tendency to suppress extreme values (i.e., peak-prone regions) or from unexplained external factors specific to this area.
In SasDMA3, the model predicted 12.92, compared to the observed value of 9.49, slightly overestimating the failure frequency. Such minor deviations may result from the fact that, in DMAs with low failure frequencies, even small numerical changes can appear disproportionately large in percentage terms.
In high-failure-rate areas such as KasDMA8 (Observed: 89.63, Predicted: 86.72) and KasDMA11 (Observed: 73.29, Predicted: 73.50), the model produced highly accurate predictions. This demonstrates the model’s capability to correctly identify high-risk zones.
The statistical performance of the developed model is presented in
Table 8. Upon examining the results, it is evident that the model achieved successful prediction performance.
The model demonstrated significant predictive accuracy across DMAs with low and high failure frequencies. This indicates that the developed prediction method can substantially assist in proactively identifying existing infrastructure threats and can be reliably utilized in field-based decision-support systems.
To strictly evaluate the position of the proposed MLP model within the current literature, a comparison with prominent studies published in the last five years is presented in
Table 9. This comparison includes various machine learning approaches ranging from Logistic Regression to recent Deep Learning and Ensemble applications. The metrics demonstrate that the proposed MLP model achieves state-of-the-art performance, outperforming or matching the best results reported in similar recent studies.
Comprehending the localized behavior of the model and permitting region-specific prioritization relies on DMA-level analyses. The model’s performance is demonstrated to be consistent and steady.