Classification of Water Pipe Damage Types Using Random Forest

Kutyłowska, Małgorzata; Cieżak, Wojciech

doi:10.3390/su18105101

Open AccessArticle

Classification of Water Pipe Damage Types Using Random Forest

by

Małgorzata Kutyłowska

^* and

Wojciech Cieżak

Faculty of Environmental Engineering, Wrocław University of Science and Technology, Wyb. Wyspiańskiego 27, 50-370 Wrocław, Poland

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(10), 5101; https://doi.org/10.3390/su18105101

Submission received: 20 March 2026 / Revised: 23 April 2026 / Accepted: 18 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue Sustainable Water Supply and Drainage Systems: Design, Modeling, and Reliability)

Download

Browse Figures

Versions Notes

Abstract

This study presents the results of a classification of the types of water supply pipe failures using a random forest model consisting of 72 trees. The modeling was done in Statistica software. The classification accuracy was compared with earlier results obtained from single-classification-tree models. The qualitative-dependent variable was the type of failure (corrosion, crack, sealing). The predictors included quantitative variables (diameter, year of construction) as well as qualitative variables (pipe type and material). The choice of 72 trees was made based on an analysis of the misclassification rate (31%) during the training stage. Increasing the number of trees forming the forest did not produce more accurate classification results: for the test set, the accuracy was 82%, 72%, and 37% for corrosion, crack, and sealing failures, respectively. The trees forming the random forest differed in their structure both in terms of the number of split and terminal nodes, as well as in the depth and number of levels of individual trees. The overall classification accuracy for the test set was nearly 66%, which is a better result than in the earlier analyses based on single trees. The proposed approach also aligns with the currently promoted concept of the sustainable operation of critical infrastructure.

Keywords:

kind of failure; random forest; reliability; sustainable operation; water supply

1. Introduction

The modern world is characterized by an enormous dynamism of events, information chaos, and rapidly changing (though not always for the better) technical capabilities, especially in terms of the everyday use of tools broadly associated with information technology. Unreflective adoption of technological innovations may consequently lead humanity into a dead end from which there may be no escape or return to the world we have known for decades. Therefore, sustainable development and balance in many aspects of life are currently necessary to implement.

This is particularly important in such sensitive areas as the reliable supply of drinking water, which is associated with proper management of municipal infrastructure based on the principles of sustainable development, for example, through the sectorization of extensive water supply systems. In cases where the end point of scientific or operational activities is the human being, even greater attention should be paid to the potential consequences of the research undertaken.

After all, the scientists working in Los Alamos, up to a certain point in their scientific activity, were not fully aware of what the results of the Manhattan Project would ultimately lead to and what the consequences for the world would be over the following decades. However, it should be remembered that the world-renowned mathematicians and physicists involved in that project, apart from strictly military aspects, laid the foundations for the development of modern cybernetics and what is now commonly referred to as machine learning, which includes the methodology of regression and classification trees, as well as the random forest method, which is a modification of the tree algorithm.

Today, mathematical modeling is an indispensable tool for solving many complicated engineering problems. Knowledge, experience, and intuition are used to build mathematical models of random experiments. But, prior to modeling, one should analyze the experimental (operational data) to be used to create a model. Since it is a highly subjective part (combining to some extent science and art) of cognitive work, analysis of the data can lead to quite different conclusions. Bearing this in mind, the operational data obtained from water companies for the purpose of building models should be properly handled and subjected to a qualitative and quantitative analysis. This paper presents the application of a selected modeling method—random forest—to classify the type of damage that occurred in water conduits.

This work focuses on damage classification using a random forest, compared to previous studies from the authors that used single classification trees. This is one of the methodologies in the broad sense of artificial intelligence that is worth applying to the analysis of the technical condition of water pipes in Poland. This type of approach has not been used in Poland so far, so it seems justified to address this topic and analyze it using a specific example. The issues addressed in this work are important from the perspective of managing water distribution systems, as modeling and classifying damage types will allow for better selection of repair or renovation methodologies in the future. Furthermore, machine learning is currently finding increasing application, and it is worthwhile to utilize this approach in addressing water pipe failures. This work is part of an earlier series of authorial publications and serves as an extension of the topic. Water utilities are signaling the need to implement the best available technologies in their management, and modeling using random forest is consistent with this trend.

1.1. Random Forest

Regression and classification trees are used to predict quantitative and qualitative variables, respectively. The beginnings of this method of data analysis and forecasting date back to the 1960s; however, it was not until 1984 that the field was popularized by Breiman et al. [1]. A regression or classification tree is a directed graph containing a root and nodes (leaves) in which conditions regarding variables are checked, as well as branches that contain decision rules. The size of a tree can be described by its depth, i.e., the number of edges between the root and the most distant leaf [2].

The regression tree method is generally easier to implement and analyze than the classification tree method [1]. An analysis using the tree-building algorithm involves finding a set of logical splitting conditions and identifying relationships between predictors (independent variables) and the dependent variable, which ultimately leads to obtaining prediction results. Tree construction is a multi-stage process, and, at each stage, a different independent variable may provide the optimal split with homogeneous subsets. This is important because sometimes the quality of a tree is not assessed on the basis of prediction accuracy; instead, the usefulness of the splitting rules is taken into account [2].

The size of a classification or regression tree is an important issue. Extensive trees are usually difficult to interpret. Therefore, they can be combined into entire ensembles of trees, forming what is known as a random forest. An ensemble of trees generally yields better prediction results than a single, even very complex, tree [3]. The generalization error of a random forest is related to the number of trees that compose it. According to the law of large numbers, there is no danger of overfitting in a random forest. In the forest algorithm, the influence of the so-called data “noise” effect on prediction results is small [4].

Random forest also performs well when there is a large number of independent variables, even much larger than the number of observations (training cases), which distinguishes this method from other typical predictive approaches. In such situations, during model development, it is possible to use the bootstrap approach, i.e., sampling with replacement. As a result, more information participates in the modeling process, which improves its quality. Moreover, a larger number of predictors reduces the model’s bias related to misclassification error. Therefore, it is not necessary to eliminate variables that play an important role (are significant operational data and, from an engineering point of view, should be included in the analysis), whereas with other predictive methods, they might have to be removed [5].

However, it should be remembered that not all variables available should be included in the model; rather, those that are truly relevant to the problem currently being solved should be selected.

1.2. Applications of Random Forest in Issues Concerning the Technical Condition of a Pipeline

Several selected examples of the use of random forest in solving engineering problems, including those strictly related to environmental engineering, were described a few years ago [6]. Since then, modeling using machine learning has gained considerable popularity. An example is a comprehensive study discussing previous attempts to predict failure rates, the time and probability of failure occurrence, or risk estimation [7] for water supply networks. It was emphasized that there is a lack of detailed analyses regarding the usefulness of modeling for a slightly different problem, namely, the accurate prediction of the causes of observed failures and the uncertainty of their occurrence, which translates into the way water supply systems are managed [7]. Therefore, in [7], different methods were compared while taking into account the above-mentioned gaps in previous research. One of the methodologies used was random forest. Similarly to the approach adopted in the present study, the set receiving the highest number of votes from the ensemble of trees was considered the predicted class. The classification results obtained with the random forest for failure causes (corrosion, material defects, poor workmanship, external load) in [7] reached about 75% accuracy. As will be shown later in this paper, this does not differ significantly from the results obtained when predicting types of failures. Thus, it is possible at the outset to formulate a thesis—also confirmed by other researchers—about the usefulness of applying random forest classification problems related to water pipe failures. This has also been demonstrated in the context of planning the rehabilitation of selected network sections [8], which is extremely important, especially from a long-term perspective [9].

Similarly, the issue of material abrasion in pipelines—this time, gas and oil pipelines—has been successfully modeled using random forest. Among the other machine learning methods applied (neural networks, support vector machines, k-nearest neighbors), random forest proved to be the optimal algorithm, achieving prediction accuracy of up to 90% [10]. Random forest has also been used in quantitative issues describing failure performance, such as, more precisely, the number of failures and the failure intensity index, with particular emphasis on the pressure inside the water pipe [11,12]. Interestingly, in [12], an almost perfect agreement between the operational data and predicted values of the pipe failure intensity index was obtained not only with the use of random forest, but also with the MARSplines method. Notably, in our own earlier study [13], this methodology showed lower adaptive capabilities than those presented in [12]. The dynamics of changes in the environment surrounding pipelines are also influenced by the behavior of the native soil or backfill, which was demonstrated in [14] and modeled with good accuracy using a random forest algorithm composed of several dozen trees. As an application example, ref. [14] presents a comparison of the actual and predicted changes in deviatoric stresses.

The topic of water supply failures is invariably associated with the problem of leak detection and localization. It is worth mentioning one of the latest studies here, which provides a review of the methods used to locate leaks in water supply networks and comments on modeling in the context of leak and damage detection [15]. By creating a specific void in the ground, leaks pose a threat both to people and to above- and below-ground infrastructure [16]. Due to the importance of this issue, it is also analyzed using decision tree models [17] and the Monte Carlo method [18]. However, in [17], the simulation was performed without the use of operational data; instead, it was based on a virtual pipe structure, and leaks were simulated on such a model network. According to the authors of [17], the obtained results of the decision tree model are promising, yet a legitimate question arises as to whether this would also translate positively when real operational data are implemented. In another study, experimental results obtained from a laboratory-scale test stand were used for the analysis and prediction of leaks [19]. However, the researchers decided to combine a random forest model with artificial neural networks, which is an approach that requires the proper selection of independent variables due to the specificity and differences between these two methodologies.

As shown by the brief summary above, in recent years, the application of machine learning methods, including the random forest model, in the analysis of the broadly understood technical condition of water pipelines has developed significantly. However, under Polish conditions, this type of approach is still not widely described or applied. Therefore, it seems justified to undertake the topic of classifying failure types using the random forest method. This constitutes a continuation of the author’s analyses carried out several years earlier [6], and fits into the modern approach of sustainable development of strategic areas, among which water supply systems undoubtedly belong as critical infrastructure. In the present study, the “random forest” module available in the commercial Statistica software was used. The main objective of this work was to compare the results of the classification of water pipe failure types obtained using classification trees, presented in [6], with the results achieved using the random forest model. In the aforementioned monograph [6], a further research goal was formulated, namely, to verify the classification capabilities of random forest in the analysis of water pipeline failures, which is undertaken here.

2. Materials and Methods

In the classification tasks discussed in this study, the results were obtained through voting, which consisted of individual tree models selecting from randomly generated subsets of a given predictor. This approach was performed in a loop until all independent variables provided at the input were analyzed.

In the Statistica 13.1 software, performing a classification task first requires organizing qualitative variables by transforming class membership into a set of indicator variables. If a qualitative variable belongs, for example, to five classes, then, during modeling, it is treated as a quantitative variable (set size of four) and takes the value 0 for four classes and 1 only for the class analyzed at a given moment. During classification, it is determined how much more often individual trees forming the entire ensemble (i.e., the forest) indicate a given class compared with the others—in other words, how much more frequently, for instance, pipeline corrosion is indicated compared with a transverse crack. This approach, apart from selecting the appropriate class membership, also allows for the determination of a prediction confidence index. During the construction of the random forest model, equal misclassification costs were assumed, which means that the estimated a priori probability was not corrected by the cost of an incorrect classification. This follows directly from the theoretical foundations of the random forest methodology.

It was assumed that the forest would consist of a maximum of 100 classification trees with no more than 10 levels and 100 nodes while considering the minimum size of the so-called child node (i.e., the minimum node size equal to 5). This parameter should be distinguished from the minimum number of objects in a node, which was set at 30. In addition, the growth of individual trees was stopped when the classification error dropped below 5% over 10 training cycles.

The classification of water pipe failure types was carried out on the basis of operational data from one of the larger water distribution systems in Poland, described in more detail in [6]. However, in the present study, some changes were introduced to the structure of the dataset (compared with those presented in the monograph) [6], although this did not affect the realistic representation of operating conditions. For the classification using the random forest model, the nomenclature of failure types was simplified by combining several categories with similar properties into a single class. Failures referred to as “crack”, “longitudinal crack”, and “transverse crack” were classified into the category “crack”. Failures referred to as “sealing” and “joint leakage” were classified as “sealing”. This procedure is justified and does not diminish the comparative analysis, because the essence of the issue is preserved: from an engineering point of view, it is still a joint failure or a structural break in the pipeline. Moreover, as shown in [20], the smaller the representation of a given failure type (i.e., the less numerous the subset in the entire dataset), the lower the classification capability. Additionally, the aim of this study is not to indicate a specific repair method, but only to demonstrate the quality of classification. Therefore, distinguishing between a transverse and a longitudinal crack is of limited cognitive importance. Furthermore, the “leak” category accounted for only 1.2% and 0.8% of the total training and test datasets, respectively. In such a situation, this type of failure was not included among the classes analyzed in this paper, which is also a change compared with the classification tree analysis contained in [6].

The entire dataset covering eight years of operation was divided into a training subset (1200 cases: 410 corrosion, 478 crack, 312 sealing) and a test subset (516 cases: 181 corrosion, 186 crack, 149 sealing). During the stage of building the random forest, an additional 31% of the training subset was separated for validation of the training phase. The implementation of the model and the analysis of its significance were then performed on the test subset (516 cases). The dependent (classified) variable is the type of failure belonging to three classes (corrosion, crack, sealing). The quantitative predictors are diameter (ranging from 20 to 800 mm) and year of construction (the oldest pipeline dates from 1926, the newest from 2011). The qualitative variables include material (asbestos cement, PE, PVC, steel, galvanized steel, cast iron) and pipe type (trunk main, distribution pipe, service connection).

3. Results and Discussion

Figure 1 shows the process of building the random forest together with changes in the misclassification fraction as a function of forest size. Further increasing the forest size, i.e., increasing the number of trees beyond 100, did not reduce the classification error in the validation subsample during the training stage, and even led to an increase in this error.

At the stage of building the random forest model, the number of misclassifications in the training and validation subsets was compared. The lowest misclassification fraction (0.318) for the validation subset was obtained for a forest consisting of 72 trees, while for the training subset, this share for 72 trees was 0.27. It can be assumed that the model with the lowest error in the validation sample will generate the most accurate classification during model testing using new data. Therefore, the remainder of this study presents the classification results for the model consisting of 72 trees forming the random forest.

Table 1 summarizes the importance of the predictors involved in the classification of failure types. In this respect, the approach was also modified and simplified compared with the single-tree classification presented in [6], where many variants were analyzed, including different configurations of independent variables, and separate tree models were built for different pipe types (trunk mains, distribution pipes, service connections).

In earlier work [6], some variants used as many as seven predictors (four quantitative and three qualitative variables), and this did not translate into classification quality at the training stage, as will be discussed in more detail later in this paper. In some variants in previous studies, failures of the “leak” and “transverse crack” type were not correctly classified at all, resulting in 0% accuracy. An earlier work [6] aimed to provide a broader perspective on the use of machine learning; therefore, the analyses were very detailed and included many variables that could potentially be important not only for assessing modeling quality, but also operationally, such as the pressure inside the pipeline. However, considering the unsatisfactory classification results previously presented [6], the approach was simplified by selecting only the independent variables listed in Table 1, especially since the random forest model is more advanced, and reducing the number of predictors should not negatively affect classification performance. As will be shown later, this change even had a positive effect on classification accuracy in the test sample.

In the 72-tree forest model, material and year of construction turned out to be the most important variables from the perspective of classification (importance equal to 1.000 and 0.989, respectively). From the standpoint of operational experience, this ranking is not surprising, because, in most cases, a given failure type is closely related to the pipe material, which also results from the specific properties of the material used in water supply systems. Typical corrosion does not occur in plastic pipes (apart from transition points from traditional materials to plastics). This alone—though not the only reason—indicates the very high importance of material type in classification tasks. Similar observations were made when single classification trees were used [6], where, in six out of ten analyzed variants, material also had an importance level of 1.000. This is an extremely important guideline for practitioners aiming to maintain a sustainable operational approach: regardless of other factors, one of the most important elements to analyze in operational datasets—e.g., for further planning of network expansion or modernization—should be pipe material. Although PE and PVC pipes accounted for only 135 cases (11.25%) in the training dataset, this does not mean that these materials are less failure-prone; rather, they are subject to different types of failures and irregularities due to their structural specificity, and also because their service life is shorter compared with cast-iron or steel pipelines. Of those 135 failure cases on polymer pipes, 67% occurred on pipes installed in the 20th century, when, in Poland, the large-scale adoption of polymers in water distribution systems was only beginning.

As shown in other studies (leak prediction [21]), pipe age (i.e., year of construction) plays an important role in modeling using various predictive methods. In this context, the result obtained in the present study is not surprising, because many failure cases (especially corrosion) were not sudden events but developed over years during which slow degradation of the material structure took place. In the training set, there were 410 corrosion cases, 73% of which were recorded on pipelines installed before 1989. The random forest model simply revealed what is intuitively apparent and what has also been demonstrated previously, such as, among others, in [22,23]: older water pipelines, due to their long service life, usually exhibit higher failure rates. Moreover, their long-term operation is one of the main factors contributing to the process of secondary water contamination within the water distribution network [24]. Compared with the earlier results [6], the variable “year of construction” in the random forest model proved more important than in the single-tree models, where only 4 out of 10 analyzed variants had this predictor at an importance level of 1.000, 0.990, and 0.947; in the remaining six configurations, it was below 0.800. It can be hypothesized that the single-tree models were saturated with other independent variables (e.g., pressure, burial depth, season), and the resulting importance ranking reflected an “optimal” choice under the condition of having an overly large predictor base; however, this did not translate into classification accuracy. In such a situation, the 72-tree random forest model yielded higher classification performance (discussed in the next part of the paper), possibly due to the absence of noisy variables that contribute little to the classification task. Moreover, pipe age (year of construction) is a variable that is usually known and available in operational data, unlike, for example, the burial depth of the pipeline. The latter variable had importance levels of 1.0, 0.7, and 0.5 in the single-tree classifications [6]. Therefore, the availability of accurate operational information should also be considered during modeling. If it is possible to use a smaller number of variables (but with reliable, precise values) that are routinely recorded, updated, and entered into increasingly common GIS systems in municipal utilities, then models should be simplified by reducing the predictor vector size, as demonstrated by the analyzed random forest. In addition, the proposed set of four predictors (two qualitative and two quantitative) seems reasonable also in comparison with the single-tree models [6], where, out of seven independent variables, the classification quality was effectively influenced by only three or four variables, with varying importance hierarchies depending on the configuration.

Figure 2 presents 12 selected trees (Nos. 1, 2, 21, 22, 31, 32, 41, 42, 61, 62, 71, 72) from the entire random forest (due to space limitations, it is not possible to present the structures of all 72 trees). Each tree structure consists of split and terminal nodes, but their number does not increase with the tree number. For example, Tree No. 1 has 22 split nodes, whereas Tree No. 71 has 18, which is directly related to the algorithm for generating individual architectures and the splits performed along subsequent branches and levels.

In the analyzed 72-tree random forest, the number of split nodes took the following values depending on the tree: 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22. Meanwhile, the number of terminal nodes was one greater than the number of split nodes (Figure 2). However, this does not mean that every tree with, for example, 16 split nodes looked the same (see Figure 2, Trees No. 31 and 72), because the split based on the selected independent variables at each level of the tree occurred in different parts of the structure, and the depth and size of branching varied depending on the split conditions and the size of each subsequent leaf. For comparison, the single-classification-tree structures in the earlier study [6] had from 1 to 22 terminal nodes. Thus, in both cases (single trees and the random forest), the architecture size was similar. However, only combining individual tree models—with structures comparable to those previously presented [6]—into a random forest yielded better classification results for failure types, both at the training stage (Figure 3), in the validation set (Table 2), and during testing on new data (Table 3).

It can be hypothesized that, despite a similar number of split and terminal nodes in the structures of the trees forming the forest, the overall model is more complex, because at successive splits at lower levels of the individual trees, a smaller number of independent variables was involved than in the case of single classification trees, as already discussed in the context of the predictor importance ranking. Thus, on the one hand, we deal with model simplification (mainly in terms of the availability of operational data) through reducing the independent-variable vector, but, on the other hand, this does not translate directly into obtaining a relatively simple model architecture. When analyzing engineering issues and phenomena with fairly high dynamics and a significant impact on people and the surrounding environment, the task of modeling should be to account for important operational problems, while less emphasis may be placed on more “virtual” aspects, such as deliberations about the number of leaves and branches in the trees forming the random forest. Therefore, it was more important to verify whether a smaller number of independent variables (compared with single classification trees and with what is typically archived operational data) would allow for an acceptable level of classification accuracy for water pipe failure types.

Figure 3 presents the classification results (training subset) at the stage of building the 72-tree random forest model. A total of 832 cases participated in the training process (288 observations of “corrosion”, 325 observations of “crack”, and 219 observations of “sealing”). Correct classification was achieved for 84% of corrosion cases, 83% of crack cases, and 43% of “sealing” cases. These results are several percentage points more accurate than those obtained with single classification trees. The model is assessed based on the validation results (Table 2) during training and the test results (Table 3).

Table 2 shows example classification results for the training dataset (validation subset, 368 cases) for the 72-tree random forest model. Of the 368 cases, a total of 251 were correctly classified (68%). Broken down by failure type, the classification was as follows: out of 122 cases observed as corrosion, correct predictions applied to 101 cases (83%); out of 153 cases observed as crack, correct predictions applied to 121 cases (79%); out of 93 cases observed as sealing, correct predictions applied to 29 cases (31%). As expected, classification performance in the validation subset was slightly worse than for the training subset. Nevertheless, these values remain satisfactory and are more accurate than those achieved with single trees. Particularly important are the relatively good results for classifying “sealing”, despite its relatively small representation in the dataset.

Failure-type classification was performed by selecting the type that received the highest probability of occurrence. If the probability was 1.00, the situation was clear. However, if the probability distribution across the types was, for example, as in the second row of Table 2 (0.17; 0.43; 0.40), the model classified corrosion as crack, because the latter had the highest predicted probability. This was, of course, a misclassification, but the probability shares in such situations are important because they show what is not visible in the final results, namely, the splitting and assignment—within nodes and branches of the trees forming the forest—of predictors to the outcome values of the dependent variable (failure type). For example, if the “material” predictor dominated in the case of “corrosion”, then, when the same material is analyzed, the trained model will, with higher probability, classify the failure as corrosion, even if the observed failure was, for instance, crack.

Table 3, in turn, presents the classification results for the test set, i.e., for new data relative to those used to build the random forest model. The same trend in classification quality is observed as in the validation set of the training stage. Corrosion was classified most accurately at 82%, followed by crack with 72% accuracy, and finally sealing with 37%. The last result is particularly interesting because it is 6% higher than the validation result from the training stage, which indicates the adaptive capabilities of the random forest model, especially for the failure type with the smallest representation in the overall dataset. The misclassification results primarily from the size of the dataset, as the most abundantly represented lesions are characterized by the highest prediction accuracy. Furthermore, in the next stage of research, it would be worthwhile to consider changing the architecture of the individual trees comprising the entire forest to increase the accuracy of the classification of less abundantly represented failures. Similarly, for corrosion, the classification accuracy is 10% higher than for crack, even though the datasets were comparable (181 corrosion cases and 186 crack cases, respectively).

The classification results obtained for the test set are comparable to those described earlier [6] using single trees, where corrosion, crack, and sealing were also classified at approximately 80%, 70%, and 30%, respectively. However, overall accuracy did not then exceed 55% for most considered variants due to a larger number of failure-type classes; only in two configurations did it reach about 70%. With the random forest model, an overall accuracy of nearly 66% was obtained. This is a satisfactory result and stems from the simplification of modeling involving the prior grouping of many irregularity types into one class (e.g., crack). The thesis stated in earlier work [6]—that the use of random forest could increase classification accuracy—has been partially confirmed in the present study. The improvement is not spectacular, but it has been shown that a smaller number of independent variables does not reduce accuracy and may even have a positive effect on the random forest model’s ability to adapt to other water distribution systems, taking into account their operational, soil-and-water, and structural specificity. This is because predictors such as material, diameter, pipe type, and age are typically archived in municipal utility datasets and updated whenever a given section is replaced or rehabilitated.

4. Conclusions and Summary

The results presented in this study are promising, as they demonstrate the potential of applying a random forest model to the analysis of water pipe failure susceptibility, even when using a smaller number of independent variables compared with single-tree models. This topic remains worth further investigation, particularly due to the need to deepen the analysis with issues directly related to the influence of material on the time to failure development. In other words, failure analysis should be combined with the operational suitability of individual sections of water pipes made of different materials, which is one element of a sustainable approach to managing municipal infrastructure. Similarly to the analysis of water quality at different points in a water distribution network [25], the aspect of failures occurring in a given system—directly related to material type and age—should also be considered separately each time. However, the model framework and its theoretical foundations can be adapted to other water supply networks. Based on the obtained results, it can be concluded that, at this stage of research, the limitation of the random forest methodology is the unsatisfactory accuracy of classifying damage from a small data set, e.g., sealing. On the other hand, the occurrence of sealing damage does not cause catastrophic consequences in water supply compared to, for example, significant cracks in the material structure or corrosion changes that reduce pipeline flow capacity.

Classifying water pipe failure types using machine learning methods fits into the broader issue of ensuring an appropriate level of water delivery quality for consumers and is part of the policy of sustainable management of municipal infrastructure. Corrosion of pipeline material is one of the causes of deteriorating water quality in distribution systems [26]. Therefore, not only the correct classification of already observed failures but also their appropriate spatial localization could help increase the operational safety and reliability of water supply networks and support appropriate actions by municipal system operators. Every water supply system is unique, so simply applying a single model to multiple water distribution systems is impossible. The vector of independent variables must be adjusted for each specific operating condition. Consequently, the proposed modeling approach can be helpful in determining the methodology for repairing or replacing individual sections of the water supply network.

So far, attention has been focused on failures that are typically mechanical, related to the impact of the external and internal environment on pipeline materials. This is a relatively straightforward approach because the technical condition of the pipe can be directly examined and assessed in terms of material loss or structural changes. It is worth mentioning here that water hammer and the resulting stresses in the pipe material also play a significant role in the occurrence of water supply network failures [27,28]. Pipeline damage is therefore the result of many variable factors. At present, however, an increasingly common problem is biofilm forming on the inner surface of pipelines, both those made of traditional materials and those made of polymers. At the moment, this is not classified as a failure or irregularity, but rather defined as an operational condition that can occur even after only a few years of use. As recent studies have shown [29], for a selected water supply system, national regulations regarding water quality were met and the water was stable; nevertheless, due to changing atmospheric conditions, the phenomenon of biofilm may in the longer term lead to local changes in the microbiological composition of water within the network.

If in further research it were possible to combine the aspect of typical failure analysis with qualitative water-quality studies, it would be possible to develop a model that includes not only mechanical failures but also—indirectly through analysis of quality parameters—irregularities in the internal structure of pipeline materials, manifested, for example, by an increased presence of microorganisms. Such an approach also aligns with the currently promoted concept of the sustainable operation of critical infrastructure.

Author Contributions

Conceptualization, M.K.; methodology, M.K.; software, M.K. and W.C.; validation, M.K. and W.C.; formal analysis, M.K.; investigation, M.K.; resources, M.K.; data curation, M.K.; writing—original draft preparation, M.K.; writing—review and editing, W.C.; visualization, W.C.; supervision, W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The operational data used for modeling were provided by a selected water company in Poland and cannot be made widely public.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman & Hall/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
Data Mining–Prediction Methods; Training Materials; Statsoft: Kraków, Poland, 2017.
Statistica Electronic Manual; Statsoft: Kraków, Poland, 2019.
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Berk, R.A. Statistical Learning from a Regression Perspective; Springer Science and Business Media, LCC: New York, NY, USA, 2008. [Google Scholar]
Kutyłowska, M. Regression and Classification Methods in the Analysis and Assessment of the Failure Level of Water Conduits; Publishing House of Wrocław University of Science and Technology: Wrocław, Poland, 2019; Available online: https://dbc.wroc.pl/Content/67027/Kutylowska_Metody_regresyjne.pdf (accessed on 6 March 2026). (in Polish)
Taiwo, R.; Zayed, T.; Adey, B.T. A novel ensemble learning framework for predicting the causes of water pipe failures. Reliab. Eng. Syst. Saf. 2025, 264, 111320. [Google Scholar] [CrossRef]
Winkler, D.; Haltmeier, M.; Kleidorfer, M.; Rauch, W.; Tscheikner-Gratl, F. Pipe failure modelling for water distribution networks using boosted decision trees. Struct. Infrastruct. Eng. 2018, 14, 1402–1411. [Google Scholar] [CrossRef]
Robles-Velasco, A.; Cortés, P.; Muñuzuri, J.; De Baets, B. Prediction of pipe failures in water supply networks for longer time periods through multi-label classification. Expert Syst. Appl. 2023, 213, 119050. [Google Scholar] [CrossRef]
Yang, S.; Zhang, L.; Fan, J.; Sun, B. Experimental study on erosion behavior of fracturing pipeline involving tensile stress and erosion prediction using random forest regression. J. Nat. Gas Sci. Eng. 2021, 87, 103760. [Google Scholar] [CrossRef]
Konstantinou, C.; Jara-Arriagada, C.; Stoianov, I. Investigating the impact of cumulative pressure-induced stress on machine learning models for pipe breaks. Water Resour. Manag. 2024, 38, 603–619. [Google Scholar] [CrossRef]
Shirzad, A.; Safari, M. Pipe failure rate prediction in water distribution networks using multivariate adaptive regression splines and random forest techniques. Urban Water J. 2019, 16, 653–661. [Google Scholar] [CrossRef]
Kutyłowska, M. Application of MARSplines method for failure rate prediction. Period. Polytech. Civ. Eng. 2019, 63, 87–92. [Google Scholar] [CrossRef]
Sakponou, H.; Cui, K. Hybrid random forest algorithm to predict internal erosion under increasing hydraulic gradient. Int. J. Geotech. Eng. 2025, 19, 41–49. [Google Scholar] [CrossRef]
La Cognata, R.; Piazza, S.; Freni, G. Bridging the gap between model assumptions and realities in leak localization for water networks. Water 2025, 17, 3502. [Google Scholar] [CrossRef]
Iwanek, M.; Suchorab, P. Determination of the water outflow zone on the ground surface after a pipe failure using fractal geometry. Sustainability 2025, 17, 11093. [Google Scholar] [CrossRef]
Pandian, C.; Alphonse, P.J.A. Evaluating water pipe leak detection and localization with various machine learning and deep learning models. Int. J. Syst. Assur. Eng. Manag. 2025, 2. [Google Scholar] [CrossRef]
Iwanek, M.; Suchorab, P. Feasibility of using hypothetical fractal structures to determine water outflow zones after a pipe failure. Sustainability 2024, 16, 10640. [Google Scholar] [CrossRef]
Huang, L.; Hu, B.; Wan, S.; Lu, B. Research on pipeline flange leakage detection method based on random forest and Pearson correlation coefficient. Appl. Acoust. 2025, 24, 110918. [Google Scholar] [CrossRef]
Kutyłowska, M.; Cieżak, W. Two AI methods for classification of water pipes damage. Instal 2024, 2, 44–48. (in Polish). [Google Scholar] [CrossRef]
Taiwo, R.; Zayed, T.; Bakhtawar, B.; Adey, B.T. Explainable deep learning models for predicting water pipe failures. J. Environ. Manag. 2025, 379, 124738. [Google Scholar] [CrossRef]
Pietrucha-Urbanik, K. Failure analysis and assessment on the exemplary water supply network. Eng. Fail. Anal. 2015, 57, 137–142. [Google Scholar] [CrossRef]
Zimoch, I.; Łobos, E. Application of the Theil statistics to the calibration of a dynamic water supply model. Environ. Prot. Eng. 2010, 36, 105–115. [Google Scholar]
Piegdoń, I.; Tchórzewska-Cieślak, B. Risk estimation method of secondary water pollution in water supply system. Desalination Water Treat. 2023, 301, 1–13. [Google Scholar] [CrossRef]
Domoń, A.; Kowalska, B.; Papciak, D.; Wojtas, E. Assessment of the stability of tap water in the distribution system. Desalination Water Treat. 2025, 322, 101130. [Google Scholar] [CrossRef]
Tchórzewska-Cieślak, B.; Pietrucha-Urbanik, K.; Rak, J. Assessing levels of safety integrity in tap water quality–A case study approach. Desalination Water Treat. 2025, 322, 101093. [Google Scholar] [CrossRef]
Urbanowicz, K.; Bergant, A.; Stosiak, M.; Deptuła, A.; Karpenko, M.; Kubrak, M.; Kodura, A. Water hammer simulation using simplified convolution-based unsteady friction model. Water 2022, 14, 3151. [Google Scholar] [CrossRef]
Urbanowicz, K.; Bergant, A.; Stosiak, M.; Karpenko, M.; Bogdevičius, M. Developments in analytical wall shear stress modelling for water hammer phenomena. J. Sound Vib. 2023, 562, 117848. [Google Scholar] [CrossRef]
Piegdoń, I. Variability of drinking water quality on the basis of analysis of qualitative monitoring from a selected water supply network located in South-Eastern Poland. Water 2024, 16, 3355. [Google Scholar] [CrossRef]

Figure 1. Fraction of misclassifications vs. number of trees.

Figure 2. Selected 12 structures of 72-tree random forest: (a) Tree number 1. (b) Tree number 2. (c) Tree number 21. (d) Tree number 22. (e) Tree number 31. (f) Tree number 32. (g) Tree number 41. (h) Tree number 42. (i) Tree number 61. (j) Tree number 62. (k) Tree number 71. (l) Tree number 72.

Figure 3. Classification of the types of failure: learning of the 72-tree random forest model.

Table 1. Importance ranking of the predictors.

Variable	Importance
material	1.000
year of construction	0.989
diameter	0.855
type of pipe	0.515

Table 2. Classification of the types of failure (learning set, validation subset).

Observed	Predicted	Probability Corrosion	Probability Crack	Probability Sealing
corrosion	corrosion	1.00	0.00	0.00
corrosion	crack	0.17	0.43	0.40
sealing	crack	0.00	0.68	0.32
corrosion	crack	0.21	0.51	0.28
crack	crack	0.00	0.60	0.40
crack	crack	0.00	0.72	0.28
crack	crack	0.00	1.00	0.00
crack	crack	0.00	0.86	0.14
crack	crack	0.00	0.90	0.10
corrosion	corrosion	0.93	0.07	0.00
crack	sealing	0.00	0.35	0.65
crack	crack	0.00	0.74	0.26
sealing	corrosion	1.00	0.00	0.00
corrosion	corrosion	0.99	0.01	0.00
sealing	sealing	0.00	0.40	0.60
corrosion	corrosion	1.00	0.00	0.00
crack	crack	0.00	0.64	0.36
corrosion	corrosion	1.00	0.00	0.00
sealing	crack	0.00	0.99	0.01
sealing	corrosion	0.98	0.01	0.01
sealing	sealing	0.00	0.14	0.86
corrosion	crack	0.00	0.83	0.17
sealing	crack	0.00	0.64	0.36
sealing	crack	0.00	0.86	0.14
crack	crack	0.00	0.72	0.28
sealing	crack	0.00	0.74	0.26
corrosion	sealing	0.10	0.35	0.55
corrosion	corrosion	1.00	0.00	0.00
corrosion	corrosion	0.96	0.03	0.01
crack	crack	0.01	0.93	0.06
corrosion	corrosion	1.00	0.00	0.00
crack	crack	0.00	0.76	0.24
crack	crack	0.00	0.57	0.43
sealing	sealing	0.00	0.25	0.75
sealing	sealing	0.07	0.46	0.47
crack	crack	0.00	0.93	0.07
corrosion	corrosion	0.96	0.03	0.01
crack	crack	0.01	0.80	0.19
corrosion	sealing	0.13	0.33	0.54
crack	crack	0.01	0.81	0.18

Table 3. Estimated number of fault type classifications (test set).

Observed	Maps of Predicted Class Cardinality Versus Total Observed Class Cardinality, %
	corrosion	crack	sealing
corrosion	82.32	13.81	3.87
crack	3.76	72.58	23.66
sealing	10.07	52.35	37.58
Overall accuracy = 65.89%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kutyłowska, M.; Cieżak, W. Classification of Water Pipe Damage Types Using Random Forest. Sustainability 2026, 18, 5101. https://doi.org/10.3390/su18105101

AMA Style

Kutyłowska M, Cieżak W. Classification of Water Pipe Damage Types Using Random Forest. Sustainability. 2026; 18(10):5101. https://doi.org/10.3390/su18105101

Chicago/Turabian Style

Kutyłowska, Małgorzata, and Wojciech Cieżak. 2026. "Classification of Water Pipe Damage Types Using Random Forest" Sustainability 18, no. 10: 5101. https://doi.org/10.3390/su18105101

APA Style

Kutyłowska, M., & Cieżak, W. (2026). Classification of Water Pipe Damage Types Using Random Forest. Sustainability, 18(10), 5101. https://doi.org/10.3390/su18105101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Classification of Water Pipe Damage Types Using Random Forest

Abstract

1. Introduction

1.1. Random Forest

1.2. Applications of Random Forest in Issues Concerning the Technical Condition of a Pipeline

2. Materials and Methods

3. Results and Discussion

4. Conclusions and Summary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI