1. Introduction
By the end of 2020, there were around 580,000 concentrated rural water supply projects in China, providing safe tap water to 909 million rural residents [
1]. However, according to the rural water supply plan in the 14th five-year plan of China, around 10% of the concentrated rural water supply projects were constructed before 2005, most of which are facing issues such as aging pipelines and high leakage rates. Reports show that in some regions, the leakage rate of the rural water supply pipeline has surpassed 30%, menacing the rural tap water safety. Pipeline leakage could be due to various reasons, such as pipe cracking, low construction quality, and great maintenance difficulties caused by the depth of buried pipes. In the northeast of China, the pipes are buried relatively deep and the temperature in winter could be low. The leakage rate there could reach 20% in some cities [
2].
As stated above, leakage is a serious issue in rural water supply projects and restricts the sustainable development of rural water supply projects. The existing leakage detection approaches are mostly developed from the passive testing method, MNF (minimum night flow), or warning mechanism based on the upper and lower pressure limits. Negharchi and Shafaghat (2020) [
3] carried out research on a rural network in the north of Iran using two leakage calculation methods, including background and bursts estimates (BABE) and MNF; the average leakage was found to be 1.45 L/s and 1.105 L/s, respectively. The effects of the legitimate night-time consumption (LNC) and leakage exponent (N) have been evaluated. Norouzi et al. (2019) [
4] proposed that employing suitable techniques for specifying the right domestic night-time consumption values is essential when applying the MNF method. Flow measurement by loggers was used to determine real losses through MNF analysis in the Juru Rural Service Centre by Chawira et al. (2022) [
5]. Dandansaz et al. (2020) [
6] minimized network leakage by applying minimal pressure on network nodes and analyzed the water distribution network using WaterGEMS and ArcGIS, aiming to determine the optimal pressure in the network.
The methods listed above are apparently much less efficient than those applied to urban water distribution networks. A compatible monitoring model for the rural water distribution network is in need. Traditional monitoring models already applied to leakage detection in urban water distribution networks usually rely on sufficient historical data, a hydraulic model of the pipeline, and real-time online monitoring data. However, when developing monitoring-warning systems for rural water distribution networks, researchers would face a lack of online monitoring stations, a limited variety of monitoring indicators, and insufficient historical data, which would result in the low accuracy of the hydraulic model of pipelines and thus a poor leakage detection model. New methods, different from those adopted in monitoring models for urban water distribution networks, need to be introduced and assessed in rural water supply projects.
With the development of network technology, the Internet of Things, and a cloud platform, real-time monitoring data of high quality could be obtained and transmitted by online sensors, enabling the application of artificial intelligence and machine learning to the analysis of online monitoring data. During the last decade, much progress has been made in researching new algorithms such as the artificial neural network (ANN) and fuzzy inference system (FIS). Mounce et al. (2002) [
7] first proposed the application of artificial intelligence to water distribution networks. Mounce et al. (2006) [
8] utilized multi-layered perceptrons (MLP) and the time delay neural network (TDNN) when studying the process where a fire hydrant was applied to a simulated burst in a pipeline. In the following research of Mounce et al. (2003, 2008, 2010, 2007) [
9,
10,
11,
12], an artificial intelligence system was established for the detection of bursts and flow meter data analysis, enabled by continuously updated historical data.
Many other researchers also developed a variety of ANN models. Caputo and Pelagagge (2002, 2003) [
13,
14] proposed a leakage monitoring approach based on a multi-layered neural network. Feng and Zhang (2006) [
15] proposed a leakage monitoring approach based on a fuzzy neural network that could also identify anomalies in a pipeline. Aksela et al. (2009) [
16] established a pipeline leakage detection model based on a self-organizing map, where the leakage function consisted of a distance function and a confidence function. Tao et al. (2014) [
17] proposed a burst detection method based on an artificial immune network, where a burst could be located by the algorithm of nearest neighbors after monitoring data were inputted into the artificial immune network. Liang et al. (2001) [
18] established a model based on the ANN, able to describe the relationship between three pressure monitoring stations and leakage-concerning parameters such as leakage location, intensity, and influence. Huang et al. (2007) [
19] developed a method based on supervisory control and data acquisition (SCADA) that could locate a leakage through a fuzzy similar priority comparison. The industrial application of an ANN to leakage detection has also been reported with satisfying outcomes. However, it is not easy for such models to converge and their convergence would require large training sample sizes, thus making them unfit for rural water supply projects.
Bayesian analysis (BA) was introduced to solve the convergence problem of the ANN, where the probability distribution could describe all forms of uncertainty. Poulakis et al. (2003) [
20] established a Bayesian probabilistic framework for leakage detection. Costanzo et al. (2014) [
21] also realized leakage area determination through BA. Romano et al. (2010) [
22] integrated an ANN, statistical process control (SPC), and BA into a burst-leakage detection framework and tested its applicability in a district metered area (DMA) of the UK. The results showed that this framework could successfully locate bursts and leakage. Despite the satisfying results obtained with BA, the Bayesian models have to make assumptions on the probability distribution of training samples, which would often cause a low identification accuracy. Large training sample sizes are required to improve accuracy, which is not practical in rural water supply projects.
Efforts have been made to solve the problem of sample size requirement. Vapnik et al. (1982) [
23] developed a statistical learning theory (SLT) that could be applicable to small sample sizes. Mounce et al. (2010) [
24] utilized support vector machines (SVMs) to analyze the time-series data obtained from monitoring stations and realized online monitoring of abnormal events such as burst-leakage, pipe cleaning, and sensor malfunction. Mamo et al. (2014) [
25] developed a leakage detection and classification technique based on multi-class SVMs (M-SVMs). The operation state of the pipeline was classified into six categories according to the degree of leakage. M-SVMs could then be applied to identify the operation state of the DMA, based on the flow and pressure data obtained from monitoring stations. Zhang et al. (2016) [
26] proposed a leakage zone identification model that could be suitable for large-scale pipe networks. Compared to ANN and FIS, this method could send warnings much faster. The weakness of this method is that it fails to consider the optimization of parameters, which could greatly influence the classification accuracy.
Another famous algorithm specialized in small sample sizes is gradient boosting, developed by Friedman, as an approximate of gradient descent. The gradient boosting decision tree (GBDT) is one of the best-performing models derived from this algorithm. It is a recursive model, consisting of multiple decision trees. GBDT could deal with all types of data, achieve high accuracy, and stay robust against anomalies. Extreme gradient boosting (XGBoost) is an enhanced version of GBDT, where regularizers have been introduced to avoid overfitting. XGBoost has a great generalization ability, widely used for various aims such as parameter anomaly detection in satellite engineering, personal credit risk assessment, water depth inversion based on remote sensing, determination of urban water sustainability index, and predicting the quantity of the urban water supply (Chen et al., 2016, Devan et al., 2020, Clercq et al., 2018, He et al., 2020) [
27,
28,
29,
30]. XGBoost has been applied to pipeline maintenance in several studies. Snider and McBean (2020) [
31] used XGBoost to predict pipeline rupture. Wu et al. (2022) [
32] adopted XGBoost to study the influence of the count of leakage occurrence and its location on model performance. Artificial leakage data were generated using a hydraulic model simulation and the prediction accuracy could reach 90.4%. The research of Mohsen (2021) [
33] showed that SVM and XGBoost ranked first in predicting leak and nonleak samples in a laboratory-scale water distribution system. Moreover, Wang et al. (2021) [
34] found that XGBoost had a better generalization ability than SVM, as XGBoost could improve its prediction accuracy via the decontamination effect. Nagaraj and Lakshmi (2021) [
35] reported that the XGBoost classifier outperformed the other machine learning algorithms assessed in their study in terms of water body extraction.
As a model applicable to small sample sizes, XGBoost has a high potential for leakage detection in rural water distribution networks. The validation of its feasibility would be of great value to the better development of rural water supply projects, which has not been carried out to the best of our knowledge. In this study, XGBoost was applied in a rural water supply project in Ningxia, China. A novel intelligent monitoring-warning system for leakage detection was established, consisting of a leakage locating model and a leakage quantity model, aimed to provide valuable insight into the construction and maintenance of future rural water supply projects.
4. Conclusions
XGBoost, a model that has a great generalization ability and specializes in small sample sizes, was applied to the leakage detection of a rural water supply project in Ningxia, China. A novel intelligent monitoring-warning system was established, consisting of a leakage locating model and a leakage quantity model. The accuracy and F1-score of the leakage locating model were 95% and 93%, respectively, while those of the leakage quantity model were 96% and 97%, respectively. The AUCs of the AUC-ROC curves were all close to 1, while both micro- and macro-F1 were over 0.99. The model performance was satisfying. In addition, with the help of feature importance analysis, enabled by XGBoost, the most important feature for leakage detection was discovered to be the pressure of monitoring points, and it was found that the importance of second tests was greater than that of first tests, indicating that the stable and timely transmission of online monitoring data could be crucial for the establishment of an intelligent monitoring system for rural water distribution networks.
The local water management authorities were also satisfied by the results. The intelligent system established in this study could not only help with major leakage incidents but also minor leakage issues that are difficult to notice. The main holdback of this system found by managers is that the reliable leakage warning service is based on the stable operation of the warning system, which requires the timely upload of pressure data and stable internet access. Certain efforts must be made to the maintenance of pressure monitoring devices.
The successful application of XGBoost in this study shows that a highly intelligent monitoring system for leakage detection in rural water supply projects is not impossible. To further improve the models developed in this study, the system could be enabled to constantly learn from new samples while conserving existing knowledge. Hopefully, the leakage locating model could become even more precise. A future study on the application of this model in rural water supply projects could eventually realize the efficient and accurate identification of leakage, early prediction, timely treatment, and hence significant improvement of rural water supply services.