Next Article in Journal
GIS Training for Animal Health in Aquaculture: A Structured Methodology
Previous Article in Journal
Effects of Drainage Control on Non-Point Source Pollutant Loads in the Discharges from Rice Paddy Fields
Previous Article in Special Issue
Key Calibration Strategies for Mitigation of Water Scarcity in the Water Supply Macrosystem of a Brazilian City
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Developing Machine Learning Models for Optimal Design of Water Distribution Networks Using Graph Theory-Based Features

by
Iman Bahrami Chegeni
1,
Mohammad Mehdi Riyahi
1,
Amin E. Bakhshipour
2,*,
Mohamad Azizipour
1 and
Ali Haghighi
1,2
1
Department of Civil Engineering, Faculty of Civil Engineering and Architecture, Shahid Chamran University of Ahvaz, Ahvaz 61357-83151, Iran
2
Department of Urban Water Management, RPTU in Kaiserslautern, Paul-Ehrlich-Straße 14, D-67663 Kaiserslautern, Germany
*
Author to whom correspondence should be addressed.
Water 2025, 17(11), 1654; https://doi.org/10.3390/w17111654
Submission received: 7 April 2025 / Revised: 26 May 2025 / Accepted: 26 May 2025 / Published: 29 May 2025
(This article belongs to the Special Issue Advances in Management and Optimization of Urban Water Networks)

Abstract

:
This study presents an innovative data-driven approach to optimally design water distribution networks (WDNs). The methodology comprises five key stages: Generation of 600 synthetic WDNs with diverse properties, optimized to determine optimal component diameters; Extraction of 80 topological and hydraulic features from the optimized WDNs using graph theory; preprocessing and preparing the extracted features using established data science methods; Application of six feature selection methods (Variance Threshold, k-best, chi-squared, Light Gradient-Boosting Machine, Permutation, and Extreme Gradient Boosting) to identify the most relevant features for describing optimal diameters; and Integration of the selected features with four machine learning models (Random Forest, Support Vector Machine, Bootstrap Aggregating, and Light Gradient-Boosting Machine), resulting in 24 ensemble models. The Extreme Gradient Boosting-Light Gradient-Boosting Machine (Xg-LGB) model emerged as the optimal choice, achieving R2, MAE, and RMSE values of 0.98, 0.017, and 0.02, respectively. When applied to a benchmark WDN, this model accurately predicted optimal diameters, with R2, MAE, and RMSE values of 0.94, 0.054, and 0.06, respectively. These results highlight the developed model’s potential for the accurate and efficient optimal design of WDNs.

1. Introduction

Water distribution networks (WDNs) are a complex set of connected components, including water resources, pipes, and other process equipment like pumps, valves, etc. The WDNs are a fundamental part of supplying water from storage to consumers at the required pressure and quality. Given the high costs associated, the design of WDNs has attracted considerable interest from researchers and designers [1].
The evolution of WDN design approaches has been marked by several key developments categorized as follows:
  • 1960s–1980s: Early efforts focused on linear and nonlinear programming methods to minimize design costs under hydraulic constraints [2,3,4,5,6,7].
  • 1990s–2000s: Metaheuristic algorithms (e.g., genetic algorithms, particle swarm optimization) have gained prominence for cost-effective solutions [8,9,10].
  • 2000s–Present: Multi-objective optimization algorithms have emerged to address additional factors such as reliability, water quality, and resilience in more complex WDNs [11,12,13,14,15,16].
While these traditional and metaheuristic optimization methods have advanced WDN design, they fundamentally rely on iterative hydraulic simulations and can be computationally expensive, particularly for large-scale or complex WDNs.
In parallel with these optimization method developments, in recent years, machine learning models have proven highly effective tools in WDNs, offering innovative solutions to complex problems traditionally addressed through conventional hydraulic modeling approaches. A critical observation on prior studies shows that few have explored the application of machine learning in WDN design [17,18], while most have focused on applying machine learning methods to the operational aspects of WDNs, such as demand forecasting [19,20], WDN monitoring [21,22], leakage detection [23,24], and pump operation [25]. The limited application of machine learning methods in designing WDNs serves as a key motivation for the current study.
Concurrently, graph theory has proven to be an indispensable tool for representing and analyzing the structural properties of WDNs. This approach represents WDNs as nodes (consumers or hydraulic control components) connected by links (pipes) [26]. The application of graph theory to WDNs began in the 1970s, initially focusing on understanding fundamental concepts and analyzing water flow and pressure [27,28]. Over time, researchers have adopted a more topological perspective, integrating graph theory with analytical tools to develop innovative solutions for WDN analysis and design [29,30].
Graph theory has various applications in WDN research, including reliability analysis [31], network dimension reduction [32,33], robustness enhancement [34,35,36,37,38], leak detection [39,40,41], network segmentation [42,43,44,45], and pump operation planning [46,47,48].
The integration of graph theory and machine learning has given rise to powerful new approaches in WDN analysis and management [49,50]. This combination allows for the extraction of topological features and the identification of complex patterns within WDNs. Applications of this synergy include leak detection and localization [51,52,53,54,55], water quality monitoring and prediction [56,57], pressure and demand forecasting [58], sensor and valve placement optimization [59], district metered area design [60,61], and asset management and failure prediction [62,63,64,65]. Nevertheless, similar to the standalone ML applications, graph theory and machine learning integration in WDN research has primarily focused on analysis/operational support for WDNs, without directly addressing the challenge of discovering the relationship between network properties and optimal design parameters like pipe diameters. Table 1 presents a comparative summary categorizing previous machine learning applications in WDNs by application domain, model type, feature set, dataset source, and reference.
Based on the literature review presented earlier, and to the best of the authors’ knowledge, there is no study that has developed machine learning models for the optimal design of WDNs. In this research, a large dataset comprising 600 synthetic WDNs is created to feed machine learning methods, which has not been done in previous studies. Graph theory is used to extract 80 features from 600 synthetic WDNs (with a total of 85,745 samples), thereby creating a connection between the machine learning method and graph theory. Unlike traditional methods that use hydraulic equations to design WDNs and determine the optimal pipe diameter, this study aims to reshape the optimal design of WDNs by integrating graph theory with machine learning models. The approach involves generating synthetic WDNs, optimizing their design, extracting topological and hydraulic features, and applying machine learning techniques to identify patterns for optimal diameter design. The process encompasses data preparation, feature selection through six methods, and the development of 24 ensemble machine learning models. The best-performing model is subsequently applied to the Hanoi WDN, showcasing the effectiveness of this innovative approach in optimizing WDN design.

2. Methodology

This section introduces an innovative approach to WDN design utilizing supervised machine learning regression models. The method leverages topological and hydraulic features to achieve optimal WDN design without relying on traditional hydraulic equation-solving techniques. Figure 1 illustrates the flowchart of the innovative approach comprising five essential steps. These steps are summarized as follows:
1. Generation and optimization of 600 synthetic WDNs to determine optimal pipe diameters.
2. Extraction of topological and hydraulic features for WDN components (pipes, nodes, and the overall network graph).
3. Preparation of a database using the features obtained in step two.
4. Application of six feature selection methods to identify the most relevant features.
5. Combination of the feature selection methods with four machine learning models, creating 24 ensemble machine learning models to detect optimal WDN design patterns.
6. The methodology concludes by employing the most effective ensemble machine learning model to determine the optimal diameters for the Hanoi WDN, a real-world case study.
This approach signifies a notable shift from conventional WDN design methods, potentially enhancing efficiency in identifying optimal pipe diameters. The subsequent sections of the paper will provide detailed explanations of each stage in the proposed methodology.

2.1. Synthetic Water Distribution Network Generation

Diverse datasets of WDNs are essential for simulating and evaluating machine learning and graph theory-based methods. To achieve this, a specialized algorithm was developed to randomly generate synthetic WDNs replicating real-world network characteristics. The algorithm for generating synthetic WDNs involves the following procedures:
1. Creating a 2D initial raw graph representing a synthetic WDN, where nodes are linked to nearby neighbors in a network formation.
2. Introducing randomness into the initial graph by randomly eliminating nodes and their associated pipes.
3. Verifying the connectivity of the graph and repeating step 2 if it is not fully connected.
4. Assigning random allowable values, such as nodal demands, pipe lengths, pipe roughness coefficients, reservoir elevation levels, and others, to the components of the synthetic WDNs generated.
5. Continuing the process until the required number of synthetic WDNs is reached.
The synthetic WDN generation algorithm ensures the generated networks are valid and functional by adhering to the structural rules defining actual WDNs. For example, the algorithm produces planar graphs in which all connections (i.e., pipes) are joined exclusively at nodal points. This feature prevents unrealistic pipe connections and closely resembles the structure of real WDNs.
Additionally, the number of pipes connected to a single node is limited to a maximum of four, enhancing the resemblance to real networks by avoiding nodes with excessive connections. These constraints, along with other critical characteristics like pipe lengths and friction coefficients, help create synthetic WDNs that resemble real-world networks. In this study, 600 synthetic WDNs were generated, with each network component having the following characteristics:
  • Number of nodes: Between 16 and 141 nodes
  • Number of pipes: Between 24 and 252 pipes
  • Pipe lengths: Between 20 and 100 m
  • Hazen-Williams friction coefficient: In the 80 to 130 range
  • Reservoir head height: Between 20 and 90 m
  • Number of loops: Between 9 and 112 loops
In generating synthetic WDNs, the following points have been considered to align the topological rules with the hydraulics of WDNs: (1) The generated synthetic WDNs are in the form of a planar graph, (2) each synthetic WDN represents a connected graph, and there is a path between every pair of nodes, and (3) each node is connected to a maximum of four pipes, preventing unrealistic and overly complex connections, (4) The random removal of nodes or pipes introduces variability in the configurations of the generated WDNs, resulting in diverse topologies, (5) Assigning realistic random hydraulic values to nodes/pipes (such as pipe length, Hazen-Williams coefficient, and reservoir head) ensures the synthetic WDNs act as real-world WDNs, and (6) the generated WDNs are evaluated for performance, including topological and hydraulic control, using hydraulic simulation software. It is also important to note that the elevation of all nodes in these synthetic WDNs is set to zero. By incorporating these features, the generated datasets provide a robust foundation for evaluating machine learning models and graph theory-based approaches in WDN analysis.

2.2. Water Distribution Network Optimization

One of the main goals in WDN design is to meet consumers’ water demands at a desirable pressure while minimizing the design cost. One of the typical approaches widely used to achieve this goal is single-objective optimization, where the goal of the objective function is the minimization of WDN construction costs. The basic intent of this optimization procedure is to determine the values of the decision variables (pipe diameters) that minimize the objective function (WDN cost) while satisfying the system’s technical and hydraulic constraints. The objective function and pressure constraint employed in the optimization problem are presented as follows.
f = i = 1 N p C ( d i ) L i ;     i = 1 , . , N p
H j H m i n ;     j = 1 , , N n
Equation (1) represents the objective function of the single-objective optimization algorithm. Here, C ( d i ) is the cost of diameter d i per unit of pipe length, and L i is the pipe length. Equation (2) is the optimization problem constraint, where H j is the pressure at node (j), which must be greater than or equal to the minimum pressure ( H m i n ). Table 2 presents the commercial diameters used in the optimization problem along with their corresponding costs. Hydraulic simulation of the synthetic WDNs is performed in EPANET software [66]. To this end, the synthetic graphs created in Python 3.8 are first introduced to EPANET 2.2 using the EPANET Toolkit in Python [67]. Next, specifications for components, such as pipe diameter, pipe length, and nodal demands, are assigned to these components, and the hydraulic simulation is performed [67]. A single-objective genetic algorithm is used in this study to optimize the synthetic WDNs. A self-adaptive method is utilized to satisfy the pressure constraint. Please refer to the article by Makaremi [68] for further study of the self-adaptive method.

2.3. Topological and Hydraulic Features

The integration of topological and hydraulic features significantly enhances the application of machine learning models in WDN analysis. Topological features, such as node degree, clustering coefficient, and shortest path, define the network structure and connections, providing crucial insights into its fundamental organization and stability. Complementing these are hydraulic features that reflect WDN performance, including average flow velocity in pipes, piezometric head at nodes, and flow rates under various conditions.
This study incorporates 80 diverse features as critical input variables for various machine learning algorithms. These features enable supervised machine learning models to learn complex relationships within WDNs, facilitating optimal design without directly solving hydraulic equations. The features used in this research encompass nodes, pipes, and the overall network graph.
Feature assignment follows a specific methodology:
  • Node and overall network graph features are assigned to pipes.
  • For each pipe, the average of features from connecting nodes is calculated and assigned as a descriptive feature.
  • Features derived from the overall network graph are uniformly applied to all pipes within that network, aiding in network differentiation during the learning process.
The study employs both directed and undirected graph representations:
  • Undirected graphs are used for features such as square clustering coefficient, node eccentricity, and pipe length index.
  • Directed graphs are necessary for features like degree of centrality and shortest path from the reservoir to nodes.
  • Some features, including node closeness centrality index and betweenness centrality indices, require examination of both directed and undirected graphs.
Water flow direction determines the directionality of the graph in WDNs. For features requiring graph weight attributes, either the pipe length or the pipe resistance coefficient from the Hazen-Williams equation is used:
R = 10.67 × L i C 1.85 × d i 4.87
where:
R = Pipe resistance
L i = Pipe length
C = Hazen-Williams coefficient
d i = Pipe diameter
The target feature in this study is the optimal commercial diameters in WDNs, obtained through the optimization process. Table 3 presents a schematic of the features derived from 600 synthetic WDNs. The dataset represented in Table 3 comprises 80 columns showing the features and 85,745 rows describing the attributes of nodes, pipes, and the overall network graph assigned to the pipes. For a comprehensive list of features used in this study, please refer to Table A1 in the Appendix A.

2.4. Database Preparation

The crucial stage of database preparation comes after randomly generating synthetic WDNs and extracting their topological and hydraulic characteristics. This procedure is vital for producing an appropriate dataset for machine learning models and consists of two primary steps:
  • Outlier Data Detection: Identifying and handling outliers is crucial, as these anomalous values can significantly impact model training, reducing accuracy and generalizability. In this study, outliers are defined as data points whose distance from the same dataset’s mean exceeds four times the standard deviation. Once identified, outliers are removed from the final database.
  • Data Normalization: Normalization is performed using the Min-max normalization method. This step is vital for (1) aligning features with different scales and (2) preventing the disproportionate impact of varying value ranges (e.g., pipe lengths vs. node pressures) on machine learning model performance.
By carefully addressing outliers and normalizing the data, we ensure that the machine learning models have a clean, well-prepared dataset to work with. This will ultimately lead to more reliable and accurate results in WDN analysis and design. To prevent data leakage between train/test sets, each WDN is considered as an independent entity during training, and all pipes in a WDN belong exclusively to either the training dataset or the testing dataset. This data-splitting procedure prevents data leakage and allows the machine learning model to be evaluated on unseen WDNs, increasing the model’s generalizability.

2.5. Feature Selection Methods

Feature selection is a critical step in using a machine learning model, particularly when dealing with datasets containing a large number of features. This process involves identifying and selecting the most relevant and informative features from the original dataset for training a machine learning model. The primary objectives of feature selection are to improve model performance, reduce overfitting, speed up training, and often provide better model interpretability [69,70].
Feature selection methods can be broadly categorized into four groups: Hybrid, Embedded, Wrapper, and Filter, with each method suitable for feature extraction in different domains [71]. The present study employs Filter and Embedded categories. Methods such as Var, Kb, and Chi2 fall under the Filter category, while methods like LGB, Per, and Xg belong to the Embedded group.
In Embedded methods, tuning the hyperparameters of the machine learning model is essential. The selection of the optimal hyperparameter configuration directly impacts model performance [72,73]. In this study, the Grid search method is used for hyperparameter tuning, which is a theoretical decision-making approach that involves a comprehensive search for a fixed range of hyperparameter values [74]. The Grid search algorithm determines a predefined range for each key hyperparameter, establishing a grid of potential combinations that are systematically assessed. Evaluation is carried out with cross-validation, then the dataset is partitioned into training and validation subsets so we can evaluate each model on both settings. The metric of comparison is computed for each configuration, and the best average performance across fold combinations is selected as optimal. We performed this process on each embedded model existing in the study (e.g., LGB, Xg, and Per models) so as to guarantee that each model is being operated under the best configuration it can provide, and hence the consistency and generalization of results.
In the following, the detailed explanations of all six feature selection methods used in this study are given.
1.
Chi2
This statistical test is specifically used to examine the dependency between two variables [75]. In feature selection, the chi2 method selects non-negative features demonstrating the highest statistical dependency on the target variable [76]. In this method, the chi2 value is calculated for each feature in relation to the target variable, with a higher chi2 value indicating a stronger dependency between the feature and the target. Accordingly, features with larger chi2 values are selected as important features. This method is also well-suited for categorical data, and understanding the concept of variable dependency is relatively straightforward. However, its main limitation is limited applicability to non-negative features and the target variable.
2.
Var
In this simple and fast method, features that show minimal variation, meaning those with low variance, are removed from the dataset [77]. The central concept is that features that are almost constant tend not to provide valuable information for the machine-learning model [78]. A variance threshold is determined for applying this method, removing features with variances lower than this value from the feature set. Due to its simplicity and high speed, this method is often used as a preprocessing stage for data dimension reduction. However, it is important to note that this method examines features individually while ignoring potential interactions between them. Additionally, selecting an appropriate value for the variance threshold can be somewhat arbitrary and dependent on the data.
  • Kb
This method aims to select the K most relevant features to the target variable [79]. For this purpose, univariate statistical tests evaluate the relationship between each feature and the target variable. The functions in this method assign a score to each feature based on statistical criteria. Subsequently, the K features that have obtained the highest scores are selected as the optimal features. Furthermore, the chi2 test can also serve as a scoring function for non-negative features. This method is relatively fast and efficient, and it can effectively identify features related to the target. However, similar to the variance threshold method, the Kb method does not consider the interactions between features. It assumes that the relationship of each feature with the target can be evaluated independently.
4.
LGB
LGB is a robust gradient-boosting framework that inherently calculates and delivers feature significance scores [80,81]. This score indicates the impact of each feature on the building of the decision trees in the boosting model. During model training, LGB calculates metrics such as the number of times a feature is used to split nodes in trees, known as “split,” and the reduction in node impurity due to using that feature, referred to as “gain”. These values are indicators of feature significance. This approach has the advantage of directly integrating feature selection into the model training process while accounting for the interactions between features. The key hyperparameters of this method, tuned before the training process using Grid search, include n_estimators, num_leaves, max_depth, and learning_rate.
5.
Per
This method is applied to evaluate the relative significance of features after training a machine learning model [82]. The model is initially employed to predict the importance of any feature. Next, the feature values in the validation set are randomly shuffled. After this alteration, the model’s performance on the altered data are evaluated. A significant drop in performance after randomizing feature values indicates that the feature is highly significant to the model, as disturbance of its values considerably impacts the model’s predictive ability. The advantage of this approach is that it can be used for any trained model, and its conceptual basis is relatively easy to grasp. However, it can be computationally costly for large datasets with many features, as it would require re-evaluating the model performance for each feature. As the model employed in this process is LightGBM, hyperparameters from the previous section are employed.
6.
Xg
Like LGB, Xg is a popular gradient-boosting algorithm that offers feature significance calculation [83,84]. Xg, while training the model, assigns feature importance scores based on how frequently a feature is used for splitting nodes, the contribution of the split criterion from the feature-induced split, and the number of samples covered by splits of the feature, or “coverage”. Similar to LGB feature importance, this approach comes with model training inherently and considers interaction among features. It must be noted that the calculated feature significance is specific to the Xg model and may produce different results for other models. The tuned hyperparameters for the current research study are n_estimators, eta, gamma, and max_depth.

2.6. Machine Learning Models

Machine learning is a branch of artificial intelligence that enables computers to learn from data without explicit programming. In other words, rather than giving computers step-by-step instructions to complete a task, large volumes of data are provided, with specific algorithms used to help the computer discern patterns, relationships, and rules within this data and to make predictions or decisions based on this knowledge [85]. The primary goal of machine learning is to develop systems that can improve their performance over time as they receive additional data. This improvement could involve various areas, such as prediction accuracy, processing speed, or the ability to recognize more complex patterns.

2.6.1. Regression in Machine Learning

Regression is one of the most critical and widely used aspects of supervised machine learning. Regression problems aim to establish a relationship between input and output variables. In this context, a relationship is developed to predict a continuous objective variable based on one or more predictor variables [86]. In the present study, four machine learning models, namely RF, SVM, BAG, and LGB, are used for regression problems, which will be outlined in detail in the following sections.
1.
Random Forest (RF)
The initial idea of the random forest was first introduced in 1972 by Messenger and Mandell [87]. This concept was later developed by Cutler [88]. The random forest model is among the most widely used algorithms in machine learning for classification and regression [89]. This model uses combined decision trees to increase prediction accuracy and stability. The main concept here is that each tree is trained independently, and each tree makes different decisions regarding the data. The final result is then obtained through voting (for classification problems) or averaging (for regression problems) among the trees. The advantages of random forest include stability and high accuracy. By minimizing overfitting and enhancing diversity in predictions, this model attains high accuracy in complex problems. Another advantage of the random forest model is its resistance to noise. Specifically, this model is less sensitive to noise and incorrect data due to the utilization of multiple trees. The adjusted hyperparameters of this method in the present study include n_estimators, min_samples_split, and max_depth.
2.
SVM
SVM is one of the most powerful and widely used algorithms in supervised machine learning [90]. This algorithm is specifically used in classification and regression problems. The main concept behind SVM is locating an optimal hyperplane that separates the data in the best possible way. The “optimal hyperplane” refers to finding a hyperplane that maximizes the margin. The margin refers to the distance between the hyperplane and the nearest data points from the objective function value. These close points are called “support vectors” and play a key role in determining the hyperplane position [91]. In other words, only these points affect the hyperplane’s location, while other data points have no impact. SVM offers several advantages. This algorithm performs particularly well in high-dimensional spaces and is memory-efficient due to its focus on support vectors. Furthermore, SVMs are still recognized as valuable tools in the machine learning toolkit due to their high accuracy and decent generalization capabilities. Thus, they are widely used in various fields, including machine vision, natural language processing, and bioinformatics [92,93]. The adjusted hyperparameters of this method in the present study include the regularization parameter (C), kernel, and gamma.
3.
BAG
This idea was first used by Breiman [94] to predict classification and regression models, in which they evaluated different models by varying the number of bags. The results ultimately demonstrated enhanced accuracy with the combined BAG model (Breiman [94]). The BAG model approach is an ensemble learning process that generates multiple instances of a learner, specifically a classifier, resulting in several predictions [95]. The final model output is obtained by applying a combination rule (i.e., majority voting) to the outputs from each created subspace. Multiple samples are generated by creating self-initiated iterations from the learning set, where samples are randomly drawn from the entire training data with replacement, with the same number of samples in each subset [96]. The adjusted hyperparameters in this method in the present study include n_estimators, max_samples, and max_features.
4.
LGB
The initial concept of boosting learning methods was presented by Schapire [97]. In this learning model, the goal is to boost performance. The performance of this method involves implementing a weak learner on the data, which is subsequently enhanced into a strong learner in later stages [98]. The boosting learning method consists of various types, including the Gradient Boosting model, which converts a weak learner into a strong learner. The LGB model is an optimized version of the Gradient Boosting framework, designed for efficient regression and classification, particularly for high-dimensional data.
The risk of overfitting increased in previous models as tree depth increased. This was due to the Level-Wise logic, which incrementally raised tree depth during training to enhance model accuracy. However, under the new logic of the LGB algorithm, the enhancement and training process is Leaf-Wise, meaning that the best branching between leaves (features) is selected at each stage of the decision tree. This process is quite effective in increasing computational speed and reducing overfitting risk. The fundamental hyperparameters of this method were previously described in earlier sections.
In all the above machine learning models, the k-fold cross-validation technique is employed to mitigate overfitting. It enhances the model’s generalization ability and minimizes the possibility of overfitting when applied to new datasets.

2.6.2. Model Evaluation

Three criteria, R 2 , MAE and RMSE, are used to evaluate machine learning models. These criteria demonstrate the effectiveness and suitability of machine learning models for the dataset. The R 2 criterion indicates how much of the variance in the objective variable is explained by the model [99]. Furthermore, the MAE and RMSE criteria show how far the predicted values are from the actual values. The relationships between these two criteria are as follows:
R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ ) 2
R M S E = i = 1 n ( y i y ^ i ) 2 n
M A E = 1 n i = 1 n y i y ^ i
The values y i , y ^ i , y ¯ , and n are outlined below:
y i : The real value of the objective variable.
y ^ i : The predicted value of objective variable.
y ¯ : The mean value of objective variable.
n : The total number of samples.

2.6.3. Hanoi WDN

The most effective machine learning model, as determined by the comparative analysis, is selected for application to the Hanoi WDN. The Hanoi WDN, initially introduced by Fujiwara and Khang [100], is a benchmark case study in optimizing WDNs. This network consists of 32 nodes, 34 pipes, 3 loops, and 1 reservoir with a water elevation of 100 m.
Key hydraulic parameters of the Hanoi WDN include:
  • Minimum pressure head at demand nodes: 30 m
  • Hazen-Williams coefficient for all pipes: 130
Figure 2 provides a visual representation of the Hanoi WDN layout, illustrating the network’s topological structure and component relationships.
Figure 2 presents the demand at each node. Additionally, the pipe lengths in order of their respective numbers are: {100, 1350, 900, 1150, 1450, 450, 850, 850, 800, 950, 1200, 3500, 800, 500, 550, 2730, 1750, 800, 400, 2200, 1500, 500, 2650, 1230, 1300, 850, 300, 750, 1500, 2000, 1600, 150, 860, 950}. Five commercial pipe diameters are utilized in the design of the Hanoi WDN: 12, 16, 20, 24, 30, and 40 inches.

3. Results and Discussion

The first stage in the developed model involves the generation of synthetic WDNs. This process utilizes a developed algorithm that includes the graph mining library in Python (NetworkX), the EPANET Toolkit in Python (EPyT), and the genetic algorithm optimizer (geneticalgorithm2). Figure 3 shows the output of three synthetic WDNs. As is evident from these WDNs, all components of the generated networks are interconnected, without any pipes interfering with one another. The minimum and maximum number of pipes connected to each node are 1 and 4, respectively. Finally, appropriate random values are assigned to each component of the generated networks to enable the hydraulic design process. For instance, as previously mentioned, pipe lengths are assigned values between 20 and 100 m, or the Hazen-Williams coefficient in pipes is assigned values between 80 and 130. Following the generation of 600 synthetic networks, the design process for the optimal WDN diameters begins. The number of genetic algorithm iterations is 400 iterations on average. Additionally, population size, mutation probability, and crossover probability are set to 12 times the WDN pipes, 0.06, and 0.85, respectively. It is worth noting that the parameters of the GA are tuned through trial-and-error runs. The ranges and optimal values of the hyperparameters utilized in the GA algorithm are presented in Table 4.
In this study, Equation (1) is utilized as a performance indicator to evaluate the effectiveness of the network design cost optimization process. Furthermore, the convergence monitoring process is implemented for responses to ensure optimal solutions. This process involved determining the number of optimization algorithm iterations for each WDN based on network dimensions through a trial-and-error process, with an average of 400 iterations. Moreover, the convergence criterion for the objective function is defined as a less than 1% improvement in cost over a predetermined number of iterations. In order to investigate hydraulic constraints, each optimal solution is simulated using EPANET software to ensure that the minimum pressure at all nodes is met as presented in Equation (2). Notably, nodal mass continuity and loop energy conservation are implicitly satisfied by solving the governing equations of WDNs.
The topological and hydraulic features are extracted from the WDNs during the next stage. These features generally pertain to the topological and hydraulic properties of nodes, pipes, and the overall network graph, totaling 80 characteristics, with the target feature being the optimal pipe diameters. The features that are attained in this stage are displayed in Table 5. This dataset contains 80 columns, and each column is a specific feature. The features are hydraulic and topological features relating to nodes, pipes, and the overall network graph. The dataset includes 85,745 rows, which are the attributes of nodes, pipes, and the overall network graph assigned to the pipes. The features identified for nodes are allocated to the corresponding connected pipes. This is done by calculating the arithmetic mean of the features of the nodes that are on either side of each pipe and subsequently calculating this as the respective feature of the pipe. Regarding the features obtained for the overall network graph, each extracted feature is applied to all pipes in that WDN. For instance, the network efficiency feature (G7) illustrated in Table 5 is repeated for all pipes in the WDN with the same G7 value. The dataset constructed at this stage has been uploaded to the GitHub site (https://github.com/bahrami-i/WDNs-Dataset; accessed on 24 May 2025), where readers can conveniently download it for use in their research. Table A1 in the article’s Appendix A provides a detailed overview of the features analyzed in this study.
In the next step, outliers are identified and removed from the dataset after obtaining topographical and hydraulic features. To this end, all features are examined, and values exceeding four times the standard deviation from the data mean are identified according to the Z-score criterion. The row related to that specific feature known as outlier, along with all other features, is removed from the dataset. Following the outlier removal process, the number of rows in the dataset decreases from 85,745 to 67,226 rows. Figure 4 compares the frequency histogram of the N4 feature before and after outlier removal. As illustrated, removing outliers has led to variations within the range of the N4 feature. The other step taken in this stage is forming the Spearman correlation matrix to evaluate multicollinearity among 80 features with the aim of gaining an initial understanding of the relationship between features. The features with high correlation coefficients (more than 90 percent correlation) are examined, removed, or merged. The high correlation between features G33 and G20 can serve as an example in this regard. Because of this, these two features are omitted in the output of the six feature selection methods.
The dataset is then normalized using the min-max normalization method to scale the features within the range of 0 to 1. This data normalization improves machine learning model performance, increases convergence speed in gradient-based algorithms, and prevents characteristics with larger scales from dominating. Table 6 displays the results of the features from Table 5 after outlier removal and data normalization.
A visual comparison of Table 5 and Table 6 reveals that the row corresponding to index 85,742 in Table 5 has been removed from Table 6, which, as previously explained, occurred due to outlier removal. Additionally, Table 6 clearly shows that the data have been normalized between 0 and 1 following the data normalization process. In the following step, the dataset is shuffled and split into training and test sets. A total of 70 percent of the data are assigned to the training set, while 30 percent is reserved for the test set. The reason behind this splitting is that allocating 70% of the data for training allows the model to learn underlying patterns with sufficient variety and volume, which is especially important in complex problems such as those involving WDNs. Meanwhile, reserving 30% for testing ensures an adequate and representative portion of unseen data to evaluate the generalization ability of the trained model.
The next stage involves extracting essential features from the 80 existing features in the dataset. To this end, as previously mentioned, six models, namely Var, Kb, Chi2, LGB, Per, and Xg, are utilized. The Var, Chi2, and Kb models belong to the Filter model category, while the LGB, Xg, and Per models belong to the Embedded category. The hyperparameter tuning in Embedded models is carried out using the Grid search algorithm. Table 7 depicts the hyperparameter values of the Embedded models tuned using the Grid search algorithm. The adjustment of these hyperparameters presented in Table 7 is conducted in a manner that prevents overfitting while maintaining a balance between exploitation and exploration in embedded methods. Additionally, Table 8 illustrates the top 20 features provided by the feature selection methods.
The analysis of the results of feature selection methods in Table 8 reveals several key insights:
Feature importance by category:
  • Node-related features dominate in four methods (Xg, Per, LGB, and Chi2)
  • Overall network graph properties are most prominent in two methods (Kb and Var)
Significance of specific features:
  • N5 and N7 (weighted node centrality degrees) consistently rank in the top quartile for most methods, except Var
  • Features that use pipe resistance (R) as a weighted criterion show greater importance than those based on pipe length.
Distribution of top features:
  • Node features: 50% on average
  • Pipe features: 26% on average
  • Graph features: 24% on average
Method-specific observations:
  • Filter methods (e.g., Kb and Var) tend to select more overall network graph features.
  • Embedded methods (e.g., Xg, Per, LGB) primarily select node and pipe features.
Application to machine learning:
  • Top features from each selection method are paired with corresponding optimal diameters.
  • These feature sets are used as inputs for four machine learning models.
Hyperparameters for each model are optimized using the Grid search algorithm (detailed in Table 9).
Figure 5 illustrates the Jaccard similarity matrix. According to this figure, the Embedded methods (Xg, LGB) present nearly identical feature selection (similarity = 0.90), confirming a core set of hydraulic-topological features such as E8, N8, N5, and E9 for diameter prediction. Additionally, Filter methods (Var, Kb) are highly divergent (similarity ≤ 0.18), as they prioritize overall network graph criteria that are overlooked by Embedded methods. The Jaccard indices of the Chi2 method with Embedded methods (Xg, LGxamB) and the Filter method (Kb) are 0.54 and 0.21, respectively. This means the Kb method shows a more substantial alignment with Embedded methods while preserving some Filter method behaviors, effectively acting as a connecting bridge between Embedded and Filter methods.
The 24 combined machine learning models are trained and evaluated using test data. The evaluation metrics used are R 2 , MAE, and RMSE. Figure 6 shows the output of 24 coupled machine learning models. As observed in this figure, the strongest model based on three evaluation metrics is the Xg-LGB model, which implements the best features of the Xg method on the LGB regressor, with R 2 , MAE, and RMSE values at 0.98, 0.017, and 0.02, respectively. The weakest model is Var-SVM, with R 2 , MAE, and RMSE values at 0.41, 0.146, and 0.21, respectively. The critical point inferred from the XG-LGB model output is that the developed model is highly accurate in hydraulic simulations, even without utilizing governing hydraulic equations. Although the current study employs a data-driven approach, it indirectly and inherently considers hydraulic equations during the learning stages. This is made possible by designing all WDNs optimally through connecting the hydraulic simulation to the optimization algorithm. It is evident that hydraulic equations and components, such as pressure head loss, flow constraints, and pressure requirements, are applied during the simulation-optimization step. Thus, the objective function (optimal diameter) constitutes a function of the hydraulic behavior in the mentioned simulations.
After selecting the best machine learning method, the Xg-LGB model is used to find the optimal diameters in the Hanoi WDN. The results show that R 2 is 0.94, indicating a high percentage of variance in the objective values (optimal diameters presented by Kadu [101]) identified by the Xg-LGB model. Furthermore, the MAE and RMSE values are 0.054 and 0.06, respectively, indicating that a small percentage of variance in the objective values remains unidentified by the model due to various factors: a shortage of features, measurement errors, incidental noise, model limitations, and data limitations. The model’s efficiency could be enhanced by improving each of the mentioned factors. In this study, overfitting in machine learning models is controlled simultaneously using K-fold cross-validation (K = 5) and ensemble models, which are inherently resistant to overfitting. The similarity in performance criteria ( R 2 , MAE, and RMSE) between test and validation data provides evidence of the generalizability of the developed model. Table 10 presents the diameters calculated by the Xg-LGB model for the Hanoi WDN, along with the diameters of this network presented by Kadu. As per Table 10, the output diameters predicted by the Xg-LGB model are within a continuous range that should be mapped to standard commercial ranges. To this end, three mapping cases have been performed for diameters: (1) Mapping to the nearest diameter that may be a larger or smaller commercial diameter, (2) Mapping to the nearest upper diameter that produces a larger commercial diameter, and (3) Mapping to the nearest lower diameter that produces a smaller commercial diameter. Subsequently, the three mapping cases are simulated in EPANET. The mapping case that is the least expensive and meets the minimum pressure constraint of 30 m at nodes while avoiding over-design is selected. By comparing the commercial diameters of the developed method with the Kadu method, it becomes clear that in approximately 65% of cases, the forecasted diameters of the Xg-LGB model are equal to Kadu’s outputs, while in the remaining 35%, these diameters are one size larger than Kadu’s results. Considering the strong similarity between the results predicted by the Xg-LGB model and the Kadu model, along with the performance criteria of the Xg-LGB model, it can be concluded that the Xg-LGB model is suitable and applicable for use in other real-world WDNs for the following reasons: first, the original synthetic database presented in Table 5 is suitable for generating the optimal diameters for a given number of random WDNs; second, the Xg feature selection method in Table 8 has effectively identified the efficient features corresponding to the objective function; and third, the Xg-LGB model has been well trained on the original synthetic database and then applied to the Hanoi benchmark network. All the reasons mentioned earlier can somehow lead to the appropriate model generalization.
Additionally, using the predicted diameter results from the Xg-LGB model on the Hanoi WDN and conducting the hydraulic simulation with Epanet software yields the nodal pressure heads displayed in the last column of Table 10. Except for node 2, which has a pressure of 97.14 m (due to its nearness to the reservoir), the pressures at other nodes range from 35 to 65 m. This enhances the minimum pressure derived from the Kadu model while meeting the maximum pressure requirements, which is essential for ensuring adequate pressure in consumer nodes.
Figure 7 illustrates diameters obtained from the Xg-LGB machine learning method and the Kadu method, along with nodal pressure distributions. As is evident from this figure, 14 different diameters are present between the two methods. On average, solutions by the suggested method choose higher diameters, raising nodal pressure in comparison to the Kadu method. This higher pressure enhances system reliability and resiliency during a failure state. The comparison of execution times for determining optimal pipe diameters reveals that the developed method achieves the optimal pipe diameters in just 0.004 s. In contrast, the Kadu method and the methods presented in [101] take significantly longer, requiring 7.8 min and 158 min, respectively.
In this research, only one real WDN (the Hanoi WDN) is utilized for validation and to prepare this ability for other researchers to compare their results with the results obtained in this study. However, it is important to note that this study generates 600 synthetic WDNs to simulate the characteristics and hydraulic behavior of real-world WDNs closely. So, using this wide range of synthetic WDNs that are hydraulically similar to real WDNs makes the proposed method a strong method that has generalization and can reliably be applied to other new real WDNs.
This study assumed that the elevation of nodes is zero, but since inside the developed model, the piezometric head is utilized, and the piezometric head is equal to the sum of elevation and pressure head, this assumption does not affect the performance of the developed model.
A critical and valuable advantage of utilizing the developed method in this study is the significant reduction in computational time for the method relative to traditional optimization-based design methods. Conventional WDN design methods, i.e., simulation-optimization methods, involve performing several hydraulic simulations on every iteration. So, the time that it requires for these models can range from a few minutes to several hours, depending on the complexity of the WDN and how large the WDN is, so with traditional methods, it could take a considerable time to design a WDN. In contrast, the developed method uses data-driven models following training and pre-processing to obtain the optimal network diameters for new WDNs in seconds, based on the network size. One of the issues with traditional WDN design methods is the need to perform hydraulic simulations and the interaction with the optimization algorithm. The reasoning behind this problem is that, with optimization methods, hydraulic simulations are required to be performed on all of the designs being evaluated at every iteration. Therefore, this is time-consuming and increases the computational costs of the design. The developed method eliminates the need for hydraulic simulations, effectively addressing the issue and significantly reducing computational costs.
According to the information presented in Table 10 and Figure 7, it can be inferred that the diameters extracted using the Xg-LGB machine learning model are larger than the diameters obtained in the Kadu method. Although a complete match between the results of these two methods is observed for all large diameters (40-inch diameter), the diameters resulting from the developed method are equal to or larger than the values from the Kadu method for other pipes, which leads to increased design costs. It is important to note that the method proposed by Kadu [101] employs a single-objective genetic algorithm to minimize design costs. However, the developed model in the current study determines diameters based on the patterns extracted from data. In practice, the developed model is trained using data that implicitly encompasses multiple objectives such as pressure, network reliability, and design cost. In contrast, the proposed model demonstrates considerably higher generalizability compared to traditional methods, in addition to favorable performance related to computational time. The model considers hydraulic reliability and prediction accuracy and can lead to more conservative pipe diameters to ensure minimum pressure. Among the proposed strategies for minimizing design costs in the developed model are directly integrating design cost in the learning process through a loss function or adopting a hybrid approach that combines data-driven models with optimization algorithms. Additionally, in the feature engineering section, there is potential to add a new characteristic related to pipe costs to the feature set, enabling cost component training in the machine learning model.

4. Conclusions

The optimal design of WDNs is one of the most critical engineering challenges, directly impacting service quality, economic costs, and water resource sustainability. Given urban population growth and increasing water demand, there is an increasingly urgent need for efficient methods to design these networks. Traditional WDN design methods are often time-consuming and have significant limitations when facing complex networks. Thus, developing novel and intelligent methods that can perform optimal design with high speed and accuracy seems essential.
A novel machine learning-based approach for optimal water distribution network design was presented in this study. To this end, 600 synthetic WDNs were generated and optimized. Then, 80 topological and hydraulic features concerning nodes, pipes, and the overall network graph were extracted from the optimized WDNs. Following data preprocessing, including outlier detection and normalization, six feature selection methods, namely Chi2, Var, Kb, LGB, Per, and Xg, were employed. Subsequently, four machine learning algorithms, including RF, SVM, LGB, and BAG, were utilized in combination with feature selection methods. Results showed that the Xg-LGB method had the best performance based on R 2 , MAE, and RMSE in optimal WDN design. This combination method not only demonstrated higher accuracy compared to other methods but also showed significant capability in generalizability and predicting optimal pipe diameters. Finally, the LGB-Xg method is applied to a real-world WDN, the Hanoi WDN, demonstrating that this model can predict the optimal pipe diameters with R2, MAE, and RMSE values of 0.94, 0.054, and 0.06, respectively. Based on the comparison between the traditional and developed methods, it is noticed that the developed method leads to selecting larger diameters. This larger selection of diameters makes a safety margin of nodal heads, making the WDN much more reliable than the traditional method. Also, because the proposed method is based on training the data and there is no need to run the hydraulic simulation model, this results in much less computational time in comparison to the traditional method that couples an optimization algorithm to the hydraulic simulation model and runs lots of hydraulic simulations in each iteration. Another benefit inferred from the performance criteria ( R 2 , MAE, and RMSE) is that the developed method has generalization since it involves a variety of samples in the dataset, extracting suitable features by feature selection methods, and optimal training. The research results indicated that among node, pipe, and overall network graph features, topological and hydraulic node features are of very high importance. This study demonstrates that using machine learning approaches can significantly improve water WDN design processes. Moreover, the study results can serve as a guideline for engineers and designers in the selection of appropriate pipe diameters in WDNs.
It is recommended that future studies investigate and explore the feasibility of using graph neural networks as an alternative to coupled machine learning models. Additionally, other factors such as reliability and uncertainty could be considered when developing machine learning methods for designing WDNs that achieve optimal diameters and enhance reliability. Also, to further reduce design costs in data-driven models, future studies can incorporate a loss function sensitive to design costs during model training or utilize engineering knowledge to attempt to include cost-related features inside machine learning model training. Finally, since machine learning models resemble black boxes without providing insight into decision-making processes, future studies are recommended to utilize tools that improve model transparency, such as SHAP or LIME.

Author Contributions

Conceptualization, M.M.R.; Methodology, I.B.C., M.M.R., A.E.B. and A.H.; Validation, M.M.R.; Investigation, I.B.C.; Writing – original draft, I.B.C.; Writing – review & editing, M.M.R., A.E.B., M.A. and A.H.; Visualization, I.B.C.; Supervision, A.E.B. and A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This project is partially funded by the Deutsche Forschungsgemeinschaft (DFG) under Project number 544048327.

Data Availability Statement

The data presented in this study are openly available in https://github.com/bahrami-i/WDNs-Dataset.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Complete List of Features Used in this Study.
Table A1. Complete List of Features Used in this Study.
N o d e
F e a t u r e s
T o p o l o g i c a l
F e a t u r e s
IndexFeatures Name
N 1 N o d e D C Node Centrality Degree
N 2 N o d e D O u t Output Degree of Directed Graph Nodes
N 3 N o d e D I n Input Degree of Directed Graph Nodes
N 4 N o d e D C I n L Internal Weighted Centrality Degree for Weighted Multiplication of Input Edge Lengths by Input Degree in the Directed Graph
N 5 N o d e D C I n R H W Internal Weighted Centrality Degree for Weighted Multiplication of Resistance Index (R) of Input Edges by Input Degree in the Directed Graph
N 6 N o d e D C O u t L External Weighted Centrality Degree for Weighted Multiplication of Input Edge Lengths by Input Degree in the Directed Graph
N 7 N o d e D C O u t R External Weighted Centrality Degree for Weighted Multiplication of Resistance Index (R) of Input Edges by Input Degree in the Directed Graph
N 8 N o d e A v e . D i m Average Diameter of Tubes Connected to Node
N 9 N o d e M i n . P a t h . L e n g t h Minimum Weighted Length Distance from Reservoir to Node in the Directed Graph
N 10 N o d e M i n . P a t h . R H W Minimum Weighted Resistance Distance from Reservoir to Node in the Directed Graph
N 11 N o d e S C C Node Clustering Coefficient in the Undirected Graph
N 12 N o d e C C U n d i r L Weighted Length Closeness Centrality Index in the Undirected Graph
N 13 N o d e C C U n d i r R Weighted Resistance Closeness Centrality Index in the Undirected Graph
N 14 N o d e C C d i r L Weighted Length Closeness Centrality Index in the Directed Graph
N 15 N o d e C C d i r R Weighted Resistance Closeness Centrality Index in the Directed Graph
N 16 N o d e B C U n d i r L Weighted Length Betweenness Centrality Index in the Undirected Graph
N 17 N o d e B C U n d i r R Weighted Resistance Betweenness Centrality Index in the Undirected Graph
N 18 N o d e B C d i r L Weighted Length Betweenness Centrality Index in the Directed Graph
N 19 N o d e B C d i r R Weighted Resistance Betweenness Centrality Index in the Directed Graph
N 20 N o d e E C U n d i r L Weighted Length Eigenvector Centrality Index in the Undirected Graph
N 21 N o d e E C d i r L Weighted Length Eigenvector Centrality Index in the Directed Graph
N 22 N o d e E C d i r R Weighted Resistance Eigenvector Centrality Index in the Directed Graph
N 23 N o d e S C U n d i r Subgraph Centrality Index in the Undirected Graph
N 24 N o d e E c C U n d i r L Weighted Length Node Eccentricity Index in the Undirected Graph
N 25 N o d e E c C U n d i r R Weighted Resistance Node Eccentricity Index in the Undirected Graph
H y d r a u l i c
F e a t u r e s
N 26 N o d e D e m a n d Consumption Flow Rates at Nodes
N 27 N o d e P r e s s u r e Pressure at Nodes
E d g e
F e a t u r e s
T o p o l o g i c a l
F e a t u r e s
E 1 E d g e F r i c t i o n Pipe Roughness Index
E 2 E d g e D i a m e t e r Pipe Diameter Index
E 3 E d g e L e n g t h Pipe Length Index
E 4 E d g e B C U n d i r L Weighted Length Edge Betweenness Centrality Index in the Undirected Graph
E 5 E d g e B C U n d i r R Weighted Resistance Edge Betweenness Centrality Index in the Undirected Graph
E 6 E d g e B C d i r L Weighted Length Edge Betweenness Centrality Index in the Directed Graph
E 7 E d g e B C d i r R Weighted Resistance Edge Betweenness Centrality Index in the Directed Graph
H y d r a u l i c
F e a t u r e s
E 8 E d g e V e l o c i t y Flow Velocity Index in Pipe
E 9 E d g e H e a d L o s s Energy Loss Index per Pipe Length
G r a p h
F e a t u r e s
T o p o l o g i c a l
F e a t u r e s
G 1 G r a p h A v e . M i n . P a t h . L Average Minimum Weighted Length Distance from Reservoir to Node in the Overall Directed Graph
G 2 G r a p h A v e . M i n . P a t h . R Average Minimum Weighted Resistance Distance from Reservoir to Node in the Overall Directed Graph
G 3 G r a p h D i a U n d i r L Graph Diameter of Weighted Length Type in Overall Undirected Graph
G 4 G r a p h D i a U n d i r R Graph Diameter of Weighted Resistance Type in the Overall Undirected Graph
G 5 G r a p h R a d U n d i r L Graph Radius of Weighted Length Type in the Overall Undirected Graph
G 6 G r a p h R a d U n d i r R Graph Radius of Weighted Resistance Type in the Overall Undirected Graph
G 7 G r a p h E f f i U n d i r Graph Efficiency in the Overall Undirected Graph
G 8 G r a p h A v e . C C U n d i r L Average Closeness Centrality Index of Weighted Length Type in the Overall Undirected Graph
G 9 G r a p h A v e . C C U n d i r R Average Closeness Centrality Index of Weighted Resistance Type in the Overall Undirected Graph
G 10 G r a p h A v e . C C d i r L Average Closeness Centrality Index of Weighted Length Type in the Overall Directed Graph
G 11 G r a p h A v e . C C d i r R Average Closeness Centrality Index of Weighted Resistance Type in the Overall Directed Graph
G 12 G r a p h A v e . B C U n d i r L Average Betweenness Centrality Index of Weighted Length Type in the Overall Undirected Graph
G 13 G r a p h A v e . B C U n d i r R Average Betweenness Centrality Index of Weighted Resistance Type in the Overall Undirected Graph
G 14 G r a p h A v e . B C d i r L Average Betweenness Centrality Index of Weighted Length Type in the Overall Directed Graph
G 15 G r a p h A v e . B C d i r R Average Betweenness Centrality Index of Weighted Resistance Type in the Overall Directed Graph
G 16 G r a p h C D U n d i r L Dominance of Central Point of Weighted Length Type in the Overall Undirected Graph
G 17 G r a p h C D U n d i r R Dominance of Central Point of Weighted Resistance Type in the Overall Undirected Graph
G 18 G r a p h C D d i r L Dominance of Central Point of Weighted Length Type in Overall Directed Graph
G 19 G r a p h C D d i r R Dominance of Central Point of Weighted Resistance Type in the Overall Directed Graph
G 20 G r a p h A v e . D e g U n d i r Average Graph Degree in the Overall Undirected Graph
G 21 G r a p h A v e . O u t D e g d i r Average Output Degree in the Overall Directed Graph
G 22 G r a p h M a x . D e g U n d i r Maximum Degree in the Overall Undirected Graph
G 23 G r a p h M a x . I n D e g d i r Maximum Input Degree in the Overall Directed Graph
G 24 G r a p h M a x . O u t D e g d i r Maximum Output Degree in the Overall Directed Graph
G 25 G r a p h A v e . L C C Average Node Square Clustering Coefficient in the Overall Undirected Graph
G 26 G r a p h F V U n d i r Algebraic Connectivity Index in the Overall Undirected Graph
G 27 G r a p h F V U n d i r L Algebraic Connectivity Index of Weighted Length Type in the Overall Undirected Graph
G 28 G r a p h F V U n d i r R Algebraic Connectivity Index of Weighted Resistance Type in the Overall Undirected Graph
G 29 G r a p h S D U n d i r Spectral Difference of the Overall Undirected Graph
G 30 G r a p h S D d i r Number of Edges in the Overall Undirected Graph
G 31 G r a p h D e n s i t y U n d i r Density of the Overall Undirected Graph
G 32 G r a p h D e n s i t y d i r Density of the Overall Directed Graph
G 33 G r a p h M e s h U n d i r Mesh Coefficient of Overall Undirected Graph
G 34 G r a p h D e a d E n d d i r Sum of Input Degrees of Dead-End Nodes in the Overall Directed Graph
G 35 G r a p h N C U n d i r R e s Normalized Minimum Cut between Reservoir and First Node with Other Nodes in the Overall Undirected Graph
G 36 G r a p h N C U n d i r R e s L Normalized Minimum Cut of Weighted Length Type between Reservoir and First Node with Other Nodes in the Overall Undirected Graph
G 37 G r a p h N C U n d i r R e s R Normalized Minimum Cut of Weighted Resistance Type between Reservoir and First Node with Other Nodes in Overall Undirected Graph
G 38 G r a p h N C U n d i r E n d Normalized Minimum Cut between Terminal Node and Other Nodes in the Overall Undirected Graph
G 39 G r a p h N C U n d i r E n d L Normalized Minimum Cut of Weighted Length Type between Terminal Node and Other Nodes in the Overall Undirected Graph
G 40 G r a p h N C U n d i r E n d R Normalized Minimum Cut of Weighted Resistance Type between Terminal Node and Other Nodes in the Overall Undirected Graph
G 41 G r a p h T o t a l L e n g t h Total Network Length in the Overall Graph
G 42 G r a p h R H W Overall Network Resistance Index in the Overall Undirected Graph
H y d r a u l i c
F e a t u r e s
G 43 G r a p h R e s E l v Water Level in Reservoir in the Overall Graph
G 44 G r a p h T o t a l D e m a n d Total Input Flow to Network in the Overall Graph

Appendix B

The synthetic water distribution networks (WDNs) are generated in the first step of the current study. Then, topological and hydraulic features are extracted from the synthetic WDNs using graph theory, and the initial database is formed. Next, machine learning models are used to extract the most influential features in the database, and then, they are employed to design pipe diameters in WDNs. Finally, the LGB-XG machine learning model, implemented on the Hanoi network, presented the best results for predicting pipe diameters based on the evaluation metrics (R2, MAE, and RMSE are 0.94, 0.054, and 0.06, respectively). In this section, the compound WOA-ANN model is implemented on the best features obtained from the XgBoost method to allow other researchers to reproduce the results. For this purpose, a brief explanation of the mentioned model components is provided, and then, its implementation on the Hanoi network is executed. The implementation code and documentation file for the ANN-WOA model have been uploaded to GitHub (https://github.com/bahrami-i/WDNs-Dataset; accessed on 24 May 2025), allowing readers to download them for their research and reuse conveniently.

Appendix B.1. Whale Optimization Algorithm (WOA)

The initial idea of using the WOA algorithm was introduced by Mirjalili and Lewis in 2016 [102]. The mentioned algorithm, imitating the hunting behavior of whales, optimizes the complex and nonlinear problems. Furthermore, the WOA algorithm is a population-based algorithm and classified as a swarm intelligence algorithm. In this algorithm, whales are solution vectors, and the target prey is the best possible answer (global optimum). The general steps of the WOA algorithm include the following:
(1)
Prey Encircling (Exploitation Phase)
In this step, whales identify the target prey and encircle it. In fact, the algorithm moves toward the best candidate solution among all available options, restricting the search area.
(2)
Bubble-Net Attacking (Local Search Phase)
Since whales surround their prey by creating bubbles, they use spiral bubbles to hunt the prey. In this phase, the solutions move in a spiral path toward the best solution, leading to an accurate search near the best solution.
(3)
Searching for Prey (Exploration Phase)
When there is no suitable prey near the whale, the whales randomly search for better prey. In fact, the algorithm explores new solutions through random whale movements and does not stop in non-optimal regions.
The primary parameters of the WOA algorithm in the current study are selected based on trial and error. The ranges and optimal values of the hyperparameters used in the WOA algorithm are shown in Table A2. They include Number of Whales = 30, which is used for the population size; Number of Iterations = 100, which indicates the number of algorithm iterations until achieving convergence and finding the best solution; and Coefficient A, which decreases in descending order from 2 to 0 and is used to control exploration and exploitation spaces.
Table A2. The ranges and optimal values of the hyperparameters used in WOA.
Table A2. The ranges and optimal values of the hyperparameters used in WOA.
Optimization AlgorithmHyperparameterTuning Range of Hyperparameter ValuesOptimal Hyperparameter Values
WOANumber of Whales20–5030
Number of Iterations50–200100

Appendix B.2. Artificial Neural Networks (ANNs)

The initial idea of using Artificial Neural Networks (ANNs) was introduced by McCulloch and Pitts in 1943 [103]. An ANN network consists of several layers. These layers include an input layer, hidden layers (for feature transformation), and an output layer (for predicting results). The training process in the mentioned network is implemented by adjusting the weights to minimize prediction errors. The current study uses the Multilayer Perceptron (MLP) model (as a sub-collection of ANN networks) to predict the designed pipe diameters.
The WOA algorithm is used to achieve the best parameters for the MLP model. The compound WOA-ANN model is created, and the WOA algorithm extracts the best parameters for the ANN model by multiple iterations. The parameters obtained are, in fact, the most suitable ones for training the ANN to predict the designed pipe diameters in WDNs. In the current study, the parameters for the two-layer ANN model include hidden_n = 24, which indicates the number of neurons in each layer; Alpha = 2.14× 10−7, which is the regularization parameter, preventing overfitting by implementing a penalty to the loss function; Lr = 1.37 × 10−2, which is the learning rate and is used to create balance between exploration and exploitation spaces; and max_iter = 2500, which defines the maximum number of iterations during the MLP model training.
Next, the implementation and training of the compound WOA-ANN model are executed using the best features obtained from the XgBoost method. It is done in a way that the model outputs (optimal pipe diameters) are a continuous value. Then, the Hanoi network’s data are used to assess the WOA-ANN model, and the evaluation metrics R2, MAE, and RMSE are obtained as 0.93, 0.056, and 0.069, respectively. Moreover, Table A3 illustrates the results of predicted pipe diameters for the Hanoi network using the compound WOA-ANN model.
After extracting the predicted pipe diameters through the WOA-ANN model, considering that the pipe diameters are within a continuous range, the mapping process is applied to them, controlling hydraulic pressure constraints at the nodes of the Hanoi network. This leads to the settlement of the predicted diameters within the range of the commercial pipe diameters. Table A3 presents the predicted diameters by the WOA-ANN model. Comparing these pipe diameters with the results obtained from the LGB-XG model, a complete overlap of approximately 68% among predicted results is observed. Since this overlap was achieved after the post-processing and pipe diameter mapping to commercial parameters, it is a suitable criterion for the reproducibility of results obtained using Xg-LGB. Furthermore, comparing the results obtained from R2, MAE, and RMSE evaluation metrics for both Xg-LGB and WOA-ANN models indicates the reproducibility in all the study’s executed phases, from initial database creation to implementing the learning model and pipe diameter prediction in WDNs.
Table A3. Results of diameter predictions for Kadu, Xg-LGB, and ANN-WOA methods in the Hanoi WDN.
Table A3. Results of diameter predictions for Kadu, Xg-LGB, and ANN-WOA methods in the Hanoi WDN.
Pipe NumberPredicted Diameters
Pipe Diameters from [101]
(In)
Commercial Pipe Diameters from Xg-LGB Model
(In)
Commercial Pipe Diameters from ANN-WOA Model
(In)
1404040
2404040
3404040
4404040
5404040
6404040
7404040
8404040
9303040
10304030
11303030
12243020
13162020
14121616
15121216
16162020
17202424
18242430
19243030
20404040
21202424
22121216
23404040
24303040
25303030
26202024
27121616
28121612
29161620
30121616
31121212
32161616
33202420
34242424

References

  1. Swamee, P.K.; Sharma, A.K. Design of Water Supply Pipe Networks; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
  2. Gupta, I. Linear programming analysis of a water supply system. AIIE Trans. 1969, 1, 56–61. [Google Scholar] [CrossRef]
  3. Karmeli, D.; Gadish, Y.; Meyers, S. Design of optimal water distribution networks. J. Pipeline Div. 1968, 94, 1–10. [Google Scholar] [CrossRef]
  4. Schaake, J.C., Jr.; Lai, D. Linear Programming and Dynamic Programming Application to Water Distribution Network Design; M.I.T. Hydrodynamics Laboratory: Cambridge, MA, USA, 1969. [Google Scholar]
  5. Su, Y.-C.; Mays, L.W.; Duan, N.; Lansey, K.E. Reliability-based optimization model for water distribution systems. J. Hydraul. Eng. 1987, 113, 1539–1556. [Google Scholar] [CrossRef]
  6. Duan, N.; Mays, L.W.; Lansey, K.E. Optimal reliability-based design of pumping and distribution systems. J. Hydraul. Eng. 1990, 116, 249–268. [Google Scholar] [CrossRef]
  7. Samani, H.M.; Taghi Naeeni, S. Optimization of water distribution networks. J. Hydraul. Res. 1996, 34, 623–632. [Google Scholar] [CrossRef]
  8. Murphy, L.; Simpson, A. Pipe optimization using genetic algorithms. Res. Rep. 1992, 93, 95. [Google Scholar]
  9. Savic, D.A.; Walters, G.A. Genetic algorithms for least-cost design of water distribution networks. J. Water Resour. Plan. Manag. 1997, 123, 67–77. [Google Scholar] [CrossRef]
  10. Simpson, A.R.; Dandy, G.C.; Murphy, L.J. Genetic algorithms compared to other techniques for pipe optimization. J. Water Resour. Plan. Manag. 1994, 120, 423–443. [Google Scholar] [CrossRef]
  11. Creaco, E.; Franchini, M. Low level hybrid procedure for the multi-objective design of water distribution networks. Procedia Eng. 2014, 70, 369–378. [Google Scholar] [CrossRef]
  12. Farmani, R.; Walters, G.A.; Savic, D.A. Trade-off between total cost and reliability for Anytown water distribution network. J. Water Resour. Plan. Manag. 2005, 131, 161–171. [Google Scholar] [CrossRef]
  13. Farmani, R.; Walters, G.; Savic, D. Evolutionary multi-objective optimization of the design and operation of water distribution network: Total cost vs. reliability vs. water quality. J. Hydroinform. 2006, 8, 165–179. [Google Scholar] [CrossRef]
  14. Prasad, T.D.; Park, N.-S. Multiobjective genetic algorithms for design of water distribution networks. J. Water Resour. Plan. Manag. 2004, 130, 73–82. [Google Scholar] [CrossRef]
  15. Riyahi, M.M.; Bakhshipour, A.E.; Haghighi, A. Probabilistic warm solutions-based multi-objective optimization algorithm, application in optimal design of water distribution networks. Sustain. Cities Soc. 2023, 91, 104424. [Google Scholar] [CrossRef]
  16. Todini, E. Looped water distribution networks design using a resilience index based heuristic approach. Urban Water 2000, 2, 115–122. [Google Scholar] [CrossRef]
  17. Elshaboury, N.; Marzouk, M. Prioritizing water distribution pipelines rehabilitation using machine learning algorithms. Soft Comput. 2022, 26, 5179–5193. [Google Scholar] [CrossRef]
  18. Jafari, S.M.; Nikoo, M.R.; Bozorg-Haddad, O.; Alamdari, N.; Farmani, R.; Gandomi, A.H. A robust clustering-based multi-objective model for optimal instruction of pipes replacement in urban WDN based on machine learning approaches. Urban Water J. 2023, 20, 689–706. [Google Scholar] [CrossRef]
  19. Maußner, C.; Oberascher, M.; Autengruber, A.; Kahl, A.; Sitzenfrei, R. Explainable artificial intelligence for reliable water demand forecasting to increase trust in predictions. Water Res. 2025, 268, 122779. [Google Scholar] [CrossRef]
  20. Namdari, H.; Haghighi, A.; Ashrafi, S.M. Short-term urban water demand forecasting; application of 1D convolutional neural network (1D CNN) in comparison with different deep learning schemes. Stoch. Environ. Res. Risk Assess. 2023, 1–16. [Google Scholar] [CrossRef]
  21. Baziar, M.; Behnami, A.; Jafari, N.; Mohammadi, A.; Abdolahnejad, A. Machine learning-based Monte Carlo hyperparameter optimization for THMs prediction in urban water distribution networks. J. Water Process Eng. 2025, 73, 107683. [Google Scholar] [CrossRef]
  22. Magini, R.; Moretti, M.; Boniforti, M.A.; Guercio, R. A machine-learning approach for monitoring water distribution networks (WDNS). Sustainability 2023, 15, 2981. [Google Scholar] [CrossRef]
  23. Pandian, C.; Alphonse, P. Evaluating water pipe leak detection and localization with various machine learning and deep learning models. Int. J. Syst. Assur. Eng. Manag. 2025, 1–13. [Google Scholar] [CrossRef]
  24. Ayati, A.H.; Haghighi, A. Multiobjective wrapper sampling design for leak detection of pipe networks based on machine learning and transient methods. J. Water Resour. Plan. Manag. 2023, 149, 04022076. [Google Scholar] [CrossRef]
  25. Pei, S.; Hoang, L.; Fu, G.; Butler, D. Real-time multi-objective optimization of pump scheduling in water distribution networks using neuro-evolution. J. Water Process Eng. 2024, 68, 106315. [Google Scholar] [CrossRef]
  26. Bondy, J.A.; Murty, U.S.R. Graph Theory with Applications; Macmillan London: London, UK, 1976; Volume 290. [Google Scholar]
  27. Hamam, Y.; Brameller, A. Hybrid method for the solution of piping networks. Proc. Inst. Electr. Eng. 1971, 118, 1607–1612. [Google Scholar] [CrossRef]
  28. Kesavan, H.K.; Chandrashekar, M. Graph-theoretic models for pipe network analysis. J. Hydraul. Div. 1972, 98, 345–364. [Google Scholar] [CrossRef]
  29. Riyahi, M.M.; Bakhshipour, A.E.; Giudicianni, C.; Dittmer, U.; Haghighi, A.; Creaco, E. An Analytical Solution for the Hydraulics of Looped Pipe Networks. Eng. Proc. 2024, 69, 4. [Google Scholar]
  30. Sitzenfrei, R. A graph-based optimization framework for large water distribution networks. Water 2023, 15, 2896. [Google Scholar] [CrossRef]
  31. Jung, D.; Yoo, D.G.; Kang, D.; Kim, J.H. Linear model for estimating water distribution system reliability. J. Water Resour. Plan. Manag. 2016, 142, 04016022. [Google Scholar] [CrossRef]
  32. Alzamora, F.M.; Ulanicki, B.; Zehnpfund. Simplification of Water Distribution Network Models. In Proceedings of the 2nd International Conference on Hydroinformatic, Zurich, Switzerland, 9–13 September 1996. [Google Scholar]
  33. Giudicianni, C.; di Nardo, A.; Oliva, G.; Scala, A.; Herrera, M. A dimensionality-reduction strategy to compute shortest paths in urban water networks. arXiv 2019, arXiv:1903.11710. [Google Scholar]
  34. Satish, R.; Hajibabaei, M.; Dastgir, A.; Oberascher, M.; Sitzenfrei, R. A graph-based method for identifying critical pipe failure combinations in water distribution networks. Water Supply 2024, 24, 2353–2366. [Google Scholar] [CrossRef]
  35. Ostfeld, A. Water distribution systems connectivity analysis. J. Water Resour. Plan. Manag. 2005, 131, 58–66. [Google Scholar] [CrossRef]
  36. Yazdani, A.; Jeffrey, P. Robustness and vulnerability analysis of water distribution networks using graph theoretic and complex network principles. In Water Distribution Systems Analysis 2010; American Society of Civil Engineers: Reston, VA, USA, 2010; pp. 933–945. [Google Scholar]
  37. Yazdani, A.; Otoo, R.A.; Jeffrey, P. Resilience enhancing expansion strategies for water distribution systems: A network theory approach. Environ. Model. Softw. 2011, 26, 1574–1582. [Google Scholar] [CrossRef]
  38. Ulusoy, A.-J.; Stoianov, I.; Chazerain, A. Hydraulically informed graph theoretic measure of link criticality for the resilience analysis of water distribution networks. Appl. Netw. Sci. 2018, 3, 1–22. [Google Scholar] [CrossRef]
  39. Oberascher, M.; Minaei, A.; Sitzenfrei, R. Graph-Based Genetic Algorithm for Localization of Multiple Existing Leakages in Water Distribution Networks. J. Water Resour. Plan. Manag. 2025, 151, 04024059. [Google Scholar] [CrossRef]
  40. Rajeswaran, A.; Narasimhan, S.; Narasimhan, S. A graph partitioning algorithm for leak detection in water distribution networks. Comput. Chem. Eng. 2018, 108, 11–23. [Google Scholar] [CrossRef]
  41. Di Nardo, A.; Giudicianni, C.; Greco, R.; Herrera, M.; Santonastaso, G.F. Applications of graph spectral techniques to water distribution network management. Water 2018, 10, 45. [Google Scholar] [CrossRef]
  42. Sitzenfrei, R.; Satish, R.; Rajabi, M.; Hajibabaei, M.; Oberascher, M. Graph-Based Methodology for Segment Criticality Assessment and Optimal Valve Placements in Water Networks. In Proceedings of the EGU General Assembly 2025, Vienna, Austria, 27 April–2 May 2025. [Google Scholar]
  43. Riyahi, M.M.; Giudicianni, C.; Haghighi, A.; Creaco, E. Coupled multi-objective optimization of water distribution network design and partitioning: A spectral graph-theory approach. Urban Water J. 2024, 21, 745–756. [Google Scholar] [CrossRef]
  44. Tzatchkov, V.G.; Alcocer-Yamanaka, V.H.; Bourguett Ortíz, V. Graph theory based algorithms for water distribution network sectorization projects. In Proceedings of the Water Distribution Systems Analysis Symposium 2006, Cincinnati, OH, USA, 27–30 August 2006; pp. 1–15. [Google Scholar]
  45. Deuerlein, J.W. Decomposition model of a general water supply network graph. J. Hydraul. Eng. 2008, 134, 822–832. [Google Scholar] [CrossRef]
  46. Price, E.; Ostfeld, A. Graph theory modeling approach for optimal operation of water distribution systems. J. Hydraul. Eng. 2016, 142, 04015061. [Google Scholar] [CrossRef]
  47. Price, E.; Ostfeld, A. Optimal pump scheduling in water distribution systems using graph theory under hydraulic and chlorine constraints. J. Water Resour. Plan. Manag. 2016, 142, 04016037. [Google Scholar] [CrossRef]
  48. Marini, G.; Fontana, N.; Maio, M.; Di Menna, F.; Giugni, M. A novel approach to avoiding technically unfeasible solutions in the pump scheduling problem. Water 2023, 15, 286. [Google Scholar] [CrossRef]
  49. Ahmed, A.A.; Sayed, S.; Abdoulhalik, A.; Moutari, S.; Oyedele, L. Applications of machine learning to water resources management: A review of present status and future opportunities. J. Clean. Prod. 2024, 441, 140715. [Google Scholar] [CrossRef]
  50. Zhou, X.; Guo, S.; Xin, K.; Tang, Z.; Chu, X.; Fu, G. Network embedding: The bridge between water distribution network hydraulics and machine learning. Water Res. 2025, 273, 123011. [Google Scholar] [CrossRef] [PubMed]
  51. Coelho, M.; Austin, M.A.; Mishra, S.; Blackburn, M. Teaching Machines to Understand Urban Networks: A Graph Autoencoder Approach. Int. J. Adv. Netw. Serv. 2020, 13, 70–81. [Google Scholar]
  52. Arsene, C.; Al-Dabass, D.; Hartley, J. Decision support system for water distribution systems based on neural networks and graphs. In Proceedings of the 2012 UKSim 14th International Conference on Computer Modelling and Simulation, Cambridge, UK, 28–30 March 2012; pp. 315–323. [Google Scholar]
  53. Kang, J.; Park, Y.-J.; Lee, J.; Wang, S.-H.; Eom, D.-S. Novel leakage detection by ensemble CNN-SVM and graph-based localization in water distribution systems. IEEE Trans. Ind. Electron. 2017, 65, 4279–4289. [Google Scholar] [CrossRef]
  54. Barros, D.; Zanfei, A.; Menapace, A.; Meirelles, G.; Herrera, M.; Brentan, B. Leak detection and localization in water distribution systems via multilayer networks. Water Res. X 2025, 26, 100280. [Google Scholar] [CrossRef]
  55. Komba, G. A Novel Leak Detection Algorithm Based on SVM-CNN-GT for Water Distribution Networks. Indones. J. Comput. Sci. 2025, 14. [Google Scholar] [CrossRef]
  56. Amali, S.; Faddouli, N.-e.E.; Boutoulout, A. Machine learning and graph theory to optimize drinking water. Procedia Comput. Sci. 2018, 127, 310–319. [Google Scholar] [CrossRef]
  57. Li, Z.; Liu, H.; Zhang, C.; Fu, G. Real-time water quality prediction in water distribution networks using graph neural networks with sparse monitoring data. Water Res. 2024, 250, 121018. [Google Scholar] [CrossRef]
  58. Liy-González, P.-A.; Santos-Ruiz, I.; Delgado-Aguiñaga, J.-A.; Navarro-Díaz, A.; López-Estrada, F.-R.; Gómez-Peñate, S. Pressure Interpolation in Water Distribution Networks by Using Gaussian Processes: Application to Leak Diagnosis. Processes 2024, 12, 1147. [Google Scholar] [CrossRef]
  59. Cheng, M.; Li, J. Optimal sensor placement for leak location in water distribution networks: A feature selection method combined with graph signal processing. Water Res. 2023, 242, 120313. [Google Scholar] [CrossRef] [PubMed]
  60. Di Nardo, A.; Di Natale, M.; Giudicianni, C.; Musmarra, D.; Santonastaso, G.F.; Simone, A. Water distribution system clustering and partitioning based on social network algorithms. Procedia Eng. 2015, 119, 196–205. [Google Scholar] [CrossRef]
  61. Han, R.; Liu, J. Spectral clustering and genetic algorithm for design of district metered areas in water distribution systems. Procedia Eng. 2017, 186, 152–159. [Google Scholar] [CrossRef]
  62. Chen, T.Y.-J.; Guikema, S.D. Prediction of water main failures with the spatial clustering of breaks. Reliab. Eng. Syst. Saf. 2020, 203, 107108. [Google Scholar] [CrossRef]
  63. Grammatopoulou, M.; Kanellopoulos, A.; Vamvoudakis, K.G.; Lau, N. A Multi-step and Resilient Predictive Q-learning Algorithm for IoT with Human Operators in the Loop: A Case Study in Water Supply Networks. arXiv 2020, arXiv:2006.03899. [Google Scholar]
  64. Xia, W.; Wang, S.; Shi, M.; Xia, Q.; Jin, W. Research on partition strategy of an urban water supply network based on optimized hierarchical clustering algorithm. Water Supply 2022, 22, 4387–4399. [Google Scholar] [CrossRef]
  65. Chen, R.; Wang, Q.; Javanmardi, A. A Review of the Application of Machine Learning for Pipeline Integrity Predictive Analysis in Water Distribution Networks. Arch. Comput. Methods Eng. 2025, 1–29. [Google Scholar] [CrossRef]
  66. Rossman, L.A.; Woo, H.; Tryby, M.; Shang, F.; Janke, R.; Haxton, T. EPANET 2.2 User Manual; Water Infrastructure Division, Center for Environmental Solutions and Emergency Response, U.S. Environmental Protection Agency: Cincinnati, OH, USA, 2020. [Google Scholar]
  67. Kyriakou, M.S.; Demetriades, M.; Vrachimis, S.G.; Eliades, D.G.; Polycarpou, M.M. Epyt: An epanet-python toolkit for smart water network simulations. J. Open Source Softw. 2023, 8, 5947. [Google Scholar] [CrossRef]
  68. Makaremi, Y.; Haghighi, A.; Ghafouri, H.R. Optimization of pump scheduling program in water supply systems using a self-adaptive NSGA-II; a review of theory to real application. Water Resour. Manag. 2017, 31, 1283–1304. [Google Scholar] [CrossRef]
  69. Zheng, A.; Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
  70. Venkatesh, B.; Anuradha, J. A review of feature selection and its methods. Cybern. Inf. Technol 2019, 19, 3–26. [Google Scholar] [CrossRef]
  71. Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
  72. Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
  73. Probst, P.; Boulesteix, A.-L.; Bischl, B. Tunability: Importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 2019, 20, 1–32. [Google Scholar]
  74. Injadat, M.; Moubayed, A.; Nassif, A.B.; Shami, A. Systematic ensemble model selection approach for educational data mining. Knowl.-Based Syst. 2020, 200, 105992. [Google Scholar] [CrossRef]
  75. Liu, H.; Setiono, R. Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, Herndon, VA, USA, 5–8 November 1995; pp. 388–391. [Google Scholar]
  76. Khomytska, I.; Bazylevych, I.; Teslyuk, V.; Karamysheva, I. The chi-square test and data clustering combined for author identification. In Proceedings of the 2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT), Lviv, Ukraine, 19–21 October 2023; pp. 1–5. [Google Scholar]
  77. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  78. Fida, M.A.F.A.; Ahmad, T.; Ntahobari, M. Variance threshold as early screening to Boruta feature selection for intrusion detection system. In Proceedings of the 2021 13th International Conference on Information & Communication Technology and System (ICTS), Surabaya, Indonesia, 20–21 October 2021; pp. 46–50. [Google Scholar]
  79. Desyani, T.; Saifudin, A.; Yulianti, Y. Feature selection based on naive bayes for caesarean section prediction. IOP Conf. Ser. Mater. Sci. Eng. 2020, 879, 012091. [Google Scholar] [CrossRef]
  80. Ye, Y.; Liu, C.; Zemiti, N.; Yang, C. Optimal feature selection for EMG-based finger force estimation using LightGBM model. In Proceedings of the 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), New Delhi, India, 14–18 October 2019; pp. 1–7. [Google Scholar]
  81. Hua, Y. An efficient traffic classification scheme using embedded feature selection and lightgbm. In Proceedings of the 2020 Information Communication Technologies Conference (ICTC), Nanjing, China, 29–31 May 2020; pp. 125–130. [Google Scholar]
  82. Altmann, A.; Toloşi, L.; Sander, O.; Lengauer, T. Permutation importance: A corrected feature importance measure. Bioinformatics 2010, 26, 1340–1347. [Google Scholar] [CrossRef]
  83. Hsieh, C.-P.; Chen, Y.-T.; Beh, W.-K.; Wu, A.-Y.A. Feature selection framework for XGBoost based on electrodermal activity in stress detection. In Proceedings of the 2019 IEEE International Workshop on Signal Processing Systems (SiPS), Nanjing, China, 20–23 October 2019; pp. 330–335. [Google Scholar]
  84. Alsahaf, A.; Petkov, N.; Shenoy, V.; Azzopardi, G. A framework for feature selection through boosting. Expert Syst. Appl. 2022, 187, 115895. [Google Scholar] [CrossRef]
  85. Riyahi, M.M.; Rahmanshahi, M.; Ranginkaman, M.H. Frequency domain analysis of transient flow in pipelines; application of the genetic programming to reduce the linearization errors. J. Hydraul. Struct. 2018, 4, 75–90. [Google Scholar]
  86. Han, J.; Pei, J.; Tong, H. Data Mining: Concepts and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2022. [Google Scholar]
  87. Messenger, R.; Mandell, L. A modal search technique for predictive nominal scale multivariate analysis. J. Am. Stat. Assoc. 1972, 67, 768–772. [Google Scholar] [CrossRef]
  88. Cutler, A.; Cutler, D.R.; Stevens, J.R. Random forests. In Ensemble Machine Learning: Methods and Applications; Springer: New York, NY, USA, 2012; pp. 157–175. [Google Scholar]
  89. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  90. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  91. Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: New York, NY, USA, 2013. [Google Scholar]
  92. Scholkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  93. Ben-Hur, A.; Weston, J. A user’s guide to support vector machines. In Data Mining Techniques for the Life Sciences; Springer: New York, NY, USA, 2009; pp. 223–239. [Google Scholar]
  94. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  95. Gaikwad, D.P.; Thool, R.C. Intrusion detection system using bagging ensemble method of machine learning. In Proceedings of the 2015 International Conference on Computing Communication Control and Automation, Pune, India, 26–27 February 2015; pp. 291–295. [Google Scholar]
  96. Tüysüzoğlu, G.; Birant, D. Enhanced bagging (eBagging): A novel approach for ensemble learning. Int. Arab J. Inf. Technol. 2020, 17, 515–528. [Google Scholar]
  97. Schapire, R.E. The strength of weak learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef]
  98. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  99. Kutner, M.H.; Nachtsheim, C.J.; Neter, J.; Li, W. Applied Linear Statistical Models; McGraw-Hill: New York, NY, USA, 2005. [Google Scholar]
  100. Fujiwara, O.; Khang, D.B. A two-phase decomposition method for optimal design of looped water distribution networks. Water Resour. Res. 1990, 26, 539–549. [Google Scholar] [CrossRef]
  101. Kadu, M.S.; Gupta, R.; Bhave, P.R. Optimal design of water networks using a modified genetic algorithm with reduction in search space. J. Water Resour. Plan. Manag. 2008, 134, 147–160. [Google Scholar] [CrossRef]
  102. Mirjalili, S.; Lewis, A. The whale optimization algorithm. Adv. Eng. Softw. 2016, 95, 51–67. [Google Scholar] [CrossRef]
  103. McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 1943, 5, 115–133. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the developed machine learning model based on graph theory for WDN design.
Figure 1. Flowchart of the developed machine learning model based on graph theory for WDN design.
Water 17 01654 g001
Figure 2. Hanoi WDN.
Figure 2. Hanoi WDN.
Water 17 01654 g002
Figure 3. Generated synthetic WDNs.
Figure 3. Generated synthetic WDNs.
Water 17 01654 g003
Figure 4. Histogram of the N4 feature before (a) and after (b) outlier removal.
Figure 4. Histogram of the N4 feature before (a) and after (b) outlier removal.
Water 17 01654 g004
Figure 5. The Jaccard similarity between feature selection methods.
Figure 5. The Jaccard similarity between feature selection methods.
Water 17 01654 g005
Figure 6. The output of 24 coupled machine learning models.
Figure 6. The output of 24 coupled machine learning models.
Water 17 01654 g006
Figure 7. WDN design solution and pressure distribution for Hanoi network: (a) Kadue method; (b) Developed method.
Figure 7. WDN design solution and pressure distribution for Hanoi network: (a) Kadue method; (b) Developed method.
Water 17 01654 g007
Table 1. Comparative summary of machine learning model applications in WDNs.
Table 1. Comparative summary of machine learning model applications in WDNs.
ApplicationMachine Learning ModelFeatures UsedDatasetReference
Rehabilitation/Design of WDNFeed-Forward
Neural Network (FFNN)
Length, material,
age, diameter, depth, and wall thickness
Real-world WDN (Shaker Al-Bahery WDN)[17]
Rehabilitation/Design of WDNHybrtid based model (GLR-LR-RBFNN-SVR-ANFIS-FFNN)Age, pipe depth, number of failures, diameter, and lengthReal-world WDN (Gorgan WDN)[18]
Water demand forecastingSix machine learning models (LS, DT, KNN, SVR, RF, and RNN)Air temperature, water consumption, precipitation,The Battle of Water Demand Forecasting (WDSA-CCWI-2024) dataset[19]
Water demand forecastingOne-Dimensional convolutional neural network (1D CNN)Hourly water demand datReal-world WDN (Shiraz WDN)[20]
WDN monitoringXGBoostFree residual chlorine concentration (FRC), Total Organic Carbon (TOC), pH, and distance from water treatment plants (WTPs)Real-world WDN (Maragheh WDN)[21]
WDN monitoringArtificial neural network (ANN)Pressure values, demand values, and number of users.The benchmark network (Fossolo WDN)[22]
Pump operationArtificial neural network (ANN)Water levels in tanksThe benchmark network (Anytown network)[25]
WDN analysis and managementKNN, SVM, and RFDemand variations and structural relationshipsThe benchmark network (M town)[50]
Leak detectionRFStructure and network attributeTwo case studies[51]
Leak detectionSVM-CNNFlow rate, pressure, and tempretureReal-world WDN[55]
Failure predictionGradient Boosted Trees (GBTs), and RF19 features such as pipe diameter, pipe material, pipe length, pipe age, etc.Real-world WDN[62]
Failure predictionReinforcement learning algorithm based on Q-learningLocation ID, time to repair, and costArlington County’s water network[63]
Failure predictionRandom Forest-Hierarchical Clustering (RF-HC)Time-domain features of flow data (Peak value, mean, variance, Form factor, etc.)Real-world WDN[64]
Table 2. Commercial diameters and their corresponding costs.
Table 2. Commercial diameters and their corresponding costs.
Diameter NumberDiameter
(mm)
Cost of Pipes
(€/m)
Diameter NumberDiameter
(mm)
Cost of Pipes
(€/m)
11610.341112535.38
22011.181216048.84
32512.221320066.80
43213.691425095.25
54015.3615315141.83
65017.4516400216.60
76320.1717500327.50
87522.6718600438.40
99025.8119800660.20
1011030.89201000882.00
Table 3. The features obtained from synthetic WDNs.
Table 3. The features obtained from synthetic WDNs.
Network
ID
No. 1
Index
No. 2
Graph Features
No. 3
Node Features
No. 4
Edge Features
No. 5
Diameters
No. 6
G1G2G44N1N2N27E1E2E9
Net 1Pipe1Net 1(G1)Net 1(G2)Net 1(G44)P1N1P1N2P1N27P1E1P1E2P1E9Pipe1(D1)
Pipe2P2N1P2N2P2N27P2E1P2E2P2E9Pipe2(D2)
PipenPnN1PnN2PnN27PnE1PnE2PnE9Pipen(Dn)
Net 2Pipe1Net 2(G1)Net 2(G2)Net 2(G44)P1N1P1N2P1N27P1E1P1E2P1E9Pipe1(D1)
Pipe2P2N1P2N2P2N27P2E1P2E2P2E9Pipe2(D2)
PipenPnN1PnN2PnN27PnE1PnE2PnE9Pipen(Dn)
Net 600Pipe1Net 600(G1)Net 600(G2)Net 600(G44)P1N1P1N2P1N27P1E1P1E2P1E9Pipe1(D1)
Pipe2P2N1P2N2P2N27P2E1P2E2P2E9Pipe2(D2)
PipenPnN1PnN2PnN27PnE1PnE2PnE9Pipen(Dn)
Table 4. The ranges and optimal values of the hyperparameters used in GA.
Table 4. The ranges and optimal values of the hyperparameters used in GA.
Optimization AlgorithmHyperparameterTuning Range of Hyperparameter ValuesOptimal Hyperparameter Values
GAPopulation Size[5–20] × Number Of Pipes12 × Number Of Pipes
Mutation Probability0.01–0.0150.06
Crossover Probability0.6–0.950.85
Number of Iterations100–600400
Table 5. First and last rows of extracted features from WDNs.
Table 5. First and last rows of extracted features from WDNs.
IndexGraph FeaturesNode FeaturesEdge FeaturesDiameters
(mm)
G1G2G44N1N2N27E1E3E9
037940737492.500.5011.8087310.01200
137940737492.501.0011.90108710.00600
237940737493.501.5013.70112940.04125
337940737493.001.0011.90112220.00500
437940737493.501.5014.20111230.20200
537940737492.501.0015.20114290.2250
637940737493.001.5018.70111210.03160
737940737493.001.5018.90111220.00800
837940737492.501.5018.7096920.00400
937940737493.502.0018.5096780.00315
85,73548450736723.000.5028.80120460.00250
85,73648450736723.001.5036.50118540.3040
85,73748450736723.001.5045.00106630.03250
85,73848450736723.001.0046.3087380.02315
85,73948450736723.001.5052.2088490.2290
85,74048450736723.001.5058.8091400.06315
85,74148450736723.001.5063.8092890.08315
85,74248450736723.002.0068.2083450.031,000
85,74348450736723.002.0070.2090350.01800
85,74448450736722.001.5035.8089630.09800
Table 6. First and last rows of extracted features following outlier removal and normalization.
Table 6. First and last rows of extracted features following outlier removal and normalization.
IndexGraph FeaturesNode FeaturesEdge FeaturesDiameters
(mm)
G1G2G44N1N2N27E1E3E9
00.230.040.410.250.000.130.170.140.010.19
10.230.040.410.250.200.140.700.640.000.59
20.230.040.410.750.400.160.800.920.070.11
30.230.040.410.500.200.140.800.020.000.49
40.230.040.410.750.400.160.770.040.380.19
50.230.040.410.250.200.170.850.110.440.03
60.230.040.410.500.400.220.770.010.060.15
70.230.040.410.500.400.220.770.020.000.80
80.230.040.410.250.400.220.400.900.010.39
90.230.040.410.750.600.210.400.730.000.30
857360.550.050.390.500.400.420.950.420.560.02
857370.550.050.390.500.400.520.650.540.060.24
857380.550.050.390.500.200.540.170.220.040.30
857390.550.050.390.500.400.610.200.360.440.07
857400.550.050.390.500.400.680.270.250.120.30
857410.550.050.390.500.400.740.300.860.170.30
857430.550.050.390.500.600.820.250.190.150.80
857440.550.050.390.000.400.410.220.540.170.80
Table 7. Hyperparameter Values of Embedded Methods.
Table 7. Hyperparameter Values of Embedded Methods.
Hyperparameter Tuning with Grid SearchEmbedded Methods
XgLGBPer
n-estimators135015001500
eta0.250--
gamma0.002--
Max-depth151212
Num-leaves-4545
Learning-rate-0.0110.011
Table 8. Selection of top 20 features based on feature selection methods.
Table 8. Selection of top 20 features based on feature selection methods.
MethodsKbChi2VarLGBPerXg
Selected FeaturesN5N5E1E9N5E9
N7N7E3E8N7N5
E5E5G43N5E3E8
N8N8G44N7E9N7
E9N17G7E3E1E1
E3E9G20E1E5E3
E1N15G21N8E8N8
G42N19G23E5N8N10
N23E8G30N10N10E5
G12N18G31N6N6N6
G8N2G32N4N3N26
G4N10G35N17N17N4
G17N3G36N26N1E6
G13N13G39E4N23E7
G10N6G41N19N25N17
G41E3N1N9N4N20
G6N4N2E6G25N15
G18N14N5N18G3E4
G27N22N7E7G2N9
G2E1N23N20G10N19
Node features percentage207525605560
Pipe features percentage202510402540
Over all graph features percentage600.0650.0200.0
Table 9. Hyperparameter values of machine learning models.
Table 9. Hyperparameter values of machine learning models.
Hyperparameter Tuning with Grid SearchEnsemble Model
RFSVMBAGLGB
n-estimators250-1401500
Max-depth30---
Min-samples-split10--12
Max-samples--0.7000-
Max-features--0.7500-
C-1.0000--
kernel-‘rbf’--
gamma--0.0001-
Num-leaves---45
Learning-rate---0.0110
Table 10. The pipe diameters from Kadu and Xg-LGB models and pressure head in each node of Hanoi WDN.
Table 10. The pipe diameters from Kadu and Xg-LGB models and pressure head in each node of Hanoi WDN.
Pipe NumberPredicted DiametersNode NumberPressure Head from [101] (m)Pressure Head from Xg-LGB Model (m)
Pipe Diameters from [101]
(in)
Continuous Pipe Diameters from Xg-LGB Model
(in)
Commercial Pipe Diameters from Xg-LGB Model
(in)
14038.9401100.00100.00
24036.940297.0897.14
34040.740360.8261.67
44040.340456.3857.39
54039.940550.8852.09
64039.540645.1346.56
74039.140743.8145.29
84040.040842.2843.83
93033.730941.0942.69
103035.4401037.6139.40
113034.9301136.0139.01
122427.4301234.8337.85
131618.2201330.5336.44
141214.0161432.0637.81
151213.0121530.9637.66
161618.4201631.1338.17
172022.1241739.2845.01
182424.4241850.0451.52
192428.2301957.1360.16
204040.0402049.5951.41
212023.4242140.0447.56
221213.1122234.7642.40
234039.2402343.4245.98
243034.6302437.7341.44
253034.4302534.0738.72
262021.4202630.5136.67
271214.1162730.3236.69
281214.6162838.0539.30
291617.6162930.0836.26
301214.5163030.5836.26
311213.0123130.9036.47
321617.6163231.8136.74
332023.924
342426.824
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bahrami Chegeni, I.; Riyahi, M.M.; Bakhshipour, A.E.; Azizipour, M.; Haghighi, A. Developing Machine Learning Models for Optimal Design of Water Distribution Networks Using Graph Theory-Based Features. Water 2025, 17, 1654. https://doi.org/10.3390/w17111654

AMA Style

Bahrami Chegeni I, Riyahi MM, Bakhshipour AE, Azizipour M, Haghighi A. Developing Machine Learning Models for Optimal Design of Water Distribution Networks Using Graph Theory-Based Features. Water. 2025; 17(11):1654. https://doi.org/10.3390/w17111654

Chicago/Turabian Style

Bahrami Chegeni, Iman, Mohammad Mehdi Riyahi, Amin E. Bakhshipour, Mohamad Azizipour, and Ali Haghighi. 2025. "Developing Machine Learning Models for Optimal Design of Water Distribution Networks Using Graph Theory-Based Features" Water 17, no. 11: 1654. https://doi.org/10.3390/w17111654

APA Style

Bahrami Chegeni, I., Riyahi, M. M., Bakhshipour, A. E., Azizipour, M., & Haghighi, A. (2025). Developing Machine Learning Models for Optimal Design of Water Distribution Networks Using Graph Theory-Based Features. Water, 17(11), 1654. https://doi.org/10.3390/w17111654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop