Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Title: Could ChatGPT Automate Water Network Clustering? A Performance Assessment Across Algorithms

In the abstract of the manuscript, it doesn’t explain how ChatGPT adds value beyond conventional algorithms. The conclusion in the abstract is weak or vague. It doesn’t show the paper contribution or practical implications.
The validation seems to be weak or narrow, as only 2 Italian networks are used.
The engineering capability of the DMA seems to be untested as the evaluations are neglected for the water quality, pressure or level constraints, head loss, etc. The manuscript used only balance indices for demand, pipe lengths, nodes etc., which shows an incomplete performance criterion.
LLM issues are not addressed in this manuscript.
The device counts and CAPEX depend on the number of boundary pipes. This looks once per method, as given in Table 1. It doesn’t show any reproducible definition or strong definition. For instance, how are parallel pipes counted? What about the valves? At the district borders, how are shared nodes treated thre? etc. It is really an ambiguous boundary- pipe accounting.
For the EPANET files, we can’t find any repository.
In the conclusion, it recommends that ChatGPT can competently execute the clustering phase, but the main body gives dependence on user guidance, corrections, and limited cases. The manuscript eventually admits that it is not consistent for unsupervised applications, which shows that the main body of the manuscript, as well as the abstract, gives a positive approach to the evidence.
Authors, please clarify some aspects with respect to the methodology. Please tell how spatial coordinates are weighted or scaled vs. network topology. In the performance metrics, how are multi-source supplies handled?

Author Response

Reviewer 1

Comment 1

In the abstract of the manuscript, it doesn’t explain how ChatGPT adds value beyond conventional algorithms. The conclusion in the abstract is weak or vague. It doesn’t show the paper contribution or practical implications.

We thank the reviewer for this constructive comment. In the revised abstract, we have clarified how ChatGPT provides added value beyond conventional clustering algorithms. Specifically, we now emphasize that ChatGPT automates the entire workflow of water distribution network (WDN) clustering—from reading input files and applying algorithms to calculating performance indices and generating reports. This capability makes advanced partitioning methodologies accessible to users without programming or hydraulic modeling expertise, thereby lowering the entry barrier for utilities and practitioners.

Additionally, we strengthened the conclusion of the abstract to better highlight the study’s contribution and practical implications. The revised conclusion underscores ChatGPT’s role as a complementary tool that accelerates repetitive tasks, supports decision-making through interpretable outputs, and facilitates the democratization of advanced WDN management strategies.

The following part was added in the abstract:

“The results show that ChatGPT uniquely adds value by automating the entire workflow of WDN clustering—from reading input files and running algorithms to calculating performance indices and generating reports. This makes advanced water network partitioning accessible to users without programming or hydraulic modeling expertise. The study highlights ChatGPT’s role as a complementary tool: it accelerates repetitive tasks, supports decision-making with interpretable outputs, and lowers the entry barrier for utilities and practitioners. These findings demonstrate the practical potential of integrating large language models into water management, where they can democratize specialized methodologies and facilitate wider adoption of WDN managing strategies.”

Comment 2

The validation seems to be weak or narrow, as only 2 Italian networks are used.

We appreciate this comment and agree that broader validation would further strengthen the study. Our intention, however, was not to provide an exhaustive benchmarking across multiple networks but rather to present a proof-of-concept of how ChatGPT can support clustering tasks in WDNs. For this purpose, we selected two real case studies of different scales and complexities:

Parete, a small network with 182 demand nodes,
Giugliano in Campania, a large network with 994 nodes, a complex topology and multiple intake points.

These two networks were chosen intentionally to cover both smaller, more manageable systems and larger, more complex topologies, thus showing ChatGPT’s adaptability across different conditions. While we acknowledge the limitation of geographic scope, the methodological framework presented is generalizable and can be readily applied to other WDNs worldwide. We have clarified this in the revised manuscript by explicitly framing the current work as a proof-of-concept validation and by highlighting future work directions that include testing on additional case studies.

Although only two Italian cases were tested, they were intentionally selected as representative of small and large networks. Future work will involve collaboration with utilities from other countries to strengthen external validity.

The following was added to the manuscript (Lines 448-456)

“It is important to note that the validation presented in this work is intentionally limited to two real Italian networks. The choice of Parete (182 demand nodes) and Giugliano in Campania (994 demand nodes) was made to represent both a smaller, more manageable system and a larger, more complex topology. This combination provides a first proof-of-concept of ChatGPT’s adaptability across networks of different scales and complexity. While the geographic scope remains limited, the methodological framework developed here is general and can be readily applied to other water distribution systems. Future research will aim to extend this validation to a wider range of networks, including those from different contexts and operational conditions.”

Comment 3

The engineering capability of the DMA seems to be untested as the evaluations are neglected for the water quality, pressure or level constraints, head loss, etc. The manuscript used only balance indices for demand, pipe lengths, nodes etc., which shows an incomplete performance criterion.

We thank the reviewer for this valuable comment. We agree that a comprehensive DMA design requires further evaluation of hydraulic aspects such as pressure, head loss, and water quality. The scope of the present study, however, was limited to testing whether ChatGPT could replicate the clustering phase of network partitioning in an automated and accessible way and, for this phase of WNP, water quality, pressure or level constraints, head loss, etc. are generally not required, as suggested in the literature because only topological information are needed. For this purpose, we focused on balance indices (nodes, demand, pipe length), which are widely used in the literature as preliminary indicators of clustering quality.

We have clarified in the manuscript that the networks generated by ChatGPT in the clustering phase can then be directly exported and tested by users with hydraulic expertise. In practice, the ChatGPT-generated DMAs can then be tested in hydraulic software to perform detailed performance checks on pressure, head losses, or water quality. In this way, ChatGPT acts as a front-end automation tool to support experts, while hydraulic simulation remains the necessary step for final and comprehensive engineering validation.

We added this sentence (line 97-99) in the Introduction: “In the dividing phase, water quality, pressure or level constraints, head loss, etc. have to be checked with specific energy and hydraulic performance indices to preserve the minimum level of pressure for users”.

Comment 4

LLM issues are not addressed in this manuscript.

We thank the reviewer for this insightful comment. We acknowledge that Large Language Models (LLMs) raise important broader issues—such as reproducibility of outputs, potential biases in training data, and the general reliability of generated results. These aspects are highly relevant in the wider AI research agenda. However, the primary objective of our manuscript is not to provide a comprehensive review of LLMs and their limitations, but rather to conduct a domain-specific proof-of-concept: assessing ChatGPT’s ability to execute the clustering phase of WDN partitioning. For this reason, our evaluation focused on measuring ChatGPT’s outputs against established clustering indices, which directly reflect performance in the water distribution network context. To avoid ambiguity, we have clarified in the revised manuscript that while generic LLM issues are outside the scope of this study, we are aware of them and recognize that they constitute an important backdrop. In practice, this proof-of-concept highlights both the opportunities (automation of workflows, accessibility for non-experts) and the constraints (need for user guidance, lack of hydraulic validation, risk of occasional misassignments) of using ChatGPT in this domain. We also emphasize in the Conclusions that ChatGPT should be regarded as a complementary and supportive tool rather than a fully autonomous solution, precisely because of the broader limitations associated with LLMs.

In the conclusion the following statement was added (Lines 491-493): “Although broader issues of LLMs such as reproducibility and bias are well recognized, they are beyond the scope of this study. Here we specifically assess ChatGPT’s performance in WDN clustering as a domain-focused proof-of-concept.”

Comment 5

The device counts and CAPEX depend on the number of boundary pipes. This looks once per method, as given in Table 1. It doesn’t show any reproducible definition or strong definition. For instance, how are parallel pipes counted? What about the valves? At the district borders, how are shared nodes treated thre? etc. It is really an ambiguous boundary- pipe accounting.

We thank the reviewer for highlighting this important point. We agree that the definition of boundary pipes must be made explicit to ensure reproducibility. In the revised manuscript, we have clarified that a boundary pipe is defined as any pipe connecting two nodes that belong to different clusters. Parallel pipes are counted individually, as each represents a potential physical separation between districts. Valves are not counted separately in this study; however, each boundary pipe can be considered to imply the placement of a flow meter or valve in a practical DMA design. Shared nodes are treated by considering their incident pipes: if a pipe connects the shared node to a node in a different cluster, it is counted as a boundary pipe.

Lines 223-234: “Number of boundary pipes: A boundary pipe is defined as any pipe connecting two nodes assigned to different clusters. Parallel pipes are counted individually, since each represents a distinct potential separation between DMAs. Valves are not explicitly counted, but each boundary pipe can be interpreted as implying the installation of a valve or flow meter in practice, in the next dividing phase, as reported in the literature. Shared nodes at cluster borders are considered through their incident pipes: if one of their pipes connects to a node in a different cluster, that pipe is classified as a boundary pipe.”

Comment 6

For the EPANET files, we can’t find any repository.

We do not have any repository, but we are open to share the .inp files with the reviewers. Anyway, the water networks used are real and in Italy they are considered critical infrastructures, consequently the information can be shared with the explicit authorization of the local Administration.

Comment 7

In the conclusion, it recommends that ChatGPT can competently execute the clustering phase, but the main body gives dependence on user guidance, corrections, and limited cases. The manuscript eventually admits that it is not consistent for unsupervised applications, which shows that the main body of the manuscript, as well as the abstract, gives a positive approach to the evidence.

We thank the reviewer for this observation. As correctly noted, our main body already emphasizes that user guidance and corrections were necessary, and that ChatGPT cannot be considered reliable for unsupervised applications. In the original conclusion, we also stated that ChatGPT’s role should be seen as complementary rather than substitutive to specialized software, and that expert oversight remains essential.

To avoid any ambiguity, however, we have rebalanced the tone at the start of the conclusion to explicitly frame the study as a proof-of-concept. This adjustment ensures that the cautious interpretation presented in the main body is fully consistent with the conclusion and abstract, reinforcing that ChatGPT is positioned as a supportive, complementary tool rather than a fully autonomous solution.

Comment 8

Authors, please clarify some aspects with respect to the methodology. Please tell how spatial coordinates are weighted or scaled vs. network topology. In the performance metrics, how are multi-source supplies handled?

We thank the reviewer for raising these methodological clarifications. In the revised manuscript, we have clarified the following points:

Use of spatial coordinates vs. topology: In the clustering phase, the raw spatial coordinates (x, y) provided in the EPANET input file were used directly, without scaling or weighting against network topology.
Handling of multi-source supplies in performance metrics: The performance indices (balance in nodes, demand, and pipe length) were calculated at the cluster level by summing the corresponding attributes of nodes or pipes within each cluster. In this proof-of-concept, the presence of multiple sources (two intake points in Parete, five in Giugliano) was not explicitly accounted for in the indices, since they were intended as topological proxies for balance rather than hydraulic performance measures (computed in the next dividing phase, as suggested in the literature, and not considered in this paper that is focused on clustering phase of WDN).

In the methodology section the following discussion was added (Lines 249-255)

“For clustering, the raw spatial coordinates (x, y) of the nodes extracted from the EPANET file were used without additional scaling or weighting. Performance indices (nodes, demand, pipe length) were computed at the cluster level, independently of the number of sources. The presence of multiple sources in the Parete and Giugliano networks was not explicitly represented in these indices, which were used here as topological proxies for cluster balance rather than hydraulic performance measures (computed in the next dividing phase, as suggested in the literature, and not considered in this paper that is focused on clustering phase of WDN)”

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Thanks for inviting me to review the manuscript “Could ChatGPT Automate Water Network Clustering? A Performance Assessment Across Algorithms”. AI and ChatGPT have been poorly tested so far in WDN applications, this paper presents their application in the context of network partitioning, with special focus on the clustering phase (lines 103-104). Therefore, the study aims to:

Explore whether ChatGPT can fulfill this task in a manner comparable to specialized hydraulic software, but with the added benefit of being accessible to non-experts (lines 111-112).
Investigate the ability of ChatGPT to analyze the results obtained, write a report, illustrate results with tables and graphs and support researchers to write a scientific paper (lines 113-115), and
Assess the ability of ChatGPT to provide water utilities with an intuitive tool for network management, democratizing access to advanced water management techniques (lines 116-118).

However, the flow of this paper is difficult to understand how these objectives were achieved. Perhaps the authors can reorganise the presentation of the methodology and results and discussion sections of this paper, aligned with these objectives. There should be three subheadings of these objectives in the methods section and explaining the data collection, analysis, following what framework/theory via using what statistical tool for what analysis? Therefore, the results and discussion section should follow the same three subheadings to state the findings of the analysis.

Although there is a validation section (line 357) under the result section, it should be under the methodology section for the three objectives’ analysis. Even at present, the validation is not adequate to justify the findings. What is the baseline/standard to compare?

What are the references for the equations used in this study?

What do colour codes mean in Figure 1?

Table 1 & Table 2, provide footnotes. What analysis has been done?

Modify the conclusion according to the findings of the three objectives, and include the limitations for this study.

In keywords, include the name of the study area or Italy.

Comments for author File: Comments.pdf

Author Response

Reviewer 2

Comment 1

Explore whether ChatGPT can fulfill this task in a manner comparable to specialized hydraulic software, but with the added benefit of being accessible to non-experts (lines 111-112).

Investigate the ability of ChatGPT to analyze the results obtained, write a report, illustrate results with tables and graphs and support researchers to write a scientific paper (lines 113-115), and

Assess the ability of ChatGPT to provide water utilities with an intuitive tool for network management, democratizing access to advanced water management techniques (lines 116-118).

We thank the reviewer for this valuable suggestion. To improve clarity, we have reorganized the Methodology section so that it explicitly follows the study objectives. The section now begins with a framing paragraph and is divided into two subsections:

Assessing ChatGPT’s Clustering Performance, which describes the case study, data preparation, clustering algorithms, and balance indices;
Evaluating ChatGPT’s Analysis, Reporting, and Applicability, which explains how ChatGPT was asked to generate reports, tables, and flowcharts, and how these outputs can support utility applications.

We note that the third objective stated in the Introduction (assessing ChatGPT’s potential to support utilities) is intrinsically linked to the second one, since the practical applicability naturally follows from ChatGPT’s ability to analyze and report results. For this reason, we merged the second and third objectives into a single subsection to avoid redundancy and to present the methodology more coherently.

Regarding the Results section, we have kept the existing scheme (clustering outcomes, flowchart generation, validation on a second network) to preserve readability. However, at the start of the section we have added a clarifying sentence to indicate that the presentation of results is structured in line with the two methodological objectives.

We believe these modifications make the flow from objectives to methods and results clearer and more consistent, while maintaining a concise and coherent structure.

Comment 2

We thank the reviewer for this observation. The Validation section was intentionally placed under the Results because it reports the outcomes of applying the workflow to a second one, larger network (Giugliano). In our view, this constitutes an additional set of results rather than part of the methodological setup, as it shows how ChatGPT performs when applied to a system of greater size and complexity.

Regarding the baseline/standard, we acknowledge that the present study does not include a formal benchmark against an external reference method or software. Our intention was to provide, as better clarified in the Introduction, a proof-of-concept demonstration of ChatGPT’s ability to execute the clustering workflow, rather than to establish comparative performance. We agree that introducing such a baseline would be valuable, and we consider this an important direction for future research.

Comment 3

What are the references for the equations used in this study?

The equations for performance indices are widely proposed and used by the authors in the previous studies and papers, reference added in line 236.

Comment 4

What do colour codes mean in Figure 1?

We thank the reviewer for pointing out this missing detail. We have clarified in the caption of Figure 1 that the colors represent the different clusters generated by the algorithm, with each cluster shown in a distinct color

Comment 5

Table 1 & Table 2, provide footnotes. What analysis has been done?

We thank the reviewer for this observation. Footnotes have now been added to Tables 1 and 2 to clarify their content. As for the analysis, we would like to clarify that no additional analysis was performed beyond the clustering itself. The tables simply report the results generated by ChatGPT after applying the clustering algorithms, namely the distribution of nodes, demand, and pipe length among clusters, together with the corresponding balance indices, to achieve, as explained in the Introduction, the “clustering phase” of WDN.

Comment 6

Modify the conclusion according to the findings of the three objectives, and include the limitations for this study.

We thank the reviewer for this comment. The conclusions were modified, and the following statement was added:

“In line with the first objective, the results showed that ChatGPT can execute the clustering phase of WDNs within seconds, producing solutions comparable to conventional algorithms proposed in the literature, although user corrections were sometimes required. In line with the second objective, ChatGPT was also able to analyze the results, generate reports, illustrate outcomes with tables and graphs, and design a reproducible flowchart. These outputs can also be directly applied by practitioners, showing ChatGPT’s potential to support utilities and democratize access to advanced methodologies.”

Comment 7

In keywords, include the name of the study area or Italy.

We thank the reviewer for this observation. The keyword “Italy” was added.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper’s title suggests an original approach to automate water network clustering with ChatGPT. However, the research is limited to comparing three EPANT solutions, with heavily user guidance f ChatGPT and prompt engineering.

The study is conducted on a small water network, which reduces its significance and impact.

Substantial improvement is still needed, particularly extensive testing on larger networks and a streamlined methodology that reduces user intervention.

Furthermore, the paper should address the integration of actual operational data such as flow and pressure in WDN clustering and provide a more comprehensive discussion of this possibility

Comments on the Quality of English Language

Author Response

Reviewer 3

Comment 1

We thank the reviewer for raising this important point. We would like to clarify that the title of the manuscript is framed as a question (“Could ChatGPT Automate Water Network Clustering?”), rather than a statement. This was intentional, as the purpose of the paper is to explore and test the potential of ChatGPT in this task as a proof-of-concept, not to claim that full automation has been achieved.

We have revised the abstract and conclusion to further stress that ChatGPT’s role is complementary, requiring user guidance, and that the study demonstrates feasibility rather than complete autonomy.

Comment 2

The study is conducted on a small water network, which reduces its significance and impact.

While one of the networks (Parete, 182 nodes) is relatively small, we also tested ChatGPT on a larger network which is characterized by a complex topology and which serves a city of ≈125 000 inhabitants (Giugliano in Campania, 994 demand nodes and 5 intake points), in line with large Italian water networks. This second case showed ChatGPT’s ability to handle networks of higher complexity.

We acknowledge, however, that broader testing on additional and more different networks would strengthen the generalizability of the results. We have now clarified this limitation in the manuscript and positioned it as an important direction for future research.

“It is important to note that the validation presented in this work is intentionally limited to two real Italian networks. The choice of Parete (182 demand nodes) and Giugliano in Campania (994 demand nodes) was made to represent both a smaller, more manageable system and a larger, more complex topology. This combination provides a first proof-of-concept of ChatGPT’s adaptability across networks of different scales and complexity. While the geographic scope remains narrow, the methodological framework developed here is general and can be readily applied to other water distribution systems. Future research will aim to extend this validation to a wider range of networks, including those from different contexts and operational conditions.”

Comment 3

Substantial improvement is still needed, particularly extensive testing on larger networks and a streamlined methodology that reduces user intervention.

We thank the reviewer for this observation. Regarding the testing on a larger network please refer to Comment 2.

Regarding methodology, the paper emphasizes that user guidance and corrections were necessary during the clustering phase (e.g., for unassigned nodes or index recalculations). We clarified this explicitly in the Discussion and in the Conclusions. In addition, the flowchart we included is intended as a first step toward a standardized and more streamlined workflow. This makes clear that while current results require expert oversight, the methodology is moving toward greater reproducibility.

Comment 4

Furthermore, the paper should address the integration of actual operational data such as flow and pressure in WDN clustering and provide a more comprehensive discussion of this possibility

We thank the reviewer for this valuable suggestion. We agree that operational data such as flow distribution and pressure levels are fundamental for a complete DMA design. In this proof-of-concept, however, our objective was to assess ChatGPT’s ability to perform the clustering phase, and the evaluation was therefore intentionally limited to topological balance indices (nodes, demand, and pipe length), calculated independently of hydraulic conditions as clarified in the Metrics subsection. (Lines 225-256)

The manuscript already emphasizes in the Conclusions that ChatGPT’s role should be seen as complementary to hydraulic software for the next “dividing phase” not investigated in this study because out of the aim of the paper.

Review Reports

Reviewer 1

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Reviewer 2

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Reviewer 3

Comment 1

Comment 2

Comment 3

Comment 4