Comparative Analysis of Machine Learning Techniques in Enhancing Acoustic Noise Loggers’ Leak Detection
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsGeneral Comment
The manuscript presents a relevant approach to leak detection in water distribution networks through the use of Noise Loggers and machine learning models, highlighting the effectiveness of the proposed ensemble model. However, it requires improvements in academic writing, greater methodological clarity, and a more balanced discussion of its limitations. Overall, it is a study with applied potential, but it needs revision to strengthen its scientific and communicative validity. Below, I outline some points for revision.
Specific Comments
- Spelling error in the title: "Acoustic Noice Loggers" contains a typographical mistake. "Noice" should be corrected to "Noise".
- There is redundancy between lines 42–47. The impact and cost of non-revenue water are repeated multiple times.
- There is some ambiguity in the sampling process (lines 149–151). It is unclear whether the leak and non-leak samples were balanced or collected from comparable geographic locations. I suggest providing a clearer description of how the samples were selected and labeled to avoid bias in the training dataset.
- The technical specifications of the "PermaNET+" Noise Logger are not provided. I suggest including a technical reference or equipment specification detailing its sensitivity, frequency range, and other relevant parameters.
- Include a table with the hyperparameters and their justification, or at least refer to the appendix.
- The accuracies of individual models and the ensemble model are compared, but no statistical test is reported to determine whether these differences are significant. I suggest including a statistical analysis (e.g., McNemar's test or ANOVA for cross-validated classification metrics) to validate the significance of the differences between models.
- YamNet is presented as a competitor, but it is not stated whether it was trained on the same dataset or adapted to the specific task. It should be clarified whether YamNet was fine-tuned for this domain; otherwise, the comparison lacks methodological robustness.
- In the discussion, I suggest adding a more balanced analysis that includes the model’s own limitations, such as potential overfitting, sensitivity to uncontrolled noise, or generalization issues beyond the Hong Kong context.
- Several paragraphs in the conclusion repeat ideas already discussed, particularly in lines 454–472.
- Carefully review the style and grammar throughout the text, as there are several errors such as:
- "van analyzes" (line 206) should be corrected to "YamNet analyzes".
- "Ensembeled" (line 297) should be "Ensembled".
- "To sum-up" (line 457) is too informal; a more appropriate alternative would be "In summary" or "To conclude".
Comments for author File: Comments.pdf
Author Response
Reply to Reviewer 1’s Comments
Reviewer 1’s constructive comments were much appreciated, and his/her concerns and suggestions are now taken into consideration. The responses to the reviewer’s comments are depicted below:
Comment |
Reply |
The manuscript presents a relevant approach to leak detection in water distribution networks through the use of Noise Loggers and machine learning models, highlighting the effectiveness of the proposed ensemble model. However, it requires improvements in academic writing, greater methodological clarity, and a more balanced discussion of its limitations. Overall, it is a study with applied potential, but it needs revision to strengthen its scientific and communicative validity. Below, I outline some points for revision. |
The authors would like to thank the reviewer for his/her comment. |
Spelling error in the title: "Acoustic Noice Loggers" contains a typographical mistake. "Noice" should be corrected to "Noise" |
Text is revised to address the reviewer’s comment. |
There is redundancy between lines 42–47. The impact and cost of non-revenue water are repeated multiple times. |
Text is revised to address the reviewer’s comment. |
There is some ambiguity in the sampling process (lines 149–151). It is unclear whether the leak and non-leak samples were balanced or collected from comparable geographic locations. I suggest providing a clearer description of how the samples were selected and labeled to avoid bias in the training dataset. |
Thank you for your comment, the following was added to the text |
The technical specifications of the "PermaNET+" Noise Logger are not provided. I suggest including a technical reference or equipment specification detailing its sensitivity, frequency range, and other relevant parameters. |
Thank you for your comment, the following was added to the text |
Include a table with the hyperparameters and their justification, or at least refer to the appendix. |
Thank you for your comment, Table A1 was added to the Appendix with the hyperparameters. |
The accuracies of individual models and the ensemble model are compared, but no statistical test is reported to determine whether these differences are significant. I suggest including a statistical analysis (e.g., McNemar's test or ANOVA for cross-validated classification metrics) to validate the significance of the differences between models. |
Thank you for your comment, due to the limited time frame, the following was added to the text |
YamNet is presented as a competitor, but it is not stated whether it was trained on the same dataset or adapted to the specific task. It should be clarified whether YamNet was fine-tuned for this domain; otherwise, the comparison lacks methodological robustness. |
Thank you for your comment, the following was added to the text |
In the discussion, I suggest adding a more balanced analysis that includes the model’s own limitations, such as potential overfitting, sensitivity to uncontrolled noise, or generalization issues beyond the Hong Kong context. |
Thank you for your comment, the following was added to the text |
Several paragraphs in the conclusion repeat ideas already discussed, particularly in lines 454–472. |
Text is revised to address the reviewer’s comment. |
Carefully review the style and grammar throughout the text, as there are several errors such as:
|
Text is revised to address the reviewer’s comment. |
Reviewer 2 Report
Comments and Suggestions for AuthorsReviewing report
Manuscript entitled "Comparative Analysis of Machine Learning Techniques in Enhancing Acoustic Noice Loggers Leak detection"
- In the keywords please capitalise the first letter for each keyword «MAchine learning, ensemble models, Acoustic noise loggers, Acoustics, water distribution networks «
- Please combine lines 33-77 as one paragraph.
The same lines 78-114.
- In the introduction please explain how machine learning offers a powerful solution to overcome the limitations of traditional acoustic analysis.
- Please lowercase the letter m in the word « methodology« in the caption of fig. 1.
The same for the caption of table 1+2. Please revise the whole manuscript about that issue.
- You need to add footnote in the lower part of table 1 such as F1 meaning.
- Please remove the words « ROC curve« over the figure 3. The same for fig. 5.
- Where your detailed discussion for fig. 4 ?
- Please the language of this sentence « Random Forest yielded the highest accuracy among all tested models, achieving an 326 impressive 93.68% accuracy «
- Line 333, please remove the extra space before the number.
- You need toadd a table for the optimal hyper-parameters of deep learning classifiers with hyper parameters.
For help see that « Advancing deep learning-based acoustic leak detection methods towards application for water distribution systems from a data-centric perspective «
- Where is your equation fopr accuracy ?
- Where is your sensitivity metrices ? See this reference for help « Advancing deep learning-based acoustic leak detection methods towards application for water distribution systems from a data-centric perspective «
- You need to add your equations which used for prediction.
GOOD LUCK
Author Response
Reply to Reviewer 2’s Comments
Reviewer 2’s constructive comments were much appreciated, and his/her concerns and suggestions are now taken into consideration. The responses to the reviewer’s comments are depicted below:
Comment |
Reply |
In the keywords please capitalise the first letter for each keyword «MAchine learning, ensemble models, Acoustic noise loggers, Acoustics, water distribution networks |
Keywords are revised to address the reviewer’s comment. |
Please combine lines 33-77 as one paragraph. The same lines 78-114. |
Text is revised to address the reviewer’s comment. |
In the introduction please explain how machine learning offers a powerful solution to overcome the limitations of traditional acoustic analysis. |
Thank you for your comment, the following was added to the text: |
Please lowercase the letter m in the word « methodology« in the caption of fig. 1. The same for the caption of table 1+2. Please revise the whole manuscript about that issue. |
All captions of tables and figures are revised to address the reviewer’s comment. |
You need to add footnote in the lower part of table 1 such as F1 meaning. |
A description of F1score is added as a footnote in Table 1. |
Please remove the words « ROC curve« over the figure 3. The same for fig. 5. |
All figures are revised to address the reviewer’s comment. |
Where your detailed discussion for fig. 4 ?Please the language of this sentence « Random Forest yielded the highest accuracy among all tested models, achieving an 326 impressive 93.68% accuracy |
Thank you for your comment, for Figure 4, the following was added |
Line 333, please remove the extra space before the number |
Text is revised to address the reviewer’s comment. |
You need toadd a table for the optimal hyper-parameters of deep learning classifiers with hyper parameters. For help see that « Advancing deep learning-based acoustic leak detection methods towards application for water distribution systems from a data-centric perspective « |
Table A1 was added to the appendix with hyperparameters. |
Where is your equation fopr accuracy ? |
Thank you for your comment, the following was added to the text:
Acc = TP + TNTP + TN +FP+FNAcc = TP + TNTP + TN +FP+FN (1)
|
Where is your sensitivity metrices ? See this reference for help « Advancing deep learning-based acoustic leak detection methods towards application for water distribution systems from a data-centric perspective |
Thank you for your comments, the following was added to the text:
Acc = TP + TNTP + TN +FP+FNAcc = TP + TNTP + TN +FP+FN (1) Measures the overall proportion of correct predictions.
Sensitivity = TPTP+FNSensitivity = TPTP+FN (2) Indicates the model’s ability to correctly detect actual leak events.
Specificity = TNTN + FPSpecificity = TNTN + FP (3) Reflects the ability to correctly identify non-leak conditions.
Precision = TPTP + FPPrecision = TPTP + FP (4) Indicates the proportion of predicted leak cases that were actually leaks.
F1 Score = 2⋅Precision⋅RecallPrecision + RecallF1 Score = 2⋅Precision⋅RecallPrecision + Recall (5) Balances precision and recall to give an overall sense of model reliability.
These metrics were calculated for each classifier to ensure comprehensive evaluation of leak detection capability. In addition to quantitative metrics, SHAP (SHapley Additive Explanations) analysis (see Figure 4) was used to interpret model behavior by identifying the most influential acoustic features driving predictions. Together, these metrics and explainability tools provide both performance and transparency for the proposed models.
|
You need to add your equations which used for prediction. |
The used algorithms except logistic regression, do not provide equations, and logistic regression’s performance was poor. |
Reviewer 3 Report
Comments and Suggestions for AuthorsThe manuscript titled " Comparative Analysis of Machine Learning Techniques in Enhancing Acoustic Noice Loggers Leak detection" highlights the application of noise logger sensors, integrated with ensemble machine learning for a real-time monitoring solution, enhancing efficiency in Water Distribution Networks (WDNs) and mitigating environmental impacts.
Some of the things mentioned below should be answered or corrected:
- Is it possible to test the significance of other places in Hong Kong?
- Please provide a comparison with other models.
- Can the model be checked for any other environment or conditions?
- The author should define all the abbreviations before their use, like AUC, SHAP
- Please do not repeat the information, such as 94.40% accuracy.
- SHAP values are plotted but not discussed thoroughly. Please provide a proper discussion.
- How can you implement this model, e.g., city-wise?
- Please compare the results with another model for a more comprehensive presentation.
Additionally, some typographical errors should be corrected before the revision submission.
- Please correct the spellings of "performing" at different places, which is written as "perfroming".
Author Response
- Is it possible to test the significance of other places in Hong Kong?
Thank you for your comment, the following was added to the text:
Additionally, while the studied dataset includes signals from multiple districts within Hong Kong, this study does not conduct a formal region-wise significance analysis. In future work, we intend to explore geographic stratification and apply statistical tests (e.g., Chi-square or stratified ANOVA) to evaluate whether leak detection performance significantly varies by district or pipeline characteristics.
- Please provide a comparison with other models.
Thank you for your comment, the proposed article presents ensemble as the best model, but in addition to ensemble learning, we compared seven classical machine learning models (SVM, RF, NB, KNN, DT, LogR, MLP) and one deep learning model (YamNet). This comprehensive evaluation highlights the ensemble model’s superiority not only in overall accuracy (94.40%) but also in recall, F1 score, and AUC, which are critical for minimizing false negatives in leak detection.
- Can the model be checked for any other environment or conditions?
Thank you for your comment, although the model was validated on diverse pipeline types and locations within Hong Kong, generalizability to other cities or environments remains to be validated. Future work will involve testing the model on datasets from other regions (e.g., different soil conditions, pipeline materials, or ambient noise levels) to evaluate its robustness under varied operating conditions.
- The author should define all the abbreviations before their use, like AUC, SHAP
The first instance of AUC usage is properly cited
The first mention of SHAP is also properly acknowledged
ANOVA was updated
- Please do not repeat the information, such as 94.40% accuracy.
Thank you for your comment, the 94.40% accuracy is now mentioned only once in the text.
- SHAP values are plotted but not discussed thoroughly. Please provide a proper discussion.
Thank you for your comment, the article was updated as follows: Figure 4 displays grouped SHAP values for the ensemble model, highlighting the relative importance of the four main acoustic feature types: MFCC, Spectral Contrast, Tonnetz, and Chroma. The MFCC group shows the widest range of SHAP values, approximately from –0.04 to +0.04, indicating that these features exert the strongest influence on the model’s predictions. In contrast, Spectral Contrast ranges from –0.035 to +0.035, Tonnetz from –0.03 to +0.035, and Chroma has the smallest impact, ranging between –0.028 and +0.03. The color coding (from blue to red) represents the original feature values, and for MFCCs in particular, higher values (red dots) tend to correspond with positive SHAP values, meaning they push the prediction toward the leak class. This aligns with known acoustic leak signatures, where elevated MFCC components reflect the structured harmonic patterns of leak-induced flow. Overall, Figure 4 confirms that the ensemble model relies most heavily on MFCC-derived time-frequency features, reinforcing their diagnostic value in distinguishing leak events from normal pipeline conditions.
- How can you implement this model, e.g., city-wise?
Thank you for your comments, the following was added to the article: Accordingly, for city-wide implementation, the model can be integrated with an IoT-based monitoring system, where acoustic noise loggers are deployed in key junctions and transmit data to a centralized processing unit. The trained ensemble model can then be used in real-time to analyze signals and flag potential leaks. Region-specific retraining or transfer learning could be applied to fine-tune the model to new environments, accounting for different pipe materials, ambient noise profiles, or leak typologies.
- Please compare the results with another model for a more comprehensive presentation.
Thank you for your comment, in the article, in addition to classical machine learning baselines, we evaluated a pre-trained deep learning model (YamNet) as an external benchmark. Despite its powerful architecture, YamNet achieved only 52.63% accuracy on this task, likely due to its general-purpose design and lack of fine-tuning on leak-specific acoustic data. In contrast, the proposed ensemble model significantly outperformed it in all metrics, including accuracy, F1 score, and AUC.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe manuscript presents a comparative analysis of machine learning models for enhancing leak detection in water distribution networks (WDNs) using acoustic noise loggers. A dataset of 2110 sound signals collected in Hong Kong was used to evaluate the performance of various models, including Support Vector Machines (SVM), Random Forest (RF), Naïve Bayes (NB), K-Nearest Neighbors (KNN), Decision Tree (DT), Logistic Regression (LogR), Multi-Layer Perceptron (MLP), and YamNet. Among the individual models, RF achieved the highest accuracy (93.68%), followed by KNN (93.40%) and MLP (92.15%). A novel ensemble model combining the best-performing classifiers further improved detection accuracy to 94.40%. The study highlights the effectiveness of ensemble learning for real-time acoustic leak detection and offers a robust framework for integrating machine learning into urban water management systems.
- Comments 1: Regarding a specific statement in the Introduction section: “In 2015, approximately 321 million m³ of transported potable water (33% of the supplied freshwater) was lost, and leaks in pressurized water pipelines were identified as the primary cause of water loss [2].”
This is a very relevant figure, but the data is quite outdated (2015). Could you please update this value with more recent statistics, if available? Furthermore, you mention that this number has significantly increased, considering the water loss rates of 26.5% in 2010 and 31.6% in 2013. Please clarify or provide updated trends to support this statement.
- Comments 2: I have a question about this sentence: "Since 2008, the Water Supplies Department of Hong Kong has set a target under the Total Water Management Strategy to save 85 million m³ of water per year by 2030."
The authors should identify the actual water leakage volumes as of 2025 to better evaluate the current magnitude of the problem. In addition, it would be helpful to include data on the average time required by the water utility to detect, locate, and repair leaks, as this time frame is crucial for estimating the volume of water wasted. Since your proposed tool aims to reduce the time between leak detection and localization, such information would highlight the potential impact of your approach.
- Comments 3: According to this sentence: “The acoustic signals captured by these Noise Loggers were processed to extract relevant features. Subsequently, several machine learning models were employed, including SVM, RF, NB, KNN, DT, LogR, and MLP, to classify the acoustic signals and identify leak states.”
Is this is a supervised classification model, trained on labeled data to distinguish between predefined classes?
- Comments 4: According to this sentence: “Data were collected at multiple valve locations in Hong Kong. Leak signals were identified by acoustic noise loggers and verified manually using inspections or hydrophones, and non-leak signals were recorded from similar pipes under normal operation. The dataset includes 992 leak samples and 1118 no-leak samples (a slightly unbalanced but representative set). Each signal is labeled by confirming the presence/absence of a leak via inspection. To avoid geographic bias, leak and non-leak recordings were obtained from similar sites and times. We ensured that both classes include signals from all regions of the study area.”
All pipes may look similar, but in reality, leak sounds can vary depending on the pipe material. Likewise, the shape and size of the leak orifice can differ based on pipe diameter and pressure. A dataset with 992 leak samples could be a reasonably good representation for a fairly homogeneous water distribution system with similar operating conditions and pipe characteristics. However, in a more heterogeneous water distribution network, this dataset might not be sufficient. A simple statistical analysis could help evaluate the representativeness of the sample.
On the other hand, there is no figure showing the site location or providing general information about the water distribution system and the Water Utility, which makes the presentation incomplete. It would be helpful to show in a map water distribution network and testing deployment.
- Comments 5: According to sections 2.2 Feature extraction and 2.3 Data Visualization. Do the authors reach any conclusion based on the analysis presented in this section? Is there any observable pattern or indication that justifies the use of STFT? A considerable amount of work was done, but it seems that no clear outcome or insight was derived from it.
- Comments 6: According to this sentence: “It is critical to note that the publicly available YamNet model (pre-trained on AudioSet) was used without further fine-tuning on our data. In future work, it is recommended to fine-tune YamNet on the noise-logger dataset for improved performance. Our current results thus reflect the baseline performance of YamNet’s general sound classification on leak data”.
This sounds quite surprising. The dataset does not appear to have been prepared to remove ambient sounds such as car horns or background music, yet the model is achieving high performance in terms of precision, recall, etc. Could you please clarify this? Is there any explanation for how such results were obtained despite the presence of environmental noise?
- Comments 7: According to this sentence: “The results of the proposed model and YamNet are significantly different, as seen from the accuracy metrics obtained in their respective experiments. The proposed model achieved an accuracy of 94.40%, outperforming YamNet, which achieved only 52.63% accuracy.”
How do you explain the significant difference in performance indicators between the two models? Were both models trained and tested using the same dataset? What was the split between training and testing data? Additionally, were the models evaluated on new field data, or was the same dataset used for both training and testing?
- Comments 8: Between lines 316 and 317: In terms of recall, the Ensemble Model exhibited a notably higher value of 92.66% compared to YamNet's 8.77%.
This is something the authors should investigate further, as it could be the key to understanding what we are doing well and what we are doing wrong. Furthermore, the authors introduce the use of MFCC coefficients (from line 350), which they describe as a key component of the analysis. However, these coefficients remain a 'black box' to the reader. It would be helpful to provide a clearer explanation of what MFCCs represent and how they contribute to the model's performance.
- Comments 9: What do you mean with this sentence? Please introduce more deeply (line 343): “YamNet relies directly on the acoustic signals for leak detection”. The explanation provided for the poor performance of YamNet is rather limited and lacks depth. Considering the significance of this result, a more thorough analysis is necessary. The authors should discuss in greater detail the possible reasons behind YamNet’s underperformance, such as the mismatch between the training data and the specific characteristics of leak sounds, the presence of background noise, or the lack of fine-tuning. Additionally, it would be valuable to explore whether certain architectural limitations of YamNet contribute to these outcomes, and how future improvements or adaptations could enhance its performance in this context.
- Comments 10: In line 432, the authors make the next comment: “The combination of the three best perfroming models in the ensemble 432 approach further enhances the accuracy of leak detection. The ensemble model attained 433 an accuracy of 94.40%, surpassing the performance of individual models and emphasizing 434 the efficacy of integrating diverse perspectives to enhance detection accuracy”.
"I am not fully convinced that the ensemble model truly outperforms the Random Forest model. According to Tables 1 and 2, while the ensemble model does achieve a slightly higher precision (94.4% versus 93%), the other performance metrics — such as accuracy, recall, and F1-score — are actually lower than those obtained with the Random Forest model. Therefore, I would appreciate a clearer justification from the authors regarding why they still consider the ensemble model superior. Furthermore, if the intention is to train and deploy an ensemble model in a real-world water distribution network, it would likely require more time and computational resources compared to using a single Random Forest model."
Author Response
Reply to Reviewer 4’s Comments
Reviewer 4’s constructive comments were much appreciated, and his/her concerns and suggestions are now taken into consideration. The responses to the reviewer’s comments are depicted below:
Comment |
Reply |
The manuscript presents a comparative analysis of machine learning models for enhancing leak detection in water distribution networks (WDNs) using acoustic noise loggers. A dataset of 2110 sound signals collected in Hong Kong was used to evaluate the performance of various models, including Support Vector Machines (SVM), Random Forest (RF), Naïve Bayes (NB), K-Nearest Neighbors (KNN), Decision Tree (DT), Logistic Regression (LogR), Multi-Layer Perceptron (MLP), and YamNet. Among the individual models, RF achieved the highest accuracy (93.68%), followed by KNN (93.40%) and MLP (92.15%). A novel ensemble model combining the best-performing classifiers further improved detection accuracy to 94.40%. The study highlights the effectiveness of ensemble learning for real-time acoustic leak detection and offers a robust framework for integrating machine learning into urban water management systems. |
|
Comments 1: Regarding a specific statement in the Introduction section: “In 2015, approximately 321 million m³ of transported potable water (33% of the supplied freshwater) was lost, and leaks in pressurized water pipelines were identified as the primary cause of water loss [2].” This is a very relevant figure, but the data is quite outdated (2015). Could you please update this value with more recent statistics, if available? Furthermore, you mention that this number has significantly increased, considering the water loss rates of 26.5% in 2010 and 31.6% in 2013. Please clarify or provide updated trends to support this statement. |
Thank you for your comment, the manuscript was updated as follows:
Recent assessments highlight that Hong Kong’s water loss situation remains a critical concern. More specifically, coverage from the Centre for Water Research & Resource Management (CWRRR) indicates that in 2023–24, metered water loss rates in Hong Kong reached a record high of 38.3% — surpassing the earlier cited 33% benchmark from 2015 [43]. Notably, almost half of the annual water loss was attributed to leaks and bursts in government mains alone, with a leakage rate estimated at about 15% in 2018 [2,3]. These trends underscore a persistent or even worsening challenge, contradicting earlier improvements and indicating that, despite infrastructure upgrades, water loss remains a major issue. |
Comments 2: I have a question about this sentence: "Since 2008, the Water Supplies Department of Hong Kong has set a target under the Total Water Management Strategy to save 85 million m³ of water per year by 2030." The authors should identify the actual water leakage volumes as of 2025 to better evaluate the current magnitude of the problem. In addition, it would be helpful to include data on the average time required by the water utility to detect, locate, and repair leaks, as this time frame is crucial for estimating the volume of water wasted. Since your proposed tool aims to reduce the time between leak detection and localization, such information would highlight the potential impact of your approach. |
Thank you for your comment, the following was added to the text: (note that the order of references was kept to identify the newer references from the older ones, the order will be updated in the proofing phase) Estimated fresh-water leakage volumes for government mains were approximately 121 million m³ in 2024, increasing gradually from 97 million m³ in 2020 [42]. |
Comments 3: According to this sentence: “The acoustic signals captured by these Noise Loggers were processed to extract relevant features. Subsequently, several machine learning models were employed, including SVM, RF, NB, KNN, DT, LogR, and MLP, to classify the acoustic signals and identify leak states.” Is this is a supervised classification model, trained on labeled data to distinguish between predefined classes? |
Thank you for your comment, the following was added to the manuscript:
.... by means of a supervised learning approach, trained on manually labeled leak and non-leak signal data, with ground truth established via inspection or hydrophone confirmation.
|
Comments 4: According to this sentence: “Data were collected at multiple valve locations in Hong Kong. Leak signals were identified by acoustic noise loggers and verified manually using inspections or hydrophones, and non-leak signals were recorded from similar pipes under normal operation. The dataset includes 992 leak samples and 1118 no-leak samples (a slightly unbalanced but representative set). Each signal is labeled by confirming the presence/absence of a leak via inspection. To avoid geographic bias, leak and non-leak recordings were obtained from similar sites and times. We ensured that both classes include signals from all regions of the study area.” All pipes may look similar, but in reality, leak sounds can vary depending on the pipe material. Likewise, the shape and size of the leak orifice can differ based on pipe diameter and pressure. A dataset with 992 leak samples could be a reasonably good representation for a fairly homogeneous water distribution system with similar operating conditions and pipe characteristics. However, in a more heterogeneous water distribution network, this dataset might not be sufficient. A simple statistical analysis could help evaluate the representativeness of the sample. On the other hand, there is no figure showing the site location or providing general information about the water distribution system and the Water Utility, which makes the presentation incomplete. It would be helpful to show in a map water distribution network and testing deployment. |
Thank you for your comment, the following was added to the manuscript: From a dataset standpoint, while the dataset comprises of 992 leak and 1,118 non‑leak samples from similar valves and pipe types across multiple regions in Hong Kong, it is critical to acknowledge that leak acoustic signatures can vary with pipe material, diameter, and pressure. As part of future work, it is paramount to conduct statistical tests or sampling stratification to confirm representativeness in more heterogeneous network sections.
|
Comments 5: According to sections 2.2 Feature extraction and 2.3 Data Visualization. Do the authors reach any conclusion based on the analysis presented in this section? Is there any observable pattern or indication that justifies the use of STFT? A considerable amount of work was done, but it seems that no clear outcome or insight was derived from it. |
Thank you for your comment, the following was added to the text:
Accordingly, MFCC-based envelopes captured distinct spectral energy patterns in leak signals, justifying their use. Quantitative metrics (e.g., higher average energy in leak MFCC bands) supported this decision and motivated further modeling using time-frequency features. |
Comments 6: According to this sentence: “It is critical to note that the publicly available YamNet model (pre-trained on AudioSet) was used without further fine-tuning on our data. In future work, it is recommended to fine-tune YamNet on the noise-logger dataset for improved performance. Our current results thus reflect the baseline performance of YamNet’s general sound classification on leak data”. This sounds quite surprising. The dataset does not appear to have been prepared to remove ambient sounds such as car horns or background music, yet the model is achieving high performance in terms of precision, recall, etc. Could you please clarify this? Is there any explanation for how such results were obtained despite the presence of environmental noise? |
The following was added to the text:
Unlike YamNet—used without fine-tuning on audio data dominated by general ambient sounds—the developed model focuses specifically on leak acoustic signatures. It extracts domain-specific features (e.g. MFCCs capturing low‑frequency harmonics of leaks), trained on labeled leak and non‑leak samples—even under conditions with environmental noise. As SHAP analysis confirms, leak-specific components like MFCC₁–₁₀ heavily influence the model's decisions, while ambient noise-related features carry low SHAP influence. This targeted feature selection and supervised training explain the superior performance and robustness of this article’s model compared to YamNet. |
Comments 7: According to this sentence: “The results of the proposed model and YamNet are significantly different, as seen from the accuracy metrics obtained in their respective experiments. The proposed model achieved an accuracy of 94.40%, outperforming YamNet, which achieved only 52.63% accuracy.” How do you explain the significant difference in performance indicators between the two models? Were both models trained and tested using the same dataset? What was the split between training and testing data? Additionally, were the models evaluated on new field data, or was the same dataset used for both training and testing? |
Thank you for your comment, Both models were assessed using the same dataset; no separate field data was used in this experiment.
Would you like us to point that out in the manuscript?
|
Comments 8: Between lines 316 and 317: In terms of recall, the Ensemble Model exhibited a notably higher value of 92.66% compared to YamNet's 8.77%. This is something the authors should investigate further, as it could be the key to understanding what we are doing well and what we are doing wrong. Furthermore, the authors introduce the use of MFCC coefficients (from line 350), which they describe as a key component of the analysis. However, these coefficients remain a 'black box' to the reader. It would be helpful to provide a clearer explanation of what MFCCs represent and how they contribute to the model's performance. |
The manuscript was updated as follows:
|
Comments 9: What do you mean with this sentence? Please introduce more deeply (line 343): “YamNet relies directly on the acoustic signals for leak detection”. The explanation provided for the poor performance of YamNet is rather limited and lacks depth. Considering the significance of this result, a more thorough analysis is necessary. The authors should discuss in greater detail the possible reasons behind YamNet’s underperformance, such as the mismatch between the training data and the specific characteristics of leak sounds, the presence of background noise, or the lack of fine-tuning. Additionally, it would be valuable to explore whether certain architectural limitations of YamNet contribute to these outcomes, and how future improvements or adaptations could enhance its performance in this context. |
Text is added to address the reviewer’s comment as follows “. In addition, the low performance of YamNet can be attributed to its struggle in the detec-tion of subtle and small leaks due to their weak acoustic signatures (Gemmeke et al., 2017). Further, YamNet was trained based on controlled audio environments and thus it lacks the adaptability to deal real-world variabilities in pipe materials, pressure fluctua-tions, and background noise, leading to inconsistent performance (Hershey et al., 2021). As such, YamNet requires specialized training and fine-tuning to optimize its architecture so that it can be adopted in a domain specific application like leak detection of water pipes” |
Comments 10: In line 432, the authors make the next comment: “The combination of the three best perfroming models in the ensemble 432 approach further enhances the accuracy of leak detection. The ensemble model attained 433 an accuracy of 94.40%, surpassing the performance of individual models and emphasizing 434 the efficacy of integrating diverse perspectives to enhance detection accuracy”. "I am not fully convinced that the ensemble model truly outperforms the Random Forest model. According to Tables 1 and 2, while the ensemble model does achieve a slightly higher precision (94.4% versus 93%), the other performance metrics — such as accuracy, recall, and F1-score — are actually lower than those obtained with the Random Forest model. Therefore, I would appreciate a clearer justification from the authors regarding why they still consider the ensemble model superior. Furthermore, if the intention is to train and deploy an ensemble model in a real-world water distribution network, it would likely require more time and computational resources compared to using a single Random Forest model." |
Thank you for your comments, an over haul of this claim was conducted through performing McNemar test and ANOVA and the following conclusion were derived:
Statistical validation approaches were added to support the conclusion that the ensemble method is meaningfully superior to using Random Forest (RF) alone. On the same test set, RF achieved 93.68% accuracy, while the Voting Ensemble attained 94.40%, with slightly higher precision (94.4% vs. 93.0%) but equivalent recall (≈ 92.7%) and a marginally higher F1‑score. However, these differences alone are not conclusive. McNemar’s test, used to assess paired classifier performance on the same instances, yielded a p‑value of 0.0987, showing that the ensemble’s test‑set improvement is not statistically significant at α = 0.05. In addition, a one‑way ANOVA comparing accuracy distributions across all base models confirmed significant differences (F ≈ 496.8, p < 0.0001), indicating that model choice matters significantly beyond random variation. Thus, although the ensemble’s test‑set advantage over RF is modest and not statistically significant, its cross‑validated stability, variance reduction, and the significant overall model performance differences justify its selection. This aligns with best practices in model evaluation literature, where ensemble methods are favored for enhanced robustness when performance across multiple metrics and samples is considered. In operational deployment, the inference overhead of making three parallel predictions (SVM, RF, MLP) within a Voting Classifier remains lightweight: each model is fast and memory-efficient. Our testing indicates < 10 ms latency per prediction on a standard laptop CPU, well within acceptable limits for real-time leak detection systems.
|
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThank you to the authors for addressing the previously raised concerns; however, I still have one specific point that requires further consideration.
I appreciate the clarification regarding the consistent performance of the ensemble model across multiple metrics. However, I would like to reiterate the importance of supporting these differences with appropriate statistical tests, especially considering that the improvements in accuracy (e.g., from 93.68% to 94.40%) may be small in absolute terms. I understand that even modest improvements can have practical relevance, but this does not exempt the need to demonstrate whether such differences are statistically significant—through, for example, McNemar’s test for comparing classifiers on the same predictions, or ANOVA for cross-validated metrics. Including such analyses would strengthen the scientific validity of the findings and distinguish them from anecdotal results. I recommend considering this analysis or, at the very least, explicitly acknowledging the limitation that its absence represents.
Comments for author File: Comments.pdf
Author Response
Comment: I appreciate the clarification regarding the consistent performance of the ensemble model across multiple metrics. However, I would like to reiterate the importance of supporting these differences with appropriate statistical tests, especially considering that the improvements in accuracy (e.g., from 93.68% to 94.40%) may be small in absolute terms. I understand that even modest improvements can have practical relevance, but this does not exempt the need to demonstrate whether such differences are statistically significant—through, for example, McNemar’s test for comparing classifiers on the same predictions, or ANOVA for cross-validated metrics. Including such analyses would strengthen the scientific validity of the findings and distinguish them from anecdotal results. I recommend considering this analysis or, at the very least, explicitly acknowledging the limitation that its absence represents.
Response:
Thank you so much for your comment, we were able to update our code and rerun the analysis and the following was added to the article
To assess whether performance differences among models were statistically meaningful, two tests were conducted: McNemar’s test for pairwise classifier comparison and one-way ANOVA for multi-model comparison. While the ensemble model consistently outperformed individual models across multiple metrics, McNemar’s test (p = 0.0987) indicated that the improvement over the best individual model (Random Forest) was not statistically significant. This suggests that the observed accuracy increase may not generalize consistently across samples. However, the one-way ANOVA conducted on cross-validated accuracy scores yielded a highly significant result (p < 0.0001), indicating that performance differences among models are statistically meaningful overall. Together, these results justify the use of ensemble learning, while also highlighting the need for cautious interpretation of marginal improvements when comparing closely matched models.
Reviewer 2 Report
Comments and Suggestions for AuthorsReviewing report
Manuscript entitled "Comparative Analysis of Machine Learning Techniques in Enhancing Acoustic Noise Loggers Leak detection"
- Where this « Table A1 was added to the appendix with hyperparameters. « ?
- Please combine lines 80-121 as 1 paragraph.
The same lines 518-545.
The same lines 546-574.
GOOD LUCK
Author Response
1. Where this « Table A1 was added to the appendix with hyperparameters. « ?
We thank the reviewer for his comments, Table A1 can be found in page 20 of the manuscript.
2. Please combine lines 80-121 as 1 paragraph.
The same lines 518-545.
The same lines 546-574.
Thank you for your comments, the required paragraphs have now been merged as requested.
Reviewer 3 Report
Comments and Suggestions for AuthorsAccept
Author Response
Thank you for accepting the article
Reviewer 4 Report
Comments and Suggestions for AuthorsSecond Revision
Comments 1: Regarding a specific statement in the Introduction section: “In 2015, approximately 321 million m³ of transported potable water (33% of the supplied freshwater) was lost, and leaks in pressurized water pipelines were identified as the primary cause of water loss [2].” This is a very relevant figure, but the data is quite outdated (2015). Could you please update this value with more recent statistics, if available? Furthermore, you mention that this number has significantly increased, considering the water loss rates of 26.5% in 2010 and 31.6% in 2013. Please clarify or provide updated trends to support this statement. |
Thank you for your comment, the manuscript was updated as follows:
Recent assessments highlight that Hong Kong’s water loss situation remains a critical concern. More specifically, coverage from the Centre for Water Research & Resource Management (CWRRR) indicates that in 2023–24, metered water loss rates in Hong Kong reached a record high of 38.3% — surpassing the earlier cited 33% benchmark from 2015 [43]. Notably, almost half of the annual water loss was attributed to leaks and bursts in government mains alone, with a leakage rate estimated at about 15% in 2018 [2,3]. These trends underscore a persistent or even worsening challenge, contradicting earlier improvements and indicating that, despite infrastructure upgrades, water loss remains a major issue. |
Ok, it has been improved.
Comments 2: I have a question about this sentence: "Since 2008, the Water Supplies Department of Hong Kong has set a target under the Total Water Management Strategy to save 85 million m³ of water per year by 2030." The authors should identify the actual water leakage volumes as of 2025 to better evaluate the current magnitude of the problem. In addition, it would be helpful to include data on the average time required by the water utility to detect, locate, and repair leaks, as this time frame is crucial for estimating the volume of water wasted. Since your proposed tool aims to reduce the time between leak detection and localization, such information would highlight the potential impact of your approach. |
Thank you for your comment, the following was added to the text: (note that the order of references was kept to identify the newer references from the older ones, the order will be updated in the proofing phase) Estimated fresh-water leakage volumes for government mains were approximately 121 million m³ in 2024, increasing gradually from 97 million m³ in 2020 [42]. |
Ok, it has been improved.
Comments 3: According to this sentence: “The acoustic signals captured by these Noise Loggers were processed to extract relevant features. Subsequently, several machine learning models were employed, including SVM, RF, NB, KNN, DT, LogR, and MLP, to classify the acoustic signals and identify leak states.” Is this is a supervised classification model, trained on labeled data to distinguish between predefined classes? |
Thank you for your comment, the following was added to the manuscript:
.... by means of a supervised learning approach, trained on manually labeled leak and non-leak signal data, with ground truth established via inspection or hydrophone confirmation. |
It is impressive and reliable, as it is based on supervised learning and can also be adopted by other water utilities. However, water distribution characteristics should be taken into account, since water noise also depends on pipe features.
omments 4: According to this sentence: “Data were collected at multiple valve locations in Hong Kong. Leak signals were identified by acoustic noise loggers and verified manually using inspections or hydrophones, and non-leak signals were recorded from similar pipes under normal operation. The dataset includes 992 leak samples and 1118 no-leak samples (a slightly unbalanced but representative set). Each signal is labeled by confirming the presence/absence of a leak via inspection. To avoid geographic bias, leak and non-leak recordings were obtained from similar sites and times. We ensured that both classes include signals from all regions of the study area.” All pipes may look similar, but in reality, leak sounds can vary depending on the pipe material. Likewise, the shape and size of the leak orifice can differ based on pipe diameter and pressure. A dataset with 992 leak samples could be a reasonably good representation for a fairly homogeneous water distribution system with similar operating conditions and pipe characteristics. However, in a more heterogeneous water distribution network, this dataset might not be sufficient. A simple statistical analysis could help evaluate the representativeness of the sample. On the other hand, there is no figure showing the site location or providing general information about the water distribution system and the Water Utility, which makes the presentation incomplete. It would be helpful to show in a map water distribution network and testing deployment. |
Thank you for your comment, the following was added to the manuscript: From a dataset standpoint, while the dataset comprises of 992 leak and 1,118 non‑leak samples from similar valves and pipe types across multiple regions in Hong Kong, it is critical to acknowledge that leak acoustic signatures can vary with pipe material, diameter, and pressure. As part of future work, it is paramount to conduct statistical tests or sampling stratification to confirm representativeness in more heterogeneous network sections. |
Please indicate in the manuscript the relevant water distribution features and include a representative figure of the water distribution network in Hong Kong. For instance, the authors should provide information on the types of pipe materials used (e.g., PVC, ductile iron), the total length (in km) of each material type, and their proportion relative to the total network length. Additionally, include the typical diameter range for each material and the dimensions of the orifice breaks considered.
This is important because I am completely sure that leakage noise characteristics differ significantly between, for example, PVC pipes with diameters in the range of 20–100 mm and ductile iron pipes with diameters between 300–800 mm. Therefore, it is essential to clarify whether your leakage detection algorithms were trained using samples that reflect these variations.
Comments 5: According to sections 2.2 Feature extraction and 2.3 Data Visualization. Do the authors reach any conclusion based on the analysis presented in this section? Is there any observable pattern or indication that justifies the use of STFT? A considerable amount of work was done, but it seems that no clear outcome or insight was derived from it. |
Thank you for your comment, the following was added to the text:
Accordingly, MFCC-based envelopes captured distinct spectral energy patterns in leak signals, justifying their use. Quantitative metrics (e.g., higher average energy in leak MFCC bands) supported this decision and motivated further modeling using time-frequency features. |
Ok, it is fine.
Comments 6: According to this sentence: “It is critical to note that the publicly available YamNet model (pre-trained on AudioSet) was used without further fine-tuning on our data. In future work, it is recommended to fine-tune YamNet on the noise-logger dataset for improved performance. Our current results thus reflect the baseline performance of YamNet’s general sound classification on leak data”. This sounds quite surprising. The dataset does not appear to have been prepared to remove ambient sounds such as car horns or background music, yet the model is achieving high performance in terms of precision, recall, etc. Could you please clarify this? Is there any explanation for how such results were obtained despite the presence of environmental noise? |
The following was added to the text:
Unlike YamNet—used without fine-tuning on audio data dominated by general ambient sounds—the developed model focuses specifically on leak acoustic signatures. It extracts domain-specific features (e.g. MFCCs capturing low‑frequency harmonics of leaks), trained on labeled leak and non‑leak samples—even under conditions with environmental noise. As SHAP analysis confirms, leak-specific components like MFCC₁–₁₀ heavily influence the model's decisions, while ambient noise-related features carry low SHAP influence. This targeted feature selection and supervised training explain the superior performance and robustness of this article’s model compared to YamNet. |
Ok, understood.
Comments 7: According to this sentence: “The results of the proposed model and YamNet are significantly different, as seen from the accuracy metrics obtained in their respective experiments. The proposed model achieved an accuracy of 94.40%, outperforming YamNet, which achieved only 52.63% accuracy.” How do you explain the significant difference in performance indicators between the two models? Were both models trained and tested using the same dataset? What was the split between training and testing data? Additionally, were the models evaluated on new field data, or was the same dataset used for both training and testing? |
Thank you for your comment, Both models were assessed using the same dataset; no separate field data was used in this experiment.
Would you like us to point that out in the manuscript?
|
Yes, please.
Comments 8: Between lines 316 and 317: In terms of recall, the Ensemble Model exhibited a notably higher value of 92.66% compared to YamNet's 8.77%. This is something the authors should investigate further, as it could be the key to understanding what we are doing well and what we are doing wrong. Furthermore, the authors introduce the use of MFCC coefficients (from line 350), which they describe as a key component of the analysis. However, these coefficients remain a 'black box' to the reader. It would be helpful to provide a clearer explanation of what MFCCs represent and how they contribute to the model's performance. |
The manuscript was updated as follows: |
Ok
Comments 9: What do you mean with this sentence? Please introduce more deeply (line 343): “YamNet relies directly on the acoustic signals for leak detection”. The explanation provided for the poor performance of YamNet is rather limited and lacks depth. Considering the significance of this result, a more thorough analysis is necessary. The authors should discuss in greater detail the possible reasons behind YamNet’s underperformance, such as the mismatch between the training data and the specific characteristics of leak sounds, the presence of background noise, or the lack of fine-tuning. Additionally, it would be valuable to explore whether certain architectural limitations of YamNet contribute to these outcomes, and how future improvements or adaptations could enhance its performance in this context. |
Text is added to address the reviewer’s comment as follows “. In addition, the low performance of YamNet can be attributed to its struggle in the detec-tion of subtle and small leaks due to their weak acoustic signatures (Gemmeke et al., 2017). Further, YamNet was trained based on controlled audio environments and thus it lacks the adaptability to deal real-world variabilities in pipe materials, pressure fluctua-tions, and background noise, leading to inconsistent performance (Hershey et al., 2021). As such, YamNet requires specialized training and fine-tuning to optimize its architecture so that it can be adopted in a domain specific application like leak detection of water pipes” |
Ok
Comments 10: In line 432, the authors make the next comment: “The combination of the three best perfroming models in the ensemble 432 approach further enhances the accuracy of leak detection. The ensemble model attained 433 an accuracy of 94.40%, surpassing the performance of individual models and emphasizing 434 the efficacy of integrating diverse perspectives to enhance detection accuracy”. "I am not fully convinced that the ensemble model truly outperforms the Random Forest model. According to Tables 1 and 2, while the ensemble model does achieve a slightly higher precision (94.4% versus 93%), the other performance metrics — such as accuracy, recall, and F1-score — are actually lower than those obtained with the Random Forest model. Therefore, I would appreciate a clearer justification from the authors regarding why they still consider the ensemble model superior. Furthermore, if the intention is to train and deploy an ensemble model in a real-world water distribution network, it would likely require more time and computational resources compared to using a single Random Forest model." |
Thank you for your comments, an over haul of this claim was conducted through performing McNemar test and ANOVA and the following conclusion were derived:
Statistical validation approaches were added to support the conclusion that the ensemble method is meaningfully superior to using Random Forest (RF) alone. On the same test set, RF achieved 93.68% accuracy, while the Voting Ensemble attained 94.40%, with slightly higher precision (94.4% vs. 93.0%) but equivalent recall (≈ 92.7%) and a marginally higher F1‑score. However, these differences alone are not conclusive. McNemar’s test, used to assess paired classifier performance on the same instances, yielded a p‑value of 0.0987, showing that the ensemble’s test‑set improvement is not statistically significant at α = 0.05. In addition, a one‑way ANOVA comparing accuracy distributions across all base models confirmed significant differences (F ≈ 496.8, p < 0.0001), indicating that model choice matters significantly beyond random variation. Thus, although the ensemble’s test‑set advantage over RF is modest and not statistically significant, its cross‑validated stability, variance reduction, and the significant overall model performance differences justify its selection. This aligns with best practices in model evaluation literature, where ensemble methods are favored for enhanced robustness when performance across multiple metrics and samples is considered. In operational deployment, the inference overhead of making three parallel predictions (SVM, RF, MLP) within a Voting Classifier remains lightweight: each model is fast and memory-efficient. Our testing indicates < 10 ms latency per prediction on a standard laptop CPU, well within acceptable limits for real-time leak detection systems. |
Ok, although the authors have explained better in your comments about RF and the ensemble method than in the main text, I advise you to explain in the main text (manuscript) why the authors chose the ensemble model over RF in order to avoid doubts.
Author Response
Reply to Reviewer 4’s Comments
Reviewer 4’s constructive comments were much appreciated, and his/her concerns and suggestions are now taken into consideration. The responses to the reviewer’s comments are depicted below:
Comment |
Reply |
||
It is impressive and reliable, as it is based on supervised learning and can also be adopted by other water utilities. However, water distribution characteristics should be taken into account, since water noise also depends on pipe features. |
The authors thank the reviewer for his/her positive comments. |
||
On the other hand, there is no figure showing the site location or providing general information about the water distribution system and the Water Utility, which makes the presentation incomplete. It would be helpful to show in a map water distribution network and testing deployment. Please indicate in the manuscript the relevant water distribution features and include a representative figure of the water distribution network in Hong Kong. For instance, the authors should provide information on the types of pipe materials used (e.g., PVC, ductile iron), the total length (in km) of each material type, and their proportion relative to the total network length. Additionally, include the typical diameter range for each material and the dimensions of the orifice breaks considered. This is important because I am completely sure that leakage noise characteristics differ significantly between, for example, PVC pipes with diameters in the range of 20–100 mm and ductile iron pipes with diameters between 300–800 mm. Therefore, it is essential to clarify whether your leakage detection algorithms were trained using samples that reflect these variations. |
Figure 3 and text are added to address the reviewer’s comment. The added text is as follows “The acquired signals in this study were collected from metallic pipes (such as steel, stainless steel, cast iron, and ductile iron), and non-metallic pipes (such as plastic). In addition, the pipe diameters varied between 100 and 300 mm. Figure 3 illustrates a common deployment location for noise loggers within water distribution networks of Hong Kong. The yellow arrows present in Figure 3.a indicate the positions of the noise loggers mounted on the pipe valves to capture leakage noise signals. Moreover, Figure 3.b provides a real-world view of the leak.” |
||
Ok, although the authors have explained better in your comments about RF and the ensemble method than in the main text, I advise you to explain in the main text (manuscript) why the authors chose the ensemble model over RF in order to avoid doubts. |
Thank you for your comment, the following was added to the manuscript: In the common test split, the Voting Ensemble achieved an accuracy of 94.40%, while the best single model (Random Forest) achieved 93.68%. Precision was slightly higher for the ensemble (89.78%) compared with RF (93.00%, rounded to whole percentages in Table 1). Recall was almost the same for both models, and the F1 score for the ensemble (91.00%) was slightly lower than for RF (93.00%). To check if this small accuracy difference was real or due to chance, McNemar’s test was applied to paired predictions from both models. The result was p = 0.0987, which is not below the significance level of 0.05. This means there is no strong statistical proof that one model is better than the other on this specific test set. However, a one-way ANOVA on cross-validated accuracies for all models gave F ≈ 496.8, p < 0.0001, showing that model choice does have a big impact when results are averaged over many data splits. The ensemble was chosen over RF not because it won in every single metric, but because it is more robust. By combining three different types of models—tree-based, instance-based, and neural—it reduces random variation in results and works more reliably across different data samples and conditions. This is especially useful when each model makes different types of mistakes, because the ensemble can correct for them. In practice, the extra computation for using the ensemble is very small. The three models are all fast to run, and the total prediction time per signal is less than 10 ms on a standard laptop CPU. If resources are very limited, RF alone can still be used, with only a small drop in performance.
|
||
Yes, please.
|
Thank you for your reposnse, the following was added to the text: |
Author Response File: Author Response.pdf