Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Stream-DBSCAN: A Streaming Distributed Clustering Model for Water Quality Monitoring

Appl. Sci. 2023, 13(9), 5408; https://doi.org/10.3390/app13095408

by Chunxiao Mu¹, Yanchen Hou^2,*, Jindong Zhao²

, Shouke Wei^2,3

and Yuxuan Wu²

Reviewer 1: Anonymous

Reviewer 2:

Pierre Mandel

Appl. Sci. 2023, 13(9), 5408; https://doi.org/10.3390/app13095408

Submission received: 15 March 2023 / Revised: 17 April 2023 / Accepted: 18 April 2023 / Published: 26 April 2023

Round 1

Reviewer 1 Report

In order to analyze such a large amount of water quality data more quickly and accurately, authors propose a streaming DBSCAN distributed stream processing clustering model. Experiments on the real data set demonstrate the usability of the algorithm.

Some comments:

1. What are the advantages of the proposed model in this paper compared to models based on the Spark Streaming framework?

2. In order to be self-consistent, authors should provide a brief introduction to Flink in Section 2.

3. I suggest that authors add or use at least one of the baselines, such as DenStream[1], D-Stream[2], DBSTREAM[3], and CEDAS[4]

[1]Density-Based Clustering over an Evolving Data Stream with Noise

[2]Density-Based Clustering for Real-Time Stream Data

[3]Clustering Data Streams Based on Shared Density Between Micro-Clusters

[4]Fully online clustering of evolving data streams into arbitrarily shaped clusters

In addition, authors do not review and introduce some well known density-based clustering algorithms, such as M3W and LDP-MST[6].

[5]M3W: multistep three-way clustering. IEEE Transactions on Neural Networks and Learning Systems(2022).

[6]Clustering with Local Density Peaks-Based Minimum Spanning Tree. IEEE Transactions on Knowledge and Data Engineerin(2021)

4. Section 3.3 claims to perform preprocessing operations on the data using the k-means algorithm. It is worth noting that the clustering results of the k-means algorithm are affected by the random initialization. How does this randomness result affect the final clustering results? Please give the corresponding experiments in Section 3.

5. I suggest that authors use at least two of the evaluation criteria, such as Fowlkes–Mallows index (FMI), Adjusted Rand Index (ARI) , Normalized Mutual Information (NMI) , and Adjusted Mutual Information (AMI).

6. I strongly recommend authors release the dataset along with the submission since a source dataset release will help others reproduce and build upon the author's results.

Author Response

Editorial feedback

Thank you very much for the valuable feedback provided by the experts. During these 10 days, I have carefully read the author's several review comments and made improvements to the article. The purple part of this article is a modification to your opinion.The following are the changes to the article:

Author Response File: Author Response.docx

Reviewer 2 Report

This manuscript presents a study showing how operational measurement can be automatically processed in order to detect changes in water quality. A case-study on a dataset gathering measurements of three quality parameters (pH, NH₄-N and turbidity) over time is presented. This work, although addressing an issue very relevant to the research community on water quality, suffers from many shortcomings that, in my opinion, disqualify it from publication:

1. The main problem is that this work does not give enough details to the reader in order to assess the quality of the results. The most interesting part of the article, the discussion of the results, is completely missing its point as: i) it relies on an arbitrary definition of the quality of the water (marks 1 to 4 in Table 4), which is not presented nor discussed; ii) only 2.1% of the results are discussed; in particular, the physical interpretation of the largest cluster, gathering 81.8% of the experimental points is not discussed.

2. Important methodological choices are not presented/justified/explained: the number of clusters in the first clustering step using K-means should be explained, as the choices for ε and MinPts, the distance used with DB-SCAN (is it a L2 or L1 distance?), the normalization scheme (using min and max values makes it very sensitive to outliers)

3. The manuscript is very clumsy, which makes it unpleasant to read. Many methodological details are repeated (e.g., the first two paragraph of the conclusion, lines 251-254), figures are not clear (in figures 4 and 5, authors should give the actual timestamps and the units used; figure 7 is not presented nor discussed in the text; figure 6 should be replaced by a confusion matrix including cluster V). Some notations are introduced but not used, as φ_ii (line 283), equation 1 is not giving the expression of ε but instead gives a definition for the neighborhood with parameter δ!

4. The bibliographical context is incompletely presented as many already published studies have proposed interesting insights on the topic. See for example:

· Di, Z., Chang, M., Guo, P., Li, Y. and Chang, Y. (2019) Using Real-Time Data and Unsupervised Machine Learning Techniques to Study Large-Scale Spatio–Temporal Characteristics of Wastewater Discharges and their Influence on Surface Water Quality in the Yangtze River Basin. Water 11(6).

· Kühnert, C., Bernard, T., Arango, I.M. and Nitsche, R. (2014) Water Quality Supervision of Distribution Networks Based on Machine Learning Algorithms and Operator Feedback. Procedia Engineering 89, 189-196.

· Mandel, P., Wang, Y., Parre, A., Féliers, C. and Heim, V. (2021) Quality zones automatically identified in water distribution networks by applying data clustering methods to conductivity measurements. Water Research 207, 117716.

· Ranjithkumar, M. and Robert, L. (2021) Inventive Communication and Computational Technologies, pp. 387-396.

· Tinelli, S. and Juran, I. (2019) Artificial intelligence-based monitoring system of water quality parameters for early detection of non-specific bio-contamination in water distribution systems. Water Supply 19(6), 1785-1792.

· Vries, D., van den Akker, B., Vonk, E., de Jong, W. and van Summeren, J. (2016) Application of machine learning techniques to predict anomalies in water supply networks. Water Science and Technology: Water Supply 16(6), 1528.

Therefore, I recommend this manuscript be rejected for publication in Applied Science.

Author Response

Editorial feedback

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

All my comments have been adequately addressed. I recommend acceptance.

Author Response

Thank you very much for your endorsement. In the past 5 days, we have revised the English style of the article and polished the sentences.

Reviewer 2 Report

I appreciate the authors' effort to address the issues I mentioned in my first report. However, I still think that the paper has to be improved in the following ways:

1. Results are still not presented sufficiently in detail. This point is crucial to assess the relevancy of the work from an operational perspective. The paper would greatly benefit from a distribution analysis of the clusters over the course of the day. In order to do so, please replace Table 5 by a confusion matrix presenting the proportion of timestamps classified in clusters I to V vs the quality timeslots (8PM to 10AM etc.). By doing so, you can also suppress Figure 6 which convey only a part of the information gathered in a confusion matrix. Also, give an explanation on why your quality timeslots are overlapping (8PM-10AM overlaps with 6PM-8AM for instance).

2. The article still needs a lot of polishing! Please address the remarks I already expressed: Many methodological details are repeated (e.g., the first two paragraph of the conclusion are a mere repetition of previous methodological aspects), figures are not clear (in figure 5, authors should give the actual timestamps and the units used). Some notations are introduced but not used, as φii (line 393), equation 1 is not giving the expression of ε but instead gives a definition for the neighborhood with parameter δ (δ is not Eps)! Parameters measured are not written in a consistent manner; see e.g. “Ph” line 460, “PH” in Tables 2 and 4. Index is singular, indices is plural, so correct lines 21, 100 and 102. Please ask someone who is proficient in English to help you polish your paper.

Author Response

Thank you very much for the valuable feedback provided by the experts. During these 5 days, I have carefully read the author's several review comments and made improvements to the article. The RED part of this article is a modification to your opinion. I will make modifications to your comments in RED font .The following are the changes to the article:

Author Response File: Author Response.docx

Round 3

Reviewer 2 Report

Thank you for correcting your paper.

Please check again lines 77 to 84 in order to:

- replace "Reference [14]" (and others) by the real researchers' names

- replace "xx" line 78 by a real text.

Author Response

Salute to your rigorous academic attitude! Thank you very much for your guidance on the article. Please use green font to correct the issues you raised. I deeply apologize for the low-level errors in the article!

Article Menu

Stream-DBSCAN: A Streaming Distributed Clustering Model for Water Quality Monitoring

Further Information

Guidelines

MDPI Initiatives

Follow MDPI