Review Reports
- Alaa Aldein M. S. Ibrahim1,*,
- Mfanasibili Nkonyane2 and
- Mlondi Ngcobo2
- et al.
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous Reviewer 4: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
This manuscript, titled "Data-Driven Machine Learning Models for E. coli Concentrations Prediction," has several areas for improvement in terms of structure, format, language, and academic rigor.
1. The last sentence of the abstract, "These findings highlight…," is slightly repetitive and should be merged with or simplified from the preceding text.
2. The keyword "E. coli" should be changed to "E. coli" (with a space) to conform to naming conventions.
3. The citation format in line 40 is inconsistent; for example, "Gupta et al [6]" should be changed to "Gupta et al. [6]".
4. In line 39, the citation numbers in "[8, 9, 11-21]" are not consecutive and are missing "[10]," requiring verification of the bibliography.
5. In line 82, "ROC values exceeding 85%" does not specify whether it refers to AUC-ROC or other indicators; this needs clarification. 6. Line 83: The description of “Nafsin et al. [31]” is too brief and does not explain its research conclusions. It is recommended to add a summary sentence.
7. Inconsistent variable symbols in formula (1): The formula uses uppercase Xij, but it is written as lowercase xij in the text. This should be consistent.
8. Inconsistent citation symbols in many places, such as “[36]” and “[39]”, need to be consistent with the same format.
9. Missing formula numbers. It is recommended to number each important formula and cite it in the text.
10. “X” in the formula on page 10 should be bold, but its dimension is not explained in the context. It is recommended to add it.
11. The unit of “Fluoride” in Table 2 is “mg/L”, but the mean is 100.644, which is much higher than the common drinking water standard. Is the unit incorrect? It needs to be checked.
12. The “P-Value” column in Table 3 should use scientific notation or a unified decimal place. The current format is inconsistent. 13. The range and values of the "Relative importance" axis in Figure 4 do not match; adjustment is recommended.
14. The MAE of KNN in Table 4 is 0.0585, but the E. coli concentration unit is MPN/100mL. Has it been standardized? This needs to be explained in the text. The explanation of the superior performance of kNN in the text is superficial; a more in-depth analysis combining its "local modeling" characteristics and data features is needed.
15. The "Future Work" section in the conclusion is too general; more specific experimental designs or model improvement directions are recommended.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors
The topic addressed in the manuscript is relevant and the authors have access to a unique long-term dataset. However, several aspects require clarification and revision before the paper can be considered for publication.
1. Keywords should be listed in alphabetical order. Avoid using keywords already present in the title.
2. Formatting of E. coli should be consistent throughout the text (scientific names in italics).
3. References inside the text: when two or more citations appear together, they should be placed within the same set of brackets (e.g. [1, 2]) and not separately ([1], [2]).
4. Introduction: the final paragraph that describes the organization of the manuscript is not necessary and may be removed.
5. Study area description: please provide the exact coordinates of the sampling sites. A map would be helpful.
6. Table 2: Several values appear unrealistic (e.g., Fluoride concentration). Please check units and confirm all laboratory results. Given the strong skewness of E. coli, reporting median, IQR and percentiles would be more informative than min/max.
7. Statistical analysis: Pearson correlation with large sample sizes leads to statistically significant but practically irrelevant correlations (e.g. |r| ≈ 0.03). The discussion should be more cautious and focus on effect size, not only p-values. Consider also Spearman or Kendall to capture non-linear relationships.
8. Machine Learning approach: Avoid random train/test split in long time series. A temporally aware validation (e.g., chronological split or rolling window) is needed to prevent information leakage. Please clarify how missing data were imputed (fitting only on the training fold) and how hyperparameters were tuned.
9. Model interpretability: RF feature importance can be biased. Permutation importance and SHAP analysis would provide more reliable interpretation.
10. Results presentation: Some reported metrics (e.g., MAE and RMSE) appear to be in a normalized or transformed scale. Please indicate clearly the scale of evaluation and include metrics in the original E. coli units to facilitate interpretation for practitioners.
11. Discussion: Anthropogenic and natural factors potentially influencing E. coli concentrations are mentioned but not integrated into the models (e.g. rainfall, hydrodynamics, land use, WWTP discharges). Please expand this point or justify their absence. Limitations should be strengthened.
Overall, the manuscript has potential but substantial improvements are needed in statistical rigor, transparency of the modelling process, and presentation of the results.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors
The article is titled "Data-Driven Machine Learning Models for E. coli Concentrations Prediction." This study examines the application of data-driven machine learning models to predict E. coli concentrations in the Midmar Dam, using readily available physicochemical parameters. The authors conducted a comparative analysis using five classic, stand-alone machine learning algorithms: Random Forest (RF), Support Vector Machine (SVM), k-nearest neighbors (kNN), Artificial Neural Network (ANN), and Extreme Gradient Boosting (XGBoost).
My comments are as follows:
- The introduction is well-organized; the authors reviewed the literature and identified gaps in existing research regarding the challenges of real-time E. coli monitoring and rapid response applications.
- The study area requires an extended description. First and foremost, the study area should be presented on a detailed map. The map should include geographic coordinates, and the overview map should be labeled (country). A separate scale bar should be provided for each map presented. The study area map should include other elements, such as the river network, land use, and major towns. This section should briefly describe the hydroclimatic conditions that may influence e.coli levels (sources of contamination). It should be noted that the map should be readable by readers outside of South Africa.
- In section 2.3, Laboratory Methods, the authors provide information on how water samples were collected. Therefore, the title should be "data." The authors must indicate the period over which water samples were collected and present a map showing the locations of sampling points, how often, and how many samples were taken. This section should include information from lines 110-116.
- In subsection 3.1., explain the origin of the extreme e.coli values and whether this was related to the hydrological situation (high or low water levels in the reservoir, or high or low flows in rivers flowing into the reservoir).
- The discussion subsection should be supplemented with references. Reference should be made to other studies, indicating the advantages and disadvantages of the method used. Conclusions should be modified and should present the general findings of the study.
Technical notes:
- The article should be adapted to the requirements of the journal Sustainability. The manuscript was prepared using the Journal Not Specified form.
- Figure 1 requires correction.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for Authors
This is an interesting and informative research study examining the capacity of five classical standalone ML algorithms to predict E. coli densities in a freshwater dam. Previously, I have done some modeling research in microbial physiological ecology, but not with the suite of machine learning algorithms used here. Nonetheless, I have made a critical reading of the manuscript, and believe that it provides a scientifically informed analysis of the relative error rates among the five algorithms used for this predictive study; and it can be a useful guide for others who want to do similar environmental studies. I have only some minor comments to make.
With respect to the NN model, it may be helpful if the authors comment in the Discussion about the choice of a 20-neuron NN model, given that the number of neurons in a neural network can impact its performance, e.g., too few leading to underfitting (inability to capture complex patterns) and too many leading to overfitting (poor generalization). If the authors had a rationale for the choice of the number of neurons used, it may be of interest to some readers. Nonetheless, the kNN algorithm as reported provides good predictive ability. Some minor text corrections: E. coli needs to be consistently italicized throughout the manuscript. Some of the text (e.g. Introduction) could be more easily read if the continuous text were separated into some smaller paragraphs.
More detailed comments:
The Introduction provides a coherent and adequately documented (referenced) background for the study. Cited references are adequately up-to-date. The Materials and Methods section is clearly written and adequately detailed. The authors used Pearson’s parametric correlation statistical analyses. Did they verify that the data in each set of analyses was adequately normally distributed to use this parametric correlation method? The Results section is concise and overall is clearly written and adequately interpretive, I do have a few small suggestions below, however, for the authors to consider. The Discussion is well written, and the References are appropriate.
The authors’ final reflective, critical discussion is appropriate and informative.
Further comment
Line Comment
356–357 This statement is, perhaps, too generously asserted: “Dissolved Oxygen and pH showed weak negative correlations with E. coli (R = -0.038 and R = -0.037, respectively), suggesting that higher oxygen levels and increased alkalinity may be associated with slightly reduced E. coli counts.” These are at best negligible R values. The R2 values are in the range of 1 x 10-4, a value that indicates insignificant percent variance that is accounted for, even if the P values are small, indicating statistical significance – there is very small R2 predictive significance. I believe that the authors should be more cautious in wording their conclusion.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for Authors
Good job!