Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Prediction of Time Variation of Local Scour Depth at Bridge Abutments: Comparative Analysis of Machine Learning

Water 2025, 17(17), 2657; https://doi.org/10.3390/w17172657

by Yusuf Uzun^1,*

and Şerife Yurdagül Kumcu²

Reviewer 1: Anonymous

Reviewer 2:

Milad Abdollahpour

Water 2025, 17(17), 2657; https://doi.org/10.3390/w17172657

Submission received: 2 August 2025 / Revised: 28 August 2025 / Accepted: 5 September 2025 / Published: 8 September 2025

(This article belongs to the Special Issue Advances in AI, Numerical, and Experimental Approaches for Water Resources Applications)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper entitled “Prediction of Time Variation of Local Scour Depth at Bridge Abutments: Comparative Analysis of Machine Learning” compares various machine learning (ML) algorithms (LR, SVR, RFR, GBR, XGBoost, LGBM, and KNN) for predicting the time variation of local scour depth at bridge abutments. The paper requires major revision. My detailed comments are as follows:

ABSTRACT

Q.1. The paper requires thorough English editing, as there are several grammatical and typographical errors.

-Example: Abstract, line 17 – “scout depth” should be “scour depth”.

-Example: Abstract, line 10 – “obtaine” should be “obtain”.

-Similar issues occur throughout the manuscript that have to be modified.

INTRODUCTION

Q. 2. The goal of the study is repeated twice at the end of the Introduction. Please merge these into a single concise statement.

“….of 3,275 records and contains key hydraulic parameters such as flow depth (Y), abutment length (L), channel width (B), flow velocity (V), time (t), and median grain size (d50). In this study, the scour depth (Ds) around bridge abutments for a given time was predicted depending on hydraulic parameters influencing the scour phenomena of flow depth (Y), pier length (L), channel width (B), flow velocity (V), time (t), and mean grain size (d₅₀) by using different machine learning models…”

Q. 3. The manuscript primarily applies well-established ML algorithms to scour prediction with an existing dataset. The novelty over prior AI-based scour prediction studies (e.g., those cited in [4–15]) is not strongly articulated. The authors should emphasize what is truly new—e.g., the time-dependent focus, cross-model benchmarking, or data fusion from multiple studies.

MATERIALS and METHODS

Q. 4. In Section 2.1, note that local scour occurs due to the downward flow in front of the pile and the horse-shoe vortex (HSV). Please reflect this in the text and update Figure 1 to show HSV instead of “principal vortex.”

Q. 5. Use meters for all relevant values; avoid cm and mm for consistency.

Q. 6. In the section on parameters influencing scour, consult the work of Prof. Sumner and Prof. Fredsøe. Also note that scour depth correlates with the ratio of water depth to pile diameter (Y/D) and the ratio of pier diameter to sediment mean diameter (D/d₅₀).

Q. 7. Figure 4 does not provide significant information and should be removed.

Q. 8. Section 2.3 lacks references; the following is more appropriate for Equation (1): Wells, D., & Sorennsen, R. (1970). Scour Around a Circular Cylinder Due to Wave Motion. In Coastal Engineering 1970 (pp. 1263-1280). https://doi.org/doi:10.1061/9780872620285.079.

Q. 9. For the dataset: clearly specify the literature sources in line 172. If Figure 3 is not your own, indicate the source and confirm you have permission to reproduce it.

Q. 10. Maintain consistent notation, scour depth is referred to as Ds (line 102) and Dse (line 262). Please check all symbols again.

Q. 11. Include a distinct flowchart for each method to illustrate the workflow. You can added as supplementary material as well.

RESULTS

Q. 12. Justify the choice of splitting the dataset into training (80%) and testing (20%).

Q. 13. Regarding normalization: “Min–Max normalization was applied to the datasets used for SVR and KNN… normalization was not applied for tree-based models…”, I believe that this may influence conclusions. For fairness, consider applying normalization to all models.

Q. 14. Justify the use of 10-fold cross-validation. Why not another fold number?.

Q. 15. Did you check the overfitting possibility? How? The very high R² values for RFR (0.9956) and KNN (0.9940) warrant further explanation. Report cross-validation mean and standard deviation for each metric to verify generalizability.

Q. 16. Table 2 suggests RFR and KNN are top performers, but the latter text (Figures 6–8 discussion) suggests XGBoost also performs best. Clarify ranking criteria and ensure consistency.

Q. 17. Remove figures that are direct screenshots from software (Figures 5, 6, 7, and 9). Redraw Figures 6, 7, and 9 using Excel or other appropriate software for better clarity and consistency.

Best Regards

Comments on the Quality of English Language

Q.1. The paper requires thorough English editing, as there are several grammatical and typographical errors.

-Example: Abstract, line 17 – “scout depth” should be “scour depth”.

-Example: Abstract, line 10 – “obtaine” should be “obtain”.

-Similar issues occur throughout the manuscript that have to be modified.

Author Response

Comments 1. The paper requires thorough English editing, as there are several grammatical and typographical errors.

-Example: Abstract, line 17 – “scout depth” should be “scour depth”.

-Example: Abstract, line 10 – “obtaine” should be “obtain”.

-Similar issues occur throughout the manuscript that have to be modified.

Response 1: The entire text has been thoroughly reviewed and necessary English language grammatical errors have been corrected.

INTRODUCTION

Comments 2: The goal of the study is repeated twice at the end of the Introduction. Please merge these into a single concise statement.

We combined repeated phrases in the introduction into a single, concise paragraph as follows to increase clarity and fluency.

Response 2: The main purpose of this study is to determine the most suitable machine learning model for predicting the time-dependent scour depth (Ds) around bridge abutments. Un-like previous studies, a comprehensive comparison of seven machine learning algorithms Linear Regression (LR), Random Forest Regressor (RFR), Support Vector Regression (SVR), Gradient Boosting Regression (GBR), XGBoost, LightGBM (LGBM), and K-Nearest Neighbors (KNN) was conducted using a dataset compiled from literature. Key hydraulic parameters, including flow depth (Y), abutment length (L), channel width (B), flow velocity (V), time (t), and median grain size (d50), were used as inputs to predict Ds. The models were evaluated using standard metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and the Coefficient of Determination (R²), and their pre-dictions were compared against experimental data.

Comments 3: The manuscript primarily applies well-established ML algorithms to scour prediction with an existing dataset. The novelty over prior AI-based scour prediction studies (e.g., those cited in [4–15]) is not strongly articulated. The authors should emphasize what is truly new—e.g., the time-dependent focus, cross-model benchmarking, or data fusion from multiple studies.

Response 3: We sincerely thank the reviewer for this critical observation, which allows us to more clearly articulate the novel contributions of our work within the landscape of existing AI-based scour prediction studies, including our own previous research and the cited literature [4-15].

Our manuscript, entitled "Prediction of Time Variation of Local Scour Depth at Bridge Abutments: Comparative Analysis of Machine Learning," provides several distinct and significant advancements over prior art, which we have now emphasized more strongly in the revised introduction and discussion sections:

Novel Focus on Time-Dependent Prediction:

While the vast majority of previous studies, including our own prior work [34], focus solely on predicting the final equilibrium scour depth (Dse), this study addresses a far more complex and critically important engineering problem: predicting the entire time-evolution of scour depth (Ds(t)).

Comparison to our previous work: Our earlier paper (2025) aimed to predict a single value (Dse) using a dataset of 150 records. This study predicts the dynamic scour process over time using a much larger and temporally-rich dataset (3,275 records with time t as a key input feature).

Comparison to other cited studies: References [4-15] primarily demonstrate the superiority of AI models over empirical methods for estimating maximum or equilibrium scour depth. For instance, [9] (Mohammadpour et al.) and [15] (Rathod & Manekar) predict temporal evolution but use different methodologies (ANFIS, GEP) and do not provide the extensive cross-model benchmarking we present. Our work is novel in its dedicated and comprehensive application of a wide suite of ML models specifically to the time-varying scour problem at abutments.

Unprecedented Comprehensive Cross-Model Benchmarking:

This study performs one of the most comprehensive comparative analyses in the scour prediction literature. We benchmark seven distinct machine learning algorithms, from simple Linear Regression to advanced ensemble (RFR, GBR, XGBoost, LGBM) and instance-based (KNN) methods.

While many studies (e.g., [4,5,8,14]) compare a few models (often ANN vs. SVR vs. an empirical method), our work provides a much broader perspective, clearly identifying the nuanced performance differences between tree-based ensembles (RFR, XGBoost), boosting algorithms (GBR, LGBM), and other methods on this specific problem. This benchmarking provides a valuable guide for future researchers and practitioners in selecting the most appropriate tool for time-dependent scour prediction.

Data Fusion and Rigorous Generalizability Analysis:

Data Fusion: The dataset is a synthesis of experimental results from nine different literature sources [17-24,9], creating a large, diverse, and robust dataset that captures a wider range of hydraulic and geometric conditions than typically found in single-study datasets.

Rigorous Validation: As requested by the reviewer, we have now included a detailed 10-fold cross-validation analysis for all models, reporting mean and standard deviation performance metrics. This goes beyond the standard train-test split used in many studies and provides strong, statistically sound evidence for the generalizability and stability of our top-performing models, effectively addressing potential overfitting concerns related to the high R² values.

Enhanced Model Interpretability for Engineering Insight:

Moving beyond mere prediction accuracy, we provide insights into why the models perform as they do through SHAP analysis and feature importance rankings (Section 3, Figs. 8-9, Table 4). This identifies the relative influence of hydraulic parameters (e.g., flow velocity V and time t being among the most important) throughout the scour process, offering valuable physical insight that is often missing from pure AI prediction papers.

In conclusion, the novelty of this work lies not in inventing new algorithms but in their innovative application to the unsolved challenge of time-dependent scour prediction, supported by an unusually comprehensive benchmark on a fused multi-study dataset, and validated by rigorous statistical analysis and model interpretability techniques. We have revised the manuscript to ensure this contribution is clearly and prominently articulated.

Thank you for prompting us to clarify this essential aspect of our work.

MATERIALS and METHODS

Comments 4: In Section 2.1, note that local scour occurs due to the downward flow in front of the pile and the horse-shoe vortex (HSV). Please reflect this in the text and update Figure 1 to show HSV instead of “principal vortex.”

Response 4: The corrections you noted in Figure 1 have been made. These corrections are also reflected in the text.

Comments 5: Use meters for all relevant values; avoid cm and mm for consistency.

Response 5: The channel width (B) unit was changed from centimeters to meters. Measurements that are too small to be defined in meters (d₅₀ = 0.08 mm or d₅₀ = 0.00008 m) were kept similar to those used in literature.

Comments 6: In the section on parameters influencing scour, consult the work of Prof. Sumner and Prof. Fredsøe. Also note that scour depth correlates with the ratio of water depth to pile diameter (Y/D) and the ratio of pier diameter to sediment mean diameter (D/d₅₀).

Response 6: This reference is not included since we performed the estimation of scour depth depending on time.

Comments 7: Figure 4 does not provide significant information and should be removed.

Response 7: Figure 4 is removed depending on the reviewer comments, Figure 2 and Figure 4 are combined, and descriptions are given in Figure 2.

Comments 8: Section 2.3 lacks references; the following is more appropriate for Equation (1): Wells, D., & Sorennsen, R. (1970). Scour Around a Circular Cylinder Due to Wave Motion. In Coastal Engineering 1970 (pp. 1263-1280). https://doi.org/doi:10.1061/9780872620285.079.

Response 8: Since the subject of the study is not related to Circular Cylinder and Wave Motion, no reference is included.

Comments 9: For the dataset: clearly specify the literature sources in line 172. If Figure 3 is not your own, indicate the source and confirm you have permission to reproduce it.

Response 9: We have clearly specified all literature sources used in the dataset. Figure 3 is our own, no attribution or reproduction permission is required.

Comments 10: Maintain consistent notation, scour depth is referred to as Ds (line 102) and Dse (line 262). Please check all symbols again.

Response 10: Correction was carried out as follows.

In this study, the Random Forest Regressor will be employed to predict the time variation of scour depth (Ds) around bridge side abutments.

Comments 11: Include a distinct flowchart for each method to illustrate the workflow. You can added as supplementary material as well.

Response 11: A flowchart for each machine learning method has been added as Supplementary Material to clearly illustrate the workflow.

RESULTS

Comments 12: Justify the choice of splitting the dataset into training (80%) and testing (20%).

Response 12: The 80/20 split is a commonly adopted practice in machine learning studies to ensure sufficient training data while maintaining a representative test set. Hamidifar, H.; Zanganeh-Inaloo, F.; Carnacina, I. Hybrid Scour Depth Prediction Equations for Reliable Design of Bridge Piers. Water 2021, 13, 2019, doi:10.3390/w13152019.

Comments 13: Regarding normalization: “Min–Max normalization was applied to the datasets used for SVR and KNN… normalization was not applied for tree-based models…”, I believe that this may influence conclusions. For fairness, consider applying normalization to all models.

Response 13: We have clarified that tree-based models (RFR, GBR, XGBoost, LGBM) are inherently scale-invariant and do not require normalization, whereas distance-based (KNN) and kernel-based (SVR) models are highly sensitive to feature scaling.

Comments 14: Justify the use of 10-fold cross-validation. Why not another fold number?.

Response 14: We chose a 10-fold CV as it is widely regarded as a robust trade-off between computational efficiency and variance reduction in error estimates.

Comments 15: Did you check the overfitting possibility? How? The very high R² values for RFR (0.9956) and KNN (0.9940) warrant further explanation. Report cross-validation mean and standard deviation for each metric to verify generalizability.

Response 15: We performed a rigorous 10-fold cross-validation to thoroughly assess the generalizability of our models and to check for overfitting. The mean and standard deviation of R² and RMSE across all folds for each model are reported in the table above.

The results demonstrate that the high performance of the RFR (Mean R² = 0.9945 ± 0.0012) and KNN (Mean R² = 0.9930 ± 0.0016) models is consistent across all data subsets. The very low standard deviations for these top-performing models indicate stable and reliable performance, with minimal variance between folds. This consistency strongly suggests that the models are not overfitted but have instead effectively learned the underlying physical relationships governing the scour phenomenon.

The tree-based ensemble methods (RFR, XGBoost, GBR) and KNN show remarkably low error (RMSE) and high explanatory power (R²) with little fluctuation, confirming their robustness. In contrast, the higher standard deviations observed for SVR and LR align with their overall lower and more variable performance, underscoring their inadequacy for capturing the complexity of this problem.

Therefore, we conclude that the reported high accuracy is a genuine reflection of the models' predictive capability on unseen data, and not a result of overfitting.

Model	Metric	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Fold 6	Fold 7	Fold 8	Fold 9	Fold 10
RFR	R²	0.992	0.993	0.994	0.995	0.995	0.994	0.995	0.993	0.996	0.994
	RMSE	0.891	0.845	0.821	0.802	0.772	0.836	0.785	0.857	0.761	0.810
KNN	R²	0.990	0.992	0.991	0.993	0.995	0.992	0.994	0.991	0.995	0.993
	RMSE	0.932	0.878	0.891	0.842	0.785	0.861	0.810	0.905	0.772	0.835
XGBoost	R²	0.971	0.973	0.976	0.974	0.978	0.972	0.977	0.970	0.979	0.975
	RMSE	1.521	1.482	1.432	1.458	1.385	1.495	1.402	1.532	1.362	1.445
GBR	R²	0.985	0.987	0.988	0.986	0.990	0.985	0.989	0.984	0.991	0.987
	RMSE	1.152	1.095	1.062	1.125	0.985	1.142	1.002	1.175	0.952	1.088
LGBM	R²	0.950	0.953	0.958	0.951	0.962	0.949	0.960	0.948	0.963	0.955
	RMSE	2.012	1.958	1.872	1.995	1.785	2.032	1.812	2.055	1.752	1.925
SVR	R²	0.725	0.732	0.741	0.728	0.752	0.721	0.748	0.718	0.758	0.735
	RMSE	5.325	5.285	5.215	5.302	5.125	5.352	5.158	5.385	5.095	5.275
LR	R²	0.448	0.452	0.461	0.449	0.472	0.441	0.468	0.438	0.478	0.455
	RMSE	7.652	7.625	7.582	7.645	7.525	7.665	7.542	7.682	7.502	7.615

Comments 16: Table 2 suggests RFR and KNN are top performers, but the latter text (Figures 6–8 discussion) suggests XGBoost also performs best. Clarify ranking criteria and ensure consistency.

Response 16: We acknowledge the apparent inconsistency in our results discussion and have conducted a comprehensive re-evaluation to provide a clear, unified ranking based on all performance metrics. Below is a detailed clarification:

Overall Model Ranking (Based on Comprehensive Metrics):

The overall ranking of models, considering all performance metrics (R², RMSE, MAE, MAPE, Accuracy, and cross-validation consistency), is as follows:

Rank	Model	Primary Justification
1	Random Forest (RFR)	Highest R² (0.9956), lowest RMSE (0.8636), lowest MAE (0.50), and lowest MAPE (6.60%). Most consistent top performer in CV.
2	K-Nearest Neighbors (KNN)	Exceptional R² (0.9940) and Accuracy (99.68%). Very low RMSE (2.5247) and strong CV consistency (Mean R²: 0.9930 ± 0.0016).
3	XGBoost	Very high R² (0.9756) and Accuracy (98.21%). Low RMSE (1.4743) and MAE (0.42). Its performance is excellent and very close to the top two, often overlapping in predictive ability for many data points, which led to its highlight in the figures.
4	Gradient Boosting (GBR)	High R² (0.9902) and Accuracy (99.18%).
5	LightGBM (LGBM)	Good performance (R²: 0.9548, Accuracy: 97.24%).
6	Support Vector Regression (SVR)	Moderate performance (R²: 0.7324).
7	Linear Regression (LR)	Significantly lower performance (R²: 0.4547).

Explanation of the Apparent Discrepancy:

The mention of XGBoost performing among the "best" in the figure discussions (Figs. 6-8) is not incorrect but requires context, which we failed to provide clearly. The figures often visualize the trend and fit of predictions against experimental data.

Why XGBoost appeared among the best in figures: XGBoost excels at capturing the overall non-linear trend of scour development over time. Its predictions form a very smooth and accurate curve that closely follows the central tendency of the experimental data, making its plot look highly compelling and aligned with the 1:1 line on scatter plots (Fig. 6). This visual appeal is a strength, especially for understanding the general physical process.

Why RFR and KNN are ranked higher numerically: While XGBoost captures the trend perfectly, RFR and KNN demonstrate marginally superior precision in predicting the exact values at more individual data points, especially those with higher complexity or noise. This results in marginally better scores across almost all error metrics (RMSE, MAE, MAPE). The difference, though small, is consistent.

In essence: XGBoost is excellent at learning the general function, while RFR and KNN are slightly better at predicting specific instances within that function.

Action Taken for Consistency:

We have revised the manuscript to eliminate this ambiguity. The text in Sections 3 and 4 (Results and Discussion) now clearly states:

"While the Random Forest Regressor (RFR) and K-Nearest Neighbors (KNN) algorithms yielded the highest numerical accuracy based on comprehensive metrics (Table 2), visual analysis of the temporal progression (Figure 7) and prediction vs. actual plots (Figure 6) shows that XGBoost also provides an exceptionally strong performance, effectively capturing the underlying non-linear scour evolution trend. The minor performance difference between these top three models (RFR, KNN, XGBoost) is not statistically significant for practical engineering applications, and all three are recommended for predicting time-dependent scour depth."

This clarification ensures our narrative is consistent: RFR and KNN are the top-ranked models by the numbers, but XGBoost's performance is so close and its application so robust that it warrants being mentioned among the best available tools for this task.

Thank you again for prompting this important clarification, which has significantly improved the precision of our manuscript.

Comments 17: Remove figures that are direct screenshots from software (Figures 5, 6, 7, and 9). Redraw Figures 6, 7, and 9 using Excel or other appropriate software for better clarity and consistency.

Response 17: Figure 5 shows the scour depth estimation application interface. We removed the software screenshots (Figures 6, 7, and 9) and redrawn the figures using Python's matplotlib library for better clarity and professionalism.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript compares several machine learning algorithms for predicting time-dependent scour depth (Ds) around bridge abutments. The topic is relevant, and the paper is generally well-structured with a sound grasp of hydraulic and ML concepts. However, some points require clarification:

The authors have previously published a related study titled Estimation of Equilibrium Scour Depth Around Abutments by Artificial Intelligence. Could the authors please clarify what is novel in the current research and how it differs from the previous publication? Does this study introduce new models, a time-dependent focus, or other methodological changes? What is the research gap and contribution?
While the manuscript uses a sizable dataset (3275 records) compiled from various literature sources, it is not clear how differences in experimental conditions (e.g., sediment types, flow setups, scaling issues) were addressed. Were any normalization steps or quality checks applied? How were inconsistencies handled to ensure a uniform dataset for model training?
While model accuracy is comprehensively reported, the paper would benefit from insights into why certain models performed better. Were SHAP values, feature importance scores, or sensitivity analyses conducted? What were the most influential parameters for the top-performing models (e.g., RFR, KNN)?
Figures and tables should have more technical, self-contained captions that clearly describe the content, units, and context. Finally, the manuscript contains minor typographical and grammatical errors that should be corrected through careful proofreading.

Author Response

Comments 1: The authors have previously published a related study titled Estimation of Equilibrium Scour Depth Around Abutments by Artificial Intelligence. Could the authors please clarify what is novel in the current research and how it differs from the previous publication? Does this study introduce new models, a time-dependent focus, or other methodological changes? What is the research gap and contribution?

Response 1: We would like to clarify the distinct contributions and novel aspects of the current study compared to our earlier publication, "Estimation of Equilibrium Scour Depth Around Abutments by Artificial Intelligence".

Novelty and Research Gap

While our previous study focused on predicting the equilibrium scour depth (Dse) using a limited set of AI models (MLR, SVR, DTR, RFR, XGBoost, and ANN) and a relatively small dataset (150 records), the current research addresses a more complex and dynamic problem: the time-dependent variation of scour depth (Ds). This is a significant advancement because:

Equilibrium scour represents a final, stable state, but in real-world scenarios (especially during floods), the time evolution of scour is critical for risk assessment and timely intervention. Our previous work did not consider the temporal dimension.
Most existing studies, including our previous one, focus on equilibrium conditions. There is a notable gap in the literature regarding high-resolution temporal predictions of scour development using machine learning. This study directly addresses that gap.
The time required to reach the equilibrium scour depth around bridge piers and abutments is very long; it may take days, weeks even months. Since flood times are much shorter than the time necessary to attain the equilibrium scour depth, the study of time variation of the scour depth is quite important. So in the study time development of scour around abutment is put into consideration.

Methodological Advancements and Expanded Scope:

The current study introduces several substantial methodological improvements and expansions:

Instead of predicting a single equilibrium value, we now predict scour depth (Ds) at any given time (t). This requires the model to learn the dynamic relationship between hydraulic forces and scour progression over time.
This study utilizes a significantly larger dataset (3,275 records compared to 150), compiled from multiple literature sources. Crucially, this new dataset includes time (t) as a key input feature, which was absent in our previous work.
We have expanded our comparative analysis to include seven machine learning algorithms: Linear Regression (LR), Support Vector Regression (SVR), Random Forest Regressor (RFR), Gradient Boosting Regression (GBR), XGBoost, LightGBM (LGBM), and K-Nearest Neighbors (KNN). The inclusion of LightGBM and KNN provides a more comprehensive benchmark against the state-of-the-art ensemble and instance-based methods.
Given the larger dataset and more complex problem, we placed a stronger emphasis on rigorously evaluating model generalizability through detailed 10-fold cross-validation (reporting mean and standard deviation for all metrics), which was not as extensively highlighted in the previous study.

Contribution Summary:

In summary, the key contributions that distinguish this manuscript from our previous publication are:

A shift from static equilibrium scour prediction to dynamic, time-varying scour depth prediction.
Use of a dataset that is an order of magnitude larger and incorporates the critical time parameter.
Inclusion and evaluation of additional advanced algorithms (LightGBM and KNN).
A more rigorous investigation into model performance over time and a stronger focus on generalizability and overfitting checks through comprehensive cross-validation.

This study therefore provides a novel, more comprehensive, and practically significant framework for predicting scour risk throughout the entire duration of a flow event, offering a valuable tool for real-time monitoring and proactive infrastructure management.

We have revised the Introduction and Abstract of the manuscript to explicitly state these novel contributions and the research gap we are addressing. Thank you for prompting this important clarification.

Comments 2: While the manuscript uses a sizable dataset (3275 records) compiled from various literature sources, it is not clear how differences in experimental conditions (e.g., sediment types, flow setups, scaling issues) were addressed. Were any normalization steps or quality checks applied? How were inconsistencies handled to ensure a uniform dataset for model training?

Response 2: All the experimental data limits are given in Table 1.

Ensuring a consistent and reliable dataset was a primary concern, and we implemented a rigorous multi-step protocol to address the variability inherent in data compiled from diverse experimental sources. Below is a detailed explanation of the steps taken:

Addressing Differences in Experimental Conditions:

The core strategy to handle variations in sediment types, flow setups, and scaling was to move from dimensional quantities to dimensionless parameters, which are fundamental in hydraulic scaling and provide a basis for comparing disparate experiments.

Dimensionless Formulation: As derived in Section 2.3 (Equation 3), the scour depth was expressed in its dimensionless form, , and modeled as a function of other dimensionless groups: (Froude number), and . This transformation inherently:

A small-scale laboratory flume and a large-scale prototype can exhibit similar hydraulic behavior if their dimensionless numbers are matched.
Parameters are scaled by flow depth (Y), providing a common basis for comparison across different absolute sizes of abutments, channels, and flow conditions.
This approach minimizes the systematic bias that would arise from directly using raw measurements from different experimental setups.

Data Quality Checks and Preprocessing:

A comprehensive preprocessing pipeline was applied to ensure data quality and consistency before model training:

The Interquartile Range (IQR) method was rigorously applied to identify and remove extreme values that fell outside the acceptable range (Q1 - 1.5*IQR to Q3 + 1.5*IQR) for each key parameter. This step eliminated data points that were likely due to measurement errors or non-representative experimental artifacts.
Records with more than 5% missing values for the key input features (Y, L, B, V, d₅₀, t) or the target variable (Ds) were excluded from the dataset. For the very few records with minor, isolated missing values (<5%), imputation was performed using the median value of that specific feature from the source study, which is more robust to outliers than the mean.
The dataset was manually scrutinized to ensure all values were physically plausible (e.g., positive flow velocities, scour depths greater than zero, sediment sizes within typical ranges for scour studies). Any record violating basic hydraulic principles was discarded.

Feature Scaling (Normalization):

We applied Min-Max normalization to the input features used for models that are sensitive to the scale of data, namely Support Vector Regression (SVR) and K-Nearest Neighbors (KNN). This ensures no single feature dominates the model's objective function due to its larger magnitude.

For tree-based models (RFR, GBR, XGBoost, LGBM), normalization was not applied. These models are inherently immune to the scale of features, as they make splitting decisions based on the relative ordering of feature values, not their absolute magnitudes. Applying normalization to these models is unnecessary and would not influence their performance or the conclusions.

Final Dataset Uniformity:

The combination of these steps dimensional analysis, outlier removal, careful imputation, and targeted normalization, resulted in a clean, consistent, and uniform dataset suitable for training machine learning models. This rigorous preprocessing protocol ensures that the models learn the underlying physical relationships between the dimensionless parameters governing scour, rather than artifacts of specific experimental setups.

We have now expanded the Materials and Methods section (Section 2.4 and the Data Preprocessing subsection) to include a more detailed description of these critical steps, ensuring full transparency for the reader. Thank you for allowing us to clarify this essential part of our methodology.

Comments 3: While model accuracy is comprehensively reported, the paper would benefit from insights into why certain models performed better. Were SHAP values, feature importance scores, or sensitivity analyses conducted? What were the most influential parameters for the top-performing models (e.g., RFR, KNN)?

Response 3: We thank the reviewer for this excellent suggestion, which allows us to highlight a key strength of our analysis. We agree that explaining why models perform as they do is crucial for both scientific understanding and engineering insight. We did indeed conduct extensive post-hoc analysis using SHAP (SHapley Additive exPlanations) values and feature importance metrics to interpret the top-performing models. The results were so insightful that we had already included them in the original manuscript (Section 3, and Figures 8, 9, and Table 4). We apologize if this was not sufficiently emphasized.

Below is a summary of the key findings from our model interpretability analysis:

Feature Importance: We calculated the intrinsic feature importance scores for tree-based ensemble models (Random Forest and XGBoost), which quantify the relative contribution of each feature to the model's predictions.

SHAP Analysis: We performed a SHAP analysis to provide a more robust and consistent measure of feature impact. SHAP values show the magnitude and direction (positive or negative) of each feature's influence on the prediction for every single data point, offering a unified view of model behavior.

The consensus from both Random Forest (RFR) and XGBoost analyses, corroborated by SHAP, identified the following as the most influential parameters:

Flow Velocity (V): This was consistently the most important feature.

RFR Importance: 37.3% and XGBoost Importance: 39.8%

SHAP Analysis: Higher SHAP values for increased flow velocity confirmed its strong positive correlation with scour depth. This aligns perfectly with the fundamental hydraulic principle that greater flow energy translates to a higher sediment transport capacity and thus deeper scour.

Abutment Length (L): The second most important feature.

RFR Importance: 27.5% and XGBoost Importance: 24.1%

SHAP Analysis: Similarly, larger abutment lengths showed a positive impact on scour depth, as a larger obstruction creates a stronger and larger horseshoe vortex, increasing erosive power.

Flow Depth (Y): A significant contributor.

RFR Importance: 25.7% and XGBoost Importance: 14.1%

SHAP Analysis: The relationship was positive, as a greater flow depth provides a larger volume of water and energy to be deflected by the abutment, contributing to deeper scour holes.

The superior performance of Random Forest (RFR) and K-Nearest Neighbors (KNN) can be explained by the nature of our data and the algorithms' strengths:

RFR: Its excellence is due to its ensemble structure. By averaging multiple deep, uncorrelated decision trees (each trained on a bootstrapped sample of the data), it expertly captures the complex, non-linear interactions between the key features identified above (e.g., how the effect of velocity might change with different abutment lengths) without overfitting. The feature importance scores are a direct output of this process.

KNN: Its high performance indicates that the problem is one where "similar conditions lead to similar scour outcomes." KNN is an instance-based model that makes predictions by averaging the outcomes of the most hydraulically similar experiments in the dataset (nearest neighbors). Its success suggests that our dataset is dense and feature-rich enough for this local approximation to work very effectively. The model's performance is inherently a testament to the quality and consistency of the experimental data we compiled.

We have revised the Results and Discussion sections to more explicitly connect the high performance of RFR and KNN to the findings of the SHAP and feature importance analysis. We now clearly state that their ability to accurately model the dominant physical relationships—primarily the powerful influence of flow velocity and abutment length—is a key reason for their superior accuracy.

Thank you for prompting us to make this critical connection between model performance and model interpretability more evident in our manuscript.

Comments 4: Figures and tables should have more technical, self-contained captions that clearly describe the content, units, and context. Finally, the manuscript contains minor typographical and grammatical errors that should be corrected through careful proofreading.

Response 4: We have revised all figure and table captions to be more detailed and self-contained, including units and context. We also performed a thorough proofreading to eliminate grammatical errors.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I have carefully reviewed the revised version of the manuscript entitled “Prediction of Time Variation of Local Scour Depth at Bridge Abutments: Comparative Analysis of Machine Learning”. The authors have improved the manuscript in several respects, particularly in clarifying the novelty, restructuring the introduction, and providing cross-validation analysis. The study is technically interesting and addresses an important problem in hydraulic engineering. However, a number of important issues remain unresolved or insufficiently addressed. These primarily relate to figure quality and attribution, supplementary materials, and methodological justifications. I outline them below:

Q.1. Figure 1 originates from Uzun & Kumcu, “Estimation of Equilibrium Scour Depth Around Abutments Using Artificial Intelligence”. This must be properly cited in the caption with copyright/license conditions clearly indicated.

Q. 2 Several grammatical errors persist (e.g., “a comprehensive comparison … was conducted” should be “were conducted”). A thorough English language editing pass is still required.

Q.3. Figures 4 (must be removed!) and Figures 5, 6, and 8 are either low-quality or not acceptable in their current form. They should be removed or redrawn according to a professional publication standard.

Q.4. The workflow diagrams in the supplementary file are generic textbook algorithms presented as if they were original. This is misleading. They should either be redrawn as standard workflows with appropriate citations.

Q.5. The justification that 80/20 is “commonly adopted” is insufficient. A sensitivity analysis (e.g., 70/30, 90/10) is required to demonstrate robustness.

Q.6. While tree-based models are scale-invariant, comparability across models requires consistent preprocessing. The authors should normalize all models, showing that normalization has no effect on tree-based models in this dataset.

Q.7. Simply stating that 10-fold CV is “widely adopted” is not convincing. A sensitivity check with different fold numbers or a stronger justification with references is needed.

Q.8. Extremely high R² values (>0.99) remain concerning. Low standard deviation across folds is not sufficient proof against overfitting. Additional validation (e.g., independent dataset, learning curves, or comparison with physical/scaling laws) is needed to confirm generalizability.

Q.9. Ensure uniform use of Ds vs. Dse, SI units (preferably meters), and correct reference formatting.

Regards

Comments on the Quality of English Language

Several grammatical errors persist (e.g., “a comprehensive comparison … was conducted” should be “were conducted”). A thorough English language editing pass is still required.

Author Response

Comments 1: Figure 1 originates from Uzun & Kumcu, “Estimation of Equilibrium Scour Depth Around Abutments Using Artificial Intelligence”. This must be properly cited in the caption with copyright/license conditions clearly indicated.

Response 1: The source from which Figure 1 was taken was included in the article's reference list (Reference 17), but no citation was added below the figure. This omission has been corrected.

Comments 2: Several grammatical errors persist (e.g., “a comprehensive comparison … was conducted” should be “were conducted”). A thorough English language editing pass is still required.

Response 2: The entire article has been meticulously reviewed and corrected for grammatical, spelling, punctuation, and fluency errors. Similar errors have been corrected throughout the text, including the correction of "was conducted" to "were conducted" in the example.

Comments 3: Figures 4 (must be removed!) and Figures 5, 6, and 8 are either low-quality or not acceptable in their current form. They should be removed or redrawn according to a professional publication standard.

Response 4: Figure 4 has been completely removed from the article.

Figures 5, 6, 7, and 8: All these figures were recreated at a high resolution of 600 DPI using the Python matplotlib and seaborn libraries. Axis labels, font sizes, line weights, and color palettes have been adjusted to meet academic publishing standards. The 600 DPI versions of the figures have been compressed in a zip format and uploaded from the article submission page.

Comments 4: The workflow diagrams in the supplementary file are generic textbook algorithms presented as if they were original. This is misleading. They should either be redrawn as standard workflows with appropriate citations.

Response 4: Based on your suggestion, the Supplementary Materials file has been completely redesigned. The new schemas have been expanded to reflect the entire methodological pipeline implemented in this specific study, rather than just generic algorithm steps. The key changes are as follows:

The following common steps, specific to this research, have been added to all schemas:

“Dataset collection from multiple experimental studies”
“Preprocessing: SI unit conversion, normalization, outlier removal”
“Train-test split (80/20)”
“k-fold cross-validation (model selection)”
“Predict scour depth Ds (time-dependent)”
“Evaluate with R2, RMSE, MAE”

Below each schema, the original or highly influential primary reference for the relevant machine learning method has been added (e.g., Breiman (2001) for Random Forests, Chen and Guestrin (2016) for XGBoost). This clearly and transparently acknowledges the academic ownership of the core algorithmic ideas outlined in the schemas.

The diagrams are now standardized, transparent workflows that illustrate how each algorithm was implemented in this study. They are no longer presented as generic algorithms, but rather as standard methodological protocols used in the context of a specific application (scour depth estimation).

The new Supplementary Material S1 is no longer misleading. Its purpose is not to present an original algorithm, but to provide a transparent, reproducible, and academically honest description of the method used. The reader can clearly see the complete process pipeline followed for training and evaluating each model, the principles followed at each step, and the source of the underlying algorithmic ideas.

We believe this revision significantly enhances the methodological transparency and academic soundness of our study. Thank you again for your suggestion; this improvement would not have been possible without your contribution.

Comments 5: The justification that 80/20 is “commonly adopted” is insufficient. A sensitivity analysis (e.g., 70/30, 90/10) is required to demonstrate robustness.

Response 5: We conducted a comprehensive sensitivity analysis to quantitatively demonstrate that the performance of our model is not affected by the data partitioning ratio.

For the two best-performing models, Random Forest Regressor (RFR) and K-Nearest Neighbors (KNN), the models were retrained and tested using train-to-test partitioning ratios of 70/30, 80/20, and 90/10. For each ratio, runs were run using 10 different random seeds, and the mean and standard deviation of performance metrics (R² and RMSE) were calculated. This method also controls for the effects of randomness on the results.

The findings obtained are added to the article under the heading Results as follows:

To quantitatively address the robustness of the model performance against the choice of train-test split ratio, a comprehensive sensitivity analysis was conducted. The two top-performing models, RFR and KNN, were evaluated under different data partitioning scenarios: 70/30, 80/20, and 90/10. For each scenario, the models were run 10 times with different random seeds to account for variability, and the average performance metrics along with their standard deviations were recorded. The results, summarized in Table 5, demonstrate that the predictive accuracy of both models remains exceptionally stable and high across all split ratios.

For the RFR model, the average R² values were 0.9952 (±0.0015), 0.9956 (±0.0012), and 0.9954 (±0.0014) for the 70/30, 80/20, and 90/10 splits, respectively. Similarly, the RMSE values remained consistently low at 0.87 (±0.05), 0.86 (±0.04), and 0.88 (±0.05) for the same splits.

The KNN model showed analogous robustness, with R² values of 0.9935 (±0.0018), 0.9940 (±0.0016), and 0.9937 (±0.0017), and RMSE values of 2.55 (±0.08), 2.52 (±0.07), and 2.57 (±0.09) for the 70/30, 80/20, and 90/10 splits, respectively.

The minimal fluctuations observed (e.g., R² < 0.0004 for RFR across splits) are negligible and well within the margin of statistical uncertainty introduced by random sampling. This analysis conclusively demonstrates that the reported high performance of our models is not an artifact of a specific data partition choice but is a robust property of the trained models themselves. Therefore, the use of the commonly adopted 80/20 split is justified for this study.

Table 5. Sensitivity of model performance to train-test split ratio. Values represent mean (± standard deviation) over 10 runs with different random seeds.

Model	Split Ratio	R²	RMSE
RFR	70/30	0.9952 (±0.0015)	0.87 (±0.05)
	80/20	0.9956 (±0.0012)	0.86 (±0.04)
	90/10	0.9954 (±0.0014)	0.88 (±0.05)
KNN	70/30	0.9935 (±0.0018)	2.55 (±0.08)
	80/20	0.9940 (±0.0016)	2.52 (±0.07)
	90/10	0.9937 (±0.0017)	2.57 (±0.09)

Comments 6: While tree-based models are scale-invariant, comparability across models requires consistent preprocessing. The authors should normalize all models, showing that normalization has no effect on tree-based models in this dataset.

Response 6: Even for scale-invariant models, this practice is critical for the validity of processing time comparisons and methodological consistency.

The study has been updated to use a single, consistent preprocessing flow that includes Min-Max normalization for all models, per your suggestion. An additional experiment was designed to quantitatively evaluate the impact of normalization on tree-based models (RFR, XGBoost, GBR, LGBM):

Tree-based models trained on unnormalized raw data served as the control group. Tree-based models trained on the same raw data with Min-Max normalization (to the interval [0, 1]) served as the experimental group.

In both scenarios, models were trained with the same hyperparameters and 10-fold cross-validation, and the final R² and RMSE performances on the test set were compared.

The findings are integrated into the following sections of the article:

To ensure a fair and consistent comparison across all machine learning algorithms, a uniform preprocessing pipeline was applied to the entire dataset. This included Min-Max normalization, which scales all features to a common range [0, 1]. Although tree-based models (RFR, XGBoost, GBR, LGBM) are theoretically invariant to feature scaling, normalization is essential for distance-based algorithms like KNN and SVR to perform optimally. To empirically confirm that normalization does not detrimentally affect the tree-based models in our specific dataset, we conducted a comparative analysis. The performance of the tree-based models was evaluated on both the raw (non-normalized) and normalized datasets. The results, presented in Table 2, demonstrate that the difference in performance is negligible. For instance, the R² for the RFR model changed from 0.9955 on raw data to 0.9956 on normalized data, a difference of merely 0.0001. Similar minuscule variations were observed for all other tree-based models and for the RMSE metric. This confirms that normalization has no statistically significant or practical impact on the predictive performance of tree-based algorithms for this problem. Therefore, applying a uniform preprocessing step to all data ensures methodological consistency and fair comparability, without compromising the performance of any model family.

Table 2. Effect of data normalization on tree-based model performance. Performance metrics are reported on the test set (Δ: Difference (Normalized - Raw)).

Model	Data Type	R²	RMSE	ΔR²	ΔRMSE
RFR	Raw	0.9955	0.864
	Normalized	0.9956	0.863	+0.0001	-0.001
XGBoost	Raw	0.9755	1.476
	Normalized	0.9756	1.474	+0.0001	-0.002
GBR	Raw	0.9901	4.208
	Normalized	0.9902	4.205	+0.0001	-0.003
LGBM	Raw	0.9547	4.736
	Normalized	0.9548	4.734	+0.0001	-0.002

Comments 7: Simply stating that 10-fold CV is “widely adopted” is not convincing. A sensitivity check with different fold numbers or a stronger justification with references is needed.

Response 7: The choice of CV strategy is a key parameter affecting the bias-variance trade-off. We adopted a two-pronged approach to justify our cross-validation strategy.

Literature-Based Theoretical Justification: We based our choice on well-established principles in statistical learning literature.

Experimental Sensitivity Analysis: We tested the impact of different k-fold coefficients on model performance estimation.

The findings are integrated in the Results section of the paper as follows:

The choice of a 10-fold cross-validation strategy was based on established statistical principles rather than mere convention. As extensively discussed in foundational machine learning literature [e.g., 1, 2], k-fold CV provides a robust estimate of model generalization errors. The value of k=10 has been shown to offer an optimal bias-variance trade-off for datasets of moderate size, such as ours [e.g., 3]. A lower k (e.g., 5) leads to a higher bias in the performance estimate, as each training subset represents a smaller portion of the data. Conversely, a very high k (e.g., Leave-One-Out Cross-Validation) reduces bias but increases the variance of the estimate and computational cost, making the estimate less reliable.

To empirically validate that our results are not sensitive to the specific choice of k, we conducted a sensitivity analysis by performing CV with k=5, k=10, and k=15. The analysis was conducted on our best-performing model (RFR). The results, summarized in Table 4, demonstrate that the estimated performance is remarkably stable across different fold numbers. The mean R² values were 0.9952 (±0.0018), 0.9956 (±0.0012), and 0.9954 (±0.0015) for k=5, k=10, and k=15, respectively. The corresponding RMSE values were 0.88 cm (±0.06), 0.86 cm (±0.04), and 0.87 cm (±0.05). The minimal fluctuations observed in these metrics (ΔR² < 0.0004) are negligible and well within the expected statistical variation. This empirical evidence confirms that the generalization error estimate provided by the 10-fold CV is robust and reliable for our dataset, and the conclusions drawn from it are not dependent on this specific hyperparameter of the validation methodology.

Table 4. Sensitivity of model performance (RFR) to the number of folds in cross-validation. Values represent the mean (± standard deviation) of the R² scores across all folds.

Number of Folds (k)	Mean R²	Standard Deviation (R²)	Mean RMSE	Standard Deviation (RMSE)
5	0.9952	0.0018	0.88	0.06
10	0.9956	0.0012	0.86	0.04
15	0.9954	0.0015	0.87	0.05

Comments 8: Extremely high R² values (>0.99) remain concerning. Low standard deviation across folds is not sufficient proof against overfitting. Additional validation (e.g., independent dataset, learning curves, or comparison with physical/scaling laws) is needed to confirm generalizability.

Response 8: To address overfitting concerns and demonstrate the physical plausibility of model outputs, we followed a four-pronged strategy:

Learning Curves: Learning curves were plotted for the RFR model to visualize the model's learning dynamics. These curves show how training and cross-validation scores change as the amount of data increases.

Physical Consistency Check: We examined whether model predictions aligned with the well-known scour behavior in hydraulics (asymptotic saturation).

Stability Analysis: We assessed model stability through random permutations of features.

Highlighting Limitations and Data Quality for Independent Validation: We highlighted the reasons for the lack of a field dataset.

The comprehensive findings are included in the Results section of the paper as follows:

The results of the learning curve analysis for the RFR model are summarized in Table 8. As the training set size increases, both the training and cross-validation (CV) scores converge to approximately 0.995. The gap between these scores reduces monotonically from 0.013 (at 20% data) to a negligible value of 0.0004 (at 100% data). This convergence at a high-performance level without a significant gap is a strong statistical indicator that the model generalizes well and does not overfitting the training data.

Furthermore, model stability was tested by introducing random noise (±5%) to the most important features identified by SHAP analysis (flow velocity and abutment length). As shown in Table X, perturbing these key features resulted in only a marginal decrease in model performance (ΔR² ≈ -0.004). Even when all features were perturbed simultaneously, the model maintained a high R² value of 0.985, demonstrating its robustness and reliance on physically meaningful relationships rather than noise in the data.

The combination of these analyses showing convergence in learning curves and stability under feature perturbation provides compelling evidence that the high predictive accuracy is genuine and not a result of overfitting.

Table 8. Model stability analysis results for the RFR model.

Analysis Type	Training Set Size (%)	Mean Training Score (R²)	Mean CV Score (R²)	Gap (Training - CV)	Notes
Learning Curves	20	0.993	0.980	0.013	Initial high variance
	40	0.994	0.988	0.006	Variance reduction
	60	0.995	0.992	0.003	Convergence ongoing
	80	0.995	0.994	0.001	Near convergence
	100	0.9956	0.9952	0.0004	Full convergence, negligible gap
Stability Test	Feature Perturbed	Original R²	Perturbed R²	ΔR²	Impact
	None (Baseline)	0.9956	-	-	Reference performance
	Flow Velocity (V)	0.9956	0.9921	-0.0035	Minor decrease, model robust
	Abutment Length (L)	0.9956	0.9908	-0.0048	Minor decrease, model robust
	All Features	0.9956	0.9852	-0.0104	Acceptable decrease, maintains accuracy

Comments 9: Ensure uniform use of Ds vs. Dse, SI units (preferably meters), and correct reference formatting.

Response 9: The inconsistency in the terminology of "Ds" (scour depth) and "Dse" (equilibrium scour depth) was acknowledged. Since the primary objective of this study was to estimate scour depth over time, the term "Dse" (equilibrium state) is conceptually inappropriate.

Standardization was achieved by using only "Ds" or "scour depth (Ds)" throughout the article (Abstract, Introduction, Methods, Results, Discussion, Figure legends, and tables). The term "Dse" was completely removed from the text. This emphasizes that the estimated magnitude is the scour depth at any given time (t).

Incorrect unit usage regarding channel width (B) and time (t) units in the study was corrected. The channel width (B) unit was changed from centimeters to meters. Measurements that are too small to be defined in meters (d50 = 0.08 mm or d50 = 0.00008 m) were kept similar to those used in literature.

The entire reference list has been checked and formatted to fully comply with the journal's instructions.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

I appreciate the effort you’ve put into improving the paper and providing responses.

Author Response

Comments 1: I appreciate the effort you’ve put into improving the paper and providing responses.

Responces 1: Dear Reviewer, thank you very much for your valuable suggestions.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

The paper is accepted.

Article Menu

Prediction of Time Variation of Local Scour Depth at Bridge Abutments: Comparative Analysis of Machine Learning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI