3. Results
The sensitivity analysis was first conducted using the SHAP method to determine the influence of each input parameter on the prediction of Q through the RTHGs. Subsequently, the performance of four metaheuristic algorithms (i.e., CMA-ES, SSA, PSO, and GA) was evaluated to identify the most suitable algorithm for hyperparameter tuning of the GB models. The optimal hyperparameter values of different GB models, obtained through the best-performing algorithm, were then presented. The performance of these models was further assessed in both training and testing phases using standard statistical metrics. In addition, a Taylor diagram was employed to provide a comprehensive comparison of model performance. Finally, uncertainty in the predictions was evaluated using confidence interval analysis and the R-Factor index, with the results presented in visual form.
Figure 6 is a graphical representation of SHAP values and feature importance in predicting the dimensionless flow rate (
Q/√
gh5). The scatter plots on the chart show SHAP values for each of the input variables that demonstrate the extent to which each feature is responsible for the output of the model. There also is a bar chart to present the mean importance of each feature in order to compare the extent to which each input is responsible for the model’s decision. This figure serves to highlight the relationship between the input features and the model’s predictions, providing insights into their individual and collective impacts.
Based on
Figure 6, the analysis shows that the
h/
B feature has the most significant impact on predicting the dimensionless flow rate
Q/√
gh5. This feature exhibits very high SHAP values in most samples, with a positive direction, clearly indicating its key role in the model’s predictions. In other words, changes in the
h/
B value have a considerable effect on the model’s output, emphasizing the importance of this feature in the sensitivity analysis. The
S feature also stands out as an influential factor in the predictions. Although its impact is less than that of
h/
B, the high dispersion of SHAP values suggests that it has a considerable influence on the prediction in certain samples. This dispersion indicates that
S can play a varying role under different data conditions. The features
L/
b,
θ, and
t/
B, while having less impact, still contribute to the model’s predictions. In particular, the SHAP values of these features are scattered in many samples, showing that their effects become noticeable under specific data conditions. This dispersion indicates that each of these features may have different effects on the model’s outcome in different scenarios.
In general,
Figure 6 indicates that
h/
B is clearly the most significant factor in predicting the flow rate, and
S is the second most significant factor. All the remaining features, including
L/
b,
θ, and
t/
B, though being less significant, are vital to the prediction process. This analysis demonstrates the capability of the model in detecting intricate relations between features with precise prediction of results, and comments on the significance of utilizing comprehensible methods such as SHAP in tackling intricate problems.
Four metaheuristic algorithms (i.e., PSO, SSA, GA, and CMA-ES) were tested against each other for hyperparameter tuning in this work. Since minimization of computational cost was a concern, the initial test was performed on a base model, LightGBoost alone. Comparison was performed against the following measures: (i) Minimum RMSE as a measure of final accuracy, (ii) Mean and standard deviation of RMSE as measures of stability; (iii) Number of iterations and time to achieve 95% best performance as a measure of convergence speed; (iv) Computational cost (CPUh) as a measure of resource utilization.
Table 9 shows the relative performance of the four metaheuristic algorithms, and
Figure 7 shows the convergence behavior of the algorithms.
Based on the results presented in
Table 9, the CMA-ES achieved the lowest best RMSE (0.154). SSA provided more stable mean performance, reached 95% of the best result faster, and required the least computational cost. PSO and GA showed weaker performance in terms of both accuracy and resource consumption. Considering the time constraint of less than one hour and computational efficiency, SSA was selected as the most suitable algorithm for hyperparameter tuning of the GB models, as it offered a balanced trade-off between accuracy, speed, and computational cost.
The parameter settings of the four metaheuristic optimizers (CMA-ES, SSA, PSO, and GA) used to explore these ranges are reported in
Table 10.
The main hyperparameters tuned for each GB model and their corresponding search ranges are summarized in
Table 11.
The lower and upper bounds for each hyperparameter were selected by combining (i) commonly recommended ranges in the original implementations of the algorithms, (ii) values frequently adopted in previous hydrologic and hydraulic applications of tree-based ensemble methods, and (iii) preliminary trial runs aimed at avoiding severe underfitting (too few trees or overly strong regularization) and excessive overfitting or computational cost (very deep trees or extremely large ensembles). In
Table 12, the optimized hyperparameters of the GB models tuned by SSA are presented.
Next, the hybrid GB models were evaluated for estimating
Q through the RTHG.
Figure 8 shows scatter plots, residuals, and density distributions of the CatBoost-SSA model during training and testing stages.
Table 13 presents the performance evaluation metrics of the CatBoost-SSA model for these stages.
Based on the results from the experimentation of the CatBoost-SSA model during training and testing, the model performed adequately during the estimation of Q through RTHG. The model performed nearly perfect prediction during training. The R2 value of 0.999 shows a very good fit with the actual values, validating the model’s ability to replicate the training set very well. Also, the extremely low values of 0.039 for RMSE and 0.029 for MAE are also in favor of the model being precise in its predictions. These also stood out in the scatter plot where points remain close to the regression line, indicating minimal difference between the actual and predicted values. Thus, the model was effective in learning the features of data and producing very good predictions when being trained. When tested, the model’s performance fell a little bit. R2 decreased to 0.984, indicating that even though the model performs remarkably well, it had some problems with new data. This small deviation from the training period is expected because the model is now trained on unseen and new data. Also, the RMSE value (0.147) and MAE value (0.092) both increased, as would be the case with the natural errors that occur when a model is fitted to new data. All the other evaluation metrics also clearly show that the CatBoost-SSA model performs extremely well during the test period. MAPE rose from 1.324% at the training stage to 3.641% at the testing stage, which implies that at the testing stage, the prediction errors were marginally high. The value of PBIAS also rose to 0.129%, which implies a minor positive bias in the predictions at the testing stage. Residual analysis also shows that the model predictions are unbiased, as is evident from the plots of the residuals. During the training period, residuals were consistently spread about zero, meaning there is no systematic underprediction of the model. The feature was also exhibited during the testing period, though the residuals during the period were spread, which is the larger normal variance in the process of handling new data. Lastly, the KDE plots show how the calculated values mirror the measured values closely across the training period, which is a sign that the model is well calibrated. During the test period, while the distributions begin to deviate slightly, there is still a large similarity between the estimated and actual distributions, which reflects that the model is able to forecast the novel data with a good accuracy. Briefly, the CatBoost-SSA model was efficient during both the training and testing phases. The model’s high accuracy in the training stage and its still-acceptable performance on testing data emphasize its ability to simulate the complex flow rate dynamics through RTHG. These results highlight the effectiveness of using SSA for hyperparameter optimization in complex modeling tasks and demonstrate that combining these techniques can significantly improve prediction accuracy in sophisticated models.
Figure 9 presents scatter plots, residuals, and KDE distributions for the NGBoost-SSA model during training and testing stages.
Table 14 summarizes the performance metrics of the NGBoost-SSA model for these stages.
The NGBoost-SSA model demonstrated excellent performance in both the training and testing stages in predicting Q through the RTHG. In the process of training, the model was able to obtain a value of R2 equal to 0.995, which confirms that actual and predicted values have a perfect fit. The model was able to obtain low values for RMSE (0.10) and MAE (0.076), which confirm the validity of its predictions. The scatter plot also showed that data points were trending closely along the line of regression, confirming the model’s high predictability. Moreover, residual analysis indicated that the residuals lay randomly about zero and thus no sign of any sort of systematic bias was present. Throughout the test phase, even though performance did decline somewhat, the model retained excellent predictive ability. The R2 metric fell to 0.976 but was still suggestive of close conformity with test data. RMSE went up to 0.18, and MAE to 0.118, a common result when the model is fed new, unseen data. MAPE also went up to 4.88%, a minimal rise in percentage error, but still within tolerable thresholds. During training, the model had an R2 of 0.993, showing near-perfect correlation with the true values. The PBIAS values were minimal in the training stage (0.049%), indicating almost no bias in predictions, while in the testing stage, PBIAS increased to 0.311%, showing a slight positive bias, but still within a reasonable level. These small changes in bias are typical when models are tested on new data. Finally, residual and KDE analysis further supports the model’s effectiveness. In the training stage, the computed values closely aligned with the measured values, indicating a good fit. In the testing stage, although the distributions began to diverge slightly, they still showed reasonable alignment, confirming that the model can predict new data effectively. Overall, the NGBoost-SSA model performed well in both the training and testing stages. It demonstrated strong predictive capabilities for estimating Q through the RTHG and maintained good performance when exposed to new data.
Figure 10 shows scatter plots, residuals, and KDE distributions for the HistGBoost-SSA model during training and testing stages.
Table 15 summarizes the performance metrics of the HistGBoost-SSA model for these stages.
The evaluation results of the HistGBoost-SSA model during both the training and testing stages demonstrate its excellent performance in predicting Q through the RTHG. In the training stage, the model achieved an R2 value of 0.993, indicating a nearly perfect fit with the actual values. The same accuracy is also observed on the scatter plot, where dots are tightly bunched up near the regression line and clearly indicate that the model had accurately predicted values. Moreover, the RMSE of 0.116 and MAE of 0.044 also authenticate the accuracy of the model to reduce errors and predict closely to real values. Residual analysis also assured that the residuals were randomly scattered around zero, indicating that there was no systematic error in the model. The KDE plot also indicated that the estimated values were extremely close to the observed values, which confirms the strong performance of the model when trained. Throughout the test period, even when there was some deterioration in performance, the model still delivered acceptable outputs. The value of R2 dropped to 0.974 but remained high, which also shows that the model still fits the test data extremely well. The RMSE increased to 0.186, and MAE to 0.112, which is a natural increase in errors as the model is being tested on new unseen data. MAPE also went up to 4.498%, showing a greater percentage of error in predictions compared to the training process. PBIAS rose to 0.406% in the testing phase, up from 0 in the training phase. This is a small positive bias in test predictions, though still in a good range. The HistGBoost-SSA model performed quite well overall in the training phase, and even when it suffered some slight loss of accuracy in the testing phase, it accurately predicted the flow rate through RTHG. These findings acknowledge the effective use of the SSA in hyperparameter optimization and predictive performance improvement of the models.
Figure 11 shows scatter plots, residuals, and KDE distributions for the XGBoost-SSA model at training and testing levels.
Table 16 reports the related performance of the XGBoost-SSA model.
The XGBoost-SSA model showed exceptional performance in predicting Q through the RTHG during both the training and testing stages. In the training phase, the R2 value of the model was 0.993, demonstrating a perfect fit of the model to actual data. The very small RMSE value of 0.114 and MAE value of 0.052 show that the predictions were very accurate with minimal amount of error involved. This accuracy is also evident from the scatter plot, since the data points are lying very close to the regression line, confirming the precision of the model while making predictions. The residual plot in the training period showed that the residuals were randomly varying around zero and there was no systematic bias. The KDE plot also confirmed the same by showing the predicted values were extremely close to the actual values, validating the excellent performance of the model. In the testing period, there was a dip in performance, but it was extremely slight. The R2 also dropped to 0.981, having an extremely high correlation with the test data. RMSE rose to 0.158 and MAE to 0.100, both of which indicate a slight rise in errors when the model processed new data. Likewise, MAPE rose to 4.155%, which is higher compared to the training period but still low. PBIAS rose to 0.374% in the test phase from around 0 in the training phase. This represents a marginal positive prediction bias during the test phase, but this improvement is still within acceptable limits. The XGBoost-SSA model generally performed well during the training and testing phases. Despite slightly higher errors when tested, it was still able to make precise predictions of flow rate through the RTHG. This indicates the model’s capability to generalize new data extremely well and suggests the efficacy of applying SSA in hyperparameter tuning for complicated modeling tasks.
Scatter plots, residuals, and KDE distributions of the LightGBoost-SSA model in the training and testing phases are depicted in
Figure 12.
Table 17 summarizes the performance metrics of the LightGBoost-SSA model for these stages.
The LightGBoost-SSA model demonstrated excellent performance in both the training and testing stages for predicting Q through the RTHG. In the training session, the model had an R2 of 0.995, which is a very good data fit. The RMSE of 0.099 and MAE of 0.037 also show the accuracy of the model in making a flow rate prediction. This can also be seen from the scatter plot, where points are very close to the regression line, which shows the accuracy of the model while making predictions. Residual analysis indicated that the residuals were randomly distributed around zero, i.e., the model was free from any systematic bias. The KDE plot also indicated that the estimated values were very close to the actual values, verifying the robust behavior of the model during training. In the testing phase, the model was still reporting good results while dropping in performance. R2 fell to 0.982, which is an indication of good agreement with the test data. The RMSE increased to 0.155, and MAE to 0.103, which is expected as the model is now being tested on new data. Similarly, MAPE rose to 4.389%, which is higher than in the training stage, but still within an acceptable range. PBIAS increased to 0.618% in the testing stage, compared to 0 in the training stage. This suggests a slight positive bias in the model’s predictions when tested on new data, which could be due to differences between the training and testing datasets. However, this value remains within an acceptable range. Overall, the LightGBoost-SSA model performed excellently in the training stage and, despite a slight increase in errors during testing, continued to provide accurate predictions of Q. These results highlight the effectiveness of using SSA for hyperparameter optimization and demonstrate the model’s capability to generalize well to unseen data.
For a more precise comparison and ranking of the hybrid GB models’ performance in predicting
Q through the RTHG, Taylor diagrams were used at training and testing stages, as shown in
Figure 13, providing a comprehensive and reliable evaluation of the models.
The analysis of the Taylor diagrams and
c′ values based on
Figure 13 demonstrates the overall quality of the hybrid models based on GB algorithms in predicting
Q through the RTHG. At the training stage, all models exhibit performance very close to the observed data, as evidenced by the low
c′ values and the proximity of the model points to the reference point on the Taylor diagram. Specifically, the model with the lowest
c′ value of 0.0394, i.e., CatBoost-SSA, best fits the data and stands first. The LightGBoost-SSA model having a
c′ value of 0.0988 ranks second, and the NGBoost-SSA model with a
c′ value of 0.1002 ranks third. Their minimum values of
c′ signify extremely negligible error in correctly estimating the flow, and their predictions closely follow the actual data. During the testing phase, the same trend of preserving accuracy is witnessed, with
c′ values having increased to represent greater complexity in the test data and model generalization challenges. However, the best performance is still maintained by the CatBoost-SSA model with a
c′ value of 0.1476, followed by the LightGBoost-SSA model with 0.1550 and the XGBoost-SSA model with 0.1579. The HistGBoost-SSA and NGBoost-SSA models rank lower because they possess higher
c′ values. The Taylor diagrams indicate that the correlation between the predictions and observed data is very high for all models, clearly reflecting the reliability and overall accuracy of the models. Additionally, the standard deviations of the models are acceptably close to those of the observed data, indicating the models’ ability to simulate the variability in the data effectively.
To examine the absence of overfitting, the learning curves of the best-performing model (CatBoost-SSA) are presented in
Figure 14.
The training and validation RMSE curves decrease smoothly and converge, indicating stable learning behavior without overfitting.
Table 18 summarizes the five-fold cross-validation R
2 values for all SSA-based GB models, providing a comprehensive evaluation of their generalization consistency across folds. The 5-fold CV results reveal highly stable R
2 performance for all optimized GB models, confirming robust generalization and the absence of overfitting.
In the following analysis, the focus is on evaluating whether the developed hybrid models exhibit physically plausible hydraulic behavior beyond the range of the laboratory data. Since Gradient Boosting models, and machine-learning approaches in general, lack inherent physical structure, they may generate non-physical responses under extrapolation. Therefore, using the best-performing model of this study, namely the CatBoost-SSA model, a set of physics-based sanity tests was conducted.
In these tests, the channel and gate geometry were fixed using representative mid-range values from the experimental dataset, including: channel width B = 0.25 m, length ratio L/b = 2.0, and thickness ratio t/B = 0.02. The depth ratio h/B was then varied over a uniform grid ranging from 0.05 to 1.05, representing an extension of approximately 10–15% beyond the experimental domain. For each depth value, three longitudinal slopes, S = 0 and 0.005, were considered, and the model output was subsequently predicted.
Figure 15 shows the results of the physics-based sanity checks performed for the CatBoost-SSA model under controlled extrapolation conditions. Panel (a) illustrates the predicted discharge
Q as a function of
h3/2 for two longitudinal slopes (
S = 0 and
S = 0.005), demonstrating smooth, monotonic, and physically plausible behavior, with higher discharges observed for the larger slope. Panel (b) presents the variation in the dimensionless discharge
Q/√
gh5 versus the depth ratio
h/
B, where the predictions remain bounded, stable, and consistent with expected hydraulic trends across the extended depth range.
Next, the uncertainty of the hybrid GB models in estimating
Q through the RTHG is examined. In this regard,
Figure 16 presents the CI values and R-Factor indices for the hybrid GB models, providing insight into the reliability and robustness of the predictions.
According to
Figure 16, the CI values and R-Factor indices indicate that all hybrid GB models exhibit comparable levels of uncertainty and reliability in estimating
Q through the RTHG. The NGBoost-SSA model ranks lowest with the lowest CI (0.616) and R-Factor (3.596) values, indicating that it possesses the most confident and accurate predictions among the models that were considered, followed by CatBoost-SSA. The CI values of all the models are also closely similar, ranging from 0.616 to 0.650. This close proximity signals uniform levels of uncertainty across models with minimal variation in prediction accuracy. The same goes for R-Factor scores 3.596 to 3.791, signaling minimal variation in agreement between prediction and actual data. Since NGBoost-SSA is marginally better than others regarding CI and R-Factor, CatBoost-SSA and HistGBoost-SSA also have high accuracy and reliability. In contrast, XGBoost-SSA and LightGBoost-SSA have relatively higher CI and R-Factor scores, which reflect relatively higher uncertainty and deviance in their prediction. All in all, the research indicates that all the hybrid GB models offer good predictions, but NGBoost-SSA is slightly better at reducing uncertainty and enhancing prediction reliability. These findings support the use of these models, particularly NGBoost-SSA and CatBoost-SSA, for accurate and dependable flow rate estimation through the RTHG in engineering applications.
Figure 17 illustrates the observed data and predicted
Q values with 95% confidence bands for the evaluated hybrid GB models, demonstrating the uncertainty bounds of the model predictions across the samples.
The results depicted in
Figure 17 reveal that the observed data and predicted
Q values, along with the 95% confidence bands, effectively illustrate the uncertainty bounds of the hybrid GB models’ predictions across the sample indices. The bands of confidence provide a graphical representation of the area where the actual flow values should fall with 95% probability, giving a simple idea of the reliability of predictability of each model. For all models, the fitted values plot in line with the observed data fairly well, suggesting that all models can represent the overall dynamics of flow fairly well. Particularly, the confidence bands of CatBoost-SSA and NGBoost-SSA models are relatively narrower and closer to observed values, indicating smaller prediction uncertainty and greater reliability. All other models, though still of reasonable fit, have relatively wider confidence bands in some of the sample points, indicating greater uncertainty at these points. In general,
Figure 17 validates that the hybrid GB models make precise predictions with measurable uncertainty, and emphasizes the higher stability and assurance of models such as CatBoost-SSA and NGBoost-SSA for predicting flow rates via RTHG.
To complement the bootstrap-derived confidence intervals, the native probabilistic output of NGBoost-SSA was evaluated. As illustrated in
Figure 18, the mean width of NGBoost’s 95% prediction intervals on the test set is only 0.26 in terms of
Q/√
gh5, with a median width of 0.24. When normalized by the maximum observed value, this corresponds to a CI-like index of 0.041, which is markedly smaller than the bootstrap-based CI value of 0.616. This sharp reduction demonstrates that NGBoost-SSA produces considerably narrower yet well-covering predictive intervals, capturing most of the aleatoric variability with only limited epistemic spread. Overall, the results presented in
Figure 18 confirm that NGBoost-SSA exhibits the lowest total predictive uncertainty among all SSA-optimized GBMs evaluated in this study.
Empirical models based on regression analysis are widely utilized tools in the investigation of hydraulic phenomena and are commonly applied in the design of spillways and hydraulic gates. Additionally, empirical models are employed for RTHG. Among these, one of the most accurate relationships is Equation (5), which has been derived from experimental data used in this article [
3].
In this study, to demonstrate the superior performance of the machine learning models, their results are compared with those of the empirical model. Equation (5) defines the dimensionless parameter Q/√gB5, while the intelligent models presented herein determine the value of the parameter Q/√gh5. For a more accurate comparison, the value of Q is first calculated using Equation (5), and subsequently, the dimensionless parameter Q/√gh5 is computed. These calculated values are then compared with the observed values, and statistical functions along with model evaluation indices are used to assess the performance of the models.
Scatter plots, residuals, and KDE distributions computed by Equation (5) in the testing dataset are presented in
Figure 19.
Table 19 also summarizes the performance metrics of the empirical model.
As illustrated in
Figure 19 and summarized in
Table 19, the empirical model also delivers acceptable accuracy, showing a reasonable alignment between the computed and observed values in the testing dataset. However, despite its satisfactory performance, the level of precision required for design-oriented hydraulic applications typically demands models with higher predictive capability. In this regard, the hybrid gradient boosting models developed in this study provide markedly superior accuracy and reduced error levels, making them more suitable than the empirical formulation for engineering design and reliable flow estimation through RTHGs.
4. Discussion
The experimental data utilized in the current work were rigorously tested prior to model training and evaluation. The dataset was initially collected from a set of well-regulated laboratory experiments under steady flow conditions using precision instruments to quantify the flow rate, head, and gate angle. The repeatability and reproducibility of the measurement were attained by comparing repeated runs and applying the Local Outlier Factor (LOF) algorithm to detect and eliminate any outlier point. The experiments were conducted in a comparatively large flume with Reynolds numbers being high enough to provide turbulent flow; hence, scale effects can be confidently neglected. In addition, normalization and dimensional analysis were performed to reduce any remaining effect of scale variability.
In as far as the capacity of the models to identify scale-dependent data is concerned, the implemented ensemble learning algorithms (CatBoost, NGBoost, XGBoost, LightGBM, and HistGBoost) have the innate capacity to identify complex nonlinear relationships between input and output variables. Assuming they are trained with clean and uniform data, they can identify anomalous or uncorrelated patterns, which would appear as low SHAP importance or high residuals. However, since the data available had already been gathered under laboratory conditions without zero-order scale effects, this problem did not have much impact on the present study.
Even though the main aim of this work was to create robust predictive models for estimating the flow rate, the derived models can also be written in empirical-type equation form. The models for simple correlations between dimensionless discharge and the most significant, most impacting parameters (e.g., h/b, b/B, θ, and S) can be obtained by symbolic regression or surrogate modeling methods based on feature importance ranking and SHAP analysis. This may be a fruitful line of future research to create general-gauge discharge formulas that capture the reality of data-based models with the parsimony of empirical models.
Lastly, the physical consistency of the created models was checked by sensitivity and interpretability tests. The sign and magnitude of the SHAP values ensured that the computed behavior of the models complies with the basic hydraulic principles—higher upstream head ratio or gate opening results in more discharge, whereas a higher gate angle or bed slope reduces it. This conformity of the physical understanding with model predictions reflects that not only are machine learning models learned correctly but also physically consistent in the explored domain.
To clarify the applicability of the proposed SSA-GBM models beyond the laboratory conditions, the hydraulic regime of the experimental dataset was analyzed, and the scaling limitations were explicitly considered. Although all measurements were obtained from a single laboratory flume, the predictor variables were defined in dimensionless form (h/B, L/b, t/B, S, and θ), which are inherently scale-independent and commonly used in hydraulic similitude. Reynolds numbers in the experiments ranged from 4.12 × 103 to 2.97 × 105, all above the classical laminar–turbulent transition and thus completely representative of predominantly turbulent flow conditions. The Froude numbers were in the range of 0.014 to 1.34; hence, most runs were subcritical, with a very small fraction of them approaching critical flow. SSA-GBM models are expected to retain their validity for field applications operating under similar hydraulic regimes. For practical applications, Re ≥ 1 × 104 and 0.02 < Fr < 1.0 are recommended to limit viscous and scale effects and also to avoid near-critical instabilities. It is also important to realize that real canals may have additional complexities due to much larger aspect ratios, greater three-dimensionality, sediment load, and variable roughness-all absent in the laboratory configuration and which might require added calibration in the field.
In addition to aleatoric uncertainty, which arises from the intrinsic variability of the measured discharge data and is quantified in this study through bootstrap confidence intervals and the R-Factor values, the models are also subject to epistemic uncertainty. This second component originates from the structure of the gradient-boosting models themselves, the limited range of the experimental parameter space, and the fact that all observations were collected within a single laboratory flume. These structural and model-based limitations imply that the learned response surfaces may not fully represent all possible hydraulic conditions beyond the tested combinations of h/B, L/b, t/B, and S. Therefore, epistemic uncertainty is inherently higher for models with greater structural complexity or weaker robustness. This distinction is especially important when evaluating the comparative reliability of the SSA-optimized GBMs.
The SHAP analysis shows that the depth ratio h/B is the most influential parameter in predicting the discharge (Q/√gh5). This result is fully consistent with hydraulic theory and with previous analytical studies on top-hinged gates. Physically, h/B directly controls the upstream specific energy and the magnitude of the hydrostatic force acting on the gate. As a result, even small increases in upstream depth significantly increase the opening beneath the gate, leading to a noticeable rise in discharge. This strong dependence matches the large exponents associated with h/B in classical power-law discharge relations. The channel slope S appears as the second most influential parameter. Although the tested slopes are small, variations in S modify the energy grade line, the approach velocity distribution, and the upstream pressure field. Previous laboratory results have also shown that using discharge equations from horizontal channels in sloping channels can lead to considerable errors, which confirms that slope effects must be accounted for. The importance assigned to S by the machine-learning models is therefore consistent with fundamental hydraulic behavior. In comparison, geometric parameters such as L/b, t/B, and θ show moderate but still meaningful impacts. Their lower influence agrees with earlier experimental findings, where these parameters were associated with small exponents or could sometimes be neglected without loss of accuracy. Nonetheless, the ML models correctly capture their secondary role in adjusting the moment balance and the detailed geometry of the opening beneath the gate.
At high discharges and velocities, several hydraulic mechanisms can explain the increased scatter and larger prediction errors. As the flow accelerates beneath the top-hinged gate, the Reynolds number and turbulence intensity rise, and the flow in the contraction region becomes more strongly three-dimensional. This can promote separation, local vortices, and secondary currents, especially for larger width ratios and mild channel slopes. Under these conditions, the actual pressure distribution and effective contraction beneath the gate may deviate from the simplified quasi-two-dimensional behavior implicitly assumed in the dimensional analysis. In addition, free-surface fluctuations and rapid variations in the under-gate jet make it more difficult to obtain accurate measurements of upstream depth, gate angle, and discharge at high velocities, which increases measurement uncertainty in exactly the range where the flow is most energetic. These highly nonlinear and partially three-dimensional effects pose a challenge for the ML models, because they typically occur in the upper tail of the dataset where the number of observations is relatively limited and the relationship between the input parameters and discharge becomes more complex and less smooth. In these setups, even a flexible ensemble learning model may have a hard time capturing the physical behavior fully and may be even more prone to errors due to noisy/biased learning data. Some potential solution ideas include selective data augmentation with greater numbers of high-velocity runs within the existing experiment setup, and implementing some form of available physical insights directly within the learning process itself. This would include inputs such as those related to the derived numerical values for Froude numbers and energy head, with constraints related only to some straightforward physical notions, such as a monotonically non-decreasing discharge with a given increase in depth, among other possibilities related directly to water flow principles.
In order to position the present work within the broader context of flow-measurement research, it is useful to compare our results with previous studies on weirs, sluice gates, and hinged-gate devices. Classical rating equations for sharp-crested weirs and sluice gates typically express discharge as a power-law function of upstream depth and a limited set of geometric parameters, with calibration coefficients that must be re-fitted whenever the structure geometry or flow regime changes. Similar dimensionless discharge relations have been proposed for circular and rectangular hinged gates, including the base experimental study used in this work, where the influence of depth, width ratio, gate length and slope is condensed into a small number of Π-groups and power-law exponents. Our hybrid ML models reproduce the same dominant hydraulic trends reported in these studies: discharge increases monotonically with upstream depth and is sensitive to channel slope and width ratio, while the effects of thickness and gate length remain secondary. At the same time, the SSA-tuned gradient boosting and NGBoost models achieve substantially lower prediction errors than the original empirical relations over the same experimental dataset and are able to accommodate the combined effects of multiple geometric and hydraulic parameters without requiring separate calibration for each configuration. Moreover, unlike traditional deterministic formulas, the NGBoost-SSA model provides full predictive distributions and allows confidence intervals and R-Factor based uncertainty metrics to be computed directly from the model outputs, which is rarely available for conventional weir or gate equations. From a methodological perspective, the main scientific contribution of this study is to demonstrate a coherent hybrid framework that combines: (i) dimensional analysis and ANOVA to define physically meaningful input features, (ii) local outlier detection and careful data preprocessing, (iii) SSA-based global hyperparameter optimization of several state-of-the-art gradient boosting algorithms, and (iv) model-agnostic explainability (SHAP) and uncertainty analysis for a practical flow-measurement device. This integrated approach goes beyond previous work on weirs and gates that relied solely on empirical regression or single black-box models, and shows that machine learning can be used not only to improve accuracy, but also to enhance physical interpretability and quantify predictive uncertainty for hydraulic structures.
5. Conclusions
The accurate estimation of flow rates (Q) through structures such as the Rectangular Top-Hinged Gate (RTHG) is a critical aspect of hydraulic engineering, as it directly impacts the efficiency and safety of water management systems. This study aimed to improve the prediction of flow rates through the RTHG by utilizing hybrid models based on gradient boosting algorithms, which offer advanced capabilities for handling complex, nonlinear relationships in hydraulic data. The primary objective was to develop an accurate model for estimating Q through the RTHG using hybrid models, including Categorical Boosting (CatBoost), Natural Gradient Boosting (NGBoost), Histogram-based Gradient Boosting (HistGBoost), Extreme Gradient Boosting (XGBoost), and Light Gradient Boosting (LightGBoost). One of the essential factors in developing artificial intelligence models is the accurate and proper tuning of their hyperparameters. Therefore, four powerful metaheuristic algorithms—Covariance Matrix Adaptation Evolution Strategy (CMA-ES), Sparrow Search Algorithm (SSA), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA)—were evaluated and compared for hyperparameter tuning, using LightGBoost as the baseline model. An assessment of error metrics, convergence speed, stability, and computational cost revealed that SSA achieved the best performance for the hyperparameter optimization of GB models. Consequently, hybrid models combining GB algorithms with SSA were developed to predict Q through RTHGs.
Outlier removal was carried out using the Local Outlier Factor (LOF) algorithm, and data normalization was achieved through Z-score normalization. In order to identify the dimensionless independent parameters that significantly influence Q predictions, dimensional analysis and analysis of variance (ANOVA) were performed. The dataset was divided into a training set (70%) and a testing set (30%) to ensure the robustness of the models. The performance of the models was evaluated using various metrics, including Coefficient of Determination (R2), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Percent Bias (PBIAS). Additionally, model comparison and ranking were conducted using the Taylor Diagram, and sensitivity analysis was performed through the SHAP method. The uncertainty in model predictions was quantified through Confidence Interval (CI) and R-Factor indices.
Based on the obtained results, all hybrid gradient boosting models demonstrated high accuracy in predicting Q through the RTHG during both training and testing phases. Among them, CatBoost-SSA achieved the highest performance, with an R2 of 0.999 during training and 0.984 during testing, indicating its superior ability to model complex Q behavior with minimal error. The low values of RMSE (0.147) and MAE (0.092) in the testing phase confirm the robustness and generalizability of this model. NGBoost-SSA also showed strong performance with reliable predictions and the lowest uncertainty, as indicated by its confidence interval (CI = 0.616) and R-Factor (3.596), suggesting its predictions were the most consistent and trustworthy across different scenarios. Additionally, LightGBoost-SSA, XGBoost-SSA, and HistGBoost-SSA provided competitive results, with R2 values above 0.97 in the testing phase, though they exhibited slightly higher levels of uncertainty compared to CatBoost and NGBoost.
SHAP analysis confirmed that the most influential parameter in predicting Q was h/B (upstream water depth to channel width ratio), followed by S (channel slope). These findings emphasize the importance of these physical parameters in Q modeling and highlight the interpretability of the machine learning approach used. Furthermore, the Taylor diagram analysis ranked CatBoost-SSA as the top-performing model, followed closely by LightGBoost-SSA and NGBoost-SSA, reinforcing the conclusion that these hybrid models are effective for complex hydraulic modeling tasks.
In conclusion, the integration of SSA with gradient boosting models not only enhanced prediction accuracy but also reduced uncertainty in modeling outcomes. This hybrid approach presents a promising tool for hydraulic engineers aiming to optimize the operation and design of gated structures such as RTHGs.
Limitations of the present study. Although the novel hybrid boosting models exhibited excellent predictive power and physical reasonability, several limitations should be mentioned. First, the dataset in the current study was measured in controlled laboratory settings and may not be fully representative of field-scale hydraulic behavior. Second, the models were trained and validated on a finite range of geometric and hydraulic parameters; thus, extrapolation beyond this range should be taken with caution. Thirdly, hyperparameter tuning was carried out with respect to one objective function (RMSE), and multi-objective optimization can be employed for improving model generalization. Finally, future studies are encouraged to incorporate larger and more diverse datasets, e.g., prototype-scale measurements, for enhancing the developed models’ robustness and applicability.