Water Level Predictions at Both Entrances of a Sea Strait by Using Machine Learning

Altaş, Furkan; Öztürk, Mehmet

doi:10.3390/w16162335

Open AccessArticle

Water Level Predictions at Both Entrances of a Sea Strait by Using Machine Learning

by

Furkan Altaş

^*

and

Mehmet Öztürk

Coastal and Harbor Engineering Laboratory, Department of Civil Engineering, Yildiz Technical University, Istanbul 34220, Türkiye

^*

Author to whom correspondence should be addressed.

Water 2024, 16(16), 2335; https://doi.org/10.3390/w16162335

Submission received: 25 July 2024 / Revised: 12 August 2024 / Accepted: 15 August 2024 / Published: 20 August 2024

(This article belongs to the Section Oceans and Coastal Zones)

Download

Browse Figures

Versions Notes

Abstract

:

In this study, we employed a novel machine learning (ML) methodology to predict water levels (WLs) from their constituent components at both entrances of a sea strait, namely the Bosphorus. The principal components of WLs in the strait are mean sea level pressure (MSLP), wind speeds (W, U, V), discharges from the Danube River (Q), and tidal conditions (T). Following the application of the t-test, SFS, PCA, and VIF analyses, and the consideration of a range of ML techniques (including Linear Regression (LR), Regression Trees (RT), Support Vector Machine Regression (SVMR), Gaussian Process Regression (GPR), and Artificial Neural Networks (ANNs)), the number of predictors was reduced in order to obtain the most flexible and accurate regression model. As a consequence of this process, MSLP, W, and Q were retained, while the remaining variables (tide) were excluded. Furthermore, the order of importance for the optimal regression model was identified as Q_lagged, MSLP, V_lagged, and U at the north entrance model, while at the south entrance model, the order was MSLP, Q_lagged, U, and V. The models were trained using 80%, 50%, and 33% of the data, respectively. The model trained on 80% of the data yielded the most accurate predictions, with a correlation coefficient of R ≅ 0.95 and a root mean square error (RMSE) of 0.02 m. The model demonstrated a markedly superior predictive capacity compared to previous studies in the region, which is attributed to two factors that are regarded as the novelty of the study. The first factor was the random selection of training data from each month of the year, which allowed for the representation of the general pattern of water level (WL) behaviours. The second factor was the selection of the physically most meaningful inputs, which were selected according to the results of the significance and multicollinearity check. Furthermore, the predicted and measured WLs were employed as boundary conditions in a hydrodynamic model to evaluate the predictive capacity of the predicted WLs on the current results in the strait in comparison to the use of observed WLs. The 80% data-trained model exhibited similar current velocities to the observed WL model used, whereas the 50% and 30% data-trained models yielded slightly different results.

Keywords:

machine learning; water level prediction; Bosphorus; feature selection; regression; training; test

1. Introduction

The accurate prediction of sea level is of importance in the planning of coastal areas in a variety of contexts, including the management of harbours, which requires minimum water levels to allow ships to enter, the safety of navigation, the early warning of destructive flooding due to storm surges, the maintenance of agricultural fertility, etc. [1,2,3]. The inability to predict could have significant economic, social, and environmental consequences. Therefore, the development of a timely and accurate prediction model, particularly one that can anticipate extreme events, is crucial for the planning of coastal resilience. Tides, wave set-ups, and storm surges (atmospheric pressure differences and strong winds) are the most prominent processes in coastal sea level variability [4,5].

Straits represent a significant component of the oceanic ecosystem. The difference in water levels between the entrances of straits connecting two different water bodies is the primary driving mechanism of flow structure. There are approximately 30 major straits in the marginal seas of Europe (e.g., Gibraltar, the Messina Strait, the Dardanelles) and approximately 200 straits worldwide (e.g., the Bab-el Mandab and the Bering Strait). These straits play a significant role in regional and global water circulation and marine ecology [6]. The transfer of mass and energy occurs via these waterways between neighbouring seas. It is, therefore, evident that these straits play a pivotal role in marine transportation, water quality, and the ecology of the adjacent basins. It is also possible that recreational activities in the straits could be of significance. As they are transitional waterways, the flow is typically two-layered in the straits, and their flow structures are highly complex due to the combined influence of their morphology and the forcing conditions acting on the neighbouring seas [7,8,9,10,11]. The driving mechanisms of the strait’s flow are the forcing conditions acting on the open boundaries. These are the water level and density (salinity and temperature) differences between both sides of the straits. It can be observed that the density (especially salinity) of the adjacent seas is less variable than the water level fluctuations on an annual scale. Therefore, baroclinic forces arising from density difference show a much more stable behaviour than the barotropic forcing mechanisms of water level difference.

In general, the water level (WL) component of the boundary forcing of a strait is constituted by regular and irregular constituents. Tidal harmonics give rise to the regular WL oscillations, whereas meteorological processes (wind and atmospheric pressure) are responsible for the latter. The tidal characteristics and meteorological setup of neighbouring basins determine barotropic forcing conditions, which are related to WL difference, at both entrances of the straits. As the most dynamic component, the WL shows high seasonal variability, particularly due to the variable meteorological setup. Therefore, determining accurate WL conditions at both ends of a strait is crucial for the hydrodynamics in both the strait and regional circulation patterns.

Another approach to predicting WLs is to make use of the components that contribute to them. This approach is particularly advantageous in situations where field campaigns are costly and challenging to undertake due to operational difficulties such as identifying suitable locations and maintaining safety. Numerical and regression models are two of the most commonly used tools for predicting WLs.

Deterministic numerical models are a robust approach, particularly in the case of storm forecasting [12,13,14]. However, they are not without limitations, particularly in terms of computational costs, which can be a significant challenge when ensemble forecasts are required for risk analysis. Regression models, on the other hand, have also been successfully applied to offshore gauge stations for the estimation of sea-level processes and extreme storm surges [15,16]. In certain instances, these approaches yield comparable results in the prediction of extreme water levels [17].

In contrast, machine learning (ML) methods have only recently been introduced for predicting WLs and have been demonstrated to offer a superior approach to numerical and regression models, particularly in storm surge events [18,19]. The application of machine learning methods in the prediction of water levels enables the modelling of complex relationships and nonlinear patterns with high accuracy and reliability. These methods are capable of analysing a multitude of variables, including wind speed, atmospheric pressure and tide, and are effective when utilising extensive and heterogeneous data sets. Their adaptive learning capabilities allow models to update themselves with new data and enhance their predictive abilities. Furthermore, the capacity for real-time forecasting and instant data processing enables a swift and efficacious response to fluctuations in water levels. These advantages offer substantial benefits in critical domains such as flood risk prevention, maritime traffic management and the protection of coastal regions.

Machine learning (ML) encompasses a range of algorithms and techniques that facilitate the acquisition of knowledge and the generation of predictions based on data. In general, algorithms can be classified into three main categories: supervised, unsupervised, and reinforcement learning [20,21,22,23]. The objective of supervised learning is to construct a model that accurately represents the relationship between inputs and outputs, utilising a labelled dataset. These are further subdivided into categories such as classification and regression and are employed extensively, particularly for the resolution of prediction and classification issues. In unsupervised learning, unlabelled data is examined with the objective of identifying intrinsic patterns through self-learning. Clustering and dimensionality reduction are examples of this field. Reinforcement learning is a type of machine learning that aims to train the agent (robot, vehicle, etc.) with the reactions it receives from the environment without training data. It enables learning through reward and punishment mechanisms and is generally used in gaming or robotic applications. Each of these methods is specifically designed and implemented for different data structures and problems [24,25].

Supervised learning is a fundamental paradigm of machine learning and is widely used in solving various data analysis and prediction problems. In this approach, the learning process progresses towards a specific goal, which is typically an output variable that is predicted based on independent variables within the data set. The fundamental classification and regression algorithms include linear regression (LR), support vector machines (SVMs), decision trees (DTs), random forests (RFs), K-nearest neighbours (KNNs), and artificial neural networks (ANNs). These methods have been successfully applied to a range of data sets, facilitating the analysis and prediction of data.

This paper presents an investigation into the capacity of machine learning (ML) methods to predict water levels at both ends of a sea strait, specifically the Bosphorus. The ML models were constructed using meteorological data from the European Centre for Medium-Range Weather Forecasts (ECMWFs) reanalysis product ERA5, along with data on the Danube River discharges from the Global Runoff Data Centre (GRDC) and major tidal harmonics specific to the region. The performance of the ML methods was evaluated in a number of ways, including the length of the training data sets, the type of selection (i.e., regular or random) of the training data sets, the data preprocessing, and the choice of independent variables. The contribution of each component of water level (WL) to the success of the prediction ability was also evaluated. This study aims to demonstrate the high predictive capability of machine learning (ML) methods for calculating water levels (WLs) at each end of a sea strait from their components, which can be readily obtained from open sources. The structure of the paper is as follows: (1) introduction, (2) site description, (3) data description, ML methodology and application, (4) results and discussion, and (5) summary and conclusions.

2. Site Description

The Bosphorus is the most dynamic component of the Turkish Straits System (TSS), situated between the Sea of Marmara and the Dardanelles (Figure 1). It presents a two-layer flow structure (Figure 1, bottom panel), comprising (1) the upper layer flowing from the Black Sea to the Sea of Marmara and (2) the lower layer flowing in the opposite direction. The exchange occurs throughout the TSS between the denser water of the Mediterranean water and the brackish Black Sea water [9,26].

The Bosphorus is a long, narrow, and relatively shallow strait. Its length is ~31 km. It has a sinuous geometry, causing significant shifts in the orientation of the strait in its course. The width of the strait is variable between ~0.7 to 3.5 km at the surface, with an average width of ~1.3 km. It has an uneven topography with two sills (the south and north sills) close to both entrances, which plays a significant role in its flow structure. The average salinities are ~38 ppt at the south entrance (the Sea of Marmara) and ~18 ppt at the north entrance (the Black Sea). This density difference drives the northward lower layer flow, as evidenced by previous studies [27,28,29].

The rate of evaporation from the Sea of Marmara is greater than that of the Black Sea. Furthermore, the Black Sea has a greater abundance of freshwater (in the form of precipitation and river runoff) than the Sea of Marmara. The net water excess is compensated by the southward flow of the upper layer in the Bosphorus. This excess of water is the primary mechanism responsible for the observed difference in water levels between the northern and southern entrances. However, severe meteorological conditions, which occur predominantly during the autumn and winter, frequently disrupt the typical two-layer flow structure, resulting in a one-layer flow in both directions, contingent on the severity of the storm. It is during these harsh conditions that the extreme water level (WL) differences and flow conditions are observed [30,31]. The strait is one of the busiest waterways in the world and is closed to ship traffic for the majority of these events. Consequently, the WL difference is the most dynamic component of the strait’s flow in comparison to the density difference. The flow in the straits responds to barotropic forcing conditions, which change with a phase lag. The layer thicknesses vary along the Bosphorus, combined with the WL difference at both ends and mixing processes in the strait due to morphological parameters.

Numerous numeric and field studies have been conducted with the aim of elucidating the intricate flow structure of the Bosphorus. As the difference in water level is the most dynamic driving force, the majority of these studies employed this property as the reference point for identifying the hydrodynamic structure of the strait. This is evidenced by the following references: [8,32,33,34,35,36,37]. It is, therefore, evident that measurements of water level (WL) at both ends have consistently been of paramount importance in field studies. Furthermore, data pertaining to WL has constituted the most significant input at the open boundaries of numerical models of the Bosphorus [33,35,37,38].

The variability of the WL at both entrances of the strait is determined by three major components: wind setup, atmospheric pressure in the region, and river discharges into the Black Sea. The Danube River is one of three major rivers flowing into the Black Sea, contributing 60% of the whole river runoff into this sea by itself, the others being the Dnieper and the Dniester. Collectively, these three rivers account for 80% of the total river runoff into the Black Sea. The combined volume of water from the Dnieper and Dniester Rivers represents approximately only 20% of the Danube River’s total flow. In contrast to some other straits, such as the Strait of Gibraltar, the tidal ranges in the Bosphorus are relatively weak, with amplitudes of only 10 cm or less [39,40]. This is one of the reasons why the Bosphorus could also be classified as a non-tidal strait [31].

In the present study, we sought to calculate the WLs of a sea strait from its components at both ends, with the Bosphorus taken into account. To this end, we employed the ML approach, which is detailed in the following chapter, as it has proven highly successful in solving non-linear problems. The method was shown to be useful in that the majority of the components (i.e., wind and atmospheric pressure) could be obtained from online sources such as ECMWF.

3. Data Description, Methodology and ML Application

3.1. Data Description

In this study, we used both the observed and modelled data for the period between October 2004 and October 2005 to predict WLs at both entrances of the Bosphorus. Some properties of the parameters are summarised in Table 1. The water levels were measured at 1 h intervals at both entrances of the strait by Taisei Corporation, Japan, on behalf of the General Directorate of Ports, Airports and Railways Construction of Turkey (DLH).

The wind speeds, directions at 10 m above sea level, and mean sea level pressures were extracted from a high-resolution atmospheric reanalysis ERA5 model of ECMWF [41] at both locations close to the north and south exit of the strait (Figure 1, upper left panel). The ECMWF-ERA5 dataset provides global atmospheric reanalysis with a 0.25° × 0.25° spatial resolution and a 1 h temporal resolution. The authors of [42] have proven that the consistency of ERA 5 data is very high with the measured wind speeds in the Bosphorus.

We obtained the daily average Danube River discharges from the Global Runoff Data Centre (GRDC). The discharges were measured 0.6 m above the bottom of the river at a station (Figure 1), which is the last station before the river flows into the Black Sea in Romania [43]. For the consistency of time intervals of the whole data set, the gaps of 1 h intervals were filled by linear interpolation for the consecutive days of discharges, of which discharges were assumed to be measured at the 12th hour of each day.

The periodic WL fluctuations due to tide were included considering the main harmonics calculated by the authors of [42]. They calculated tidal constituents from the observed WLs using a MATLAB-based code T-TIDE tool [44]. This approach is commonly used in oceanography and performs a classical harmonic analysis by applying nodal corrections. The main harmonics considered in this study are seen in Table 2.

Consequently, we used four variables as predictors to calculate WLs at each end of the Bosphorus. These predictors are wind speeds (W), mean sea level pressure (MSLP), tidal range (T), and the Danube River discharges (Q).

3.2. Machine Learning (ML) Methodology

ML is a branch of artificial intelligence that allows computers to learn by studying data without being directly programmed. This process typically occurs in three main steps: data collection and preparation, model training, and model evaluation. First, relevant data sets are collected and made available for analysis through various adjustments. Then, the selected machine learning algorithm is trained using the training data; in this phase, the model learns the relationships between inputs and outputs. Finally, the model’s performance is evaluated with test data and measured by metrics such as accuracy and error rates. After the well-performing model is optimised, it is ready to be used to make predictions with new data. The steps applied in creating a regression model for water level prediction in the current study are schematised in Figure 2. In the following sections, the methods that we used to create the regression model are explained under the main headings in Figure 2.

3.2.1. Data Collection

The sources and the way of handling data used in the regression model are explained in detail in the Data Description (Section 3.1) section. To briefly summarise, simultaneous water level, wind speed, wind direction and atmospheric pressure data were obtained from hydrodynamic measurement results and global meteorological model results close to both entrances of the strait. In addition, daily average Danube River discharge data was obtained from the measurement station located in the Danube River delta outlet (Figure 1).

3.2.2. Data Pre-Processing

(a): Data Cleaning

Data cleaning reduces biases that may be present in the data set by minimising errors caused by missing or inaccurate data. It helps the model to produce unbiased and fair results by providing a more balanced data set. In this study, data cleaning was performed by removing missing and erroneous data from the data set.

(b): Data Reduction

Data reduction in machine learning is an important process that increases the performance and efficiency of models by reducing the size and complexity of the data set. Data reduction techniques enable the model to make more general and accurate predictions by selecting only the most important and meaningful features. Thus, it increases the flexibility of the model and helps it perform better on different data sets. In this study, various data reduction methods were applied to the data set while selecting independent variables to be used in the regression model. In this context, Principal Component Analysis (PCA), t-test, Variance Inflation Factor (VIF), and Step Forward Selection (SFS) methods, which are summarised below, were used.

Principal Component Analysis (PCA):

PCA is a statistical method used to represent information within a multivariate data set with minimal loss of information and fewer variables. PCA detects the most important variables that affect the output value and represents the data set in a lower dimensional space through these variables.

t-test:

The t-test is used to determine whether the effect of each independent variable in the regression model on the dependent variable of the model is statistically significant. The t-test determines whether the variable is significant by testing the following hypotheses:

Null Hypothesis (H₀): A particular regression coefficient (

β_{i}

) is equal to zero (this independent variable has no significant effect on the dependent variable).

Alternative Hypothesis (H₁): A particular regression coefficient (

β_{i}

) is different from zero (this independent variable has a significant effect on the dependent variable).

The t statistic for each regression coefficient is calculated as follows:

t_{i} = \frac{β_{i}}{S E (β_{i})}

(1)

S E (β_{i}) = \sqrt{\frac{σ^{2}}{\sum (x_{i} - \bar{x})^{2}}}

(2)

Here,

S E

is the standard error,

σ^{2}

is the variance of the error terms,

x_{i}

is the independent variable, and

\bar{x}

is the mean of the independent variable. The calculated t-value is converted to the p-value according to the t-distribution. If the p-value is less than the specified significance level α (usually α = 0.05), the null hypothesis is rejected, and it is concluded that the relevant independent variable has a significant effect on the dependent variable.

Variance Inflation Factor (VIF):

VIF is used to detect multicollinearity problems between independent variables in multiple linear regression analysis. The presence of multicollinearity reduces the predictability of the regression coefficients. This negatively affects the overall accuracy and reliability of the model. It also leads to insignificant changes in model validation and test parameters. If VIF = 1, there is no multicollinearity between independent variables. If 1 < VIF < 5, there is moderate multicollinearity, and it is at an acceptable level. If VIF > 5, there is a high level of multicollinearity. It needs attention.

Step Forward Selection (SFS):

The most successful variable in expressing the dependent variable is selected, and the model is established with only this variable. Model performance is evaluated, and the best one among the remaining variables is selected and added to the model. This time, a new model is established with two variables and how the model performance changes is examined. The cycle continues until the specified performance measure is reached.

(c): Data Transformation

Data transformation in the regression model is the process of making the data more suitable for building a model by applying mathematical transformations to independent or dependent variables. These operations are performed to increase the accuracy and validity of the model, to make non-linear relationships linear, to reduce the heteroscedasticity problem, to ensure normal distribution, and to reduce the effect of outliers. In this study, the Z-score method given in Equation (3) was used for data transformation. It is used to make data comparable by normalising data at different scales and to obtain statistically significant results in data analysis.

Z_{i} = \frac{x_{i} - μ}{σ}

(3)

Here

x_{i}

represents an observation value in the data set,

μ

represents the average of the data set, and σ represents the standard deviation of the data set.

3.2.3. Model Training

Model training involves the process of learning the machine learning algorithm based on training data. This phase allows the model to be trained on examples in the dataset to perform best at a specific task (e.g., prediction or classification). For model training, the training data set must first be prepared. To avoid misleading results in model evaluation, training and testing data were handled separately. The test data set was not included in model training and was used only in the model testing phase. Although there are various suggestions in the literature about the size of the training data set and test data set, three different alternatives were evaluated in this study: 80% (~9.6 months) of the whole data set for training, 20% for test, 50% (6 months) of whole data set for training, 50% for test and ~33% (4 months) of the whole data set for training and ~67% for test data. The literature generally recommends that 80% of the total data set should be training data and 20% test data. However, since the water level measurements of the Bosphorus are available for a limited period of one year, the 20% test data length is relatively short to test the accuracy of the regression model. Therefore, a more reliable assessment of the accuracy and flexibility of the regression model was made with alternatives where the size of the training data was shortened, and the size of the test data was increased. The training data was randomly selected to include samples from all months in the 1-year data set (Figure 3). The remaining data was used as test data. After the training data set was determined, various regression models were trained in the second stage. The models used to train are briefly summarised below.

Linear Regression (LR):

Linear regression is a statistical technique used to model the relationship between a dependent/predicted variable (target or output) and one or more independent/predictor variables (inputs or features). Its purpose is to determine the linear relationship between the independent variables and the dependent variable.

Regression Trees (RT):

Regression Trees are a type of decision tree and are used to predict a continuous target variable (dependent variable). Decision trees work by splitting the observations in the data set and building a model for each split. This process is accomplished by dividing the data set into homogeneous subgroups and making predictions on these groups.

Support Vector Machine Regression (SVMR):

Support Vector Machine Regression (SVMR) is an extension of the Support Vector Machines (SVM) algorithm and is used to predict continuous target variables. Instead of separating data using a hyperplane, SVMR aims to minimise the prediction error of the target variable. This algorithm creates a margin around the target variable and tries to predict the data as close to that margin as possible.

Gaussian Process Regression (GPR):

GPR is a powerful Bayesian method used to estimate continuous variables. Gaussian processes model relationships between data points, accounting for uncertainty when making predictions. GPR uses a covariance (kernel) function to model the structure and relatedness of data when making predictions.

Artificial Neural Networks (ANN):

ANN is designed with inspiration from complex information processing systems such as the human brain. They are used in computer science and artificial intelligence and are often used to detect, classify or predict complex data patterns. Neural networks have been used successfully in a variety of applications, including image and sound recognition, natural language processing, gaming strategies, and more.

3.2.4. Model Evaluation

Model evaluation is the process of evaluating the performance of an ML model. It helps determine the accuracy, reliability and generalisation ability of the model. There are various methods for model evaluation. In this study, the model was evaluated using different methods on the training data set (validating model) and the test data set (testing model).

(a): Validating Model:

Model validation processes performed on training data provide insight for model evaluation. The final evaluation is made after the model is tested.

R-square score (R²):

R square score is one of the metrics used in regression analysis to measure the success of the model. R-squared is a statistical measure of how close the data is to the fitted regression line. It usually takes a value between 0 and 1. R² = 1 value means that the model makes perfect predictions.

R^{2} = 1 - \frac{S S E}{S S T}

(4)

S S E = \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}

(5)

S S T = \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}

(6)

Here SSE is the sum of squared error, SST is the sum of squared total,

y_{i}

represents the dependent variable,

\bar{y}

represents the average of the dependent variable,

\hat{y_{i}}

represents the estimated dependent variable value.

Adjusted R-square score (Adjusted R²):

Adjusted R-Square addresses one of the disadvantages of R-Square: sensitivity to model complexity. R-Square may increase as the number of independent variables in the model increases, even if the predictive ability of the model does not increase, and therefore, it can be misleading. Adjusted R-Square addresses this drawback and more accurately evaluates the fit of the model.

{Adjusted R}^{2} = \frac{(1 - R^{2}) (n - 1)}{n - k - 1}

(7)

Here, n represents the total sample size, and k represents the number of independent variables. Adjusted R-Square follows the same interpretation principle as R-Square. A value close to 1 indicates that the model fits the data well, while a value close to 0 indicates that the model poorly explains the data.

K-Fold Cross Validation:

The training data set is divided into K equal parts. One part is used for validation while the other parts are used as the training data set. This process is repeated K times, each time a different piece is selected as validation data. As a result, K different validation values are obtained and the performance of the model is evaluated by averaging these values. K-Fold Cross Validation (K is taken as 5 in the analyses) is used for validation results of the following analyses.

(b): Testing Model:

After the models are trained, the model is tested using the test data set separated from the data set in the first stage. Models are evaluated in line with the calculated error metrics listed below.

Mean Squared Error (MSE):

It represents the average of the squares of the differences between the actual values and the predicted values. A lower MSE value indicates better model performance.

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}

(8)

Mean Absolute Error (MAE):

It represents the average of the absolute differences between actual values and predicted values. Similar to MSE, a lower MAE value indicates better model performance.

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - \hat{y_{i}}|

(9)

Root Mean Squared Error (RMSE):

It is calculated as the square root of the average of the squares of the differences between the actual values and the predicted values. A lower RMSE value indicates a better performance of the model. It takes a value between 0 and 1. A value of RMSE = 0 means that the model makes error-free predictions.

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}

(10)

3.2.5. Model Tuning

This operation is the process of accurately tuning hyperparameters (configurable parameters of the model) to maximise the performance of machine learning models. This process is important to ensure that the model generalises better, to eliminate problems such as overfitting or underfitting, and to increase the accuracy of the model. The literature has hyperparameter tuning methods such as Grid Search, Random Search and Bayesian Optimisation [45,46,47]. In this study, the Bayesian optimisation method was used because it tends to produce better results with fewer trials than other hyperparameter tuning methods such as grid search or random search. Bayesian Optimisation builds a probability model and selects the next set of hyperparameters using this information while evaluating the current model’s performance.

Bayesian Optimisation:

Bayesian Optimisation technique is commonly used to tune hyperparameters in ML models. It is based on Bayesian probability theory, which states that observed events depend on the probability of other events. It aims to optimise a specific objective function with the least number of trials. Bayesian Optimisation builds a probability model and selects the next set of hyperparameters using this information while evaluating the current model performance. This way, better results are achieved with fewer attempts.

4. Results and Discussion

The use of regression models is a common practice in the field of data analysis, employed to assess the relationships between predictor (independent) and predicted (dependent) variables and to make predictions based on these relationships. Nevertheless, the accuracy of these models in making predictions is contingent not only on the selection of appropriate solving algorithms but also on the choice of suitable predictors. The inclusion of inappropriate predictor variables, which are irrelevant or less relevant to the predicted variable, has the potential to impair the model’s performance, increase the risk of overfitting and elevate the computational costs. Such risks have the potential to diminish the overall performance and utility of the models in question. Accordingly, particular emphasis was placed on the selection of predictor variables in the course of this study. Furthermore, the PCA method was employed to ascertain the most appropriate predictor variables in conjunction with the t-test, SFS and VIF methods.

The water levels exhibit distinct characteristics at the two entrances of the Bosphorus, given the disparate meteorological, hydrological and geometrical properties of the adjacent basins. Consequently, this study has employed distinct regression models for water prediction at both entrances of the Bosphorus, encompassing the selection of algorithms and predictors.

4.1. Selection of Predictors (Independent Variables) or Inputs

In total, as outlined in Section 2 (Site Description), the combination of four variables was employed as predictors (Table 3): the hourly mean seal level pressures (MSLP), the hourly directional wind speed components (U: east-west wind component, V: north-south wind component, W: resultant wind speed), the hourly Danube River discharges (Q-Danube) and tidal components (T). Given that the water level varies in a linear fashion with MSLP and T, we selected a single option for these two predictors, thereby neglecting any potential phase lag in these parameters. In contrast, with regard to wind, a number of different options were explored, as the relationship between wind and WL is non-linear. A three-hour phase lag was calculated between wind and water levels in the Black Sea, while a 14-h phase lag was applied at the entrance to the Sea of Marmara. These values differ slightly from those reported in previous studies [48,49]. Consequently, in addition to the simultaneous wind data, we also analysed phase-lagged wind cases, which are indicated with a subscript “lagged” in Table 3. Given that the wind setup is also proportional to the square of wind speed, particularly in the north component within the Bosphorus, we consider the square of the wind component in addition to the phase lag. Previous studies have also demonstrated a phase lag of approximately one to two months between the Danube River runoff into the Black Sea and the WL of the Bosphorus [31,42,50]. In order to achieve this, both concurrent and lagged river discharges were considered alongside the WLs in the course of the analyses. A 30-day phase lag was calculated for the river discharges (Q_lagged) arriving at the north entrance of the strait. Accordingly, this value was incorporated into the lagged time. The eleven cases in Table 3 were constructed for analysis purposes. All regression models were trained with 80% of the data to identify the optimal model. The impact of varying training data lengths was also investigated in the subsequent section, with analyses conducted using 50% and 33% of the data.

The initial step involved the application of the t-test to the variables, with the objective of determining the statistical significance of each predicted variable on the WLs. The results of the t-test demonstrated that the three components of WLs (atmospheric pressure (MSLP), wind (W, U, V) and river discharges (Q)) were statistically significant with respect to the WLs at both entrances of the strait. However, the tide was found to be insignificant, in accordance with the findings of previous studies [39,40]. While the water level of the Bosphorus varies between −0.2 m and 0.6 m on an annual scale, the tidal effect is only around 0.03 m. Therefore, the tidal component, which was found to be insignificant by the t-test, was excluded from the regression analysis as a predictor in subsequent stages.

Secondly, the SFS data reduction technique was applied to the remaining predictors in Table 3 and the WLs with the objective of reducing the number of variables while retaining the most important and meaningful input combination. To this end, all machine learning algorithms (logistic regression, random tree, support vector machine, Gaussian process regression, and artificial neural network) were employed separately, as detailed in Section 3.2.3. The SFS technique was then applied to identify the optimal predictor variable combination. The SFS increases the accuracy of predictions and enhances the flexibility of the model. Furthermore, the SFS enhances the generalisability of the regression model, facilitating superior performance across diverse data sets and circumventing the overfitting risk between predictors and predicting variables. The method starts with the optimal variable according to its success in predicting the independent variable (Table 4) and evaluates the model success of variable combinations by including other variables in the model one at a time. The optimal predictor combination (1, 2, 3, 4 in Table 5 and Table 6) with the optimal predictive capacity is identified through this process.

The results demonstrate that the lagged Danube River discharges (Q_lagged) were more effective than the other predictors in predicting water level (WL) at the northern end of the strait. However, the MSLP proved to be the most influential parameter at the southern entrance. The correlation between MSLP and WL, and between the lagged north wind component (V_lagged) and WL, is in the same order (R² = 0.18) and visibly higher than the correlation between east-west component (U) and WL at the north entrance (R² = 0.06). In contrast to the north entrance, the correlation between U and WL is greater than that between V and WL at the south entrance. This can be attributed to the longer east-west fetch of the wind than the north-south one in the Sea of Marmara.

In comparison to the meteorological effects in themselves, MSLP is superior to wind, as evidenced by the data in Table 4, in the Sea of Marmara entrance, which is relevant to the geomorphological properties of both seas. The Sea of Marmara is connected to the Aegean Sea via the Dardanelles Strait and to the Black Sea via the Bosphorus. The alteration in water level resulting from changes in atmospheric pressure gives rise to an inflow and outflow through these straits, the magnitude of which is less constrained due to the restricted nature of both straits. In contrast, the Black Sea is a relatively closed body of water, and the volume of water affected by wind stresses cannot be evacuated through the sole strait (the Bosphorus) in the same manner as in the Sea of Marmara [48]. It can, therefore, be concluded that the relative effect of wind is in close order with that of air pressure and is much larger than that observed in the Sea of Marmara. Accordingly, the four variables were selected as predictor variables for the regression models in this study. The order of importance was graded from 1 to 4 on both ends of the regression model (Table 4). In this evaluation, the highest rating of 1 indicates the greatest success in explaining water level change, while the lowest rating of 4 indicates the least successful outcome.

Once the optimal predictor combination (1, 2, 3, 4) had been identified, the VIF values for each predictor were calculated to ascertain whether there was a presence of multicollinearity between the selected parameters. The variance inflation factors (VIFs) for the Black Sea entrance model were calculated as 1.15 for predictor 1 (Table 4, second last column), 1.14 for predictor 2, 1.40 for predictor 3 and 1.32 for predictor 4. The VIF values for the Sea of Marmara entrance model were calculated as 1.16, 1.15, 1.57, and 1.68 for Predictors 1, 2, 3, and 4, respectively. The VIF results (1 < VIF < 5) indicated low and acceptable multicollinearity between the selected predictors at each entrance of the strait.

In all considered methods, the regression trees (RT) method yielded the most optimal results for the Black Sea entrance model, while the Gaussian process regression (GPR) method demonstrated superior performance for the Sea of Marmara entrance. All results in Table 5 belong to the RT algorithm, and all results in Table 6 belong to the GPR algorithm. The RT model makes predictions contingent on three hyperparameters: the number of learners, the minimum leaf size and the learning rate. The flexibility of the model increases with a reduction in the number of minimum leaf sizes and an increase in the number of learners. On the other hand, the performance of GPR depends on the standard deviation of observation noise, Sigma, the basis function and the kernel function hyperparameters. The optimisation of these hyperparameters is described as model tuning (Section 3.2.5), which is the final step in obtaining the most successful and flexible predicting model.

The predictive capacity of the optimal models was gradually enhanced for the validation and test stages, as evidenced in Table 5 and Table 6. As demonstrated in Table 5 and Table 6, despite differing levels of contribution, each predictor contributed to a notable enhancement in the predictive capacity of the model. Combinations with the “optimised” extension are combinations where the model hyperparameters are optimised to improve the model’s predictive ability. The correlation increased by over 50% (from R² = 0.60 to R² = 0.94) when all selected predictors were used instead of only the best predictor variable (Q_lagged in Table 4) at the Black Sea entrance. We calculated the same results for the adjusted R² with R²’s, which was not given here. This indicates that the addition of each predictor did not result in an improvement in the predictive model, nor did it lead to an inflation of the correlation metrics (R²s). The calculated errors for all considered error metrics (presented in the last row of Table 5) and predictor combinations are relatively minor. The maximum errors are in the order of a few centimetres (MAE = 1.7 cm and RMSE = 2.3 cm) for the optimal predictor combination (presented in the last row of Table 5), corresponding to approximately 7% of the mean water level (approximately 32 cm) and approximately 4% of the water level range (approximately 63 cm).

The correlations are slightly weaker, and the error metrics are somewhat higher at the entrance to the Sea of Marmara than at the Black Sea entrance for the specified variable combinations (Table 4, last column). The model results exhibited a notable improvement, with MAE and RMSE values nearly tripling from the initial model (1) to the most optimal model (1, 2, 3, 4 optimised). The MAE and RMSE were calculated as 2.3 cm and 3.4 cm, respectively, for the optimal predictor-combination model. These values correspond to a few percent of the water level range (72 cm) over the one-year period under consideration. As with the Black Sea entrance regression model, the adjusted R² values were identical across all predictor combinations presented in the sixth column of Table 6.

4.2. The Effect of Training Data Length on the Prediction of Model Success

The accuracy of an ML model is also contingent upon the length of the training data set. The results obtained from the test model are, in essence, the predicted outcomes of the trained model. Consequently, an extended data model will yield superior predictive outcomes. However, as the number of trained data sets increases, the number of test data sets will inevitably decrease. A model with greater predictive capacity that has been trained on a smaller data set is more effective than one that has been trained on a larger data set. In order to investigate this further, we examined the predictive success of three different data lengths. In addition to the optimal model combination (1, 2, 3, 4 and 1, 2, 3, 4 optimised, as detailed in Table 5 and Table 6), which constituted 80% of the training data, two further models were considered. The third model was trained with 50% of the data, while the fourth was trained with 33%. As previously stated, the training data was randomly selected from each month for each case to represent the general behaviour of the data.

The discrepancy between the predicted and observed water levels (WLs) for the test stages is illustrated in Figure 4 and Figure 5. Figure 4 depicts the Black Sea entrance model, while Figure 5 depicts the Sea of Marmara entrance. The correlations between the time series were found to be highly significant at both entrances, with a correlation coefficient (R) exceeding 0.9. The enhancement in prediction outcomes was more pronounced at the lowest water levels and at the Black Sea entrance in comparison to the Sea of Marmara. As anticipated, the augmented length of trained data led to enhanced model prediction accuracy at both endpoints. The application of 80% of the data (RMSE = 0.037 m) for training resulted in an improvement of approximately 24% in accuracy compared to the use of 33% of the data (RMSE = 0.028 m) at the Black Sea entrance (Figure 4b,f). The incorporation of further model tuning led to a further improvement in accuracy (Figure 4h), which was less than the contribution of data length.

The application of the case with 80% trained data led to an improvement in accuracy of approximately 10% (from RMSE = 0.041 m to 0.037 m) in the case of the Sea of Marmara model (see Figure 5b–f). The positive impact of model tuning is more pronounced in the Sea of Marmara than in the Black Sea (Figure 4f–h and Figure 5f–h). The largest reduction in RMSE (from 0.034 m to 0.017 m) was observed in the two most effective model combinations at the southern entrance of the strait (Figure 5f–h), which had the same training data length, but one of which was tuned.

It can be concluded that the length of the trained data contributed to the prediction results to a slight extent. However, the data length tested did not significantly enhance the accuracy, with the exception of the notable improvement in the lowest WLs at both entrances (Figure 4b,d,f,h). The prediction accuracies were notably high in comparison to previous studies, even in the most unfavourable scenario (the predicted model trained with 33% of data, Figure 4a,b and Figure 5a,b). For example, ref. [51] developed an ML model to predict WL at both entrances of the Bosphorus for the same period as our study. They employed ANN and SVM for this purpose and found that SVM produced more accurate predictions with R = 0.76 and RMSE = 0.059 m.

It was hypothesised that the inferior predictive capabilities of earlier models in comparison to the current model could be attributed to a number of key factors. One reason for this is the selection of predictors. In addition to wind speed, atmospheric pressure, and the Danube River discharges, the aforementioned models consider water surface salinity and temperature data as predictors. This could result in multicollinearity between the predictors (salinity, temperature, and river discharges), which would lead to overfitting. The other, and arguably the most significant, reason may be attributed to the selection of the training data set. In contrast to our approach, which employed a random selection of weighting, the aforementioned researchers trained their regression model by considering 70% of the data as a single block. Given the pronounced seasonal variations in water level changes, the use of a specific period of the year as training data may introduce a degree of bias in the predictions.

The relationships between the predictive variable (WL) and predictor variables (MSLP, wind, river discharges) employed in the present study can be characterised as linear and non-linear relations. A linear relationship exists between MSLP and river discharges and WL, indicating that WL would respond instantaneously to fluctuations in these variables [48]. The slope of inclination or declination of WL will be linearly dependent on the magnitude of these variables. Conversely, the relationship between WL and wind is complex due to the inherent randomness and uncertainty of wind conditions. This results in non-uniform behaviour (in terms of magnitude, duration, etc.) of wind depending on geographic location, which allows for distance to the wind to flow. The WL responds to the wind setup with a phase lag.

In contrast to MSLP and river discharge, the behaviour of the WL slope is much more complex, depending on the strength and duration of the wind and the geographic features of the area of interest. To illustrate, the Black Sea is a more enclosed basin than the Sea of Marmara. Consequently, the meteorological contribution (MSLP + wind) is less influential in terms of WL change (Table 4, last two columns) in this sea. The hydrological contribution is of greater significance than the other components within the closed Black Sea basin. The linear components of WL exert a more pronounced influence on WL prediction, resulting in more precise predictions at this entrance of the Bosphorus (Figure 4) compared to the Sea of Marmara entrance (Figure 5). As a consequence of the restricted nature of the Bosphorus’ geomorphology, the Danube River’s contribution to water level (WL) is diminished at the Sea of Marmara entrance in comparison to the north entrance (Table 4). As the linear component of water level (WL) decreases, the relative importance of the non-linear component (wind) increases, resulting in less accurate WL predictions at the south entrance. Consequently, the longer the training data, the more accurate the predictions will be (Figure 5a–h). A longer data set is more likely to contain samples of different wind/storm conditions, allowing for a more accurate description of the non-linear relation. This is why ML approaches are better suited to describing these non-linear behaviours than other approaches.

4.3. The Effect of Using Predicted Water Levels in Predicting Capacity of a Hydrodynamic Model

As previously outlined in Section 2: Site Description, the WL difference between the two entrances represents the most dynamic and decisive parameter influencing the flow structure of the Bosphorus. Furthermore, it was demonstrated that hydrodynamic modelling has proven to be an invaluable technique for elucidating the three-dimensional flow structure of such waterways. The accuracy of the model results is contingent upon the water level boundary conditions. It is, therefore, imperative that the WL predictions align closely with the measurements, as this is a key determinant of the success of a hydrodynamic model. In order to test the accuracy of the hydrodynamic model, the predicted water levels of the previous section were used as the boundary conditions. The current results of each model were compared to the measured current velocities taken in close proximity to the water level measurement station in the southern Bosphorus (Figure 1) at a depth of approximately 25 m. Figure 6 illustrates the variation of current velocities over a two-month period at a depth of 5 m below the surface. The results presented in Figure 6 were extracted from a calibrated three-dimensional flow model developed using the Delft 3D numerical modelling approach developed by Deltares [52]. The numerical settings of the models were identical, with the exception of the WL boundary conditions. As this study did not aim to provide a comprehensive account of the numerical modelling process, the details of this procedure are not included here.

As anticipated, the highest level of accuracy was achieved when the model was used to describe the observed water level (WL observation) at the open boundaries (see Figure 6). The correlation coefficient and root mean square error (RMSE) were calculated as R ≅ 0.92 and RMSE ≅ 0.231 m/s, respectively, for this case. The aforementioned metrics were calculated as R ≅ 0.89 and RMSE ≅ 0.295 m/s for the WL case in Figure 4a and Figure 5a (33% trained in Figure 6) current model with 33% trained, R ≅ 0.90 and RMSE ≅ 0.278 m/s for the WL case in Figure 4c and Figure 5c (50% trained in Figure 6) and R ≅ 0.92 and RMSE ≅ 0.256 m/s for the WL case in Figure 4e and Figure 5e (80% trained in Figure 6). The optimal current predictions were obtained when the optimised WL model, comprising 80% of the training data (Figure 4g and Figure 5g), was applied to the model boundary (Figure 6a,b). The RMSE was calculated to be approximately 0.24 m/s with a correlation coefficient (R) of 0.92. The optimal WL regression model (80% trained) yielded results that were nearly identical to those of the observed WL model (Figure 6b). The discrepancies in current speeds are more pronounced in the other three model cases, with the largest discrepancy observed between the 33% trained and the 50% trained and 80% trained models. The speed orders in the least accurate model are approximately half those of the most accurate. Overall, these results indicate that there are smaller discrepancies between the model alternatives and that there is a high predicting capacity for using the predicted WLs in this study.

5. Summary and Conclusions

The objective of this study is to present a methodology for predicting water levels based on their constituent components, employing the machine learning approach. The method was shown beneficial both for long data sets (over 1 year) and especially for shorter data sets, which are more challenging than the former. The Bosphorus, one of the busiest waterways in the world, was selected as the study area. The flowchart of the presented method (Figure 7) can be employed by both practitioners and researchers for the purpose of sea level prediction at any location worldwide. Firstly, the possible combinations of predictor variables (mean sea level pressure (MSLP), wind (U, V, W), tide (T) and the Danube River discharges (Q)) were identified from the existing literature, which is known to be decisive in the water levels (WLs) at both entrances of the strait. Subsequently, the number of predictors was reduced to obtain the most flexible and accurate prediction model. To this end, a series of statistical analyses, including t-tests, SFS, PCA and VIF, were applied in conjunction with a range of ML techniques, namely Linear Regression (LR), Regression Trees (RT), Support Vector Machine Regression (SVMR), Gaussian Process Regression (GPR) and Artificial Neural Networks (ANN).

The most precise water level (WL) forecasts were achieved by utilising RT at the Black Sea entrance and GPR at the Sea of Marmara entrance. The optimal predictor combination was identified as the lagged Danube discharge (Q_lagged), MSLP, V_lagged and U at the north entrance model. However, in the south entrance model, the optimal combination was MSLP, Q_lagged, U, and V. Finally, to prevent overfitting or bias issues, certain hyperparameters were adjusted in the optimal models.

The predicted WLs of the optimal model combination exhibited high correlations (R > 0.90) with the measured values and produced highly accurate results with low error metrics (RMSE, MSE) in the orders of a few percent of the mean WL. The regression model trained with the largest data set (80%) exhibited the highest accuracy, with an RMSE of 0.023 m and an R-squared value of 0.97 at the north entrance, and an RMSE of 0.017 m and an R-squared value of 0.94 at the south entrance. Notwithstanding, the remaining two models, which were trained with a smaller data set (50% and 33% of the total data), also yielded comparable predictions to those of the largest data set model. It is notable that even the performance of the model trained with 33% of the data was significantly better than some of the previous ML models applied to the task of predicting the WL. It is due to the algorithm followed here, which is the novelty of the present study. The novelty can be attributed to two principal factors: (1) The selected predictors in the present study were more decisive components of WL behaviour than in previous studies [49,51], which demonstrated acceptable significance and multicollinearity. (2) To avoid bias, the training data were randomly selected in each month of the year according to their weights, which were selected as a single block in previous studies [49,51].

The capacity of these regression models to distinguish between different training data was also tested using open boundary conditions in a hydrodynamic model. The model trained with 80% data produced current velocity results that were almost identical to those of the observed-WLs-used model. However, the other two regression models also produced results comparable to those of the observed model in a large number of current orders.

Consequently, the acquisition of WLs from their components, which can be obtained from various open sources, is a more straightforward and practical approach than field observations, which are often hindered by operational difficulties. As demonstrated in this study, the proposed ML algorithm shows considerable promise for the successful prediction of WLs. This could be particularly advantageous for climatological studies, which require the forecasting of WLs at a specific region in the future under different climate scenarios.

Author Contributions

Conceptualisation, F.A. and M.Ö.; methodology, F.A. and M.Ö.; software, F.A. and M.Ö.; validation, F.A. and M.Ö.; formal analysis, F.A. and M.Ö.; resources, F.A. and M.Ö.; data curation, F.A. and M.Ö.; writing, F.A. and M.Ö.; visualisation, F.A. and M.Ö.; supervision, M.Ö. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

This research is supported by the Scientific and Technological Research Council of Turkey (TUBITAK) with project number 222M240. This article will also be used to fulfil the Ph.D. degree requirements at Yıldız Technical University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Blake, E.; Rappaport, E.; Landsea, C. The Deadliest, Costliest, and Most Intense United States Tropical Cyclones from 1851 to 2006 (and Other Frequently Requested Hurricane Facts). Technical Report NHC Miami. 2007. Available online: https://www.nhc.noaa.gov/pdf/NWS-TPC-5.pdf (accessed on 17 March 2024).
Andre, C.; Monfort, D.; Bouzit, M.; Vinchon, C. Contribution of insurance data to cost assessment of coastal flood damage to residential buildings: Insights gained from Johanna (2008) and xynthia (2010) storm events. Nat. Hazards Earth Syst. Sci. 2013, 13, 2003–2012. [Google Scholar] [CrossRef]
Needham, H.F.; Keim, B.D.; Sathiaraj, D. A review of tropical cyclone-generated storm surges: Global data sources, observations, and impacts. Rev. Geophys. 2015, 53, 545–591. [Google Scholar] [CrossRef]
Woodworth, P.L.; Melet, A.; Marcos, M.; Ray, R.D.; Wöppelmann, G.; Sasaki, Y.N.; Merrifield, M.A. Forcing factors affecting sea level changes at the coast. Surv. Geophys. 2019, 40, 1351–1397. [Google Scholar] [CrossRef]
Idier, D.; Bertin, X.; Thompson, P.; Pickering, M.D. Interactions between mean sea level, tide, surge, waves and flooding: Mechanisms and contributions to sea level variations at the coast. Surv. Geophys 2019, 40, 1603–1630. [Google Scholar] [CrossRef]
Martin, A.G.L. International Straits: Concept, Classification and Rules of Passage; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Armi, L.; Farmer, D.M. The flow of Atlantic water through the Strait of Gibraltar. Prog. Oceanogr. 1988, 21, 1–103. [Google Scholar] [CrossRef]
Özsoy, E.; Latif, M.A.; Beşiktepe, Ş.; Çetin, N.; Gregg, M.; Belekopytov, V.; Goryachkin, Y.; Vassile, D. The Bosphorus Strait: Exchange fluxes, currents and sea-level changes. In Ecosystem Modeling as a Management Tool for the Blacksea; Ivanov, L.I., Oğuz, T., Eds.; NATO Science Series 2: Environmental Security; Kluwer Academic Pub.: Dordrecht, The Netherlands, 1998; Volumes 1 and 2. [Google Scholar]
Ünlülata, Ü.; Oğuz, T.; Latif, M.A.; Özsoy, E. On the physical oceanography of the Turkish Straits. In The Physical Oceanography of Sea Straits; Pratt, L.J., Ed.; NATO ASI Series (Mathematical and Physical Sciences); Springer: Dordrecht, The Netherlands, 1990; Volume 318. [Google Scholar] [CrossRef]
Kanarska, Y.; Maderich, V. Modeling of seasonal exchange flows through the dardanelles strait. Estuar. Coast. Shelf Sci. 2008, 79, 449–458. [Google Scholar] [CrossRef]
Li, Y.; Wolanski, E.; Zhang, H. What processes control the net currents through shallow straits? A review with application to the Bohai Strait, China. Estuar. Coast. Shelf Sci. 2015, 158, 1–11. [Google Scholar] [CrossRef]
Dietrich, J.C.; Bunya, S.; Westerink, J.J.; Ebersole, B.A.; Smith, J.M.; Atkinson, J.H.; Jensen, R.; Resio, D.T.; Luettich, R.A.; Dawson, C.; et al. A high-resolution coupled riverine flow, tide, wind, wind wave, and storm surge model for southern Louisiana and Mississippi. Part II: Synoptic description and analysis of Hurricanes Katrina and Rita. Mon. Weather Rev. 2010, 138, 378–404. [Google Scholar] [CrossRef]
Suh, S.-W.; Lee, H.-Y. Forerunner storm surge under macro-tidal environmental conditions in shallow coastal zones of the Yellow Sea Cont. Shelf Res. 2018, 169, 1–16. [Google Scholar] [CrossRef]
Fernandez-Montblanc, T.; Vousdoukas, M.I.; Ciavola, P.; Voukouvalas, E.; Mentaschi, L.; Breyiannis, G.; Feyen, L.; Salamon, P. Towards robust pan-European storm surge forecasting. Ocean Model. 2019, 133, 129–144. [Google Scholar] [CrossRef]
Salmun, H.; Molod, A.; Wisniewska, K.; Buonaiuto, F.S. Statistical prediction of the storm surge associated with cool-weather storms at the Battery, New York. J. Appl. Meteorol. Climatol. 2011, 50, 273–282. [Google Scholar] [CrossRef]
Lopeman, M.; Deodatis, G.; Franco, G. Extreme storm surge hazard estimation in lower Manhattan: Clustered separated peaks-over-threshold simulation (CSPS) method. Nat. Hazards 2015, 78, 355–391. [Google Scholar] [CrossRef]
Roberts, K.J.; Colle, B.A.; Georgas, N.; Munch, S.B. A regression-based approach for cool-season storm surge predictions along the New York–New Jersey coast. J. Appl. Meteorol. Climatol. 2015, 54, 1773–1791. [Google Scholar] [CrossRef]
Kim, S.; Matsumi, Y.; Pan, S.; Mase, H. A real-time forecast model using artificial neural network for after-runner storm surges on the Tottori coast, Japan. Ocean Eng. 2016, 122, 44–53. [Google Scholar] [CrossRef]
Bruneau, N.; Polton, J.; Williams, J.; Holt, J. Estimation of global coastal sea level extremes using neural networks. Environ. Res. Lett. 2020, 15, 074030. [Google Scholar] [CrossRef]
Mitchell, T.M.; Mitchell, T.M. Machine Learning; McGraw-Hill: New York, NY, USA, 1997; Volume 1. [Google Scholar]
Mahesh, B. Machine learning algorithms—A review. Int. J. Sci. Res. 2020, 9, 381–386. [Google Scholar] [CrossRef]
Zhou, Z.H. Machine Learning; Springer Nature: London, UK, 2021. [Google Scholar] [CrossRef]
Alpaydin, E. Machine Learning; MIT Press: Cambridge, MA, USA, 2021. [Google Scholar]
Wang, H.; Lei, Z.; Zhang, X.; Zhou, B.; Peng, J. Machine learning basics. Deep Learn. 2016, 98–164. Available online: http://whdeng.cn/Teaching/PPT_01_Machine%20learning%20Basics.pdf (accessed on 9 February 2024).
Shinde, P.P.; Shah, S. A review of machine learning and deep learning applications. In Proceedings of the Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), Pune, India, 16–18 August 2018; pp. 1–6. [Google Scholar]
Maderich, V.; Ilyin, Y.; Lemeshko, E. Seasonal and interannual variability of the water exchange in the Turkish Straits System estimated by modelling. Mediterr. Mar. Sci. 2015, 16, 444–459. [Google Scholar] [CrossRef]
Oguz, T.; Ozsoy, E.; Latif, M.A.; Sur, H.I.; Ünlüata, Ü. Modeling of hydraulically controlled exchange flow in the Bosphorus Strait. J. Phys. Oceanogr. 1990, 20, 945–965. [Google Scholar] [CrossRef]
Çolpan Polat, S.; Tugrul, S. Nutrient and organic carbon exchanges between the Black and Marmara seas through the Bosphorus Strait. Continent. Shelf Res. 1995, 15, 1115–1132. [Google Scholar] [CrossRef]
Hubareva, E.; Svetlichny, L.; Kideys, A.; Isinibilir, M. Fate of the Black Sea Acartia clausi and Acartia tonsa (Copepoda) penetrating into the Marmara Sea through the Bosphorus. Estuar. Coast Shelf Sci. 2008, 76, 131–140. [Google Scholar] [CrossRef]
Yuksel, Y.; Ayat, B.; Ozturk, M.N.; Aydogan, B.; Guler, I.; Cevik, E.O.; Yalçıner, A.C. Responses of the stratified flows to their driving conditions-A field study. Ocean Eng. 2008, 35, 1304–1321. [Google Scholar] [CrossRef]
Ozturk, M.; Altas, F. Seasonal variability of stratified flow properties in a non-tidal strait-A field study. Estuar. Coast. Shelf Sci. 2022, 264, 107700. [Google Scholar] [CrossRef]
Arisoy, Y.; Akyarli, A. Long term current and sea level measurements conducted at Bosphorus. In Physical Oceanography of Sea Straits; North Atlantic Treaty Organization NATOASI Series; Springer: Dordrecht, The Netherlands, 1989; pp. 225–236. [Google Scholar]
Aydoğan, B.; Ayat, B.; Öztürk, M.N.; Özkan Çevik, E.; Yüksel, Y. Current velocity forecasting in straits with artificial neural networks, a case study: Strait of İstanbul. Ocean Eng. 2010, 37, 443–453. [Google Scholar] [CrossRef]
Jarosz, E.; Teague, W.J.; Book, J.W.; Beşiktepe, Ş. On flow variability in the Bosphorus Strait. J. Geophys. Res. Ocean. 2011, 116. [Google Scholar] [CrossRef]
Öztürk, M.; Ayat, B.; Aydoğan, B.; Yüksel, Y. 3D Numerical modeling of stratified flows: Case study of the Bosphorus Strait. J. Waterw. Port Coast. Ocean Eng. 2012, 138, 406–419. [Google Scholar] [CrossRef]
Book, J.W.; Jarosz, E.; Chiggiato, J.; Beşiktepe, Ş. The oceanic response of the Turkish Straits System to an extreme drop in atmospheric pressure. J. Geophys. Res. Ocean. 2014, 119, 3629–3644. [Google Scholar] [CrossRef]
Saçu, Ş.; Şen, O.; Erdik, T. A stochastic assessment for oil contamination probability: A case study of the Bosphorus. Ocean Eng. 2021, 231, 109064. [Google Scholar] [CrossRef]
Ozturk, M. Numerical modeling of the effect of duration of barotropic forcing on sea strait flow: Case study of the Bosphorus Strait. J. Hydraul. Eng. 2013, 139, 1199–1211. [Google Scholar] [CrossRef]
Alpar, B.; Yuce, H. Sea-level variations and their interactions between the blacksea and the Aegean Sea. Estuar. Coast. Shelf Sci. 1998, 46, 609–619. [Google Scholar] [CrossRef]
Ozturk, M.; Yuksel, Y. Tidal and non-tidal sea level analysis in enclosed and inland basins: The Black, Aegean, Marmara, and Eastern Mediterranean (Levantine) Seas. Reg. Stud. Mar. Sci. 2023, 61, 102848. [Google Scholar] [CrossRef]
Hersbach, H.; Bell, B.; Berrisford, P.; Biavati, G.; Horányi, A.; Muñoz Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Rozum, I.; et al. ERA5 hourly data on single levels from 1940 to present. Copernic. Clim. Change Serv. (C3S) Clim. Data Store (CDS) 2023. [Google Scholar] [CrossRef]
Ozturk, M.; Altas, F. The meteorological, hydrological and tidal components of Bosphorus flow. Reg. Stud. Mar. Sci. 2021, 48, 102060. [Google Scholar] [CrossRef]
Global Runoff Data Centre (GRDC); Federal Institute of Hydrology: Koblenz, Germany, 2020. Available online: https://portal.grdc.bafg.de/applications/public.html?publicuser=PublicUser#dataDownload/Stations (accessed on 28 November 2023).
Pawlowicz, R.; Beardsley, B.; Lentz, S. Classical tidal harmonic analysis including error estimates in MATLAB using T_TIDE. Comput. Geosci. 2002, 28, 929–937. [Google Scholar] [CrossRef]
Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. Adv. Neural Inf. Process. Syst. 2011, 24. [Google Scholar]
DeCastro-García, N.; Muñoz Castañeda, Á.L.; Escudero García, D.; Carriegos, M.V. Effect of the sampling of a dataset in the hyperparameter optimization phase over the efficiency of a machine learning algorithm. Complexity 2019, 2019, 6278908. [Google Scholar] [CrossRef]
Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
Oda, Y.; Ito, K.; Solomon, C.Y. Current Forecast for Tunnel-Element Immersion in the Bosphorus Strait, Turkey. J. Waterw. Port Coast. Ocean. Eng. 2009, 135, 108–119. [Google Scholar] [CrossRef]
Tur, R.; Tas, E.; Haghighi, A.T.; Mehr, A.D. Sea Level Prediction Using Machine Learning. Water 2021, 13, 3566. [Google Scholar] [CrossRef]
Sur, H.İ.; Özsoy, E.; Ünlüata, Ü. Boundary current instabilities, up welling, shelf mixing and eutrophication processes in the Black Sea. Prog. Oceanogr. 1994, 33, 249–302. [Google Scholar] [CrossRef]
Karsavran, Y.; Erdik, T. Artificial intelligence based prediction of seawater level: A case study for bosphorus strait. International Journal of Mathematical. Eng. Manag. Sci. 2021, 6, 1242. [Google Scholar]
Deltares. Delft3D-FLOW Users Manual. Simulation of Multi-Dimensional Hydrodynamic Flows and Transport Phenomena, including Sediments. 2020. Available online: https://content.oss.deltares.nl/delft3d4/Delft3D-FLOW_User_Manual.pdf (accessed on 11 October 2023).

Figure 1. Location of the study area (the upper panel) and monitoring stations (upper left panel) and schematic presentation of typical two-layer flow structure in the Bosphorus (bottom panel).

Figure 2. Steps applied to building a regression model in this study.

Figure 3. The percentage of training and test data for the regression model alternatives.

Figure 4. The variation of observed and modelled water level data and Q-Q plots at the Black Sea entrance by using different training data lengths: %30 of data (a,b), 50% of data (c,d), 80% of data (combination of 1, 2, 3, 4 in Table 5) (e,f), and 80% of data (combination of 1, 2, 3, 4 optimised in Table 5) (g,h).

Figure 5. The variation of observed and modelled water level data and Q-Q plots at the Sea of Marmara entrance by using different training data lengths: %30 of data (a,b), 50% of data (c,d), 80% of data (combination of 1, 2, 3, 4 in Table 5) (e,f), and 80% of data (combination of 1, 2, 3, 4 optimised in Table 5) (g,h).

Figure 6. A selected example of various WL open boundary condition effects on the current velocities at −5 m in the Bosphorus: (a) variation throughout two months (November and December 2005) and (b) Q-Q plots for the cases. The label observation in (a) indicates measured current speeds, WL_observation indicates observed-WLs—described model current results, and the remaining labels (33% trained, 50% trained, 80% trained, and 80% trained + optimised) are the model current results after describing predicted WLs (Figure 4a–d and Figure 5a–d) as the boundary conditions. The positive values displayed in (a,b) indicate northward flow speeds, while the negative values indicate the opposite direction.

Figure 7. The water level prediction algorithm applied in this study follows the ML approach.

Table 1. Statistical characteristics of model parameters.

Site	Parameter	Min.	Max.	Mean	Std. Dev. (σ)
The Black Sea	Wind speed	0.026	11.902	4.08	2.086
	N-S wind speed	−10.756	10.83	2.85	1.894
	Air pressure	992.57	1035.5	1015.6	6.673
	Water level	−0.06	0.57	0.281	0.0953
The Sea of Marmara	Wind speed	0.051	12.592	4.10	2.128
	N-S wind speed	−11.032	10.205	2.94	1.912
	Air pressure	993.24	1035.3	1015.5	6.645
	Water level	−0.30	0.42	0.044	0.1054
Danube River	Discharge	3290	14,400	8460.3	3080.5

Table 2. Main tidal harmonics at both entrances of the Bosphorus.

		H₁	S₁	K₁	O₁	M₂	S₂
Tidal Harmonics	Peiod (h)	12.44	24	23.93	25.82	12.42	12
The Black Sea	Amplitude (m)	0.0048	-	0.0089	0.0387	0.0805	0.0833
The Black Sea	Phase (°)	116.30		92.20	94.91	60.26	73.29
The Sea of Marmara	Amplitude (m)	0.031	0.0092	0.0098	0.0076	0.0065	0.0044
The Sea of Marmara	Phase (°)	358.17	105.29	129.82	95.47	267.82	291.01

Table 3. Significance of predictor variables in predicting the water levels at both entrances.

	The Black Sea Entrance			The Sea of Marmara
Predictors	t-Test	p-Value	Significance	t-Test	p-Value	Significance
MSLP	−3.16 × $10^{1}$	1.44 × $10^{- 207}$	✓	−1.08 × $10^{2}$	0.00	✓
U	−1.77 × $10^{1}$	2.82 × $10^{- 69}$	✓	2.67 × $10^{1}$	1.62 × $10^{- 151}$	✓
V	−3.75 × $10^{1}$	2.80 × $10^{- 286}$	✓	2.03 × $10^{1}$	2.48 × $10^{- 89}$	✓
$U_{lagged}$	−2.05 × $10^{1}$	3.13 × $10^{- 91}$	✓	1.49 × $10^{1}$	1.55 × $10^{- 49}$	✓
$V_{lagged}$	−4.25 × $10^{1}$	0.00	✓	5.57	2.56 × $10^{- 8}$	✓
$V_{lagged}^{2}$	−3.69 × $10^{1}$	8.89 × $10^{- 278}$	✓	7.91	2.99 × $10^{- 15}$	✓
W	4.12	3.88 × $10^{- 5}$	✓	−9.89	6.19 × $10^{- 23}$	✓
Q	9.28 × $10^{1}$	0.00	✓	3.01 × $10^{1}$	2.19 × $10^{- 189}$	✓
$Q_{lagged}$	6.02 × $10^{1}$	0.00	✓	3.69 × $10^{1}$	3.64 × $10^{- 278}$	✓
Tide (T)	−8.49 × $10^{- 1}$	3.96 × $10^{- 1}$	✗	9.31 × $10^{- 1}$	3.52 × $10^{- 1}$	✗

Table 4. The most meaningful predictors at both entrances of the Bosphorus.

Predictors	The Black Sea Entrance				The Sea of Marmara				Rank
	R²	MAE	MSE	RMSE	R²	MAE	MSE	RMSE	Black Sea	Sea of Marmara
	R²	(m)			R²	(m)			Black Sea	Sea of Marmara
MSLP	0.18	0.069	0.007	0.086	0.60	0.051	0.004	0.067	2	1
U	0.06	0.077	0.009	0.092	0.11	0.078	0.01	0.1	4	3
V	0.15	0.072	0.008	0.088	0.06	0.081	0.01	0.1	-	4
$U_{lagged}$	0.04	0.077	0.009	0.093	0.05	0.081	0.011	0.103	-	-
$V_{lagged}$	0.18	0.071	0.007	0.086	0.02	0.082	0.011	0.104	3	-
$V_{lagged}^{2}$	0.18	0.071	0.007	0.086	0.02	0.082	0.011	0.104	-	-
W	0.14	0.072	0.008	0.088	0.05	0.081	0.011	0.103	-	-
Q	0.52	0.049	0.004	0.066	0.29	0.065	0.008	0.089	-	-
$Q_{lagged}$	0.6	0.045	0.004	0.06	0.37	0.062	0.007	0.084	1	2

Table 5. The contribution of each predictor to the improvement of the model’s predicting capacity at the Black Sea (north) entrance of the Bosphorus.

Predictors	Validation				Test
Predictors	R²	MAE (m)	MSE (m²)	RMSE (m)	R²	MAE (m)	MSE (m²)	RMSE (m)
1	0.6	0.045	0.004	0.06	0.6	0.046	0.004	0.06
1, 2	0.81	0.027	0.002	0.041	0.84	0.026	0.001	0.039
1, 2, 3	0.88	0.023	0.001	0.033	0.87	0.023	0.001	0.035
1, 2, 3, 4	0.9	0.023	0.001	0.031	0.91	0.021	0.001	0.028
1, 2, 3, 4 optimised	0.93	0.018	0.001	0.025	0.94	0.017	0.0005	0.023

Table 6. The contribution of each predictor to the improvement of the model’s predicting capacity at the Sea of Marmara (south) entrance of the Bosphorus.

Predictors	Validation				Test
Predictors	R²	MAE (m)	MSE (m²)	RMSE (m)	R²	MAE (m)	MSE (m²)	RMSE (m)
1	0.6	0.051	0.004	0.067	0.57	0.052	0.005	0.068
1, 2	0.81	0.034	0.002	0.046	0.81	0.033	0.002	0.046
1, 2, 3	0.84	0.028	0.002	0.042	0.84	0.026	0.002	0.041
1, 2, 3, 4	0.84	0.027	0.002	0.042	0.86	0.024	0.001	0.039
1, 2, 3, 4 optimised	0.9	0.024	0.001	0.034	0.89	0.023	0.001	0.034

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Altaş, F.; Öztürk, M. Water Level Predictions at Both Entrances of a Sea Strait by Using Machine Learning. Water 2024, 16, 2335. https://doi.org/10.3390/w16162335

AMA Style

Altaş F, Öztürk M. Water Level Predictions at Both Entrances of a Sea Strait by Using Machine Learning. Water. 2024; 16(16):2335. https://doi.org/10.3390/w16162335

Chicago/Turabian Style

Altaş, Furkan, and Mehmet Öztürk. 2024. "Water Level Predictions at Both Entrances of a Sea Strait by Using Machine Learning" Water 16, no. 16: 2335. https://doi.org/10.3390/w16162335

APA Style

Altaş, F., & Öztürk, M. (2024). Water Level Predictions at Both Entrances of a Sea Strait by Using Machine Learning. Water, 16(16), 2335. https://doi.org/10.3390/w16162335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Water Level Predictions at Both Entrances of a Sea Strait by Using Machine Learning

Abstract

1. Introduction

2. Site Description

3. Data Description, Methodology and ML Application

3.1. Data Description

3.2. Machine Learning (ML) Methodology

3.2.1. Data Collection

3.2.2. Data Pre-Processing

3.2.3. Model Training

3.2.4. Model Evaluation

3.2.5. Model Tuning

4. Results and Discussion

4.1. Selection of Predictors (Independent Variables) or Inputs

4.2. The Effect of Training Data Length on the Prediction of Model Success

4.3. The Effect of Using Predicted Water Levels in Predicting Capacity of a Hydrodynamic Model

5. Summary and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI