Forecasting of Significant Wave Height Based on Gated Recurrent Unit Network in the Taiwan Strait and Its Adjacent Waters

Significant wave height (SWH) forecasting is a key process for offshore and costal engineering. However, accurate prediction of the SWH is quite challenging due to the randomness and fluctuation features of waves. This paper employs a novel deep learning method, the gated recurrent unit network (GRU), to forecast SWH with lead times of 3, 6, 12 and 24 h. The data sets used in this study include the wind speed of the past 3 h and the current SWH as inputs, which were obtained from six buoy stations in the Taiwan Strait and its adjacent waters. The GRU results are compared with those of back propagation neural network (BP), extreme learning machine (ELM), support vector machine (SVM), and random forest (RF). Although the error indices of the six stations are different, the general performance of GRU is satisfactory, with a faster forecasting speed, smaller volatility and better adaptability. Using buoy station 46714D as an example, the root mean square error (RMSE) predicted by GRU reaches 0.234, 0.299, 0.371, and 0.479 with lead times of 3, 6, 12, and 24 h, respectively.


Introduction
Marine disasters pose a severe threat to many countries in the world, leading to tremendous casualties and economic losses. To this end, investigating the characteristics of ocean waves, especially the significant wave height (SWH), is of pivotal importance to maritime activities and coastal engineering. An accurate and reliable prediction of SWH contributes to the smooth progress of activities such as fisheries, marine resources exploitation, safe navigation, and construction and maintenance of coastal structures [1,2]. However, the irregularity of ocean waves presents great challenges in predicting SWH.
In general, there are three mainstream approaches to predict SWH, including empirical, numerical and machine learning methods. The classical empirical-based models such as auto regressive moving average (ARMA) have been long established, but they have a limited ability to capture the non-stationarities and non-linearity in data series [3]. In the past decades, a number of researchers have sought to predict SWH using physics-based models, which rely primarily on a form of the spectral energy or action balance equation. Although numerical models have proven to be effective in wave height prediction over a large spatial and temporal range, their drawbacks have also been noted, and the cost of computational resources and time is extremely high, especially for the calculations of a higher resolution grid in the nearshore zones where the seabed topography is intricate [4,5].
The increasingly rapid advances in machine learning have triggered a huge amount of innovative inquiries in SWH prediction. Machine learning methods use statistics to gain a deeper insight into the spatial and temporal link hidden in the historical time series.
Zamani et al. [6] conducted a detailed study of several data-driven models based on artificial neural networks (ANNs) and instance-based learning (IBL). Experiments showed that the ANNs have a slight superiority over IBL, and ANNs also exhibit competitive advantages in predicting extreme wave conditions. The predictive capability of several machine learning approaches including support vector machine (SVM), Bayesian network (BN), ANN and adaptive neuro-fuzzy inference system has been inspected by Malekmohamadi et al. [7]. The results manifested that the behavior of these models are acceptable except the results of BN. By incorporating a genetic algorithm with Kalman filtering, Altunkaynak and Wang [8] developed a new technique to predict SWH. The superiority of this method over ANN was shown by its lower mean relative error and mean square error. Nitsure et al. [9] applied genetic programming to predict wave heights using wind information as an input. The prediction results with lead times up to 12 h and 24 h were satisfactory, where the coefficients of correlation between the predicted and measured values were higher than 0.87. Prahlada and Deka [10] strived to present a hybrid model of wavelet and an artificial neural network for SWH prediction across multistep lead time by combining the beneficial qualities of both. The presented method has been proven to be effective and feasible. Cornejo-Bueno et al. [11] proposed a hybrid grouping genetic algorithm-extreme learning machine approach for marine energy applications in SWH and flux prediction and obtained desirable results. Nikoo et al. [12] conducted SWH prediction based on a fuzzy K-nearest neighbor (FKNN) model where the variation of wind direction will affect the fetch length. The prediction results of FKNN outperformed those obtained by BN, regression tree induction and support vector regression, especially in the prediction of wave heights larger than 2 m. Wei and Hsieh [13] adopted ANN in two distinct situations to assess the practicability of predicting waves using the data gathered from the adjacent buoy. The study showed that the model involving information from the adjacent buoy outperforms the one without extra data. Considering the edges of back propagation neural networks (BP) and cuckoo search algorithms (CS), Yang et al. [14] creatively attempted to predict SWH based on a CS-BP model, and the proposed model offers promising potential for wave height prediction. A recent study carried out by Zhang and Dai [15] involved the conditional restricted Boltzmann machine in the classical deep belief network to predict SWH. The measurement criterion revealed that the newly proposed method has a strong ability for short-term and extreme events prediction.
More recently, the long short-term memory network (LSTM) [16], which is an improved form of a recurrent neural network (RNN), has been attracting considerable interests. Son et al. [17] found a novel perspective to predict real-valued SWH from a series of sequential ocean images using the bi-directional convolutional LSTM model, and low error indices were obtained. Fan et al. [18] employed LSTM to predict SWH for various forecasting time horizons with higher accuracy, and proposed a simulating waves nearshore-LSTM to make a single-point prediction. A great deal of previous research into SWH prediction has focused on using all kinds of shallow machine learning models such as BP, SVM and so on, but they have failed to completely exploit the inner correlations between historical information over the long term. LSTM has been successfully utilized to predict SWH. However, a conspicuous shortcoming with LSTM is that it entails a large number of parameters for training. Consequently, the training process is time consuming and it easily becomes overfitted.
The gated recurrent unit network (GRU) is optimized and condensed on the basis of LSTM, which has two gates named reset gate and update gate to control the flow of information. Benefiting from the structure, the forecasting speed of GRU is effectively improved and maintain the strength of LSTM at the same time [19]. GRU has emerged as a powerful tool in various applications encompassing time series prediction, such as machine health monitoring [20,21], wind speed prediction [22][23][24], and traffic flow prediction [25,26].
Nevertheless, to the authors' cognition, there has been very little research that seeks to predict SWH using the novel deep learning method GRU over a large range and long prediction interval. Therefore, the prospective study sets out to predict SWH based on GRU at six buoy stations in the Taiwan Strait and its adjacent waters, and compare the prediction results of GRU with those obtained by BP, extreme learning machine (ELM), SVM, and random forest (RF).
The overall structure of this essay takes the form of five parts. The second part describes the study materials used in this study. The third part gives an explanation of the gated recurrent unit network and the evaluation indicator. The following part presents the obtained results, along with discussions. The final part gives a summary of the work.

Materials
The Taiwan Strait is China's largest strait connecting the East China Sea and the South China Sea, which is not only an important maritime area for historical trade routes, but also a strategic point of modern geopolitics. Therefore, it enjoys a high reputation as the "sea corridor". The topography of the Taiwan Strait is violently undulant; it is wide in the south, narrow in the north and shaped like a horn, and brings about the prominent narrow tube effect. As the frequency of production activities and shipping in this area is constantly increasing, there is an urgent need to make timely and accurate wave forecasting in this strait.
To test the performance of GRU, six buoy stations distributed at different sites in the Taiwan Strait and its adjacent waters were selected. The hourly data used for SWH prediction is owned and maintained by the National Marine Data Center (http://mds. nmdis.org.cn/) and the European Marine Observation and Data Network (http://www. emodnet-physics.eu/map/). Table 1 gives details of the selected stations, including the exact locations, water depth, the period of data, the maximum SWH and wind speed during the corresponding period of each buoy station and the total number of available data. Figure 1 displays the distributions and water depth of the selected stations. The key part of wind-wave forecasting is to predict SWH with lead times of a few hours or days using the historical information. According to [27], wind speed has been identified as a major contributing factor to the generation of waves. Furthermore, the previous SWH also exerts a dominant effect due to the continuity of waves. Therefore, the available previous observations of wind speed and SWH are fed into the model as inputs.
In this study, 80% of the available data were utilized for training the model, and the remaining 20% were used for testing. The key part of wind-wave forecasting is to predict SWH with lead times of a few hours or days using the historical information. According to [27], wind speed has been identified as a major contributing factor to the generation of waves. Furthermore, the previous SWH also exerts a dominant effect due to the continuity of waves. Therefore, the available previous observations of wind speed and SWH are fed into the model as inputs.
In this study, 80% of the available data were utilized for training the model, and the remaining 20% were used for testing.

Wave Forecast Model
Since traditional neural networks are transmitted through full connection and each node in the same layer is not connected, they may fail when dealing with the temporal problems.
In this context, RNN has been proven to be more powerful in extracting temporal patterns than traditional neural networks by building self-loop connections from a node to itself and sharing parameters across different time steps.
The standard RNN take their input from the current input t x along with what they have picked up previously. Firstly, the hidden state t h carrying the network memory can be computed by

Wave Forecast Model
Since traditional neural networks are transmitted through full connection and each node in the same layer is not connected, they may fail when dealing with the temporal problems.
In this context, RNN has been proven to be more powerful in extracting temporal patterns than traditional neural networks by building self-loop connections from a node to itself and sharing parameters across different time steps.
The standard RNN take their input from the current input x t along with what they have picked up previously.
Firstly, the hidden state h t carrying the network memory can be computed by where h t−1 is the previous hidden state; x t is the new input; W and U are the weight matrices; b is the bias vector and f is a nonlinear activation function. Then the current state o t is calculated as where W o is the weight matrix, and b o is the bias vector. Although RNN exhibits a robust capability of modeling nonlinear time series in an effective fashion, it cannot escape the vanishing gradient and exploding gradient problems, and its accuracy decreases when the time span becomes longer.
A long short-term memory network (LSTM) was proposed to mitigate the aforementioned problems, but the time consuming training process may hinder a wide-spread adoption of LSTM in real-time and fast SWH forecasting. In our paper, we employ another notable RNN variant, a gated recurrent unit network (GRU). Figure 2 shows the inner structure of the GRU.
where o W is the weight matrix, and o b is the bias vector.
Although RNN exhibits a robust capability of modeling nonlinear time series in an effective fashion, it cannot escape the vanishing gradient and exploding gradient problems, and its accuracy decreases when the time span becomes longer.
A long short-term memory network (LSTM) was proposed to mitigate the aforementioned problems, but the time consuming training process may hinder a wide-spread adoption of LSTM in real-time and fast SWH forecasting. In our paper, we employ another notable RNN variant, a gated recurrent unit network (GRU). Figure 2 shows the inner structure of the GRU. Both RNN and GRU have chain-like modules, but the repeating modules of GRU are more complicated. Each repeating module of GRU contains two gates, named update gate and reset gate, which gives GRU the ability to control the flow of information. The two gates are sigmoid units that map the variables in [ ] 0,1 , where the value between 0 and 1 is the ratio of memory. Thus, GRU can tackle the correlation with the time series over long and short terms.

1-
Firstly, the reset gate t r controls how much information from the previous hidden state will be carried over to the current hidden state, where Both RNN and GRU have chain-like modules, but the repeating modules of GRU are more complicated. Each repeating module of GRU contains two gates, named update gate and reset gate, which gives GRU the ability to control the flow of information. The two gates are sigmoid units that map the variables in [0, 1], where the value between 0 and 1 is the ratio of memory. Thus, GRU can tackle the correlation with the time series over long and short terms.
Firstly, the reset gate r t controls how much information from the previous hidden state will be carried over to the current hidden state, where The new memory candidate h t is produced by r t with a tanh layer, which derives from the following: The update gate z t determines whether the hidden state will be updated with a new hidden state, where In the end, the hidden state h t is renewed by In Equation (3) to Equation (6), W r , W z are the weight matrices, b r , b z are the corresponding bias vectors.

Data Preprocessing and Evaluation Criteria
In order to keep all the variables on the same scale and guarantee a stable convergence in the model developed in the present study, the following standardization formula is used where µ represents the mean, and δ represents the variance of data.
For the quantitative evaluation of the model's performance, three statistical metrics, the root mean square error (RMSE), coefficient of correlation (R) and index of agreement (IA), are considered: where x i is the observed value at the ith time, y i is the predicted value at the same moment, n is the number of time steps, x is the mean of observed data, and y is the mean value of predicted values.

Results and Discussions
According to the theory that a wave is generated by wind interacting with the ocean surface, the current SWH together with wind speed over the past three hours are imposed as input variables to predict SWH using GRU for various forecasting time horizons. The experiments of four widely used machine learning algorithms, BP, ELM, SVM and RF, were conducted for comparison.
The parameters of GRU were set by means of trial and error, while the parameters of the other four methods were selected according to the previous study. The experiments showed that excessive model parameters may bring about a time consuming training process without significant improvements on the prediction effectiveness. Therefore, the parameter setting should be balanced against the prediction performance and the time consumption. Table 2 lists the key model parameters of the five algorithms, where m is the number of hidden layers, S is the number of neurons in each hidden layer, g represents the learning rate, k stands for the number of training epochs, C is the penalty parameter, ε is the error tolerance, N is the number of trees in RF, and maxDeep is the maximum depth of each tree. In the following paper, the best results are highlighted with bold font.

SWH Prediction with Lead Times of 3 h
In this section, the current SWH and wind speed 3 h ago are fed as inputs to forecast SWH with lead times of 3 h.
The error indices of the five algorithms at six different stations selected from the Taiwan Strait and its adjacent waters are presented in Table 3. It is apparent that all the indicators of GRU are better than those of the other four algorithms. The ability of GRU in predicting SWH far outpaces the others, which may be due to the fact that GRU can take full advantage of the useful information from the previous states without the vanishing gradient and exploding gradient problems. What stands out in this table is that the performances of SVM and ELM are strikingly poorer than the others, which indicates that SVM and ELM are not good choices to predict SWH. At station C6W08, the RMSE of GRU is 46.3% lower than that of ELM, and the R and IA of GRU reach up to 0.950 and 0.972, whereas the same indices of ELM are only 0.844 and 0.914. With regard to BP, RF, and SVM, the performances of all of the six stations are inferior to those of GRU to a greater or lesser degree.  As can be seen from Figure 3, the observed and predicted values produced by have a considerable accordance for 3 hourly prediction. Especially at Station C6V2 predicted SWH correlated very well with the observed values. Even so, what cann ignored is that there are some outliers that are not concentrated near the bisecto  As can be seen from Figure 3, the observed and predicted values produced by GRU have a considerable accordance for 3 hourly prediction. Especially at Station C6V27, the predicted SWH correlated very well with the observed values. Even so, what cannot be ignored is that there are some outliers that are not concentrated near the bisector. The most likely cause is that the vast majority of SWH observed at these stations are less than 4 m, whereas a few extreme events still exist.
To reveal the difference in prediction performance between GRU and the other four methods clearly, we randomly choose a piece of the predictive results for each station to show in Figure 4, in which the number of data points is 100.    Given a closer inspection of Figure 4, it can be found that the test results of SVM an ELM tend to have larger volatility than GRU. One of the known reasons that bring o volatility is that ELM and SVM are more susceptible to parameter selection. In practic application, both of them are more likely to plunge into local minimum, so the stability relatively poorer. By contrast, based on the gate mechanism, the reset gate throws aw the unwanted information and the update gate propagates useful context from t previous hidden states, which endows GRU with a strong ability to exploit the future an previous information without sophisticated parameter tuning. Hence, the predicti made by GRU at all the selected stations with lead times of 3 h are of high accuracy an stability.

SWH Prediction with Lead Times of 6 h
The experiments with lead times up to 6 h are described in this section. A vertic comparison of Table 3, Table 4 and Table 5 reveals that the prediction accuracy drops the forecasting time horizon increases, with the RMSE increasing while the R and decrease.
Given a horizontal analysis to Table 4, the GRU model still outperformed the othe models with respect to all the assessment criteria, as the intrinsic structure enables GR to preserve memories over the long term. The RMSE of GRU is 54.4% lower than that ELM at station 46714D. The prediction performance at station NanJi is the best among t six stations (RMSE = 0.265, R = 0.902, IA = 0.946), which may be due to the smallest da number and average SWH at this station.  Given a closer inspection of Figure 4, it can be found that the test results of SVM and ELM tend to have larger volatility than GRU. One of the known reasons that bring out volatility is that ELM and SVM are more susceptible to parameter selection. In practical application, both of them are more likely to plunge into local minimum, so the stability is relatively poorer. By contrast, based on the gate mechanism, the reset gate throws away the unwanted information and the update gate propagates useful context from the previous hidden states, which endows GRU with a strong ability to exploit the future and previous information without sophisticated parameter tuning. Hence, the prediction made by GRU at all the selected stations with lead times of 3 h are of high accuracy and stability.

SWH Prediction with Lead Times of 6 h
The experiments with lead times up to 6 h are described in this section. A vertical comparison of Tables 3-5 reveals that the prediction accuracy drops as the forecasting time horizon increases, with the RMSE increasing while the R and IA decrease.  Given a horizontal analysis to Table 4, the GRU model still outperformed the other 4 models with respect to all the assessment criteria, as the intrinsic structure enables GRU to preserve memories over the long term. The RMSE of GRU is 54.4% lower than that of ELM at station 46714D. The prediction performance at station NanJi is the best among the six stations (RMSE = 0.265, R = 0.902, IA = 0.946), which may be due to the smallest data number and average SWH at this station.
The scatter diagrams of observed and forecasted SWH for the lead times of 6 h are illustrated in Figure 5. Although the increased forecasting horizons will result in a higher level of dispersion, these points are still distributed relatively close to the diagonal line. The predicted results obtained by GRU at station 46714D, C6V27, and NanJi are still satisfactory, which may be attributed to less missing data in these area.  Figure 6 provides the comparison of the five different algorithms for the 6 h prediction. Although the forecasting accuracy decreased for all models, the deep learning  Although the forecasting accuracy decreased for all models, the deep learning method GRU yielded better prediction results and captured the trend of data relatively well. The superiority of using GRU in comparisons to the other four methods is because GRU is proficient at identifying previous essential information to estimate the current state. On the contrary, SVM, BP, RF and ELM belong to the shallow machine learning models. The insufficiency of shallow machine learning models has restricted their application in long-term time series prediction.
Water 2020, 12, x FOR PEER REVIEW 14 of 21 method GRU yielded better prediction results and captured the trend of data relatively well. The superiority of using GRU in comparisons to the other four methods is because GRU is proficient at identifying previous essential information to estimate the current state. On the contrary, SVM, BP, RF and ELM belong to the shallow machine learning models. The insufficiency of shallow machine learning models has restricted their application in long-term time series prediction. What cannot be ignored is that a great difference in SVM with respect to the others exists in the first panel of Figures 4 and 6. There may be two possible reasons to explain this phenomenon. On the one hand, the parameter setting of SVM is considered according to the previous study and in balance with the computational time. Therefore, the parameter setting of SVM may not be optimal for every buoy station. On the other hand, the average SWH of 46714D is less than 1 m in the period selected, which is relatively small in comparison with others.

Long-Term Span SWH Prediction
The prediction results with lead times of 12 h and 24 h are listed in Tables 5 and 6. The longer the prediction horizon is, the weaker the link in the data series is. Therefore, there is no doubt that RMSE increases, whereas R and IA decrease at the same time. However, the performance of GRU is still the best among these five algorithms and its prediction error was within an acceptable range with lead times up to 12 h and 24 h. It can be seen that the results obtained by ELM and SVM are invalid. At station 46714D, the RMSE obtained by GRU is only 0.371 for 12 hourly forecast, whereas the same indices of ELM and SVM are high, up to 0.840 and 0.663, which indicates that GRU has a stronger adaptability and reliability for long-term horizon prediction.  What cannot be ignored is that a great difference in SVM with respect to the others exists in the first panel of Figures 4 and 6. There may be two possible reasons to explain this phenomenon. On the one hand, the parameter setting of SVM is considered according to the previous study and in balance with the computational time. Therefore, the parameter setting of SVM may not be optimal for every buoy station. On the other hand, the average SWH of 46714D is less than 1 m in the period selected, which is relatively small in comparison with others.

Long-Term Span SWH Prediction
The prediction results with lead times of 12 h and 24 h are listed in Tables 5 and 6. The longer the prediction horizon is, the weaker the link in the data series is. Therefore, there is no doubt that RMSE increases, whereas R and IA decrease at the same time. However, the performance of GRU is still the best among these five algorithms and its prediction error was within an acceptable range with lead times up to 12 h and 24 h. It can be seen that the results obtained by ELM and SVM are invalid. At station 46714D, the RMSE obtained by GRU is only 0.371 for 12 hourly forecast, whereas the same indices of ELM and SVM are high, up to 0.840 and 0.663, which indicates that GRU has a stronger adaptability and reliability for long-term horizon prediction.  As shown in Figure 7, an obvious hysteresis in 24 h forecasting exists, but the general trends of the predicted SWH are consistent with the observed values. This may be because the dependence of the SWH on the previous wave characteristics decreases in a large forecasting time horizon.
In addition, it is somewhat disappointing to find that GRU underestimated the SWH for 12 h and 24 h forecasting, especially in extreme events at all stations. A possible explanation for these results might be that slight sea and moderate sea accounted for an overwhelming portion in the training process of GRU.

Conclusions
GRU is a novel deep learning method that is accomplished in retaining long-term information with high efficiency, which can provide fresh insight into the time series prediction. In this paper, the performance of GRU for SWH prediction with lead times of 3, 6, 12 and 24 h was investigated. To test the performance of GRU, current SWH and wind speed of the past 3 h collected from six buoy stations distributed at various sites in the Taiwan Strait and its adjacent waters were fed as inputs, and the error indicators RMSE, R, and IA were utilized to evaluate the accuracy.
Overall, it can be concluded that GRU has the ability to produce better forecasting values and capture the general data trend. By comparison, the predictions made by SVM and ELM are rather inaccurate and tend to have larger fluctuations. For BP and RF, the forecasting skill is slightly inferior to that of GRU. Because GRU has a strong edge in longterm time series prediction, the performance with lead times of 3 and 6 h are satisfactory and trustworthy. As forecasting time increased, the root mean square error increased and the coefficient of correlation decreased for all models. However, the error statistics of GRU are still within an acceptable range. Although GRU does not completely achieve success in predicting peak wave heights for extreme events, much of the underestimation can be attributed to the lack of sufficient similar large wave heights in the training database.
Benefitting from the recurrent structure and special gate mechanism, it is believed that GRU can provide SWH predictions with multistep lead times in a reliable and prompt The peak SWH of 7.4 m observed on 8 August 2015 at 13:00 is predicted as 5.744, 5.699, 5.373, 3.961 m by GRU for lead times of 3 h, 6 h, 12 h, and 24 h at station 46714D. Station C6V27 suffered the Super Typhoon Haima on 20 October 2016, where the observed SWH was 11.7 m at 18:00. The prediction of peak made by GRU is 9.818, 9.383, 8.211, 6.464 m respectively for 3 h, 6 h, 12 h, and 24 h forecasts. The most likely cause of underestimation for larger wave heights is that usually the training datasets do not contain sufficient similar data for the peak wave height.

Conclusions
GRU is a novel deep learning method that is accomplished in retaining long-term information with high efficiency, which can provide fresh insight into the time series prediction. In this paper, the performance of GRU for SWH prediction with lead times of 3, 6, 12 and 24 h was investigated. To test the performance of GRU, current SWH and wind speed of the past 3 h collected from six buoy stations distributed at various sites in the Taiwan Strait and its adjacent waters were fed as inputs, and the error indicators RMSE, R, and IA were utilized to evaluate the accuracy.
Overall, it can be concluded that GRU has the ability to produce better forecasting values and capture the general data trend. By comparison, the predictions made by SVM and ELM are rather inaccurate and tend to have larger fluctuations. For BP and RF, the forecasting skill is slightly inferior to that of GRU. Because GRU has a strong edge in longterm time series prediction, the performance with lead times of 3 and 6 h are satisfactory and trustworthy. As forecasting time increased, the root mean square error increased and the coefficient of correlation decreased for all models. However, the error statistics of GRU are still within an acceptable range. Although GRU does not completely achieve success in predicting peak wave heights for extreme events, much of the underestimation can be attributed to the lack of sufficient similar large wave heights in the training database.
Benefitting from the recurrent structure and special gate mechanism, it is believed that GRU can provide SWH predictions with multistep lead times in a reliable and prompt way, which is favorable for coastal disaster risk reduction and mitigation management. As long as the forecasts exceed predefined threshold levels, hazard warnings with detailed scale will be issued immediately, which may give assistance to the authorities and decisionmakers to create better preparedness for the sake of coastal residential communities and safe offshore operation. A further improvement on the SWH prediction accuracy is possible by provision of more input features such as wind direction and wave direction.