Ocean Wave Height Series Prediction with Numerical Long Short-Term Memory

: This paper investigates the possibility of using machine learning technology to correct wave height series numerical predictions. This is done by incorporating numerical predictions into long short-term memory (LSTM). Speciﬁcally, a novel ocean wave height series prediction framework, referred to as numerical long short-term memory (N-LSTM), is introduced. The N-LSTM takes a combined wave height representation, which is formed of a current wave height measurement and a subsequent Simulating Waves Nearshore (SWAN) numerical prediction, as the input and generates the corrected numerical prediction as the output. The correction is achieved by two modules in cascade, i.e., the LSTM module and the Gaussian approximation module. The LSTM module characterizes the correlation between measurement and numerical prediction. The Gaussian approximation module models the conditional probabilistic distribution of the wave height given the learned LSTM. The corrected numerical prediction is obtained by sampling the conditional probabilistic distribution and the corrected numerical prediction series is obtained by iterating the N-LSTM. Experimental results validate that our N-LSTM effectively lifts the accuracy of wave height numerical prediction from SWAN for the Bohai Sea and Xiaomaidao. Furthermore, compared with the state-of-the-art machine learning based prediction methods (e.g., residual learning), the N-LSTM achieves better prediction accuracy by 10% to 20% for the prediction time varying from 3 to 72 h.


Introduction
Ocean waves are irregular combinations of multiple waves with multiple wave heights, periods and travel directions [1]. Ocean waves cause huge losses to the lives and properties of people when big waves reach the coast. The predictions of significant wave height play an important role in marine engineering such as fisheries, exploration, power generation and marine transportation [2].
In the research literature, many efforts have been made in predicting significant wave heights [3,4]. Numerical wave models are widely applied to global sea-state predictions [5,6]. The principle of a numerical wave model is to obtain the wave height, period and other information by solving the wave spectrum equation of ocean physical processes. Dentale et al. [7] compare the wave height buoy observations and model predictions and find that the numerical prediction is a reliable method of wave height prediction. The third generation models such as Wave Model (WAM) [8,9], WAVEWATCH-III (WWIII) [10,11] and Simulating Waves Nearshore (SWAN) [12,13] are among the most advanced numerical models. The WAM model and WMIII model are similar in structure. WMIII uses more complicated dissipation source terms and wind input terms than the WAM model [14]. Liu et al. [15] use the data of the South Indian Ocean to compare the performance of WAM model is composed of an LSTM module and a Gaussian approximation module in cascade. The LSTM module characterizes the measurement-prediction correlation for wave height sequences, resulting in the condition on which the Gaussian approximation module models the wave height distribution. In contrast to error and residual learning [35,36] which just models the difference between measurement and prediction, our N-LSTM comprehensively characterizes the correlation between measurement sequences and numerical prediction sequences via long short-term memory. It explores temporal characteristics and thus renders accurate corrections for numerical prediction sequences. Furthermore, the measurements in our method are used to provide wave height feature information and we can use very few measurements to predict future wave heights for a longer time. As long as there is no missing a lot of measured data, our method can be used. Our N-LSTM is motivated by the autoregressive recurrent networks [39] in terms of model structure but exhibits major differences from two operational aspects. First, our N-LSTM incorporates numerical predictions which are not considered in the autoregressive recurrent network. Second, our N-LSTM makes predictions by correcting the wave height numerical prediction, contrasting the straightforward prediction in the autoregressive recurrent networks. Experimental evaluations validate the effectiveness of our proposed model in predicting significant wave heights by correcting significant wave height predictions from SWAN for the Bohai Sea and Xiaomaidao.
The organization of the paper is listed as follows: Section 2 presents the structure of the numerical long short-term memory (N-LSTM) network and describes the training method for the N-LSTM network. Section 3 describes the significant wave height correction based on an N-LSTM network. Section 4 evaluates the effectiveness of our method qualitatively and quantitatively. Section 5 discusses the effectiveness of our method under large wave conditions and the probability correction. Section 6 concludes the paper.

Overall N-LSTM Model
In this section, we present a numerical long short-term memory (N-LSTM) network for correcting the numerical wave height prediction series. The N-LSTM processes timesequential data. It consists of an LSTM module and a Gaussian approximation module. Figure 1 shows the structure of the N-LSTM. and a numerical model as two separate procedures. Learning the difference between their predictions cannot inherently exploit their joint prediction capability.
In order to explore the potential of machine learning for lifting wave height numerical prediction accuracy, we introduce a numerical long short-term memory (N-LSTM) framework for correcting the wave height predictions from the numerical model. The model is composed of an LSTM module and a Gaussian approximation module in cascade. The LSTM module characterizes the measurement-prediction correlation for wave height sequences, resulting in the condition on which the Gaussian approximation module models the wave height distribution. In contrast to error and residual learning [35,36] which just models the difference between measurement and prediction, our N-LSTM comprehensively characterizes the correlation between measurement sequences and numerical prediction sequences via long short-term memory. It explores temporal characteristics and thus renders accurate corrections for numerical prediction sequences. Furthermore, the measurements in our method are used to provide wave height feature information and we can use very few measurements to predict future wave heights for a longer time. As long as there is no missing a lot of measured data, our method can be used. Our N-LSTM is motivated by the autoregressive recurrent networks [39] in terms of model structure but exhibits major differences from two operational aspects. First, our N-LSTM incorporates numerical predictions which are not considered in the autoregressive recurrent network. Second, our N-LSTM makes predictions by correcting the wave height numerical prediction, contrasting the straightforward prediction in the autoregressive recurrent networks. Experimental evaluations validate the effectiveness of our proposed model in predicting significant wave heights by correcting significant wave height predictions from SWAN for the Bohai Sea and Xiaomaidao.
The organization of the paper is listed as follows: Section 2 presents the structure of the numerical long short-term memory (N-LSTM) network and describes the training method for the N-LSTM network. Section 3 describes the significant wave height correction based on an N-LSTM network. Section 4 evaluates the effectiveness of our method qualitatively and quantitatively. Section 5 discusses the effectiveness of our method under large wave conditions and the probability correction. Section 6 concludes the paper.

Overall N-LSTM Model
In this section, we present a numerical long short-term memory (N-LSTM) network for correcting the numerical wave height prediction series. The N-LSTM processes time-sequential data. It consists of an LSTM module and a Gaussian approximation module. Figure 1 shows the structure of the N-LSTM. At the time , the input of the overall N-LSTM and also the LSTM module is a combined significant wave height representation , which is formed of the current wave height measurement at the time and a numerical prediction for the significant wave height at the subsequent time + 1. The LSTM input is denoted as follows: At the time t, the input of the overall N-LSTM and also the LSTM module is a combined significant wave height representation x t , which is formed of the current wave height measurement m t at the time t and a numerical prediction n t+1 for the significant wave height at the subsequent time t + 1. The LSTM input x t is denoted as follows: where * denotes the transpose operation. The LSTM updates the memory cell state c t and the output of the LSTM module h t . c t and h t are fed back for updating the LSTM at the subsequent time t + 1. In addition, h t is used as the input of the Gaussian approximation module. The output of the overall N-LSTM and also the Gaussian approximation module isn t+1 , which is the corrected numerical prediction with respect to the numerical prediction n t+1 .

LSTM Module
The subsection describes the internal structure of the LSTM module. The LSTM module consists of an input gate i t , a forget gate f t and an output gate o t . Figure 2 shows the architecture of an LSTM module.
where * denotes the transpose operation. The LSTM updates the memory cell state and the output of the LSTM module . and are fed back for updating the LSTM at the subsequent time + 1. In addition, is used as the input of the Gaussian approximation module. The output of the overall N-LSTM and also the Gaussian approximation module is , which is the corrected numerical prediction with respect to the numerical prediction .

LSTM Module
The subsection describes the internal structure of the LSTM module. The LSTM module consists of an input gate , a forget gate and an output gate . Figure 2 shows the architecture of an LSTM module. At the time , the inputs of the LSTM module contain the combined significant wave height representation , the memory cell state and the output of the LSTM module .
denotes memory information. The LSTM computing process is as follows: = tanh + + , where , denote weight matrices and denote bias vectors. Sig and tanh are smooth step functions and hyperbolic tangent functions, respectively. ○ denotes the Hadamard product. The output of the LSTM module and will be the input of the LSTM module at a subsequent time and is also the input of the Gaussian approximation module. Θ summarizes the parameters of the LSTM module as follows: The LSTM structure extracts long-term useful features in sequences and has been validated effectively in a variety of prediction problems. Different from the original LSTM which tends to memorize historical measurements to predict subsequent measurements, the LSTM module within our N-LSTM explores a combined significant wave height representation in terms of both real measurements and the numerical predictions. Being capable of capturing time-varying characteristics, the LSTM module characterizes the temporal correlation between the real measurements and the numerical predictions. At the time t, the inputs of the LSTM module contain the combined significant wave height representation x t , the memory cell state c t−1 and the output of the LSTM module h t−1 . c t denotes memory information. The LSTM computing process is as follows: where W, U denote weight matrices and b denote bias vectors. Sig and tanh are smooth step functions and hyperbolic tangent functions, respectively. denotes the Hadamard product. The output of the LSTM module c t and h t will be the input of the LSTM module at a subsequent time and h t is also the input of the Gaussian approximation module. Θ LSTM summarizes the parameters of the LSTM module as follows: The LSTM structure extracts long-term useful features in sequences and has been validated effectively in a variety of prediction problems. Different from the original LSTM which tends to memorize historical measurements to predict subsequent measurements, the LSTM module within our N-LSTM explores a combined significant wave height representation in terms of both real measurements and the numerical predictions. Being capable of capturing time-varying characteristics, the LSTM module characterizes the temporal correlation between the real measurements and the numerical predictions. Therefore, the LSTM module within the N-LSTM takes advantage of both real measurements and the numerical predictions along with their temporal correlation characteristics and is bound to exhibit greater prediction capability than the original LSTM which just relies on memorizing historical measurements.

Gaussian Approximation Module
The LSTM module learns the correlation between historical measurements and numerical prediction. The learning uncertainty arises subject to noise from historical measurements and numerical prediction. To alleviate such deficiency, probabilistic models are used for addressing uncertainty. The widely used Gaussian function is exploited for modeling the conditional distribution of the significant wave height, and accordingly a Gaussian approximation module is introduced. Figure 3 shows the architecture of a Gaussian approximation module.
istics and is bound to exhibit greater prediction capability than the original LSTM which just relies on memorizing historical measurements.

Gaussian Approximation Module
The LSTM module learns the correlation between historical measurements and numerical prediction. The learning uncertainty arises subject to noise from historical measurements and numerical prediction. To alleviate such deficiency, probabilistic models are used for addressing uncertainty. The widely used Gaussian function is exploited for modeling the conditional distribution of the significant wave height, and accordingly a Gaussian approximation module is introduced. Figure 3 shows the architecture of a Gaussian approximation module. At the time , the input of the Gaussian approximation module is the output of the LSTM module . Specifically, the Gaussian approximation module commences by processing via two layers in parallel as shown in Figure 3. The top layer is a linear mapping which operates as follows: where * and are the weight and bias of the linear mapping, respectively. The bottom layer is a linear mapping, followed by a softplus activation. The softplus activation is used to ensure the variance σ > 0. The linear mapping and softplus activation with respect to the bottom layer operate as follows: where * and are the weight and bias of the linear mapping, respectively. The softplus activation layer ensures that σ is positive. We denote the parameter set Θ of the Gaussian approximation module as follows: The Gaussian approximation module terminates by modeling a conditional probabilistic distribution for significant wave height in terms of a Gaussian distribution. The outputs and σ of the two layers (9) and (10) are modeled as the mean and variance of the Gaussian distribution, respectively. The conditional probabilistic distribution for significant wave height is given as follows: where denotes the real wave height observations at the time t+1 and Θ represents the overall N-LSTM parameter set including Θ and Θ as follows: The Gaussian approximation module models the conditional probabilistic distribution of the subsequent significant wave height given the learned LSTM (i.e., the learned ), rather than straightforwardly predicting significant wave height. It should be noted At the time t, the input of the Gaussian approximation module is the output of the LSTM module h t . Specifically, the Gaussian approximation module commences by processing h t via two layers in parallel as shown in Figure 3. The top layer is a linear mapping which operates as follows: where w * µ and b µ are the weight and bias of the linear mapping, respectively. The bottom layer is a linear mapping, followed by a softplus activation. The softplus activation is used to ensure the variance σ > 0. The linear mapping and softplus activation with respect to the bottom layer operate as follows: where w * σ and b σ are the weight and bias of the linear mapping, respectively. The softplus activation layer ensures that σ(h t ) is positive. We denote the parameter set Θ P of the Gaussian approximation module as follows: The Gaussian approximation module terminates by modeling a conditional probabilistic distribution for significant wave height in terms of a Gaussian distribution. The outputs µ(h t ) and σ(h t ) of the two layers (9) and (10) are modeled as the mean and variance of the Gaussian distribution, respectively. The conditional probabilistic distribution for significant wave height is given as follows: where r t+1 denotes the real wave height observations at the time t + 1 and Θ represents the overall N-LSTM parameter set including Θ LSTM and Θ P as follows: The Gaussian approximation module models the conditional probabilistic distribution of the subsequent significant wave height given the learned LSTM (i.e., the learned h t ), rather than straightforwardly predicting significant wave height. It should be noted that the unconditional significant wave height distribution might not follow Gaussian distribution but an arbitrary one. It is observed in the literature of probability that the distribution of an arbitrary (non-Gaussian) process can be approximated by a weighted combination of Gaussian distribution functions. In the light of this observation, the unconditional distribution P(r t+1 ) of the significant wave height process can be expressed as a weighted combination of Gaussian distribution P(r t+1 |h t ) as follows: where P(h t ) behaves as the weight for combination. This justifies the effectiveness to use the Gaussian function to model the conditional distribution of significant wave height.

Sampling for Correcting Numerical Predictions
The prediction for significant wave height at the time t + 1 is obtained by sampling from the conditional probability distribution of prediction (12) and it is presented as follows: The benefits of the sampling procedure are two-fold. First, the sampling procedure produces a future result by not only learning historical measurements but also correcting numerical predictions with respect to historical measurements. This is enabled by the conditional probability distribution which encodes both measurements and numerical predictions within h t . Second, the sampling strategy alleviates the accumulation of prediction errors which cause considerable prediction uncertainty. At each time step, an LSTM without sampling the conditional probabilistic distribution predicts a fixed value. The predicted value might be contaminated by noise. If the noisy predicted values keep recycled for updating LSTM, error accumulation inevitably arises. The random sampling of the conditional distribution neutralizes such error accumulation and renders predictions more robust than those straightforwardly regressed from an LSTM.

N-LSTM Training
One sample for training the N-LSTM consists of a real significant wave height observation sequence r 1:T = {r 1 , r 2 , · · · r τ , r τ+1 , · · · , r T } and a significant wave height numerical prediction sequence n 1:T = {n 1 , n 2 , · · · n τ , n τ+1 , · · · , n T }, where T indicates the sequence length. The training procedure with respect to the sample is under the assumption that the input of real observations is available for the measurement time t ∈ {1, 2, · · · , τ}, but the unavailable time t ∈ {τ + 1, · · · , T}. In this scenario, the significant wave height predictions obtained as described in Section 2.4 are used for representing the measurement time t ∈ {τ + 1, · · · , T}.
In addition, we construct the significant wave height measurement sequences m 1:T = {m 1 , m 2 , · · · , m τ , m τ+1 , · · · , m T }, with the significant wave height measurement m t defined as follows: where r t is the real observation of the significant wave height at time t andn t is computed according to (14) and obtained as follows: train the N-LSTM by maximizing the log-likelihood as follows: We use a stochastic gradient descent algorithm to optimize Equation (18) by calculating the gradient ofΘ. The Adam adaptive stochastic gradient descent optimizer is used to minimize the loss in our work [40]. The overall procedure of training the N-LSTM network is presented in Algorithm 1. 3: for i = 1, 2, · · · , I do 4: for t = 1, 2, · · · , T do 5: Compute h t according to (2)-(7). 6: Compute µ(h t ) and σ(h t ) according to (9) and (10). 7: Construct the conditional probabilistic distribution P(r t |h t ; Θ) according to (12). 8: end for 9: Maximize the log-likelihood according to (18). 10: Train the log-likelihood by stochastic gradient descent algorithm. 11: end for
We use a stochastic gradient descent algorithm to optimize Equation (18) by calculating the gradient of Θ . The Adam adaptive stochastic gradient descent optimizer is used to minimize the loss in our work [40]. The overall procedure of training the N-LSTM network is presented in Algorithm 1.

Algorithm 1
The training procedure of the N-LSTM 1: Input: The significant wave height numerical predictions ,⋯, and the real observations ,⋯, . 2: Output: The parameters Θ of the autoregressive recurrent network. 3: for = 1,2, ⋯ , do 4: for = 1,2, ⋯ , do 5: Compute according to (2)-(7). 6: Compute and according to (9) and (10). 7: Construct the conditional probabilistic distribution | ; Θ according to (12).  The significant wave heights at the time ∈ {1, 2, ⋯ , } does not require prediction because real observations are available in the time interval. The significant wave height real observations : and numerical predictions : at the time ∈ {1, 2, ⋯ , } are used for computing according to (7). The real observation is no longer available starting from the time + 1. At the time + 1, the corrected prediction is produced Gaussian approximation module in terms of sampling the conditional distribution as follows: At the time ∈ { + 1, ⋯ , }, the corrected prediction is used as the measurement , i.e.,: Specifically, the input of the N-LSTM is given as follows: The significant wave heights at the time t ∈ {1, 2, · · · , τ} does not require prediction because real observations are available in the time interval. The significant wave height real observations r 1:τ and numerical predictions n 1:τ at the time t ∈ {1, 2, · · · , τ} are used for computing h τ according to (7). The real observation is no longer available starting from the time τ + 1. At the time τ + 1, the corrected predictionn τ+1 is produced Gaussian approximation module in terms of sampling the conditional distribution as follows: At the time t ∈ {τ + 1, · · · , T}, the corrected predictionn t is used as the measurement m t , i.e.,: Specifically, the input of the N-LSTM is given as follows: x t = [n t , n t+1 ] * , for t ∈ {τ + 1, · · · , T} The corrected numerical prediction series {n τ+1 ,n τ+2 , · · · ,n T } are obtained by iterating the N-LSTM procedure as illustrated in Figure 1. In order to render more accurate predictions, the sampling at each time t ∈ {τ + 1, · · · , T} is performed multiple times and the average is given as the resulted predictionn t .

Data Collection and Evaluation Criterions
The study area is the Bohai Sea and the northern part of the Yellow Sea (35 • -41 • N, 117 • -123 • E), as shown in Figure 5. The submarine topography in the study area is relatively flat, with an average water depth of about 18 m. The maximum tidal range is about 2.7 m. The study area is dominated by wind waves and is greatly affected by the monsoon.
The corrected numerical prediction series { , , ⋯ , } are obtained by iterating the N-LSTM procedure as illustrated in Figure 1. In order to render more accurate predictions, the sampling at each time ∈ { + 1, ⋯ , } is performed multiple times and the average is given as the resulted prediction .

Data Collection and Evaluation Criterions
The study area is the Bohai Sea and the northern part of the Yellow Sea (35°-41° N, 117°-123° E), as shown in Figure 5. The submarine topography in the study area is relatively flat, with an average water depth of about 18 m. The maximum tidal range is about 2.7 m. The study area is dominated by wind waves and is greatly affected by the monsoon. In our work, the wave height numerical predictions are obtained from the third-generation numerical wave model SWAN. The grids in this study are with a spatial resolution of 0.1° × 0.1°. The source term in the SWAN model uses the default formula package. The model considers the dissipation caused by whitecapping, bottom friction and depth induced wave breaking [41]. Other configurations of the SWAN model are carried out according to [42]. The real wave height observations for validating our method are collected from buoys located in the Bohai Sea, China, and Xiaomaidao Nearshore Sea, Shandong. The buoy in the Bohai Sea is located in the open waters inside the Bohai Sea with a water depth of 20 m. The position of the buoy on Xiaomaidao is shown in Figure 5 and the water depth is about 24 m. All the obtained data are recorded hourly with a period for two years from January 2017 to December 2018. We use a total of 17,510 sample pairs to train and test our model. We randomly select 10% of the time-continuous data pairs as the test set and the remaining 90% of the data pairs are used as the training set. All training data and test data are divided into real significant wave height observation sequences and numerical model prediction sequences according to Section 2.5. The number of sequences is related to the predicted time T.
We adopt the root mean square error (RMSE), mean absolute percentage error (MAPE) and skill score (SS) to evaluate the effectiveness of the proposed method. The RMSE and MAPE metric are defined as follows: In our work, the wave height numerical predictions are obtained from the thirdgeneration numerical wave model SWAN. The grids in this study are with a spatial resolution of 0.1 • × 0.1 • . The source term in the SWAN model uses the default formula package. The model considers the dissipation caused by whitecapping, bottom friction and depth induced wave breaking [41]. Other configurations of the SWAN model are carried out according to [42]. The real wave height observations for validating our method are collected from buoys located in the Bohai Sea, China, and Xiaomaidao Nearshore Sea, Shandong. The buoy in the Bohai Sea is located in the open waters inside the Bohai Sea with a water depth of 20 m. The position of the buoy on Xiaomaidao is shown in Figure 5 and the water depth is about 24 m. All the obtained data are recorded hourly with a period for two years from January 2017 to December 2018. We use a total of 17,510 sample pairs to train and test our model. We randomly select 10% of the time-continuous data pairs as the test set and the remaining 90% of the data pairs are used as the training set. All training data and test data are divided into real significant wave height observation sequences and numerical model prediction sequences according to Section 2.5. The number of sequences is related to the predicted time T.
We adopt the root mean square error (RMSE), mean absolute percentage error (MAPE) and skill score (SS) to evaluate the effectiveness of the proposed method. The RMSE and MAPE metric are defined as follows: The skill score shows the relative improvement of the proposed models over the numerical model. It is defined as follows: The performance of the prediction methods is better when the RMSE and MAPE are closer to 0 and the SS is closer to 1.
In our experiment, the epoch and learning rate are set to 10,000 and 0.001, respectively. The number of hidden units of the LSTM module in the N-LSTM is set to 50. The hyperparameter setting of LSTM is the same as that of N-LSTM. The major hyper-parameter of the ELM methods is the number of hidden layers, which is set to 36 for both ELM and MFELM. We carry out hyper-parameter optimization through manual search. We conduct our experiments on a computing platform with three Intel Xeon CPUs E5-2690 at 2.60 GHz. N-LSTM and LSTM are implemented by PyTorch under the environment of Python. ELM residual learning method and MFELM residual learning method are implemented by using MATLAB R2019a.

Empirical Evaluations
We compare the performance of our N-LSTM method with the LSTM method, the ELM residual learning method [36] and the MFELM residual learning method [43] through the significant wave height prediction results of the numerical model in the future time. The numerical model is used as a baseline method for evaluating the improvement of the predictions from the numerical model. The LSTM behaves as one module in our N-LSTM. The other important module in the N-LSTM is the Gaussian approximation. The LSTM is used for validating the effectiveness of the overall N-LSTM (especially with the involvement of the Gaussian approximation) over the sole LSTM. The ELM and MFELM share some similar ideas to our N-LSTM in terms of correcting numerical predictions. They straightforwardly model the difference between the numerical predictions and the real observations via residual learning. In particular, the MFELM residual learning method exploits multiple factors, such as wind speed and wind direction, and achieves state-ofthe-art performance for wave height predictions. In contrast to the ELM and MFELM residual correction methods, the N-LSTM does not model the residuals in a straightforward way but comprehensively incorporates the correction procedure into an LSTM followed by a Gaussian approximation. The empirical comparison between the N-LSTM and the (MF)ELM validates the advantage of the comprehensive correction procedure over the straightforward correction procedure.
We correct the significant wave height numerical model predictions for 3 h, 6 h, 12 h, 24 h, 48 h, 72 h and the time step is one hour. Each of these forecast leads comes from independent N-LSTM models. Tables 1 and 2 show wave height prediction results at the buoy positions of the open water in Bohai Sea and the nearshore water by Xiaomaidao, respectively. The bold entries denote the best results. From Table 1, we observe that our proposed N-LSTM is significantly better than the LSTM method, the ELM residual learning method and the MFELM residual learning method in each evaluation criterion. This reflects that correcting the significant wave height predictions from the numerical model in the future time steps according to the real observations and the numerical model in the previous time steps is useful, and our model learns the relationship between the numerical model predictions and real observations effectively. In addition, we observe that the accuracy of all prediction methods decreases with the increase of the prediction time. In particular, the accuracies of the ELM related methods are considerably reduced when the prediction time increases more than 24 h. However, our method maintains good predictions in 24 h and 48 h and the skill scores are both higher by 50%. It drops to 36% in the 72-h prediction and is still significantly better than the other methods. This validates that our N-LSTM maintains good results in long-term predictions. This is because the LSTM structure can save long-term information that is not available in the ELM methods. From Table 2, the N-LSTM achieves the best performance compared with the other methods except the six-hour prediction. The comparison methods have different wave height prediction accuracies at different locations. The original LSTM outperforms two ELM methods for the Bohai Sea. The MFELM outperforms the original LSTM for Xiaomaidao. It confirms that our method effectively improves the accuracy of nearshore wave height prediction and validates that our method has robustness in predicting wave heights in different sea areas.  Figure 6 shows the significant wave height prediction results at various time steps. We evaluate the prediction results of several methods qualitatively through Figure 6. We see that the performance of our method is better than the other three machine learning methods. In particular, Figure 6d,e show that as the significant wave height increases, the error between the predictions of the numerical model and the real observations increases too. We can see that our N-LSTM method is more accurate than the other three machine learning methods in these cases. More detailed validations will be introduced in Section 4.1. In addition, the principle of the ELM residual learning method and the MFELM residual learning method is to predict the residuals between the real observations and the predictions from the numerical model. It is difficult to accurately predict the residual when the residuals change suddenly (the significant wave height has a peak). Our N-LSTM method characterizes the temporal correlation between real observations and numerical predictions rather than straightforwardly use their difference such that it achieves better performance. 4.1. In addition, the principle of the ELM residual learning method and the MFELM residual learning method is to predict the residuals between the real observations and the predictions from the numerical model. It is difficult to accurately predict the residual when the residuals change suddenly (the significant wave height has a peak). Our N-LSTM method characterizes the temporal correlation between real observations and numerical predictions rather than straightforwardly use their difference such that it achieves better performance.

Analysis of Significant Wave Height Predictions between 1.25 m and 4 m
The World Meteorological Organization (WMO) sea state code describes the wave height and corresponding characteristics [44]. When the significant wave height is below 1.25 m, the sea state is calm or slight. The numerical model has a good performance in this case. The probability of the significant wave height above 4 m appearing on the coast is extremely low. The significant wave height sometimes occurs between 1.25 m and 4 m. The sea state at this time is rough and will affect marine production. This section mainly discusses the model prediction of significant wave height between 1.25 m and 4 m. Figure 7 shows the scatter plot of the significant wave height prediction results in which the real observations are between 1.25 m and 4 m. In the plot, the diagonal indicate 100% accuracy. The scattered points close to the diagonal reflect accurate predictions and those far from the diagonal inaccurate ones. In Figure 7, the numerical model prediction results are concentrated on the upper left of the diagonal. It means that the correct results for the numerical model are always smaller than the real observations. The prediction results obtained by our method are closer to the diagonal than the results obtained by the ELM residual learning method and the MFELM residual learning method. In particular, the predictions of the numerical model have a large deviation and the prediction effectiveness of the other two ELM methods is poor when the significant wave heights are higher than 2 m. In contrast, our method maintains a good prediction of this situation.
The sea state at this time is rough and will affect marine production. This discusses the model prediction of significant wave height between 1.25 m Figure 7 shows the scatter plot of the significant wave height predi which the real observations are between 1.25 m and 4 m. In the plot, the dia 100% accuracy. The scattered points close to the diagonal reflect accurate p those far from the diagonal inaccurate ones. In Figure 7, the numerical mo results are concentrated on the upper left of the diagonal. It means that the for the numerical model are always smaller than the real observations. results obtained by our method are closer to the diagonal than the results o ELM residual learning method and the MFELM residual learning method the predictions of the numerical model have a large deviation and the p tiveness of the other two ELM methods is poor when the significant wa higher than 2 m. In contrast, our method maintains a good prediction of th

Analysis of Probabilistic Wave Height Predictions
In addition to calculating the prediction results of the wave height nu our method obtains the confidence interval of the prediction wave heights Monte Carlo sampling [45] and obtain probabilistic forecasts of wave he ture. Figure 8 shows probabilistic predictions for the forecasting time o nowcast and 48 h for the long forecast and made 600 h of consecutive p black scatters shows the real observations. The green line shows the nu predictions. We plot prediction results from our method as red line alon percent confidence intervals (shaded area). The confidence interval repre ability interval that real wave height observations fall between the predict

Analysis of Probabilistic Wave Height Predictions
In addition to calculating the prediction results of the wave height numerical model, our method obtains the confidence interval of the prediction wave heights in the form of Monte Carlo sampling [45] and obtain probabilistic forecasts of wave heights in the future. Figure 8 shows probabilistic predictions for the forecasting time of 1-3 h for the nowcast and 48 h for the long forecast and made 600 h of consecutive predictions. The black scatters shows the real observations. The green line shows the numerical model predictions. We plot prediction results from our method as red line along with eighty percent confidence intervals (shaded area). The confidence interval represents the probability interval that real wave height observations fall between the predictions. Figure 8 indicates that our N-LSTM predictions accurately approach most of the real observations for the one-h probability forecast, no matter the significant wave heights have a peak or stay stable. In addition, the corrected confidence intervals are smooth and their amplitude is small. This reflects that the probabilistic forecast is useful in the one-h prediction. The confidence interval increases as the forecast time step increases and the amplitude fluctuate greatly at the peak. In general, the confidence interval reasonably presents the characteristics of uncertainty, because the confidence interval contains most of the real observations. Accurate probability prediction of significant wave height plays an important role in early warning of Marine disasters.

Conclusions
This paper presents a numerical long short-term memory (N-LSTM) network for correcting the significant wave height predictions from the numerical model. The N-LSTM takes a combined significant wave height representation, which is formed of real observations from a buoy in the Bohai Sea and the predictions of the SWAN at the same location as the input and produces the corrected numerical prediction. The N-LSTM characterizes the temporal correlation between the real measurement and the numerical prediction and models the conditional probabilistic distribution of the significant wave height real observations and numerical predictions. Compared with traditional machine learning prediction methods, our method reduces the accumulation of error in the prediction process. The experimental results show that the N-LSTM network has better prediction accuracy. Besides, our N-LSTM method performs better when the significant wave height is over 1.25 m and the numerical model predicts poor performance in this case. Our method N-LSTM make probabilistic predictions for significant wave heights and the confidence interval of the prediction covers most of the real observations. Therefore, our N-LSTM method is capable of coping with not only difficult wave situations but also prediction uncertainties, and renders a robust prediction strategy.
Furthermore, we find that the wave height prediction of the numerical model sometimes has a lag. Since our model only uses the significant wave height as the input data, the prediction may fail. In future work, we will add sea surface wind, atmospheric pressure and other factors as input and go deep in to investigate the impact of the wave type [46,47] and meteorological elements.