Enhanced Weight-Optimized Recurrent Neural Networks Based on Sine Cosine Algorithm for Wave Height Prediction

: Constructing offshore and coastal structures with the highest level of stability and lowest cost, as well as the prevention of faulty risk, is the desired plan that stakeholders seek to obtain. The successful construction plans of such projects mostly rely on well-analyzed and modeled metocean data that yield high prediction accuracy for the ocean environmental conditions including waves and wind. Over the past decades, planning and designing coastal projects have been accomplished by traditional static analytic, which requires tremendous efforts and high-cost resources to validate the data and determine the transformation of metocean data conditions. Therefore, the wind plays an essential role in the oceanic atmosphere and contributes to the formation of waves. This paper proposes an enhanced weight-optimized neural network based on Sine Cosine Algorithm (SCA) to accurately predict the wave height. Three neural network models named: Long Short-Term Memory (LSTM), Vanilla Recurrent Neural Network (VRNN), and Gated Recurrent Network (GRU) are enhanced, instead of random weight initialization, SCA generates weight values that are adaptable to the nature of the data and model structure. Besides, a Grid Search (GS) is utilized to automatically ﬁnd the best models’ conﬁgurations. To validate the performance of the proposed models, metocean datasets have been used. The original LSTM, VRNN, and GRU are implemented and used as benchmarking models. The results show that the optimized models outperform the original three benchmarking models in terms of mean squared error (MSE), root mean square error (RMSE), and mean absolute error (MAE)


Introduction
In coastal and marine structural engineering, wave height is the main factor that needs to be considered. The wave conditions affect several marine activities and have significant consequences for marine industries. Nevertheless, it is challenging to accurately predict such tasks because the ocean waves are stochastic by nature [1,2]. Thus, estimating the wave heights and their trends was, and still is, a big challenge that has to be tackled [3,4].
At early stages, semi-analytic models, such as the Pierson-Neumann-James and Sverdroup-Munk-Bretscheider models, were used to predict the height of the waves; still, such models are inefficient at clearly describing in detail the wave conditions of the sea surface [5,6]. After that, numerical models have been widely used for wave height prediction; nevertheless, such models require a high computational resource when dealing with large amount of data [7,8].
Metocean data are the combination of meteorology and oceanography parameters that need to be studied in order to predict the oceanic environmental change [9]. The meteorology parameters are wind speed, direction, humidity, and air temperature. The oceanography parameters are wave height, period, current, and tides [10,11]. The prediction of oceanic environmental change is essential when planning coastal and offshore constructions, maintenance, or operation. A high-accuracy prediction of these parameters in the offshore environment reduces cost, time consumed, and fault risk. It also helps in determining the required weather window to achieve the projects in time.
Deep learning models were recently applied to atmospheric observation [12,13]. First, the height of the wave has been predicted based on Artificial Neural Network (ANN) models. For instance, a feedforward network model was utilized to forecast the real-time wave height [14]. Similarly, Mandal et al. [15] proposed a Recurrent Neural Network (RNN) for forecast wave height and showed better correlation coefficient compared to feedforward network. Another study conducted by Mahjoobi et al. [16] predicted wave height based on regressive support vector machine (SVM). Their outcomes demonstrated that the proposed model outperforms ANN model in terms of accuracy and computational time. Likewise, ANNs backpropagation and feedforward were compared against model trees to indicate the superiority of predicating wind speed as well as wave height and found that model trees is superior [17].
The ultimate objective of sea wave prediction modeling is to find precise short-or longterm forecasts of the studied variables at a certain time and location [18,19]. The literature of oceanic modeling characteristic is categorized into physical-based and model-based methods [20,21]. These methods have been utilized to forecast ocean waves. According to the work in [3], which was based on the work in [21], these approaches are further graded based on their attempts to specifically parameterize ocean wave interactions. Physicalbased methods mimic sea waves by finding the appropriate equation solutions and have been proven to be beneficial over longer time periods for forecasting, whereas model-based methods can be considered as time-series or statistical methods and work well for shortterm prediction [22][23][24][25][26]. Moreover, machine learning (or statistical) models can be applied to postprocess physics-based models [27][28][29][30][31].
The main contribution of this study is to predict the wave height with high accuracy and less computational cost. We precisely investigated the following objectives:

1.
To propose an enhanced weight-optimized RNN based on SCA optimization to process time-series data with high accuracy. 2.
To update and tune the learning rate based on grid search mechanism. 4.
To compare the proposed models against three well-regarded prediction models: LSTM, GRU, and VRNN, then investigate whether the proposed models outperforms in terms of mean squared error (MSE), root mean square error (RMSE), and mean absolute error (MAE).
The rest of this paper is categorized as follows. Section 2 reviews the recent related work on recurrent neural network models for predictions. Section 3 describes metocean data properties and four stations information in different locations. Section 4 explains the methodology, and Section 5 illustrates the experimental setup. Results and discussion are explained in Section 7. Last, this paper is concluded in Section 8.

Related Work
Artificial neural network models are labeled to fill within the model-based category, and several neural networks have been utilized to forecast and reconstruct wave characteristics. Artificial Intelligence (AI) models have become an alternative to numerical models in recent years. These models are relatively simple to assemble and have outperformed computational and statistical models for a site-specific wave parameter [32]. The metocean data can be modeled and featured by applying the feature selection algorithms [20,[33][34][35][36][37].
Fully connected neural networks in [38] were used to forecast wave heights. RNNs were used in [32] to predict the waves and showed a better correlation Coefficient forecasting. A fuzzy logic modeling machine was also used in [32] to predict the ocean wave energy. An extreme learning machine was applied in [39].
Several studies have been conducted using meta-heuristic algorithms to improve neural network prediction. For instance, Zhang et al. [40] used a fruit fly optimization algorithm and kernel extreme learning machine for bankruptcy prediction. Zhao et al. [41] improved ant colony optimization by utilizing a chaotic intensification strategy and random spare strategy for multi-threshold image segmentation. Tu et al. [42] proposed an improved whale optimization algorithm to overcome the local optimum stagnation problem as well as slow convergence speed. Similarly, the whale optimizer was improved using chaotic multi-swarm to boost support vector machine for medical diagnosis [43]. Shan et al. [44] proposed an improved moth flame optimizer based on a mechanism of adaptive weight. Besides, an enhanced moth flame optimizer-based Gaussian mutation, Levy mutation, as well as Cauchy mutation was proposed for global optimization [45]. Similarly, chaotic moth flame optimizer was utilized in [46] to boost kernel extreme learning machine for medical diagnoses. Chen et al. [47] proposed a new variant of Harris hawks optimizer by integrating several strategies such as topological multi-population, chaos, as well as differential evolution.
Comparative studies have been carried out in [48], taking into account geographical differences and using different neutral networks. The findings show that networks with less allocation of resources have a better structure and adaptability to different situations and conditions [1]. Wang et al. [7] used an evolutionary hybrid algorithm to simulate the ocean waves. This method demonstrated better results than other neural networks in which fewer weather data were available. A symbiotic organisms hunt in [49] was proposed to forecast ocean wave heights in two time zones based on a large number of accurate weather data including wave heights measured by buoys. The proposed model performed better than other state-of-the-art models. In [50], the efficiency of optimization algorithms in resolving real-world complex problems such as the wave height problem is discussed by proposing hybrid approach based on accent-based multi-objective particle swarm optimization algorithm, whereas in [20], a sequence-to-sequence neural network, feature selection, and bayesian hyperparameter are the techniques that were used and applied to forecast the height of the wave and reconstruct a prediction for neighboring buoys stations.
For time series forecasting, researchers apply cross-validation as a technique to distribute the data as in [51,52]. However, cross-validation approaches can be applied to stationary synthetic time series. Besides, using cross-validation depends mainly on the type of the data. Therefore, in our study we used the Out-of-sample (OOS) method, which is traditionally used to estimate predictive performance in time-dependent data.
Despite the good performance of such models, they still face some limitations. For example, feedforward models do not perform well with big time-series data. The RNNs face the phenomenal issue of vanishing gradient. Recently, Rashid et al. [53] attempted to use a set of meta-heuristics algorithms, such as Harmony search, GWO, as well as SCA, in order to overcome the vanishing gradient in LSTM. Similarly, Somu et al. [54] proposed an improved SCA using a Haar wavelet-based mutation operator to optimize the hyperparameters of LSTM for energy consumption prediction [55]. In this paper, the main contribution is to utilize the SCA to enhance the weight of three different types of RNN: LSTM, VRNN, and GRU to forecast wave heights with high accuracy, so that instead of random weight initialization, SCA generates weight values that are adaptable to the nature of the data and model structure. Furthermore, GS is employed to automatically find the best models' configurations. Besides, according to the No free Lunch theorem [56], there is no specific model to solve all forecasting problems. As such, improvements can be made to the existing models to enhance the performance of such models [57].

Study Area
In order to measure the efficiency of the proposed models, four stations were selected from different locations, having several climate conditions as well as variants depths of water. The hourly data that have been used in this study were recorded by floating buoys located in the North Atlantic Ocean. Table 1  The selected data obtained from the NOAA (https://www.ndbc.noaa.gov/, accessed on 24 December 2020). Four stations are used in this study: Station (41010) (https://www.ndbc. noaa.gov/station_page.php?station=41010, accessed on 24 December 2020); Station (41040) (https://www.ndbc.noaa.gov/station_page.php?station=41040, accessed on 11 January 2021); and Stations (41043) (https://www.ndbc.noaa.gov/station_page.php?station=41043, accessed on 5 January 2021). The details of these stations are described in Table 1. The stations are owned and maintained by National Data Buoy Center National Data Buoy Center, the world's largest environmental data buoys network [58].
The fourth station (41060) (https://www.ndbc.noaa.gov/station_page.php?station= 41060, accessed on 17 Janury 2021) is owned and maintained by Woods Hole Northwest Tropical Atlantic Wave Station (http://uop.whoi.edu/currentprojects/currentprojects. html, accessed on 17 January 2021), which mainly focuses on studying the physical process in ocean surface using moored surface buoys that have metrological and oceanographic sensors. The wave height is influenced by several factors such as direction and speed of wind, past wave height, and temperature of ocean surface [59,60]. The dataset of each station contains the following features: wind speed, significant wave height, wind direction, gust speed, dominant wave period, the direction from which the waves at the dominant period are coming, average wave period, air temperature, dewpoint temperature, sea surface temperature, sea level pressure, the water level in feet above or below mean lower low water, station visibility, and pressure tendency.

Methods
In this section, the main methods used in this paper are explained. First, the original VRNN is outlined, followed by LSTM then GRU. After that, the SCA algorithm is mathematically presented and finally the GS mechanism is elaborated.

Vanilla Recurrent Neural Network (VRNN)
The simplest structure of VRNN is initialized with one single input layer, one single hidden layer, and one single output layer [61]. Simple architecture is constructed of three layers that can process the sequence of T inputs through time t, which is the vector of All three layers are hierarchically joined from input layer to the hidden layers and from the hidden layer to the output layer. The link between the layers of the network is called a weight matrix. W W defines the weight connection between the hidden layer and the input layer at each time-step. Equation (1) computes the hidden layer recursively to measure the current state of the network. W V is the weight matrix reference that links the hidden layer units to each other. The number of hidden units H is h t = {h 1 , h 2 , . . . h H }.
The W W matrix is multiplied by the x t inputs and summed up with the product of W V and the previous state h t−1 . Then, this result is added to bias b h of the hidden layer. Equation (1) defines this process. Equation (2) define the networks' current state.
f H (.) represents the nonlinear activation function that converts the result of s t to values that depend on the selected activation function. Meanwhile, several activation functions on the state-of-the-art such as sigmoid, tanh, relu, leaky relu, and many more [62][63][64]. W U is the weight matrix that connects hidden and output layers. The output numbers is N units and can be represented as y t = {y 1 , y 2 , . . . , y N }, whileŷ t is the network prediction that can be obtained by Equation (3).
The result of the output layer is the sum of product of weights W U and current hidden state h t form Equation (2), added to bias of output layer b o , then it is transformed by activation function f s (.). The output layer predicts the results at each time step based on the results of the hidden layers calculations and input values.Ŷ t defines the total prediction length parameter. Theŷ t = ŷ t1 ,ŷ t2 , . . . ,ŷŶ .

Long Short-Term Memory (LSTM)
LSTM is a new improved variant of RNN that has been proposed to solve the classical version VRNN in terms of gradient problem [65]. Nevertheless, because of different memory cells, LSTM is more expensive in terms of computational resources and requires extra memory compared to RNN [65]. Typical LSTMs have almost four times more parameters than simple VRNN; thus, they suffer from high complexity in hidden layers. The main goal of designing LSTM is to solve the difficulties of learning long-term dependencies, regardless of the uncertainty of their costs [61,66].
The LSTM cell is constructed of several main gates, which are the input gate, forget gate, and the output gate. Irreverent information that has less importance on the prediction is dropped out by the forget gate. By this mechanism, LSTM determines which new data are going to be processed in the cell state [61]. The cell state is altered by the forget gate positioned below the cell state and by the input gate. The previous cell state forgets and adds new information through the output of input gates. The forget gate decides which data should be dropped. It forgets the irrelevant details coming from the previous state with the following calculation in Equation (8) [61,67]. The input gates decide which cell state or long-term memory information can enter. There are two sections to this layer: One is the sigmoid function, while the other is the function of tanh(). Typically, the target vector is called the "output gate". The following equations explain the parameter notations in LSTM cell.
The f (t) is the forget gate, i (t) is the input gate, and o (t) is the output gate. g (t) is the method that checks the current input of x (t) . The previous short-term memory is h (t−1) . C (t−1) is the previous long-term memory state. The logistic sigmoid function is σ and the tanh() is the tanh function.ŷ t is the expected prediction output of the LSTM cell. The c (t) is the long-term state for the next cell, and the h (t) is the short-term state for the next cell.
W xi , W x f , W xg , and W xo are the connecting weight matrices between the input x t and the four layers. On the other hand, the connections between the previous state h (t−1) and the four layers are W hi , W h f , W hg , and W ho . Every layer has its own bias as b i , b f , b g , and b o , respectively.

Gated Recurrent Units (GRU)
Cho et al. [68,69] introduced GRU to solve the problem of decay of information that occurs in the traditional recurrent neural network and minimize the computational complexity in LSTM. Two controlling gates are used by GRU, which are the updating and resetting gates that track the data forwarded to the output gate. The update gates decide the past and present data that must be moved to the new state, while the reset gate determines the previous data that must be dropped at every time step. The parameters of the GRU cell are defined by the following equations.
where Z t is the update gate and W xz is the weight matrix between input layer and the update gate, whereas W hz is the weights connecting the update gate with the hidden state h t−1 . b z is the bias of the updating gate. R t representing the reset gate which is connected to the input layer by W xr , while W hr is the weight matrix that connects it to the hidden state h t−1 . The bias that belongs to this gate is b z .

Sine Cosine Algorithm (SCA)
SCA is a population-based optimization technique that randomly generates multiple possible solutions for optimization problems. Using the mathematical Sine-Cosine equations to oscillate towards or outwards in order to find the optimal solutions. It uses random variables to adaptively ensure this technique emphasizes on the exploitation and exploration to finding the possible global optima on the search space [70,71]. SCA has been found more efficient than other population-based algorithms in achieving an optimal global solution [54]. Different millstones on the search area are investigated when the sine and cosine functions return values greater than one or less than one. Equations (15) and (16) demonstrate the formation of sine and cosine functions.
where X t i is the position of the current candidate at the tth iteration in the ith dimension AND P i is the position of the best candidate at the tth iteration in the ith dimension. The random agents are r 1 , r 2 , r 3 , and r 4 . The ( * ) is the multiplication sign. Equations (15) and (16) are combined in Equation (17).
The first agent (r 1 ) is responsible for defining the afterward search space, located between the solution region or outside it. The second operator (r 2 ) defines the amount of distance in the search space that should be in or out of the destination where T indicates the maximum iterations number and t is the currently running iteration. The a is a constant variable. Algorithm 1 shows the main pseudocode of SCA algorithm.

Grid Search (GS)
In order to produce good accurate results, deep learning models require many parameters that need to be predefined before the training takes place [72,73]. These hyperparameters have to be suitable for the network structure and the nature of the dataset. Setting these hyperparameters can be accomplished manually by trial and error until the best results are achieved; however, this is an inefficient method as it consumes time and may not work as expected [68,74]. Therefore, there are many hyperparameter tuning techniques that have automatically configured models to make them suitable to the network's structure [75,76].
Grid Search is a hyperparameter tuning technique that can be applied to find the best model configuration [77]; it produces more accurate results [75,78]. Therefore, in this paper, GS has been utilized to obtain the best learning rate values.

Input:
• Set the lower bound and upper bound of X solutions • Set the population size • Initialize the agents of the search space randomly • Specify the maximum number of iterations Calculate every single solutions candidate 3: Define the best-selected solution (X * ) 4: Update r 1 , r 2 , r 3 , and r 4 5: Update agents' locations in the search space with (18) 6: end while Return (X*)

The Proposed Enhanced Weight Optimized Recurrent Neural Networks
The models developed in this paper have been achieved by integrating the sine cosine algorithm [70] to optimize and update the model weights of simple recurrent neural networks aiming to overcome the problem of vanishing gradient in prediction wave heights. Besides, we integrated an effective grid search mechanism to find the optimal configuration of models' structures.
The main contribution is to generate weight values that are adaptable to the nature of dataset and models' structure. To achieve that, instead of initializing the weights of recurrent neural networks randomly, the weight initialization is adapted using SCA algorithm. As explained in the literature, the traditional training of RNNs is usually based on back propagation so that parameters, such as weights and leaning rate, would be updated either by increasing their values or decreasing it until finding the optimal values, resulting in minimizing the error value. The random initialization of the weights was developed to overcome the drawbacks of the back propagation and to enhance the convergence speed with less time. However, the randomness of weight initialization might not be adaptive with every data type and size [79]. Therefore, we proposed an enhanced weight optimized model to effectively overcome the limitations of the state-of-the-art methods, as well as to be more adaptive to any type of dataset.
The general structure of the proposed model is illustrated in Figure 2. The following subsections demonstrate the phases of the proposed model to analyze the metocean data and predict the wave heights.

Data Preprocessing
As explained in Section 3, the four datasets that have been used in this paper were selected from different locations based on Table 1. The datasets have been preprocessed as follows: • The linear interpolation is used to handle the missing values. • The time series dataset has been partitioned into 70% for training data and 30% for testing data; 25% of the training portion was assigned for validation. • After splitting the dataset, the input features and target labels are identified. • Finally, the MinMaxscaler function from a Python library named Scikit-Learn has been utilized as a normalization technique to scale the data into a suitable form within the range (0, 1).

Define Features and Labels in Training Dataset
Testing Data

Define Features and Labels in Training Dataset
Training Data

Grid Search Mechanism
The optimal hyperparameters are essential in any neural network to obtain sufficient performance. Traditionally, trial and error was used to find the best hyperparameters; however, this technique is not effective as it takes a long time and, in most cases, the optimal values are not guaranteed. It needs to try to set the parameters, train the model, validate, test, and compute the error. This cycling process needs to be repeated many times until it begins getting better results.
Therefore, we integrate an effective technique known as grid search to efficiently find the best hyperparameters for the proposed models. Table 2 shows the parameter settings of the grid search. Table 2 shows the values of the search space that allows the GS to find and selects the optimal values that are suitable to dataset size and model structure. To avoid the vanishing and exploding gradient problem, the optimal learning rate should not be too small nor too big. Therefore, we set the search space to be between 0.001 and 0.3.

Weight Optimization Process Using SCA
SCA is a recent effective optimization algorithm capable of exposing an efficient performance and proved to be more effective than several optimization algorithms for attaining optimum or near optimum solution; therefore, it has been selected in this study due to its special characteristics, which are summarized as follows. • The potential to escape the local optima, besides the high exploration inheritance of sine and cosine functions. • The basic sine cosine functions enable this algorithm to adaptively shift from exploration [<1, >1] to exploitation [−1, 1]. • The inclination towards the finest region of the search space as the solution modifies its location around the finest solutions gained so far.
The three basic recurrent neural network models (VRNN, LSTM, and GRU) are enhanced based on sine cosine algorithm, and new models named VRNN-SCA, LSTM-SCA, and GRU-SCA were developed. A clear description of these models is explained as follows.

Optimizing VRNN-SCA
The VRNN-SCA model's inputs are multiplied with weights that are generated by the SCA algorithm. The results of the multiplication are then fed as input vector to the hidden state and multiplied with the hidden state weight matrix generated by SCA as well. All products' results are summed up with additional bias values that came from the SCA. The weights of the basic VRNN presented in Equation (1) are updated by integrating the SCA as shown in Equation (19).
where W scaW indicates the generated weight based SCA in the input layer, whereas x t represents the inputs. W scaV indicates the generated weight based SCA in the hidden layer, whereas, h t−1 is the hidden state. Besides, b scah represents the generated bias based on SCA.

Optimizing LSTM-SCA
The cell state c (t) and the three gates of LSTM-input gate i (t) , forget gate f (t) , and output gate o (t) -have their own weights, as explained in Section 4.2. Instead of generating these weights randomly, they have been generated based on SCA to be more adaptive to the input dataset. Equations (4), (5) and (7) have been updated as where W scaXi represents the generated weights (based on SCA) that connect the input layer to input gate i (t) , while hidden state h t−1 and input gate are connected by W scaHi , which represents the updated weight matrix-based SCA. The cell state c t−1 is connected by W scaCi generated SCA weights to input gate and b scai is the input gate bias updated by SCA.
The forget gate f t is connected to the input layer by generated SCA weights W scaX f and connected to the hidden state h t−1 by W scaH f weights which generated by SCA as well. It is also connected to the cell state c t−1 by SCA generated weights W scaC f . The forget gate's bias, which is generated by SCA, is b sca f .
where W scaXo represents the generated weights based on SCA that connect the output gate o t to the input layer. The generated SCA weight W scaHo is between the output gate and hidden state h t−1 , while the cell state is connected with the output gate by W scaCo , which indicates SCA generated weights. The generated SCA bias of the output gate is b scao .

Optimizing GRU-SCA
GRU only has two gates: the update gate Z t and reset gate R t . These gates have the weights matrix as described in Section 4.3. These weights have been updated based on SCA. Equations (11) and (12) have been updated as follows: W scaXz is the weight matrix between the input layer and the update gate Z t , whereas W scaHz is generated SCA weights that connecting the update gate with the hidden state h t−1 . b scaz is the updating gate bias generated based on SCA.
Similarly, the reset gate R t is connected to the input layer by SCA generated weights (W scaXr ), while W scaHr is the weight matrix obtained by SCA, and it connects the reset gate to the hidden state h t−1 . The bias that belongs to this gate is b scaz , and it is generated based on SCA as well.

Experimental Setup
This section explains the experimental settings used in this paper; it starts with the evaluation measures used to validate the proposed models then the parameter setting. The implementation is performed on Intel(R) Core(TM) i7-9700 CPU @ 3.00 GHz 3.00 GHz, 0 16.0 GB RAM. The experiments were implemented using the Python 3 programming language and the libraries below: •

Evaluation Measures
In order to comprehensively evaluate the effectiveness and prediction of the proposed models, three common metrics are used in this paper: Mean Square Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). All of these error evaluation indices have been extensively applied in the forecasting model estimation. These three metrics are defined as:

Parameter Settings
Setting up the experimental environments has been done by first defining the parameters of the SCA as illustrated in Table 3; the parameters are based on the default setting of the algorithm itself. Then, we defined the parameters of the models that are explained in Table 4. Note that the three recurrent neural network models (VRNN, LSTM, and GRU) are implemented as benchmarking models for comparison purposes. For all the proposed models, the model structure consists of an input layer which includes 12 input features, a hidden layer consisting of 32 hidden units, and an output layer. The dataset has been partitioned into 70% for training data and 30% for testing data; 25% of the training portion was assigned for validation.

Results and Discussion
This section demonstrates and discusses the results of the enhanced models. To ensure the stability of the proposed models, the models have been run ten times for each dataset. Besides, different initialization in terms of epoch has been conducted. The number of epochs was set to 10, 20, 30, 40, 50, 60 , 70, 80, 90, and 100. Table 5  As can be seen form Table 6 Table 8 shows the results of the original model of VRNN. These results demonstrated how effective the integration of the SCA algorithm and the grid search mechanism are in producing better results and overcome the existing work limitations. Table 9 explains the results of the original LSTM. Table 10 explains the results of the original GRU benchmarking over all the datasets.

Comparison of the Proposed Models with Existing Models
In this subsection, we benchmark our proposed models with three existing recurrent neural networks. Table 11 shows the comparison in terms of average MSE on all dataset for all the three proposed models as well as the three original RNNs models. As can be seen from Table 11, the proposed VRNN-SCA model outperforms the other two proposed models as well as the three original models in terms of MSE in datasets 41040 and 41060. The proposed GRU-SCA outperforms all other models in dataset 41010. However, the original VRNN-SCA clearly outperforms the three original models (VRNN, LSTM, and GRU) in the 41040 dataset and takes the third place in general.  Table 12 shows the comparison in terms of RMSE average for optimized models and original ones on all datasets. The results clearly show that the proposed models outperformed the original RNNs and GRU-SCA shows best results in datasets 41010 and 41060, while LSTM-SCA achieved the best results on dataset 41043. Finally, VRNN-SCA shows the best RMSE results on dataset 41040. Table 13 explains the average of MAE comparison for all models on all dataset. It is obvious that GRU-SCA came at the first rank with best two results on datasets from stations 41010, and 41060. LSTM-SCA model shows best result value on dataset 41043. On the dataset from station 41040, the best RMSE average results was achieved by VRNN-SCA.  Figures 3b, 6b, 9b and 12b show the result that was produced from the non-optimized model. LSTM-SAC model's best predicted results are illustrated in prediction graphs as shown in Figures 4a, 7a, 10a and 13a and compared with the best selected results of original LSTM, which are shown in Figures 4b, 7b, 10b and 13b.
The best selected results produced by GRU-SCA can be seen in Figures 5a, 8a, 11a and 14a. On the other hand, Figures 5b, 8b, 11b and 14b show the results for the original GRU model.

Discussion
All experiments on the three models showed that our proposed technique can be effectively used to forecast wave heights with more prediction accuracy. The simple architecture of all variant of recurrent neural networks, which are VRNN, LSTM, and GRU, can be optimized in terms of weights generation by sine cosine optimization algorithm. The proposed RNN-SCA models have shown an outstanding performance and outperformed the state-of-the-art models in terms of MSE, RMSE, and MAE.
The difference between the graphs in Figure 3 is slightly noticeable. The enhanced model VRNN-SCA's best prediction for dataset 41010 was in epoch 40 and is shown in Figure 3a. It is more accurate than original model (VRNN) prediction illustrated in Figure 3b. Similarly, Figure 4 shows the comparison over dataset 41010 in terms of prediction between the enhanced LSTM-SCA and original LSTM. As can be seen in Figure 4a, LSTM-SCA in epoch 40 produces the best results for predicating the wave heights. The results outperform and is more precise than that of original LSTM which is shown in Figure 4b. The performance of GRU-SCA is compared with the original GRU in terms of prediction as shown in Figure 5. As can be seen form Figure 5a, the best results among all models over dataset 41010 was for GRU-SCA with the value of 8.54 × 10 −5 , which clearly outperform the original GRU shown in Figure 5b. Figure 6 shows the comparison between the proposed VRNN-SCA and the original VRNN models in term of prediction for dataset form station 41040. Figure 6a shows that the prediction of VRNN-SCA outperforms the original VRNN shown in Figure 6b. This is because the SCA algorithm is effective at producing better prediction accuracy compared to the original one. Similarly, Figure 7 shows the comparison between the proposed model LSTM-SCA and the original LSTM; the proposed model (Figure 7a) clearly outperforms the original one (Figure 7b) in terms of accurately predicating the wave heights. The performance of GRU-SCA is compared with the original GRU in terms of prediction as shown in Figure 8. As can be seen form Figure 8a, the best results among all models over dataset 41040 was for GRU-SCA, which clearly outperform the original GRU shown in Figure 8b.
In Figure 9, there is the comparison between VRNN-SCA model and VRNN in terms of best prediction on dataset 41043. Figure 9a demonstrates the best prediction of the enhanced model VRNN-SCA, while Figure 9b explains the prediction of the original VRNN. The performance of LSTM-SCA on dataset 41043 is demonstrated in Figure 10a, similarly the performance of original LSTM is depicted on Figure 10b. Figure 11a demonstrates the prediction of the enhanced model GRU-SCA, whereas Figure 11b model's prediction. Figure 12 shows the comparison between the proposed VRNN-SCA and the original VRNN models in term of prediction for the dataset from station 41060. Figure 12a shows the prediction of VRNN-SCA outperforms the original VRNN showed in Figure 12b. Similarly, Figure 13 shows the comparison between the proposed model LSTM-SCA and the original LSTM; the proposed model (Figure 13a) clearly outperform the original one (Figure 13b) in terms of accurately predicate the wave heights. The performance of GRU-SCA is compared with the original GRU in terms of prediction as shown in Figure 14. As can be seen form Figure 14a, the best results among all models over dataset 41043 was for GRU-SCA, which clearly outperforms the original GRU shown in Figure 14b. This is due to the effective of the SCA algorithm in producing better prediction accuracy comparing to the original one.

Significance Analysis
The one-way analysis of variance (ANOVA) test was used to assess the statistical significance of the differences between the resulting MSE obtained by proposed models versus other models. The findings of this analysis indicate whether the findings of the experiments are independent. No significance difference between the MSE of the proposed models and other models is assumed by the null hypothesis. The null hypothesis is accepted at state level greater than 0.05 and rejected at state level less than 0.05. ANOVA is an effective analysis technique as it accepts more than two groups to find the significance differences, and because we have six groups, the ANOVA test is selected. The procedures of this analysis are adopted from in [80].
In dataset 41010, the obtained p-value, as can be seen in Table 14, is 0.000003, which is less than 0.05, and we can thus reject the null hypothesis and indicate there is a significant difference between the proposed models and the original models. Figure 15 shows the boxplot of the differences between the proposed models and benchmarking models.
Similarly, in dataset 41040, the obtained p-value, as can be seen from Table 15, is 0.000001, which strongly indicates that there is a significance difference between the proposed models and original models. Figure 16 illustrates the boxplot of the differences between the proposed models and original models on this dataset.    For dataset 41043, there is a slight difference, and, as can be seen from Table 16, the obtained p-values is 0.0183, which still less than state level 0.05. Figure 17 demonstrates these difference as a boxplot.
Finally, in dataset 41060, the obtained p-value, as shown in Table 17, is 0.0006, which demonstrates a strong indication of the superiority of the proposed models. Figure 18 illustrates the boxplot differences on this dataset.

Conclusions and Future Work
This paper proposed an enhanced weight-optimized recurrent neural network based on the sine cosine algorithm for predicting with high accuracy the wave heights. The proposed models' structures were first configured with optimal hyperparameters using grid search technique. The grid search is used to find the best values for learning rate. Three models are proposed, namely VRNN-SCA, LSTM-SCA, and GRU-SCA, by integrating the sine cosine algorithm. The results proved that the proposed models have the capability of improving waves prediction and producing better results than original models. The results of the proposed models demonstrate much better results comparing the original ones, for example, the best averages MSE on 41010 datasets were 0.0003, 0.0005, and 0.0008 for GRU-SCA, LSTM-SCA, and VRNN-SCA, respectively, whereas the best average RMSE was 0.0165, 0.0205, and 0.0269 for GRU-SCA, LSTM-SCA, and VRNN-SCA, respectively. Similarly, the best average MAE for GRU-SCA, LSTM-SCA, and VRNN-SCA was 0.0113, 0.0130, and 0.0133, respectively.
The integration of SCA has helped the simple architectures of RNNs to generate weights that are adaptable to the selected data set and models structures. In traditional training of RNN, the initialization of weights happens randomly, ignoring the datasets size. This increases the possibility of the model vanishing, exploding, or becoming trapped in local optima. Therefore, our technique utilized the advantages of SCA by generating adaptive weight values that can adapted with the model's parameters and the datasets simultaneously. The proposed VRNN-SCA, LSTM-SCA, and GRU-SCA models are effective tools in forecasting wave height and can be recommended to solve other prediction problems. In future work, according to the "No Free Lunch" theorem, other optimization algorithms such as gray wolf optimizer or dragonfly algorithm could be investigated to optimize the weight of recurrent neural networks. Besides, the proposed models could be investigated on other domains such as forecasting air pollution, flood prediction, and wind speed forecasting. Another future direction is to use different evaluation indices to validate the performance of models such as Moving Average, Weighted MA, or Exponential smoothing.