1. Introduction
Global climate change is profoundly destabilizing Earth’s systems, with a growing emphasis on extreme weather events triggered by warming trends. In the Western North Pacific, the translation speed of tropical cyclones has slowed since the 1980s, resulting in prolonged durations that exacerbate storm surge intensity and increase coastal vulnerability [
1]. The anticipated intensification of powerful typhoons is likely to compound the already elevated risk level in the South China Sea region [
2]. Tropical cyclones not only trigger secondary disasters such as storm surges and flooding but also have profound impacts on the ecological environment and socio-economic development of coastal areas [
3]. For instance, in the fall of 2024, consecutive typhoons—including the destructive No. 11 “YAGI”—severely affected Hainan Province, resulting in major losses to life and property [
4]. Against this backdrop, there is heightened concern regarding the destructive potential of extreme sea conditions, particularly their impact on coastal regions during severe weather events.
Wave forecasting is a crucial reference for marine engineering, offshore activities, and coastal infrastructure design [
5]. Under severe weather conditions, extreme wave heights can pose serious risks to the safety of marine facilities and lead to significant loss of life and property in coastal regions [
6]. While most existing research has concentrated on the prediction of SWH, limited attention has been paid to predicting MWH. This is because statistically defined extreme events—such as waves exceeding specified thresholds—occur infrequently in time-series data, reducing their influence on conventional accuracy metrics [
7]. However, MWH refers to the highest single wave observed within a specific period, rather than a probabilistic extreme, and often coincides with the most hazardous sea states during tropical cyclones. Accurate prediction of MWH is therefore vital for improving marine safety and disaster resilience. In regions like Hainan Province along the South China Sea, strengthening MWH monitoring and early warning systems is key to reducing tropical cyclone–related impacts and enhancing disaster preparedness.
Currently, wave height forecasting primarily relies on Numerical Weather Prediction (NWP) models, particularly the widely-used WAVEWATCH-III (WW3) and Simulating WAves Nearshore (SWAN) models [
8,
9,
10]. Umesh and Behera [
11] improved wave height forecasting in the nearshore waters of Eastern India by using a nested SWAN, Simulating WAves till SHore (SWASH), model and optimizing grid resolution through sensitivity analysis, thereby enhancing the accuracy of nearshore predictions. Vijayan et al. [
12] improved the accuracy of hurricane wave modeling in the Gulf of Mexico by dynamically coupling the SWAN and ADvanced CIRCulation (ADCIRC) models, and validated their reliability during Category 5 Hurricane Michael. Cicon et al. [
13] conducted probabilistic forecasting of rogue waves in the Northeast Pacific using the WAVEWATCH III model. They confirmed that the crest–trough correlation
r—a spectral shape parameter related to wave bandwidth—exhibited the highest univariate correlation with rogue wave probability, outperforming conventional metrics like the Benjamin–Feir Index (BFI).
Although NWP models can provide global wave forecasts, they are typically limited to specific grid points with low resolution, making them inadequate for localized, short-term fine-scale predictions [
14,
15]. In recent years, machine learning models have gained attention for their outstanding performance in short-term predictions, offering a new approach to wave forecasting [
16,
17,
18]. Savitha et al. [
19] further applied sequential learning algorithms, namely the Minimal Resource Allocation Network (MRAN) and the Growing and Pruning Radial Basis Function (GAP-RBF) network, to forecast daily wave heights, evaluating model performance at different terrain stations. The results showed that it outperformed traditional methods in both generalization ability and prediction accuracy. Gracia et al. [
20] proposed an integrated model combining a Multilayer Perceptron (MLP) and Gradient Boosting Decision Tree (GBDT) and validated its effectiveness at a buoy station in the Spanish Estuary Port. The results showed that combining machine learning with numerical models significantly improved prediction accuracy. Afzal et al. [
21] used Support Vector Machine (SVM) to predict SWH and analyzed its seasonal variation using the Generalized Extreme Value (GEV) theory. The study showed that the SVM model achieved a prediction accuracy of 99.80%, outperforming Linear Regression (LR) and Artificial Neural Networks (ANNs), demonstrating superior performance in predicting SWH. However, conventional machine learning can only capture simple local features, struggle with complex data, and have poor generalization ability. In contrast, deep learning can automatically extract features, handle complex data, and exhibit stronger learning and generalization capabilities on large datasets [
22].
With the rapid development of deep learning technology, its potential in wave height forecasting has gradually emerged. Jörges et al. [
23] proposed a model based on Long Short-Term Memory (LSTM) networks for both short-term and long-term SWH forecasting in coastal areas. The study showed that the LSTM model, after incorporating bathymetric data, significantly outperformed the Deep Feedforward Neural Network (FNN) in forecasting performance. Elbisy and Elbisy [
24] explored the combination of ANN and Multivariate Additive Regression Trees (MARTs), proposing several improved models. The results showed that the MART model excelled in both accuracy and efficiency. Wei and Davison [
25] proposed a model based on Convolutional Neural Networks (CNNs) to forecast nearshore waves and fluid dynamics, validating its high accuracy in complex wave propagation and circulation prediction. Gao et al. [
26] developed a Convolutional Long Short-Term Memory (ConvLSTM) model to predict the SWH, mean period, and mean wavelength in the Northwest Pacific wave field, with computational efficiency hundreds of times higher than traditional numerical models. Chen and Huang [
27] proposed a model based on Convolutional Gated Recurrent Units (CGRU), which effectively extracts spatiotemporal features from X-band ocean radar backscatter image sequences to estimate SWH. The experimental results demonstrated that the CGRU-based model significantly outperformed methods based on Signal-to-Noise Ratio (SNR) and CNN in rainy day image sequences. Specifically, the model’s Root Mean Square Deviation (RMSD) was reduced from nearly 0.90 m to 0.54 m, while significantly improving the underestimation issue.
While deep learning algorithms continue to evolve, their effectiveness often depends on the availability of high-quality observational data. In this context, buoy data serve as a vital input source for enhancing model performance, particularly under complex and dynamic ocean conditions. Buoy data, as an essential component of wave forecasting, provide critical support for improving model accuracy. Chen and Wang [
28] predicted typhoon wave heights along the Taiwan coast using a Support Vector Regression (SVR) model. The study showed that incorporating nearby buoy data significantly improved prediction accuracy. Dogan et al. [
29] developed a model based on Bidirectional Recurrent Neural Networks (Bi-RNN) and LSTM, achieving high-precision predictions by utilizing buoy-observed wave parameters. Wang and Ying [
30] developed a hybrid model based on LSTM–Gated Recurrent Unit (GRU)–Kernel Density Estimation (KDE), integrating multiple feature data to predict wave heights. The results showed that the model outperformed traditional methods in multi-step predictions. Minuzzi and Farina [
31] proposed an LSTM model combining ECMWF Reanalysis 5th Generation (ERA5) reanalysis data and buoy data for short-term real-time wave height forecasting, demonstrating strong applicability. Breunung and Balachandran [
32] trained a neural network to predict anomalous wave heights in real time using buoy-recorded data, reducing the uncertainty associated with traditional theories’ dependence on the causes of anomalous wave heights.
The accuracy of wave height prediction models is strongly influenced by the quality of input data, particularly buoy observations, which offer direct and high-frequency measurements [
29,
33,
34,
35]. As deep learning techniques continue to evolve, a growing body of research has explored their integration with buoy data, demonstrating significant potential in operational wave forecasting [
36]. As shown in
Table 1, existing studies differ considerably in terms of data sources and prediction targets, revealing heterogeneous modeling strategies. Nevertheless, several research gaps remain. First, although SWH has been extensively studied, MWH—which often better reflects the potential risk in marine environments—has received comparatively less attention [
37]. Second, although reanalysis products such as ERA5 provide spatially comprehensive datasets, their inherent latency limits real-time applicability [
38]. Meanwhile, buoy deployment remains sparse in complex and nearshore sea areas, which restricts the prediction accuracy of models trained on limited observational data [
39]. Third, techniques such as the Time Distortion Index (TDI) and Dynamic Time Warping (DTW) have been introduced in some studies for time-series model evaluation [
40]. DTW has been applied to improve multi-step SWH forecasting [
41]. However, the use of DTW in MWH prediction remains uncommon, limiting the ability of existing models to quantify prediction delays and time-alignment performance. Addressing these challenges requires not only algorithmic improvements but also advances in data integration, sensor coverage, and evaluation frameworks.
To solve the above problems, this study establishes a comprehensive research framework (
Figure 1) encompassing data collection, feature selection, model development, and performance evaluation. Hourly observational data collected from a moored buoy deployed in the Qiongzhou Strait are used to compensate for nearshore data gaps. LightGBM is employed to identify and select key features, thereby improving the relevance of inputs for MWH prediction. Building on these inputs, two improved models combining Temporal Convolutional Networks (TCN) and Bidirectional Gated Recurrent Units (BiGRU) (STCN-BiGRU and BiTCN-BiGRU) are proposed for static multi-step MWH prediction. In addition, the Time Distortion Index (TDI) is introduced—for the first time in this context—as an evaluation metric to better quantify the time-alignment capability of MWH prediction models. This integrated approach establishes a novel framework for improving both the accuracy and temporal consistency of MWH prediction, thereby providing valuable technical support for disaster warning and coastal engineering design.
The structure of this study is organized as follows:
Section 2 introduces the data and methods, focusing on the study area, data sources, and data preprocessing;
Section 3 describes the methods and models used in this study in detail, including the introduction of the TDI, and further explains the experimental design and related settings;
Section 4 presents the experimental results, validating model performance under normal weather conditions—specifically, the test period during which no tropical cyclones were recorded in the South China Sea;
Section 5 summarizes the study, discusses its limitations, and suggests future improvements and applications.
3. Method and Framework
This section outlines the proposed framework for MWH prediction, which integrates a feature selection stage with deep learning-based temporal modeling. The key components, including model architecture, activation functions, and optimization strategies, are explained in detail to illustrate the modeling pipeline.
3.1. LightGBM for Feature Selection
In this study, LightGBM was used to analyze the contribution of each feature in the moored buoy observational data to the prediction of MWH, with the aim of identifying the most critical features and optimizing model performance. LightGBM, an efficient GBDT algorithm, optimizes based on an additive model, adding a new learner in each iteration [
43]. The iterative update formula for LightGBM is as follows:
In the LightGBM framework, the model prediction after the m-th iteration is denoted as , where represents the learning rate and is the decision tree constructed during that iteration. At each boosting step, LightGBM leverages both the first-order gradient and the second-order derivative (Hessian) to determine the optimal tree-splitting strategy, thereby accelerating convergence and improving model robustness.
Owing to its high computational efficiency, low memory footprint, and scalability for large datasets, LightGBM has become a widely adopted algorithm in various machine learning tasks. Compared to conventional Gradient Boosting Decision Tree (GBDT) approaches, LightGBM achieves superior performance through optimizations such as histogram-based learning and leaf-wise tree growth. In this study, LightGBM was chosen as the primary tool for feature selection and dimensionality reduction due to its proven effectiveness in identifying key predictors relevant to MWH forecasting.
3.2. BiGRU
This study chooses BiGRU as the core model for MWH prediction. Compared to traditional RNN and LSTM, BiGRU combines the simple structure of GRU with bidirectional propagation, enabling more efficient capture of temporal dependencies with lower computational cost. RNN, one of the earliest proposed recurrent neural networks, transmits information between time steps through its internal recurrent structure to capture dependencies in sequence data [
44]. Its core formula is:
Here, represents the hidden state at the current time step, denotes the hidden state from the previous time step, and is the input at the current time step. , , and are the weight matrices for the hidden state, input-to-hidden weights, and bias term, respectively. is the activation function (typically Sigmoid).
However, RNN suffers from the vanishing or exploding gradient problem when processing long time-series data, limiting their ability to model long-term dependencies. To address this, LSTM introduces a gating mechanism, including the forget gate, input gate, and output gate, which significantly alleviates the vanishing gradient problem, allowing LSTM to excel in modeling long-term sequences [
45]. However, the complex structure of LSTM results in higher computational costs. In contrast, GRU simplifies the model structure by merging the input and forget gates while retaining the reset and update gates, reducing computational costs [
46]. Therefore, GRU consists of three components: the update gate, reset gate, and candidate hidden state, with the core formula given by:
Here, represents the output of the update gate, represents the output of the reset gate, represents the candidate hidden state, represents the current hidden state, represents the previous hidden state, represents the input at the current time step, , and represent the weight matrices, input weight matrices, and bias terms for different gating mechanisms, and ⊙ denotes element-wise multiplication.
BiGRU extends the GRU by incorporating a bidirectional propagation mechanism, which allows it to simultaneously process both forward and backward sequential data, thereby enhancing the model’s ability to capture temporal dependencies.
Figure 4 illustrates the structure of the BiGRU model. On the left, the bidirectional BiGRU model is shown, while on the right is a single GRU unit. In the BiGRU model, both forward and backward GRU units work together to process sequential data, significantly enhancing the model’s temporal modeling capability. Compared to RNN and LSTM, BiGRU not only simplifies the structure but also improves the model’s ability to handle complex prediction tasks by leveraging bidirectional sequential data. Overall, RNN is more suitable for short-term sequence tasks, while LSTM effectively handles long-term sequences but at a higher computational cost. In contrast, BiGRU, by modeling bidirectional temporal data, improves prediction accuracy while reducing computational cost, making it the ideal choice for the MWH prediction task in this study.
3.3. TCN
TCN is a time-series prediction model based on CNN, designed to address issues such as gradient vanishing and low computational efficiency encountered by traditional methods like RNN when handling long sequences. Compared to RNN, TCN replaces recurrent layers with causal and dilated convolutions, enabling parallel computation at each time step and thus improving training efficiency [
47]. Dilated convolutions increase the receptive field, allowing the model to capture dependencies over long time steps, while causal convolutions ensure that the model relies only on past input information for prediction, thus preventing leakage of future data.
Figure 5 illustrates the TCN architecture. By combining these two convolution types, TCN enhances its ability to capture temporal dependencies while processing long sequences stably without introducing significant computational burden, effectively mitigating the gradient vanishing and explosion problems common in RNN. Furthermore, TCN enhances model performance through techniques such as Dropout, activation functions, and weight normalization: Dropout improves generalization by randomly dropping neurons; activation functions introduce nonlinear mappings to enhance representational power; and weight normalization standardizes layer weights to promote stability in weight updates during training, accelerating convergence.
Due to the limited sample size in this study, the model may suffer from reduced generalization ability, potentially affecting prediction stability and accuracy [
48]. To mitigate this issue, the Temporal Convolutional Network (TCN) architecture was enhanced by incorporating the GELU (Gaussian Error Linear Unit) activation function and Layer Normalization (Layer Norm), as shown in
Figure 6. GELU offers a smooth activation profile that dynamically adjusts output based on input magnitude, combining the benefits of dropout and ReLU (Rectified Linear Unit) [
49]. Compared to the ReLU, which outputs zero for negative inputs and passes positive values unchanged, GELU reduces sharp gradient fluctuations and improves sensitivity to subtle input variations, thereby improving training stability [
50]. To further enhance model robustness, Layer Norm was applied. Unlike Batch Normalization, which normalizes features across the batch dimension and relies on batch statistics, Layer Norm normalizes across the feature dimension for each data point independently [
51,
52]. This characteristic makes it more effective for small-batch training and sequential modeling tasks [
53]. By stabilizing the feature distribution at each layer, Layer Norm facilitates faster convergence and alleviates internal covariate shift [
54]. Although Weight Normalization can also improve convergence by reparameterizing the weight vectors, Layer Norm has demonstrated superior performance in small-sample scenarios with noisy or biased inputs [
55].
In addition, this study adopts the bidirectional characteristics of BiGRU by setting two identical TCN in parallel, with one processing the sequence in reverse. By simultaneously handling both forward and reverse temporal information, the model is able to capture the dependencies in time-series data more comprehensively.
3.4. TCN-BiGRU
To effectively capture the temporal dependencies of MWH, this study proposes two models combining TCN and BiGRU: STCN-BiGRU and BiTCN-BiGRU. The key distinction between these two models lies in the structural design of the TCN layer.
In the STCN-BiGRU model, the TCN layer uses a unidirectional structure to capture long-range dependencies in the time series. In contrast, the BiTCN-BiGRU model employs a parallel structure, with one TCN processing the forward data and the other handling the reverse data. This parallel setup enables the model to more comprehensively process temporal data and improve adaptability to diverse data patterns.
Both STCN-BiGRU and BiTCN-BiGRU models integrate a BiGRU layer following the TCN layer. As the core component of the model, BiGRU further processes and enhances the temporal features extracted from the TCN layer. BiGRU uses bidirectional information transmission, allowing it to simultaneously learn temporal dependencies from both forward and reverse sequences. In the MWH prediction task, the BiGRU layer leverages both forward and backward temporal information to enhance the model’s prediction accuracy and stability.
The role of TCN is to efficiently capture long-range dependencies in time series, while BiGRU further captures temporal information through bidirectional propagation [
56,
57,
58,
59,
60]. The combination of both models complements each other’s strengths, offering an efficient and accurate solution for MWH prediction tasks. To further enhance the model’s generalization ability, a Dropout layer is incorporated to randomly discard neurons, preventing overfitting and improving the model’s adaptability to unseen data. Finally, the output layer maps the extracted features to the final prediction space. A fully connected layer performs a weighted sum of the features, combined with a linear activation function, to output the predicted MWH at each time step for the next 1, 3, and 6 h.
During the model optimization process, hyperparameters are tuned using GridSearch. The key parameters and their ranges include the number of hidden layers, the number of units in each hidden layer, batch size, dropout rate, the number of convolutional kernels, kernel size, and dilation rate. The selection of these hyperparameters is aimed at enhancing the model’s predictive performance and generalization ability. During optimization, the Adam optimizer is used to enhance training efficiency, the Mean Squared Error (MSE) loss function is selected, and an EarlyStopping strategy (patience = 5) is employed to prevent overfitting, ensuring the model’s stability and generalization capability. Hyperparameter tuning and model training are performed using the training set, while the validation set is employed to monitor generalization performance and guide the selection of optimal model configurations. The final model’s effectiveness is then evaluated on the test set.
3.5. Model Evaluation Metrics
3.5.1. Traditional Evaluation Metrics
This study employs MAE, Root Mean Squared Error (RMSE), and R
2 as evaluation metrics to assess the model’s performance in predicting MWH. The specific formulas are as follows:
Here, represents the actual value for the i-th data point, is the predicted value for the i-th data point is the mean of the actual values, and n is the total number of data points.
Both MAE and RMSE are used to measure the difference between the predicted and actual values of the MWH. A smaller value indicates better prediction performance. MAE calculates the absolute difference between the predicted and actual values, with smaller values indicating predictions closer to the true values, reflecting the model’s accuracy. RMSE assesses the model’s ability to handle errors in extreme values and its overall stability, being particularly sensitive to larger errors, thus effectively reflecting the model’s performance in handling extreme values. indirectly measures the model’s ability to reduce the sum of squared residuals. A higher means the model captures the variation in the target variable better. As approaches 1, the model fit improves, and as it approaches 0, the model fit worsens.
3.5.2. Time Distortion Index
In this study, in addition to traditional evaluation metrics such as MAE, RMSE, and , the TDI is introduced for a more comprehensive assessment of the MWH model’s performance. While the aforementioned traditional error metrics are commonly used in time-series forecasting, they do not effectively measure the issue of prediction delay. Prediction delay refers to the misalignment between the predicted and actual values in terms of time, where low error metrics may still be accompanied by temporal lags. To address this issue, the TDI quantifies the extent of deformation along the time axis, offering a more thorough evaluation.
The calculation of TDI is based on the DTW algorithm, which aligns time series non-linearly and calculates the minimal distance [
61]. However, DTW only reflects the degree of alignment between sequences and does not directly quantify time distortion [
62,
63]. TDI addresses this by analyzing the alignment path of DTW, evaluating changes in time steps, and thus quantifying the extent of time distortion [
64]. The calculation steps of TDI include the following: (1) compute the DTW alignment path: obtain the optimal alignment path between two time series and the corresponding changes in time steps; (2) quantify time distortion: analyze the time step variations in the alignment path and calculate the cumulative degree of time distortion; (3) standardization: normalize the TDI to ensure comparability across datasets.
The Time Distortion Index (TDI) provides an effective means of quantifying prediction delays, complementing traditional error metrics such as MAE and RMSE, which focus solely on magnitude discrepancies. TDI is particularly valuable in time-series forecasting tasks where temporal alignment between predictions and observations is critical. In the context of this study, TDI offers a rigorous basis for evaluating the timing accuracy of MWH forecasts, enabling a more comprehensive assessment of model performance.