Wave Height and Period Estimation from X-Band Marine Radar Images Using Convolutional Neural Network

: In this study, a deep learning network for extracting spatial-temporal features is proposed to estimate significant wave height ( H s ) and wave period ( T s ) from X-band marine radar images. Since the shore-based radar image in this study is interfered with by other radar radial noise lines and solid target objects, to ensure that the proposed convolutional neural network (CNN) extracts the image features accurately, it is necessary to pre-process the radar image to eliminate interference. Firstly, a pre-trained GoogLeNet is used to extract multi-scale depth space features from the radar images to estimate H s and T s . Since CNN-based models cannot analyze the temporal behavior of wave features in radar image sequences, self-attention is connected after the deep convolutional layer of the CNN to construct a convolutional self-attention (CNNSA)-based model that generates spatial-temporal features for H s and T s estimation. Simultaneously, H s and T s measured by nearby buoys are used for model training and reference. The experimental results show that the proposed CNNSA model reduces the RMSD by 0.24 m and 0.11 m, respectively, in H s estimation compared to the traditional SNR-based and CNN-based methods. In T s estimation, the RMSD is reduced by 0.3 s and 0.08 s, respectively.


Introduction
X-band radar has become a valuable tool for oceanographic studies due to its high spatial and temporal resolution [1].The Bragg resonance interaction between X-band electromagnetic waves and the centimeter-scale ripple waves induced by local winds generate radar backscatter from the sea surface.The sea surface changes are imaged by the backscattering of electromagnetic waves [2].In addition, marine radar images are used to effectively estimate sea surface features, such as wind [3,4], wave parameters [2,[5][6][7], and current [8][9][10].Different sea surface parameters, such as significant wave height (H s ), wave period (T s ), and wave direction, are essential for the safety of various marine activities and the development of coastal areas [11].In situ sensors, such as wave buoys, have traditionally been used for wave measurements.In contrast, X-band marine radars can detect and measure wave parameters over a broader range and have relatively low maintenance costs.
The echo signal produced by the X-band radar, presented in a light and dark striped image, is called sea clutter, formed by the backscattering of electromagnetic waves emitted by radar across the sea.The base signal (short waves) is modulated by longer waves with several mechanisms, such as hydrodynamic modulation, tilt modulation, and shadow modulation, resulting in longer waves visible in radar images [12].The conventional spectral analysis method obtains the image spectrum via 3-D Fast Fourier transformation (FFT) of the radar image sequence [2], using the dispersion relation as a filter to extract the image spectrum from the background noise.Then, the wave period, wave direction, and wavelength can be deduced from the filtered spectrum [2,6].Based on the estimation relation of the synthetic aperture radar [13], the signal-to-noise ratio (SNR) of the wave is calculated from the wave spectrum, and the estimation is completed by the linear relationship between the H s and the square root of the SNR.The coefficients can be determined using in situ measurements.Later studies have found that H s is not exactly linearly proportional to the square root of the SNR due to variations in sea states, different methods of calculating SNR, and differences in radar systems.In addition to SNR-based methods, there are some alternative methods that have been proposed to estimate wave parameters, such as empirical orthogonal function-based methods [14], iterative least squares-based methods [1], 2D continuous wavelet transform-based methods [15], array-beamforming-based methods [16], shadowing mitigation-based methods [17], and synchrosqueezed wavelet transform-based methods [18].Additionally, there are some other methods for estimating H s that have been proposed, such as shadowing-based methods [19,20], ensemble empirical mode decomposition-based methods [21], correlation analysis-based methods [22], and variational mode decomposition-based methods [23].It is worth noting that machine learning algorithms have been applied to H s estimation, which can simplify the cumbersome steps of previous algorithms and improve computational efficiency.It is also possible to estimate more accurate results using methods including a support vector regression (SVR)-based method [24], artificial neural network (ANN)-based methods [25,26], a convolutional neural network (CNN)-based method [27], a convolutional gated recurrent unit network (CGRU)-based method [14], or a temporal convolutional network (TCN)-based method [28].In addition, random forest (RF)-based machine learning methods have been used to estimate wave directions and periods [29].
CNNs can extract spatial features of image sequences but are incapable of temporal analysis.In this study, a novel H s and T s estimation model combining CNN with selfattention (abbreviated as CNNSA) is proposed.This method can extract the spatial features of radar image sequences via CNN, and introduces a self-attention layer in the feature vector of the CNN to capture dependencies in the time series and achieve spatial-temporal feature extraction of radar images.Section 2 introduces the data pre-processing methods used, including median filtering based on the two-layer decision and the adaptive region growing repair method.The structure and components of the proposed CNN and convolutional self-attention (CNNSA)-based models are illustrated in Section 3. The training and testing results obtained from shore-based marine radar data using SNR, CNN, and CNNSA-based models are presented and compared with each other in Section 4. The results are discussed in Section 5. Finally, the conclusion and outlook of this work appear in Section 6.

Data Pre-Processing
Interference from other marine radars produces radial noise lines in radar images, as shown by the dense radial pixel lines in Figure 1.In this study, the radar images collected are from shore-based X-band radar, which also is affected by interference from targets like ships, which appear as bright spots in the radar images.The pre-processing process of the radar images is shown in Figure 2.

Median Filtering Based on the Two-Layer Decision
According to the characteristics of the wave texture in the radar image, an improved median filtering method is used to screen the pixel points for noise.The region involved in H s and T s estimation is intercepted in a radar polar coordinates image.In this study, a sub-image with azimuth ±30 • and range in 600-2500 m is selected and converted into a grey image of 0-255, denoted as I(x, y), which is of size n × n, n = 256.First, the size of the sliding window L 1 is set to m × 1, and in this study, m = 3.The I(x, y) is expanded according to the sliding window size; the pixel image size after the expansion is (n + m − 1), and the starting point is the first position of the expanded image in order to find the noise point to be processed.Then, the center point L 1 is set to determine whether the median of the grey value of each pixel point is the L 1 .If it is the median, the grey value of the center point will not be processed.If it is not the median, L 1 is set to determine whether it is a noise point.Next, a threshold C 1 is set to determine whether the pixel point is a noise point.C 1 = 1 5 A avg calculates the difference between the grayscale value of the maximum pixel point in each sliding window (B max (i, j)) and the grayscale value of the minimum pixel point (B min (i, j)), expressed as B. If B > C 1 , the noise point is determined.Finally, the noise point is replaced by the median value calculated in L 1 .A avg is represented as follows:

Adaptive Region Growing Repair Method
Targets on the sea surface can cause a lack of wave texture, leading to errors when estimating H s and T s from radar images using deep learning-based methods.To avoid this problem, this study uses an adaptive region-growing repair method to recover the image region disturbed by targets.The process is divided into four parts.The first part uses the adaptive threshold to determine whether the target is produced.The initial parameter C 2 is set with threshold D 1 , where A max is the maximum value in the grayscale value of all pixel points in I(x, y).It is then determined whether A max is greater than D 1 , and if the output is based on the original grayscale value, the following calculation is required.
The second part uses gradient descent to find the initial growth point of the target and identifies the target area in I(x, y).The grayscale values of all the pixel points in I(x, y) are sorted from smallest to largest, and the location of a specific grayscale value is selected as the growth point of the target.The sliding window is set to L 2 = m × m, so that A max is used as the center point and initial growth point of the sliding window L 2 .This point is used as the starting point to search for pixel points in the sliding window that have similar features to the centre point of the sliding window.It is also used as the noise starting point until the traversal stops when there are no pixel points in the sliding window that have similar features, in order to determine the area where the target is located.The pixel point screening method for similar features is C (i,j) − C cen < D 2 , where C (i,j) denotes the grayscale value of the pixel point inside L 2 , C cen denotes the grayscale value of the center point of L 2 , and D 2 is the filtering threshold.If the equation is valid, pixel points with similar features are searched for using C (i,j) as a new starting point until the equation is not valid inside the sliding window.These steps are repeated to find the remaining targets in the image.D 2 is defined as follows: The third part determines whether the target is a real target at the time.The number of pixel locations occupied by each target object N i is calculated, and it is determined whether the total number of pixel locations occupied by each target object is less than the maximum upper limit of single target object recognition where area = m × m (m = N i /6).The integer is retained, and if it is valid, it is considered to be a target object.The average grey value B avg of the pixel points in the target area and the target threshold D 3 are calculated.If B avg > D 3 , then it is considered a real target.D 3 is defined as follows: The fourth part is based on a mean-filling transition algorithm to restore the target and real wave recovery.The grayscale value of the pixel point where the target is located is set to 0. According to the size of the sliding window L 2 , I(x, y) is expanded to (n + 2m) × (n + 2m), and the mean value of the four-pixel points that are m points away from the centre of the noise point in the distance direction and orientation direction is selected to fill the image, instead of the noise point.Figure 3 shows the radar sub-image, processed radial noise line, and the image of the target object.
pixel points with similar features are searched for using  (,) as a new starting point until the equation is not valid inside the sliding window.These steps are repeated to find the remaining targets in the image. 2 is defined as follows: The third part determines whether the target is a real target at the time.The number of pixel locations occupied by each target object   is calculated, and it is determined whether the total number of pixel locations occupied by each target object is less than the maximum upper limit of single target object recognition where  =  ×  ( =   /6).The integer is retained, and if it is valid, it is considered to be a target object.The average grey value   of the pixel points in the target area and the target threshold  3 are calculated.If   >  3 , then it is considered a real target. 3 is defined as follows: The fourth part is based on a mean-filling transition algorithm to restore the target and real wave recovery.The grayscale value of the pixel point where the target is located is set to 0. According to the size of the sliding window  2 , (, ) is expanded to ( + 2) × ( + 2), and the mean value of the four-pixel points that are m points away from the centre of the noise point in the distance direction and orientation direction is selected to fill the image, instead of the noise point.Figure 3 shows the radar sub-image, processed radial noise line, and the image of the target object.CNNs are a deep learning model designed to process and analyze data with a grid structure, such as images and videos.They are unique in their convolutional layers, which are locally aware of the input data through the utilization of sliding windows, thus effectively capturing the spatial relationships of image or video features [30].The basic building blocks of CNNs include convolutional operations, pooling operations, and fully connected layers.Convolutional operations detect various features in images, such as edges, textures, and shapes.In contrast, pooling operations are used to reduce the size of the feature map, reduce computational complexity, and extract higher-level features.In this study, GoogLeNet [31] is selected as the pre-training model among many convolutional network models.Compared with other classical CNNs, such as AlexNet [32] and VGGNet [33], GoogLeNet has a lighter weight and more efficient structure, adopting the "Inception" module structure to reduce the number of parameters by using multi-scale

CNN
CNNs are a deep learning model designed to process and analyze data with a grid structure, such as images and videos.They are unique in their convolutional layers, which are locally aware of the input data through the utilization of sliding windows, thus effectively capturing the spatial relationships of image or video features [30].The basic building blocks of CNNs include convolutional operations, pooling operations, and fully connected layers.Convolutional operations detect various features in images, such as edges, textures, and shapes.In contrast, pooling operations are used to reduce the size of the feature map, reduce computational complexity, and extract higher-level features.In this study, GoogLeNet [31] is selected as the pre-training model among many convolutional network models.Compared with other classical CNNs, such as AlexNet [32] and VGGNet [33], GoogLeNet has a lighter weight and more efficient structure, adopting the "Inception" module structure to reduce the number of parameters by using multi-scale convolutional kernels.There are also deeper networks that can learn more complex feature representations.
GoogLeNet consists of several "Inception" modules, each containing a series of convolutional kernels of different scales to capture image features in parallel.Its overall structure consists of the following key components: (1) Convolutional and pooling layers.These are used to extract the basic features of the image, such as edges and texture; (2) Inception modules.Each "Inception" contains multiple parallel convolutional kernels and pooling operations to capture features at different scales and levels.The results of these parallel operations are cascaded together to form the module outputs, with the primary goal of improving the feature representation of the model without adding too many parameters; (3) Global average pooling layer.This layer averages the values of each channel of the feature map to generate a fixed-size feature vector.This reduces the fully connected layer's dimensionality and helps reduce overfitting.(4) Fully connected layer.This layer integrates information from different features to capture complex relationships in the data.In this study, the regression task of the fully connected layer is utilized to estimate the H S and T S of the radar image.
The structure of the GoogLeNet-based estimation model is shown in Figure 4.
convolutional kernels.There are also deeper networks that can learn more complex feature representations.GoogLeNet consists of several "Inception" modules, each containing a series of convolutional kernels of different scales to capture image features in parallel.Its overall structure consists of the following key components: (1) Convolutional and pooling layers.These are used to extract the basic features of the image, such as edges and texture; (2) Inception modules.Each "Inception" contains multiple parallel convolutional kernels and pooling operations to capture features at different scales and levels.The results of these parallel operations are cascaded together to form the module outputs, with the primary goal of improving the feature representation of the model without adding too many parameters; (3) Global average pooling layer.This layer averages the values of each channel of the feature map to generate a fixed-size feature vector.This reduces the fully connected layer's dimensionality and helps reduce overfitting.(4) Fully connected layer.This layer integrates information from different features to capture complex relationships in the data.In this study, the regression task of the fully connected layer is utilized to estimate the   and   of the radar image.
The structure of the GoogLeNet-based estimation model is shown in Figure 4.

Self-Attention
The CNN-based model performs well in capturing the depth features of the radar image, but there are no features to capture the temporal correlation between the entire image sequence.Self-attention allows the model to dynamically capture long-distance dependencies throughout the input sequence, which can relate to the context of the image sequence.The computation process of self-attention, which includes the computation of Query, Key, Value, Attention Score, and the final weighted summation, is shown in Figure 5.
Firstly, the mapping map x obtained by deep convolution is multiplied by the weight matrices W Q , W K , and W K to get the feature spaces f (x), g(x), and h(x), respectively: Then, after multiplying the transpose of the feature space f (x i ) with the feature space g x j , the product is normalized with a soft classifier to obtain the attention map β j,i : Finally, the feature space h(x) is multiplied by the attention map and then with a 1 × 1 convolution to obtain the self-attention feature map, o: The CNN-based model performs well in capturing the depth features of the radar image, but there are no features to capture the temporal correlation between the entire image sequence.Self-attention allows the model to dynamically capture long-distance dependencies throughout the input sequence, which can relate to the context of the image sequence.The computation process of self-attention, which includes the computation of Query, Key, Value, Attention Score, and the final weighted summation, is shown in Figure 5. Firstly, the mapping map x obtained by deep convolution is multiplied by the weight matrices   ,   , and   to get the feature spaces (), (), and ℎ(), respectively: Then, after multiplying the transpose of the feature space (  ) with the feature space (  ), the product is normalized with a soft classifier to obtain the attention map  , : Finally, the feature space ℎ() is multiplied by the attention map and then with a 1 × 1 convolution to obtain the self-attention feature map, o: Figure 5.The structure of self-attention.

CNN-SA Model
In this study, a novel deep learning model with temporal and spatial feature extraction is proposed by embedding self-attention as a feature extraction module in the CNN model.The model captures the spatial information of radar images through the convolutional layer, pooling layer, and Inception structure of GoogLeNet.Adding self-attention after the "Inception" structure can dynamically capture long-range dependencies in image sequences.This makes capturing spatial and temporal relationships in radar image sequences more conducive to estimating   and   from the model.The flowchart of the

CNN-SA Model
In this study, a novel deep learning model with temporal and spatial feature extraction is proposed by embedding self-attention as a feature extraction module in the CNN model.The model captures the spatial information of radar images through the convolutional layer, pooling layer, and Inception structure of GoogLeNet.Adding self-attention after the "Inception" structure can dynamically capture long-range dependencies in image sequences.This makes capturing spatial and temporal relationships in radar image sequences more conducive to estimating H s and T s from the model.The flowchart of the CNNSA-based estimation model for H s and T s is shown in Figure 6, and detailed parameter information is shown in Table 1.

Data Overview
The X-band marine radar data used in this study were collected in Pingtan, Fujian Province, from 16 December 2010 to 10 January 2011.The detailed parameter information of the radar is shown in Table 2.The rotation period of the radar antenna was about 2.5 s, and a radar image sequence contained 32 images, that is, 80 s of each image sequence interval.The wave buoy that measured the reference data was deployed approximately 0.85 km from the radar.Since the references H s and T s were measured every 20 min, temporal interpolation is required to provide synchronized references for each radar image sequence.Figure 1 shows the position relationship between a radar image under polar coordinates and wave buoy placement.During data collection, H s was between 1.3 and 3.5 m and T s was between 6.5 and 10 s.At the same time, the synchronized local wind speeds ranged from 5.8 to 18.5 m/s, as shown in Figure 7, making the proposed method estimated under typical wind and wave conditions.The synchronized buoy references and radar images used in the study were not present at every moment, and the corresponding data are shown in Figure 7.

Data Overview
The X-band marine radar data used in this study were collected in Pingtan, Fujian Province, from 16 December 2010 to 10 January 2011.The detailed parameter information of the radar is shown in Table 2.The rotation period of the radar antenna was about 2.5 s, and a radar image sequence contained 32 images, that is, 80 s of each image sequence interval.The wave buoy that measured the reference data was deployed approximately 0.85 km from the radar.Since the references   and   were measured every 20 min, temporal interpolation is required to provide synchronized references for each radar image sequence.Figure 1 shows the position relationship between a radar image under polar coordinates and wave buoy placement.During data collection,   was between 1.3 and 3.5 m and   was between 6.5 and 10 s.At the same time, the synchronized local wind speeds ranged from 5.8 to 18.5 m/s, as shown in Figure 7, making the proposed method estimated under typical wind and wave conditions.The synchronized buoy references and radar images used in the study were not present at every moment, and the corresponding data are shown in Figure 7.

Model Train
In this study, the first image of each sequence was input into the pre-processing process in Figure 2, which selected the radar image sub-area and filtered out the radial noise lines and targets so that high-quality sub-images could be provided for the estimation of   and   , based on deep learning for better learning of feature information in the images.The significant wave heights used in this study were 1.3-3.5 m, and the training sample contained images from all ranges.Since the reference   and   were measured around every 20 min before training, all the image sequences (excluding the rainy day) were first sorted according to the simultaneous   and   obtained from interpolated buoy measurements.Then, at each 0.5 m interval between 1.3-3.5 m, 70% and 30% of the samples

Model Train
In this study, the first image of each sequence was input into the pre-processing process in Figure 2, which selected the radar image sub-area and filtered out the radial noise lines and targets so that high-quality sub-images could be provided for the estimation of H s and T s , based on deep learning for better learning of feature information in the images.The significant wave heights used in this study were 1.3-3.5 m, and the training sample contained images from all ranges.Since the reference H s and T s were measured around every 20 min before training, all the image sequences (excluding the rainy day) were first sorted according to the simultaneous H s and T s obtained from interpolated buoy measurements.Then, at each 0.5 m interval between 1.3-3.5 m, 70% and 30% of the samples were taken for training and testing of the model.The proposed CNNSA-based model was trained using the Pytorch framework, installed on a Windows 10 PC with two 2.10 GHz Xeon (R) Gold 6230R CPUs.The initial learning rate was 0.0003 using the Adam optimizer and the MSE loss function.Batch size was set to 36, and the number of epochs was 50.

Result Analysis
The SNR-based method [7] and the CNN-based method were also applied to the same testing set for comparison, to demonstrate the effectiveness of the CNNSA-based H s and T s estimation model.In this study, the H s and T s obtained in two continuous buoy measurement intervals were averaged using a time-moving average with the same moving time as the buoy measurement interval.The root-mean-square differences (RMSDs), correlation coefficients (CCs), and biases between the reference value and the estimated value using the SNR-based, CNN-based, and CNNSA-based methods are summarized in Tables 3 and 4, respectively.The estimation results of the three methods for the test sample before and after the moving average are shown in Figure 8.In addition, Figure 9 shows the estimation results in the time series.From Table 3, it can be seen that the RMSD of the proposed method is reduced by 0.21 m and 0.1 m, and the CC is improved to 0.85 when compared with the SNR-based and CNN-based methods, which is the H s estimation without averaging.Further, when the H s estimation result is processed by sliding for an average of 30 min, it can be seen from Table 3 that the H s estimation result is further improved, with an RMSD and CC of 0.3 m and 0.86, respectively, based on the CNNSA method.Similarly, Table 4 shows the T s estimation.The RMSD of the proposed method is reduced by 0.23 s and 0.09 s, and the CC is improved to 0.89 compared to the SNR-based and CNN-based methods, which is the T s estimation without averaging.The estimated results of T s were processed by sliding for an average of 30 min.As can be seen from Table 4, the estimated T s is also improved, and the RMSD and CC based on the proposed method are 0.27 s and 0.91.To visualize the estimation results of the three methods regarding H s and T s , Figure 8 shows the estimation results of H s and T s under average sliding and without average sliding of the three methods.

Discussion
As can be seen from Figure 8a, when Hs is 2-3 m, the estimated result of   is small, while when   is greater than 3 m, the estimated result of   is larger.In Figure 8b, the SNR-based   estimation results show a large dispersion as a whole, which is due to the sensitivity of SNR-based methods to radar image noise, resulting in a large deviation in the estimation results.It can be seen from Figures 8c,e that the CNNSA-based method has a better linear correlation in   estimation than the CNN-based method.Similarly, in Fig- ures 8d,f, the CNNSA-based method also produces a better linear correlation in the estimation of   .It can be seen from Tables 3 and 4 that the correlation coefficients of   and   estimation based on the CNNSA method are 0.86 and 0.91, respectively, after the 30min moving average, both of which are higher than those of the CNN-based method and SNR-based method.Further, Figures 9a and b, respectively, show the variation trend of   and   estimated by the three methods over time, and it can be seen that the results estimated by the CNNSA-based method are closer to the variation trend of buoy references.
In summary, the proposed CNNSA-based regression estimation model has different degrees of improvement in the estimation of   and   compared to the other two methods.Figure 8 shows the degree of fit of the three methods; the proposed CNNSA-based model estimates of   and   are closer to the black line after applying the time-moving average.Further, the closeness of the three methods to the buoy reference is demonstrated in Figure 9, with   and   estimation of the proposed method being closer to the buoy references.All these results show that the proposed method improves the accuracy of   and   estimation using X-band marine radar images and simultaneously validates the reasonableness and effectiveness of the proposed CNNSA-based estimation model.

Discussion
As can be seen from Figure 8a, when Hs is 2-3 m, the estimated result of H s is small, while when H s is greater than 3 m, the estimated result of H s is larger.In Figure 8b, the SNR-based T s estimation results show a large dispersion as a whole, which is due to the sensitivity of SNR-based methods to radar image noise, resulting in a large deviation in the estimation results.It can be seen from Figure 8c,e that the CNNSA-based method has a better linear correlation in H s estimation than the CNN-based method.Similarly, in Figure 8d,f, the CNNSA-based method also produces a better linear correlation in the estimation of T s .It can be seen from Tables 3 and 4 that the correlation coefficients of H s and T s estimation based on the CNNSA method are 0.86 and 0.91, respectively, after the 30-min moving average, both of which are higher than those of the CNN-based method and SNR-based method.Further, Figure 9a and b, respectively, show the variation trend of H s and T s estimated by the three methods over time, and it can be seen that the results estimated by the CNNSA-based method are closer to the variation trend of buoy references.
In summary, the proposed CNNSA-based regression estimation model has different degrees of improvement in the estimation of H s and T s compared to the other two methods.Figure 8 shows the degree of fit of the three methods; the proposed CNNSA-based model estimates of H s and T s are closer to the black line after applying the time-moving average.Further, the closeness of the three methods to the buoy reference is demonstrated in Figure 9, with H s and T s estimation of the proposed method being closer to the buoy references.All these results show that the proposed method improves the accuracy of H s and T s estimation using X-band marine radar images and simultaneously validates the reasonableness and effectiveness of the proposed CNNSA-based estimation model.

Conclusions
In this study, a deep neural network was used to estimate H s and T s in an X-band marine radar backscatter image sequence.Since the shore-based radar images in this study were interfered with by other radar radial noise lines and solid target objects, to make the convolutional neural network (CNN) extract the image features more accurately, it was necessary to pre-process the radar images to eliminate the interference.Firstly, a pre-trained GoogLeNet was used to extract multi-scale depth space features from the radar images to estimate H s and T s .Since CNN-based models cannot analyse the temporal features of wave features in radar image sequences, self-attention was connected after the deep convolutional layer of the CNN in order to construct a convolutional self-attention (CNNSA)-based model that generated spatial-temporal features for H s and T s estimation.Both models were trained and tested using collected shore-based radar data.The buoy measurements were also interpolated and used as baseline values for performance evaluation.The experimental results showed that both CNN-based and CNNSA-based models can improve the estimation results of H s and T s .The RSMD of H s estimation is reduced by 0.13 m and 0.24 m, respectively, and the RSMD of T s estimation is reduced by 0.22 s and 0.3 s, respectively, compared with the traditional SNR-based method.Overall, the CNNSA-based model showed better results for estimating H s and T s simultaneously.Therefore, the estimation accuracy is further improved by using temporal information while extracting multi-scale depth space features.
Due to the data limitations of this experiment, in future work, a collection of sea state data and radar data containing a larger number of images and a wider range for model training and validation is required, and radar data and reference data should also be collected at different locations to validate the proposed model further.

Figure 2 .
Figure 2. Pre-processing of radar images.Figure 2. Pre-processing of radar images.

Figure 2 .
Figure 2. Pre-processing of radar images.Figure 2. Pre-processing of radar images.

Figure 3 .
Figure 3. Pre-processing of radar image sub-area.(a) Contains radial noise lines and target.(b) Removed radial noise lines.(c) Removed target.

Figure 4 .
Figure 4. Flowchart of the GoogLeNet-based   and   estimation model.Figure 4. Flowchart of the GoogLeNet-based H S and T S estimation model.

Figure 4 .
Figure 4. Flowchart of the GoogLeNet-based   and   estimation model.Figure 4. Flowchart of the GoogLeNet-based H S and T S estimation model.

Figure 5 .
Figure 5.The structure of self-attention.

Figure 6 .
Figure 6.Flowchart of the CNNSA-based estimation model for   and   .

Figure 6 .
Figure 6.Flowchart of the CNNSA-based estimation model for H S and T S .

Figure 7 .
Figure 7. Synchronized anemometer-measured wind speed and buoy-measured   and   data.

Figure 9 .
Figure 9. Radar-derived results in time series using three different methods, where (a)   estimation results and (b)   estimation results.

Figure 9 .
Figure 9. Radar-derived results in time series using three different methods, where (a) H S estimation results and (b) T S estimation results.

Table 1 .
Detailed parameter information for CNNSA-based estimation models.

Table 1 .
Detailed parameter information for CNNSA-based estimation models.

Table 3 .
Comparisons of results using different methods for H S estimation.

Table 4 .
Comparisons of results using different methods for T S estimation.