Parallel Computation of EM Backscattering from Large Three-Dimensional Sea Surface with CUDA

An efficient parallel computation using graphics processing units (GPUs) is developed for studying the electromagnetic (EM) backscattering characteristics from a large three-dimensional sea surface. A slope-deterministic composite scattering model (SDCSM), which combines the quasi-specular scattering of Kirchhoff Approximation (KA) and Bragg scattering of the two-scale model (TSM), is utilized to calculate the normalized radar cross section (NRCS in dB) of the sea surface. However, with the improvement of the radar resolution, there will be millions of triangular facets on the large sea surface which make the computation of NRCS time-consuming and inefficient. In this paper, the feasibility of using NVIDIA Tesla K80 GPU with four compute unified device architecture (CUDA) optimization strategies to improve the calculation efficiency of EM backscattering from a large sea surface is verified. The whole GPU-accelerated SDCSM calculation takes full advantage of coalesced memory access, constant memory, fast math compiler options, and asynchronous data transfer. The impact of block size and the number of registers per thread is analyzed to further improve the computation speed. A significant speedup of 748.26x can be obtained utilizing a single GPU for the GPU-based SDCSM implemented compared with the CPU-based counterpart performing on the Intel(R) Core(TM) i5-3450.


Introduction
Studying the electromagnetic (EM) backscattering characteristics from an electrically large sea surface plays an important role in the synthetic aperture radar (SAR) imaging, ocean parameter inversion, targets detection, and monitoring [1][2][3]. Computing the normalized radar cross section (NRCS) constitutes a prerequisite to investigate on characteristics of backscattering echoes from large sea surfaces. Based on the two-scale oceanic surface model simulated by the double superimposition model (DSM) [4,5], the Kirchhoff approximation (KA) [6,7], small perturbation method (SPM) [8], small-slope approximation method (SSA) [9,10], two-scale method [11], four-modified two-scale method (FMTSM) [12], and slope-deterministic composite scattering model (SDCSM) [13] can be quoted to calculate the NRCS from sea surface. The KA method is only valid to the calculation of EM scattering when the incident angle closes to the specular direction (±20 • ). SPM, TSM, and FMTSM are suitable for the calculation of the Bragg scattering from sea surface. However, when the incident

Electromagnetic Backscattering from an Electrically Large Sea Surface
The conventional oceanic surface is generally divided into two scales: the large-scale gravity waves and small-scale ripples superimposed on them, where the large-scale gravity can be simulated as millions of triangular meshing. In order to calculate the EM backscattering echoes from diverse triangular meshing on the electrically large sea surface, a hypothesis should be followed that the EM backscattering echoes from diverse facets on the oceanic surface are de-correlated [25]. In this section, a slope-deterministic composite scattering model (SDCSM) is developed to calculate the NRCS of EM backscattering from the electrically large sea surface taking both scattering mechanisms into account, as shown in Figure 1.

Slope-Deterministic Kirchhoff Approximation Model (SDKAM)
When the incident angle is small (    20 i ), the quasi-specular scattering mechanism corresponding to large-scale gravity waves plays a leading component for the EM backscattering echoes from the electrically large sea surface. Thus the Kirchhoff approximation method (KAM) can be utilized to calculate the NRCS, which can be given as [6,7] ( , ) , and Fresnel coefficients [7].

Slope-Deterministic Two-Scale Model (SDTSM)
As the incident angle increases, EM backscattering echoes from the electrically large sea surface are dominated by the Bragg scattering mechanism caused by small-scale ripples. In this case, The NRCS of the electrically large sea surface can be calculated by small perturbation method (SPM) as [26] where  S( ) is the sea wave spectrum corresponding to the small-scale ripples;

Slope-Deterministic Kirchhoff Approximation Model (SDKAM)
When the incident angle is small (θ i ≤ 20 • ), the quasi-specular scattering mechanism corresponding to large-scale gravity waves plays a leading component for the EM backscattering echoes from the electrically large sea surface. Thus the Kirchhoff approximation method (KAM) can be utilized to calculate the NRCS, which can be given as [6,7] , p and q represent radar reception and incident polarizations, respectively; F KAM pq denotes the coefficient matrix depending on the incident angles (θ i , φ i ), scattering angles (θ s , φ s ), and Fresnel coefficients [7]. Prob(z x , z y ) denotes the slope probability density function of the gravity waves; z x = −q x /q z , z y = −q y /q z . The NRCS of the EM backscattering can be well calculated by the KAM when the incident angles close to the quasi-specular direction (±20 • ).

Slope-Deterministic Two-Scale Model (SDTSM)
As the incident angle increases, EM backscattering echoes from the electrically large sea surface are dominated by the Bragg scattering mechanism caused by small-scale ripples. In this case, The NRCS of the electrically large sea surface can be calculated by small perturbation method (SPM) as [26] where S(·) is the sea wave spectrum corresponding to the small-scale ripples; is the projection of → q onto the triangular meshing; n = −z xx − z yŷ +ẑ / 1 + z 2 x + z 2 y denotes the unit normal vector of the triangle element; z x and z y denote the slope of the triangular meshing along the x and y directions, respectively.
For co-polarization, the coefficient matrix F pq in Equation (2) can be expressed as where ε is the relative permittivity of sea surface; R h , and R v correspond to Fresnel coefficients in h (horizontal) and v (vertical) polarizations, respectively; θ i , and θ s represent the global incident and scattering angle, respectively; φ i , and φ s denote the global incident and scattering azimuth angle, respectively, as shown in Figure 1. The ripples which contribute to the EM scattering are modulated by large-scale gravity waves, as shown in Figure 1. Therefore, The NRCS can be calculated based on the local small perturbation method with local bi-static angles (θ i ,θ s ,φ i ,φ s ). Thus, the Equation (4) can be re-written in local coordinates as wherev i ,ĥ i ,v i , andĥ i indicate the unit incident polarization in global and local coordinates, respectively;v s ,ĥ s ,v s , andĥ s indicate the unit scattering polarization in global and local coordinates, respectively.

Slope-Deterministic Composite Scattering Model (SDCSM)
For different radar frequencies at diverse incident angles, EM backscattering echoes from diverse triangular meshing on the electrically large sea surface correspond to different scattering mechanisms. When the incident angle is close to the specular reflection region, The SDKAM [6,7] is adequate for specular reflection caused by large-scale gravity waves. However, as the incident angle increases, the number of triangular meshing corresponding to the Bragg scattering mechanism is gradually increasing, which leads to SDKAM insufficient for calculating NRCS. In this case, the TSM [11,12] can be utilized to calculate the NRCS from sea surface. Therefore, it is sensible to take both scattering mechanisms into account when calculating NRCS from an electrically large sea surface, the so-called slope-deterministic composite scattering model (SDCSM) [13]. The NRCS of EM scattering echoes then can be written as )∆x∆y]/A (9) where M and N represent sample points of the sea surface model in the x and y directions, respectively; ∆x and ∆y correspond to the spatial step of the sea surface model in the x and y directions, respectively; A is the area of the electrically large sea surface. The cutoff wavenumber k cut is exploited to distinguish the quasi-specular scattering mechanism caused by large-scale gravity waves from the Bragg scattering caused by small-scale ripples, as shown in Figure 2. When q ⊥ is less than the cutoff number, the NRCS of EM backscattering echoes from an individual facet is responsible for the quasi-specular scattering mechanism which can be well calculated by SDKAM in Equation (1). When q ⊥ is greater than the cutoff wavenumber, the NRCS of EM backscattering echoes from an individual facet is responsible for the Bragg scattering mechanism which can be well calculated by SDTSM in Equation (7). Then the NRCS of EM scattering echoes can be calculated as the superposition of the contribution from diverse scattering facets on the electrically large sea surface based on Equation (9). In this paper, k cut = k/3 is adopted in our model according to the results in [27] when the sea spectrum is Efouhaily spectrum [28].  Figure 3 displays the NRCS of EM backscattering echoes from an electrically large sea surface based on SDKAM, SDTSM, and SDCSN, respectively. The frequency is equal to 13.9 GHz, the wind speed  u 10 10 m/s , the wind direction is downwind. The size of electrically large sea surface is  204.9 m 204.9 m , the spatial step is equal to  0.1m 0.1m , the cutoff number  cut k k/ 3 , the permittivity of the sea surface is equal to (25.4047, 36.0192). The results in this paper are the average calculation over 50 samples. As shown in Figure 3, when the incident angle is close to the specular scattering region, the NRCS calculated by SDKAM is 10 dB greater than that calculated by SDTSM, indicating that the SDTSM is not suitable for calculating the specular backscattering echoes. However, when the incident angle is greater than 20 degrees, the NRCS calculated by SDKAM will decrease dramatically due to the EM backscattering echoes from sea surface being dominated by Bragg scattering caused by small-scale ripples. The NRCS of an electrically large sea surface compared with the experiment data [29] is illustrated in Figure 4 when the wind speeds are 5 and 10 m/s, respectively. The other parameters are the same as Figure 4. As can be seen in Figure 4, the results based on SDCSM in this paper are relatively consistent with the experiment data, which indicates that our algorithm is feasible to calculate the NRCS of EM backscattering echoes from sea surface. However, with the development of synthetic aperture radar (SAR), the resolution of the electrically large sea surface is gradually improved, which leads to the inefficiency calculation of NRCS form electrically large sea surface. Therefore, it is necessary to develop a high performance algorithm to calculate the NRCS of electrically large sea surface.  For backscattering NRCS computation, θ i = θ s , φ i = 0 • and φ s = 180 • . Figure 3 displays the NRCS of EM backscattering echoes from an electrically large sea surface based on SDKAM, SDTSM, and SDCSN, respectively. The frequency is equal to 13.9 GHz, the wind speed u 10 = 10 m/s, the wind direction is downwind. The size of electrically large sea surface is 204.9 m × 204.9 m, the spatial step is equal to 0.1 m × 0.1 m, the cutoff number k cut = k/3, the permittivity of the sea surface is equal to (25.4047, 36.0192). The results in this paper are the average calculation over 50 samples. As shown in Figure 3, when the incident angle is close to the specular scattering region, the NRCS calculated by SDKAM is 10 dB greater than that calculated by SDTSM, indicating that the SDTSM is not suitable for calculating the specular backscattering echoes. However, when the incident angle is greater than 20 degrees, the NRCS calculated by SDKAM will decrease dramatically due to the EM backscattering echoes from sea surface being dominated by Bragg scattering caused by small-scale ripples.  Figure 3 displays the NRCS of EM backscattering echoes from an electrically large sea surface based on SDKAM, SDTSM, and SDCSN, respectively. The frequency is equal to 13.9 GHz, the wind speed  u 10 10 m/s , the wind direction is downwind. The size of electrically large sea surface is  204.9 m 204.9 m , the spatial step is equal to  0.1m 0.1m , the cutoff number  cut k k/ 3 , the permittivity of the sea surface is equal to (25.4047, 36.0192). The results in this paper are the average calculation over 50 samples. As shown in Figure 3, when the incident angle is close to the specular scattering region, the NRCS calculated by SDKAM is 10 dB greater than that calculated by SDTSM, indicating that the SDTSM is not suitable for calculating the specular backscattering echoes. However, when the incident angle is greater than 20 degrees, the NRCS calculated by SDKAM will decrease dramatically due to the EM backscattering echoes from sea surface being dominated by Bragg scattering caused by small-scale ripples.
The NRCS of an electrically large sea surface compared with the experiment data [29] is illustrated in Figure 4 when the wind speeds are 5 and 10 m/s, respectively. The other parameters are the same as Figure 4. As can be seen in Figure 4, the results based on SDCSM in this paper are relatively consistent with the experiment data, which indicates that our algorithm is feasible to calculate the NRCS of EM backscattering echoes from sea surface. However, with the development of synthetic aperture radar (SAR), the resolution of the electrically large sea surface is gradually improved, which leads to the inefficiency calculation of NRCS form electrically large sea surface. Therefore, it is necessary to develop a high performance algorithm to calculate the NRCS of electrically large sea surface.  The NRCS of an electrically large sea surface compared with the experiment data [29] is illustrated in Figure 4 when the wind speeds are 5 and 10 m/s, respectively. The other parameters are the same as Figure 4. As can be seen in Figure 4, the results based on SDCSM in this paper are relatively consistent with the experiment data, which indicates that our algorithm is feasible to calculate the NRCS of EM backscattering echoes from sea surface. However, with the development of synthetic aperture radar (SAR), the resolution of the electrically large sea surface is gradually improved, which leads to the inefficiency calculation of NRCS form electrically large sea surface. Therefore, it is necessary to develop a high performance algorithm to calculate the NRCS of electrically large sea surface.

NVIDIA Tesla K80 GPU Features and GPU-Based SDCSM Implemented
In this paper, NVIDIA Tesla K80 GPU with CUDA is exploited to improve the calculation performance of NRCS from electrically large sea surface. The GPU accelerator features and the GPU-based SDCSM algorithm are detailed in this section.

NVIDIA Tesla K80 GPU Haredare Resource
NVIDIA Tesla K80 is engineered to deliver superior performance with more powerful servers. It is designed to boost throughput by 5~10 with 480 GB/s aggregate memory bandwidth compared with conventional CPU counterpart. Figure 5 displays the schematic of Tesla K80 dual-GPU features. As shown in Figure 5, NVIDIA Tesla K80, with dual-GPU design features, contains 24 GB of global memory, 65,536 registers, 128 KB of shared memory, and 64 KB of constant memory. There are also 13 stream multiprocessors (SMX) in each GPU and each SMX contains 192 CUDA cores, with a total of 4992 NVIDIA CUDA cores for Tesla K80. Our sequential C algorithm is performed on the Intel(R) core (TM) i5-3450 CPU.

SDCSM Parallel Computing with CUDA
CUDA was introduced in 2006 by NVIDIA as an extension of conventional C program to leverage the high performance computing (HPC) capability of GPUs, which enables scientists and engineers to solve massive computational problems in a more efficient way. In this paper, an efficient GPU-based SDCSM model is developed with CUDA to improve the calculation performance of NRCS from an electrically large sea surface. The entirely parallel program consists of two components: the sequential parts performed on CPU and the parallel parts performed on GPU.
Threads are the smallest executable unit in a CUDA program, in which a three-dimensional thread forms a thread block, and a grid is made up of a three-dimensional thread block. Figure 6 shows a hierarchy of thread groups. On the Tesla K80, each SMX schedules and performs threads

NVIDIA Tesla K80 GPU Features and GPU-Based SDCSM Implemented
In this paper, NVIDIA Tesla K80 GPU with CUDA is exploited to improve the calculation performance of NRCS from electrically large sea surface. The GPU accelerator features and the GPU-based SDCSM algorithm are detailed in this section.

NVIDIA Tesla K80 GPU Haredare Resource
NVIDIA Tesla K80 is engineered to deliver superior performance with more powerful servers. It is designed to boost throughput by 5 ∼ 10× with 480 GB/s aggregate memory bandwidth compared with conventional CPU counterpart. Figure 5 displays the schematic of Tesla K80 dual-GPU features. As shown in Figure 5, NVIDIA Tesla K80, with dual-GPU design features, contains 24 GB of global memory, 65,536 registers, 128 KB of shared memory, and 64 KB of constant memory. There are also 13 stream multiprocessors (SMX) in each GPU and each SMX contains 192 CUDA cores, with a total of 4992 NVIDIA CUDA cores for Tesla K80. Our sequential C algorithm is performed on the Intel(R) core (TM) i5-3450 CPU.

NVIDIA Tesla K80 GPU Features and GPU-Based SDCSM Implemented
In this paper, NVIDIA Tesla K80 GPU with CUDA is exploited to improve the calculation performance of NRCS from electrically large sea surface. The GPU accelerator features and the GPU-based SDCSM algorithm are detailed in this section.

NVIDIA Tesla K80 GPU Haredare Resource
NVIDIA Tesla K80 is engineered to deliver superior performance with more powerful servers. It is designed to boost throughput by 5~10 with 480 GB/s aggregate memory bandwidth compared with conventional CPU counterpart. Figure 5 displays the schematic of Tesla K80 dual-GPU features. As shown in Figure 5, NVIDIA Tesla K80, with dual-GPU design features, contains 24 GB of global memory, 65,536 registers, 128 KB of shared memory, and 64 KB of constant memory. There are also 13 stream multiprocessors (SMX) in each GPU and each SMX contains 192 CUDA cores, with a total of 4992 NVIDIA CUDA cores for Tesla K80. Our sequential C algorithm is performed on the Intel(R) core (TM) i5-3450 CPU.

SDCSM Parallel Computing with CUDA
CUDA was introduced in 2006 by NVIDIA as an extension of conventional C program to leverage the high performance computing (HPC) capability of GPUs, which enables scientists and engineers to solve massive computational problems in a more efficient way. In this paper, an efficient GPU-based SDCSM model is developed with CUDA to improve the calculation performance of NRCS from an electrically large sea surface. The entirely parallel program consists of two components: the sequential parts performed on CPU and the parallel parts performed on GPU.
Threads are the smallest executable unit in a CUDA program, in which a three-dimensional thread forms a thread block, and a grid is made up of a three-dimensional thread block. Figure 6 shows a hierarchy of thread groups. On the Tesla K80, each SMX schedules and performs threads

SDCSM Parallel Computing with CUDA
CUDA was introduced in 2006 by NVIDIA as an extension of conventional C program to leverage the high performance computing (HPC) capability of GPUs, which enables scientists and engineers to solve massive computational problems in a more efficient way. In this paper, an efficient GPU-based SDCSM model is developed with CUDA to improve the calculation performance of NRCS from an electrically large sea surface. The entirely parallel program consists of two components: the sequential parts performed on CPU and the parallel parts performed on GPU.
Threads are the smallest executable unit in a CUDA program, in which a three-dimensional thread forms a thread block, and a grid is made up of a three-dimensional thread block. Figure 6 shows a concurrently in a group of 32 CUDA cores, which is usually called a warp. The maximum number of threads per multiprocessor is 2048. In this paper, the EM backscattering echoes from different triangular meshes are de-correlation, thus the EM backscattering echoes from diverse triangular meshes on the electrically large sea surface can be calculated independently in parallel on the GPU by all threads. Each thread is responsible for calculating the EM backscattering from individual triangular meshing on the sea surface.

Initial Parallel Implemented
The computational performance of a GPU-based SDCSM algorithm will be detailed in this section, and five CUDA optimization strategies are adopted to further accelerate the computation of NRCS of electrically large sea surface. The frequency  f 13.9 GHz , the wind speed  u 10   The typical processing flow of the GPU-based SDCSM algorithm can be expressed as follows: 1.
Initialize the size of electrically large sea surface L x × L y , the spatial step of the sea surface ∆x × ∆y, the wind speed u 10 , the wind direction φ w , the incident and scattering angles θ i and θ s , the incident and scattering azimuth angles φ i and φ s , the frequency f , the grid and block sizes corresponding to the CUDA program.

2.
Transfer the electrically large sea surface data from the CPU to the GPU.

3.
Compute the NRCS of individual triangular meshing on the electrically large sea surface independently in parallel on the GPU by all threads within a block.

4.
Copy the results from the GPU back to the CPU.
In this paper, the EM backscattering echoes from different triangular meshes are de-correlation, thus the EM backscattering echoes from diverse triangular meshes on the electrically large sea surface can be calculated independently in parallel on the GPU by all threads. Each thread is responsible for calculating the EM backscattering from individual triangular meshing on the sea surface.

Initial Parallel Implemented
The computational performance of a GPU-based SDCSM algorithm will be detailed in this section, and five CUDA optimization strategies are adopted to further accelerate the computation of NRCS of electrically large sea surface. The frequency f = 13.9 GHz, the wind speed u 10 = 10 m/s, the incident angle θ i = 60 • and θ s = θ i , the incident and scattering azimuth angles are 0 • and 180 0 , respectively. Figure 7 shows the schematic of electrically large sea surface discretized into triangular meshes. The size of the electrically large sea surface is 204.9 × 204.9 m 2 , the spatial step is equal to 0.1 × 0.1 m 2 . Thus, there will be 8,396,802 triangular meshing when calculating the NRCS of the electrically large sea surface. The grid and block sizes are equal to 128 × 256 and 16 × 16, respectively.
The GPU-based SDCSM parallel program executed on the Tesla K80 was compiled by the nvcc 5.0 with -O3 -arch=compute_35 -code = sm_35 -Xptxas -v compiling option while the conventional serial C program executed on Intel(R) core (TM) i5-3450 CPU was compiled by g++ with -O2 compiling option. Table 1 shows the runtime and speedup for the GPU-based SDCSM parallel program compared with the conventional serial C program. As shown in Table 1, the initial GPU-based SDCSM program only takes 86.31 ms, while the CPU-based program requires 47,593.6 ms. A significant speedup of 551.4× is achieved compared with the serial C program. As described in Table 1, the most time-consuming aspect of the GPU-based TESF is the kernel execution. Therefore, optimizing that step is helpful to improve the performance of the CUDA program.

Further Optimization with Coalesced Global Memory Access
Threads inside each block may access data from diverse memory space each time that the kernel is called. Memory performance optimizations are particularly important for the massive parallel computation performance. When the CUDA program is launched, all threads performed on the SMXs will access the same global memory, which has the greatest access latency. For the Tesla K80, each SMXs provides an on-chip shared memory which can be accessible by all threads inside each block within one thread block with much higher bandwidth and lower latency.
In order to achieve high memory bandwidth, shared memory is utilized to coalesce cache global memory, thus reducing the number of accesses to the global in our parallel program. By utilizing the shared memory __shared__ zsea_float [9,17], the coalesced memory access pattern enables the GPU to coalesce groups of reads of one sub-surface data items on the electrically large sea surface into one operation, as illustrated in Figure 8. The runtime and speedup for the GPU-based SDCSM program with and without coalesced access optimization are displayed in Table 2. With coalesced accesses optimization, the number of accesses to global memory is reduced from 41,943,040 to 5,013,504 and a significant speedup of 566.4× has been achieved. The GPU-based SDCSM parallel program executed on the Tesla K80 was compiled by the nvcc 5.0 with -O3 -arch=compute_35 -code = sm_35 -Xptxas -v compiling option while the conventional serial C program executed on Intel(R) core (TM) i5-3450 CPU was compiled by g++ with -O2 compiling option. Table 1 shows the runtime and speedup for the GPU-based SDCSM parallel program compared with the conventional serial C program. As shown in Table 1, the initial GPU-based SDCSM program only takes 86.31 ms, while the CPU-based program requires 47,593.6 ms. A significant speedup of  551.4 is achieved compared with the serial C program. As described in Table 1, the most time-consuming aspect of the GPU-based TESF is the kernel execution. Therefore, optimizing that step is helpful to improve the performance of the CUDA program.

Further Optimization with Coalesced Global Memory Access
Threads inside each block may access data from diverse memory space each time that the kernel is called. Memory performance optimizations are particularly important for the massive parallel computation performance. When the CUDA program is launched, all threads performed on the SMXs will access the same global memory, which has the greatest access latency. For the Tesla K80, each SMXs provides an on-chip shared memory which can be accessible by all threads inside each block within one thread block with much higher bandwidth and lower latency.
In order to achieve high memory bandwidth, shared memory is utilized to coalesce cache global memory, thus reducing the number of accesses to the global in our parallel program. By utilizing the shared memory __shared__ zsea_float [17] [9], the coalesced memory access pattern enables the GPU to coalesce groups of reads of one sub-surface data items on the electrically large sea surface into one operation, as illustrated in Figure 8. The runtime and speedup for the GPU-based SDCSM program with and without coalesced access optimization are displayed in Table 2. With coalesced accesses optimization, the number of accesses to global memory is reduced from 41,943,040 to 5,013,504 and a significant speedup of 566.4  has been achieved.

Further Optimization with Constant Memory
In our CUDA program, all threads within a grid utilize the same parameters ε, πk 2 → q 2 /q 4 z in Equation (1), and πk 4 (1 − ε) 2 in Equation (7) to conduct the same computation on diverse triangular meshing data. For Tesla K80 GPU, all threads call the same 64 KB constant memory storage. Figure 9 shows the usage of the constant memory for avoiding repeated calculations when the kernel is called. As shown in Figure 9, the same 24 bytes of coefficients reside in constant memory. When all threads inside a SMX access the same address, the constant cache access will be as fast as a register access.   (1 ) in Equation (7) to conduct the same computation on diverse triangular meshing data. For Tesla K80 GPU, all threads call the same 64 KB constant memory storage. Figure 9 shows the usage of the constant memory for avoiding repeated calculations when the kernel is called. As shown in Figure 9, the same 24 bytes of coefficients reside in constant memory. When all threads inside a SMX access the same address, the constant cache access will be as fast as a register access.  Table 3 shows the runtime and speedup for the GPU-based SDCSM program with and without constant memory access optimization. With the constant memory optimization, the total runtime is reduced from 84.03 ms to 82.56 ms and a significant speedup of 576.5× is achieved.

Further Optimization with Fast Math Compiler Option
For single-precision floating-point operations performed on the device, the nvcc compiler provides an option -use_fast_math which coerces all functions in the CUDA program to compiler to CUDA intrinsic functions. Intrinsic functions are compiled with fewer instructions and greater throughput compared with their equivalent standard counterpart. As a result, the CUDA program with fast math compiler option can achieve more aggressive optimization with less accuracy.
The runtime and speedup for the GPU-based SDCSM program with fast math compiler option is provided in Figure 10. After utilizing the -use_fast_math compiler option, the runtime for the CUDA program is reduced from 82.56 ms to 60.68 ms, resulting in an acceleration from 576.5× to 784.3×.
It is advisable to exploit the fast math option whenever speedup is more important than precision. In order to verify the correctness of the outcomes with the -use_fast_math compiler  Table 3 shows the runtime and speedup for the GPU-based SDCSM program with and without constant memory access optimization. With the constant memory optimization, the total runtime is reduced from 84.03 ms to 82.56 ms and a significant speedup of 576.5× is achieved.

Further Optimization with Fast Math Compiler Option
For single-precision floating-point operations performed on the device, the nvcc compiler provides an option -use_fast_math which coerces all functions in the CUDA program to compiler to CUDA intrinsic functions. Intrinsic functions are compiled with fewer instructions and greater throughput compared with their equivalent standard counterpart. As a result, the CUDA program with fast math compiler option can achieve more aggressive optimization with less accuracy.
The runtime and speedup for the GPU-based SDCSM program with fast math compiler option is provided in Figure 10. After utilizing the -use_fast_math compiler option, the runtime for the CUDA program is reduced from 82.56 ms to 60.68 ms, resulting in an acceleration from 576.5× to 784.3×. option, the mean absolute error (MAE) and MEA/mean are exploited to ensure that the results of the parallel program are consistent with those of serial program. Table 4 illustrates the results of MAE and MAE/mean with and without the -use_fast_math compiler option. As shown in Table 4, the result of MAE and MAE/mean with the -use_fast_math compiler option is slighter larger than that without -use_fast_math compiler option, indicating that the compiler option with the fast math option has litter effect on the result correctness.

Further Optimization with Asynchronous Data Transfer (ADT)
The percentages of GPU runtime consumed by different GPU operations after using the fast math compiler option are shown in Figure 11. As can been seen, the data transfer time is from 43.62% shown in Table 1 up to 62.1% in Figure 10. Therefore, reducing the data transmission time between the host and the device will effectively improve the computing performance.
In order to hide the data transmission time, asynchronous data transfer (ADT) was utilized, which can perform data transfer between host and device memories and kernel execution simultaneously. The Tesla K80 GPU provides two copy engines, namely, a host-to-device engine and a device-to-host engine, allowing one data transfer from the host to the device, one kernel execution, and one data transfer from the device to the host to overlap. Data transfer and kernel execution perform simultaneously through streams, which can be seen as a sequence of commands that perform on the GPU in order. Figure 12 shows the asynchronous data transfer (ADT) for our CUDA program executed on NVIDIA Tesla K80. In this paper, four streams are utilized to calculate the NRCS from the electrically large sea surface. Figure 13 shows the runtime for the GPU-based SDCSM program with and without asynchronous data transfer (ADT) optimization. As shown in Figure 13, the running time of the parallel program decreased obviously from 60.68 ms to 23.34 ms, and a significant speedup of 2031.9× can be achieved with asynchronous data transfer (ADT) optimization, as shown in Table 5. It is advisable to exploit the fast math option whenever speedup is more important than precision. In order to verify the correctness of the outcomes with the -use_fast_math compiler option, the mean absolute error (MAE) and MEA/mean are exploited to ensure that the results of the parallel program are consistent with those of serial program. Table 4 illustrates the results of MAE and MAE/mean with and without the -use_fast_math compiler option. As shown in Table 4, the result of MAE and MAE/mean with the -use_fast_math compiler option is slighter larger than that without -use_fast_math compiler option, indicating that the compiler option with the fast math option has litter effect on the result correctness.

Further Optimization with Asynchronous Data Transfer (ADT)
The percentages of GPU runtime consumed by different GPU operations after using the fast math compiler option are shown in Figure 11. As can been seen, the data transfer time is from 43.62% shown in Table 1 up to 62.1% in Figure 10. Therefore, reducing the data transmission time between the host and the device will effectively improve the computing performance.  In order to hide the data transmission time, asynchronous data transfer (ADT) was utilized, which can perform data transfer between host and device memories and kernel execution simultaneously. The Tesla K80 GPU provides two copy engines, namely, a host-to-device engine and a device-to-host engine, allowing one data transfer from the host to the device, one kernel execution, and one data transfer from the device to the host to overlap. Data transfer and kernel execution perform simultaneously through streams, which can be seen as a sequence of commands that perform on the GPU in order. Figure 12 shows the asynchronous data transfer (ADT) for our CUDA program executed on NVIDIA Tesla K80. In this paper, four streams are utilized to calculate the NRCS from the electrically large sea surface. Figure 13 shows the runtime for the GPU-based SDCSM program with and without asynchronous data transfer (ADT) optimization. As shown in Figure 13, the running time of the parallel program decreased obviously from 60.68 ms to 23.34 ms, and a significant speedup of 2031.9× can be achieved with asynchronous data transfer (ADT) optimization, as shown in Table 5.    Figure 14 illustrates the speedup and runtime for different sizes of the electrically large sea surface (0.1 m). As shown in Figure 14, the speedup tends to be stable with the increase of sea surface size, which implies that the massively parallel computing performance of the GPU has been fully utilized in this paper. Figure 15 shows the NRCS results of parallel program compared with that of the serial program when the frequency is equal to 13.9 GHz at different wind speeds. As      Figure 14 illustrates the speedup and runtime for different sizes of the electrically large sea surface (0.1 m). As shown in Figure 14, the speedup tends to be stable with the increase of sea surface size, which implies that the massively parallel computing performance of the GPU has been fully utilized in this paper. Figure 15 shows the NRCS results of parallel program compared with that of the serial program when the frequency is equal to 13.9 GHz at different wind speeds. As   Figure 14 illustrates the speedup and runtime for different sizes of the electrically large sea surface (0.1 m). As shown in Figure 14, the speedup tends to be stable with the increase of sea surface size, which implies that the massively parallel computing performance of the GPU has been fully utilized in this paper. Figure 15 shows the NRCS results of parallel program compared with that of the serial program when the frequency is equal to 13.9 GHz at different wind speeds. As shown in Figure 15, the results of parallel are consistent with that of serial programs, which indicates that the GPU-based SDCSM algorithm developed in this paper is feasible to calculate the EM backscattering NRCS from the electrically large sea surface. A significant speedup of 2039.1× has been achieved compared with conventional serial C program, implying that the EM backscattering NRCS calculated more than 1 h now can be realized within 2 s. shown in Figure 15, the results of parallel are consistent with that of serial programs, which indicates that the GPU-based SDCSM algorithm developed in this paper is feasible to calculate the EM backscattering NRCS from the electrically large sea surface. A significant speedup of 2039.1 has been achieved compared with conventional serial C program, implying that the EM backscattering NRCS calculated more than 1 h now can be realized within 2 s.

Conclusions
In this paper, a slope-deterministic composite scattering model (SDCSM) was exploited to calculate the NRCS of the EM backscattering echoes from the electrically large sea surface. Both the quasi-specular backscattering caused by large-scale gravity waves and the Bragg backscattering caused by small-scale ripples are taken into account for calculating the NRCS. Moreover, the NVIDIA Tesla K80 with CUDA has been utilized to accelerate the calculation of NRCS from electrically large sea surface. Compared to the conventional serial C program, our parallel program has significantly improved the performance of the NRCS calculation and a significant speedup of 2104.2× has been achieved. Four optimization strategies have been adopted in our GPU-based SDCSM parallel program. First, coalesced global memory access was used to reduce access to global memory. Following this procedure, constant memory is exploited to reduce repetitive computations in the CUDA program. The compiler option with -use_fast_math was adopted to further improve the computational performance and the correctness of outcomes is verified to ensure the results of the parallel program are consistent with that of the serial program. Then, in order to hide the data transmission time between the host and the device, asynchronous data transfer (ADT) was exploited to further improve the computational performance of the GPU-based SDCSM parallel program. The proposed GPU-based SDCSM parallel program only calculates the NRCS of the EM backscattering under low sea state (  10 m/s ). However, for high sea state (  10 m/s ), the electrically large sea surface will be covered with whitecaps. Future work will take the volume shown in Figure 15, the results of parallel are consistent with that of serial programs, which indicates that the GPU-based SDCSM algorithm developed in this paper is feasible to calculate the EM backscattering NRCS from the electrically large sea surface. A significant speedup of 2039.1 has been achieved compared with conventional serial C program, implying that the EM backscattering NRCS calculated more than 1 h now can be realized within 2 s.

Conclusions
In this paper, a slope-deterministic composite scattering model (SDCSM) was exploited to calculate the NRCS of the EM backscattering echoes from the electrically large sea surface. Both the quasi-specular backscattering caused by large-scale gravity waves and the Bragg backscattering caused by small-scale ripples are taken into account for calculating the NRCS. Moreover, the NVIDIA Tesla K80 with CUDA has been utilized to accelerate the calculation of NRCS from electrically large sea surface. Compared to the conventional serial C program, our parallel program has significantly improved the performance of the NRCS calculation and a significant speedup of 2104.2× has been achieved. Four optimization strategies have been adopted in our GPU-based SDCSM parallel program. First, coalesced global memory access was used to reduce access to global memory. Following this procedure, constant memory is exploited to reduce repetitive computations in the CUDA program. The compiler option with -use_fast_math was adopted to further improve the computational performance and the correctness of outcomes is verified to ensure the results of the parallel program are consistent with that of the serial program. Then, in order to hide the data transmission time between the host and the device, asynchronous data transfer (ADT) was exploited to further improve the computational performance of the GPU-based SDCSM parallel program. The proposed GPU-based SDCSM parallel program only calculates the NRCS of the EM backscattering under low sea state (  10 m/s ). However, for high sea state (  10 m/s ), the electrically large sea surface will be covered with whitecaps. Future work will take the volume

Conclusions
In this paper, a slope-deterministic composite scattering model (SDCSM) was exploited to calculate the NRCS of the EM backscattering echoes from the electrically large sea surface. Both the quasi-specular backscattering caused by large-scale gravity waves and the Bragg backscattering caused by small-scale ripples are taken into account for calculating the NRCS. Moreover, the NVIDIA Tesla K80 with CUDA has been utilized to accelerate the calculation of NRCS from electrically large sea surface. Compared to the conventional serial C program, our parallel program has significantly improved the performance of the NRCS calculation and a significant speedup of 2104.2× has been achieved. Four optimization strategies have been adopted in our GPU-based SDCSM parallel program. First, coalesced global memory access was used to reduce access to global memory. Following this procedure, constant memory is exploited to reduce repetitive computations in the CUDA program. The compiler option with -use_fast_math was adopted to further improve the computational performance and the correctness of outcomes is verified to ensure the results of the parallel program are consistent with that of the serial program. Then, in order to hide the data transmission time between the host and the device, asynchronous data transfer (ADT) was exploited to further improve the computational performance of the GPU-based SDCSM parallel program. The proposed GPU-based SDCSM parallel program only calculates the NRCS of the EM backscattering under low sea state (< 10 m/s). However, for high sea state (≥ 10 m/s), the electrically large sea surface will be covered with whitecaps. Future work will take the volume scattering into account to calculate the NRCS of the EM backscattering echoes from electrically large sea surface.