An Improved Back-Projection Algorithm for GNSS-R BSAR Imaging Based on CPU and GPU Platform

: Global Navigation Satellite System Reﬂectometry Bistatic Synthetic Aperture Radar (GNSS-R BSAR) is becoming more and more important in remote sensing because of its low power, low mass, low cost, and real-time global coverage capability. The Back Projection Algorithm (BPA) was usually selected as the GNSS-R BSAR imaging algorithm because it can process echo signals of complex geometric conﬁgurations. However, the huge computational cost is a challenge for its application in GNSS-R BSAR. Graphics Processing Units (GPU) provides an efﬁcient computing platform for GNSS-R BSAR processing. In this paper, a solution accelerating the BPA of GNSS-R BSAR using GPU is proposed to improve imaging efﬁciency, and a matching pre-processing program was proposed to synchronize direct and echo signals to improve imaging quality. To process hundreds of gigabytes of data collected by a long-time synthetic aperture in ﬁxed station mode, a stream processing structure was used to process such a large amount of data to solve the problem of limited GPU memory. In the improvement of the imaging efﬁciency, the imaging task is divided into pre-processing and BPA, which are performed in the Central Processing Unit (CPU) and GPU, respectively, and a pixel-oriented parallel processing method in back projection is adopted to avoid memory access conﬂicts caused by excessive data volume. The improved BPA with the long synthetic aperture time is veriﬁed through the simulation of and experimenting on the GPS-L5 signal. The results show that the proposed accelerating solution is capable of taking approximately 128.04 s, which is 156 times lower than pure CPU framework for producing a size of 600 m × 600 m image with 1800 s synthetic aperture time; in addition, the same imaging quality with the existing processing solution can be retained.

GNSS-R BSAR is one of the emerging applications of GNSS-R technology [19], which uses the GNSS satellites as non-cooperative illuminators and employs bistatic SAR technology to obtain an image of the Earth's surface. This technology has been proven to be applicable to target detection and recognition [20], deformation monitoring [21,22], etc. It is also evident that more precise phase-based analyses are required for deformation monitoring, which is likely to be the topic of our future activities. The advantage of GNSS-R BSAR is significant compared with traditional SAR. Firstly, as no additional transmitter is needed, the GNSS-R BSAR equipment can be made low mass, low power, and low 1.
Using Fast Back Projection Algorithm (FBPA) for GNSS-R BSAR imaging has the advantage of reducing the computational cost, but it will reduce the quality of the imaging [29]. 2.
Using GPU to accelerate BPA in parallel; this method will increase the complexity of the system but will obtain the optimal imaging quality [30][31][32][33][34][35].
In this paper, our research focuses on the use of GPU to accelerate BPA in GNSS-R BSAR. An improved BPA was proposed for making it suitable for GPU parallel implementation. Firstly, a pre-processing program was designed to obtain parameter information matching BPA. In the pre-processing program, to ensure the synchronization of the direct signal and the echo signal, the echo signal is synchronized with the accurate pseudorange and code phase information in the direct signal. The precision ephemeris of the GPS is introduced to improve the orbit accuracy of the navigation satellites and the positioning accuracy of the receivers, which improves the imaging quality of the BPA algorithm. Secondly, the parallel acceleration strategy of BPA in CPU and GPU heterogeneous architecture is designed and implemented. In fixed station mode, the GNSS-R SAR system may work for thousands of seconds on a single mission resulting in the acquisition of data that can reach hundreds of gigabytes. A stream processing structure was used to process such a large amount of data. According to the memory size of the GPU, the original data are divided into several subdata in chronological order and transmitted to GPU. To optimize calculation efficiency, the preprocess program is allocated to the CPU for execution, after which the BPA is allocated to the CPU for execution.
To prove the feasibility of the proposed imaging algorithm, simulations and experiments were carried out using a GPS-L5 signal in the ground fixed station mode. In simulation and experiment, the time consumed by the proposed algorithm is 163.96 and 156 times less than that consumed by BPA running only on CPU, respectively. However, the image quality has not been reduced. The simulation and experimental results prove the feasibility and effectiveness of the proposed algorithm.
This paper is organized into six sections. In Section 2, the signal model of GNSS-R BSAR is reviewed briefly. In Section 3, the improved BPA is discussed in detail. A GPU parallel acceleration method compatible with the improved BPA is proposed and analyzed in Section 4. The simulation and experimental results are provided in Section 4. A further Remote Sens. 2021, 13,2107 3 of 17 discussion of the proposed algorithm is presented in Section 5, and conclusions are drawn in Section 6.

Materials
We consider a ground-based stationary receiving platform collecting the signals emitted from a GNSS satellite and reflected by a stationary scene. Figure 1 shows the geometric configuration of GNSS-R BSAR, where the origin O is set to center of the XYZ coordinate system, P(x p , y p , 0) is an arbitrary point target in the imaging area (the targets are assumed to be located on the XOY plane). T and R are the transmitter (navigation satellite) and fixed receiver, respectively. The transmitter is located at T(x T (t), y T (t), z T (t)), while the receiver at (x R , y R , z R ). R T (t) is the distance between the satellite T and target P, R R is the distance between the receiver R and target P, R B (t) is the distance between the satellite T and receiver R. R T (t), R R , and R B (t) can be expressed as BSAR is reviewed briefly. In Section 3, the improved BPA is discussed in detail. A GPU parallel acceleration method compatible with the improved BPA is proposed and analyzed in Section 4. The simulation and experimental results are provided in Section 4. A further discussion of the proposed algorithm is presented in Section 5, and conclusions are drawn in Section 6.

Materials
We consider a ground-based stationary receiving platform collecting the signals emitted from a GNSS satellite and reflected by a stationary scene. Figure 1 shows the geometric configuration of GNSS-R BSAR, where the origin O is set to center of the XYZ coordinate system, ( , ,0) p p P x y is an arbitrary point target in the imaging area (the targets are assumed to be located on the XOY plane). T and R are the transmitter (navigation satellite) and fixed receiver, respectively. The transmitter is located at R t is the distance between the satellite T and target P , R R is the distance between the receiver R and target P , is the distance between the satellite T and receiver R .
The range history is ( )  The range history is R(t) = R T (t) + R R . GNSS-R BSAR's receiver contains two channels, a direct channel, and an echo channel. The direct channel connects a Right-Handed Circularly Polarized (RHCP) omnidirectional antenna for receiving direct signals from GNSS satellites, which is used to realize system positioning and obtain accurate carrier phase and code phase to provide reference information for signal synchronization. Based on the Stop-Go model [36], the length of a ranging code (for GPS C/A code is 1 ms) is equivalent to the width of the range pulse. After quadrature demodulation and SAR data formatting [36], the direct signal can be expressed as a two-dimensional form η is the azimuth time, also called slow time, τ is the range time, also called fast time. W(η) is the envelope of the received signal in the azimuth dimension, which is a Remote Sens. 2021, 13, 2107 4 of 17 rectangular window. N(η) is the data code of the navigation signal, C(τ) is the ranging code sequence, f c is the carrier frequency of the transmitted signal, and c is the speed of light. The echo channel connects a high-gain Left-Handed Circular Polarized (LHCP) antenna to receive the echo signal from the target area. After removing the data code, the echo signal can be expressed as The direct channel and echo channel use the same oscillator to mix the sampling signal, and the analog-to-digital converter chips for sampling the direct signal and echo use the same clock. After demodulation, the signals are sampled and stored by a large storage data collector for postprocessing.

Review on BPA
The BPA is a typical time-domain imaging algorithm [37], which comes from computer tomography. Firstly, the imaged plane is divided into grids, and the center of each grid is determined as the pixel point. Secondly, range compression used matched filtering to achieve pulse compression in the range. Thirdly, in the range-azimuth time domain, the coherent accumulation method is used to realize the azimuth focusing of each pixel point by point. Each pixel in the imaged plane is traversed to calculate the position of the target in each echo, and the phase is compensated for coherent accumulation to obtain the pixel value of the current point.

Data Pre-Processing
To optimize the imaging quality of the BPA algorithm and extract the parameters needed to accelerate BPA in parallel using the GPU, a pre-processing program is designed for a GNSS-R BSAR system. Three crucial technical issues relating to the program are considered. Firstly, it is necessary to ensure the synchronization of the direct signal and the echo signal during the entire imaging process. Secondly, BPA requires accurate time, and the location of the transmitter, receiver, and imaging area [38]. Finally, the relevant parameters (the reference point, imaging area, synthetic aperture time, the pixel interval) of the BPA algorithm are obtained. The pre-processing program was designed for the above three issues; the specific process is shown in Figure 2. The range compression in the BPA of GNSS-R SAR is achieved through correlation operations between the direct channel and the echo channel. The performance of the autocorrelation depends on the quality of the signal synchronization between the direct and echo channel. Ideally, the direct channel and the echo channel would have good channel The range compression in the BPA of GNSS-R SAR is achieved through correlation operations between the direct channel and the echo channel. The performance of the autocorrelation depends on the quality of the signal synchronization between the direct and echo channel. Ideally, the direct channel and the echo channel would have good channel coherence, but unfortunately, the signal can be affected by various factors. To solve the problem of time synchronization, the signals of the direct channel and the echo channel are processed, as shown in Figure 2. The GNSS direct signal is captured and tracked to obtain a subframe start timestamp position T d of the direct signal and calculate the time delay difference T delay between the echo signal and the direct signal due to the different propagation paths. To ensure the time and phase synchronization of the two signals, the direct signal of 1 ms (a ranging code length) taken by timestamp T d , and the complete pseudorandom noise code of zero-initial phase is reconstructed as a reference signal. According to Phase-locked loop (PLL) theory [39], after compensating for the time difference T delay , we can obtain the direct signal and the echo signal, which was emitted by satellite at the same time. The direct signal and the echo signal have good time synchronization performance by this operation. The expression of the reference signal is The method for improving the satellite orbit accuracy and positioning accuracy is shown in Figure 2. International GNSS Service (IGS) Ultra-Rapid Precision Ephemeris was used to extract satellite obit data and calculate its position with less than 10 cm position error and satellite clock 5 ns error [40]. As IGS Ultra-Rapid data has a 15 min update rate, interpolation was necessary to meet the sampling rate of the GNSS-R BSAR system.
As shown in Figure 2, The method of using geometric modeling is used to determine the specific parameters of the BPA imaging process. Taking the ground fixed station mode as an example, the reference point is the position of the phase center of the echo antenna. The imaged area is the area covered by the main lobe of the echo antenna. The synthetic aperture time can be selected as a period of time between the navigation satellite antenna beam enters the imaged area to the time when the beam completely leaves the imaged area. To obtain a high-resolution image, it is necessary to determine the start and end times of the synthetic aperture according to the geometric structure of acquisition time [28]. Finally, the imaged plane is determined according to the coordinate information of the imaged area. In this paper, the height of the imaged plane is determined by introducing the Google Elevation Application Programming Interface (API), which can reduce the impact of elevation errors on the imaged quality. The pixel interval is estimated by the range and azimuth resolution at the time of acquisition. The minimum requirement is that the pixel interval sampling for the range and azimuth resolution satisfies the Nyquist sampling theorem.

Theoretical Analysis of the Improved BPA
An improved BP algorithm is proposed for GNSS-R BSAR, which can optimize the signal processing of heterogeneous architecture platform of CPU and GPU. The GNSS-R SAR system may work for thousands of seconds on a single mission resulting in the acquisition of data that can reach hundreds of gigabytes. A stream processing structure was used to process such a large amount of data.
The block diagram of the proposed algorithm is shown in Figure 3. The algorithm can be divided into three steps: range compression, back projection and phase compensation, and azimuth compression. signal processing of heterogeneous architecture platform of CPU and GPU. The GNSS-R SAR system may work for thousands of seconds on a single mission resulting in the acquisition of data that can reach hundreds of gigabytes. A stream processing structure was used to process such a large amount of data.
The block diagram of the proposed algorithm is shown in Figure 3. The algorithm can be divided into three steps: range compression, back projection and phase compensation, and azimuth compression.

Range Compression
As the ranging code of the navigation satellite signal has good autocorrelation characteristics, the range compression is accomplished by autocorrelating the echo signal with the reference signal. Taking the C/A code of GPS L1 signal as an example, its autocorrelation peak is 24 dB higher than the sidelobe and cross-correlation results. Under a Gaussian white noise background, the autocorrelation gain is about 30 dB. In this paper, the calculation speed is accelerated by converting to the frequency domain, due to which the range autocorrelation is relatively computationally intensive. As shown in Figure 3, firstly, due to the actual collected echo data of GNSS-R BSAR consisting of hundreds of gigabytes, its capacity greatly exceeds the size of the video memory (tens of gigabytes) [30]. The proposed solution is to equally divide the echo data into multiple data segments and process them sequentially according to time. Secondly, the direct signal and the echo signal are converted to frequency domain for range compression, the specific implementation process is described in [25]. At a given azimuthal moment, the signal after range compression is expressed as: ( ) P τ represents the autocorrelation function of the ranging code.

Range Compression
As the ranging code of the navigation satellite signal has good autocorrelation characteristics, the range compression is accomplished by autocorrelating the echo signal with the reference signal. Taking the C/A code of GPS L1 signal as an example, its autocorrelation peak is 24 dB higher than the sidelobe and cross-correlation results. Under a Gaussian white noise background, the autocorrelation gain is about 30 dB. In this paper, the calculation speed is accelerated by converting to the frequency domain, due to which the range autocorrelation is relatively computationally intensive. As shown in Figure 3, firstly, due to the actual collected echo data of GNSS-R BSAR consisting of hundreds of gigabytes, its capacity greatly exceeds the size of the video memory (tens of gigabytes) [30]. The proposed solution is to equally divide the echo data into multiple data segments and process them sequentially according to time. Secondly, the direct signal and the echo signal are converted to frequency domain for range compression, the specific implementation process is described in [25]. At a given azimuthal moment, the signal after range compression is expressed as: P(τ) represents the autocorrelation function of the ranging code.

Back Projection and Phase Compensation
The focus processing of BPA in the azimuth direction is carried out in the time domain and does not require Range Cell Migration Correction (RCMC). F(η, τ) is the correlation value at different time delays relative to the reference point, and the process of back projection is to map the F(η, τ mn ) to different pixels (x m , y n ), (see Figure 4) (m, n) is the image pixel index of the final image.  The propagation delay can be calculated according to the signal propagation distance when the three positions of the navigation satellite, the pixels, and the receiver are uniquely determined, which can be expressed as: The propagation delay can be calculated according to the signal propagation distance when the three positions of the navigation satellite, the pixels, and the receiver are uniquely determined, which can be expressed as: τ mn (η) is the time delay of the echo signal generated at the pixel (x m , y n ). R T (η, m, n) and R R (m, n) are the distance from the satellite and receiver to the pixel (x m , y n ). Before F(η, τ mn ) is mapped to the pixel, the Doppler phases need be compensated. The Doppler phase compensation factor can be expressed as: The Doppler phase compensation is realized by complex multiplication h(η, m, n) in the time domain according to the data selected by the slant range R T (η, m, n) + R R (m, n), and the result is used as the value of the pixel. An image can be generated at each azimuth time η, which can be expressed as

Azimuth Compression
The azimuth compression of a data segment is achieved by coherently accumulating S(η, x m , y n ), the result obtained is defined as a subimage, which can be expressed as (14) T is the time length of each data segment and the synthetic aperture time. The final SAR image is generated by accumulating each subimage, which can be expressed as N is the total number of data segments. As we all know, with the increase of scene area and synthetic aperture time, the computational cost of BPA will increase sharply. This makes BPA difficult to adopt in SAR technology. Fortunately, the two main steps (the range compression and the backprojection) of BPA can be highly parallelized. A software program taking advantage of a GPU can benefit from speed improvements thanks to parallel computing. In the next section, we will analyze these two steps.

Parallelized BPA on GPU
To optimize the processing process, a heterogeneous processing architecture was designed. Figure 5 shows the system block diagram of the proposed GNSS-R BSAR signal processing. In the heterogeneous framework, the GPU and the CPU process the data serially. The pre-processing part is mainly executed in CPU, the BPA is executed in GPU, finally, the image is sent back to CPU for output and storage.

Parallelized BPA on GPU
To optimize the processing process, a heterogeneous processing architecture was designed. Figure 5 shows the system block diagram of the proposed GNSS-R BSAR signal processing. In the heterogeneous framework, the GPU and the CPU process the data serially. The pre-processing part is mainly executed in CPU, the BPA is executed in GPU, finally, the image is sent back to CPU for output and storage.

Parallelized Range Compression on GPU
It can be seen from Section 3 that the range compression is realized by converting to the frequency domain. As shown in Figure 5, the range compression module is all processed on the GPU, which includes main operations such as FFT and IFFT operation, complex multiplication, and complex addition. FFT and IFFT operations can be completed through the cuFFT library provided by Compute Unified Device Architecture (CUDA). The parallel complex multiplication and complex addition in the frequency domain are completed by constructing a kernel function. It is worth noting that the operation of FFT and IFFT will take up a lot of video memory resources, and the data size needs to be reasonably planned according to the size of the video memory to prevent memory overflow.
In the CUDA architecture, a thread is the smallest execution unit of a GPU that corresponds to the streaming processor (SP) of hardware. The threads are allocated to the number of range gates sampled according to the range, and each thread is solely responsible for the complex multiplication of the echo signal and the reference signal by a range gate data. The main steps of range compression on GPU include: (1) N a and N r are determined according to the size of one segment data, where N a and N r are the number of range and azimuth samples, respectively. The memory of CPU and GPU are requested for the host for the corresponding data, respectively, and the data are transferred from the memory of the CPU to the memory of the GPU.

Parallelized Back Projection on GPU
In each imaged area, the corresponding time delay is calculated according to the slant range of each pixel, and a compensation factor is constructed to compensate for the phase of the pixel. Furthermore, both the sampling points processing at each portion and the pixel processing in the image are the same and independent. Therefore, all the pixels in the imaging area are implemented parallelly by an assigned thread, which was also used to calculate and process azimuth time images. In the viewpoint of thread mapping, there are two kinds of implementations for the parallel Back Projection step: pixel-oriented parallel and range code-oriented parallel. The two methods described above have good parallelism, the pixel-oriented parallelism is more suitable for the engineering realization of SAR image [30]. Thus, the pixel-oriented parallel BPA is adopted in this paper. The parallel implementation steps of pixel-oriented back-projection are as follows: (1) Calculate the delay τ mn (η) by constructing a kernel function to allocate threads to each pixel at a range code. (2) According to the delay of each pixel τ mn (η), the information in the corresponding range gate in the range compression is selected, and the compensation factor h(η, m, n) is constructed according to the slant range information R(η, m, n) and the phase difference value. (3) Phase compensation is performed by complex multiplication of the range compression data F(η, τ mn ) for that pixel point with the compensation factor h(η, m, n) in the time domain. The kernel function will be invoked tens of thousands of times to complete the complex multiplication and complex addition processing. (4) The data S(η, x m , y n ) from each azimuthal moment is coherently summed to generate the subimage S sub (η, x m , y n ). (5) Repeat steps (1)-(4) until all echo data are processed.

Results
To verify the acceleration of the proposed algorithm, running times and image quality were compared using a unified hardware development platform. In simulation and experiments, NVIDIA GeForce GTX 3090 is used, which is mounted in a computer equipped with AMD-3800X (Core frequency is 3.80~4.5 GHz), 64-bit CPU serving as the host, 64 GB RAM, and Windows 10 64-bit OS, whereas the C/C++ Visual Studio 2019 development environment is used to implement the algorithms. The specific operation process is implemented in the following environment: (1) C++ as a reference compiled procedural language using the Faster Fourier Transform in the West (FFTW) library version 3.3.5. (2) CUDA 10 with the cuFFT and NPP included libraries.

Simulation
Simulations were conducted to demonstrate the performance of the proposed algorithm, which were carried out using the CPU and GPU of the hardware platform described above, respectively. The geometric configuration of the satellite and receiver is shown in Figure 6a. Some parameters used in the simulation are listed in Table 1. Figure 6b shows the multitarget simulation results of the proposed algorithm; all point targets are focused on the correct position in the image. Table 2 shows the elapsed time involved in the whole processing using CPU and GPU. The total running time of using the GPU and CPU to generate images of the size 4 km × 4 km is 106.468 s, while the running time of the improved BPA on the CPU is about 17,456.88 s, and the total speedup is about 163.96 times. FFT and IFFT process are accelerated by 117 times, the complex multiplication is accelerated by 261 times, and the back projection is accelerated by 164.1 times. Each step of the entire process of the BPA can be accelerated by GPU in parallel, and since most of the time of the algorithm is consumed in the back projection process, the overall acceleration ratio mainly depends on the back projection acceleration ratio.

Simulation
Simulations were conducted to demonstrate the performance of the proposed algorithm, which were carried out using the CPU and GPU of the hardware platform described above, respectively. The geometric configuration of the satellite and receiver is shown in Figure 6a. Some parameters used in the simulation are listed in Table 1.    To verify the imaging quality of the proposed algorithm, cross-sectional measurements were taken along the range and azimuth in the point targets. The imaging results of chosen target 25 are presented in Figure 7. As can be seen, Target 25 is well focused. The range profile and the azimuth profile were well consistent with theoretical ones [23]. Table 3 shows the comparison of the proposed algorithm imaging and CPU imaging of target 25, the range and azimuth imaging resolution, Peak Side-Lobe Ratio (PSLR), and Integral Side-Lobe Ratio (ISLR) are the same. This effectively shows that GPU-accelerated BPA can effectively reduce the imaging time without reducing the image quality at all. In the simulation, we ignored the cross-correlation interference caused by other navigation satellite signals. Firstly, the GPS-L5 signal's autocorrelation peak is about 35 dB higher than the sidelobe and cross-correlation results. Secondly, the strength of the autocorrelation signal will increase by 10 log 10 N r times after azimuth focus, N r is the number of azimuth samples, while the strength of the cross-correlation signal will not increase due to the different slant range of different satellites.
BPA can effectively reduce the imaging time without reducing the image quality at all. In the simulation, we ignored the cross-correlation interference caused by other navigation satellite signals. Firstly, the GPS-L5 signal's autocorrelation peak is about 35 dB higher than the sidelobe and cross-correlation results. Secondly, the strength of the autocorrelation signal will increase by 10 10 log r N times after azimuth focus, r N is the number of azimuth samples, while the strength of the cross-correlation signal will not increase due to the different slant range of different satellites.    The simulation of scenes of different sizes is performed, and the results are shown in Table 4. In terms of computing efficiency, GPU and CPU-based methods are superior to CPU-based methods. The time consumed by the simulation increases as the size of the scene increases. When the scene size increases to a certain extent, the GPU stream processors are all in a working state, and the speedup ratio no longer changes. This illustrates that the acceleration capability of GPU-accelerated BPAs is largely dependent on the number of stream processors, with the higher the number of stream processors, the better the acceleration performance. To obtain better acceleration performance, developers can choose GPUs with more stream processors or use multiple GPU parallel methods to improve performance.

Experiments
To demonstrate the validity and feasibility of our proposed GPU accelerated improved BPA, real scene experiments were carried out. Figure 8 shows the real scene of the experiment. As shown in Figure 8a, the experimental hardware contains four navigation signal receiving channels. The RHCP omnidirectional antenna is used to receive direct signals from GNSS. The echo channel uses a high-gain LHCP antenna to receive the echo signal from the target area. The receiver location is the stadium stand of Beihang University, and the LHCP antenna was pointed toward the east, see Figure 8c. The experiment was conducted at 9:50 on 25 March 2021. According to the strategy of optimal resolution [28], GPS satellite PRN3 was selected as the optimal signal source, and a long-time synthetic aperture was carried out. The geometric configuration at the time of collection is shown in Figure 8b,c, and Table 5 lists the detailed parameters. signal receiving channels. The RHCP omnidirectional antenna is used to receive direct signals from GNSS. The echo channel uses a high-gain LHCP antenna to receive the echo signal from the target area. The receiver location is the stadium stand of Beihang University, and the LHCP antenna was pointed toward the east, see Figure 8c. The experiment was conducted at 9:50 on 25 March 2021. According to the strategy of optimal resolution [28], GPS satellite PRN3 was selected as the optimal signal source, and a long-time synthetic aperture was carried out. The geometric configuration at the time of collection is shown in Figure 8b,c, and Table 5 lists the detailed parameters.  The imaging results are shown in Figure 9. It can be seen that Figure 9a,b were very similar to each other-this shows that the image quality obtained by using GPU and CPU to run BPA is the same, which is consistent with the theoretical results. Based on resolution analysis [23], the theoretically predicted resolution is approximately 16.8 m in the range and around 0.95 m in the azimuth. The time taken was 128.04 s to run BPA imaging on the GPU, while that of the CPU was 19,974.24 s. The ratio of time consumed by GPU slightly higher than the simulation result; this is caused by I/O data transmission. As the scene becomes larger or the synthetic aperture time increases, the advantages of the proposed algorithm will become more obvious.  The imaging results are shown in Figure 9. It can be seen that Figure 9a,b were very similar to each other-this shows that the image quality obtained by using GPU and CPU to run BPA is the same, which is consistent with the theoretical results. Based on resolution analysis [23], the theoretically predicted resolution is approximately 16.8 m in the range and around 0.95 m in the azimuth. The time taken was 128.04 s to run BPA imaging on the GPU, while that of the CPU was 19,974.24 s. The ratio of time consumed by GPU slightly higher than the simulation result; this is caused by I/O data transmission. As the scene becomes larger or the synthetic aperture time increases, the advantages of the proposed algorithm will become more obvious. It can be seen from Figure 9 that there are multiple strong scattering targets in the image, but we cannot interpret them. The low range resolution of the system causes this result. The interpretation is made easier by matching with a very high-resolution optical image. It can be seen from Figure 10 that the buildings in the optical image and the strong scattering area in the radar image have a good match. It can be seen from Figure 10 that the strong scattering area of the radar image is mainly concentrated in the west of the building, which is caused by the geometry of the acquisition time. As shown in Figure 8b, this geometry leads to the majority of the echo signal coming from the west side of the building. To obtain the top view image of the target area, the receiver should be installed at a higher position or on an airplane for data collection. The resulting image will contain more information about the target scene.  It can be seen from Figure 9 that there are multiple strong scattering targets in the image, but we cannot interpret them. The low range resolution of the system causes this result. The interpretation is made easier by matching with a very high-resolution optical image. It can be seen from Figure 10 that the buildings in the optical image and the strong scattering area in the radar image have a good match. It can be seen from Figure 10 that the strong scattering area of the radar image is mainly concentrated in the west of the building, which is caused by the geometry of the acquisition time. As shown in Figure  8b, this geometry leads to the majority of the echo signal coming from the west side of the building. To obtain the top view image of the target area, the receiver should be installed at a higher position or on an airplane for data collection. The resulting image will contain more information about the target scene. It can be seen from Figure 9 that there are multiple strong scattering targets in the image, but we cannot interpret them. The low range resolution of the system causes this result. The interpretation is made easier by matching with a very high-resolution optical image. It can be seen from Figure 10 that the buildings in the optical image and the strong scattering area in the radar image have a good match. It can be seen from Figure 10 that the strong scattering area of the radar image is mainly concentrated in the west of the building, which is caused by the geometry of the acquisition time. As shown in Figure 8b, this geometry leads to the majority of the echo signal coming from the west side of the building. To obtain the top view image of the target area, the receiver should be installed at a higher position or on an airplane for data collection. The resulting image will contain more information about the target scene.  To verify the image quality obtained, cross-sectional measurements were taken along the range and azimuth in the area around target No. 6 (the edge of the gymnasium), where the echoes have good signal strength and continuity. As shown in Figure 11a, the actual range resolution of target No. 6 is approximately 19 m, which is close to the theoretical value. The theoretically predicted azimuth resolution is approximately 0.95 m, which is much smaller than the length of the building. However, we can compare the length of the target's measured values to its physical length. As shown in Figure 11b, the total length of the target's measured values is about 38 m, which is consistent with the optical measurement result. In addition, Figure 11a,b show the cross-sectional measurement results of the images generated using GPU and CPU to run BPA, respectively. The contours generated by these two methods perfectly match each other, which shows that BPA can be executed in GPU without loss, and the calculation time is greatly reduced compared with the operation in CPU. The proposed method effectively applies BPA to GNSS-R BSAR, and its feasibility has been verified. To verify the image quality obtained, cross-sectional measurements were taken along the range and azimuth in the area around target No. 6 (the edge of the gymnasium), where the echoes have good signal strength and continuity. As shown in Figure 11a, the actual range resolution of target No. 6 is approximately 19 m, which is close to the theoretical value. The theoretically predicted azimuth resolution is approximately 0.95 m, which is much smaller than the length of the building. However, we can compare the length of the target's measured values to its physical length. As shown in Figure 11b, the total length of the target's measured values is about 38 m, which is consistent with the optical measurement result. In addition, Figure 11a,b show the cross-sectional measurement results of the images generated using GPU and CPU to run BPA, respectively. The contours generated by these two methods perfectly match each other, which shows that BPA can be executed in GPU without loss, and the calculation time is greatly reduced compared with the operation in CPU. The proposed method effectively applies BPA to GNSS-R BSAR, and its feasibility has been verified.

Discussion
In this section, the necessity of GPU accelerating BPA of GNSS-R BSAR is analysed. The computational cost of the BPA range compression step can be expressed as  [27]. It can be seen that the calculation amount of BPA is much greater than RDA. However, BPA is more suitable for GNSS-R BSAR. The reason is as follows: (1) The geometry of GNSS-R BSAR is complex. The BPA can focus the echo data without the influence of geometry, while the frequency domain algorithms are greatly affected by the geometric structure [28]. Especially in the multisatellite fusion or multistation fusion mode [19,23], more complex geometry will lead to most of the frequency domain algorithm, which is no longer applicable. (2) The frequency domain algorithms rely on the accuracy of the equivalent squint model and cannot achieve a long-time synthetic aperture [41]. However, increasing the synthetic aperture time can not only improve the azimuth resolution and signalto-noise ratio but also improve the characteristic information of the target in the imaged area [24]. Especially in one station fixed mode, the synthetic aperture time can

Discussion
In this section, the necessity of GPU accelerating BPA of GNSS-R BSAR is analysed. The computational cost of the BPA range compression step can be expressed as N a N r log 2 N r + N a N r , where N a and N r are the number of range and azimuth samples, respectively. The computational cost of the back-projection step can be expressed as N m N n N a , where N m and N n are the numbers of samples of the imaging scene. Thus, the total computational cost for the improved BPA can be represented by N a N r log 2 N r + N a N r + N m N n N a . The computational cost of an improved RDA is 3N a N r log 2 N r + 2N a N r log 2 N a + 7N a N r [27]. It can be seen that the calculation amount of BPA is much greater than RDA. However, BPA is more suitable for GNSS-R BSAR. The reason is as follows: (1) The geometry of GNSS-R BSAR is complex. The BPA can focus the echo data without the influence of geometry, while the frequency domain algorithms are greatly affected by the geometric structure [28]. Especially in the multisatellite fusion or multistation fusion mode [19,23], more complex geometry will lead to most of the frequency domain algorithm, which is no longer applicable. (2) The frequency domain algorithms rely on the accuracy of the equivalent squint model and cannot achieve a long-time synthetic aperture [41]. However, increasing the synthetic aperture time can not only improve the azimuth resolution and signal-tonoise ratio but also improve the characteristic information of the target in the imaged area [24]. Especially in one station fixed mode, the synthetic aperture time can reach thousands of seconds. The BPA algorithm is not affected by the synthetic aperture time.
In practical applications, whether airborne large scene imaging or one-station fixed mode long synthetic aperture time imaging, the BPA algorithm will achieve a huge amount of calculations, which will greatly affect the efficiency of BPA applications. Especially in the application of deformation monitoring, there is a demand for response time. Therefore, it is necessary to use GPU to accelerate BPA for reducing the imaged time.
In this paper, we proposed GPU accelerated BPA, which has a good acceleration ability in GNSS-R BSAR 's one-stop fixed mode. The simulation and experimental verification results verify the feasibility and effectiveness of the proposed algorithm. The simulated imaging is performed on an area of 4 km × 4 km, and the synthetic aperture time is 300 s. The point targets in the image generated by the proposed algorithm are well focused. Comparing the proposed algorithm with the BPA run only on CPU, the image quality is the same, while the running speed is increased by 163.96 times. In the experiment, the synthetic aperture time was increased to the 1800 s for an image area of 600 m × 600 m. The collected raw data were processed by the proposed algorithm and the BPA run only on the CPU, respectively. The imaging results are similar for the two algorithms, the running time of the proposed algorithm is 128.04 s, while the running time on the CPU is 19,974.24 s; the time taken by the proposed algorithm is greatly reduced.

Conclusions
With the improvement of GPU computing power, the BPA has become a feasible imaging method for processing GNSS-R BSAR data. This paper proposes an improved BPA for processing GNSS-R BSAR data using a heterogeneous architecture platform of CPU and GPU. The improved BPA can accurately compensate for the nonlinear motion error of the satellite and improve the synchronization performance of the direct signal and echo signal. The improved BPA on the CPU and GPU platform greatly reduces the total running time and can meet the needs of many offline GNSS-R BSAR imaging applications. As shown by the simulation and the experimental results, the algorithm performs well in imaging quality and efficiency. Using GPS L5 signal for a long-time synthetic aperture, the range resolution of the generated image is about 19 m, and the azimuth resolution is about submeter level. Images of this quality can meet the needs of many applications; an important application of GNSS-R BSAR imaging in one-station fixed mode is deformation monitoring. The next step is to apply this algorithm to the research of deformation monitoring, which will greatly reduce time consumption and improve emergency response time.

Data Availability Statement:
The processed data cannot be shared at this time because they are also part of an ongoing research project.