Parallel Optimisation and Implementation of a Real-Time Back Projection (BP) Algorithm for SAR Based on FPGA

This study conducts an in-depth evaluation of imaging algorithms and software and hardware architectures to meet the capability requirements of real-time image acquisition systems, such as spaceborne and airborne synthetic aperture radar (SAR) systems. By analysing the principles and models of SAR imaging, this research creatively puts forward the fully parallel processing architecture for the back projection (BP) algorithm based on Field-Programmable Gate Array (FPGA). The processing time consumption has significant advantages compared with existing methods. This article describes the BP imaging algorithm, which stands out with its high processing accuracy and two-dimensional decoupling of distance and azimuth, and analyses the algorithmic flow, operation, and storage requirements. The algorithm is divided into five core operations: range pulse compression, upsampling, oblique distance calculation, data reading, and phase accumulation. The architecture and optimisation of the algorithm are presented, and the optimisation methods are described in detail from the perspective of algorithm flow, fixed-point operation, parallel processing, and distributed storage. Next, the maximum resource utilisation rate of the hardware platform in this study is found to be more than 80%, the system power consumption is 21.073 W, and the processing time efficiency is better than designs with other FPGA, DSP, GPU, and CPU. Finally, the correctness of the processing results is verified using actual data. The experimental results showed that 1.1 s were required to generate an image with a size of 900 × 900 pixels at a 200 MHz clock rate. This technology can solve the multi-mode, multi-resolution, and multi-geometry signal processing problems in an integrated manner, thus laying a foundation for the development of a new, high-performance, SAR system for real-time imaging processing.


Introduction
In the past decade, advances in digital and microwave technologies have led to the leap-forward development of synthetic aperture radar (SAR) with significant improvements in the device capability, system design, and processing algorithm. However, the SAR image generation and information processing link cannot match the SAR capability improvement to achieve the most efficient remote sensing information acquisition capability. The traditional remote sensing, SAR information acquisition process comprises the following: planning the platform flight routes and radar operation modes, collecting and recording radar echo data from the ground area using SAR radar, returning, copying, or uninstalling echo data, importing the data to ground high-performance computing devices, processing data with an algorithm, achieving the result, and forwarding the results to the users. It is observed that the traditional SAR remote sensing information acquisition process is time-consuming with many nodes and poor flexibility that cannot perform highefficiency remote sensing detection of time-sensitive targets. To solve these problems, the real-time data processing capabilities should be enhanced to synchronise platform flight, data processing, user reception, and realise the real-time changes of platform flight routes or working modes according to user feedback to obtain more targeted real-time remote sensing results. Therefore, in-depth research on SAR real-time processing technology has far-reaching significance and great application value. To improve the SAR's real-time processing capability and generated image quality, it is necessary to conduct in-depth research on imaging algorithms and hardware and software architectures [1][2][3][4].
In terms of algorithms, the rapid development of antenna technology has made its applications more and more versatile [5,6]; meanwhile, new mechanisms of SAR have been emerging. Consequently, the application platform from airborne and spaceborne was expanded to the foundation and on-board, the operational mode was upgraded from the stripe, bunching, and scanning to the tops, mosaic, and multimodal types, and the imaging geometry was extended to include side view, strabismus, large strabismus, and circular trail. Given the application requirements of multi-platform, multiple modes, and multiple geometries, imaging algorithms are required to have uniform and general characteristics. Currently, imaging algorithms can be divided into two categories: frequency-domain and time-domain algorithms. Common frequency-domain imaging algorithms include the range doppler (R-D), chirp scaling (CS), and wavenumber-domain (W-K) algorithms. These algorithms usually need to decouple the echo signal in range and azimuthal directions. There will be many approximations and assumptions in the process of decoupling, leading to errors in the imaging results. Moreover, the decoupling calculation at different flight paths, working modes, and geometric configurations are also very different, and the processing cannot be unified [7][8][9]. The time-domain imaging algorithm is the backprojection (BP) algorithm. Instead of pulse accumulation in the signal domain, the algorithm projects the echo data of each pulse to the image domain through backward projection in a successive manner and then accumulates energy coherently in the image domain. With the accumulation of energy, the image resolution gradually improves until the maximum resolution is obtained. Because there is no need for synthetic aperture signal accumulation, theoretically, imaging with the BP algorithm can yield outputs without delays. At the same time, because BP is based on the actual motion trajectory of the platform, it is not an approximation algorithm, and can be theoretically applied to any platform, track, geometric configuration, and imaging mode. This algorithm is the focus of this study [10][11][12][13][14][15][16][17].
In terms of hardware and software architectures, an increasing number of processing architectures and devices have been designed and applied to the field of real-time signal processing in recent years. The field-programmable gate array (FPGA) shows obvious advantages in many signal processing schemes by its full programmability, parallel processing, and low-power consumption [18], and the simultaneous development of high-level language synthesis technology in recent years has greatly improved the algorithmic aspect of FPGA programming [19]. The BP algorithm involves a considerable amount of computation calculations. Among these, the operation process involves a large amount of vector multiplication and read-write memory. However, the fully programmable and parallel multi-pipeline features of FPGA can adapt to the BP algorithm and can achieve an efficient structure for the algorithm [20][21][22]. Nowadays, research work targeting the BP algorithm and using FPGA to implement the algorithm has become noticeable. For example, three interpolation methods have been proposed in [23] to reduce the computational complexity of the interpolation part of the algorithm. In [24], the OpenCL framework has been used to implement fixed-point data processing in System on Chip (SoC) to reduce computing resources and improve clock performance. In [25], the BP designed by the logic part of SoC has been used to deal with Intelligent Property (IP). The IP can be flexibly deployed on a unit chip or multiple chips repeatedly. However, the above design still needs to be further improved in processing time efficiency. In summary, this study designed an efficient architecture for the BP algorithm based on the FPGA processor, completed the SAR real-time imaging processing on a single chip, and laid a technical foundation for subsequent, complex, high-performance airborne and spaceborne real-time processing systems.

Principle of the BP Algorithm
The airborne imaging model is shown in Figure 1. The linear airborne speed of the radar along the plane is υ 0 , and the northeast sky Cartesian coordinate system OXYZ is established. To better describe the target, the reference coordinate system O'X'Y'Z' is established, in which the target's coordinate is P(X t , Y t , Z t ). R(η) represents the oblique distance from the radar platform to the target point at each azimuth time; R min is the minimum oblique distance between the radar and the target point; θ(η) represents the viewing angle between the radar and the target at each azimuth moment; ψ(η) represents the pitch angle between the radar and the target at each azimuth time; and θ(s) is the azimuth angle between the radar and the target.
computing resources and improve clock performance. In [25], the BP designed by the logic part of SoC has been used to deal with Intelligent Property (IP). The IP can be flexibly deployed on a unit chip or multiple chips repeatedly. However, the above design still needs to be further improved in processing time efficiency.
In summary, this study designed an efficient architecture for the BP algorithm based on the FPGA processor, completed the SAR real-time imaging processing on a single chip, and laid a technical foundation for subsequent, complex, high-performance airborne and spaceborne real-time processing systems.

Principle of the BP Algorithm
The airborne imaging model is shown in Figure 1. The linear airborne speed of the radar along the plane is υ0, and the northeast sky Cartesian coordinate system OXYZ is established. To better describe the target, the reference coordinate system O'X'Y'Z' is established, in which the target's coordinate is P(Xt, Yt, Zt). R(η) represents the oblique distance from the radar platform to the target point at each azimuth time; Rmin is the minimum oblique distance between the radar and the target point; θ(η) represents the viewing angle between the radar and the target at each azimuth moment; ψ(η) represents the pitch angle between the radar and the target at each azimuth time; and θ(s) is the azimuth angle between the radar and the target. Figure 1. Airborne SAR signal model.
Based on the geometric relationship illustrated in Figure 1, the oblique distance from the radar platform to the target point R(η) at time t can be calculated as: where (x(η), y(η), z(η)) is the trajectory of the aircraft platform, and (Xt, Yt, Zt) represents the spatial position coordinates of the target P at time t. Suppose the radar transmits the linear frequency modulation signal, and the echo signal is obtained after modulation and delay. After demodulation, the echo signal s(τ, η) of the target point can be expressed as: where f0 is the carrier frequency of the transmitted pulse, τ represents the distance time, η represents the azimuth time, Kr is the distance modulation frequency, and η represents the Based on the geometric relationship illustrated in Figure 1, the oblique distance from the radar platform to the target point R(η) at time t can be calculated as: where (x(η), y(η), z(η)) is the trajectory of the aircraft platform, and (X t , Y t , Z t ) represents the spatial position coordinates of the target P at time t. Suppose the radar transmits the linear frequency modulation signal, and the echo signal is obtained after modulation and delay. After demodulation, the echo signal s(τ, η) of the target point can be expressed as: where f 0 is the carrier frequency of the transmitted pulse, τ represents the distance time, η represents the azimuth time, K r is the distance modulation frequency, and η represents the beam centre offset time. The echo data is transformed into the distance frequency domain, and expressed as S(f τ , η) as shown below: Next, s rc (τ, η) is obtained from range pulse compression as below: where H( f τ ) = exp(jπ f 2 τ k r ). As a result, the radar echo pulse compression signal s rc (τ, η) is obtained, and the imaging result of the BP algorithm in aircraft motion can be calculated by geometric mapping.
The actual working scenario of the simulated aircraft is as follows. The radar is radiated through the left side of the aircraft and transmits and receives echoes at the waypoints ABCDE ( Figure 2). There are two targets M and N in the irradiation area. The imaging grid in the (X, Y) rectangular coordinate system is established, and M' and N' are the projections of the target in the imaging grid, and the route points A to E also represent the position of the antenna phase centre in the imaging process. beam centre offset time. The echo data is transformed into the distance frequency domain, and expressed as S(fτ, η) as shown below: Next, src(τ, η) is obtained from range pulse compression as below: As a result, the radar echo pulse compression signal src(τ, η) is obtained, and the imaging result of the BP algorithm in aircraft motion can be calculated by geometric mapping.
The actual working scenario of the simulated aircraft is as follows. The radar is radiated through the left side of the aircraft and transmits and receives echoes at the waypoints ABCDE (Figure 2). There are two targets M and N in the irradiation area. The imaging grid in the (X, Y) rectangular coordinate system is established, and M' and N' are the projections of the target in the imaging grid, and the route points A to E also represent the position of the antenna phase centre in the imaging process.
where RM is the distance from the centre of the antenna phase to point M. The phases of each pulse are extracted and accumulated separately so that the pixels on the imaging grid where R M is the distance from the centre of the antenna phase to point M. The phases of each pulse are extracted and accumulated separately so that the pixels on the imaging grid match exactly the real position of the antenna's phase centre. Therefore, the calculation of the distance from the antenna phase centre to the target is a key part of the BP imaging algorithm. Theoretically, this method can be used to obtain high-precision SAR images without approximation, and the algorithm can be applied to high-precision imaging of complex geometric configurations, such as strabismus and circular trace. In practical engineering, the distance between the imaging grid and the phase centre of the antenna may not correspond to the data in the echo domain as integer multiples, and interpolation operation is required. The interpolation of multiple, interpolation functions and the calculation accuracy will directly affect the quality of the imaging. The simulation process is carried out with a point target as an example, and the results are depicted in Figure 3 as follows [26].
gineering, the distance between the imaging grid and the phase c not correspond to the data in the echo domain as integer multip eration is required. The interpolation of multiple, interpolation f tion accuracy will directly affect the quality of the imaging. The ried out with a point target as an example, and the results are follows [26]. The results of the analysis of the target azimuthal metrics at targets for different numbers of pulse accumulations are shown It can be observed from Table 1 that the greater the numbe in the BP algorithm the smaller the IRW. Meanwhile, the PSLR stant, which is the difference between the time domain algorithm algorithms in the processing results.  The results of the analysis of the target azimuthal metrics at the corresponding point targets for different numbers of pulse accumulations are shown in the table below.
It can be observed from Table 1 that the greater the number of pulse accumulations in the BP algorithm the smaller the IRW. Meanwhile, the PSLR and ISLR are almost constant, which is the difference between the time domain algorithms and frequency domain algorithms in the processing results.  Table 1 Results of the analysis of the target azimuth metrics of the point targets. Based on the above analysis, the flow chart of the BP algorithm is as Figure 4: Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 No Yes Figure 4. BP flow chart of the imaging algorithm.
The main steps are as follows: 1.
Step 1: Range compression is performed on the echo, and an imaging grid is constructed according to the resolution requirements (corresponding to the value of src and the coordinate system O'X'Y'Z'); 2.
Step 2: Upsampling is performed on the data that has completed the range direction compression according to the imaging grid (aim to improve the accuracy of (τ,η) in src); 3.
Step 3: Calculate the distance from the antenna's phase centre to pixel points in the imaging grid (corresponding to the calculation M () a R  in Equation (5) in Equation (5); 5.
Step 5: Multiply the echo data according to the phase information; 6.
Step 6: Coherent accumulation of echo value after projection (corresponding to the calculation of Equation (5); 7.
Step 7: Repeat steps 1 to 6 for the subsequent echo until all pixels of the imaging network and all echoes are processed.
The computation of the BP imaging algorithm mainly focuses on range compression in step 1 and distance calculation in step 3, while the memory read-write operation mainly focuses on data read-write in the echo domain in step 4 and coherent accumulation of echo value in step 6. In step 1, the operation of range pulse compression mainly includes multiplication, addition, and division of complex numbers, and distance calculation mainly involves multiplication and addition of complex vector numbers. Memory reads and writes are large-scale discontinuous address reads and writes. The system processing parameters are set as follows: azimuth direction of echo data is A, range direction is R, and the size of imaging grid is N × N. Therefore: The main steps are as follows: 1.
Step 1: Range compression is performed on the echo, and an imaging grid is constructed according to the resolution requirements (corresponding to the value of s rc and the coordinate system O'X'Y'Z'); 2.
Step 2: Upsampling is performed on the data that has completed the range direction compression according to the imaging grid (aim to improve the accuracy of (τ,η) in s rc ); 3.
Step 3: Calculate the distance from the antenna's phase centre to pixel points in the imaging grid (corresponding to the calculation R M (η a ) in Equation (5); 4.
Step 4: Read the corresponding echo data according to the distance and calculate the echo delay (corresponding to the calculation s rc (τ r , η a ; M) and exp j 4π f c c R M (η a ) in Equation (5); 5.
Step 5: Multiply the echo data according to the phase information; 6.
Step 6: Coherent accumulation of echo value after projection (corresponding to the calculation of Equation (5); 7.
Step 7: Repeat steps 1 to 6 for the subsequent echo until all pixels of the imaging network and all echoes are processed.
The computation of the BP imaging algorithm mainly focuses on range compression in step 1 and distance calculation in step 3, while the memory read-write operation mainly focuses on data read-write in the echo domain in step 4 and coherent accumulation of echo value in step 6. In step 1, the operation of range pulse compression mainly includes multiplication, addition, and division of complex numbers, and distance calculation mainly involves multiplication and addition of complex vector numbers. Memory reads and writes are large-scale discontinuous address reads and writes. The system processing parameters are set as follows: azimuth direction of echo data is A, range direction is R, and the size of imaging grid is N × N. Therefore: • Range compression requires A × R × 2log 2 R + A × R times of the complex multiplication and addition; • Distance calculation requires 5 × A × N × N times of complex multiplication and addition; • Read-write echo domain data requires A × N times of memory reads and A × N × N times of memory writes; • Coherent accumulation of echo value requires A × N × N times of memory data reads and A × N × N times of memory writes. In actual engineering, the azimuth parameter A of the echo data is larger than the parameter N of the imaging grid. It can be observed from the above that the computational complexity of the BP imaging algorithm is mainly concentrated in the distance calculation part, the computational complexity of the algorithm has reached O(N 3 ), and the number of memory reads and writes is also proportional to N 3 . The relationship between these three is shown in the Figure 5. In actual engineering, the azimuth parameter A of the echo data is larger than parameter N of the imaging grid. It can be observed from the above that the computati complexity of the BP imaging algorithm is mainly concentrated in the distance calcula part, the computational complexity of the algorithm has reached O( 3 ), and the num of memory reads and writes is also proportional to 3 . The relationship between t three is shown in the Figure 5. It can be observed from the above figure that when SAR images with the size of 1 × 1000 pixels are generated, approximately 5 G times of multiple accumulations, and times of memory reads and writes are required. When the image size becomes 1500 × 1 the times of multiple accumulations and memory reads and writes increase to 16.9 G ti and 6.75 G times, respectively. In the airborne platform, the pulse repetition rate (P can exceed 4 K; thus, the amount of computation and memory reads and writes per time is huge. The Xilinx XC7VX690T FPGA processing chip is selected in this study. B on this chip, real-time imaging with a pixel size of 1000 × 1000 can be realised. In summary, although the BP algorithm does not apply any approximation to signal processing imaging process, it can accurately eliminate the coupling between ra and azimuth, and the obtained image does not need to be processed for geometric cor tion. However, owing to the huge amount of computation, the application of the BP a rithm in real-time imaging of large-scale scenes is limited. Therefore, finding ways to timise the BP algorithm for efficient implementation by small-scale processing hardw has important research significance. It can be observed from the above figure that when SAR images with the size of 1000 × 1000 pixels are generated, approximately 5 G times of multiple accumulations, and 2 G times of memory reads and writes are required. When the image size becomes 1500 × 1500, the times of multiple accumulations and memory reads and writes increase to 16.9 G times and 6.75 G times, respectively. In the airborne platform, the pulse repetition rate (PRF) can exceed 4 K; thus, the amount of computation and memory reads and writes per unit time is huge. The Xilinx XC7VX690T FPGA processing chip is selected in this study. Based on this chip, real-time imaging with a pixel size of 1000 × 1000 can be realised.
In summary, although the BP algorithm does not apply any approximation to the signal processing imaging process, it can accurately eliminate the coupling between range and azimuth, and the obtained image does not need to be processed for geometric correction. However, owing to the huge amount of computation, the application of the BP algorithm in real-time imaging of large-scale scenes is limited. Therefore, finding ways to optimise the BP algorithm for efficient implementation by small-scale processing hardware has important research significance.

Algorithmic Implementation for Architectural Design and Optimisation
Advanced processing architecture design and optimisation are needed to efficiently implement the BP algorithm. The processing architecture should be fully integrated with the FPGA structure and the characteristics of the algorithmic process. Based on the above, this study conducts in-depth research on the implementation of the algorithm architecture from the perspective of the (a) optimisation process [27], (b) fixed-point operation optimisation [28], (c) parallel processing optimisation [29], and (d) distributed storage [30].

Algorithmic Flow Optimisation
The algorithmic flow optimisation mainly includes adjusting the processing steps to adapt the algorithm to the FPGA processing hardware and the pipelined design to improve the computational throughput. Figure 3 is the most original flow diagram of the BP algorithm based on PC. This flow first stores data in PC memory and then processes it serially. The direct deployment of this flow to FPGA will lead to extremely low-calculation efficiency mainly owing to the following three reasons and optimisation methods: 1.
There are iterations and branch jumps in the main process. Iteration and branch jumps will reduce the efficiency of FPGA calculations and cannot realise pipelining. The optimisation method involves the expansion of the iteration cycle and the data-flow processing on the top of the code architecture so that the main process used to achieve flow processing is highly suitable for accelerated implementation on FPGA hardware; 2.
Dynamic random-access-memory (DDR) access exists. Memory access will reduce the processing flow and memory interaction will increase the processing delay. The processing flow based on FPGA can directly send data to the processing module for calculation instead of accessing the memory frequently to transfer data for processing; 3.
Redundant computation exists. The original BP algorithm usually uses the frequency domain method to directly interpolate data by eight-fold in the upsampling part. It then selects the corresponding result after interpolation when reading echo domain data in step 4. In this step, there are usually one or two interpolation points. Thus, there are 3/4 or even 7/8 redundant calculations in the upsampling part. The optimisation method will then select the sinc (x) function (SINC) interpolation algorithm in the time domain to accurately calculate and generate the echo-domain data required in step 4, rather than interpolating the data first and then reading and storing the data.
In summary, the optimised result of the algorithm flow is shown in Figure 6.
Step 1 reads the data in DDR for pulse compression and coordinate system establishment; step 2 receives the processing result of step 1, as well as the distance calculation result of step 3 for parallel upsamplings; step 4 receives the processing result of step 3 and calculates the delay phase at the same time; step 5 multiplies the echo data and delay phase; and step 6 connects DDR and accumulates the processing result of step 5 with the data in DDR. The loop of these six steps continues to cycle until the data processing is finished and a clear SAR image is obtained. Compared with the original process, the optimised process has three advantages: first, the original processing flow interacts with DDR memory many times, while the optimised postprocessing flow is highly pipelined leaving only two channels to access the memory; second, the upsampling in step 2 is precisely controlled by the location of imaging grid from step 3, which improves the pertinence of computing power and removes redundant calculations; third, by removing step 7, the whole process has no branch jumps, resulting in more efficient processing. Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Echo data DDR chip DDR Memory Array Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 (a) Original Processing flow (b) Optimised Processing flow Optimisation Figure 6. Algorithmic flow optimisation.

Fixed-Point Operation Optimisation
The processing accuracy is an important factor affecting the occupied resources and output results. During the operation, the accuracy of the SINC interpolation in the upsampling step 2 and the numerical range of the echo accumulation in step 6 should be fully considered, and the impact on the memory capacity and bandwidth should be considered

Fixed-Point Operation Optimisation
The processing accuracy is an important factor affecting the occupied resources and output results. During the operation, the accuracy of the SINC interpolation in the upsampling step 2 and the numerical range of the echo accumulation in step 6 should be fully considered, and the impact on the memory capacity and bandwidth should be considered when storing. The analysis of the impact of the two is as Figure 7.

Fixed-Point Operation Optimisation
The processing accuracy is an important factor affecting the occupied resources and output results. During the operation, the accuracy of the SINC interpolation in the upsampling step 2 and the numerical range of the echo accumulation in step 6 should be fully considered, and the impact on the memory capacity and bandwidth should be considered when storing. The analysis of the impact of the two is as Figure 7. It can be observed from the above figure that for SINC interpolation operation, the 8bit fixed-point decimal precision is approximately 10 −3 , the 10-bit fixed-point decimal precision is approximately 10 −4 , the 15-bit fixed-point decimal precision is approximately 10 −6 , and the 20-bit fixed-point decimal precision is approximately 10 −7 . Regarding the dynamics that can be represented by data, the maximum dynamics operations of the 8-bit fixedpoint is 256, 11-bit is 2048, 16-bit is 65,536, 20-bit is 1,048,576, and single-precision floating point is 6.8 × 10 38 . In the actual project, an accuracy of 10 −6 has met the calculation accuracy requirements, so the 15-bit fixed-point decimal is finally selected for calculation.
The relationship between the storage capacity and bit width calculated based on the optimised processing flow is as shown in the following Figure 8. It can be observed from the above figure that for SINC interpolation operation, the 8-bit fixed-point decimal precision is approximately 10 −3 , the 10-bit fixed-point decimal precision is approximately 10 −4 , the 15-bit fixed-point decimal precision is approximately 10 −6 , and the 20-bit fixed-point decimal precision is approximately 10 −7 . Regarding the dynamics that can be represented by data, the maximum dynamics operations of the 8-bit fixed-point is 256, 11-bit is 2048, 16-bit is 65,536, 20-bit is 1,048,576, and single-precision floating point is 6.8 × 10 38 . In the actual project, an accuracy of 10 −6 has met the calculation accuracy requirements, so the 15-bit fixed-point decimal is finally selected for calculation.
The relationship between the storage capacity and bit width calculated based on the optimised processing flow is as shown in the following Figure 8. The 10-bit memory requirement is 26.2144 Mb, the 16-bit memory requirement is 41.9430 Mb, and the 20-bit memory requirement is 52.4288 Mb. Different precisions of the same operation consume different processing resources. Using the range pulse compression in algorithm step 1 as an example, the processing resource consumptions are compared [31] in the following Table 2. Table 2. Impact of bit width on processing resources.  in algorithm step 1 as an example, the processing resource consumptions are compared [31] in the following Table 2. It can be observed from Table 2 that as the bit width increases, the accuracy and dynamics increase, but at the same time, the requirements for memory and processing resources increase. Therefore, it is necessary to achieve a compromise between the processing results and the occupied resources. The generated point target was simulated by MATLAB with various precision processing results and the analysis is summarised below.
The Figure 9 on the left shows the result of the 16-bit fixed-point processing, and the figure on the right shows the result of the single-precision floating-point processing. After numerical analysis, the focus metrics were observed to be the same for the PSLR, ISLR and IRW. However, the amplitude values were different. In practical applications, the image output is 8 bit-quantized, therefore, the amplitude variation has a negligible impact on the result. To obtain the optimal processing results, the processing accuracy needs to be combined with fixed-point and floating-point starting from the simulation results and combining the Xilinx XC7VX690T FPGA chip resources [32]. The data input is buffered into the DDR by a 16-bit fixed-point, and the range compression by floating-point processing. After processing, 16-bit, fixed-point quantisation is performed, and the results are stored in RAM. Distance is calculated by a double floating point. SINC interpolation by 16-bit fixed-point. The value coherent accumulation adopts 16-bit fixed-point, and the final result adopts 8-bit fixed-point output. The entire processing bit width is designed as shown in the Figure 10.

Echo data
Step 1 Single float Process ing Step 2 16bit fixed point Processing Step 3 Step 4  To obtain the optimal processing results, the processing accuracy needs to be combined with fixed-point and floating-point starting from the simulation results and combining the Xilinx XC7VX690T FPGA chip resources [32]. The data input is buffered into the DDR by a 16-bit fixed-point, and the range compression by floating-point processing. After processing, 16-bit, fixed-point quantisation is performed, and the results are stored in RAM. Distance is calculated by a double floating point. SINC interpolation by 16-bit fixed-point. The value coherent accumulation adopts 16-bit fixed-point, and the final result adopts 8-bit fixed-point output. The entire processing bit width is designed as shown in the Figure 10.
the DDR by a 16-bit fixed-point, and the range compression by floating-point processing. After processing, 16-bit, fixed-point quantisation is performed, and the results are stored in RAM. Distance is calculated by a double floating point. SINC interpolation by 16-bit fixed-point. The value coherent accumulation adopts 16-bit fixed-point, and the final result adopts 8-bit fixed-point output. The entire processing bit width is designed as shown in the Figure 10.

Echo data
Step 1

Single float Process ing
Step 2 16bit fixed point Processing Step 3 double float Processing Step 4 16bit fixed point Processing Step 6 16bit fixed point Processing Step  Figure 10. Bit-width design in the processing flow.

Parallel Processing
To improve the system processing capability and shorten the processing time, the BP algorithm needs to be parallelised. Parallel algorithmic processing is implemented with single-instruction, multiple-data (SIMD), and multiple-instruction, multiple-data (MIMD) processing models in three dimensions. The first dimension is distance calculation. Each distance projection of the BP algorithm is completely independent. Combined with the computing resources of each path, 16 paths are designed for parallel execution of this dimension. The second dimension is SINC interpolation [33]. The optimised flow chart shows that after the distance calculation, the echo needs to be upsampled. The upsampling uses an 8-point interpolation core. Thus, eight paths are designed for parallel execution. The third dimension is floating-point multiplication and addition [34]. More than eight blocking operation cycles are required for each high-precision data multiplication and addition, which makes the entire process unable to achieve efficient pipeline processing. Therefore, it is necessary to improve the throughput through parallel processing

Parallel Processing
To improve the system processing capability and shorten the processing time, the BP algorithm needs to be parallelised. Parallel algorithmic processing is implemented with single-instruction, multiple-data (SIMD), and multiple-instruction, multiple-data (MIMD) processing models in three dimensions. The first dimension is distance calculation. Each distance projection of the BP algorithm is completely independent. Combined with the computing resources of each path, 16 paths are designed for parallel execution of this dimension. The second dimension is SINC interpolation [33]. The optimised flow chart shows that after the distance calculation, the echo needs to be upsampled. The upsampling uses an 8-point interpolation core. Thus, eight paths are designed for parallel execution. The third dimension is floating-point multiplication and addition [34]. More than eight blocking operation cycles are required for each high-precision data multiplication and addition, which makes the entire process unable to achieve efficient pipeline processing. Therefore, it is necessary to improve the throughput through parallel processing by adopting the eight-path parallel execution. In summary, the three dimensions perform sixteen-path, eight-path, and eight-path parallel processing, respectively. There are 1024 processing units in each clock cycle. The first and the second dimensions are MIMD, and the third dimension is SIMD. The schematic diagram of parallel processing is as Figure 11. by adopting the eight-path parallel execution. In summary, the three dimensions perform sixteen-path, eight-path, and eight-path parallel processing, respectively. There are 1024 processing units in each clock cycle. The first and the second dimensions are MIMD, and the third dimension is SIMD. The schematic diagram of parallel processing is as Figure  11.

(I,D)
Data Input Command Data Output Figure 11. Schematic diagram explaining parallel data processing.
Each box in the figure above represents a processing unit, which has an independent clock and an input/output interface, and can flexibly adjust the operating rate, input data, and parameters. The three-dimensional processing model was applied to the BP algorithmic flow to form the following processing and control flow as Figure 12. Step 1 Step 2 Step 3 Step 4 2st echo pulse compression results Step 4 Figure 11. Schematic diagram explaining parallel data processing.
Each box in the figure above represents a processing unit, which has an independent clock and an input/output interface, and can flexibly adjust the operating rate, input data, and parameters. The three-dimensional processing model was applied to the BP algorithmic flow to form the following processing and control flow as Figure 12. Each box in the figure above represents a processing unit, which has an independent clock and an input/output interface, and can flexibly adjust the operating rate, input data, and parameters. The three-dimensional processing model was applied to the BP algorithmic flow to form the following processing and control flow as Figure 12. Step 2 Step 6 Step 5 Step 3 Step 4 2st echo pulse compression results

Nst echo pulse compression results
Step 4 Figure 12. Parallel implementation process.
The flow of calculation can be summarised as follows.
Step 3 controls step 4 to pass the processing result of step 1 to step 2; step 5 receives and processes the output of both step 2 and step 4, and then passes it to step 6. The data processing method is described in detail as follows: Step 1 receives echo data while performing range pulse compression on the echo data, and makes 16 copies of the compressed data to prepare for subsequent parallel calculation; step 4 receives the inertial navigation information and system parameters in 16 parallel threads to calculate the slant range by passing the calculated results to steps 3 and The flow of calculation can be summarised as follows.
Step 3 controls step 4 to pass the processing result of step 1 to step 2; step 5 receives and processes the output of both step 2 and step 4, and then passes it to step 6. The data processing method is described in detail as follows: Step 1 receives echo data while performing range pulse compression on the echo data, and makes 16 copies of the compressed data to prepare for subsequent parallel calculation; step 4 receives the inertial navigation information and system parameters in 16 parallel threads to calculate the slant range by passing the calculated results to steps 3 and 5; step 3 receives the distance information, indexing each range pulse pressure result and exporting the outputs to step 2; step 2 receives the data and starts processing in eight parallel threads; step 5 receives the data from steps 2 and 4, and processes with high precision multiplication, writing the result to the Block RAM (BRAM); step 6 reads the BRAM data and initiates the multiple accumulation operation of eight parallel threads. The three-dimensional processing model corresponds to steps 4, 2, and 6, respectively. After the parallel optimisation, the data throughput rate can reach 200 Gsps, which meets the requirements of the BP algorithm for real-time imaging data calculation of the throughput rate.

Distributed Storage
The BP algorithm not only requires huge computing ability but also imposes strict requirements on the throughput rate of data access. There are a lot of data fetches in steps 4 and 6. To meet this access bandwidth requirement, a distributed storage design based on FPGA is required. The FPGA processing platform has three flexibly configurable storage resources, namely register (reg), static RAM (SRAM), and dynamic RAM (DDR). The structures and characteristics of the three are different. The specific indicators are as follows Table 3. Combined with parallelised design, the advantages of three types of memory are fully utilised to realise high-throughput computing. The first parallel dimension 16-way calculation is implemented by SRAM, the single-way port width is 32-bit, the interface speed is 200 MHz, and the bus speed is 100 Gbps; the second parallel SINC interpolation eight-way calculation is implemented by register storage, and the single-way port width is 64. The bit-interface speed is 200 MHz, and the bus speed is 1.6 Tbps; the third parallel, high-precision, multiply-add, eight-way calculation is implemented by register storage. The single-channel port width is a 64-bit interface operating at the frequency of 200 MHz, and the bus speed is 12.8 Tbps. The dynamic RAM buffer is added in the echo input and image result output, thus realising the isolation of the processing clock and data flow rate and improving the running speed of the entire calculation process. Based on the above design, the throughput of ultra-high data rate in the operation process is realised as Figure 13.
static RAM (SRAM) <100 M hundreds 1 to 16 bit less than 1 Tbps dynamic RAM (DDR) >1 G 1-4 8 to 64 bit less than 100 Gbps Combined with parallelised design, the advantages of three types of memory are fully utilised to realise high-throughput computing. The first parallel dimension 16-way calculation is implemented by SRAM, the single-way port width is 32-bit, the interface speed is 200 MHz, and the bus speed is 100 Gbps; the second parallel SINC interpolation eight-way calculation is implemented by register storage, and the single-way port width is 64. The bit-interface speed is 200 MHz, and the bus speed is 1.6 Tbps; the third parallel, high-precision, multiply-add, eight-way calculation is implemented by register storage. The single-channel port width is a 64-bit interface operating at the frequency of 200 MHz, and the bus speed is 12.8 Tbps. The dynamic RAM buffer is added in the echo input and image result output, thus realising the isolation of the processing clock and data flow rate and improving the running speed of the entire calculation process. Based on the above design, the throughput of ultra-high data rate in the operation process is realised as Figure  13. Step 4 Step 2 Step 6  Figure 13. Distributed storage schematic. Figure 13. Distributed storage schematic.
Step 1 uses a storage matrix composed of DDR particles, which shares a set of data and address buses, so the access speed is slow, but the capacity is large; step 4 uses a storage matrix composed of SRAM, wherein each of which has an independent address and data bus, coupled with the device characteristics of SRAM; thus, the access rate of the matrix is faster; steps 2 and 6 use registers to form a storage matrix, wherein each element of the matrix has only data bus and no address bus. Thus, it can be accessed in full parallel mode fast. To meet the real-time processing of the full-process BP algorithm, a distributed storage technology combining multiple storage devices is designed [35].

Integrated Design of Software and Hardware of Imaging Processing System Based on FPGA
This study proposes a block diagram of FPGA-based, SAR, real-time BP software and hardware integration scheme, as shown in the Figure 14. Data are input to FPGA through an optical fiber interface, and FPGA completes single-chip SAR imaging through four core processing units: data interface, DDR data storage, processing control, and parallel data processing. Step 1 uses a storage matrix composed of DDR particles, which shares a set of data and address buses, so the access speed is slow, but the capacity is large; step 4 uses a storage matrix composed of SRAM, wherein each of which has an independent address and data bus, coupled with the device characteristics of SRAM; thus, the access rate of the matrix is faster; steps 2 and 6 use registers to form a storage matrix, wherein each element of the matrix has only data bus and no address bus. Thus, it can be accessed in full parallel mode fast. To meet the real-time processing of the full-process BP algorithm, a distributed storage technology combining multiple storage devices is designed [35].

Integrated Design of Software and Hardware of Imaging Processing System Based on FPGA
This study proposes a block diagram of FPGA-based, SAR, real-time BP software and hardware integration scheme, as shown in the Figure 14. Data are input to FPGA through an optical fiber interface, and FPGA completes single-chip SAR imaging through four core processing units: data interface, DDR data storage, processing control, and parallel data processing.  The data interface unit includes the interface driver module, signal pre-processing module, image splicing module, and output data format module, which mainly completes input and output interface and format arrangements. The data storage unit consists of the MIG controller, DDR initialisation, DDR access logic, image data cache, echo data cache, The data interface unit includes the interface driver module, signal pre-processing module, image splicing module, and output data format module, which mainly completes input and output interface and format arrangements. The data storage unit consists of the MIG controller, DDR initialisation, DDR access logic, image data cache, echo data cache, and data stream distribution modules, which mainly complete the large-capacity and low-speed data buffer storage related to DDR. The processing control unit includes the processing parameter setting and processing flow control module, which mainly completes the toplevel resource scheduling and control of the entire processing flow and the initialisation parameter configuration of each module. The parallel processing module is the core of the system. Data through data stream distribution module separate into echo data and IMU data, IMU data used for radar and motion parameter calculation. The calculation result is sent to parallel distance and delay phase computing modules, meanwhile, the echo data is copied 16 times for subsequent processing. In parallel, each copied data (among the 16 copies) of range compression to be processed is combined with the above distance and delay phase calculated results, and then the output is sent to the eight-thread parallel SINC interpolation upsampling module for calculation. The calculated results are sent to the imaging grid. Every data entering into the imaging grid is subjected to parallel, eight-thread multiplication accumulation calculation (MAC). After all the sampling data is processed, the data in the imaging grid is the image data, and the image data is output according to eight-bit quantisation. In this system, there are computing and storage modules related to the algorithmic software, system status monitoring, and reset and clock management modules related to hardware.

System Verification and Analysis
The time-domain BP algorithm for SAR imaging is implemented in a Xilinx XC7VX690T FPGA chip. The development board contains one unit of XC7VX690T FPGA and an external 16 GB DDR3, whose clock speed is 1333 MHz. The bus is 64-bit wide. The FPGA has a clock speed of 200 MHz. In all, 16 threads perform phase calculations. If the input data is 16 bits, the maximum data input rate is therefore 51.2 Gbps. The resource usage is analysed in the Vivado development environment of Xilinx Inc. and the screenshot of the Vivado analysis result is shown in Figure 15 below. according to eight-bit quantisation. In this system, there are computing and storage modules related to the algorithmic software, system status monitoring, and reset and clock management modules related to hardware.

System Verification and Analysis
The time-domain BP algorithm for SAR imaging is implemented in a Xilinx XC7VX690T FPGA chip. The development board contains one unit of XC7VX690T FPGA and an external 16 GB DDR3, whose clock speed is 1333 MHz. The bus is 64-bit wide. The FPGA has a clock speed of 200 MHz. In all, 16 threads perform phase calculations. If the input data is 16 bits, the maximum data input rate is therefore 51.2 Gbps. The resource usage is analysed in the Vivado development environment of Xilinx Inc. and the screenshot of the Vivado analysis result is shown in Figure 15 below. It can be observed from Figure 15 that when the BP algorithm is implemented with the architecture of this study, the storage resources occupy more than 80%, the logic resources occupy nearly 60%, and the layout and routing results of each part are obtained after comprehensive compilation as Figure 16. It can be observed from Figure 15 that when the BP algorithm is implemented with the architecture of this study, the storage resources occupy more than 80%, the logic resources occupy nearly 60%, and the layout and routing results of each part are obtained after comprehensive compilation as Figure 16. It can be observed from Figure 15 that when the BP algorithm is implemented with the architecture of this study, the storage resources occupy more than 80%, the logic resources occupy nearly 60%, and the layout and routing results of each part are obtained after comprehensive compilation as Figure 16. In conclusion, a single XC7VX690T FPGA chip resource can be fully utilised to complete a 900 × 900 imaging grid. Vivado energy consumption analysis shows that the project has a power consumption of 21.073 W, and the power consumption of each resource was analysed by Vivado as indicated in Figure 17 below. In conclusion, a single XC7VX690T FPGA chip resource can be fully utilised to complete a 900 × 900 imaging grid. Vivado energy consumption analysis shows that the project has a power consumption of 21.073 W, and the power consumption of each resource was analysed by Vivado as indicated in Figure 17 below. Based on a review of current literature, it was found that the realisation platform of the BP imaging of SAR radar mainly includes the CPU, DSP, GPU, and FPGA. However, at present, few articles demonstrate the entire process of the BP algorithm with high-temporal effectiveness using FPGA. Based on the analysis and comparison of existing articles, the time consumption of this study is summarised in the following Table 4.  Based on a review of current literature, it was found that the realisation platform of the BP imaging of SAR radar mainly includes the CPU, DSP, GPU, and FPGA. However, at present, few articles demonstrate the entire process of the BP algorithm with high-temporal effectiveness using FPGA. Based on the analysis and comparison of existing articles, the time consumption of this study is summarised in the following Table 4. It can be observed from the above table that the use of this architecture to implement the BP algorithm based on FPGA has obvious advantages in terms of time consumption.
In order to verify the imaging performance of the BP algorithm implemented by the FPGA hardware, a test of the target point data generated by the simulation is carried out in squint mode, and the test results are as Figure 18 and Table 5.  From hardware test results on the target point, we see that the BP hardware implementation presented in this paper meets the design expectations. This real-time imaging technology can be applied to airborne and spaceborne SAR platforms. This study uses airborne SAR radar as an example to process the actual flight data. The image range is 900 pixels and the azimuth is 900 pixels. At the clock frequency of 200 MHz, the processing time is approximately 1.1 s, and the processing result is as shown in the Figure 19 and Table 6.  From hardware test results on the target point, we see that the BP hardware implementation presented in this paper meets the design expectations. This real-time imaging technology can be applied to airborne and spaceborne SAR platforms. This study uses airborne SAR radar as an example to process the actual flight data. The image range is 900 pixels and the azimuth is 900 pixels. At the clock frequency of 200 MHz, the processing time is approximately 1.1 s, and the processing result is as shown in the Figure 19 and Table 6. mentation presented in this paper meets the design expectations. This real-time imaging technology can be applied to airborne and spaceborne SAR platforms. This study uses airborne SAR radar as an example to process the actual flight data. The image range is 900 pixels and the azimuth is 900 pixels. At the clock frequency of 200 MHz, the processing time is approximately 1.1 s, and the processing result is as shown in the Figure 19 and Table 6.    As observed from the figure above, this method is used to complete imaging processing in a single FPGA, the output image features are distinct, and the focusing effect is obvious.

Conclusions
This study proposed a real-time BP algorithm implementation method for SAR based on FPGA and deployed it on a single XC7VX690T FPGA chip from Xilinx Inc. Actual flight data were used for validation of the proposed method. The article first introduces the principle of backward projection imaging, analyses the algorithmic flow, operation, and storage requirements; the algorithmic implementation architecture design and optimisation are then conducted, and the optimisation methods of algorithmic flow, fixed-point operation, parallel processing, and distributed storage based on the characteristics of FPGA processor were described in detail. Subsequently, resource utilisation, system power consumption and processing time were analysed based on the hardware platform, and the performance comparison with the existing design reflects the advanced design of the proposed method. Finally, the correctness of the processing architecture is verified by actual flight data. The technology can be used in airborne, spaceborne, and other new platform SAR real-time imaging systems, which can solve the problem of multi-mode, multi-resolution, and multigeometry signal processing in an integrated way. This study lays the foundation for the subsequent development of high-performance SAR real-time image processing.

Conflicts of Interest:
The authors declare no conflict of interest.