A Spaceborne Synthetic Aperture Radar Partial Fixed-Point Imaging System Using a Field- Programmable Gate Array—Application-Specific Integrated Circuit Hybrid Heterogeneous Parallel Acceleration Technique

With the development of satellite load technology and very large scale integrated (VLSI) circuit technology, onboard real-time synthetic aperture radar (SAR) imaging systems have become a solution for allowing rapid response to disasters. A key goal of the onboard SAR imaging system design is to achieve high real-time processing performance with severe size, weight, and power consumption constraints. In this paper, we analyse the computational burden of the commonly used chirp scaling (CS) SAR imaging algorithm. To reduce the system hardware cost, we propose a partial fixed-point processing scheme. The fast Fourier transform (FFT), which is the most computation-sensitive operation in the CS algorithm, is processed with fixed-point, while other operations are processed with single precision floating-point. With the proposed fixed-point processing error propagation model, the fixed-point processing word length is determined. The fidelity and accuracy relative to conventional ground-based software processors is verified by evaluating both the point target imaging quality and the actual scene imaging quality. As a proof of concept, a field- programmable gate array—application-specific integrated circuit (FPGA-ASIC) hybrid heterogeneous parallel accelerating architecture is designed and realized. The customized fixed-point FFT is implemented using the 130 nm complementary metal oxide semiconductor (CMOS) technology as a co-processor of the Xilinx xc6vlx760t FPGA. A single processing board requires 12 s and consumes 21 W to focus a 50-km swath width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384.


Introduction
A spaceborne synthetic aperture radar (SAR) is a remote sensor that works in the microwave band. Its abilities to penetrate cloud cover and to collect data in the dark over large areas at high resolution make it unique compared to other imaging instruments [1]. As one of the important means of space-to-Earth observation, SAR plays an important role in disaster emergency response, environmental monitoring, resource exploration and geographic information access [2][3][4]. Recent publications have reviewed the applications of satellite remote sensing techniques for hazards cannot provide enough FLOPS per watt, which leads to a bottleneck in their potential applications. Interconnection between multiple CPUs/DSPs will also bring additional and complex overheads to system design. Benefitting from its customized design, the ASIC can provide sufficient processing power. However, large-scale, complicated logic design requires a longer development period. The FPGA has made great progress in terms of on-chip storage resources, arithmetic logic resources and hardware-software co-design. It can adapt to a large throughput rate and strict real-time signal processing requirements.
In this paper we propose a hybrid heterogeneous parallel accelerating technique combining the advantages of both the FPGA and ASIC to realize an on-board real-time SAR processing system. As with many digital signal processing systems, such as OFDM and MIMO, we expect to adopt fixed-point processing to reduce the system scale and improve processing speed. Therefore, we conduct a computational burden analysis. A partial fixed-point processing scheme is proposed. The FFT, which is the main contributor to the processing burden, is implemented with fixed-point using the 130 nm CMOS technology. Complicated phase function generation calculations are implemented using the Xilinx FPGA. Based on the proposed fixed-point quantization error propagation model, the imaging quality is guaranteed. The system implementation results indicate that our design can potentially be used for spaceborne on-board real-time processing.
The rest of the paper is organized as follows: Section 2 reviews the CS algorithm and analyses the computational burden of the CS algorithm. Fixed-point quantization error propagation model is also proposed. In Section 3, we evaluate the imaging quality of the partial fixed-point system. Both the point target and actual scene are verified. In Section 4, the design methodology of the FPGA-ASIC hybrid heterogeneous parallel accelerating architecture is described. In Section 5, the corresponding hardware realization details and results are discussed. A comparison with related works is carried out to show the validity of our system. Section 6 concludes the paper.

CS Algorithm Review
The CS algorithm is one of the most fundamental and popular algorithms for SAR data processing. As the kernel algorithm, the CS algorithm, with some pre-or post-steps, can process various modes, such as the stripmap, scan SAR, spotlight, sliding spotlight, Tops and Mosaic modes [22,23]. Compared with other algorithms, the CS algorithm retains reasonable efficiency [24]. The calculation of interpolation is replaced by phase multiplication in the processing of range cell migration correction (RCMC), which eliminates approximations and reduces the computational burden. Moreover, this algorithm can also solve the problem of the dependence of the secondary range compression (SRC) on the azimuth frequency because of the need for data processing in the two-dimensional frequency domain.
The flowchart of the CS algorithm is illustrated in Figure 1. First, the SAR raw data are transferred to the Range-Doppler domain via a FFT in the azimuth direction. Second, the data are multiplied by the 1st phase function to achieve the chirp scaling, which makes all the range migration curves the same. The 1st phase function can be described as: where τ is the range time, f η is the azimuth frequency, r re f is the reference distance, k( f η ; r re f ) is the modulating frequency in the range direction, and c( f η ) is the curvature factor, expressed as: Third, the data are transferred to the two-dimensional frequency domain via a FFT in the range direction. Then, the data are multiplied by the 2nd phase function to complete the range compression, SRC and remaining RCMC. The 2nd phase function can be described as: where f τ is the range frequency. Next, the data are transferred to the Range-Doppler domain via an inverse FFT in the range direction. The data can be multiplied by the 3rd phase function to complete the azimuth compression and phase correction. The 3rd phase function can be described as: Finally, the inverse FFT operation in the azimuth direction is executed to complete the CS algorithm. It can be seen that the algorithm only includes some FFT operations and multiplication operations. It is an effective and precise algorithm for stripmap SAR imaging.
where f is the range frequency.
Next, the data are transferred to the Range-Doppler domain via an inverse FFT in the range direction. The data can be multiplied by the 3rd phase function to complete the azimuth compression and phase correction. The 3rd phase function can be described as: Finally, the inverse FFT operation in the azimuth direction is executed to complete the CS algorithm. It can be seen that the algorithm only includes some FFT operations and multiplication operations. It is an effective and precise algorithm for stripmap SAR imaging. SAR  Step 1 Step 2 Step 3 Figure 1. Flowchart of the CS algorithm.

Computational Burden Analysis
According to the flow chart shown in Figure 1, the CS algorithm is divided into three steps (Step 1-Step 3). The phase function generation mainly consists of scalar operations. Although the operations involved are complex, the computational burden is not large. From the perspective of the vector operations involved in the CS algorithm, the processing flow mainly consists of vector FFT/IFFT operations and vector complex multiplications. These operations can be decomposed into a combination of real additions and real multiplications. According to the classical Cooley-Tukey FFT algorithm [25], an N-point FFT/IFFT is decomposed into 3Nlog2N real additions and 2Nlog2N real multiplications. A complex multiplication is decomposed into two real additions and four real multiplications according to: Assuming that the sample numbers of the range direction and azimuth direction are NR and NA, respectively, taking Step 1 as an example, one NA-point FFT as well as one complex multiplication is performed NR times. The total real addition computational burdens of the FFT and complex multiplication are 3NRNAlog2NA and 2NRNA, respectively. The total real multiplication computational burdens of the FFT and complex multiplication are 2NRNAlog2NA and 4NRNA, respectively. The computational burdens of the FFT and complex multiplication in different steps are shown in Tables 1 and 2, respectively.

Computational Burden Analysis
According to the flow chart shown in Figure 1, the CS algorithm is divided into three steps (Step 1-Step 3). The phase function generation mainly consists of scalar operations. Although the operations involved are complex, the computational burden is not large. From the perspective of the vector operations involved in the CS algorithm, the processing flow mainly consists of vector FFT/IFFT operations and vector complex multiplications. These operations can be decomposed into a combination of real additions and real multiplications. According to the classical Cooley-Tukey FFT algorithm [25], an N-point FFT/IFFT is decomposed into 3Nlog 2 N real additions and 2Nlog 2 N real multiplications. A complex multiplication is decomposed into two real additions and four real multiplications according to: Assuming that the sample numbers of the range direction and azimuth direction are N R and N A , respectively, taking Step 1 as an example, one N A -point FFT as well as one complex multiplication is performed N R times. The total real addition computational burdens of the FFT and complex multiplication are 3N R N A log 2 N A and 2N R N A , respectively. The total real multiplication computational burdens of the FFT and complex multiplication are 2N R N A log 2 N A and 4N R N A ,  Tables 1 and 2, respectively. Table 1. Computational burden of FFT/IFFTs in different steps.
Step No.

Real Additions Real Multiplications Total
Step 1 Table 2. Computational burden of complex multiplications in different steps.
Step No.

Real Additions Real Multiplications Total
Step 1 The percentage of FFT operations to the total computational burden is expressed as: For the convenience of hardware processing, the zero-padding method is usually adopted to expand the SAR raw data to an integer power of 2. Considering SAR raw data with a granularity of 16,384 × 16,384, the percentage P is over 90% according to (6). Therefore, the analysis result shows that the FFT accounts for the largest computational burden. We also test the time required for the 16,384-point FFT and the time required to generate the 16,384-point phase function, on both the standard CPU platform and the target FPGA platform. It takes the CPU (2× Intel Xeon E5-2697 (24 cores each) @ 2.7GHz) 4.9 ms and 0.2 ms to perform a 16,384-point FFT and to generate a phase function, respectively. It takes 16,500 clock cycles and about 400 clock cycles for the FPGA to perform a 16,384-point FFT and to generate a phase function, respectively. The experiment results also shows that FFT accounts for the largest computational burden. Fixed-point processing is faster and less resource intensive compared with floating-point processing. Adopting the fixed-point FFT is very important with respect to system hardware cost reduction and real-time performance improvement. On the other hand, phase function generation is computationally complex. It is necessary to ensure sufficient accuracy for the SAR image focus. We adopt the IEEE-754 standard single precision floating-point data format to generate the phase function.

Fixed-Point Processing Error Propagation Model
Due to the finite word length effect, fixed-point FFT processing introduces an additional quantization error. For the SAR imaging processing system, to meet the requirement of imaging accuracy, a proper processing word length should be determined. Blind simulation validation is inefficient and time consuming. Many works [26][27][28][29], which focus on word length optimization of DSP systems, tend to establish analytical models to analyse the mean value or variance of quantization noise. In our previous works [30], we also proposed the error propagation model to solve the finite word length effect of Range Doppler (RD) algorithm. Thus, we proposed an error propagation model to evaluate the quantization error power of the fixed-point processing CS algorithm. This will assist us in determining the word length of the FFT to a great extent.
The CS algorithm processing flow can be summarized as a cascade of multilevel FFT/IFFTs. Figure 2 shows the error propagation model. Taking the quantization propagation progress of the first azimuth FFT as an example, define e FFT1 as the quantization error vector generated by the first azimuth direction FFT. β R and β A are the attenuation factors of the range direction FFT and the azimuth direction FFT, respectively. The quantization error is propagated through the latter range and azimuth FFT/IFFTs. The propagation progress of other FFT/IFFTs is similar.
The phase function generations and the complex multiplications are processed with single precision floating-point. The accuracy of the operations here ensures that no quantization errors are introduced. In addition, the complex multiplications do not attenuate the noise power. We prove this conclusion as follows. Define e MUL1 as the quantization error vector after the first complex multiplication and f 1 as phase function 1. Then, e MUL1 is expressed as follows: Since the modulus of the phase function vector is 1, the power of e MUL1 is derived as follows: Accordingly, the error power remains unchanged after a complex multiplication. Thus, we only take the FFT attenuation factors β R and β A into consideration.
According to [31], β R and β A can be assigned as 1/N R and 1/N A , where N R and N A are the sample numbers of the range direction and azimuth direction, respectively. Assuming that the quantization errors generated from different FFTs are mutually independent, the errors are finally accumulated at the output. The total quantization error output is derived as follows: In our previous work [32], we revealed the quantization error assessment of the fixed-point FFT. The quantization error power of an N-point, b-bit fixed-point FFT is described as: The total quantization error power is accordingly derived as follows: Sensors 2017, 17, 1493 6 of 23 The phase function generations and the complex multiplications are processed with single precision floating-point. The accuracy of the operations here ensures that no quantization errors are introduced. In addition, the complex multiplications do not attenuate the noise power. We prove this conclusion as follows. Define is expressed as follows: Since the modulus of the phase function vector is 1, the power of 1 MUL e is derived as follows: Accordingly, the error power remains unchanged after a complex multiplication. Thus, we only take the FFT attenuation factors βR and βA into consideration.
According to [31], βR and βA can be assigned as 1/NR and 1/NA, where NR and NA are the sample numbers of the range direction and azimuth direction, respectively. Assuming that the quantization errors generated from different FFTs are mutually independent, the errors are finally accumulated at the output. The total quantization error output is derived as follows: In our previous work [32], we revealed the quantization error assessment of the fixed-point FFT. The quantization error power of an N-point, b-bit fixed-point FFT is described as: The total quantization error power is accordingly derived as follows:    Figure 3 shows the quantization error power under different processing word lengths according to (11). As the word length increases from 8 to 11, the quantization error power gradually decreases. As the word length increases from 12 to 16, the quantization error power remains essentially unchanged. Therefore, we can narrow the scope of the word length for verification. In the following  Figure 3 shows the quantization error power under different processing word lengths according to (11). As the word length increases from 8 to 11, the quantization error power gradually decreases. As the word length increases from 12 to 16, the quantization error power remains essentially unchanged. Therefore, we can narrow the scope of the word length for verification. In the following subsections, we evaluated the imaging quality of both a point target and an actual scene for 12-to 16-bit word lengths.   Figure 4 shows the point target imaging results for different FFT processing accuracies. Figure  4a-e are the imaging results of the partial fixed-point processing based on SystemC, which is a bitaccurate hardware design verification language [33]. Figure 4f is the imaging result of a typical single precision floating-point processing based on MATLAB (or some other conventional ground-based software processors). We choose the floating-point imaging result as the accurate one for comparison.  For the point target, the peak sidelobe ratio (PSLR), integrated sidelobe ratio (ISLR) and spatial resolution (RES) are commonly adopted to evaluate the imaging quality [34]. Table 3 shows the imaging quality assessment comparison result. Theoretically, the squint angle of the side-looking SAR echo is zero. When generating the point target simulation data, there will be a slight error in the squint angle (less than 1 degree). Thus, we set the squint angle to 1 degree when evaluating the  Figure 4 shows the point target imaging results for different FFT processing accuracies. Figure 4a-e are the imaging results of the partial fixed-point processing based on SystemC, which is a bit-accurate hardware design verification language [33]. Figure 4f is the imaging result of a typical single precision floating-point processing based on MATLAB (or some other conventional ground-based software processors). We choose the floating-point imaging result as the accurate one for comparison.   Figure 4 shows the point target imaging results for different FFT processing accuracies. Figure  4a-e are the imaging results of the partial fixed-point processing based on SystemC, which is a bitaccurate hardware design verification language [33]. Figure 4f is the imaging result of a typical single precision floating-point processing based on MATLAB (or some other conventional ground-based software processors). We choose the floating-point imaging result as the accurate one for comparison.  For the point target, the peak sidelobe ratio (PSLR), integrated sidelobe ratio (ISLR) and spatial resolution (RES) are commonly adopted to evaluate the imaging quality [34]. Table 3 shows the imaging quality assessment comparison result. Theoretically, the squint angle of the side-looking SAR echo is zero. When generating the point target simulation data, there will be a slight error in the squint angle (less than 1 degree). Thus, we set the squint angle to 1 degree when evaluating the imaging quality. Similar to the quantization error power curve, the imaging quality with a 15 or 16 For the point target, the peak sidelobe ratio (PSLR), integrated sidelobe ratio (ISLR) and spatial resolution (RES) are commonly adopted to evaluate the imaging quality [34]. Table 3 shows the imaging quality assessment comparison result. Theoretically, the squint angle of the side-looking SAR echo is zero. When generating the point target simulation data, there will be a slight error in the squint angle (less than 1 • ). Thus, we set the squint angle to 1 • when evaluating the imaging quality. Similar to the quantization error power curve, the imaging quality with a 15 or 16 bit word length is very close to that of single precision floating-point. The 15 or 16 bit partial fixed-point system seems to provide sufficient accuracy for the point target imaging.  Figure 5 shows the actual scene imaging results for different FFT processing accuracies. The raw data were obtained from the Chinese Gaofen-3 satellite's 50-km-width, 5-m-resolution stripmap mode. Usually, the region of interest (ROI) extraction, target recognition and other SAR image applications are performed after the SAR imaging process. Thus, we also selected some local regions in the actual scene to evaluate. Figure 6 shows the imaging results. bit word length is very close to that of single precision floating-point. The 15 or 16 bit partial fixedpoint system seems to provide sufficient accuracy for the point target imaging. Table 3. Point target imaging quality assessment for different FFT processing accuracies.  Figure 5 shows the actual scene imaging results for different FFT processing accuracies. The raw data were obtained from the Chinese Gaofen-3 satellite's 50-km-width, 5-m-resolution stripmap mode. Usually, the region of interest (ROI) extraction, target recognition and other SAR image applications are performed after the SAR imaging process. Thus, we also selected some local regions in the actual scene to evaluate. Figure 6 shows the imaging results.     For actual scenes, the mean squared error (MSE) is commonly adopted to evaluate the imaging quality. It is computed by averaging the squared intensity difference between fixed-point image pixels xi and floating-point image pixels yi. The peak signal-to-noise ratio (PSNR) is derived from the MSE [35]. The MSE and PSNR between two images are calculated as follows:

Actual Scene Imaging Quality Evaluation
N is the total number of pixels in the image, and L is the maximum dynamic range. For 8-bit greyscale images, L = 255. The MSE is easy to calculate and understand but cannot match the perceived visual quality very well. Therefore, based on the known characteristics of the human visual system (HVS), we also adopt structural similarity (SSIM) to evaluate the imaging quality [36]. The expression of SSIM is as follows:  Moreover, the radiometric resolution (γ) is also a very important indicator [37]. It reveals how fine a sensor can distinguish between objects with similar reflection properties. σ and μ represent the standard deviation and the mean value of the image, respectively. It is calculated as follows: Table 4 and 5 show the imaging quality assessment comparison results of both the overall region and local region of the actual scene. The MSE as well as the PSNR is only evaluated for the overall region. Both the imaging results and the quality assessment indicate that the partial fixed-point system with a 16-bit fixed-point FFT processing word length can provide sufficient accuracy for the For actual scenes, the mean squared error (MSE) is commonly adopted to evaluate the imaging quality. It is computed by averaging the squared intensity difference between fixed-point image pixels x i and floating-point image pixels y i . The peak signal-to-noise ratio (PSNR) is derived from the MSE [35]. The MSE and PSNR between two images are calculated as follows: N is the total number of pixels in the image, and L is the maximum dynamic range. For 8-bit greyscale images, L = 255. The MSE is easy to calculate and understand but cannot match the perceived visual quality very well. Therefore, based on the known characteristics of the human visual system (HVS), we also adopt structural similarity (SSIM) to evaluate the imaging quality [36]. The expression of SSIM is as follows: σ 2 x and σ 2 y represent the variances of the fixed-point image and floating-point image, respectively. µ x and µ y represent the mean values of the fixed-point image and floating-point image, respectively. C 1 and C 2 are constant values. SSIM = 1 indicates that two images are nearly the same. Moreover, the radiometric resolution (γ) is also a very important indicator [37]. It reveals how fine a sensor can distinguish between objects with similar reflection properties. σ and µ represent the standard deviation and the mean value of the image, respectively. It is calculated as follows: Tables 4 and 5 show the imaging quality assessment comparison results of both the overall region and local region of the actual scene. The MSE as well as the PSNR is only evaluated for the overall region. Both the imaging results and the quality assessment indicate that the partial fixed-point system with a 16-bit fixed-point FFT processing word length can provide sufficient accuracy for the CS algorithm. According to the evaluation results for both the point target and actual scene, we choose to adopt the 16-bit fixed-point FFT in the actual system implementation. SAR data intrinsically consist of amplitude and phase information. The above assessments quantify the impact of fixed-point processing from the amplitude information, while the accuracy of the phase information also needs to be validated. The SAR interferometry is based on the phase information in SAR complex products. In general, a pair of interferometric data acquired by different orbits is processed by the same CS algorithm to obtain two focused complex images of the actual scene. After the registration processing of the two complex images, the mean values of phase and phase standard deviation are estimated by the method proposed in [38]. Table 6 shows the phase information assessment comparison results of both the fixed-point and single precision floating-point processing of the actual scene. According to the phase information evaluation results, 16-bit fixed-point FFT is also appropriate for the actual system implementation.

Hardware Design Methodology
The key principles followed in the system hardware design are as follows.

Real-Time Operation
Requiring the processor to be able to operate without large power-hungry computers implies an efficient mapping methodology. The process of mapping the architecture and algorithm is to progress from a correct algorithm, described at a level of abstraction that is both implementation-independent and timing-independent, to a system description of time-dependent, specific allocation to processing resources and the sequencing of events within those resources. This mapping procedure is divided into static description and dynamic description. During static description, the procedure focuses on how to allocate each step or algorithm module to distinct processors with different characteristics in a hybrid heterogeneous system. A large amount of rapid but repetitive computations can be calculated by fixed-point ASIC co-processors. The FPGA contains a large number of user-defined floating-point intellectual property (IP) cores, which are more suitable for the less frequent but non-linear computations. Thus, the slower and often more irregular computations are assigned to the FPGA. In addition, a mass storage resource is required to store the large SAR raw data. Another task of the FPGA is to build a bridge between the storage and co-processors. During dynamic description, introducing time-area constraints helps to achieve a good balance between processor parallelism and processing latency. The procedure only examines the feasibility of algorithm parallel operational interconnection between processors during the runtime and does not yet look inside a processor for processor-specific features.

Accuracy and Flexibility
The partial fixed-point system framework is achieved based on our fixed-point error propagation model, as described above. Simultaneously, the bit-level accurate simulation method based on an advanced programming language (SystemC) provides a benchmark to gauge the performance of the processor. The simulation result of the software allows for verification of the segmentation of the algorithm, which provides a convenient condition for hardware modular design. The algorithm-based changes can be easily incorporated into the reconfigurable FPGA flexibility.

Static Mapping Strategy
The processing elements selected for this system design are the FPGA and fixed-point FFT co-processors. A Dual Data Rate (DDR) SDRAM is also introduced into the system; it acts as the external storage medium for the SAR raw data. In this paper, we only discuss SAR imaging processing after A/D conversion. The static mapping of CS image processing can be described by the end-to-end processing chain shown in Figure 7.
SAR raw data are first accessed in the FPGA. Then, the standard CS algorithm can be partitioned into two parallel branches after the preprocessing operations, i.e., the raw data analysis and storage and the direct current (DC) removal operation. One branch, denoted by the grey, dashed rectangular boxes, is processed by the co-processors. The other branch, denoted by the solid-line rectangular boxes, is processed by the FPGA. The FPGA and co-processors are controlled by independent instructions. Thus, they can work synchronously. For the FPGA part, since the three-fold phase function estimation (PFE) can be classified as a linear combination of transcendental operations, the development can be improved effectively through using the IP cores provided by specific development tools such as Xilinx ISE. For the co-processors part, the order of the FFT (IFFT) and multiplication units is flexible and can be configured according to the algorithm requirements. The quantify operation should be added after the last azimuth IFFT to achieve the greyscale images.
The FPGA, as the storage medium, establishes a data access link between the co-processor and external DDR SDRAM. In Figure 7, the left branch shows that three overall data reading and writing operations occur before the three co-processor processing stages. In general, the SAR raw data are stored range-first. Thus, three matrix transpose operations are required during the total processing procedure. The DDR-based matrix transpose is achieved by using the submatrix three-dimensional mapping method [39]. In this way, the transpose operation can be performed simultaneously with data reading before the co-processors stage. The accessing of situ address data is the prerequisite of this method. The main procedure of this technology is shown in Figure 8, according to the following two steps:

1.
Divide the N A × N R raw data matrix into M × N submatrixes of N a × N r order; 2.
Map each submatrix into the three-dimensional storage array of the DDR. Each submatrix corresponds to a row of the DDR. Considering that read and write accesses to the DDR are burst-oriented, m × N samples (m means the burst length) can be stored to or loaded from the DDR during N clock cycles. Therefore, in our design, the range or azimuth access mode can be described as follows:

Range Access Mode
This mode corresponds to the overall data operation before the second co-processors stage. Take the first range reading operation line as an example. As shown in Figure 8, the data of this line are distributed into (0, 0) to (0, Nr-1) of the A0,0 to A0,N-1 submatrixes. These submatrixes are mapped into the A0,0 to A0,N-1 rows of the DDR. Since each of the m data belongs to the same range line, the range line data can be accessed from each row of the DDR continuously. Therefore, the data are accessed in sequential order, with m data per clock cycle. N row-cross operations and N/8 bank-cross operations Considering that read and write accesses to the DDR are burst-oriented, m × N samples (m means the burst length) can be stored to or loaded from the DDR during N clock cycles. Therefore, in our design, the range or azimuth access mode can be described as follows:

Range Access Mode
This mode corresponds to the overall data operation before the second co-processors stage. Take the first range reading operation line as an example. As shown in Figure 8, the data of this line are distributed into (0, 0) to (0, Nr-1) of the A0,0 to A0,N-1 submatrixes. These submatrixes are mapped into the A0,0 to A0,N-1 rows of the DDR. Since each of the m data belongs to the same range line, the range line data can be accessed from each row of the DDR continuously. Therefore, the data are accessed in Considering that read and write accesses to the DDR are burst-oriented, m × N samples (m means the burst length) can be stored to or loaded from the DDR during N clock cycles. Therefore, in our design, the range or azimuth access mode can be described as follows.

Range Access Mode
This mode corresponds to the overall data operation before the second co-processors stage. Take the first range reading operation line as an example. As shown in Figure 8, the data of this line are distributed into (0, 0) to (0, N r -1) of the A 0,0 to A 0,N-1 submatrixes. These submatrixes are mapped into the A 0,0 to A 0,N-1 rows of the DDR. Since each of the m data belongs to the same range line, the range line data can be accessed from each row of the DDR continuously. Therefore, the data are accessed in sequential order, with m data per clock cycle. N row-cross operations and N/8 bank-cross operations are performed when acquiring one range line of N R depth. This brings an additional data access time overhead. Thus, it takes more than N R /m clock cycles to obtain one range line of N R depth. The range direction reading operation described above is shown in the upper part of Figure 9.

Azimuth Access Mode
This mode corresponds to the overall data operation before the first and third co-processors stages. Take the beginning m azimuth lines reading operations as an example. As shown in Figure 8, the data of these lines are distributed into the 0 to m-1 columns of the A0,0 to AM-1,0 submatrixes. These submatrixes are mapped into the A0,0 to AM-1,0 rows of the DDR. Different from the range access mode, each burst length consists of m data from m different azimuth lines. Thus, m azimuth lines data can be accessed at Nr intervals, wit m data per clock cycle. M row-cross operations and no bank-cross operations are performed when acquiring m azimuth lines of NA depth. This also brings an additional data access time overhead. Thus, it takes more than NA clock cycles to simultaneously obtain m azimuth lines. The azimuth direction reading operation described above is shown in the lower part of Figure 9. Figure 9 shows the reading operation of the two modes from left to right. The writing operation has just the opposite procedure compared to the reading operation. Apart from the influence of the line-cross and bank-cross operations of each access mode, the maximum bandwidth utilization of the DDR also depends on whether the internal FPGA cache memory can match these two completely different access forms. The relationship between the memory and parallel accelerating technique will be discussed in the following subsections.

Dynamic Mapping Strategy
Introducing the timing analysis of the critical path from the perspective of the hardware structure design is the main function of the dynamic description. This mapping strategy focuses on the following issues:


Execution-time matching of parallel branch operations;  Processing bandwidth allocations to multiple processors;  Interaction between the different parts of the system (FPGA, co-processor and DDR).
This section mainly discusses the real-time performance of multiple co-processor parallel processing according to the above views. Theoretically, each range line or azimuth line FFT operation is independent of the others. This means that the more FFT processors there are, the better the system real-time performance achieved. However, the number of processors is determined based on the

Azimuth Access Mode
This mode corresponds to the overall data operation before the first and third co-processors stages. Take the beginning m azimuth lines reading operations as an example. As shown in Figure 8, the data of these lines are distributed into the 0 to m-1 columns of the A 0,0 to A M-1,0 submatrixes. These submatrixes are mapped into the A 0,0 to A M-1,0 rows of the DDR. Different from the range access mode, each burst length consists of m data from m different azimuth lines. Thus, m azimuth lines data can be accessed at N r intervals, wit m data per clock cycle. M row-cross operations and no bank-cross operations are performed when acquiring m azimuth lines of N A depth. This also brings an additional data access time overhead. Thus, it takes more than N A clock cycles to simultaneously obtain m azimuth lines. The azimuth direction reading operation described above is shown in the lower part of Figure 9. Figure 9 shows the reading operation of the two modes from left to right. The writing operation has just the opposite procedure compared to the reading operation. Apart from the influence of the line-cross and bank-cross operations of each access mode, the maximum bandwidth utilization of the DDR also depends on whether the internal FPGA cache memory can match these two completely different access forms. The relationship between the memory and parallel accelerating technique will be discussed in the following subsections.

Dynamic Mapping Strategy
Introducing the timing analysis of the critical path from the perspective of the hardware structure design is the main function of the dynamic description. This mapping strategy focuses on the following issues:

•
Execution-time matching of parallel branch operations; • Processing bandwidth allocations to multiple processors; • Interaction between the different parts of the system (FPGA, co-processor and DDR).
This section mainly discusses the real-time performance of multiple co-processor parallel processing according to the above views. Theoretically, each range line or azimuth line FFT operation is independent of the others. This means that the more FFT processors there are, the better the system real-time performance achieved. However, the number of processors is determined based on the computational throughput requirements, which are restricted by the bandwidth of the FFT operations, PFE operations and DDR-FPGA communication. The relationship can be described as follows: (16) B ddr is the actual DDR effective bandwidth, which varies according to the access mode (range/azimuth). B m is the equivalent bandwidth of the FPGA memory, which is used to cache the data between the DDR and FFT co-processors. B p f e is the bandwidth of the phase function estimation operations. k represents the amount of FPGA internal cache memory required for a specific DDR access mode. Ideally, as described above, k should be 1 and m in the range access mode and the azimuth access mode, respectively. B f f t represents the bandwidth of the FFT operations, and n is the parallelism of FFT co-processors. Actually, as mentioned before, the PFE is performed by a series of float-point IP cores in a pipelined steam. The simulation results show that the processing delay of the PFE operations is approximately dozens to hundreds of cycles. Therefore, the PFE is considered as a real-time operation, and B p f e can be ignored in (16). B ddr , B m and B f f t can be described as: F ddr_IO , F m and F f f t represent the clock frequency of the DDR I/O, the FPGA internal memory and the FFT co-processor, respectively. W is the word length of the data processed per cycle. Since float-to-fixed and fixed-to-float operations should be done before and after the fixed-point FFT, respectively, a unified value of W can be adopted across the system as a general situation. B ddr_peak is the peak bandwidth of the DDR, which can be represented by the product of double F ddr_IO and W. η represents the actual DDR bandwidth loss ratio, which is determined based on the data access mode (range or azimuth) of the DDR. ρ is the utilization of burst length m. Consequently, the effectiveness of the parallel processing can be described as: According to (20), the parallelism of co-processors is determined based on specific system features such as F ddr_IO , F m , F f f t and W.

Implementation of the Customized Fixed-Point FFT Co-Processor
As discussed above, the FFT accounts for the largest computational burden. We take the standard FFT IP cores, which are provided by Xilinx FPGA, as an example to show the order of magnitude of computational efficiency improvement by exploiting fixed-point processing in FFT instead of floating-point. Table 7 shows the resource comparison between 16-bit 16,384-point fixed-point pipeline FFT processor and IEEE standard single precision floating-point pipeline FFT processor. The logic resource (registers, LUTs, and DSP48s) occupied by the floating-point FFT processor is more than twice that occupied by the fixed-point FFT processor. The memory resource (block RAMs/FIFOs) occupied by the floating-point FFT processor is three times as that occupied by the fixed-point FFT processor. The maximum working frequency of floating-point FFT processor is also lower than that of fixed-point FFT processor. In order to improve the flexibility of the system, reduce the system power consumption, and reduce the system requirements for the FPGA device selection, we adopt the ASIC implementation of the FFT processor. According to the fixed-point quantization error analysis, a 16,384 point, 16-bit FFT co-processor is implemented. It is a typical radix-2 k single-path delay feedback (SDF) pipeline FFT processor based on our previous research [40]. The processing latency is approximately 16,500 clock cycles or 0.16 ms at a 100 MHz working frequency. The design is modelled in the VHDL language and synthesized using the Semiconductor Manufacturing International Corporation (SMIC, Shanghai, China) 0.13 µm standard cell library. Figure 10 shows the primary layout of the chip. Table 8 summarizes the main specifications of the chip. The total power consumption is only 84.9 mW @ 125 MHz. The chip turns out to be an energy-efficient tool for implementing the FFT, which is the most computationally intensive operation in the CS processing flow.  In order to improve the flexibility of the system, reduce the system power consumption, and reduce the system requirements for the FPGA device selection, we adopt the ASIC implementation of the FFT processor. According to the fixed-point quantization error analysis, a 16,384 point, 16-bit FFT co-processor is implemented. It is a typical radix-2 k single-path delay feedback (SDF) pipeline FFT processor based on our previous research [40]. The processing latency is approximately 16,500 clock cycles or 0.16 ms at a 100 MHz working frequency. The design is modelled in the VHDL language and synthesized using the Semiconductor Manufacturing International Corporation (SMIC, Shanghai, China) 0.13 μm standard cell library. Figure 10 shows the primary layout of the chip. Table  8 summarizes the main specifications of the chip. The total power consumption is only 84.9 mW @ 125 MHz. The chip turns out to be an energy-efficient tool for implementing the FFT, which is the most computationally intensive operation in the CS processing flow.

Parallel Processing Performance Analysis
To validate the effectiveness of the system design methodology, in this section, we establish a prototype using a single off-the-shelf Xilinx XC6VSX760T FPGA with fixed-point FFT co-processors to realize 16,384 × 16,384 stripmap SAR image processing. The raw data are represented in a single precision floating point complex form, which requires 64 bit (32 bit for the real part and for the imaginary part, respectively) for each sample data point. The DDR3 SDRAM with K4B4G0846Q-HYK0 SODIM is also introduced for the overall data storage. Table 9 shows the basic parameters of the system. F f f t is derived from the ASIC design report. F m is limited by the maximum working frequency of the FPGA, which can be obtained from the synthesis report. We adopt the ping/pong memory group as caches between the DDR and FFT co-processors. Considering the limitation of the FPGA resources, each group contains four SDRAMs with 0.125 MB of storage space. Each SDRAM is suitable for one 16 K × 64-bit data line.
The burst length m of the DDR3 SDRAM is 8. Thus, 8 × 64-bit data can be accessed in one cycle. For the range operation, the 8 × 64-bit data belongs to the same range line. It can be pipeline stored in the same DPRAM in eight clock cycles. Thus, the burst length utilization ρ is 1, and the amount of FPGA internal cache memory k is 1 in this case. According to the actual measurement, η is 0.9375. According to (20), the bandwidth of the FFT operations should match the bandwidth of the FPGA internal cache memory. Therefore, the co-processor parallelism n can be 2 to meet the requirements of range direction parallel processing.
For the azimuth operation, k is 4, limited by the number of DPRAMs of the ping/pong group. Thus, only the first 256 bit (4 × 64 bit) can be stored in parallel in the four DPRAMs (ping/pong group) in one cycle. Thus, ρ is 0.5 in this case. According to the actual measurement, η is 0.74. According to (20), the bandwidth of the FFT operations should match the bandwidth of the FPGA internal cache memory. Therefore, n can be 2 to meet the requirements of the azimuth direction parallel processing.  Table 10 shows a summary of the parameters related to parallel processing. Combining the analysis results of both the range and azimuth operations, we adopt two FFT co-processors to perform the parallel acceleration for the SAR imaging system. The unified flow chart of the parallel processing procedure via a round-robin memory assignment is shown in Figure 11; the six steps of this procedure are as follows: Step 1: Load four range or azimuth lines of data from the DDR to the ping (pong) SDRAM group.
Step 2: Divide the four DPRAMs of the ping (pong) group into pairs, named RAM_A and RAM_B, respectively. Select the data from RAM_A to send to the parallel FFT co-processors.
Step 3: Store the computation result to RAM_A after performing parallel processing using the FFT co-processors.
Step 4: Repeat steps 2 to 3 for the other two SDRAMs of RAM_B.
Step 5: Collect the data from both RAM_A and RAM_B and send them back to the DDR.
Step 6: Repeat steps 2 to 5 for the data from the pong (ping) group.
Sensors 2017, 17,1493 17 of 23 Table 10 shows a summary of the parameters related to parallel processing. Combining the analysis results of both the range and azimuth operations, we adopt two FFT co-processors to perform the parallel acceleration for the SAR imaging system. The unified flow chart of the parallel processing procedure via a round-robin memory assignment is shown in Figure 11; the six steps of this procedure are as follows: Step 1: Load four range or azimuth lines of data from the DDR to the ping (pong) SDRAM group.
Step 2: Divide the four DPRAMs of the ping (pong) group into pairs, named RAM_A and RAM_B, respectively. Select the data from RAM_A to send to the parallel FFT co-processors.
Step 3: Store the computation result to RAM_A after performing parallel processing using the FFT co-processors.
Step 4: Repeat steps 2 to 3 for the other two SDRAMs of RAM_B.
Step 5: Collect the data from both RAM_A and RAM_B and send them back to the DDR.
Step 6: Repeat steps 2 to 5 for the data from the pong (ping) group. It is worth noting that there are both FFT and IFFT operations in the second co-processors stage, as shown in Figure 7. In this stage, Steps 2 to 3 should be executed once more before moving on to Step 4. The parallel processing timelines of one round-robin assignment for both the range direction operation and the azimuth direction operation are described in Figure 12. The red and green blocks represent the ping-memory-group-based operations and pong-memory-group-based operations, respectively.

Modular Design-Based System Architecture
The system block diagram and the photo of the hardware system are shown in Figure 13 and 14, respectively; it is mainly composed of one FPGA, two ASICs as co-processors and one DDR3 SDRAM. The FPGA design follows the modular design principle. A Universal Asynchronous Receiver/Transmitter (Uart) is used to receive the external control commands from the host computer for single-step debugging. The accuracy of each step of the CS algorithm can be verified in this way. The other main function modules of the FPGA include the following: It is worth noting that there are both FFT and IFFT operations in the second co-processors stage, as shown in Figure 7. In this stage, Steps 2 to 3 should be executed once more before moving on to Step 4. The parallel processing timelines of one round-robin assignment for both the range direction operation and the azimuth direction operation are described in Figure 12. The red and green blocks represent the ping-memory-group-based operations and pong-memory-group-based operations, respectively.
Sensors 2017, 17,1493 17 of 23 Table 10 shows a summary of the parameters related to parallel processing. Combining the analysis results of both the range and azimuth operations, we adopt two FFT co-processors to perform the parallel acceleration for the SAR imaging system. The unified flow chart of the parallel processing procedure via a round-robin memory assignment is shown in Figure 11; the six steps of this procedure are as follows: Step 1: Load four range or azimuth lines of data from the DDR to the ping (pong) SDRAM group.
Step 2: Divide the four DPRAMs of the ping (pong) group into pairs, named RAM_A and RAM_B, respectively. Select the data from RAM_A to send to the parallel FFT co-processors.
Step 3: Store the computation result to RAM_A after performing parallel processing using the FFT co-processors.
Step 4: Repeat steps 2 to 3 for the other two SDRAMs of RAM_B.
Step 5: Collect the data from both RAM_A and RAM_B and send them back to the DDR.
Step 6: Repeat steps 2 to 5 for the data from the pong (ping) group. It is worth noting that there are both FFT and IFFT operations in the second co-processors stage, as shown in Figure 7. In this stage, Steps 2 to 3 should be executed once more before moving on to Step 4. The parallel processing timelines of one round-robin assignment for both the range direction operation and the azimuth direction operation are described in Figure 12. The red and green blocks represent the ping-memory-group-based operations and pong-memory-group-based operations, respectively.  Figure 12. Parallel processing timeline of one round-robin assignment.

Modular Design-Based System Architecture
The system block diagram and the photo of the hardware system are shown in Figure 13 and 14, respectively; it is mainly composed of one FPGA, two ASICs as co-processors and one DDR3 SDRAM. The FPGA design follows the modular design principle. A Universal Asynchronous Receiver/Transmitter (Uart) is used to receive the external control commands from the host computer for single-step debugging. The accuracy of each step of the CS algorithm can be verified in this way. The other main function modules of the FPGA include the following:

Modular Design-Based System Architecture
The system block diagram and the photo of the hardware system are shown in Figures 13  and 14, respectively; it is mainly composed of one FPGA, two ASICs as co-processors and one DDR3 SDRAM. The FPGA design follows the modular design principle. A Universal Asynchronous Receiver/Transmitter (Uart) is used to receive the external control commands from the host computer for single-step debugging. The accuracy of each step of the CS algorithm can be verified in this way. The other main function modules of the FPGA include the following:  The data interconnection between each module is realized by the 64-bit customized bus, which is called the data-mux in this design. We adopt the ordinary handshake agreement since each module has an independent control part to analyse the main state information. In practical image processing, SAR raw data are sent to the DDR3 SDRAM through the LVDS. By managing and combining different modules, the CS imaging algorithm is completed. The 8 bit greyscale image result after quantification can finally be outputted by the LVDS. The FPGA resource occupation is shown in Table 11.   The data interconnection between each module is realized by the 64-bit customized bus, which is called the data-mux in this design. We adopt the ordinary handshake agreement since each module has an independent control part to analyse the main state information. In practical image processing, SAR raw data are sent to the DDR3 SDRAM through the LVDS. By managing and combining different modules, the CS imaging algorithm is completed. The 8 bit greyscale image result after quantification can finally be outputted by the LVDS. The FPGA resource occupation is shown in Table 11. The data interconnection between each module is realized by the 64-bit customized bus, which is called the data-mux in this design. We adopt the ordinary handshake agreement since each module has an independent control part to analyse the main state information. In practical image processing, SAR raw data are sent to the DDR3 SDRAM through the LVDS. By managing and combining different modules, the CS imaging algorithm is completed. The 8 bit greyscale image result after quantification can finally be outputted by the LVDS. The FPGA resource occupation is shown in Table 11. According to the above timing analysis, it takes 12.1 s for the system hardware to process SAR raw data with 16,384 × 16,384 data granularity. Twenty-one watts are required for the whole system to work. Figure 15 shows the final imaging result of the proposed partial fixed-point system hardware. The image quality is reliable, which is consistent with the above SystemC-based software simulation results.  According to the above timing analysis, it takes 12.1 s for the system hardware to process SAR raw data with 16,384 × 16,384 data granularity. Twenty-one watts are required for the whole system to work. Figure 15 shows the final imaging result of the proposed partial fixed-point system hardware. The image quality is reliable, which is consistent with the above SystemC-based software simulation results. The pulse repetition frequency (prf) of Gaofen-3 satellite's 50-km-width, 5-m-resolution stripmap mode is about 2000. Thus, it takes about 8 s to acquire the SAR raw data. Admittedly, the proposed single-board FPGA-ASIC processing system (12.1 s processing latency) just realizes a near real-time processing performance. However, through the construction of a multi-board parallel processing system, it is feasible to achieve a true real-time processing system. For a simple example, with the parallel processing of two boards, the processing latency is easily decreased to about 6 s, which meets the requirement of the data acquisition time. Table 12 shows a comparison between the proposed hybrid heterogeneous parallel system and some previous works. Compared with [2,19,41], taking into account the data granularity processed, the proposed system undoubtedly shows advantages in both processing time and power consumption. [42] describes a SAR image focuser application exploiting General-purpose Computing on Graphics Processing Units (GPGPU). The OpenCL framework allows the application to exploit to GPUs and CPUs. It takes 8.5 s to focus an ENVISAT ASAR's stripmap SAR raw data. Taking into account the data granularity (30,000 × 6000), the processing time is comparable to ours. However, at the current level of technology, the GPU is more suitable for ground-based accelerators than onboard processors. FPGA has an inherent advantage in terms of FLOPS per watt. Although [21] only takes 2.8 s to process SAR raw data with 32,768 × 32,768 data granularity, the large power consumption of the GPU is unacceptable with respect to the harsh spaceborne onboard real-time processing The pulse repetition frequency (prf) of Gaofen-3 satellite's 50-km-width, 5-m-resolution stripmap mode is about 2000. Thus, it takes about 8 s to acquire the SAR raw data. Admittedly, the proposed single-board FPGA-ASIC processing system (12.1 s processing latency) just realizes a near real-time processing performance. However, through the construction of a multi-board parallel processing system, it is feasible to achieve a true real-time processing system. For a simple example, with the parallel processing of two boards, the processing latency is easily decreased to about 6 s, which meets the requirement of the data acquisition time. Table 12 shows a comparison between the proposed hybrid heterogeneous parallel system and some previous works. Compared with [2,19,41], taking into account the data granularity processed, the proposed system undoubtedly shows advantages in both processing time and power consumption. Ref. [42] describes a SAR image focuser application exploiting General-purpose Computing on Graphics Processing Units (GPGPU). The OpenCL framework allows the application to exploit to GPUs and CPUs. It takes 8.5 s to focus an ENVISAT ASAR's stripmap SAR raw data. Taking into account the data granularity (30,000 × 6000), the processing time is comparable to ours. However, at the current level of technology, the GPU is more suitable for ground-based accelerators than onboard processors. FPGA has an inherent advantage in terms of FLOPS per watt. Although [21] only takes 2.8 s to process SAR raw data with 32,768 × 32,768 data granularity, the large power consumption of the GPU is unacceptable with respect to the harsh spaceborne onboard real-time processing requirements. In [43], the authors present a spaceborne SAR on-board processing simulator using the low-power mobile GPU. The low power consumption seems appealing and competitive for a SAR onboard processing task. However, according to their experiments on 1024 × 1024, 2048 × 2048 and 4096 × 4096 SAR imaging processing, the time utilization increases linearly due to the raw data size. Thus, it is reasonable to assume that the time utilization of a 16,384 × 16,384 SAR imaging processing is about 205 s. The processing time is unacceptable. In [44], the authors present a novel architecture for real-time SAR signal processing. Real time processing is achieved by using the Signum Coded algorathm in which raw data and reference function are coded with a single bit. Using this method can truly reduce the scale of the hardware implementation, including storage and computing resources. However, the method also has some limitations. If the signal/noise ratio (SNR) of the signal to be processed is high, false targets and high side lobe are generated after the imaging processing, and false targets become more obvious as the SNR increases. The image quality depends on the signal SNR, which to some extent affects the versatility of this method. In summary, the proposed design achieves a better tradeoff between real-time performance and power consumption.

Conclusions
In this paper, to complete the on-board real-time SAR imaging processing procedure in harsh space conditions, a partial fixed-point imaging system using the FPGA-ASIC hybrid heterogeneous parallel accelerating technique is proposed. With a proposed fixed-point processing error propagation model, the fixed-point processing word length is determined. The imaging fidelity is verified via both a point target and an actual scene imaging quality evaluation. Through a reasonable algorithm-to-system mapping, one Xilinx FPGA and two ASICs work together in parallel. The efficient architecture achieves high real-time performance with low power consumption. A single processing board requires 12 s and consumes 21 W to focus 50-km width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384. Admittedly, the proposed design is a prototype verification of the spaceborne on-board real-time processor. Both the system scale and the power consumption meet the harsh constraints of on-orbit processing. With the development of anti-radiation reinforcement technology and system fault-tolerant technology, the proposed framework is undoubtedly expandable and feasible for potential spaceborne real-time SAR imaging processing.