1. Introduction
Synthetic Aperture Radar (SAR) systems have been extensively used for Earth remote sensing. They provide high-resolution, light- and weather-independent reconstructions for various applications, including climate and environmental change research and Earth system monitoring [
1]. Similar to conventional radar, electromagnetic waves (in the form of a series of short pulses) are transmitted from a spaceborne or airborne platform, backscattered, and finally collected by the receiving antennas. The combination of the echo signals, received over a period of time, allows for the construction of a virtual aperture much longer than the physical antenna length [
2].
SAR systems produce large amounts of raw data, which needs to be processed for image reconstruction. Due to limited on-board processing capacities (e.g., power, size, weight, cooling, communication bandwidth, etc.) on SAR platforms, the raw data is commonly sent to the ground station for processing. Nevertheless, since the volume of raw data and computational load of modern SAR systems have been increasing meaningfully, downlinking the large data throughput has become a major bottleneck. For instance, to attain a 0.5 m × 0.5 m resolution, the F-SAR system from the German Aerospace Center (DLR) [
3] acquires SAR raw data with approximately 285,000 × 32,768 (azimuth/range) samples per channel. Assuming the complex64 (8-byte) format, each channel results in approximately 70 GB of raw data. With three channels for Maritime Moving Target Indication (MMTI), approximately 210 GB are processed. Seeking to ease this problem, the generation of on-board imagery has been conducted; however, constraints on computational performance, data size, and transfer speeds must be tackled.
Previous related studies have approached the issue in different ways. The work presented in [
4] addresses real-time on-board SAR imaging by means of a Field Programmable Gate Array (FPGA) together with a Digital Signal Processor (DSP). The application requirements include an optimal balance between processing delay, throughput, appropriate data format, and circuit scale. The DSP is used to implement auxiliary functions, while the FPGA is employed to implement the main processing flow of the Chirp Scaling (CS) algorithm [
5,
6], which makes use of the Fast Fourier Transform (FFT)/Inverse FFT (IFFT). Pipeline optimization is applied to the FFT to improve processing efficiency. Long development times, however, are emphasized as one of the main challenges.
The work addressed in [
7] presents a heterogeneous array architecture for SAR imaging. An Application Specific Instruction Set Processor (ASIP) is utilized, attaining Giga operations per second. The research seeks to comply with the desired power consumption and lists the main advantages and disadvantages of the technology against other technologies.
Central Processing Units (CPUs) are flexible and portable; however, their power efficiency is quite low, a bottleneck for real-time SAR applications.
Graphics Processing Units (GPUs) provide powerful parallel computation and programmability. However, the average power consumption (up to 150 W) limits the application of GPUs for on-board processing.
A multi-DSP architecture allows for the performance of many complex theories and algorithms on hardware. Nevertheless, it entails low power efficiency.
FPGAs possess rich on-chip memory; moreover, computational resources are configurable to meet SAR signal processing requirements, e.g., high throughput rate, desired operations per second, and power consumption. However, the development cycle of FPGAs is relatively long.
The work in [
8] addresses real-time SAR imaging systems and focuses on the data format. Specifically, after assessing the advantages offered by fixed-point data processing, the authors propose a solution based on the utilization of a System-on-Programmable-Chip (SoPC), implemented in a Zynq+NetFPGA platform. The use of SoPC is due to high-performance embedded computing. System C is employed to develop the Register Transfer Level (RTL) code.
Concerning spatial-grade devices, [
9] presents a comparison between different spatial-grade CPUs, DSPs, and FPGAs. The research demonstrates and quantifies how emerging space-grade processors are continually increasing the capabilities of space missions by supporting high levels of parallelism in terms of computational units. The considered processors include multicore and many-core CPUs (HXRHPPC, BAE Systems RAD750, Cobham GR712RC, Cobham GR740, BAE Systems RAD5545, and Boeing Maestro), DSPs (Ramon Chips RC64 and BAE Systems RADSPEED), and FPGA architectures (Xilinx Virtex-5QV FX130 and Microsemi RTG4). GPUs are excluded since there are no space-grade GPUs. In terms of integer Computational Density (CD) and CD/W, the best results are attained by RC64, Virtex-5QV, and RTG4. In terms of Internal Memory Bandwidth, the best results are achieved by the RC64 and the Virtex-5QV. In terms of External Memory Bandwidth, the best results are attained by the RAD5545 and the Virtex-5QV. Lastly, in terms of Input/Output Bandwidth, best results are achieved by RAD5545, Virtex-5QV, and RTG4.
The research in [
10] explores the utilization of Qualcomm’s Snapdragon System on a Chip (SoC) technology in space missions. Specifically, it focuses on the successful deployment of the Snapdragon 801 SoC in the Ingenuity Helicopter on Mars and the use of Snapdragon 855 development boards in the International Space Station. The study compares different GPUs such as Nvidia Jetson Nano, Nvidia TX2, Nvidia GTX 560M, and Nvidia GTX 580; it also highlights that GPUs are not commonly used for space computing. The study refers to traditional FPGA implementations with the VIRTEX-5 SX50T FPGA as a benchmark. Interestingly, the research demonstrates that, in certain scenarios, the software implementation on the Snapdragon SoC outperforms the traditional FPGA implementations. This finding emphasizes the computational power and efficiency offered by Snapdragon technology in the context of space-related applications. However, it should be noted that the FPGA VIRTEX-5 SX50T uses 65 nm technology, while the Snapdragon 855 uses 7 nm technology.
Multiple implementations for on-board SAR processing with small and low-power GPU devices have been realized. In [
11], the paper demonstrates the successful implementation of SAR processing algorithms on the Jetson TX1 platform. The optimized implementation takes advantage of the GPU’s parallel processing capabilities, resulting in improved performance compared to CPU-based approaches. This highlights the potential of the Jetson TX1 platform for accelerating SAR processing tasks in a compact and energy-efficient manner. In [
12], the research presents a small UAV-based SAR system that utilizes low-cost radar, position, and attitude sensors while incorporating on-board imaging capability. The system demonstrates the feasibility of cost-effective SAR imaging using a Nvidia Jetson Nano as the host computer of the drone. The choice of a powerful and energy-efficient platform for data processing and control enhances the system’s capabilities. Leveraging the Jetson Nano’s GPU capabilities and parallel processing power, the system can perform real-time processing, sensor integration, and image reconstruction tasks.
In both cases [
11,
12], the amount of processed information differs from the previous example of the three-channel MMTI system. Although [
11] does not specify the image dimensions, it mentions a data weight of 172 MB; on the other hand, although [
12] does not specify the data weight, it mentions the image dimensions (120 m × 100 m with a resolution of 0.25 m). As a point of comparison, Ref. [
13] presents image dimensions similar to the MMTI example; the article tackles on-board SAR processing using SIMD instructions with an Intel
® Core™ i7-3610QE (by Intel Corporation in Santa Clara, CA, USA) processor. The image dimensions are 7.5 km by 2.5 km with a resolution of 0.5 m. Note the large difference between the data weight in [
11] (172 MB) and in the MMTI example (210 GB), as well as the image size, 120 m × 100 m in [
12] in contrast to 7.5 km × 2.5 km.
In general, FPGA-based accelerators provide problem-specific processing solutions that are highly parallelized and reliable. Moreover, there is a wide variety of devices to select from, according to the application requirements. Nonetheless, as discussed previously, there are some concerns regarding the implementation, including a long development time and a low Data Transfer Rate (DTR). However, these issues can be solved effectively by making use of tools like High-Level Synthesis (HLS) [
14], employed to reduce development time between fivefold and tenfold [
15,
16], and Reusable Integration Framework for FPGA Accelerators (RIFFA) [
17], utilized to develop a Peripheral Component Interconnect express (PCIe) interface. On the one hand, HLS allows describing Hardware (HW) via high-level programming languages (e.g., C/C++, System C, or OpenCL). Additionally, HLS permits applying optimizations like pipeline, cyclic partition, and unroll. On the other hand, RIFFA provides a framework that works directly with the PCIe endpoint, achieving high data transfer speeds (up to 15.7 GB/s for PCIe 3.0 with 16 lanes). RIFFA runs on Windows and Linux with the programming languages C, C++, Python, MATLAB, and Java [
17].
Correspondingly, this article addresses a novel methodology for the utilization of FPGA accelerators in on-board SAR processing routines. The methodology consists of using HLS to create Intellectual Property (IP) blocks and using RIFFA to develop a PCIe interface between the CPU and the FPGA. The proposed schematic has the advantage of being highly flexible and scalable since the IPs can be exchanged to perform different processing routines and since RIFFA allows employing up to five FPGAs while multiple IPs can be implemented in each FPGA.
The suggested methodology, for example, is suitable for the project described in [
4], as each stage of the CS algorithm can be implemented as an IP. Making use of HLS would reduce development time significantly, whereas different optimizations (e.g., pipeline, cyclic partition, or unroll) could be applied to meet application requirements. In contrast to [
7], an FPGA implementation has greater flexibility than an ASIP, as it is not limited by an instruction set. Similar to [
8], the proposed methodology allows performing the same algorithm for different input/output data formats; the main difference is that our methodology provides further optimization capabilities and PCIe infrastructure. Finally, [
9] addresses space-grade FPGA systems (i.e., Virtex-5QV and RTG4) that are configurable with the proposed methodology and are able to process large data throughput efficiently.
As proof of concept, we present a FPGA accelerator in charge of the reordering stage of VEC-FFT [
18], a software-optimized version of the FFT. Among the most frequently employed SAR focusing techniques, we can find the range-doppler algorithm [
19,
20], CS [
8,
9], and Omega-K (ω-k) [
21,
22,
23], along with their extended versions. Common to all these methods is the use of the FFT, which has a high computational cost as it involves many additions, subtractions, and multiplications. Consequently, optimizing the FFT enhances the performance of such SAR focusing techniques.
VEC-FFT is based on Radix-4 [
24] and executed using Single Instruction Multiple Data (SIMD) [
25], achieving high efficiency due to parallel data processing. However, after the FFT is computed, results are retrieved in reversed bit order, meaning that data reordering is required as a further step. VEC-FFT shows better performance in contrast to the Fastest Fourier Transform in the West (FFTW) Scalar [
18], a popular C++ FFT library. Furthermore, when the reordering stage is not performed, VEC-FFT results are faster than FFTW SIMD [
18]. For example, without data reordering, VEC-FFT solves a 16,384-point FFT utilizing only 91,601 clock cycles, whereas FFTW Scalar and FFTW SIMD entail 452,253 and 106,601 clock cycles, respectively. Nevertheless, VEC-FFT requires 142,460 additional clock cycles to compute the reordering post-processing step, for a total of 234,061 clock cycles. Note that the data reordering function consumes more than half of the total clock cycles.
A HW implementation of the VEC-FFT reordering function using the proposed methodology significantly reduces the number of required clock cycles. By instance, from 142,460 to 44,569, for the example above mentioned. Moreover, the IP can be instantiated multiple times, performing more than one reordering function at a time.
Next, to demonstrate flexibility, the VEC-FFT reordering IP is replaced by an IP for matrix transposition, another computationally expensive process due to memory latency. The matrix transpose is implemented for squared matrices with dimensions of 32 × 32, 64 × 64, and 128 × 128. The addressed CPU implementation utilizes 10,424, 38,505, and 210,928 clock cycles, respectively; conversely, the HW implementation, using the proposed methodology, reduces the number of clock cycles to 1041, 4113, and 16,401, correspondingly.
Finally, the main contributions of this work are summarized as follows:
The remainder of the article is organized as follows:
Section 2 reviews HLS and RIFFA;
Section 3 explains the integration of RIFFA with the HLS Ips;
Section 4 addresses the HLS IP for the reordering stage of VEC-FFT and the HLS IP for matrix transposition;
Section 5 assesses the performance of the FPGA against the CPU;
Section 6 presents the discussion; and finally,
Section 7 concludes this work.
3. Integration of RIFFA
The proposed HW architecture consists of two main elements: RIFFA, responsible for generating the signals utilized by PCIe, and the IPs, whose functionality depends on the application, e.g., the reordering stage of VEC-FFT and matrix transpose. The IPs connect with RIFFA through RIFFA channels, which obey the set of signals presented in
Table 1 [
17]. These are divided into two subsets: those for RX and those for TX. As shown in
Figure 3, the operation of the channels is based on a finite-state machine composed of eight states. The state machine manages the receiving, processing, and sending of data.
The receiving of data starts at state 0, waiting for the signal “CHNL_RX”. Subsequently, state 1 collects the data and counts the number of received words, taking into consideration the signal “CHNL_RX_DATA_VALID”. If the count is equal to “CHNL_RX_LEN”, it continues to the next stage (processing). The data processing begins with a counter reset “rCount = 0”, which occurs in state 2. Following that, state 3 asserts “ap_start” to indicate the start of the IP and waits for the assertion “ap_done”, which indicates that the IP finished its work (e.g., data reordering or matrix transpose). Data sending starts at state 4 with the reset of the IP via assertion “rst_ip”. State 5 sets “rst_ip” to zero and initializes “rCount”. Next, state 6 loads “rCount” with the number of data integers, making use of signal “C_PCI_DATA_WIDTH”. Recall that an integer is a data type with a width of 32 bits. Finally, at state 7, RIFFA is instructed to send the data from the FPGA to the CPU through the signal “CHNL_TX_LEN”. After this, the finite state machine goes to state 0 and waits for the next data to be sent.
To simplify the implementation process, both the RIFFA channels and IPs are instantiated within RIFFA. The IP integrates with RIFFA through the RX/TX control signals specified in [
17]. The IP is then connected to the RIFFA channel using the “ap_start” and “ap_done” signals. Once the integration is performed, the data transfer process begins with the
fpga_send function, which sends data from the PC memory to the FPGA. The channel waits until all data is received and the “ap_start” signal is asserted, indicating that the IP can start processing the data. Once the IP finishes processing, it asserts the “ap_done” signal, and the channel waits for the
fpga_recv function to retrieve the processed data from the FPGA.
Figure 4 provides a visual representation of the integration of HLS IPs with RIFFA by means of channels. Observe how the various components work together to enable efficient data transfer and processing.
6. Discussion
Table 6 and
Table 10 present the processing clock cycles required for each implementation of VEC-FFT data reordering and matrix transpose, respectively. As can be observed, HW implementations perform better than SW implementations when a large amount of memory access is required. In
Table 10, for example, the FPGA presents a significant improvement in clock cycles in contrast to the CPU since the matrix transpose consists of a series of value changes in memory. The FPGA performs multiple memory accesses when storing and loading multiple data sets at the same time.
Concerning
Table 7, at first glance, it might seem that FFTW SIMD has better performance than VEC-FFT+HW IP; however, recall that the FPGA is capable of executing multiple IPs at the same time. This means that, using the same clock cycles, VEC-FFT+HW IP can perform as many FFT reorderings as there are IPs.
Table 8 and
Table 11 present the attained DTR for VEC-FFT data reordering (16,384 points) and 128 × 128 matrix transpose, correspondingly, using the CPU clock as reference. On the other hand,
Table 9 and
Table 12 consider the same cases, respectively, but with multiple (three) IPs. As can be seen, the data throughput increases when multiple IPs are used and higher parallel processing is performed. For the particular case of VEC-FFT data reordering, the achieved data throughput is above the DTR defined (limited) by the PCIe port since both the CPU and FPGA perform the VEC-FFT data reordering, as depicted in
Figure 7.
Figure 8 shows the power consumption of a single IP for the reordering stage of VEC-FFT (1024 points), using the FPGA xc7z045ffg900-2. As can be observed, the PCIe ports consume most of the power. An increase in power consumption is expected with more IPs, but it is still lower in comparison to the PCIe ports.
Power consumption depends on the architecture and the implemented system. The addressed implementation consumes 2.174 W with one VEC-FFT IP and two channels, 2.577 W with two VEC-FFT IP and four channels, 3.337 W with four VEC-FFT IP and eight channels, and 3.838 W with six VEC-FFT IP and twelve channels. It is important to note that the implemented system can be further optimized to achieve more efficient power consumption. For example, in the case of VEC-FFT data reordering, modifying the system to use a single channel can result in a power saving of approximately 0.123 W. Additionally, when power consumption is a primary constraint, optimizations that prioritize lower power consumption can be chosen instead of lower latency.
On-board SAR systems have specific requirements that depend on the application and platform. These requirements include compact size and weight to accommodate the limited space and weight constraints of airborne and spaceborne platforms like drones or satellites. Power efficiency is crucial to ensuring longer mission endurance and minimal energy consumption, considering the limited power sources available. Real-time processing capability is essential for rapid data acquisition, processing, and image reconstruction to provide timely and actionable information. High-resolution imaging is a requirement to capture detailed images of the Earth’s surface, necessitating the use of sensors and algorithms that can achieve the desired resolution. On-board systems should also consider data storage and transmission capabilities, including sufficient storage capacity, data compression techniques, and efficient data transfer methods. Furthermore, on-board SAR systems must be designed to withstand harsh environmental conditions encountered during aerial or spaceborne missions, ensuring resilience to temperature variations, vibration, electromagnetic interference, and external elements.
Overall, on-board SAR systems must be compact, power-efficient, capable of real-time processing, and able to capture high-resolution images. They need to integrate sensors effectively, facilitate data storage and transmission, and exhibit resilience in harsh environments. Adapting the system design to meet the specific requirements of each application and platform is vital for successful deployment and data acquisition.
The combination of FPGA and CPU implementations using a PCIe interface offers several advantages for SAR systems. It leverages the parallel processing power of FPGAs alongside the flexibility and customizability of CPUs, resulting in the accelerated execution of computationally intensive SAR algorithms. The PCIe interface facilitates efficient data transfer between the FPGA and CPU, enabling seamless communication and real-time processing. This implementation optimizes resource utilization, reduces power consumption, and provides scalability options for handling varying processing requirements.
This study utilizes PCIe 2.0 with four lanes, reaching a maximum DTR of 2 GB/s, limited by the PCIe technology. In order to achieve the same or better performance than the CPU, a different PCIe technology might be employed. Correspondingly,
Figure 10 shows the DTR for different PCIe technologies and the number of lanes. For example, PCIe 1.0 with four lanes attains a maximum DTR of 1 GB/s, whereas more recent versions of PCIe with four lanes attain a maximum DTR of 2 GB/s, 3.9 GB/s, 7.8 GB/s, and 15.7 GB/s for PCIe 2.0, 3.0, 4.0, and 5.0, respectively. Note that increasing the number of lanes increases the DTR achieved.
A suitable PCIe technology and the addition of more FPGAs increase parallel processing capacity, overcoming CPU measurements easily. For instance, PCIe 2.0 with 16 lanes attains a theoretical DTR of 8 GB/s, or about 7.4 GB/s in practice. For VEC-FFT data reordering of 16,384 points, such technology reduces the TX/RX time from 61.43 μs (see
Table 8) to 35.43 μs, i.e., approximately 26 μs less than the CPU. This technology is available on multiple boards, e.g., the Xilinx Virtex-7 FPGA VC709, the Solar Express 125, and the XUP-PL4. Moreover, RIFFA [
17] supports using a maximum of five FPGAs per system with PCIe 3.0 or previous versions; also, since RIFFA works with a direct memory access engine, it could be upgraded to PCIe 4.0.
In the context of wider expectations and requirements in on-board SAR processing, the proposed methodology aligns with the need for real-time and efficient data processing. The high DTR achieved, up to 15.7 GB/s, through the PCIe interface developed using RIFFA addresses the challenge of data downlink by minimizing transfer bottlenecks and enabling faster communication between the CPU and FPGA. This capability is vital for on-board SAR systems, where timely and efficient data transfer is crucial for real-time decision-making and analysis. As mentioned, the data size with three channels for MMTI with 0.5 m × 0.5 m resolution is about 210 GB; correspondingly, with a DTR of 15.7 GB/s, the transfer bottlenecks are significantly reduced.
Electromagnetic interference avoidance is a relevant factor in SAR space missions. Accordingly, the proposed methodology allows the employing of space-grade FPGAs like the Xilinx Virtex-5QV FX130 and Microsemi RTG4, which have an average power consumption of 9.97 W and 3.91 W, respectively [
9].
The choice of optimizations and the balance between power consumption and processing capacity will ultimately depend on the specific requirements and constraints of the project. As highlighted previously, there is a direct relationship between power consumption and processing capacity, necessitating careful consideration and trade-offs to achieve the desired system performance while optimizing power efficiency.
Related to a GPU implementation, leaving aside the limitations of GPUs for space missions, GPU accelerators offer significant advantages for SAR processing. Their parallel processing capabilities allow for simultaneous computation on multiple data elements, making them well-suited for SAR algorithms. GPUs provide high computational power, enabling faster processing and analysis of SAR data compared to CPU-based approaches. With careful optimization and algorithm design, GPUs can deliver substantial speedups. Additionally, GPU libraries, such as cuFFT and cuBLAS, provide optimized functions for efficient SAR data processing. However, GPUs present limitations. Memory constraints and bandwidth limitations can pose challenges when dealing with large datasets. Some SAR algorithms may not be easily parallelizable or may have irregular data access patterns, affecting GPU performance. GPU programming complexity and power consumption are additional considerations. Despite these limitations, GPUs remain a valuable tool for SAR processing, and with careful optimization and algorithm design, they can provide substantial speedups compared to CPU-based approaches.
Although multiple implementations with low-power GPUs (Jetson TX1, Jetson TX2, or Jetson Nano) of on-board SAR processing have been made, such as [
11,
12], the amount of data that these low-power GPUs can process is limited. On the other hand, the use of GPUs with higher processing capacity, such as the one mentioned in [
7] (the Tesla K10 GPU Accelerator), has a power consumption of more than 150 W, which is a factor to consider in on-board processing.
The advantages of the addressed methodology include the utilization of HLS to create IP blocks and RIFFA to develop a PCIe interface. Employing HLS significantly reduces development time, typically between fivefold and tenfold [
15,
16]. This accelerates the design process and enables faster iterations and optimizations, which is one of the primary constraints in multiple on-board SAR implementations regarding FPGA implementations, as mentioned in [
4,
7,
8,
9].
Additionally, HLS offers several optimizations, including pipeline, cyclic partition, and unroll techniques, which can further enhance the performance of the FPGA-based SAR processing routines. These optimizations contribute to improved efficiency and resource utilization, allowing for better utilization of the FPGA’s capabilities. E.g., the FFT could be implemented with the radix-8 algorithm, and instead of using eight memories as shown in
Figure 6, sixteen memories would be used, and eight data sets would be processed at a time instead of four.
The RIFFA framework facilitates the use of up to five FPGAs, with the potential for multiple IPs to be implemented in each FPGA. This scalability ensures that the methodology can handle more complex SAR processing tasks or accommodate larger data volumes, enhancing its applicability to different scenarios.
In comparison to existing approaches, the proposed methodology stands out by leveraging the combined advantages of HLS, IP blocks, RIFFA, and FPGA accelerators. While individual components and techniques have been employed in previous research, their integration and application specifically for on-board SAR processing is relatively novel. By capitalizing on these technologies, the methodology tackles the limitations and challenges faced by conventional approaches, such as time-consuming development cycles, limited scalability, and suboptimal processing efficiency.
Nevertheless, the use of FPGA accelerators presents some disadvantages that need to be considered. One significant disadvantage is the resource consumption of FPGA accelerators. FPGA designs require careful management of resources such as logic elements, memory, and interconnects. Implementing SAR processing algorithms on FPGAs can be resource-intensive, potentially requiring larger and more expensive FPGA devices to accommodate the computational requirements. This can increase the overall cost of the on-board SAR system. Additionally, FPGA accelerators may operate at lower frequencies compared to other processing technologies like CPUs or GPUs. This lower operating frequency can limit the processing speed and real-time capabilities of on-board SAR systems, affecting their ability to handle time-sensitive tasks efficiently.
A readme file in the GitHub repository explains the points above in detail.
7. Conclusions
Aimed at incorporating FPGA accelerators into on-board SAR processing algorithms, this article introduces a novel methodology to combine HW and SW via PCIe. Two main tools are employed, the HLS synthesizer (i.e., Vivado HLS) and RIFFA, reducing development time significantly (between fivefold and tenfold) and attaining high transfer speeds, up to 15.7 GB/s for PCIe 3.0 with 16 lanes. HLS is a fast and efficient way of creating IPs by transforming high-level programming languages (e.g., C/C++, System C, or OpenCL) into HDL (i.e., Verilog or VHDL). Furthermore, it also permits easily implementing optimizations like pipeline, unroll, and array partition. RIFFA, on the other hand, provides a framework that works directly with the PCIe endpoint; therefore, it is no longer necessary to implement PCIe communication.
The development of onboard SAR systems presents various challenges and considerations. These systems must meet specific requirements such as compact size, power efficiency, real-time processing capability, high-resolution imaging, and resilience to harsh environmental conditions. Integrating FPGA accelerators with a CPU using a PCIe interface offers several advantages for on-board SAR processing. The PCIe interface ensures efficient data transfer between the FPGA and CPU (up to 15.7 GB/s using the current methodology), minimizing bottlenecks and enabling high-speed communication. The FPGA’s high parallelism and customizable nature enhance the acceleration of computationally intensive SAR algorithms, keeping the high parallelism through multiple IPs and optimizations. The scalability of the implementation allows for handling large data sizes, ensuring efficient processing, and up to five FPGAs can be used per system. However, it is important to note that FPGA accelerators also have disadvantages, including logic resource consumption and cost implications due to the need for larger FPGA devices. Additionally, the lower operating frequency of FPGAs compared to CPUs or GPUs may impact real-time processing capabilities. Despite these limitations, the FPGA+CPU implementation with a PCIe interface remains a promising solution for on-board SAR processing, enabling efficient and high-performance data processing.
In order to exemplify the advantages of the suggested methodology, we refer to the VEC-FFT algorithm [
18]. VEC-FFT retrieves results in reversed bit order, and the reordering stage consumes more than half of the total clock cycles. Therefore, an FPGA accelerator for the data reordering function is developed that communicates via PCIe with a CPU, where the rest of the VEC-FFT algorithm is implemented.
Modular and scalar implementations are possible, meaning that the IPs are interchangeable and/or have multiple instantiations. For demonstration purposes, the original (reordering) IP is replaced by an IP that performs the computationally expensive matrix transpose. Experimental results show the capabilities of the introduced methodology, together with the main advantages of parallel processing. The proposed solution allows for the use of different PCIe technologies according to specific needs. Being scalable, up to five FPGAs can be employed, each with multiple IPs.
The comparisons between CPU and FPGA for VEC-FFT re-encoding reveal similar performance levels. However, it is important to consider the limitations of the ZC706 board used in these comparisons. The ZC706 board operates at a working frequency of 100 MHz and utilizes PCIe 2.0 with 4 lanes. Despite these limitations, the FPGA still demonstrates competitive performance. To further enhance performance, one could explore the use of an FPGA with a higher working frequency or leverage more advanced PCIe technology.
Moreover, in the case of matrix transpose, the FPGA outperforms the CPU, highlighting the inherent advantages of FPGA-based processing. Even with the limitations imposed by the ZC706 board, the FPGA demonstrates favorable performance in this scenario. This underscores the potential of FPGAs for accelerating computational tasks such as matrix operations.
The current research and methodology provide a practical guide for developing FPGA accelerators quickly and efficiently using HLS. The creation of modular IPs offers the advantage of reusability, allowing the same IP to be utilized for different SAR algorithms. Furthermore, the integration of the RIFFA framework offers a robust infrastructure for efficient communication between the FPGA and the CPU, significantly minimizing bottlenecks and enabling parallel processing capabilities. The direct DMA provided by RIFFA facilitates seamless data transfer between the FPGA and the CPU, further enhancing the overall performance of the system.
This research serves as a valuable resource for developers seeking to harness the power of FPGAs for accelerating SAR processing and similar applications. By following the presented methodology, developers can achieve faster development cycles, improved performance, and enhanced efficiency in FPGA-based accelerator designs.