Constrained Optimization of FPGA Design for Spaceborne InSAR Processing

: With the development of spaceborne processing technologies, the demand for on-board processing has risen sharply. Against this background, spaceborne Interferometric Synthetic Aperture Radar (InSAR) processing has become an important research area. In many cases, high processing capacity is required during on-board InSAR processing, yet Field-Programmable Gate Array (FPGA) resources on the satellites are limited. To improve the performance of spaceborne remote sensing processing, this paper designs a high-performing FPGA system for the coarse registration and interferogram generation process of InSAR. Moreover, to address this dual-constraint problem of resource and processing capacity, the paper proposes an FPGA design method based on the gradient descent theory, which can identify the optimum trade-off scheme between two such constraints. Finally, the proposed system design and method are implemented in FPGA. Experiments showed that the FPGA system outperformed the NVIDIA (Santa Clara, CA, USA) GTX Titan Black Graphics Processing Unit (GPU), and the optimum trade-off scheme only increases the entire time by 1.1% but reduces the FPGA BRAM usage by 8.7%. The experimental results proved the effectiveness and validity of the proposed system and method.

On-board processing, among other research areas of InSAR, attracts much research attention as it can reduce the volume of mission data through satellite-ground communication and improve the proportion of valid data [21]. Owing to the advancement of rad-hardened/tolerant FPGAs, SAR on-board processing has been realized [22], which provides a solid foundation for implementing InSAR. InSAR on-board processing is advantageous in fields with real-time requirements since it could reduce output data rates and downlink volume. For example, it has been employed in many oceanic studies, including ocean topology and surface water topography, which require the processing of a large amount of ocean observation data [23,24]. As for fields that require high accuracy instead of real-time performance, such as planetary missions, on-board InSAR processing is not the ideal choice compared with the ground processing system at the current

•
To realize on-orbit InSAR processing, we design an efficient FPGA system for coarse registration and interferogram generation flow of InSAR. • To perform matrix transposition under the AXI protocol, we analyze the data rule of matrix transposition based on the AXI protocol.

•
To solve the dual-constraint problem of processing capacity and resources, we propose a design method of FPGA based on the gradient descent theory. We use the InSAR design in this paper to illustrate how to find the optimum trade-off between processing capacity and resources.

•
We implement the efficient structure and the design method on FPGA. The results show that the efficient structure outperforms the NVIDIA GTX Titan Black GPU. The design method can not only guide the engineering design but also evaluate the optimized structure academically.
The remainder of this article is organized as follows. Section 2 introduces the targeted InSAR algorithm, including coarse registration and interferogram generation, and briefly introduces the gradient descent theory. Section 3 describes the architecture, processing, and storage structure of the FPGA system proposed by this paper. Notably, the matrix transpose storage structure can be adapted to the SoPC system under the AXI conditions, ensuring processing efficiency. Section 4 illustrates how to seek the balance between resources and processing capacity of FPGA using the gradient descent theory. In Section 5, this paper shows experimental results measuring the performance of the FPGA system and compares them with that of GPU and DSP. Results show that the balanced design of the FPGA proposed by this paper could achieve high processing performance while Remote Sens. 2022, 14, 4713 4 of 21 saving resources. With the design proposed by this paper, FPGA demonstrates comparable processing capacity to that of NVIDIA GTX Titan Black GPU, revealing the great potential of FPGA in the field of spaceborne real-time processing. The paper ends with conclusions in Section 6.

Coarse Registration and Interferogram Generation of InSAR
The InSAR processing flow includes 6 steps [51]: image registration, interferogram generation, interferogram flattening, interferogram phase filtering, interferogram phase unwrapping, phase to height conversion and geocoding. Existing research shows that the interferogram generation process is the most time-consuming step, occupying about 32% of the time in the process of InSAR (without fine registration) [34]. Therefore, this paper focused on the first two steps of InSAR processing: image registration and interferogram generation, to shorten the processing time of InSAR.
Before calculating the interference phase and coherence coefficient, the two images need to be registered to obtain accurate phase information. Image registration can be divided into coarse and fine registration, according to accuracy. Pixel-level registration is called coarse registration, while sub-pixel registration (i.e., less than one pixel) is fine registration.
The specific case used in this paper was the observation of the variation of ocean depth in the vast ocean scene through satellite altimetry. Since this paper did not focus on any particular area, fine registration resampling applied in GPU [52,53] did not fit our needs. Thus, a coarse registration algorithm, based on the correlation function, was adopted, as shown in Formula (1). M 1 (i,j) and M 2 (i,j) are the amplitude of the two complex images, and R(u,v) is the correlation function of the two images. The correlation process could be transformed from the time domain into the frequency domain and quickly calculated by Fast Fourier Transform (FFT), as Formula (2) shows, which also suits pipeline operation in FPGA. M 1 (i,j) and M 2 (i,j) in the time domain correspond to S 1 (m,n) and S 2 (m,n) in the frequency domain, and R FFT is the frequency domain result of the correlation function R(u,v). The symbol * denotes the conjugate operation: Cross-correlation results of the two images could be obtained by performing Inverse Fast Fourier Transform (IFFT) on the R FFT . The offset coordinate (x-axis, y-axis) of registration could be calculated using the position of the maximum value in cross-correlation results.
What followed was interferogram generation. After the two images were aligned, according to the offset coordinates, the phase and coherence coefficient could be calculated using Formula (3): In this formula, M 1 and M 2 represent two complex images; E {·} means the mathematical expectation; * is the conjugate; γ is the complex coherence, and |γ| is the amplitude of the complex coherence; θ is the phase.
The FPGA calculation flow of coarse registration and interferogram generation is shown in Figure 1. Firstly, the sub-image (i.e., 512 × 512) was from the center of the original image (i.e., 16,384 × 4096). Secondly, the two sub-images were sent to the two-dimensional FFT, including range FFT, matrix transposition, and azimuth FFT. Thirdly, after the twodimensional FFT and IFFT, the offset was obtained from the maximum value. Finally, the Remote Sens. 2022, 14, 4713 5 of 21 two images' coherence coefficient and interference phase could be calculated, according to the offset coordinate (x-axis, y-axis).
In this formula, M1 and M2 represent two complex images; E {·} means the mathematical expectation; * is the conjugate; γ is the complex coherence, and |γ| is the amplitude of the complex coherence; θ is the phase.
The FPGA calculation flow of coarse registration and interferogram generation is shown in Figure 1. Firstly, the sub-image (i.e., 512 × 512) was from the center of the original image (i.e., 16,384 × 4096). Secondly, the two sub-images were sent to the two-dimensional FFT, including range FFT, matrix transposition, and azimuth FFT. Thirdly, after the twodimensional FFT and IFFT, the offset was obtained from the maximum value. Finally, the two images' coherence coefficient and interference phase could be calculated, according to the offset coordinate (x-axis, y-axis). The operation of matrix transposition is essential in the process of two-dimensional FFT. The step of matrix transposition is highlighted in grey to illustrate that a specialized FPGA structure was required.

Gradient Descent Method
In on-board processing, the design of an FPGA system is faced with various strict nonlinear constraints, such as resources, processing capacity, power consumption, volume, weight, etc. Therefore, it was essential to research an FPGA system's design under multiple constraints.
Mathematical optimization theory can shed light on solving nonlinear multiconstrained problems. These can be expressed by the objective function and constraints, as shown in Formula (4): If the objective function F can be expressed as a formula, then the nonlinear optimization problem could be solved mathematically by obtaining the optimum value Xj. Since there are many types of FPGA logic resources and the resource values are The operation of matrix transposition is essential in the process of two-dimensional FFT. The step of matrix transposition is highlighted in grey to illustrate that a specialized FPGA structure was required.

Gradient Descent Method
In on-board processing, the design of an FPGA system is faced with various strict nonlinear constraints, such as resources, processing capacity, power consumption, volume, weight, etc. Therefore, it was essential to research an FPGA system's design under multiple constraints.
Mathematical optimization theory can shed light on solving nonlinear multi-constrained problems. These can be expressed by the objective function and constraints, as shown in Formula (4): If the objective function F can be expressed as a formula, then the nonlinear optimization problem could be solved mathematically by obtaining the optimum value X j . Since there are many types of FPGA logic resources and the resource values are discrete, neither the objective function nor constraints can be expressed as a function, thus making it very difficult to apply the optimization theory to solve the problem directly.
That said, we introduced the gradient descent method in optimization theory into the design of the FPGA system to solve the dual-constraint problem between resources and processing capacity. The basic principles of the gradient descent method are described below.
The gradient descent method was proposed by Cauchy in 1847 and has become the basis of many constrained and unconstrained optimization methods [54]. The gradient descent method sets the negative gradient direction of the objective function as the searching direction of each iteration so that the value of the objective function decreases the fastest in the first few steps. For a nonlinear function f with a first-order continuous partial derivative, X contains n variables x n , and the gradient is defined by Formula (5): We used the gradient descent method to map each FPGA scheme's resources and processing capacities in two-dimensional coordinates. Then we analyzed the relationship between resources and processing capacity through the gradient between the schemes. After that, we identified the balanced scheme. The detailed steps are described in Section 5.

Design of System and Structure
This paper built a SoPC system based on the AXI to achieve high efficiency in communicating, storing, and processing, which is described in detail in the following sections.

System Architecture and Data Communicating
As the on-board processing algorithm becomes increasingly complex, the debugging cycle of traditional HDL finite state machines becomes longer, which increases the cost of maintenance. As an alternative, the solution SoPC has been widely used. As shown in Figure 2, the SoPC system has four parts: controlling, computing, arbitrating, and storing.
the fastest in the first few steps. For a nonlinear function f with a first-order continuous partial derivative, X contains n variables xn, and the gradient is defined by Formula (5): We used the gradient descent method to map each FPGA scheme's resources and processing capacities in two-dimensional coordinates. Then we analyzed the relationship between resources and processing capacity through the gradient between the schemes. After that, we identified the balanced scheme. The detailed steps are described in Section 5.

Design of System and Structure
This paper built a SoPC system based on the AXI to achieve high efficiency in communicating, storing, and processing, which is described in detail in the following sections.

System Architecture and Data Communicating
As the on-board processing algorithm becomes increasingly complex, the debugging cycle of traditional HDL finite state machines becomes longer, which increases the cost of maintenance. As an alternative, the solution SoPC has been widely used. As shown in Figure 2, the SoPC system has four parts: controlling, computing, arbitrating, and storing. For the controlling part, compared with the traditional HDL finite state machine, the Microblaze embedded processor soft core replaces the conventional finite state machine, software programming supersedes hardware programming, and software testing replaces hardware debugging. For a SoPC system, control signals are exchanged through AXI4-Lite, and data is processed through AXI4-Stream and visited by the storage space through the AXI4 [55]. For the controlling part, compared with the traditional HDL finite state machine, the Microblaze embedded processor soft core replaces the conventional finite state machine, software programming supersedes hardware programming, and software testing replaces hardware debugging. For a SoPC system, control signals are exchanged through AXI4-Lite, and data is processed through AXI4-Stream and visited by the storage space through the AXI4 [55].
The computing part realizes InSAR signal processing, containing the major operation units FFT and Cordic. FFT is used to transform data between the time domain and frequency domain, while Cordic is used to calculate the coherence and phase. The Matrix Transposition module is designed to improve the efficiency of accessing range and azimuth data and increase processing bandwidth.
The arbitration part based on AXI Interconnect consists of CrossBar, which can arbitrate reading/writing data operations between the FIFOs on both sides. The arbitration structure solves the problem caused by the algorithm function and the dual-channel storage structure. In the process of interferogram generation, either chip of DDR3 SDRAM includes both reading and writing operations simultaneously, which cannot be achieved without the arbitration structure. Therefore, CrossBar implements arbitration of reading/writing operations through the Shared-Address Multiple-Data (SAMD) topology to realize equivalent pipeline processing [56].

Hardware Structure of Graduated Registration
Before the interferogram generation, the images must be aligned in the range and azimuth directions. To improve the processing capacity, this paper proposed a specialized structure based on FPGA called "graduated registration." It fits the FPGA pipeline, thus improving the processing capacity and reducing the algorithm's complexity. The analysis is as follows. The offset coordinates (x-axis, y-axis) are obtained when the registration process is completed. Before entering the interferogram generation process, the two images must be registered by the offset coordinates. The coordinate offset reflects the relative position of the main and auxiliary images. The offset coordinates (x-axis, y-axis) can be positive or negative; the auxiliary image can be moved either positively or negatively along the X or y-axis. x-axis > 0 means that Image A moves in the positive X direction relative to Image B; similarly, y-axis > 0 means Image A moves in the positive Y direction relative to Image B. Therefore, there are four cases of offset in the four quadrants, as shown in Figure 3.

Hardware Structure of Graduated Registration
Before the interferogram generation, the images must be aligned in the range and azimuth directions. To improve the processing capacity, this paper proposed a specialized structure based on FPGA called "graduated registration." It fits the FPGA pipeline, thus improving the processing capacity and reducing the algorithm's complexity. The analysis is as follows.
The offset coordinates (x-axis, y-axis) are obtained when the registration process is completed. Before entering the interferogram generation process, the two images must be registered by the offset coordinates. The coordinate offset reflects the relative position of the main and auxiliary images. The offset coordinates (x-axis, y-axis) can be positive or negative; the auxiliary image can be moved either positively or negatively along the X or y-axis. x-axis > 0 means that Image A moves in the positive X direction relative to Image B; similarly, y-axis > 0 means Image A moves in the positive Y direction relative to Image B. Therefore, there are four cases of offset in the four quadrants, as shown in Figure 3.  If the offset of X and Y works in the DDR3 SDRAM, like 2D_Address, shown in Figure 4, the algorithm's complexity is O(n 2 ) because the offset coordinate works in two loops. More importantly, when filtering data in two-dimensional space, the corresponding DDR3 SDRAM addresses are discontinuous, which affects storage efficiency terribly. To improve efficiency and fully take advantage of the pipeline processing capacity of FPGA, we designed a structure called graduated registration, as shown in Figure 4. The x-axis was moved from the DDR3 SDRAM into the FIFO, and only the y-axis worked in the DDR3 SDRAM. In this way, the algorithm's complexity was reduced to O(n) without the loss of efficiency caused by discontinuous addresses, like 1D_Address.
The data flow works as follows. First, the initial address of azimuth data is obtained according to y-axis. Then, a particular row of the original data is read from the DDR3 SDRAM. Finally, two lines of data from two DDR3 SDRAM chips are buffered in S1_PRE and S2_PRE to filter range data, thus realizing the registration process of two-dimensional data. As the offset, y-axis and x-axis work in DDR3 SDRAM and FIFO, respectively, to perform two-dimensional registration. We called the proposed structure "graduated registration".
According to the x-axis, both the S1_PRE and S2_PRE modules have two cases, as shown in Figure 4. It can be seen that data in the modules could be divided into three segments: the C segment was for the coherence coefficient and phase, the A segment was discarded, and the B segment was zeroed. In this structure, the unavailable data was zeroed or discarded, so the depth of FIFO stayed unchanged (i.e., 4096).
The graduated registration structure proposed in this paper has two advantages. On the one hand, it ensures a high data bandwidth reading from DDR3 SDRAM with lower algorithm complexity and efficiency loss. On the other hand, the length of data buffered in this structure is constant by filling zeros, which can easily be controlled and debugged.
More importantly, when filtering data in two-dimensional space, the corresponding DDR3 SDRAM addresses are discontinuous, which affects storage efficiency terribly. To improve efficiency and fully take advantage of the pipeline processing capacity of FPGA, we designed a structure called graduated registration, as shown in Figure 4. The x-axis was moved from the DDR3 SDRAM into the FIFO, and only the y-axis worked in the DDR3 SDRAM. In this way, the algorithm's complexity was reduced to O(n) without the loss of efficiency caused by discontinuous addresses, like 1D_Address.

O(n)
Interferogram Generation

x-axis
Coherence Phase AXI Interconnect Cordic S1_PRE A B x-axis x-axis < 0 x-axis > 0 y-axis Figure 4. The graduated registration. "Base_Row" represents the initial row address in DDR. "Base_Col" means the initial column address in DDR. Both "Base_Row" and "Base_Col" are base addresses. "y-axis" and "x-axis" denote the offset addresses.
The data flow works as follows. First, the initial address of azimuth data is obtained according to y-axis. Then, a particular row of the original data is read from the DDR3 SDRAM. Finally, two lines of data from two DDR3 SDRAM chips are buffered in S1_PRE and S2_PRE to filter range data, thus realizing the registration process of two-dimensional data. As the offset, y-axis and x-axis work in DDR3 SDRAM and FIFO, respectively, to perform two-dimensional registration. We called the proposed structure "graduated registration".
According to the x-axis, both the S1_PRE and S2_PRE modules have two cases, as shown in Figure 4. It can be seen that data in the modules could be divided into three segments: the C segment was for the coherence coefficient and phase, the A segment was discarded, and the B segment was zeroed. In this structure, the unavailable data was zeroed or discarded, so the depth of FIFO stayed unchanged (i.e., 4096).
The graduated registration structure proposed in this paper has two advantages. On the one hand, it ensures a high data bandwidth reading from DDR3 SDRAM with lower algorithm complexity and efficiency loss. On the other hand, the length of data buffered in this structure is constant by filling zeros, which can easily be controlled and debugged.

Matrix Transposition Based on AXI
Matrix transposition is an essential step in the 2D FFT calculation of coarse registration. The efficiency of matrix transposition significantly affects the total time of coarse registration. The burst length of the native interface in DDR3 SDRAM is 8 [36], yet the burst length of the AXI can reach 1024 data. The schemes [36][37][38][39][40][41][42][43][44] of existing research on matrix transposition are built on the native interface, which could not be used in the SoPC system directly. However, the existing matrix transposition schemes could shed light on the research on the AXI matrix transposition. Inspired by a previous study [37], we found out that the key to efficient matrix transposition data access lay in the balance Figure 4. The graduated registration. "Base_Row" represents the initial row address in DDR. "Base_Col" means the initial column address in DDR. Both "Base_Row" and "Base_Col" are base addresses. "y-axis" and "x-axis" denote the offset addresses.

Matrix Transposition Based on AXI
Matrix transposition is an essential step in the 2D FFT calculation of coarse registration. The efficiency of matrix transposition significantly affects the total time of coarse registration. The burst length of the native interface in DDR3 SDRAM is 8 [36], yet the burst length of the AXI can reach 1024 data. The schemes [36][37][38][39][40][41][42][43][44] of existing research on matrix transposition are built on the native interface, which could not be used in the SoPC system directly. However, the existing matrix transposition schemes could shed light on the research on the AXI matrix transposition. Inspired by a previous study [37], we found out that the key to efficient matrix transposition data access lay in the balance of two-dimensional data: the number of range and azimuth data were the same. Therefore, the azimuth and range data in the design structure occupied half of the continuous data. The matrix block structure, including the reading and writing process, is shown in Figure 5. of two-dimensional data: the number of range and azimuth data were the same. Therefore, the azimuth and range data in the design structure occupied half of the continuous data. The matrix block structure, including the reading and writing process, is shown in Figure  5.  We took the transfer of 256 data of 64 bit at a time as an example to illustrate the reading and writing process. For the 512 × 512-point data to be registered, 16 lines of data were buffered. After that, we rearranged the data to take 16 range data in each row, then put the 256 data from 16 rows into a line. Then, the new array containing 32 lines was placed into DDR3 SDRAM in the direction of the arrows in Figure 5, until the writing process was completed. When data was read from the storage space, data was read from the entire row of the DDR3 SDRAM according to different row addresses to form an array. The array corresponded to 16 columns that were the transposed data.
When the burst length of AXI changed, the efficiency would be different, and the matrix transposition structure would be different. Among the achievable solutions, the burst length could be selected as 1024, 256, 64, and 16; correspondingly, 32, 16, 8, and 4 We took the transfer of 256 data of 64 bit at a time as an example to illustrate the reading and writing process. For the 512 × 512-point data to be registered, 16 lines of data were buffered. After that, we rearranged the data to take 16 range data in each row, then put the 256 data from 16 rows into a line. Then, the new array containing 32 lines was placed into DDR3 SDRAM in the direction of the arrows in Figure 5, until the writing process was completed. When data was read from the storage space, data was read from Remote Sens. 2022, 14, 4713 9 of 21 the entire row of the DDR3 SDRAM according to different row addresses to form an array. The array corresponded to 16 columns that were the transposed data.
When the burst length of AXI changed, the efficiency would be different, and the matrix transposition structure would be different. Among the achievable solutions, the burst length could be selected as 1024, 256, 64, and 16; correspondingly, 32, 16, 8, and 4 pieces of data needed to be buffered and transposed. It could be seen that as the burst length increased, the efficiency gradually increased, and the corresponding storage resources increased. In this paper, the coarse registration dealt with 512 × 512 floating-point complex images. If the most efficient transfer was achieved, the burst length was 1024, with 32 lines of image data buffered in the RAM. Considering both the reading and writing process, the storage space was at least 2 × 512 × 32 × 64 bit = 2 Mb.
Under the condition that the image size remained unchanged, the higher transmission efficiency led to more BRAM resources being occupied correspondingly. This also reflected the contradiction between the processing capacity and resources in the FPGA design. The limited amounts of resources would inevitably limit the processing capacity. Faced with this conflict, it was necessary to make a balance and trade-off from the system design perspective so that the used resources could maximize the processing performance. To analyze and solve this problem, the following section examines the two-dimensional constraints of processing capacity and resources from a systematic perspective.

Design Method Based on Gradient Descent Theory
In FPGA design, the high processing capacity can be achieved by increasing resources with the parallel structure. Yet, there are conflicts between resources and processing capacity. Therefore, it is necessary to solve the following optimization problem: when faced with different constraints, including requirement and environment restrictions, how can one build a balanced system that can meet the requirements of quick data processing and reduce resource usage to a great extent. Balance in this paper primarily referred to the trade-off between resources and processing capacity. The strength of balance is reflected by the quantitative relationship between resources and processing capacity. The dualconstraint problem is described below by analyzing the relationship between processing time and resources.
The relationship between processing capacity and resources is nonlinear-resources are allocated to different processes, including computing, controlling, communicating, etc., which means that only part of the resources could be used to improve processing capacity. According to the optimization model, as described by Formula (4), the relationship between resources and time could be described similarly as in Formula (6): For structure j, R j represents the used resources, and T j represents the consumed time. In other words, different structures contain different R j and T j . The constraint of resources R 0 represents the total amount of resources, and the constraint of time T 0 comes from application requirements.
In engineering, it is difficult to illustrate the objective function in specific formulae. Therefore, we could not directly express it in mathematical formulae like Formula (5). However, the idea of gradient descent from optimization theory can be brought into system design. Supposing there are n schemes in FPGA design, the consumed time and critical resources are T 1 , R 1 , T 2 , R 2 , . . . T n , R n , etc. R 0 is the resource in FPGA and T 0 is the system processing time. The FPGA schemes could be mapped into a two-dimensional space of R and T, where discrete points represent the schemes. The gradient between the points could be defined as ∇G, expressed as Formula (7). ∆R represents the absolute difference in resources between schemes; ∆T represents the absolute difference in time; ∇G represents the difference in system effect between resources and processing capacity: We defined γ as γ = ∆R/R 0 , which meant the impact of changing partial resources on the total resources. Similarly, we defined β as β = ∆T/T 0 , which represented the effect of the changing partial time on the entire system time. Since the FPGA contained various resources, the total amount of resource R 0 could represent different types of resources. However, the total system time T 0 was not a constant and would vary according to the different design.
To realize fast processing, the analysis process of the proposed method in the twodimensional R-T space was as follows: (1) Given that the physical constraints and time constraints are met (i.e., R j ≤ R 0 , T j ≤ T 0 ), m initial points (R j ,T j ) are obtained through synthesis and estimation, 1 ≤ j ≤ n and j ∈ Z. (2) Setting the point T k = MIN (T j ) (1 ≤ k ≤ n and k ∈ Z) as the initial state and setting the T j (1 ≤ j ≤ n, j ∈ Z and j = k) as the next state, then ∇G between (R k ,T k ) and (R j ,T i ) is calculated. G gate is the threshold set by developers according to the requirement of the system. G gate was set as 3 in this paper to achieve high system performance. The following section presents a detailed analysis of the InSAR system based on what we have proposed above.

Analysis of Threads in Interferogram Generation
This part analyzes the relationship between resources and processing time in interferogram generation. According to the algorithm introduced in Section 3, 16,384 × 4096 image points were to be calculated in the process of interferogram generation, and 512 × 512 points were to be calculated in the coarse registration process; thus, the former required much more time than the latter process. Therefore, the time interferogram generation took up was critical to the total system runtime.
If 64-bit float complex data is regarded as a thread, the number of threads could represent the degree of parallelism. Faced with the constraint of the 512-bit data width for one DDR3 SDRAM chip, the thread can be set as 1, 2, or 4. Here, we designed an IP in 4 threads to find the critical resource. The synthesis results of the IP are shown in Table 1. Due to nonlinear operations in the calculation process, LUT took up the most significant proportion among other resources, so LUT was the critical resource. To increase the proportion of the coarse registration on the system processing time, the thread of coarse registration was set as 1. When the system clock worked at 100 MHz, Remote Sens. 2022, 14, 4713 11 of 21 512 × 512 points were calculated in 262,144 clocks, 2.62 ms in theory. Therefore, the ideal coarse registration time T c was 7.86 ms. Similarly, to calculate 16,384 × 4096 points, the ideal interferogram generation time T g was 640 ms. When the thread of interferogram generation changed from 1 to 4, the results are shown in Table 2. T i was the entire calculation time T i = T c + T g . The variation ∆R LUT was obtained by subtraction between the schemes R i , as shown in Table 3. When threads changed from 1 to 4 ∇G was less than 0.1, which indicated that relatively few resources were saved while the processing capacity was lowered sharply. As ∇G was too low to meet the requirement ∇G ≥ ∇G gate = 3, there was no balanced scheme. As for ∆ba (i.e., from scheme b to plan a), the resources were saved by 2.41%, but the time increased by 97.6%. When the thread was 4, 256-bit data width, two images were read from the two DDRs, respectively. The calculation results, 128-bit phase and 256-bit coherence coefficients were written into two DDR3 SDRAM chips. Since the data width of either DDR3 SDRAM channel was no more than 512 bits, there was no efficiency loss when data passed through the arbitrating structure. However, when the thread was 8, either channel reading or writing worked in 512 bits, which led to efficiency loss. Therefore, due to the hardware constraints, it was hard to analyze the performance of the 8-thread scheme.
By analyzing schemes with different threads, low ∇G (i.e., lower than 0.5) indicated that the greater processing capacity could be improved with fewer resources. As the thread of FPGA increased, the processing capacity increased accordingly.

Analysis of Threads in Coarse Registration
This part discusses the relationship between the thread and resources in coarse registration. Coarse registration requires two-dimensional FFT and IFFT operations. The FFT IP core that Xilinx provided consumed 48 DSPs. A total of 6 FFTs were needed to finish parallel processing. As the thread increased, the usage of FFT increased by double. Due to the limitation of the dual-channel DDR3 storage structure and the data width of 512 bits, the maximum thread of the coarse registration was 4. Similarly, time could be calculated according to the working clock. Resources and time are shown in Table 4. When the thread of interferogram generation was 4 (scheme c), the ideal time T g was 160 ms. When the thread was 2 (scheme b), the T g was 320 ms. As shown in Figure 6, ∆c_xx and ∆b_xx in the figure represented the results when the threads of interferogram generation were 4 and 2, respectively. ∇G of ∆c_fe was over 12, which meant that many resources were used, but the time improvement effect was very small. So, it could be seen that scheme e was more balanced than f. If scheme e was used as the initial state to analyze, ∆c_ed was greater than 3, which showed that scheme d was more balanced than scheme e. In other words, when the thread of interferogram generation was 4, the scheme with 1 thread in coarse registration was the balanced design. Similarly, under scheme b, G was nearly doubled since the overall system processing time was almost doubled. It could be seen from b_fd that reducing the use of resources by 24% only deteriorated the system processing time by less than 2%. So, as the processing time of the system increased, the scheme with 1 thread in coarse registration (scheme d) was still the balanced design among other schemes.

Analysis of the Matrix Transposition Schemes
This part focuses on the matrix transposition structure when the thread of interferogram generation was 4 and the thread of coarse registration was 1. The transpose of the matrix with different burst lengths had different efficiencies of AXI  . The utilization of resources and efficiency of each scheme are shown in Table 5. The utilization of resources comes from the synthesis in Vivado, not HLS. Since matrix transposition structure could be designed separately, the resources Ri could be found after synthesis in Vivado. The structure efficiency could be tested on the FPGA board. Then, coarse registration processing time Tc could be calculated.
Similarly, the G  calculation results are shown in Figure 7. G of gh was 8.16, which meant that, compared with scheme g, the time of scheme h deteriorated by 1.1%, but the resources were saved by 8.7%. However, since the values of Δhi and Δhj were much less than 3, the scheme h was more balanced among the four above schemes. There would be efficiency loss during the storage process, so Tg was above 160 ms. Then ΔT/Ti would be reduced, and the value of G would be increased, again confirming that scheme h was the balanced design. To sum up, when the thread of interferogram generation was 4 and the thread of coarse registration was 1, the scheme with a burst length of 256 was the balanced design. Similarly, under scheme b, ∇G was nearly doubled since the overall system processing time was almost doubled. It could be seen from ∆b_fd that reducing the use of resources by 24% only deteriorated the system processing time by less than 2%. So, as the processing time of the system increased, the scheme with 1 thread in coarse registration (scheme d) was still the balanced design among other schemes.

Analysis of the Matrix Transposition Schemes
This part focuses on the matrix transposition structure when the thread of interferogram generation was 4 and the thread of coarse registration was 1. The transpose of the matrix with different burst lengths had different efficiencies of AXI η. The utilization of resources and efficiency of each scheme are shown in Table 5. The utilization of resources comes from the synthesis in Vivado, not HLS. Since matrix transposition structure could be designed separately, the resources R i could be found after synthesis in Vivado. The structure efficiency could be tested on the FPGA board. Then, coarse registration processing time T c could be calculated.
Similarly, the ∇G calculation results are shown in Figure 7. ∇G of ∆gh was 8.16, which meant that, compared with scheme g, the time of scheme h deteriorated by 1.1%, but the resources were saved by 8.7%. However, since the values of ∆hi and ∆hj were much less than 3, the scheme h was more balanced among the four above schemes. There would be efficiency loss during the storage process, so T g was above 160 ms. Then ∆T/T i would be reduced, and the value of ∇G would be increased, again confirming that scheme h was the balanced design. To sum up, when the thread of interferogram generation was 4 and the thread of coarse registration was 1, the scheme with a burst length of 256 was the balanced design. Similarly, when the thread of interferogram generation was 2, the system processing time would be prolonged and scheme h, with 256 burst length, would still be the balanced design. The scheme g with 1024 burst length would exert less impact on the whole system while consuming more resources.

Experimental Results and Discussion
The structure and analysis designed in this paper were tested on the experimental platform, as shown in Figure 8 The working course of our experimental platform was designed as follows. First, two SAR images were sent from the ground verification system through the TLK2711 interface and written in the two DDR3 SDRAM chips of FPGA, respectively. Then, algorithmic processing, i.e., coarse registration and interferogram generation, started until all calculation results were saved. Then, the calculation results were sent to the ground verification system and verified with the MATLAB calculation results. The IP core AXI-Timer provided by Xilinx recorded the time used during algorithmic processing. The verification in this section was divided into two parts. Sections 5.1-5.3 verified the validity of the analysis in Section 4, while Section 5.4 verified the efficiency and accuracy of the structure designed in Section 3. Similarly, when the thread of interferogram generation was 2, the system processing time would be prolonged and scheme h, with 256 burst length, would still be the balanced design. The scheme g with 1024 burst length would exert less impact on the whole system while consuming more resources.

Experimental Results and Discussion
The structure and analysis designed in this paper were tested on the experimental platform, as shown in Figure 8 Similarly, when the thread of interferogram generation was 2, the system processing time would be prolonged and scheme h, with 256 burst length, would still be the balanced design. The scheme g with 1024 burst length would exert less impact on the whole system while consuming more resources.

Experimental Results and Discussion
The structure and analysis designed in this paper were tested on the experimental platform, as shown in Figure 8 The working course of our experimental platform was designed as follows. First, two SAR images were sent from the ground verification system through the TLK2711 interface and written in the two DDR3 SDRAM chips of FPGA, respectively. Then, algorithmic processing, i.e., coarse registration and interferogram generation, started until all calculation results were saved. Then, the calculation results were sent to the ground verification system and verified with the MATLAB calculation results. The IP core AXI-Timer provided by Xilinx recorded the time used during algorithmic processing. The verification in this section was divided into two parts. Sections 5.1-5.3 verified the validity of the analysis in Section 4, while Section 5.4 verified the efficiency and accuracy of the structure designed in Section 3. The working course of our experimental platform was designed as follows. First, two SAR images were sent from the ground verification system through the TLK2711 interface and written in the two DDR3 SDRAM chips of FPGA, respectively. Then, algorithmic processing, i.e., coarse registration and interferogram generation, started until all calculation results were saved. Then, the calculation results were sent to the ground verification system and verified with the MATLAB calculation results. The IP core AXI-Timer provided by Xilinx recorded the time used during algorithmic processing. The verification in this section was divided into two parts. Sections 5.1-5.3 verified the validity of the analysis in Section 4, while Section 5.4 verified the efficiency and accuracy of the structure designed in Section 3.

Test of Different Threads in Interferogram Generation Process
For the interferogram generation, the two images were read from the two DDR3 SDRAM chips, with the phase stored in DDRA and the coherence coefficient stored in DDRB. When the thread of coarse registration was 1, we set the threads of interferogram generation as 1, 2, 4, and 8, respectively, and the results of the tests are shown in Table 6. T c was the measured time in the coarse registration process, and T g was the estimated time in the interferogram generation process. T i was the sum of T g and T c . In practice, the loss of efficiency in storage space could not be avoided, leading to the differences between Tables 2 and 6. It could also be seen from Table 6 that the processing time rose nearly linearly as the threads increased from 1 to 4. Moreover, the T g of 8 threads was close to that of 4 threads, indicating that the dual-channel storage structure hampered the system's processing capacity. The calculation results of ∇G are shown in Figure 9. In the figure, ∇G < 0.8 < 3 meant that there was no balanced scheme. Moreover, it could be seen from the gap between ∆R LUT /R LUT and ∆T/Ti that resource changes would sharply deteriorate the processing time.

Test of Different Threads in Interferogram Generation process
For the interferogram generation, the two images were read from the two DDR3 SDRAM chips, with the phase stored in DDRA and the coherence coefficient stored in DDRB. When the thread of coarse registration was 1, we set the threads of interferogram generation as 1, 2, 4, and 8, respectively, and the results of the tests are shown in Table 6. Tc was the measured time in the coarse registration process, and Tg was the estimated time in the interferogram generation process. Ti was the sum of Tg and Tc. In practice, the loss of efficiency in storage space could not be avoided, leading to the differences between Tables 2 and 6. It could also be seen from Table 6 that the processing time rose nearly linearly as the threads increased from 1 to 4. Moreover, the Tg of 8 threads was close to that of 4 threads, indicating that the dual-channel storage structure hampered the system's processing capacity. The calculation results of G are shown in Figure 9. In the figure, G < 0.8 < 3 meant that there was no balanced scheme. Moreover, it could be seen from the gap between RLUT/RLUT and T/Ti that resource changes would sharply deteriorate the processing time. In addition, there would be a slight difference between measured GT and theoretical GB due to the efficiency loss in interface or logic, as shown in Table 7. From this table, the relative error ε did not exceed 10% when the thread was 1, 2 and 4, which verified our analysis.  In addition, there would be a slight difference between measured ∇G T and theoretical ∇G B due to the efficiency loss in interface or logic, as shown in Table 7. From this table, the relative error ε did not exceed 10% when the thread was 1, 2 and 4, which verified our analysis.

Test of Different Threads in Coarse Registration
It has been pointed out in Section 4.3 of Section 4 that when the thread of interferogram generation was 4, the scheme of the thread of coarse registration as 1 was the balanced design. In practice, for interferogram generation, the actual time of the scheme with 8 threads was 164.7 ms, which was longer than the theoretical value of 4 threads (i.e., 160 ms); when the thread of interferogram generation was 8, the thread 1 in coarse registration was still the balanced design of the system in theory. So, as shown in Table 8, we only needed to compare the scheme with 1 thread and the one with 2 threads in the coarse registration to find the balanced design. As shown in Table 9, ∇G of scheme f' and e' was around 3, indicating that the scheme with the thread of coarse registration as 1 was the balanced design. The experiment result confirmed our previous analysis. From Table 4, ∇G in theory between thread 1 and thread 2, that is the scheme d and e, could be calculated according to the Formula (7). ∇G B was about 3.37 while ∇G T was around 3.00. Therefore, the relative error ε of ∇G was at 10.14%. The value ε was relatively large for the following two reasons. First, the theoretical analysis did not take into account the efficiency loss during the storage process with the continuous address. In addition, after the 512 × 512 points in the center were found from the 16,384 × 4096 image, the addresses would be changed every 512 points taken, which also added to the efficiency loss.

Verification of Matrix Transposition Schemes
The matrix transposition schemes' theoretical resource requirement and efficiency were pointed out in Section 4.4 of Section 4. Now that the test results of interferogram generation and coarse registration were ready, this section focused on verifying matrix transposition schemes on FPGA. To facilitate the comparison with the theoretical analysis, the thread of interferogram generation and coarse registration was set as 4 (scheme c') and 1 (scheme e'), respectively. Different matrix transposition schemes had varying efficiency, influencing coarse registration processing time T c , as shown in Table 10. The comparison between the measured value ∇G T (calculated from Table 10. Test results of matrix transposition schemes under the condition of c' and e') and the theoretical value ∇G B (calculated from Table 6) is displayed in Figure 10. It can be seen that ∇G T of ∆g'h' was above 8 while ∇G T of ∆h'i' was below 2, which confirmed the theoretical analysis that scheme h was more balanced than schemes g, i, j. Due to the loss of efficiency, the measured time T g and T i was longer than the theoretical value. As a result, the actual value (∇G T ) was above the theoretical value (∇G B ), which, in turn, caused an increase in the relative error. Yet, the relative error was still lower than 15%. analysis that scheme h was more balanced than schemes g, i, j. Due to the loss of efficiency, the measured time Tg and Ti was longer than the theoretical value. As a result, the actual value (GT) was above the theoretical value (GB), which, in turn, caused an increase in the relative error. Yet, the relative error was still lower than 15%. It can be proved that under the same resource condition, if the relative error between the measured value (Tc_test,Tg_test) and the theoretical value is within 20%, that is, 0.8 Tc ≤ Tc_test ≤ 1.2 Tc and 0.8 Tg ≤ Tg_test ≤ 1.2 Tg, then the relative error of G will also be within 20%. From the perspective of engineering, when the Ggate is set (e.g., 3) and the relative error range is set (e.g., 20%), the two-dimensional constraint problem of resources and processing capacity could be solved using the theoretical analysis of FPGA proposed in this paper, thus shortening the system optimization time.
We analyzed the influence on the system when the thread of the interferogram generation was 4. When the thread of the interferogram generation changed from 4 to 1, the results are shown in Figure 11. GT of either a`_g`h` or a`_h`i` was above 3, which meant schemes h` and i` were more balanced than the others. We can see that when the requirement of system processing capacity changed, the balanced scheme would change accordingly. Such a changing relationship could be dynamically reflected by the index G proposed by this paper.  It can be proved that under the same resource condition, if the relative error between the measured value (T c_test ,T g_test ) and the theoretical value is within 20%, that is, 0.8 T c ≤ T c_test ≤ 1.2 T c and 0.8 T g ≤ T g_test ≤ 1.2 T g , then the relative error of ∇G will also be within 20%. From the perspective of engineering, when the ∇G gate is set (e.g., 3) and the relative error range is set (e.g., 20%), the two-dimensional constraint problem of resources and processing capacity could be solved using the theoretical analysis of FPGA proposed in this paper, thus shortening the system optimization time.
We analyzed the influence on the system when the thread of the interferogram generation was 4. When the thread of the interferogram generation changed from 4 to 1, the results are shown in Figure 11. ∇G T of either ∆a'_g'h' or ∆a'_h'i' was above 3, which meant schemes h' and i' were more balanced than the others. We can see that when the requirement of system processing capacity changed, the balanced scheme would change accordingly. Such a changing relationship could be dynamically reflected by the index ∇G proposed by this paper. analysis that scheme h was more balanced than schemes g, i, j. Due to the loss of efficiency, the measured time Tg and Ti was longer than the theoretical value. As a result, the actual value (GT) was above the theoretical value (GB), which, in turn, caused an increase in the relative error. Yet, the relative error was still lower than 15%. It can be proved that under the same resource condition, if the relative error between the measured value (Tc_test,Tg_test) and the theoretical value is within 20%, that is, 0.8 Tc ≤ Tc_test ≤ 1.2 Tc and 0.8 Tg ≤ Tg_test ≤ 1.2 Tg, then the relative error of G will also be within 20%. From the perspective of engineering, when the Ggate is set (e.g., 3) and the relative error range is set (e.g., 20%), the two-dimensional constraint problem of resources and processing capacity could be solved using the theoretical analysis of FPGA proposed in this paper, thus shortening the system optimization time.
We analyzed the influence on the system when the thread of the interferogram generation was 4. When the thread of the interferogram generation changed from 4 to 1, the results are shown in Figure 11. GT of either a`_g`h` or a`_h`i` was above 3, which meant schemes h` and i` were more balanced than the others. We can see that when the requirement of system processing capacity changed, the balanced scheme would change accordingly. Such a changing relationship could be dynamically reflected by the index G proposed by this paper.

Verification and Comparison of System Performances
This section verifies the system performance from the perspective of accuracy and efficiency.
By comparing the measured results with the theoretical value, the accuracy verification calculates the absolute and relative errors, as shown in Table 11. The absolute error of phase and coherence coefficient was less than 2 × 10 −7 , and the relative error was less than 5 × 10 −3 , which showed that the FPGA balanced design proposed in this paper had realized the expected algorithm with high precision. As for efficiency verification, we listed the different processing capacities of different threads to find the range of system processing capacity. Then, the processing capacity was expressed as the throughput, which would be compared with the existing research. The Formula of throughput is B = D/T. D is the number of data points, and T is the entire calculation time. The throughput index B represents how fast data points are processed, so it could be a yardstick to measure the processing capacity.
In terms of efficiency verification, we compared the performance of FPGA with central processors like DSP and GPU in coarse registration and interferogram generation. The experimental results are shown in Table 11. Based on the dual-channel storage structure, when the thread of the interferogram generation was 8 and the FPGA working clock was 100 MHz, the throughput of FPGA could reach 407.46 × 10 6 points/s, which exceeded 386.32 × 10 6 points/s of the NVIDIA GTX Titan Black GPU. When the thread of interferogram generation was 1, the throughput could reach 94.52 × 10 6 points/s, which was still far beyond the processing capacity of DSP. Notably, fetching data discontinuously in the calculation process would hamper the data transfer rate of DSP/GPU, which would, in turn, affect the throughput. However, the proposed specialized structure could reduce data transfer loss of FPGA. Thus, the proposed FPGA structure had advantages in the computationally intensive process of InSAR, compared with GPU and DSP.
We have pointed out that when the thread of the generated interferogram was 8, the thread of coarse registration as 1 was the balanced design. At this time, our FPGA took at least 9.24 ms to finish the coarse registration as the scheme g'. The slowest scheme took 39.75 ms as j'. The scheme h' took 11. 16 ms as the optimum trade-off, which only increased the entire time by 1.1% but reduced the FPGA BRAM usage by 8.7%. The throughput is shown in Table 11. DSP processed an image with 512 × 512 points in 7.182 ms, demonstrating a slight advantage over FPGA. To process a 2048 × 2048 image, the throughput of DSP (i.e., 72.92 × 10 6 points/s) was over twice as large as that of FPGA (i.e., 28.37 × 10 6 points/s), which showed that DSP had better processing capacity in the coarse registration. However, the time of coarse registration only took up less than 10% of the entire time, which meant that coarse registration had little influence on the overall processing capacity of the system. Therefore, it was feasible to realize coarse registration and interferogram generation based on FPGA to process remote sensing images efficiently. The processing capacity could be further improved when the working clock frequency of FPGA rose.
The experimental results indicated that the proposed method could be applied to InSAR on-board processing in the field of ocean observation. With a broader area, the amount of global ocean observation data was much larger than that of land observation. That said, the traditional processing methods on the ground were unsuitable because of the limited satellite-ground communication network speed. By reducing the volume downlink data, on-board processing could fulfill the requirements in ocean observation. Thus, the proposed method could contribute to ocean observation research by improving the performance of spaceborne remote sensing processing To our best knowledge, little research has applied on-board processing technology to other fields, such as landslide detection, planetary mission, and volcano monitoring, etc. These fields prioritize obtaining high-precision tiny deformations to extract landslide, planetary, and volcano information. In other words, these fields put more emphasis on high resolution instead of real-time performance at the current stage. However, with the development of on-board processing and multi-satellite collaboration technology [57,58], more complex algorithms of InSAR could be realized. In the future, it is possible that on-board processing technology could also be applied in other InSAR scenarios, which requires more research to balance the needs for accuracy and timeliness.
Focusing on the key issue of on-board processing, that is, the trade-off between resources and processing capacity, this method could be applied in on-board processing technology. Besides, the method can also be applied to other fields, such as edge computing [59,60] and UAV platforms [61] etc.

Conclusions
In this paper, research on hardware structure and design method was carried out, based on the processes of InSAR processing, namely coarse registration and interferogram generation. First, the proposed efficient system can ensure fast data processing from the perspective of data communication, processing flow, and storage. Second, given the dual constraints between resources and processing capacity, this paper proposes an FPGA design method based on the gradient descent theory. Mapping the two constraints to two-dimensional coordinates, this method helps identify the optimum trade-off through gradient values, providing guidance for engineering practice in general and, thus, shortening the system optimization process. In the empirical part, the experiment results verify the system's accuracy, the structure's efficiency, and the design method's effectiveness. The results show that the proposed method can be applied to the field of ocean observation. Finally, this paper discusses the generalizability of the proposed method. This paper argues that the proposed method is most suitable for ocean observation research and can be potentially applied to other fields as on-board processing technology develops and the algorithm's complexity increases.
In the near future, we will verify the proposed FPGA design method in other algorithms and study the generalizability of the design method under multiple constraints.