RPC-Based Orthorectification for Satellite Images Using FPGA

Conventional rational polynomial coefficients (RPC)-based orthorectification methods are unable to satisfy the demands of timely responses to terrorist attacks and disaster rescue. To accelerate the orthorectification processing speed, we propose an on-board orthorectification method, i.e., a field-programmable gate array (FPGA)-based fixed-point (FP)-RPC orthorectification method. The proposed RPC algorithm is first modified using fixed-point arithmetic. Then, the FP-RPC algorithm is implemented using an FPGA chip. The proposed method is divided into three main modules: a reading parameters module, a coordinate transformation module, and an interpolation module. Two datasets are applied to validate the processing speed and accuracy that are achievable. Compared to the RPC method implemented using Matlab on a personal computer, the throughputs from the proposed method and the Matlab-based RPC method are 675.67 Mpixels/s and 61,070.24 pixels/s, respectively. This means that the proposed method is approximately 11,000 times faster than the Matlab-based RPC method to process the same satellite images. Moreover, the root-mean-square errors (RMSEs) of the row coordinate (ΔI), column coordinate (ΔJ), and the distance ΔS are 0.35 pixels, 0.30 pixels, and 0.46 pixels, respectively, for the first study area; and, for the second study area, they are 0.27 pixels, 0.36 pixels, and 0.44 pixels, respectively, which satisfies the correction accuracy requirements in practice.


Introduction
Orthorectification is a process that orthorectifies an image onto its upright planimetry map and removes the perspective angle [1][2][3]. Orthorectification is a prerequisite for remotely sensed (RS) image applications in areas such as land resource investigation, disaster monitoring, forestry inventory, and environmental changes analysis. The RS image that is orthorectified not only contains the geometric accuracy of the map but also has the features of the remote sensing image. In the past 20 years, many orthorectification methods were proposed. For example, Zhou et al. [2] presented a comprehensive study on theories, algorithms, and methods of large-scale urban orthoimage generation. Zhou [3] proposed a near real-time orthorectification method for mosaic of video flow acquired by an unmanned aerial vehicle (UAV). Aguilar et al. [4] used rigorous model and rational function model to orthorectify GeoEye-1 and WorldView-2 images and assessed the geometric accuracy of the orthophoto. The results showed that the best horizontal geo-positioning accuracies were acquired by using third order rational functions with vendor's RPC coefficients data. Marsetič et al. [5] presented an automatic processing chain for orthorectification of optical pushbroom sensors. Habib et al. [6] proposed an approach using generated orthophotos from frame camera to improve the orthorectification of hyperspectral pushbroom scanner imagery. However, these studies on orthorectification were based almost entirely on ground-image processing systems, which is unable to meet the demand with respect to time-critical disasters. Thus, it is important to determine how to improve the speed of the orthorectification process when used in the on-board processing of a spacecraft.
With the increasing demands in (near) real-time RS imagery applications for applications such as military deployments, quick response to terrorist attacks, and disaster rescue (e.g., flooding monitoring), the on-board implementation of orthorectification has attracted much research worldwide in recent years. To increase the speed of image processing, researchers have proposed multiple parallel-processing methods and employed hardware acceleration such as the approach by Warpenburg and Siegel [7], who performed resampling in a single instruction stream-multiple data stream environment. Wittenbrink et al. [8] presented optimal concurrent-read-exclusive-write and exclusive-read-exclusive-write parallel-random-access-machine algorithms for spatial image warping. Liu et al. [9] proposed a parallel algorithm that is focused on massive remotely sensed orthorectification. Dai and Yang [10] proposed a fast graphic processing unit (GPU)-central processing unit (CPU) cooperative processing algorithm that is based on computing unified device architecture for the orthorectification of RS images. Reguera-Salgado et al. [11] proposed a method for the real-time geocorrection of images from airborne pushbroom sensors using the hardware acceleration and parallel-computing characteristics of modern GPUs. Quan et al. [12] presented an optical aerial image orthorectification parallel algorithm that employs GPU acceleration. These ground-based parallel-processing systems have increased to an extent the processing speed for RS image orthorectification. However, the RS images still need to be sent back to the ground-based processing centers. However, this process is time consuming. In addition, most parallel-processing methods are based on the multiple task operating system of the GPU, which cannot essentially solve the problem of a serial instruction method.
To realize on-board orthorectification in (near) real-time, an efficient approach is to apply field-programmable gate array (FPGA) hardware architecture because FPGA chips offer a highly flexible design, scalable circuits, and a high efficiency in data processing for its pipeline structure and fine-grained parallelism. In recent decades, researchers have widely used FPGA for image processing applications. Examples are Halle can coworkers' [13] proposed on-board image data processing system based on the neural network processor NI100, digital signal processors, and FPGA. Eadie et al. [14] investigated the use of FPGA for the correction of geometric image distortion. Kumar et al. [15] realized the real-time correction of images using an FPGA under a dynamic environment. Escamilla-Hernández et al. [16] and Kate [17] used an FPGA to implement data compression. Tomasi et al. [18] proposed a stereo vision algorithm using an FPGA to perform the correction of video graphics array images (57 fps). Pal et al. [19], Wang et al. [20], and Zhang et al. [21] applied FPGAs to accelerate the image data and signal filtering processes. Ontiveros-Robles et al. [22,23] proposed FPGA-based hardware architectures for real-time edge detection using fuzzy logic algorithm. Li et al. [24,25] utilized FPGAs to realize the real-time processing of video images to remove snow and fog. Huang et al. [26] proposed an FPGA-based method for the on-board detection and matching of the feature points. Huang et al. [27] presented a new FPGA architecture of a fast and brief algorithm for on-board corner detection and matching.
To the best of our understanding, research into FPGA hardware systems has focused mainly on the real-time correction of video images, noise removal, edge detection, etc., and there are few studies related to on-board orthorectification. Zhou et al. [28] first presented the concept of "on-board geometric correction", but details pertaining to its on-board implementation were not given. Zhou [3] Sensors 2018, 18, 2511 3 of 24 proposed a method for a real-time mosaic of video flow acquired by a small low-cost unmanned aerial vehicle. However, the method was implemented based on software, a serial instruction system, which would affect the real-time processing efficiency. Thus, this paper proposes a FPGA-based method for the on-board implementation of orthorectification. The proposed method can be divided into three modules: reading parameters module, coordinates transformation module, and interpolation module.
The major contribution of this study is a FPGA-based method, in which a traditional orthorectification algorithm is modified for on-board image (near) real-time orthorectification.
The paper is organized as follows. Section 2 describes the proposed RPC algorithm, i.e., fixed-point-based RPC (FP-RPC) algorithm, and gives the FPGA implementation process of the FP-RPC algorithm. Section 3 provides an experimental comparison of the proposed method using IKONOS-2 data and SPOT-6 data. Section 4 discusses the rectification accuracy by FPGA and PC platforms, and processing speed and resource consumption of these two platforms. Finally, Section 5 gives some conclusions.

RPC-Based Orthorectification Using an FPGA Chip
High-resolution satellite sensors are different from conventional aerial frame perspective imaging, and generally apply linear-array CCD pushbroom imaging technology. To deal with various types of images, many geometric processing models and algorithms are presented. One of the most widely used models is the rational polynomial coefficient (RPC) model, which is a general imaging model that is independent of the satellite sensor and platform. Many modern satellite images are equipped with rational polynomial coefficients (RPCs). Unlike rigorous physical models that are based on the collinear equation, which uses the ephemeris, attitude information, etc., to establish the acquisition geometry of the sensors, the RPC model does not require knowledge of the interior orientation elements and exterior orientation elements, which are sometimes not provided by vendors. The RPC model can produce uniform accuracy with a rigorous physical model, and is a simple generalized model. The RPC model has been widely applied to orthorectify satellite images with the increasing utilization of high-resolution images, as in [29][30][31][32][33][34][35]. The details of RPC orthorectification are given in [29,30].
In this study, the RPC algorithm is implemented using FPGA. Usually, an FPGA chip can offer a highly flexible design, scalable circuits, and a high efficiency in data processing for its pipeline structure and fine-grained parallelism. Moreover, an FPGA chip has advantages in size, weight, and power (SWaP) compared to GPU and CPU, which is helpful to integrate the FPGA into the on-board system. However, the traditional RPC algorithm for orthorectification is computationally costly because of floating-point operations in the RPC algorithm. To implement on-board orthorectification using an FPGA chip in (near) real-time, a fixed-point-based RPC (FP-RPC) algorithm is proposed that can reduce the computation cost significantly. The details of the proposed method are given below.

Proposed RPC Algorithm
FP processing is a method that accelerates the calculation [36,37]. To make the transformation between a fixed-point variable and a floating-point variable, multiplication by a constant is necessary to maintain the precision. When the constant is set to a power of 2, the multiplication can be seen as a single bit shift, i.e., where F is a floating-point variable, F is a fixed-point variable, and τ is a scale factor, which affects the binary accuracy of the resulting integer representation. A larger scale factor will produce a higher degree of the binary accuracy.
In the proposed FP-RPC algorithm, all of the variables and constants are transformed to integers using Equation (1). Table 1 gives the integer variables and their scale factors. In Table 1 and Hei are geodetic coordinates, which represent the longitude, latitude, and height, respectively. Lat off , Lat scale , Lon off , Lon scale , H off , H scale , Line off , Line scale , Samp off , and Samp scale are the parameters for normalization. Samp and Line represent the image coordinates, sample and line. According to Fraser et al. [29] and Grodecki et al. [30] and Equation (1), the normalized coordinates are converted into integers by Moreover, the polynomials are converted into integers by where ND LS = Num L Den L Num S Den S In addition, the normalized image coordinates (X , Y ) are converted into integers by Finally, the image coordinates (Samp , Line ) are converted into integers by

Parallel Computation of Orthorectification Using an FPGA
Many factors affect the computation speed when an FPGA is adopted, such as the optimal design of algorithms and the logical resource of the utilized FPGA. By analyzing the structure of the FP-RPC algorithm and optimizing it, an FPGA-based hardware architecture for FP-RPC-based orthorectification is designed, as shown in Figure 1. As described in Equations (2)-(9), their structures are similar. It is convenient for FPGAs to be implemented in parallel. As shown in Figure 1, the FPGA-based FP-RPC module can be divided into three submodules, that is, Read_parameter_mod (RPM), which is used to send parameters to other modules; Coordinate_Transform_mod (CTM), which is applied to transform geodetic coordinates to image coordinates; and Interpolation_mod (IM), which is utilized to perform bilinear interpolation. The details of these modules are given as follows.

1.
For RPM, the coefficients of RPC can be calculated by least-squares adjustment [38]. According to Tao et al. [38], the computing processes of the RPC coefficients are as follows. Equation (7) can be rewritten as F X = Num S (P, L, H) − 2 τ 2 XDen S (P, L, H) = 0 (10) Thus, the matrix form of error equation can be expressed as where (12) is solved by least-squares algorithm, and the solutions of RPC coefficients can be represented as The solution for Equation (13) is acquired by an iterative process. The entire algorithm of Equation (13) has been implemented in our previous work [39], in which the detailed implementing process can be found. The normalization parameters can be calculated by the following equations. where n is the number of ground control points (GCPs). When the enable signal is being received, the geodetic coordinates (Lon, Lat, Hei) stored in the RAMs are read, and sent to Coordinate_Transform_mod (CTM) with the attained parameters and the start signal (Start_Sig) in the same clock cycle.

2.
When the Start_Sig, the constants, and the geodetic coordinates are being received in the CTM, the normalized coordinates (P, L, H) are first calculated in the regularization module (Regulation_mod, ReM). Then, the normalized coordinates and the done signal of ReM (ReM_Done_Sig) are sent to the polynomial module (Polynomial_mod, PM) with a i , b i , c i , and d i (i = 1 to 20) in the same clock cycle to compute the numerators and denominators (Num L , Num S , Den L , and Den S ) of Equation (7). Subsequently, when the done signal (PM_Done_Sig) of PM, Num L , Num S , Den L , and Den S are being received, the normalized coordinates (X, Y) of the image coordinates are calculated. Finally, when the normalized coordinates (X, Y) and the done signal (RaM_Done_Sig) of the ratio module (Ratio_mod, RaM) are being received, the image coordinates (Samp, Line) and the done signal (RCM_Done_Sig) of the image coordinate calculation module (Row_Clm_mod, RCM) are acquired and sent to the interpolation module (Interpolation_mod, IM) in the same clock cycle.

3.
When the image coordinates (Samp, Line) and RaM_Done_Sig are being received in IM, the gray of pixel (Samp, Line) is obtained by interpolating, and the done signal (IM_Done_Sig) of IM is produced.

4.
When the posedge clk of the signal, ALL_Done_Sig is being detected, the processing is finished.  (17) where n is the number of ground control points (GCPs). When the enable signal is being received, the geodetic coordinates (Lon, Lat, Hei) stored in the RAMs are read, and sent to Coordinate_Transform_mod (CTM) with the attained parameters and the start signal (Start_Sig) in the same clock cycle.

Read Parameter Module
To ensure that the constants, geodetic coordinates, and the start signal (Start_Sig) are sent in the same clock cycle, a parallel module (i.e., the RPM) is designed (see Figure 2). In the RPM, the constants are assigned corresponding values, while the geodetic coordinates are stored in RAM. In this design, all values are expressed using a fixed point of 32 bits to ensure computational accuracy.
In the RPM, the geodetic coordinates are sent to the next module according to the order of the column. First, the address of RAM is initialized as 0. When the enable signal is detected, the first group of geodetic coordinates (Lat 0 , Lon 0 , Hei 0 ) is read from the RAM and sent to the next module with the constants and the Start_Sig in the same clock cycle. Starting from the second group of geodetic coordinates, the rules for reading and sending geodetic coordinates are changed. In other words, after the second group of geodetic coordinates (Lat 1 , Lon 1 , Hei 1 ), the geodetic coordinates will be read and sent unless the enable signal and the feedback signal (Feedback_Sig), which are sent by the interpolation module, are detected at the same time. After the final group of geodetic coordinates are read and sent, if the Feedback_Sig is received, the done signal (ALL_Done_Sig) of orthorectification is produced. When the ALL_Done_Sig is detected, the process of orthorectification is stopped.
constants are assigned corresponding values, while the geodetic coordinates are stored in RAM. In this design, all values are expressed using a fixed point of 32 bits to ensure computational accuracy.
In the RPM, the geodetic coordinates are sent to the next module according to the order of the column. First, the address of RAM is initialized as 0. When the enable signal is detected, the first group of geodetic coordinates (Lat0, Lon0, Hei0) is read from the RAM and sent to the next module with the constants and the Start_Sig in the same clock cycle. Starting from the second group of geodetic coordinates, the rules for reading and sending geodetic coordinates are changed. In other words, after the second group of geodetic coordinates (Lat1, Lon1, Hei1), the geodetic coordinates will be read and sent unless the enable signal and the feedback signal (Feedback_Sig), which are sent by the interpolation module, are detected at the same time. After the final group of geodetic coordinates are read and sent, if the Feedback_Sig is received, the done signal (ALL_Done_Sig) of orthorectification is produced. When the ALL_Done_Sig is detected, the process of orthorectification is stopped.

Coordinate Transformation Module
As shown in Figure 1, for the CTM, the inputs contain the constants, the geodetic coordinates, and the Start_Sig, while the outputs include image coordinates and the done signal of this module. The CTM can be divided into four submodules, namely ReM, PM, RaM, and RCM. Details regarding these four submodules are as follows.

Coordinate Transformation Module
As shown in Figure 1, for the CTM, the inputs contain the constants, the geodetic coordinates, and the Start_Sig, while the outputs include image coordinates and the done signal of this module. The CTM can be divided into four submodules, namely ReM, PM, RaM, and RCM. Details regarding these four submodules are as follows.

• Regulation Module
According to Section 2.1, the geodetic coordinates (Lat, Lon, Hei) should be first transformed as the normalized coordinates (L, P, H) based on Equation (2) because this operation can minimize the introduction of errors during the computation of the numerical stability of equations [13]. As shown in Equation (2), the forms of these equations are uniform. In other words, they are suitable for implementation using FPGA. To obtain the normalized coordinates (L, P, H) of the geodetic coordinates (Lat, Lon, Hei) using an FPGA chip, a parallel computation architecture is presented in Figure 3. In Figure 3, the structures of "Normalize Lat", "Normalize Lon", and "Normalize Hei" are similar. Thus, only the schematic diagram of "Normalize Lat" is presented (see Figure 4).
As shown in Figure 4, during the computation process, 1 divider, 10 adders, 10 flipflops, and 16 multiplexer units are mainly used to normalize the Lat. In this design, the relationship among "Normalize Lat", "Normalize Lon", and "Normalize Hei" is parallel. The normalized coordinates (L, P, H) are obtained in the same clock cycle as the done signal.

 Regulation Module
According to Section 2.1, the geodetic coordinates (Lat, Lon, Hei) should be first transformed as the normalized coordinates (L, P, H) based on Equation (2) because this operation can minimize the introduction of errors during the computation of the numerical stability of equations [13]. As shown in Equation (2), the forms of these equations are uniform. In other words, they are suitable for implementation using FPGA. To obtain the normalized coordinates (L, P, H) of the geodetic coordinates (Lat, Lon, Hei) using an FPGA chip, a parallel computation architecture is presented in Figure 3. In Figure 3, the structures of "Normalize Lat", "Normalize Lon", and "Normalize Hei" are similar. Thus, only the schematic diagram of "Normalize Lat" is presented (see Figure 4).
As shown in Figure 4, during the computation process, 1 divider, 10 adders, 10 flipflops, and 16 multiplexer units are mainly used to normalize the Lat. In this design, the relationship among "Normalize Lat", "Normalize Lon", and "Normalize Hei" is parallel. The normalized coordinates (L, P, H) are obtained in the same clock cycle as the done signal.

 Regulation Module
According to Section 2.1, the geodetic coordinates (Lat, Lon, Hei) should be first transformed as the normalized coordinates (L, P, H) based on Equation (2) because this operation can minimize the introduction of errors during the computation of the numerical stability of equations [13]. As shown in Equation (2), the forms of these equations are uniform. In other words, they are suitable for implementation using FPGA. To obtain the normalized coordinates (L, P, H) of the geodetic coordinates (Lat, Lon, Hei) using an FPGA chip, a parallel computation architecture is presented in Figure 3. In Figure 3, the structures of "Normalize Lat", "Normalize Lon", and "Normalize Hei" are similar. Thus, only the schematic diagram of "Normalize Lat" is presented (see Figure 4).
As shown in Figure 4, during the computation process, 1 divider, 10 adders, 10 flipflops, and 16 multiplexer units are mainly used to normalize the Lat. In this design, the relationship among "Normalize Lat", "Normalize Lon", and "Normalize Hei" is parallel. The normalized coordinates (L, P, H) are obtained in the same clock cycle as the done signal.

• Polynomial Module
When the ReM_Done_Sig and (L, P, H) are being received by the PM module, the PM module starts to work. As shown in Equations (4) and (5), these polynomials have a uniform form, which are suitable for the implementation of an FPGA chip in parallel. In these equations, variables such as LH, LP, and PH are shared. To implement these polynomials in parallel using an FPGA chip, a parallel computation architecture is proposed in Figures 5 and 6. As shown in those figures, the PM module is divided into two parts: one is used to perform multiplication and the other is applied to manipulate addition. When performing addition, some special operations about the positive and negative sets of data should be considered. Thus, for the additions in Figure 6, each of them is extended to a similar form, as shown in Figure 7, taking the addition between a 3 P and a 4 H as an example. In the example, three situations are considered: (i) a 3 P and a 4 H are both positive; (ii) a 3 P and a 4 H are both negative; and (iii) a 3 P and a 4 H have opposite signs. The details for an extended addition are shown in Figure 7.
To implement each polynomial, 35 multipliers are utilized in the multiplication, and 19 extended additions are used. In each extended addition, three flipflops, four selectors, seven adders, and eleven multiplexers are applied. After processing the PM module, four sums, i.e., Num L , Num S , Den L , and Den S , are obtained with the done signal of the PM module, PM_Done_Sig, in the same clock cycle.

 Polynomial Module
When the ReM_Done_Sig and (L, P, H) are being received by the PM module, the PM module starts to work. As shown in Equations (4) and (5), these polynomials have a uniform form, which are suitable for the implementation of an FPGA chip in parallel. In these equations, variables such as LH, LP, and PH are shared. To implement these polynomials in parallel using an FPGA chip, a parallel computation architecture is proposed in Figures 5 and 6. As shown in those figures, the PM module is divided into two parts: one is used to perform multiplication and the other is applied to manipulate addition. When performing addition, some special operations about the positive and negative sets of data should be considered. Thus, for the additions in Figure 6, each of them is extended to a similar form, as shown in Figure 7, taking the addition between a3P and a4H as an example. In the example, three situations are considered: (i) a3P and a4H are both positive; (ii) a3P and a4H are both negative; and (iii) a3P and a4H have opposite signs. The details for an extended addition are shown in Figure 7.
To implement each polynomial, 35 multipliers are utilized in the multiplication, and 19 extended additions are used. In each extended addition, three flipflops, four selectors, seven adders, and eleven multiplexers are applied. After processing the PM module, four sums, i.e., NumL, NumS, DenL, and DenS, are obtained with the done signal of the PM module, PM_Done_Sig, in the same clock cycle.

 Polynomial Module
When the ReM_Done_Sig and (L, P, H) are being received by the PM module, the PM module starts to work. As shown in Equations (4) and (5), these polynomials have a uniform form, which are suitable for the implementation of an FPGA chip in parallel. In these equations, variables such as LH, LP, and PH are shared. To implement these polynomials in parallel using an FPGA chip, a parallel computation architecture is proposed in Figures 5 and 6. As shown in those figures, the PM module is divided into two parts: one is used to perform multiplication and the other is applied to manipulate addition. When performing addition, some special operations about the positive and negative sets of data should be considered. Thus, for the additions in Figure 6, each of them is extended to a similar form, as shown in Figure 7, taking the addition between a3P and a4H as an example. In the example, three situations are considered: (i) a3P and a4H are both positive; (ii) a3P and a4H are both negative; and (iii) a3P and a4H have opposite signs. The details for an extended addition are shown in Figure 7.
To implement each polynomial, 35 multipliers are utilized in the multiplication, and 19 extended additions are used. In each extended addition, three flipflops, four selectors, seven adders, and eleven multiplexers are applied. After processing the PM module, four sums, i.e., NumL, NumS, DenL, and DenS, are obtained with the done signal of the PM module, PM_Done_Sig, in the same clock cycle.

 Ratio Module
When the PM_Done_Sig, NumL, NumS, DenL and DenS are being received, the RaM module starts to calculate the normalized coordinates (X, Y) of image coordinates. As shown in Equation (6), the forms for the two equations are the same. It is convenient to calculate X and Y in parallel using an FPGA chip. In Figure 8, a parallel-computing architecture that is used to calculate X is presented. In the same way, the Y coordinate can be obtained.
To obtain the X (or Y) coordinate, one divider, three adders, six multiplexers, six flipflops (two flipflops are public), and 32 selectors are applied. After the processing of the RaM module, the X coordinate and Y coordinate are acquired with the done signal, RaM_Done_Sig, in the same clock cycle.

• Ratio Module
When the PM_Done_Sig, Num L , Num S , Den L and Den S are being received, the RaM module starts to calculate the normalized coordinates (X, Y) of image coordinates. As shown in Equation (6), the forms for the two equations are the same. It is convenient to calculate X and Y in parallel using an FPGA chip. In Figure 8, a parallel-computing architecture that is used to calculate X is presented. In the same way, the Y coordinate can be obtained.
To obtain the X (or Y) coordinate, one divider, three adders, six multiplexers, six flipflops (two flipflops are public), and 32 selectors are applied. After the processing of the RaM module, the X coordinate and Y coordinate are acquired with the done signal, RaM_Done_Sig, in the same clock cycle.

 Image Coordinate Calculation Module
When the RaM_Done_Sig, X, and Y coordinates are being detected, the RCM module starts to calculate the image coordinates (Samp, Line), i.e., column and row indexes. As shown in Equation (9), the equations give the relationship between the normalized coordinates (X, Y) and image coordinates (Samp, Line).
As shown in Equation (9), the equations have a uniform form, which is helpful for implementation using an FPGA. To calculate the image coordinates (Samp, Line) in parallel, a parallel-computing hardware architecture is designed. Because the forms of the equations in Equation (9) are similar, only the schematic diagram used for calculating the Samp coordinate is given. As shown in Figure 9, there are one multiplier, four flipflops (two of them are shared when calculating Line coordinate), five selectors (MUX) shared when calculating Line coordinate, seven adders, and 136 multiplexers (MUX21).
After the processing of the RCM module, the image coordinates, that is, the column and row indexes (Samp, Line), and the done signal (RCM_Done_Sig) are obtained in the same clock cycle. Up to this point, the whole processing of the coordinate transformation is done. The obtained image coordinates are sent to the interpolation module to interpolate the grayscale.

• Image Coordinate Calculation Module
When the RaM_Done_Sig, X, and Y coordinates are being detected, the RCM module starts to calculate the image coordinates (Samp, Line), i.e., column and row indexes. As shown in Equation (9), the equations give the relationship between the normalized coordinates (X, Y) and image coordinates (Samp, Line).
As shown in Equation (9), the equations have a uniform form, which is helpful for implementation using an FPGA. To calculate the image coordinates (Samp, Line) in parallel, a parallel-computing hardware architecture is designed. Because the forms of the equations in Equation (9) are similar, only the schematic diagram used for calculating the Samp coordinate is given. As shown in Figure 9, there are one multiplier, four flipflops (two of them are shared when calculating Line coordinate), five selectors (MUX) shared when calculating Line coordinate, seven adders, and 136 multiplexers (MUX21).
After the processing of the RCM module, the image coordinates, that is, the column and row indexes (Samp, Line), and the done signal (RCM_Done_Sig) are obtained in the same clock cycle. Up to this point, the whole processing of the coordinate transformation is done. The obtained image coordinates are sent to the interpolation module to interpolate the grayscale. Sensors 2018, 18, x FOR PEER REVIEW 12 of 24 Figure 9. Schematic diagram used for calculating Samp coordinate.

Interpolation Module
Because the obtained column and row indexes may not exist only at the center of a pixel, it is necessary to use the interpolation method to obtain the grayscale in the obtained column and row indexes. Considering the interpolation effect, the complexity of an interpolation algorithm, and the resources of an FPGA, the bilinear interpolation method is selected to implement the interpolation for grayscale. Mathematically, the bilinear interpolation algorithm can be expressed by the following equation: where i and j are nonnegative integers; the intermediates p = |i − int(i)| and q = |j − int(j)| are within the range of (0, 1); and g(i, j) represents gray values.
To implement the bilinear interpolation algorithm in parallel using an FPGA chip, a parallel computation architecture was designed (see Figure 10). The designed hardware architecture contains four submodules/parts: (i) the subtract_mod, which is used to obtain the integer part (iLine and iSamp) and fractional part (p and q) of Line and Samp indexes, and to calculate the subtraction (1_p and 1_q) in Equation (18); (ii) the get_gray_addr_mod, which is applied to obtain the address of gray in RAM; (iii) the multiplication part, which is utilized to calculate the multiplications in

Interpolation Module
Because the obtained column and row indexes may not exist only at the center of a pixel, it is necessary to use the interpolation method to obtain the grayscale in the obtained column and row indexes. Considering the interpolation effect, the complexity of an interpolation algorithm, and the resources of an FPGA, the bilinear interpolation method is selected to implement the interpolation for grayscale. Mathematically, the bilinear interpolation algorithm can be expressed by the following equation: where i and j are nonnegative integers; the intermediates p = |i − int(i)| and q = |j − int(j)| are within the range of (0, 1); and g(i, j) represents gray values.
To implement the bilinear interpolation algorithm in parallel using an FPGA chip, a parallel computation architecture was designed (see Figure 10). The designed hardware architecture contains four submodules/parts: (i) the subtract_mod, which is used to obtain the integer part (iLine and iSamp) and fractional part (p and q) of Line and Samp indexes, and to calculate the subtraction (1_p and 1_q) in Equation (18); (ii) the get_gray_addr_mod, which is applied to obtain the address of gray in RAM; (iii) the multiplication part, which is utilized to calculate the multiplications in Equation (18); and (iv) the calculate_sum_mod, which is used to compute the sum in Equation (18). After the processing of the calculate_sum_mod, the results of interpolation in (Samp, Line) are obtained. The details of subtract_mod, get_gray_addr_mod, and calculate_sum_mod are described as follows. Equation (18); and (iv) the calculate_sum_mod, which is used to compute the sum in Equation (18). After the processing of the calculate_sum_mod, the results of interpolation in (Samp, Line) are obtained. The details of subtract_mod, get_gray_addr_mod, and calculate_sum_mod are described as follows.

 subtract_mod
As shown in Equation (18), to perform the bilinear interpolation method, the gray values of four neighbors around the acquired column and row indexes are required. Thus, the acquired column and row indexes should be pre-processed to obtained the integer part and fractional part, which are used to calculate (1 − q) and (1 − p). To implement the function using an FPGA chip, a parallel-computing architecture is proposed, named subtract_mod. In subtract_mod, the methods used to acquire iSamp, q and 1_q are similar to those for obtaining iLine, p and 1_p, respectively. Thus, in this section, only the schematic diagram for obtaining iSamp, q and 1_q is given (see Figure 11).
As shown in Figure 11, to obtain iSamp, q and 1_q, three adders, seven multiplexers (MUX21), and nine flipflops (three of which are shared when iLine), p and 1_p are used. In addition, three MUXs are public. After the processing of the whole subtract_mod, iSamp, q, 1_q, iLine, p, and 1_p are acquired with the done signal in the same clock cycle. When obtaining these variables, iSamp and iLine are sent to the next submodule to retrieve the address of gray for four neighbors in RAM. Meanwhile, q, 1_q, iLine, p, and 1_p are sent to another part to perform multiplication.

• subtract_mod
As shown in Equation (18), to perform the bilinear interpolation method, the gray values of four neighbors around the acquired column and row indexes are required. Thus, the acquired column and row indexes should be pre-processed to obtained the integer part and fractional part, which are used to calculate (1 − q) and (1 − p). To implement the function using an FPGA chip, a parallel-computing architecture is proposed, named subtract_mod. In subtract_mod, the methods used to acquire iSamp, q and 1_q are similar to those for obtaining iLine, p and 1_p, respectively. Thus, in this section, only the schematic diagram for obtaining iSamp, q and 1_q is given (see Figure 11).
As shown in Figure 11, to obtain iSamp, q and 1_q, three adders, seven multiplexers (MUX21), and nine flipflops (three of which are shared when iLine), p and 1_p are used. In addition, three MUXs are public. After the processing of the whole subtract_mod, iSamp, q, 1_q, iLine, p, and 1_p are acquired with the done signal in the same clock cycle. When obtaining these variables, iSamp and iLine are sent to the next submodule to retrieve the address of gray for four neighbors in RAM. Meanwhile, q, 1_q, iLine, p, and 1_p are sent to another part to perform multiplication.

 get_gray_addr_mod
The grayscale of a pixel can be obtained according to the corresponding address. To obtain the gray values of four neighbors around the obtained column and row indexes in parallel, a parallel-computing hardware architecture is proposed (see Figure 12), called get_gray_addr_mod. In the get_gray_addr_mod, 3 LESS-THAN comparators, 9 adders, 10 MUX21, 12 flipflops, and 70 MUX are applied. After the processing of the get_gray_addr_mod, four addresses are obtained with the done signal in the same clock cycle. According to the obtained addresses, the gray values can be acquired from RAM. Then, they are sent to the multiplication part to perform the multiplication.

• get_gray_addr_mod
The grayscale of a pixel can be obtained according to the corresponding address. To obtain the gray values of four neighbors around the obtained column and row indexes in parallel, a parallel-computing hardware architecture is proposed (see Figure 12), called get_gray_addr_mod. In the get_gray_addr_mod, 3 LESS-THAN comparators, 9 adders, 10 MUX21, 12 flipflops, and 70 MUX are applied. After the processing of the get_gray_addr_mod, four addresses are obtained with the done signal in the same clock cycle. According to the obtained addresses, the gray values can be acquired from RAM. Then, they are sent to the multiplication part to perform the multiplication.

 get_gray_addr_mod
The grayscale of a pixel can be obtained according to the corresponding address. To obtain the gray values of four neighbors around the obtained column and row indexes in parallel, a parallel-computing hardware architecture is proposed (see Figure 12), called get_gray_addr_mod. In the get_gray_addr_mod, 3 LESS-THAN comparators, 9 adders, 10 MUX21, 12 flipflops, and 70 MUX are applied. After the processing of the get_gray_addr_mod, four addresses are obtained with the done signal in the same clock cycle. According to the obtained addresses, the gray values can be acquired from RAM. Then, they are sent to the multiplication part to perform the multiplication.

• calculate_sum_mod
As shown in Figure 10, after the multiplication process, four variables, x 1 , x 2 , x 3 , and x 4 , are obtained in the same clock cycle. To implement the addition for four variables, two levels of additions are needed. Each addition corresponds to an extended addition that has an architecture that is similar to Figure 7. The details can be found in Section 2.2.2 and Figure 7. After the processing of the calculate_sum_mod, the result of interpolation in (Samp, Line) can be obtained.

Integration On-Board System
An FPGA device can be integrated into the on-board system as a part of the system, because FPGA has advantages in size, weight, and power. After completing the proposed algorithm design using Verilog language, the designed algorithm can be programmed into the selected FPGA device.

Software and Hardware Environment
In this study, an Altera FPGA was used. The version of the FPGA is Kintex-7 XC7K325TFFG900-1 (see Figure 13), the design tool is Vivado 2016.4 (Xilinx, San Jose, CA, USA), and the simulation tool is ModelSim SE10.1d (Mentor, Santa Barbara, CA, USA). The PC uses a Windows 7 (64 bit) operating system, and has an Intel ® Core™ i7-4790 CPU @ 3.6 GHz processor with 8 GB RAM. To validate the proposed method, the orthorectification algorithm was also implemented using Matlab 2012a (MathWorks, 1 Apple Hill Drive, Natick, MA, USA).

 calculate_sum_mod
As shown in Figure 10, after the multiplication process, four variables, x1, x2, x3, and x4, are obtained in the same clock cycle. To implement the addition for four variables, two levels of additions are needed. Each addition corresponds to an extended addition that has an architecture that is similar to Figure 7. The details can be found in Section 2.2.2 and Figure 7. After the processing of the calculate_sum_mod, the result of interpolation in (Samp, Line) can be obtained.

Integration On-Board System
An FPGA device can be integrated into the on-board system as a part of the system, because FPGA has advantages in size, weight, and power. After completing the proposed algorithm design using Verilog language, the designed algorithm can be programmed into the selected FPGA device.

Software and Hardware Environment
In this study, an Altera FPGA was used. The version of the FPGA is Kintex-7 XC7K325TFFG900-1 (see Figure 13), the design tool is Vivado 2016.4 (Xilinx, San Jose, CA, USA), and the simulation tool is ModelSim SE10.1d (Mentor, Santa Barbara, CA, USA). The PC uses a Windows 7 (64 bit) operating system, and has an Intel ® Core™ i7-4790 CPU @ 3.6 GHz processor with 8 GB RAM. To validate the proposed method, the orthorectification algorithm was also implemented using Matlab 2012a (MathWorks, 1 Apple Hill Drive, Natick, MA, USA).

Dataset
To validate the correction accuracy and processing speed of the proposed FPGA-based orthorectification method, two test datasets (as shown in Figure 14

Dataset
To validate the correction accuracy and processing speed of the proposed FPGA-based orthorectification method, two test datasets (as shown in Figure 14        According to the proposed method, input parameters should be transformed into fixed-point data. As shown in Tables 2-4, the values of parameters of two study areas are in different range. To ensure computation accuracy, all parameters are transformed into fixed-point of 32 bits using different scale factor, τ. The details are given in Tables 5 and 6. In addition, the clock frequency is 100 MHz.

Results
As shown in Figures 15a and 16a, after the processing of the proposed method, the orthorectified results (orthophoto) were obtained. To validate the accuracy and speed of the proposed rectification method, orthorectification for the same datasets was also implemented by applying the PC-based platform. On the PC-based platform, the proposed FP-RPC orthorectification was implemented using Matlab codes.
The orthorectification results obtained using the PC-based software are shown in Figures 15b and 16b. The contrast-enhanced difference images for two study areas are shown in Figures 15c and 16c, respectively. As shown in Figures 15c and 16c, the contrast-enhanced difference images show the few discrepancies that are present. The orthorectification images obtained by FPGA and PC are not visually different by inspection. The numerical differences between FPGA and PC become apparent when observing the difference images shown in Figures 15c and 16c. These pixel position differences are mainly caused by the used bit wide and scale factor of fixed-point data [1]. According to [1], the pixel position difference can be decreased with the increasing of bit wide and scale factor. Error analysis between the proposed method and the FP-RPC algorithm implemented on PC are provided in the next section. The orthorectification results obtained using the PC-based software are shown in Figures 15b  and 16b. The contrast-enhanced difference images for two study areas are shown in Figures 15c and  16c, respectively. As shown in Figures 15c and 16c, the contrast-enhanced difference images show the few discrepancies that are present. The orthorectification images obtained by FPGA and PC are not visually different by inspection. The numerical differences between FPGA and PC become apparent when observing the difference images shown in Figures 15c and 16c. These pixel position differences are mainly caused by the used bit wide and scale factor of fixed-point data [1]. According to [1], the pixel position difference can be decreased with the increasing of bit wide and scale factor. Error analysis between the proposed method and the FP-RPC algorithm implemented on PC are provided in the next section.

Error Analysis
To quantitatively evaluate the accuracy of the proposed orthorectification method, the root-mean-square error (RMSE) [40,41] was utilized. Mathematically, the RMSEs of the image coordinates along the vertical axis (ΔI) and horizontal axis (ΔJ), and distance (ΔS) can be calculated using the following equations, respectively,   The orthorectification results obtained using the PC-based software are shown in Figures 15b  and 16b. The contrast-enhanced difference images for two study areas are shown in Figures 15c and  16c, respectively. As shown in Figures 15c and 16c, the contrast-enhanced difference images show the few discrepancies that are present. The orthorectification images obtained by FPGA and PC are not visually different by inspection. The numerical differences between FPGA and PC become apparent when observing the difference images shown in Figures 15c and 16c. These pixel position differences are mainly caused by the used bit wide and scale factor of fixed-point data [1]. According to [1], the pixel position difference can be decreased with the increasing of bit wide and scale factor. Error analysis between the proposed method and the FP-RPC algorithm implemented on PC are provided in the next section.

Error Analysis
To quantitatively evaluate the accuracy of the proposed orthorectification method, the root-mean-square error (RMSE) [40,41] was utilized. Mathematically, the RMSEs of the image coordinates along the vertical axis (ΔI) and horizontal axis (ΔJ), and distance (ΔS) can be calculated using the following equations, respectively,

Error Analysis
To quantitatively evaluate the accuracy of the proposed orthorectification method, the root-mean-square error (RMSE) [40,41] was utilized. Mathematically, the RMSEs of the image coordinates along the vertical axis (∆I) and horizontal axis (∆J), and distance (∆S) can be calculated using the following equations, respectively, where I h and J h are the image coordinates rectified by the proposed orthorectification method; I h and J h are the reference image coordinates; and n is the number of check points.
To compute the RMSEs, 40 check points for each study area were selected randomly (as shown in Figure 17). As shown in Figure 18, the differences in the values of image coordinates for the Matlab-based and FPGA-based methods are given. Based on Equations (19) and (20), the RMSEs (∆I, ∆J, and ∆S) are 0.35 pixels, 0.30 pixels, and 0.46 pixels, respectively, for the first study area; meanwhile, they are 0.27 pixels, 0.36 pixels, and 0.44 pixels, respectively, for the second study area. Moreover, other statistics were also calculated (as shown in Table 7).
According to the calculation results of Equations (19) and (20), the orthorectification results obtained using the proposed method are considered acceptable because the RMSEs are less than one pixel [42][43][44]. However, as shown in Figure 18, differences still exist in the image coordinates acquired by the FPGA-based and Matlab-based methods. These differences may be caused by the algorithms implemented by FPGA hardware, for example, the fixed-point computation, which propagate and accumulate.  (20) where I′h and J′h are the image coordinates rectified by the proposed orthorectification method; Ih and Jh are the reference image coordinates; and n is the number of check points.
To compute the RMSEs, 40 check points for each study area were selected randomly (as shown in Figure 17). As shown in Figure 18, the differences in the values of image coordinates for the Matlab-based and FPGA-based methods are given. Based on Equations (19) and (20), the RMSEs (ΔI, ΔJ, and ΔS) are 0.35 pixels, 0.30 pixels, and 0.46 pixels, respectively, for the first study area; meanwhile, they are 0.27 pixels, 0.36 pixels, and 0.44 pixels, respectively, for the second study area. Moreover, other statistics were also calculated (as shown in Table 7).
According to the calculation results of Equations (19) and (20), the orthorectification results obtained using the proposed method are considered acceptable because the RMSEs are less than one pixel [42][43][44]. However, as shown in Figure 18, differences still exist in the image coordinates acquired by the FPGA-based and Matlab-based methods. These differences may be caused by the algorithms implemented by FPGA hardware, for example, the fixed-point computation, which propagate and accumulate.    (20) where I′h and J′h are the image coordinates rectified by the proposed orthorectification method; Ih and Jh are the reference image coordinates; and n is the number of check points.
To compute the RMSEs, 40 check points for each study area were selected randomly (as shown in Figure 17). As shown in Figure 18, the differences in the values of image coordinates for the Matlab-based and FPGA-based methods are given. Based on Equations (19) and (20), the RMSEs (ΔI, ΔJ, and ΔS) are 0.35 pixels, 0.30 pixels, and 0.46 pixels, respectively, for the first study area; meanwhile, they are 0.27 pixels, 0.36 pixels, and 0.44 pixels, respectively, for the second study area. Moreover, other statistics were also calculated (as shown in Table 7).
According to the calculation results of Equations (19) and (20), the orthorectification results obtained using the proposed method are considered acceptable because the RMSEs are less than one pixel [42][43][44]. However, as shown in Figure 18, differences still exist in the image coordinates acquired by the FPGA-based and Matlab-based methods. These differences may be caused by the algorithms implemented by FPGA hardware, for example, the fixed-point computation, which propagate and accumulate.

Processing Speed Comparison
This section presents the processing time as the size of satellite image increases, and evaluates processing speed of the FPGA-based orthorectification method and the Matlab-based method.
The processing time have been recorded as an average of 10 runs of the RPC orthorectification algorithm for each image. The average processing time for difference size of image is presented in Table 8. A plot of the image size vs. processing time is shown in Figure 19. The speed-up of the method can be defined as the Matlab time taken divided by the time taken on the FPGA for the performance of the RPC algorithm [45]. In the image case considered, the maximum speed-up is acquired from a size of 1024 × 1024 pixels, where the speed-up is about 11,095.8709. From the results in Table 8, it can be demonstrated that the speed increases with the size of image.

Processing Speed Comparison
This section presents the processing time as the size of satellite image increases, and evaluates processing speed of the FPGA-based orthorectification method and the Matlab-based method.
The processing time have been recorded as an average of 10 runs of the RPC orthorectification algorithm for each image. The average processing time for difference size of image is presented in Table 8. A plot of the image size vs. processing time is shown in Figure 19. The speed-up of the method can be defined as the Matlab time taken divided by the time taken on the FPGA for the performance of the RPC algorithm [45]. In the image case considered, the maximum speed-up is acquired from a size of 1024 × 1024 pixels, where the speed-up is about 11,095.8709. From the results in Table 8, it can be demonstrated that the speed increases with the size of image.  The processing speed is one of the most importance indicators for evaluating on-board processing. To evaluate and compare the speed of the proposed FPGA-based orthorectification method and the Matlab-based method, the throughput, which is a normalized metric, is used, and represents the capacity in terms of the number of pixels processed per second. For the proposed method, the average throughput is approximately 675.67 Mpixels/s. However, for the Matlab-based method, the average throughput is approximately 61,677.49 pixels/s. This means that the proposed FPGA-based method has higher processing capacity than the Matlab-based method.

Resource Consumption
Besides the speed of processing, the utilization ratio of each type of resource is also a key indicator when assessing the quality of a method. As is well known, it can be determined whether a selected device meets the requirement of a design scheme by analyzing the utilization ratio of hardware resource. If the utilization ratio of a type of resource reaches 60-80%, the selected device satisfies the requirement of the design scheme. The processing speed is one of the most importance indicators for evaluating on-board processing. To evaluate and compare the speed of the proposed FPGA-based orthorectification method and the Matlab-based method, the throughput, which is a normalized metric, is used, and represents the capacity in terms of the number of pixels processed per second. For the proposed method, the average throughput is approximately 675.67 Mpixels/s. However, for the Matlab-based method, the average throughput is approximately 61,677.49 pixels/s. This means that the proposed FPGA-based method has higher processing capacity than the Matlab-based method.

Resource Consumption
Besides the speed of processing, the utilization ratio of each type of resource is also a key indicator when assessing the quality of a method. As is well known, it can be determined whether a selected device meets the requirement of a design scheme by analyzing the utilization ratio of hardware resource. If the utilization ratio of a type of resource reaches 60-80%, the selected device satisfies the requirement of the design scheme.
Thus, after implementing the proposed method, some important resources, such as look-up tables (LUTs), registers, and total pins are analyzed. As shown in Table 9, the slice logics contain slice LUTs and slice registers. The utilization ratios of LUTs and registers are 44.42% and 5.59%, respectively. The utilization of input and output (IO) is 368, which is 73.60% of the total IOs.
In short, according to the above comprehensive utilization ratios for resources, it can be demonstrated that the resources of the selected FPGA can meet the design requirement of the proposed FPGA-based orthorectification method.

Conclusions
This paper proposes an orthorectification method, namely, the field-programmable gate array (FPGA)-based fixed-point (FP) rational polynomial coefficient (RPC) method (FPGA-based FP-RPC method) to perform the process of orthorectification on board spacecraft/satellite to accelerate the orthorectification processing speed for remotely sensed (RS) images. The proposed FPGA-based FP-RPC method contains three main submodules, Read_parameter_mod, Coordinate_transform_mod, and Interpolation_mod, based on the bilinear interpolation algorithm.
To validate the orthorectification accuracy, an orthophoto that was orthorectified by a PC-based platform (Matlab 2012a) was used as a reference. Two datasets, IKONOS and SPOT-6 images, were used to validate the proposed FPGA-based FP-RPC method. The root-mean-square error (RMSE), which is associated with the maximum, minimum, standard deviation (STD), and mean of row and column coordinates' differences, was used. The experimental results show that the STD of the row and column coordinates' differences are 0.19 pixels and 0.18 pixels, respectively, for the first study area, while they were 0.16 pixels and 0.24 pixels, respectively, for the second study area. The RMSE of the row coordinate (∆I), column coordinate (∆J) and the distance ∆S are 0.35 pixels, 0.30 pixels, and 0.46 pixels, respectively, for the first study area, while they are 0.27 pixels, 0.36 pixels, and 0.44 pixels, respectively, for the second study area. It can be concluded from these quantitative analyses that the proposed method can meet the demand of orthorectification in practice.
Moreover, a comparison of the processing speed was also performed for the proposed FPGA-based FP-RPC method and PC-based RPC methods. The throughput of the FPGA-based FP-RPC method and PC-based RPC method are 675.67 Mpixels/s and 61,070.24 pixels/s, respectively. Therefore, it can be shown that the processing speed of the FPGA-based FP-RPC method is faster (by approximately 11,000 times) than the processing speed of the Matlab-based RPC method. In terms of the resource consumptions, it can be found that the utilization ratios of ALUTs, registers, and IO are 44.42%, 5.59%, and 73.60%, respectively.