On-Board Georeferencing Using FPGA-Based Optimized Second-Order Polynomial Equation

For real-time monitoring of natural disasters, such as fire, volcano, flood, landslide, and coastal inundation, highly-accurate georeferenced remotely sensed imagery is needed. Georeferenced imagery can be fused with geographic spatial data sets to provide geographic coordinates and positing for regions of interest. This paper proposes an on-board georeferencing method for remotely sensed imagery, which contains five modules: input data, coordinate transformation, bilinear interpolation, and output data. The experimental results demonstrate multiple benefits of the proposed method: (1) the computation speed using the proposed algorithm is 8 times faster than that using PC computer; (2) the resources of the field programmable gate array (FPGA) can meet the requirements of design. In the coordinate transformation scheme, 250,656 LUTs, 499,268 registers, and 388 DSP48s are used. Furthermore, 27,218 LUTs, 45,823 registers, 456 RAM/FIFO, and 267 DSP48s are used in the bilinear interpolation module; (3) the values of root mean square errors (RMSEs) are less than one pixel, and the other statistics, such as maximum error, minimum error, and mean error are less than one pixel; (4) the gray values of the georeferenced image when implemented using FPGA have the same accuracy as those implemented using MATLAB and Visual studio (C++), and have a very close accuracy implemented using ENVI software; and (5) the on-chip power consumption is 0.659 W. Therefore, it can be concluded that the proposed georeferencing method implemented using FPGA with second-order polynomial model and bilinear interpolation algorithm can achieve real-time geographic referencing for remotely sensed imagery.


Introduction
With the advancement of technological, remote sensing (RS) images are becoming more widely used in natural disasters monitoring and positioning [1][2][3].They are required to rapidly produce highly accurate georeferenced image.However, the processing speed of the traditional methods for indirect georeferencing of RS images cannot to meet the timeliness requirements [4][5][6][7][8][9].Therefore, it would be highly desirable to develop a rapid, inexpensive technique to implement an on-board georeferencing.three modules: a data memory, coordinate transformation (including the transformation from geodetic coordinates to the raw image coordinates and the raw image coordinates to the scanning coordinates), and bilinear interpolation.The contributions are summarized as follows.
(1) An optimized second-order polynomial equation and bilinear interpolation are proposed for on-board georeferencing for remotely sensing imagery.(2) An architecture for floating-point block Lower-Upper (LU) decomposition is designed to solve the inverse for large-sized matrices.
(3) To reduce resource consumption, some strategies are adopted in FPGA implementation, i.e., 32-bit integer and floating-point mixed operation and serial-parallel data communication.
The remainder of the paper is organized as follows.Section 2 reviews the traditional second-order polynomial georeferencing algorithm, bilinear interpolation method and optimizes the algorithms based on the FPGA; Section 3 gives the details of the implementation using FPGA for the georeferencing scheme; Section 4 describes and validates the experimental results.The conclusion is presented in Section 5.

Optimization for Georeferencing Scheme
The georeferencing scheme includes the selection of a suitable mathematical distortion model, coordinate transformation, and resampling (interpolation) [36][37][38].The georeferencing scheme can be classified into direct and indirect methods.The direct method applies the raw image coordinates to compute the georeferenced coordinates, and the indirect method applies the georeferenced image coordinates to compute the raw image coordinates.In this paper, the indirect method is used to establish the relationship of coordinate by the second-order polynomial equations.

Traditional Second-Order Polynomial Equations
The second-order polynomial equations are expressed as follows [39][40][41][42]: where (x, y) are the coordinates of the raw image, (X, Y) are the corresponding ground coordinates (longitude, latitude), and a i and b i (i = 0, 1, . . ., 5) are the unknown coefficients of the second-order polynomial equation.After choosing a suitable mathematical distortion model, the unknown coefficients a i and b i can be obtained from Equations ( 1) and (2) with the ground control points (GCPs).
GCPs are an important parameter in the geometric calibration, which affect the accuracy of subsequent correction [43].To ensure accuracy, ten GCPs are selected in the study area.Let the coordinate pairs of ten GCPs can be represented as: (x 1 , y 1 , X 1 , Y 1 ), (x 2 , y 2 , X 2 , Y 2 ), . . ., (x 10 , y 10 , X 10 , Y 10 ).The matrix form of Equation ( 1) can be expressed as: According to the least square method [5], Equation (3) can be simplified as Equation (4).
where A is the matrix of the ground coordinates of GCPs, ∆ a is the coefficients matrix of the second-order polynomial Equation (1), L x is the coordinates matrix of GCPs in the raw image, and P is the weight matrix.A , ∆ a , L x and P are expressed by Equation (5) through Equation (8).
∆ a = [ a 0 a 1 a 2 a 3 a 4 a 5 ] T , ( 6) Typically, the weight of each GCPs is the same.So, P = I.Equation ( 4) can be descripted as: In the same way, the coefficients of b can be solved by the formula ∆ b = (A T A) −1 (A T L y ).

Coordinate Transformation
After establishing the second-order polynomial equation and solving the coefficients of Equations ( 1) and ( 2), the raw image can be georeferenced pixel-to-pixel.The steps are as follows [44]: 1.The Size of the Georeferenced Image To properly obtain the georeferenced image, the storage area of georeferenced image must be computed in advance, as shown in Figure 1 [7].abcd is the raw image, with corners a , b, c, and d in the o − xy image coordinate system in the Figure 1a. Figure 1b shows the correct range of output image.a b c d is the georeferenced image with corners a , b , c and d in the O − XY ground coordinate system, and ABCD is the storage area for georeferenced image.Obviously, if the storage range is not correctly defined, the georeferenced image is improper with a large blank area, as shown in Figure 1c.Therefore, the right boundary should include the entire georeferenced image with minimum exterior blank rectangle as possible.
computed in advance, as shown in Figure 1 [7].abcd is the raw image, with corners a , b , c , and d in the o xy − image coordinate system in the Figure 1a. Figure 1b shows the correct range of output image.' ' ' ' a b c d is the georeferenced image with corners ' a , ' b , ' c and ' d in the O XY − ground coordinate system, and ABCD is the storage area for georeferenced image.Obviously, if the storage range is not correctly defined, the georeferenced image is improper with a large blank area, as shown in Figure 1c.Therefore, the right boundary should include the entire georeferenced image with minimum exterior blank rectangle as possible.
3. Row (M) and Column (N) of the Georeferenced Image.
M and M can be solved by Equation (11).
where X GSD and Y GSD are the ground-sampling distances (GSD) of the output image.Hence, the location of pixel can be described by the row and column in the B − x y coordinate system (x = 1, 2, 3, . . ., M; y = 1, 2, 3, . . ., N).

Coordinate Transformation.
The georeferenced model only expresses the relationship between the ground coordinates (X, Y) and the (x, y) coordinates of the raw image.To further express the relationship between the raw image coordinates and scanning coordinates of the georeferenced image, it is necessary to convert the ground coordinate into the row and column of georeferenced image, as shown in Equation (12).
where x and y are the row and column of the georeferenced image, respectively, and (X g , Y g ) are the corresponding ground coordinate.

Resample
As to be expected, the defined grid center from the georeferenced image will not usually project to the center location of pixel in the raw image after coordinate transformation.To solve this problem, nearest-neighbor interpolation, bilinear interpolation, and cubic interpolation [45] are commonly used.To balance the accuracy and complexity of the preprocessing function, bilinear interpolation is chosen in this paper.The algorithm obtains the pixel value by taking a weighted sum of the pixel values of the four nearest neighbors surrounding the calculated location [46], as shown in Figure 2 and Equation (13).(13) where, (x, y) are the coordinates of the georeferenced image, I(x, y) is the gray value of a pixel with (x, y) coordinates in the georeferenced image; i and j are the integer part of the corresponding coordinate in the raw image, respectively; u and v are the fractional parts of the corresponding coordinate in the raw image, respectively.I(i, j) is the gray value of (i, j) coordinates in the raw image.

Optimized Second-Order Polynomial Model
To parallel implement the Equations ( 1) and ( 2), the second-order polynomial equations are optimized to Equations ( 14) and (15). where, , and In Equation ( 1), eight multipliers and five adders are required to complete a transformation.However, in Equation ( 14), five multipliers and five adders are required to complete a transformation.Comparatively, six multipliers are reduced in all.The multipliers and adders required to complete the coordinate transformation are in Table 1.

Optimized Second-Order Polynomial Model
To parallel implement the Equations ( 1) and ( 2), the second-order polynomial equations are optimized to Equations ( 14) and (15). where, In Equation (1), eight multipliers and five adders are required to complete a transformation.However, in Equation ( 14), five multipliers and five adders are required to complete a transformation.Comparatively, six multipliers are reduced in all.The multipliers and adders required to complete the coordinate transformation are in Table 1.13) appears to require eight multiplications, but careful factorization to exploit separability can reduce the multiplication by three [47].
To compute the Equation ( 16), the gray values of four nearest-neighbors must be read from the memory.The principle of the mapping storage method is to optimize and balance the line and block data access rate [7].In order to improve memory access efficiency, a multi-array memory storage method is designed in Figure 3.The RS image size is m × n.As such, m × n memory units are designed in double data rate (DDR).The relationship between the address of DDR and (i, j) coordinates can be described by Equation (17).
where, i and j are the row and column of the raw image, respectively.addr ij is the corresponding address of memory.
To compute the Equation ( 16), the gray values of four nearest-neighbors must be read from the memory.The principle of the mapping storage method is to optimize and balance the line and block data access rate [7].In order to improve memory access efficiency, a multi-array memory storage method is designed in Figure 3.The RS image size is m n × .As such, m n × memory units are designed in double data rate (DDR).The relationship between the address of DDR and ( , ) i j coordinates can be described by Equation (17).
where, i and j are the row and column of the raw image, respectively.

FPGA-Based Implementation of the Georeferencing Scheme
An FPGA hardware architecture is designed for georeferenced of the remotely sensed imagery as shown in Figure 4, which contains five modules: (I) GCP data module; (II) input image module (IIM); (III) coefficient calculation module (CCM); (IV) coordinate transformation module (CTM); and (V) bilinear interpolation module (BIM).The details of five modules are described as follows.
 The GCPs data is stored in the RAM of the GCP data module.These parameters are sent to the CCM, when the enable signal is being received.


The gray values of the input image are saved to the ROM through the IIM, when the enable signal is being received.


The coefficients of 0 a , 1 a , 2 a , 3 a , According to Equation ( 13) and ( 16), the interpolation values are calculated.At last, the gray values, latitude, and longitude of the georeferenced image are outputted at the same clock

FPGA-Based Implementation of the Georeferencing Scheme
An FPGA hardware architecture is designed for georeferenced of the remotely sensed imagery as shown in Figure 4, which contains five modules: (I) GCP data module; (II) input image module (IIM); (III) coefficient calculation module (CCM); (IV) coordinate transformation module (CTM); and (V) bilinear interpolation module (BIM).The details of five modules are described as follows.

•
The GCPs data is stored in the RAM of the GCP data module.These parameters are sent to the CCM, when the enable signal is being received.

•
The gray values of the input image are saved to the ROM through the IIM, when the enable signal is being received.

•
The coefficients of a 0 , a 1 , a 2 , a 3 , a

•
The coordinate transformation is carried out in the CTM when the coefficients of second-order polynomial equation are arrived.The values of i, j, u, 1 − u, v and 1 − v are simultaneously sent to the BIM.

•
According to Equations ( 13) and ( 16), the interpolation values are calculated.At last, the gray values, latitude, and longitude of the georeferenced image are outputted at the same clock cycle.

FPGA-Based Solution of the Second-Order Polynomial Equation
To implement the second-order polynomial equation based on an FPGA, the processing of solving the coefficient is decomposed into three modules (as shown in

Calculation of Matrix T A A
According to Equation ( 9) and the principle of matrix multiplication, the coefficients of 1 j a ( j = 1, 2, …, 6) are calculated, as shown in Equation ( 18) and (19).

FPGA-Based Solution of the Second-Order Polynomial Equation
To implement the second-order polynomial equation based on an FPGA, the processing of solving the coefficient is decomposed into three modules (as shown in Figure 5): (I) Form the matrices A , L x , and L y ; (II) Calculate A T A , (A T A) −1 , A T L x , and A T L y , and (III) perform (A T A) −1 (A T L x ) and (A T A) −1 (A T L y ).

FPGA-Based Solution of the Second-Order Polynomial Equation
To implement the second-order polynomial equation based on an FPGA, the processing of solving the coefficient is decomposed into three modules (as shown in

Calculation of Matrix T A A
According to Equation ( 9) and the principle of matrix multiplication, the coefficients of 1 j a ( j = 1, 2, …, 6) are calculated, as shown in Equation ( 18) and ( 19).

Calculation of Matrix A T A
According to Equation ( 9) and the principle of matrix multiplication, the coefficients of a 1j (j = 1, 2, . . ., 6) are calculated, as shown in Equations ( 18) and (19).
Remote Sens. 2019, 11, 124 9 of 28 Generally, ten multipliers and nine adders are needed to compute a 11 .Therefore, it takes about 324 additions and 360 multiplications to calculate all coefficients.However, the resources are limited on an FPGA chip.Therefore, a combination of serial and parallel strategy is adopted [48,49].
To improve the processing speed, multiplier-adder (MD) modules are used.At the same way, the other coefficients of a 1j , a 2j , a 3j , a 4j , a 5j , and a 6j (j = 1, 2, 3, . . ., 6) are parallel computed at the same clock cycle.Figure 6 shows the architecture of A T A based on an FPGA.
Generally, ten multipliers and nine adders are needed to compute 11 a .Therefore, it takes about 324 additions and 360 multiplications to calculate all coefficients.However, the resources are limited on an FPGA chip.Therefore, a combination of serial and parallel strategy is adopted [48,49].To improve the processing speed, multiplier-adder (MD) modules are used.At the same way, the other coefficients of 1 j a , 2 j a , 3 j a , 4 j a , 5 j a , and 6 j a ( j = 1, 2, 3, …, 6) are parallel computed at the same clock cycle.Figure 6 shows the architecture of A T A based on an FPGA.

LU Decomposition of the Matrix T A A
The coefficient accuracy of the second-order polynomial equation is crucial to the entire system.This paper proposes a parallel structure that uses the floating-point block LU decomposition to solve the inverse of

LU Decomposition of the Matrix A T A
The coefficient accuracy of the second-order polynomial equation is crucial to the entire system.This paper proposes a parallel structure that uses the floating-point block LU decomposition to solve the inverse of A T A .The matrix A can be represented as four matrices A 11 , A 12 , A 21 , and A 22 [48][49][50][51], as shown in Equation (20).
where A 11 , A 12 , A 21 , A 22 , L 21 and U 12 are 3 × 3 matrices; L 11 and L 22 are 3 × 3 lower triangular matrices, U 11 and U 22 are 3 × 3 upper triangular matrices.From Equation ( 20), some additional equations can be achieved: The steps of LU decomposition are as follows.
Step 1: A 11 is performed by block LU decomposition; L 11 and U 11 are obtained.
Step 2: From Equation ( 22), U 12 can be solved by Equation ( 25): Step 3: From Equation ( 23), L 21 can be calculated by the product of A 21 and (U 11 ) −1 (see Equation ( 26)): Step Step  7. Five multipliers, three divisors, one adder, and four subtractors are used.Additionally, some control signals are used to ensure that the results are outputted at the same clock cycle.It is well known that the inverse of a lower triangular matrix is another lower triangular matrix, and the inverse of an upper triangular matrix is also another upper triangular matrix.The inversion of matrices 11 L and 11 U can be rewritten as Equation ( 28) and ( 29), respectively.
21 21 where The parallel computation structures for calculating  It is well known that the inverse of a lower triangular matrix is another lower triangular matrix, and the inverse of an upper triangular matrix is also another upper triangular matrix.The inversion of matrices L 11 and U 11 can be rewritten as Equations ( 28) and ( 29), respectively. where where The parallel computation structures for calculating (L 11 ) −1 and (U 11 ) −1 are shown in Figure 8. Six multipliers, four divisors, one subtractor ("-" in the circle), and three reciprocals ("/" in the circle) are used.The reverse operation ("-" in the circle) for floating-point is simply to reverse the symbol bit and require very few resources.Additionally, some control signals are used to ensure that the results are outputted at the same clock cycle.25) and ( 26) can be rewritten into Equations ( 30) and (31).In Equation ( 30), the elements of the matrix contain three formats, i.e., a 14 , IL 21 a 14 + a 24 , and IL 31 a 14 + IL 32 a 24 + a 34 .In Equation ( 31), the elements of the matrix contain three formats, i.e., a 41 IU 11 , a 41 IU 12 + a 42 IU 22 , and a 41 IU 13 + a 42 IU 23 + a 43 IU 33 .The proposed parallel architecture for calculation of L 21 and U 12 is depicted in Figure 9. Twelve multipliers, fifteen MD modules, and three adders are used.
where NewA 22 is a 3 × 3 matrix, with a similar format of A 11 .Therefore, the LU decomposition of NewA 22 is not deduced in details.The parallel implementation is shown in Figure 10.
where 22 NewA is a 3 3 × matrix, with a similar format of 11 A .Therefore, the LU decomposition of 22 NewA is not deduced in details.The parallel implementation is shown in Figure 10.(where d is the dimension of each block, n is the size of matrix).In the standard LU decomposition, the number of multiplication operations is about (2 1)( ( ) For the Block LU decomposition, the number of multiplications is about n 3 /3 + (n%d − 0.5)n 2 , and division is (d + 1)n/2 (where d is the dimension of each block, n is the size of matrix).In the standard LU decomposition, the number of multiplication operations is about n(2n − 1)(n − 1)/6 n, and division is n(n − 1)/2 [52].The multipliers and dividers based block LU decomposition are approximately reduced 1.02 and 1.25 times than that the traditional LU decomposition.2. FPGA-Based Implementation of (A T A) The scheme for (A T A) −1 is shown in Figure 12. Figure 12a is an FPGA implementation of  From the ten GCPs data, L x and L y matrix can be obtained, since a and b matrices have the same structure, such as A T L x , (A T A) −1 (A T L x ), A T L y , and (A T A) −1 (A T L y ), the implementation of a is given to illustrate the processes.From Equation ( 34), six MD_10 modules are adopted.Six MD_6 modules are needed for solving Equation (35).Considering the limited resources of an FPGA, a serial framework is used to implement a , the strategy is depicted in Figure 13 (such as (A T A) −1 (A T L x )).
The details of state transition graph (STG) are as follows: • S_idle, idle state.When the rst signal is low, the system is in the reset state, all registers (R0, R1, ..., R5) and other signals are reset.illustrate the processes.From Equation ( 34), six MD_10 modules are adopted.Six MD_6 modules are needed for solving Equation (35).Considering the limited resources of an FPGA, a serial framework is used to implement a , the strategy is depicted in Figure 13 (such as 1 ( ) ( ) The     In this order, when the S_6 SM (final State) is completed, the matrix a is derived.Under the same clock cycle, the values of the registers R5 to R0 (a 0 -a 5 ) are parallel outputted.The current state returns to the first state.Additionally, each state should contain a certain delay time.The delay time should include MD_6 module operation time and serial storage time.

FPGA-Based Implementation of Coordinate Transformation and Bilinear Interpolation Algorithm
Figure 14 shows the flowchart for the coordinate transformation and bilinear interpolation method, which consists of the transformation from the output image coordinates to the ground coordinates, conversion the ground coordinate to the raw image coordinates, and bilinear interpolation.

FPGA-Based Implementation of Coordinate Transformation and Bilinear Interpolation Algorithm
Figure 14 shows the flowchart for the coordinate transformation and bilinear interpolation method, which consists of the transformation from the output image coordinates to the ground coordinates, conversion the ground coordinate to the raw image coordinates, and bilinear interpolation.( ) ( )

FPGA-Based Implementation of X g and Y g
The purpose of this step is to obtain X g and Y g in Equation ( 12) of Section 2.1.2,as shown in Figure 15.In this part, 32-bit integer and shift register are used to reduce the resource utilization of an FPGA.To implement the shift operation, X GSD and Y GSD are approximated as 2 m or 2 m − 2 n (m and n are integers).For instance, X GSD = Y GSD = 30, which can be expressed as the form of 2 5 − 2. In Figure 15, the symbol "<<" in the circle represents the left shift operation.Y .

FPGA-Based Implementation of Bilinear Interpolation
When g X and g Y are calculated, the coordinates of _ x row and _ y column are obtained from Equation ( 14) and ( 15).The bilinear interpolation algorithm is performed based on the _ x row and _ y column as shown in Figure 16.As observed from Figure 16, the processing of bilinear interpolation is divided into three steps.

FPGA-Based Implementation of Bilinear Interpolation
When X g and Y g are calculated, the coordinates of x_row and y_column are obtained from Equations ( 14) and ( 15).The bilinear interpolation algorithm is performed based on the x_row and y_column as shown in Figure 16.As observed from Figure 16, the processing of bilinear interpolation is divided into three steps.In step (I), the integer part i , j , the fractional part u , v and the weight part 1 u − and 1 v − of the floating-point ( _ , _ ) x row y column coordinates are computed, respectively.Figure 17 shows the parallel implementation method for i , j , u , v , 1 u − and 1 v − .Four subtractors, two INT (integer) modules and two absolute modules ("||" in the circle) are used.In step (II), it is mainly to calculate the address of four nearest neighbor pixels and read the corresponding gray value.The address of memory can be calculated from Equation (17) of Section 2.2.2.The processing of reading gray value is shown in Figure 18.To reduce the resources of an FPGA, the multiplication is converted into left shift operation.In step (I), the integer part i, j, the fractional part u, v and the weight part 1 − u and 1 − v of the floating-point (x_row, y_column) coordinates are computed, respectively.Figure 17 shows the parallel implementation method for i, j, u, v, 1 − u and 1 − v. Four subtractors, two INT (integer) modules and two absolute modules ("||" in the circle) are used.In step (I), the integer part i , j , the fractional part u , v and the weight part 1 u − and 1 v − of the floating-point ( _ , _ ) x row y column coordinates are computed, respectively.Figure 17 shows the parallel implementation method for i , j , u , v , 1 u − and 1 v − .Four subtractors, two INT (integer) modules and two absolute modules ("||" in the circle) are used.In step (II), it is mainly to calculate the address of four nearest neighbor pixels and read the corresponding gray value.The address of memory can be calculated from Equation (17) of Section 2.2.2.The processing of reading gray value is shown in Figure 18.To reduce the resources of an FPGA, the multiplication is converted into left shift operation.In step (II), it is mainly to calculate the address of four nearest neighbor pixels and read the corresponding gray value.The address of memory can be calculated from Equation (17) of Section 2.2.2.The processing of reading gray value is shown in Figure 18.To reduce the resources of an FPGA, the multiplication is converted into left shift operation.In step (II), it is mainly to calculate the address of four nearest neighbor pixels and read the corresponding gray value.The address of memory can be calculated from Equation (17) of Section 2.2.2.The processing of reading gray value is shown in Figure 18.To reduce the resources of an FPGA, the multiplication is converted into left shift operation.
, ( , 1) I i j + , and ( 1, 1) the gray value of the georeferenced image can be calculated in step (III).According to Equation ( 13) and ( 16).The implementation of bilinear interpolation algorithm is shown in Figure 19.Eight multipliers and three adders are used to compute the value of ( , _ ) I x row y column − . After the values of i, j, u, v, 1 − u, 1 − v, I(i, j), I(i + 1, j), I(i, j + 1), and I(i + 1, j + 1) are computed, the gray value of the georeferenced image can be calculated in step (III).According to Equations ( 13) and ( 16).The implementation of bilinear interpolation algorithm is shown in Figure 19.Eight multipliers and three adders are used to compute the value of I(x − row, y_column).

The Software and Hardware Environment
The proposed scheme is implemented on a custom-designed board which contains a Xilinx Virtex-7-XC7VX980t-ffg1930-1 FPGA that has 612,000 logic cells, 1,500kB Block RAM, 1,224,000 Flip-Flops, and 3,600 DSP slices.Additionally, the design tool is Vivado 2014.2, the simulation tool is ModelSim SE-64 10.4, and the hardware design language is Verilog HDL.To validate the proposed method, the georeferenced algorithm is also implemented by MATLAB R2014a, Visual Studio 2015 (C++) and ENVI 5.3 on a PC equipped with an Intel (R) Core i7-4790 CPU @ 3.6GHz and 8GB RAM, running Windows 7 (64 bit).

Data
To validate the proposed FPGA-based algorithm, two data sets are used to perform the georeferencing.The first data set is acquired from the ENVI example dataset, i.e., bldt_tm.imgand bldt_tm.pts.The second data set is obtained from the ERDAS example dataset, i.e., tmAtlanta.imgand panAtanta.img.Figure 20(a) and (b) display the raw data sets of bldt_tm.img and tmAtlanta.img,respectively.The image size is 256×256 pixels 2 .The information of two datasets is shown in Table 2.

The Software and Hardware Environment
The proposed scheme is implemented on a custom-designed board which contains a Xilinx Virtex-7-XC7VX980t-ffg1930-1 FPGA that has 612,000 logic cells, 1,500 kB Block RAM, 1,224,000 Flip-Flops, and 3,600 DSP slices.Additionally, the design tool is Vivado 2014.2, the simulation tool is ModelSim SE-64 10.4, and the hardware design language is Verilog HDL.To validate the proposed method, the georeferenced algorithm is also implemented by MATLAB R2014a, Visual Studio 2015 (C++) and ENVI 5.3 on a PC equipped with an Intel (R) Core i7-4790 CPU @ 3.6 GHz and 8 GB RAM, running Windows 7 (64 bit).

Data
To validate the proposed FPGA-based algorithm, two data sets are used to perform the georeferencing.The first data set is acquired from the ENVI example dataset, i.e., bldt_tm.imgand bldt_tm.pts.The second data set is obtained from the ERDAS example dataset, i.e., tmAtlanta.imgand panAtanta.img.Figure 20a,b display the raw data sets of bldt_tm.img and tmAtlanta.img,respectively.The image size is 256 × 256 pixels 2 .The information of two datasets is shown in Table 2.

Data
To validate the proposed FPGA-based algorithm, two data sets are used to perform the georeferencing.The first data set is acquired from the ENVI example dataset, i.e., bldt_tm.imgand bldt_tm.pts.The second data set is obtained from the ERDAS example dataset, i.e., tmAtlanta.imgand panAtanta.img.Figure 20(a) and (b) display the raw data sets of bldt_tm.img and tmAtlanta.img,respectively.The image size is 256×256 pixels 2 .The information of two datasets is shown in Table 2.

Error Analysis
To quantitatively evaluate the accuracy of the georeferencing implemented using FPGA, the root mean squared error (RMSE) is used in Equation (36) through Equation ( 38) [45].
where x k and y k are coordinates of the georeferenced image which are computed by the proposed method; x k and y k are reference geodetic coordinates; and n is the number of check points (CPs).
To compute the root mean squared errors (RMSEs), one hundred CPs are selected as shown in Figure 21.
With the experimental results, a few conclusions can be drawn as follows.
(1) From Equation ( 36) through (38), the RMSEs are computed based on FPGA, MATLAB, Visual Studio (C++), and ENVI, as shown in Table 3.It can be concluded that the RMSE x , RMSE y , and RMSE of the georeferenced image of the first data set implemented using FPGA are 0.1441, 0.1672, and 0.2207 pixels, respectively.The accuracy of the georeferenced using FPGA has the same values when implemented by MATLAB and Visual studio (C++), and is close to the values implemented using ENVI software.Additionally, other statistics, such as maximum, minimum and mean error are computed and listed in Table 4.It can be found that the proposed FPGA-based algorithm has a mean error of x and y coordinates at 0.0775 pixels and 0.0945 pixels, respectively; the minimum and the maximum errors of the x and y coordinates are 0.0008 and 0.0026 pixels, and 0.5113 and 0.4038 pixels, respectively.The accuracy of the various errors are calculated by the proposed algorithm has the same values than those based on MATLAB and Visual Studio (C++), and has a close accuracy than that based on ENVI software.With the experimental results, a few conclusions can be drawn as follows.
(1) From Equation ( 36) through (38), the RMSEs are computed based on FPGA, MATLAB, Visual Studio (C++), and ENVI, as shown in Table 3.It can be concluded that the RMSEx, RMSEy, and RMSE of the georeferenced image of the first data set implemented using FPGA are 0.1441, 0.1672, and 0.2207 pixels, respectively.The accuracy of the georeferenced using FPGA has the same values when implemented by MATLAB and Visual studio (C++), and is close to the values implemented using ENVI software.Additionally, other statistics, such as maximum, minimum and mean error are computed and listed in Table 4.It can be found that the proposed FPGA-based algorithm has a mean error of x and y coordinates at 0.0775 pixels and 0.0945 pixels, respectively; the minimum and the maximum errors of the x and y coordinates are 0.0008 and 0.0026 pixels, and 0.5113 and 0.4038 pixels, respectively.The accuracy of the various errors are calculated by the proposed algorithm has the same values than those based on MATLAB and Visual Studio (C++), and has a close accuracy than that based on ENVI software.(2) Table 5 lists the RMSEs of the georeferenced image of the second data set.The values of the RMSE x , RMSE y , and RMSE implemented by FPGA are 0.0965, 0.1268, and 0.1593 pixels, respectively.The RMSEs implemented using FPGA have the same accuracy than those based on MATLAB and Visual Studio (C++), and have a close accuracy implemented using ENVI software.However, it can be considered acceptable for the absolute error is less than one pixel [53].In addition, maximum, minimum and mean errors are computed and listed in Table 6.It can be found that the mean errors implemented using FPGA of the x and y coordinates are 0.0789 and 0.1144 pixels, respectively, and the minimum and maximum errors for the x and y coordinates are 0.0008 and 0.0046 pixels, and 0.2613 and 0.2081 pixels, respectively.As observed form Table 3 through 6, the accuracy of the georeferenced image when implemented using FPGA can reach the requirements because its RMSEs are less than one pixel [54].
Figure 22a-d show the georeferenced images of the first data set implemented by FPGA, MATLAB, Visual Studio (C++) and ENVI, respectively.Figure 23a-d  As observed form Table 3 through 6, the accuracy of the georeferenced image when implemented using FPGA can reach the requirements because its RMSEs are less than one pixel [54].

Gray Value Comparison
To verify the accuracy of gray value, the gray levels of georeferenced image are compared to those implemented using FPGA, MATLAB, Visual Studio (C++), and ENVI software.The georeferenced image implemented using FPGA as a referencing image, called "Ref-Img".The georeferenced image implemented using MATLAB, Visual Studio (C++), and ENVI are called "Img-MATLAB", "Img-C++", and "Img-ENVI", respectively.The gray differences between the Ref-Img and those georeferenced images implemented by MATLAB, Visual Studio (C++), and ENVI are obtained and shown in Figure 24 and 25.
As observed from Figure 24, the gray values of Figure 24a and Figure 24b are 0, which means the proposed method implemented using FPGA has the same accuracy with the implemented by MATLAB, and Visual Studio (C++) [55,56].Figure 24c  As observed form Table 3 through 6, the accuracy of the georeferenced image when implemented using FPGA can reach the requirements because its RMSEs are less than one pixel [54].

Gray Value Comparison
To verify the accuracy of gray value, the gray levels of georeferenced image are compared to those implemented using FPGA, MATLAB, Visual Studio (C++), and ENVI software.The georeferenced image implemented using FPGA as a referencing image, called "Ref-Img".The georeferenced image implemented using MATLAB, Visual Studio (C++), and ENVI are called "Img-MATLAB", "Img-C++", and "Img-ENVI", respectively.The gray differences between the Ref-Img and those georeferenced images implemented by MATLAB, Visual Studio (C++), and ENVI are obtained and shown in Figure 24 and 25.
As observed from Figure 24, the gray values of Figure 24a and Figure 24b are 0, which means the proposed method implemented using FPGA has the same accuracy with the implemented by MATLAB, and Visual Studio (C++) [55,56].Figure 24c

Gray Value Comparison
To verify the accuracy of gray value, the gray levels of georeferenced image are compared to those implemented using FPGA, MATLAB, Visual Studio (C++), and ENVI software.The georeferenced image implemented using FPGA as a referencing image, called "Ref-Img".The georeferenced image implemented using MATLAB, Visual Studio (C++), and ENVI are called "Img-MATLAB", "Img-C++", and "Img-ENVI", respectively.The gray differences between the Ref-Img and those georeferenced images implemented by MATLAB, Visual Studio (C++), and ENVI are obtained and shown in Figures 24 and 25    As observed from Figure 24, the gray values of Figure 24a,b are 0, which means the proposed method implemented using FPGA has the same accuracy with the implemented by MATLAB, and Visual Studio (C++) [55,56].Figure 24c indicates that the difference image between Ref-Img and Img-ENVI has many pixels are not same.The mean values of Figure 24a-c are 0, 0, and 0.713 respectively, which are less than 1 pixel.

Resource Occupation Analysis
As observe from Figure 25, the proposed method has the same accuracy as those implemented using MATLAB, and Visual Studio (C++), and has a small portion of the gray values are different from the Img-ENVI.The mean values of the Figure 25a-c are 0, 0, and 0.1265, respectively, which are all less than 1 pixel.

Resource Occupation Analysis
The resource utilization ration, including the flip-flop (FF), look-up-table (LUT) and DSP48s of the FPGA for the coordinate transformation method and bilinear interpolation function are assessed, respectively.
In the bilinear interpolation function, floating-point and 32-bit fixed-point mixed operations are adopted, which can reduce the resource consumption of an FPGA.Table 8 lists the resources occupied for the bilinear interpolation scheme.27,218 LUTs, 45,823 registers, 456 RAM/FIFO, and 267 DSP48s are utilized at rates of 4.45%, 3.74%, 30.40%, and 7.42%, respectively.The processing speed is considered as one of the most important factors for implementation by an FPGA.Table 9 lists the processing speed implemented by FPGA, Visual Studio (C++) and MATLAB.The size of the first raw image is 256 × 256 pixels 2 .After georeferencing, the size of georeferenced image is 281 × 281 pixels 2 .The running time of the georeferencing method for the first image using FPGA, Visual Studio (C++) and MATLAB is 0.13s, 1.06s, and 1.12s, respectively.The size of the second raw image is 256 × 256 pixels 2 ; after georeferencing, the image size is 285 × 277 pixels 2 .The running time of the georeferencing method for the second raw image using FPGA, Visual Studio (C++) and MATLAB is 0.15s, 1.21s, 1.26s, respectively.To put it simply, the processing speed using FPGA is 8 times faster than that based on PC computer.With the development of technology and the improvement of system performance, low power consumption has become one of the measurement objectives of on-board system.Resources, speed, and power consumption are three key factors in FPGA design.To obtain the power consumption, Vivado software provides a comprehensive methodologies and strategies for power consumption As observed from Figure 26, the powers of the dynamic and the device are 0.280 W (43%) and 0.379 W (57%), respectively.The total on-chip power is 0.659 W, which is acceptable in on-board processing platform.(57%), respectively.The total on-chip power is 0.659W, which is acceptable in on-board processing platform.

Conclusion
This paper presents a novel scheme for on-board georeferencing using FPGA optimized secondorder polynomial equation and bilinear interpolation scheme, which consists of five modules: input data, coordinate transformation, bilinear interpolation and output data.The main contributions of this paper are as follows.
First, a comprehensive framework has been developed to optimize the georeferencing algorithm based on an FPGA.(1) A floating-point block LU decomposition is used to inverse the matrix.Compared with the traditional LU decomposition, the multiplication and division operations are reduced by 1.02 and 1.25 times, respectively.The block LU decomposition method can reduce the complexity for inverting matrix and speed up the operation.(2) To reduce resource consumption of an FPGA, some strategies are adopted in programming, i.e., 32-bit integer and floating-point mixed operation and serial-parallel data communication.
Second, the performances of the proposed algorithm are evaluated by error analysis, gray value comparison, resource occupation analysis, processing speed comparison, and power consumption.The RMSEs are less than one pixel, and other statistics, such as maximum, minimum, and mean error are less than one pixel.The gray values of the georeferenced image when implemented using the FPGA have the same accuracy as those implemented using MATLAB and Visual Studio (C++), and have a very close accuracy implemented using ENVI software.The processing speed using the proposed algorithm is 8 times faster than that based on PC computer.The on-chip power consumption is 0.659W.
Therefore, it can be concluded that the proposed georeferencing algorithm implemented using FPGA with second-order polynomial model and bilinear interpolation algorithm can achieve realtime geographic referencing for remote sensing images.
Author Contributions: G. Zhou conceives and designs the experiments; D. Liu performs the experiments and writes the paper; J. Huang offers the help of FPGA implementation; L. S. analyzes the data; R. Zhang and X.

Conclusions
This paper presents a novel scheme for on-board georeferencing using FPGA optimized second-order polynomial equation and bilinear interpolation scheme, which consists of five modules: input data, coordinate transformation, bilinear interpolation and output data.The main contributions of this paper are as follows.
First, a comprehensive framework has been developed to optimize the georeferencing algorithm based on an FPGA.(1) A floating-point block LU decomposition is used to inverse the matrix.Compared with the traditional LU decomposition, the multiplication and division operations are reduced by 1.02 and 1.25 times, respectively.The block LU decomposition method can reduce the complexity for inverting matrix and speed up the operation.(2) To reduce resource consumption of an FPGA, some strategies are adopted in programming, i.e., 32-bit integer and floating-point mixed operation and serial-parallel data communication.
Second, the performances of the proposed algorithm are evaluated by error analysis, gray value comparison, resource occupation analysis, processing speed comparison, and power consumption.The RMSEs are less than one pixel, and other statistics, such as maximum, minimum, and mean error are less than one pixel.The gray values of the georeferenced image when implemented using the FPGA have the same accuracy as those implemented using MATLAB and Visual Studio (C++), and have a very close accuracy implemented using ENVI software.The processing speed using the proposed algorithm is 8 times faster than that based on PC computer.The on-chip power consumption is 0.659 W.

Figure 1 .
Figure 1.The georeferenced image size.(a) Input image.(b) Output image.(c) Incorrect output image.ur:, upper right; ul:, upper left; ll:, lower left; lr:, lower right.abcd is the input image.ABCD is the storage area for the georeferenced image.a b c d is the georeferenced image.

Figure 2 .
Figure 2. The architecture for the bilinear interpolation.(a) Distributions of the four nearest neighbor pixels and (b) 3D spatial distribution of the bilinear interpolation.

Figure 2 .
Figure 2. The architecture for the bilinear interpolation.(a) Distributions of the four nearest neighbor pixels and (b) 3D spatial distribution of the bilinear interpolation.

Figure 3 .
Figure 3. Raw image mapping with double data rate (DDR) 4 a , 5 a , 0 b , 1 b , 2 b , 4 b , and 5 b are calculated in the CCM when the GCP data are arrived.The matrices T A A , are parallel computed. The coordinate transformation is carried out in the CTM when the coefficients of second-order polynomial equation are arrived.The values of i , j , u , 1 u − , v and 1 v − are simultaneously sent to the BIM.

Figure 3 .
Figure 3. Raw image mapping with double data rate (DDR) 4 , a 5 , b 0 , b 1 , b 2 , b 4 , and b 5 are calculated in the CCM when the GCP data are arrived.The matrices A T A , A T L x , A T L y , and (A T A) −1 are parallel computed.

Figure 4 .
Figure 4.A field programmable gate array (FPGA) architecture for the georeferenced remote sensing images with second-order polynomial equation and bilinear interpolation scheme.RAM: random access memory; clk: clock; rst: reset; en: enable.

Figure 5 )
: (I) Form the matrices A , x L , and y L ; (II) Calculate T A

Figure 5 .
Figure 5.The proposed parallel implementation of the solving coefficient.

Figure 4 .
Figure 4.A field programmable gate array (FPGA) architecture for the georeferenced remote sensing images with second-order polynomial equation and bilinear interpolation scheme.RAM: random access memory; clk: clock; rst: reset; en: enable.

Figure 4 .
Figure 4.A field programmable gate array (FPGA) architecture for the georeferenced remote sensing images with second-order polynomial equation and bilinear interpolation scheme.RAM: random access memory; clk: clock; rst: reset; en: enable.

Figure 5 )
: (I) Form the matrices A , x L , and y L ; (II) Calculate T A

Figure 5 .
Figure 5.The proposed parallel implementation of the solving coefficient.

Figure 5 .
Figure 5.The proposed parallel implementation of the solving coefficient.

Figure 6 .
Figure 6.The architectures of T A A .
T A A .The matrix A can be represented as four matrices 11 A , 12 A , 21 A , and A [48-51], as shown in Equation (20).

Figure 6 .
Figure 6.The architectures of A T A .
− are shown in Figure 8. Six multipliers, four divisors, one subtractor ("-" in the circle), and three reciprocals ("/" in the circle) are used.The reverse operation ("-"in the circle) for floating-point is simply to reverse the symbol bit and require very few resources.Additionally, some control signals are used to ensure that the results are outputted at the same clock cycle.(a) (b)

Figure 7 .
Figure 7. Parallel computation block lower-upper (LU) decomposition of A 11 .2. FPGA-Based Parallel Computation for Matrices (L 11 ) −1 and (U 11 ) −1It is well known that the inverse of a lower triangular matrix is another lower triangular matrix, and the inverse of an upper triangular matrix is also another upper triangular matrix.The inversion of matrices L 11 and U 11 can be rewritten as Equations (28) and (29), respectively.

Figure 9 .
Figure 9. Parallel computation method for 21 L and 12 U .(a) The architecture of 21 L .(b) The architecture of 12U .

Figure 9 .
Figure 9. Parallel computation method for L 21 and U 12 .(a) The architecture of L 21 .(b) The architecture of U 12 .

Figure 10 .
Figure 10.The architecture of L 22 and U 22 .

•
S_1 to S_6 are the six different state machines (SMs).Under the different SMs, different row values of the matrix (A T A) −1 with matrix A T L x are calculated in the MD_6 modules.The result of MD_6 is serial saved in the registers R5 to R0.When the six SMs are finished, the values of registers R5 to R0 are parallel outputted to the matrix a .• S_1, the first SM.When the rst signal goes high, the current state enters the first state when the enable signal is being received.The values of matrix B −1 (1, :) = (A T A) −1 (1, :) and A T L x are put into the MD_6 module for multiplication and addition.The result is saved in the register R5. • S_2, the second SM.When the S_1 state is complete, the current state enters the second state.The values of matrix B −1 (2, :) = (A T A) −1 (2, :) and A T L x begin to calculate multiplication and addition.Then, the result is saved in the register R4.Remote Sens. 2019, 11, x FOR PEER REVIEW 16 of 28

Figure 13 .
Figure 13.The processing of state transition graph (STG).(a) The state machine.(b) Serial framework.(c) Serial input and parallel output.

Figure 14 .
Figure 14.The flowchart of the coordinate transformation and bilinear interpolation.

1 . 2 −
FPGA-Based Implementation of g X and g Y The purpose of this step is to obtain g X and g Y in Equation (12) of Section 2.1.2,as shown in Figure 15.In this part, 32-bit integer and shift register are used to reduce the resource utilization of an FPGA.To implement the shift operation, GSD X and GSD Y are approximated as 2 m or 2 2 m n− ( m and n are integers).can be expressed as the form of5   2 .In Figure15, the symbol "<<" in the circle represents the left shift operation.

Figure 14 .
Figure 14.The flowchart of the coordinate transformation and bilinear interpolation.

1 . 2 −
FPGA-Based Implementation of g X and g Y The purpose of this step is to obtain g X and g Y in Equation (12) of Section 2.1.2,as shown in Figure 15.In this part, 32-bit integer and shift register are used to reduce the resource utilization of an FPGA.To implement the shift operation, GSD X and GSD Y are approximated as 2 m or 2 2 m n− ( m and n are integers).can be expressed as the form of5   2 .In Figure15, the symbol "<<" in the circle represents the left shift operation.

Figure 15 .
Figure 15.Parallel computation method for g X and g

Figure 15 .
Figure 15.Parallel computation method for X g and Y g .

Figure 17 .
Figure 17.The parallel implementation method for i , j , u , v , 1 u − and 1 v − .

Figure 17 .
Figure 17.The parallel implementation method for i , j , u , v , 1 u − and 1 v − .

Figure 17 .
Figure 17.The parallel implementation method for i, j, u, v, 1 − u and 1 − v.

Figure 17 .
Figure 17.The parallel implementation method for i , j , u , v , 1 u − and 1 v − .

Figure 18 .
Figure 18.The architecture for parallel reading gray values

Figure 18 .
Figure 18.The architecture for parallel reading gray values.

28 Figure 19 .
Figure 19.The parallel computation for bilinear interpolation

Figure 19 .
Figure 19.The parallel computation for bilinear interpolation.

Figure 20 .
Figure 20.The raw image.(a) The first raw image.(b) The second raw image.

Figure 20 .
Figure 20.The raw image.(a) The first raw image.(b) The second raw image.

Figure 21 .
Figure 21.Check points distribution in: (a) the first image and (b) the second image.
show georeferenced images of the second image implemented by FPGA, MATLAB, Visual Studio (C++), and ENVI, respectively.Remote Sens. 2019, 11, x FOR PEER REVIEW 22 of 28 Figure 22a-d show the georeferenced images of the first data set implemented by FPGA, MATLAB, Visual Studio (C++) and ENVI, respectively.Figure 23a-23d show georeferenced images of the second image implemented by FPGA, MATLAB, Visual Studio (C++), and ENVI, respectively.
show the georeferenced images of the first data set implemented by FPGA, MATLAB, Visual Studio (C++) and ENVI, respectively.Figure 23a-23d show georeferenced images of the second image implemented by FPGA, MATLAB, Visual Studio (C++), and ENVI, respectively.

Figure 24 .Figure 25 .
Figure 24.The differences between referencing image and other georeferenced images of the first image.(a) The differences between Ref-Img and Img-C++.(b) The differences between Ref-Img and Img-MATLAB.(c) The differences between Ref-Img and Img-ENVI.

Figure 24 .Figure 24 .Figure 25 .
Figure 24.The differences between referencing image and other georeferenced images of the first image.(a) The differences between Ref-Img and Img-C++.(b) The differences between Ref-Img and Img-MATLAB.(c) The differences between Ref-Img and Img-ENVI.

Figure 25 .
Figure 25.The differences between referenced image and other georeferenced images of the second image.(a) The differences between Ref-Img and Img-C++.(b) The differences between Ref-Img and Img-MATLAB.(c) The differences between Ref-Img and Img-ENVI.
(27)rding to Equation(27), a parallel computation architecture for matrices L 11 and U 11 is presented in Figure and IL 31 = −IL 32 L 21 − L 31 .

Table 2 .
The information of two data sets

Table 2 .
The information of two data sets.

Table 3 .
The values of root mean square errors (RMSEs) of the georeferenced image of the first image.

Table 3 .
The values of root mean square errors (RMSEs) of the georeferenced image of the first image.

Table 4 .
Statistics values of the georeferenced image of the first image.

Table 5 .
The values of RMSEs of the georeferenced image of the second image.

Table 6 .
Statistics values of the georeferenced image of the second image.

Table 7 .
The logic unit utilization ratio of the coordinate transformation method.

Table 8 .
The logic unit utilization ratio of the bilinear interpolation method.

Table 9 .
The consumption speed.