FPGA Design of Enhanced Scale-Invariant Feature Transform with Finite-Area Parallel Feature Matching for Stereo Vision

: In this paper, we propose an FPGA-based enhanced-SIFT with feature matching for stereo vision. Gaussian blur and difference of Gaussian pyramids are realized in parallel to accelerate the processing time required for multiple convolutions. As for the feature descriptor, a simple triangular identiﬁcation approach with a look-up table is proposed to efﬁciently determine the direction and gradient of the feature points. Thus, the dimension of the feature descriptor in this paper is reduced by half compared to conventional approaches. As far as feature detection is concerned, the condition for high-contrast detection is simpliﬁed by moderately changing a threshold value, which also beneﬁts the reduction of the resulting hardware in realization. The proposed enhanced-SIFT not only accelerates the operational speed but also reduces the hardware cost. The experiment results show that the proposed enhanced-SIFT reaches a frame rate of 205 fps for 640 × 480 images. Integrated with two enhanced-SIFT, a ﬁnite-area parallel checking is also proposed without the aid of external memory to improve the efﬁciency of feature matching. The resulting frame rate by the proposed stereo vision matching can be as high as 181 fps with good matching accuracy as demonstrated in the experimental results.


Introduction
Recently, vision-based simultaneous localization and mapping (V-SLAM) techniques have become more and more popular due to the need for the autonomous navigation of mobile robots [1,2].The front-end feature point detection and feature matching are especially important because their accuracy will significantly influence the performance of back-end visual odometry, mapping, and pose estimation [3,4].In the front-end schemes, although speed-up robust features (SURF) exhibit a faster operational speed, its accuracy is worse than scale-invariant feature transform (SIFT) [5,6].Nevertheless, the high accuracy of SIFT is achieved at the cost of a time-consuming process.Although CUDA implementations on GPUs for parallel programming can be used for various feature detection and matching [7,8], the software-based approaches generally require a larger power consumption which is not desired for mobile robot vision applications.Thus, this paper aims to improve the operational speed of SIFT by hardware implementation without sacrificing its accuracy.Over the past years, many researchers were devoted to improving the operational efficiency of SIFT [9].Among them, a pipelined FPGA-based architecture was proposed in [10] to process images with double speed at the price of a higher hardware cost.A hardware implemented SIFT was also proposed in [11] to reduce the number of internal registers.However, the resulting image frame rate is not high enough because of finite bandwidth due to external memory.As a result, the efficiency of the subsequent mapping process could be limited.In [12], multiple levels of Gaussian-blurred images were simultaneously generated by adequately modifying Gaussian kernels to shorten the processing time, resulting in a frame rate of 150 fps for 640 × 480 images.Although a few dividers were only used by the modified approach for feature detection, the resulting hardware cost is difficult to reduce because of the square function in the design.Besides, the coordinate rotation digital computer (CORDIC) algorithm adopted in finding phases and gradients of the feature descriptor often requires considerable hardware resources and a long latency period due to the needs of multiple iterations, thus significantly preventing the approach from being applied for real-world applications.
As far as feature matching is concerned, one feature point of an image needs to compare with all feature points of the other image to find the matching pairs [13] in conventional exhaustion methods.A large number of feature points would incur extra memory access time.Moreover, the realization of feature matching often consumes a large number of hardware resources.Although the random sample consensus (RANSAC) algorithm [14] can improve the accuracy of feature matching, the resulting frame rate is only 40 fps.Therefore, it has brought a great challenge to improve the operational speed of the SIFT for use with feature matching.
In this paper, we propose an enhanced-SIFT (E-SIFT) to reduce the hardware resources required and accelerate the operational speed without sacrificing accuracy.Gaussianblurred images and difference of Gaussian (DoG) pyramids can be simultaneously obtained by meticulous design of a realizing structure.We also propose a simple triangular identification (TI) with a look-up table (LUT) to easily and quickly decide the direction and gradient of a pixel point.The resulting dimension of feature points can thus be effectively reduced by half.This is not only beneficial to the operational speed of the proposed E-SIFT and the following feature matching but also helpful to reduce the hardware cost.In the feature detection, the condition of high-contrast detection is also modified by slightly adjusting a threshold value without substantial change, thus considerably reducing the required logic elements and processing time.
For stereo vision, we also propose a finite-area parallel (FAP) feature matching approach integrated with two E-SIFTs.According to the position of the feature point of the right image, a fixed number of feature points in the corresponding area of the left image will be chosen and stored first.Then, feature matching proceeds in parallel to efficiently and accurately find the matching pairs without using any external memory.Based on the Epipolar geometry, a minimum threshold is assigned during feature comparison to avoid matching errors caused by the visual angle difference between the two cameras [15].Besides, a valid signal is designed in the system to prevent the finite bandwidth of external SDRAM or discontinuous image data transmission from corrupting the matching accuracy.
In the experiments, we use Altera FPGA hardware platform to evaluate the feasibility of the proposed TI with LUT scheme and test the flexibility of the proposed E-SIFT and double E-SIFT with FAP feature matching.

Proposed E-SIFT and Fap Feature Matching for Stereo Vision
Figure 1 shows the proposed FPGA-based E-SIFT architecture, where an initial image is convoluted by a meticulous design of Gaussian filters to simultaneously generate a Gaussian-blurred image and DoG pyramid.Through a definite mask in the Gaussianblurred image, the direction and gradient of each detection point can be determined by the proposed TI with LUT method.The results in each local area are then summed and collected to form a feature descriptor.Since the summations among different partition areas present large differences, values in the feature descriptors will be normalized to acceptable ones by a simple scheme without considerably altering the original character of distribution.At the same time, on the right side in Figure 1, the corresponding detection point in the DoG images will pass through extrema, high contrast, and corner detections to decide whether feature points exist.If the result is positive, a confirmed signal and the corresponding feature descriptor will be delivered to the feature matching module.

Image Pyramid
The purpose of building the image pyramid is to find out feature points from the differences of consecutive scale-spaces.Multi-level Gaussian-blurred images are created one by one through a series of Gaussian functions.The speckles and interfering noises in the original image can therefore be reduced [9].Then, the contour of objects in the original image is vaguely outlined by the differences between the consecutive Gaussian-blurred images.Thus, the image feature can be reserved to be the basis for subsequent feature detection.In general, long latency is expected in constructing the multi-level Gaussianblurred pyramids.Thus, as shown in Figure 2a, a more efficient structure had been proposed by simultaneously convoluting all the Gaussian functions, where suitable Gaussian kernels can be found by training the system with different images [12].To further accelerate the construction of the DoG pyramid, a simplified structure is proposed in this paper, where new kernels for convolution are first derived by subtraction of two adjacent ones as shown in Figure 2b.Then, the nth level DoG image, Dn(x, y), can be obtained by directly convoluting the original image with the new kernel: where Ln(x, y) and Gn(x, y) are the nth level Gaussian-blurred image and Gaussian kernel, respectively, and I(x, y) is the initial image.In the proposed structure, the DoG pyramid and the required Gaussian-blurred image can be directly created at the same time.Here, only the 2nd level Gaussian-blurred image is reserved to be the basis of the feature descriptor.Since a 7 × 7 mask is adopted mainly for the convolution in the image pyramid, 49 general-purpose 8-bit registers will be required to store the pixel data.For an image width of 640 pixels, a large number of 8-bit registers needs be aligned row by row for the serial input of pixel data.To dramatically reduce the number of internal registers, we propose a buffer structure as shown in Figure 3.We utilize Altera optimized RAM-based shift registers to form a six-row buffer array, where each row includes 633 registers.Once the pixel data reaches the last one, i.e., reg.49, the 7 × 7 convolution can be accomplished in one

Image Pyramid
The purpose of building the image pyramid is to find out feature points from the differences of consecutive scale-spaces.Multi-level Gaussian-blurred images are created one by one through a series of Gaussian functions.The speckles and interfering noises in the original image can therefore be reduced [9].Then, the contour of objects in the original image is vaguely outlined by the differences between the consecutive Gaussianblurred images.Thus, the image feature can be reserved to be the basis for subsequent feature detection.In general, long latency is expected in constructing the multi-level Gaussian-blurred pyramids.Thus, as shown in Figure 2a, a more efficient structure had been proposed by simultaneously convoluting all the Gaussian functions, where suitable Gaussian kernels can be found by training the system with different images [12].To further accelerate the construction of the DoG pyramid, a simplified structure is proposed in this paper, where new kernels for convolution are first derived by subtraction of two adjacent ones as shown in Figure 2b.Then, the nth level DoG image, D n (x, y), can be obtained by directly convoluting the original image with the new kernel: where L n (x, y) and G n (x, y) are the nth level Gaussian-blurred image and Gaussian kernel, respectively, and I(x, y) is the initial image.In the proposed structure, the DoG pyramid and the required Gaussian-blurred image can be directly created at the same time.Here, only the 2nd level Gaussian-blurred image is reserved to be the basis of the feature descriptor.Since a 7 × 7 mask is adopted mainly for the convolution in the image pyramid, 49 general-purpose 8-bit registers will be required to store the pixel data.For an image width of 640 pixels, a large number of 8-bit registers needs be aligned row by row for the serial input of pixel data.To dramatically reduce the number of internal registers, we propose a buffer structure as shown in Figure 3.We utilize Altera optimized RAM-based shift registers to form a six-row buffer array, where each row includes 633 registers.Once the pixel data reaches the last one, i.e., reg.49, the 7 × 7 convolution can be accomplished in one clock cycle.A similar buffer structure with different sizes for the following building blocks requiring a mask can also be adopted to efficiently reduce the register number as well as shorten the buffer delay.
Electronics 2021, 10, x FOR PEER REVIEW 4 of 24 clock cycle.A similar buffer structure with different sizes for the following building blocks requiring a mask can also be adopted to efficiently reduce the register number as well as shorten the buffer delay.

Feature Descriptor
To accelerate the generating speed of feature descriptors, the direction and gradient of the detection point need to be quickly determined.Thus, in this paper, the TI method is proposed to promptly determine the direction of the detection point.Combined with a look-up table, the gradient of the detection point can be easily estimated.In general, eight different directions need to be determined to reduce the complexity of the phase of the detection point [9,16].In this paper, the eight directions with unsigned gradients for the detection point have been simplified to four with signed ones to reduce the hardware cost.Although the resulting gradient size increases by one bit, the dimension of the feature descriptor can be effectively reduced by half.It is also helpful to the FAP feature matching for implementation.
Figure 4 shows the block diagram of the feature descriptor.First, a 3 × 3 mask is applied to the 2nd level Gaussian-blurred image to find the differences between adjacent

Feature Descriptor
To accelerate the generating speed of feature descriptors, the direction and gradient of the detection point need to be quickly determined.Thus, in this paper, the TI method is proposed to promptly determine the direction of the detection point.Combined with a look-up table, the gradient of the detection point can be easily estimated.In general, eight different directions need to be determined to reduce the complexity of the phase of the detection point [9,16].In this paper, the eight directions with unsigned gradients for the detection point have been simplified to four with signed ones to reduce the hardware cost.Although the resulting gradient size increases by one bit, the dimension of the feature descriptor can be effectively reduced by half.It is also helpful to the FAP feature matching for implementation.
Figure 4 shows the block diagram of the feature descriptor.First, a 3 × 3 mask is applied to the 2nd level Gaussian-blurred image to find the differences between adjacent

Feature Descriptor
To accelerate the generating speed of feature descriptors, the direction and gradient of the detection point need to be quickly determined.Thus, in this paper, the TI method is proposed to promptly determine the direction of the detection point.Combined with a look-up table, the gradient of the detection point can be easily estimated.In general, eight different directions need to be determined to reduce the complexity of the phase of the detection point [9,16].In this paper, the eight directions with unsigned gradients for the detection point have been simplified to four with signed ones to reduce the hardware cost.Although the resulting gradient size increases by one bit, the dimension of the feature descriptor can be effectively reduced by half.It is also helpful to the FAP feature matching for implementation.
Figure 4 shows the block diagram of the feature descriptor.First, a 3 × 3 mask is applied to the 2nd level Gaussian-blurred image to find the differences between adjacent pixels in x and y axes to the center.The direction and gradient of the center point will then be determined respectively by the proposed TI with LUT method.Finally, the feature descriptor of the detection point can be synthesized by finding and collecting gradient sums of the four directions in each 4 × 4 partition area of a 16 × 16 mask.
Electronics 2021, 10, x FOR PEER REVIEW 5 of 2 pixels in x and y axes to the center.The direction and gradient of the center point will the be determined respectively by the proposed TI with LUT method.Finally, the feature de scriptor of the detection point can be synthesized by finding and collecting gradient sum of the four directions in each 4 × 4 partition area of a 16 × 16 mask.

Triangular Identification and Look-Up Table
In conventional approaches, the center point, i.e., L2(x, y), of every 3 × 3 mask shiftin in the 2nd level Gaussian-blurred image is the detection point.The variations of the de tection point on x and y axes, i.e., Δp and Δq, are defined by the differences between tw adjacent pixel values, as shown in Figure 5. Since these two variations are perpendicular to each other, the resulting phase an magnitude which define the direction and gradient of the detection point can be easil determined by the following equations [9].

Triangular Identification and Look-Up Table
In conventional approaches, the center point, i.e., L 2 (x, y), of every 3 × 3 mask shifting in the 2nd level Gaussian-blurred image is the detection point.The variations of the detection point on x and y axes, i.e., ∆p and ∆q, are defined by the differences between two adjacent pixel values, as shown in Figure 5.
scriptor of the detection point can be synthesized by finding and collecting g of the four directions in each 4 × 4 partition area of a 16 × 16 mask.

Triangular Identification and Look-Up Table
In conventional approaches, the center point, i.e., L2(x, y), of every 3 × 3 m in the 2nd level Gaussian-blurred image is the detection point.The variatio tection point on x and y axes, i.e., Δp and Δq, are defined by the differences adjacent pixel values, as shown in Figure 5. Since these two variations are perpendicular to each other, the resultin magnitude which define the direction and gradient of the detection point Since these two variations are perpendicular to each other, the resulting phase and magnitude which define the direction and gradient of the detection point can be easily determined by the following equations [9].
m(x, y) = ∆p 2 + ∆q 2 (5) Conventional approaches generally utilize the CORDIC algorithm to derive the results of Equations ( 4) and ( 5) [17].However, time-consuming and tedious procedures would ensue due to a large number of iterations.In this paper, we propose a simple TI method to determine the phase of detection points.Since only eight directions exist in the original definition, the direction of the detection point can be easily determined by the value of ∆p and ∆q, which are two legs of a right-angled triangle.For example, if ∆p is larger than ∆q, and both are greater than zero, then the hypotenuse will be located in direction 0, as shown in Figure 6a.Thus, the direction of the detection point is classified as phase 0 directly.Conversely, the direction of the detection point is classified as phase 1 if ∆p is smaller than ∆q.In a similar way, we can easily recognize other different directions by using Table 1.The required conditions are also devised to distinguish different directions.Absolute values are considered in reality because the lengths of both legs cannot be negative.
( , ) arctan q x y p ( , ) m x y p q = Δ +Δ (5) Conventional approaches generally utilize the CORDIC algorithm to derive the results of Equations ( 4) and ( 5) [17].However, time-consuming and tedious procedures would ensue due to a large number of iterations.In this paper, we propose a simple TI method to determine the phase of detection points.Since only eight directions exist in the original definition, the direction of the detection point can be easily determined by the value of Δp and Δq, which are two legs of a right-angled triangle.For example, if Δp is larger than Δq, and both are greater than zero, then the hypotenuse will be located in direction 0, as shown in Figure 6a.Thus, the direction of the detection point is classified as phase 0 directly.Conversely, the direction of the detection point is classified as phase 1 if Δp is smaller than Δq.In a similar way, we can easily recognize other different directions by using Table 1.The required conditions are also devised to distinguish different directions.Absolute values are considered in reality because the lengths of both legs cannot be negative.
For a right-angled triangle, if there is a big difference between the length of the two legs, the hypotenuse length can be approximated as the longer one.If two legs have equal lengths, i.e., Δp = Δq, the hypotenuse length will be 2 times leg one, as shown in Figure 6b.Thus, the gradient of the detection point cannot be directly estimated only by the comparison of the two legs.Here, a look-up table is proposed for the estimation of the gradient of the detection point, as listed in Table 2. First, the ratio, h, of variations in two axes is evaluated for choosing the correction factor, K, as shown in the table.The gradient of the detection point, m, will then be approximated by the length of the longer leg multiplied by a chosen correction factor.
For a right-angled triangle, if there is a big difference between the length of the two legs, the hypotenuse length can be approximated as the longer one.If two legs have equal lengths, i.e., ∆p = ∆q, the hypotenuse length will be √ 2 times leg one, as shown in Figure 6b.Thus, the gradient of the detection point cannot be directly estimated only by the comparison of the two legs.Here, a look-up table is proposed for the estimation of the gradient of the detection point, as listed in Table 2. First, the ratio, h, of variations in two axes is evaluated for choosing the correction factor, K, as shown in the table.The gradient of the detection point, m, will then be approximated by the length of the longer leg multiplied by a chosen correction factor.Note that the feature descriptor requires eight dimensions to express different gradients of the eight directions.To reduce hardware resources, a representation having fewer dimensions is discussed in this paper.From the observation of Figure 6, phase 4 has an opposite direction to phase 0. Similarly, phase 5 has an opposite direction to phase 1, and so on.Thus, the eight-direction representation can be simplified to a four-direction one to reduce the resulting dimensions.For example, if the detection point has an unsigned gradient in phase 4, it will be changed to have a negative one in phase 0. As shown in Figure 7, the proposed TI with LUT approach for detecting the direction and gradient of the detection point is processed in parallel to accelerate the operational speed.The presented structure is simpler than the CORDIC, so that the hardware resources for implementation can be effectively reduced.
Electronics 2021, 10, x FOR PEER REVIEW 7 of 24 Note that the feature descriptor requires eight dimensions to express different gradients of the eight directions.To reduce hardware resources, a representation having fewer dimensions is discussed in this paper.From the observation of Figure 6, phase 4 has an opposite direction to phase 0. Similarly, phase 5 has an opposite direction to phase 1, and so on.Thus, the eight-direction representation can be simplified to a four-direction one to reduce the resulting dimensions.For example, if the detection point has an unsigned gradient in phase 4, it will be changed to have a negative one in phase 0. As shown in Figure 7, the proposed TI with LUT approach for detecting the direction and gradient of the detection point is processed in parallel to accelerate the operational speed.The presented structure is simpler than the CORDIC, so that the hardware resources for implementation can be effectively reduced.

Feature Descriptor
The feature descriptor consists of the gradient sums in four directions for each 4 × 4 partition area in a 16 × 16 mask, as shown in Figure 8a.Since one 16 × 16 mask area can contain sixteen 4 × 4 partitions, the feature descriptor with a four-direction representation

Feature Descriptor
The feature descriptor consists of the gradient sums in four directions for each 4 × 4 partition area in a 16 × 16 mask, as shown in Figure 8a.Since one 16 × 16 mask area can contain sixteen 4 × 4 partitions, the feature descriptor with a four-direction repre-sentation will possess 64 dimensions, which are half of that by conventional methods, as shown in Figure 8b.It not only reduces the required devices in implementation but also accelerates the producing speed of feature descriptors.These benefits are also helpful to the subsequent feature matching.
Electronics 2021, 10, x FOR PEER REVIEW 8 of 24 will possess 64 dimensions, which are half of that by conventional methods, as shown in Figure 8b.It not only reduces the required devices in implementation but also accelerates the producing speed of feature descriptors.These benefits are also helpful to the subsequent feature matching.

Normalization
For general images, large differences usually exist among the gradient sums, resulting in different dimensions of the feature descriptor.Consequently, the feature descriptor, OF = (o1, o2, …, o64), needs to be normalized to decrease the scale without significantly altering the distribution among gradient sums.The normalization of the feature descriptor is achieved in this paper by: where W is the sum of all gradient sums in all dimensions of the feature descriptor.
A scaling factor, R = 255, is chosen here for better discrimination.That is, the resulting gradient sums of the feature descriptor lie between ±255.Thus, the required bit number of the gradient sum is reduced from 13 bits to 9 bits in the normalization of the feature descriptor.In the implementation of the presented normalization, we introduce an additional multiplier having the power of 2 approximating the sum of all gradients for achieving a simple operation.Thus, the normalized feature descriptor F can be easily obtained by only shifting bits of the feature descriptor.The realized block diagram of the normalization is shown in Figure 9.

Normalization
For general images, large differences usually exist among the gradient sums, resulting in different dimensions of the feature descriptor.Consequently, the feature descriptor, OF = (o 1 , o 2 , . . ., o 64 ), needs to be normalized to decrease the scale without significantly altering the distribution among gradient sums.The normalization of the feature descriptor is achieved in this paper by: where W is the sum of all gradient sums in all dimensions of the feature descriptor.
A scaling factor, R = 255, is chosen here for better discrimination.That is, the resulting gradient sums of the feature descriptor lie between ±255.Thus, the required bit number of the gradient sum is reduced from 13 bits to 9 bits in the normalization of the feature descriptor.In the implementation of the presented normalization, we introduce an additional multiplier having the power of 2 approximating the sum of all gradients for achieving a simple operation.Thus, the normalized feature descriptor F can be easily obtained by only shifting bits of the feature descriptor.The realized block diagram of the normalization is shown in Figure 9.

Feature Detection
In the beginning, we use three levels of DoG images to establish a 3D space.A 3 × 3 mask will then be applied to the three DoG images to define 27-pixel data for feature detection, as shown in Figure 10.Thus, center pixel of the 3 × 3 mask in the 2nd level DoG image, D2, is regarded as the detection point.Three kinds of detections, including extrema detection, high contrast detection, and corner detection, are adopted and processed in parallel, as shown in Figure 11.
Once the results of the three detections are all positive, the detection point becomes a feature point, and a detected signal will be sent to the feature descriptor and feature matching blocks for further processing.
To evaluate the variation of the pixel values around the detection point, 3D and 2D Hessian matrix, H3×3 and H2×2, respectively, will be used in the high contrast and corner detections.

Feature Detection
In the beginning, we use three levels of DoG images to establish a 3D space.A 3 × 3 mask will then be applied to the three DoG images to define 27-pixel data for feature detection, as shown in Figure 10.Thus, center pixel of the 3 × 3 mask in the 2nd level DoG image, D2, is regarded as the detection point.Three kinds of detections, including extrema detection, high contrast detection, and corner detection, are adopted and processed in parallel, as shown in Figure 11.
Electronics 2021, 10, x FOR PEER REVIEW 10 of 24  Once the results of the three detections are all positive, the detection point becomes a feature point, and a detected signal will be sent to the feature descriptor and feature matching blocks for further processing.
To evaluate the variation of the pixel values around the detection point, 3D and 2D Hessian matrix, H 3×3 and H 2×2 , respectively, will be used in the high contrast and corner detections.
where the definitions of the elements are listed as follows.
Electronics 2021, 10, x FOR PEER REVIEW 10 of  Since both the matrices are symmetric, only six elements need to be determined.Fig ure 12 shows the realized structure, where pixel inputs come from the corresponding p sitions shown in Figure 10.Since both the matrices are symmetric, only six elements need to be determined.Figure 12 shows the realized structure, where pixel inputs come from the corresponding positions shown in Figure 10.

Extreme Detection
In extreme detection, the detection point is compared with the other 26 pixels in the 3D mask range.If the pixel value of the detection point is the maximum or the minimum one, the detection point will be a possible feature point [18].

Extreme Detection
In extreme detection, the detection point is compared with the other 26 pixels in the 3D mask range.If the pixel value of the detection point is the maximum or the minimum one, the detection point will be a possible feature point [18].

High Contrast Detection
In conventional approaches, when the variation of the pixel values in three-dimension space around the detection point is greater than a threshold of 0.03, the high contrast feature can be assured [11].Here, to simplify the derivations and subsequent hardware design, an approximated value of 1/32 is chosen to be the threshold value.That is, the resulting pixel variation criterion is modified as: where S = [x y z] T denotes the 3D coordinate vector.The pixel variation can be expressed as the Maclaurin series and approximated by: where D(0) represents the pixel value of the detection point.The transpose of the partia derivative of D(0) with respect to S will be: where and Figure 13 shows the realized hardware of the partial derivative.The second partia derivative of D(0) is the same as the 3D Hessian matrix.

High Contrast Detection
In conventional approaches, when the variation of the pixel values in three-dimension space around the detection point is greater than a threshold of 0.03, the high contrast feature can be assured [11].Here, to simplify the derivations and subsequent hardware design, an approximated value of 1/32 is chosen to be the threshold value.That is, the resulting pixel variation criterion is modified as: where S = [x y z] T denotes the 3D coordinate vector.The pixel variation can be expressed as the Maclaurin series and approximated by: where D(0) represents the pixel value of the detection point.The transpose of the partial derivative of D(0) with respect to S will be: where and Figure 13 shows the realized hardware of the partial derivative.The second partial derivative of D(0) is the same as the 3D Hessian matrix.
Electronics 2021, 10, x FOR PEER REVIEW Let the first derivative of Equation ( 18) with respect to S equals zero.
The most variation of the pixel values around the detection point can be fou Substituting Equation (25) into Equation ( 18), we obtain: Substituting Equation (26) into Equation ( 17), we have: Since the matrix inversion, (H3×3) −1 , requires lots of dividers in implementa resulting hardware cost would be high.Thus, the adjugate matrix, Adj(H3×3), an terminant, det(H3×3), are utilized here to accomplish the matrix inversion instead dividers.
( ) ( ) ( ) and Let the first derivative of Equation ( 18) with respect to S equals zero.
The most variation of the pixel values around the detection point can be found at: Substituting Equation (25) into Equation ( 18), we obtain: Substituting Equation (26) into Equation ( 17), we have: Since the matrix inversion, (H 3×3 ) −1 , requires lots of dividers in implementation, the resulting hardware cost would be high.Thus, the adjugate matrix, Adj(H 3×3 ), and the determinant, det(H 3×3 ), are utilized here to accomplish the matrix inversion instead of using dividers. where Figure 14 shows the realized architecture of the determinant and adjugate matrices.Since the adjugate matrix is symmetric, only six elements need to be processed in the implementation.The determinant of H 3×3 can be obtained by repeatedly using the first row of the adjugate matrix combined with elements in the first column of the determinant to reduce the hardware cost.Thus, Equation ( 27) can be rewritten as: where Electronics 2021, 10, x FOR PEER REVIEW 13 of 24 Figure 14 shows the realized architecture of the determinant and adjugate matrices.Since the adjugate matrix is symmetric, only six elements need to be processed in the implementation.The determinant of H3×3 can be obtained by repeatedly using the first row of the adjugate matrix combined with elements in the first column of the determinant to reduce the hardware cost.Thus, Equation (27) can be rewritten as: where  ( ) That is, the detection point will have a high-contrast feature and could be regarded as a possible feature point when Equation ( 31) is satisfied.Since the condition for the highcontrast feature detection is less complicated than the conventional ones, the resulting hardware circuits will be simpler for achieving a higher operating speed [12].

Corner Detection
In this paper, the Harris corner detection is adopted [19], where the parameter r is the ratio of two eigenvalues of H2×2 and r > 1, and tr(H2×2) and det(H2×2) respectively represent the trace and the determinant of H2×2, which are given by: ( ) That is, the detection point has a corner feature if the inequality Equation (33) is satisfied.An empirical value, r = 10, is chosen here for acquiring a better result.Figure 15 shows the hardware architecture of the corner detection.
Once the results of the three detections are all positive, the detection point will be logically regarded as a feature point.During the serial input of pixels, the production of the feature descriptor and the feature point detection proceeds until no image pixel remains.That is, the detection point will have a high-contrast feature and could be regarded as a possible feature point when Equation (31) is satisfied.Since the condition for the high-contrast feature detection is less complicated than the conventional ones, the resulting hardware circuits will be simpler for achieving a higher operating speed [12].

Corner Detection
In this paper, the Harris corner detection is adopted [19], where the parameter r is the ratio of two eigenvalues of H 2×2 and r > 1, and tr(H 2×2 ) and det(H 2×2 ) respectively represent the trace and the determinant of H 2×2 , which are given by: That is, the detection point has a corner feature if the inequality Equation (33) is satisfied.An empirical value, r = 10, is chosen here for acquiring a better result.Figure 15 shows the hardware architecture of the corner detection.

Finite-Area Parallel Matching for Stereo Vision
A double E-SIFT with FAP feature matching system is proposed for use in stereo vision, as shown in Figure 16.The stereo vision captures images by parallel cameras for the left and right E-SIFTs respectively to find the feature points and the corresponding feature descriptors, FL and FR.At the same moment, the corresponding coordinates will be output by a counter triggered by the detected signals, dtcL and dtcR, from the two E-SIFTs.Then, the stereo matching points, MXL, MYL, MXR, and MYR, will be found out and recorded by the proposed FAP feature matching.Since the image transmission could be interrupted due to the finite bandwidth of SDRAM or noise coupling, a data valid signal is added to alarm all building blocks to prevent abnormal termination or any discontinuous situation from corrupting correct output messages.Once the results of the three detections are all positive, the detection point will be logically regarded as a feature point.During the serial input of pixels, the production of the feature descriptor and the feature point detection proceeds until no image pixel remains.

Finite-Area Parallel Matching for Stereo Vision
A double E-SIFT with FAP feature matching system is proposed for use in stereo vision, as shown in Figure 16.The stereo vision captures images by parallel cameras for the left and right E-SIFTs respectively to find the feature points and the corresponding feature descriptors, FL and FR.At the same moment, the corresponding coordinates will be output by a counter triggered by the detected signals, dtc L and dtc R , from the two E-SIFTs.Then, the stereo matching points, MX L , MY L , MX R , and MY R , will be found out and recorded by the proposed FAP feature matching.Since the image transmission could be interrupted due to the finite bandwidth of SDRAM or noise coupling, a data valid signal is added to alarm all building blocks to prevent abnormal termination or any discontinuous situation from corrupting correct output messages.

Finite-Area Parallel Matching for Stereo Vision
A double E-SIFT with FAP feature matching system is proposed for use in stereo vision, as shown in Figure 16.The stereo vision captures images by parallel cameras for the left and right E-SIFTs respectively to find the feature points and the corresponding feature descriptors, FL and FR.At the same moment, the corresponding coordinates will be output by a counter triggered by the detected signals, dtcL and dtcR, from the two E-SIFTs.Then, the stereo matching points, MXL, MYL, MXR, and MYR, will be found out and recorded by the proposed FAP feature matching.Since the image transmission could be interrupted due to the finite bandwidth of SDRAM or noise coupling, a data valid signal is added to alarm all building blocks to prevent abnormal termination or any discontinuous situation from corrupting correct output messages.

Finite-Area Parallel Feature Matching
Suppose two cameras, C L and C R , are set up in parallel at the same height without any tilt angle difference, and three objects, P 1 , P 2 , P 3 , are placed in line with the right camera, as shown in the top view in Figure 17a [20].In theory, the projections of the same object onto two image planes will be on the same row.However, based on Epipolar geometry [15], the imaging of P 2 and 3 in the right image could disappear due to the shading effect caused by P 1 .As shown in Figure 17b, three imagings, P 1L , P 2L , and P 3L , appear in the left image, but only one projection, P 1R , in the right one.Therefore, to avoid the feature points of the left and right images from error matching, a minimum threshold value for comparison will be chosen in the proposed FAP feature matching.

Finite-Area Parallel Feature Matching
Suppose two cameras, CL and CR, are set up in parallel at the same height without any tilt angle difference, and three objects, P1, P2, P3, are placed in line with the right camera, as shown in the top view in Figure 17a [20].In theory, the projections of the same object onto two image planes will be on the same row.However, based on Epipolar geometry [15], the imaging of P2 and P3 in the right image could disappear due to the shading effect caused by P1.As shown in Figure 17b, three imagings, P1L, P2L, and P3L, appear in the left image, but only one projection, P1R, in the right one.Therefore, to avoid the feature points of the left and right images from error matching, a minimum threshold value for comparison will be chosen in the proposed FAP feature matching.In reality, the object projections on two cameras cannot be on the same row due to a slight difference in the tilt angle between the two cameras.The resulting feature point matching would be inaccurate.However, too many feature points would consume a large number of resources.On the other hand, matching accuracy will decrease if there are not sufficient feature points.

Parallel Arrangement for Feature Data
In stereo vision, the scenery of the left camera is shifted toward the left slightly compared to the right camera.Some objects would not be captured in the right camera and vice versa, as shown in the pink and blue areas of Figure 18.Thus, while pixels, which begin from the top left of the image, are entered into the system, mismatch could occur due to the visual angle difference between two the cameras.Therefore, a little delay will be added in the right-image input path to ensure the feature points found from the left image can be matched with the ones from the right image, as shown in the green area of the figure.
Figure 19 shows the building blocks of the proposed FAP feature matching.A demultiplexer transfers the serial feature point data into a parallel form by the left detected signal, dtcL.The feature descriptors and the corresponding coordinates will be stored into registers.Then, sixteen feature points data of the left image will be compared with the feature point of the right image simultaneously when the right detected signal, dtcR, is received.In reality, the object projections on two cameras cannot be on the same row due to a slight difference in the tilt angle between the two cameras.The resulting feature point matching would be inaccurate.However, too many feature points would consume a large number of resources.On the other hand, matching accuracy will decrease if there are not sufficient feature points.

Parallel Arrangement for Feature Data
In stereo vision, the scenery of the left camera is shifted toward the left slightly compared to the right camera.Some objects would not be captured in the right camera and vice versa, as shown in the pink and blue areas of Figure 18.Thus, while pixels, which begin from the top left of the image, are entered into the system, mismatch could occur due to the visual angle difference between two the cameras.Therefore, a little delay will be added in the right-image input path to ensure the feature points found from the left image can be matched with the ones from the right image, as shown in the green area of the figure.Figure 19 shows the building blocks of the proposed FAP feature matching.A demultiplexer transfers the serial feature point data into a parallel form by the left detected signal, dtc L .The feature descriptors and the corresponding coordinates will be stored into registers.Then, sixteen feature points data of the left image will be compared with the feature point of the right image simultaneously when the right detected signal, dtc R , is received.

Feature Matching Algorithm
For a street view image of 640 × 480 pixels, about 2500 feature points can be detected.Every row of the image may include 5.2 feature points in average.Thus, to improve the accuracy of feature matching, 16 feature points, which are equivalent to an area of approximately three rows of pixels, will be chosen for feature matching in this paper.
In general, for the same object, the projection in the left image will be located in a little more to the right position than the corresponding one in the right.An exception is that if the object is far away from the cameras, the resulting x coordinates of the projections on two images will be almost equal.Thus, the comparison will proceed only the following condition is satisfied: where xR and xL represent the right-image and left-image x-coordinates, respectively.To quickly find out the matching points, the 16 feature descriptors of the left image will be simultaneously subtracted from the feature descriptor of the right one.The resulting differences in 64 dimensions between each two feature descriptors are then summed together for comparison.

Feature Matching Algorithm
For a street view image of 640 × 480 pixels, about 2500 feature points can be detected.Every row of the image may include 5.2 feature points in average.Thus, to improve the accuracy of feature matching, 16 feature points, which are equivalent to an area of approximately three rows of pixels, will be chosen for feature matching in this paper.
In general, for the same object, the projection in the left image will be located in a little more to the right position than the corresponding one in the right.An exception is that if the object is far away from the cameras, the resulting x coordinates of the projections on two images will be almost equal.Thus, the comparison will proceed only the following condition is satisfied: where x R and x L represent the right-image and left-image x-coordinates, respectively.To quickly find out the matching points, the 16 feature descriptors of the left image will be simultaneously subtracted from the feature descriptor of the right one.The resulting differences in 64 dimensions between each two feature descriptors are then summed together for comparison.
where f k_Li , i = 1, 2, . . ., 64, is the element of the k-th feature descriptor in the left image, and f Ri is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MX R , MY R , MX L , and MY L , on the respective images will be the output.
where fk_Li, i = 1, 2, …, 64, is the element of the k-th feature descriptor in the left image, and fRi is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MXR, MYR, MXL, and MYL, on the respective images will be the output.

Experiment Results
In this paper, we use Altera development board, DE2i-150, to verify the feasibility of the proposed E-SIFT and FAP feature matching.The core of DE2i-150 is Cyclone IV GX: EP4CGX150DF31C7 and the system clock is 50 MHz.Three types of experiments are conducted in the following.In the first one, the same functions are synthesized for two distinct SIFTs by the proposed TI with LUT approach and the CORDIC.Then, a comparison between the two SIFTs proceeds after measurements.In the second one, the performance of the proposed E-SIFT is evaluated and compared with other hardware-implemented SIFTs.In the final one, the system of the proposed double E-SIFT with FAP feature matching for stereo vision is evaluated and compared with the state-of-the-art approaches.

Proposed TI with LUT and CORDIC Algorithm
In the beginning, the proposed TI with LUT and the CORDIC algorithms are respectively realized into two distinct SIFTs.Next, we arbitrarily choose seven 640 × 480 images having different sceneries as samples for testing to acquire an objective evaluation from the comparison of the two distinct SIFTs.
In each experiment, both the original and skewed images are applied to the same SIFT to detect the feature points.Then, the matching pairs are determined by a conventional exhaustion method via C++ and Visual Studio.Each of the feature points on the original image will be compared with all feature points on the skewed ones to find the matching pairs.Based on the matching results, the resulting accuracy can be estimated.
Table 3 shows the performance and hardware resources required by the two approaches.Note that the accuracy between the proposed TI with LUT and CORDIC is very close.However, their consumed hardware sources are quite different.The quantities of LE and register in the proposed TI with the LUT approach are significantly reduced by 89.18% and 96.08% compared with the CORDIC algorithm.

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred where fk_Li, i = 1, 2, …, 64, is the element of the k-th feature descriptor in the left image, and fRi is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MXR, MYR, MXL, and MYL, on the respective images will be the output.

Experiment Results
In this paper, we use Altera development board, DE2i-150, to verify the feasibility of the proposed E-SIFT and FAP feature matching.The core of DE2i-150 is Cyclone IV GX: EP4CGX150DF31C7 and the system clock is 50 MHz.Three types of experiments are conducted in the following.In the first one, the same functions are synthesized for two distinct SIFTs by the proposed TI with LUT approach and the CORDIC.Then, a comparison between the two SIFTs proceeds after measurements.In the second one, the performance of the proposed E-SIFT is evaluated and compared with other hardware-implemented SIFTs.In the final one, the system of the proposed double E-SIFT with FAP feature matching for stereo vision is evaluated and compared with the state-of-the-art approaches.

Proposed TI with LUT and CORDIC Algorithm
In the beginning, the proposed TI with LUT and the CORDIC algorithms are respectively realized into two distinct SIFTs.Next, we arbitrarily choose seven 640 × 480 images having different sceneries as samples for testing to acquire an objective evaluation from the comparison of the two distinct SIFTs.
In each experiment, both the original and skewed images are applied to the same SIFT to detect the feature points.Then, the matching pairs are determined by a conventional exhaustion method via C++ and Visual Studio.Each of the feature points on the original image will be compared with all feature points on the skewed ones to find the matching pairs.Based on the matching results, the resulting accuracy can be estimated.
Table 3 shows the performance and hardware resources required by the two approaches.Note that the accuracy between the proposed TI with LUT and CORDIC is very close.However, their consumed hardware sources are quite different.The quantities of LE and register in the proposed TI with the LUT approach are significantly reduced by 89.18% and 96.08% compared with the CORDIC algorithm.

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred where fk_Li, i = 1, 2, …, 64, is the element of the k-th feature descriptor in the left image, and fRi is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MXR, MYR, MXL, and MYL, on the respective images will be the output.

Experiment Results
In this paper, we use Altera development board, DE2i-150, to verify the feasibility of the proposed E-SIFT and FAP feature matching.The core of DE2i-150 is Cyclone IV GX: EP4CGX150DF31C7 and the system clock is 50 MHz.Three types of experiments are conducted in the following.In the first one, the same functions are synthesized for two distinct SIFTs by the proposed TI with LUT approach and the CORDIC.Then, a comparison between the two SIFTs proceeds after measurements.In the second one, the performance of the proposed E-SIFT is evaluated and compared with other hardware-implemented SIFTs.In the final one, the system of the proposed double E-SIFT with FAP feature matching for stereo vision is evaluated and compared with the state-of-the-art approaches.

Proposed TI with LUT and CORDIC Algorithm
In the beginning, the proposed TI with LUT and the CORDIC algorithms are respectively realized into two distinct SIFTs.Next, we arbitrarily choose seven 640 × 480 images having different sceneries as samples for testing to acquire an objective evaluation from the comparison of the two distinct SIFTs.
In each experiment, both the original and skewed images are applied to the same SIFT to detect the feature points.Then, the matching pairs are determined by a conventional exhaustion method via C++ and Visual Studio.Each of the feature points on the original image will be compared with all feature points on the skewed ones to find the matching pairs.Based on the matching results, the resulting accuracy can be estimated.
Table 3 shows the performance and hardware resources required by the two approaches.Note that the accuracy between the proposed TI with LUT and CORDIC is very close.However, their consumed hardware sources are quite different.The quantities of LE and register in the proposed TI with the LUT approach are significantly reduced by 89.18% and 96.08% compared with the CORDIC algorithm.

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred where fk_Li, i = 1, 2, …, 64, is the element of the k-th feature descriptor in the left image, and fRi is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MXR, MYR, MXL, and MYL, on the respective images will be the output.

Experiment Results
In this paper, we use Altera development board, DE2i-150, to verify the feasibility of the proposed E-SIFT and FAP feature matching.The core of DE2i-150 is Cyclone IV GX: EP4CGX150DF31C7 and the system clock is 50 MHz.Three types of experiments are conducted in the following.In the first one, the same functions are synthesized for two distinct SIFTs by the proposed TI with LUT approach and the CORDIC.Then, a comparison between the two SIFTs proceeds after measurements.In the second one, the performance of the proposed E-SIFT is evaluated and compared with other hardware-implemented SIFTs.In the final one, the system of the proposed double E-SIFT with FAP feature matching for stereo vision is evaluated and compared with the state-of-the-art approaches.

Proposed TI with LUT and CORDIC Algorithm
In the beginning, the proposed TI with LUT and the CORDIC algorithms are respectively realized into two distinct SIFTs.Next, we arbitrarily choose seven 640 × 480 images having different sceneries as samples for testing to acquire an objective evaluation from the comparison of the two distinct SIFTs.
In each experiment, both the original and skewed images are applied to the same SIFT to detect the feature points.Then, the matching pairs are determined by a conventional exhaustion method via C++ and Visual Studio.Each of the feature points on the original image will be compared with all feature points on the skewed ones to find the matching pairs.Based on the matching results, the resulting accuracy can be estimated.
Table 3 shows the performance and hardware resources required by the two approaches.Note that the accuracy between the proposed TI with LUT and CORDIC is very close.However, their consumed hardware sources are quite different.The quantities of LE and register in the proposed TI with the LUT approach are significantly reduced by 89.18% and 96.08% compared with the CORDIC algorithm.

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred where fk_Li, i = 1, 2, …, 64, is the element of the k-th feature descriptor in the left image, and fRi is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MXR, MYR, MXL, and MYL, on the respective images will be the output.

Experiment Results
In this paper, we use Altera development board, DE2i-150, to verify the feasibility of the proposed E-SIFT and FAP feature matching.The core of DE2i-150 is Cyclone IV GX: EP4CGX150DF31C7 and the system clock is 50 MHz.Three types of experiments are conducted in the following.In the first one, the same functions are synthesized for two distinct SIFTs by the proposed TI with LUT approach and the CORDIC.Then, a comparison between the two SIFTs proceeds after measurements.In the second one, the performance of the proposed E-SIFT is evaluated and compared with other hardware-implemented SIFTs.In the final one, the system of the proposed double E-SIFT with FAP feature matching for stereo vision is evaluated and compared with the state-of-the-art approaches.

Proposed TI with LUT and CORDIC Algorithm
In the beginning, the proposed TI with LUT and the CORDIC algorithms are respectively realized into two distinct SIFTs.Next, we arbitrarily choose seven 640 × 480 images having different sceneries as samples for testing to acquire an objective evaluation from the comparison of the two distinct SIFTs.
In each experiment, both the original and skewed images are applied to the same SIFT to detect the feature points.Then, the matching pairs are determined by a conventional exhaustion method via C++ and Visual Studio.Each of the feature points on the original image will be compared with all feature points on the skewed ones to find the matching pairs.Based on the matching results, the resulting accuracy can be estimated.
Table 3 shows the performance and hardware resources required by the two approaches.Note that the accuracy between the proposed TI with LUT and CORDIC is very close.However, their consumed hardware sources are quite different.The quantities of LE and register in the proposed TI with the LUT approach are significantly reduced by 89.18% and 96.08% compared with the CORDIC algorithm.

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred where fk_Li, i = 1, 2, …, 64, is the element of the k-th feature descriptor in the left image, and fRi is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MXR, MYR, MXL, and MYL, on the respective images will be the output.

Experiment Results
In this paper, we use Altera development board, DE2i-150, to verify the feasibility of the proposed E-SIFT and FAP feature matching.The core of DE2i-150 is Cyclone IV GX: EP4CGX150DF31C7 and the system clock is 50 MHz.Three types of experiments are conducted in the following.In the first one, the same functions are synthesized for two distinct SIFTs by the proposed TI with LUT approach and the CORDIC.Then, a comparison between the two SIFTs proceeds after measurements.In the second one, the performance of the proposed E-SIFT is evaluated and compared with other hardware-implemented SIFTs.In the final one, the system of the proposed double E-SIFT with FAP feature matching for stereo vision is evaluated and compared with the state-of-the-art approaches.

Proposed TI with LUT and CORDIC Algorithm
In the beginning, the proposed TI with LUT and the CORDIC algorithms are respectively realized into two distinct SIFTs.Next, we arbitrarily choose seven 640 × 480 images having different sceneries as samples for testing to acquire an objective evaluation from the comparison of the two distinct SIFTs.
In each experiment, both the original and skewed images are applied to the same SIFT to detect the feature points.Then, the matching pairs are determined by a conventional exhaustion method via C++ and Visual Studio.Each of the feature points on the original image will be compared with all feature points on the skewed ones to find the matching pairs.Based on the matching results, the resulting accuracy can be estimated.
Table 3 shows the performance and hardware resources required by the two approaches.Note that the accuracy between the proposed TI with LUT and CORDIC is very close.However, their consumed hardware sources are quite different.The quantities of LE and register in the proposed TI with the LUT approach are significantly reduced by 89.18% and 96.08% compared with the CORDIC algorithm.

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred where fk_Li, i = 1, 2, …, 64, is the element of the k-th feature descriptor in the left image, and fRi is the element of the feature descriptor of the right one.If the minimum sum is smaller than the half of the second minimum one, the feature points causing the minimum sum will be the matching pairs.The corresponding coordinates, MXR, MYR, MXL, and MYL, on the respective images will be the output.

Experiment Results
In this paper, we use Altera development board, DE2i-150, to verify the feasibility of the proposed E-SIFT and FAP feature matching.The core of DE2i-150 is Cyclone IV GX: EP4CGX150DF31C7 and the system clock is 50 MHz.Three types of experiments are conducted in the following.In the first one, the same functions are synthesized for two distinct SIFTs by the proposed TI with LUT approach and the CORDIC.Then, a comparison between the two SIFTs proceeds after measurements.In the second one, the performance of the proposed E-SIFT is evaluated and compared with other hardware-implemented SIFTs.In the final one, the system of the proposed double E-SIFT with FAP feature matching for stereo vision is evaluated and compared with the state-of-the-art approaches.

Proposed TI with LUT and CORDIC Algorithm
In the beginning, the proposed TI with LUT and the CORDIC algorithms are respectively realized into two distinct SIFTs.Next, we arbitrarily choose seven 640 × 480 images having different sceneries as samples for testing to acquire an objective evaluation from the comparison of the two distinct SIFTs.
In each experiment, both the original and skewed images are applied to the same SIFT to detect the feature points.Then, the matching pairs are determined by a conventional exhaustion method via C++ and Visual Studio.Each of the feature points on the original image will be compared with all feature points on the skewed ones to find the matching pairs.Based on the matching results, the resulting accuracy can be estimated.
Table 3 shows the performance and hardware resources required by the two approaches.Note that the accuracy between the proposed TI with LUT and CORDIC is very close.However, their consumed hardware sources are quite different.The quantities of LE and register in the proposed TI with the LUT approach are significantly reduced by 89.18% and 96.08% compared with the CORDIC algorithm.

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred

Performance of Proposed E-SIFT
To verify the performance of the proposed E-SIFT, a 640 × 480 image having a static object, as shown in Figure 20a, is processed by a single E-SIFT to extract the feature points indicated by red points in Figure 20b, where 283 feature points are detected.Then, the image is shifted toward the bottom right direction for matching with the original one by a software-based feature matching, as shown in Figure 21.There are 290 feature points detected in the shifted image, resulting in 83 matching pairs.No matching error occurred in these matching pairs.The resulting accuracy of the feature matching based on the proposed E-SIFT is 100%.
image and the rotated and skewed images, respectively.The resulting accuracy of the feature matching based on the proposed E-SIFT is 100%.
To test the flexibility of the proposed E-SIFT, outdoor and indoor images are also skewed for feature matching with the original one, as shown in Figure 23a,b, respectively, where the matching accuracy reaches 88.89% and 92.39%, respectively, for these two images.The experimental results have shown that the proposed E-SIFT exhibits good performance even for images having cluttering scenery.image and the rotated and skewed images, respectively.The resulting accuracy of the feature matching based on the proposed E-SIFT is 100%.
To test the flexibility of the proposed E-SIFT, outdoor and indoor images are also skewed for feature matching with the original one, as shown in Figure 23a,b, respectively, where the matching accuracy reaches 88.89% and 92.39%, respectively, for these two images.The experimental results have shown that the proposed E-SIFT exhibits good performance even for images having cluttering scenery.   in these matching pairs.The resulting accuracy of the feature matching based on the proposed E-SIFT is 100%.
When the original image is rotated and skewed, 272 and 235 feature points are derived, respectively, as shown in Figure 22a,b.Through the proposed E-SIFT and the software-based feature matching, 13 and 11 matching pairs are obtained between the original image and the rotated and skewed images, respectively.The resulting accuracy of the feature matching based on the proposed E-SIFT is 100%.
To test the flexibility of the proposed E-SIFT, outdoor and indoor images are also skewed for feature matching with the original one, as shown in Figure 23a,b, respectively, where the matching accuracy reaches 88.89% and 92.39%, respectively, for these two images.The experimental results have shown that the proposed E-SIFT exhibits good performance even for images having cluttering scenery.To test the flexibility of the proposed E-SIFT, outdoor and indoor images are also skewed for feature matching with the original one, as shown in Figure 23a,b, respectively, where the matching accuracy reaches 88.89% and 92.39%, respectively, for these two images.The experimental results have shown that the proposed E-SIFT exhibits good performance even for images having cluttering scenery.
Table 4 shows a performance comparison and hardware resources required by the proposed E-SIFT and other published papers.The processing time of one image, including detection of the feature points and generation of the feature descriptor, is 4.865 ms, reaching a corresponding frame rate of 205 fps.As shown in Table 4, the proposed E-SIFT is the fastest among the hardware-implemented SIFTs published in the literature.The resources required for the proposed E-SIFT includes 54,911 LEs, 36,153 registers, 297 DSP, and 267.9 kbits RAM.Although the SIFT in [11] utilized a similar quantity of devices, the resulting frame rate is 58 fps only.Compared with other published SIFTs, the proposed E-SIFT exhibits a faster response while using a relatively small quantity of hardware devices.Table 4 shows a performance comparison and hardware resources required by the proposed E-SIFT and other published papers.The processing time of one image, including detection of the feature points and generation of the feature descriptor, is 4.865 ms, reaching a corresponding frame rate of 205 fps.As shown in Table 4, the proposed E-SIFT is the fastest among the hardware-implemented SIFTs published in the literature.The resources required for the proposed E-SIFT includes 54,911 LEs, 36,153 registers, 297 DSP, and 267.9 kbits RAM.Although the SIFT in [11] utilized a similar quantity of devices, the resulting frame rate is 58 fps only.Compared with other published SIFTs, the proposed E-SIFT exhibits a faster response while using a relatively small quantity of hardware devices.In this paper, a stereo vison architecture of double E-SIFT integrated with FAP feature matching is proposed, as shown in Figure 24, where stereo images from the KITTI dataset are firstly written into SDRAM of Altera DE2i-150 by Nios II.When the system is activated, the stereo images in the SDRAM will be delivered to the proposed stereo vision system through the Avalon bus.An end signal and coordinates of the matching pairs will be sent back to a PC through buffers after finding the feature points.In the following experiments, we use 1226 × 370 images from the KITTI dataset for the proposed double E-SIFT and FAP feature matching system [24].To be able to compare with other published papers, the KITTI stereo images are tailored symmetrically with respect to the center to form an image having 640 × 370 pixels, as shown in Figure 25.Table 6 lists the performance comparison and hardware resources usage of the proposed system with other published papers.The resulting frame rate of the proposed double E-SIFT with FAP feature matching is 181 fps, which is faster than most of the approaches.Although reference [27] implementing SURF detector and BRIEF descriptor has the fastest frame rate [27], the required hardware resource is much higher than the proposed method in this paper.The quantities of LEs in [14] and [28] are close to the proposed stereo vision system, but the resulting frame rates are only about 40 fps.The stereo vision system proposed by [29] uses many hardware resources, but the resulting frame rate is only half of the proposed approach.In summary, the proposed double E-SIFT with FAP feature matching has a satisfactory frame rate with an acceptable hardware cost.
for the purpose of mapping or object tracking.To evaluate the accuracy of the proposed double E-SIFT with FAP feature matching, the stereo images are aligned vertically, where green matching lines are depicted with the same procedure, as shown in Figure 27a,b.The accuracy of the proposed double E-SIFT with FAP feature matching presents good results because there are no apparent skewed matching lines existed in the images.Table 5 shows the consumed hardware resources of the proposed double E-SIFT with FAP feature matching for stereo vision.The usage ratios of LEs, registers, RAM, and DSP/Multiplier in the DE2i-150 development board are 93.6%,52.4%, 82.5%, and 8.9%, respectively.For the whole system, the double E-SIFT and the FAP feature matching occupy 76% and 24% of LEs, respectively, as shown in Figure 28a.Most of the registers, about 90%, are used by the E-SIFTs.Only 10% of the total register devices are used by the FAP feature matching, as shown in Figure 28b.The usage ratio of the coordinate counter is very low and can be neglected.The total power consumption of the FPGA when it hosts the proposed architecture including the double E-SIFT and the FAP feature matching is less than 1168.33 mW.This value is estimated using the Power Analyzer from Intel [25].It is difficult to directly measure the power consumption because the DE2i-150 development board that we used to conduct the experiments does not come with a circuit for power estimation similar to the one used in [26].
Table 6 lists the performance comparison and hardware resources usage of the proposed system with other published papers.The resulting frame rate of the proposed double E-SIFT with FAP feature matching is 181 fps, which is faster than most of the approaches.Although reference [27] implementing SURF detector and BRIEF descriptor has the fastest frame rate [27], the required hardware resource is much higher than the proposed method in this paper.The quantities of LEs in [14] and [28] are close to the proposed stereo vision system, but the resulting frame rates are only about 40 fps.The stereo vision system proposed by [29] uses many hardware resources, but the resulting frame rate is only half of the proposed approach.In summary, the proposed double E-SIFT with FAP feature matching has a satisfactory frame rate with an acceptable hardware cost.

Conclusions
In this paper, we propose an FPGA-implemented double E-SIFT with FAP feature matching for stereo vision.With the improved architecture for the Gaussian pyramid, the Gaussian-blurred image and DoG pyramid can be produced simultaneously in only one clock cycle.As for the feature descriptor, we propose a simple TI with LUT method to simplify the decision of direction and gradient of detection points.The dimension of the feature descriptor is also decreased by half so that the hardware cost is significantly reduced.In the feature decision part, in comparison with the conventional approaches, a simple condition for high-contrast detection is derived by approximating the threshold value to the power of two.Thanks to the proposed double E-SIFT integrated with FAP feature matching, matching pairs between two images can be efficiently determined.Based the position of the feature point in the right image, the corresponding area in the left image can be chosen for feature matching without external memory.
Since the simplification of the structures and approaches adopted in the proposed E-SIFT, the quantity of hardware devices required in the implementation can be significantly reduced than that by other published papers.It also effectively improves the operating speed of the proposed system.From the experimental results, a frame rate of 205 fps can be reached by the proposed E-SIFT at a system clock of 50 MHz.For stereo vision from the KITTI dataset, a frame rate of 181 fps is also achieved by the proposed double E-SIFT with FAP feature matching system.Besides, a data valid signal is added in the system to synchronize all the functional blocks to prevent the proposed circuitries from ceasing transmission due to the finite bandwidth of SDRAM or the noise interference.

Figure 1 .
Figure 1.Building blocks of the proposed E-SIFT.

Figure 1 .
Figure 1.Building blocks of the proposed E-SIFT.

Figure 2 .
Figure 2. (a) Conventional structure and (b) the proposed structure in realizing image pyramids.

Figure 2 .Figure 2 .
Figure 2. (a) Conventional structure and (b) the proposed structure in realizing image pyramids.

Figure 4 .
Figure 4. Function blocks of the feature descriptor.

Figure 5 .
Figure 5. Variations of the detection point on x and y axes in a 3 × 3 mask.

Figure 4 .
Figure 4. Function blocks of the feature descriptor.

Figure 4 .
Figure 4. Function blocks of the feature descriptor.

Figure 5 .
Figure 5. Variations of the detection point on x and y axes in a 3 × 3 mask.

Figure 5 .
Figure 5. Variations of the detection point on x and y axes in a 3 × 3 mask.

Figure 6 .
Figure 6.(a) The proposed TI algorithm for direction recognition; (b) The ratio of hypotenuse to the leg has maximum value when Δp = Δq.

Figure 6 .
Figure 6.(a) The proposed TI algorithm for direction recognition; (b) The ratio of hypotenuse to the leg has maximum value when ∆p = ∆q.

Figure 7 .
Figure 7. Function blocks of the proposed TI and look-up table approach for determining the direction and gradient of the detection point.

Figure 8 .
Figure 8.(a) Sixteen 4 × 4 partitions in a 16 × 16 mask for the feature descriptor and (b) the corresponding gradient sum in each 4 × 4 partitions.

Figure 8 .
Figure 8.(a) Sixteen 4 × 4 partitions in a 16 × 16 mask for the feature descriptor and (b) the corresponding gradient sum in each 4 × 4 partitions.

Figure 9 .
Figure 9. Function blocks of the normalization of the feature descriptor.

Figure 9 .
Figure 9. Function blocks of the normalization of the feature descriptor.

Figure 10 .
Figure 10.The detection range includes 27 pixels in three DoG images under a 3 × 3 mask, where the orange pixel is regarded as the detection point.

Figure 10 .
Figure 10.The detection range includes 27 pixels in three DoG images under a 3 × 3 mask, where the orange pixel is regarded as the detection point.

Figure 10 .
Figure 10.The detection range includes 27 pixels in three DoG images under a 3 × 3 mask, whe the orange pixel is regarded as the detection point.

Figure 11 .
Figure 11.Building blocks of the feature detection.

Figure 11 .
Figure 11.Building blocks of the feature detection.

Figure 12 .
Figure 12.Hardware structure of the elements in 2D and 3D Hessian matrices.

Figure 12 .
Figure 12.Hardware structure of the elements in 2D and 3D Hessian matrices.

Figure 14 .
Figure 14.Hardware architecture of the determinant and adjugate matrices.

Figure 14 .
Figure 14.Hardware architecture of the determinant and adjugate matrices.

2 xy
34) and det(H 2×2 ) = D xx D xy D xy D yy = D xx D yy − D

Figure 15 .
Figure 15.Hardware structure of the corner detection.

Figure 16 .
Figure 16.Building blocks of the proposed E-SIFT and FAP feature matching for stereo vision.

Figure 16 .
Figure 16.Building blocks of the proposed E-SIFT and FAP feature matching for stereo vision.

Figure 16 .
Figure 16.Building blocks of the proposed E-SIFT and FAP feature matching for stereo vision.

Figure 17 .
Figure 17.(a) Top view and (b) front view of stereo vision with Epipolar phenomenon.

Figure 17 .
Figure 17.(a) Top view and (b) front view of stereo vision with Epipolar phenomenon.

Electronics 2021 , 24 Figure 18 .
Figure 18.Different scenes are captured in stereo vision due to the visual angle difference between two cameras.Figure 18. Different scenes are captured in stereo vision due to the visual angle difference between two cameras.

Figure 18 .
Figure 18.Different scenes are captured in stereo vision due to the visual angle difference between two cameras.Figure 18. Different scenes are captured in stereo vision due to the visual angle difference between two cameras.

Figure 18 .
Figure 18.Different scenes are captured in stereo vision due to the visual angle difference between two cameras.

Figure 19 .
Figure 19.Building blocks of the proposed finite-area parallel (FAP) feature matching for stereo vision.

Figure 19 .
Figure 19.Building blocks of the proposed finite-area parallel (FAP) feature matching for stereo vision.

Figure 21 .Figure 22 .
Figure 21.Feature matching results of the original image with the shifted one.

Figure 21 .Figure 22 .
Figure 21.Feature matching results of the original image with the shifted one.

Figure 21 .
Figure 21.Feature matching results of the original image with the shifted one.When the original image is rotated and skewed, 272 and 235 feature points are derived, respectively, as shown in Figure 22a,b.Through the proposed E-SIFT and the softwarebased feature matching, 13 and 11 matching pairs are obtained between the original image and the rotated and skewed images, respectively.The resulting accuracy of the feature matching based on the proposed E-SIFT is 100%.

Figure 21 .Figure 22 .
Figure 21.Feature matching results of the original image with the shifted one.

Figure 22 .
Figure 22.Feature matching results of the original image with (a) the rotated and (b) the skewed image.Figure 22. Feature matching results of the original image with (a) the rotated and (b) the skewed image.

Figure 23 .
Figure 23.(a) Matching accuracy for a skewed outdoor image is 92.39%,where 85 out of 92 feature points are successfully matched; (b) Matching accuracy for a skewed indoor image is 88.89%, where 48 out of 54 feature points are successfully matched.

Figure 23 .
Figure 23.(a) Matching accuracy for a skewed outdoor image is 92.39%,where 85 out of 92 feature points are successfully matched; (b) Matching accuracy for a skewed indoor image is 88.89%, where 48 out of 54 feature points are successfully matched.

Figure 24 .
Figure 24.Architecture of the proposed stereo vison system.

Figure 24 .
Figure 24.Architecture of the proposed stereo vison system.

Figure 27 .Figure 28 .
Figure 27.The same stereo visions as (a) Figure 26a, and (b) Figure 26b are aligned vertically to verify the accuracy.There is no apparent skewed matching line existed in the images.

Figure 27 .
Figure 27.The same stereo visions as (a) Figure 26a, and (b) Figure 26b are aligned vertically to verify the accuracy.There is no apparent skewed matching line existed in the images.

Figure 27 .Figure 28 .
Figure 27.The same stereo visions as (a) Figure 26a, and (b) Figure 26b are aligned vertically to verify the accuracy.There is no apparent skewed matching line existed in the images.

Figure 28 .
Figure 28.Usage ratio of (a) LE used and (b) registers used for the proposed E-SIFTs and FAP feature matching.Figure 28.Usage ratio of (a) LE used and (b) registers used for the proposed E-SIFTs and FAP feature matching.

Author Contributions:
Conceptualization, C.-H.K. and C.-C.H.; implementation, E.-H.H.; validation, C.-H.K. and C.-H.C.; writing-original draft preparation, E.-H.H. and C.-H.C.; writing-review and editing; C.-H.K. and C.-C.H.; visualization, E.-H.H.; supervision, C.-H.K. and C.-C.H.All authors have read and agreed to the published version of the manuscript.Funding: This work was financially supported by the "Chinese Language and Technology Center" of National Taiwan Normal University (NTNU) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan, and Ministry of Science and Technology, Taiwan, under Grants no.MOST 109-2634-F-003-006 and MOST 109-2634-F-003-007 through Pervasive Artificial Intelligence Research (PAIR) Labs.

Table 1 .
Angle range condition and direction distribution.

Table 2 .
Look-up table for gradient correction.

Table 2 .
Look-up table for gradient correction.

Table 3 .
Accuracy and hardware comparison for proposed TI with LUT and CORDIC algorithm.

Table 3 .
Accuracy and hardware comparison for proposed TI with LUT and CORDIC algorithm.

Table 3 .
Accuracy and hardware comparison for proposed TI with LUT and CORDIC algorithm.

Table 3 .
Accuracy and hardware comparison for proposed TI with LUT and CORDIC algorithm.

Table 3 .
Accuracy and hardware comparison for proposed TI with LUT and CORDIC algorithm.

Table 3 .
Accuracy and hardware comparison for proposed TI with LUT and CORDIC algorithm.

Table 3 .
Accuracy and hardware comparison for proposed TI with LUT and CORDIC algorithm.

Table 4 .
Performance comparison of proposed E-SIFT with other published papers.
4.3.Proposed Double E-SIFTs with FAP Feature Matching for Stereo Vision

Table 4 .
Performance comparison of proposed E-SIFT with other published papers.

Table 5 .
Hardware resources utilized in the proposed system.

Table 5 .
Hardware resources utilized in the proposed system.

Table 6 .
Performance comparison of proposed stereo E-SIFTs and FAP feature matching with other published papers.Random sample consensus (RANSAC).SURF: Speed Up Robust Features; BRIEF: Binary Robust Independent Elementary Features.1Frequency of SIFT and FM; 2 Frequency of RANSAC.