Fundamental Matrix Computing Based on 3D Metrical Distance

: To reconstruct point geometry from multiple images, computation of the fundamental matrix is always necessary. With a new optimization criterion, i.e., the re-projective 3D metric geometric distance rather than projective space under RANSAC (Random Sample And Consensus) framework, our method can reveal the quality of the fundamental matrix visually through 3D reconstruction. The geometric distance is the projection error of 3D points to the corresponding image pixel coordinates in metric space. The reasonable visual ﬁgures of the reconstructed scenes are shown but only some numerical result were compared, as is standard practice. This criterion can lead to a better 3D reconstruction result especially in 3D metric space. Our experiments validate our new error criterion and the quality of fundamental matrix under the new criterion.


Introduction
Nowadays, the demand for 3D graphical models in computer graphics, virtual reality and communication, etc., and their application are increasing rapidly. Much progress in multi-view stereo (MVS) algorithms has been made in recent years, both in terms of precision and performance. Fundamental matrix F computation is a basic problem in multi-view geometry for the reconstruction of 3D scenes. Once F is obtained, the projective reconstruction of a scene or object can be inferred from the fundamental matrix [1].
Usually, the eight-point method (8-pt) [2,3], seven-point method (7-pt), five-point method (5-pt) [4,5], and gold distance method (Gold) [6] are all based on the RANSAC (RANdom Sample And Consensus) framework. If picture quality and feature matching are precise enough, accurate results can be achieved; however, the estimation from the seven-point method is sensitive to noise. The five-point method makes use of all kinds of constraints but is not very robust. These methods often include a non-linear optimization as the last step, which is time-consuming. The sampling method is based on the distance from feature points to epipolar lines on the image to select the best F from the candidate F. The F with least average distance wins and is output as the final result. So the quality of F is determined by this distance number.
According to the epipolar theory in the multi-view geometry of computer vision, as demonstrated in Figure 1, all the epipolar lines, l 1 , l 2 , intersect at a single point, epipole, which is e or e at each image plane. An epipolar line such as l 1 or l 2 is the intersection of an epipolar plane with the image plane. The epipolar plane, e.g., O 1 O 2 X i , is a plane containing the baseline, e.g., O 1 O 2 . The epipole is the point of intersection of the line joining the camera centers (the baseline) with the image plane. Equivalently, the epipole is the image in one view of the camera centre of the other view. E.g. e is the projection of O 2 on the left view in Figure 1. It is also the vanishing point of the baseline (translation) direction.
This geometry relation between two images for X i , [u, u ] is linked through F. Our method also utilizes the RANSAC framework to compute fundamental matrix F like most of the other ordinary algorithms. Although every result after RANSAC could have very small average distance from matching feature pixel to the epipolar lines, sometimes the reconstruction result is still unstable if the reconstructed scene is viewed by the other different standard. For example, epipoles of different final best samplings distribute anywhere but not concentrate in a small region. Even the epipoles could be located either within the image or out of the image. Sometimes, the epipole of each computation from F varies a long distance between them; however, both Fs have very small average distance to epipolar lines. When we utilized this F to reconstruct the two-view stereo, sometimes the reconstructed 3D scene was abnormal and occasionally kind of projective. This means the average distance to epipolar lines is not the only, nor very robust, criteria to decide the fundamental matrix. Actually, a fundamental matrix with very small average distance to epipolar lines sometimes led to very poor 3D reconstruction results, such as two-view reconstruction based on F. This problem is critical in affine and projective reconstruction in which there is no meaningful metric information about the object space.
A new error selection criterion is proposed in this paper to solve the problem that Fs with very small average distance to epipolar lines sometimes result in very bad 3D metrical reconstruction or even result in totally wrong results. The advantage of our criterion is that this method can visually present the 3D reconstructed scene to indicate the quality of F. The criterion is still the projective error to project X i , which is in 3D metric space rather than projective space. The description of metric space or metric reconstruction implies that the structure is defined in the Euclidean coordinates where the reconstructed scene is highly similar with the real scene. Metric is used to compare with descriptions for projective and affine. One speaks of projective reconstruction, affine reconstruction, similarity reconstruction, and so on, to indicate the type of transformation involved. In this paper, the term metric reconstruction is normally used in preference to similarity reconstruction, being identical in meaning.
Especially, this modification to the error selection criterion can improve 3D reconstruction results if we assume the principle center of camera in the image coordinates lies at the center of the image. The F under our criterion is suitable for the 3D metric construction, whereas the other Fs are always used for projective reconstruction. When the re-projection error is extracted from metric reconstruction based on F is feed back to judge the quality of the F and then recompute F, the quality of F is improved accordingly. This is the novelty of this paper.

Previous Research
Lots of works have been done on the fundamental matrix computation in the past 30 years [2,[7][8][9][10][11][12]. It's a well and deeply developed area in computer vision; however, there still are some papers on the fundamental matrix published every year. Some traditional methods yield good results, which means very low projective errors to the epipolar lines. This average distance to the epipolar lines is a universal or almost unique criterion to judge the quality of the fundamental matrix.
The five-point method [4,5] is the most popular method that finds the possible solutions for relative camera pose between two calibrated views based on five given corresponding points. The algorithm consists of computing the coefficients of a tenth degree polynomial in closed form and subsequently finding its roots. Kukelova et al. [13] proposed polynomial eigenvalue solutions to the five-point and six-point relative pose problems, which can be solved using the standard efficient numerical algorithms. It is somewhat more stable than solutions proposed by Nister [4,5] and Stewenius [14]. Another five-point method [15] was proposed to estimate the fundamental matrix between two non-calibrated cameras from five correspondences of rotation invariant features. Three of the points have to be co-planar and two of them in general position. The solver, combined with Graph-Cut RANSAC, was superior to the seven and eight-point algorithms both in terms of accuracy and the needed sample number on the evaluated 561 publicly available real image pairs.
Reference [11] proposed a novel technique for estimating the fundamental matrix based on Least Absolute Deviation (LAD), which is also known as the L 1 norm method. Banno et al. [16] proposed a parameterization for the nonlinear optimization which includes the rank 2 constraint. Double quaternion and a scalar are adopted as the parameter set. Menglong et al. A two-point fundamental matrix method [17] is even proposed. It makes use of the epipolar theory to estimate the fundamental matrix based on three corresponding epipolar lines instead of seven or eight corresponding points. Sengupta et al. [12] introduce a novel rank constraint, F = A + AT and rank(A) = 3, on collections of fundamental matrices in multi-view settings. Then iterative re-weighted least squares and alternate direction method of multiplier (ADMM) are used to minimize a L1-cost function. Reference [18] developed the algorithm MAPSAC to obtain the robust MAP estimate of an arbitrary manifold to compute F. But by its open source code, the SfM result is not stable which means that the reconstructed scene varies from one to another totally different one.
As [19] points out, the use of symmetric epipolar distance should be avoided. The error criterion, i.e., the Kanatani distance (REK) [20], is the most effective error criterion found in their experiment for use as a distance function during the outlier removal phase of the F estimation. Some methods [21,22] searched for corresponding epipolar lines using points on the convex hull of the silhouette of a single moving object. These methods fail when the scene includes multiple moving objects. Reference [23] extends previous work to scenes having multiple moving objects by using the "Motion Barcodes", a temporal signature of lines. Reference [24] developed a new method for calculating the fundamental matrix combined with the feature lines. In [24], the camera orientation elements and relative orientation elements are used as the parameters of the fundamental matrix, and the equivalent relationships are deduced based on the horizontal and vertical feature lines.
Recently, after overwhelming studies of deep learning by neural networks, fundamental matrix estimation without correspondences is proposed by novel neural network architectures in an end-to-end manner [25]. New modules and layers are designed to preserve mathematical properties of the fundamental matrix as a homogeneous rank-2 matrix with seven degrees of freedom.
Most of the methods above focus on more constraints or new optimization methods. Unlike these methods, our algorithm uses a new criterion visually for metric reconstruction under the RANSAC framework and achieves a result similar to [4,5] i.e., our F is better to connect with 3D Euclidean coordinate. In other words, our F is better to reconstruct the real world of Euclidean coordinate.

F Computation in Multi-View Geometry
The most basic F computation is based on the linear equations x n x n x n y n x n y n x n y n y n y n x n y n 1 where u = [x y 1] T and u = [x y 1] T are the coordinates of the matching feature on twoview images. It is usually called the eight-point method because at least eight matching point pairs are needed to determine this equation. Of course more than eight matching point pairs can compose an over-determined system of equations to minimize |A f | that could produce more robust result.
In RANSAC framework, increasing iteration times n I is to find a better F with less error. But as the n I increases, the errors of projective distance or metric distance become smaller and smaller at beginning of the iterations, then keep almost at same level after beginning iterations.
If constraint det(F) = 0 is enforced to Equation (1), only seven matching pairs are necessary to solve this equation [26]. So this is usually called seven-point method. With more constraints being added in Equation (1), less matching pairs become indispensable. In the extreme occasion as mentioned in Section 2, only two matching pairs are needed [17].
Normalization [6,27] of matching feature coordinates is an essential step and recommended to reduce the final projection error. The suggested normalization is a translation and scaling of each image so that the centroid of the matching feature points is at [0, 0], the origin point of the coordinates, and the root mean square (RMS) distance of the points to the origin point is √ 2. Usually, the last step is to optimize F for to minimize some sort of average error by nonlinear optimization method. The error is always the squared distance between a point and its epipolar line, which will be computed for both points of one match and averaged over all N matches.

Essential Matrix Computation and Decomposition
The relationship between F and essential matrix E is as listed by Equation (4).
If K, K are intrinsic parameter matrices of two cameras. E implies the relative position of a two-camera pair if R and t mean the disposition of two cameras. Thus E can be represented as Equation (5).
For a given essential matrix E which can be decomposed by, Equation (6) and the first camera matrix P E = [ I | 0 ], there are four possible choices for the second camera matrix P E , namely where u 3 is the third column of the U, but only one of the four choices is correct. As we know, reconstructed X by the correct P E should locate in front of two cameras. Choosing one or more X i to check if they are in front of two cameras is the way to select the correct P E out. This is the key point that our method can show the 3D metric reconstructed scene because our method can estimate E to obtain P E through the given F.

Optimization Based on Projective Geometric Distance
After an initial F is achieved by the methods as that in Section 3.1, an optimization always follows as the last step. One goal of optimization is to minimize the re-projection error-the (summed squared) distances in projective space. Usually the cost function is like Equation (8), which minimizes the distance of a point from its corresponding projected point, computed in each of the images.
With projection matrices of cameras denoted by P, P , the feature coordinates on image should beũ = PX,ũ = P X; however, there must be some error forũ,ũ . |u i − PX i | and |u i − P X i | represent this re-projection error, which means the projective distances between the projections of X to the measured image points [u, u ]. Obviously, P, P are the projective matrices under projective coordinate but not the metric reconstruction under Euclidean coordinate. e is the epipole on the second image with respect to F.
The initial X can be computed by P, P and [u, u ] by methods such as linear or nonlinear triangulation. Then, the Levenberg-Marquardt method is applied to minimize cost of the Equation (8), which has 3N + 12 variables: 3N for N 3D points X i , and 12 for the camera matrix P = [M|t], with F = [t] × M. After optimization, a better result is produced most of the time. It is worth noting that sometimes it has only better numerical index on the average projective distance rather than reasonable 3D metric reconstruction.

Computing F in Metric Space Based on Geometric Distance
As we know, F includes the information in the projective coordination but not the Euclidean coordination. If metric reconstruction is necessary, it can only be estimated based on F via camera's intrinsic matrix K. Auto-(or self-) calibration, which can result in K automatically, is the computation of metric properties of the cameras and/or the scene from a set of uncalibrated images. Our method does not need any intrinsic parameter measured previously to reconstruct under the metric frame of Euclidean coordinate. In other words, we use a simplified auto-calibration method to estimate K.
In this paper, some constraints [28] are enforced on the intrinsic parameters of camera to estimate K. These constraints are: (1) Typically the aspect ratio is around 1. (2) The skew ratio can be assumed to be 0. (3) The principal point of the camera is projected to the center of image, [w/2, h/2] where w, h is the width and height of image in unit pixel, respectively. After moving the origin of image coordinate to [w/2, h/2], the intrinsic parameter matrix is the K defined in Equation (10), where f is the focal length of camera. This is a basic assumption for our method because this condition can be satisfied easily now and easier in the future with improvements in the manufacturing. Our method tries to fully make use of this assumption. Theoretically, it can achieve a perfect result if the real images satisfy this constraint of K ideally.
4.1. 3D Metric Distance Criterion of F As mentioned above, almost all the computation of F uses the distance of feature pixel to epipolar line as criterion to discard outliers of features matching. However, our algorithm uses the 3D metric distance criterion and is detailed as follows.
Compared with the traditional methods of F, the 3D metric distance criterion is more robust and has a similar performance on precision. It is the average distance between the projection of X on two image plains, P E X and P E X, and feature pixels [u, u ] in Equation (11), denoted by metric i . If metric i of one point X i is less than threshold presented, it is an inlier. Then Equation (11) is designed to compute the average distance metric , our 3D metric distance criterion of F, with all N matching inliers.
This 3D metric distance criterion can be applied into any RANSAC framework of F computation to judge whether one feature matching is an outlier or not. This criterion is more robust and has better performance especially for the images satisfied the assumption that the optical center lies at the center of image, [w/2, h/2]. The validation of this criterion by experiment is discussed in Section 5.

Optimization Based on Metric Geometric Distance
Our optimization based on metric geometric distance is in the same framework as the gold standard method [6] but in a metrical space for X and P E , P E , rather than the projective space described as Equation (8) in Section 3.3. In other words, we use a different method to compute projective matrix labeled as P E , P E which represents the metric scene in Euclidean coordinates.
The input of our optimization is an initial F, image size [w, h], and matching feature points [u, u ]. The initial F of the one images pair is computed by the eight-point method as described in Section 3.1 in the RANSAC framework based on the traditional distance from [u, u ] to epipolar line. The output is the final result of F. Goal of the optimization is to minimize the Equation (11). The steps of this algorithm are listed as follows: 1.
Estimate E = K T FK with K in Equation (10) With these correspondence [û,û ] and P E , P E , estimate X by the triangulation method [6].

5.
Minimize energy function Equation (11) over X, P E and P E . The cost is minimized using the Levenberg-Marquardt algorithm over 3N + 24 variables: 3N for N 3D points X i , and 24 for the camera matrix P E , P E . 6.
Retrieve F from the K, K , P E , P E . At first, compute the epipole e of the second image by e = P E × null(P E ) where null(P E ) is zero space of P E . Then return the final F = K −1 [e ] × P E P −1 E K −1 . As mentioned above, the K is estimated through the constraints, but not a precisely measured one. Of course if the precisely measured K is applied to our algorithm, better result will be achieved. For most of the scenarios, the precise K could not be known in advance. Therefore, this estimated and feasible K is adopted in our optimization. In fact it works very well and always achieves a satisfying result.
Compared with normalization in Section 3.1, our method uses a different normalization method for coordinates of features to compute final F. In Section 3.1, its origin point is a centroid that lies at the barycenters of u or u , and its average distance from u or u to the origin point is √ 2 which is fixed. However, in our algorithm, origin point is at the center of the images, and the average distance from origin point is less than 1, which is not fixed. Our normalization is more efficient than the normalization in Section 3.1. Our normalization method is a linear operation that does not need to calculate the barycenter. Furthermore, our method can visualize metric 3D reconstruction of X as described in Section 5.3 but the other normalization methods cannot produce reasonable visualization of reconstruction directly.

The Framework of Computing F Visually
The input data of our algorithm are a pair of images of a scene that need to be reconstructed, which are taken with a hand-held camera with a fixed focal length. The framework of the experiment is as follows: Step 1: Extract SIFT features on both images. Compute the matchings of features [u, u ].
Step 2: Use a traditional method such as the eight-point method to compute F as the initial input of next step.
Step 3: Optimize F using algorithm in Section 4.2.
Step 4: If the metrical projection error is smaller than Step 2, output the result as the final F. If not, go to Step 2 to repeat.
In step (3), we can deduce P E , P E from E. Then, the figure of reconstructed 3D points X can be shown in a Cartesian coordinate system even in each iteration.

Experimental Settings
Three different types of datasets, Dino, ZhangLan, and Simu were used in our experiment. Dino is a public-accessed dataset with an image size of 480 × 640. ZhangLan has an image size of 338 × 600 and was captured by us on our own campus. Simu is a human-made simulation of camera projection with an image size of 702 × 496. Figure 2 shows the features extracted by SIFT on dataset Dino and ZhangLan. Many different stereo matching methods are fully and deeply researched [29]. Thus, our algorithm adopts SIFT [30] for the feature matching [u, u ] of two images. It is the first step to compute F. Here, Euclidean distance is the criterion to measure the features' difference. The ratio of the nearest distance over the second nearest distance was set to be less than threshold T matching = 0.55. Otherwise it will not be labeled as a feature.  Actually, the other feature matching algorithms such as Harris, BRISK, SURF, ORB, etc., can still work. We choose SIFT only because it is typical and good enough to compare with other F computation methods, not because it is the best or most accurate matching method. SIFT has some mismatching features most of the time. Because of the RANSAC framework used in this paper, the mismatching can be culled out by the sampling and iteration mechanism as shown in Figure 2. This is one of the differences to most of F computation methods that always carefully remove the wrong feature matching manually.

Features Matching
For the matching of simulated scene Simu displayed in Figure 3, all the points in the image are perfectly matched because the scene was manually made. So the simulation matching of scene Simu in Figure 3 is not listed in Figure 2. The 3D points X and camera positions are displayed in Figure 3a and the projection image of X at the first and second viewpoint are displayed in Figure 3b and Figure 3c, respectively.
For feature matching from SIFT in Figure 2, there must be some outliers of matching with long distances to epipolar lines or obvious re-projection error of X. If outliers are validated by the criteria mentioned in this paper, the incorrect matching should be removed from the feature matching sequence and not be counted into error calculation. This is an important step to improve precision and robustness of our method. Figure 4 shows part of the typical epipoles and their corresponding epipolar lines chosen randomly and computed with our 3D metrical criterion for the final F under the RANSAC framework like the other traditional methods. As we can see, the epipoles are close to the epipolar lines. The distances are less than one pixel distance on average as shown in Table 1.

Reconstructed 3D Points
The visual results of Figure 5 show the 3D reconstruction with respect to our 3D metric projection error criterion based on E from F. However, eight-point [6], seven-point [6], and Gold [6] methods can only compute the projective F and cannot reconstruct the metrical 3D result directly. In (c), we rotate the scene to a special view point to see that the 50 points lie at a same plane and thus nearly only a line can be seen at the lower part of the figure, which validate that our method is reasonable and robust. The precision analysis is described in the last paragraph of Section 5.4.
In our experiment, the eight-point method always has good reconstruction results for each criterion. Every point of X is recovered precisely. The projection errors are 1.50, 0.5, and 0.45 for the three different scenarios, respectively.
On the other hand, for the seven-point method, no matter how carefully the parameters are adjusted, the reconstruction result is not good enough because seven-point method is very sensitive to noise. The reconstructed 3D points X of Simu almost lie on one plane which in fact are two planes. The reason is that restriction of det(F) = 0 is too arbitrary and only has mathematical meaning. For the reconstruction here, det(F) = 0 has a kind of side effect when computing E. In our experiment, det(F) = 0 was a useful and necessary but weak constraint. It is useful because it adds one equation to reduce one point pair input and the epipole has to be calculated under this constraint. It is weak because that it affects the result significantly when this constraint is applied. Most of the time without this constraint, the better projective error or average epipolar lines distance are obtained, although this does not mean they will result in better 3D construction results. Actually, det(F) = 0 limits the space of the solution of F. When this constraint is applied to the algorithms above, there is an advantage to compute the epipoles on corresponding images.
The appended material are the result from the five-point method that computes E as output directly. So the epipolar criterion can not be applied to five-point method. Compared with the other two methods over their distance to the epipolar line, Gold and Ours almost have results similar to the seven-point method. The five-point method's projection error of X, which is listed on the fourth row of Table 1, is obviously bigger than the other four. The five-point method is not robust or reliable enough in our experiments.
For the scene Simu, if only 0.5 pixel random projection error or less is added on [u, u ], it could have better reconstruction than the other scenarios because the optical center of camera lies at image center [w/2, h/2] precisely. Table 1 compares metric criterion with traditional epipolar distance in 8-pt [6], 7-pt [6], 5-pt [13] and Gold [6] method. The second to sixth row of Table 1 are result based on the average distance from feature pixels to epipolar lines w.r.t the 5 methods to compute F. The columns are three different scenes of Dino, ZhangLan and Simu. The total initial matching features number are 100, 123 and 100. The numbers under the column 'Error' are the average distance from feature pixels to epipolar lines while the numbers under the column 'Inliers #' are numbers of inliers. Here unit of the distance is pixel in image. Our method and the method Gold have the best re-projection error than the other four.  Table 2 is the average re-projection error from feature pixels u, u to P E X, P E X and the 3D metric criterion u − P E X and u − P E X as described in Section 4.1, for the same datasets Dino, ZhangLan, and Simu. The last column is our method, which outperforms the other four methods. In order to compare with each other, the same F in Table 1 were used to decompose out E with K to reconstruct the scene, i.e., their inliers are recorded to compute the average re-projection error. Our method has comparatively good results compared with the other four. Table 2. Error comparison based on the average re-projection error from feature pixels u, u to P E X, P E X. It shows that our method outperform other methods. Projective validation is a common way to judge the quality of F. In this paper, we propose metric validation to perform the comparison. Metric validation is based on the assumption that the intrinsic matrix K is subject to the constraints in Section 4. Then, the reconstruction is achieved by the deduced essential matrix E. The criteria to judge the inlier becomes the average re-projection error of X on images to [ u, u ]. If images are captured with the intrinsic matrix K under these constraints, the achieved result F will have the better final reconstruction in the metric coordinate. Actually, this is the case for most of the cameras in making metric validation work. The biggest benefit of our metric validation is that it can produce reasonable and visual metric reconstruction results rather than a sort of projective reconstruction with small numerical projection errors that could be used to compare. If a more precise result is required, the intrinsic matrix K should be measured in advance. Here, we focus on an automatic approach so that the pre-measured K is not adopted. As we can see in Table 1, our method has better results in metric space than the five-point method.

Numerical Result Comparison
In our experiments, thresholds of inliers have two types. (1) For metric validation, it is the metric re-projection errors of X on images to u and u . (2) For RANSAC inliers, it is the projective errors from u or u to the corresponding epipolar lines. Usually, the smaller threshold means less inliers and better re-projection errors especially for real image pairs. For example, for the result of Zhanglan with the five-point method, the re-projection errors decreased from 5.5 to 4.7, on the other hand the number of inliers dropped from 94 to 86.
In Section 4.3, the bigger metrical projection error could be produced in Step 3, which optimize F using algorithm in Section 4.2, because the initialization of global optimization is crucial. Once the initialization of Step 3 is recalculated, the result with smaller result will be obtained in three iterations most of the time.
In order to measure the precision of reconstruction in metric space, we add some noise into matching features' position and then reconstruct the scene Simu to compare the distance between the points with the ground-truth distance simulated out. Because there are 100 points in Simu, (99 + 1) * 99/2 = 4950 distances can be used to compare the error with the ground-truth. We calculate the mean error of 4950 distances and repeat this process 100 times with different noise at the same level to obtain a average error distance. Considering the scene area is a unbounded factor to affect the error of distances, we apply the ratio of the above average error distance to maximum distance of these 100 points. The ratio result is within 1 ± 0.4% most of the time when the noise onto features is 0.5 pixel with Gaussian distributions. Our method provides a reasonable and high precision for 3D reconstruction.

Conclusions
We propose a novel RANSAC criterion for the fundamental matrix computation. The results of experiments have shown that our method under metric space re-projection distance has good results for different test images compared with other methods such as eight-point, seven-point, five-point, and Gold standard methods. Our method can present the result visually to judge the quality of F. In future work, we would like to explore theoretical proof to provide stronger evidence to support our 3D metrical criterion of quality of fundamental matrix.