Automatic Pose Estimation of Uncalibrated MultiView Images Based on a Planar Object with a Predefined Contour Model

We have presented a framework to obtain camera pose (i.e., position and orientation in the 3D space) with real scale information of the uncalibrated multi-view images and the intrinsic camera parameters automatically. Our framework consists of two key steps. First, the initial value of the intrinsic camera and the pose parameters were extracted from homography estimation based on the contour model of some planar objects. Second, a refinement of the intrinsic camera and pose parameters was operated by the bundle adjustment procedure. Our framework can provide a complete flow of pose estimation of disorderly or orderly uncalibrated multi-view images, which can be used in vision tasks requiring scale information. Real multi-view images were utilized to demonstrate the robustness, flexibility and accuracy of the proposed framework. The proposed framework was also applied in 3D reconstruction.


Introduction
The pose estimation problem is one of the central problems in robotics, computer graphics, photogrammetry and computer vision.In robotics, pose estimation is commonly used in the hand-eye coordination systems [1].In computer graphics, pose estimation plays an important role in tasks that combine computer-generated objects with photographic scenes, such us landmark tracking in determining head pose in augmented reality [2] or interactive manipulation of objects.In photogrammetry and computer vision, pose estimation is central to the procedure of 3D reconstruction [3] and object recognition [4].
The "Structure from Motion" (SfM) concept is the core method used in the automatic pose estimation of images [5].In the late 1980s, effective SfM techniques were developed, which aimed to reconstruct the unknown 3D scene structure and the camera positions and orientations from a set of feature correspondences simultaneously.Longuet-Higgins introduced a still widely-used two-frame relative orientation technique in 1981 [6]; however, the development of a multi-frame structure in motion techniques, including factorization methods [7], and global optimization techniques [8][9][10] was delayed.In 2004, Nister matched small subsets of images to one another and then merged them for a complete 3D reconstruction in the form of sparse point clouds [11].Vergauwen and Van Gool [12] developed a SfM tool for cultural heritage applications (hosted now by a web-based 3D reconstruction service).Recently, the SfM concept has made tremendous improvements, notwithstanding that the achievable 3D reconstructions were only useful for visualization, object-based navigation, annotation transfer or image browsing purposes.However, the procedure developed with the capability to orient a large numbers of images.The two well-known packages are Bundler (or its graphical version Photosynth) [13] and Samantha [14].Bundler is the implementation of the current state of the art in sequential SfM applications and extended toward a hierarchical SfM approach based on a set of key images [15].Samantha appears to be faster because of the introduction of a local bundle adjustment procedure.The automatic methodologies to estimate image pose based on the SfM are also shown in [16][17][18][19][20][21].
In the SfM process, a direct result of the reconstruction of the cameras and scene structure can be obtained according to image point correspondences [21][22][23].However, the method may have several limitations in measurable application: (1) SfM only estimates the relative pose of each camera without scale information; SfM can recover intrinsic parameters of non-degenerate configurations; however, this process can be unstable [21]; (2) the relative pose of the multi-view images can be obtained using SfM.When the images have geo-reference marks, the absolute pose can be obtained by a similarity transformation.However, if the accuracy of the relative pose is low, then the absolute pose with high accuracy cannot be acquired [22].
In this study, a multi-view image (captured from the same camera) pose estimation framework is presented, which can calculate the homography transformation by directly using the iterative optimization, and the intrinsic camera parameters (i.e., focus length, principle point radial distortion and tangential distortion) and the pose parameters (i.e., six parameters including the position (X S , Y S , Z S ) of the camera projection center in world coordinate system and the pose (ϕ, ω, κ) of every image) can be recovered simultaneously and robustly.This framework consists of the following two key steps: (1) the contour-based homography estimation is used for computing the homography relationship between models and images; the initial intrinsic camera and pose parameters can be recovered by the multiple homography matrices; moreover, the disorderly multi-view images can be rearranged by the camera pose derived from the initial pose parameters; (2) the bundle adjustment is used for the refinement of the initial intrinsic camera and pose parameters.
Therefore, the main contributions of this study include: (1) A robust contour model-based homography estimation (including recognition and tracking) of the planar object, which can transform disorderly multi-view images into orderly multi-view images in a general environment and provide good initial intrinsic camera and pose parameters.(2) A complete framework, which automatically provides both the intrinsic camera and pose parameters with a real scale for uncalibrated multi-view images.The framework can develop substantial measurable vision applications.
The remainder of this paper is organized as follows: Section 2 gives an overview of the approach.Section 3 presents the approach to obtain the initial parameters from the contour-based homography.Section 4 describes the parameter refinement using bundle adjustment.Section 5 presents the experimental results.Section 6 discusses the major advantages of the proposed framework.Section 7 presents the conclusion of the potentials and the limitations of the framework, as well as the objectives of future works.

Overview
In this section, we provide an overview of pose estimation, which includes camera localization and incremental bundle adjustment.Then, the approximate steps of our framework are shown.
Camera localization has received much interest in the last few years.Visual Simultaneous Localization and Mapping [24][25][26] and, in the computer vision, Structure from Motion with bundle adjustment optimization [27,28] are common ways of estimating the camera pose.These approaches reconstruct the environment and estimate the camera position simultaneously, but need to make a loop to correct the drift.Visual odometry is another way to retrieve the relative pose of the camera [29], but estimations drift irremediably.Royer had been shown that the use of 3D information on the environment ensures a better precision in pose estimation [30].It makes the pose estimation of the camera, embedded on a mobile platform, precise with no drifting, if the robot moves near these referenced [31] or even georeferenced [32] landmarks.For a few years, 3D models of cities or urban environments have been made available through various digitized town projects across the world.The French National Institute of Geography (IGN) digitalized streets and buildings of the 12th arrondissement of Paris in France.Model-based pose estimation is a problem tackled for several years working with various feature types: points [33,34], lines [35], both [36] or wireframe models [37][38][39].These works dealt with geometrical features, but only a few other works take into account the photometric information explicitly in the pose estimation and tracking.Some of them mix geometric and photometric features [40,41].Photometric features (image intensity) can directly be considered to estimate the homography and then the relative position between a current and a reference image [42].A more recent approach proposes to estimate such a transformation using information theoretic approaches.Dame et al. proposed that mutual information shared by a planar textured model and images acquired by the camera is used to estimate an affine transformation or a homography [43,44].
Bundle adjustment optimization is a common way of estimating the image pose, and incremental bundle adjustment is a topic of recent research.Most existing work maintains real-time performance by optimizing a small subset of the most recent poses each time a new image is added.For example, Engels et al. [45] optimize a number of the most recent camera poses and landmarks using the standard bundle adjustment optimization, and Mouragnon et al. [46] perform a similar optimization, but including key frame selection and using a generic camera model.Zhang and Shan [47] perform an optimization over the most recent camera triplets, experimenting with both standard BA optimization and a reduced optimization where the landmarks are linearly eliminated.Other incremental bundle adjustment methods optimize more than just a fixed number of recent poses, instead adaptively identifying which camera poses to optimize [48][49][50].However, in contrast with incremental smoothing [51], these methods result in approximate solutions to the overall least-squares problem and do not reuse all possible computations.In 2012, the Incremental Light Bundle Adjustment (ILBA) proposed by Vadim Indelman et al. [52] has been obtaining attention.They combine the two key ideas of structure-less SFM and incremental smoothing into a computationally-efficient bundle adjustment method, and additionally introduce the use of three-view constraints to remedy commonly encountered degenerate camera motions.In 2013, Vadim Indelman et al. [53] gave a probabilistic analysis of the incremental light bundle adjustment method and presented a computationally-efficient method for Vision-Aided Navigation (VAN) in autonomous robotic applications [54] with ILBA.
In this study, the framework calculated the homography transformation by using the bundle adjustment optimization.
As shown in Figure 1, the framework consists of the following four steps: First, the structured line contour model was generated.The model of the object, i.e., A4 paper and a book, was measured in advance.The model can be asymmetric, as the Book 1 cover shown in Figure 2.
Second, the multi-view images of the scene (i.e., disorderly or orderly) were captured.In this study, a focus-fixed consumer-grade camera (or just a smartphone) was used.
Third, the intrinsic camera and pose parameters were calculated.The predefined structured line contour model was matched with the multiple line segments, which were extracted from the first frame of the set of pictures using the homography recognition procedure, to extract a correct homography.When at least three homographies were estimated, the intrinsic camera and pose parameters were computed linearly.
Finally, the parameters were refined by bundle adjustment.The sparse 3D points of the multi-view images were triangulated using the previously-mentioned parameters by the sparse correspondence procedure.Then, the parameters were set (i.e., the intrinsic camera and pose parameters), and the sparse 3D points were used in bundle adjustment to optimize the results.
The following sections describe the last two steps in detail.

Step I: Generate Structured Line Contour Model
The planar objects with the predefined model, from left to right, are A4 paper covers with line features.In the top ro predefined contour models.

Sept II: Acquire Multi-view Images
The A4 paper was placed on the table.corner; the axis is the shorter edge; th is the other edge; and the axis is con the right-hand rule.Each camera view h transformation to the global coordinat which is unknown at this phase.The planar objects with the predefined contour model, from left to right, are A4 paper and book covers with line features.
In the top row are the predefined contour models.

Sept II: Acquire Multi-view Images
The A4 paper was placed on the table.O 0 is one corner; the X 0 axis is the shorter edge; the Y 0 axis is the other edge; and the Z 0 axis is confirmed by the right-hand rule.Each camera view has a rigid transformation to the global coordinate system, which is unknown at this phase.Step I: Generate Structured Line Contour Model The planar objects with the predefined contour model, from left to right, are A4 paper and book covers with line features.In the top row are the predefined contour models.

Sept II: Acquire Multi-view Images
The A4 paper was placed on the table.is one corner; the axis is the shorter edge; the axis is the other edge; and the axis is confirmed by the right-hand rule.Each camera view has a rigid transformation to the global coordinate system, which is unknown at this phase.

Step IV: Parameter Refinement by Bundle Adjustment
Sparse 3D points can be triangulated by the known initial camera parameters in the previous step and corresponding feature points.In this study, the corner points of the contour model were taken as Ground Control Points (GCPs).The bundle adjustment procedure was executed for parameter refinement.Step IV: Parameter Refinement by Bundle Adjustment Sparse 3D points can be triangulated by the known initial camera parameters in the previous step and corresponding feature points.In this study, the corner points of the contour model were taken as Ground Control Points (GCPs).The bundle adjustment procedure was executed for parameter refinement.step and corresponding feature points.In this study, the corner points of the contour model were taken as Ground Control Points (GCPs).The bundle adjustment procedure was executed for parameter refinement.

Initial Parameters Obtained from Contour-Based Homography
In this section, the problem of homography estimation was initially elucidated.Then, homography estimation that uses contour models for multi-view images was recognized.Subsequently, the homographies were optimized by minimizing the errors between the sample points and their corresponding image points yielded by utilizing the 1D search along the normal direction.At last, initial camera parameters, including the intrinsic camera and pose parameters, were retrieved from the multiple homographies based on the multiple view geometry.

Initial Parameters Obtained from Contour-Based Homography
In this section, the problem of homography estimation was initially elucidated.Then, homography estimation that uses contour models for multi-view images was recognized.Subsequently, the homographies were optimized by minimizing the errors between the sample points and their corresponding image points yielded by utilizing the 1D search along the normal direction.At last, initial camera parameters, including the intrinsic camera and pose parameters, were retrieved from the multiple homographies based on the multiple view geometry.

Problem Statement
The problem considered in this section was the estimation of the homography transformation between the model and image features.The relationship between a 2D planar model P and its corresponding image point p is given as: Furthermore, the model line segment L is defined by its two endpoints (P 1 , P 2 ).Therefore, the homography between a model line segment L and its projection in the image plane I is given by the projection of its two endpoints: When noise appears in the measurement data, P denotes the noisy observation of 2D point P, and I denotes the noisy observation of 2D line segment I. The contour models utilized in this study include the geometrical and textural edges of the planar objects (see Step I of Figure 1), where they were modeled as lines and intersected corners.The lines and corners are illustrated in Figure 2, and the corner points were taken as the planar GCPs in the bundle adjustment procedure.The details (i.e., the coordinates of the corners and the index of the line segments) of the contour model of the cover of Book 1 are defined in Table 1.
Table 1.The contents of the planar models.

Problem Statement
The problem considered in this section was the estimation of the homography transformation between the model and image features.The relationship between a 2D planar model and its corresponding image point is given as: Furthermore, the model line segment is defined by its two endpoints ( , ).Therefore, the homography between a model line segment and its projection in the image plane is given by the projection of its two endpoints: When noise appears in the measurement data, denotes the noisy observation of 2D point , and denotes the noisy observation of 2D line segment .The contour models utilized in this study include the geometrical and textural edges of the planar objects (see Step I of Figure 1), where they were modeled as lines and intersected corners.The lines and corners are illustrated in Figure 2, and the corner points were taken as the planar GCPs in the bundle adjustment procedure.The details (i.e., the coordinates of the corners and the index of the line segments) of the contour model of the cover of Book 1 are defined in Table 1.

The Disorderly Images and the First Frame: Recognition of the Contour-Based Homography
For the disorderly images or the first frame of orderly images, the homographies were recognized based on the contour model of the planar object, which includes two steps, namely hypothesizing and verifying.In the first step, a quadrangle-like structure formed by two image corners was selected and utilized to generate a number of approximate homography hypotheses.In the second step, the homography hypotheses were quickly ranked by matching the model line set with the image line set

The Disorderly Images and the First Frame: Recognition of the Contour-Based Homography
For the disorderly images or the first frame of orderly images, the homographies were recognized based on the contour model of the planar object, which includes two steps, namely hypothesizing and verifying.In the first step, a quadrangle-like structure formed by two image corners was selected and utilized to generate a number of approximate homography hypotheses.In the second step, the homography hypotheses were quickly ranked by matching the model line set with the image line set in the local region around the base line according to the distance function between the line segments derived from a series of noisy edge points in the probabilistic approach.
Although the image lines were fragmented because of the occlusion or the faulty line detection, the homography was determined with a closed-form solution from the four correspondences of the model lines and the image lines.A small set of homographies with approximately high certainty must be generated, including at least one homography close to the accurate transformation.Similar to [55], an approximate homography was obtained by exploiting the corner-like structures.However, the proposed method does not rely on the assumption that particular image lines, called base lines, are unfragmented [55].In this study, corner-like structures were formed by pairs of line segments that terminate at a common point with a certain distance and intersection angle between each other.In contrast to the proposed method in [55], which only utilized one correspondence of the corner-like structure to generate the affine hypotheses, the present method generated the homography hypotheses from two correspondences of the corner-like structure (denoted as the quadrangle-like structure).In the two corner-like structures that share the same line segment (see Corner 1 and Corner 3 of Figure 3), the base lines are chosen if the lines match the corner-like structure of the other two line segments in the neighborhood, as shown in Figure 3.The endpoints of the base line cannot be utilized directly because the base line may be fragmented.The endpoints of the base line can be intersected with the other two lines, because they are invariant in the perspective transformation [55].As shown in Figure 3, the base image line is , and the points , are the intersections with the other two lines , .Given the single correspondence of the quadrangle-like structure in the model plane and image plane, the two point pairs { , } and { , } can be obtained.The homography transformation has eight DOF; however, the method has only four constraints (i.e., two for each point pair).Therefore, a ratio constraint that the ratio of the length of the other two lines , with that of the base line remains unchanged was imposed in homography transformation.Therefore, the second endpoint of the line can be obtained according to the following equation: where and are the endpoints of the base model line and is the second endpoint of the line L .
is given by the following equation: The second endpoint of the image line can be obtained in the same manner.Then, the four correspondences of the model and image points can be obtained according to the following equation: 1 The endpoints of the base line cannot be utilized directly because the base line may be fragmented.The endpoints of the base line can be intersected with the other two lines, because they are invariant in the perspective transformation [55].As shown in Figure 3, the base image line is I 0 , and the points P 0 , P 1 are the intersections with the other two lines I 3 , I 1 .Given the single correspondence of the quadrangle-like structure in the model plane and image plane, the two point pairs {P 0 , P O } and {P 1 , P 1 } can be obtained.The homography transformation has eight DOF; however, the method has only four constraints (i.e., two for each point pair).Therefore, a ratio constraint that the ratio of the length of the other two lines I 1 , I 3 with that of the base line remains unchanged was imposed in homography transformation.Therefore, the second endpoint P 2 of the line I 1 can be obtained according to the following equation: where P 0 and P 1 are the endpoints of the base model line L 0 and P 2 is the second endpoint of the line L 1 .P 2 is given by the following equation: The second endpoint P 3 of the image line I 2 can be obtained in the same manner.Then, the four correspondences of the model and image points can be obtained according to the following equation: If two corner-like structures form a quadrangle (see Corner 1 and Corner 2 in Figure 3), then four point pairs {P 0 , P 0 }, {P 1 , P 1 }, {P 2 , P 2 }, {P 3 , P 3 } can be obtained without any assumption.For each correspondence {P i , P i }, two linear equations were in the eight unknowns in H. H can be solved linearly by four corresponding points, which the planar object of A4 paper can fulfill.
The second step is to rank all of the homography hypotheses from the first step.A geometric measure of the model line set (projected by an approximate homography) and the image line set were computed.Only a subset of the lines around the base line (if the two corners form a quadrangle, then the diagonal line from one corner to the other is chosen as the base line) was compared with the image line set to compute the similarity measure quickly.Let M = {L 1 , L j , . . .L M } be the model lines and surrounding the base image line I, where N nbr > M.Then, the geometric similarity between the model line set M and the image line set N(I) around the base image line I is given by: where d(•) denotes the distance between two lines in the image, which was defined in the literature [56].
The smaller the value of S(M, S, H), the more similar the model line set is to the image lines under the current homography H.The parameter S max was employed to ensure that the correct homographies are not penalized too severely when a model line is fully occluded in the image.In Equation ( 6), the homography hypotheses generated from the quadrangle-like structures can be ranked.Then, the ranked hypotheses were refined by the proposed homography optimization method (which is presented in Section 3.3), and the hypothesis with the smallest alignment error was chosen as the optimal homography.

Orderly Images: Contour-Based Homography Optimization (Tracking)
In Section 3.2, the homographies are of the first frame of the orderly images (i.e., video images).Then, for the rest of the images, optimization (i.e., tracking strategy) was proposed to obtain the optimal (the subsequent) homographies.As shown in Figure 4b, the 2D model edge of the ID card was projected to the image plane using the prior homography of the planar object.Instead of dealing with the line segment itself, the projected line segment (black solid line in Figure 4a) with a series of points (brown points in Figure 4a) was sampled.Then, the visibility test was performed for each of the sample points, because some of these sample points may be out of the field of view of the camera.In each of the visible sample points, the 1D search along the normal direction of the projected model line was employed to find the edge point with the strongest gradient or the closest location at its correspondence.Finally, the sum of the errors between the sample points and their corresponding image points was minimized to solve for the homography between frames.
points (brown points in Figure 4a) was sampled.Then, the visibility test was performed for each of the sample points, because some of these sample points may be out of the field of view of the camera.In each of the visible sample points, the 1D search along the normal direction of the projected model line was employed to find the edge point with the strongest gradient or the closest location at its correspondence.Finally, the sum of the errors between the sample points and their corresponding image points was minimized to solve for the homography between frames.As shown in Figure 4a, and are the set of projected sample points and their corresponding image points with the presence of the observation noise along the normal direction, respectively.Then, a function is defined to measure the normal distance between and : where is the unit normal vector of the projected sample point .Assuming a Gaussian distribution for , the problem of homography optimization can be presented as: where is the number of model points.The optimization problem can be solved by the nonlinear technique.The iterative least square technique was used in this study.As shown in Figure 4a, P i and p i are the set of projected sample points and their corresponding image points with the presence of the observation noise along the normal direction, respectively.Then, a function is defined to measure the normal distance between P i and p i : where n i is the unit normal vector of the projected sample point P i .Assuming a Gaussian distribution for d i , the problem of homography optimization can be presented as: where S is the number of model points.The optimization problem can be solved by the nonlinear technique.The iterative least square technique was used in this study.

Initial Intrinsic Camera and Pose Parameter Retrieval
With a series of homography matrices, i.e., H-matrices (more than three orientations), the intrinsic camera and pose parameters can be estimated using a closed-form solution [57].
The degenerate case, such as the regular calibration method using a chessboard, also exists [57].The multiple view images were taken as a pure planar motion [58] when the camera was fixed and the object was on the turntable, as shown in Figure 5.In that situation, the method failed to retrieve camera parameters, although multiple homographies were estimated.The hand-held moving cameras that break pure planar motion can guarantee that such motion of the images will be avoided.
With a series of homography matrices, i.e., H-matrices (more than three orientations), the intrinsic camera and pose parameters can be estimated using a closed-form solution [57].
The degenerate case, such as the regular calibration method using a chessboard, also exists [57].The multiple view images were taken as a pure planar motion [58] when the camera was fixed and the object was on the turntable, as shown in Figure 5.In that situation, the method failed to retrieve camera parameters, although multiple homographies were estimated.The hand-held moving cameras that break pure planar motion can guarantee that such motion of the images will be avoided.

Parameter Refinement by Bundle Adjustment
The intrinsic camera and pose parameters of multiple view images are presented in Section 3.However, these parameters may not be reasonably accurate because of the image noises and the soft constraint of the contour-based model [59].Thus, in this section, the parameters were refined and considered in the scale information for obtaining a measurable framework.Specifically, the parameters were refined using the bundle adjustment model [60] with planar control points of the contour-based model, which were also used in the homography estimation stage in Section 3 [61].The planar control points play the role of GCPs to obtain the real scale of the pose parameters.With regard to the optimization problem on the 3D structure and viewing parameters, the bundle adjustment obtained the optimal reconstruction based on the assumption that the noise from the observed image features was considered white Gaussian noise [62].
Generally, the following conditions were necessary for the bundle adjustment procedure: (1) the input images must have sufficient overlapping degrees and (2) the initial parameter value must be known.These conditions can all be guaranteed in this framework stated in the previous sections.First, the images were arranged by homography recognition specified in Section 3.2.Second, the overlapping degree was easily guaranteed by the shooting mode (i.e., video images or small movement of the camera).Finally, the initial intrinsic camera and pose parameters were estimated in Subsection 3.4.Therefore, the second condition was also satisfied.A routine sparse reconstruction similar to SfM was initially implemented based on these conditions, such that the images can obtain a relative orientation structure in the subsequent section.Then, a bundle adjustment model was built to refine all of the parameters by using the sparse reconstruction and intrinsic camera and pose parameters, as shown in Section 3.4.

Sparse Reconstruction
In this section, a part of the commonly-used flow of SfM was taken to obtain the correspondences and tracks of images.The first step was to find the point features using the SIFT [63] or the SURF [64]

Parameter Refinement by Bundle Adjustment
The intrinsic camera and pose parameters of multiple view images are presented in Section 3.However, these parameters may not be reasonably accurate because of the image noises and the soft constraint of the contour-based model [59].Thus, in this section, the parameters were refined and considered in the scale information for obtaining a measurable framework.Specifically, the parameters were refined using the bundle adjustment model [60] with planar control points of the contour-based model, which were also used in the homography estimation stage in Section 3 [61].The planar control points play the role of GCPs to obtain the real scale of the pose parameters.With regard to the optimization problem on the 3D structure and viewing parameters, the bundle adjustment obtained the optimal reconstruction based on the assumption that the noise from the observed image features was considered white Gaussian noise [62].
Generally, the following conditions were necessary for the bundle adjustment procedure: (1) the input images must have sufficient overlapping degrees and (2) the initial parameter value must be known.These conditions can all be guaranteed in this framework stated in the previous sections.First, the images were arranged by homography recognition specified in Section 3.2.Second, the overlapping degree was easily guaranteed by the shooting mode (i.e., video images or small movement of the camera).Finally, the initial intrinsic camera and pose parameters were estimated in Section 3.4.Therefore, the second condition was also satisfied.A routine sparse reconstruction similar to SfM was initially implemented based on these conditions, such that the images can obtain a relative orientation structure in the subsequent section.Then, a bundle adjustment model was built to refine all of the parameters by using the sparse reconstruction and intrinsic camera and pose parameters, as shown in Section 3.4.

Sparse Reconstruction
In this section, a part of the commonly-used flow of SfM was taken to obtain the correspondences and tracks of images.The first step was to find the point features using the SIFT [63] or the SURF [64] key point detector.The second step was to match those key points to each pair of multiple view images using the FLANN [65] method.Then, the corresponding fundamental matrix was robustly estimated using the RANSAC iteration [66].During each RANSAC iteration, the candidate fundamental matrix was computed using the eight-point algorithm, followed by the nonlinear refinement [58].Subsequently, the matches that were outliers to the recovered fundamental matrix were removed.When the number of inliers was less than the present threshold (i.e., 20 is used in our framework), all of the matches from the consideration were removed [22].Finally, all matches were geometrically organized into tracks after finding a set of consistent matches between each image pair using the union-find-based tracking method [67].
In contrast to the procedure of SfM, which aimed to recover a set of camera parameters and a 3D location for each track by the incremental strategy [22], the proposed method already had the initial camera parameters solved by the homographies estimation in Section 3.4.Therefore, the sparse 3D points can be directly obtained by the triangulating pairs, a large number of matches and base lines.Those feature points were defined as Type I features.Other type points (defined as Type II features) were the corner points of the contour model (as shown in Figure 2).Type II feature points were taken as GCPs in the subsequent bundle adjustment procedure.

Bundle Adjustment
A dataset was already obtained that conforms to three conditions of the bundle adjustment procedure mentioned previously.A bundle adjustment model with planar control points can be built and solved further, which will be described in this section to refine the parameters.As shown in Figure 6, a ray exists connecting camera C and space point P among the bundle, and this ray intersects with the corresponding image plane at p. key point detector.The second step was to match those key points to each pair of multiple view images using the FLANN [65] method.Then, the corresponding fundamental matrix was robustly estimated using the RANSAC iteration [66].During each RANSAC iteration, the candidate fundamental matrix was computed using the eight-point algorithm, followed by the nonlinear refinement [58].Subsequently, the matches that were outliers to the recovered fundamental matrix were removed.When the number of inliers was less than the present threshold (i.e., 20 is used in our framework), all of the matches from the consideration were removed [22].Finally, all matches were geometrically organized into tracks after finding a set of consistent matches between each image pair using the union-find-based tracking method [67].
In contrast to the procedure of SfM, which aimed to recover a set of camera parameters and a 3D location for each track by the incremental strategy [22], the proposed method already had the initial camera parameters solved by the homographies estimation in Section 3.4.Therefore, the sparse 3D points can be directly obtained by the triangulating pairs, a large number of matches and base lines.Those feature points were defined as Type I features.Other type points (defined as Type II features) were the corner points of the contour model (as shown in Figure 2).Type II feature points were taken as GCPs in the subsequent bundle adjustment procedure.

Bundle Adjustment
A dataset was already obtained that conforms to three conditions of the bundle adjustment procedure mentioned previously.A bundle adjustment model with planar control points can be built and solved further, which will be described in this section to refine the parameters.As shown in Figure 6, a ray exists connecting camera and space point among the bundle, and this ray intersects with the corresponding image plane at .In Figure 6, was the projective center of the camera, was the corresponding principal point with coordinate of , and while the equivalent focal length of the camera was ( , ).The lens distortion parameters were ( , , , , ).The pose parameters of the camera were a translation vector ( , , ) and the rotation matrix with the vfollowing two equivalent expressions: (1) rotation angle ( , , ) and (2) nine elements ~ of the rotation matrix.
P was the space feature point corresponding to image point p.The distortion of was signed as , .Furthermore, ~ were Type II feature points with corresponding image points ~ .As shown in Figure 1, the relationship between the space feature point and corresponding image point was described by collinearity equations [68]: Equation ( 9) was called the observation equations in the bundle adjustment model [60], and the observations refer to coordinate ( , ) of the image feature points.The unknown parameters of the bundle adjustment procedure contain space feature points (including Types I and II) and camera In Figure 6, C was the projective center of the camera, o was the corresponding principal point with coordinate of c x , c y and while the equivalent focal length of the camera was f x , f y .The lens distortion parameters were (k 0 , k 1 , k 2 , k 3 , k 4 ).The pose parameters of the camera were a translation vector T(T X , T Y , T Z ) and the rotation matrix R with the vfollowing two equivalent expressions: (1) rotation angle (A X , A Y , A Z ) and (2) nine elements r 0 ∼ r s of the rotation matrix.
P was the space feature point corresponding to image point p.The distortion of p was signed as δ x , δ y .Furthermore, O 1 ∼ O 4 were Type II feature points with corresponding image points o 1 ∼ o 4 .
As shown in Figure 1, the relationship between the space feature point and corresponding image point was described by collinearity equations [68]: Equation ( 9) was called the observation equations in the bundle adjustment model [60], and the observations refer to coordinate (x, y) of the image feature points.The unknown parameters of the bundle adjustment procedure contain space feature points (including Types I and II) and camera parameters.The constraint of the bundle adjustment procedure is the collinearity condition described in Equation (9), and this constraint is often used in computer vision.
The bundle adjustment procedure was illustrated using a video image, as shown in Figure 7. First, the initial camera parameters were obtained by the method described in Section 3.Then, feature matching and tracking were executed, as shown in Figure 7a,b, and the sparse 3D points were obtained by triangulation of the feature tracks.Subsequently, the initial camera parameters, sparse 3D points and GCPs (planar control points) were subjected to the bundle adjustment procedure.A converged result of the camera parameters and sparse 3D points were obtained, as shown in the Figure 7c.The sparse 3D points exhibited evident improvement in visual sensation.A further quantitative comparison will be shown in this experiment.

Homography Recognition and Tracking
The homography from the planar objects to the image plane in an environment under a wide variety of viewpoints was recognized and tracked to validate the proposed method.All images were captured by a mobile phone.The line segments utilized for homography recognition were detected in the images by the LSD method [69].
A. Performance of homography recognition for disorderly images: In this experiment, the asymmetric objects shown in Step I of Figure 1 were placed in the environment, and the disorderly multiple view images were acquired by a moving camera at different positions and orientations.As shown in Figure 8a, the lines detected in the images were drawn in red, whereas the thick green lines correspond to the projections of the model lines mapped by the recognized homography.Figure 8a shows a large number of image line segments that were distributed on the background, such as the other books and the desk, whereas only a small number of line segments was detected in the object region.In such a case, the proposed method can provide an accurate homography recognition.Moreover, the angle, with respect to the orientation of the mode plane, was comparatively large, and the proposed method can still recognize the planar objects.Although deviation exists in homography recognition, it is sufficient for homography optimization and can lead to an accurate homography estimation.When all homographies were recognized, the images can be rearranged by the pose of the camera to obtain a spatial order set, that is a polygon shape without intersection, as shown in Figure 8b.The operation can build pairwise images with as much overlapping areas as possible to ensure successful feature matching and tracking.
B. Performance of homography tracking for orderly images: The objects were placed in the environment undergoing large rotation and translation, which were captured in the form of video images, to validate the performance of the proposed homography tracking method.Some sampled results from video images were exhibited.As showed in Figure 9, the proposed method can provide a good match with the contour of the objects in the images.The pose parameters of the video images were computed by the proposed framework, and the dense matching procedure [21] was executed based on the camera parameters.shows a large number of image line segments that were distributed on the background, such as the other books and the desk, whereas only a small number of line segments was detected in the object region.In such a case, the proposed method can provide an accurate homography recognition.Moreover, the angle, with respect to the orientation of the mode plane, was comparatively large, and the proposed method can still recognize the planar objects.Although deviation exists in homography recognition, it is sufficient for homography optimization and can lead to an accurate homography estimation.When all homographies were recognized, the images can be rearranged by the pose of the camera to obtain a spatial order set, that is a polygon shape without intersection, as shown in Figure 8b.
The operation can build pairwise images with as much overlapping areas as possible to ensure successful feature matching and tracking.B. Performance of homography tracking for orderly images: The objects were placed in the environment undergoing large rotation and translation, which were captured in the form of video images, to validate the performance of the proposed homography tracking method.Some sampled results from video images were exhibited.As showed in Figure 9, the proposed method can provide a good match with the contour of the objects in the images.The pose parameters of the video images were computed by the proposed framework, and the dense matching procedure [21] was executed based on the camera parameters.

Accuracy Evaluation
In this section, the pose estimation technique and the corner-based method were initially compared using the four images sampled from the video captured with a smartphone, as shown in Figure 10.Then, a 3D reconstruction result was used to verify the parameter accuracy.
The image captured by the smartphone was 960 × 540.In the experiment, the chessboard plane contained 10 × 13 interior corners and 23 lines.The results are shown in Table 2.The results of the

Accuracy Evaluation
In this section, the pose estimation technique and the corner-based method were initially compared using the four images sampled from the video captured with a smartphone, as shown in Figure 10.Then, a 3D reconstruction result was used to verify the parameter accuracy.The number of lines from four to 23 was varied to investigate the stability of the proposed method further.The results are shown in Figure 11., recovered by the proposed method were approximately the same as the values estimated by the corner-based method, with only a small deviation.The reprojection errors of the proposed method decreased significantly from four to 17.When the number was greater than 17, the reprojection error was close to that of the corner-based method.The robust homography recognition and tracking can provide good initial parameters for multi-view images; therefore, the bundle adjustment model can converge effectively.Comparison equipment was set to validate the accuracy of the proposed method.A chessboard was placed in the scene.Multi-view images (total count = 24) were captured around the scene (as shown in Figure 12).Then, the camera parameters (i.e., intrinsic camera and pose parameters) were calculated by the classical chessboard calibration method [57] and the proposed method.The coupling of the intrinsic camera and pose parameters has less significance in comparing the values of the parameters.Therefore, a comparison in 3D space was set because the accuracy of the pose is the most important The image captured by the smartphone was 960 × 540.In the experiment, the chessboard plane contained 10 × 13 interior corners and 23 lines.The results are shown in Table 2.The results of the proposed method exhibited a slight difference form the results of the corner-based method.When only four edges of the plane pattern were utilized, the proposed method provided consistent results with the corner-based method, and the offset of the camera parameters, which is five pixels, was small with respect to the corner-based method.The last column of Table 2 shows the reprojection RMS of the three methods.When all of the 23 lines were utilized, the proposed method provided almost the same reprojection error with the corner-based method.The four-line-based method returns a slightly larger reprojection error, because only the minimum of the model lines was utilized.The number of lines from four to 23 was varied to investigate the stability of the proposed method further.The results are shown in Figure 11.c x , c y recovered by the proposed method were approximately the same as the values estimated by the corner-based method, with only a small deviation.The reprojection errors of the proposed method decreased significantly from four to 17.When the number was greater than 17, the reprojection error was close to that of the corner-based method.The number of lines from four to 23 was varied to investigate the stability of the proposed method further.The results are shown in Figure 11., recovered by the proposed method were approximately the same as the values estimated by the corner-based method, with only a small deviation.The reprojection errors of the proposed method decreased significantly from four to 17.When the number was greater than 17, the reprojection error was close to that of the corner-based method.The robust homography recognition and tracking can provide good initial parameters for multi-view images; therefore, the bundle adjustment model can converge effectively.Comparison equipment was set to validate the accuracy of the proposed method.A chessboard was placed in the scene.Multi-view images (total count = 24) were captured around the scene (as shown in Figure 12).The robust homography recognition and tracking can provide good initial parameters for multi-view images; therefore, the bundle adjustment model can converge effectively.Comparison equipment was set to validate the accuracy of the proposed method.A chessboard was placed in the scene.Multi-view images (total count = 24) were captured around the scene (as shown in Figure 12).The robust homography recognition and tracking can provide good initial parameters for multi-view images; therefore, the bundle adjustment model can converge effectively.Comparison equipment was set to validate the accuracy of the proposed method.A chessboard was placed in the scene.Multi-view images (total count = 24) were captured around the scene (as shown in Figure 12).Then, the camera parameters (i.e., intrinsic camera and pose parameters) were calculated by the classical chessboard calibration method [57] and the proposed method.The coupling of the intrinsic camera and pose parameters has less significance in comparing the values of the parameters.Therefore, a comparison in 3D space was set because the accuracy of the pose is the most important Then, the camera parameters (i.e., intrinsic camera and pose parameters) were calculated by the classical chessboard calibration method [57] and the proposed method.The coupling of the intrinsic camera and pose parameters has less significance in comparing the values of the parameters.Therefore, a comparison in 3D space was set because the accuracy of the pose is the most important element in dense matching.As shown in Figure 13, three methods were designed to calculate the camera parameters of the same multi-view image set: (1) homo_exp: the camera parameters were calculated from the homography estimation, which is called the initial parameters of this framework; (2) homo&ba_exp: the initial camera parameters of homo_exp were refined by the bundle adjustment procedure; and (3) chessbd_exp: the camera parameters were calculated by the chessboard method.The three sets of parameters were used in the same dense matching procedure [21].The point cloud results were compared to evaluate the accuracy of the camera pose.element in dense matching.As shown in Figure 13, three methods were designed to calculate the camera parameters of the same multi-view image set: (1) homo_exp: the camera parameters were calculated from the homography estimation, which is called the initial parameters of this framework; (2) homo&ba_exp: the initial camera parameters of homo_exp were refined by the bundle adjustment procedure; and (3) chessbd_exp: the camera parameters were calculated by the chessboard method.
The three sets of parameters were used in the same dense matching procedure [21].The point cloud results were compared to evaluate the accuracy of the camera pose.As shown in Figure 13, the outer four edge lines of the chessboard are taken as a contour model for our framework.The initial parameters are decomposed from multiple homographies by the homography tracking strategy described in Section 3.3.As shown in Figure 14, the depth image of chessbd_exp was taken as the ground truth.Notably, the depth image of the homo_exp method was the worst because of the discontinuous depth values, which were in conflict with the real scene, and the depth image of homo&ba_exp was close to the ground truth.As shown in Figure 13, the outer four edge lines of the chessboard are taken as a contour model for our framework.The initial parameters are decomposed from multiple homographies by the homography tracking strategy described in Section 3.3.
As shown in Figure 14, the depth image of chessbd_exp was taken as the ground truth.Notably, the depth image of the homo_exp method was the worst because of the discontinuous depth values, which were in conflict with the real scene, and the depth image of homo&ba_exp was close to the ground truth.As shown in Figure 14, the depth image of chessbd_exp was taken as the ground truth.Notably, the depth image of the homo_exp method was the worst because of the discontinuous depth values, which were in conflict with the real scene, and the depth image of homo&ba_exp was close to the ground truth.The particularly interesting parts were marked by the green ellipse, as shown in Figure 16.In the first column, the contour of the text "2.72 kg" marked by the green ellipse was sharper and clearer in the homo&ba_exp and chessbd_exp methods than in the homo_exp method.The same phenomenon occurred in the second and third columns of the figure.In the last column, the side view of the table in the scene was shown.A clear misalignment was observed in the homo_exp and chessbd_exp methods, and a good alignment was observed in the homo&ba_exp method.Such misalignment is shown on the left corner of the book cover in the second column.The fault of the point clouds can be attributed to the camera parameters provided by the three methods because the dense reconstruction and input images were all the same in the three methods.The particularly interesting parts were marked by the green ellipse, as shown in Figure 16.In the first column, the contour of the text "2.72 kg" marked by the green ellipse was sharper and clearer in the homo&ba_exp and chessbd_exp methods than in the homo_exp method.The same phenomenon occurred in the second and third columns of the figure.In the last column, the side view of the table in the scene was shown.A clear misalignment was observed in the homo_exp and chessbd_exp methods, and a good alignment was observed in the homo&ba_exp method.Such misalignment is shown on the left corner of the book cover in the second column.The fault of the point clouds can be attributed to the camera parameters provided by the three methods because the dense reconstruction and input images were all the same in the three methods.
phenomenon occurred in the second and third columns of the figure.In the last column, the side view of the table in the scene was shown.A clear misalignment was observed in the homo_exp and chessbd_exp methods, and a good alignment was observed in the homo&ba_exp method.Such misalignment is shown on the left corner of the book cover in the second column.The fault of the point clouds can be attributed to the camera parameters provided by the three methods because the dense reconstruction and input images were all the same in the three methods.The point clouds of the homo&ba_exp and chessbd_exp methods were aligned to the same coordinate system.Only a small translation between the two methods was observed because of the different original points.The bias analysis between the two point clouds was computed in a color error map, as shown in Figure 17, which was measured as the distance in mm.Approximately 90% of points were in the 0.72 mm to 0.72 mm areas (green color), and the average distance and standard deviation of the two aligned point clouds were 0.90 mm and 1.3 mm, respectively.The completeness value that measures how well the parameters provided by the homo&ba_exp method covers the ground truth (the traditional chessboard method).Compared with the SfM method, the camera parameters and the follow-up 3D reconstruction all had scale information because the real size of the contour model was used.Therefore, the proposed framework can be used in vision tasks, which need real-scale information.The point clouds of the homo&ba_exp and chessbd_exp methods were aligned to the same coordinate system.Only a small translation between the two methods was observed because of the different original points.The bias analysis between the two point clouds was computed in a color error map, as shown in Figure 17, which was measured as the distance in mm.Approximately 90% of points were in the 0.72 mm to 0.72 mm areas (green color), and the average distance and standard deviation of the two aligned point clouds were 0.90 mm and 1.3 mm, respectively.The completeness value that measures how well the parameters provided by the homo&ba_exp method covers the ground truth (the traditional chessboard method).Compared with the SfM method, the camera parameters and the follow-up 3D reconstruction all had scale information because the real size of the contour model was used.Therefore, the proposed framework can be used in vision tasks, which need real-scale information.

3D Reconstruction Application
Camera pose is vital to 3D reconstruction, particularly the measurable framework, which can provide the real size of the 3D model.As shown in Figure 18, the aim is to recover the 3D shape of the statue outdoors.Similar to the A4 paper, the platform edge of the statue (size = 4 m × 4 m) was used as the planar object in this framework, and a video around the statue was captured using a smartphone (iPhone 5S; video size = 1920 × 1080).The intrinsic camera and pose parameters of the multi-view images (approximately 100 frames) decomposed from the video were obtained from this framework.A mixed silhouette-based and photo-consistent 3D reconstruction approach [70] was executed with a set of silhouette images segmented by the level set method [71].The point cloud was constructed into mesh format using the Poisson method [72], and then, the mesh is textured with the color images [73].Considerable indoor 3D reconstruction results are shown in Figures 19 and 20 (the images also captured from 1080p video).Table 3 is designed to show the experiments in different scenarios.

3D Reconstruction Application
Camera pose is vital to 3D reconstruction, particularly the measurable framework, which can provide the real size of the 3D model.As shown in Figure 18, the aim is to recover the 3D shape of the statue outdoors.Similar to the A4 paper, the platform edge of the statue (size = 4 m × 4 m) was used as the planar object in this framework, and a video around the statue was captured using a smartphone (iPhone 5S; video size = 1920 × 1080).The intrinsic camera and pose parameters of the multi-view images (approximately 100 frames) decomposed from the video were obtained from this framework.A mixed silhouette-based and photo-consistent 3D reconstruction approach [70] was executed with a set of silhouette images segmented by the level set method [71].The point cloud was constructed into mesh format using the Poisson method [72], and then, the mesh is textured with the color images [73].Considerable indoor 3D reconstruction results are shown in Figures 19  and 20 (the images also captured from 1080p video).Table 3 is designed to show the experiments in different scenarios.
It can be concluded from the experiments that the pose estimation method was applied in 3D reconstruction successfully.It was not only suitable for small objects indoors (as shown in Figures 19  and 20), but also for large objects outdoors (as shown in Figure 18).Additionally, the contour models with a standard size and planar structure (as shown in Table 3) made the experiment easy to operate.In addition, the photography method was different in the three experiments (from the third column of Table 3), and it did not affect the results.In the end, the computational time of sparse reconstruction, which was measured on a 3.00-GHz Intel Xeon E3-1220 v5 architecture, is shown in the fifth column in Table 3.

3D Reconstruction Application
Camera pose is vital to 3D reconstruction, particularly the measurable framework, which can provide the real size of the 3D model.As shown in Figure 18, the aim is to recover the 3D shape of the statue outdoors.Similar to the A4 paper, the platform edge of the statue (size = 4 m × 4 m) was used as the planar object in this framework, and a video around the statue was captured using a smartphone (iPhone 5S; video size = 1920 × 1080).The intrinsic camera and pose parameters of the multi-view images (approximately 100 frames) decomposed from the video were obtained from this framework.A mixed silhouette-based and photo-consistent 3D reconstruction approach [70] was executed with a set of silhouette images segmented by the level set method [71].The point cloud was constructed into mesh format using the Poisson method [72], and then, the mesh is textured with the color images [73].Considerable indoor 3D reconstruction results are shown in Figures 19 and 20 (the images also captured from 1080p video).Table 3 is designed to show the experiments in different scenarios.It can be concluded from the experiments that the pose estimation method was applied in 3D reconstruction successfully.It was not only suitable for small objects indoors (as shown in Figure 19 and Figure 20), but also for large objects outdoors (as shown in Figure 18).Additionally, the contour models with a standard size and planar structure (as shown in Table 3) made the experiment easy to operate.In addition, the photography method was different in the three experiments (from the third column of Table 3), and it did not affect the results.In the end, the computational time of sparse reconstruction, which was measured on a 3.00-GHz Intel Xeon E3-1220 v5 architecture, is shown in the fifth column in Table 3.

Discussion
In this study, a complete framework that aims to estimate the pose of multi-view images has been presented.Under the proposed framework, the intrinsic camera and pose parameters of disorderly or orderly multi-view images can be recovered.Compared with other methods, the proposed method has the following advantages: (1) The homography estimation method can be considered the revised version of the model-based 3D tracking [74,75], which was developed to estimate the six DOF pose of the camera.Rather

Discussion
In this study, a complete framework that aims to estimate the pose of multi-view images has been presented.Under the proposed framework, the intrinsic camera and pose parameters of disorderly or orderly multi-view images can be recovered.Compared with other methods, the proposed method has the following advantages: (1) The homography estimation method can be considered the revised version of the model-based 3D tracking [74,75], which was developed to estimate the six DOF pose of the camera.Rather than initially estimating the affine transformation parameters and then the remaining non-affine parameters [76], this method used an iterative optimization process to refine the recognized homography directly.(2) Compared with a similar work [77], in which mapping was modeled as affine transformation and line correspondences were utilized in the refining process, the proposed method recognized the eight DOF of homography and optimized the initial transformation iteratively by dealing with the object contour as a series of sample points in a manner that the curved edge can be integrated.In this approach, the initial homography was recognized in the framework of hypothesizing and verifying the unmatched set of lines.Moreover, the optimized homography was obtained by minimizing the errors between the sample points and their corresponding image points obtained by utilizing the 1D search along the normal direction.(3) The robust approximate homography estimation is a vital stage, which can provide good initial parameters for the bundle adjustment procedure and can transform disorderly multi-view images to orderly multi-view images.The proposed method focused on obtaining the intrinsic camera and pose parameters with scale information and improving the precision of those parameters by the bundle adjustment procedure.
In addition, for general users that will perform vision tasks, the prepared planar calibration may not always be available.However, common items in daily life are of a standard size and planar structure.An easier and practical pose estimation method was proposed by exploiting the edge information.

Conclusions
In this study, a framework was designed that can provide the intrinsic camera and pose parameters of uncalibrated multi-view images via the bundle adjustment procedure and homography estimation of the contour model of the planar objects.The framework aims to help the general user to perform vision tasks without a prepared planar calibration pattern (i.e., chessboard).However, common items in daily life are of a standard size and planar structure (see Step I of Figure 1).By exploiting edge information, an easy, practical and automatic pose estimation method for uncalibrated multi-view images was proposed.In practice, the method can be used to measure the actual size of the object.For example, the 3D shape of the human feet can be used to measure several key data in designing shoes.In addition, the differences in the size between two objects can be measured according to the same prepared planar model.The method can play an important role in the measurement of industrial parts and digital preservation of antiquities.
The approximate homography was obtained in the framework of hypothesizing and verifying.The quadrangle-like structure was used to ensure automatic and stable recognition of the homography in common environments.The robust and approximate homography can provide good initial intrinsic camera and orientation parameters.Moreover, disorderly images can be rearranged into an orderly set, which is helpful in multi-view image processing.Subsequently, a refinement procedure of the intrinsic camera and pose parameters was operated by the bundle adjustment procedure.The experimental results revealed that the proposed method can conveniently generate a multi-view framework with scale information as compared with the traditional absolute orientation method and the SfM procedure.The proposed method has substantial features, e.g., circle and quadric.In practice, all of these features can be added to the homography estimation stage.However, only the line features of planar objects were used in this framework.The configuration must fulfill the demands that is the number of lines must be at least four and must have intersection points.Fulfilling the demands of configuration was the main challenge that we met.The objective of future study is to use all of the features, which can result in a more convenient and accurate pose estimation.
Sept III: Calculate the Intrinsic Camera and Pose Parameters (a) Homography recognition and track respectively applied for multi-view ima (b) Matching of line segments in a single the configuration of an ID card, four line were matched.(c) Multiple homographies can be c when homography estimation has be Then, initial intrinsic and camera pose pa can be decomposed by multiple homogr Step IV: Parameter Refinement by Bundle Adjustment Sparse 3D points can be triangulated known initial camera parameters in the step and corresponding feature point study, the corner points of the conto were taken as Ground Control Points (G bundle adjustment procedure was exe parameter refinement.

Figure 1 .
Figure 1.Overview of our framework.

Sept III:
Calculate the Intrinsic Camera and Pose Parameters ISPRS Int.J. Geo-Inf.2016, 5, 244 4 of 2 Sept III: Calculate the Intrinsic Camera and Pose Parameters (a) Homography recognition and tracking were respectively applied for multi-view images.(b) Matching of line segments in a single image.In the configuration of an ID card, four line segments were matched.(c) Multiple homographies can be computed when homography estimation has been done.Then, initial intrinsic and camera pose parameters can be decomposed by multiple homographies.

Figure 1 .
Figure 1.Overview of our framework.

Figure 2 .
Figure 2. The lines and corners of the planar objects (from left to right: A4 paper, Book 1 cover and Book 2 cover).The model lines are shown in green, and the corner points are shown in red.

( a )
Homography recognition and tracking were respectively applied for multi-view images.(b) Matching of line segments in a single image.In the configuration of an ID card, four line segments were matched.(c) Multiple homographies can be computed when homography estimation has been done.Then, initial intrinsic and camera pose parameters can be decomposed by multiple homographies.

Figure 1 .
Figure 1.Overview of our framework.

Figure 1 .
Figure 1.Overview of our framework.

Figure 2 .
Figure 2. The lines and corners of the planar objects (from left to right: A4 paper, Book 1 cover and Book 2 cover).The model lines are shown in green, and the corner points are shown in red.

Figure 2 .
Figure 2. The lines and corners of the planar objects (from left to right: A4 paper, Book 1 cover and Book 2 cover).The model lines are shown in green, and the corner points are shown in red.

22 Figure 3 .
Figure 3. Homography transformation between the model and image lines.

Figure 3 .
Figure 3. Homography transformation between the model and image lines.

Figure 4 .
Figure 4. 1D search from the model line to the image line.(a) Sketch map of the image edge and the projected edge.The black solid line is the sampled line segments, and brown points are the sampled points.(b) Real image of the image edge and the projected edge.The yellow line is the image edge projected by a correct homography or the prior homography.The blue line is the projected edge by the current estimated homography.

Figure 4 .
Figure 4. 1D search from the model line to the image line.(a) Sketch map of the image edge and the projected edge.The black solid line is the sampled line segments, and brown points are the sampled points.(b) Real image of the image edge and the projected edge.The yellow line is the image edge projected by a correct homography or the prior homography.The blue line is the projected edge by the current estimated homography.

Figure 5 .
Figure 5. Degenerate situation for camera parameter calculation.(a) Image sequence of a flat scene with a fixed camera and the scene moving around a single-axis turntable; (b) the orientation of the camera motion of (a); actually, the pose of the images has a large offset from the real value, which means the pose estimation failed.

Figure 5 .
Figure 5. Degenerate situation for camera parameter calculation.(a) Image sequence of a flat scene with a fixed camera and the scene moving around a single-axis turntable; (b) the orientation of the camera motion of (a); actually, the pose of the images has a large offset from the real value, which means the pose estimation failed.

Figure 6 .
Figure 6.Bundle adjustment model with planar control points.

Figure 6 .
Figure 6.Bundle adjustment model with planar control points.

Figure 7 .
Figure 7. Bundle adjustment procedure with planar control points.(a) A feature matching between a pair of images.The green cross flags are matched key points of SIFT.The four line segments with red color are the edges of an ID card; (b) Feature tracks between multiple view images; (c) A bundle adjustment model is generated and executed.The left is the sparse reconstruction result using initial camera parameters.The right is the sparse result refined by bundle the adjustment procedure.

Figure 8 .
Figure 8. Homography recognition for disorderly multi-view images.(a) Homography recognition of the disorder multi-view image captured by a smartphone; the correct recognition homographies are shown in a green lines format, while other error lines are shown in red color; (b) the left is the results of pose estimation for the disorder multi-view images; polylines are connected according to the capture sequence.The right is the rearranged sequence by the pose of the cameras.

Figure 8 .
Figure 8. Homography recognition for disorderly multi-view images.(a) Homography recognition of the disorder multi-view image captured by a smartphone; the correct recognition homographies are shown in a green lines format, while other error lines are shown in red color; (b) the left is the results of pose estimation for the disorder multi-view images; polylines are connected according to the capture sequence.The right is the rearranged sequence by the pose of the cameras.ISPRS Int.J. Geo-Inf.2016, 5, 244 12 of 22

Figure 9 .
Figure 9. Homography tracking for Book 1 and Book 2 cover contour models in an environment.(a) The projections of the contour models of the Book 1 cover are drawn in blue by the recovered homography.The four images are the 100th, 200th, 300th and 400th frame in the video, respectively.(b) The left is the camera pose result of the video images, and the right is the 3D point clouds of the scene containing the Book 1 cover.(c) The projections of the contour models of the Book 2 cover are drawn in yellow by the recovered homography.The four images are the 100th, 200th, 300th and 400th frame in the video, respectively.(d) The left is the camera pose result of the video images, and the right is the 3D point clouds of the scene contain the Book 2 cover.

Figure 9 .
Figure 9. Homography tracking for Book 1 and Book 2 cover contour models in an environment.(a) The projections of the contour models of the Book 1 cover are drawn in blue by the recovered homography.The four images are the 100th, 200th, 300th and 400th frame in the video, respectively; (b) The left is the camera pose result of the video images, and the right is the 3D point clouds of the scene containing the Book 1 cover; (c) The projections of the contour models of the Book 2 cover are drawn in yellow by the recovered homography.The four images are the 100th, 200th, 300th and 400th frame in the video, respectively; (d) The left is the camera pose result of the video images, and the right is the 3D point clouds of the scene contain the Book 2 cover.

22 Figure 10 .
Figure 10.Four images of the model plane for camera calibration.

Figure 11 .
Figure 11.Results versus the number of model lines.

Figure 12 .
Figure 12.Multi-view images of the scene.The images are captured by a smartphone, and a 7 × 9 × 25 mm chessboard was put in the scene.

Figure 10 .
Figure 10.Four images of the model plane for camera calibration.

22 Figure 10 .
Figure 10.Four images of the model plane for camera calibration.

Figure 11 .
Figure 11.Results versus the number of model lines.

Figure 11 .
Figure 11.Results versus the number of model lines.

Figure 11 .
Figure 11.Results versus the number of model lines.

Figure 12 .
Figure 12.Multi-view images of the scene.The images are captured by a smartphone, and a 7 × 9 × 25 mm chessboard was put in the scene.

Figure 12 .
Figure 12.Multi-view images of the scene.The images are captured by a smartphone, and a 7 × 9 × 25 mm chessboard was put in the scene.

Figure 13 .
Figure 13.Pose estimation of three methods.(a) The left column is the features used in the estimation procedure, which, from top to bottom, are the four outer edge lines of A4 paper, the four outer edge lines of A4 paper with bundle adjustment and the corners of the chessboard, respectively; (b) the right column is the pose results of the three methods.

Figure 13 .
Figure 13.Pose estimation of three methods.(a) The left column is the features used in the estimation procedure, which, from top to bottom, are the four outer edge lines of A4 paper, the four outer edge lines of A4 paper with bundle adjustment and the corners of the chessboard, respectively; (b) the right column is the pose results of the three methods.

Figure 13 .
Figure 13.Pose estimation of three methods.(a) The left column is the features used in the estimation procedure, which, from top to bottom, are the four outer edge lines of A4 paper, the four outer edge lines of A4 paper with bundle adjustment and the corners of the chessboard, respectively; (b) the right column is the pose results of the three methods.

Figure 14 .
Figure 14.Comparison of depth images.We pick one depth image corresponding to the original image of (a), and (b-d) are the depth images of the methods homo_exp, homo&ba_exp and chessbd_exp.

Figure 14 . 22 Figure 15 .
Figure 14.Comparison of depth images.We pick one depth image corresponding to the original image of (a), and (b-d) are the depth images of the methods homo_exp, homo&ba_exp and chessbd_exp.

Figure 15 .
Figure 15.Point clouds of dense reconstruction using the camera parameters of the three methods.(a,b) The total point clouds with color and normal.

Figure 16 .
Figure 16.The detailed parts of the three sets in zoomed-in view.Four parts are selected to show the diversity of the three The regions are marked by the green ellipse.

Figure 16 .
Figure 16.The detailed parts of the three sets in zoomed-in view.Four parts are selected to show the diversity of the three methods.The detail regions are marked by the green ellipse.

22 Figure 17 .
Figure 17.Bias analysis when the two point clouds are aligned.

Figure 17 .
Figure 17.Bias analysis when the two point clouds are aligned.

Figure 17 .
Figure 17.Bias analysis when the two point clouds are aligned.

Figure 18 .
Figure 18.3D shape of the statue from uncalibrated multi-view images.(a) Several frames sampled from a 1080p video captured by a smartphone; (b) recoverable results of the pose parameters of multi-view images and the sparse 3D points; (c) the left is the results of the color point clouds, and the normal point clouds are shown in the middle; the right is the model of the triangulation result.

Figure 18 .
Figure 18.3D shape of the statue from uncalibrated multi-view images.(a) Several frames sampled from a 1080p video captured by a smartphone; (b) recoverable results of the pose parameters of multi-view images and the sparse 3D points; (c) the left is the results of the color point clouds, and the normal point clouds are shown in the middle; the right is the model of the triangulation result.

Figure 19 .
Figure 19.3D reconstruction results of One Piece.Figure 19.3D reconstruction results of One Piece.

Figure 19 .
Figure 19.3D reconstruction results of One Piece.Figure 19.3D reconstruction results of One Piece.

Figure 19 .
Figure 19.3D reconstruction results of One Piece.

Figure 20 .
Figure 20.3D reconstruction results of The Hulk.

Figure 20 .
Figure 20.3D reconstruction results of The Hulk.

Table 1 .
The contents of the planar models.
X+r 1 Y+r 2 Z+T X r 6 X+r 7 Y+r 8 Z+T Z y = δ y + c y + f y r 3 X+r 4 Y+r 5 Z+T Y r 6 X+r 7 Y+r 8 Z+T Z r 0

Table 2 .
Calibration results with the real data of four images.

Table 3 .
Experiments in different scenarios.

Table 3 .
Experiments in different scenarios.