Evaluating the Performance of Structure from Motion Pipelines

Structure from Motion (SfM) is a pipeline that allows three-dimensional reconstruction starting from a collection of images. A typical SfM pipeline comprises different processing steps each of which tackles a different problem in the reconstruction pipeline. Each step can exploit different algorithms to solve the problem at hand and thus many different SfM pipelines can be built. How to choose the SfM pipeline best suited for a given task is an important question. In this paper we report a comparison of different state-of-the-art SfM pipelines in terms of their ability to reconstruct different scenes. We also propose an evaluation procedure that stresses the SfM pipelines using real dataset acquired with high-end devices as well as realistic synthetic dataset. To this end, we created a plug-in module for the Blender software to support the creation of synthetic datasets and the evaluation of the SfM pipeline. The use of synthetic data allows us to easily have arbitrarily large and diverse datasets with, in theory, infinitely precise ground truth. Our evaluation procedure considers both the reconstruction errors as well as the estimation errors of the camera poses used in the reconstruction.


Introduction
Three-dimensional reconstruction is the process that allows to capture the geometry and appearance of an object or an entire scene.In the last years, interest has developed around the use of 3D reconstruction for reality capture, gaming, virtual and augmented reality.These techniques have been used to realize video game assets [1,2], virtual tours [3] as well as mobile 3D reconstruction apps [4][5][6].Some other areas in which 3D reconstruction can be used are CAD (Computer Aided Design) software [7], computer graphics and animation [8,9], medical imaging [10], virtual and augmented reality [11], cultural heritage [12], etc...Over the years, a variety of techniques and algorithms for 3D reconstruction has been developed to meet different needs in various fields of application ranging from active methods that require the use of special equipment to capture geometry information (i.e., laser scanners, structured lights, microwaves, ultrasound, etc...) to passive methods that are based on optical imaging techniques only.The latter techniques do not require special devices or equipment and thus are easily applicable in different contexts.Among the passive techniques for 3D reconstruction there is the Structure from Motion (SfM) pipeline [13][14][15][16][17].As shown in Figure 1, given a set of images acquired from different observation points, it recovers the pose of the camera for each input image and a three-dimensional reconstruction of the scene in form of a sparse point cloud.After this first sparse reconstruction, it is possible to run a dense reconstruction phase using Multi-View Stereo (MVS) [18].As it can be seen from Figure 2, a typical SfM pipeline comprises different processing steps each of which tackles a different problem in the reconstruction pipeline.Each step can exploit different algorithms to solve the problem at hand and thus many different SfM pipelines can be built.There are many SfM pipelines available in the literature.How to choose the best among them?In this paper, we compare different state-of-the-art SfM pipelines in terms of their ability to reconstruct different scenes.The comparison is carried out by evaluating the reconstruction error of each pipeline on an evaluation dataset.The dataset is composed of real objects whose ground truth has been acquired with high-end devices.Having real scenes as reference models is not trivial, thus we have developed a plug-in for a rendering software to create an evaluation dataset starting from synthetic 3D scenes.This allows us to rapidly and efficiently extend the existing datasets and stress the pipelines under various conditions.
The rest of the paper is organized as follows.In Section 2, we describe the incremental SfM pipeline building blocks and compare their implementations.In Section 3, we describe how we evaluated the different pipelines.In Section 4 we present a plug-in that allows to generate synthetic datasets, evaluate the SfM and MVS reconstructions.In Section 5 we present and comment the evaluation results.Section 6 concludes the paper.In Appendix A we provide some guidelines about how to best capture images to be used in a reconstruction pipeline.

Review of Structure from Motion
The SfM pipeline allows the reconstruction of three-dimensional structures starting from a series of images acquired from different observation points.The complete flow of incremental SfM pipeline operations is shown in Figure 2. In particular, incremental SfM is a sequential pipeline that consists of a first phase of correspondences search between images and a second phase of iterative incremental reconstruction.The correspondence search phase is composed of three sequential steps: Feature Extraction, Feature Matching and Geometric Verification.This phase takes as input the image set and generates as output the so called Scene Graph (or View Graph) that represents relations between geometrically verified images.The iterative reconstruction phase is composed of an initialization step followed by three reconstruction steps: Image Registration, Triangulation and Bundle Adjustment.
Using the scene graph, it generates an estimation of the camera pose for each image and a 3D reconstruction as a sparse point cloud.

SfM Building Blocks
In this section we describe the building blocks of a typical incremental SfM pipeline illustrating the problem that each of them addresses and the possible solutions exploited.
Feature Extraction: For each image given in input to the pipeline, a collection of local features is created to describe the points of interest of the image (key points).For feature extraction different solutions can be used, the choice of the algorithm influences the robustness of the features and the efficiency of the matching phase.Once key points and their description is obtained, correspondences of these points in different images can be searched by the next step.
Feature Matching: The key points and features obtained through Feature Extraction are used to determine which images portray common parts of the scene and are therefore at least partially overlapping.If two points in different images have the same description, then those points can be considered as being the same in the scene respect to the appearance; if two images have a set of points in common, then it is possible to state that they portray a common part of the scene.Different strategies can be used to efficiently compute matches between images; solutions adopted by SfM implementations are reported in Table 1.The output of this phase is a set of images overlapping at least in pairs and the set of correspondences between features.
Geometric Verification: This phase of analysis is necessary because the previous matching phase only verifies that pairs of images apparently have points in common; it is not guaranteed that found matches are real correspondences of 3D points in the scene, outliers could be included.It is necessary to find a geometric transformation that correctly maps a sufficient number of points in common between two images.If this happens, the two images are considered geometrically verified, thus meaning that the points are also corresponding in the geometry of the scene.Depending on the spatial configuration with which the images were acquired, different methods can be used to describe their geometric relationship.An homography can be used to describe the transformation between two images of a camera that acquires a planar scene.Instead, the epipolar geometry allows to describe the movement of a camera through the essential matrix E if the intrinsic calibration parameters of the camera are known; alternatively, if the parameters are unknown, it is possible to use the uncalibrated fundamental matrix F. Algorithms used for geometric verification are reported in Table 1.Since the correspondences obtained from the matching phase are often contaminated by outliers, it is necessary to use robust estimation techniques such as RANSAC (RANdom SAmple Consensus) [19] during the geometry verification process [20,21].Instead of RANSAC, some of its optimizations can be used to reduce execution times.Refer to Table 1 for a list of possible robust estimation methods.The output of this phase of the pipeline is the so-called Scene Graph, a graph whose nodes represent images and edges join the pairs of images that are considered geometrically verified.
Reconstruction Initialization: The initialization of the incremental reconstruction is an important phase because a bad initialization leads to a bad reconstruction of the three-dimensional model.To obtain a good reconstruction it is preferable to start from a dense region of the scene graph so that the redundancy of the correspondences provides a solid base for the reconstruction.In case the reconstruction starts from an area with few images, the Bundle Adjustment process does not have sufficient information to refine the position of the reconstructed camera poses and points; this leads to an accumulation of errors and a bad final result.For the initialization of the reconstruction a pair of geometrically verified images is chosen in a dense area of the scene graph.If more than one pair of images can be used as a starting point, the one with the most geometrically verified matching points is chosen.The points in common to the two images are used as the first points of the reconstructed cloud; they are also used to establish the pose of the first two cameras.Subsequently the Image Registration, Triangulation and Bundle Adjustment steps add iteratively new points to the reconstruction considering a new image at a time.
Image Registration: Image registration is the first step of the incremental reconstruction.In this phase a new image is added to the reconstruction and is thus identified as registered image.For the newly registered image the pose of the camera (position and rotation) that has acquired it must be calculated; this can be achieved using the correspondence with the known 3D points of the reconstruction.Therefore, this step takes advantage of the 2D-3D correspondence between the key points of the newly added image and the 3D reconstruction points that are associated with the key points of the previously registered images.To estimate the camera pose it is necessary to define the position in terms of 3D coordinates of the reference world coordinates system and the rotation (pitch, roll and yaw axes), for a total of six degrees of freedom.This is possible by solving the Perspective-n-Point (PnP) problem.Various algorithms can be used to solve the PnP problem (see Table 1).Often outliers are present in the 2D-3D correspondences, the above mentioned algorithms are used in conjunction with RANSAC (or its variants) to obtain a robust estimate of the camera pose.The new recorded image has not yet contributed to the addition of new points; this will be done by the triangulation phase.
Triangulation: The previous step identifies a new image that certainly observes points in common with the 3D point that cloud reconstructed so far.The new registered image may observe further new points; such points can be added to the three-dimensional reconstruction if they are observed by at least one previously registered image.A triangulation process is used to define the 3D coordinates of the new points that can be added to the reconstruction and thus generate a more dense point cloud.The triangulation problem takes a pair of registered images with points in common and the estimate of the respective camera poses; then it tries to estimate the 3D coordinates of each point in common between the two images.In order to solve the problem of triangulation, an epipolar constraint is placed.It is necessary that the positions from which the images were acquired allow to identify the position of acquisition of the counterpart in the image; these points are called epipoles.In the ideal case it is possible to use the epipolar lines to define the epipolar plane on which lies the point whose position is to be estimated.However, because of the inaccuracies in the previous phases of the pipeline it is possible that the point does not lie in the exact intersection of the epipolar lines; this error is known as a reprojection error.To solve this problem, special algorithms that take into account the inaccuracy are necessary.Algorithms used by SfM pipelines are listed in Table 1.
Bundle Adjustment: Since the estimation of camera poses and the triangulation can generate inaccuracies in the reconstruction it is necessary to adopt a method to minimize the accumulation of such errors.The purpose of the Bundle Adjustment (BA) [22] phase is to prevent inaccuracies in the estimation of the camera pose to propagate in the triangulation of cloud's points and vice versa.BA can therefore be formulated as the refinement of the reconstruction that produces optimal values for the 3D reconstructed points and the calibration parameters of the cameras.The algorithm used for BA is Levenberg-Marquardt (LM), also known as Damped Least-Squares; it allows the resolution of the least squares method for the non-linear case.Various implementation can be used as shown in Table 1.This phase has an high computational cost and must be executed for each image that is added to the reconstruction.To reduce processing time BA can be executed only locally (i.e., only for a small number of images/cameras, the most connected ones); BA is executed globally on all images only when the rebuilt point cloud has grown by at least a certain percentage since the last time global BA was made.

Incremental SfM Pipelines
In the years many different implementations of the SfM pipeline were proposed.Here we focus our attention on the most popular ones with publicly available source code that could allow customization of the pipeline itself.Among the available pipelines we can mention COLMAP, Theia, OpenMVG, VisualDFM, Bundler, and MVE.Here we briefly describe each pipeline while Table 1 details their implementations with the algorithms used in each processing block.COLMAP [14,23]-an open-source implementation of the incremental SfM and MVS pipeline.The main objective of its creators is to provide a general-purpose solution usable to reconstruct any scene introducing also enhancements in robustness, accuracy and scalability.The C++ implementation also comes with an intuitive graphical interface that also allows configuration of pipelines parameters.It is also possible to export the sparse reconstruction for different MVS pipelines.
Theia [24]-an incremental and global SfM open-source library.Many algorithms commonly used for feature detection, matching, pose estimation and 3D reconstruction are included.Furthermore, it is possible to extend the library with new algorithms using its software interfaces.Implementation is in form of a C++ library, executables can be compiled and then be used to build reconstructions.Obtained sparse reconstruction can be exported into Bundler or VisualSFM NVM file format that can be used by most MVS pipelines.
OpenMVG [25]-an open-source library to solve Multiple View Geometry problems.An implementation of the Structure from Motion pipeline is provided for both the incremental and global case.Different options are provided for feature detection, matching, pose estimation and 3D reconstruction.It is also possible to use geographic data and GPS coordinates for the pose estimation phase.The library is written in C++ and can be included in a bigger project or can be compiled in multiple executables each one for a specific set of algorithms.Sample code to run SfM is also included.Sparse reconstruction can be exported in different file formats for different MVS pipelines.
VisualSFM [16,26,27]-implementation of the incremental SfM pipeline.Compared to other solutions, this one is less flexible because only one set of algorithms can be used to make reconstructions.The software comes with an intuitive graphical user interface that allows SfM configuration and execution.Reconstructions can be exported in VisualSFM's NVM format or in Bundler format.It is also possible to execute the dense reconstruction steps using CMVS/PMVS directly form the user interface (UI).
Bundler [17,28]-is one of the first incremental SfM pipeline implementation of success.It also defines a Bundler 'out' format that is commonly used as an exchange file between SfM and MVS pipelines.
MVE [29] (Multi-View Environment)-an incremental SfM implementation.It is designed to allow multi-scale scenes reconstruction, it comes with a graphical user interface and also includes an MVS pipeline implementation.

Evaluation Method for SfM 3D Reconstruction
Once a 3D reconstruction has been performed using the SfM and MVS pipelines, it is possible to evaluate the quality of the results obtained by comparing them to a ground truth with the same data representation.An evaluation method applicable both to the reconstructions obtained from real and synthetic datasets is here defined.This method requires the ground truth geometry of the model to be reconstructed and the ground truth camera pose for each image.Our proposed evaluation method is composed of four phases: 1. Alignment and registration 2. Evaluation of sparse point cloud 3. Evaluation of camera pose 4. Evaluation of dense point cloud Another approach to the evaluation of the SfM reconstructions is the one presented in [54].The authors designed a Web application that can visualize reconstruction statistics, such as minimum, maximum and average intersection angles, point redundancy and density.All the previous statistics does not require a ground truth.

Alignment and Registration
Since the reconstruction and the ground truth use different reference coordinate systems (RCSs), it is necessary to find the correct alignment between the two.The translation, rotation, and scale factors to align the two RCS can be defined using a rigid transformation matrix T. The adopted procedure finds this matrix aligning the reconstructed sparse point cloud to the ground truth geometry using a two step process: a first phase of coarse alignment and a second phase of fine registration, which allows to overlap in the best possible way the reconstruction to the ground truth.Alignment and registration steps generate two transformation matrices T 1 and T 2 of size 4 × 4 in homogeneous coordinates.By multiplying the matrices to each other in the order in which they were identified, it is possible to obtain the global alignment matrix T = T 2 •T 1 .This matrix is applied to the reconstructed clouds (sparse and dense) and also to the estimated camera poses to obtain the reconstruction aligned and registered with the ground truth.The ground truth can present itself as a dense points cloud or a mesh.Alignment algorithms work only with point clouds, so in the case where the ground truth is a mesh, a cloud of sampled points is used to bring the problem back to the alignment of two point clouds.
Alignment: In order to increase the probability of success of the Fine registration step (Section 3.1) and to reduce the processing time, it is necessary to find a good alignment of the reconstructed point cloud with the ground truth.This operation can be performed manually by defining the parameters of rotation, translation and scale or more conveniently by specifying pairs of corresponding points that are aligned by a specific algorithm defined by Horn in [55].This algorithm uses three or more points of correspondence between the reconstructed cloud and the ground truth to estimate the transformation necessary to align the specified matching points.The method proposed by Horn estimates the translation vector by defining and aligning the barycenters of the two point clouds.The scaling factors are defined by looking for the scale transformation that minimizes the positioning error between the specified matching points.Finally, the rotation that allows the best alignment is estimated using unit quaternions from which the rotation matrix can be extracted.The algorithm then returns the transformation matrix T 1 that is the composition of translation, rotation and scaling.
For the alignment operation, CloudCompare [56] can be used which implements the Horn algorithm and has a user interface that simplifies the process of selecting matching points.
Fine registration: Once the reconstruction has been aligned to the ground truth, it is possible to refine the alignment obtained from the previous step (Section 3.1) using a process of fine registration.
The algorithm used for this phase is Iterative Closest Point (ICP) [57][58][59]; it uses as input the two point clouds and a criterion for stopping the iterations.The output generated is a rigid transformation matrix T 2 that allows better alignment.The algorithm's steps are: 1.For each point of the cloud to be aligned, look for the nearest point in the reference cloud.2. Search for a transformation (rotation and translation) that globally minimizes the distance (measured by RMSE) between the pairs of points identified in the previous step; it can include the removal of statistical outliers and pairs of points whose distance exceeds a given maximum allowed limit.3. Align the point clouds using the results from previous step.4. If the stop criterion has been verified, terminate and return the identified optimal transformation; otherwise re-iterate all phases.
The stopping criterion is usually a threshold to be reached in the decrease of the RMSE measure.For very large point clouds it is also useful to limit the number of iterations allowed to the algorithm.The modified version defined by CloudCompare [56] can be used for this phase: it allows to estimate the transformation that registers the point clouds also considering scale adjustment in addition to those of rotation and translation.ICP does not work well if the point cloud to be registered and the reference cloud are very different, for example when one cloud includes portions that are not present in the other.In this case it is first necessary to clean the clouds so that both represent the same portion of a scene or object.

Evaluation of Sparse Point Cloud
The sparse point cloud generated by SfM can be evaluated in comparison to the ground truth of the object of the reconstruction.The evaluation considers the distance between the reconstructed points and the geometry of the ground truth.Once the reconstruction is aligned to the ground truth it is possible to proceed with the evaluation of the reconstructed point cloud, calculating the distance between the reconstructed points and the ground truth.
If the ground truth is available as a dense point cloud, the distance can be evaluated by calculating the Euclidean distance.For each 3D point of the cloud to be compared, the nearest point is searched in the reference cloud calculating the Euclidean distance.Octree [60] data structures can be used to partition the three-dimensional space and speed up the calculation.Once the distance values are obtained for all points in the cloud, the mean value and standard deviation are calculated.
If the ground truth is available as a mesh, the distance is calculated between a reconstructed point and the nearest point on the triangles of the mesh.This can be done using the algorithm defined by David Eberly in [61].Given a point of the reconstructed point cloud, for each triangle of the mesh the algorithm searches the point with the smallest square distance.Among all the selected points (one for each triangle) the one with the smallest square distance is chosen and the square root of this value is returned.This calculation is repeated for each point of the reconstructed cloud.Even in this case octree data structures can be used to partition the three-dimensional space and speed up the computation.Once distance values are obtained for all points in the cloud, the mean value and standard deviation are calculated.
In both cases it is necessary that the reconstructed cloud contains only points relative to objects that are included in the ground truth model used for comparison.Usually the ground truth includes only the main object of the reconstruction, ignoring the other elements visible in the dataset's images.If the reconstruction includes parts of the scene that do not belong to the ground truth, the distance calculation will be distorted.To overcome this problem, it is possible to cut out the cloud of points of the reconstruction, manually eliminating the parts in excess before evaluating the distance.If this is not possible (mainly because the separation between the objects of interest and those not relevant is not simply identifiable), then the same result can be achieved by specifying a maximum distance allowed for the evaluation of the reconstruction.If a reconstruction point is evaluated with a greater distance from the ground truth than allowed, it is discarded so that it does not affect the overall assessment.

Evaluation of Camera Pose
In addition to the sparse points cloud, the SfM pipeline also generates information about the camera poses.The pose of each camera can be compared to the corresponding ground truth.In particular, the method defined here provides information on the distance between the positions and the difference in orientation between each pair of ground truth and estimated camera pose.Ideally, if a camera is reconstructed in the same position as its ground truth, then it can be assumed that it observes the same points and that consequently its orientation is the same as that of the ground truth; in the real case it is however possible to observe slight differences between the orientations and for this reason an evaluation is provided.
Position evaluation: The position of a reconstructed camera is evaluated by calculating the Euclidean distance between the reconstructed position and the corresponding ground truth camera's position.Such values can also be used to calculate average distance and standard deviation.
Orientation evaluation: The differences in orientation of the cameras are evaluated using the angle of the rotation necessary for the relative transformation that, applied to the reconstructed camera, brings it to the same orientation of the corresponding ground truth camera.The camera orientation can be defined using a unit quaternion.Therefore, it is possible to define q GT as the camera ground truth orientation and q E as the reconstructed camera orientation.The relative transformation that aligns the reconstructed camera at the same orientation of the ground truth is defined by the quaternion q R that is calculated as follows: where q E −1 is the inverse quaternion of q E calculated by Equation (2) where q E * is the conjugate of q E and ||q E || is the norm.
By substituting in Equation (1) the term q E −1 with his definition, the equation becomes: Being rotations expressed with unit quaternions, the norm of q E is always 1 accordingly the equation can be simplified obtaining: Quaternion q R represents the rotation transformation necessary to change the orientation of the reconstructed camera so that it is the same as the ground truth.This can be expressed by defining a rotation axis and the angle for which the camera must be rotated around that axis.
This rotation angle can be used as a quality measure of the reconstructed camera rotation.If the orientation of the reconstructed camera is the same as the ground truth camera, the rotation angle of the defined transformation is 0; when the orientation of the reconstructed camera is different from that of the ground truth, the value of the rotation angle necessary to align the orientation of the camera also increases.
The representation of q R in terms of axes a (vector of components x, y, z) and rotation angle α is defined as follows: Angle α expressed in radians and the rotation axis can be extracted from the quaternion using Equations ( 6) and (7).The identified angle is always positive.
Using this representation particular attention should be paid when the rotation angle is 0 • .When this happens the rotation axis is arbitrary and the result is the same whichever is chosen; the quaternion is in the form q = 1 + i0 + j0 + k0 and consequently division by 0 must be avoided when applying Equation (7).To solve the problem an arbitrary axis with unitary norm can be chosen (e.g., vector x = 1, y = 0, z = 0): in this way there is no need to compute a rotation axis and the length is still unitary.
Angle α from Equation ( 6) can be converted form radians to α deg expressed in degrees.This angle can vary form 0 • to 360 • ; it also must be taken into account that α deg is a rotation around the axis of direction a or a rotation of −α deg around the opposite direction axis.Moreover, a rotation greater then 180 • around the a axis can also be expressed as a rotation of −(360 − α deg ) degrees around the same axis.To correctly compute the difference of orientations the smallest angle must be considered, independently of its direction; therefore in the α deg > 180 case the difference between camera's orientations is computed as 360 − α deg .
The differences in orientations measured trough angle α can also be used to calculate the average distance value and the standard deviation.

Evaluation of Dense Point Cloud
The MVS pipeline reconstructs the dense points cloud of the scene observed by the set of images.This cloud of points can be evaluated in comparison to the ground truth of the object to be rebuilt.The evaluation takes place in terms of the distance between the reconstructed points and the geometry of the ground truth.Once the dense reconstruction is registered in the best possible way with the ground truth, it is possible to proceed with the evaluation of the reconstructed cloud by calculating the distance between the reconstructed points and the ground truth.This evaluation can be done in the same way used for the sparse point cloud, as illustrated in Section 3.2.

Synthetic Datasets Creation and Pipeline Evaluation: Blender Plug-In
As stated in previous Sections, the evaluation of 3D reconstruction pipelines requires some datasets of source images and associated ground truth.Over the years, various datasets of real objects have been created [62][63][64]; they usually contain the ground truth of the object to be reconstructed in form of dense point cloud, acquired through high accuracy laser scanners.In some cases the ground truth is instead made available in the form of a three-dimensional mesh generated starting from a scanner acquisition or an high quality reconstruction obtained directly from the images that compose the dataset.In any case, the accuracy of the ground truth depends on the quality of the instrumentation used and the process with which it was acquired.The assumption that must be made in order to use the ground truth so generated is that it is however more precise than the reconstruction generated by the pipelines.Otherwise, having a low quality ground truth, it would not be possible to evaluate the accuracy of the reconstructed model.Usually these datasets do not report the ground truth of the camera poses and this does not allow to evaluate the pose of reconstructed cameras.The generation of these datasets encounter limitations due to the equipment or the scene to be captured itself, making it difficult to generate a set of images that fully comply with the guidelines.Moreover, it is difficult to find available datasets that include model ground truth and even when it is present the quality is low and occluded surfaces are missing.In the Appendix, we report some guidelines to create high quality datasets of images to be used in the reconstruction.
To overcome the problems in creating real datasets for evaluation, it is possible to use virtual 3D models to generate synthetic datasets with good image quality, intrinsic parameters for each image and optimal 3D model ground truth.With respect to real datasets usually acquired with physical imaging devices, the synthetic datasets make it possible to have accurate, and infinitely precise ground truths.We can generate synthetic datasets by acquiring images of virtual 3D models by means of rendering software.For our purposes we employ Blender [65].First of all the subject of the dataset needs to be chosen; for optimal results the 3D model must have an highly detailed geometry and texture.It is also important that the model does not make use of rendering techniques like bump-map or normal-map; such features can simulate complex geometries in rendered images that are not defined in the model geometry, thus cannot be included when exporting the ground truth.The model must then be placed in a scene where lights and other objects can be included.A camera is then added and all its intrinsic calibration parameters must be set.Such camera is then animated to observe the scene from different view points; each frame of the animation will be used as an image of the dataset.Once everything is set the images can be rendered using Cycles, the Blender's path-tracing render engine, that simulates light interactions and allows to generate photo-realistic images.Some EXIF metadata like focal length and sensor size can be added to the images so reconstruction pipelines can gather them automatically.Finally, along with the images, ground truth of model geometry and camera poses are needed.The ground truth geometry is the model itself, therefore it can be exported directly.Camera poses (position and orientation) can be obtained for each frame of the camera animation.Blender dose not have a direct way to export such information but it is easy to do that using its internal Python scripting framework.The entire flow of dataset generation is shown in Figure 3. Synthetic dataset creation, pipeline execution and results evaluation involve many steps and various algorithms.To help the user in the process we created a plug-in for Blender that allows synthetic datasets generation and SfM reconstructions evaluation [66].Such tool adds a simple panel in Blender's user interface that makes possible to: • import the main object of the reconstruction and setup a scene with lights for illumination and uniform background walls.Also, the parameters for the path tracing rendering engine are set.• add a camera and setup its intrinsic calibration parameters.Animate the camera using circular rotations around the object to observe the scene from different view points.

•
render the set of images and add EXIF metadata of intrinsic camera parameters used by SfM pipelines.• eventually, geometry ground truth can be exported.This is not necessary if next steps are processed using this plug-in as the current scene will be used as ground truth.

•
run the SfM pipelines listed in Section 2.2.

•
import the reconstructed point cloud form SfM output and allow the user to manually eliminate parts that do not belong to the main object of the reconstruction.

•
align the reconstructed point cloud to the ground truth using the Iterative Closest Point algorithm (ICP).
• evaluate the reconstructed cloud by computing the distance between the cloud and the ground truth and generating statistical information like min, max and average distance values and also reconstructed point count.
The dataset generation process could be used to create set of images of scene with various scale and likely include many objects, for those reason the plug-in divides the process in different steps, in this way it is possible for the user to adapt the result obtained after each step to specific needs.For example, it is possible to change the default camera intrinsic parameters, the scene illumination, animate the camera with different paths than the defaults and so on.

Experimental Results
Five synthetic datasets of different 3D models (Figure 4) have been generated using the method described in Section 4: All the images of each dataset have been acquired at resolution 1920 × 1080 px using a virtual camera with a 35 mm focal length and 32 × 18 mm sensor.For every dataset is also generated the ground truth of object geometry and camera poses; in this way it is possible to run the reconstruction pipelines and evaluate the obtained results.In addition to our synthetic datasets the real dataset Ignatius (Figure 4f) from the "Tanks and Temples" collection [62] is also used, whose 263 images have been acquired at a resolution of 1920 × 1080 px.The physical height of the statue is 2.51 m.The datasets can be downloaded from [72].Among all the SfM pipelines listed in Section 2.2 we compare the reconstructions results of COLMAP, Theia, OpenMVG and VisualSFM because each one is a reference implementation.In particular VisualSFM and COLMAP represent two remarkable developments of the incremental SfM pipeline with improvements in accuracy and performance compared to previous state-of-the-art implementations.Theia and OpenMVG are instead two ready to use SfM and multi-view geometry libraries that implement reconstruction algorithms and allow to build SfM pipelines that meet specific needs.
In order to evaluate dense 3D reconstructions, we paired the chosen SfM pipeline with CMVS/PMVS [73,74] as our MSV reference algorithm because it is widely used state-of-the-art implementation and is also natively supported by all the SfM pipelines used.The use of a single MVS pipeline with the same configuration parameters for all the reconstructions allows to evaluate and compare the dense results based on the quality os the sparse SfM reconstruction.In such way no other variables affect the reconstruction process.Here, we reported the results of evaluations done using method described in Section 3. Results are reported in Tables 2-5 and some examples of reconstructed dense point cloud are visible in Table 6.SfM pipelines have generated sufficient information to allow dense reconstruction on all datasets except for the Hydrant one.That dataset has a low geometric complexity, an high level of symmetry and an almost uniform texture; for these reasons SfM pipelines were not able to find enough correspondence between images and thus cannot generate a good reconstruction.The worst result was obtained with the OpenMVG pipeline and was not possible to run the MVS pipeline.COLMAP is the pipeline that achieves better results on average; even when it does not generate the best reconstruction it achieves good results.
Results for dataset Ignatius do not include camera pose evaluation because no information about camera pose ground truth is included in the dataset.This real dataset includes many elements besides the main object of the reconstruction; for this reason the reconstructed clouds have an high number of points that do not belong to the statue and thus must be removed.An evaluation of system resources usage was also done and COLMAP is efficient also in this aspect; using as much resources as possible it can complete the reconstruction in less time than the other pipelines.

Ignatius
The SfM reconstruction generated by pipeline Theia and dataset Statue shows that the obtained sparse point cloud is the best for that dataset but the camera poses are not accurate.These imprecisions are relative to camera positioning and not the camera rotation estimation that is instead always accurate.Further analysis shows that all the cameras are estimated positioned further away from the ground truth but on the correct viewing direction and for this reason the camera orientation is correct.Because the error in position is constant and applies to all the cameras this allows anyway the reconstruction of an accurate point cloud.

Conclusions
In this paper we analyzed the state-of-the-art incremental SfM pipelines showing that different algorithms and approaches can be used for each step of the reconstruction process.We proposed a complete method that starting from synthetic dataset generation allows to overcome real dataset limitations, evaluates and compares the reconstructions from different SfM implementations testing theirs limits under different conditions.The proposed method also allows to compare results after the MVS dense point cloud reconstruction.Our experiments results show that it is possible to generate synthetic datasets from which SfM reconstruction can successfully run obtaining satisfactory results.This also allows to take the pipelines to their limits showing that critical conditions can negatively affect the reconstruction process.To this end we have developed a plug-in for the Blender rendering software that allows us to generate synthetic datasets for the pipeline's evaluation.Moreover, it simplifies the execution of the different steps of the evaluation procedure itself.With respect to real datasets usually acquired with physical imaging devices, the synthetic datasets make it possible to have accurate, and infinitely precise ground truths.According to our experiments, among the tested incremental SfM implementations, COLMAP showed the best average results.We also created a software tool that allows (in a single solution) to run the whole process, from the dataset generation to the reconstruction evaluation.Further work can be done to evaluate other aspects of the pipeline such as reconstructed object coverage to identify missing parts.The evaluation method can also be extended to include the subsequent mesh reconstruction and texture extraction phases.

•
The set must be composed of a number of images sufficient to cover the entire surface of the object to be rebuilt.Parts of the object not included in the dataset cannot be reconstructed; thus resulting in a geometry with missing parts or not accurately reconstructed.

•
The images must portray, at least in pairs, common parts of the object to be rebuilt.If an area of the object is included only in a single image, it is not possible to correctly estimate 3D points for the reconstruction.Depending on the implementation of the pipeline, the reconstruction could improve with the increase of images that portray the same portion of the object from different view points; this because the 3D points can be estimated and confirmed through multiple images.

•
The quality of the reconstruction also depends on the quality of the images.Sets of images with a good resolution and level of detail should lead to a good reconstruction.The use of poor quality or wide-angle optics requires that the reconstruction pipelines take into account the presence of radial distortions.

•
The intrinsic parameters of the camera must be known for each image.In particular, the pipelines makes use of focal length, sensor size and image size to estimate the distance of the observed points and to generate the sparse point cloud.If the sensor size is unknown, the focal length in 35 mm format can be used.The accuracy of the intrinsic calibration parameters is of particular importance when the images composing the dataset have been acquired with different cameras; the imprecision of these parameters introduces imprecisions in camera pose estimation and points triangulation.It should also be taken into consideration that if the images have been cropped, the original intrinsic calibration parameters are no longer valid and must be recalculated.
• Along with the images, ground truth must also be available.This is not necessary for the reconstruction but is used to evaluate the quality of the obtained results.In order to be able to globally evaluate the SfM+MVS pipeline, it is sufficient to have the ground truth of the model to be reconstructed in the form of a mesh or a dense points cloud; this allows to compare the geometries.
To make a better evaluation of the SfM pipeline, it is also necessary to know the actual camera pose of each image of the dataset.In this way, by comparing the ground truth with the reconstruction, it is possible to provide a measure of the accuracy of the estimated camera poses.

Figure 1 .
Figure 1.3D reconstruction using Structure from Motion.

Figure 3 .
Figure 3. Example of synthetic dataset generation steps: (a) 3D model.(b) Scene setup.(c) Camera motion around the object.(d) Images rendering.(e) 3D model geometry and camera pose ground truth export.

Table 2 .
SfM cloud evaluation results.x is the average distance of the point cloud from the ground truth and s its standard deviation.N p is the number of reconstructed points.

Table 3 .
SfM camera pose evaluation results.N c is the percentage of used cameras.x is the mean distance from ground truth of reconstructed camera positions and s x its standard deviation.r is the mean rotation difference from ground truth of reconstructed camera orientations and s r its standard deviation.n.a.means measure not available.

Table 4 .
MVS cloud evaluation results.x is the average distance of the point cloud from the ground truth and s its standard deviation.N p is the number of reconstructed points.

Table 5 .
Pipelines execution times in seconds and memory usage in MB.

Table 6 .
Example of dense point clouds using CMVS/PMVS on different SfM reconstructions.