Second Iteration of Photogrammetric Processing to Refine Image Orientation with Improved Tie-Points †

Photogrammetric processing is available in various software solutions and can easily deliver 3D pointclouds as accurate as 1 pixel. Certain applications, e.g., very accurate shape reconstruction in industrial metrology or change detection for deformation studies in geosciences, require results of enhanced accuracy. The tie-point extraction step is the opening in the photogrammetric processing chain and therefore plays a key role in the quality of the subsequent image orientation, camera calibration and 3D reconstruction. Improving its precision will have an impact on the obtained 3D. In this research work we describe a method which aims at enhancing the accuracy of image orientation by adding a second iteration photogrammetric processing. The result from the classical processing is used as a priori information to guide the extraction of refined tie-points of better photogrammetric quality. Evaluated on indoor and UAV acquisitions, the proposed methodology shows a significant improvement on the obtained 3D point accuracy.


Introduction
Products of photogrammetric processing are nowadays used across many domains, and are available via commercial and open-source software tools. The quality demands of such products vary with applications, e.g., from millimetric accuracy in industrial metrology [1] to centimetric or decentimetric accuracies in applications concerning indoor mapping [2], cultural heritage documentation [3], mobile mapping [4] or airborne surveys [5,6]. Typically, as the accuracies grow, so does the on-site labour to prepare an acquisition.
3D measurements derived from images are a product of several successive operations: image tie-point extraction, camera calibration and orientation, 3D triangulation. As they are all functionally related, the tie-point measurement precision will have an impact on the precision of the calculated calibrations, orientations and ultimately the triangulated 3D points. The importance of image observations on the camera modelling has been long known and proven empirically [7]. Poor precision and spatial distribution of the tie-points inhibits both: reliable estimation of the camera orientation and proper modelling of the systematic errors produced by the optics. As a consequence, output pointclouds become deformed and inaccurate [8,9]. To fight against this effect, a lot of research has been focused on utilizing only those tie-points that guarantee the so-called photogrammetric quality, i.e., (i) favour image points with low re-projection errors, (ii) keep good point distribution

Related Works
The tie-points are the interest points (or keypoints) that had been extracted and subsequently matched with their correspondences. Interest points alone are salient features (e.g., points or regions) that are distinctive in their immediate neighbourhood. The performance of an interest point detector is evaluated in terms of repeatability and information content [13]. A good interest point detector is invariant to a class of transformations: translation, rotation, scale and perspective distortions. It is also insensitive to illumination change.
Over the course of the last twenty years many different interest operators have been invented that would implement these properties. The distinctiveness of a pixel can be measured by its local dissimilarity, e.g., with the auto-correlation function to determine the signal changes [14,15] (e.g., Moravec, Harris), with covariance matrices [16] (Foerstner), using comparison functions on circular masks [17,18] (e.g., SUSAN, FAST), or the spatial gradient distribution [19] (e.g., Scale Invariant Feature Transform or SIFT). To include scale invariance, the notion of 3D scale-space representation was introduced [20]. The images would be represented by a pyramid of, e.g., Laplacian-of-Gaussians (LoG) or difference-of-Gaussians (DoG), where the pyramid's maxima were the characteristic points [19,21,22] (e.g., SIFT; Speeded-up robust features or SURF). Scale invariant point detectors, however, assume a uniform scale-change in both directions therefore do not model perspective (or affine) deformations induced by the imaging process. The end effect is that the aforementioned operators identify corresponding points between image couples with very small base-to-height ratios and fail at wide-baseline scenarios. To address this shortcoming, the known operators were affine-adapted and new affine-invariant operators were devised. Affine-invariant point detectors often delineate the so-called distinguished regions (DR) of stable properties and high repeatability. The DRs can be, for instance, parallelograms built on the Harris points and their nearby edges [23], elliptical blobs calculated from Harris-Laplace operator and the local structure described by the second moment matrix [21] (Harris-Affine, Hessian-Affine) or blobs of arbitrary shapes [24,25] (Maximally Stable Extremal Regions or MSER). Description of such regions is accomplished by identifying a local affine coordinate system or by geometric normalising, and establishing a description vector from the intensities within the region.
Out of the most popular affine-invariant region detectors, the MSER is said to be the most repeatable in the presence of changing viewpoint angles, scales, blur and illumination [26]. As proved experimentally [10], a notable point detector robust to changing viewpoints is the SIFT, and its affine adaption affine-SIFT (ASIFT) [27]. In the latter, the authors combine the concepts of simulation and normalization to simulate all distortions a patch can undergo when tilting the image. Next, they compare the totality of the generated images with SIFT. Computational time turns to be a restraining issue of the algorithm.
Regarding the measurement precision of the most popular tie-point extraction algorithms, the bundle block adjustement (BBA) reprojection errors place in the range of 0.5-1 pixel [10,11].

Contributions
The aforementioned algorithms extract image tie-points without making any a priori on the camera or the geometry of the 3D scene. In this research work, our hypothesis is that the camera orientations and approximate 3D geometry is known, and then we exploit it to guide and calculate a new set of tie-points of improved quality. Accordingly, we call it the Second Iteration to Refine image Orientation with improved Tie-Point (SIROP). More precisely, contributions of this research work are as follows • the tie-point extraction is free from seed points contrary to [3,28], and produces regularly spaced points, also in areas of low-contrast regions where classical tie-point extractors typically fail to identify salient points; • contrary to [30], our algorithm is not constrained by the presence of dominant 3D planes in the scene, therefore is not constrained to urban reconstructions; • our initial structure is represented in form of a triangulated mesh, and the re-projection to images of each mesh entity is rectified to a pre-selected geometry; as the perspective deformations are removed, the algorithm is apt to handle wide baseline acquisitions; The method can be equally used as an alternative to Iterative Closest Point (ICP) algorithm [31] for automated alignment of two individual 3D pointclouds captured from very different viewpoints. See Figure 1 for a summary of the proposed processing. All concieved algorithms were implemented in MicMac-the free, open-source software for photgorammetry [32,33]. This paper is an extended version of our preliminary work published in [34].

An Overview
SIROP enhances the obtained result precision by performing the second iteration of photogrammetric processing using the new tie-points (see Figure 1). These new tie-points are extracted based on 3D scene information i.e triangulated mesh. The image orientations and calibrations are used to predict the keypoint matching position and correct the perspective deformation between images before extracting the new tie-points. To assure that the algorithms work properly, it is expected that the approximate (i.e., initial) camera orientation predicts the image points' positions within 10 pixel precision. SIROP method can be described in 3 separate steps as presented in Figure 2: • Image selection and scene partition.

•
Matching by correlation and LSM. The sequence of steps is applied to each triangle. By iterating over each triangle in the mesh, SIROP ensures a homogenous distribution of tie-points in the object space.

Image Selection and Scene Partition
The purpose of this phase is to designate for each triangle a master image and a set of secondary images. The retained images must fulfil two conditions of: • visibility (here implemented by a Z-buffer filter), and • minimum resolution (triangles with surface area below 100 pixels will be filtered-out).
Out of all images, the master image will be the one with the most optimal resolution. Under perspective deformations, the resolution of a projected surface-element changes according to the observation direction (see minimal resolution direction in Figure 3) within the plane of the 3D triangle. We use here a cautious strategy by defining the minimal resolution obtained among all possible directions, and the master image is the image with the highest minimal resolution value (i.e the optimal resolution). With the optimal resolution one wants to favour a selection of images that are: • fronto-parallel with respect to the 3D triangle in question, which results in least deformed triangle back-projections, and • nearer to the triangle in question, which is equivalent of higher resolution. To relate an ellipse defined within the triangle's plane, and its back-projection in an image, an affine transformation is calculated. The 6 parameters of an affine transformation are estimated from 3 corresponding triangle's vertices in 3D space and in the image. Rigorously speaking, this plane to plane mapping is described by a homography, but since the mesh elements are small an affine approximation is justified. The errors in image position prediction induced by the adopted approximations are shown in Figure 4. For a triangle of 9000 pixels, the average prediction discrepancy does not exceed 0.08 pixel. The histogram of maximum difference values proves that 99% of discrepancies falls below 0.13 pixel. Mathematically speaking, let P n be the plane of image n, P Tri the 3D plane containing the triangle and C the circle inscribed in the triangle (see Figure 5). The projection function is approximated by the affine transformation between the 3 vertices of the triangle in its plane in object-space and their back-projections in image n. Only A Tri2n , the linear part of this transformation, is used for image selection. The 6 parameters are computed from the (2 dimensions) × (3 points) observations with a direct inversion. E n is the ellipse transformed from circle C to plane image P n by the affine transformation A Tri2n denoted as follows: For each point (cos(θ), sin(θ)) on circle C, we have a corresponding point on the ellipse E n mapped by an affine transformation A Tri2n : Value min( V n (θ)) Equation (3) represents the minimal possible resolution obtained across all directions in image n of circle C: The master image is selected as the maximum of all possible minimal resolutions across all images: A set of secondary images is chosen by applying a threshold on the minimal possible resolution min( V n (θ) 2 ) between each candidate image and the master image. Finally, after the selection of images for all 3D triangles in mesh, the scene has been partitioned between different master images.

Region of Interest Rectification
For each triangle back-projected to an image, a region of interest (ROI) is defined in its surroundings. All ROIs found in the secondary images are rectified to the geometry of the master image (see Figure 6) using the 6-parameter affine function form master image to the secondary image for this triangle. By doing so, all perspective distorsions due to 3D scene geometry and perspective imaging are removed. If (x s , y s ) is the pixel coordinate in secondary image ROI, the affine mapping is expressed as follows:

Detection of Interest Points
This step generates a set of interest points for all back-projected and rectified triangles. In the subsequent steps the found interest points will guide the sub-pixel matching (see Section 2.4).
Firstly, local intensity maxima/minima points are adopted as candidate points. Then, these candidates are filtered by several criteria such as contrast quality, risk of repetitive patterns and spatial distribution. The general workflow is presented in Figure 7, each filter will be detailed in the following section.

Contrast Quality Filter
The purpose of this filter is to eliminate low-contrast candidate points. The idea of this filter is based on the FAST [35] and SUSAN [17] detectors, by examinating the neighboring points to decide if a point is contrast enough. There are three main reasons for this approach selection: • Fast processing speed: in comparing to Difference of Gaussian approach detector such as SIFT.

•
Perform on non-perspective ROI: as presented in Section 2.2.1, perspective in ROI is eliminated. Therefore, a scale rotation invariant approach is unnecessary.

•
Interest point localization: interest points are detected at pixel exact and FAST filter-based approach work on pixel exact also. Therefore, we avoid the sub-pixel point localization problem in SIFT approach [36]. Region-based matching method by correlation and Least Square Matching in matching step will localize tie-point at sub-pixel precision.
The intensity value of a candidate point is denoted as p, and we examine its neighboring points (1...n) with intensity values (p 1 ...p n ), respectively. There are a total of 24 neighboring points located on a circle with a radius R = 4. A difference in the intensity value d i is computed for each neighboring point at i and the central point as |p − p i | (see Figure 8). To evaluate the contrast quality of a candidate point, a Contrast Quality Score (CQS) is computed. The CQS is composed of two terms: • contiguous priority contrast score CQS 1 , and • global contrast score CQS 2 .
A candidate point is considered to be of good contrast if at least 75% of differences d i pass an appropriate threshold. The 75th percentile value in the ascending sort difference intensity vector V d is then taken as CQS 2 score. Then, a sliding window scans through the vector V d . Each window score is taken as the maximum value of all elements within the window. CQS 1 score is the minimum value among the window scores (see Figure 8). Candidate points are considered to be contrasted if CQS 1 passes a certain threshold. Figure 9 illustrates the performance of the contrast filter on an example. Finally, a combination of two terms CQS 1 and CQS 2 with a weighting factor give a final Contrast Quality Score for each point.
This score will be used in the following interest point reduction step to determine the point's priority.
The figure below presents an example of Contrast Quality Score for interest point and contrast filter performance. On the left image, interest points are colorized as blue/red circle corresponding to local intensity minima/maxima points. A green bar represents contiguous contrast score term CQS 1 . The middle image represents Global contrast score term CQS 2 with a yellow bar assigned to each point in triangle. Finally, the right-most image represents points rejected by the filter, which are colorized in yellow.

Non-Repetitive Pattern Filter
To make the subsequent matching procedure more robust, candidate points with adjacent repetitive patterns are rejected. To do that, for each candidate point, a circle of neighboring patches around at a certain radius is chosen (see Figure 10). Then, a correlation score is computed between the center patch and each of the neighboring patches. A point is eliminated if the high-score normalized correlation Equation (6) is found along any direction. In this implementation, we chose 0.85 as the high-score value.  On the bottom image, blue/red circles represent minima/maxima local points, respectively. The yellow points are rejected by the filter because they lie on a linear pattern that may cause an ambiguity in correlation.

Interest Point Reduction
After the interest points are selected, a filter is applied to reduce their quantity. The interest points are sorted by CQS to form a processing list, with the priority corresponding to the score value. For each master image, for all interest points, the filter then defines a circular region centered at a point (see Figure 11). Only one point with highest CQS within the circle is retained. The circular region size is chosen relatively to image resolution.

Matching
The interest points in master images must now be assigned their correspondences in the secondary images, i.e., the tie-points are created. Because our image patches are rectified to the geometry of the master image, an area-based matching method can be used to find the correspondences. SIROP adopts the normalized cross-correlation technique, followed by least squares matching.

Matching by Correlation
Zero Mean Normalized Cross-Correlation (ZNCC) (see Equation ( 6)) is adopted to provide the initial tie-point's position. For each interest point in the master image, ZNCC is computed for all potential interest points in the secondary images. Finally, the point with the highest matching score is selected as the tie-point. In analogy to interest points in the master images, the potential interest points in secondary images are found as maxima/minima within specified circular regions (see the search region in Figure 12).
where f and g as two image patches of size N * M,f andḡ are the corresponding patch average intensity values. To speed-up the processing, the correlator operates in a multi-scale approach [37]. Three main ZNCC calculation phases can be distinguished: • at 2-times down-sampled resolution (every other pixel is discarded), • at the nominal resolution, and • at sub-pixel resolution (up to 0.1 pixel).

The Tie-Point Spatial Filter
The spatial filter decimates the tie-points previously matched so that their homogeneous distribution in image space is maintained. The following selection criteria are imposed: • prioritisation of points with high ZNCC scores, • prioritisation of points with with high manifold, • even distribution of points in both master and secondary images.
Please note that our tie-points are manifold points (i.e., observed on more than two images). To determine which is the better correlated multiple point, for each tie-point in the master image, a Global Correlation Score C GCS i is computed (see Figure 13 and Equation (7)). This score includes all multiple correlation scores between points in the master image and all of its matched points in secondary images.
where C i_S t as a correlation score Equation (6)   Equation (7) can be interpreted as a quality indicator composed of the correlation score and the manifold. Points with higher manifold and good correlation will have higher score, hence higher position in filter processing queue. For each current point in queue, neighbor points are picked as all points lying within a radius R around it (see blue circle in Figure 14). All points retained from the circle will be tested with Equation (8) which is a function of a correlation score between two potentially corresponding points (i.e., C i_S t calculated with Equation (6)) and the normalized distance d i_ip R between the neighbor i p and the current point i. In other words, points that are close to point i and with poor correlation scores will more likely be filtered out. Figure 15 visualizes intermediary results of the described approach. Pseudocode of tie-points spatial filter is described in Algorithm 1. will be poped out of heap point_i ← TiePointHeap.pop(); Result.push(point_i); // Get all neighbor points lying within a radius R neighbors_i ← GetNeighbors(point_i, R); for point_i p ∈ neighbors_i do for Image_S t ∈ secImages do C i_S t ← GetCorrelationScore(point_i, Master_Image, Image_S t ); TiePointHeap.delete(point_i p );

Matching Refinement by Least Square Matching
In the final step, LSM [28] is used in combination with SIROP to compensate for unmodelled image orientation errors (i.e., y-parallax), see Figure 16. Pixel intensity values in image patches are used as observations. Both the affine transformation (6 parameters as previously discussed) and the radiometric model (2 parameters including the contrast and brightness) are the unknowns in the least square adjustment model: where g 1 , g 2 as two same size image patches of a tie-point given by correlation. A linear model with the contrast (named A) and the brightness (named B) is adopted to model the intensity variation between the two patches.
To transfer the non-linear model in Equation (9) to a linear one, Taylor linearisation is adopted. All parameters are estimated across several iterations. The stop condition is a pre-defined maximum number of iterations or convergence error estimation between two consecutive iterations.
To evaluate the gain of employing the LSM, an experiment involving 5 images was performed. Figure 17 shows a bundle block adjustment (BBA) reprojection error improvement of 0.02 pixel, and a minor change in the estimated camera parameters. Given a typical acquisition with thousands of tie-points, the additional LSM processing times should not be neglected. Accordingly, in the current MicMac implementation, the LSM is an optional parameter, and the parameter was inactive in all the experiments presented in the next section. LSM refinement is said to improve the SIROP's tie-points precision but it has a drawback of longer computation times.

The Evaluation Procedure
The evaluation is carried out on four datasets with different 3D scene types, cameras and acquisition geometries. To obtain 3D accuracy measures, several Ground Control Points (GCPs) were measured in each scene. Each dataset is processed with two methods: the classical processing with SIFT tie-points (implementation of [38]) (Classical) and the SIROP method, which only uses its own-detected points. In any case, grayscale images are used to extract tie-points.
The difference between the two processing methods is in the employed tie-points. In both scenarios the camera parameters are estimated in a self-calibration BBA (i.e., the Fraser camera model [39]). The BBA computation is first carried out in a relative coordinate system, and further transferred to the coordinate system of the GCPs with the 7-parameter transformation estimated from all GCPs. Three case studies are summarized in Table 1. Each case study was processed according to the workflow presented in Figure 1. Due to repetitivity of the obtained results, only one detailed outcome analysis is presented-the terrestrial Hallway dataset in Section 3.2. As for the UAV datasets (Viabon1, Corridor), a compiled presentation of the obtained results is given in Section 3.3.
The evaluation is carried out by looking at the following accuracy measures: • euclidean distance between GCPs resulting from image triangulation and their true values (d). • the BBA standard deviation. • color-coded error map of tie-points residuals.
Additionally, histograms of tie-point residuals and their manifold are reported.

The Hallway Dataset
A snapshot of the dataset and the distribution of GCPs are shown in Figures 18 and 19, respectively. In Figure 20 one clearly sees the advantage of the proposed methodology. The extracted tie-points are, as expected, well distributed in both object and image spaces. The optimal resolution criteria for master image selection, according to which a triangle is assigned its master favouring large image scale, is also kept as revealed in Figure 21.
SIROP reduces the tie-point residuals (see Figures 22 and 23) on average by almost a factor of two. Tie-points manifold is slightly less favourable which is due to the specificity of the long-corridor acquisitions. Throughout the acquisition the camera points at infinity whereas SIROP privileges points close to the camera. As a result, the far points are ignored (see Figure 24 for evidence). Finally, Figure 25 reports on distances between the true value and the photogrammetrically-predicted positions of all GCPs in the Classical and the SIROP processing scenarios. An average the distances decreased by a ratio of 1.2, 2.1 and 3.7 respectively in X, Y and Z, which is equivalent of 1.4σ x , 3.6σ y , 15.8σ z .

UAV Datasets
The aim of this section is to asses the performance of SIROP in an aerial acquisition. Two UAV-acquisitions were processed: a block acquisitions (Viabon1-see Figure 26), and a corridor acquisition (Corridor).
As detailed in Table 2, SIROP outruns the Classical method in all the tested scenarios. The accuracy gain is observed particularly along the Z component, i.e., along the coordinate with the highest estimation uncertainty due to the imaging configuration. Global increase in tie-points manifold is also observed.  The Corridor dataset uncovers another beneficial aspect of SIROP, i.e., the capability of suppressing the "dome" effect. It has been stated by many researchers [8,41,42] that under certain imaging configurations the self-calibrating BBA is suboptimal. The arising correlations between the camera internal parameters and the exterior orientations lead to incorrect modelling of both. As a result, one observes the so-called "dome" effect in the reconstructed pointcloud. One way to overcome is to employ a denser network of GCPs or include oblique views. However, it is not always possible. In such circumstances, SIROP proves that higher accuracy and well distributed tie-points can equally mitigate the effect (see Figures 27 and 28).  The study of the evolution of the camera's internal parameters confirms that with the SIROP's tie-points one is able to model the camera's systematic errors more efficiently. Please note the difference in residuals at the margins of the image for the Classical and SIROP processing, as is presented in Figure 29. Stability of the estimated parameters is also tested by running a second and third SIROP iteration.

Conclusions and Discussion
In this article, a novel photogrammetric processing chain called SIROP and aimed at enhanced precision 3D reconstruction is proposed. The algorithms make use of the approximate 3D scene geometry and the known camera orientations to generate a new set of optimal tie-points. With the new tie-points a more precise camera parameters and orientation are estimated, and therefore more accurate 3D measurements can be done. The methodology is implemented in the free open-source software for photogrammetry-MicMac.
The performance of the introduced algorithms was assessed in three different scenarios: an indoor acquisition (Hallway), and two UAV acquisitions (Viabon1, Corridor). The outcome accuracy were investigated on several GCPs, giving an improvement by a ratio of 2.09, 2.83, 1.46 for the respective datasets. The tie-points manifold was increased in all but the Hallway dataset where far points were eliminated contributing to a slightly lower points' redundancy. To alleviate the issue and the potential camera's rotational drift, the low-resolution tie-points from the Classical method could be integrated with the SIROP tie-points (see Figure 30). Inspection of the UAV datsets revealed that SIROP tie-points contribute to a better estimation of the camera internal parameters and thus are capable of eliminating the systematic "dome" effect. Stability of the internal camera parameters calculated with SIROP has been also verified (see Figure 29).
The processing time of SIROP is also evaluated by the Corridor dataset (see Figure 27 and Table 2) containing 77 panchromatic 20-megapixel images. The classical framework at full-resolution, from the tie-points extraction to 3D mesh generator requires 234 minutes. SIROP is then applied using the 3D mesh and image orientation derived from the classical round. SIROP time consuming essentially depends on the number of images in dataset and the number of triangles in the 3D mesh. In this test, the 3D mesh contains 18021 3D triangles and SIROP processing takes 18 minutes.
Overall, the method proved its efficiency and could be adopted as a post-processing step for fine metrology applications. Furthermore, SIROP could be used to accelerate the classical processing by using a very low resolution scene geometry calculated with a SIFT-like method at its input. Funding: This research was funded by IGN-Institut national de l'information géographique et forestière.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: