Mosaicking of Unmanned Aerial Vehicle Imagery in the Absence of Camera Poses

: The mosaicking of Unmanned Aerial Vehicle (UAV) imagery usually requires information from additional sensors, such as Global Position System (GPS) and Inertial Measurement Unit (IMU), to facilitate direct orientation, or 3D reconstruction approaches (e


Introduction
A huge market is currently emerging from the vast number of potential applications and services offered by small, low-cost, and low-flying unmanned aerial vehicles (UAVs).UAVs can carry payloads such as cameras, infrared cameras, and other sensors.Thus, they enable us to obtain a synoptic view of an area, which is helpful in applications such as surveillance and reconnaissance, environmental monitoring, disaster assessment and management (see, e.g., [1][2][3]).The aim of this paper is to present a novel method for mosaicking of UAV imagery to obtain visually satisfactory mosaics (not orthomosaics) that are not intended for precise measurement purposes.
A single image from a UAV mounted camera only covers limited area.In many digital earth applications, it is therefore necessary to stitch hundreds or even thousands of images together to create a larger image that can provide good overall situational awareness.Mosaicking of UAV imagery usually requires extra information, such as camera calibration parameters, position and orientation data from GPS/IMU, ground control points (GCPs) or a reference map, to achieve accurate mosaicking results [1][2][3][4][5][6][7].When GPS/IMU data are not accurate enough for direct orientation determination, pose estimation using a 3D reconstruction method is usually employed to refine the camera Existing mosaicking methods can be roughly divided into two classes: single-viewpoint mosaicking [13] and multi-viewpoint mosaicking [3,7,12,14].Single-viewpoint mosaicking can be achieved by rotating a camera around its optical center.An excellent review of the single-viewpoint mosaicking approach is given in [15], but further consideration is out of the scope of this paper.In multi-viewpoint mosaicking, the motion of the camera is not constrained purely to rotation only.For example, in UAV applications, the mounted camera undergoes not only rotations but also translations, making the mosaicking of the images more complicated.When the camera undergoes general 3D motion (including both rotation and translation), the eight degrees-of-freedom (DOF) homography is a general alignment model for the mosaicking of planar scenes.However, when the camera undergoes pure rotation, instead of the general 8-DOF homography relating a pair of images, we can get 3-, 4-, or 5-DOF 3D rotation motion models corresponding to cases in which the focal length is known, fixed, or variable.Estimating the 3D rotation matrix (and, optionally, focal length) associated with each image is intrinsically more stable than estimating a full 8-DOF homography [13,15].Thus, the case of the camera undergoing general motion is more challenging.
A single image captured from a typical UAV covers only a limited area on the ground.To enlarge the field of view, feature-based pairwise image registration can be applied to generate precise short-term mosaicking with almost no visual artifacts.Compared to linear-pushbroom sensor mosaicking [4], this method has the advantage that non-linear flight-paths and varying view directions can be handled effectively.However, stitching hundreds, or even thousands, of images would result in a significant drift of parameters due to accumulated errors.Therefore, to obtain large mosaics, geo-referenced satellite images of the whole region may be used as a reference image [2,5] and the smaller mosaics are overlaid on top of the reference map using manually determined control-points.A framework where online mosaicking is used to compute and improve the localization of a UAV is presented in [16].In instances where a region of the mosaic is revisited (namely loop-closure), an Extended Kalman Filter (EKF) can be adopted to minimize the accumulated drift.
Another approach to avoid accumulated errors is the poses-based method [6,7], which uses extra sensors such as GPS and IMU to directly obtain camera poses.Furthermore, if the intrinsic camera parameters are known, the images can be stitched together.However, this method necessitates sensors with a high level of known internal geometry, or a number of well distributed GCPs to determine the camera poses [3].
When sensors with sufficiently high known internal precision or GCPs are not available, image registration information and camera poses are combined to obtain seamless stitching while maintaining global consistence [1,6,7].To combine image-based and pose-based methods together, an objective function to maximize correlation in overlapping areas constrained by camera poses calculated from GPS/IMU data is defined in [7].The method described in [1] employed an EKF to integrate GPS and IMU data to estimate the initial camera poses.The initial poses are used to triangulate the 3D position of terrain features, and also to provide an initial estimate for the bundle adjustment procedure [17].An overview image of the terrain is constructed by projecting each image onto a ground plane.
Compared with most of the methods above, the proposed method is purely image-based.It does not require satellite imagery as a reference, GPS/IMU data, or GCPs.Our work is closely related to the work by Capel [12], which also only employs image information, applying a bundle adjustment approach [17] to simultaneously optimize the image transformations and the 2D coordinates of the features in the reference image.It is a similar approach to the bundle adjustment method employed in the structure-from-motion (SFM) pipeline.The latest image-based approach in [18] aims to handle large parallax, while our work emphasizes the case of a large number of images of approximately planar scenes.

Objective Function
We utilize a feature-based method for image registration.An 8-DOF homography matrix, H i , represents the transformation parameters of the i th image.
When setting g i and h i to zero, then H i becomes an affine homography.When further assuming that the submatrix ˜ai b i c i d i ¸is an orthogonal matrix, then H i becomes a rigid homography.
For a point x in an image, the transformed coordinates of the point given the transformation represented by the homography H is: We assume that N feature pairs are found among the M images.The transformation parameter of the i th image is represented by X i (a column vector that consists of the eight independent parameters of the homography H i ).Let X " `XT 1 , X T 2 , ¨¨¨, X T M ˘T.We construct an objective function E pXq as follows: where Ecor pXq is the energy term based on feature correspondences, which is to minimize the sum of the squared distances of the feature pairs and to geometrically align the images.It is defined as: where e i " T m `pi,m ˘´T n `pi,n ˘, 1 ď m, n ď M. T m and T n denote the transformations of X m and X n , respectively.r e i " T n_re f ´pi,n_re f ¯´p i,n_re f .`pi,m , p i,n ˘represents the i th feature pairs.n_re f denotes the index of the reference image.r N is the number of feature correspondences in the reference image.The effect of the second term in Equation ( 4) is to constrain the reference image to preserve the original shape.
The regularization term E reg pXq, inspired by Sumner et al. [11], enables X i to be as rigid as possible in order to avoid global distortion in the mosaicking result.E reg pXq is defined as: where p i is a weight factor, the value for which is determined by the number of the feature correspondences in the i th image.Note that as the number of feature correspondences in an image increases, the number of the terms with respect to the image in E cor pXq increases accordingly.Therefore, the weight p i should be larger in E reg pXq.
ω is the constant weight factor.Currently, its value is determined heuristically.When ω is small, the mosaicking result appears seamless, but may contain global distortion.With an increase in the value of ω, global distortion is eliminated, but the alignment error of the feature correspondences may simultaneously increase.It is therefore necessary to choose an appropriate value for ω.
When the pitch and roll angles of the UAV vary during flight, the regularization term tries to maintain the original shape of the objects in the images.Meanwhile, the correspondence term tries to change the shape of the images to align them geometrically.The combined effect of the two terms results in an optimized mosaicking result.
There are 8M unknown parameters in Equation (3).Equation ( 3) is a typical non-linear least squares problem that can be solved by the Levenberg-Marquardt (LM) algorithm [19].
To solve Equation (3) using the LM algorithm, the Jacobian matrix has the size of 2N + 4M rows and 8M columns.When M and N are large, for example as in the experiments shown below where M = 59, N = 220,977, the Jacobian of 444,318 rows and 4728 columns becomes a huge matrix, which makes the execution of the standard LM algorithm impossible on a regular PC.To conserve memory and to speed up the computation, we therefore employ the sparse LM algorithm [20] to solve Equation (3).

Mosaicking Work-Flow
The objective function proposed in Section 3.1 is embedded into a full mosaicking workflow as illustrated in Figure 1.It consists of four steps, as follows: To solve Equation (3) using the LM algorithm, the Jacobian matrix has the size of 2N + 4M rows and 8M columns.When M and N are large, for example as in the experiments shown below where M = 59, N = 220,977, the Jacobian of 444,318 rows and 4728 columns becomes a huge matrix, which makes the execution of the standard LM algorithm impossible on a regular PC.To conserve memory and to speed up the computation, we therefore employ the sparse LM algorithm [20] to solve Equation (3).

Mosaicking Work-Flow
The objective function proposed in Section 3.1 is embedded into a full mosaicking workflow as illustrated in Figure 1.It consists of four steps, as follows:

Feature Extraction and Matching
Considering the varied height and pose of a UAV mounted camera during flight, we extract Scale-Invariant Feature Transform (SIFT) features [21], which are, to a certain extent, invariant to scale changes and affine distortions and use the Fast Library for Approximate Nearest Neighbors (FLANN) [22] to match the features.Subsequently, the RANdom SAmple Consensus (RANSAC) algorithm [23] is utilized to remove false matches.

Initialization of the Transformation Parameters
The image plane of the reference image is treated as the reference plane of the initial mosaicking result, with its homography set to an identity matrix.The reference image can be selected at random, but in practice we select an image captured by the camera for which the principal axis is approximately perpendicular to the ground as the reference.To obtain the initial transformations of other images, we simplify the transformation represented by an 8-DOF homography to the 6-DOF affine transformation and neglect the regularization term 3) then turns into a linear least-squares problem that is relatively simple to solve.

Global Optimization
After obtaining the feature correspondences in Step (1) and the initial transformations in Step (2), the sparse LM algorithm is deployed to optimize Equation (3).

Feature Extraction and Matching
Considering the varied height and pose of a UAV mounted camera during flight, we extract Scale-Invariant Feature Transform (SIFT) features [21], which are, to a certain extent, invariant to scale changes and affine distortions and use the Fast Library for Approximate Nearest Neighbors (FLANN) [22] to match the features.Subsequently, the RANdom SAmple Consensus (RANSAC) algorithm [23] is utilized to remove false matches.

Initialization of the Transformation Parameters
The image plane of the reference image is treated as the reference plane of the initial mosaicking result, with its homography set to an identity matrix.The reference image can be selected at random, but in practice we select an image captured by the camera for which the principal axis is approximately perpendicular to the ground as the reference.To obtain the initial transformations of other images, we simplify the transformation represented by an 8-DOF homography to the 6-DOF affine transformation and neglect the regularization term E reg pXq in Equation (3).Equation ( 3) then turns into a linear least-squares problem that is relatively simple to solve.

Global Optimization
After obtaining the feature correspondences in Step (1) and the initial transformations in Step (2), the sparse LM algorithm is deployed to optimize Equation (3).

Blending
Finally, based on the estimated geometrical transformation parameters for all images, we use the multi-band blending method [13] to process the mosaicking artifacts generated by alignment errors in geometry and differences in intensity among the images.Images with excessive overlap are not used to create the final results since their inclusion adds nothing to the final product and their exclusion saves computer memory and CPU time in the multi-band image blending step, as illustrated in Figure 2.

Blending
Finally, based on the estimated geometrical transformation parameters for all images, we use the multi-band blending method [13] to process the mosaicking artifacts generated by alignment errors in geometry and differences in intensity among the images.Images with excessive overlap are not used to create the final results since their inclusion adds nothing to the final product and their exclusion saves computer memory and CPU time in the multi-band image blending step, as illustrated in Figure 2. Note that no camera parameter or pose data were used in our experiments.ω in Equation ( 3) is set to 4000 for all tests.

Datasets
Four datasets (datasets can be obtained by sending an e-mail to the author) were used to test the proposed image mosaicking objective function and workflow.The images in Datasets 1, 2 and 3 were captured by a UAV (KC1600 [24]) over Yongzhou city, Hechi and HeJiangdong villages of Hunan province, China, respectively.Dataset 1, Dataset 2 and Dataset 3 contain 61, 182, and 51 images, respectively, each with resolution of 3680 by 2456 pixels.The average flying heights were 558 m, 405 m and 988 m, respectively.Although the UAV was guided by GPS, GPS data were not used in our mosaicking experiments.Digital orthophoto maps (DOMs) of the three datasets, each with ground resolution of 20 cm generated by DPGrid [25], were used as ground-truth to quantitatively evaluate the proposed algorithm.
The images in Dataset 4 were from Pix4D [26], and are openly available on the Internet.In total, 591 images were used for mosaicking purposes.The images (4000 × 3000 pixels) in the dataset were captured by a Panasonic DMC-GF1 camera with a 20 mm focal length lens.The average flying height was 742 m.In addition, the dataset provided by Pix4D contains a DOM with a ground resolution of 20 centimeters, which was used to quantitatively evaluate the proposed algorithm.
The sites of the four datasets are shown in Figure 3.The variations in terrain elevation for Dataset 1 and Dataset 3 are obvious, while for the other two datasets the variations are relatively small, as shown in  Note that no camera parameter or pose data were used in our experiments.ω in Equation ( 3) is set to 4000 for all tests.

Datasets
Four datasets (datasets can be obtained by sending an e-mail to the author) were used to test the proposed image mosaicking objective function and workflow.The images in Datasets 1, 2 and 3 were captured by a UAV (KC1600 [24]) over Yongzhou city, Hechi and HeJiangdong villages of Hunan province, China, respectively.Dataset 1, Dataset 2 and Dataset 3 contain 61, 182, and 51 images, respectively, each with resolution of 3680 by 2456 pixels.The average flying heights were 558 m, 405 m and 988 m, respectively.Although the UAV was guided by GPS, GPS data were not used in our mosaicking experiments.Digital orthophoto maps (DOMs) of the three datasets, each with ground resolution of 20 cm generated by DPGrid [25], were used as ground-truth to quantitatively evaluate the proposed algorithm.
The images in Dataset 4 were from Pix4D [26], and are openly available on the Internet.In total, 591 images were used for mosaicking purposes.The images (4000 ˆ3000 pixels) in the dataset were captured by a Panasonic DMC-GF1 camera with a 20 mm focal length lens.The average flying height was 742 m.In addition, the dataset provided by Pix4D contains a DOM with a ground resolution of 20 centimeters, which was used to quantitatively evaluate the proposed algorithm.
The sites of the four datasets are shown in Figure 3.The variations in terrain elevation for Dataset 1 and Dataset 3 are obvious, while for the other two datasets the variations are relatively small, as shown in

Results and Discussion
Figures 5-8 illustrate the mosaicking results of the proposed algorithm for the four datasets.The results are found to be visually satisfactory for all of the experimental datasets.All of the mosaics produced by the proposed method are seamless and display no obvious global distortion.

Results and Discussion
Figures 5-8 illustrate the mosaicking results of the proposed algorithm for the four datasets.The results are found to be visually satisfactory for all of the experimental datasets.All of the mosaics produced by the proposed method are seamless and display no obvious global distortion.Although image mosaicking has been studied for decades, many algorithms assume the camera rotates about the optical center [13,15].For image mosaicking in UAV applications, the camera experiences not only rotations but also translations.Therefore, although they are purely image-based, these types of algorithms [13,15] cannot be directly compared to the proposed method.To obtain accurate mosaicking results, many algorithms for UAV imagery integrate the image with other information, such as data from GPS/IMU, or with GCPs [1,3,8,9].By contrast, this additional information is not required in the presented method.

Results and Discussion
We compare the proposed method with that of Capel in aspects of accuracy and efficiency.We implement Capel's method incrementally, i.e., when a new image is added, all the image transformation parameters and 2D feature locations are re-optimized simultaneously.At each iteration, the image with the largest overlap to the current mosaic is added into the optimization procedure.The steps of feature extraction and blending are the same for the two methods under comparison.

Accuracy
Thirty corresponding well-distributed check points were manually selected in both the ground-truth and the mosaicked image for each dataset.A four degrees-of-freedom 2D similarity transformation model was used to align the two sets of control points.The Root-Mean-Square (RMS) error, minimum error and maximum error of the alignment are analyzed.Note that values of various errors are converted to metric units using the ground resolutions of the ground-truth images.In each comparison, the same reference image is used for both methods.
Table 1 shows the mosaicking errors for both methods, including the RMS errors, minimum errors and maximum errors.For Dataset 1 and Dataset 4, the RMS error of the proposed method is twice as good as that resulting from Capel's approach.For Dataset 2, the improvement is fourfold.

Efficiency
The run time for the two algorithms when processing three datasets is shown in Table 2.It shows that the proposed method is more than one order of magnitude faster for all three datasets.Capel's method needs to optimize 8M + 2n unknown parameters, where M is the number of images and n is the number of feature points.Generally, it is the case that n >> M. By contrast, our proposed method only needs to optimize 8M unknown parameters, and is therefore computationally faster.Moreover, our proposed algorithm can be initialized efficiently by solving a linear least-squares problem.Conversely, Capel's method has to optimize the transformation parameters of the images and the 2D feature locations incrementally in an iterative manner, rendering our proposed method theoretically more efficient.

Impact of the Reference Image
The image illustrated in Figure 9a was selected as the reference image for the mosaic in Dataset 4 (see Figure 8).To assess the variability in the result based on the quality of the reference image, we repeated the mosaicking process using a sub-optimal reference image.The mosaicking results from the two methods when using the image illustrated in Figure 9b as the reference image, are shown in Figure 10.Comparing Figure 8 with Figure 10, we see that the results of Capel's method vary significantly, while the results of the proposed method are relatively stable.Moreover, the relative performances of the two methods for Dataset 4 are quantitatively shown in Table 3.We can see that Capel's method is highly affected by the selection of the reference image, displaying a six times degradation in error.By contrast, the proposed method is more robust to the selection of the reference image.

Impact of the Reference Image
The image illustrated in Figure 9a was selected as the reference image for the mosaic in Dataset 4 (see Figure 8).To assess the variability in the result based on the quality of the reference image, we repeated the mosaicking process using a sub-optimal reference image.The mosaicking results from the two methods when using the image illustrated in Figure 9b as the reference image, are shown in Figure 10.Comparing Figure 8 with Figure 10, we see that the results of Capel's method vary significantly, while the results of the proposed method are relatively stable.Moreover, the relative performances of the two methods for Dataset 4 are quantitatively shown in Table 3.We can see that Capel's method is highly affected by the selection of the reference image, displaying a six times degradation in error.By contrast, the proposed method is more robust to the selection of the reference image.Proposed Algorithm (m) Capel's Method (m) Figure 9a 7.9 18.1 Figure 9b 7.8 105.1 The results of Capel's method heavily rely upon the reference image.Once the image plane of the reference image deviates from the image plane of the ground, the final results of Capel's method distort considerably.This will happen, for example, when the camera tilts significantly when capturing the reference image or when the ground elevation of the region covered by the

Impact of the Reference Image
The image illustrated in Figure 9a was selected as the reference image for the mosaic in Dataset 4 (see Figure 8).To assess the variability in the result based on the quality of the reference image, we repeated the mosaicking process using a sub-optimal reference image.The mosaicking results from the two methods when using the image illustrated in Figure 9b as the reference image, are shown in Figure 10.Comparing Figure 8 with Figure 10, we see that the results of Capel's method vary significantly, while the results of the proposed method are relatively stable.Moreover, the relative performances of the two methods for Dataset 4 are quantitatively shown in Table 3.We can see that Capel's method is highly affected by the selection of the reference image, displaying a six times degradation in error.By contrast, the proposed method is more robust to the selection of the reference image.Proposed Algorithm (m) Capel's Method (m) Figure 9a 7.9 18.1 Figure 9b 7.8 105.1 The results of Capel's method heavily rely upon the reference image.Once the image plane of the reference image deviates from the image plane of the ground, the final results of Capel's method distort considerably.This will happen, for example, when the camera tilts significantly when capturing the reference image or when the ground elevation of the region covered by the  The results of Capel's method heavily rely upon the reference image.Once the image plane of the reference image deviates from the image plane of the ground, the final results of Capel's method distort considerably.This will happen, for example, when the camera tilts significantly when capturing the reference image or when the ground elevation of the region covered by the reference image varies considerably.However, the impact of the reference image is minimal for the proposed method as a result of the regularization term in the objective function.

Impact of Variations in Terrain Elevation
The variations in terrain elevation for Dataset 1 and Dataset 3 are obvious, as shown in Figure 4, while the terrain for Dataset 2 and Dataset 4 are both relatively flat.The errors reported for Dataset 1 and Dataset 3 are significantly larger than Dataset 2 and Dataset 4, as shown in Table 1.The results show that the variations in terrain elevation adversely affect the mosaicking accuracy.
For the case of a camera undergoing purely rotation, two images of a 3D scene can be aligned seamlessly with the 8-DOF homography matrix [15].However, for the case of a camera undergoing general 3D motion (including both rotation and translation), the parallax will induce artifacts in the mosaics (see Figure 1).By using the multi-band blending approach [13], the artifacts can be effectively eliminated, as shown in Figure 1.

Conclusions
This paper has introduced a novel mosaicking method for UAV imagery that was inspired by the non-rigid mesh deformation.The novel objective function consists of a feature correspondence energy term and a regularization term.The first term minimizes the sum of the squared distances between matched feature pairs to align the images geometrically.The second term constrains the image transformation parameters directly by keeping all transformations as rigid as possible to avoid global distortion in the final mosaic.
Experiments have demonstrated that the proposed method can effectively avoid global distortions and results in visually satisfactory mosaics for several different datasets.Compared with another popular existing image-based mosaicking algorithm [12], our approach was found to be more accurate (more than twice as high) and efficient (more than one order of magnitude faster).The proposed method is more efficient because it only needs to optimize 8M unknown parameters, versus 8M + 2n (generally, n >> M) unknown parameters in [12].Moreover, the proposed objective function can be initialized efficiently by solving a linear least-squares problem, which provides the initial transformation parameters for the sparse LM algorithm.Conversely, the method in [12] has to optimize the transformation parameters of the images and the 2D feature locations incrementally in an iterative manner.Moreover, our approach is insensitive to selection of the reference image.Although we make the assumption that the ground is approximately planar, our method can adapt to non-planar terrain, at least to a certain degree.For featureless scenes (e.g., surfaces of large lakes or other water bodies), the image matching algorithm may fail since the proposed method is purely image-based.
When there is no demand for high accuracy mosaicking, especially when GPS data are either unavailable or not aligned to the imagery, the approach presented provides an effective and practical alternative method for mosaicking UAV imagery.
Future research will concentrate on the theoretical issues regarding the improvement of mosaicking, such as how to determine the optimal value of ω in Equation (3), this being a key factor in the mosaicking process.In addition, how best to develop the method to adapt to mountainous regions is also a very valuable consideration in engineering applications, and will be studied.

Figure 1 .
Figure 1.The pipeline for the proposed mosaicking algorithm.

Figure 1 .
Figure 1.The pipeline for the proposed mosaicking algorithm.

Figure 2 .
Figure 2. Images with excessive overlap (e.g., left) are not used to create the final result.

Figure 4 .
The elevation of the terrain in Dataset 1 ranges from 62 m to 169 m.The differences in elevation for Dataset 2 are less than 25 m.The variations in elevation for Dataset 3 are the largest.It ranges from 354 m to 507 m.The Digital Elevation Model (DEM) of Dataset 4 is from Pix4D, as shown in Figure 4d.Although the exact elevation values are not shown, we can find out that the terrain is relatively flat.

Figure 2 .
Figure 2. Images with excessive overlap (e.g., left) are not used to create the final result.

Figures 5 -Figure 5 .
Figures 5-8 illustrate the mosaicking results of the proposed algorithm for the four datasets.The results are found to be visually satisfactory for all of the experimental datasets.All of the mosaics produced by the proposed method are seamless and display no obvious global distortion.

Figure 9 .Figure 10 .
Figure 9.The two different reference images in Dataset 4. (a) The reference image for the results in Figure 8; (b) The reference image for the results in Figure 10.

Figure 9 .
Figure 9.The two different reference images in Dataset 4. (a) The reference image for the results in Figure 8; (b) The reference image for the results in Figure 10.

Figure 9 .Figure 10 .
Figure 9.The two different reference images in Dataset 4. (a) The reference image for the results in Figure 8; (b) The reference image for the results in Figure 10.

Figure 10 .
Figure 10.Mosaicking result of Dataset 4 obtained using reference image shown in Figure 9b: (a) mosaicking result using Capel's method; and (b) mosaicking result using the proposed algorithm.

Table 1 .
Mosaicking error for proposed algorithm and Capel's approach.

Table 2 .
Run time for proposed algorithm and Capel's approach.

Table 3 .
Mosaicking errors for Dataset 4 obtained using different methods and reference images.

Table 3 .
Mosaicking errors for Dataset 4 obtained using different methods and reference images.