UAS Navigation with SqueezePoseNet — Accuracy Boosting for Pose Regression by Data Augmentation

The navigation of Unmanned Aerial Vehicles (UAVs) nowadays is mostly based on Global Navigation Satellite Systems (GNSSs). Drawbacks of satellite-based navigation are failures caused by occlusions or multi-path interferences. Therefore, alternative methods have been developed in recent years. Visual navigation methods such as Visual Odometry (VO) or visual Simultaneous Localization and Mapping (SLAM) aid global navigation solutions by closing trajectory gaps or performing loop closures. However, if the trajectory estimation is interrupted or not available, a re-localization is mandatory. Furthermore, the latest research has shown promising results on pose regression in 6 Degrees of Freedom (DoF) based on Convolutional Neural Networks (CNNs). Additionally, existing navigation methods can benefit from these networks. In this article, a method for GNSS-free and fast image-based pose regression by utilizing a small Convolutional Neural Network is presented. Therefore, a small CNN (SqueezePoseNet) is utilized, transfer learning is applied and the network is tuned for pose regression. Furthermore, recent drawbacks are overcome by applying data augmentation on a training dataset utilizing simulated images. Experiments with small CNNs show promising results for GNSS-free and fast localization compared to larger networks. By training a CNN with an extended data set including simulated images, the accuracy on pose regression is improved up to 61.7% for position and up to 76.0% for rotation compared to training on a standard not-augmented data set.


Introduction
More and more Unmanned Aerial Vehicles (UAVs) are operating in industrial, research and private areas, leading to a growth of interest in UAV-related topics such as navigation, data acquisition, path planning or obstacle avoidance.For UAV positioning, general opposite items are of interest and described: • Global Referencing vs. Local Referencing • High Computational Power vs. Low Computational Power • Hand-made Features vs. Learned Features • Multi-View Stereo vs. Pose Regression Unmanned Aerial Systems (UASs) need reliable navigation solutions for Global Referencing which are mostly based on GNSS and often combined with Inertial Navigation Systems (INSs).As failures, due to gaps in signal coverage caused by occlusions or multipath effects, weaken satellite-based navigation, alternative methods for a reliable navigation of Unmanned Aerial Systems are mandatory.One of those state-of-the-art methods is aerial image matching based on feature detection or correlation which are used for navigational purposes.These methods are computational intensive, especially for navigating in large areas where a vast number of descriptors potentially has to be stored and matched.Further methods for navigating in a global reference frame, are commonly based on digital surface models, digital terrain models or CAD models [1].However, the large memory consumption of such models renders this approach infeasible for on-board processing on computational weak computers which are commonly mounted on UAVs.
Global referencing is commonly aided by Local Referencing to bridge gaps and maintain a continuous navigation.Local navigation methods such as visual Simultaneous Localization and Mapping (SLAM) or Visual Odometry (VO) are potential techniques for reconstructing trajectories [2][3][4].Considering additional information, such as geo-referenced key points or initialization to a referenced model, allows global referencing for local navigation methods.Nevertheless, for short frame-to-frame distances, these methods score high accuracies.Drawbacks are drift effects for longer distances caused by cumulative errors.However, these local solutions can potentially support global navigation algorithms in a complementary way.Visual SLAM or VO fail once the pose estimate is lost which can be caused by fast motions or occlusions.Global re-localization methods could be used to recover the pose in such cases.However, global navigation methods are capable as standalone systems.
While most navigation methods work fine on computers with High Computational Power, running them on weak or embedded devices with Low Computational Power may lead to limited real-time capability or even failures.Running a navigation framework on small UAVs or Micro Aerial Vehicles (MAVs) limits the computation power due to maximum payload.Therefore, all developed methods are subject to the limited computational power of the on-board processing units and need to be carefully designed with the target hardware in mind.A potential solution may be provided by Convolutional Neural Networks which need high computational power during the training step, but only a fraction of that during runtime.CNNs can perform fast forward passes even on small on-board GPUs and have a limited memory demand and power consumption [5].Therefore, CNNs offer a valuable solution for on-board computers.
Pose estimation with deep CNNs based on Learned Features, such as PoseNet [6] has been shown recently.CNNs solve image matching tasks, where Hand-made Features such as Scale Invariant Feature Transform (SIFT) [7], Oriented FAST and rotated BRIEF (ORB) [8] or Speeded Up Robust Features (SURFs) [9] might fail due to less descriptive features.However, a satisfying solution for VO using CNNs trained end-to-end is not yet available.For the demands on UAV navigation, the utilization of a small CNN that can be used on-board computers is mandatory.Therefore, a CNN-based solution for the navigation or localization of an UAV in a known environment is utilized by SqueezePoseNet [10].
Data augmentation became an important element, overcoming problems in several fields of computer vision.The success of training-based methods such as CNNs and the coexisting lack of training data increased the value of data augmentation even more.Modification of existing training data as well as the simulation of purely new data are used to augment training sets.Therewith, training classifiers or regressors became very successful by utilizing the diversity of modified, synthesized or simulated data.
The main contribution of this submission, besides SqueezePoseNet, is the investigation on data augmentation on training sets to improve Pose Regression by CNNs.For this purpose, a training data set is extended with simulated images.Therefore, a photo-realistic 3D model of the environment is generated.This 3D model serves for rendering images with arbitrary poses in the environment which augment and enrich the original training data set.Further, the pose results are approved with a standard Multi-View Stereo approach for evaluation.
The paper is structured as follows.After reviewing related work on pose determination, navigation methods, CNNs and data augmentation in Section 2 (Related Work), the modification of CNNs to solve for pose regression as well as the approach of accuracy boosting by data augmentation is investigated in Section 3 (Methodology).The training process on different data sets is further described in Section 4 (Training).In Section 5 (Data), the benchmark data set Shop Façade and the captured data set Atrium utilized for this work are presented.Furthermore, the Puzzle data set is introduced for the approach on data augmentation.The Unmanned Aerial Vehicle as well as the hardware used for data acquisition is specified in Section 6 (Unmanned Aerial System); Experiments on pose regression by PoseNet, a modification of the VGG16-Net and SqueezePoseNet are carried out on the mentioned data sets in Section 7 (Experiments).The results are presented numerically and visually and subsequently interpreted in Section 8 (Discussion).Finally, Section 9 (Conclusion and Outlook) summarizes the findings of the paper and provides ideas for future work and research.

Related Work
The determination of camera poses from images in a global coordinate frame is a useful and wide-ranging task.Solutions for this purpose are provided by finding correlations between aerial and UAV images for image matching and further localization of aerial vehicles [11,12].Feature-based methods successfully match remotely sensed data [13,14] or oblique images [15].Such feature-based methods often suffer from mismatches which is effectively tackled by Locality Preserving Matching (LPM) [16].Alternatives to determine camera poses from imagery are model-based approaches [17,18].Determining the camera pose in real time, in indoor environments, is practicable by CAD model matching [19][20][21][22].Convolutional Neural Networks are used to determine matches between aerial images and UAV images [23] or terrestrial images and UAV images [24].
However, these methods are not operable on small drones in large environments, due to the on-board computers' limited storage.
Feasible methods to conduct local navigation are visual SLAM or VO approaches, reconstructing a trajectory based on image sequences.Efficient solutions are ORB-SLAM [4], Large-Scale Direct Monocular SLAM (LSD-SLAM) [3] or Direct Sparse Odometry (DSO) [2].These methods provide satisfying solutions according to accuracy and real-time capability.However, their trajectory will only be based on a local reference.Navigating in higher level coordinate frames can be achieved by fusing local solutions with, for instance, GNSS or other global localization methods.Even though SLAM or VO solutions show impressive results (e.g., DSO) and drift only slightly for short trajectories, they will certainly drift over long distances, particularly if there are no loop closures.Additionally, SLAM or VO methods fail once the track is lost.Restoring the lost track is impossible without moving back to a known or mapped position.
Therewith, local and global navigation approaches build a complementary framework, merging the advantages of both systems.The satisfying accuracy of relative pose estimations derived from neighboring frames and the computational speed of local methods as well as the global referencing to counter drift and trajectory losses are mutually supportive.
A part of this work is based on PoseNet [6], a Convolutional Neural Network which re-localizes an acquired image in a known area.The CNN is trained with images and their corresponding poses in order to estimate the pose of an unknown camera during test time.The CNN estimates poses between two trained images by regression, but also extends the learned information to determine poses which slightly exceed the trained space.However, these results suffer concerning spatial accuracy.An enhancement to the original PoseNet is the Bayesian PoseNet [25] which provides re-localization uncertainty by adding dropout layers after each convolutional layer and improves the PoseNet accuracy by averaging over multiple forward passes.For large-scale data sets (e.g., King's College, 5600 m 2 ) PoseNet scores approximately 2 m and 3 • for mean pose estimates.PoseNet was originally built on GoogLeNet [26], a CNN designed for classification tasks, as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).However, the feasibility of pose regression with CNNs is shown in earlier works [6].Enhanced accuracies in the task of estimating poses were derived by further improvement [27] using Long Short-Term Memory layers (LSTM) [28], a type of recurrent neural net which was combined with CNNs in the past.LSTMs handle the problem of a dissolving gradient during the back-propagation using so-called gates.However, they showed great success in handwriting or speech recognition and recently in image re-localization.Furthermore, to improve a potential navigation approach, one may support the input of a CNN with additional data.Combining RGB data and depth data in a dual stream CNN showed further improvements of the localization results [29].
Considering that deep CNNs with a large number of parameters need a higher computation power than smaller CNNs, the latter is preferred for UAV applications.For example, the VGG16-Net [30] has 528 MB of weights which will exceed the capacity of an on-board processing unit such as DJI Manifold which is used for development (Section 6.2).The model size of PoseNet or the LSTM-based CNN which are based on GoogLeNet have sizes of about 50 MB.However, due to the modification of some layers for pose regression, the number of weighting parameters increases and therefore the net sizes vary slightly.A CNN that is small enough to run on an on-board processing unit while maintaining sufficient accuracy is mandatory.SqueezeNet [31], for instance, is a small CNN with a data size of only 4.8 MB which is ten times smaller than GoogLeNet.In future work, this might be reduced even more by quantization [32] or binarization [33].
Furthermore, pose regression may potentially be improved by data augmentation.Improving the performance of classifiers and CNNs by augmenting training data sets is widely known and well established [34][35][36][37].Common data augmentation methods are to shift, rotate, scale, flip, crop, transform, compress or blur the training images to extend the training database.This supports classifiers and CNNs to learn invariance to, e.g., shifts and rotations, which helps to generalize and increase accuracy [38,39].Simulated or synthetic images rendered from 3D models have long been used in computer vision to generate even more extensive training data [40,41].Rendering images from 3D objects is also practised to expand training data and improve the performance of CNNs [42,43].To render images for pose regression demands, a 3D model of the target environment is necessary.Generating 3D models is of great interest in research communities such as photogrammetry, computer vision or geo-information sciences [44][45][46].Research focuses on the reconstruction of such models and its automation [46][47][48].Moreover, recently, various benchmark data sets for visual localization and with varying conditions have been published [6,49].

Methodology
The methodology is divided into two sections.The modifications applied on CNNs to solve for pose regression are described in Section 3.1.Subsequently, the data augmentation approach to increase accuracy is outlined in Section 3.2.

Modifications of CNNs
For the demands on pose regression SqueezePoseNet [10], an adaptation of SqueezeNet [31] is utilized.Whereas SqueezeNet is designed for solving classification tasks, SqueezePoseNet is modified to solve for pose regression.The architecture of SqueezePoseNet is depicted in Figure 1.A fully connected layer with 500 neurons is added after the last pooling layer.For pose estimation, two fully connected layers (Position and Rotation) with 3 and 4 neurons, respectively, were added.SqueezeNet's originally design strategy consists of multiple Fire modules.Each Fire module first decreases the number of input channels from the previous layer by 1 × 1 convolutions in a so-called squeeze operation.Subsequently, an expand operation that is a combination of 1 × 1 and 3 × 3 filters increases the number of activation maps but keeps the number of parameters low (Figure 2).
In addition, the traditional SqueezeNet architecture lacks a final fully connected layer as these layers increase the number of parameters significantly.Instead, the final convolutional layer consists of as many 1 × 1 filters as classes.Subsequently, average pooling is used to yield a vector whose length equals the number of classes.To modify SqueezeNet's architecture for pose regression, a fully connected layer with 500 neurons is introduced.This layer acts as a descriptor and is meant to help the network distinguish different poses from each other.Additionally, the layer's activation functions are changed from Rectified Linear Units (ReLUs) to Leaky Rectified Linear Units (Leaky ReLUs) [50].This is meant to help convergence.In addition, Batch Normalization [51] is added after each convolutional layer, making higher learning rates possible.Finally, two fully connected layers for actual pose estimation were added.A three-neuron layer for position and a four-neuron layer for rotation.The rotation is parametrized as a quaternion.The training is described in Section 4.  A so-called squeeze operation [31] is performed by the 1 × 1 convolutional layer.Subsequently, an expand operation that is a combination of 1 × 1 and 3 × 3 filters increases the number of activation maps while keeping the number of parameters low.
For the purpose of evaluation, a deeper Neural Network, the VGG16-Net [30], is modified to compare it with the smaller SqueezePoseNet.Deeper networks obviously tend to be more accurate than small ones [31].The modification on the VGG16-Net was realized the same way as the modification on SqueezeNet.Two dense layers (Position and Rotation) for pose regression were added and the layer's activation functions were set to Leaky Rectified Linear Units.
CNNs are optimized by iteratively changing their weighting parameters using back-propagation.To optimize SqueezePoseNet for pose regression, the following loss [6] is minimized: The loss is calculated as the sum of the position error and the weighted orientation error.xi and x i are ground truth and estimated position in metric dimensions.qi and q i are ground truth and estimated orientation in quaternions.Since position and orientation are not in the same unit space and therefore are not comparable, a weighting is done by using β to keep the errors in the same range.This prevents the CNN from minimizing only one of the two error values.Typically, β is set between 250 and 2000 for outdoor data sets.For the experiments, β is set to 500.

Data Augmentation
To improve the net's performance, data augmentation is used, whereas a training data set is extended with simulated images and their corresponding poses.Therewith, the CNN should be able to learn the appearance of images and their poses in more extended space according to the original training data.For this approach to enhancing the performance of a CNN, simulated images were added to an existing image data set.Rendering additional images to extend the data set presumes that a 3D model of the target environment is existent.This is obtained without the need of additional data by only utilizing the original image data set to generate a photo-realistic model of the environment.For this purpose, a standard multi-view stereo approach [52] is used to generate textured 3D models from images.Automatic feature extraction and feature matching is performed to compute poses of the training images.By finding further correspondences, a 3D model is generated.The model's texture is created automatically from the training images.No additional data, aside from the training data set, is needed for this approach.Since the textures of the model are created directly from the input images, a photo-realistic model is generated.For generating simulated images, the Gazebo simulator [53] is utilized to render images with arbitrary perspectives to extend the Puzzle training data set.

Training
Convolutional Neural Networks usually need a huge amount of training data to perform well, which is often not accessible.Transfer-learning is a valuable process to overcome this issue.For that, a net is initially pre-trained with a huge amount of data, which is often publicly available, to determine weights for the net's layers.These weights are used as initial values for later training.
SqueezeNet is initialized with weighting parameters trained on the ImageNet data set [54].It was shown that for pose regression, training on the Places data set [55,56] leads to further improvement in terms of accuracy [6], as it is a more suitable data set for pose regression.Training was subsequently carried out on the Places data set.
Before starting training for pose regression on the re-localization data set Atrium (Section 5.2), SqueezePoseNet was pre-trained on the Shop Façade benchmark data set (Section 5.1).However, only the added layers were set as trainable (the last three fully connected layers in Figure 1 For accuracy evaluation of SqueezePoseNet, the deeper VGG16-Net, which should be more accurate, was chosen.Considering that VGG16-Net is built to solve classification tasks, it is adapted in a similar manner as SqueezeNet described above to estimate poses.The net's layers were initialized with weights obtained by training on the Places365 data set.The adaptation is also trained on Shop Façade to obtain initial weights for the added layers.The modified net is then trained on the data set Atrium All training procedures were carried out on a 64 GB RAM computer equipped with an Intel R Core TM i7-3820 3.6 GHz processor and an 8 GB GeForce GTX 980 graphics card.

Data
In this section, the benchmark data set Shop Façade (Section 5.1) and the captured data set Atrium (Section 5.2) that are used for training and evaluation of the Convolutional Neural Networks for pose regression are described.Besides, the ImageNet and Places data sets serve for pre-training.The experiments on data augmentation for pose regression are carried out on the Puzzle data set (Section 5.3).

Shop Façade Data Set
The Shop Façade data set is part of the Cambridge Landmarks benchmark [6].This outdoor data set for visual re-localization contains three video sequences recorded with a smartphone camera and their extracted image frames separated in 231 training and 103 evaluation images.A pose is provided for every image.The modified CNNs are trained on Shop Façade to yield initial weighting parameters for further training on the Atrium data set.The data set containing image sequences and poses of Shop Façade (available online http://mi.eng.cam.ac.uk/, accessed on 14 December 2017)).The dimension of the area in which the images were captured is denoted as 35 m × 25 m.

Atrium Data Set
The Atrium data set consists of 864 images collected within the LaFiDa (available online https://www.ipf.kit.edu/lafida.php,accessed on 14 December 2017)) benchmark collection by KIT [57].These images in high resolution (available online https://www2.ipf.kit.edu/pcv2016/downloads/photos_atrium_recon-struction.zip,accessed on 14 December 2017)) are used to determine the poses with Agisofts Structure from Motion (SfM) routine [58].For introducing the Atrium environment, side views of the 3D model are shown in Figure 3a,b.The dimension of the captured area (39 m × 36 m × 18 m) is similar to the Shop Façade environment, at least for the ground area.
Two image sequences of the Atrium serve for evaluation.The images of the medium coverage sequence are captured in a low altitude and are spatially close to the training data.The images of the low coverage sequence are spatially far away and have high discrepancies in perspectives compared to the training data.However, the medium coverage sequence includes 145 extracted images, which show medium coverage of the training data.High coverage can be stated, if training and testing data show very similar poses.The low coverage sequence is captured with a higher altitude than the medium coverage sequence and shows low coverage of the training data, as no images with high altitude or a downward facing field of view are present in the training images.Therefore, the positions as well as the orientations are 'far away' from the training poses.However, the low coverage sequence is challenging for a Convolutional Neural Network, as it has not learned similar images nor poses during the training process.Ground truth is built by adding the evaluation images to the Structure from Motion (SfM) pipeline to determine their poses.For evaluation, the received poses serve as ground truth and are used to compute differences to the corresponding camera poses estimated by the CNN. Figure 4 shows examples of training and testing images.Figure 4a shows a test image where reflections could be detected.Figure 4b shows a blurred image caused by angular motion of the UAV.Both effects are known for challenging computer vision tasks.Figures 3 and 4c show the very ambiguous structures on the Atriums walls.Figure 4d

Puzzle Data Set
This data set assists the investigation by showing how data augmentation can improve the accuracy of CNNs on pose regression.The simplicity of this data set serves to show the feasibility of a data augmentation approach and reduces visualization difficulty.The performance of CNNs is evaluated on the original Puzzle data set and on the Puzzle data set augmented with simulated images.The standard Puzzle data set consists of 19 images captured in nadir perspective of a puzzle.The nadir images were captured to recreate an aerial image-like scene.Therefore, these images could be compared to an aerial flight to collect spacious training data.Additionally, the data set consists of 48 test images captured closer to the puzzle and in oblique perspectives.These images represent images taken within an UAV-flight in lower altitude.The test images serve for evaluation purposes.The spatial extents of the puzzle are 114.3cm × 82.6 cm.These metric dimensions do not match a real-world scenario.However, this is not crucial for the CNNs' work flow and outcome, since the weighting parameter β handles scale differences.Additionally, for later evaluation, not the absolute metric measurements, but the ratio between the metric results compared to the spatial extent of the target 3D environment is of primary interest.
The 19 data set images serve for generating a photo-realistic 3D model (Figure 5).

Unmanned Aerial System
This Section presents an overview of the Unmanned Aerial System (UAS), which mainly consists of a UAV (Section 6.1), an on-board processing unit (Section 6.2) and a camera (Section 6.3).

Unmanned Aerial Vehicle
A Matrice 100 from DJI (Dà-Jiāng Innovations Science and Technology Co., Ltd: Skyworth Semiconductor Design Building, No.18, Gaoxin South 4th Ave, Nanshan District, Shenzhen, China, 518057) (Figure 7) and the X3 camera (Section 6.3) serve to capture images for the experiments (Section 7).The on-board processing unit, a DJI Manifold (Section 6.2) is used for data storage.Besides, it is capable of running the proposed SqueezePoseNet.The quadrocopter is able to carry about 1 kg of payload with a maximum take off weight of 3600 g.The training of the CNNs is processed offline; only the actual re-localization or navigation process has to be processed on-board.

On-Board Processing Unit
The DJI Manifold, shown on top of the UAV in Figure 7, is based on a NVIDIA R Tegra K1 with a Quad-core Arm R Cortex A-15 32-bit processor.The Kepler GPU consists of 192 CUDA R cores.CUDA R is NVIDIAs R software architecture to process computations on a Graphics Processing Unit (GPU) which is necessary for most deep learning tasks.Furthermore, the Manifold is compatible with cuDNN R , NVIDIAs R GPU-based deep neural network library, which includes standard routines for CNN-processing and CNMeM, a memory manager for CUDA.Besides, the DJI Manifold has 2 GB of RAM.With SqueezePoseNet, it is possible to process pose regression in real time.The weight of the device is less than 200 g.

Camera
The UAV is equipped with a DJI Zenmuse X3 camera, shown on the bottom of the UAV in Figure 7.It is capable of taking single shot images with a resolution of 4000 × 2250 pixels or capturing video streams with a resolution of 3840 × 2160 pixels and 30 frames per second.The camera has a 1/2.3"CMOS sensor (6.2 mm × 4.6 mm) with 12.4 megapixel.Its field of view is 94 • .The camera and its gimbal weigh 254 g.

Experiments
For the experiments, the pose estimations derived by CNNs are evaluated by Multi-View Stereo [52,58].SqueezePoseNet is evaluated on the medium and low coverage sequences mentioned in Section 5.2.For comparison, the VGG16-Net adaptation and PoseNet [6] are also evaluated on these two sequences.In Sections 7.1 and 7.2, visual results as well as metric evaluations against ground truth are shown.The experiments on data augmentation on the Puzzle data set are presented in Section 7.3.

Medium Coverage Sequence
This sequence shows a medium coverage between training and evaluation poses.A high coverage would be given by a more dense distribution of training data or a high similarity of training images and evaluation images.However, Figure 8a   The histograms in Figure 9 show the derived errors to the ground truth poses.Figure 9a,c show the spatial errors of the modified VGG16-Net and SqueezePoseNet to ground truth.Figure 9b,d show the angular errors in the same manner.The evaluation errors are represented by the difference between the spatial and angular estimates to the ground truth.The evaluation errors for the tested CNNs on the Atrium sequences are shown in Table 1 and are as follows.The pose estimation accuracy is 4.91 m and 33.30

Low Coverage Sequence
This sequence shows a low coverage between training poses and evaluation poses.Figure 10a shows the training (green) and ground truth (blue) camera poses for the low coverage sequence.The resulting poses derived by the modified VGG16-Net are visualized (red) in Figure 10b.The poses derived by SqueezePoseNet are visualized in Figure 10c.The histograms in Figure 11 show the derived errors to the ground truth poses.Figure 11a,c show the spatial errors of the modified VGG16-Net and SqueezePoseNet, respectively, to ground truth.Figure 11b,d show the angular errors in the same manner.The pose estimation accuracy is 11.34 m and 37.33 • for the modified VGG16-Net, 15.18 m and 65.02 • for SqueezePoseNet and 11.47 m and 46.40 • for PoseNet (Table 1).
An exemplary visualization of a side view is pictured in Figure 12. Green cameras indicate training data, blue cameras depict the ground truth and red cameras are estimated poses derived by the modified VGG16-Net.As can be seen, a major deficit arises in the estimation of the height component; the resulting poses are estimated systematically too low.This behavior is supposed to be influenced by the training poses, which are located far away and mainly on the ground.

Puzzle Data Set and Data Augmentation
To tackle and overcome the issue of the experienced limited extrapolation capability of CNNs for pose regression (Figure 12), the Puzzle data set was created and augmented with simulated images (Section 5.3).The training data-specifically the lack of training data in the spatial neighborhood of the evaluation data-hardly causes the limitation in extrapolation.Therefore, the net never learned that poses appear in such locations.By adding simulated images of the environment in the spatial neighborhood of the evaluation data, the results would be enhanced.However, such images do not exist in the original Puzzle data set and it is expected that on the application level, such data is not available most of the time.Therefore, the training data set was augmented as described earlier in Section 5.3.
PoseNet and the modified VGG16-Net are trained and evaluated on the Puzzle data set as well as on the Puzzle data set augmented with simulated images.The results on PoseNet are visualized in Figure 13; Figure 13a depicts the results on the original Puzzle data set; Figure 13b shows the results trained on the Puzzle data set after data augmentation.The results on VGG16-Net are visualized in Figure 14; Figure 14a depicts the results on the original Puzzle data set; Figure 14b shows the results trained on the Puzzle data set after data augmentation.The numerical results are shown in Table 2 and are as follows.For PoseNet, trained on the original Puzzle data set, the evaluation errors are 0.47 m and 16.07 • for the spatial and angular components, respectively.The evaluation errors on PoseNet, trained on the Puzzle data set after data augmentation, are 0.18 m and 7.53 • .This is an improvement of 61.7% for the spatial component and 53.1% for the angular component.In Figure 15, the evaluation errors are visualized in histograms; Figure 15a

Discussion
The previous experiments show that Convolutional Neural Networks are capable of estimating camera poses.Even the small SqueezePoseNet could successfully estimate sufficient poses and is very well suited for UAS or MAV on-board applications, due to its small net size.
SqueezePoseNet for CNN-based pose regression is tested on the medium and the low coverage sequence.The medium coverage sequence shows satisfying results for regressing a coarse pose.Pose regression performed well for the VGG16-Net adaptation as well as for SqueezePoseNet.However, VGG16-Net shows more accurate results, as it is deeper than SqueezePoseNet.Both CNNs, with position differences to ground truth of 4.91 m and 5.19 m, respectively, would provide good initial poses for a further and optional pose refinement step.Additionally, the calculation of these poses is processed within a few milliseconds.
As expected, the low coverage sequence where no similar training data was captured showed less accurate results.The spatial differences of 11.34 m and 15.18 m in an environment of about 39 m × 36 m × 18 m are not sufficient.The majority of the position error is influenced by a faulty height determination.The poses are systematically estimated too low.It is to be expected that the CNN cannot extend its knowledge too far from the trained data, as it has never learned such extents.Although interpolation works well, extrapolation performs less well.
To overcome this issue of insufficient coverage, data augmentation was introduced.The Puzzle data set serves to demonstrate the feasibility of this approach.After extending a data set with simulated images, the utilized CNNs perform significantly better in the corresponding environment.On the modified VGG16-Net, the results could be improved by about 59.0% for the position component and 76.0% for the angular component.For PoseNet, the improvement was about 61.7% and 53.1% respectively.It is to be noted that no additional data has to be utilized to achieve that improvement, since the simulated images were rendered from a 3D model, which was generated by the original data set itself.

Conclusions and Outlook
It is shown that CNN-based solutions for pose regression can support UAV-based navigation applications and have the potential to improve or substitute existing methods.SqueezeNets architecture could be modified, whereby a small CNN for pose regression, named SqueezePoseNet, could be introduced.The results on this net were promising compared to deeper CNNs, such as PoseNet or VGG16-Net.
A drawback of the proposed idea for UAV navigation is the need for training data.A pose can only be determined if sufficient training data of the environment is available, but such data is often not given.However, image data is captured all over the globe; it is commercially or publicly available.A well known example is the reconstruction of parts of the city of Rome by using solely online data [59].Additionally, there is a lot of aerial imagery available covering urban and rural areas, which may also serve as training data.
Furthermore, the issue of lacking training data was tackled by extending a training set with simulated images by data augmentation.Thereby, the evaluation errors could be decreased by 59.0-61.7% for position and 53.1-76.0%for rotation, compared to the corresponding not-augmented data set.This indicates the success of supporting Convolutional Neural Networks with simulated images to learn extrapolation for pose regression.For this data augmentation, a 3D model of the environment is mandatory.Data for such models could be obtained by accessing the aforementioned image data bases.The reconstruction of 3D models is a basis for rendering images of a target environment.In this contribution, a photo-realistic model was created to render additional images.It is of further interest whether simple triangulated 3D models may serve for data augmentation on pose regression with similar success.By utilizing such 3D models, a CNN-based or -aided navigation approach could be realized.
Of further interest is the amount of data content which can be learned by a CNN, especially by a small CNN.Even though the number of weighting parameters of the CNNs do not increase with a higher amount of training images, a CNN is only able to recognize a finite part of the training data.Different sized CNNs such as PoseNet, VGG16-Net or SqueezePoseNet may fail or at least might have different accuracies depending on the given size of the net itself or the training environment.Both should be investigated to determine the optimal performance for given applications.Furthermore, the investigation of data augmentation with more and diversified simulated images and on larger data sets with real-world examples will be tackled in future work.

Figure 1 .
Figure 1.Architecture of SqueezePoseNet.A fully connected layer with 500 neurons is added after the last pooling layer.For pose estimation, two fully connected layers (Position and Rotation) with 3 and 4 neurons, respectively, were added.

Figure 2 .
Figure2.Architecture of a Fire module.A so-called squeeze operation[31] is performed by the 1 × 1 convolutional layer.Subsequently, an expand operation that is a combination of 1 × 1 and 3 × 3 filters increases the number of activation maps while keeping the number of parameters low.
), since the preceding layers should already be well initialized by the pre-training steps.After training the network on the Shop Façade benchmark, the CNN is finally fine-tuned on the training data of the Atrium data set consisting of 864 images.The inputs for the CNNs are 757 training images and 95 testing images with their corresponding camera poses.The training on SqueezePoseNet ended with errors of 4.45 m for position and 14.35 • for orientation.
. The training errors result in 2.35 m and 9.09 • for position and rotation, respectively.To show the benefit of the utilization of data augmentation by simulated images, an additional training was carried out with the original Puzzle training dataset and the dataset extended with 479 simulated images.Furthermore, to preserve comparable results, performance evaluation was realized on two Convolutional Neural Networks: the original PoseNet and the VGG16-Net modified for pose regression.Note that the focus is on relative performance improvements of the net's accuracy, not on the absolute accuracy itself.The training process with the initial Puzzle data set (without any simulated images) of PoseNet ended in training errors of 0.13 m and 6.00 • ; VGG16-Net training ended with training errors of 0.20 m and 5.00 • .The training with the added 479 simulated images ended for PoseNet with a training error of 0.18 m and 6.44 • and for VGGG16-Net net with 0.04 m and 1.98 • respectively.To summarize, the work flow of the methodology can be described as follows.(i) Choose a suitable CNN for classification tasks; (ii) Initialize the CNN with pre-trained weighting parameters; (iii) Transfer learning on a suitable data set, e.g., Places (optional); (iv) Prepare the CNN to solve for pose regression by adding appropriate layers; (v) Produce training data for a target environment.Optionally, the training data set could be extended by data augmentation; (vi) Train the network (the trainable layers are restricted to the last fully connected layers); (vii) Train the CNN with own training data; (viii) Run the CNN on evaluation images.

Figure 3 .
Figure 3. Side views of the Atrium.To obtain a complete introduction to the environment, the 3D model of the Atrium is visualized in east view (a) and west view (b).Notice, the red-bricked walls show very similar structures, forming a challenging environment for computer vision algorithms.
Figure4shows examples of training and testing images.Figure4ashows a test image where reflections could be detected.Figure4bshows a blurred image caused by angular motion of the UAV.Both effects are known for challenging computer vision tasks.Figures3 and 4cshow the very ambiguous structures on the Atriums walls.Figure4dshows a test image of the low coverage sequence.A similar image is not contained in the training data.The position as well as the orientation are very far away from any training pose.

Figure 4 .
Figure 4. Example images of the Atrium.(a) Windows in the scene cause reflections; this is challenging for computer vision tasks; (b) Image captured with high (angular) motion thus showing motion blur; (c) Image shows ambiguous structures; (d) Test image of the low coverage sequence (Section 7.2) captured at high altitude.A similar perspective is not contained in the training data; the pose is 'far away' from any training pose (position as well as orientation).

Figure 5 .
Figure 5. Nadir view of the puzzle's 3D model.This model was generated with 19 images and serves to render additional images for data augmentation on the Puzzle data set.

Figure 6 .
Figure 6.The top row shows images taken for the Puzzle data set; the bottom row shows corresponding simulated images from a photo-realistic model.The pose of the original image (a) is determined incorrectly within the SfM process leading to a faulty simulated image (c).Images with wrong pose determinations were disregarded in all further processes.The pose of image (b) is determined correctly, whereby the simulated image (d) is rendered accordingly.

Figure 7 .
Figure 7. DJI Matrice 100 with an on-board processing unit DJI Manifold and a Zenmuse X3 camera.
shows the training (green) and ground truth (blue) camera poses for the medium coverage sequence.The estimated poses derived by the adaptation of VGG16-Net are visualized (red) in Figure 8b.The poses derived by SqueezePoseNet are visualized in Figure 8c.

Figure 8 .
Figure 8. Atrium data set, medium coverage sequence.(a) Visualization of ground truth; (b) Results derived by the modified VGG16-Net; (c) Results derived by SqueezePoseNet.Green cameras indicate training data, blue cameras depict the ground truth and red cameras are estimated poses derived by the modified CNNs.

Figure 9 .
Figure 9. Medium coverage sequence.The histograms show the spatial and angular errors of the CNN-derived poses to ground truth.Histograms (a,c) depict the spatial errors of the modified VGG16-Net and SqueezePoseNet, respectively, and (b,d) show the angular errors.

Figure 10 .Figure 11 .
Figure 10.Atrium data set, low coverage sequence.(a) Visualization of ground truth; (b) Results derived by the modified VGG16-Net; (c) Results derived by SqueezePoseNet.Green cameras indicate training data, blue cameras depict the ground truth and red cameras are estimated poses derived by the modified CNNs.

Figure 12 .
Figure 12.Visualization of VGG16-Net results on the Atrium data set.Green cameras indicate training data, blue cameras depict the ground truth and red cameras are estimated poses derived by the modified VGG16-Net.The poses are systematically estimated too low.This behavior is supposed to be influenced by the training poses, which are located far away and mainly on the ground.
,b depict the spatial and angular errors, respectively, based on training on the standard Puzzle data set.The histograms in Figure 15c,d depict the spatial and angular errors, respectively, based on training on the Puzzle data set after data augmentation.For the modified VGG16-Net, trained on the original Puzzle data set, the spatial and angular errors are 0.39 m and 17.07 • respectively.The evaluation errors on VGG16-Net, trained on the Puzzle data set after data augmentation, are 0.16 m and 4.10 • .This is an improvement of 59.0% for the spatial component and 76.0% for the angular component.In Figure 16, the evaluation errors are visualized in histograms; Figure 16a,b depict the spatial and angular errors, respectively, based on training on the standard Puzzle data set.The histograms in Figure 16c,d depict the spatial and angular errors based on training on the Puzzle data set after data augmentation.

Figure 13 .Figure 14 .Figure 15 .Figure 16 .
Figure 13.Visualization of determined poses by PoseNet on the Puzzle data set.(a) Pose results for PoseNet trained on the original Puzzle data set; (b) Pose results for PoseNet trained on the Puzzle data set after data augmentation.Green cameras indicate training data of the original Puzzle data set, blue cameras depict the ground truth and red cameras are estimated poses derived by PoseNet.The black points indicate the puzzle's pointcloud.It is clearly shown that data augmentation supports the extrapolation potential of the net.

Table 1 .
Position and rotation errors on the medium and high coverage sequences, respectively.Bold text marks the best result on a sequence.

Table 2 .
Mean evaluation errors for PoseNet and the modified VGG16-Net trained on the original Puzzle data set and the Puzzle data set after data augmentation.The last row shows the percent improvement of training on the augmented data set against the original.