Efficient Structure from Motion for Large-Size Videos from an Open Outdoor UAV Dataset

Modern UAVs (unmanned aerial vehicles) equipped with video cameras can provide large-scale high-resolution video data. This poses significant challenges for structure from motion (SfM) and simultaneous localization and mapping (SLAM) algorithms, as most of them are developed for relatively small-scale and low-resolution scenes. In this paper, we present a video-based SfM method specifically designed for high-resolution large-size UAV videos. Despite the wide range of applications for SfM, performing mainstream SfM methods on such videos poses challenges due to their high computational cost. Our method consists of three main steps. Firstly, we employ a visual SLAM (VSLAM) system to efficiently extract keyframes, keypoints, initial camera poses, and sparse structures from downsampled videos. Next, we propose a novel two-step keypoint adjustment method. Instead of matching new points in the original videos, our method effectively and efficiently adjusts the existing keypoints at the original scale. Finally, we refine the poses and structures using a rotation-averaging constrained global bundle adjustment (BA) technique, incorporating the adjusted keypoints. To enrich the resources available for SLAM or SfM studies, we provide a large-size (3840 × 2160) outdoor video dataset with millimeter-level-accuracy ground control points, which supplements the current relatively low-resolution video datasets. Experiments demonstrate that, compared with other SLAM or SfM methods, our method achieves an average efficiency improvement of 100% on our collected dataset and 45% on the EuRoc dataset. Our method also demonstrates superior localization accuracy when compared with state-of-the-art SLAM or SfM methods.


Introduction
Modern unmanned aerial vehicles (UAVs) equipped with cameras have become crucial in several fields, such as surveying and mapping, geographic information systems (GIS), and digital city modeling.To achieve accurate localization and create 3D representations of real-world scenes, techniques like image or video-based structure from motion (SfM) and visual simultaneous localization and mapping (VSLAM) are utilized [1][2][3][4][5][6][7][8][9][10].However, it is important to note that there is a relatively limited amount of research on large-size videobased SfM specifically designed for outdoor UAVs.On the one hand, a mainstream UAV camera has already reached a resolution up to 20 megapixels, thus providing more detailed information for all kinds of applications.However, the widely-used video datasets [11][12][13][14][15] provide a resolution below 1 megapixel.On the other hand, there is limited research on how to combine SfM and VSLAM for large-size video-based localization.For large-size videos, current video-based SfM methods extract keyframes from videos usually based on simple empirical rules, for example, Kurniawan et al. [16] performed SfM on the keyframes extracted from videos simply according to the overlap rate of images, instead of a more sophisticated VSLAM method, to achieve 3D terrain reconstruction.In fact, VSLAM designed for continuous image processing inherently suits video data better.To process large-size videos in real-time, SLAM systems estimate camera poses and build maps on downsampled images, which is more efficient but results in lower localization accuracy than SfM methods.Some researchers [17,18] have attempted to utilize VSLAM to assist the SfM method for reconstruction.However, these methods only utilize estimated camera poses from a SLAM system.In fact, reusing feature extraction, matching, and keyframe covisibility graph results from SLAM can significantly reduce the computational cost for large-size video processing.
In this paper, we propose an efficient SfM pipeline designed to process high-resolution aerial videos.Additionally, we introduce a new outdoor UAV video dataset comprising images with a resolution of 3840 × 2160 pixels.Our approach maximizes the usefulness of initial outcomes provided by a speedy VSLAM system and incorporates a constrained bundle adjustment (BA) as a singular backend refinement step.The pipeline unfolds in the following steps.Initially, we subject the downsampled video data to a VSLAM system, which serves multiple purposes, including selecting keyframes and keypoints, as well as establishing preliminary camera poses and 3D scene structures.Secondly, to optimize the efficiency of the pipeline, we employ a coarse-to-fine two-step keypoint adjustment (TS-KA) method with rotation invariants, which adjusts the positions of matched keypoints projected onto the original high-resolution images instead of re-matching new feature points.This adjustment process begins by roughly aligning keypoint positions using normalized cross-correlation (NCC) [19].Following the rough alignment, we apply direct image alignment [20] within a learned dense feature space to further refine matched points up to sub-pixel accuracy.Finally, the global bundle adjustment takes the initial camera poses from the VSLAM system as inputs and integrates a rotation averaging strategy [21].Optionally, ground control point (GCP) constraints can be included to retain high-accuracy poses and 3D scene points at a centimeter-level precision.
The contributions of this paper are summarized as follows.
(1) Efficient SfM pipeline.We propose an efficient pipeline specifically designed to process large-size aerial videos.By leveraging the strengths of a rapid VSLAM system and incorporating refined adjustment steps, our pipeline achieves impressive accuracy and efficiency in pose localization of video sequences.(2) Two-step keypoint adjustment (TS-KA) strategy.The novel strategy refines the positions of keypoints matched in downsampled images up to sub-pixel accuracy on the original high-resolution images.(3) High-resolution UAV video dataset.We provide a high-resolution UAV video dataset and supply high-accuracy GCPs to facilitate evaluation.This dataset fills a gap in the current availability of outdoor high-resolution video datasets for SLAM or SfM research.

Unstructured, Sparsely Sampled Collection
Early works laid the foundation for internet photo collections [22].Inspired by these works, reconstruction systems for increasingly high-resolution photo collections have been developed [15,23].These methods can be classified into incremental SfM, global SfM, and hybrid SfM, based on the manner in which camera poses are estimated.Currently available open-source incremental SfM algorithms, such as Bundler [1], VisualSfM [2], and COLMAP [3,24,25], provide a solid foundation for SfM research.Mainstream global SfM methods [4,5,26,27] estimate all camera poses and perform a global BA to refine the camera poses and reconstruction scene, resulting in better scalability and efficiency.
Indeed, these methods focusing on unordered, sparsely sampled images face challenges when dealing with coherent, densely sampled data.This difficulty arises from frame-wise matching and triangulation with very short parallax, which can result in high computation loads and unreliable geometric structures.

Coherent, Densely Sampled Collection
This type of study addresses continuous feature tracking and mapping on coherent, densely sampled image sequences.Specifically, VSLAM methods have been developed to estimate camera trajectories and reconstruct scene structures from video streams in real time [7][8][9][10]35].However, these methods often prioritize speed and, as a result, face limitations when processing large-size high-resolution images.This restriction hampers their ability to produce fine-grained high-quality reconstructions.
Over the years, SfM methods have also been developed specifically for densely sampled image sequences or videos.For example, Shum et al. [36] introduced the concept of "virtual keyframes" in a hierarchical SfM approach to enhance efficiency.Resch et al. [37] proposed multiple SfM techniques based on the KLT tracker and linear camera pose estimation [38] for large-scale videos.Leotta et al. [39] accelerated feature tracking for aerial videos by exploiting temporal continuity and planarity of the ground.More recently, a deep learning-based approach [40] was proposed to select appropriate keyframes for videos.To resolve ambiguity arising from repetitive structures, Wang et al. [41] proposed a track-community structure to segment the scene.Gong et al. [42] proposed to disambiguate scenes in SfM by prioritizing pose consistency over feature consistency.However, it should be noted that these methods may rely on fixed camera calibration and could encounter significant drift issues in scenes without a loop.Different from these methods, our work proposes a hybrid SfM solution that combines the advantages of global SfM and feature-based VSLAM methods.

Keypoint Adjustment
Recently, there has been an increased focus on developing local search-based methods to enhance the efficiency and accuracy of keypoint matching.These methods employ both handcrafted [43,44] and learned features [45][46][47] to establish more accurate correspondences between keypoints.For example, Taira et al. [48] presented a method that achieves dense correspondence through a coarse-to-fine matching process using VGG-16 [49].Li et al. [50] employed a dual-resolution approach to achieve reliable and accurate correspondences.Zhou et al. [51] proposed a detect-to-refine method, where initial matches are refined by regressing pixel-level matches in local regions.However, it should be noted that these methods [48,50,51] are primarily optimized for stereo pairs and may not be directly applied for multiple views.
In order to enhance the quality of multi-view keypoints for downstream tasks like SfM, Dusmanu et al. [52] incorporated a geometric cost with optical flow.However, this method has limitations in terms of accuracy and scalability for large scenes.Lindenberger et al. [20] addressed the alignment of keypoints by utilizing feature-metric representation to jointly adjust feature matches across thousands of images.However, this method suffers from a limited range of adjustment and may become less accurate when dealing with images exhibiting significant viewpoint changes.To address these challenges, we introduce an efficient two-step matching approach that takes into account errors in initial matching at a lower resolution and effectively handles large viewpoint changes.

System Overview
We introduce a novel SfM pipeline that efficiently selects appropriate keyframes and calculates camera poses by utilizing rich information from high-resolution, high-frame-rate videos.As depicted in Figure 1, our proposed pipeline comprises three main steps.
In the first step, we begin by downsampling the original high-resolution video to improve efficiency.Then, we estimate the initial camera poses and select keyframes using visual odometry on the downsampled video.The output of this step includes three components: a set of N keyframes denoted as I = {I 1 , . . ., I N } along with their poses corresponding 2D keypoints {p u }, and a view graph (VG) G = {V, E} with absolute rotation (R W I i ) as vertices and relative rotation (R ) as edges for image pairs (I i , I j ).In the second step of the pipeline, we upsample the keypoints obtained from visual odometry to match the original resolution of the keyframes.To achieve sub-pixel accuracy, we employ a two-step keypoint adjustment method called TS-KA.TS-KA refines the position of the upsampled keypoints in a coarse-to-fine strategy.Initially, we utilize the NCC algorithm [19] to roughly adjust the keypoint positions considering view angle changes.Then, we introduce feature-metric optimization for further refinement.
Moving on to the third step, we perform rotation averaging on the VG obtained in the first step.This helps us estimate the global rotation of all keyframes.The obtained global rotation will be integrated into BA as a regularization measure, reducing cumulative errors.Then, we refine the camera intrinsic parameters, keyframe poses, and sparse point cloud coordinates through rotation-averaged BA.To handle outliers, we incorporate a reprojection error threshold to filter them out.Additionally, we enhance trajectory accuracy at the centimeter-level by including GCPs in the BA process.(2) Keypoints are refined on full-resolution keyframes by a two-step keypoint adjustment method.red: original matching points; green: matching points after coarse keypoint adjustment; blue: matching points after sub-pixel refinement.(3) Global rotation is obtained by rotation averaging, and scene structure is finally refined using rotation-averaged bundle adjustment.

Initial Pose Estimation
We utilize visual odometry for both keyframe selection and initial scene reconstruction.Given the high-resolution aerial video used in this study, we initially downsample the raw video by a factor of 4. This downsampling step ensures real-time initial camera trajectory estimation.Visual odometry involves the detection and tracking of distinctive features in consecutive camera frames.By matching these features, it estimates the camera's relative motion and selects keyframes that represent significant viewpoints.In our proposed pipeline, we leverage the widely used OpenVSLAM for the initial camera pose estimation.OpenVSLAM includes three modules: tracking, local mapping and loop closing.The tracking module is primarily responsible for estimating the camera's pose in real time.This module estimates the camera's position and orientation by extracting and tracking feature points from consecutive video frames.It also determines whether to incorporate the current frame as a keyframe into the map based on specific rules.The local mapping module focuses on building and maintaining the map.It uses feature points from keyframes, creates new map points via triangulation, and performs local optimization of the map's structure to enhance its accuracy.The loop closing module detects and handles loop closures.By recognizing revisited images and aligning them with previous map data, the module corrects cumulative navigational errors.More details can be found in [8].

Two-Step Keypoint Adjustment
As the matched points in VSLAM are obtained from 4× downsampled images with limited precision in point coordinates, it is necessary to adjust the keypoint coordinates to sub-pixel accuracy on the original resolution.Inspired by the work of the keypoint adjustment method in Pixel-SfM [20], we propose a simple yet powerful two-step keypoint adjustment approach TS-KA.

Coarse Keypoint Adjustment
The coarse keypoint adjustment in Figure 2 aims to refine the keypoints within the given search area by utilizing a rotation-invariant similarity measure.This adjustment allows for accurate keypoint refinement over a large range.The first step involves determining the reference keypoint, p r , within a track, {p i } (i = 1, . . ., N).A track refers to a collection of N keypoints corresponding to the same 3D world point.We calculate the accumulated matching scores for each keypoint in the track, and p r is selected as the reference keypoint with the highest accumulated matching score.The remaining points in the track are then adjusted and matched to p r .Second, we assign a consistent orientation to each keypoint based on local image characteristics; the keypoint can be represented relative to this orientation, thus ensuring invariance to image rotation.For a pixel, p, within the search region of a keypoint, p i , we compute the gray centroid, p c , of its NCC window.The NCC window is a circular window with a radius of 15 pixels here.To achieve orientation invariance, we rotate the NCC window based on the angle between pp c and p r p rc , where p rc represents the gray centroid of p r .Finally, we obtain the best matching points through NCC matching.

Sub-Pixel Refinement
The coarse keypoint adjustment primarily achieves feature matching accuracy at the pixel level.To meet the accuracy requirements of various downstream tasks, it is often necessary to refine keypoints to sub-pixel accuracy.For this purpose, we introduce the feature keypoint adjustment (FKA) method [20].We first extract a dense feature map of 16 × 16 patches centered on the keypoints by S2DNet [53], and then we treat the refinement of the keypoints, M(l), in a track belonging to the same landmark, l, as an energy minimization problem, as follows: where ω ab represents the confidence between matched points p a and p b , according to the similarity of the local features.F[.] represents the feature map.
It should be noted that the original FKA [20] lacks a coarse adjustment step, which can result in numerous incorrect adjustments.To address this limitation, we have incorporated a coarse adjustment step in our approach.We also add a constraint ∥p where p c f best denotes the position of the keypoint after fine adjustment, and K is set to be lower than the radius, r.

Global Pose Refinement
The trajectory derived from visual odometry often suffers from the drift accumulation problem, leading to significant deviations from the true trajectory.Inspired by [21], in order to enhance the precision of camera pose estimation, we incorporate the global camera pose obtained through rotation averaging as a regularizer into the BA process.Furthermore, when available, we include the GCPs in the BA equations.

Rotation Averaging
Rotation averaging (RA) is a method utilized for estimating global camera poses by simultaneously considering pairwise relative poses.The global rotation is computed by minimizing the cost function: min where d 2 represents the Euclidean norm.However, RA [21] is sensitive to outliers, which may result in inaccurate estimates.
Before performing RA, it is necessary to construct a view graph with edges being pairs of matched images.To avoid starting image matching from scratch, we leverage the co-visibility data derived from VO, as outlined in Section 3.2, transforming it into a view graph with candidate edges.Then, we assign higher weight values to edges with more visible points and a more uniform distribution of matches.Concurrently, we prune edges within a view graph under the following conditions: (1) the number of matches falls short of the predetermined threshold, N m ; and (2) the angular error for a given edge, denoted as , is below a specified threshold, σ, as delineated by the following formula: I 3 represents the 3 × 3 identity matrix.σ is set as 0.01.

Rotation Averaged Bundle Adjustment
Rotation averaged BA is conducted to optimize camera poses, 3D points, and camera intrinsics.Since the observations are independent, the trajectory estimated by RA does not accumulate errors.Therefore, it can serve as a regularizer in BA to mitigate drift in the initial trajectory.The objective function for this optimization is as follows: where ρ(•) is the loss function, and, in this paper, the huber loss function ρ(x) = x 2 x 2 +δ 2 is used.ω i,j represents the weight value for the known rotation term, r ′ i,j .The objective divides into two terms, explained as follows: • Reprojection term: this term represents the reprojection error corresponding to all tie points in bundle adjustment, as follows: where C represents the intrinsic matrix, and P W l is the 3D coordinate of a world point.Additionally, GCPs can be included in the reprojection term, as follows: where P G l denotes the position of GCP in the geodetic coordinate, and R W G , t W G , and s W G represent the rotation matrix, the translation, and the scale between the world and geodetic coordinates.The centimeter-level accuracy trajectory can be obtained by introducing the GCP term into BA.

•
Known rotation term: this term is used as a regularizer to reduce the accumulated error, which is given by: where log is logarithm mapping from the special orthogonal group SO(3) to Lie algebra so(3).R and R denote estimated and global rotation, respectively.The dataset consists of two aerial video sequences captured using the DJI M300 RTK drone with the DJI P1 camera, both manufactured by DJI in Shenzhen, China.Figure 3 illustrates the two sequences: one with a regular strip configuration and the other with an irregular configuration.These videos were recorded at the Informatics Department of Wuhan University at an altitude of 200 m.The recording frequency was set at 60 frames per second (fps), with a resolution of 3840 × 2160 pixels.The average ground resolution achieved was 0.03 m.
The regular sequence is a 1379-second video that contains evenly distributed air strips across the area.The between-strip overlapping is set at a degree of 40%.The coverage area of this sequence is 860 × 460 m 2 and consists of buildings, trees, and playgrounds.On the other hand, the irregular sequence is a 345-s video that follows a heart-shaped loop trajectory.This sequence includes wooded areas, buildings, and a lake, with numerous texture-repeated regions, making it more challenging compared with the regular sequence.
Additionally, we collected 16 GCPs that are evenly distributed throughout the dataset area.Some of these GCPs were utilized to compute a high-accuracy trajectory, while the remainder served as checkpoints to assess the accuracy of the trajectory.The GCPs were measured using a high-accuracy GPS receiver and processed to achieve a localization accuracy of 9.0 mm.

Metrics
We use check points error (CPE) and absolute trajectory error (ATE) for evaluation.
• Check points error: the accuracy of triangulation is evaluated by utilizing surveyed points called check points (CPs) that were not used for georeferencing.Given a check point with coordinate P l = P l (x), P l (y), P l (z) , the root mean square errors (RMSEs) for plane (δ xy ), elevation (δ z ), and pixel (δ p ) in terms of m CPs are evaluated as follows: • Absolute trajectory error: ATE is utilized to assess the drift in the position and rotation of the estimated trajectory.The estimated trajectory has been aligned with the ground truth trajectory using Umeyama's method [54], resulting in aligned poses represented as RMSEs for the position (δ pos ) and rotation (δ rot ) are evaluated as follows:

Results
In our method, the video input is set to 6 fps, and a downsampling rate of 4× is applied.The search radius for the two-step keypoint adjustment is set to 20 pixels.In contrast, the other methods utilize the original-scale videos as input.However, for methods like COLMAP and Theia that require temporally sampled keyframes, we employ two strategies.One strategy involves sampling the video images every one second.The other strategy involves using our method, as described in Section 3.2, which entails applying OpenVSLAM on the 4× downsampled videos to obtain keyframes.
We evaluate our method against the incremental SfM methods, namely COLMAP [3] and Theia [4], as well as the VSLAM method, OpenVSLAM [8], on the collected dataset, considering scenarios both with and without GCPs.
Performance on our collected dataset: As shown in Table 1, our pipeline has demonstrated significant improvements in accuracy when compared with COLMAP [3], Theia [4], and OpenVSLAM [8].Specifically, in the regular sequence, our method outperforms the second-best method, OpenVSLAM, with improvements of 4.8 cm in δ xy (a relative improvement of 137%), 1.7 cm in δ z (a relative improvement of 40%), 0.12 pixels in δ p (a relative improvement of 7%), as well as 0.7 m in δ pos (a relative improvement of 175%), and 0.25°in δ rot (a relative improvement of 178%).
Our method also exhibits significant improvements over other methods in all metrics for the irregular sequence.Compared with OpenVSLAM [8], the second-best performer, our method achieves improvements of 0.7 cm in δ xy (a relative improvement of 14%), 6.2 cm in δ z (a relative improvement of 293%), and 0.82 pixels in δ p (a relative improvement of 103%).Additionally, our method demonstrates improvements of 0.24 m in δ pos (a relative improvement of 48%) and 0.26°in δ rot (a relative improvement of 96%).These results indicate that our pipeline effectively enhances reconstruction robustness and yields more accurate scene structure.
It is worth noting that when COLMAP [3] and Theia [4] are initialized with our selected keyframes, the accuracy improves across all metrics in both sequences against using per second sampling.This suggests that SLAM-based methods can effectively provide keyframes for subsequent SfM methods.
Table 1 presents a comparison of the efficiency of different methods.Our method requires the least amount of time compared with other methods, yielding a remarkable 200% enhancement over COLMAP [3], a 100% to 200% enhancement over OpenVSLAM [8], and a 50% to 100% improvement over Theia [4] for the regular sequence.Performance on the EuRoc MAV: We test our proposed method on the small-scale and low-resolution EuRoC MAV Dataset [11], which consists of 11 sequences categorized into easy, medium, and difficult classes based on illumination and camera motion.In our method, we did not downsample the sequences from the EuRoC MAV Dataset since they already have a resolution of only 752 × 480 pixels.Additionally, the search radius for the two-step keypoint adjustment is set to 4 pixels.
In Table 2, we provide the σ pos results.Given the small scale of the EuRoc sequences, our method shows slight improvements compared with OpenVSLAM [8].For most sequences, our method delivers either better or comparable results to the state-of-the-art methods.Notably, COLMAP [3] demonstrates competitive accuracy with our method, but our approach is noticeably more efficient, as seen in Table 3.Our method also outperforms Theia [4] in terms of efficiency, except for sequences V102, V103, and V203.However, it is worth noting that, in sequences V103 and V203, Theia exhibits significantly lower accuracy compared with our method.

Ablation Experiment
We perform several ablation experiments on the collected dataset.Figure 4 illustrates the results for all metrics under different settings.TS-KA: as shown in Figure 4, the incorporation of TS-KA enhances accuracy ranging from 2 to 5 times for all metrics when keypoints are extracted from downsampled images.Even when keypoints are obtained in the original scale, TS-KA still enhances matching performance, particularly for σ rot .This emphasizes the significance of adjusting keypoints prior to global refinement.Rotation averaging: in the regular sequence, the introduction of global averaged rotations into BA results in a slight improvement for all metrics.However, in the case of the irregular sequence, there can be a slight decrease in accuracy for certain metrics like σ pos and σ rot on the original scale.This can be attributed to the fact that the view graph of the regular scene is denser, which necessitates the use of rotation averaging.Conversely, in the irregular sequence, the scene may have a sparser view graph, making the global averaged rotations less beneficial.
Accuracy vs precision: Figure 5 shows breakdown timings of each component to the total reconstruction in the regular scene.We see the coarse adjustment takes equal time but gain significantly more improvement in accuracy than fine keypoint adjustment, according to Figure 4. Therefore, it is a good choice to remove the fine adjustment components [20] instead of the coarse adjustment in an efficiency-first scenario.TS-KA vs FKA [20]: we further compare our TS-KA with the featuremetric keypoint adjustment (FKA) [20] for SfM tasks in six outdoor sequences from the ETH3D benchmark [15].This benchmark provides ground-truth camera poses, intrinsic parameters, and highly accurate dense point clouds.To evaluate the matching effect, we follow the protocol introduced in [52].We reconstruct a 3D sparse model using COLMAP [3], with fixed camera intrinsics and poses provided by the authors.We use four different features: SIFT [43], learning-based SuperPoint [45], D2-Net [47], and R2D2 [46] for extracting feature points in the original scale.

VO
The results of applying two keypoint refinement methods on different feature points are presented in Table 4.It can be observed that our method consistently achieves better accuracy and completeness compared with [20] across all feature points in almost all This consistent improvement confirms that our TS-KA method offers superior keypoint alignment.Table 4. Results of 3D sparse reconstruction using our TS-KA or FKA [20] on different feature point extractors.We use metrics "accuracy" and "completeness" for threshold 1 cm, 2 cm, and 5 cm, as defined in [55].Figure 6 provides samples of feature point refinement, showcasing the ability of our method to adjust feature points in multi-view images to their correct positions.In comparison, FKA [20] is capable of correctly adjusting points under small view angle changes, as demonstrated in the first and second row.However, when faced with significant variations in view angles, as shown in the third row, FKA tends to produce a larger number of incorrect keypoints.This discrepancy largely contributes to the comparatively poorer performance of FKA, as evident in Table 4.

Conclusions
This paper introduces an efficient SfM pipeline for processing high-resolution, largesize videos.The pipeline utilizes visual odometry to select keyframes and obtain initial camera poses and reconstruction results efficiently by operating on downsampled video

23 Figure 1 .
Figure 1.System overview.(1) The initial scene structure is obtained from 4× downsampled video using visual odometry.(2)Keypoints are refined on full-resolution keyframes by a two-step keypoint adjustment method.red: original matching points; green: matching points after coarse keypoint adjustment; blue: matching points after sub-pixel refinement.(3) Global rotation is obtained by rotation averaging, and scene structure is finally refined using rotation-averaged bundle adjustment.

Figure 2 .
Figure 2. Coarse keypoint adjustment.(1) The reference keypoint is selected as the one having the highest score.(2) Each keypoint is assigned with a consistent orientation.(3) Best matching points (green ones) are obtained using NCC.

Figure 3 .
Figure 3.The visualization of dataset.(a) Regular scene.(b) Irregular scene.Red lines represent the trajectory of the drone.

Figure 4 .
Figure 4. RMSE results.Baseline: bundle adjustment applied once at the original scale, and parameters and keyframes are initialized with OpenVSLAM on 4× downsampled video.DS: downsample; RA: rotation average; BA: global bundle adjustment; C: coarse keypoint adjustment; F: fine keypoint adjustment.

Figure 6 .
Figure 6.The comparison of keypoint adjustment methods.For each keypoint, we select three views.View 1 and View 2 have similar capture angles, whereas the viewing angle of View 3 varies significantly from them.For each view, we demonstrate the matching positions using different keypoint adjustment methods.