Vision-Only Localization of Drones with Optimal Window Velocity Fusion

Yeom, Seokwon

doi:10.3390/electronics15030637

Open AccessArticle

Vision-Only Localization of Drones with Optimal Window Velocity Fusion

by

Seokwon Yeom

Department of Artificial Intelligence, Daegu University, Gyeongsan 38453, Republic of Korea

Electronics 2026, 15(3), 637; https://doi.org/10.3390/electronics15030637

Submission received: 19 December 2025 / Revised: 27 January 2026 / Accepted: 29 January 2026 / Published: 2 February 2026

(This article belongs to the Special Issue Recent Research and Applications of Computer Vision and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Drone localization is essential for various purposes such as navigation, autonomous flight, and object tracking. However, this task is challenging when satellite signals are unavailable. This paper addresses database-free vision-only localization of flying drones using optimal window template matching and velocity fusion. Assuming the ground is flat, multiple optimal windows are derived from a piecewise linear segment (regression) model of the image-to-real world conversion function. The optimal window is used as a fixed region template to estimate the instantaneous velocity of the drone. The multiple velocities obtained from multiple optimal windows are integrated by a hybrid fusion rule: a weighted average for lateral (sideways) velocities, and a winner-take-all decision for longitudinal velocities. In the experiments, a drone performed a total of six medium-range (800 m to 2 km round trip) and high-speed (up to 14 m/s) maneuvering flights in rural and urban areas. The flight maneuvers include forward-backward, zigzags, and banked turns. Performance was evaluated by root mean squared error (RMSE) and drift error of the GNSS-derived ground-truth trajectories and rigid-body rotated vision-only trajectories. Four fusion rules (simple average, weighted average, winner-take-all, hybrid fusion) were evaluated, and the hybrid fusion rule performed the best. The proposed video stream-based method has been shown to achieve flight errors ranging from a few meters to tens of meters, which corresponds to a few percent of the flight length.

Keywords:

vision-only drone localization; image–position conversion; template matching; optimal windows; velocity fusion

1. Introduction

The applications of unmanned aerial vehicles (UAVs) have expanded dramatically [1,2,3]. It is crucial to determine drones’ position to perform high-level tasks. Drones typically estimate their position by integrating external sensors, such as global navigation satellite systems (GNSSs) and internal sensors, such as inertial measurement units (IMUs) [4,5]. However, GNSS signals can be absent, blocked, or intentionally jammed. In these GNSS-denied environments, accurate localization is very challenging [5]. Without absolute position reference, drones must rely on alternative sensing methods to infer their positions, often requiring additional onboard sensing or external infrastructure [6,7]. One prominent solution to this problem is visual-inertial odometry (VIO), which fuses data from cameras and IMUs [8,9,10].

One vision-based approach is the template-based method, which determines the drone’s movements by matching the current frame with a template from a previous frame or reference frame. Classical template matching methods are suitable for estimating small translational displacements between frames when lighting is stable and there are no rotational or scale changes [11]. Since template selection is typically scene- or object-dependent, incorrect template updates can lead to accumulated errors [12]. A robust illumination-invariant localization method was proposed for UAV navigation [13]. In [14], a multi-region scene matching-based method is proposed for automated navigation of UAV. Fourier-based image phase correlation was adopted to estimate absolute velocity estimation of UAVs [15]. However, [13,14] utilize pre-stored reference images, and [15] fuses data from onboard sensors. Various matching techniques are surveyed for UAV navigation in [16].

This paper addresses database-free vision-only localization of drones for medium-range (800 m to 2 km round trip) and high-speed (up to 14 m/s) maneuvering flights. The database-free video stream-only localization method proposed in this paper offers the advantages of low cost, light weight, and energy efficiency without using multiple sensor devices and infrastructure. Furthermore, it does not require the collection or updating of maps or reference data. When the camera is pointed at a flat surface, pixel coordinates can be converted to unique real-world coordinates using ray optics [17,18,19]. However, this conversion produces non-uniform spacing in the real-world coordinate system. The optimal windows were proposed to overcome this non-uniform spacing [19]. A piecewise linear segment (regression) model [20,21] determines the vertical side of the optimal window. This model minimizes total least-squares error over all linear segments that approximate the conversion function. As a result, a frame is divided into several non-overlapping windows, which function as fixed area templates. The optimal window template matching is performed between consecutive frames to minimize the normalized sum of squared differences (NSSD) [22], which estimates the instantaneous velocity. In [19], this technique combined with a state estimator achieves errors of several meters on short-range (about 150 m) flights.

This approach is scene- or object-independent and requires no template updates, as the optimal window is a fixed region spanning the entire frame. Furthermore, this fixed-region template matching method searches for only a small, consistent region, rather than the entire image, providing a simple and fast hardware solution.

In the paper, the optimal window template matching technique is significantly improved through velocity fusion. A hybrid fusion rule is proposed considering minimum detectable velocity (MDV) as the velocity resolution. The weights used in hybrid fusion are based on the improved conversion function. This hybrid fusion rule applies a weighted average to lateral velocity and a winner-take-all method to longitudinal velocity. In addition, a zero-order hold scheme is applied to reduce computational load.

In the experiments, a multirotor drone (DJI Mavic 2 Enterprise Advanced [23]) performed medium-range flights including high-speed maneuvers in rural and urban areas. The rural paths have flat and simple terrain, while the urban paths have more dynamic and complex terrain. In rural areas, four paths consist of a straight-forward-backward flight, a zigzag forward-backward flight (snake path), a squared path with three banked turns, and a free flight including both banked turns and zigzags. In urban areas, two paths are a straight outbound flight and a straight-forward-backward flight.

For performance evaluation, GNSS-derived trajectories are used as ground-truths. The GNSS trajectories are projected into a local North-East-Down (NED) Cartesian frame using the WGS-84 ellipsoid model for spatial consistency [24,25]. To compensate for the initial orientation offset between the vision-only trajectory and the global reference, a rigid-body rotation is applied to align the vision-only trajectory with the GNSS-derived ground-truth trajectory. Root mean square error (RMSE) and drift error are calculated at three intermediate points and one final point. Four fusion rules (simple average, weighted average, winner-take-all, and hybrid fusion) are evaluated, with the hybrid fusion rule demonstrating the best performance. It will be shown that the proposed method achieves errors ranging from a few meters to tens of meters. Errors mainly come from the constraint of the flat ground assumption. They also originate from the drone’s mechanical system, which is thoroughly discussed in the discussion section. Figure 1 shows a block diagram of the proposed method. In the block diagram, ellipses indicate signals/data; rectangles indicate processing blocks; arrows indicate direction of signal flow. The block diagram distinguishes two primary processes by separate encompassing boxes: localization and performance evaluation.

The contributions of this study are as follows: (1) The process of converting images to real-world coordinates has been improved, enabling more precise velocity resolution analysis. (2) Multiple velocities observed through multiple window matching were fused based on velocity resolution. A hybrid fusion rule was proposed that applies a weighted average to sideways velocities and a winner-take-all decision to longitudinal velocities. (3) The robustness of the system was verified through high-speed maneuvering flights. Furthermore, the estimated trajectory was verified with the ground-truth derived from GNSS signals.

The rest of this paper is organized as follows: the optimal window template matching is presented in Section 2. The velocity fusion rules and performance evaluation are described in Section 3. Section 4 demonstrates experimental results. Discussion and Conclusions follow in Section 5 and Section 6, respectively.

2. Optimal Windows for Template Matching

This section describes how the optimal windows are derived from the improved image-to-position conversion function.

2.1. Improved Image-to-Position Conversion

The image-to-position conversion [18,19] applies trigonometry to compute real-world coordinates from pixel coordinates when the camera’s angular field of view (AFOV), elevation, and tilt angle are known. It is assumed that the ground is flat and the camera rotates only around the pitch axis. The improved conversion function is as follows:

x_{i} (j) = \sqrt (h^{2} + y_{j}^{2}) \cdot t a n [(i - \frac{W}{2} + 1) \frac{a_{x}}{W}], i = 0, \dots, W - 1,

(1)

y_{j} = h \cdot t a n [θ_{T} + (\frac{H}{2} - j) \frac{a_{y}}{H}], j = 0, \dots, H - 1,

(2)

where W and H are the image sizes in the horizontal and vertical directions, respectively, h is the altitude of the drone or the elevation of the camera,

a_{x}

and

a_{y}

are the AFOV in the horizontal and vertical directions, respectively, and

θ_{T}

is the tilt angle of the camera. It is noted that the horizontal and vertical directions in the image are the same with the lateral (sideways) and longitudinal directions of the drone body frame, respectively. Figure 2a,b visualize the coordinate conversion function in the horizontal and vertical directions, respectively: W and H are set to 3840 and 2160 pixels, respectively;

a_{x}

and

a_{y}

are set to 68° and 42°, respectively; h is set to 40 m;

θ_{T}

is set to 60°. In Figure 2a, eight nonlinear conversion functions are plotted varying the vertical index j from 1 to 2101 by 300-pixel intervals. In [18,19],

\sqrt (h^{2} + y_{j}^{2})

was approximated by

\sqrt (h^{2} + y_{H / 2}^{2})

; thus, the horizontal conversion function was independent of the column index of the image. As shown in Figure 2b, the vertical direction conversion function shows a rapidly decreasing nonlinearity, causing non-uniform spacing in the real coordinate system. Due to this nonlinearity, pixel-domain displacement cannot accurately measure real-world displacement.

2.2. Optimal Windows

The optimal window aims to ensure uniform spacing in the real-world coordinates. A piecewise linear segment model is adopted to approximate the nonlinear conversion function as multiple line segments. Therefore, the discrete pixel domain is partitioned into multiple intervals, and the breakpoints between each interval are determined to minimize total least-squares error (total sum of squared errors) as follows:

{\hat{s}}_{1}, \dots, {\hat{s}}_{N - 1} = {a r g m i n}_{s_{1}, \dots, s_{N - 1}} [\sum_{n = 0}^{N - 1} \sum_{j = s_{n}}^{s_{n + 1} - 1} {{m i n}_{a_{n}, b_{n}} [y_{j} - (a_{n} j + b_{n})]}^{2}],

(3)

where

s_{1}, \dots, s_{N - 1}

are

N - 1

break points for

N

segments, and

s_{0}

and

s_{N}

are equal to 0 and the image size in the vertical direction, respectively, and

a_{n}

and

b_{n}

are the coefficients of the n-th linear regression line [20,21]. It is noted that

N

is equal to the number of windows. The length of each interval is equal to the height of the corresponding optimal window. The number of windows can be determined empirically. Too many windows (i.e., too few sampling points (pixels) per window) can result in inaccurate matching, while too few windows cannot compensate for the non-uniform spacing of the nonlinear conversion function. In the experiments, the frame was cropped by 90 pixels near the edges to remove distortions that might occur during image capture; thus, optimal windows tiled in an area of 3660 × 1980 pixels. The size of the n-th window is 3660 ×

(s_{n} - s_{n - 1})

pixels. Figure 3 shows three piecewise linear segment models when

N

= 10, 12, 15. Figure 4 shows the corresponding optimal windows to Figure 3 in a sample frame. Total sum of squared errors which are minimized according to Equation (3) are 55.30, 26.85, and 11.11. The experimental results will be presented along with the 12 optimal windows that produced the best results in terms of the number of windows.

The computational complexity of exhaustively searching all possible piecewise linear models grows exponentially with the number of segments. Dynamic programming [26,27] efficiently determines the optimal linear segments while significantly reducing computational burden. This approach first precomputes the least-squares fitting error for all candidate line segments. It then determines the optimal n segment solution when n = 2. During this step, all possible least partial sum of squared errors for two segments are calculated at

j \in [0,2], [0,3], \dots, [0, H - 1]

. Then n is increased by one, and the optimal solution for n segments is calculated using the least partial sum of squared errors obtained from the previous step. This process continues until n reaches

N

.

Each window region in the image serves as a fixed-region template to generate pixel displacement between consecutive frames. The velocities in lateral and longitudinal directions of the drone are obtained by minimizing the NSSD between frames [19,22]. The MDV at pixel (i,j) is calculated as

δ_{x i} (j) = \frac{{| x}_{i + 1} (j) - x_{i} (j) |}{T}

and

δ_{y j} = \frac{{| y}_{j + 1} - y_{j} |}{T}

in lateral and longitudinal directions, respectively, where T is the frame period. Figure 5a,b show horizontal and vertical MDV components, respectively, when T =1/30 sec and

N = 12

. The j values in Figure 5a are the center of 12 optimal windows in the vertical direction. As shown in Figure 5b, the MDV in the vertical direction decreases sharply.

3. Velocity Fusion and Performance Evaluation

This section describes how to fuse the velocities of multiple optimal windows, and a zero-order hold scheme to reduce the matching counts. Performance evaluation using GNSS trajectories is also presented at the end of the section.

3.1. Hybrid Fusion Rule

It is assumed that MDV is equal to the velocity resolution. The lateral MDV of the optimal window is set to the MDV of the window center, while the longitudinal MDV of the optimal window remains constant for each window because the vertical conversion function is linearized by the piecewise linear segment model. The weight on each window is calculated based on its resolution as

w_{n x} = \frac{\frac{1}{σ_{n x}^{2}}}{\sum_{n = 1}^{N} \frac{1}{σ_{n x}^{2}}} = \frac{\frac{1}{Δ_{n x}^{2}}}{\sum_{n = 1}^{N} \frac{1}{Δ_{n x}^{2}}} \approx \frac{\frac{1}{δ_{n x}^{2}}}{\sum_{n = 1}^{N} \frac{1}{δ_{n x}^{2}}},

(4)

w_{n y} = \frac{\frac{1}{σ_{n y}^{2}}}{\sum_{n = 1}^{N} \frac{1}{σ_{n y}^{2}}} = \frac{\frac{1}{Δ_{n y}^{2}}}{\sum_{n = 1}^{N} \frac{1}{Δ_{n y}^{2}}} \approx \frac{\frac{1}{δ_{n y}^{2}}}{\sum_{n = 1}^{N} \frac{1}{δ_{n y}^{2}}},

(5)

where

σ_{n x}^{2}

and

σ_{n y}^{2}

are the variance of velocity of the n-th optimal window in the horizontal (lateral) and vertical (longitudinal) directions, respectively,

Δ_{n x}

and

Δ_{n y}

are the velocity resolution of the n-th optimal window in the horizontal and vertical directions, respectively, and

δ_{n x}

and

δ_{n y}

are the MDV of the n-th optimal window in the horizontal and vertical directions, respectively. Assuming the quantization error is uniformly distributed, the variance is proportional to the resolution squared as

σ_{n x}^{2} = \frac{Δ_{n x}^{2}}{12}

and

σ_{n y}^{2} = \frac{Δ_{n y}^{2}}{12} .

The lateral (sideways) velocities are fused by a weighted average as follows

v_{x} (k) = \sum_{n = 1}^{N} w_{n x} v_{n x} (k),

(6)

where

v_{n x} (k)

is the lateral velocity obtained from the n-th optimal window template matching between k and

k - 1

frames. Equation (6) is the weighted least-squares estimator [28,29]. As shown in Figure 5b, the resolution of longitudinal velocity varies significantly. Therefore, longitudinal velocity fusion adopts a winner-takes-all strategy.

v_{y} (k) = \sum_{n = 1}^{N} w_{n y}^{*} v_{n y} (k) = v_{N y} (k),

(7)

w_{n y}^{*} = 1 ({n = a r g m a x}_{i = 1, \dots, N} w_{i y}),

(8)

where

v_{n y} (k)

is the longitudinal velocity obtained from the n-th optimal window template matching between k and

k - 1

frames, and

1 (\cdot)

denotes the indicator function, which returns 1 if the condition inside the parentheses is true and 0 otherwise. This winner-take-all decision rule is meaningful when the best resolution is significantly better than others [30]. Figure 6 shows the lateral and longitudinal velocity weights for 12 optimal windows. The last longitudinal weight is almost 40% of the total. Therefore, this weight was selected as a winner, and the remaining weights were excluded from the hybrid fusion.

3.2. Vision-Only Traejctory

The vision-only trajectory is generated as

x_{v} (k + 1) = x_{v} (k) + v (k) T = [\begin{matrix} x_{v} (k) + v_{x} (k) \cdot T \\ y_{v} (k) + v_{y} (k) \cdot T \end{matrix}], k = 1, \dots, l - 1

(9)

where

x_{v} (k) = {[x_{v} (k) y_{v} (k)]}^{t}

,

v (k) = {[v_{x} (k) v_{y} (k)]}^{t}

, t denotes matrix transpose,

l

is the trajectory length, and the initial position y is set to the zero vector (origin). To reduce the computational cost of template matching, a zero-order hold scheme can be applied. This scheme maintains the velocity constant across multiple frames; thus, the fused velocity

v (k)

is replaced by

v_{z o h} (k)

as follows

v_{z o h} (k) = \{\begin{matrix} v (1), 1 \leq k < M - 1 \\ v (M), M \leq k < 2 M - 1 \\ . \\ . \\ . \end{matrix}

(10)

where

M

is the number of frames between consecutive template-matching operations; thus, the frame-matching speed becomes frame rate (frame capture speed) divided by

M

.

3.3. Perofrmance Evaluation

Since the vision-only method yields a different orientation from the ground-truth trajectory derived from the GNSS signals, a rigid-body rotation is applied to the vision-only trajectory as

x_{g} (k) = R (θ_{l}) x_{v} (k) + ϵ_{k}, k = 1, \dots, l,

(11)

R (θ) = [\begin{matrix} c o s θ & - s i n θ \\ s i n θ & c o s θ \end{matrix}],

(12)

where

x_{g} (k)

is the ground-truth trajectory at frame k, the initial position is also set to the origin,

θ_{l}

is the counterclockwise angle from the vision-only trajectory to the ground-truth trajectory when the frame number of the trajecoty is l, and

ϵ (k)

is the 2D noise vector generating the discrepancy between the ground-truth and the rotated vision-only trajectories. The optimal rotation angle that aligns two trajectories in the least-square sense is obtained as [31,32]

{\hat{θ}}_{l} = {a r g m i n}_{θ_{l}} \sum_{k = 1}^{l} {‖x_{g} (k) - R (θ_{l}) x_{v} (k)‖}^{2},

(13)

R ({\hat{θ}}_{l}) = {V U}^{t},

(14)

where

U D V^{t}

is a singular value decomposition of the cross-covariance matrix

X_{v} X_{g}^{t}

,

X_{v} = [x_{v} (1) \dots x_{v} (l)],

and

X_{g} = [x_{g} (1) \dots x_{g} (l)] .

It is noted that

{\hat{θ}}_{l}

varies with the trajectory length l. The RMSE and drift error are defined, respectively, as

E_{R M S E} (l) = \sqrt{\frac{1}{l} \sum_{k = 1}^{l} {‖x_{g} (k) - R ({\hat{θ}}_{l}) x_{v} (k)‖}^{2}},

(15)

E_{D} (l) = ‖x_{g} (l) - R ({\hat{θ}}_{l}) x_{v} (l)‖ .

(16)

The RMSE evaluates the overall accuracy of the entire trajectory, whereas the drift error measures the accuracy at a specific point. In the experiments, RMSE and drift error are caluated at three intermediate points and one final point.

4. Results

4.1. Flight Paths

A multi-rotor drone (DJI Mavic 2 Enterprise Advanced) [23] flew along six different paths in rural and urban areas. All videos were captured at 30 frames per second (FPS) with a frame size of 3840 × 2160 pixels. The altitude of the drone was maintained at 40 m, and the camera tilt angle was set to 60°. The AFOV was assumed to be 68° and 42° in the horizontal and vertical directions, respectively.

Figure 7 shows the GNSS trajectories of six flight paths (Paths 1 to 6) obtained from the drone. The GNSS trajectory is generated through the onboard sensor fusion of absolute GNSS and relatively high-frequency IMU data [23]. This process produces a spatially consistent and smooth path. The trajectories are plotted in the latitude (north–south) and longitude (east–west) coordinates. The starting point is marked ‘O’, three intermediate points or waypoints are marked ‘A’, ‘B’, ‘C’, and the final point is marked ‘D’. Figure 7a–d show four flight paths (Paths 1 to 4) in rural areas, while Figure 7e,f show two flight paths (Paths 5 to 6) in urban areas. Rural areas are mostly flat agricultural land. Urban areas, on the other hand, have complex and irregular terrain due to various artificial structures. Figure 7a,b show a forward-backward straight path and a forward-backward zigzag path, respectively. Figure 7c is a squared path with three banked turns. Figure 7d is a free path with multiple banked turns and zigzags. Figure 7e,f are an outbound straight path and a forward-backward straight path, respectively. No flat turn (yaw-only rotation) was included for all paths to observe the errors accurately. In all flights, the drone starts from a stationary hovering state and accelerates to fly at its highest possible speed. For the forward-backward flight in Figure 7a,f, an abrupt joystick reversal causes the drone to decelerate, stop, and move in the opposite direction. Figure 8 shows the corresponding ground-truth trajectories for Paths 1 to 6. The GNSS trajectory is projected into a local Cartesian frame (north–east) using the WGS-84 ellipsoidal model [24,25]. These projected trajectories are assumed to be the ground-truths for the performance evaluation.

Figure 9a–f show video frames when the drone passes Points O, A, B, C, and D in six paths. These frames show only part of the path, but all frames can be seen in the MP4 videos uploaded to the database site.

Table 1 shows the frame number and the ground-truth trajectory length of Paths 1 to 6. The shortest length is Path 5, at 810.99 m, and the longest length is Path 1, at 2066.7 m.

4.2. Vision-Only Trajectories

Table 2 shows the RMSE and drift error for six trajectories at Points A, B, C, D for the hybrid fusion. The RMSE reflects overall trajectory accuracy, whereas drift error quantifies the deviation at a specific point. The RMSE and drift error at Points A, B, C, D range from 2.85 m to 24.89 m and from 2.45 m to 42.47 m, respectively. The average RMSE ranges from 7.17 m to 13.88 m, and the average drift error ranges from 9.44 m to 20.25 m. In percentage terms, the RMSE and drift error at Points A, B, C, D range from 0.57% to 2.29% and from 0.55% to 4.61%, respectively. In terms of percentage, the lowest RMSE (0.57%) is obtained both at Point A on Path 1 and Point D on Path 2 while the lowest drift error (0.55%) is obtained at Point C on Path 2. The highest RMSE and drift error are obtained at Point A on Path 6 and Point B on Path 3, respectively. Table 3 shows the RMSE and drift error averaged over four points to compare various fusion rules. Four fusion rules were tested: simple average, weighted average, winner-take-all, and hybrid fusion. The hybrid fusion rule produced the smallest error in all cases except for Path 5, followed by the winner-take-all decision. Simple average produced the largest error, but the lowest error for Path 5.

Figure 10 shows the vision-only trajectory aligned with the ground-truth trajectory. Alignment is achieved through the rigid-body rotation of the vision-only trajectory, as shown in Equations (11)–(14).

Figure 11 shows the ground-truth and the rigid-body rotated vision-only velocities in the GNSS-projected coordinate system. The ground-truth velocity is the average of the forward and backward velocities over one second. In Figure 11a–f, the north–south and east–west velocity components for six paths are shown separately. As shown in Figure 10 and Figure 11, higher errors often appear at higher velocities; errors were the lowest when the drone was moving forward, followed by backward flight and then sideways flight. Several outliers were observed from the vision-only rotated velocity. These isolated outliers can be removed using a statistical tool such as the Hampel filter [33] or by considering the drone’s actual maximum speed, but the performance improvement was minimal.

4.3. Zero-Order Hold Scheme

Figure 12 shows the RMSE and drift error at the final point, as well as the average RMSE and drift error when applying the zero-order hold scheme. The frame matching rate varies from 30 Hz to 1 Hz. A lower frame matching rate reduces the computational burden.

As the frame matching rate decreases, the performance of Paths 1, 2, 5 degrades, but the performance of Paths 3, 4, 6 remains relatively stable.

A total of 12 supplementary files are available online. Six MP4 videos for six flights and their corresponding GNSS files (captions).

5. Discussion

The optimal window aims to ensure a uniform spacing in the real-world coordinate system. The number of optimal windows was heuristically set to 12, and the same number of velocities were obtained through template matching. These velocities were fused with the hybrid fusion rule. Two constraints were imposed on the optimal window derivation. One is the planar ground assumption, and the other is that the camera rotates only in the pitch direction. Localization errors in the same environments primarily originate from the drone’s mechanical system including its camera stabilizer (gimbal), which is optimized for flying forward. The multi-rotor tilts its whole body in the traveling direction to increase its speed [23]. This tilting effect hinders the gimbal to maintain a stable alignment at the intended angle. Experimental results indicate that these errors are most significant during sideways flights due to the gimbal’s physical limitations in the roll direction. Backward flight generates fewer errors than sideways, but more than forward flight. The hardware optimization for forward flight creates directional performance bias. This bias is proportional to speed and generates a persistent lag in the calculated location compared to the ground-truth. Errors caused by this directional bias and the flat ground assumption can be alleviated by a practical post-processing solution studied in [19]. The flat ground assumption may be more appropriate for high-altitude drone localization, a topic that remains for future research.

All experiments are conducted at 40 m altitude, 60° camera tilt, and 30 FPS, and the AFOV is determined approximately. Evaluating robustness to different acquisition settings (altitude/tilt/AFOV/FPS) including maximum permissible speeds is left for future work.

The optimal window templates are non-overlapping and tile the image; thus, each template contains approximately K/N pixels, where K denotes the total image pixel and N the number of templates. Each template restricts its search to a local region, S = (2 × maximum speed / minimum resolution)². The total operations per frame are approximately N∙K/N∙S= K∙S, yielding 3 × 10¹⁰ operations for K = 3840 × 2160 pixels, maximum speed = 15 m/s, and minimum resolution = 0.5 m/s. Approximately, the floating-point operations (FLOPs) for NSSD are a few times the number of pixel-level operations [34]. In consequence, the computational demand reaches 9 TFLOPS at a 30 Hz frame matching rate. Advanced embedded drone platforms can reach up to 10 TFLOPS; thus, a 9 TFLOPS workload is feasible for real-time onboard processing [35]. It is noted that applying a zero-order hold scheme can reduce the computational burden by several tens of times, at the cost of potential performance degradation.

6. Conclusions

This vision-only technique allows drones to estimate their location using only visual sensors. A novel optimal window template-matching technique was significantly improved in the paper. The optimal window is derived from the piecewise linear segment model that linearizes the nonlinear image-to-real world conversion function. The optimal window template is scene-independent and requires no additional processes for updating.

Multiple velocities are fused based on the MDV of the optimal window. The RMSE and drift error were found to range from several meters to a few tens of meters during high-speed maneuvers ranging from about 800 m to a round trip of 2 km.

Although the application of this method may be limited due to constraints on the optimal window, it has great potential in a wide range of applications, from commercial and industrial sectors to military missions. Thermal imaging cameras [36] could be used in environments with low illumination, a possibility also being explored in future research.

Supplementary Materials

The following are available online at https://doi.org/10.5281/zenodo.17990832 and http://drive.google.com/drive/folders/1pV08XaKD1LvDs5cU5Imk7Y8xjYREkeED, accessed on 20 December 2025. Movies of flights for six paths: Path1.mp4, Path2.mp4, Path3.mp4, Path4.mp4, Path5.mp4, Path6.mp4, and corresponding GNSS caption files: Path1.srt, Path2.srt, Path3.srt, Path4.srt, Path5.srt, Path6.srt.

Funding

This research was supported by a Daegu University Research Grant 2025.

Data Availability Statement

Data are contained within the article and Supplementary Materials.

Acknowledgments

The author would like to thank Jong-Ha Kim for his guidance in the city of Andong for flying the drone.

Conflicts of Interest

The authors declare no conflict of interest.

References

Osmani, K.; Schulz, D. Comprehensive Investigation of Unmanned Aerial Vehicles (UAVs): An In-Depth Analysis of Avionics Systems. Sensors 2024, 24, 3064. [Google Scholar] [CrossRef]
Kumar, P.; Pal, K.; Govil, M.C.; Choudhary, A. Comprehensive Review of Path Planning Techniques for Unmanned Aerial Vehicles (UAVs). ACM Comput. Surv. 2025, 58, 1–44. [Google Scholar] [CrossRef]
Bany Abdelnabi, A.A.; Rabadi, G. Human Detection from Unmanned Aerial Vehicles’ Images for Search and Rescue Missions: A State-of-the-Art Review. IEEE Access 2024, 12, 152009–152035. [Google Scholar] [CrossRef]
Ye, X.; Song, F.; Zhang, Z.; Zeng, Q. A Review of Small UAV Navigation System Based on Multisource Sensor Fusion. IEEE Sens. J. 2023, 23, 18926–18948. [Google Scholar] [CrossRef]
Bhatti, U.I.; Ochieng, W.Y. Failure Modes and Models for Integrated GPS/INS Systems. J. Navig. 2007, 60, 327–348. [Google Scholar] [CrossRef]
Alghamdi, S.; Alahmari, S.; Yonbawi, S.; Alsaleem, K.; Ateeq, F.; Almushir, F. Autonomous Navigation Systems in GPS-Denied Environments: A Review of Techniques and Applications. In Proceedings of the 2025 11th International Conference on Automation, Robotics, and Applications (ICARA), Zagreb, Croatia, 12–14 February 2025; pp. 290–299. [Google Scholar] [CrossRef]
Jarraya, I.; Al-Batati, A.; Kadri, M.B.; Abdelkader, M.; Ammar, A.; Boulila, W.; Koubaa, A. GNSS-Denied Unmanned Aerial Vehicle Navigation: Analyzing Computational Complexity, Sensor Fusion, and Localization Methodologies. Satell. Navig. 2025, 6, 9. [Google Scholar] [CrossRef]
Scaramuzza, D.; Fraundorfer, F. Visual Odometry [Tutorial]. IEEE Robot. Autom. Mag. 2011, 18, 80–92. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Chen, C.; Tian, Y.; Lin, L.; Chen, S.; Li, H.; Wang, Y.; Su, K. Obtaining World Coordinate Information of UAV in GNSS Denied Environments. Sensors 2020, 20, 2241. [Google Scholar] [CrossRef] [PubMed]
Gonzalez, R.C.; Woods, R.E. Digital Image Processing, 4th ed.; Pearson: Boston, MA, USA, 2018. [Google Scholar]
Matthews, I.; Ishikawa, T.; Baker, S. The Template Update Problem. IEEE Trans. Pattern Anal. Mach. Intell. 2004, 26, 810–815. [Google Scholar] [CrossRef]
Wan, X.; Liu, J.; Yan, H.; Morgan, G.L.K. Illumination-Invariant Image Matching for Autonomous UAV Localisation Based on Optical Sensing. ISPRS J. Photogramm. Remote Sens. 2016, 119, 198–213. [Google Scholar] [CrossRef]
Jin, Z.; Wang, X.; Moran, B.; Pan, Q.; Zhao, C. Multi-Region Scene Matching Based Localisation for Autonomous Vision Navigation of UAVs. J. Navig. 2016, 69, 1215–1233. [Google Scholar] [CrossRef]
Deng, H.; Li, D.; Shen, B.; Zhao, Z.; Arif, U. Absolute Velocity Estimation of UAVs Based on Phase Correlation and Monocular Vision in Unknown GNSS-Denied Environments. IET Image Process. 2024, 18, 3218–3230. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Emam, E.; Fontana, F.; Foresti, G.L.; Marini, M.R.; Mecca, A.; Pannone, D. UAV Geo-Localization for Navigation: A Survey. IEEE Access 2024, 12, 125332–125357. [Google Scholar] [CrossRef]
Hecht, E. Optics, 5th ed.; Pearson: Boston, MA, USA, 2017. [Google Scholar]
Yeom, S. Long Distance Ground Target Tracking with Aerial Image-to-Position Conversion and Improved Track Association. Drones 2022, 6, 55. [Google Scholar] [CrossRef]
Yeom, S. Drone State Estimation Based on Frame-to-Frame Template Matching with Optimal Windows. Drones 2025, 9, 457. [Google Scholar] [CrossRef]
Muggeo, V.M.R. Estimating Regression Models with Unknown Break-Points. Stat. Med. 2003, 22, 3055–3071. [Google Scholar] [CrossRef] [PubMed]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
OpenCV Developers. Template Matching. 2024. Available online: https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html (accessed on 5 June 2025).
DJI. Mavic 2 Enterprise Advanced User Manual. Available online: https://dl.djicdn.com/downloads/Mavic_2_Enterprise_Advanced/20210331/Mavic_2_Enterprise_Advanced_User_Manual_EN.pdf (accessed on 17 December 2025).
NIMA. Department of Defense World Geodetic System 1984: Its Definition and Relationships with Local Geodetic Systems, 3rd ed.; Technical Report TR 8350.2; Available online: https://gis-lab.info/docs/nima-tr8350.2-wgs84fin.pdf (accessed on 17 December 2025).
Soler, T.; Hothem, L.D. Coordinate Systems Used in Geodesy: Basic Definitions and Concepts. J. Surv. Eng. (ASCE) 1988, 114, 84–97. [Google Scholar] [CrossRef]
Bellman, R. On the Approximation of Curves by Line Segments Using Dynamic Programming. Commun. ACM 1961, 4, 284–286. [Google Scholar] [CrossRef]
Jackson, B.; Sargus, R.; Homayouni, R.; McLemore, D.; Yao, G. An Algorithm for Optimal Partitioning of Data on an Interval. IEEE Signal Process. Lett. 2005, 12, 105–108. [Google Scholar] [CrossRef]
Kay, S.M. Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory; Prentice Hall: Upper Saddle River, NJ, USA, 1993. [Google Scholar]
Elmenreich, W. Fusion of Continuous-Valued Sensor Measurements Using Confidence-Weighted Averaging. J. Vib. Control 2007, 13, 1303–1312. [Google Scholar] [CrossRef]
Joshi, S.; Boyd, S. Sensor Selection via Convex Optimization. IEEE Trans. Signal Process. 2009, 57, 451–462. [Google Scholar] [CrossRef]
Zhang, Z.; Scaramuzza, D. A Tutorial on Quantitative Trajectory Evaluation for SLAM. arXiv 2018, arXiv:1801.06581. [Google Scholar]
Umeyama, S. Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
Hampel, F.R. The Influence Curve and Its Role in Robust Estimation. J. Am. Stat. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]
Szeliski, R. Computer Vision: Algorithms and Applications; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
NVIDIA. Jetson AGX Orin Series Technical Brief v1.2. 2022. Available online: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ (accessed on 18 January 2026).
Yeom, S. Thermal Image Tracking for Search and Rescue Missions with a Drone. Drones 2024, 8, 53. [Google Scholar] [CrossRef]

Figure 1. Block diagram of vision-only localization of a drone.

Figure 2. Coordinate conversion functions: (a) horizontal direction, (b) vertical direction. The red circle indicates the center of the image.

Figure 3. (a) 10 linear segments, (b) 12 linear segments, (c) 15 linear segments.

Figure 4. Optimal windows corresponding to Figure 3, (a) 10 optimal windows, (b) 12 optimal windows, (c) 15 optimal windows. The centers of the windows are marked with ‘+’.

Figure 5. Minimum detectable velocity, (a) lateral direction (horizontal direction in image), (b) longitudinal direction (vertical direction in image). The centers of the windows are marked with ‘+’.

Figure 6. Velocity weights for 12 optimal windows.

Figure 7. GNSS trajectory with starting (O), intermediate (A, B, C), and final (D) points, (a) Path 1, (b) Path 2, (c) Path 3, (d) Path 4, (e) Path 5, (f) Path 6.

Figure 8. Ground-truth trajectory projected from the GNSS trajectory with starting (O), intermediate (A, B, C), and final (D) points, (a) Path 1, (b) Path 2, (c) Path 3, (d) Path 4, (e) Path 5, (f) Path 6.

Figure 9. Video frame at starting (O), intermediate (A, B, C), and final (D) points, (a) Path 1, (b) Path 2, (c) Path 3, (d) Path 4, (e) Path 5, (f) Path 6.

Figure 10. Ground-truth and rigid-body rotated vision-only trajectories, (a) Path 1, (b) Path 2, (c) Path 3, (d) Path 4, (e) Path 5, (f) Path 6.

Figure 11. Ground-truth and rigid-body rotated vision-only velocities, (a) Path 1, (b) Path 2, (c) Path 3, (d) Path 4, (e) Path 5, (f) Path 6.

Figure 12. Performance evaluation of zero-order hold scheme, (a) Path 1, (b) Path 2, (c) Path 3, (d) Path 4, (e) Path 5, (f) Path 6.

Table 1. Frame number and path lengths of six paths.

Point	Path 1		Path 2		Path 3		Path 4		Path 5		Path 6
Point	Frame Number	Path Length (m)	Frame Number	Path Length (m)	Frame Number	Path Length (m)	Frame Number	Path Length (m)	Frame Number	Path Length (m)	Frame Number	Path Length (m)
A	1175	500.5	1063	398.7	678	268.35	866	302.19	476	182.05	1131	346.97
B	2350	1033.1	2126	824.6	1482	611.45	1731	613.65	951	403.22	2261	720.68
C	3686	1539.4	3367	1284.0	2242	891.84	2596	928.49	1426	624.12	3291	1031.44
D	5021	2066.7	4560	1725.6	3013	1225.90	3462	1231.27	1901	810.99	4320	1440.1

Table 2. RMSE and drift error for hybrid fusion.

Point	Path 1		Path 2		Path 3		Path 4		Path 5		Path 6
Point	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)
A	2.85	5.45	6.23	5.23	4.07	5.91	3.18	2.45	3.06	7.32	7.96	6.99
B	11.93	26.45	7.68	10.95	11.82	28.19	5.59	5.27	5.67	6.29	8.35	9.42
C	15.7	17.89	7.78	7	19.98	33.62	11.41	32.99	8.67	18.22	10.27	11.69
D	16.44	21.72	9.88	19.45	19.65	9.11	24.89	40.28	11.26	5.92	16.75	42.47
Avg.	11.73	17.88	7.89	10.66	13.88	19.21	11.27	20.25	7.17	9.44	10.83	17.64

Table 3. Average RMSE and drift errors for various fusion rules.

Rule	Path 1		Path 2		Path 3		Path 4		Path 5		Path 6
Rule	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)	RMSE (m)	Drift (m)
Simple Avg.	54.97	69.54	19.24	19.32	15.66	20.5	23.16	27.11	5.52	6.77	17.05	31.82
Weighted Avg.	31.71	38.92	14.55	17.14	14.72	19.72	12.87	23.32	13.01	19.88	12.68	24.16
Winner-take-all	12.35	19.55	9.59	13.46	14.09	18.93	18.33	32.64	7.17	9.76	15.93	30.56
Hybrid	11.73	17.88	7.89	10.66	13.88	19.21	11.27	20.25	7.17	9.44	10.83	17.64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yeom, S. Vision-Only Localization of Drones with Optimal Window Velocity Fusion. Electronics 2026, 15, 637. https://doi.org/10.3390/electronics15030637

AMA Style

Yeom S. Vision-Only Localization of Drones with Optimal Window Velocity Fusion. Electronics. 2026; 15(3):637. https://doi.org/10.3390/electronics15030637

Chicago/Turabian Style

Yeom, Seokwon. 2026. "Vision-Only Localization of Drones with Optimal Window Velocity Fusion" Electronics 15, no. 3: 637. https://doi.org/10.3390/electronics15030637

APA Style

Yeom, S. (2026). Vision-Only Localization of Drones with Optimal Window Velocity Fusion. Electronics, 15(3), 637. https://doi.org/10.3390/electronics15030637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vision-Only Localization of Drones with Optimal Window Velocity Fusion

Abstract

1. Introduction

2. Optimal Windows for Template Matching

2.1. Improved Image-to-Position Conversion

2.2. Optimal Windows

3. Velocity Fusion and Performance Evaluation

3.1. Hybrid Fusion Rule

3.2. Vision-Only Traejctory

3.3. Perofrmance Evaluation

4. Results

4.1. Flight Paths

4.2. Vision-Only Trajectories

4.3. Zero-Order Hold Scheme

5. Discussion

6. Conclusions

Supplementary Materials

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI