Vision-Based Geolocation of Moving Ground Targets Using Kalman Filtering with a Gimbal Camera on Board a UAV

Kim, Jaemin; Kim, Youngrun; Kim, SuHyeon; Cho, Hyeongjun; Jung, Dongwon

doi:10.3390/aerospace12121065

Open AccessArticle

Vision-Based Geolocation of Moving Ground Targets Using Kalman Filtering with a Gimbal Camera on Board a UAV

by

Jaemin Kim

¹

,

Youngrun Kim

²

,

SuHyeon Kim

³

,

Hyeongjun Cho

⁴

and

Dongwon Jung

^2,*

¹

Pablo Air, Daejeon 34109, Republic of Korea

²

Smart Air Mobility Engineering, Korea Aerospace University, Goyang-si 10540, Republic of Korea

³

Unmanned Aircraft System Research Division, Korea Aerospace Research Institute, Daejeon 34133, Republic of Korea

⁴

LIG Nex1, Yongin-si 16911, Republic of Korea

^*

Author to whom correspondence should be addressed.

Aerospace 2025, 12(12), 1065; https://doi.org/10.3390/aerospace12121065

Submission received: 22 September 2025 / Revised: 27 November 2025 / Accepted: 28 November 2025 / Published: 30 November 2025

(This article belongs to the Topic Target Tracking, Guidance, and Navigation for Autonomous Systems, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicles (UAVs) are vital for surveillance missions requiring the geolocation of moving ground targets, yet small, resource-constrained platforms often lack integrated, robust systems that can handle disturbances such as wind, occlusions, and noise. This paper presents an integrated, end-to-end vision-based geolocation pipeline specifically designed for embedded deployment on resource-constrained UAVs with gimbal cameras. Starting from a rough initial position estimate, pan/tilt angles are computed to orient the gimbal, and then a visual tracking module combining object detection (via Tiny-YOLO) and feedback control (using CSRT) centers the target in the frame. The target’s absolute position is derived from UAV inertial data and gimbal angles. To mitigate noisy or unavailable direct geolocation due to disturbances or visual lock loss, Kalman filtering is integrated with a unicycle-based motion model. Both an extended Kalman filter (EKF) and unscented Kalman filter (UKF) are evaluated and tuned in high-fidelity simulations, with the UKF demonstrating superior performance by reducing the 2D position RMSE by 33% compared to the EKF in occlusion scenarios. The system is implemented on embedded hardware and validated through real flight tests, establishing the operational capability of vision-based surveillance on small UAV platforms.

Keywords:

geolocation; estimation; moving target; unscented kalman filter; on-board camera; gimbal control; visual tracking; unmanned aerial vehicle (UAV)

1. Introduction

Unmanned aerial vehicles (UAVs) have been receiving increasing attention from both academic and industrial communities owing to their expanding range of applications [1,2,3,4]. This growing attention has driven the development of diverse onboard sensing systems for UAVs, tailored to the specific tasks demanded by each application area. Among the various sensing systems used in UAVs, vision-based systems stand out due to their robustness under various environmental conditions [5,6,7] and their ability to provide detailed spatial awareness of the environment using compact, low-cost hardware suitable for small UAVs. These features make them highly effective in diverse applications such as environmental monitoring and inspection [8,9], search and rescue [10], surveillance [11], traffic control [12,13], and precision mapping [14,15]. While all of these applications benefit from accurate positioning, surveillance missions present particularly demanding requirements for ground target localization and tracking. In military and law enforcement operations, achieving reliable geolocation from a moving aerial platform is essential yet technically challenging.

Vision-based target localization using UAVs has been approached from multiple perspectives. Early work focused on geometric projection methods for geolocation. Pachter et al. [16] developed algorithms for target position estimation using cameras fixed to a UAV. To expand the effective field of regard and overcome FOV limitations, researchers have introduced gimbal mechanisms. Redding et al. [17] proposed a gimbal-based geolocation algorithm, Monda et al. [18] extended this with a two-axis system for riverine environments, and Hosseinpoor et al. [19] demonstrated thermal imaging-based tracking with gimbal stabilization.

More recently, research efforts have pursued two parallel directions: improving tracking performance through deep learning and enhancing system capabilities through active gimbal control. Micheal [20] and Zhao et al. [21] developed learning-based approaches using LSTMs and reinforcement learning to improve target tracking robustness. Meanwhile, Xu et al. [22] and Ansen [23] demonstrated that active gimbal control strategies—dynamically adjusting the gimbal orientation based on visual feedback—can mitigate motion blur and improve localization accuracy.

These advances have significantly improved specific aspects of the problem, with some systems demonstrating real-world performance on suitable platforms. However, existing approaches tend to focus on either tracking robustness or control strategies, often evaluated in controlled environments or with high computational resources. For small UAV applications with tight size, weight, and power (SWaP) constraints, there remains a need for practical, integrated frameworks that unify tracking, active control, and state estimation while operating reliably onboard under realistic outdoor flight conditions.

Motivated by this need, this paper presents an integrated framework for vision-based geolocation on small UAVs that unifies active gimbal control, robust tracking, and state estimation in a single onboard system. Our approach combines visual feedback-driven gimbal control to maintain continuous target visibility with Kalman filtering-based state estimation that fuses inertial measurements and visual observations to handle realistic disturbances, including wind, gimbal lag, and measurement noise. The system employs a unicycle-based process model to capture target dynamics and a pinhole camera measurement model under flat-ground assumptions.

The proposed framework was evaluated through high-fidelity simulations, comparing EKF and UKF implementations, and the complete system was validated through real-world flight experiments, where the geolocation estimates enabled closed-loop flight control on a gimbal-equipped UAV running on low-power embedded hardware.

The contributions of this work are as follows:

Operational surveillance validation: It demonstrates that persistent vision-based surveillance of moving ground targets is feasible with small UAVs under realistic conditions, including continuous tracking at long range (141 m standoff) and autonomous flight control adaptation based on target estimates.
Integrated practical framework: Design and deployment of a complete onboard system unifying visual tracking, active gimbal control, and state estimation, validated on resource-constrained hardware under actual flight disturbances (wind, vibration, measurement noise).
Systematic filtering evaluation: Comparative analysis of EKF and UKF performance for maintaining tracking continuity under occlusions and measurement uncertainties, providing practical guidance for real-time UAV surveillance applications.

The remainder of this paper is structured as follows: Section 2 describes the geolocation and filtering methods. Section 3 presents simulation and flight experiment results. Finally, Section 4 concludes the paper.

2. Methods and Materials

2.1. Geolocation Geometry

This section introduces the geometric formulation for estimating the ground target’s location using a gimbal-mounted camera onboard a UAV. The position of the target is determined relative to the UAV’s position and orientation, along with the pan and tilt angles of the gimbal. Assuming that the gimbal provides independent horizontal (pan) and vertical (tilt) rotation, the onboard camera can be directed toward the target through two sequential rotations. The geolocation geometry is derived under a flat-Earth assumption, which is valid for operational ranges typically less than 20 km from the takeoff location. The formulation also assumes that the camera’s optical axis is precisely aligned with the target.

The north–east–down (NED) coordinate system shown in Figure 1, whose origin is fixed at the UAV’s takeoff position, is defined as the inertial frame

F_{I}

in which the positions of both the UAV and the target are expressed. The vehicle-carried frame

F_{V}

is a non-rotating frame that moves with the UAV and remains aligned with

F_{I}

. The body frame

F_{B}

is attached to the UAV and is rotated from

F_{V}

through a 3-2-1 Euler rotation sequence, corresponding to roll (

ϕ

), pitch (

θ

), and yaw (

ψ

). The rotation matrix

R_{B}^{V}

transforms the vector coordinates from the vehicle-carried frame

F_{V}

to the body frame

F_{B}

:

R_{B}^{V} (ϕ, θ, ψ) = R_{z} (ψ) R_{y} (θ) R_{x} (ϕ) = [\begin{matrix} {\hat{i}}_{b}^{v} & {\hat{j}}_{b}^{v} & {\hat{k}}_{b}^{v} \end{matrix}]

(1)

Here, each column vector in Equation (1) represents a unit vector of the body frame expressed in the vehicle-carried frame

F_{V}

.

The gimbal frame

F_{G}

is attached to the onboard camera and is oriented relative to the body frame

F_{B}

through two sequential rotations: a pan angle (

ν

) and a tilt angle (

λ

). For simplicity, we assume that

F_{G}

is aligned with

F_{B}

when both the pan and tilt angles are zero. The rotation matrix from the body frame to the gimbal frame, denoted as

R_{G}^{B}

, is defined as a body-fixed rotation:

R_{G}^{B} (ν, λ) = R_{z} (ν) R_{y} (λ) = [\begin{matrix} {\hat{i}}_{g}^{b} & {\hat{j}}_{g}^{b} & {\hat{k}}_{g}^{b} \end{matrix}]

(2)

Here, each column vector represents a unit vector of the gimbal frame expressed in the body frame.

Using Equations (1) and (2), the composite rotation matrix from the vehicle-carried frame

F_{V}

to the gimbal frame

F_{G}

, denoted as

R_{G}^{V}

, is obtained by chaining the two transformations:

R_{G}^{V} = R_{B}^{V} (ϕ, θ, ψ) R_{G}^{B} (ν, λ) = [\begin{matrix} {\hat{i}}_{g}^{v} & {\hat{j}}_{g}^{v} & {\hat{k}}_{g}^{v} \end{matrix}]

(3)

The geolocation geometry for target positioning is illustrated in Figure 2. The fundamental approach requires knowledge of the UAV’s position vector (

{\vec{p}}_{b}

) and the initial position of the target (

{\vec{p}}_{i n t}

). The objective is to determine the required gimbal pan and tilt angles that orient the camera toward the target location. To establish the gimbal’s pointing direction (

{\vec{p}}_{G}

), we first define the gimbal frame’s x-axis to align with the target direction. The unit vector representing this direction in the inertial frame is calculated as

{\hat{i}}_{g}^{I} = \frac{{\vec{p}}_{i n t} - {\vec{p}}_{b}}{| | {\vec{p}}_{i n t} - {\vec{p}}_{b} | |}

(4)

As the vehicle-carried frame is non-rotating relative to the inertial frame, this relationship simplifies to

{\hat{i}}_{g}^{I} = {\hat{i}}_{g}^{v}

.

From Equation (3), given the rotation matrix

R_{V}^{G}

, one can express the rotation matrix

R_{G}^{B}

as follows:

\begin{matrix} R_{G}^{B} (ν, λ) = & {(R_{B}^{V} (ϕ, θ, ψ))}^{T} [\begin{matrix} {\hat{i}}_{g}^{v} {\hat{j}}_{g}^{v} {\hat{k}}_{g}^{v} \end{matrix}] \\ = & {[\begin{matrix} {\hat{i}}_{b}^{v} & {\hat{j}}_{b}^{v} & {\hat{k}}_{b}^{v} \end{matrix}]}^{T} [\begin{matrix} {\hat{i}}_{g}^{v} & {\hat{j}}_{g}^{v} & {\hat{k}}_{g}^{v} \end{matrix}] \\ = & [\begin{matrix} {\hat{i}}_{b}^{v} \cdot {\hat{i}}_{g}^{v} & {\hat{i}}_{b}^{v} \cdot {\hat{j}}_{g}^{v} & {\hat{i}}_{b}^{v} \cdot {\hat{k}}_{g}^{v} \\ {\hat{j}}_{b}^{v} \cdot {\hat{i}}_{g}^{v} & {\hat{j}}_{b}^{v} \cdot {\hat{j}}_{g}^{v} & {\hat{j}}_{b}^{v} \cdot {\hat{k}}_{g}^{v} \\ {\hat{k}}_{b}^{v} \cdot {\hat{i}}_{g}^{v} & {\hat{k}}_{b}^{v} \cdot {\hat{j}}_{g}^{v} & {\hat{k}}_{b}^{v} \cdot {\hat{k}}_{g}^{v} \end{matrix}] \end{matrix}

(5)

The rotation matrix

R_{G}^{B} (ν, λ)

is also computed by multiplying two basic rotation matrices:

R_{G}^{B} (ν, λ) = [\begin{matrix} cos ν cos λ & - sin ν & cos ν sin λ \\ sin ν cos λ & cos ν & sin ν sin λ \\ - sin λ & 0 & cos λ \end{matrix}]

(6)

To complete the definition of the gimbal coordinate frame, we next determine the y-axis unit vector

{\hat{j}}_{g}^{v}

. In Equations (5) and (6), the second column of the third row indicates that

{\hat{j}}_{g}^{v}

is orthogonal to

{\hat{k}}_{b}^{v}

. Moreover, through the construction of a right-handed coordinate frame,

{\hat{j}}_{g}^{v}

must also be perpendicular to

{\hat{i}}_{g}^{v}

. Combining these two constraints uniquely specifies the y-axis as

{\hat{j}}_{g}^{v} = {\hat{k}}_{b}^{v} \times {\hat{i}}_{g}^{v}

(7)

With the complete gimbal frame established, the required pan and tilt angles can be determined by comparing the matrix entries from Equations (5) and (6):

\begin{matrix} ν = & {tan}^{- 1} (\frac{{\hat{j}}_{b}^{v} \cdot {\hat{i}}_{g}^{v}}{{\hat{i}}_{b}^{v} \cdot {\hat{i}}_{g}^{v}}) \\ λ = & - {sin}^{- 1} ({\hat{k}}_{b}^{v} \cdot {\hat{i}}_{g}^{v}) \end{matrix}

(8)

where

{\hat{i}}_{b}^{v}

,

{\hat{j}}_{b}^{v}

, and

{\hat{k}}_{b}^{v}

are the unit vectors of the body frame that can be easily derived from the attitude angles of the UAV, while

{\hat{i}}_{g}^{v}

can be obtained from Equation (4) using the positions of the UAV and the target.

Remark 1.

The ground target position is used to calculate the required pan and tilt angles for gimbal control. These computed angles direct the gimbal to acquire the target within the camera’s field of view, after which visual servoing maintains the target at the center of the image frame. To determine the pan and tilt angles using the geolocation geometry described in Equations (4), (7) and (8), an approximate target position must be known a priori. In this work, we assume that the initial target position is provided with sufficient accuracy to enable the gimbal to orient toward the approximate location of the target. This is carried out through the known location of the target’s starting point.

Once the target appears partially within the image frame, vision-based information can be employed to correct the pan and tilt angles and center the target in the image. This requires determining the target’s displacement from the image center. To handle non-cooperative targets without special markers, target detection is performed using deep learning-based object detection methods. The well-known ‘You Only Look Once’ (YOLO) [24] algorithm is employed in conjunction with a pre-trained dataset for this purpose. Figure 3 demonstrates the target detection results using YOLO, where a vehicle is successfully detected and enclosed within a yellow bounding box with the appropriate label. In this study, targets are assumed to be vehicles or pedestrians. Subsequently, Tiny-YOLO [25], which requires less computational resources, is implemented using a limited pre-trained dataset for these two object classes. This configuration enables real-time target detection on the embedded mission computer, achieving approximately 10-frames-per-second throughput, which satisfies the surveillance mission requirements.

Following target detection and region of interest (ROI) establishment through the bounding box, vision-based target tracking computes the spatial displacement of the ROI across sequential image frames. The tracking algorithm monitors the movement of the bounding box as the target moves, providing corrective information for gimbal adjustment relative to the image center. During surveillance missions, the UAV typically loiters around the target while adjusting zoom levels to obtain higher-resolution imagery. Consequently, the target’s apparent shape and size may vary due to changing relative positions between the UAV and the target. Among various tracking algorithms in the literature, the CSRT tracking scheme [26] is adopted for its robustness to target shape variations, enabling consistent tracking performance despite UAV motion.

To calculate gimbal correction angles, the pixel displacement of the target relative to the image center must be converted to angular measurements. For small correction angles, horizontal pixel displacement corresponds to pan correction, while vertical pixel displacement corresponds to tilt correction. Using the pinhole camera model illustrated in Figure 4, the image plane is located at a focal length f distance from the camera center, and the image resolution is fixed at

w \times h

.

The pixel displacements

Δ w

and

Δ h

correspond to small correction angles for pan (

Δ ν

) and tilt (

Δ λ

), respectively, as follows:

\begin{matrix} Δ ν & = {tan}^{- 1} (\frac{Δ w}{f}) \\ Δ λ & = {tan}^{- 1} (\frac{Δ h}{f}) \end{matrix}

(9)

By subtracting these correction angles from the calculated pan and tilt angles in Equation (8), the target can be centered in the image frame.

While Equation (9) requires the camera focal length f, which is typically specified by the manufacturer, this approach may introduce calibration errors due to manufacturing tolerances and lens distortions. To improve accuracy, the correction angles are calculated using experimentally determined horizontal and vertical field-of-view (FOV) values:

\begin{matrix} Δ ν & = {tan}^{- 1} (2 tan \frac{F_{w} (η_{z})}{2} \cdot \frac{Δ w}{w}) \\ Δ λ & = {tan}^{- 1} (2 tan \frac{F_{h} (η_{z})}{2} \cdot \frac{Δ h}{h}) \end{matrix}

(10)

where

F_{w}

and

F_{h}

represent the horizontal and vertical FOV angles of the gimbal camera, respectively. As the FOV values vary with the camera zoom level, different FOV values are experimentally determined at various zoom levels within

η_{z} \in [1, 3]

. This approach provides a robust visual feedback system for continuous target tracking that compensates for gimbal pointing errors in Equation (8) caused by target location uncertainty.

2.2. Target Geolocation

This section presents a method for determining the precise ground position of a target using the measured pan and tilt angles from a gimbal-mounted camera on a UAV. Building upon the geometric relationships established in the previous section, this geolocation approach leverages the visual tracking system to calculate the target’s coordinates with respect to the inertial frame.

Assuming that knowledge of the UAV’s position is accurate and that the target remains on the ground, the target position

{\vec{p}}_{T}

can be determined directly from the measured gimbal angles. When the vision system keeps the target centered in the camera frame, the gimbal’s pan and tilt angles accurately represent the line-of-sight vector to the target. This allows the target position to be expressed using the following vector equation:

{\vec{p}}_{T} = {\vec{p}}_{B} + ρ {\hat{i}}_{g}

(11)

where

{\vec{p}}_{B}

represents the UAV position,

{\hat{i}}_{g}

is the unit vector along the x-axis of the gimbal frame (pointing toward the target), and

ρ

is the scalar distance from the UAV to the target.

Given that the target is on the ground (zero altitude), Equation (11) can be expanded as

\begin{matrix} ρ {\hat{i}}_{g} & = {\vec{p}}_{T} - {\vec{p}}_{B} \\ = [\begin{matrix} x_{t} \\ y_{t} \\ 0 \end{matrix}] - [\begin{matrix} x_{b} \\ y_{b} \\ - h \end{matrix}] \end{matrix}

(12)

where

(x_{t}, y_{t})

and

(x_{b}, y_{b})

represent the

(N, E)

horizontal positions of the target and UAV, respectively, and h is the UAV’s altitude above ground level (with

D_{b} = - h

in the NED frame).

As illustrated in Figure 5, by taking the dot product of Equation (12) with the inertial frame’s downward unit vector

{\hat{k}}_{I}

, the distance parameter

ρ

can be solved for

ρ = \frac{h}{{\hat{i}}_{g} \cdot {\hat{k}}_{I}}

(13)

This allows for the determination of the target’s position by substituting Equation (13) back into Equation (11):

{\vec{p}}_{T} = {\vec{p}}_{B} + \frac{h}{{\hat{i}}_{g} \cdot {\hat{k}}_{I}} {\hat{i}}_{g}

(14)

In this formulation, the UAV position

{\vec{p}}_{B}

and altitude h are provided by the flight control computer, while the unit vector

{\hat{i}}_{g}

is derived from the measured pan and tilt angles of the gimbal. This approach provides a straightforward and computationally efficient method for the real-time geolocation of ground targets during surveillance missions.

2.3. Kalman Filtering

Accurate geolocation of ground targets can be affected by multiple factors: inaccurate position and attitude measurements of the UAV, disturbances from wind or gusts causing vehicle vibration, and the inherent limitations of the gimbal controller response. Additionally, targets may become temporarily obscured by obstacles during surveillance. To address these challenges and improve geolocation accuracy, this section details the development of a Kalman filtering approach based on a ground vehicle motion model.

Due to the inherent nonlinearity in both the target motion dynamics and the gimbal-to-target measurement geometry, standard linear filtering approaches are not suitable for this application. Therefore, nonlinear filtering techniques are required, with an extended Kalman filter (EKF) and unscented Kalman filter (UKF) being the most practical options. The EKF linearizes the nonlinear process and measurement models using first-order Taylor expansion around the current state estimate, requiring computation of Jacobian matrices. The UKF employs the unscented transform to deterministically select sigma points that capture the mean and covariance of the state distribution through nonlinear transformations without requiring analytical derivatives. Given the complexity of the gimbal geometry and target motion, the UKF is adopted in this work for its improved handling of nonlinear transformations, although both the EKF and UKF were evaluated in comparison [27].

The filtering framework requires both a process model to predict target motion and a measurement model to incorporate visual observations. The process model is derived from a unicycle-type vehicle kinematic relationship, which provides a reasonable approximation of ground target motion. The equations of motion are formulated as follows:

\begin{matrix} \dot{x} & = v sin θ \\ \dot{y} & = v cos θ \\ \dot{θ} & = ω \\ \dot{v} & = a \end{matrix}

(15)

where x and y represent the horizontal position of the vehicle, v is the forward speed,

θ

is the heading,

ω

is the turning rate, and a is the linear acceleration of the vehicle.

Applying the forward Euler method to Equation (15) with a fixed sampling period

Δ t

yields the discrete process model for the Kalman filter:

\begin{matrix} x_{k + 1} & = x_{k} + Δ t v_{k} sin θ_{k} + \frac{1}{2} Δ t^{2} a_{k} sin θ_{k} + ε_{N, k}, \\ y_{k + 1} & = y_{k} + Δ t v_{k} cos θ_{k} + \frac{1}{2} Δ t^{2} a_{k} cos θ_{k} + ε_{E, k}, \\ θ_{k + 1} & = θ_{k} + Δ t ω_{k} + ε_{θ, k}, \\ v_{k + 1} & = v_{k} + Δ t a_{k} + ε_{v, k}, \\ ω_{k + 1} & = ω_{k} + ε_{ω, k}, \\ a_{k + 1} & = a_{k} + ε_{a, k} . \end{matrix}

(16)

The turning rate and linear acceleration are modeled as Brownian motion random variables driven by Gaussian process noises

ε_{ω} \sim N (0, σ_{ω}^{2})

and

ε_{a} \sim N (0, σ_{a}^{2})

. With the state vector defined as

X_{k} = {[x_{k}, y_{k}, θ_{k}, v_{k}, ω_{k}, a_{k}]}^{T}

, additional process noises

w_{N, k}

,

w_{E, k}

,

w_{θ, k}

, and

w_{v, k}

account for modeling errors and unexpected target motion. The linearization of Equation (15) particularly affects position states, resulting in relatively larger process noise values for

w_{N, k}

and

w_{E, k}

. The process noise covariance matrix is empirically determined as the following equation and tuned using experimental data for optimal filter performance:

Q_{k} = diag (σ_{n}^{2}, σ_{e}^{2}, σ_{θ}^{2}, σ_{v}^{2}, σ_{ω}^{2}, σ_{a}^{2})

(17)

The measurement model utilizes the pixel values of the target relative to the center of the camera image frame. Even when the target deviates from the image center due to disturbances, an accurate position estimate can still be obtained by incorporating the pixel error into the model. Based on the pinhole camera model shown in Figure 6, the measurement equation is

s [\begin{matrix} 1 \\ Δ x_{p i x} \\ Δ y_{p i x} \end{matrix}] = [\begin{matrix} 1 & 0 & 0 \\ 0 & f_{x} & 0 \\ 0 & 0 & f_{y} \end{matrix}] C_{G}^{I} (ν, λ, ψ, θ, ϕ) [\begin{matrix} x_{t} - x_{b} \\ y_{t} - y_{b} \\ 0 - (- h) \end{matrix}] + μ_{k}

(18)

where

C_{G}^{I}

is the direction cosine matrix (DCM) from the gimbal frame to the inertial frame, rotating opposite to

R_{G}^{I}

. The parameter s represents the depth (the distance from the camera to the target along the optical axis), which acts as a scale factor in the pinhole camera model, while

f_{x}

and

f_{y}

are the horizontal and vertical focal distances of the camera in pixels. The measurement noise

μ_{k} \sim N (0, R_{k})

follows a Gaussian distribution with zero mean and covariance

R_{k}

.

The covariance matrix

R_{k}

adjusts the weighting of measurements in the estimation filter. Through experimental flight tests, the noise levels in pixel measurements were determined empirically. When tracking targets, ideally, they should remain centered in the image frame according to Equation (10), but rapid UAV maneuvers or unexpected target movements create pixel errors. From these observations, the standard deviations for x-axis and y-axis pixel errors were determined to be

σ_{Δ x_{p i x}} = 5.76

[pix] and

σ_{Δ y_{p i x}} = 12.04

[pix], respectively. The resulting measurement noise covariance matrix is

R = diag (σ_{Δ x_{p i x}}^{2}, σ_{Δ y_{p i x}}^{2})

(19)

For the UKF implementation, the following parameters were used:

α = 1 \times 10^{- 3}

,

β = 2.0

, and

κ = 0.0

. These parameters control the sigma point generation and provide optimal performance for the nonlinear state estimation in this application.

3. Tests and Results

This section presents the results of both the simulation and flight experiments conducted to evaluate the effectiveness of the proposed target estimation algorithm utilizing a gimbal-mounted camera on a UAV. The algorithm’s performance was first analyzed in a simulation environment by comparing the accuracy of target geolocation and estimation using a moving average filter, an extended Kalman filter (EKF), and an unscented Kalman filter (UKF). Subsequently, real-world flight experiments were performed to further examine the algorithm’s robustness and accuracy under practical conditions. The results confirm that the proposed target estimation method provides reliable and precise target localization across various scenarios.

3.1. Simulation Studies

3.1.1. Simulation Environment Setup

To validate the target position estimation algorithm, a simulation environment was developed using MATLAB R2022 and UNITY R2022. In this setup shown in Figure 7, MATLAB simulated UAV dynamics, computed the UAV’s attitude and position, and transmitted these data to UNITY via TCP/IP communication. UNITY utilized the received UAV states to compute the gimbal’s pan/tilt compensation angles based on the target’s pixel coordinates within the camera model’s image frame. The target geolocation algorithm, in combination with the Kalman filter-based estimation techniques, was then employed to estimate the target’s position and velocity.

For these simulations, measurement noise was modeled as a Gaussian distribution with a standard deviation of 5 m (

1 σ

). The target followed a predefined trajectory, initially moving 40 m northward before shifting 30 m westward from its starting position. This simulation framework enabled a comparative analysis of different estimation approaches, specifically evaluating the performance of the moving average filter, EKF, and UKF in target tracking.

3.1.2. Filter Parameter Analysis

As an initial step in the simulation-based evaluation, a sensitivity analysis was conducted on the Kalman filter’s measurement noise covariance matrix R. All analyses in this subsection were performed within the simulation environment previously described, which allowed for a systematic observation of the filter’s behavior under controlled noise conditions. Although both process noise (Q) and measurement noise (R) influence estimation performance, this study focused on R due to the nature of the sensing modality. As the proposed method relies on image-based observations from a gimbal-mounted camera, uncertainty introduced by pixel noise, lens distortion, and calibration error was expected to be the dominant source of measurement error.

The analysis considered three settings: nominal R, increased

10 R

, and decreased

0.1 R

. Each was tested to evaluate its impact on estimation responsiveness, stability, and noise sensitivity. The nominal R setting offered the best trade-off, as demonstrated in Figure 8a, where the trajectory showed moderate smoothness without significant delay or noise sensitivity. The corresponding metrics in Table 1 confirm its overall superiority among the tested configurations. Under the

10 R

condition, the filter assigned less weight to measurements and relied more on model predictions. As shown in Figure 8b, this resulted in a smoother trajectory but introduced noticeable phase lag during directional changes. Table 1 supports this observation, with the north-direction mean and standard deviation errors rising to 11.91 m and 13.51 m, respectively—both significantly higher than those of the nominal case. Conversely, with

0.1 R

, the filter placed excessive confidence in noisy measurements, resulting in fast but unstable estimates. Figure 8c shows visible jitter in the trajectory, and the standard deviation increased despite a relatively low mean error (1.67 m in the north direction). The instability in this configuration undermined the robustness of the filter under realistic conditions.

These results underscore the effectiveness of the measurement noise covariance matrix R in vision-based, dynamic tracking scenarios. In UAV operations, target motion is often rapid, while image-based measurements are inherently noisy. The comparative analysis showed that larger R values led to tracking delays, while smaller R values amplified noise and reduced stability. Such an analysis is essential for achieving reliable estimation performance in vision-based UAV systems operating under dynamic conditions.

3.1.3. Algorithm Comparison Under Nominal Conditions

To evaluate the suitability of different filtering methods for vision-based target estimation in UAV applications, three filters were tested under nominal conditions—defined here as stable flight with moderate sensor noise, minimal environmental disturbances, and no obstructions. The filters tested included a moving average filter, an extended Kalman filter (EKF), and an unscented Kalman filter (UKF). These were selected based on their trade-offs between responsiveness and accuracy—factors critical for onboard UAV systems operating in noisy and dynamic environments.

The raw target geolocation data represent position estimates derived from the target geolocation algorithm under realistic measurement conditions simulated with sensor noise, platform vibrations, and wind gusts. As shown in Figure 9, these disturbances caused significant fluctuations in the estimated trajectory, leading to observable error spikes during target tracking.

The moving average filter achieved some noise suppression by averaging recent measurements, thereby smoothing out short-term fluctuations. However, because it lacks a predictive dynamic model and assigns equal weight to all data within the window, it responded sluggishly to sudden changes in the target’s motion. This resulted in a noticeable phase delay—particularly evident in the east direction in Figure 9a—as the filter tended to “lag behind” real-time changes. The delay became more pronounced with larger window sizes, limiting responsiveness despite improved smoothness. In contrast, the EKF significantly reduced both the mean error and standard deviation compared to the raw geolocation and moving average results. As reported in Table 2, the EKF halved the north-direction mean error—from 2.17 m to 1.10 m—while maintaining robust performance during motion transitions. This improvement can be attributed to its model-based prediction and measurement update structure, which effectively balances responsiveness and stability under moderate-nonlinearity conditions typical of many UAV tracking scenarios. The UKF was evaluated to determine whether its more accurate handling of nonlinearities could further enhance estimation performance. It produced strong results, with error metrics being nearly indistinguishable from those of the EKF. Slight improvements were observed during sharp maneuvers, where the UKF’s unscented transform more effectively captured nonlinear state evolution. However, under the nominal conditions tested, where nonlinearity was relatively mild, the UKF did not offer a significant performance advantage.

These results suggest that both the EKF and UKF are highly effective for vision-based target estimation under nominal conditions, providing a clear improvement over simple averaging. While the UKF offers greater theoretical robustness, the EKF performs comparably in this scenario and remains a reliable, computationally efficient choice.

3.1.4. Robustness Testing with Occlusion

To assess the robustness of the estimation filters under partial measurement loss, a simulation scenario was implemented in which the target became temporarily unobservable due to occlusion. Specifically, geolocation measurements were withheld between

t = 26

s and

t = 33

s, simulating a period during which the target was obscured by an obstacle and could not be detected by the camera. During this interval, measurement updates were skipped, and only the prediction step of each filter was executed. Predicted state estimates were propagated using the kinematic motion model defined in Equation (16).

Figure 10 and Table 3 summarize the tracking performance under occlusion. As expected, the raw geolocation algorithm and the moving average filter were unable to provide estimates in the absence of measurements. In contrast, both the EKF and UKF continued to generate predicted estimates by propagating prior states. While estimation errors increased during the occlusion period due to the lack of corrective updates, the extent of degradation differed between the EKF and UKF. Because the EKF relies on a linearized version of the motion model, it introduces approximation errors when the system exhibits nonlinear behavior—particularly during target turning. This results in an accumulated prediction error, as reflected by elevated standard deviation values during occlusion (e.g., 6.38 m in the east direction). In comparison, the UKF maintained more consistent performance by propagating sigma points through the nonlinear system model without requiring linearization. This approach more accurately captured the target’s trajectory during prediction-only intervals, resulting in reduced estimation errors. As shown in Table 3, the UKF achieved a 52% reduction in the east-direction standard deviation compared to the EKF (from 6.38 m to 3.04 m), and it also maintained a lower mean error throughout the occlusion period.

These results demonstrate that the UKF offers improved robustness in scenarios where measurements are intermittently unavailable. Its ability to account for system nonlinearity during prediction enables more accurate estimation under occlusion, making it a valuable choice for vision-based tracking systems operating in cluttered or partially observable environments.

3.1.5. Performance Analysis Across Target Speed Variations

To evaluate the algorithm’s robustness across diverse operational scenarios, systematic speed variation tests were conducted. While the nominal case (Section 3.1.3) validated basic tracking capabilities, real-world surveillance missions encounter targets with varying velocities—from slow-moving pedestrians to fast-moving vehicles. This analysis examines target velocities across operationally relevant ranges to assess performance boundaries. The experiments were conducted in a Unity-based high-fidelity simulation environment with Matlab control integration. The drone maintained an altitude of 150m via an altitude-hold controller while performing loitering flight to keep the target centered through active gimbal tracking. The onboard camera operated at 30 fps with a fixed field of view. The simulation sampling rate was 20 Hz (dt = 0.05 s). The target followed a circular trajectory at three systematically varied speeds to isolate velocity effects on tracking performance:

Low speed: 2.5 m/s (jogging).
Nominal speed: 5.0 m/s (slow vehicle speed).
High speed: 10.0 m/s (moderate vehicle speed).

Each speed regime was evaluated over 180 s to observe performance trends across extended tracking periods.

Table 4 presents the position estimation errors across all speed regimes, and Figure 11 shows the trajectory comparison for visual validation.

Low-speed performance (2.5 m/s): At jogging speeds, both filters demonstrate improved accuracy over the raw geolocation. The raw geolocation shows mean errors of approximately 1.48 m with a standard deviation of 1.90 m. The EKF reduces the mean errors to 0.82 m (north) and 0.59 m (east), though the standard deviation increases to 2.69 m, indicating some instability during continuous turning. The UKF achieves superior performance with mean errors of 0.71 m and 0.64 m and a better standard deviation of 1.81 m. In this speed regime, the UKF demonstrates more stable tracking characteristics.

Nominal-speed performance (5.0 m/s): At slow vehicle speeds representative of typical urban vehicles, both filters maintain excellent tracking accuracy. The raw geolocation shows mean errors of 1.56 m (north) and 1.51 m (east) with a standard deviation of 1.96 m. The EKF achieves 0.76 m and 0.62 m mean errors with a 1.10 m standard deviation, demonstrating effective handling of the nonlinear circular motion. The UKF delivers comparable performance with mean errors of 0.78 m and 0.70 m and a slightly better standard deviation of 1.00 m. In this speed regime, both filters perform similarly with consistent accuracy.

High-speed performance (10.0 m/s): At moderate vehicle speeds, tracking complexity increases substantially. The raw geolocation shows mean errors of 1.60 m (north) and 1.56 m (east) with a standard deviation of 2.01 m. The EKF reduces the mean errors to 1.24 m and 1.00 m, successfully improving positional accuracy. However, its standard deviation slightly increases to 2.02 m, suggesting that the linearization approach introduces some inconsistency when handling rapidly changing dynamics during high-speed circular motion. In contrast, the UKF demonstrates superior performance with mean errors of 1.03 m and 0.96 m while maintaining a significantly better standard deviation of 1.33 m. The unscented transform’s ability to handle nonlinear dynamics without linearization provides more stable and consistent tracking at high speeds.

These results demonstrate that both filters maintained reliable tracking performance across the entire speed range from 3.5 to 19.0 m/s. While both filters achieved improved mean accuracy over the raw geolocation, the UKF consistently provided better stability across all speed regimes, particularly at high speeds where the EKF’s linearization limitations became apparent. The consistent performance across this diverse speed range confirms the algorithm’s applicability to various surveillance scenarios, from pedestrian tracking to typical vehicle monitoring.

3.1.6. Performance Evaluation with Complex Target Trajectories

To assess the algorithm’s capability to handle unpredictable target maneuvers, a complex trajectory scenario was designed to simulate exploratory movements. While the previous tests examined performance under controlled circular motion at varying speeds, operational surveillance often involves targets executing irregular patterns with frequent direction changes, variable curvature turns, and non-uniform velocity profiles, including sudden stalls and accelerations. This test evaluates the filters’ ability to maintain accurate tracking under such challenging conditions. The target trajectory consisted of multiple interconnected circular paths with varying radii and orientations, creating a figure-8-like pattern with additional loops, as shown in Figure 12. This path includes sharp turns, direction reversals, and segments with high curvature that stress both the motion model and the filter’s ability to adapt to rapidly changing dynamics. The target maintained an average speed of 5.0 m/s throughout the 200-s simulation, with natural speed variations occurring during turns.

Table 5 summarizes the estimation performance for this complex scenario. The raw geolocation achieves mean errors of 1.48 m (north) and 1.59 m (east) with standard deviations of 1.84 m and 1.65 m. Interestingly, the EKF shows degraded performance compared to the raw geolocation, with mean errors of 1.81 m (north) and 1.41 m (east) and the standard deviation increasing to 2.83 m. This counterintuitive result occurs because the frequent direction changes and varying curvature violate the linearization assumptions of the EKF, causing the filter to introduce additional errors rather than improving accuracy. In contrast, the UKF demonstrates robust performance with mean errors of 1.24 m (north) and 1.15 m (east) and standard deviations of 1.60 m and 1.46 m. The UKF achieves superior tracking compared to both the raw geolocation and EKF, benefiting from its ability to propagate uncertainty through the nonlinear motion model without linearization. These results confirm that the UKF maintains reliable tracking performance under dynamic and unpredictable target motion patterns, demonstrating the algorithm’s suitability for surveillance scenarios where target behavior cannot be anticipated.

3.2. Experimental Implementation

3.2.1. Hardware Platform

To validate the proposed algorithm, a custom UAV platform was developed in-house, as illustrated in Figure 13. The onboard avionics architecture comprises three primary components: a mission computer, a two-axis gimbal camera system, and a flight control computer. The gimbal camera system utilizes a Q10N unit by Viewpro (China) mounted beneath the UAV via a custom-designed retractable mechanism. The camera provides up to 1080p resolution with a 10× optical zoom, making it suitable for vehicle-scale target detection at operational altitudes. This system receives pan and tilt commands from the mission computer to execute vision-based target tracking operations while simultaneously providing angular feedback data for the implementation of the geolocation algorithm. To address lens distortion and improve accuracy, we calibrated the camera using the checkerboard method at the outdoor test site. This approach was found to be more reliable than relying solely on manufacturer-provided focal length specifications or polynomial distortion coefficients, particularly for the variable-zoom gimbal camera. The mission computer, based on a Jetson Nano platform by NVIDIA (United States) serves as the central processing unit for the onboard systems. Its primary functions include image acquisition, target detection and tracking algorithm execution, and generation of low-level autopilot commands. These commands are transmitted to the flight control computer (Pixhawk4 by Holybro (China)) to implement the stand-off tracking algorithm for mobile ground target following. System supervision is maintained through a ground control station (GCS) that monitors vehicle status via an RF data link connected to the flight control computer. Additionally, a dedicated video modem transmits real-time gimbal camera footage to enable online monitoring of target detection and tracking operations.

3.2.2. System Architecture

Figure 14 illustrates the mission control system architecture, which integrates four key functions: target tracking within the image frame, gimbal control, target state estimation, and serial communication. Each function operates as an independent thread to enable parallel processing, with data exchange managed through mutex locks and condition variables.

The gimbal control loop operates as follows: The gimbal encoder readings are transmitted to the Jetson Nano via the UART interface, providing the current gimbal orientation. Based on the target position in the image frame, a compensation angle is calculated using Equation (10) to center the target. This compensation angle is then sent as a command to the gimbal motors, which are controlled using a PID controller.

3.2.3. Target Detection and Tracking

For the experimental implementation, Tiny-YOLO was selected for deployment on the Jetson Nano platform. Compared to alternatives, it provided a better overall balance between detection accuracy, inference speed, and memory usage, particularly for recognizing small targets typical in UAV surveillance scenarios [28]. Tiny-YOLO achieved approximately 10-frames-per-second throughput on the Jetson Nano while maintaining sufficient detection accuracy for vehicles at the operational distances tested. The system integrates YOLO detection with CSRT tracking through a two-stage process: initial target acquisition using YOLO generates the bounding box ROI, followed by continuous tracking using CSRT to monitor target displacement across sequential frames. This displacement information directly drives the gimbal compensation system to maintain target centering in the image frame. The implementation proved particularly effective during UAV operations involving altitude changes and zoom adjustments, where the CSRT algorithm’s robustness against scale and appearance variations enabled consistent target lock maintenance despite the dynamic viewing conditions induced by UAV motion.

3.3. Flight Experiments

3.3.1. Flight Test Methodology

Flight experiments were conducted to validate the proposed target estimation algorithm under real-world conditions. Initially, the gimbal camera pan and tilt angles were set to 0°, facing the front of the UAV. Upon receiving a signal from the ground operator, the gimbal was directed toward the initial target position and detected the target using the YOLO algorithm, as shown in Figure 3. Once the target was successfully detected and confirmed by the ground station operator, the system initiated target tracking and position estimation using the CSRT algorithm. Based on the simulation results presented in the previous section, the UKF demonstrated superior performance compared to the EKF; therefore, the UKF was implemented for target position estimation in the flight tests.

3.3.2. Stationary Target Tests

The first experiment evaluated the algorithm’s performance with a stationary target. The UAV maintained a hovering position at approximately −6 m north, −13 m east, and 23 m altitude from the origin of the NED coordinate system to achieve a high pixel resolution for accurate baseline performance assessment. A gray vehicle, detected through YOLO and selected as the target (Figure 3), remained stationary at approximately 36 m north and −67 m east from the coordinate origin. During this experiment, significant environmental disturbances, particularly wind, affected the system’s performance. The impact of these disturbances is evident in Figure 15 and Figure 16, which demonstrate that the raw geolocation algorithm experienced noise levels exceeding 15 m. In contrast, the UKF-based position estimation algorithm achieved substantial noise reduction, as shown in Figure 15 and Figure 16 and Table 6. The standard deviation of the target position estimated through the UKF algorithm was reduced by more than 7 times compared to that estimated through the geolocation algorithm.

3.3.3. Moving Target Tests

The second experiment assessed the algorithm’s capability to track a moving target performing complex maneuvers. The UAV hovered at approximately −16 m north, −21.5 m east, and 36 m altitude from the NED coordinate origin, providing sufficient coverage for the target’s runway maneuvers while maintaining good detection quality. A gray vehicle was detected through YOLO and subsequently tracked using the CSRT algorithm, as shown in Figure 17.

The target’s trajectory consisted of three distinct phases. During the first 20 s, the target entered the runway perpendicular to its direction at approximately 0.5 m/s, as shown in Figure 17a. From 20 to 45 s, the target executed a right turn maneuver, as illustrated in Figure 17b. After 45 s, the target accelerated along the runway in the northeast direction, as depicted in Figure 17c.

Similar to the stationary target test, significant wind disturbances were present during this experiment. Figure 18 and Figure 19 confirm that the geolocation algorithm experienced maximum noise levels exceeding 5 m. The UKF-based position estimation algorithm demonstrated significant noise reduction compared to the raw geolocation data. Furthermore, the UKF algorithm maintained reliable target position estimation even during nonlinear maneuvers such as turning and acceleration changes.

Additionally, this flight experiment investigated the impact of adjusting the Kalman filter’s measurement noise covariance R on target position tracking accuracy. Three settings were evaluated, namely, the standard R,

10 R

, and

0.1 R

, consistent with the simulation experiments. According to Figure 20, the increased measurement noise covariance (

10 R

) produced smoother estimated trajectories. However, this configuration resulted in temporal delays, as the filter became less responsive to new measurement data, failing to immediately reflect system changes. This behavior stems from the filter’s increased reliance on model-based predictions rather than measurement updates. Conversely, the reduced covariance setting (

0.1 R

) eliminated temporal delays but significantly amplified measurement noise in the position estimates. The low measurement noise covariance caused the filter to excessively trust the measurement data, improving responsiveness at the cost of overall filtering performance.

These results demonstrate that the Kalman filter’s R value critically affects both filtering performance and responsiveness, emphasizing the importance of appropriate parameter selection for specific application requirements.

3.3.4. Fixed-Wing UAV Tests with Moving Target

The final experiment evaluated the algorithm’s performance using a fixed-wing UAV platform tracking a long-distance moving target in a realistic surveillance scenario. The experimental setup consisted of a fixed-wing UAV circling at 100 m altitude with a 100 m radius loitering pattern, maintaining approximately 141 m standoff distance from the ground target. Figure 21 shows a video clip of the flight test where a fixed-wing UAV tracks a moving target.

The target trajectory comprised two phases: stationary for the first 60 s, followed by movement along a road. This scenario tested the system’s ability to track a real moving target under challenging conditions including long standoff distance, continuous platform motion, and target state transitions.

Figure 22 and Figure 23 demonstrate successful tracking throughout both target phases. As shown in Figure 22, the UAV maintained its loitering pattern based on the estimated target positions, successfully adapting to the target’s movement after t = 60 s. The UKF provided smooth state estimates by filtering measurement noise, which is particularly important when raw measurements fluctuate due to vision detection uncertainties and GPS noise. This smoothing capability enabled stable gimbal control and maintained visual lock on the target throughout the mission.

Figure 24 shows the estimation errors computed using GPS measurements from both the target vehicle and the UAV as ground truth reference. At the 141 m standoff distance, several error sources contribute to the positioning uncertainty: GPS measurement noise from both platforms (approximately ±5 m each), gimbal measurement errors, camera calibration uncertainties, and geometric projection errors at long range. The UKF estimation achieved a mean error of approximately 10 m during the stationary phase (0–60 s). While absolute positioning accuracy is limited by these sensor and calibration uncertainties, the key achievement demonstrated in this experiment is the system’s operational capability: the UAV successfully maintained persistent visual tracking of a moving target at long range, and the loitering pattern adapted appropriately based on the target position estimates. This validates that the integrated system—combining vision tracking, UKF filtering, and flight control—can perform practical surveillance missions with consumer-grade sensors.

3.4. Discussion

The evaluation, consisting of both simulation and real-world flight experiments, provided valuable insights into the performance and technical characteristics of the proposed vision-based target estimation system.

3.4.1. Filter Performance Analysis

The comparative analysis of different filtering approaches revealed distinct performance characteristics under various operational conditions. While the moving average filter provided basic noise suppression, its lack of predictive capabilities resulted in significant phase delays during target maneuvers, limiting its effectiveness for dynamic tracking scenarios. Both the EKF and UKF demonstrated substantial improvements over raw geolocation data, with standard deviation reductions exceeding 40% under nominal operating conditions.

Under challenging scenarios involving occlusion and complex trajectories, the UKF’s superiority became more pronounced. The ability to propagate uncertainty through nonlinear transformations without linearization errors enabled the UKF to maintain tracking accuracy during measurement outages, achieving a 52% reduction in estimation error compared to the EKF during occlusion periods. This advantage stems from the UKF’s unscented transform, which more accurately captures the mean and covariance of the state distribution through the nonlinear process model during prediction-only intervals.

3.4.2. Parameter Sensitivity and Tuning

The investigation of measurement noise covariance (R) revealed its critical role in balancing filter responsiveness and stability. The nominal R configuration provided optimal performance for most scenarios, achieving the best trade-off between noise suppression and tracking responsiveness. The

10 R

setting introduced unacceptable tracking delays during rapid maneuvers, as the filter became overly reliant on model predictions rather than measurement updates. Conversely, the

0.1 R

configuration amplified measurement noise, compromising estimation stability despite improved responsiveness.

These parameter sensitivity results demonstrate that measurement noise covariance tuning is application-specific and must account for the expected target dynamics, measurement quality, and mission requirements. The empirically determined values of

σ_{Δ x_{p i x}} = 5.76

[pix] and

σ_{Δ y_{p i x}} = 12.04

[pix] proved effective across multiple flight scenarios.

3.4.3. Real-World Performance Validation

The flight experiments validated the simulation results while revealing additional practical challenges. For stationary targets, the UKF achieved remarkable noise reduction, decreasing standard deviations by more than seven times compared to the raw geolocation data (from 3.76 m to 0.54 m in the north direction). This improvement is particularly significant given the presence of substantial wind disturbances during testing.

The moving target experiments demonstrated the algorithm’s robustness during complex maneuvers, including turns and acceleration changes. The UKF maintained consistent tracking accuracy throughout all motion phases, with position errors remaining below 2 m during steady-state tracking and below 5 m during aggressive maneuvers. The system successfully tracked targets through speed variations from 0.5 m/s to 3.5 m/s without degradation in performance. While simulation studies under controlled circular trajectories achieved better accuracy, the slightly higher errors observed in field tests were attributed to outdoor environmental factors such as sunlight reflections on the target surface that were not present in the simulation conditions.

The fixed-wing UAV tests confirmed the algorithm’s adaptability to different platform types, though the extended standoff distance (141 m) resulted in mean position errors of approximately 10m. Error propagation analysis indicates that a combined angular error from gimbal encoder uncertainty, attitude estimation errors, and camera mounting misalignment produces a position error of this magnitude at long range. Despite these challenges, the system successfully demonstrated operational feasibility through sustained loitering flight operations with continuous target tracking, confirming its applicability for real-world surveillance missions.

3.4.4. Practical Implementation Considerations

The developed system demonstrates several technical advantages for operational deployment. Real-time operation on embedded hardware achieved 10 fps processing rates on the Jetson Nano platform, meeting the responsiveness requirements for surveillance missions. The integration of CSRT tracking with YOLO detection proved effective in maintaining target lock despite scale and appearance variations induced by UAV motion and zoom changes, with successful tracking maintained through zoom levels ranging from 1× to 3×.

However, several technical limitations affect practical deployment scenarios. The mean position error observed in long-range tracking scenarios suggests that improved calibration procedures may enhance accuracy for precision geolocation applications, though the current performance is acceptable for surveillance missions where maintaining visual contact is the primary objective. Gimbal mechanical limitations, including response lag and encoder resolution constraints, contribute to tracking errors that become more significant at extended ranges.

The current implementation assumes that targets are at ground level and utilizes a planar projection model, requiring system modifications for tracking aerial targets or operations in mountainous terrain. Additionally, the unicycle motion model may be less accurate for targets with significant lateral dynamics, such as emergency vehicles or off-road scenarios.

3.4.5. Failure Cases and Recovery

During flight tests, occasional failures were observed. Failures most commonly occurred when abrupt UAV maneuvers induced by wind gusts caused motion blur and rapid target displacement, leading to temporary detection failures of the Tiny-YOLO module. These failures were transient, and the integrated CSRT tracker enabled re-acquisition of the target once it reappeared in the field of view. This recovery mechanism ensured continuity of operation without requiring manual intervention. No persistent system failures or mission-critical losses were encountered during testing, indicating acceptable robustness for the intended operational scenarios.

3.4.6. Limitations and Future Work

While this work demonstrates operational feasibility on resource-constrained platforms, several technical limitations indicate directions for future enhancement.

The flat-ground assumption, though computationally efficient, restricts applicability to planar environments and introduces geometric errors in complex terrain. The fixed-wing experiment at 141 m standoff exhibited approximately 10m mean error, partly attributable to this constraint combined with camera calibration uncertainties. The unicycle motion model oversimplifies vehicle dynamics and cannot capture lateral slip during aggressive maneuvers. Additionally, reliance on pre-trained detection networks limits operation to known target classes, preventing generalization to novel targets without retraining.

Future work will address these limitations through complementary enhancements building upon the current framework. Tighter integration of high-rate IMU measurements with the filtering process could significantly improve state estimation during rapid UAV maneuvering or temporary vision degradation, representing an achievable enhancement leveraging existing hardware. Development of terrain-adaptive geolocation algorithms incorporating digital elevation models or online terrain estimation from tracking history would eliminate the flat-ground constraint, enabling operations in mountainous environments and urban settings with elevation variations. Expansion of training datasets to cover more diverse target categories or adoption of efficient few-shot learning techniques suitable for embedded deployment would extend detection capability beyond vehicles and pedestrians, enabling the system to handle varied target types encountered in different operational contexts with minimal additional training overhead.

These enhancements address acknowledged limitations while maintaining embedded deployment capability, expanding practical utility for UAV surveillance across diverse terrain types and varied target categories.

4. Conclusions

This paper presents an integrated, monocular vision-based geolocation pipeline for small UAVs, designed to operate onboard with limited computational resources. The system integrates visual tracking, gimbal control, and Kalman filtering into a unified framework that prioritizes deployability and practical operation under realistic flight conditions.

The key contributions of this work are (1) an integrated onboard framework that unifies visual tracking (YOLO detection with CSRT tracking), gimbal control, and state estimation into a deployable system suitable for small UAV platforms; (2) a systematic filtering analysis comparing EKF and UKF performance under occlusion and measurement uncertainty, providing practical guidance for filter selection; and (3) an embedded flight validation demonstrating that the complete pipeline operates reliably on resource-constrained hardware under real-world flight disturbances, including wind and vibration.

Simulation results show that the UKF implementation reduces position error standard deviations compared to raw geolocation measurements. Flight experiments on the VTOL platform confirmed that the system maintains continuous visual tracking and provides stable feedback for flight control during autonomous surveillance missions. The successful onboard deployment on embedded hardware validates the feasibility of the approach for practical UAV applications. This validates a practical alternative to high-end systems for applications where deployability is prioritized over maximum accuracy.

The system operates reliably within its design scope—surveillance scenarios with ground-level targets on relatively flat terrain—validating a practical alternative to high-end systems where deployability is prioritized. However, several technical limitations indicate directions for future enhancement. The flat-ground assumption restricts applicability in complex terrain, the unicycle motion model cannot capture all vehicle dynamics, and reliance on pre-trained detection networks limits generalization to novel targets.

Future research will focus on better exploitation of available sensor data including IMU measurements for improved state estimation, terrain-adaptive geolocation techniques to handle elevation variations, and adoption of more generalizable detection frameworks to expand operational capability.

Author Contributions

Conceptualization, J.K. and D.J.; methodology, J.K.; software, J.K., H.C., and S.K.; validation, J.K. and S.K.; formal analysis, J.K.; investigation, J.K.; resources, D.J.; data curation, J.K.; writing—original draft preparation, J.K., Y.K., and H.C.; writing—review and editing, Y.K. and D.J.; visualization, J.K.; supervision, D.J.; project administration, D.J.; funding acquisition, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) in 2024 (Project No. NRF-2022M1A3B8073175, Space Debris Capture Mechanism Design and Validation using Lab-scale Experimental Platform).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to institutional restrictions.

Acknowledgments

The authors thank Korea Aerospace University for providing the research facilities and technical assistance for flight experiments.

Conflicts of Interest

Author Jaemin Kim was employed by the company Pablo Air, author Hyeongjun Cho was employed by the company LIG Nex1. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Valavanis, K.P.; Vachtsevanos, G.J. Handbook of Unmanned Aerial Vehicles; Springer: New York, NY, USA, 2015; Volume 1. [Google Scholar]
Cai, G.; Dias, J.; Seneviratne, L. A survey of small-scale unmanned aerial vehicles: Recent advances and future development trends. Unmanned Syst. 2014, 2, 175–199. [Google Scholar] [CrossRef]
Tomic, T.; Schmid, K.; Lutz, P.; Domel, A.; Kassecker, M.; Mair, E.; Grixa, I.L.; Ruess, F.; Suppa, M.; Burschka, D. Toward a fully autonomous UAV: Research platform for indoor and outdoor urban search and rescue. IEEE Robot. Autom. Mag. 2012, 19, 46–56. [Google Scholar] [CrossRef]
Park, S.; Jung, D. Vision-based tracking of a ground-moving target with UAV. Int. J. Aeronaut. Space Sci. 2019, 20, 467–482. [Google Scholar] [CrossRef]
Yang, S.; Scherer, S.A.; Schauwecker, K.; Zell, A. Autonomous landing of MAVs on an arbitrarily textured landing site using onboard monocular vision. J. Intel. Robot. Syst. 2014, 74, 27–43. [Google Scholar] [CrossRef]
Arafat, M.Y.; Alam, M.M.; Moh, S. Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges. Drones 2023, 7, 89. [Google Scholar] [CrossRef]
Shin, S.; Min, D.; Lee, J. Multi-matching-based vision navigation referencing map tile. Int. J. Aeronaut. Space Sci. 2021, 22, 1119–1140. [Google Scholar] [CrossRef]
He, R.; Bachrach, A.; Achtelik, M.; Geramifard, A.; Gurdan, D.; Prentice, S.; Stumpf, J.; Roy, N. On the design and use of a micro air vehicle to track and avoid adversaries. Int. J. Robot. Res. 2010, 29, 529–546. [Google Scholar]
Hu, K.; Chen, Z.; Kang, H.; Tang, Y. 3D vision technologies for a self-developed structural external crack damage recognition robot. Automat. Constr. 2024, 159, 105262. [Google Scholar] [CrossRef]
Bevacqua, G.; Cacace, J.; Finzi, A.; Lippiello, V. Mixed-initiative planning and execution for multiple drones in search and rescue missions. In Proceedings of the International Conference on Automated Planning and Scheduling, Jerusalem, Israel, 7–11 June 2015; Volume 25, pp. 315–323. [Google Scholar]
Al-Kaff, A.; Martin, D.; Garcia, F.; de la Escalera, A.; Armingol, J.M. Survey of computer vision algorithms and applications for unmanned aerial vehicles. Expert Syst. Appl. 2018, 92, 447–463. [Google Scholar] [CrossRef]
Kanistras, K.; Martins, G.; Rutherford, M.J.; Valavanis, K.P. A survey of unmanned aerial vehicles (UAVs) for traffic monitoring. In Proceedings of the 2013 International Conference on Unmanned Aircraft Systems (ICUAS), Atlanta, GA, USA, 28–31 May 2013; pp. 221–234. [Google Scholar]
Barrile, V.; Bernardo, E.; Fotia, A.; Candela, G.; Bilotta, G. Road safety: Road degradation survey through images by UAV. WSEAS Trans. Environ. Dev. 2020, 16, 649–659. [Google Scholar] [CrossRef]
Mukherjee, A.; Misra, S.; Raghuwanshi, N.S. A survey of unmanned aerial sensing solutions in precision agriculture. J. Netw. Comput. Appl. 2019, 148, 102461. [Google Scholar] [CrossRef]
del Cerro, J.; Cruz Ulloa, C.; Barrientos, A.; de León Rivas, J. Unmanned aerial vehicles in agriculture: A survey. Agronomy 2021, 11, 203. [Google Scholar] [CrossRef]
Pachter, M.; Ceccarelli, N.; Chandler, P.R. Vision-based target geo-location using camera equipped MAVs. In Proceedings of the 2007 46th IEEE Conference on Decision and Control, New Orleans, LA, USA, 12–14 December 2007; pp. 2333–2338. [Google Scholar]
Redding, J.D.; McLain, T.W.; Beard, R.W.; Taylor, C.N. Vision-based target localization from a fixed-wing miniature air vehicle. In Proceedings of the 2006 American Control Conference, Minneapolis, MI, USA, 14–16 June 2006; 6p. [Google Scholar]
Monda, M.; Woolsey, C.; Reddy, C. Ground target localization and tracking in a riverine environment from a uav with a gimbaled camera. In Proceedings of the AIAA Guidance, Navigation and Control Conference and Exhibit, Hilton Head, SC, USA, 20–23 August 2007; p. 6747. [Google Scholar]
Hosseinpoor, H.; Samadzadegan, F.; Dadras Javan, F. Pricise target geolocation and tracking based on UAV video imagery. Int. Arch. Photogr. Remote Sens. Spatial Inf. Sci. 2016, 41, 1–7. [Google Scholar] [CrossRef]
Micheal, A.A.; Vani, K.; Sanjeevi, S.; Lin, C.H. Object detection and tracking with UAV data using deep learning. J. Indian Soc. Remote Sens. 2021, 49, 463–469. [Google Scholar] [CrossRef]
Zhao, X.; Huang, X.; Cheng, J.; Xia, Z.; Tu, Z. A Vision-Based End-to-End Reinforcement Learning Framework for Drone Target Tracking. Drones 2024, 8, 628. [Google Scholar] [CrossRef]
Xu, Y.; Liu, Y.; Li, H.; Wang, L.; Ai, J. A deep learning approach of intrusion detection and tracking with UAV-based 360 camera and 3-axis gimbal. Drones 2024, 8, 68. [Google Scholar] [CrossRef]
Hansen, J.G.; de Figueiredo, R.P. Active object detection and tracking using gimbal mechanisms for autonomous drone applications. Drones 2024, 8, 55. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method for embedded devices. Comput. Vis. Pattern Recogn. 2020, 3, 1–11. [Google Scholar]
Lukezic, A.; Vojir, T.; Cehovin Zajc, L.; Matas, J.; Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6309–6318. [Google Scholar]
Wan, E.A.; Van Der Merwe, R. The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for sIgnal Processing, Communications, And Control Symposium (Cat. No. 00EX373), Lake Louise, AB, Canada, 4 October 2000; pp. 153–158. [Google Scholar]
Gupta, P.; Pareek, B.; Singal, G.; Rao, D.V. Edge device based military vehicle detection and classification from UAV. Multimedia Tools Appl. 2022, 81, 19813–19834. [Google Scholar] [CrossRef]

Figure 1. Definition of coordinate frames: inertial (

F_{I}

), body (

F_{B}

), and gimbal (

F_{G}

).

Figure 1. Definition of coordinate frames: inertial (

F_{I}

), body (

F_{B}

), and gimbal (

F_{G}

).

Figure 2. Illustration of geolocation geometry when the gimbal x-axis is directed toward the target.

Figure 3. The yellow bounding box represents the object detected via YOLO. The center of the image frame is indicated with the red cross-mark, and the green cross-mark indicates the initial location of the target based on given target information.

Figure 4. Pan/tilt correction using the pixel deviation from the center of the image.

Figure 5. Target geolocation using UAV attitude and gimbal pan/tilt angle.

Figure 6. Measurement model based on pinhole camera.

Figure 7. Simulation enviroment setup using MATLAB and UNITY.

Figure 8. Comparison of simulation results according to covariance.

Figure 9. Comparison of target tracking performance using different filters.

Figure 10. Filter performance under occlusion. Occlusion occurs between

t = 26

s and

t = 33

s.

Figure 10. Filter performance under occlusion. Occlusion occurs between

t = 26

s and

t = 33

s.

Figure 11. Trajectory comparison at nominal speed (5.0 m/s) showing ground truth circular path (55 m radius) with raw geolocation and UKF estimates.

Figure 12. Target position estimation results for complex trajectory scenario. The ground truth of the target path is depicted by a magenta dashed line, the measurements of raw geolocation algorithm are shown as a blue dotted line, and the results of UKF are represented by a cyan line.

Figure 13. UAV platform for target estimation.

Figure 14. The block diagram of the mission control system.

Figure 15. Flight test results of the stationary target geolocation and UKF-based estimation.

Figure 16. Flight test results of the stationary target geolocation and UKF-based estimation.

Figure 17. Target tracking using CSRT and target state estimation.

Figure 18. Flight test results of the moving target geolocation and UKF-based estimation.

Figure 19. Flight test results of the moving target geolocation and UKF-based estimation.

Figure 20. Comparison of flight test results according to covariance.

Figure 21. Fixed-wing flight test with moving target video.

Figure 22. Fixed-wing flight test with moving target trajectory.

Figure 23. Target estimation results from fixed-wing flight test.

Figure 24. Target estimation error from fixed-wing flight test.

Table 1. Comparison of mean errors and standard deviations according to covariance.

Parameter (Unit: m)	Geolocation (Raw Data)	R	0.1R	10R
Mean error of N	2.1683	1.0977	1.6702	11.9126
Mean error of E	1.9927	1.1839	1.6406	3.2167
STD of N	2.8611	1.6621	2.2426	13.5123
STD of E	2.5578	1.6856	2.0761	4.0970

Table 2. The error and standard deviation of the target position 5 s after the starting point (no obstruction).

Parameter (Unit: m)	Geolocation (Raw Data)	EKF	UKF
Mean error of N	2.1683	1.0977	1.0915
Mean error of E	1.9927	1.1839	1.2007
STD of N	2.8611	1.6621	1.6663
STD of E	2.5578	1.6856	1.6853

Table 3. Mean error and standard deviation of target position (at 5 s after occlusion begins).

Parameter (Unit: m)	Geolocation (Raw Data)	EKF	UKF
Mean error of N	INF	2.4483	2.2708
Mean error of E	INF	2.8829	1.8728
STD of N	INF	4.8306	4.1098
STD of E	INF	6.3825	3.0421

Table 4. Position estimation errors across target speed variations.

Speed	Dir.	Raw Geolocation		EKF		UKF
Regime		Mean	STD	Mean	STD	Mean	STD
		(m)	(m)	(m)	(m)	(m)	(m)
Low (2.5 m/s)	N	1.48	1.90	0.82	2.69	0.71	1.81
Low (2.5 m/s)	E	1.49	1.90	0.59	2.69	0.64	1.81
Nominal (5.0 m/s)	N	1.56	1.96	0.76	1.10	0.78	1.00
Nominal (5.0 m/s)	E	1.51	1.96	0.62	1.10	0.70	1.00
High (10.0 m/s)	N	1.60	2.01	1.24	2.02	1.03	1.33
High (10.0 m/s)	E	1.56	2.01	1.00	2.02	0.96	1.33

Table 5. Position estimation errors for complex trajectory scenario.

Parameter (Unit: m)	Geolocation (Raw Data)	EKF	UKF
Mean error of N	1.48	1.81	1.24
Mean error of E	1.58	1.41	1.15
STD of N	1.84	2.83	1.60
STD of E	1.65	2.32	1.46

Table 6. Results of fixed target position estimation through flight experiments.

Parameter (Unit: m)	Geolocation (Raw Data)	UKF
STD of N direction	3.7600	0.5399
STD of E direction	4.2933	0.5352

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.; Kim, Y.; Kim, S.; Cho, H.; Jung, D. Vision-Based Geolocation of Moving Ground Targets Using Kalman Filtering with a Gimbal Camera on Board a UAV. Aerospace 2025, 12, 1065. https://doi.org/10.3390/aerospace12121065

AMA Style

Kim J, Kim Y, Kim S, Cho H, Jung D. Vision-Based Geolocation of Moving Ground Targets Using Kalman Filtering with a Gimbal Camera on Board a UAV. Aerospace. 2025; 12(12):1065. https://doi.org/10.3390/aerospace12121065

Chicago/Turabian Style

Kim, Jaemin, Youngrun Kim, SuHyeon Kim, Hyeongjun Cho, and Dongwon Jung. 2025. "Vision-Based Geolocation of Moving Ground Targets Using Kalman Filtering with a Gimbal Camera on Board a UAV" Aerospace 12, no. 12: 1065. https://doi.org/10.3390/aerospace12121065

APA Style

Kim, J., Kim, Y., Kim, S., Cho, H., & Jung, D. (2025). Vision-Based Geolocation of Moving Ground Targets Using Kalman Filtering with a Gimbal Camera on Board a UAV. Aerospace, 12(12), 1065. https://doi.org/10.3390/aerospace12121065

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vision-Based Geolocation of Moving Ground Targets Using Kalman Filtering with a Gimbal Camera on Board a UAV

Abstract

1. Introduction

2. Methods and Materials

2.1. Geolocation Geometry

2.2. Target Geolocation

2.3. Kalman Filtering

3. Tests and Results

3.1. Simulation Studies

3.1.1. Simulation Environment Setup

3.1.2. Filter Parameter Analysis

3.1.3. Algorithm Comparison Under Nominal Conditions

3.1.4. Robustness Testing with Occlusion

3.1.5. Performance Analysis Across Target Speed Variations

3.1.6. Performance Evaluation with Complex Target Trajectories

3.2. Experimental Implementation

3.2.1. Hardware Platform

3.2.2. System Architecture

3.2.3. Target Detection and Tracking

3.3. Flight Experiments

3.3.1. Flight Test Methodology

3.3.2. Stationary Target Tests

3.3.3. Moving Target Tests

3.3.4. Fixed-Wing UAV Tests with Moving Target

3.4. Discussion

3.4.1. Filter Performance Analysis

3.4.2. Parameter Sensitivity and Tuning

3.4.3. Real-World Performance Validation

3.4.4. Practical Implementation Considerations

3.4.5. Failure Cases and Recovery

3.4.6. Limitations and Future Work

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI