An Online Calibration Method for UAV Electro-Optical Pod Zoom Cameras Based on IMU-Vision Fusion

Zhu, Weiming; Shi, Zhangsong; Xu, Huihui; Hu, Qingping; Ying, Wenjian; Gui, Fan

doi:10.3390/drones10030224

Open AccessArticle

An Online Calibration Method for UAV Electro-Optical Pod Zoom Cameras Based on IMU-Vision Fusion

by

Weiming Zhu

,

Zhangsong Shi

,

Huihui Xu

,

Qingping Hu

^*,

Wenjian Ying

and

Fan Gui

Graduate School, Naval University of Engineering, 717 Jiefang Road, Qiaokou District, Wuhan 430030, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(3), 224; https://doi.org/10.3390/drones10030224

Submission received: 21 January 2026 / Revised: 19 March 2026 / Accepted: 19 March 2026 / Published: 22 March 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A unified state estimation framework integrating IMU data and visual information is proposed, enabling real-time dynamic estimation of zoom camera intrinsic parameters with high precision and fast convergence.
The improved ORB feature extraction and LK optical flow tracking strategies enhance the robustness of visual observations, while EKF-based fusion of high-frequency IMU motion constraints and visual geometric constraints effectively suppresses parameter drift during continuous zooming.

What are the implications of the main findings?

The method eliminates the reliance on precise calibration objects and controlled environments, providing a flexible and reliable calibration solution for UAV electro-optical pods in complex outdoor dynamic scenarios.
It achieves a mean target geolocation error of 0.35 m at a distance of 1 km, advancing the application of zoom camera-equipped UAVs in high-precision tasks such as military reconnaissance, emergency rescue, and precision agriculture.

Abstract

To address the calibration challenge caused by the nonlinear variation in intrinsic parameters during continuous camera zooming in UAV electro-optical pods, this paper proposes an online calibration method based on IMU-visual fusion. Traditional offline calibration cannot adapt to dynamic scenarios, while existing self-calibration methods suffer from slow convergence and insufficient robustness. The proposed method aims to achieve real-time and accurate estimation of camera intrinsic parameters during zooming. Specifically, we first construct a unified state estimation framework that encodes the internal and external parameters of the camera and the 3D positions of scene feature points into a high-dimensional state vector, then establish a camera motion model based on IMU data, construct a visual observation model by combining the pinhole camera and second-order radial distortion model to establish a nonlinear mapping from 3D feature points to 2D pixel coordinates, and adopt an improved ORB algorithm for feature extraction and LK optical flow method to achieve high-precision cross-frame feature matching to enhance the stability of visual observation. Most importantly, we design a tight-coupling fusion strategy based on the Extended Kalman Filter (EKF) prediction-update iteration mechanism, which fuses IMU high-frequency motion constraints and visual geometric constraints in real time to suppress parameter drift induced by focal length changes. Finally, we recursively solve the state vector to complete the online dynamic estimation of intrinsic parameters. Monte Carlo simulation experiments and real UAV flight experiments confirm that the method has both high estimation accuracy and strong environmental adaptability, can meet the high-precision calibration needs of UAVs in dynamic scenarios, and provides reliable technical support for accurate target positioning.

Keywords:

zoom camera calibration; IMU-visual fusion; Extended Kalman Filter (EKF); UAV-borne electro-optical pod; target localization

1. Introduction

Against the backdrop of the rapid development of low-altitude economy and intelligent perception technologies, unmanned aerial vehicles (UAVs) have been widely deployed in critical domains including emergency rescue, power line inspection, and precision agriculture. Their capability for rapid target perception and high-precision localization constitutes a core determinant of mission execution efficiency. As the pivotal component of UAV perception systems, electro-optical pods equipped with fixed-focal-length lenses face an inherent trade-off between field of view and spatial resolution, which hinders their ability to fulfill integrated operational requirements of wide-area search and fine-grained target identification. Consequently, electro-optical pods integrated with continuous zoom lenses have emerged as the prevailing direction of technological advancement in this field. However, the incorporation of continuous zoom systems introduces new technical bottlenecks: while the relative movement of lens elements during zooming is theoretically designed to maintain optical parameter stability, engineering imperfections such as manufacturing tolerances and assembly errors induce complex nonlinear variations in camera intrinsic parameters and distortion coefficients as a function of focal length. This issue has become a core obstacle restricting UAVs from achieving high-precision target localization, as the uncompensated nonlinear parameter drift will continuously accumulate with focal length adjustment and severely degrade the performance of visual perception and localization algorithms.

UAV target localization relies on accurate camera intrinsic parameters and distortion correction models. Failure to precisely estimate the camera’s intrinsic parameters across varying focal lengths will lead to model mismatches and feature point extraction errors, ultimately resulting in 3D target localization accuracy that exceeds acceptable thresholds. This, in turn, limits the applicability of UAVs in scenarios with stringent localization requirements, such as disaster assessment and precision agriculture. Notably, the nonlinear variation in intrinsic parameters is not a simple monotonic change but a coupled variation involving focal length, principal point and radial distortion, which further increases the difficulty of real-time online calibration during continuous zooming.

Accordingly, precise calibration of zoom lenses has become a research hotspot in this field. Traditional calibration methods (e.g., the Zhang’s method [1,2]) are predicated on the assumption of constant intrinsic parameters, rendering them incompatible with zoom operation conditions and unable to capture the nonlinear variation characteristics of parameters with focal length. The present study aims to construct an accurate mathematical model characterizing the variation in intrinsic parameters and distortion coefficients with focal length, and to develop an efficient and robust calibration approach with a dedicated fusion strategy for real-time suppression of parameter drift during focal length changes. This work seeks to provide technical support for the high-precision application of zoom electro-optical pods, thereby breaking through the bottleneck of UAV localization accuracy.

Extensive research efforts have been devoted to zoom lens calibration by scholars worldwide, with technical approaches evolving along three primary directions: model construction, parameter solving strategies, and application scenario expansion. These methods exhibit significant disparities in terms of accuracy, efficiency, and adaptability to operational environments. Early research predominantly adopted discrete calibration strategies, wherein researchers independently calibrate parameters at several discrete focal points, followed by polynomial fitting or spline interpolation to derive parameter values across the full focal range. While these methods boast conceptual simplicity and ease of implementation, they suffer from high calibration workloads. Specifically, discrete zoom-based calibration generally requires sampling at no less than 10 distinct focal lengths, with at least 8 images collected at each focal length, leading to cumbersome manual operation. Deep learning-based schemes further demand a large-scale labeled training dataset, which significantly increases data acquisition cost, labeling effort, and computational complexity. Moreover, interpolation accuracy is highly dependent on the number and distribution of discrete points, making it difficult to globally approximate complex nonlinear mapping relationships [1,2,3,4] and unable to support real-time online calibration for continuous focal length adjustment. Wilson and Shafer [5] employed Tsai’s method [6] to fit the variation in parameters (including focal length, principal point, and radial distortion) with focal length using fifth-order polynomials, which notably reduced reprojection errors. However, this method relies on high-precision calibration targets and controlled environments, impeding its feasibility for online calibration. Notably, principal point drift during zooming represents a critical factor affecting calibration accuracy. Yin and Sui [7] proposed the Euclidean Distance Weighting Model (EDWM), which estimates the principal point via intersection of image feature points before and after focal length adjustment. This approach effectively suppresses the interference of outliers, confining the principal point estimation error within 2.75 pixels. Despite their high accuracy, traditional methods require frequent recalibration and thus fail to meet the real-time demands of dynamic missions, and their pre-fitted polynomial models lack adaptability to the random nonlinear parameter variations in actual flight scenarios.

To overcome the reliance on calibration targets, self-calibration methods leverage multi-view geometric constraints to directly solve for parameters from image sequences. These methods are suitable for camera calibration in unknown scenes or under dynamic motion conditions, but they impose stringent requirements on camera motion trajectories and scene texture richness [8,9,10]. Faugeras et al. [11] utilized the Kruppa equations and fundamental matrix constraints for intrinsic parameter calibration, yet the solving process is computationally complex and highly sensitive to initial values. Triggs [12] introduced the Absolute Dual Quadric (ADQ) to transform the constraints into a quadratic programming problem, enhancing solution stability. Pollefeys’ [13] hierarchical stepwise calibration method upgrades projective reconstruction to metric reconstruction through modular constraints, enabling its application to zoom camera calibration. Agapitos et al. [14] proposed a method that initially performs linear parameter estimation based on infinite homography constraints, followed by nonlinear optimization to refine focal length and distortion parameters, achieving self-calibration for rotating and zooming cameras. Although self-calibration methods offer high flexibility, they suffer from strong parameter coupling, which easily leads to local optima, and their reprojection errors are generally higher than those of traditional methods especially in continuous zoom scenarios where nonlinear parameter variations are not explicitly modeled.

Hybrid calibration methods combine the advantages of target-based calibration and self-calibration. They establish parameter models using a small amount of calibration data and then update parameters online, striking a balance between accuracy and flexibility. Oh and Sohn [15] proposed a hybrid method integrating checkerboard-based calibration and pure-rotation self-calibration: parameters are first calibrated at different focal lengths, then parameter variation models are established via scattered data interpolation, while rotation sensors are employed to alleviate ill-posed problems. Sturm [16] determined the relationship between the principal point, distortion, and focal length through pre-calibration, and then solved for the remaining parameters using self-calibration based on the Kruppa equations. Experimental results demonstrate that hybrid methods improve adaptability while maintaining high accuracy, though their performance is contingent on the coverage and quality of calibration data.

In recent years, with the advancement of deep learning technologies, intelligent calibration methods have emerged that leverage neural networks to learn complex mapping relationships between parameters and focal length, circumventing the limitations of explicit mathematical modeling [17,18]. Butt et al. [19] proposed a multi-task network for joint estimation of intrinsic and extrinsic parameters, incorporating the CARLA dataset to enhance generalization capability. Nevertheless, intelligent methods require large volumes of training data. Furthermore, most existing studies employ a single training dataset to fit multiple tasks, which often leads to local optima in parameter optimization and leaves room for improvement in generalization performance; in addition, these data-driven models lack real-time adaptability to the unforeseen nonlinear parameter variations in actual UAV flight and zoom processes. Current research priorities include online adaptive calibration and cross-focal-range domain learning, aiming to address calibration challenges in dynamic environments.

Overall, zoom lens calibration technology is evolving from static offline calibration to dynamic online calibration, forming a technical framework encompassing four categories: traditional calibration, self-calibration, hybrid calibration, and intelligent learning methods [20,21,22]. However, to meet the practical application requirements of UAV-borne EO pods, three core issues remain to be urgently addressed: first, existing models lack sufficient capability to characterize full-focal-range nonlinear distortion, resulting in significant accuracy degradation at both wide-angle and telephoto ends, and fail to explicitly model the coupled nonlinear variation in focal length, principal point and radial distortion during continuous zooming; second, most methods rely on high-precision calibration targets and controlled environments, limiting their convenience and robustness for field operations; third, the balance between real-time performance and calibration accuracy has not been effectively achieved, and there is a lack of dedicated multi-sensor fusion strategies for suppressing parameter drift induced by focal length changes, making it difficult to adapt to the dynamic mission scenarios of UAVs.

To tackle the aforementioned challenges, this study aims to propose a high-precision, robust online calibration method tailored for UAV-borne continuous zoom EO pods, and to systematically verify its effectiveness in improving the accuracy of target geolocalization. The main contributions of this paper are threefold: (1) An improved joint parameterization model for zoom lens intrinsic parameters and distortion is proposed, which explicitly characterizes the coupled nonlinear variation law of focal length, principal point and radial distortion during continuous zooming, and embeds the nonlinear parameter constraint into the state estimation framework to realize the real-time tracking of zoom-induced parameter changes. This model can more accurately describe the nonlinear variation characteristics of optical parameters during continuous zooming—with particular emphasis on refining the modeling of coupled variations in the camera principal point and radial distortion. (2) A tight-coupling IMU-visual fusion strategy is designed for the online calibration of continuously zooming cameras: a user-friendly calibration workflow that eliminates the need for sophisticated 3D calibration fields and only requires planar targets is designed. By acquiring image sequences across multiple focal lengths and camera poses, combined with a bundle adjustment optimization strategy, this workflow enables efficient and robust solving of model parameters, and the designed fusion strategy fuses IMU high-frequency motion constraints and visual geometric constraints in real time to effectively suppress parameter drift during focal length changes. (3) A quantitative analysis model linking calibration residuals to final geolocalization errors is established. Through both simulation and physical experiments, the correction accuracy of the proposed calibration method across different focal lengths and its practical effectiveness in improving the absolute localization accuracy of UAVs are comprehensively validated, and the superiority of the proposed method in handling parameter nonlinearities and suppressing zoom-induced parameter drift is fully demonstrated by comparing with state-of-the-art calibration methods.

2. Materials and Methods

2.1. Camera Model

There are five coordinate systems in the camera calibration process. They are the two-dimensional image pixel coordinate system

u - v

and the physical coordinate system

x - y

, the three-dimensional camera coordinate system

X_{c} - Y_{c} - Z_{c}

, the UAV body coordinate system

X_{a} - Y_{a} - Z_{a}

and the world coordinate system

X_{w} - Y_{w} - Z_{w}

. The correspondence among the five coordinate systems is shown in Figure 1.

Due to the inherent manufacturing imperfections of optical lenses, image distortion inevitably occurs during the imaging process. Therefore, a distortion correction model must be incorporated into the camera calibration framework. Common distortion types include radial distortion, tangential distortion, and thin-prism distortion. In practice, only radial distortion is dominant and typically considered for camera calibration of UAV electro-optical zoom lenses, as other distortion components are negligible under normal operating conditions [23,24]. In general, only radial aberrations are considered, so the imaging model is modified as shown in Equation (1).

\{\begin{array}{l} x_{1} = x_{0} [1 + k_{1} r^{2} + k_{2} r^{4} + o (k_{n} r^{2 n})] \\ y_{1} = y_{0} [1 + k_{1} r^{2} + k_{2} r^{4} + o (k_{n} r^{2 n})] \end{array},

(1)

where

k_{1}

and

k_{2}

represent the first- and second-order radial distortion coefficients,

o

denotes the higher-order infinitesimal term,

r = \sqrt{x^{2} + y^{2}}

,

(x_{0}, y_{0})

denotes the ideal normalized image coordinate, and

(x_{1}, y_{1})

is the corresponding distorted image coordinate. Considering the trade-off between calibration accuracy and computational efficiency, only the second-order radial distortion model is adopted in this work, which has been widely verified to be sufficient for zoom lenses in UAV electro-optical pods.

From the above equation, it can be seen that when the setting parameters of the zoom lens are changed, the intrinsic parameters will change. Also, when the focal length of the lens changes, its radial distortion coefficient changes. There are different implicit functions between different calibration parameters and lens settings, so the calibration of zoom cameras is more complex and variable than that of fixed-focus cameras [25].

2.2. Proposed Algorithm

For the dynamic operational scenarios of UAV-borne zoom electro-optical pods, existing calibration methods struggle to characterize the continuous nonlinear evolution of camera intrinsic parameters during zooming. Meanwhile, mainstream self-calibration methods suffer from poor convergence under zoom conditions due to the insufficient fusion of multi-source heterogeneous sensor information. In view of these limitations, we propose an online calibration framework that balances real-time performance and robustness, aiming to break through the bottlenecks of current technologies.

To address this challenge, we design a framework capable of online, real-time, joint estimation of camera intrinsic/extrinsic parameters and environmental structure, as illustrated in Figure 2. Specifically, camera intrinsic and extrinsic parameters, along with the 3D coordinates of scene feature points, are jointly encoded into a high-dimensional state vector. The Extended Kalman Filter (EKF) is then employed to recursively estimate this state vector, with concurrent constraints imposed on the evolution of intrinsic parameters. This tightly coupled fusion strategy enables robust, high-precision online calibration for UAV-borne electro-optical pods.

The proposed calibration framework is designed specifically for the calibration scenarios and system lifecycle of unmanned aerial vehicle (UAV) electro-optical pods. Its detailed description is as follows to address the key issues related to the calibration scenarios. The calibration process of the UAV electro-optical pod consists of two complementary stages: offline pre-calibration and online real-time calibration. Offline pre-calibration is performed once during factory delivery or maintenance to provide initial intrinsic and extrinsic parameters, while online real-time calibration runs synchronously with all flight missions to realize real-time update of parameters, ensuring the consistency between calibration results and actual flight states.

The core goal of the proposed calibration framework is to support high-precision target geolocation of UAVs in military reconnaissance and emergency rescue scenarios, with mandatory precision indices including reprojection root mean square error (RMSE) below 1 pixel, geolocation view angle error less than 1 mrad, and mean geolocation error under 5 m.

The UAV platform and its inertial measurement unit (IMU) specifications are key prerequisites for the effectiveness of the calibration framework. Prior to the formal calibration process, the IMU must undergo pre-calibration, including factory calibration and on-site RTK-GNSS calibration, to ensure the systematic error of the gyroscope is <0.005 rad/s. During the calibration process, the UAV does not need to follow a fixed trajectory; instead, it only requires smooth motion, which is consistent with the actual flight characteristics of UAV missions.

Parameter drift and stable properties are fully considered in the framework design to maintain long-term calibration accuracy. Camera intrinsics drift with focal length in a nonlinear manner but are not affected by time, while the IMU exhibits slow time drift (gyroscope bias drift of 0.001 rad/s·h). Importantly, all hardware parameters remain stable within a single flight mission (usually no more than 5 h), which is the key time scale of UAV tasks, ensuring the reliability of calibration results during the entire mission period.

The calibration setup for each stage is specifically designed to match the environmental conditions and precision requirements. The proposed tightly coupled EKF-based fusion strategy effectively integrates the constraints from the calibration setup, parameter stability, and platform motion characteristics, ultimately achieving robust and high-precision online calibration for UAV-borne electro-optical pods.

2.3. Problem Modeling

Firstly, we define the complete state vector

x

of the system, which includes all variables to be estimated at discrete time step

k

:

x = {[c^{T}, p^{T}, v^{T}, q^{T}, z^{T}]}^{T},

(2)

where

c = {[f_{x}, f_{y}, c_{x}, c_{y}, k_{1}, k_{2}]}^{T}

denotes the vector of camera intrinsic parameters to be estimated—corresponding to the equivalent focal lengths in the x and y directions, principal point coordinates, and two primary radial distortion coefficients, respectively.

p \in ℝ^{3}

and

v \in ℝ^{3}

represent the camera’s position and linear velocity in the world coordinate system, respectively.

q \in S^{3}

is a unit quaternion describing the rotational attitude of the camera from the world coordinate system to the camera coordinate system.

z \in ℝ^{3 N}

is a three-dimensional spatial position vector composed of N scene feature points, and its dimension is 3N.

In this model, to simplify complexity, a key assumption is made: within a limited observation window, the camera’s intrinsic parameters and the positions

z

of scene feature points remain constant. This assumption transforms the complex online calibration problem into a dynamic state estimation problem with static parameters, enabling effective joint estimation via EKF.

The entire system follows a typical non-linear state-space model, described jointly by a state transition equation and an observation equation:

\begin{matrix} x_{k} = f_{k} (x_{k - 1}, u_{k}) \\ y_{k} = h_{k} (x_{k}) + v_{k} \end{matrix},

(3)

where

x_{k}

represents the state value at time

k

and

y_{k}

is the measurement value at that time, composed of the detected distorted pixel coordinates of N scene feature points in the image.

f_{k} (\cdot)

is the non-linear state transition function,

u_{k} \sim Ν (0, Q_{k})

is the process noise representing model uncertainty;

h_{k} (•)

is the non-linear observation function, and

v_{k} \sim Ν (0, R_{k})

is the measurement noise, primarily originating from errors in image feature point detection.

2.4. Camera Motion Model

The state transition model

f_{k} (•)

mainly describes how the camera’s extrinsic parameters (position, velocity, attitude) evolve over time. Since the camera is rigidly attached to the carrier, its motion can be directly modeled using measurements from the IMU mounted on the carrier. The IMU includes an accelerometer and a gyroscope, which sense the carrier’s linear acceleration and angular velocity, respectively. However, translational and rotational motions are physically decoupled and differ significantly in their mathematical representation: translation is typically expressed using Euclidean vectors, while rotation belongs to the Lie group SO(3) and cannot be handled by simple vector addition. Therefore, to construct a physically accurate and mathematically tractable model, the camera’s motion must be decomposed into translational and rotational components, which are modeled separately.

2.4.1. Translational Motion Model

For translational dynamics, we adopt the Wiener velocity process model, with estimated values being the camera’s position

p

and velocity

v

. This model assumes that acceleration is a zero-mean white noise process

w (t)

, which effectively captures the random, unpredictable acceleration changes experienced by the carrier in real environments. Specifically:

\frac{d^{2} p (t)}{d t^{2}} = w (t),

(4)

Converting this into a system of first-order differential equations and integrating in the discrete time domain yields:

[\begin{matrix} p_{k + 1} \\ v_{k + 1} \end{matrix}] = [\begin{matrix} I_{3} & Δ t_{k} I_{3} \\ 0_{3} & I_{3} \end{matrix}] [\begin{array}{l} p_{k} \\ v_{k} \end{array}] + [\begin{matrix} \frac{1}{2} Δ t_{k}^{2} I_{3} \\ Δ t_{k} I_{3} \end{matrix}] a_{k} + u_{k},

(5)

where

Δ t_{k}

represents the k-th IMU sampling period,

a_{k}

is the linear acceleration after gravity compensation,

u_{k}

is a random disturbance term related to the covariance of process noise

Q_{k}

.

2.4.2. Rotational Motion Model

For rotational motion, its dynamics are driven by the angular velocity measured by the gyroscope. The mathematical nature of rotation prevents updates via simple vector integration like translation. Unit quaternions are widely used for attitude representation due to their non-singularity and high computational efficiency. There is a definite differential relationship between the time derivative of a quaternion and its corresponding angular velocity:

\dot{q} (t) = - \frac{1}{2} Ω (ω (t)) q (t),

(6)

where

Ω (ω)

is the skew-symmetric matrix constructed from the angular velocity vector

ω = {[ω_{x}, ω_{y}, ω_{z}]}^{T}

:

Ω (ω) = [\begin{matrix} 0 & - ω_{x} & - ω_{y} & - ω_{z} \\ ω_{x} & 0 & ω_{z} & - ω_{y} \\ ω_{y} & - ω_{z} & 0 & ω_{x} \\ ω_{z} & ω_{y} & - ω_{x} & 0 \end{matrix}],

(7)

In the discrete time domain, assuming constant angular velocity within a sampling period

Δ t_{k}

, the quaternion update can be obtained via exponential mapping:

q_{k + 1} = \underset{R_{4 \times 4}}{\underset{⏟}{\exp (- \frac{1}{2} Ω (ω_{k}) Δ t_{k})}} q_{k},

(8)

where

\exp (- \frac{1}{2} Ω (ω_{k}) Δ t_{k})

denotes the matrix exponential, which yields a 4 × 4 rotation matrix that operates on the quaternion

q_{k}

via standard matrix-vector multiplication. Similarly, this process is affected by gyroscope noise and bias drift, which are modeled as process noise

ω_{b}

.

Combining the translational and rotational models, the complete state transition function

f_{k} (•)

can be written in a block-diagonal form, whose Jacobian matrix

A_{k} = \partial f_{k} / \partial x_{k - 1}

is used for linearization in the prediction step of EKF.

2.5. Visual Observation Model

The visual observation model

h_{k} (•)

establishes the mathematical connection between the state vector

x_{k}

and the 2D feature points extracted from images. To clearly elaborate on this model, it is first necessary to define the involved coordinate systems and their transformation relationships. Two main coordinate systems are involved in this process: the world coordinate system and the camera coordinate system. The world coordinate system is a global, inertial reference frame where the true 3D positions of all scene feature points are defined. The camera coordinate system is a local frame with the camera’s optical center as the origin and the Z-axis aligned with the optical axis, whose orientation is determined by the attitude quaternion at the current moment.

Given a 3D feature point

z_{i} = {[X_{i}, Y_{i}, Z_{i}]}^{T}

in the world coordinate system, its coordinates in the camera coordinate system are obtained via rigid-body transformation:

{z_{i}}^{c} = {R (q}_{k}) (z_{i} - p_{k}),

(9)

where

{R (q}_{k})

is the rotation matrix converted from the quaternion

q_{k}

. If

q = {[q_{w}, q_{x}, q_{y}, q_{z}]}^{T} (‖q‖ = 1),

R (q) = [\begin{matrix} 1 - 2 q_{y}^{2} - 2 q_{z}^{2} & 2 q_{x} q_{y} - 2 q_{z} q_{w} & 2 q_{x} q_{z} + 2 q_{y} q_{w} \\ 2 q_{x} q_{y} + 2 q_{z} q_{w} & 1 - 2 q_{x}^{2} - 2 q_{z}^{2} & 2 q_{y} q_{z} - 2 q_{x} q_{w} \\ 2 q_{x} q_{z} - 2 q_{y} q_{w} & 2 q_{y} q_{z} + 2 q_{x} q_{w} & 1 - 2 q_{x}^{2} - 2 q_{y}^{2} \end{matrix}] .

The undistorted ideal projection point

{\hat{p}}_{i} = {[{\hat{u}}_{i}, {\hat{v}}_{i}]}^{T}

can be derived using the pinhole camera model:

[\begin{matrix} {\hat{u}}_{i} \\ {\hat{v}}_{i} \\ 1 \end{matrix}] = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] z_{i}^{c} = K (c) z_{i}^{c},

(10)

where

K (c)

is the intrinsic parameter matrix composed of

c

, which performs the affine transformation from the normalized plane to the pixel plane.

However, actual lenses exhibit optical distortion—particularly radial distortion, which is more pronounced during zooming. The actually observed pixel coordinates are the results of radial distortion correction applied to the ideal projection points

p_{i} = {[u_{i}, v_{i}]}^{T}

. This paper adopts the widely used second-order radial distortion model:

\begin{matrix} u_{i} = {\hat{u}}_{i} + ({\hat{u}}_{i} - c_{x}) (k_{1} r_{i}^{2} + k_{2} r_{i}^{4}) \\ v_{i} = {\hat{v}}_{i} + ({\hat{v}}_{i} - c_{y}) (k_{1} r_{i}^{2} + k_{2} r_{i}^{4}) \end{matrix},

(11)

where

r_{i}^{2} = {({\hat{u}}_{i} - c_{x})}^{2} + {({\hat{v}}_{i} - c_{y})}^{2}

is the squared normalized distance from the ideal projection point to the principal point.

In summary, the complete visual observation function

h_{k} (•)

is a non-linear mapping composed of the aforementioned projection and distortion models, which maps the relevant components

c, p_{k}, q_{k}, z_{i}

in the state vector to 2D pixel coordinates

p_{i}

. In the update step of EKF, it is necessary to compute the Jacobian matrix

H_{k} = \partial h_{k} / \partial x_{k}

of this function with respect to the state vector.

2.6. Feature Extraction and Tracking Strategy

Robust and efficient feature extraction and tracking are the foundation for constructing visual geometric constraints. To provide stable visual observations, this paper adopts the feature extraction and tracking pipeline proposed by Qinhan Chen et al. The specific strategy is as follows: first, detect high-discriminative and stable key points in a single image to serve as anchors for subsequent inter-frame matching; second, establish correspondences between these feature points across consecutive video frames (i.e., feature tracking) to obtain a set of 2D-2D pixel matching pairs. These matching pairs contain rich information about camera motion and serve as the direct data source for constructing the visual observation model.

2.6.1. Feature Extraction Method

UAVs typically operate in outdoor environments characterized by homogeneous terrain and indistinct ground features [26]. In such scenarios, feature points detected by the traditional ORB algorithm are predominantly concentrated in high-texture regions, and the algorithm exhibits poor adaptability to large-scale variations [27,28]. This often leads to a loss of feature points during UAV movement, thereby degrading the accuracy of online calibration. To address this issue, this study proposes a feature compensation strategy based on uniform distribution. Specifically, the image is divided into grids of 30 × 30 pixels, and an adaptive threshold T is calculated for each grid to detect FAST corner points. This approach effectively resolves the challenge of feature point extraction across regions with varying texture densities. The calculation formula for the adaptive threshold T is given as follows:

T = K \times \frac{I_{\max} - I_{\min}}{I_{\max} + I_{\min}} \times I_{a v g},

(12)

where K is the proportional coefficient, 0 < K < 1. The value of K is adaptively selected based on the characteristics of the input image and the specific requirements of the task. In this article, the value of K is set to 0.8;

I_{a v g}

is the average gray value of the pixels in the mesh,

I_{\max}

and

I_{\min}

are the maximum and minimum grey values of the pixels in the mesh, respectively.

As illustrated in Figure 3, the left panel presents the feature point extraction results of the ORB algorithm with a fixed threshold—a value optimized via multiple experiments. It is evident that the extracted feature points are primarily clustered around foreground objects with significant gradient variations. When the camera undergoes large-angle rotation or abrupt velocity changes, massive feature point loss occurs, resulting in poor system robustness and calibration accuracy. In contrast, the right panel displays the extraction results of the proposed algorithm. It can be observed that the proposed method fully exploits information from all image regions, yielding uniformly distributed feature points across the entire field of view. This uniform feature distribution provides a richer set of correspondences for subsequent image feature matching.

2.6.2. Feature Tracking Method

After extracting features from a single image, inter-frame feature tracking is essential to establish temporal continuity. The quality of feature tracking directly determines the accuracy and robustness of visual observations. For feature tracking, we adopt the pyramid Lucas-Kanade (LK) optical flow method: first, use the uniformly distributed ORB feature points extracted earlier as tracking targets, initialize the optical flow displacement to zero at the top layer of the Gaussian image pyramid, and perform optical flow estimation layer by layer from top to bottom. For each layer, first predict the position based on the rough displacement result from the previous layer, then compute the true gradient of pixels within a neighborhood window in the matching image centered at this predicted position. Next, iteratively solve for the optimal displacement by minimizing the grayscale error of corresponding windows between the two frames—updating the residual optical flow vector in each iteration, while introducing an adaptive termination mechanism based on changes in the condition number of the Hessian matrix. The iteration stops when the difference in condition numbers between two consecutive iterations is less than a set threshold or the displacement increment is sufficiently small. After completing optical flow tracking across all pyramid layers, the rough corresponding positions of each feature point in the matching image are obtained, forming initial matching pairs. Finally, compute the Hamming distance using rotated BRIEF descriptors and eliminate incorrect matches using the nearest neighbor ratio strategy, resulting in high-precision 2D-2D matching results.

To mitigate the adverse effects of drastic field-of-view (FoV) changes induced by camera zooming and maintain robust visual constraints, a dynamic feature point management strategy is integrated with the improved ORB feature extraction and LK optical flow tracking. The strategy is implemented as follows: (1) Supplementary feature extraction: Following initial feature detection in each frame, the improved ORB algorithm is reapplied to FoV transition regions (i.e., newly visible or occluded areas due to zoom) to supplement feature points, ensuring sufficient visual constraints for state estimation. (2) Invalid feature rejection: Feature points lost or mismatched during FoV changes are filtered out using a pixel-level displacement threshold and a Hamming distance threshold within the LK optical flow framework, eliminating unreliable observations. (3) Stable feature screening: Only feature points tracked for ≥3 consecutive frames are retained, ensuring the consistency and validity of visual measurements. With this strategy, the algorithm can detect 80 to 120 stable feature points per frame. This strategy is tightly coupled with the extended Kalman filter (EKF): the 3D coordinates of newly extracted features are incorporated into the high-dimensional state vector in real time, while invalid features are removed from the state vector, enabling dynamic updates of the feature set to sustain the efficiency and accuracy of joint recursive estimation.

2.7. EKF Solution Process

Based on the non-linear state-space model constructed earlier, EKF is used for recursive estimation of the state vector

x_{k}

. The whole solving process consists of prediction and update stages in each time step.

In the prediction phase, the IMU data-driven state transition model

f_{k} (•)

is used to predict the state and its uncertainty at the next moment. The prior state estimate

{\hat{x}}_{k - 1 | k - 1}

is directly computed using the non-linear state transition function from the posterior estimate of the previous moment, assuming the expected value of the process noise is zero:

{\hat{x}}_{k | k - 1} = f_{k} ({\hat{x}}_{k - 1 | k - 1}, 0),

(13)

To propagate state uncertainty, the non-linear function

f_{k} (•)

is linearized around

{\hat{x}}_{k - 1 | k - 1}

. Its Jacobian matrix is defined as:

A_{k} = {\frac{\partial f_{k} (x, u)}{\partial x}|}_{x = {\hat{x}}_{k - 1 | k - 1}, u = 0},

(14)

Using this Jacobian matrix, the prior covariance at the current moment is computed from the posterior covariance

P_{k - 1 | k - 1}

of the previous moment:

P_{k | k - 1} = A_{k} P_{k - 1 | k - 1} {A_{k}}^{T} + Q_{k},

(15)

where

Q_{k}

is the covariance matrix of process noise

u_{k}

, which quantifies the uncertainty of the IMU model and carrier dynamics.

In the update phase, the current prior state estimate

{\hat{x}}_{k | k - 1}

is used to predict the observation value via the non-linear observation model

h_{k} (•)

:

{\hat{y}}_{k | k - 1} = h_{k} ({\hat{x}}_{k | k - 1}),

(16)

Define the innovation as the difference between the actual observation value and the estimated observation value:

{\tilde{y}}_{k} {= y}_{k} {- \hat{y}}_{k | k - 1},

(17)

Similarly to the prediction phase, linearize the observation model

h_{k} (•)

around

{\hat{x}}_{k | k - 1}

. Its Jacobian matrix is defined as:

H_{k} = {\frac{\partial h_{k} (x)}{\partial x}|}_{x = {\hat{x}}_{k | k - 1}},

(18)

The innovation covariance matrix is obtained by combining the uncertainty of the predicted state and the noise of the observation itself:

S_{k} = H_{k} P_{k | k - 1} {H_{k}}^{T} + R_{k},

(19)

where

R_{k}

is the covariance matrix of measurement noise

v_{k}

, typically set empirically based on the accuracy of image feature point detection.

By definition, the Kalman gain is calculated as:

K_{k} = P_{k | k - 1} {H_{k}}^{T} {S_{k}}^{- 1},

(20)

Update the state estimate and its covariance using the Kalman gain and innovation to obtain the final posterior estimate:

\begin{matrix} {\hat{x}}_{k ∣ k} = {\hat{x}}_{k ∣ k - 1} + K_{k} {\tilde{y}}_{k} \\ P_{k ∣ k} = (I - K_{k} H_{k}) P_{k ∣ k - 1} \end{matrix},

(21)

Through the above prediction-update cycle, EKF can recursively output the joint estimation of camera intrinsic parameters, extrinsic parameters, and scene structure in an online, real-time manner, thereby completing the online calibration task for zoom cameras. This process fully leverages the high-frequency motion constraints of the IMU and the geometric consistency constraints of vision, effectively overcoming the limitations of single-sensor systems.

Notably, the validity of the EKF framework is contingent on the noise characteristics of the gyroscope, which imposes the following applicability constraints: When the gyroscope noise standard deviation

σ_{ω}

≤ 0.01 rad/s, the linearization error of the quaternion is less than 1%, as derived from the Taylor expansion of the exponential map. In this case, the Gaussian assumption of the EKF remains valid, and the filter performs reliably; when

σ_{ω}

> 0.05 rad/s, the linearization error exceeds 10%, violating the fundamental assumptions of the EKF formalism and leading to degraded estimation performance. In such scenarios, the Unscented Kalman Filter (UKF) is recommended as a more robust alternative. The corresponding formula for quantifying the linearization error has been added to the manuscript.

3. Experimental Results and Discussion

To comprehensively evaluate the effectiveness and robustness of the proposed online calibration method for zoom cameras, two complementary sets of experiments were designed and implemented: Monte Carlo simulation experiments and real UAV flight experiments. Simulation experiments aim to quantitatively analyze the algorithm’s convergence and accuracy under different noise levels in a controlled environment; real flight experiments verify the algorithm’s practicality and stability in complex, dynamic real-world scenarios. In the experiments, the root mean square error (RMSE) of reprojection was adopted as the core evaluation metric, defined as follows:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {‖p_{i} - {\hat{p}}_{i} (\hat{x})‖}^{2}},

(22)

where

p_{i}

denotes the actual observed pixel coordinates of the i-th feature point,

{\hat{p}}_{i} (\hat{x})

represents the reprojected coordinates calculated using the estimated state

\hat{x}

via the visual observation model

h_{k} (•)

, and N is the total number of feature points involved in the calculation.

3.1. Simulation Experiments

The simulation experiments were established based on the Python 3.8 environment, constructing a virtual IMU-visual integrated platform that includes a pinhole camera with known intrinsic parameters and an ideal IMU. The specific parameter configurations were strictly aligned with the characteristics of real UAV electro-optical pods. For scene modeling, 1000 3D spatial points were randomly generated, distributed within the spatial region of X, Y ∈ [−50, 50] m, Z ∈ [10, 70] m, to simulate the spatial distribution characteristics of outdoor open scenes. The coordinates of the feature points were generated using uniform random sampling to ensure a uniform distribution of scene textures, avoiding the problem of unbalanced observation constraints caused by dense or sparse local feature points, and providing sufficient data support for visual geometric constraints. The camera model adopted a pinhole model with radial distortion, and the true intrinsic parameters were set as follows: equivalent focal length

f_{x} = f_{y} = 650 pixels

, principal point

(c_{x}, c_{y}) = (960, 540) pixels

, and radial distortion coefficients

k_{1} = - 0.2635

,

k_{2} = 1.7435

. The camera moved along a complex trajectory driven by random linear velocity

v \in [0.5, 3] m / s

and angular velocity

ω \in [0.01, 0.1] rad / s

to simulate the dynamic characteristics of UAVs during actual flight. Regarding data characteristics, the IMU sampling rate was set to 100 Hz, and a 200-frame video composed of spatial 2D sparse point clouds with a frame rate of 30 FPS and a resolution as 1920 * 1080 was artificially generated. To simulate the imperfection of real sensors, zero-mean Gaussian white noise was injected into the angular velocity and acceleration measurements of the IMU in each test, with standard deviations of

σ_{ω} = 0.001 rad / s

and

σ_{a} = 0.01 m^{2} / s

, respectively. Additionally, pixel-level zero-mean Gaussian noise with a standard deviation of

σ_{p} = 0.5 p i x e l

was added to the detected coordinates of image feature points. The EKF initialization strategy was as follows: the state vector was randomly initialized within the range of 3σ around the true value, where the initial error of intrinsic parameters was controlled within ±5%, the initial error of position is less than or equal to 5 m, the initial error of velocity is less than or equal to 0.5 m/s, and the initial error of attitude quaternions is less than or equal to 0.05 rad, to test the robustness of the algorithm against the initial error. In each EKF update step, 50 successfully tracked feature points are randomly selected as observation inputs to balance computational efficiency and estimation accuracy.

Figure 4 illustrates the evolution process of the estimated values of camera intrinsic parameters over time in a single simulation experiment. The solid blue lines represent the estimated values of each parameter, and the dashed red lines denote the true values corresponding to the respective parameters. It can be clearly observed that all estimated values of intrinsic parameters exhibit rapid convergence characteristics. In the initial stage (Frames 0–10), due to the significant deviation between the initial state estimation of the system and the true values, as well as the insufficient accumulation of observation data, the estimated values fluctuate obviously, and the Kalman gain is relatively large. The system mainly relies on visual observations to correct the prior prediction, which is consistent with the expected behavior of the EKF. With the accumulation of more visual observations (Frames 10–50), the EKF can quickly correct this deviation through continuous iterative updates. For the focal length parameters

f_{x}

and

f_{y}

, their estimated values are basically stable around the true values after approximately 50 frames, with minimal subsequent fluctuations, indicating that the algorithm can effectively capture and recover the focal length information. The estimations of the principal points

c_{x}

and

c_{y}

also show a similar rapid convergence trend, although there is a brief spike in the initial stage, which is quickly suppressed. This demonstrates that the algorithm is robust in estimating the camera’s imaging center. From the final steady-state values, the estimated values of all parameters are highly consistent with the true values. The estimation errors of

f_{x}

and

f_{y}

are 0.32 pixel and 0.28 pixel, respectively; the estimation errors of

c_{x}

and

c_{y}

are 0.45 pixel and 0.37 pixel, respectively; and the estimation errors of the distortion coefficients

k_{1}

and

k_{2}

are 0.0015 and 0.023, respectively, all meeting the requirements of high-precision calibration. This proves that the proposed joint optimization framework has excellent modeling capabilities for nonlinearly varying intrinsic parameters.

Figure 5 depicts the convergence curve of the reprojection RMSE calculated jointly for all intrinsic parameters throughout the estimation process. This curve intuitively reflects the improvement process of the overall geometric consistency of the system. From the evolution trend of the curve, the RMSE exhibits a three-stage characteristic of “rapid decline—slow attenuation—stabilization”: the first 25 frames are the rapid decline stage, where the RMSE drops sharply from the initial 25.6 pixel to 3.2 pixel. During this stage, the system quickly corrects the initial deviation and establishes a preliminary camera model by fusing the high-frequency motion constraints of the IMU with visual observations. Frames 25–100 are the slow attenuation stage, where the RMSE gradually decreases from 3.2 pixel to 0.1 pixel. At this point, the system has basically captured the true values of the intrinsic parameters, and the errors mainly originate from sensor noise and linearization approximation errors. The noise interference is continuously suppressed through iterative updates. After 100 frames, it enters the stabilization stage, where the RMSE stabilizes around 0.08 pixel with a fluctuation range of less than ±0.01 pixel, indicating that the system has reached the optimal estimation state. The estimated parameters can accurately describe the optical characteristics of the camera, and the reprojection error is close to the sensor noise level, verifying the high-precision characteristics of the algorithm.

Overall, the simulation experimental results fully validate the effectiveness of the proposed IMU-visual fusion-based online calibration method for zoom lenses: the method can quickly and accurately estimate the intrinsic parameters of zoom cameras, with fast convergence speed, high steady-state accuracy, and strong noise resistance, meeting the requirements of real-time calibration. By incorporating intrinsic parameters, extrinsic parameters, and 3D positions of scene feature points into a unified state vector, and using EKF to achieve tightly coupled fusion of multi-source information, the limitations of a single sensor are effectively overcome, providing reliable algorithmic support for high-precision online calibration in dynamic zoom scenarios.

3.2. UAV Flight Experiments

To verify the performance of the algorithm on a real UAV electro-optical pod, flight experiments were conducted using a fixed-wing UAV equipped with a zoom camera, as shown in Figure 6a. The pod consists of key components such as an uncooled long-wave infrared thermal imager, a high-definition visible light camera, a laser range finder, a laser illuminator, a laser warning system, and a servo-stabilized platform, and is integrated with a high-precision three-axis pan-tilt and an IMU. The IMU has completed factory pre-calibration and on-site calibration to eliminate systematic errors. The visible light camera supports imaging in the 0.4 μm~0.9 μm band, with a 30× continuous zoom capability ranging from 4.3 mm to 129 mm, a field of view covering 2.3°~63.7°, and a video output resolution of 1920 × 1080@30 FPS. The experimental scene was selected in an open outdoor area, where 7 high-contrast artificial markers were placed on the ground as auxiliary features, as shown in Figure 6b. The center positions of the targets were accurately measured using RTK-GNSS, and their 3D positions in the world coordinate system were used as ground truth for subsequent accuracy evaluation.

The experimental procedure is as follows: (1) Place the UAV at the take-off point, complete the debugging and connection of the flight control system, electro-optical pod, and data transmission system to ensure the normal operation of each module. Set the flight route through the ground station, planning the UAV to climb spirally to a height of 100 m, and then fly at a constant speed along a circular route with a radius of 100 m centered on marker T1; (2) After the UAV enters a stable flight state, the ground station controls the camera to continuously zoom from the wide-angle end (4.3 mm) to the telephoto end (129 mm); (3) During the entire zoom process, the pan-tilt servo system keeps marker 1 locked in the central area of the image, and synchronously records high-definition video streams, IMU data, camera focal length control signals, and UAV attitude data; (4) Based on the proposed IMU-visual fusion-based online calibration algorithm for electro-optical pods, real-time processing is performed on each flight data to solve the camera intrinsic parameters under different focal lengths. Meanwhile, the traditional offline Zhang Zhengyou calibration method and the structure-from-motion (SfM)-based self-calibration method are used to process the same data as comparative benchmarks.

The convergence performance of the EKF is highly dependent on the accuracy of the initial state. Therefore, before the flight experiments, the Zhang Zhengyou calibration method was used to perform offline calibration for 7 commonly used focal lengths (5 mm, 20 mm, 40 mm, 55 mm, 70 mm, 93 mm, 120 mm) of the pod under static laboratory conditions. For each focal length, 8 chessboard images from different perspectives were collected, and the intrinsic parameters were solved using the Matlab 2018 calibration toolbox, as shown in Figure 7. Based on these discrete calibration results, an empirical mapping function of intrinsic parameters varying with focal length was constructed through cubic spline interpolation, and this function was used to initialize the state vector of the online EKF to ensure the algorithm has a good initial value.

To evaluate the improvement effect of the proposed online self-calibration method on UAV target positioning accuracy, in addition to using the reprojection RMSE as an evaluation metric, this paper also introduces the target positioning error. This error is obtained by substituting the calibrated intrinsic parameters into the UAV target positioning model [29], and calculating the Euclidean distance between the 3D estimated coordinates of marker and the true coordinates measured by RTK, so as to objectively reflect the performance of the algorithm in actual positioning tasks. Its calculation formula is as follows:

Δ P = \sqrt{{(X - X^{'})}^{2} + {(Y - Y^{'})}^{2} + {(Z - Z^{'})}^{2}},

(23)

where

(X, Y, Z)

is the true world coordinate of marker 1, and

(X^{'}, Y^{'}, Z^{'})

is the world coordinate of marker solved by applying the intrinsic parameters to the UAV target positioning model.

To comprehensively validate the robustness and generalizability of the proposed algorithm across diverse spatial positions, statistical analyses were performed on the experimental data of all seven artificial markers (T1–T7) deployed in the test scene. All experimental results presented herein correspond to the mean values derived from 10 repeated flight tests, with the reprojection root mean square error (RMSE) and geolocation error calculated for each individual marker; the detailed quantitative results are summarized in Table 1. As indicated in the table, the proposed IMU-vision fusion algorithm maintains a consistently low reprojection RMSE ranging from 0.27 to 0.35 pixels and a geolocation error of 0.34 to 0.42 m for all markers throughout the full focal length range. This outcome demonstrates the strong spatial adaptability of the proposed algorithm in estimating camera intrinsic parameters for targets at different spatial positions. In contrast, the SfM-based self-calibration method yields significantly higher errors with large fluctuations for all markers, which can be attributed to the absence of high-frequency motion constraints provided by the IMU, leading to unstable parameter estimation during continuous zooming. Although the offline Zhang’s calibration method achieves the lowest average error among the three methods, it is inherently inapplicable to the dynamic zoom scenarios of UAV airborne photoelectric pods in practical field operations, due to its reliance on a static calibration environment and pre-calibrated targets.

Notably, the pan-tilt servo system was employed to keep Marker T1 centered in the image frame throughout the entire flight experiment. This setup ensured the integrity of visual features of T1 and the optimal temporal synchronization between IMU motion data and visual image data during the continuous zoom process from the wide-angle end (4.3 mm) to the telephoto end (129 mm). For a more in-depth evaluation of the calibration performance of the proposed method across the full focal length spectrum, the calibration results of Marker T1 at typical focal length points (wide-angle, middle, and telephoto) are further analyzed and discussed in the following section.

Figure 8 shows the variation curves of reprojection RMSE of different algorithms over the entire focal length range. It can be observed that the RMSE of the proposed algorithm remains at a low level throughout the zoom process, ranging from 0.2 to 0.35 pixels with small fluctuations, indicating that the algorithm has good stability in continuous zoom scenarios. The RMSE of the SfM-based self-calibration method is significantly higher, with obvious fluctuations in the intervals where the focal length changes drastically. This is because the nonlinear variation in intrinsic parameters during the zoom process leads to insufficient pure visual constraints to stably estimate the parameters. The offline Zhang’s calibration method has the lowest RMSE, but this method requires a preset calibration object and cannot adapt to online dynamic scenarios, lacking practical application flexibility. In addition, the RMSE of the proposed algorithm gradually approaches that of the offline calibration method after the focal length stabilizes, further verifying the high-precision characteristics of the algorithm.

By applying the intrinsic parameters obtained by different calibration algorithms to the UAV target positioning model, Figure 9 shows the distribution of target positioning results for marker 1 by the three calibration methods under different focal lengths. The red cross represents the true position of the target, and the scattered points of each color represent the 10 repeated positioning results of the corresponding method. The results indicate that: (1) The proposed IMU-visual fusion algorithm exhibits the optimal positioning concentration under all focal lengths, with the positioning points closely surrounding the true coordinates, and the latitude and longitude deviations are controlled within ±1.5 × 10⁻⁵°; (2) Although the offline Zhang Zhengyou calibration has slightly better positioning accuracy in the medium focal length range (60 mm), its positioning dispersion in the wide-angle (4.3 mm) and telephoto (129 mm) ranges is higher than that of the proposed algorithm, and it cannot adapt to dynamic zoom scenarios; (3) The positioning dispersion of the SfM-based method increases significantly, with the deviation expanding to ±5.8 × 10⁻⁵°, which is caused by the propagation of calibration residuals due to the dependence on visual features. In summary, through the tightly coupled fusion of IMU and vision, the proposed algorithm effectively suppresses parameter drift during the zoom process, achieving high-precision and high-consistency target positioning over the entire focal length range.

The statistical results of the positioning error for marker 1 are shown in Figure 10. The original data refers to the positioning error results obtained by inputting the camera parameters collected without any calibration method into the target positioning model. It can be seen that the average positioning error of the proposed algorithm is 0.35 m, and the maximum positioning error does not exceed 0.52 m; while the average positioning error of the pure visual self-calibration method is 1.87 m, and the maximum positioning error reaches 2.63 m; the average positioning error of the offline Zhang Zhengyou calibration method is 0.28 m, which is slightly better than the proposed algorithm. However, considering that the offline method cannot adapt to dynamic zoom scenarios, its practical application value is limited. This result indicates that the proposed online calibration algorithm can provide high-precision intrinsic parameter support for UAV target positioning, effectively reducing the positioning deviation caused by inaccurate intrinsic parameters, and meeting the requirements of actual tasks for positioning accuracy.

4. Conclusions

Aiming at the problem of decreased target positioning accuracy caused by the nonlinear variation in intrinsic parameters of zoom cameras in UAV electro-optical pods with focal length, this paper proposes an IMU-visual fusion-based online calibration method for electro-optical pods. By constructing a high-dimensional state vector including camera intrinsic and extrinsic parameters, as well as the 3D positions of scene feature points, a state transition model is established using the kinematic constraints provided by the IMU, and a nonlinear state space is constructed by combining with the visual observation model. The EKF is adopted to realize the joint recursive estimation of multiple parameters. Both Monte Carlo simulation experiments and real-world UAV flight experiments have verified the effectiveness and robustness of the algorithm. This method fundamentally solves the limitations of traditional calibration technologies in zoom scenarios, providing core technical support for high-precision imaging in dynamic UAV tasks.

The proposed method possesses four unique core advantages, which completely eliminate the minor accuracy differences existing in the static state and have significant engineering value for the actual operation of unmanned aerial vehicles: (1) Real-time dynamic adaptability: It can realize real-time estimation and update of camera parameters during continuous zoom, and adapt to the dynamic motion state of UAV; (2) Convenient field deployment feasibility: It eliminates the dependence on precise calibration objects and controlled environments, and only needs simple plane targets for initial initialization, which is suitable for field rapid deployment; (3) Strong environmental robustness: The tight-coupling fusion of IMU and vision effectively suppresses the interference of outdoor low texture, light changes and other factors, and maintains high calibration accuracy; (4) Full focal length zoom adaptability: It accurately models the nonlinear variation in intrinsic parameters during zooming, and maintains stable calibration and positioning performance in the whole focal length range from wide-angle to telephoto.

Nevertheless, the proposed method still has certain limitations in specific application scenarios, and corresponding improvements can be made in future work: (1) Featureless terrain (water, snow): The insufficient number of feature point extraction leads to the lack of visual observation constraints, resulting in the decrease in EKF estimation accuracy; the future improvement direction is to fuse laser ranging data to supplement spatial constraints. (2) Rapid aggressive maneuvers of UAV: Exceeding the uniform speed/slow acceleration assumption of the motion model, the modeling error of IMU motion constraints increases, leading to the decrease in fusion accuracy; the future improvement direction is to introduce an adaptive motion model and real-time adjust the EKF process noise covariance according to the UAV’s maneuver state. (3) Very low-light conditions: The decrease in image gray contrast leads to a sharp drop in the success rate of ORB feature detection and LK optical flow tracking, resulting in the failure of visual observation; the future improvement direction is to combine infrared image feature extraction to realize multi-spectral feature fusion.

Author Contributions

Conceptualization, W.Z.; methodology, W.Z.; software, Z.S. and F.G.; validation, Z.S. and H.X.; investigation, Q.H.; writing—review and editing, Q.H.; supervision, H.X.; project administration, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (61640308) and Natural Science Foundation of Hubei Province (2023AFB900).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Machine Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Zhang, Z. Flexible Camera Calibration by Viewing a Plane from Unknown Orientations. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 1, pp. 666–673. [Google Scholar]
Abdel-Aziz, Y.I.; Karara, H.M.; Hauck, M. Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry. Photogramm. Eng. Remote Sens. 2015, 81, 103–107. [Google Scholar] [CrossRef]
Faig, I.W. Calibration of close-range photogrammetric systems: Mathematical formulation. Photogramm. Eng. Remote Sens. 1975, 41, 103–107. [Google Scholar]
Wilison, R.G.; Shafer, S.A. Perspective projection camera model for zoom lenses. In Proceedings of the Optical 3D Measurement Techniques II: Applications in Inspection, Quality Control, and Robotics, Zurich, Switzerland, 4–7 October 1993. [Google Scholar]
Tsai, R. A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robot. Autom. 1987, 3, 323–344. [Google Scholar] [CrossRef]
Yin, C.; Sui, G. Method for Monocular Camera to Estimate Principal Point and Calibrate Zoom Lens. Opt. Eng. 2019, 58, 113101. [Google Scholar] [CrossRef]
Hartley, R.I. Projective Reconstruction and Invariants from Multiple Images. IEEE Trans. Pattern Anal. Machine Intell. 1994, 16, 1036–1041. [Google Scholar] [CrossRef]
Wu, Z.; Radke, R.J. Keeping a Pan-Tilt-Zoom Camera Calibrated. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1994–2007. [Google Scholar] [CrossRef]
Hayman, E.; Murray, D.W. The Effects of Translational Misalignment When Self-Calibrating Rotating and Zooming Cameras. IEEE Trans. Pattern Anal. Machine Intell. 2003, 25, 1015–1020. [Google Scholar] [CrossRef]
Faugeras, O.D.; Luong, Q.-T.; Maybank, S.J. Camera Self-Calibration: Theory and Experiments. In Proceedings of the 2nd ECCV, Santa Margherita Ligure, Italy, 19–22 May 1992; Volume 588, pp. 321–334. [Google Scholar]
Triggs, B. Autocalibration and the Absolute Quadric. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 609–614. [Google Scholar]
Pollefeys, M.; Goo, L.V.; Pollefeys, M. The modulus constraint: A new constraint selfcalibration. In Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, 25–29 August 1996; pp. 349–353. [Google Scholar]
Agapito, L.D.; Hayman, E.; Reid, I. Self-Calibration of Rotating and Zooming Cameras. Int. J. Comput. Vis. 2001, 45, 107–127. [Google Scholar] [CrossRef]
Sohn, K. Semiautomatic Zoom Lens Calibration Based on the Camera’s Rotation. J. Electron. Imaging 2011, 20, 023006. [Google Scholar] [CrossRef]
Sturm, P. Self-Calibration of a Moving Zoom-Lens Camera by Pre-Calibration. Image Vis. Comput. 1997, 15, 583–589. [Google Scholar] [CrossRef]
Xianzhi, D. Research on Camera Calibration Technology Based on Deep Neural Network in Mine Environment. In Proceedings of the 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL), Chongqing, China, 10–12 July 2020; pp. 375–379. [Google Scholar]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; López, A.; Koltun, V. CARLA: An open urban driving simulator. In Proceedings of the 1st Conference on Robot Learning (CoRL), Mountain View, CA, USA, 13–15 November 2017. [Google Scholar]
Butt, T.H.; Taj, M. Camera Calibration Through Camera Projection Loss. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23 May 2022; pp. 2649–2653. [Google Scholar]
Hastedt, H.; Luhmann, T.; Przybilla, H.-J.; Rofallski, R. Evaluation of interior orientation modelling for cameras with aspheric lenses and image pre-processing with special emphasis to SFM reconstruction. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2021, 43, 17–24. [Google Scholar] [CrossRef]
Yang, F.; Chen, H.; Zhang, X.; Wang, S.; Zhang, L. SPLAC: A Single-Step PTZ Camera Linear Auto-Calibration Method. Neurocomputing 2025, 645, 130377. [Google Scholar] [CrossRef]
Gao, Z.; Gao, Y.; Su, Y.; Liu, Y.; Fang, Z.; Wang, Y.; Zhang, Q. Stereo Camera Calibration for Large Field of View Digital Image Correlation Using Zoom Lens. Measurement 2021, 185, 109999. [Google Scholar] [CrossRef]
Montibeller, A.; Pérez-González, F. An Adaptive Method for Camera Attribution Under Complex Radial Distortion Corrections. IEEE Trans. Inform. Forensic Secur. 2024, 19, 385–400. [Google Scholar] [CrossRef]
Liebold, F.; Mader, D.; Sardemann, H.; Eltner, A.; Maas, H.-G. A Bi-Radial Model for Lens Distortion Correction of Low-Cost UAV Cameras. Remote Sens. 2023, 15, 5283. [Google Scholar] [CrossRef]
Jia, C.; Evans, B.L. Online Camera-Gyroscope Autocalibration for Cell Phones. IEEE Trans. Image Proc. 2014, 23, 5070–5081. [Google Scholar] [CrossRef] [PubMed]
Yan, X.; Sun, S.; Zhu, H.; Ying, W.; Shi, Z.; Li, Y. Robust two-stage unbiased pseudo-linear kalman filter for UAV-based ground target geo-tracking in extreme observation conditions. Measurement 2025, 259, 119592. [Google Scholar] [CrossRef]
Chen, Q.; Yao, L.; Xu, L.; Yang, Y.; Xu, T.; Yang, Y.; Liu, Y. Horticultural Image Feature Matching Algorithm Based on Improved ORB and LK Optical Flow. Remote Sens 2022, 14, 4465. [Google Scholar] [CrossRef]
Dai, Y.; Wu, J. An Improved ORB Feature Extraction Algorithm Based on Enhanced Image and Truncated Adaptive Threshold. IEEE Access 2023, 11, 32073–32081. [Google Scholar] [CrossRef]
Zhu, W.; Shi, Z.; Hu, Q.; Ying, W. Comparative Research of Two Target Location Methods Based on Unmanned Remote Sensing Platform. In Proceedings of the 2024 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Shenzhen, China, 22 November 2024; pp. 1369–1375. [Google Scholar]

Figure 1. The relationship between the conversion of each coordinate system in the camera calibration.

Figure 2. Proposed algorithm framework.

Figure 3. ORB feature point extraction results under different methods: (a) Fixed threshold; (b) adaptive threshold T.

Figure 4. Parameter estimation curves in one Monte Carlo simulation trial. (a) convergence of f_x, (b) convergence of f_y, (c) convergence of c_x, (d) convergence of c_x, (e) convergence of k_1, (f) convergence of k_2.

Figure 5. Convergence curve of RMSE jointly calculated by the internal parameters.

Figure 6. Experimental Equipment and Site: (a) UAV equipped with an electro-optical pod; (b) Distribution of artificial markers.

Figure 7. Offline camera calibration conducted in the laboratory: (a) Chessboard images captured by the electro-optical pod; (b) Schematic diagram of the MATLAB 2018 Camera Calibration Toolbox.

Figure 8. Variation curves of reprojection RMSE for three algorithms at different focal lengths.

Figure 9. UAV Target Localization Results Using Three Algorithms. (a) f = 4.3 mm, (b) f = 60 mm, (c) f = 129 mm.

Figure 10. Statistics of UAV Target Localization Errors Using Three Algorithms.

Table 1. The error comparison results of different markers under the three methods.

	Proposed IMU-Vision Fusion Algorithm		SfM-Based Self-Calibration		Offline Zhang’s Calibration
	RMSE	Target Positioning Error	RMSE	Target Positioning Error	RMSE	Target Positioning Error
T₁	0.28	0.35	1.02	1.87	0.22	0.28
T₂	0.31	0.38	1.15	1.95	0.24	0.30
T₃	0.29	0.36	0.98	1.82	0.23	0.29
T₄	0.35	0.42	1.23	2.05	0.26	0.32
T₅	0.32	0.39	109	1.92	0.25	0.31
T₆	0.30	0.37	1.12	1.89	0.24	0.30
T₇	0.31	0.40	1.20	1.98	0.25	0.31

Notes: ① RMSE is the root mean square error of feature point reprojection, calculated by the Formula (22) and the unit is Pixel; ② target positioning error is the Euclidean distance between the estimated 3D coordinates and the true coordinates measured by RTK-GNSS, calculated by the Formula (23) and the unit is m; ③ All data are the average results of 10 repeated flight experiments.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, W.; Shi, Z.; Xu, H.; Hu, Q.; Ying, W.; Gui, F. An Online Calibration Method for UAV Electro-Optical Pod Zoom Cameras Based on IMU-Vision Fusion. Drones 2026, 10, 224. https://doi.org/10.3390/drones10030224

AMA Style

Zhu W, Shi Z, Xu H, Hu Q, Ying W, Gui F. An Online Calibration Method for UAV Electro-Optical Pod Zoom Cameras Based on IMU-Vision Fusion. Drones. 2026; 10(3):224. https://doi.org/10.3390/drones10030224

Chicago/Turabian Style

Zhu, Weiming, Zhangsong Shi, Huihui Xu, Qingping Hu, Wenjian Ying, and Fan Gui. 2026. "An Online Calibration Method for UAV Electro-Optical Pod Zoom Cameras Based on IMU-Vision Fusion" Drones 10, no. 3: 224. https://doi.org/10.3390/drones10030224

APA Style

Zhu, W., Shi, Z., Xu, H., Hu, Q., Ying, W., & Gui, F. (2026). An Online Calibration Method for UAV Electro-Optical Pod Zoom Cameras Based on IMU-Vision Fusion. Drones, 10(3), 224. https://doi.org/10.3390/drones10030224

Article Menu

An Online Calibration Method for UAV Electro-Optical Pod Zoom Cameras Based on IMU-Vision Fusion

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Camera Model

2.2. Proposed Algorithm

2.3. Problem Modeling

2.4. Camera Motion Model

2.4.1. Translational Motion Model

2.4.2. Rotational Motion Model

2.5. Visual Observation Model

2.6. Feature Extraction and Tracking Strategy

2.6.1. Feature Extraction Method

2.6.2. Feature Tracking Method

2.7. EKF Solution Process

3. Experimental Results and Discussion

3.1. Simulation Experiments

3.2. UAV Flight Experiments

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI