1. Introduction
Against the backdrop of the rapid development of low-altitude economy and intelligent perception technologies, unmanned aerial vehicles (UAVs) have been widely deployed in critical domains including emergency rescue, power line inspection, and precision agriculture. Their capability for rapid target perception and high-precision localization constitutes a core determinant of mission execution efficiency. As the pivotal component of UAV perception systems, electro-optical pods equipped with fixed-focal-length lenses face an inherent trade-off between field of view and spatial resolution, which hinders their ability to fulfill integrated operational requirements of wide-area search and fine-grained target identification. Consequently, electro-optical pods integrated with continuous zoom lenses have emerged as the prevailing direction of technological advancement in this field. However, the incorporation of continuous zoom systems introduces new technical bottlenecks: while the relative movement of lens elements during zooming is theoretically designed to maintain optical parameter stability, engineering imperfections such as manufacturing tolerances and assembly errors induce complex nonlinear variations in camera intrinsic parameters and distortion coefficients as a function of focal length. This issue has become a core obstacle restricting UAVs from achieving high-precision target localization, as the uncompensated nonlinear parameter drift will continuously accumulate with focal length adjustment and severely degrade the performance of visual perception and localization algorithms.
UAV target localization relies on accurate camera intrinsic parameters and distortion correction models. Failure to precisely estimate the camera’s intrinsic parameters across varying focal lengths will lead to model mismatches and feature point extraction errors, ultimately resulting in 3D target localization accuracy that exceeds acceptable thresholds. This, in turn, limits the applicability of UAVs in scenarios with stringent localization requirements, such as disaster assessment and precision agriculture. Notably, the nonlinear variation in intrinsic parameters is not a simple monotonic change but a coupled variation involving focal length, principal point and radial distortion, which further increases the difficulty of real-time online calibration during continuous zooming.
Accordingly, precise calibration of zoom lenses has become a research hotspot in this field. Traditional calibration methods (e.g., the Zhang’s method [
1,
2]) are predicated on the assumption of constant intrinsic parameters, rendering them incompatible with zoom operation conditions and unable to capture the nonlinear variation characteristics of parameters with focal length. The present study aims to construct an accurate mathematical model characterizing the variation in intrinsic parameters and distortion coefficients with focal length, and to develop an efficient and robust calibration approach with a dedicated fusion strategy for real-time suppression of parameter drift during focal length changes. This work seeks to provide technical support for the high-precision application of zoom electro-optical pods, thereby breaking through the bottleneck of UAV localization accuracy.
Extensive research efforts have been devoted to zoom lens calibration by scholars worldwide, with technical approaches evolving along three primary directions: model construction, parameter solving strategies, and application scenario expansion. These methods exhibit significant disparities in terms of accuracy, efficiency, and adaptability to operational environments. Early research predominantly adopted discrete calibration strategies, wherein researchers independently calibrate parameters at several discrete focal points, followed by polynomial fitting or spline interpolation to derive parameter values across the full focal range. While these methods boast conceptual simplicity and ease of implementation, they suffer from high calibration workloads. Specifically, discrete zoom-based calibration generally requires sampling at no less than 10 distinct focal lengths, with at least 8 images collected at each focal length, leading to cumbersome manual operation. Deep learning-based schemes further demand a large-scale labeled training dataset, which significantly increases data acquisition cost, labeling effort, and computational complexity. Moreover, interpolation accuracy is highly dependent on the number and distribution of discrete points, making it difficult to globally approximate complex nonlinear mapping relationships [
1,
2,
3,
4] and unable to support real-time online calibration for continuous focal length adjustment. Wilson and Shafer [
5] employed Tsai’s method [
6] to fit the variation in parameters (including focal length, principal point, and radial distortion) with focal length using fifth-order polynomials, which notably reduced reprojection errors. However, this method relies on high-precision calibration targets and controlled environments, impeding its feasibility for online calibration. Notably, principal point drift during zooming represents a critical factor affecting calibration accuracy. Yin and Sui [
7] proposed the Euclidean Distance Weighting Model (EDWM), which estimates the principal point via intersection of image feature points before and after focal length adjustment. This approach effectively suppresses the interference of outliers, confining the principal point estimation error within 2.75 pixels. Despite their high accuracy, traditional methods require frequent recalibration and thus fail to meet the real-time demands of dynamic missions, and their pre-fitted polynomial models lack adaptability to the random nonlinear parameter variations in actual flight scenarios.
To overcome the reliance on calibration targets, self-calibration methods leverage multi-view geometric constraints to directly solve for parameters from image sequences. These methods are suitable for camera calibration in unknown scenes or under dynamic motion conditions, but they impose stringent requirements on camera motion trajectories and scene texture richness [
8,
9,
10]. Faugeras et al. [
11] utilized the Kruppa equations and fundamental matrix constraints for intrinsic parameter calibration, yet the solving process is computationally complex and highly sensitive to initial values. Triggs [
12] introduced the Absolute Dual Quadric (ADQ) to transform the constraints into a quadratic programming problem, enhancing solution stability. Pollefeys’ [
13] hierarchical stepwise calibration method upgrades projective reconstruction to metric reconstruction through modular constraints, enabling its application to zoom camera calibration. Agapitos et al. [
14] proposed a method that initially performs linear parameter estimation based on infinite homography constraints, followed by nonlinear optimization to refine focal length and distortion parameters, achieving self-calibration for rotating and zooming cameras. Although self-calibration methods offer high flexibility, they suffer from strong parameter coupling, which easily leads to local optima, and their reprojection errors are generally higher than those of traditional methods especially in continuous zoom scenarios where nonlinear parameter variations are not explicitly modeled.
Hybrid calibration methods combine the advantages of target-based calibration and self-calibration. They establish parameter models using a small amount of calibration data and then update parameters online, striking a balance between accuracy and flexibility. Oh and Sohn [
15] proposed a hybrid method integrating checkerboard-based calibration and pure-rotation self-calibration: parameters are first calibrated at different focal lengths, then parameter variation models are established via scattered data interpolation, while rotation sensors are employed to alleviate ill-posed problems. Sturm [
16] determined the relationship between the principal point, distortion, and focal length through pre-calibration, and then solved for the remaining parameters using self-calibration based on the Kruppa equations. Experimental results demonstrate that hybrid methods improve adaptability while maintaining high accuracy, though their performance is contingent on the coverage and quality of calibration data.
In recent years, with the advancement of deep learning technologies, intelligent calibration methods have emerged that leverage neural networks to learn complex mapping relationships between parameters and focal length, circumventing the limitations of explicit mathematical modeling [
17,
18]. Butt et al. [
19] proposed a multi-task network for joint estimation of intrinsic and extrinsic parameters, incorporating the CARLA dataset to enhance generalization capability. Nevertheless, intelligent methods require large volumes of training data. Furthermore, most existing studies employ a single training dataset to fit multiple tasks, which often leads to local optima in parameter optimization and leaves room for improvement in generalization performance; in addition, these data-driven models lack real-time adaptability to the unforeseen nonlinear parameter variations in actual UAV flight and zoom processes. Current research priorities include online adaptive calibration and cross-focal-range domain learning, aiming to address calibration challenges in dynamic environments.
Overall, zoom lens calibration technology is evolving from static offline calibration to dynamic online calibration, forming a technical framework encompassing four categories: traditional calibration, self-calibration, hybrid calibration, and intelligent learning methods [
20,
21,
22]. However, to meet the practical application requirements of UAV-borne EO pods, three core issues remain to be urgently addressed: first, existing models lack sufficient capability to characterize full-focal-range nonlinear distortion, resulting in significant accuracy degradation at both wide-angle and telephoto ends, and fail to explicitly model the coupled nonlinear variation in focal length, principal point and radial distortion during continuous zooming; second, most methods rely on high-precision calibration targets and controlled environments, limiting their convenience and robustness for field operations; third, the balance between real-time performance and calibration accuracy has not been effectively achieved, and there is a lack of dedicated multi-sensor fusion strategies for suppressing parameter drift induced by focal length changes, making it difficult to adapt to the dynamic mission scenarios of UAVs.
To tackle the aforementioned challenges, this study aims to propose a high-precision, robust online calibration method tailored for UAV-borne continuous zoom EO pods, and to systematically verify its effectiveness in improving the accuracy of target geolocalization. The main contributions of this paper are threefold: (1) An improved joint parameterization model for zoom lens intrinsic parameters and distortion is proposed, which explicitly characterizes the coupled nonlinear variation law of focal length, principal point and radial distortion during continuous zooming, and embeds the nonlinear parameter constraint into the state estimation framework to realize the real-time tracking of zoom-induced parameter changes. This model can more accurately describe the nonlinear variation characteristics of optical parameters during continuous zooming—with particular emphasis on refining the modeling of coupled variations in the camera principal point and radial distortion. (2) A tight-coupling IMU-visual fusion strategy is designed for the online calibration of continuously zooming cameras: a user-friendly calibration workflow that eliminates the need for sophisticated 3D calibration fields and only requires planar targets is designed. By acquiring image sequences across multiple focal lengths and camera poses, combined with a bundle adjustment optimization strategy, this workflow enables efficient and robust solving of model parameters, and the designed fusion strategy fuses IMU high-frequency motion constraints and visual geometric constraints in real time to effectively suppress parameter drift during focal length changes. (3) A quantitative analysis model linking calibration residuals to final geolocalization errors is established. Through both simulation and physical experiments, the correction accuracy of the proposed calibration method across different focal lengths and its practical effectiveness in improving the absolute localization accuracy of UAVs are comprehensively validated, and the superiority of the proposed method in handling parameter nonlinearities and suppressing zoom-induced parameter drift is fully demonstrated by comparing with state-of-the-art calibration methods.
2. Materials and Methods
2.1. Camera Model
There are five coordinate systems in the camera calibration process. They are the two-dimensional image pixel coordinate system
and the physical coordinate system
, the three-dimensional camera coordinate system
, the UAV body coordinate system
and the world coordinate system
. The correspondence among the five coordinate systems is shown in
Figure 1.
Due to the inherent manufacturing imperfections of optical lenses, image distortion inevitably occurs during the imaging process. Therefore, a distortion correction model must be incorporated into the camera calibration framework. Common distortion types include radial distortion, tangential distortion, and thin-prism distortion. In practice, only radial distortion is dominant and typically considered for camera calibration of UAV electro-optical zoom lenses, as other distortion components are negligible under normal operating conditions [
23,
24]. In general, only radial aberrations are considered, so the imaging model is modified as shown in Equation (1).
where
and
represent the first- and second-order radial distortion coefficients,
denotes the higher-order infinitesimal term,
,
denotes the ideal normalized image coordinate, and
is the corresponding distorted image coordinate. Considering the trade-off between calibration accuracy and computational efficiency, only the second-order radial distortion model is adopted in this work, which has been widely verified to be sufficient for zoom lenses in UAV electro-optical pods.
From the above equation, it can be seen that when the setting parameters of the zoom lens are changed, the intrinsic parameters will change. Also, when the focal length of the lens changes, its radial distortion coefficient changes. There are different implicit functions between different calibration parameters and lens settings, so the calibration of zoom cameras is more complex and variable than that of fixed-focus cameras [
25].
2.2. Proposed Algorithm
For the dynamic operational scenarios of UAV-borne zoom electro-optical pods, existing calibration methods struggle to characterize the continuous nonlinear evolution of camera intrinsic parameters during zooming. Meanwhile, mainstream self-calibration methods suffer from poor convergence under zoom conditions due to the insufficient fusion of multi-source heterogeneous sensor information. In view of these limitations, we propose an online calibration framework that balances real-time performance and robustness, aiming to break through the bottlenecks of current technologies.
To address this challenge, we design a framework capable of online, real-time, joint estimation of camera intrinsic/extrinsic parameters and environmental structure, as illustrated in
Figure 2. Specifically, camera intrinsic and extrinsic parameters, along with the 3D coordinates of scene feature points, are jointly encoded into a high-dimensional state vector. The Extended Kalman Filter (EKF) is then employed to recursively estimate this state vector, with concurrent constraints imposed on the evolution of intrinsic parameters. This tightly coupled fusion strategy enables robust, high-precision online calibration for UAV-borne electro-optical pods.
The proposed calibration framework is designed specifically for the calibration scenarios and system lifecycle of unmanned aerial vehicle (UAV) electro-optical pods. Its detailed description is as follows to address the key issues related to the calibration scenarios. The calibration process of the UAV electro-optical pod consists of two complementary stages: offline pre-calibration and online real-time calibration. Offline pre-calibration is performed once during factory delivery or maintenance to provide initial intrinsic and extrinsic parameters, while online real-time calibration runs synchronously with all flight missions to realize real-time update of parameters, ensuring the consistency between calibration results and actual flight states.
The core goal of the proposed calibration framework is to support high-precision target geolocation of UAVs in military reconnaissance and emergency rescue scenarios, with mandatory precision indices including reprojection root mean square error (RMSE) below 1 pixel, geolocation view angle error less than 1 mrad, and mean geolocation error under 5 m.
The UAV platform and its inertial measurement unit (IMU) specifications are key prerequisites for the effectiveness of the calibration framework. Prior to the formal calibration process, the IMU must undergo pre-calibration, including factory calibration and on-site RTK-GNSS calibration, to ensure the systematic error of the gyroscope is <0.005 rad/s. During the calibration process, the UAV does not need to follow a fixed trajectory; instead, it only requires smooth motion, which is consistent with the actual flight characteristics of UAV missions.
Parameter drift and stable properties are fully considered in the framework design to maintain long-term calibration accuracy. Camera intrinsics drift with focal length in a nonlinear manner but are not affected by time, while the IMU exhibits slow time drift (gyroscope bias drift of 0.001 rad/s·h). Importantly, all hardware parameters remain stable within a single flight mission (usually no more than 5 h), which is the key time scale of UAV tasks, ensuring the reliability of calibration results during the entire mission period.
The calibration setup for each stage is specifically designed to match the environmental conditions and precision requirements. The proposed tightly coupled EKF-based fusion strategy effectively integrates the constraints from the calibration setup, parameter stability, and platform motion characteristics, ultimately achieving robust and high-precision online calibration for UAV-borne electro-optical pods.
2.3. Problem Modeling
Firstly, we define the complete state vector
of the system, which includes all variables to be estimated at discrete time step
:
where
denotes the vector of camera intrinsic parameters to be estimated—corresponding to the equivalent focal lengths in the x and y directions, principal point coordinates, and two primary radial distortion coefficients, respectively.
and
represent the camera’s position and linear velocity in the world coordinate system, respectively.
is a unit quaternion describing the rotational attitude of the camera from the world coordinate system to the camera coordinate system.
is a three-dimensional spatial position vector composed of N scene feature points, and its dimension is 3N.
In this model, to simplify complexity, a key assumption is made: within a limited observation window, the camera’s intrinsic parameters and the positions of scene feature points remain constant. This assumption transforms the complex online calibration problem into a dynamic state estimation problem with static parameters, enabling effective joint estimation via EKF.
The entire system follows a typical non-linear state-space model, described jointly by a state transition equation and an observation equation:
where
represents the state value at time
and
is the measurement value at that time, composed of the detected distorted pixel coordinates of N scene feature points in the image.
is the non-linear state transition function,
is the process noise representing model uncertainty;
is the non-linear observation function, and
is the measurement noise, primarily originating from errors in image feature point detection.
2.4. Camera Motion Model
The state transition model mainly describes how the camera’s extrinsic parameters (position, velocity, attitude) evolve over time. Since the camera is rigidly attached to the carrier, its motion can be directly modeled using measurements from the IMU mounted on the carrier. The IMU includes an accelerometer and a gyroscope, which sense the carrier’s linear acceleration and angular velocity, respectively. However, translational and rotational motions are physically decoupled and differ significantly in their mathematical representation: translation is typically expressed using Euclidean vectors, while rotation belongs to the Lie group SO(3) and cannot be handled by simple vector addition. Therefore, to construct a physically accurate and mathematically tractable model, the camera’s motion must be decomposed into translational and rotational components, which are modeled separately.
2.4.1. Translational Motion Model
For translational dynamics, we adopt the Wiener velocity process model, with estimated values being the camera’s position
and velocity
. This model assumes that acceleration is a zero-mean white noise process
, which effectively captures the random, unpredictable acceleration changes experienced by the carrier in real environments. Specifically:
Converting this into a system of first-order differential equations and integrating in the discrete time domain yields:
where
represents the k-th IMU sampling period,
is the linear acceleration after gravity compensation,
is a random disturbance term related to the covariance of process noise
.
2.4.2. Rotational Motion Model
For rotational motion, its dynamics are driven by the angular velocity measured by the gyroscope. The mathematical nature of rotation prevents updates via simple vector integration like translation. Unit quaternions are widely used for attitude representation due to their non-singularity and high computational efficiency. There is a definite differential relationship between the time derivative of a quaternion and its corresponding angular velocity:
where
is the skew-symmetric matrix constructed from the angular velocity vector
:
In the discrete time domain, assuming constant angular velocity within a sampling period
, the quaternion update can be obtained via exponential mapping:
where
denotes the matrix exponential, which yields a 4 × 4 rotation matrix that operates on the quaternion
via standard matrix-vector multiplication. Similarly, this process is affected by gyroscope noise and bias drift, which are modeled as process noise
.
Combining the translational and rotational models, the complete state transition function can be written in a block-diagonal form, whose Jacobian matrix is used for linearization in the prediction step of EKF.
2.5. Visual Observation Model
The visual observation model establishes the mathematical connection between the state vector and the 2D feature points extracted from images. To clearly elaborate on this model, it is first necessary to define the involved coordinate systems and their transformation relationships. Two main coordinate systems are involved in this process: the world coordinate system and the camera coordinate system. The world coordinate system is a global, inertial reference frame where the true 3D positions of all scene feature points are defined. The camera coordinate system is a local frame with the camera’s optical center as the origin and the Z-axis aligned with the optical axis, whose orientation is determined by the attitude quaternion at the current moment.
Given a 3D feature point
in the world coordinate system, its coordinates in the camera coordinate system are obtained via rigid-body transformation:
where
is the rotation matrix converted from the quaternion
. If
The undistorted ideal projection point
can be derived using the pinhole camera model:
where
is the intrinsic parameter matrix composed of
, which performs the affine transformation from the normalized plane to the pixel plane.
However, actual lenses exhibit optical distortion—particularly radial distortion, which is more pronounced during zooming. The actually observed pixel coordinates are the results of radial distortion correction applied to the ideal projection points
. This paper adopts the widely used second-order radial distortion model:
where
is the squared normalized distance from the ideal projection point to the principal point.
In summary, the complete visual observation function is a non-linear mapping composed of the aforementioned projection and distortion models, which maps the relevant components in the state vector to 2D pixel coordinates . In the update step of EKF, it is necessary to compute the Jacobian matrix of this function with respect to the state vector.
2.6. Feature Extraction and Tracking Strategy
Robust and efficient feature extraction and tracking are the foundation for constructing visual geometric constraints. To provide stable visual observations, this paper adopts the feature extraction and tracking pipeline proposed by Qinhan Chen et al. The specific strategy is as follows: first, detect high-discriminative and stable key points in a single image to serve as anchors for subsequent inter-frame matching; second, establish correspondences between these feature points across consecutive video frames (i.e., feature tracking) to obtain a set of 2D-2D pixel matching pairs. These matching pairs contain rich information about camera motion and serve as the direct data source for constructing the visual observation model.
2.6.1. Feature Extraction Method
UAVs typically operate in outdoor environments characterized by homogeneous terrain and indistinct ground features [
26]. In such scenarios, feature points detected by the traditional ORB algorithm are predominantly concentrated in high-texture regions, and the algorithm exhibits poor adaptability to large-scale variations [
27,
28]. This often leads to a loss of feature points during UAV movement, thereby degrading the accuracy of online calibration. To address this issue, this study proposes a feature compensation strategy based on uniform distribution. Specifically, the image is divided into grids of 30 × 30 pixels, and an adaptive threshold T is calculated for each grid to detect FAST corner points. This approach effectively resolves the challenge of feature point extraction across regions with varying texture densities. The calculation formula for the adaptive threshold T is given as follows:
where
K is the proportional coefficient, 0 <
K < 1. The value of
K is adaptively selected based on the characteristics of the input image and the specific requirements of the task. In this article, the value of
K is set to 0.8;
is the average gray value of the pixels in the mesh,
and
are the maximum and minimum grey values of the pixels in the mesh, respectively.
As illustrated in
Figure 3, the left panel presents the feature point extraction results of the ORB algorithm with a fixed threshold—a value optimized via multiple experiments. It is evident that the extracted feature points are primarily clustered around foreground objects with significant gradient variations. When the camera undergoes large-angle rotation or abrupt velocity changes, massive feature point loss occurs, resulting in poor system robustness and calibration accuracy. In contrast, the right panel displays the extraction results of the proposed algorithm. It can be observed that the proposed method fully exploits information from all image regions, yielding uniformly distributed feature points across the entire field of view. This uniform feature distribution provides a richer set of correspondences for subsequent image feature matching.
2.6.2. Feature Tracking Method
After extracting features from a single image, inter-frame feature tracking is essential to establish temporal continuity. The quality of feature tracking directly determines the accuracy and robustness of visual observations. For feature tracking, we adopt the pyramid Lucas-Kanade (LK) optical flow method: first, use the uniformly distributed ORB feature points extracted earlier as tracking targets, initialize the optical flow displacement to zero at the top layer of the Gaussian image pyramid, and perform optical flow estimation layer by layer from top to bottom. For each layer, first predict the position based on the rough displacement result from the previous layer, then compute the true gradient of pixels within a neighborhood window in the matching image centered at this predicted position. Next, iteratively solve for the optimal displacement by minimizing the grayscale error of corresponding windows between the two frames—updating the residual optical flow vector in each iteration, while introducing an adaptive termination mechanism based on changes in the condition number of the Hessian matrix. The iteration stops when the difference in condition numbers between two consecutive iterations is less than a set threshold or the displacement increment is sufficiently small. After completing optical flow tracking across all pyramid layers, the rough corresponding positions of each feature point in the matching image are obtained, forming initial matching pairs. Finally, compute the Hamming distance using rotated BRIEF descriptors and eliminate incorrect matches using the nearest neighbor ratio strategy, resulting in high-precision 2D-2D matching results.
To mitigate the adverse effects of drastic field-of-view (FoV) changes induced by camera zooming and maintain robust visual constraints, a dynamic feature point management strategy is integrated with the improved ORB feature extraction and LK optical flow tracking. The strategy is implemented as follows: (1) Supplementary feature extraction: Following initial feature detection in each frame, the improved ORB algorithm is reapplied to FoV transition regions (i.e., newly visible or occluded areas due to zoom) to supplement feature points, ensuring sufficient visual constraints for state estimation. (2) Invalid feature rejection: Feature points lost or mismatched during FoV changes are filtered out using a pixel-level displacement threshold and a Hamming distance threshold within the LK optical flow framework, eliminating unreliable observations. (3) Stable feature screening: Only feature points tracked for ≥3 consecutive frames are retained, ensuring the consistency and validity of visual measurements. With this strategy, the algorithm can detect 80 to 120 stable feature points per frame. This strategy is tightly coupled with the extended Kalman filter (EKF): the 3D coordinates of newly extracted features are incorporated into the high-dimensional state vector in real time, while invalid features are removed from the state vector, enabling dynamic updates of the feature set to sustain the efficiency and accuracy of joint recursive estimation.
2.7. EKF Solution Process
Based on the non-linear state-space model constructed earlier, EKF is used for recursive estimation of the state vector . The whole solving process consists of prediction and update stages in each time step.
In the prediction phase, the IMU data-driven state transition model
is used to predict the state and its uncertainty at the next moment. The prior state estimate
is directly computed using the non-linear state transition function from the posterior estimate of the previous moment, assuming the expected value of the process noise is zero:
To propagate state uncertainty, the non-linear function
is linearized around
. Its Jacobian matrix is defined as:
Using this Jacobian matrix, the prior covariance at the current moment is computed from the posterior covariance
of the previous moment:
where
is the covariance matrix of process noise
, which quantifies the uncertainty of the IMU model and carrier dynamics.
In the update phase, the current prior state estimate
is used to predict the observation value via the non-linear observation model
:
Define the innovation as the difference between the actual observation value and the estimated observation value:
Similarly to the prediction phase, linearize the observation model
around
. Its Jacobian matrix is defined as:
The innovation covariance matrix is obtained by combining the uncertainty of the predicted state and the noise of the observation itself:
where
is the covariance matrix of measurement noise
, typically set empirically based on the accuracy of image feature point detection.
By definition, the Kalman gain is calculated as:
Update the state estimate and its covariance using the Kalman gain and innovation to obtain the final posterior estimate:
Through the above prediction-update cycle, EKF can recursively output the joint estimation of camera intrinsic parameters, extrinsic parameters, and scene structure in an online, real-time manner, thereby completing the online calibration task for zoom cameras. This process fully leverages the high-frequency motion constraints of the IMU and the geometric consistency constraints of vision, effectively overcoming the limitations of single-sensor systems.
Notably, the validity of the EKF framework is contingent on the noise characteristics of the gyroscope, which imposes the following applicability constraints: When the gyroscope noise standard deviation ≤ 0.01 rad/s, the linearization error of the quaternion is less than 1%, as derived from the Taylor expansion of the exponential map. In this case, the Gaussian assumption of the EKF remains valid, and the filter performs reliably; when > 0.05 rad/s, the linearization error exceeds 10%, violating the fundamental assumptions of the EKF formalism and leading to degraded estimation performance. In such scenarios, the Unscented Kalman Filter (UKF) is recommended as a more robust alternative. The corresponding formula for quantifying the linearization error has been added to the manuscript.
3. Experimental Results and Discussion
To comprehensively evaluate the effectiveness and robustness of the proposed online calibration method for zoom cameras, two complementary sets of experiments were designed and implemented: Monte Carlo simulation experiments and real UAV flight experiments. Simulation experiments aim to quantitatively analyze the algorithm’s convergence and accuracy under different noise levels in a controlled environment; real flight experiments verify the algorithm’s practicality and stability in complex, dynamic real-world scenarios. In the experiments, the root mean square error (RMSE) of reprojection was adopted as the core evaluation metric, defined as follows:
where
denotes the actual observed pixel coordinates of the i-th feature point,
represents the reprojected coordinates calculated using the estimated state
via the visual observation model
, and N is the total number of feature points involved in the calculation.
3.1. Simulation Experiments
The simulation experiments were established based on the Python 3.8 environment, constructing a virtual IMU-visual integrated platform that includes a pinhole camera with known intrinsic parameters and an ideal IMU. The specific parameter configurations were strictly aligned with the characteristics of real UAV electro-optical pods. For scene modeling, 1000 3D spatial points were randomly generated, distributed within the spatial region of X, Y ∈ [−50, 50] m, Z ∈ [10, 70] m, to simulate the spatial distribution characteristics of outdoor open scenes. The coordinates of the feature points were generated using uniform random sampling to ensure a uniform distribution of scene textures, avoiding the problem of unbalanced observation constraints caused by dense or sparse local feature points, and providing sufficient data support for visual geometric constraints. The camera model adopted a pinhole model with radial distortion, and the true intrinsic parameters were set as follows: equivalent focal length , principal point , and radial distortion coefficients , . The camera moved along a complex trajectory driven by random linear velocity and angular velocity to simulate the dynamic characteristics of UAVs during actual flight. Regarding data characteristics, the IMU sampling rate was set to 100 Hz, and a 200-frame video composed of spatial 2D sparse point clouds with a frame rate of 30 FPS and a resolution as 1920 * 1080 was artificially generated. To simulate the imperfection of real sensors, zero-mean Gaussian white noise was injected into the angular velocity and acceleration measurements of the IMU in each test, with standard deviations of and , respectively. Additionally, pixel-level zero-mean Gaussian noise with a standard deviation of was added to the detected coordinates of image feature points. The EKF initialization strategy was as follows: the state vector was randomly initialized within the range of 3σ around the true value, where the initial error of intrinsic parameters was controlled within ±5%, the initial error of position is less than or equal to 5 m, the initial error of velocity is less than or equal to 0.5 m/s, and the initial error of attitude quaternions is less than or equal to 0.05 rad, to test the robustness of the algorithm against the initial error. In each EKF update step, 50 successfully tracked feature points are randomly selected as observation inputs to balance computational efficiency and estimation accuracy.
Figure 4 illustrates the evolution process of the estimated values of camera intrinsic parameters over time in a single simulation experiment. The solid blue lines represent the estimated values of each parameter, and the dashed red lines denote the true values corresponding to the respective parameters. It can be clearly observed that all estimated values of intrinsic parameters exhibit rapid convergence characteristics. In the initial stage (Frames 0–10), due to the significant deviation between the initial state estimation of the system and the true values, as well as the insufficient accumulation of observation data, the estimated values fluctuate obviously, and the Kalman gain is relatively large. The system mainly relies on visual observations to correct the prior prediction, which is consistent with the expected behavior of the EKF. With the accumulation of more visual observations (Frames 10–50), the EKF can quickly correct this deviation through continuous iterative updates. For the focal length parameters
and
, their estimated values are basically stable around the true values after approximately 50 frames, with minimal subsequent fluctuations, indicating that the algorithm can effectively capture and recover the focal length information. The estimations of the principal points
and
also show a similar rapid convergence trend, although there is a brief spike in the initial stage, which is quickly suppressed. This demonstrates that the algorithm is robust in estimating the camera’s imaging center. From the final steady-state values, the estimated values of all parameters are highly consistent with the true values. The estimation errors of
and
are 0.32 pixel and 0.28 pixel, respectively; the estimation errors of
and
are 0.45 pixel and 0.37 pixel, respectively; and the estimation errors of the distortion coefficients
and
are 0.0015 and 0.023, respectively, all meeting the requirements of high-precision calibration. This proves that the proposed joint optimization framework has excellent modeling capabilities for nonlinearly varying intrinsic parameters.
Figure 5 depicts the convergence curve of the reprojection RMSE calculated jointly for all intrinsic parameters throughout the estimation process. This curve intuitively reflects the improvement process of the overall geometric consistency of the system. From the evolution trend of the curve, the RMSE exhibits a three-stage characteristic of “rapid decline—slow attenuation—stabilization”: the first 25 frames are the rapid decline stage, where the RMSE drops sharply from the initial 25.6 pixel to 3.2 pixel. During this stage, the system quickly corrects the initial deviation and establishes a preliminary camera model by fusing the high-frequency motion constraints of the IMU with visual observations. Frames 25–100 are the slow attenuation stage, where the RMSE gradually decreases from 3.2 pixel to 0.1 pixel. At this point, the system has basically captured the true values of the intrinsic parameters, and the errors mainly originate from sensor noise and linearization approximation errors. The noise interference is continuously suppressed through iterative updates. After 100 frames, it enters the stabilization stage, where the RMSE stabilizes around 0.08 pixel with a fluctuation range of less than ±0.01 pixel, indicating that the system has reached the optimal estimation state. The estimated parameters can accurately describe the optical characteristics of the camera, and the reprojection error is close to the sensor noise level, verifying the high-precision characteristics of the algorithm.
Overall, the simulation experimental results fully validate the effectiveness of the proposed IMU-visual fusion-based online calibration method for zoom lenses: the method can quickly and accurately estimate the intrinsic parameters of zoom cameras, with fast convergence speed, high steady-state accuracy, and strong noise resistance, meeting the requirements of real-time calibration. By incorporating intrinsic parameters, extrinsic parameters, and 3D positions of scene feature points into a unified state vector, and using EKF to achieve tightly coupled fusion of multi-source information, the limitations of a single sensor are effectively overcome, providing reliable algorithmic support for high-precision online calibration in dynamic zoom scenarios.
3.2. UAV Flight Experiments
To verify the performance of the algorithm on a real UAV electro-optical pod, flight experiments were conducted using a fixed-wing UAV equipped with a zoom camera, as shown in
Figure 6a. The pod consists of key components such as an uncooled long-wave infrared thermal imager, a high-definition visible light camera, a laser range finder, a laser illuminator, a laser warning system, and a servo-stabilized platform, and is integrated with a high-precision three-axis pan-tilt and an IMU. The IMU has completed factory pre-calibration and on-site calibration to eliminate systematic errors. The visible light camera supports imaging in the 0.4 μm~0.9 μm band, with a 30× continuous zoom capability ranging from 4.3 mm to 129 mm, a field of view covering 2.3°~63.7°, and a video output resolution of 1920 × 1080@30 FPS. The experimental scene was selected in an open outdoor area, where 7 high-contrast artificial markers were placed on the ground as auxiliary features, as shown in
Figure 6b. The center positions of the targets were accurately measured using RTK-GNSS, and their 3D positions in the world coordinate system were used as ground truth for subsequent accuracy evaluation.
The experimental procedure is as follows: (1) Place the UAV at the take-off point, complete the debugging and connection of the flight control system, electro-optical pod, and data transmission system to ensure the normal operation of each module. Set the flight route through the ground station, planning the UAV to climb spirally to a height of 100 m, and then fly at a constant speed along a circular route with a radius of 100 m centered on marker T1; (2) After the UAV enters a stable flight state, the ground station controls the camera to continuously zoom from the wide-angle end (4.3 mm) to the telephoto end (129 mm); (3) During the entire zoom process, the pan-tilt servo system keeps marker 1 locked in the central area of the image, and synchronously records high-definition video streams, IMU data, camera focal length control signals, and UAV attitude data; (4) Based on the proposed IMU-visual fusion-based online calibration algorithm for electro-optical pods, real-time processing is performed on each flight data to solve the camera intrinsic parameters under different focal lengths. Meanwhile, the traditional offline Zhang Zhengyou calibration method and the structure-from-motion (SfM)-based self-calibration method are used to process the same data as comparative benchmarks.
The convergence performance of the EKF is highly dependent on the accuracy of the initial state. Therefore, before the flight experiments, the Zhang Zhengyou calibration method was used to perform offline calibration for 7 commonly used focal lengths (5 mm, 20 mm, 40 mm, 55 mm, 70 mm, 93 mm, 120 mm) of the pod under static laboratory conditions. For each focal length, 8 chessboard images from different perspectives were collected, and the intrinsic parameters were solved using the Matlab 2018 calibration toolbox, as shown in
Figure 7. Based on these discrete calibration results, an empirical mapping function of intrinsic parameters varying with focal length was constructed through cubic spline interpolation, and this function was used to initialize the state vector of the online EKF to ensure the algorithm has a good initial value.
To evaluate the improvement effect of the proposed online self-calibration method on UAV target positioning accuracy, in addition to using the reprojection RMSE as an evaluation metric, this paper also introduces the target positioning error. This error is obtained by substituting the calibrated intrinsic parameters into the UAV target positioning model [
29], and calculating the Euclidean distance between the 3D estimated coordinates of marker and the true coordinates measured by RTK, so as to objectively reflect the performance of the algorithm in actual positioning tasks. Its calculation formula is as follows:
where
is the true world coordinate of marker 1, and
is the world coordinate of marker solved by applying the intrinsic parameters to the UAV target positioning model.
To comprehensively validate the robustness and generalizability of the proposed algorithm across diverse spatial positions, statistical analyses were performed on the experimental data of all seven artificial markers (T1–T7) deployed in the test scene. All experimental results presented herein correspond to the mean values derived from 10 repeated flight tests, with the reprojection root mean square error (RMSE) and geolocation error calculated for each individual marker; the detailed quantitative results are summarized in
Table 1. As indicated in the table, the proposed IMU-vision fusion algorithm maintains a consistently low reprojection RMSE ranging from 0.27 to 0.35 pixels and a geolocation error of 0.34 to 0.42 m for all markers throughout the full focal length range. This outcome demonstrates the strong spatial adaptability of the proposed algorithm in estimating camera intrinsic parameters for targets at different spatial positions. In contrast, the SfM-based self-calibration method yields significantly higher errors with large fluctuations for all markers, which can be attributed to the absence of high-frequency motion constraints provided by the IMU, leading to unstable parameter estimation during continuous zooming. Although the offline Zhang’s calibration method achieves the lowest average error among the three methods, it is inherently inapplicable to the dynamic zoom scenarios of UAV airborne photoelectric pods in practical field operations, due to its reliance on a static calibration environment and pre-calibrated targets.
Notably, the pan-tilt servo system was employed to keep Marker T1 centered in the image frame throughout the entire flight experiment. This setup ensured the integrity of visual features of T1 and the optimal temporal synchronization between IMU motion data and visual image data during the continuous zoom process from the wide-angle end (4.3 mm) to the telephoto end (129 mm). For a more in-depth evaluation of the calibration performance of the proposed method across the full focal length spectrum, the calibration results of Marker T1 at typical focal length points (wide-angle, middle, and telephoto) are further analyzed and discussed in the following section.
Figure 8 shows the variation curves of reprojection RMSE of different algorithms over the entire focal length range. It can be observed that the RMSE of the proposed algorithm remains at a low level throughout the zoom process, ranging from 0.2 to 0.35 pixels with small fluctuations, indicating that the algorithm has good stability in continuous zoom scenarios. The RMSE of the SfM-based self-calibration method is significantly higher, with obvious fluctuations in the intervals where the focal length changes drastically. This is because the nonlinear variation in intrinsic parameters during the zoom process leads to insufficient pure visual constraints to stably estimate the parameters. The offline Zhang’s calibration method has the lowest RMSE, but this method requires a preset calibration object and cannot adapt to online dynamic scenarios, lacking practical application flexibility. In addition, the RMSE of the proposed algorithm gradually approaches that of the offline calibration method after the focal length stabilizes, further verifying the high-precision characteristics of the algorithm.
By applying the intrinsic parameters obtained by different calibration algorithms to the UAV target positioning model,
Figure 9 shows the distribution of target positioning results for marker 1 by the three calibration methods under different focal lengths. The red cross represents the true position of the target, and the scattered points of each color represent the 10 repeated positioning results of the corresponding method. The results indicate that: (1) The proposed IMU-visual fusion algorithm exhibits the optimal positioning concentration under all focal lengths, with the positioning points closely surrounding the true coordinates, and the latitude and longitude deviations are controlled within ±1.5 × 10
−5°; (2) Although the offline Zhang Zhengyou calibration has slightly better positioning accuracy in the medium focal length range (60 mm), its positioning dispersion in the wide-angle (4.3 mm) and telephoto (129 mm) ranges is higher than that of the proposed algorithm, and it cannot adapt to dynamic zoom scenarios; (3) The positioning dispersion of the SfM-based method increases significantly, with the deviation expanding to ±5.8 × 10
−5°, which is caused by the propagation of calibration residuals due to the dependence on visual features. In summary, through the tightly coupled fusion of IMU and vision, the proposed algorithm effectively suppresses parameter drift during the zoom process, achieving high-precision and high-consistency target positioning over the entire focal length range.
The statistical results of the positioning error for marker 1 are shown in
Figure 10. The original data refers to the positioning error results obtained by inputting the camera parameters collected without any calibration method into the target positioning model. It can be seen that the average positioning error of the proposed algorithm is 0.35 m, and the maximum positioning error does not exceed 0.52 m; while the average positioning error of the pure visual self-calibration method is 1.87 m, and the maximum positioning error reaches 2.63 m; the average positioning error of the offline Zhang Zhengyou calibration method is 0.28 m, which is slightly better than the proposed algorithm. However, considering that the offline method cannot adapt to dynamic zoom scenarios, its practical application value is limited. This result indicates that the proposed online calibration algorithm can provide high-precision intrinsic parameter support for UAV target positioning, effectively reducing the positioning deviation caused by inaccurate intrinsic parameters, and meeting the requirements of actual tasks for positioning accuracy.
4. Conclusions
Aiming at the problem of decreased target positioning accuracy caused by the nonlinear variation in intrinsic parameters of zoom cameras in UAV electro-optical pods with focal length, this paper proposes an IMU-visual fusion-based online calibration method for electro-optical pods. By constructing a high-dimensional state vector including camera intrinsic and extrinsic parameters, as well as the 3D positions of scene feature points, a state transition model is established using the kinematic constraints provided by the IMU, and a nonlinear state space is constructed by combining with the visual observation model. The EKF is adopted to realize the joint recursive estimation of multiple parameters. Both Monte Carlo simulation experiments and real-world UAV flight experiments have verified the effectiveness and robustness of the algorithm. This method fundamentally solves the limitations of traditional calibration technologies in zoom scenarios, providing core technical support for high-precision imaging in dynamic UAV tasks.
The proposed method possesses four unique core advantages, which completely eliminate the minor accuracy differences existing in the static state and have significant engineering value for the actual operation of unmanned aerial vehicles: (1) Real-time dynamic adaptability: It can realize real-time estimation and update of camera parameters during continuous zoom, and adapt to the dynamic motion state of UAV; (2) Convenient field deployment feasibility: It eliminates the dependence on precise calibration objects and controlled environments, and only needs simple plane targets for initial initialization, which is suitable for field rapid deployment; (3) Strong environmental robustness: The tight-coupling fusion of IMU and vision effectively suppresses the interference of outdoor low texture, light changes and other factors, and maintains high calibration accuracy; (4) Full focal length zoom adaptability: It accurately models the nonlinear variation in intrinsic parameters during zooming, and maintains stable calibration and positioning performance in the whole focal length range from wide-angle to telephoto.
Nevertheless, the proposed method still has certain limitations in specific application scenarios, and corresponding improvements can be made in future work: (1) Featureless terrain (water, snow): The insufficient number of feature point extraction leads to the lack of visual observation constraints, resulting in the decrease in EKF estimation accuracy; the future improvement direction is to fuse laser ranging data to supplement spatial constraints. (2) Rapid aggressive maneuvers of UAV: Exceeding the uniform speed/slow acceleration assumption of the motion model, the modeling error of IMU motion constraints increases, leading to the decrease in fusion accuracy; the future improvement direction is to introduce an adaptive motion model and real-time adjust the EKF process noise covariance according to the UAV’s maneuver state. (3) Very low-light conditions: The decrease in image gray contrast leads to a sharp drop in the success rate of ORB feature detection and LK optical flow tracking, resulting in the failure of visual observation; the future improvement direction is to combine infrared image feature extraction to realize multi-spectral feature fusion.