Next Article in Journal
Region-Based Algorithm for Switching Frequency Reduction in Predictive Control of Converter Supplied Electric Drives
Previous Article in Journal
Continuous-Variable Quantum Fourier Layer: Applications to Filtering and PDE Solving
Previous Article in Special Issue
Probability Distribution Tree-Based Dishonest-Participant-Resistant Visual Secret Sharing Using Linearly Polarized Shares
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Incremental Multi-Camera Extrinsic Calibration Method Based on PnP Integrating Weighted AprilTag Detections and Multi-View Triangulation

by
Liliya A. Demidova
* and
Vladimir E. Zhuravlev
*
Institute for Information Technologies, Federal State Budget Educational Institution of Higher Education “MIREA—Russian Technological University”, 78 Vernadsky Avenue, Moscow 119454, Russia
*
Authors to whom correspondence should be addressed.
Algorithms 2026, 19(5), 371; https://doi.org/10.3390/a19050371
Submission received: 9 April 2026 / Revised: 1 May 2026 / Accepted: 6 May 2026 / Published: 8 May 2026
(This article belongs to the Special Issue Visual Attributes in Computer Vision Applications)

Abstract

Accurate extrinsic calibration of multi-camera systems is a central problem in three-dimensional computer vision, as errors in the relative positioning of sensors directly propagate into geometric distortions that critically degrade the quality of downstream applications. This paper proposes an incremental extrinsic camera parameter initialization method that improves upon the baseline iterative registration algorithm based on the Perspective-n-Point (PnP) problem. Unlike board-based calibration frameworks, the proposed approach operates on individually placed markers with no prior knowledge of their mutual positions, enabling recalibration without dedicated calibration sessions. The accuracy improvement is achieved through the introduction of heuristic weighting of fiducial marker detections using AprilTags, as well as the application of a multi-view triangulation algorithm for dynamic refinement of marker spatial coordinates at each stage of scene expansion. Theoretical analysis demonstrates that the incorporation of these mechanisms does not increase the overall asymptotic computational complexity of the complete calibration cycle (including the global optimization stage), despite the higher computational cost of the initialization stage itself. Empirical validation of the method is performed on both synthetic datasets with known ground-truth camera parameters and real-world capture data through the evaluation of geometric errors and their comparison with the baseline method. Experimental results, supplemented by an ablation study, indicate that the proposed algorithm achieves statistically significant improvements on synthetic data in more than 80% of cases, while on real data it is on average 85% more accurate in terms of reprojection error.

1. Introduction

Three-dimensional computer vision has become an integral component of a wide range of modern technologies, finding applications in areas such as autonomous driving, robotics, augmented and virtual reality, as well as industrial quality control [1,2,3]. The foundation for most three-dimensional computer vision systems consists of multi-camera setups [4], which enable richer spatial coverage of the surrounding environment compared to single cameras. Nearly all higher organisms, including humans, rely on binocular vision, as evolution has established this mechanism as the most effective means of perceiving three-dimensional environments.
The key factor determining the accuracy and reliability of any multi-camera system is the quality of its calibration. The calibration process involves estimating both the intrinsic parameters of each camera (intrinsics) and their relative spatial position and orientation (extrinsics) [4,5]. Intrinsics characterize the properties of the optical system and sensor, remaining invariant to camera placement in space. This enables their offline estimation during a preliminary setup stage and allows them to be treated as constant throughout a series of experiments. Although physical factors such as temperature fluctuations may introduce microscopic variations in the optical system, in practice these deviations are either negligibly small or successfully compensated by refinement methods applied during the computation pipeline of specific tasks. Extrinsic parameters, however, require strict recalibration upon any, even minor, change in the multi-camera rig configuration, as errors in the relative positioning of sensors directly propagate into geometric distortions and critically degrade the accuracy of the resulting 3D reconstruction or any other computer vision task.
This paper proposes an incremental initialization method for extrinsic camera calibration in an object-centric rig that utilizes AprilTag marker detection [6] in images, weighted PnP (Perspective-n-Point) solution techniques for pose estimation of objects with known geometry [5], and a multi-view triangulation algorithm for refining marker positions in space. Unlike board-based calibration frameworks such as Kalibr [7] or MC-Calib [8], which require a structured calibration object with known internal geometry, the proposed method places no constraints on the relative positions of the markers, recovering their 3D coordinates simultaneously with camera poses. This distinction has a direct practical consequence: board-based frameworks require a dedicated calibration session each time the camera configuration changes, with the calibration object actively repositioned through the shared field of view of all cameras. The proposed approach eliminates this overhead—once placed in the scene, the markers can be reused for recalibration at any time without additional capture sessions. The proposed method is compared against a baseline method that implements classical iterative calibration and relies solely on the PnP problem for estimating camera and marker poses. This comparison is conducted on both synthetic datasets and real-world capture data. To isolate the specific contribution of weighting and triangulation that distinguish the proposed method from the baseline, an ablation study is performed.
The principal contributions of this paper are as follows:
  • An incremental initialization method for multi-camera extrinsic calibration is proposed, operating on arbitrarily placed AprilTag markers with unknown mutual positions and requiring no structured calibration object.
  • A heuristic detection-quality weighting scheme for the PnP problem is introduced, with physical justification for each factor and empirical evidence of robustness to hyperparameter choice.
  • A multi-view triangulation step is integrated directly into the incremental registration loop, enabling dynamic correction of marker positions and mitigation of geometric drift at each camera addition step.
  • It is shown theoretically that the proposed modifications do not increase the asymptotic complexity of the complete calibration procedure, including the bundle adjustment stage.
The remainder of the paper is organized as follows. Section 2 reviews related work on target-based calibration, SfM-inspired incremental methods, and weighted pose estimation. Section 3 describes the baseline incremental calibration method. Section 4 covers the global optimization (bundle adjustment) stage that completes the calibration procedure. Section 5 introduces the proposed method, which extends the baseline with the two key modifications. Section 6 reports experimental results on synthetic and real-world data, and Section 7 presents an ablation study isolating the contribution of each component. Section 8 concludes the paper.

2. Related Work

Existing methods for extrinsic calibration of multi-camera systems fall into two main categories: methods that employ calibration targets with a priori known geometry [9], and targetless calibration methods that estimate camera parameters through analysis of natural features in the observed scene [10].

2.1. Target-Based Calibration Methods

One of the most widely adopted calibration approaches involves the use of planar regular patterns. In particular, Zhang’s method [11] reduces the calibration problem to establishing correspondences between known 3D coordinates of control points on the pattern (e.g., intersections of chessboard squares) and their 2D projections in images. Based on the established correspondences, extrinsic camera parameters are computed by solving the PnP problem. However, the application of regular patterns in multi-camera systems (especially in object-centric configurations) is hindered by the requirement to ensure simultaneous visibility of the calibration object across multiple cameras.
To address this problem, fiducial markers such as AprilTags [6,12] are actively employed. Each such marker constitutes a distinctive pattern on a square grid, in which a unique identifier is encoded using a binary code. The identifiers themselves are selected to maximize pairwise Hamming distances between codes of different markers, thereby ensuring reliable recognition and minimizing false positives. The works in [8,13,14] demonstrate methods for calibrating multi-camera systems using patterns composed of AprilTag markers arranged in a planar grid. Optimizing camera parameters using markers with known relative positions reduces orientation estimation errors compared to the use of single targets. Popular frameworks such as MC-Calib [8] and Kalibr [7] also employ calibration objects with marker grids for reliable initialization of extrinsics prior to launching global optimization. A common characteristic of all these approaches is the requirement for a structured calibration object whose internal geometry—the relative positions of its control points—is precisely known in advance. Consequently, frameworks such as Kalibr and MC-Calib are not applicable to scenarios in which markers are placed individually at arbitrary, a priori unknown positions—as is the case in object-centric rigs where fiducial markers are permanently affixed to stands in the scene and reused across recalibration sessions without requiring a calibration target to be repositioned in front of the cameras.
Recent studies continue to advance target-based multi-camera calibration for diverse setups, including large outdoor scenes, generic marker-based toolboxes, omnidirectional imaging systems, large capture volumes with bundle adjustment, RGB-D camera networks, and robust meta-board-based calibration frameworks [8,14,15,16,17,18].

2.2. Targetless Calibration and Incremental Registration

Alternative calibration methods draw upon algorithms from the Structure from Motion (SfM) domain [1,19,20] and eliminate the need for specialized calibration targets, operating solely on automatically extracted image features. In frameworks such as COLMAP [19], calibration is accomplished through extraction and matching of scene keypoints with subsequent iterative addition of new cameras to the visibility graph. The computation of parameters for each new camera is performed by solving the PnP problem with respect to already triangulated 3D scene points [19,20,21].
The fundamental problem of incremental methods is geometric drift. The estimation error of extrinsic parameters for the first camera pair propagates to all subsequent iterations, accumulating with each step. In closed-loop camera configurations, this leads to significant geometric misalignments upon loop closure. To mitigate these errors, global optimization is applied, which in the context of camera calibration is conventionally referred to as Bundle Adjustment (BA). However, such optimization is highly sensitive to the quality of the initial estimate [22,23,24]. Substantial drift accumulated during the initialization stage may cause the optimizer to converge to an incorrect local minimum or require a prohibitively large number of iterations [25].

2.3. Weighted Pose Estimation and Multi-View Triangulation

Classical algorithms for solving the PnP problem assume homoscedasticity of observations, that is, equal noise variance for all employed 2D points. In practice, the localization accuracy of marker corners or keypoints is non-uniform and depends on lens distortion, distance to the object, and viewing angle. To account for these factors, weighted optimization methods are employed. In [26], the minimized error function is weighted based on computed covariance matrices of 2D points, which improves localization accuracy. Similarly, the EPro-PnP method [27] utilizes probability distributions for pose prediction, where weight coefficients are computed by a neural network. However, rigorous computation of covariance matrices for detection errors or the use of learned predictive models substantially increases the computational cost of pose estimation.
To reduce geometric drift during incremental scene reconstruction, multi-view triangulation algorithms are also investigated. As demonstrated in [28], the accuracy of 3D point reconstruction increases with the addition of observations from new viewpoints. Integration of coordinate recalculation directly into the camera addition cycle enables dynamic correction of spatial structure. However, in many practical incremental pipelines, marker or point positions are initialized locally and are only refined later during bundle adjustment, if at all [19,20,21].
The method proposed in this paper belongs to the class of incremental PnP-based registration methods described in Section 2.2, adapted specifically to the object-centric rig calibration setting in which fiducial markers are permanently placed in the scene. Unlike the board-based frameworks discussed in Section 2.1—including Kalibr [7], MC-Calib [8], Meta-Calib [14], and CamOdoCal [29]—it imposes no prior constraints on marker placement and requires no structured calibration object. Unlike general-purpose SfM pipelines, its scope is confined to the initialization of extrinsic parameters of a known multi-camera rig, providing a high-quality starting point for subsequent bundle adjustment rather than attempting full scene reconstruction. Within this class, the standard PnP-based incremental registration algorithm—which underlies the initialization stages of both general SfM systems and board-based calibration frameworks—is the natural baseline for evaluating the proposed improvements.

3. Baseline Calibration Method

Since the problem of joint estimation of camera parameters and scene structure (Bundle Adjustment, BA) reduces to the minimization of a non-convex cost function in a high-dimensional space [23,24,25], applying classical optimization methods without a high-quality initial estimate leads to convergence to a local minimum with high probability [22,23,30]. In this regard, the standard approach to parameter initialization is the incremental registration strategy [20].
Incremental registration consists of sequentially solving a series of local geometric problems, where parameters of each new camera are estimated relative to the already reconstructed part of the scene. The key tools for such localization are algorithms for solving the PnP problem. Formally, the PnP problem involves finding the position and orientation of a camera ( R , t ) , where R R 3 × 3 is the rotation matrix and t R 3 is the translation vector, relative to a given coordinate system by minimizing the reprojection error between A known 3D coordinates of scene points and their corresponding 2D projections on the image plane. In general form, the problem is formulated as a nonlinear optimization:
R * , t * = arg min R , t i = 1 A u i π K , R , t , x i 2 ,
where u i R 2 are the observed 2D coordinates of the i -th point on the image plane; x i R 3 are the 3D coordinates of the i -th point in the world coordinate system; K R 3 × 3 is the camera intrinsic matrix; π : R 3 R 2 is the function projecting a 3D point from the world coordinate system to the camera image plane; ( R , t ) are the sought rotation matrix R R 3 × 3 and translation vector t R 3 required to transform a point from the world coordinate system to the camera coordinate system; denotes the Euclidean norm; A is the total number of points; the expression under the summation represents the squared reprojection error for the i -th point.
Finding the transformation ( R , t ) for a camera that minimizes the reprojection error is algebraically equivalent to finding the rotation and translation of a marker relative to a fixed camera, since any rigid transformation is invertible. Thus, solving the PnP problem enables both estimating camera poses relative to known 3D points and estimating point positions relative to known cameras.
In this work, the baseline calibration method under consideration is an iterative algorithm implementing the principle of sequential scene expansion. The calibration process begins with fixing the coordinate system of the first camera, after which new sensors and markers are added in a cyclic manner. The position and orientation of each subsequent camera are computed via PnP using already known 3D marker points, while the poses of newly detected markers are, in turn, estimated via PnP from cameras already placed in space. The baseline calibration method is presented in more detail in Algorithm 1.
Algorithm 1: Baseline extrinsic calibration method
Input: N —number of cameras;
M —number of markers;
V i   |   i 1 , N —sets of markers indices visible from the i-th camera;
u i j   |   i 1 , N , j V i —detections of the j-th marker on the i-th camera image plane.
Output: R c i   |   i 1 , N —rotations of all cameras;
t c i   |   i 1 , N —translations of all cameras;
R m j   |   j 1 , M —rotations of all markers;
t m j   |   j 1 , M —translations of all markers.
1. U 1 , , N Set of uninitialized cameras indices
2. M Set of positioned markers indices
3.while    U  do
4.      if  U = N  then
5.           i arg max i U V i Pick camera with strict max observations
6.           R c i I Initialize camera rotation as identity matrix
7.           t c i 0 , 0 , 0 T Initialize camera translation by zeros
8.       else
9.           i arg max i V i M Pick camera with max overlap with M
10.           R R m j   |   j V i M Set of visible markers rotations
11.           T t m j   |   j V i M Set of visible markers translations
12.           D u i j   |   j V i M Set of visible markers detections
13.           R c i , t c i S o l v e P n P R , T , D Estimate camera pose by solving PnP
14.       end if
15.       U U i Mark camera as initialized
16.       for each marker j V i M  do
17.           R m j , t m j S o l v e P n P R c i , t c i , u i j Estimate marker pose by solving PnP
18.           M M j Mark marker as initialized
19.       end for
20.end while
However, the baseline method suffers from a substantial limitation–the problem of error accumulation. Since the estimation of position and orientation for each subsequent camera depends on the accuracy of marker position estimates at previous stages, errors are not compensated but rather accumulate along the chain of cameras. In closed-loop configurations, particularly in the object-centric systems considered in this work, this often leads to a significant discrepancy between the computed position of the last calibrated camera and its true position relative to the first.
Accumulated geometric inconsistencies can potentially be eliminated during global optimization, but BA is extremely sensitive to the quality of the initial estimate. An excessively coarse estimate of extrinsics from the baseline method may cause the optimization to converge to an incorrect local minimum. Furthermore, the computational complexity of BA grows nonlinearly with increasing numbers of cameras and markers, which becomes particularly apparent when the initial calibration is insufficiently accurate, forcing the optimization algorithm to perform a larger number of iterations to reach the optimum.

4. Global Optimization

Typically, regardless of the accuracy of the initial parameter estimate for the scene, the calibration process for multi-camera systems cannot avoid the stage of global optimization (BA). This is because sequential initialization algorithms, including the baseline method described earlier, minimize errors locally and do not account for global constraints in the visibility graph. Joint optimization of all scene parameters enables finding a solution corresponding to the maximum likelihood criterion, ensuring a statistically optimal result in the presence of Gaussian measurement noise [23].
Since both cameras and markers possess rotation and translation, the values of which are adjusted during global optimization, separate notation is required. Let R c i R 3 × 3 , t c i R 3 denote the rotation and translation for the i -th camera, and R m j R 3 × 3 , t m j R 3 denote the rotation and translation for the j -th marker. The optimized parameters for all cameras and all markers are packed into a single parameter vector θ R S for launching BA, where S = P N 1 + M is the total number of parameters, which depends on the number of cameras N , the number of markers M , and the number of parameters P for encoding rotation and translation. One camera is excluded from the vector θ so that its position and orientation define the coordinate system and eliminate solution ambiguity. For representing rotations, the axis-angle vector parameterization is employed. This representation is the most compact parameterization of rotation, avoiding parameter redundancy inherent in rotation matrices or quaternions. Thus, 3 parameters are allocated for rotation and 3 parameters for translation for each camera and marker, yielding P   =   6 . The vector θ is formed by concatenating the parameters of all cameras and markers: θ = R ^ c 1 T , t c 1 T , , R ^ c N 1 T , t c N 1 T , R ^ m 1 T , t m 1 T , , R ^ m M T , t m M T T , where the hat symbol R ^ R 3 denotes the axis-angle representation of rotation.
The global optimization problem is formulated as follows:
θ * = arg min θ i = 1 N j V i U i j π K i , R c i , t c i , X j 2 ,
where θ R S is the vector of optimized parameters encoding rotations and translations for all cameras ( R c i R 3 × 3 and t c i R 3 for each i -th camera, respectively) and markers (coordinates X j R 4 × 3 for the four corners of the j -th marker, obtained by applying rotation and translation to the template 3D representation of the marker X t e m p l a t e R 3 × 4 ); U i j R 4 × 2 are the 2D coordinates of the detection of the four corners of the j -th marker on the image plane of the i -th camera; π : R 4 × 3 R 4 × 2 is the function projecting the four corners of a marker from the 3D world coordinate system to the camera image plane; V i is the set of indices of markers for which detections exist in the i -th camera; denotes the Euclidean norm computed along the second dimension, that is, implementing the mapping R 4 × 2 R 4 .
The expression under the summation in Equation (2) decomposes into 8 squared scalar quantities, since each detection of the j -th marker from the i -th camera involves 4 points (corners), indexed by variable k 1 , 4 , each having 2 coordinates indexed by variable c x , y :
U i j π K i , R c i , t c i , X j 2 = k = 1 4 c x , y r i j k c 2 ,
Thus, the BA problem reduces to a nonlinear least squares problem, in which the sum of squared residuals r i j k c 2 must be minimized, where i 1 , N is the camera index, j 1 , M is the marker index, k 1 , 4 is the marker corner index, and c x , y is one of the two dimensions on the camera image plane. The complete residual vector r has length D = i = 1 N 8 V i , encompassing all marker detections. Optionally, a robust loss function may be applied to the squared residuals to mitigate the influence of outliers, for example, the Huber loss [31].
Second-order iterative methods are traditionally employed to solve the BA problem, such as the Levenberg–Marquardt (LM) algorithm [32,33] or the more commonly used Trust Region Reflective (TRF) algorithm [34]. This choice is motivated by the specific structure of the objective function landscape in BA problems, which often exhibits an ill-conditioned Hessian matrix. Under such conditions, second-order methods that exploit curvature information can achieve quadratic convergence rates when provided with a sufficiently accurate initial estimate.
The computational complexity of a single iteration of such algorithms is determined by the cost of solving the normal equations of the form [23]:
H θ = J T r ,
where H R S × S is the Hessian matrix of the sum of squared residuals objective function or its approximation (depending on the specific second-order optimization method); J R D × S is the Jacobian matrix of the sum of squared residuals objective function; θ is the sought parameter update for the current iteration.
The complexity of solving such a system comprises the cost of constructing and inverting the Hessian matrix H . Inverting a matrix of size S × S in the general case amounts to O S 3 = O P M + N 3 , or simply O M + N 3 , since P is a constant. However, it can be observed that the rotation and translation parameters for each marker depend only on those cameras that observe it and are independent of one another. The same holds for cameras–no residual computation involves parameters from multiple cameras simultaneously. Therefore, the matrix H exhibits a specific sparsity pattern and has a block-diagonal structure. This property enables the application of the Schur complement method to the inversion of H , which eliminates the marker parameters and reduces the problem to solving a system for cameras only [35], lowering the complexity to O M N 2 for constructing the complement and O N 3 for the final computation of H 1 . Leveraging the same sparsity property, it can be established that constructing H costs O M N in the worst case, when each camera observes every marker, i.e., D = 8 M N . Typically, the number of markers in the scene far exceeds the number of cameras, i.e., M N , so M N 2 begins to dominate over N 3 and M N , and the per-iteration complexity ultimately reduces to O M N 2 . The total runtime of BA is linear in the number of iterations K , which can reach hundreds for a poor initial estimate.
Thus, the asymptotic complexity of BA is linear in the number of markers and quadratic in the number of cameras, while the quality of initialization directly affects the number of iterations required to reach the optimum.

5. Proposed Calibration Method

The principal limitation of the baseline sequential initialization method described above is the problem of error accumulation, which leads to substantial geometric discrepancies, particularly in closed-loop configurations of object-centric systems. To address this limitation, we propose a modified method that incorporates mechanisms for improved robustness and drift correction directly into the incremental scene reconstruction process.
The proposed modification of the baseline incremental calibration method incorporates two key modifications aimed at improving accuracy and geometric consistency. First, a weighting scheme for observations is introduced in the PnP problem, where the contribution of each detection is determined by a heuristic reliability estimate. Second, the process is augmented with a multi-view triangulation stage, which is employed at each step of scene expansion to recompute the coordinates of all markers, thereby maintaining dynamic geometric consistency and preventing error accumulation.

5.1. Weighted PnP

Standard methods for solving the PnP problem, such as Efficient PnP (EPnP) [36] or iterative LM, minimize the sum of squared reprojection errors, assuming identical noise variance for all points. However, in real-world conditions, the quality of AprilTag marker detection varies significantly depending on observational conditions.
To account for this, a weighting function is introduced that assigns a confidence coefficient to each marker observation. Weights w are incorporated into the summation of the PnP problem formulation (1), computed for each marker as follows:
w = w l e n γ l e n 1 1 w s i n γ s i n · 1 w d i s t γ d i a g γ d i s t ,
where w l e n is the average side length of the detection quadrilateral (in pixels); w s i n is the average of the sines of the angles of the detection quadrilateral; w d i s t is the Euclidean distance from the center of the detection quadrilateral to the principal point of the image; γ d i a g is the image diagonal length for normalization; the ratio w l e n γ l e n is clipped to 0 ,   1 , so any marker whose projected side length exceeds γ l e n pixels is considered fully reliable with respect to size and receives the maximum contribution from this term; and γ l e n , γ s i n , γ d i s t are regularization hyperparameters determining the contribution of each factor.
The proposed heuristic weighting formula accounts for three geometric factors affecting detection accuracy:
  • Projection size w l e n . Markers closer to the camera occupy a larger area on the sensor, improving the localization accuracy of their corners.
  • Viewing angle obliqueness w s i n . When a marker is observed at an oblique angle, its projection becomes distorted, increasing detection uncertainty.
  • Radial position w d i s t . Detections near the image edges (far from the principal point) are more susceptible to residual lens distortion.
The hyperparameter values used throughout all experiments are γ l e n = 250 px, γ s i n = 0.5 , and γ d i s t = 2 . The value γ l e n = 250 px represents the projected marker side length at which detection is considered fully reliable; it corresponds to the expected maximum projected size of a marker at the typical operating distances of the experimental setup. The exponent γ d i s t = 2 is motivated by the Brown-Conrady distortion model: the dominant radial distortion term is cubic in the radial coordinate, so the rate at which corner localization error grows with distance from the principal point scales approximately quadratically, and the quadratic weight decay matches this order of nonlinearity. The exponent γ s i n = 0.5 defines a square-root (convex) function that assigns relatively more weight to small deviations from frontal viewing—where even modest obliqueness introduces measurable perspective error in corner localization—while compressing differences among highly oblique detections that are already heavily downweighted. To assess the sensitivity to this choice, a grid search over γ s i n 0.05 ,   0.10 ,   ,   1.00 was conducted on 10 synthetic datasets, described in Section 6.1.1. A Kruskal-Wallis test found no statistically significant differences in either translation or rotation errors (calculated using Formulas (10) and (11) respectively) across the tested range. Since calibration accuracy is thus robust to the precise value of γ s i n , the value 0.5 is adopted as a natural default—it is the simplest nonlinear choice in the range 0 ,   1 , introducing moderate convexity without collapsing to linear weighting ( γ s i n = 1.0 ) or a near-threshold behavior ( γ s i n = 0.0 ).
This weighting allows the PnP algorithm to minimize reprojection error non-uniformly across all visible markers, so that the optimization emphasis reflects detection reliability, thereby improving the stability of camera pose estimation.

5.2. Multiview Triangulation

A critical limitation of the baseline method is its single-shot estimation of marker positions: marker coordinates are fixed at the moment of first detection and never subsequently refined. This leads to error accumulation, as each marker inherits the error of the camera from which its position was estimated, compounds it with its own positioning inaccuracy, and propagates the combined error to the next camera added to the scene.
The proposed method eliminates this drawback by refining marker positions after each new camera is added. Specifically, multi-view triangulation is performed for each j -th marker from all already calibrated cameras that observe it. At the moment when the i -th camera is added, if the position of the j -th marker it observes is already known, this implies that the marker has been observed by at least two positioned cameras, including the i -th, thereby enabling triangulation. If the j -th marker is encountered for the first time, its position is estimated by solving the PnP problem, as in the baseline method. For computational efficiency, the following triangulation algorithm is proposed.
First, the 3D coordinate of the k -th corner of the j -th marker is independently computed such that rays cast from the optical centers of all cameras participating in the triangulation towards this point minimize the deviation from the rays directed toward the corresponding 2D detection coordinates on the image planes. Formally, this is achieved by solving an optimization problem minimizing the squared norms of the cross products of the ray pairs (i.e., the squared deviations from collinearity) in homogeneous coordinates:
x ~ j k * = arg min x ~ jk = 1 i O j P u ~ i j k × P i x ~ j k 2 ,
where x ~ j k R 4 = x j k , y j k , z j k , ω j k T are the homogeneous coordinates of the k -th corner of the j -th marker; u ~ i j k R 3 = u i j k , v i j k , 1 T are the homogeneous coordinates of the detection of the k -th corner of the j -th marker on the image plane of the i -th camera; P i R 3 × 4 is the projection matrix from the 3D world coordinate system to the image plane of the i -th camera; O j is the set of indices of all cameras observing the j -th marker; P is the set of indices of all cameras already positioned at the current iteration; is the Euclidean norm; and × denotes the cross product.
Problem (6) represents a classical formulation for the Direct Linear Transformation (DLT) method [37]. The solution is obtained analytically and has a linear computational complexity of O N in the worst case, where the marker is triangulated from all N cameras simultaneously.
Since the geometry of the marker is known a priori, the next step estimates the rotation and translation that best align the template 3D representation of the marker X t e m p l a t e R 3 × 4 with the triangulated points. This problem is solved analytically using the Kabsch-Umeyama method [38] via the singular value decomposition (SVD) of the point covariance matrix, with a constant asymptotic complexity of O 1 .
This extension to the baseline algorithm ensures that the position of each marker is always consistent with all cameras that observed it. Consequently, error accumulation is mitigated as the camera chain grows.
More formally, the proposed method is described in Algorithm 2.
Algorithm 2: Proposed extrinsic calibration method
Input: N —number of cameras;
M —number of markers;
V i   |   i 1 , N —sets of markers indices visible from the i-th camera;
O j   |   j 1 , M —sets of cameras indices observing the j-th marker;
u i j   |   i 1 , N , j V i —detections of the j-th marker on the i-th camera image plane.
Output: R c i   |   i 1 , N —rotations of all cameras;
t c i   |   i 1 , N —translations of all cameras;
R m j   |   j 1 , M —rotations of all markers;
t m j   |   j 1 , M —translations of all markers.
1. U 1 , , N Set of uninitialized cameras indices
2. M Set of positioned markers indices
3.while    U  do
4.      if  U = N  then
5.            i arg max i U V i Pick camera with strict max observations
6.            R c i I Initialize camera rotation as identity matrix
7.            t c i 0 , 0 , 0 T Initialize camera translation by zeros
8.      else
9.            i arg max i V i M Pick camera with strict max observations M
10.            R m R m j   |   j V i M Set of visible markers rotations
11.            T m t m j   |   j V i M Set of visible markers translations
12.            D m u i j   |   j V i M Set of visible markers detections
13.            W W e i g h t u i j   |   j V i M Set of detections weights estimated by (5)
14.            R c i , t c i S o l v e W P n P R m , T m , D m , W Estimate camera pose by solving weighted PnP
15.      end if
16.       U U i Mark camera as initialized
17.      for each marker j V i  do
18.           if  j M  do
19.                R m j , t m j S o l v e P n P R c i , t c i , u i j Estimate marker pose by solving PnP
20.           else
21.                R c R c k   |   k O j U Set of cameras rotations observing marker
22.                T c t c k   |   k O j U Set of cameras translations observing marker
23.                D c u k j   |   k O j U Set of marker detections from all cameras
24.                R m j , t m j T r i a n g u l a t e R c , T c , D c Estimate marker pose by triangulation
25.           end if
26.            M M j Mark marker as initialized
27.      end for
28.end while

5.3. Method Complexity

The computational complexity of the baseline method for scene initialization is the sum of the complexities of solving PnP problems at each iteration. Since positioning each marker involves only a single camera, the complexity of this operation is independent of M or N and is O 1 . The total complexity of positioning all M markers is O M . To position the i -th camera, the PnP problem is solved using all positioned markers visible from that camera. In the worst case, where the i -th camera observes all M markers, solving the problem for a single camera has a complexity of O M . Thus, the complexity of the baseline initialization method is O M + M N = O M N .
However, as mentioned previously, the complete calibration process of a multi-camera system almost never proceeds without Bundle Adjustment (BA). The complexity of BA is estimated as O M N 2 + N 3 or O M N 2 under the assumption that M N , which clearly dominates the complexity of scene initialization by the baseline method.
The initialization method proposed in this paper introduces two heuristics to the baseline method:
  • Weighted PnP. Computing the weights w using heuristic Formula (5) is performed in O 1 , as it is independent of N , M , or any other variable quantities.
  • Multi-view Triangulation. In the worst case, when adding the i -th camera, the positions of all M markers are recalculated. As discussed in Section 5.2, the complexity of triangulating a single marker is O N ; therefore, O M N operations are performed for each new i -th camera. The resulting complexity added by the triangulation process to the baseline method is O M N 2 , which does not exceed the complexity of a single BA iteration.
Thus, the complexity of the entire calibration process, comprising the proposed initialization method and BA, remains the same as that of the baseline method, amounting to M N 2 + N 3 in the general case.
Incorporating the proposed scene initialization method does not alter the asymptotic complexity class of the entire calibration process. The primary advantage of the method lies in a substantial improvement in the quality of the initial estimate of scene parameters and the associated error covariance structure. This, in turn, leads to a reduction in the number of BA iterations required to achieve convergence, thereby reducing overall computational costs and enhancing calibration reliability in the presence of noisy data.
Empirical support for this claim is provided by the bundle adjustment convergence comparison in Section 7. The proposed initialization consistently reduces the number of BA iterations required to reach the stopping criterion, confirming that the additional cost incurred at the initialization stage is offset by faster convergence of the global optimization.

6. Experiments

To evaluate the effectiveness of the proposed initialization method, a series of experiments was conducted under both fully controlled conditions with known parameters of synthetically generated scenes and real-world capture scenarios typical of object-centric rigs. The experiments compare the baseline and proposed methods for initial camera calibration using various metrics appropriate to the nature of the data.

6.1. Synthetic Data

6.1.1. Generation

To conduct controlled and reproducible experiments, 40 unique synthetic datasets were generated. Each scene represents a multi-camera system comprising 40 virtual cameras and 80 AprilTag markers randomly distributed on the inner surface of a cylinder. This arrangement ensures substantial overlap between camera fields of view and is representative of standard object-centric configurations. All cameras were set to a resolution of 1920 × 1080 pixels. The camera field of view was randomly selected from the range of 80 to 120 degrees for each dataset. The height and radius of the cylinder were randomly chosen from the ranges 1.4 , 1.8 meters and 0.8 , 1.0 meters, respectively. Figure 1 presents an example of a generated dataset, where cameras are depicted as orange frustums with the optical center at the apex.
Images from the virtual cameras were generated by rendering the 3D scene. Since only AprilTag marker detections are relevant for this experiment, surface material properties and lighting were not simulated during rendering. Figure 2 shows an example of a render from one of the cameras.
To approximate real-world conditions, random noise was added to the ground-truth values of the intrinsic matrix K i of each i -th camera immediately after rendering:
K i * = f x , i + rand β 0 p x , i + rand β 0 f y , i + rand β p y , i + rand β 0 0 1 ,
where f x , i and f y , i are the true focal lengths of the i -th camera; p x , i and p y , i are the true coordinates of the principal point of the i -th camera; rand β is a random number selected from the range β ; β ; and β is a parameter determining the noise amplitude (for the experiments, a value of 0.5% of the camera image diagonal length was used, i.e., β 11 pixels).
The addition of noise simulates the errors incurred during preliminary intrinsic calibration, which is inherent in any real-world measurement process. Additionally, distortion D corresponding to the polynomial model for fisheye lenses was applied to all generated renders:
D = rand 2 10 3 rand 1 10 3 rand 5 10 4 rand 5 10 4 ,
where the values in the column vector D correspond to the four distortion coefficients of the Brown-Conrady model [10].
Adding slight distortion to the images allows for evaluating the robustness of the initialization methods to uncompensated optical distortions, which are always present in real cameras due to the physical limitations of optical systems.

6.1.2. Evaluation Metrics for Synthetic Data

In real-world conditions, ideal camera positions are unknown a priori; otherwise, calibration would be meaningless. Therefore, metrics that rely solely on 2D point locations on image planes and their corresponding 3D positions are employed to evaluate calibration quality. Such metrics include reprojection error, triangulation error (the distance between rays cast from different cameras toward the same 3D point) [39], photometric evaluation (when 3D reconstruction is performed during calibration, the similarity between rendered and real images can be assessed) [40], and others. However, this work utilizes synthetic images generated by virtual cameras with known precise intrinsic and extrinsic parameters; therefore, it is most appropriate to use metrics based on direct comparison between the ground-truth and calibrated parameters.
Among these evaluation approaches, geometric metrics such as reprojection error, angular error, and epipolar error directly quantify the consistency between the estimated camera model and the observed 2D measurements—precisely the quantity that calibration procedures optimize. Task-level metrics such as photometric similarity in 3D reconstruction are useful for assessing end-to-end system performance, but they are confounded by factors beyond calibration quality: photometric accuracy depends on scene geometry, surface material properties, and—in the case of learning-based methods [40]—on model capacity and training convergence. Isolating the contribution of calibration accuracy from these confounds requires controlled conditions that are not always practical to establish. The practical relevance of geometric calibration accuracy is nonetheless well-established: errors in extrinsic parameters propagate directly into the triangulation of 3D scene points [39], introduce geometric inconsistencies between camera pairs as characterised by epipolar geometry [4,41], and affect any downstream task that relies on the geometric correspondence between physical space and the image plane.
To evaluate the accuracy of the proposed method and compare it with the baseline method, metrics characterizing the errors in determining camera position and orientation were selected. It is important to note that extrinsic calibration algorithms recover the relative positioning of cameras; however, the resulting coordinate system may be translated and rotated relative to the ground truth coordinate system used for data generation. Nevertheless, since the physical size of the AprilTag markers is known a priori, the recovered 3D structure preserves absolute scale. This allows the two coordinate systems to be aligned for a valid comparison. Figure 3 presents an example of the ground truth camera system (green) and the estimated camera positions after calibration (red).
Alignment of the camera coordinate systems involves finding a rigid transformation, consisting of a rotation R a l i g n R 3 × 3 and a translation t a l i g n R 3 , that best aligns the calibrated camera system with the ground truth system. To determine this transformation, the Kabsch-Umeyama algorithm [38] is employed. A separate scale factor estimation is not required, since the known physical marker dimensions were used during calibration. Formally, the alignment problem is formulated as follows:
R a l i g n * , t a l i g n * = arg min R align , t align i = 1 N c g t , i R a l i g n c e s t , i + t a l i g n 2 ,
where c g t , i R 3 are the coordinates of the optical center of the i -th camera in the ground truth system; c e s t , i R 3 are the coordinates of the optical center of the i -th camera in the calibrated system; N = 40 is the number of cameras in each synthetic dataset; and denotes the Euclidean norm.
In this context, the Kabsch-Umeyama algorithm provides an analytical solution to the least squares problem (9); therefore, it aligns the camera systems in the best possible way in terms of pairwise squared distances. The following metrics are then computed.
Translation error E t r a n s . It is computed as the Euclidean distance between the true position of the camera optical center and the aligned estimated position after calibration for the i -th camera:
E t r a n s , i = c g t , i R a l i g n c e s t , i + t a l i g n , i ,
Rotation error E r o t . It is computed as the angle required to align the true camera orientation with the aligned estimated orientation for the i -th camera:
E r o t , i = arccos trace R g t , i R a l i g n R e s t , i T 1 2 ,
where trace M is the sum of the diagonal elements of a square matrix M ; R g t , i R 3 × 3 is the rotation matrix transforming the global coordinate system to the coordinate system of the i -th ground truth camera; R e s t , i R 3 × 3 is the rotation matrix transforming the global coordinate system to the coordinate system of the i -th camera estimated during calibration.

6.1.3. Results for Synthetic Data

Both the baseline and the proposed method were applied to all 40 synthetic datasets for the initial extrinsic calibration of cameras. For each case, both metrics E t r a n s and E r o t discussed earlier were computed. Figure 4 presents a boxplot where the y-axis represents the values of the E t r a n s metric for all cameras across all datasets, and the x-axis indicates the dataset numbers. Figure 5 similarly illustrates the distribution of the E r o t metric.
Both plots indicate that the proposed method yields consistently better results than the baseline method across both metrics.
To assess the statistical significance of the observed differences, the one-sided Wilcoxon test was applied with a significance level of α = 0.05 . The results show that the proposed method significantly outperforms the baseline in 97.5% of the datasets for the E t r a n s metric and in 85% of the datasets for the E r o t metric. In the remaining cases, no statistically significant differences were observed.
Based on these results, it can be concluded that the proposed method for initial extrinsic camera calibration outperforms the baseline on synthetic datasets in terms of both translation and rotation metrics, yielding more accurate camera pose estimates.

6.2. Real Data

6.2.1. Multicamera Setup

A single GoPro Hero 13 Black camera was used for real-world data acquisition, configured for video recording at a resolution of 3840 × 2160. Twenty-nine AprilTag markers were placed on four vertical stands (six per stand), while the remaining five markers were distributed on the floor, as shown in Figure 6.
Subsequently, a camera pass around the setup was performed, and 20 frames were extracted from the video stream. Since the objective of the experiment is solely camera calibration using marker detections, the resulting dataset is equivalent to a simultaneous 20-camera capture, with the caveat that all cameras share the same intrinsics and distortion coefficients, as only a single physical camera was used. The intrinsics and distortion coefficients were pre-calibrated using Zhang’s classical algorithm [11] with a chessboard. The GoPro Hero 13 Black features a fisheye lens with strong distortion; however, since undistorted images and the corresponding camera parameters are used in the subsequent experiments, the pinhole camera model is assumed.
Two aspects of this setup warrant clarification. First, the sequential nature of the capture does not introduce any methodological limitation: the scene is static, and the method operates solely on 2D marker detections, which are independent of capture timing. The resulting dataset is therefore equivalent to a synchronous multi-camera capture for the purposes of calibration. Second, the fact that all virtual cameras share identical intrinsic parameters represents a special case of the general setting rather than a fundamental constraint: the method accepts independently calibrated intrinsic matrices K i for each i -th camera and processes them separately throughout. The synthetic experiments, in which independent random noise is applied to the intrinsic parameters of each of the 40 cameras, already constitute an evaluation under heterogeneous intrinsics.

6.2.2. Evaluation Metrics for Real Data

Unlike synthetic data, real-world data lacks ground-truth camera positions. Therefore, calibration quality is assessed using metrics that evaluate the local and global geometric consistency of the scene.
One of the most well-known metrics is the reprojection error E r e p r . This is a classical metric that is minimized both when solving the PnP problem (1) and during Bundle Adjustment (2). It is calculated as the Euclidean distance between the projection of the k -th corner of the j -th marker onto the image plane of the i -th camera and the detection of this corner:
E r e p r , i j k = u i j k π K i , R i , t i , x j k ,
where u i j k R 2 is the detection of the k -th corner of the j -th marker on the image plane of the i -th camera; x j k R 3 are the coordinates of the k -th corner of the j -th marker in the global coordinate system; π K i , R i , t i , x j k R 2 is the projection of point x j k onto the image plane of the i -th camera; is the Euclidean norm.
Another well-known metric is the angular error E a n g . It is calculated as the angle between the ray back-projected from the pixel coordinates of the detection of the k -th corner of the j -th marker on the image plane of the i -th camera and the vector pointing from the optical center of the i -th camera to the actual 3D point of the k -th corner of that j -th marker:
E a n g , i j k = arccos v d e t , i j k v e s t , i j k v d e t , i j k v e s t , i j k ,
where v d e t , i j k R 3 = R i T K i 1 u ~ i j k is the ray originating from the optical center of the i -th camera towards the detection coordinates of the k -th corner of the j -th marker on the image plane, expressed in the world 3D coordinate system; R i R 3 × 3 is the rotation matrix for transforming from the world coordinate system to the coordinate system of the i -th camera; K i R 3 × 3 is the intrinsic matrix of the i -th camera; u ~ i j k R 3 are the homogeneous coordinates of the detection of the k -th corner of the j -th marker on the image plane of the i -th camera; v e s t , i j k R 3 = X j k + R i T t i is the ray originating from the optical center of the i -th camera towards the 3D coordinates X j k R 3 of the k -th corner of the j -th marker in the world coordinate system; t i is the translation vector for transforming from the world coordinate system to the coordinate system of the i -th camera; is the Euclidean norm.
Another widely known metric for evaluating the quality of calibration is the epipolar error E e p i . This metric evaluates the quality of pairwise camera poses using the averaged Sampson distance [42] for all point pairs, characterizing the deviation from satisfying the epipolar constraint:
u ~ m j k T F n m u ~ n j k = 0 ,
where u ~ n j k R 3 = u n j k , v n j k , 1 T and u ~ m j k R 3 = u m j k , v m j k , 1 T are the homogeneous coordinates of the detection of the k -th corner of the j -th marker on the image planes of the n -th and m -th cameras, respectively; F n m is the fundamental matrix [4,41] encoding the epipolar geometry of the camera pair n and m .
The error E e p i is calculated for each pair of cameras n and m :
E e p i , n m = 1 4 V n V m j V n V m k = 1 4 u ~ m j k T F n m u ~ n j k 2 l n j k 1 2 + l n j k 2 2 + l m j k 1 2 + l m j k 2 2 ,
where V n and V m are the sets of indices of markers detected in the n -th and m -th cameras, respectively; l n j k R 3 = F n m u ~ n j k = l n j k 1 , l n j k 2 , l n j k 3 T and l m j k R 3 = F n m T u ~ m j k = l m j k 1 , l m j k 2 , l m j k 3 T are the vectors corresponding to the epipolar lines on the image planes of the n -th and m -th cameras, respectively.

6.2.3. Results for Real Data

For the real-world data, both the baseline method and the proposed method were applied for the initial extrinsic camera calibration. It should be noted that for both methods, the iterative addition of cameras was performed in the same order, starting from camera 12.
Figure 7 presents a boxplot where the x-axis represents the camera indices i = 1 , 20 ¯ in the dataset, and the y-axis displays the values of the metric E r e p r , i j k for each i -th camera. These values correspond to all corners k = 1 , 4 ¯ of every marker j V i , where V i is the set of indices of markers for which detections are available in the i -th camera.
As can be clearly seen from the plot in Figure 7, the proposed initial calibration method performs significantly better in terms of reprojection error. It is worth noting the case of camera 12, which initiated the iterative calibration process in both methods. This camera is the only instance where the baseline method outperformed the proposed method. Since the baseline algorithm does not update marker positions after their initial placement, the reprojection error for the first camera is essentially the direct result of solving the PnP problem over all markers visible to it. The proposed method, on the other hand, updates marker positions via triangulation at each iteration, thereby sacrificing local accuracy in favor of global consistency.
In a manner similar to Figure 7, Figure 8 presents a boxplot for all errors E a n g , i j k on the real-world dataset.
As can be seen from Figure 8, the angular error exhibits results nearly identical to the previously discussed reprojection error. Compared to the baseline, the proposed method demonstrates superior results for all cameras except for camera 12, from which the iterative calibration originated.
Figure 9 presents a graphical representation of the matrix of pairwise epipolar errors E e p i , n m for all camera pairs n , m = 1 , 20 ¯ . Each camera pair is represented by a semicircle whose diameter is proportional to the error magnitude and whose color encodes the number of common points 4 V n V m , where V n and V m are the sets of marker indices for which detections are available on the n -th and m -th cameras, respectively. The upper semicircle corresponds to the baseline method and the lower semicircle to the proposed method. A legend with color and size scales is provided at the top of the plot.
Notably, the most significant differences in errors between the two methods are predominantly observed for camera pairs with relatively few common points. This indicates that the proposed method achieves more stable relative positioning even for cameras whose fields of view barely overlap, unlike the baseline method. For camera 12, which served as the starting point for calibration in both cases, the proposed method exhibits the highest average errors, while the baseline method yields the lowest. This confirms the earlier observation about the special role of the starting camera.
Figure 10 presents a 3D representation of the entire scene comprising 20 cameras and 29 AprilTag markers, reconstructed using the baseline and the proposed initialization methods. The cameras are colored according to the mean reprojection error E r e p r , i = 1 4 V i j V i k = 1 4 E r e p r , i j k for each i -th camera, where V i denotes the set of indices of markers detected by the i -th camera. For reference, the floor plane is indicated by a gray circle.
The 3D visualization illustrates how even minor camera displacement errors can substantially degrade calibration quality, as quantified by reprojection error. Despite the visually similar arrangement of markers and cameras in both cases, the baseline approach yields an average error exceeding 7 pixels, while the error of the proposed method remains below one.
Camera 12, from which the calibration process started in both methods, is indicated by a red arrow in the 3D visualization. It is easily identified in the baseline method by its bright green color, corresponding to the lowest reprojection error, as discussed above. A clear spatial pattern is also visible in the baseline method: cameras with minimal errors cluster together, as do those with maximal errors. Therefore, to confirm this pattern, we evaluate the correlation between the error magnitude of the i -th camera and its proximity to camera 12 for all i = 1 , 20 ¯ . Euclidean distance in meters (16) and angular distance in radians (17) are used as proximity metrics:
d E u c l i d e a n , i = R i T t i R 12 T t 12 ,
d a n g u l a r , i = arccos trace R i R 12 T 1 2 ,
where R i R 3 × 3 and t i R 3 are the rotation and translation for transforming a 3D point from the world coordinate system to the i -th camera coordinate system (the extrinsics of the i -th camera); R i T t i R 3 are the coordinates of the optical center of the i -th camera in the world coordinate system; R i R 12 T R 3 × 3 is the rotation matrix for transforming from the coordinate system of the 12th camera to the coordinate system of the i -th camera; denotes the Euclidean norm; and trace M is the sum of the diagonal elements of a square matrix M .
Table 1 presents the Pearson correlation coefficients between the error magnitude of the i -th camera and its proximity to the 12th camera for all i = 1 , 20 ¯ . Only statistically significant correlations are reported, at a significance level of α = 0.05 .
The correlation coefficients indicate that the farther the i -th camera is from the starting camera, the worse its pose is estimated by the baseline algorithm, and the better it is estimated by the proposed method. This is presumably because the order of camera addition in both methods was determined by the number of commonly visible markers, so as to maximize the reliability of each subsequent camera pose estimate (Algorithms 1 and 2). Since all cameras were arranged in a circle, the minimal field-of-view overlap occurs for cameras on opposite sides of the scene. For the baseline method, this implies that the furthest camera accumulates the largest error, whereas for the proposed method, the farthest camera is positioned using markers whose coordinates have already been refined via triangulation from all previously added cameras, which yields greater reliability.

7. Ablation Study

Since the proposed method is a modification of the baseline achieved by adding two key extensions, an ablation study is performed to evaluate the contribution of each extension separately. To this end, two additional configurations of the proposed method were evaluated on the real data: one with PnP weighting disabled, and another with multi-view triangulation disabled.
For the comparison, the metric E r e p r , i j k was selected for all cameras i = 1 , 20 ¯ , for all corners k   = 1 , 4 ¯ of each marker j V i , where V i is the set of indices of markers for which detections are available on the i -th camera. Figure 11 presents a comparison of error distributions for all aforementioned method variants in the form of a boxplot.
As can be seen from the figure, the primary contribution to the quality of the results is made by the multi-view triangulation of markers, which actively counteracts the sequential error accumulation characteristic of the baseline calibration method. PnP weighting, on the other hand, does not yield a comparable improvement on its own, but it reduces both the median and the bulk of the error distribution, albeit at the cost of slightly higher outlier errors. Importantly, the joint application of both extensions yields a significant improvement over their individual use.
Table 2 reports the exact median and mean reprojection errors for each method variant, providing a quantitative basis for the comparisons discussed below.
To assess the statistical significance of the observed differences, Wilcoxon signed-rank tests were performed on the metric E r e p r , i j k for all cameras i = 1 , 20 ¯ , for all corners k   = 1 , 4 ¯ of each marker j V i visible from i -th camera for all pairwise ablation comparisons at a significance level of α = 0.001 . Statistically significant results were obtained in all cases, including the weighted PnP component both in isolation and when added on top of the triangulation-only variant, confirming that the two extensions act as complementary rather than redundant mechanisms. As shown in Table 2, the weighted PnP scheme reduces the median reprojection error by approximately 9% relative to the baseline method, while the triangulation component reduces it by 86%. The full proposed method achieves a further 15% reduction in median error compared to the triangulation-only variant.
Furthermore, global optimization was executed for all versions of the proposed method as well as the baseline method, solving problem (2) using the Trust Region Reflective (TRF) method with a stopping criterion of θ < 10 6 . Figure 12 presents plots of the objective function value versus the number of function evaluations. Insets magnify specific regions to facilitate comparison between the methods.
The plot confirms that the proposed method reduces the number of iterations required for bundle adjustment convergence, thereby lowering the overall computational cost of the calibration process. At the same time, it can be observed that the application of PnP weighting, compared to the baseline method, ultimately results in a lower (better) final value of the objective function, despite a slightly higher initial value.
The ablation results also demonstrate that the two components of the proposed method can be deployed independently. The weighted PnP scheme achieves a statistically significant improvement at O ( 1 ) additional cost per iteration. The triangulation component provides the dominant accuracy gain at an initialization overhead of O ( N M ) , offset by a reduction in the number of subsequent bundle adjustment iterations required for convergence.

8. Conclusions and Future Work

This paper presents an incremental extrinsic camera parameter initialization method for multi-camera systems, which improves upon the baseline iterative registration algorithm based on solving the PnP problem. The key features distinguishing the proposed method from the baseline are the integration of heuristic weighting for AprilTag marker detections and the use of multi-view triangulation for the dynamic refinement of marker spatial coordinates at each step of scene expansion. Theoretical analysis demonstrates that the incorporation of these mechanisms does not increase the overall computational complexity of the calibration process, when global optimization is included, preserving the overall asymptotic complexity of O M N 2 .
A series of experiments was conducted on synthetic and real-world data to evaluate the effectiveness of the proposed method. Synthetic data enables direct comparison between estimated and ground-truth camera parameters by aligning coordinate systems using the Kabsch-Umeyama algorithm. Validation on real-world data was performed using epipolar geometry metrics, the final reprojection error, and the angular error. The experimental results confirm that the proposed method yields more accurate camera localization and effectively mitigates geometric drift compared to the baseline. An ablation study isolating the contribution of each component reveals that multi-view triangulation is the primary factor in preventing geometric discrepancies. Furthermore, the weighted PnP scheme reduces the median positioning error and decreases the number of iterations required for the subsequent convergence of the global optimization (bundle adjustment).
Several promising directions for future research could further enhance the proposed calibration method. First, since the proposed method involves minimizing nonlinear functions at various stages, it would be worthwhile to investigate alternative optimization methods [18,43,44,45]. Modern population-based and heuristic algorithms could improve robustness to outliers and reduce the risk of convergence to local extrema.
Second, several extensions of the proposed weighting scheme merit investigation. A natural extension would be to integrate the proposed confidence metric directly into the bundle adjustment stage. Weighting the loss function during global optimization according to the accumulated heuristic detection confidence could allow the algorithm to penalize deviations of reliable observations more heavily, thereby further improving final calibration accuracy. Finally, the proposed weighting scheme could be used to refine the criterion for selecting the next camera during incremental addition, prioritizing configurations with the highest total observation reliability rather than simply the largest number of shared markers.

Author Contributions

Conceptualization, guidance, supervision, and validation, L.A.D.; software, resources, visualization, and testing, V.E.Z.; original draft preparation, L.A.D. and V.E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The synthetic and real data used in this paper, including scene images and corresponding camera parameters, are publicly available in the Zenodo repository at https://doi.org/10.5281/zenodo.20072914.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gao, X.; Li, M.; Shen, S. Large-scale structure from motion: A survey. J. Comput.-Aided Des. Comput. Graph. 2024, 36, 969–994. [Google Scholar] [CrossRef]
  2. Liu, S.; Yang, M.; Xing, T.; Yang, R. A Survey of 3D Reconstruction: The Evolution from Multi-View Geometry to NeRF and 3DGS. Sensors 2025, 25, 5748. [Google Scholar] [CrossRef]
  3. Valverde, M.; Moutinho, A.; Zacchi, J.-V. A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities. Sensors 2025, 25, 5264. [Google Scholar] [CrossRef]
  4. Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  5. Zhou, L.; Kaess, M. An efficient and accurate algorithm for the perspective-n-point problem. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6245–6252. [Google Scholar]
  6. Olson, E. AprilTag: A robust and flexible visual fiducial system. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 3400–3407. [Google Scholar]
  7. Furgale, P.; Rehder, J.; Siegwart, R. Unified temporal and spatial calibration for multi-sensor systems. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Tokyo, Japan, 3–7 November 2013; IEEE: Piscataway, NJ, USA, 2013. [Google Scholar]
  8. Rameau, F.; Park, J.; Bailo, O.; Kweon, I.S. MC-Calib: A generic and robust calibration toolbox for multi-camera systems. Comput. Vis. Image Underst. 2022, 217, 103353. [Google Scholar] [CrossRef]
  9. Salvi, J.; Armangué, X.; Batlle, J. A comparative review of camera calibrating methods with accuracy evaluation. Pattern Recognit. 2002, 35, 1617–1635. [Google Scholar] [CrossRef]
  10. Huai, J.; Shao, Y.; Jozkow, G.; Wang, B.; Chen, D.; He, Y.; Yilmaz, A. Geometric wide-angle camera calibration: A review and comparative study. Sensors 2024, 24, 6595. [Google Scholar] [CrossRef] [PubMed]
  11. Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
  12. Wang, J.; Olson, E. AprilTag 2: Efficient and robust fiducial detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 4193–4198. [Google Scholar]
  13. Tang, D.; Hu, T.; Shen, L.; Ma, Z.; Pan, C. AprilTag array-aided extrinsic calibration of camera–laser multi-sensor system. Robot. Biomim. 2016, 3, 13. [Google Scholar] [CrossRef]
  14. Zhou, P.; Yin, H.; Xu, G.; Li, L.; Yao, J.; Li, J.; Liu, C.; Shi, Z. Meta-Calib: A generic, robust and accurate camera calibration framework with ArUco-encoded meta-board. ISPRS J. Photogramm. Remote Sens. 2024, 212, 357–380. [Google Scholar] [CrossRef]
  15. Tripicchio, P.; D’Avella, S.; Camacho-Gonzalez, G.; Landolfi, L.; Baris, G.; Avizzano, C.A.; Filippeschi, A. Multi-camera extrinsic calibration for real-time tracking in large outdoor environments. J. Sens. Actuator Netw. 2022, 11, 40. [Google Scholar] [CrossRef]
  16. Pacheco, J.M.; Tommaselli, A.M.G. Simultaneous calibration of multiple cameras and generation of omnidirectional images. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2024, X-1-2024, 183–190. [Google Scholar] [CrossRef]
  17. Jatesiktat, P.; Lim, G.M.; Ang, W.T. Multi-camera calibration using far-range dual-LED wand and near-range chessboard fused in bundle adjustment. Sensors 2024, 24, 7416. [Google Scholar] [CrossRef]
  18. Taghipour-Gorjikolaie, M.; Volino, M.; Rusbridge, C.; Wells, K. A novel multiple camera RGB-D calibration approach using simulated annealing. IEEE Access 2024, 12, 98723–98733. [Google Scholar] [CrossRef]
  19. Schönberger, J.L.; Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 4104–4113. [Google Scholar]
  20. Cui, H.; Gao, X.; Shen, S. MCSfM: Multi-camera-based incremental structure-from-motion. IEEE Trans. Image Process. 2023, 32, 6441–6456. [Google Scholar] [CrossRef] [PubMed]
  21. Mouragnon, E.; Lhuillier, M.; Dhome, M.; Dekeyser, F.; Sayd, P. Generic and real-time structure from motion using local bundle adjustment. Image Vis. Comput. 2009, 27, 1178–1193. [Google Scholar] [CrossRef]
  22. Hong, J.-H.; Zach, C. pOSE: Pseudo object space error for initialization-free bundle adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
  23. Triggs, B.; McLauchlan, P.F.; Hartley, R.I.; Fitzgibbon, A.W. Bundle adjustment–A modern synthesis. In Vision Algorithms: Theory and Practice; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2000; Volume 1883, pp. 298–372. [Google Scholar]
  24. Agarwal, S.; Snavely, N.; Seitz, S.M.; Szeliski, R. Bundle adjustment in the large. In Proceedings of the European Conference on Computer Vision (ECCV), Heraklion, Greece, 5–11 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 29–42. [Google Scholar]
  25. Börlin, N.; Grussenmeyer, P. Camera calibration using the damped bundle adjustment toolbox. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2014, II-5, 89–96. [Google Scholar] [CrossRef]
  26. Vakhitov, A.; Ferraz, L.; Agudo, A.; Moreno-Noguer, F. Uncertainty-aware camera pose estimation from points and lines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4659–4668. [Google Scholar]
  27. Chen, H.; Tian, W.; Wang, P.; Wang, F.; Xiong, L.; Li, H. EPro-PnP: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
  28. Huang, J.; Shao, X. Three-dimensional reconstruction precision estimation in multi-view measurement systems. Opt. Laser Technol. 2025, 184, 112494. [Google Scholar] [CrossRef]
  29. Heng, L.; Li, B.; Pollefeys, M. CamOdoCal: Automatic intrinsic and extrinsic calibration of a rig with multiple generic cameras and odometry. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1793–1800. [Google Scholar]
  30. Beketov, S.M.; Zubkova, D.A.; Gintciak, A.M.; Burlutskaya, Z.V.; Redko, S.G. Modern optimization methods and their application features. Russ. Technol. J. 2025, 13, 78–94. [Google Scholar] [CrossRef]
  31. Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  32. Levenberg, K. A method for the solution of certain non-linear problems in least squares. Q. Appl. Math. 1944, 2, 164–168. [Google Scholar] [CrossRef]
  33. Marquardt, D.W. An algorithm for least-squares estimation of nonlinear parameters. SIAM J. Appl. Math. 1963, 11, 431–441. [Google Scholar] [CrossRef]
  34. Branch, M.A.; Coleman, T.F.; Li, Y. A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM J. Sci. Comput. 1999, 21, 1–23. [Google Scholar] [CrossRef]
  35. Lourakis, M.I.A.; Argyros, A.A. SBA: A software package for generic sparse bundle adjustment. ACM Trans. Math. Softw. 2009, 36, 2. [Google Scholar] [CrossRef]
  36. Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
  37. Abdel-Aziz, Y.I.; Karara, H.M. Direct linear transformation from comparator coordinates into object-space coordinates in close-range photogrammetry. Photogramm. Eng. Remote Sens. 2015, 81, 103–107. [Google Scholar] [CrossRef]
  38. Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
  39. Hartley, R.I.; Sturm, P. Triangulation. Comput. Vis. Image Underst. 1997, 68, 146–157. [Google Scholar] [CrossRef]
  40. Delaunoy, A.; Pollefeys, M. Photometric bundle adjustment for dense multi-view 3D modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar]
  41. Hartley, R.I. In defense of the eight-point algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 580–593. [Google Scholar] [CrossRef]
  42. Fathy, M.E.; Nguyen, Q.-S.; Saad, M. Fundamental matrix estimation: A study of error criteria. Pattern Recognit. Lett. 2011, 32, 383–391. [Google Scholar] [CrossRef]
  43. Selvarajan, S. A comprehensive study on modern optimization techniques for engineering applications. Artif. Intell. Rev. 2024, 57, 194. [Google Scholar] [CrossRef]
  44. Demidova, L.A.; Zhuravlev, V.E. An Improved Soft Island Model of the Fish School Search Algorithm with Exponential Step Decay Using Cluster-Based Population Initialization. Stats 2025, 8, 10. [Google Scholar] [CrossRef]
  45. Sherstnev, P.A.; Semenkin, E.S. Self-configuring genetic programming algorithms with Success History-based Adaptation. Sib. Aerosp. J. 2025, 26, 60–70. [Google Scholar] [CrossRef]
Figure 1. 3D representation of the generated dataset (left: side view; right: top view).
Figure 1. 3D representation of the generated dataset (left: side view; right: top view).
Algorithms 19 00371 g001
Figure 2. The example of a render from one of the cameras.
Figure 2. The example of a render from one of the cameras.
Algorithms 19 00371 g002
Figure 3. Example of the difference in global coordinate systems while maintaining local calibration accuracy.
Figure 3. Example of the difference in global coordinate systems while maintaining local calibration accuracy.
Algorithms 19 00371 g003
Figure 4. Boxplot for E t r a n s .
Figure 4. Boxplot for E t r a n s .
Algorithms 19 00371 g004
Figure 5. Boxplot for E r o t .
Figure 5. Boxplot for E r o t .
Algorithms 19 00371 g005
Figure 6. Scene configuration with 29 AprilTag markers for real-world data acquisition.
Figure 6. Scene configuration with 29 AprilTag markers for real-world data acquisition.
Algorithms 19 00371 g006
Figure 7. Boxplot for E r e p r .
Figure 7. Boxplot for E r e p r .
Algorithms 19 00371 g007
Figure 8. Boxplot for the E a n g metric.
Figure 8. Boxplot for the E a n g metric.
Algorithms 19 00371 g008
Figure 9. Matrix of pairwise errors E e p i .
Figure 9. Matrix of pairwise errors E e p i .
Algorithms 19 00371 g009
Figure 10. 3D representation of the entire scene.
Figure 10. 3D representation of the entire scene.
Algorithms 19 00371 g010
Figure 11. Ablation comparison of methods using the E r e p r metric.
Figure 11. Ablation comparison of methods using the E r e p r metric.
Algorithms 19 00371 g011
Figure 12. Ablation comparison of methods by bundle adjustment convergence.
Figure 12. Ablation comparison of methods by bundle adjustment convergence.
Algorithms 19 00371 g012
Table 1. The Pearson correlation coefficients.
Table 1. The Pearson correlation coefficients.
CorrelationBaseline MethodProposed Method
Between E r e p r , i and d E u c l i d e a n , i 0.650−0.632
Between E r e p r , i and d a n g u l a r , i 0.698
Table 2. Reprojection error statistics for all ablation variants.
Table 2. Reprojection error statistics for all ablation variants.
MethodMedian (Pixels)Mean (Pixels)
Baseline method6.4177.451
Proposed method (only weighted PnP)5.8287.244
Proposed method (only triangulation)0.8941.180
Proposed method (with both)0.7571.006
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Demidova, L.A.; Zhuravlev, V.E. Incremental Multi-Camera Extrinsic Calibration Method Based on PnP Integrating Weighted AprilTag Detections and Multi-View Triangulation. Algorithms 2026, 19, 371. https://doi.org/10.3390/a19050371

AMA Style

Demidova LA, Zhuravlev VE. Incremental Multi-Camera Extrinsic Calibration Method Based on PnP Integrating Weighted AprilTag Detections and Multi-View Triangulation. Algorithms. 2026; 19(5):371. https://doi.org/10.3390/a19050371

Chicago/Turabian Style

Demidova, Liliya A., and Vladimir E. Zhuravlev. 2026. "Incremental Multi-Camera Extrinsic Calibration Method Based on PnP Integrating Weighted AprilTag Detections and Multi-View Triangulation" Algorithms 19, no. 5: 371. https://doi.org/10.3390/a19050371

APA Style

Demidova, L. A., & Zhuravlev, V. E. (2026). Incremental Multi-Camera Extrinsic Calibration Method Based on PnP Integrating Weighted AprilTag Detections and Multi-View Triangulation. Algorithms, 19(5), 371. https://doi.org/10.3390/a19050371

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop