1. Introduction
Three-dimensional computer vision has become an integral component of a wide range of modern technologies, finding applications in areas such as autonomous driving, robotics, augmented and virtual reality, as well as industrial quality control [
1,
2,
3]. The foundation for most three-dimensional computer vision systems consists of multi-camera setups [
4], which enable richer spatial coverage of the surrounding environment compared to single cameras. Nearly all higher organisms, including humans, rely on binocular vision, as evolution has established this mechanism as the most effective means of perceiving three-dimensional environments.
The key factor determining the accuracy and reliability of any multi-camera system is the quality of its calibration. The calibration process involves estimating both the intrinsic parameters of each camera (intrinsics) and their relative spatial position and orientation (extrinsics) [
4,
5]. Intrinsics characterize the properties of the optical system and sensor, remaining invariant to camera placement in space. This enables their offline estimation during a preliminary setup stage and allows them to be treated as constant throughout a series of experiments. Although physical factors such as temperature fluctuations may introduce microscopic variations in the optical system, in practice these deviations are either negligibly small or successfully compensated by refinement methods applied during the computation pipeline of specific tasks. Extrinsic parameters, however, require strict recalibration upon any, even minor, change in the multi-camera rig configuration, as errors in the relative positioning of sensors directly propagate into geometric distortions and critically degrade the accuracy of the resulting 3D reconstruction or any other computer vision task.
This paper proposes an incremental initialization method for extrinsic camera calibration in an object-centric rig that utilizes AprilTag marker detection [
6] in images, weighted PnP (Perspective-n-Point) solution techniques for pose estimation of objects with known geometry [
5], and a multi-view triangulation algorithm for refining marker positions in space. Unlike board-based calibration frameworks such as Kalibr [
7] or MC-Calib [
8], which require a structured calibration object with known internal geometry, the proposed method places no constraints on the relative positions of the markers, recovering their 3D coordinates simultaneously with camera poses. This distinction has a direct practical consequence: board-based frameworks require a dedicated calibration session each time the camera configuration changes, with the calibration object actively repositioned through the shared field of view of all cameras. The proposed approach eliminates this overhead—once placed in the scene, the markers can be reused for recalibration at any time without additional capture sessions. The proposed method is compared against a baseline method that implements classical iterative calibration and relies solely on the PnP problem for estimating camera and marker poses. This comparison is conducted on both synthetic datasets and real-world capture data. To isolate the specific contribution of weighting and triangulation that distinguish the proposed method from the baseline, an ablation study is performed.
The principal contributions of this paper are as follows:
An incremental initialization method for multi-camera extrinsic calibration is proposed, operating on arbitrarily placed AprilTag markers with unknown mutual positions and requiring no structured calibration object.
A heuristic detection-quality weighting scheme for the PnP problem is introduced, with physical justification for each factor and empirical evidence of robustness to hyperparameter choice.
A multi-view triangulation step is integrated directly into the incremental registration loop, enabling dynamic correction of marker positions and mitigation of geometric drift at each camera addition step.
It is shown theoretically that the proposed modifications do not increase the asymptotic complexity of the complete calibration procedure, including the bundle adjustment stage.
The remainder of the paper is organized as follows.
Section 2 reviews related work on target-based calibration, SfM-inspired incremental methods, and weighted pose estimation.
Section 3 describes the baseline incremental calibration method.
Section 4 covers the global optimization (bundle adjustment) stage that completes the calibration procedure.
Section 5 introduces the proposed method, which extends the baseline with the two key modifications.
Section 6 reports experimental results on synthetic and real-world data, and
Section 7 presents an ablation study isolating the contribution of each component.
Section 8 concludes the paper.
3. Baseline Calibration Method
Since the problem of joint estimation of camera parameters and scene structure (Bundle Adjustment, BA) reduces to the minimization of a non-convex cost function in a high-dimensional space [
23,
24,
25], applying classical optimization methods without a high-quality initial estimate leads to convergence to a local minimum with high probability [
22,
23,
30]. In this regard, the standard approach to parameter initialization is the incremental registration strategy [
20].
Incremental registration consists of sequentially solving a series of local geometric problems, where parameters of each new camera are estimated relative to the already reconstructed part of the scene. The key tools for such localization are algorithms for solving the PnP problem. Formally, the PnP problem involves finding the position and orientation of a camera
, where
is the rotation matrix and
is the translation vector, relative to a given coordinate system by minimizing the reprojection error between
known 3D coordinates of scene points and their corresponding 2D projections on the image plane. In general form, the problem is formulated as a nonlinear optimization:
where
are the observed 2D coordinates of the
-th point on the image plane;
are the 3D coordinates of the
-th point in the world coordinate system;
is the camera intrinsic matrix;
is the function projecting a 3D point from the world coordinate system to the camera image plane;
are the sought rotation matrix
and translation vector
required to transform a point from the world coordinate system to the camera coordinate system;
denotes the Euclidean norm;
is the total number of points; the expression under the summation represents the squared reprojection error for the
-th point.
Finding the transformation for a camera that minimizes the reprojection error is algebraically equivalent to finding the rotation and translation of a marker relative to a fixed camera, since any rigid transformation is invertible. Thus, solving the PnP problem enables both estimating camera poses relative to known 3D points and estimating point positions relative to known cameras.
In this work, the baseline calibration method under consideration is an iterative algorithm implementing the principle of sequential scene expansion. The calibration process begins with fixing the coordinate system of the first camera, after which new sensors and markers are added in a cyclic manner. The position and orientation of each subsequent camera are computed via PnP using already known 3D marker points, while the poses of newly detected markers are, in turn, estimated via PnP from cameras already placed in space. The baseline calibration method is presented in more detail in Algorithm 1.
| Algorithm 1: Baseline extrinsic calibration method |
| Input: | —number of cameras; —number of markers; —sets of markers indices visible from the i-th camera; —detections of the j-th marker on the i-th camera image plane. |
| Output: | —rotations of all cameras; —translations of all cameras; —rotations of all markers; —translations of all markers. |
| 1. | | Set of uninitialized cameras indices |
| 2. | | Set of positioned markers indices |
| 3. | while do | |
| 4. | if then | |
| 5. |
| Pick camera with strict max observations |
| 6. |
| Initialize camera rotation as identity matrix |
| 7. |
| Initialize camera translation by zeros |
| 8. |
else | |
| 9. |
| Pick camera with max overlap with |
| 10. |
| Set of visible markers rotations |
| 11. |
| Set of visible markers translations |
| 12. |
| Set of visible markers detections |
| 13. |
| Estimate camera pose by solving PnP |
| 14. |
end if | |
| 15. |
| Mark camera as initialized |
| 16. | for each marker do | |
| 17. |
| Estimate marker pose by solving PnP |
| 18. |
| Mark marker as initialized |
| 19. | end for |
| 20. | end while |
However, the baseline method suffers from a substantial limitation–the problem of error accumulation. Since the estimation of position and orientation for each subsequent camera depends on the accuracy of marker position estimates at previous stages, errors are not compensated but rather accumulate along the chain of cameras. In closed-loop configurations, particularly in the object-centric systems considered in this work, this often leads to a significant discrepancy between the computed position of the last calibrated camera and its true position relative to the first.
Accumulated geometric inconsistencies can potentially be eliminated during global optimization, but BA is extremely sensitive to the quality of the initial estimate. An excessively coarse estimate of extrinsics from the baseline method may cause the optimization to converge to an incorrect local minimum. Furthermore, the computational complexity of BA grows nonlinearly with increasing numbers of cameras and markers, which becomes particularly apparent when the initial calibration is insufficiently accurate, forcing the optimization algorithm to perform a larger number of iterations to reach the optimum.
4. Global Optimization
Typically, regardless of the accuracy of the initial parameter estimate for the scene, the calibration process for multi-camera systems cannot avoid the stage of global optimization (BA). This is because sequential initialization algorithms, including the baseline method described earlier, minimize errors locally and do not account for global constraints in the visibility graph. Joint optimization of all scene parameters enables finding a solution corresponding to the maximum likelihood criterion, ensuring a statistically optimal result in the presence of Gaussian measurement noise [
23].
Since both cameras and markers possess rotation and translation, the values of which are adjusted during global optimization, separate notation is required. Let , denote the rotation and translation for the -th camera, and , denote the rotation and translation for the -th marker. The optimized parameters for all cameras and all markers are packed into a single parameter vector for launching BA, where is the total number of parameters, which depends on the number of cameras , the number of markers , and the number of parameters for encoding rotation and translation. One camera is excluded from the vector so that its position and orientation define the coordinate system and eliminate solution ambiguity. For representing rotations, the axis-angle vector parameterization is employed. This representation is the most compact parameterization of rotation, avoiding parameter redundancy inherent in rotation matrices or quaternions. Thus, 3 parameters are allocated for rotation and 3 parameters for translation for each camera and marker, yielding . The vector is formed by concatenating the parameters of all cameras and markers: , where the hat symbol denotes the axis-angle representation of rotation.
The global optimization problem is formulated as follows:
where
is the vector of optimized parameters encoding rotations and translations for all cameras (
and
for each
-th camera, respectively) and markers (coordinates
for the four corners of the
-th marker, obtained by applying rotation and translation to the template 3D representation of the marker
);
are the 2D coordinates of the detection of the four corners of the
-th marker on the image plane of the
-th camera;
is the function projecting the four corners of a marker from the 3D world coordinate system to the camera image plane;
is the set of indices of markers for which detections exist in the
-th camera;
denotes the Euclidean norm computed along the second dimension, that is, implementing the mapping
.
The expression under the summation in Equation (2) decomposes into 8 squared scalar quantities, since each detection of the
-th marker from the
-th camera involves 4 points (corners), indexed by variable
, each having 2 coordinates indexed by variable
:
Thus, the BA problem reduces to a nonlinear least squares problem, in which the sum of squared residuals
must be minimized, where
is the camera index,
is the marker index,
is the marker corner index, and
is one of the two dimensions on the camera image plane. The complete residual vector
has length
, encompassing all marker detections. Optionally, a robust loss function may be applied to the squared residuals to mitigate the influence of outliers, for example, the Huber loss [
31].
Second-order iterative methods are traditionally employed to solve the BA problem, such as the Levenberg–Marquardt (LM) algorithm [
32,
33] or the more commonly used Trust Region Reflective (TRF) algorithm [
34]. This choice is motivated by the specific structure of the objective function landscape in BA problems, which often exhibits an ill-conditioned Hessian matrix. Under such conditions, second-order methods that exploit curvature information can achieve quadratic convergence rates when provided with a sufficiently accurate initial estimate.
The computational complexity of a single iteration of such algorithms is determined by the cost of solving the normal equations of the form [
23]:
where
is the Hessian matrix of the sum of squared residuals objective function or its approximation (depending on the specific second-order optimization method);
is the Jacobian matrix of the sum of squared residuals objective function;
is the sought parameter update for the current iteration.
The complexity of solving such a system comprises the cost of constructing and inverting the Hessian matrix
. Inverting a matrix of size
in the general case amounts to
, or simply
, since
is a constant. However, it can be observed that the rotation and translation parameters for each marker depend only on those cameras that observe it and are independent of one another. The same holds for cameras–no residual computation involves parameters from multiple cameras simultaneously. Therefore, the matrix
exhibits a specific sparsity pattern and has a block-diagonal structure. This property enables the application of the Schur complement method to the inversion of
, which eliminates the marker parameters and reduces the problem to solving a system for cameras only [
35], lowering the complexity to
for constructing the complement and
for the final computation of
. Leveraging the same sparsity property, it can be established that constructing
costs
in the worst case, when each camera observes every marker, i.e.,
. Typically, the number of markers in the scene far exceeds the number of cameras, i.e.,
, so
begins to dominate over
and
, and the per-iteration complexity ultimately reduces to
. The total runtime of BA is linear in the number of iterations
, which can reach hundreds for a poor initial estimate.
Thus, the asymptotic complexity of BA is linear in the number of markers and quadratic in the number of cameras, while the quality of initialization directly affects the number of iterations required to reach the optimum.
5. Proposed Calibration Method
The principal limitation of the baseline sequential initialization method described above is the problem of error accumulation, which leads to substantial geometric discrepancies, particularly in closed-loop configurations of object-centric systems. To address this limitation, we propose a modified method that incorporates mechanisms for improved robustness and drift correction directly into the incremental scene reconstruction process.
The proposed modification of the baseline incremental calibration method incorporates two key modifications aimed at improving accuracy and geometric consistency. First, a weighting scheme for observations is introduced in the PnP problem, where the contribution of each detection is determined by a heuristic reliability estimate. Second, the process is augmented with a multi-view triangulation stage, which is employed at each step of scene expansion to recompute the coordinates of all markers, thereby maintaining dynamic geometric consistency and preventing error accumulation.
5.1. Weighted PnP
Standard methods for solving the PnP problem, such as Efficient PnP (EPnP) [
36] or iterative LM, minimize the sum of squared reprojection errors, assuming identical noise variance for all points. However, in real-world conditions, the quality of AprilTag marker detection varies significantly depending on observational conditions.
To account for this, a weighting function is introduced that assigns a confidence coefficient to each marker observation. Weights
are incorporated into the summation of the PnP problem formulation (1), computed for each marker as follows:
where
is the average side length of the detection quadrilateral (in pixels);
is the average of the sines of the angles of the detection quadrilateral;
is the Euclidean distance from the center of the detection quadrilateral to the principal point of the image;
is the image diagonal length for normalization; the ratio
is clipped to
, so any marker whose projected side length exceeds
pixels is considered fully reliable with respect to size and receives the maximum contribution from this term; and
,
,
are regularization hyperparameters determining the contribution of each factor.
The proposed heuristic weighting formula accounts for three geometric factors affecting detection accuracy:
Projection size . Markers closer to the camera occupy a larger area on the sensor, improving the localization accuracy of their corners.
Viewing angle obliqueness . When a marker is observed at an oblique angle, its projection becomes distorted, increasing detection uncertainty.
Radial position . Detections near the image edges (far from the principal point) are more susceptible to residual lens distortion.
The hyperparameter values used throughout all experiments are
px,
, and
. The value
px represents the projected marker side length at which detection is considered fully reliable; it corresponds to the expected maximum projected size of a marker at the typical operating distances of the experimental setup. The exponent
is motivated by the Brown-Conrady distortion model: the dominant radial distortion term is cubic in the radial coordinate, so the rate at which corner localization error grows with distance from the principal point scales approximately quadratically, and the quadratic weight decay matches this order of nonlinearity. The exponent
defines a square-root (convex) function that assigns relatively more weight to small deviations from frontal viewing—where even modest obliqueness introduces measurable perspective error in corner localization—while compressing differences among highly oblique detections that are already heavily downweighted. To assess the sensitivity to this choice, a grid search over
was conducted on 10 synthetic datasets, described in
Section 6.1.1. A Kruskal-Wallis test found no statistically significant differences in either translation or rotation errors (calculated using Formulas (10) and (11) respectively) across the tested range. Since calibration accuracy is thus robust to the precise value of
, the value
is adopted as a natural default—it is the simplest nonlinear choice in the range
, introducing moderate convexity without collapsing to linear weighting (
) or a near-threshold behavior (
).
This weighting allows the PnP algorithm to minimize reprojection error non-uniformly across all visible markers, so that the optimization emphasis reflects detection reliability, thereby improving the stability of camera pose estimation.
5.2. Multiview Triangulation
A critical limitation of the baseline method is its single-shot estimation of marker positions: marker coordinates are fixed at the moment of first detection and never subsequently refined. This leads to error accumulation, as each marker inherits the error of the camera from which its position was estimated, compounds it with its own positioning inaccuracy, and propagates the combined error to the next camera added to the scene.
The proposed method eliminates this drawback by refining marker positions after each new camera is added. Specifically, multi-view triangulation is performed for each -th marker from all already calibrated cameras that observe it. At the moment when the -th camera is added, if the position of the -th marker it observes is already known, this implies that the marker has been observed by at least two positioned cameras, including the -th, thereby enabling triangulation. If the -th marker is encountered for the first time, its position is estimated by solving the PnP problem, as in the baseline method. For computational efficiency, the following triangulation algorithm is proposed.
First, the 3D coordinate of the
-th corner of the
-th marker is independently computed such that rays cast from the optical centers of all cameras participating in the triangulation towards this point minimize the deviation from the rays directed toward the corresponding 2D detection coordinates on the image planes. Formally, this is achieved by solving an optimization problem minimizing the squared norms of the cross products of the ray pairs (i.e., the squared deviations from collinearity) in homogeneous coordinates:
where
are the homogeneous coordinates of the
-th corner of the
-th marker;
are the homogeneous coordinates of the detection of the
-th corner of the
-th marker on the image plane of the
-th camera;
is the projection matrix from the 3D world coordinate system to the image plane of the
-th camera;
is the set of indices of all cameras observing the
-th marker;
is the set of indices of all cameras already positioned at the current iteration;
is the Euclidean norm; and
denotes the cross product.
Problem (6) represents a classical formulation for the Direct Linear Transformation (DLT) method [
37]. The solution is obtained analytically and has a linear computational complexity of
in the worst case, where the marker is triangulated from all
cameras simultaneously.
Since the geometry of the marker is known a priori, the next step estimates the rotation and translation that best align the template 3D representation of the marker
with the triangulated points. This problem is solved analytically using the Kabsch-Umeyama method [
38] via the singular value decomposition (SVD) of the point covariance matrix, with a constant asymptotic complexity of
.
This extension to the baseline algorithm ensures that the position of each marker is always consistent with all cameras that observed it. Consequently, error accumulation is mitigated as the camera chain grows.
More formally, the proposed method is described in Algorithm 2.
| Algorithm 2: Proposed extrinsic calibration method |
| Input: | —number of cameras; —number of markers; —sets of markers indices visible from the i-th camera; —sets of cameras indices observing the j-th marker; —detections of the j-th marker on the i-th camera image plane. |
| Output: | —rotations of all cameras; —translations of all cameras; —rotations of all markers; —translations of all markers. |
| 1. | | Set of uninitialized cameras indices |
| 2. | | Set of positioned markers indices |
| 3. | while do | |
| 4. | if then | |
| 5. | | Pick camera with strict max observations |
| 6. | | Initialize camera rotation as identity matrix |
| 7. | | Initialize camera translation by zeros |
| 8. | else | |
| 9. | | Pick camera with strict max observations
|
| 10. | | Set of visible markers rotations |
| 11. | | Set of visible markers translations |
| 12. | | Set of visible markers detections |
| 13. | | Set of detections weights estimated by (5) |
| 14. | | Estimate camera pose by solving weighted PnP |
| 15. | end if | |
| 16. | | Mark camera as initialized |
| 17. | for each marker do | |
| 18. | if do | |
| 19. |
| Estimate marker pose by solving PnP |
| 20. | else | |
| 21. | | Set of cameras rotations observing marker |
| 22. | | Set of cameras translations observing marker |
| 23. | | Set of marker detections from all cameras |
| 24. | | Estimate marker pose by triangulation |
| 25. | end if | |
| 26. | | Mark marker as initialized |
| 27. | end for |
| 28. | end while |
5.3. Method Complexity
The computational complexity of the baseline method for scene initialization is the sum of the complexities of solving PnP problems at each iteration. Since positioning each marker involves only a single camera, the complexity of this operation is independent of or and is . The total complexity of positioning all markers is . To position the -th camera, the PnP problem is solved using all positioned markers visible from that camera. In the worst case, where the -th camera observes all markers, solving the problem for a single camera has a complexity of . Thus, the complexity of the baseline initialization method is .
However, as mentioned previously, the complete calibration process of a multi-camera system almost never proceeds without Bundle Adjustment (BA). The complexity of BA is estimated as or under the assumption that , which clearly dominates the complexity of scene initialization by the baseline method.
The initialization method proposed in this paper introduces two heuristics to the baseline method:
Weighted PnP. Computing the weights using heuristic Formula (5) is performed in , as it is independent of , , or any other variable quantities.
Multi-view Triangulation. In the worst case, when adding the
-th camera, the positions of all
markers are recalculated. As discussed in
Section 5.2, the complexity of triangulating a single marker is
; therefore,
operations are performed for each new
-th camera. The resulting complexity added by the triangulation process to the baseline method is
, which does not exceed the complexity of a single BA iteration.
Thus, the complexity of the entire calibration process, comprising the proposed initialization method and BA, remains the same as that of the baseline method, amounting to in the general case.
Incorporating the proposed scene initialization method does not alter the asymptotic complexity class of the entire calibration process. The primary advantage of the method lies in a substantial improvement in the quality of the initial estimate of scene parameters and the associated error covariance structure. This, in turn, leads to a reduction in the number of BA iterations required to achieve convergence, thereby reducing overall computational costs and enhancing calibration reliability in the presence of noisy data.
Empirical support for this claim is provided by the bundle adjustment convergence comparison in
Section 7. The proposed initialization consistently reduces the number of BA iterations required to reach the stopping criterion, confirming that the additional cost incurred at the initialization stage is offset by faster convergence of the global optimization.
7. Ablation Study
Since the proposed method is a modification of the baseline achieved by adding two key extensions, an ablation study is performed to evaluate the contribution of each extension separately. To this end, two additional configurations of the proposed method were evaluated on the real data: one with PnP weighting disabled, and another with multi-view triangulation disabled.
For the comparison, the metric
was selected for all cameras
, for all corners
of each marker
, where
is the set of indices of markers for which detections are available on the
-th camera.
Figure 11 presents a comparison of error distributions for all aforementioned method variants in the form of a boxplot.
As can be seen from the figure, the primary contribution to the quality of the results is made by the multi-view triangulation of markers, which actively counteracts the sequential error accumulation characteristic of the baseline calibration method. PnP weighting, on the other hand, does not yield a comparable improvement on its own, but it reduces both the median and the bulk of the error distribution, albeit at the cost of slightly higher outlier errors. Importantly, the joint application of both extensions yields a significant improvement over their individual use.
Table 2 reports the exact median and mean reprojection errors for each method variant, providing a quantitative basis for the comparisons discussed below.
To assess the statistical significance of the observed differences, Wilcoxon signed-rank tests were performed on the metric
for all cameras
, for all corners
of each marker
visible from
-th camera for all pairwise ablation comparisons at a significance level of
. Statistically significant results were obtained in all cases, including the weighted PnP component both in isolation and when added on top of the triangulation-only variant, confirming that the two extensions act as complementary rather than redundant mechanisms. As shown in
Table 2, the weighted PnP scheme reduces the median reprojection error by approximately 9% relative to the baseline method, while the triangulation component reduces it by 86%. The full proposed method achieves a further 15% reduction in median error compared to the triangulation-only variant.
Furthermore, global optimization was executed for all versions of the proposed method as well as the baseline method, solving problem (2) using the Trust Region Reflective (TRF) method with a stopping criterion of
.
Figure 12 presents plots of the objective function value versus the number of function evaluations. Insets magnify specific regions to facilitate comparison between the methods.
The plot confirms that the proposed method reduces the number of iterations required for bundle adjustment convergence, thereby lowering the overall computational cost of the calibration process. At the same time, it can be observed that the application of PnP weighting, compared to the baseline method, ultimately results in a lower (better) final value of the objective function, despite a slightly higher initial value.
The ablation results also demonstrate that the two components of the proposed method can be deployed independently. The weighted PnP scheme achieves a statistically significant improvement at additional cost per iteration. The triangulation component provides the dominant accuracy gain at an initialization overhead of , offset by a reduction in the number of subsequent bundle adjustment iterations required for convergence.
8. Conclusions and Future Work
This paper presents an incremental extrinsic camera parameter initialization method for multi-camera systems, which improves upon the baseline iterative registration algorithm based on solving the PnP problem. The key features distinguishing the proposed method from the baseline are the integration of heuristic weighting for AprilTag marker detections and the use of multi-view triangulation for the dynamic refinement of marker spatial coordinates at each step of scene expansion. Theoretical analysis demonstrates that the incorporation of these mechanisms does not increase the overall computational complexity of the calibration process, when global optimization is included, preserving the overall asymptotic complexity of .
A series of experiments was conducted on synthetic and real-world data to evaluate the effectiveness of the proposed method. Synthetic data enables direct comparison between estimated and ground-truth camera parameters by aligning coordinate systems using the Kabsch-Umeyama algorithm. Validation on real-world data was performed using epipolar geometry metrics, the final reprojection error, and the angular error. The experimental results confirm that the proposed method yields more accurate camera localization and effectively mitigates geometric drift compared to the baseline. An ablation study isolating the contribution of each component reveals that multi-view triangulation is the primary factor in preventing geometric discrepancies. Furthermore, the weighted PnP scheme reduces the median positioning error and decreases the number of iterations required for the subsequent convergence of the global optimization (bundle adjustment).
Several promising directions for future research could further enhance the proposed calibration method. First, since the proposed method involves minimizing nonlinear functions at various stages, it would be worthwhile to investigate alternative optimization methods [
18,
43,
44,
45]. Modern population-based and heuristic algorithms could improve robustness to outliers and reduce the risk of convergence to local extrema.
Second, several extensions of the proposed weighting scheme merit investigation. A natural extension would be to integrate the proposed confidence metric directly into the bundle adjustment stage. Weighting the loss function during global optimization according to the accumulated heuristic detection confidence could allow the algorithm to penalize deviations of reliable observations more heavily, thereby further improving final calibration accuracy. Finally, the proposed weighting scheme could be used to refine the criterion for selecting the next camera during incremental addition, prioritizing configurations with the highest total observation reliability rather than simply the largest number of shared markers.