High Precision Positioning with Multi-Camera Setups: Adaptive Kalman Fusion Algorithm for Fiducial Markers

The paper addresses the problem of fusing the measurements from multiple cameras in order to estimate the position of fiducial markers. The objectives are to increase the precision and to extend the working area of the system. The proposed fusion method employs an adaptive Kalman algorithm which is used for calibrating the setup of cameras as well as for estimating the pose of the marker. Special measures are taken in order to mitigate the effect of the measurement noise. The proposed method is further tested in different scenarios using a Monte Carlo simulation, whose qualitative precision results are determined and compared. The solution is designed for specific positioning and alignment tasks in physics experiments, but also, has a degree of generality that makes it suitable for a wider range of applications.


Introduction
The evolution of technology has led to an increasing demand for solving complex problems which may be viewed as attempts to control and direct system behaviours towards desired states. The inherent complexity of problems and processes requires new approaches both in system modelling and in defining the emergent interaction with a highly dynamical and sparsely defined environment.
Cognitive approaches are successfully used in contexts where the boundary between the systems and the environment is fuzzy. However, they exhibit strong interrelation and interconnection, assisted by specific perception mechanisms [1]. Advances in complex control applications can be achieved only by considering adequate design approaches for sensory systems, especially in domains like environmental applications [2], health [3], industrial control, agriculture, etc.
Although new technologies such as wireless sensor networks or Internet of Things (IoT) are providing valuable solutions for appropriate perception mechanisms, complex issues are raised with the inclusion of data fusion, reliability, flexibility, reconfigurability, and cost of the measurement system. If some of these problems can be addressed during the modelling process, there are others (e.g., sensor positioning and sensor dynamics) that have a bigger impact on the generality of the overall solution.

Description of the Initial Problem
The large number of high repetition rate, ultrashort pulse, and high power laser facilities that will come online all around the world will require state-of-the-art tools to allow the harnessing of their full potential [5]. The high power lasers of the ELI-NP (Extreme Light Infrastructure-Nuclear Physics) user facility will be employed for a wide range of research topics like studies of nuclear induced reactions, dark matter search, material irradiation, or medical applications [6]. The development of the experimental setups for studying these topics requires specific instrumentation, while also having strict needs in terms of positioning and alignment, in order to ensure optimal experimental conditions.
As the setups are continuously changing, absolute position referencing is hard to achieve. This is a necessity, as multiple instruments need to be precisely positioned relative to each other during the experiment. Figure 1 displays an example of a setup for solid target alignment in which multiple optical diagnostics are positioned using motorised manipulators that have 3 to 6 degrees of freedom (DOF). Apart from the precision requirements, additional ones should be taken into consideration. The alignment should be done remotely because the experiments take place inside vacuum chambers and behind concrete walls used for radiation shielding. Moreover, to take advantage of the high repetition rate of the lasers and to maximise the beam-time available for the users, the positioning has to be performed in a limited amount time. In a previous work [7], we addressed this series of requirements and constraints by developing an automatic alignment system that is based on relative position measurements using imaging cameras and compact and flat fiducial markers, due to space and visibility constraints. The alignment algorithm was based on a real-time optimisation procedure which is the subject of a patent [8].
Although the fiducial markers were initially developed and used in augmented reality applications [9], due to their versatility in determining their position in a non-invasive manner (by imaging them with a camera), they were quickly adopted for a different range of applications. These include kinematic calibration and visual servoing for industrial robots [10,11], robot localisation and navigation [12][13][14], SLAM (simultaneous localisation and mapping) [15][16][17] and sensor fusion [18][19][20].
The motivation behind this work is to combine the measurements from multiple cameras in order to increase the working area of the system and to maximise the positioning precision. Simultaneously, the main focus is to present and test the precision of the proposed methods and algorithm, comparing the results with those recorded while using only one camera [21].

Method
The basic approach behind the proposed method is to use multiple cameras in order to improve the precision of estimating the pose of fiducial markers and to extend the working area of the system. It begins with the assumption that each camera provides measurements that are erroneous and noisy. The problem can be conceived through an analogy with the GPS system where distance measurements from multiple satellites are used to estimate the position of the receiver module with high precision. The key ingredient in the GPS system is to know very precisely the position of the satellites. The proposed positioning system needs to meet the same requirement for its cameras, but wasting too much effort on this task diminishes its practicality. Consequently it is only assumed that the cameras have unknown but fixed positions.
A related approach that has similar objectives can be found in [22]. The main benefit of the method is the possibility to calibrate the setup of cameras using a 3D feature with fiducial markers having unknown configuration. However, it is not meant for estimating the pose of single markers. In order to achieve this, additional methods are required.
Our method is developed using the ArUco fiducial markers [23,24]. The ArUco library detects and estimates the pose of the marker with respect to the camera using the solvepnp algorithm in which the pinhole camera model is data fitted using a Levenberg-Marquardt non-linear optimisation procedure. The output is represented by the extrinsic camera parameters which can be expressed in terms of a homogenous transformation T M C (the transformation between the camera attached reference frame and the marker attached one) detailed in [7].

Adaptive Kalman Fusion Algorithm for Multiple Cameras and Fiducial Markers
The method consists of fusing noisy measurements from multiple cameras, in order to improve the precision of estimating the exact position of the fiducial marker. In the scientific literature, multiple sensor fusion and noise filtering methods have been developed over the years. Among them there is the Kalman filter [25], which is the most widely used. The noise effect is mitigated by using a dynamical model for the physical process involved and a statistical model (covariance matrix) for the noise in the measurement process.
For a discrete linear state-space dynamical model in Equation (1), the Kalman filter estimates the value of the state vector x using noisy measurements for the input u and the output y.
In standard data fusion applications that use the Kalman filter, the pose of various objects is estimated using the input from accelerometer, gyroscope, and magnetometer sensors [26,27] and the output from distance measurement devices [28,29]. Our application makes impossible to use any type of attached sensors and, hence, we only rely on the "non-invasive" pose measurement using the solvepnp algorithm and imaging cameras.
Our approach is to employ a state-space model with free dynamics (where u is zero) and with an identity state matrix (A). In this respect, each measurement is made using only one snapshot image from all the cameras that are synchronised with the help of an electrical trigger signal. In this way, the position of the fiducial marker during the measurement is considered fixed. On the same set of images, the solvepnp algorithm is applied multiple times and, thus, the evolution of the measurement is caused only by the noise and not by the movement of the fiducial marker. Furthermore, the measurements are used to iteratively estimate the real position using the proposed algorithm.
In order to improve the results, we propose a procedure designed to update the statistical model of the noise. The Q and R covariance matrices (which are discussed below) are continuously adjusted using the newly acquired data. Any change of the noise behaviour and any existing correlations are captured and thus, the effect of the noise is mitigated more efficiently. Consequently, the proposed algorithm is an adaptive Kalman version.
The homogeneous transformation representation (T M C ) has numerous practical benefits, especially for pose composition and inversion operations, but it is not suitable in this circumstance because it is redundant (16 numerical values for expressing a 6 DOF pose). The solution is to use an equivalent representation which is composed of the set of translation coordinates (X, Y, and Z) and the set of Euler angles (AX, AY, and AZ).
In this particular case, the structure of the state vector is the real pose of the fiducial marker defined in Equation (2), where the subscript k denotes the present discrete-time sample of the state vector. ( The output vector is composed of the pose elements measured using all the cameras available. The structure of the output vector is defined in Equation (3), where the superscript i is indicating the index of the camera considered.
The discrete-time state-space model considered is defined in Equation (4), where I 6 is the identity matrix of rank 6, q k−1 is a random noise signal with normal distribution (white noise) that is quantifying the false evolution induced by the noisy measurement, the H is the measurement model matrix defined in Equation (5), and r k is the measurement noise also considered white noise.
The equations that describe the Kalman filter are presented in Equation (6). The first two equations give a rough state estimate using the dynamical model while the last 5 equations are used for improving the estimate using the newly acquired output sample. P is a covariance matrix that expresses the confidence degree of the state estimation, which is updated during the algorithm iterations, Q and R are the covariance matrices associated with the noises q and r respectively, v k is the rough estimation error (difference between the measured output and the predicted one using the first two equations), S k is the covariance matrix associated with the predicted output, and K k is the Kalman state update.
Predict the state The filter requires good estimates for the initial state x 0 and the covariance matrices Q and R. In the proposed algorithm this is achieved using an initialisation procedure. The position of the fiducial marker is measured for N i number of sampling times. The result is the set of samples defined in Equation (3) for k = 1, . . . , N i (the length of the initialisation) and i = 1, . . . , n (the number of cameras available).
Averaging the samples gives a good estimate for the initial state, which is built according to Equation (7), where E[·] is the mean operator (expected value). The Q and R matrices are computed from the same set of samples, assuming that there is no correlation between the noises affecting the measurements.
Q is a diagonal matrix defined in Equation (8) built using the mean variance of the pose elements (X, Y, Z, AX, AY, and AZ) along all the cameras. The R matrix is defined in Equation (9) where R i is defined in Equation (10), built using the variance for each pose element measured by each camera.
After the initialisation is finished, the Kalman algorithm is iterated for N e number of sampling times, while at each step, a new set of samples y k in the form of Equation (3) is measured and an estimated state trajectory is built (x k|k for k = N i + 1, . . . , N i + N e ).
Newly measured system outputs can be used to improve the statistical models of the noises for increased performance. Thereby, at each iteration, every new set is added to the initialisation set and the covariance matrices Q and R are updated using Equations (11)-(13) for k = 1, . . . , N e .
Considering that the estimated state trajectory (x k|k ) belongs to a system with free dynamics where, in the absence of the noise effects, the state should be constant, a final estimate with a better precision can be achieved by averaging the values of the estimated state trajectory according to Equation (14). The timeline of all the procedures that are involved in the proposed algorithm is presented in Figure 2. In Algorithm 1 the proposed procedure is summarised. [k]

Setup Calibration Procedure
Given a set of n cameras in a pre-defined fixed configuration, having unknown absolute positions, the purpose of the setup calibration procedure is to determine, in a precise manner, their relative positions. As depicted in Figure 3, the goal is to determine the set of homogenous transformations T Cj Ci where i = 1, . . . , n and j = i + 1, . . . , n. This requires the use of a precision gauge, which can be manufactured in the form of a cube of fiducial markers like in [21] or in any 3D shape where the markers can be viewed all around. The gauge needs to be manufactured or calibrated with increased precision. It can be seen as the absolute precision reference used to calibrate the entire positioning system. In Figure 4 The camera setup calibration procedure is done simultaneously for all the camera pairs C i , C j where i = 1, . . . , n and j = i + 1, . . . , n, with the goal of estimating the homogenous transformation T Cj Ci (as presented in Figure 5). It starts with placing the precision gauge inside the environment. Depending on the orientation, each camera sees a different portion of the gauge, i.e., from all n fiducial markers, C i can measure the position of n i markers and C j can measure only n j , where n i ≤ n and n j ≤ n. It is preferable that at least one fiducial marker M k is seen by both cameras, otherwise, it can be virtually determined using the gauge transformations.
The proposed algorithm estimates the T Mk Ci homogenous transformation using the noisy measurements from all n i markers using the C i camera. This transformation is expressed multiple times in terms of T Ml Ci pose measurement (where l = 1, . . . , n i ) using Equation (15). The T Mk Ml transformation is precisely known from the gauge.
The resultingT Mk Ci l pose is converted to translations and Euler angles which are used for building the output measurement vector of Equation (3). The proposed algorithm is further applied and the result is an estimation of T Mk Ci having a higher degree of precision and being closer to the real value. For the C j camera, T Mk Cj is estimated in a similar manner. Consequently, T Cj Ci transformation is computed using Equation (16).

Position Estimation Procedure
Having a calibrated camera setup, the aim of this procedure is to estimate the position of a fiducial marker attached to a specific instrument, according to the application.
In Figure 6, the conceptual diagram of this procedure is depicted. Depending on the configuration, the M marker can be seen only by a number of n m out of n cameras (n m ≤ n). An arbitrary camera (C r ) is chosen, which is considered the positioning reference.  The proposed algorithm is estimating the T M Cr homogenous transformation using the noisy measurements from each of the n m cameras. In a similar manner to the setup calibration procedure, the T M Cr transformation is expressed multiple times in terms of T M Ck pose measurement (where k = 1, . . . , n m ) using Equation (17). The T Ck Cr is precisely known from the setup calibration.
The resultingT M Cr k pose is converted to translations and Euler angles and used for building the output measurement vector of Equation (3). The proposed algorithm is further applied and the result is an estimation of T M Cr which has a higher degree of precision and is closer to the real value. This approach can be used in a similar manner for measuring the relative position of multiple fiducial markers in the environment, which is very useful in alignment tasks as presented in [7].

Monte Carlo Setup
There are two factors that contribute and affect the precision of estimating the pose of the fiducial marker. First, there are the physical and environment-related aspects, which include the optical specifications of the imaging system (the sensor and optical resolution, the focal length, the depth of field, and the field of view), the environment illumination conditions (how strong, uniform, and consistent the lighting is) and if the cameras are optimally positioned so as to achieve good viewing angles and maintain consistent accuracy along the working area. Secondly, there are factors related to the algorithms regarding how precise the pose can be estimated and how much the effect of the noise can be mitigated during the data fusion. In this paper we decouple the two types of factors and only consider the behaviour of our proposed method, so as to have a first qualitative assessment regarding the precision.
Consequently, we developed a simulation environment that aims to replicate how our method performs in a real setup, considering the noisy pose estimations it receives from the solvepnp algorithm. For an increased confidence in the results, we adopted a Monte Carlo approach in which we statistically analysed how the noise is propagated through our method and how the precision is affected. The simulation is implemented in MATLAB where positions of multiple cameras, precision gauges, and fiducial markers are virtually defined. In order to simulate the noisy input from the solvepnp, each position (that is supposed to be measured in the real environment) is disturbed with additive random noise having normal distribution. The noise is configured considering the precision limits we determined experimentally for one camera in our previous work [21]: for X and Y, 75 µm, for Z, 300 µm, and for AX, AY and AZ, 0.02 • . The mean value of the noise is 0 while the standard deviation (σ) was configured in such a way that 95.45% of the noise values are within the experimental limits (inside [−2σ, 2σ]).
The simulation puts in place three scenarios which are additionally used to assess the different contributions between the number of cameras and the number of the fiducial markers from the gauge: 1. 3 cameras, a precision gauge with 5 markers, and one marker whose position must be estimated; 2. 5 cameras, a precision gauge with 3 markers, and one marker whose position must be estimated; 3. 5 cameras, a precision gauge with 5 markers, and one marker whose position must be estimated.
The results are compared against estimating the pose using only one camera in the same environment. The Monte Carlo simulation performs 5000 iterations where, in each run and for each of the scenarios, the setup is calibrated using the gauge. The calibration is further used to estimate the pose of the fiducial marker which is compared with the true, predefined one. The N i and N e parameters are configured to 20 and 50 respectively. In a real application, the choice of these two parameters is a matter of cost optimisation, considering the computational effort, the resources available, and the required measuring frequency.
Compared with a real setup, we adopted one simplifying hypothesis. The angle of incidence of the marker relative to the camera, as we showed in [21], has an influence on the precision. Close to normal angles of incidence tend to bring more noise in the estimation. In this study, we only consider that the precision of the solvepnp algorithm to be consistent, regardless of the angle. However, in other respects, the simulation is considering the worst case scenario because of the following arguments:

•
Noises from different cameras and from different elements of the pose (X, Y, Z, AX, AY, and AZ) are considered completely uncorrelated. In real circumstances this might not be the case (e.g., the noise induced by the environment illumination which affects all the measurements in a similar manner) thus, any relaxed conditions are contributing to an increased precision of the estimation. This additional information is harnessed using the update procedure for Q and R covariance matrices (which in this case would no longer be diagonal); • The precision of one camera measurement along the Z axis is 5 times lower compared with the X and Y axis. In the simulation this is taken as it is, but in a real situation, this effect can be diminished by placing the camera setup in an optimal configuration. For instance, the measurement along the Z axis from one camera can be replaced by measurements from two cameras placed in lateral positions, for which the Z axis motion is decomposed in X and Y components that have a higher precision.

Results
Figures 7-12 presents the simulation results for estimating X, Y, and Z coordinates and AX, AY, and AZ orientation angles, which are given in terms of the probability density function (PDF) and the standard deviation (SD) of the error. The first plot from each Figure depicts the estimation error achieved using only one camera. The following three plots give the results achieved using the setups from each of the above-mentioned scenarios.
The results show that the proposed algorithm achieves an increase in precision which is close to an order of magnitude. It can also be observed that it is of greater importance to have more fiducial markers in the precision gauge instead of more cameras. In scenario #3, a slight decrease of precision is experienced in comparison to the #1 scenario. This is to be expected as each added camera is an additional noise source. However, the benefit of achieving a larger working area is far more important. In Table 1 the results are summarised and compared with regards to the limits of the variation interval [−2σ, 2σ] where it is expected that 95.45% of the errors will occur. This can be considered the precision that the positioning system is expected to achieve when using th proposed method.

Conclusions
With respect to the considered case study, the results show that the proposed method and algorithm have managed to successfully meet the objectives. The working area could be increased in accordance with the number of cameras in the setup. This is a decision-making procedure that needs to take into account the cost relative to the working area and the precision required. For our simulation scenarios, the precision increase was close to an order of magnitude, which was around 10-15 µm for X and Y coordinates, 30 µm for Z and 0.002 • for AX, AY, and AZ orientation angles.
The cost is an important benefit of the system compared to other solutions like laser trackers which tend to be extremely expensive. In addition to that, our proposed method could achieve relative and simultaneous positioning of multiple fiducial markers, which supports the development of advanced applications. Future work will include a complete analysis of the method in a real environment where all physical and algorithm-related factors are considered, and a precision comparison against other methods presented in the literature.