An Indoor Positioning Approach Based on Fusion of Cameras and Infrared Sensors

A method for infrared and cameras sensor fusion, applied to indoor positioning in intelligent spaces, is proposed in this work. The fused position is obtained with a maximum likelihood estimator from infrared and camera independent observations. Specific models are proposed for variance propagation from infrared and camera observations (phase shifts and image respectively) to their respective position estimates and to the final fused estimation. Model simulations are compared with real measurements in a setup designed to validate the system. The difference between theoretical prediction and real measurements is between 0.4 cm (fusion) and 2.5 cm (camera), within a 95% confidence margin. The positioning precision is in the cm level (sub-cm level can be achieved at most tested positions) in a 4×3 m locating cell with 5 infrared detectors on the ceiling and one single camera, at distances from target up to 5 m and 7 m respectively. Due to the low cost system design and the results observed, the system is expected to be feasible and scalable to large real spaces.


Introduction
The framework of this proposal is positioning in indoor Intelligent Spaces. These kinds of Local Positioning Spaces (LPSs) are complex environments in which several sensors collect information, multiple agents share resources and position and navigation of mobile units appear as main tasks [1,2]. Under such complex conditions, having sensors with complementary capacities is necessary to fulfill all requirements satisfactorily.
At the end of this section we include a table (Table 1) with a comprehensive summary of the indoor positioning outlook presented in this introduction, according to accuracy and cost features, together with a description of the applications and some comments on their main strengths and weaknesses.
Depending on the application goals, a rough, but useful, classification may divide the indoor positioning systems in non-precise systems (from some tens of cm to 1 m level) or precise ones (1 to 10 cm). The former are typically human-centered applications in which m-level or room-level accuracy may be enough to fulfill the requirements (for example, localization of people or objects in office buildings). They are usually user-oriented applications based on portable technologies as mobile phones [3], Inertial Measurement Units (IMUs), etc. [4], or indoor local networks (WiFi, Zigbee, Bluetooth) [5]. They exploit the benefits of having an available infrastructure in the environment reducing, consequently, costs and sensor design effort at the expense of reaching lower precision than an ad-hoc sensorial positioning system, as the networks are originally conceived for communication SLAM, Simultaneous Localization and Mapping) and architecture (self-localization systems on-board of a mobile unit). Their main challenges and achieved performance are therefore hard to compare to the proposal described in this paper, as will be seen next. Another VLC application for positioning with three Light-Emitting Diodes (LEDs) and a fast camera, with fast code (from LEDs) processing, achieving precisions better than 10 cm in a 40 m 2 area, can be found in [38]. A similar application to the latter, although making use of a mobile phone camera, is proposed in [39], showing decimeter level precisions. Many positioning systems are based on RSS measurements [15], or AOA [11,34], but precise optical telemetry is usually based on phase measurements [40].
Summarizing such a complex scenario with so high heterogeneity in the solutions proposed, we can ascertain that cm-level precision (below 10 cm) is quite difficult to achieve, not only in prepared experimental setups typically referred in the literature but, and specially, under realistic conditions in real environments. Home applications can cope with m-level and room-level accuracies while industrial ones may need (robust) performance down to 1 cm and need accurate ad-hoc solutions. Additionally, low cost solutions are very interesting in industry as, in many cases, large spaces must be covered (needing a high number of resources and devices for this purpose). The system we present here is intended to aim at meeting these requirements.
In this context, our proposal is a localization system developed with a phase-shift IR localization subsystem, composed of five receivers acting as anchors and an emitter acting as target, fused with a camera localization one, with a maximum likelihood (ML) approach. IR and camera models are developed and used for variance propagation to the final position estimation. Additive White Gaussian Noise (AWGN) hypothesis is supposed for the IR and camera observations in these models. For this, IR measurements must be mostly multipath (MP) free, which can be reasonably assumed in sufficiently large scenarios with low reflectivity of walls and ceiling, or by implementing MP mitigation techniques [41] or oriented sensors (as in [42] in an IR communications framework). The IR positioning system, with cm level precision, was successfully developed and shown in the past, and a model that relates the variances of position estimate to the target position was derived. The novelty lies in the development of another model to deduce observation variances and further propagation to camera position estimate by means of an homography, so that it can be used in the fused final position estimate. The novelty also lies in the fused sensor system itself as a measuring unit, which performs robustly delivering precisions in the cm level, and at the same time matches the models stated. To our knowledge there are no precise positioning systems with data fusion of a phase-shift IR system, developed ad-hoc with wide angle simple devices (IR LED emitter and photodiodes) and a single low cost camera detecting passive landmarks, performing with cm-level in ranges of 5 to 7 m.
In Section 2 the method description is presented. The IR sensor, camera and fusion estimation models are included in Sections 3-5 respectively, and evaluated with Monte Carlo simulations. Results on a real setup and the comparison with simulations to validate theoretical prediction from the models are presented in Section 6. A summary of key concepts derived from results discussion is included in the final conclusions in Section 7.

Method Description
A block level description of the strategy proposed is depicted in Figure 1. We consider a basic locating cell (BLC) where a target (a mobile robot) is to be positioned. This BLC is covered by an infrared positioning set (IR set hereafter) composed of several IR detectors and one single camera sharing the locating area. In this arrangement, the position X of the target T, defined as X = (x, y, z T ) T , is to be obtained. This target is an IR emitter for the IR sensor and a passive landmark for the camera. It lies in a plane with fixed (and known) height z T , hence positioning is a 2D problem where the coordinates x, y of T in such plane are sought. After the IR and camera processing blocks, two position estimates, X IR and X c , are obtained from the IR and camera sensors respectively. X IR is attained by hyperbolic trilateration from differences of distances [45] and X c is obtained by projecting the camera image plane onto the scene plane by means of a homography transformation. We assume both estimates are affected by bi-dimensional zero-mean Gaussian uncertainties, represented by their respective covariance matrices ∑ IR , ∑ c . A fusion stage with a maximum likelihood (ML) approach is carried out yielding a final estimate X F with expected lower uncertainty values than the original IR and camera ones. The information about the final precision is contained in the resulting covariance matrix ∑ F , which is addressed in Section 5. A deep explanation of the IR positioning system (developed in the past) can be read in [17,45]. Some relevant aspects are recalled herein though, for better understanding of the proposal. In Figure 1, the IR BLC is composed of N + 1 receivers acting as anchors (A i ), one of them as common reference (A r ), being A i and A r the coordinate vectors A i = (x i , y i , z 0 ) T and A R = (x r , y r , z 0 ) T , with z-coordinate fixed and known, equal to z 0 . The coordinates of T are achieved by hyperbolic trilateration (HT) from the differences of distance measurements r ir , from T to each one of the anchors A i (i = 1...N) and from T to A r . These r ir are obtained from differential-phase of arrival (DPOA) measurements, φ ir , between A i and A r [17]. Nevertheless, the quantities r ir will be named as observations hereafter as only a constant factor is needed to convert φ ir into r ir . Every observation r ir is assumed to have additive white Gaussian noise (AWGN) with variance σ 2 ir . The x,y variances of the position IR estimate (which are terms of ∑ IR ) result from the propagation of σ 2 ir through the HT algorithm. This process, addressed in the next section, involves a set of positioning equations solved by non-linear least squares (NLLS) plus a Newton-Gauss recursive algorithm [46]. Note that considering the error as unbiased AWGN means there are no remaining systematic or other biasing error contributions (cancelled after error correction and calibration, or considered negligible compared with random contributions). This includes multipath (MP) errors, which means working in an MP free environment (large open spaces with low wall reflectivity or else, MP cancellation capabilities [41,47] or working with orientable detectors [42]).
On the other hand, the camera captures the scene and the center of a landmark placed on T is detected yielding its coordinates in the BLC locating plane (in the scene) after image processing (landmark identification and center detection algorithm) and further projection from the camera image plane to the scene. As already mentioned, this is carried out by means of a homography so that a bijective correspondence between the image plane and the scene plane makes it possible to express the real world coordinates as a function of pixel coordinates in the image plane (and reciprocally). The homography is applied to the rectified image after camera calibration.
Once the estimates X IR and X c are known, after an ML fusion procedure we obtain X F as where the weights W 1 and W 2 depend on the IR and camera covariance terms, which in turn also depend on the x,y coordinates of T. Detailed models for the X IR and X c estimates and their respective covariance matrices ∑ IR , ∑ c have been derived. We aim at two goals with these models: on the one hand, the models are the key theoretical basis to obtain the final fused estimation, as the covariance matrices acting as weights are obtained from said models. On the other hand, they are a very useful tool for the designer of the positioning system as low level parameters can be easily tuned and allow for evaluating the effect on precision, from sensor observation level to final fused estimation of position. It is also convenient to describe here the set of errors used to quantify the uncertainty of the position estimate, either delivered by simulation or by real measurements, and either referred to IR, camera or fusion. Considering an ellipse of N position estimations, the error metrics used are: 1. Errors in x,y axes: for every position in the test grid, the uncertainty in the x and y axis is assessed by the standard deviation in x or y respectively. This information is contained in the covariance matrix. This applies to Monte Carlo simulations or real measurements. 2. Maximum and minimum elliptical errors: the eigenvalues of the covariance matrix are the squares of the lengths of the 1-sigma confidence ellipsoid axis, and can be easily computed by a singular-value decomposition (SVD) of this covariance matrix. Hence, given a generic covariance matrix Σ θ of a set of observations in R 2 with coordinates x,y, after this SVD decomposition the two eigenvalues λ 1 , λ 2 are obtained in a matrix Σ θ . This matrices have the form: The deviations with respect to these axes provide more spatial information about the uncertainty in each position. We will refer to these deviations as elliptical deviations, or elliptical errors. The complete spatial information of the confidence ellipsoid would include also the rotation angle of its axis with respect to, for instance, x axis. This is also easily computed, if needed, through the aforementioned SVD decomposition. In our case, the axis length is enough to assess the dimensions of the uncertainty ellipsoid and its level of circularity.
The criterion for considering one of them depends on the specific uncertainty description requirements as will be addressed along the paper as needed. The results delivered by Monte Carlo simulations, running the models aforementioned, will be compared with real measurements in a real setup.

Infrared Estimation Model
We developed an IR sensor in the past with cm-level precision in MP-free environments or with MP mitigation capabilities [45]. Three receivers (anchors) at least must be seen by the IR emitter (target) so that trilateration is possible. More receivers are usually used to increase precision as real measurements are always spoiled by additive noise. A 3 × 3 m IR BLC composed of five receivers was proposed in [17] so that three, four or five can be used in different configurations as needed (one of them being the common reference). For scalability purposes, this 5-anchor BLC must be linked with other BLCs and high level strategies must be defined too. Nevertheless, this question fell out of the scope of that work and is not considered here either.
In Figure 2 the IR link elements are depicted, summarizing the basic parameters involved in a DPOA measuring unit. Therefore, just the emitter placed at T and one of the receivers located at A i together with the reference A r are depicted, followed by an I/Q demodulator. As introduced in Section 2, and explained in detail in [17] the observation φ ir is the DPOA ir (phase difference between target T and receivers A i and A r respectively), directly obtained at the output of each I/Q demodulator. Regarding the IR model parameters (which encompasses radiometric, devices and electronics ones), the link T-A r is the same as any of the other T-A i , as all receivers, including the reference one, are implemented with the same devices. It is assumed that φ ir ∼ N φ ir , σ 2 φ ir , where φ ir is the DPOA true value. The corresponding observed differential range r ir is directly obtained as r ir = c 2π f φ ir and we also assume r ir ∼ N d ir , σ 2 ir , where d ir is the distance-difference true value (i.e., d ir = ||T − A i || − ||T − A r ||, and ||·|| represents the 2-norm operator). In order to derive an IR model for the measurements, we consider r ir defined as r ir = r i − r r , being r i ∼ N d i , σ 2 i , d i the true value of the distance between T and A i , and σ 2 i the variance of r i (related to one single anchor A i ). This also applies to r r (with i = r). The variance term σ 2 i can be modeled as a function of X and the IR measuring-system parameters, as will be shown further. This way, considering r i and r r as uncorrelated variables, we can compute σ ir as: Note that while r ir are real observations from the measuring system (the I/Q demodulator directly delivers a phase-difference DPOA ir ), r i and r r are virtual single-anchor observations, defined to derive the model.
As demonstrated in [48], the variance σ 2 i can be expressed as the inverse of the signal to noise ratio at every anchor (SNR i ), and can be modelled as follows where the factor γ is proved to be γ = 1 [48], d i is the Euclidean distance (true value) between A i and T, and K IR is a constant encompassing all parameters of the IR system (including devices, electronics, noise and geometry), as follows: All parameters appearing in (5) are known: P e is the IR emitted power per solid angle unit, A s and R are the photodiode sensitive area and responsivity respectively, G A is the i-v converter gain, K F is the filter gain (after i-v stage), K I/Q is the I/Q demodulator gain (ideally unity gain) and H is the receiver's height (measured from the emitter z-coordinate). The terms in the denominator, η and BW N , are the noise power spectral density and noise bandwidth respectively. Consequently, the quantity in the denominator is the total noise power which, once the receiver parameters are fixed, is constant [17]. The distance d i in (4) is also valid for d r making i = r. The relation in (4) is a very useful tool as it allows for expressing the variance terms as a function of the coordinates of distance d i , hence as a function of the coordinates x, y of the sought target T (given that the coordinates of A i are constant) and fixed parameters grouped together in a single constant. Therefore, (4) and (5) establish the link between the random contributions in the observations and the coordinates X of the target. The covariance matrix of the observations r ir is where every term σ 2 ii in the diagonal represents the variance σ 2 ir term defined in (3) of the observation r ir . The covariance matrix is not diagonal, as the d r distance term is present in all the r ir terms. As said, the covariance matrix modelled this way links the position of the target in the navigation space with the uncertainty in the observations. Every diagonal term is computed as in (3) using the relation in (4), where d i and d r directly depend on the target coordinates through (6). The non-diagonal terms are directly deduced as a function of d r from expressions (4) to (6) The position X of the target is achieved by hyperbolic trilateration. From N observations r ir we can write where n ir is an AWGN contribution with variance σ 2 ir . This is: An estimate for X = (x, y) T can be typically obtained by nonlinear (unweighted) least squares (NLLS) as follows: where ε is an N × 1 column vector formed by the N residuals ε i (noise terms in (9)), i.e.,: which are the Gaussian uncertainty terms in the measurement of r ir . (10) can also be typically solved iteratively with a Newton-Gauss algorithm, yielding a recursive solution: The N × 2 matrix J ε in (12) is the Jacobian of ε with respect to variables x and y (z is constant, as mentioned), with terms where d i and d r are computed as in (6). The covariance matrix of the NLLS estimator in (12) is then [49,50].
computed at each x, y coordinates. The IR estimate obtained in (12) and the covariance matrix in (14) will be used in the final fusion estimate, in the form of (1). Note that the covariances in (7) are not obtained from experimental measurements but from the expressions derived from the IR model. Besides providing the weights in the fusion estimate, this covariance matrix can be also used for off-line simulations in order to evaluate the results under different configurations prior to real performance. In Figure 3 the simulation results of the model explained for realistic conditions are displayed. As shown, in Figure 3a the whole considered test grid is depicted, with a 3 × 3 m IR set with 5 anchors placed at 2.7 m height (four in the corners and one in the center as common reference) covering a synthetic 4 × 3 m BLC. The additional area out of the IR set can be useful for transitions between BLCs or just for widening the area covered by one IR set, although with less precision (in the extra area out of the IR set the dispersion of the estimations increases). A set of 63 positions with 0.5 m separation between consecutive ones has been tested, with 100 realizations at each position. One of the locations is zoomed so that the observations cloud can be seen. The elliptical errors are evaluated at every position upon the observation ellipse (as the one zoomed). The IR link features in this simulation are: emitting power (P e ) 75 mW/sr, detector area (A s ) 100 mm 2 , responsivity (R) 0.64, i-v gain factor (G A ) 33 × 10 3 , filter gain (K F ) and I/Q demodulator gain (K I/Q ) are both unity factor, noise power spectral density (η) 1.34 × 10 −11 W/Hz and noise equivalent bandwidth (BW N ) 30π/2 Hz. These are the values for the parameters appearing in (5) with the geometrical parameter H set to 2.7 m. In Figure 3b the values of these errors along the whole test cell are displayed (indexed starting at the left-bottom corner of the grid and growing in columns up to the upper right corner). As shown, under the conditions aforementioned, less dispersion is observed in the central area. The uncertainty in x, y axis are in a margin between 1 cm and 2.3 cm and between 0.9 cm and 2.1 cm values for σ x and σ y respectively. The 68% confidence ellipse is defined by axis with σ values between 1.15 cm and 2.2 cm and between 0.7 cm and 2 cm (a 95% confidence ellipse would be defined by two times these σ values). The closeness of the deviations in the ellipse axis compared to x, y is due to the central geometry derived from choosing the reference anchor in the center and using the five anchors in the BLC. More ellipticity would be observed in choosing another reference anchor or using less anchors in the BLC.   The IR set defines an area enclosed by the perimeter defined by the four vertices at the external anchors. This is useful for modular scaling of a larger positioning environment. Nevertheless, as can be seen in the figure, a larger area (BLC) can be covered by the same anchor deployment. This may be quite convenient when scaling the LPS as said, for locating the mobile robot when navigating between consecutive BLCs. It allows for different high level design strategies making it possible, for instance, to separate the anchors as much as possible if cost reduction is a need (at the cost of lower precision). This applies also for the camera coverage as will be seen. However, the design strategy of a larger space linking several BLCs lies out of the scope of this paper. Finally, any other configuration in which any of the parameters is changed (photodiode sensitive area, emitted power, etc) can be easily evaluated. The real tests reported in the results section are carried out with similar values as the ones synthetically generated in Figure 3.

Camera Estimation Model
The camera is placed in the BLC at fixed co-ordinates and not necessarily at the same height as the IR anchors. It may lay out of the polygon formed by the IR receivers and, although in this work both sensors are tested with the same (target) test-position grid, it might cover, if needed, a different target-positioning area than the IR sensors (the fusion would be, nevertheless, carried out in the intersection of both areas). Let us distinguish the four procedures that take place for proper camera performance in the LPS.

•
Calibration: it is carried out, customarily, by means of a calibration pattern, in order to obtain the intrinsic and extrinsic camera parameters. Once these parameters are known, the coordinates of the detected target from the scene can be obtained in the image. Calibration has a big impact on errors in the final positioning. In any case, this is a standard stage in any camera-based landmark recognition application. • Homography parameters computation: a homography is established between the image and scene planes (as already mentioned, the scene is also a plane of known height). It is a reciprocal transformation from image to scene. The coefficients of the matrix H for such transformation are obtained by means of specific encoded markers (named H-markers hereafter). This process (with low computational and time cost) allows for having a bijective relation between camera and navigation planes. This way, positioning can be easily carried out with one single camera. As explained in Section 2, the uncertainty in the image capture is propagated to uncertainty in the position in the scene through H. For setup characterization purposes, we have carried out this process off line but, in real performance, it can normally be implemented on line. • Target detection (Projection): a landmark placed on the target is detected by image processing so that the position of the target is obtained in the image plane in pixel coordinates. For simplicity, we will refer hereafter to projection when considering the projection path from image to scene (with matrix H) and to back projection in the opposite sense (scene to image, with H −1 ), as represented in Figure 4. • Back projection: the back projection stage projects the position of T scene to the camera image by means of the inverse homography matrix H −1 matrix. Back projection is used for simulation of the camera estimation model.
In Figure 4 the true position of the landmark is represented by the coordinate vector X (same as for IR sensor) and the captured position in the camera plane is X t p (x p , y p ) where x p and y p are the x, y coordinates in such camera plane in pixelic units (index p stands for pixelic hereafter). A homography can be defined between the scene and camera planes, so that the relation between the coordinates of the landmark in the scene, X t (x, y), and X t p (x p , y p ) in the image can be written as (15) In (15) H is the homography transformation matrix, the terms h ij of which are specific for every scene and camera planes (hence camera location) and must be obtained accordingly. The homography matrix is a 3 × 3 matrix but with 8 DoF (degrees of freedom) because it is generally normalized with  [51]. This way, the coordinates in the scene are obtained as follows: Let us define the 2-variables function F H defined by (16) so that X = F H (X p ). Given pixelic variances σ 2 x p and σ 2 y p and considering the camera observations with behavior x p ∼ N(x p , σ 2 x p ) and y p ∼ N(y p , σ 2 y p ), the jacobian of F H (J F H ) for further variance propagation is: where the terms F 1 , F 2 and D are computed as: Given (x p , y p ), pixelic coordinates of the captured landmark, yielding a corresponding position estimation X c , the covariance matrix of the camera estimation of position is: Obtained from the covariance matrix of the camera observations Σ p 20) and the jacobian described in (17). It is noteworthy to remark here that in addition to the projection-homography (camera to scene) described in (15), it is also necessary to work with the back projection (scene to camera), defined by X p = F H −1 (X), for evaluation of the method through simulation. To do so, synthetic true positions of the target in the grid are generated and back projected to the camera plane by means of H −1 . Next, in the camera image, synthetic realizations with a bidimensional Gaussian distribution are also generated, with center in the true positions back projected to the image plane before. This cloud of image points is then projected to the scene by means of H. This Monte Carlo simulation allows evaluation of errors in the scene given certain known values of pixelic errors and knowledge of H matrix. It must be noted that the H matrix depends on the camera location and, therefore, it must be obtained specifically for such a location. In Figure 5 the same test-grid as the IR one of previous section (a 4 × 3 m rectangular cell with 63 positions separated 0.5 m each other) is evaluated running a simulation as explained in the previous paragraph. The camera, represented by a green square in the figure, is set at an arbitrary position at 3 m high with respect to the horizontal plane. Operating with a single camera it may be interesting to cover a wide angle for inexpensive scalability to large areas, at the cost of more distortion (on the contrary, a camera placed in the center experiences less distortion, but less coverage too).
Supposing realistic uncertainty values σ x p and σ y p equal to 5 pixels (they could also be different from each other) and the following real H coefficients (h 11 = 2.9 × 10 −2 ; h 12 = −9.9 × 10 −1 ; h 13 = 1.1 × 10 3 ; h 21 = 9.9 × 10 −1 ; h 22 = 3.3 × 10 −2 ; h 23 = −1.7 × 10 −3 ; h 31 = 1.2 × 10 −5 ; h 32 = −1.7 × 10 −5 ; h 33 = 8.3 × 10 −1 ), the precision of the position in the scene can be seen in Figure 5: the deviation in x, y axis are between 0.46 cm and 0.7 cm and between 0.47 cm and 0.65 cm values respectively, being the 68% confidence ellipse within margins of 0.5 cm and 0.66 cm and between 0.4 cm and 0.65 cm (ellipse axis respectively). Note the decrease in the deviations as the test positions get closer to the camera. Other tests may be run in the same manner with the camera in any other location (yielding another matrix H).

Fusion of Camera and IR Sensors
The fusion estimation X F is reached by a maximum likelihood (ML) approach [19], i.e., maximizing, with respect to X, the joint probability of having X IR and X c estimates given a true position X. This is, given that both estimates are independent of each other: The IR and camera estimates are X IR ∼ N(X, Σ IR ) and X c ∼ N(X, Σ c )respectively, where X is the true position of T and Σ IR , Σ c are the IR and camera covariance matrixes defined by (12) to (14) and (19) to (20), respectively. Consequently, X IR is described by: and X c : Introducing (22) and (23) in (21), it follows that X F is found as: Yielding the fusion estimate, computed as follows: Regarding covariance X-dependence, note the approach followed herein in reaching (25): the joint probability is maximized in (21) with respect to X, considering covariances as constant. Covariances depend on the position X as seen in the derivation of IR and camera models in previous sections. However, to simplify the optimization process, a good tradeoff to find the solution is solving the ML problem as stated in (21) to (25), where X is the optimization variable, and compute the IR and camera covariance matrixes in 25, defined as in (7) to (14) and (17) to (20) respectively, using the position estimation delivered either by the IR sensor or by the camera sensor. The criterion for choosing one of both can be defined as needed depending on the application. By default, if no other requirement is set, the one with minimum variance (estimated at X with each respective model) is chosen. In static conditions this would be a good solution; in dynamic (navigation) conditions, we can compute the fusion-position X[k] by using the covariance matrixes from X[k − 1]. If the position update velocity is high enough this is also a good fast real-time solution. The results of fusion simulations are depicted in Figure 6. In this figure, the tests reproduce the conditions of the respective IR and camera simulations shown in the previous two sections. The standard deviations in the x, y axis, together with the maximum and minimum variances (68% confidence ellipse axis) are displayed for camera, IR and fusion results. As expected, the deviations delivered by the fusion estimate are lower than any of the two single sensors. The more one of the deviations (IR or camera) increases, the more fusion variance approaches the lowest one.

Results
In this section, a set of results obtained from real measurements conducted in a real basic locating (BLC) to assess the sensors performance and the models derived, is shown. First, IR and camera results are shown independently in Sections 6.2 and 6.3 respectively. The fusion results are addressed next in Section 6.4. Two approaches are considered to determine the camera observation variances that appear in the fusion estimation. One of them is closer to the theoretical description, based on unbiased Gaussian uncertainty of the observations and the other one consists in defining a new standard deviation upon the real measurements (which are not purely unbiased). The latter differs more from the theoretical assumptions but approaches better the real behavior and allows for having a practical design tool with the same theoretical basics derived in previous sections.
The section starts with the description of the setup used for tests. Results from measurements are shown next and finally, a comparative analysis between measurement results and theoretical prediction is included.
In order to facilitate a faster knowledge of the results behavior we include some tables in the different sections as explained next. In Sections 6.2 and 6.3 (IR and camera respectively): a summary of the precision (defined upon standard deviation) in the BLC, together with two indicators to have a better view of the behavior (shape) of the estimation clouds, are presented in a table. These indicators are defined in Section 6.2 (IR) and further used in Section 6.3 (camera) too. In Section 6.4 (results of fusion, also compared with IR and camera) the summarizing table is focused on the precision levels, representing the elliptical deviations explained in Section 2.

Setup
The setup is depicted in Figure 7. A 4 × 3 m rectangular BLC is covered by a set of five IR anchors deployed in a 3 × 3 m square inside the BLC (four at the corners and one in the center) at 2.7 m height and one single camera, placed at 3 m height. The 3 × 3 m IR set covers the full 4 × 3 m BLC, so that the area out of the IR set can be considered as a transition zone between different BLCs in an eventual larger space. The test grid is composed of 63 (9 × 7) test positions separated 50 cm each other. At every position in the grid an IR emitter and a landmark are placed, acting as targets. An amount of 200 observations have been taken (phase shifts with IR and images with camera) yielding 200 IR and camera position estimations respectively and 200 fusion estimations. In addition, five illumination controlled levels where included in the tests, at every test-position and every camera location (therefore, in fact, the total number of camera images at every grid position is 200 × 4). Although the IR anchors could be flexibly chosen (keeping a minimum of three, necessary for trilateration), we work here with the full 5-IR anchor set with the reference in the center. The camera has been placed at four different locations in order to analyze the tradeoff between covered area (by the camera) and precision. We will show here the deviations obtained at these four positions and, next, detailed results are focused on the camera at two of these locations, A and B, (shown in Figure 7), which represent two extremes regarding such precision versus coverage tradeoff. The devices are selected to fulfill low cost requirements while fulfilling performance conditions. All features of the test bench are summarized in Table 2, including devices, BLC configuration and test conditions (notation and configuration indexes or labels corresponding to those in Figure 7). The IR system works with an IRED as emitter (and in turn positioning target) and a photodiode as receiver (in turn positioning anchors), as indicated in the table. All electronic circuits, including the stages for signal conditioning and phase measuring had already been specifically designed for this purpose in past projects. The camera is also an inexpensive one with the sensor shown in the table and a Raspberry Pi 3 Model B as processing unit. The landmarks used for target detection are also shown in Figure 7.
The IR measurements and positioning system performance had already been developed and shown in the past [45]. The camera data was collected for fusion purposes, which constitutes the core of the results presented in this paper.

Infrared Measurements
The IR real measurements, under the conditions described in Table 2, deliver the positioning errors displayed in Figure 8. In this case the configuration with higher precision has been chosen (five receivers with reference in the center). Other possible configurations might need less resources (lower number of anchors and/or more separated) and, hence, implying lower costs at the expense of less precision too. As can be seen, for this IR setup the elliptical deviations are between 0.75 cm and 1.75 cm. The maximum and minimum elliptical deviation values are close to each other (and also close to the x and y deviations) due to the symmetry and circularity of the geometry defined by the IR set chosen) as can be seen in Table 3.
An indicator to assess the shape of the estimation clouds and to have a better geometrical view of the results are introduced here: a dissimilarity index (DI) which will be useful to quantify, in percentage, the level of closeness of the different standard-deviation of the results shown in Figure 8. We first define the DI for two arbitrary matrices M and N of same dimensions: which can vary from 0% (identical matrices) to ∞. In (26) F is the Frobenius norm, also applicable to one dimension vectors. We use (26) to compare, in pairs, the standard deviations in the original x, y axis, σ x , σ y , with the elliptical ones defined in Section 2. For this, a 63 × 1 vector containing the values of σ at each position is built (one vector for each of the four axis), being M in (26) either the σ x vector or the σ y one. Note that F is equal to the classical Euclidean 2 norm if applied to a vector. We nevertheless introduce the general definition for matrices as it would have a wider range of application, if needed, in other works (e.g., if all the covariance matrices are wished to be compared).
Here, it is more convenient to see how close the axis deviations are from each other independently. In addition, note that a DI equal to 0% in both axis would mean that both deviation pairs are identical (i.e., the ellipse is oriented in the x, y original axis). In this case we would not be able to ascertain the circularity of the estimation cloud. Hence, as complementary information, we also define a circularity index (CI) as: where λ i min and λ i max are, respectively, the minimum and maximum value of the eigenvalues pair {λ 1 , λ 2 } appearing in (2). This index is evaluated at every position in the BLC grid. In Table 3 the CI average in the CBL is included (note that CI itself could be enough to ascertain the ellipticity of the estimations but both CI and DI defined provide more detailed information about the estimations' shape and behavior). In this table, the elliptical deviation 2σ values in the major and minor axis are also included. Namely the maximum (worse case) and average values in the whole CBL are shown.
The 2σ values are chosen as they define the 95% confidence ellipsoid.

Camera Measurements
In Figure 9 the position estimations in the CBL obtained from the camera observations at locations A and B are shown (CL 2 and CL 4 in Figure 7, respectively). The whole grid can be seen, with 200 estimations at every test position. As expected, in location A, the dispersion of the estimation clouds is lower because the camera is closer to the target. However, from location A the whole grid is not covered by the camera field of view and some test positions fall in blind areas. On the contrary, from location B the full BLC is covered by the camera, though dispersion is higher (less precision than in location A). In any case, as will be discussed in the next paragraph, and shown in Figure 10 and Table 4, at either of the camera locations A or B, the dispersion of the estimations increases as the distance from target to camera increases (this happens in the left-up direction from A and left direction from B).

Observations
Ground Truth

Camera location 4 (B)
(1) (8)  In order to have a comprehensive view of the camera performance, in Figure 10 the standard deviations for all camera locations (not only A and B, but all locations 1 to 4) at every test position are depicted. The four standard deviation values defined in Section 2 are displayed (x, y deviations and elliptical ones). As can be seen, the elliptical deviations are in a range between 0.1 cm and 1.7 cm, increasing as the camera is more separate from the targets (progressively from locations 1 to 4) being similar at locations 1 and 2 as can be seen in Table 4. This would define 2σ ( 95%) confidence ellipsoids with axis between 0.2 cm and 3.4 cm. In addition, the x, y deviation values increasingly differ from each other, and from the elliptical ones too, from locations 1 to 4. Due to the more distortion as the camera gets further from the grid towards location 4, the estimation clouds get more elliptical. The discontinuities in the graphs at locations 1 and 2 correspond to blind grid positions not covered by the camera. In the same manner as explained in previous Section 6.2 (IR measurements) the Table 4 summarizes the information about estimation-clouds' shape and measurements precision. Next, in order to obtain specific information about pixelic deviations, which are needed to compute the fusion estimate according to the method derived in previous sections, the whole set of collected data is represented in Table 5. It contains the standard deviation values, in pixelic units, in the image x, y axis for each of the camera locations. In the table, the information is summarized showing the maximum and minimum values in the grid, as well as the average of such deviations (for all illuminations). Focusing on locations A and B, the maximum pixelic deviations in the x, y image axes are: σ x p = 4.1, σ y p = 4.5, at A, and σ x p = 4.1, σ y p = 3.9 at B. It must be taken into account, in order not to misunderstand the table information, that a higher pixelic deviation does not necessarily mean a higher distance deviation in the scene (resolution worsens as camera-target distance increases). In fact, as seen in Figure 10, deviations in location 4 (labeled as B) are clearly higher than in location 2 (labeled as A), while the pixelic deviations are similar in both cases. In the next section we will introduce these values of pixelic deviations in the expressions in (19) to obtain the variance of X c and compute the fusion estimate as in (24). The choice of deviation values is a key point, as discussed further. The terms of H needed in (19) to propagate pixelic variance with the function F H , are obtained from an off-line characterization process to define an homography from point correspondences between the ArUco markers and image [28]. It must be remarked here that in Table 5 only the standard deviations are displayed, hence any systematic or any other biased error is not reflected there. This topic will be addressed in the fusion results section as, actually, a higher error is obtained from camera results due to such type of errors.

Fusion Results
In Figure 11 the fusion errors in test-positions, obtained from the real measurements referred in the two previous points with the 5-anchor IR set and the camera in locations A and B are depicted, together with IR and camera ones. For easier figure reading only elliptical errors are displayed, as the information is more useful than x, y ones, because they provide the maximum deviations in the uncertainty ellipsoid axis (enough to define the dimensions of the 68% confidence elliptical area of the estimation). Thus, the elliptical standard deviations of camera, IR and fusion position estimate are shown as maximum and minimum deviations in a figure each. The abscissa axis of the figure represents the positions in the grid as stated in the setup in Figure 7. At every test position in the ground truth, the fusion estimation of such position is obtained as in (25), being the IR and camera position estimations in (25) obtained from the real measurements. The covariance matrixes in (25) are computed, also at every position, using the IR and camera models for variance propagation (from measurements to position estimate) as explained in their respective previous sections. These variances require the position coordinates to be computed. The position estimation obtained from the camera is used for this (the IR estimate, or the one with lowest variance, could be used too). Let us remember that, concerning the IR and camera variances, the former are obtained by means of an analytical expression derived from the IR model, while the latter are empirically inferred from the camera observations (afterwards the propagation to the covariance matrix of the camera position estimate is carried out by the analytical model). The choice of the pixelic deviation value to use in the fusion estimate must keep a balance between representing the real performance while being useful from the practical point of view. We follow a conservative criterion and choose the highest deviation values in Table 5: σ x p = 4.1, σ y p = 4.5, at A, and σ x p = 4.1, σ y p = 3.9 at B.
As can be seen, camera uncertainty (between some mm and 1.8 cm) is, in most positions, lower than IR one and the fusion deviations are much closer to the camera ones. The gaps in the traces in position A correspond to blind positions, as explained before.
Note that the fusion variance is slightly higher than the camera one at some points (namely, a little worse in the minimum elliptical direction, displayed in the image on the right). This is because the values of σ x p and σ y p chosen were the highest in all the grid, according to Table 5. Therefore, in many grid positions the variance is over dimensioned (as if working with a virtual distribution with worse variance in (25) at many positions). Proceeding in the opposite way, i.e., choosing the lowest values would lead to apparently better, but unrealistic, results as it would now overestimate the behavior at many points grids. The theoretical model predicts a fusion variance with lower values than the other two (IR and camera), but there is a differences between the model stated and the real behavior: in the model-based simulations two values of σ x p and σ y p were supposed to be the same for all the images, hence the propagation to the position met this condition. The real behavior shows different σ x p and σ y p at different positions. However, setting specific values at each position is not practical as, unlike the IR sensor, there is not a fully analytical model to express the variance as a function of target position. Selecting the value of the highest variance values along the whole grid is a good solution for the whole scene. Apart from this, the selection of σ x p and σ y p described before would match a theoretical model for unbiased Gaussian distribution of observations. However, the camera position estimates show a bias, as displayed in Figure 12, which is not taken into account by the previous approach. The values of the bias can be in the order of the deviations considered and, therefore, the total error assigned to camera observations is underestimated by the pure unbiased Gaussian assumption. Moreover, this bias is different at every grid-position (and camera location one too) with no systematic pattern, or at least it is not easy to model as it is highly dependent on environment parameters, geometrical configuration and image processing algorithm. Consequently, in order to take into account this contribution to error, now, the bias will be assimilated into the inferred σ x p and σ y p model parameters. Following again a conservative criterion, the highest values are determined from data represented in Figure 12. This deviates the model (AWGN assumption) from the real error performance (biased distribution) but this way the model is much more tractable. Aiming at having a simple procedure based on the model, two unique values σ x p and σ y p are selected from the information represented in Figure 12, for the whole grid and for all camera locations as well. As mentioned, this approach allows for having a useful model while matching the results satisfactorily, as shown next.  Figure 13 shows the results of IR, camera and fusion. The camera error has been redefined and displayed according to the discussion in the previous paragraph in a mixed manner, fitting better the real behavior: in the figure, the bias has been added to the deviation depicted in previous figures. On the other hand, the fusion variance, resulting from the choice of pixelic deviations as explained in the previous paragraph is below both IR and camera variances in most positions. Furthermore, comparing these results with simulation ones (shown in Figure 14), obtained as explained in Sections 3-5 with the model parameters set to the real values and σ x p , σ y p chosen as explained, it can be seen that the (2σ) difference in the fusion (between measurements and model predictions) are within a difference of 0.4 cm considering the whole grid (see Table 6 and 7). The camera real behavior shows higher fluctuations than the model ones, while still remaining within a margin of 2.5 cm (mainly due to two points; on average keeps below 0.3 cm).
In summary, the model allows for having a simulation tool that facilitates testing the positioning space considered within a margin below 1 cm. Regarding precision achieved in target positioning, the whole system performs within a 95% (2σ) confidence error ellipse of 2.5 cm, enhanced with the strong robustness provided by fusion, compared with any of the two sensors working independently.

Conclusions
We have successfully developed an indoor positioning system with fusion of IR sensors and cameras (5 IR sensors and one camera). The achievable precision is in the cm level (maximum of 2.5 cm, 1.4 cm on average, within 95% confidence ellipse) in localization areas of 3 × 4 m, with IR sensors in a 3 × 3 m cell and a camera in possible locations up to a distance of 7 m from the target.
It has been tested with measurements in a real setup, validating both the system performance and also the proposed models for variance propagation from IR and camera observations to the resulting fused position. Model and measurement fusion results agree within a margin of 0.4 cm (2.5 cm for camera, 1 cm for infrared), allowing for the models to be used as a valuable positioning systems simulation tool. Regarding positioning precision itself, while camera and IR ones (95% confidence margin), may rise up to 4.7 cm and 6.7 cm respectively, fusion improves precision reducing the deviations to 1.9 cm. On average, IR and camera present 2σ values of 2.5 cm and 2.7 cm; fusion lowers it to 1.4 cm. This means approximately a 48% precision improvement on average (in the maximum deviations it rises to about 72%). An average improvement of almost 50% may be quite convenient in some applications. Moreover, as important as precision performance, the sensor fusion strategy provides the system with high robustness as, in case one sensor fails or its performance worsens, the fused position precision gets closer to the other one. The system also works under changing illumination. The camera performs satisfactorily with very low illumination levels and also with floor brightness (high artificial illumination conditions). Furthermore, IR and camera provide complementary behavior with respect to light conditions, as in dark conditions the camera might not see the scene but the IR system would perform correctly, while under high illumination levels the IR receivers could saturate but the camera would still capture the landmarks. The last two features pointed out add robustness and reliability to the system. The proposal shows promising perspectives in terms of scalability to large real spaces because the system developed offers wide coverage with a small number of low cost devices (less than five IR anchors is also valid) and high precision features, with the camera enhancing the decrease of IR precision beyond 5 m. Locating areas are expected to be notably enlarged without increasing the number of devices, although further tests are necessary for proper assessment. This could be attained by placing the camera at longer distances and by improvement of the IR precision (increasing sensitive area, frequency and emitted power is feasible). From the low level point of view, this will imply a challenging design effort to balance the distance between devices so as to keep cost as low as possible and still have acceptable coverage and precision (increasing the distance will reduce SNR). Careful selection of new devices with higher working frequency (which also enhances precision) and wider field of view, without engaging the response time, is a key aspect in this low level tradeoff. With respect to response time requirements (under real time navigation conditions), a more restrictive signal filtering (IR) and resolution increase (camera) would increase SNR (precision) but would make the system slower. Considering all these aspects, the final system should also be tackled together with the aid of odometry (which can also be tackled with a fusion approach). Finally, we have proposed a BLC with five receivers but the minimum localization unit needs three, although precision worsens. If the precision requirements relax the number of receivers could be lower (not necessarily the fusion precision, unless the camera worsens too).
From the theoretical point of view, the IR model for variance propagation is completely analytical and matches real results very closely at every target position in unbiased scenarios, i.e., mainly multipath free (or MP canceled). The camera one is semi-empirical, the variance of observations being deduced from the measurements. A completely analytical model for the camera is not easy to define as some stages are not accessible as to link variance with position (mainly the detection algorithm). Nevertheless, unlike the IR system, the determination of the pixelic variances, in order to obtain the H matrix of the model, is very easily achieved with one single picture-burst of the whole scene (with IR, this would need a large amount of measurements, successively at every position in the grid). Moreover, although this task has been carried out offline, under real navigation performance, it could be easily carried out online. In summary, a good tradeoff solution has been found for determining the variance of the camera observations, which represents within 1 cm the real behavior in the whole grid. The camera bias is very position-dependent and not easy to model, but it has also been successfully integrated in the procedure. Regarding IR, as said, AWGN assumption requires multipath free scenario what requires cancelation techniques. Otherwise, multipath errors can grow up to the m level in unfriendly (though common) scenarios. A next challenge is improving the model for the camera, by going further in the analytical link between variance and target position. In addition, it will be important to tackle the problem from an estimator-based approach, addressing the approximations assumed here with respect to the position-depending variances in the ML solution, and investigating the camera bias and IR multipath contribution to integrate them in the models. Finally, exploring sensor cooperation capabilities together with data fusion will be interesting when applied to real applications.